2022-09-29 21:10:24

by Joel Fernandes

[permalink] [raw]
Subject: Sum of weights idea for CFS PI

Hi Peter, all,

Just following-up about the idea Peter suggested at LPC22 about sum of weights
to solve the CFS priority inversion issues using priority inheritance. I am not
sure if a straight forward summation of the weights of dependencies in the
chain, is sufficient (or may cause too much unfairness).

I think it will work if all the tasks on CPU are 100% in utilization:

Say if you have 4 tasks (A, B, C, D) running and each one has equal
weight (W) except for A which has twice the weight (2W).
So the CPU bandwidth distribution is (assuming all are running):
A: 2 / 5
B, C. D: 1 / 5

Say out of the 4 tasks, 3 of them are a part of a classical priority
inversion scenario (A, B and C).

Say now A blocks on a lock and that lock's owner C is running, however now
because A has blocked, B gets 1/3 bandwidth, where as it should have been
limited to 1/5. To remedy this, say you give C a weight of 2W. B gets 1/4
bandwidth - still not fair since B is eating away CPU bandwidth causing the
priority inversion we want to remedy.

The correct bandwidth distribution should be (B and D should be unchanged):
B = 1/5
D = 1/5

C = 3/5

This means that C's weight should be 3W , and B and D should be W each
as before. So indeed, C's new weight is its original weight PLUS the
weight of the A - that's needed to keep the CPU usage of the other
tasks (B, D) in check so that C makes forward progress on behalf of A and the
other tasks don't eat into the CPU utilization.

However, I think this will kinda fall apart if A is asleep 50% of the time
(assume the sleep is because of I/O and unrelated to the PI chain).

Because now if all were running (and assume no PI dependencies), with A being
50%, the bandwidth of B, C and D each would be divided into 2 components:

a. when A is running, it would be as above.
b. but if A was sleeping, B, C, and D would get 1/3.

So on average, B, C and D get: (1/3 + 1/5) / 2 = 8/30. This gives A about 6/30
or 1/5 bandwidth.

But now say A happen to block on a lock that C is holding. You would boost C to
weight 3W which gives it 3/5 (or 18/30) as we saw above, which is more than what
C should actually get.

C should get (8/30 + 6/30 = 14/30) AFAICS.

Hopefully one can see that a straight summation of weights is not enough. It
needs to be something like:

C's new weight = C's original weight + (A's weight) * (A's utilization)

Or something, otherwise the inherited weight may be too much to properly solve it.

Any thoughts on this? You mentioned you had some notes on this and/or proxy
execution, could you share it?

Thanks,

- Joel



2022-09-30 14:00:01

by Qais Yousef

[permalink] [raw]
Subject: Re: Sum of weights idea for CFS PI

Hi Joel

I'm interested in the topic, if I can be CCed in any future discussions I'd
appreciate it :)

On 09/29/22 16:38, Joel Fernandes wrote:
> Hi Peter, all,
>
> Just following-up about the idea Peter suggested at LPC22 about sum of weights
> to solve the CFS priority inversion issues using priority inheritance. I am not
> sure if a straight forward summation of the weights of dependencies in the
> chain, is sufficient (or may cause too much unfairness).
>
> I think it will work if all the tasks on CPU are 100% in utilization:
>
> Say if you have 4 tasks (A, B, C, D) running and each one has equal
> weight (W) except for A which has twice the weight (2W).
> So the CPU bandwidth distribution is (assuming all are running):
> A: 2 / 5
> B, C. D: 1 / 5
>
> Say out of the 4 tasks, 3 of them are a part of a classical priority
> inversion scenario (A, B and C).
>
> Say now A blocks on a lock and that lock's owner C is running, however now
> because A has blocked, B gets 1/3 bandwidth, where as it should have been
> limited to 1/5. To remedy this, say you give C a weight of 2W. B gets 1/4
> bandwidth - still not fair since B is eating away CPU bandwidth causing the
> priority inversion we want to remedy.
>
> The correct bandwidth distribution should be (B and D should be unchanged):
> B = 1/5
> D = 1/5
>
> C = 3/5
>
> This means that C's weight should be 3W , and B and D should be W each
> as before. So indeed, C's new weight is its original weight PLUS the
> weight of the A - that's needed to keep the CPU usage of the other
> tasks (B, D) in check so that C makes forward progress on behalf of A and the
> other tasks don't eat into the CPU utilization.
>
> However, I think this will kinda fall apart if A is asleep 50% of the time
> (assume the sleep is because of I/O and unrelated to the PI chain).
>
> Because now if all were running (and assume no PI dependencies), with A being
> 50%, the bandwidth of B, C and D each would be divided into 2 components:
>
> a. when A is running, it would be as above.
> b. but if A was sleeping, B, C, and D would get 1/3.
>
> So on average, B, C and D get: (1/3 + 1/5) / 2 = 8/30. This gives A about 6/30
> or 1/5 bandwidth.

The average metric is interesting one. It can be confusing to reason about too.

I think we have 3 events to take into account here, not 2:

a. when A is running and NOT blocked on C.
b. when A is running and BLOCKED on C.
c. A is sleeping.

This means A, B, C and D's shares will be:

A , B , C , D
a. 2/5, 1/5, 1/5, 1/5
b. - , 3/5, 1/5, 1/5
c. - , 1/3, 1/3, 1/3

Since A is sleeping for 50%, I don't think we can assume equal distribution for
the 3 events (can't just divide by 3).

I believe we can assume that

a. occurs 25% of the time
b. occurs 25% of the time
c. occurs 50% of the time

I *think* this should provide something more representative.

>
> But now say A happen to block on a lock that C is holding. You would boost C to
> weight 3W which gives it 3/5 (or 18/30) as we saw above, which is more than what
> C should actually get.
>
> C should get (8/30 + 6/30 = 14/30) AFAICS.
>
> Hopefully one can see that a straight summation of weights is not enough. It
> needs to be something like:
>
> C's new weight = C's original weight + (A's weight) * (A's utilization)
>
> Or something, otherwise the inherited weight may be too much to properly solve it.
>
> Any thoughts on this? You mentioned you had some notes on this and/or proxy
> execution, could you share it?

I assume we'll be using rt-mutex inheritance property to handle this? If this
was discussed during a talk, I'd appreciate a link to that.

In the past in OSPM conference we brought up an issue with performance
inversion where a task running on a smaller (slower to be more generic) CPU is
holding the lock and causing massive delays for waiters. This is an artefact of
DVFS. For HMP, there's an additional cause due to the unequal capacities of the
CPUs.

Proxy execution seems to be the nice solution to all of these problems, but
it's a long way away. I'm interested to learn how this inheritance will be
implemented. And whether there are any userspace conversion issues. i.e: do
we need to convert all locks to rt-mutex locks?


Thanks

--
Qais Yousef

2022-09-30 15:48:45

by Youssef Esmat

[permalink] [raw]
Subject: Re: Sum of weights idea for CFS PI

Hi Everyone!

I am not sure we should care about A's sleeping pattern. The case we
care about is when A is running or wants to run but can't because it
is blocked on C. In that case C should get the weight of A as if A was
running.

Ideally this is also a temporary boost since critical sections should
be relatively small, so erring on the side of giving C slightly more
runtime would be safe I think.

Thanks,
Youssef

On Fri, Sep 30, 2022 at 8:49 AM Qais Yousef <[email protected]> wrote:
>
> Hi Joel
>
> I'm interested in the topic, if I can be CCed in any future discussions I'd
> appreciate it :)
>
> On 09/29/22 16:38, Joel Fernandes wrote:
> > Hi Peter, all,
> >
> > Just following-up about the idea Peter suggested at LPC22 about sum of weights
> > to solve the CFS priority inversion issues using priority inheritance. I am not
> > sure if a straight forward summation of the weights of dependencies in the
> > chain, is sufficient (or may cause too much unfairness).
> >
> > I think it will work if all the tasks on CPU are 100% in utilization:
> >
> > Say if you have 4 tasks (A, B, C, D) running and each one has equal
> > weight (W) except for A which has twice the weight (2W).
> > So the CPU bandwidth distribution is (assuming all are running):
> > A: 2 / 5
> > B, C. D: 1 / 5
> >
> > Say out of the 4 tasks, 3 of them are a part of a classical priority
> > inversion scenario (A, B and C).
> >
> > Say now A blocks on a lock and that lock's owner C is running, however now
> > because A has blocked, B gets 1/3 bandwidth, where as it should have been
> > limited to 1/5. To remedy this, say you give C a weight of 2W. B gets 1/4
> > bandwidth - still not fair since B is eating away CPU bandwidth causing the
> > priority inversion we want to remedy.
> >
> > The correct bandwidth distribution should be (B and D should be unchanged):
> > B = 1/5
> > D = 1/5
> >
> > C = 3/5
> >
> > This means that C's weight should be 3W , and B and D should be W each
> > as before. So indeed, C's new weight is its original weight PLUS the
> > weight of the A - that's needed to keep the CPU usage of the other
> > tasks (B, D) in check so that C makes forward progress on behalf of A and the
> > other tasks don't eat into the CPU utilization.
> >
> > However, I think this will kinda fall apart if A is asleep 50% of the time
> > (assume the sleep is because of I/O and unrelated to the PI chain).
> >
> > Because now if all were running (and assume no PI dependencies), with A being
> > 50%, the bandwidth of B, C and D each would be divided into 2 components:
> >
> > a. when A is running, it would be as above.
> > b. but if A was sleeping, B, C, and D would get 1/3.
> >
> > So on average, B, C and D get: (1/3 + 1/5) / 2 = 8/30. This gives A about 6/30
> > or 1/5 bandwidth.
>
> The average metric is interesting one. It can be confusing to reason about too.
>
> I think we have 3 events to take into account here, not 2:
>
> a. when A is running and NOT blocked on C.
> b. when A is running and BLOCKED on C.
> c. A is sleeping.
>
> This means A, B, C and D's shares will be:
>
> A , B , C , D
> a. 2/5, 1/5, 1/5, 1/5
> b. - , 3/5, 1/5, 1/5
> c. - , 1/3, 1/3, 1/3
>
> Since A is sleeping for 50%, I don't think we can assume equal distribution for
> the 3 events (can't just divide by 3).
>
> I believe we can assume that
>
> a. occurs 25% of the time
> b. occurs 25% of the time
> c. occurs 50% of the time
>
> I *think* this should provide something more representative.
>
> >
> > But now say A happen to block on a lock that C is holding. You would boost C to
> > weight 3W which gives it 3/5 (or 18/30) as we saw above, which is more than what
> > C should actually get.
> >
> > C should get (8/30 + 6/30 = 14/30) AFAICS.
> >
> > Hopefully one can see that a straight summation of weights is not enough. It
> > needs to be something like:
> >
> > C's new weight = C's original weight + (A's weight) * (A's utilization)
> >
> > Or something, otherwise the inherited weight may be too much to properly solve it.
> >
> > Any thoughts on this? You mentioned you had some notes on this and/or proxy
> > execution, could you share it?
>
> I assume we'll be using rt-mutex inheritance property to handle this? If this
> was discussed during a talk, I'd appreciate a link to that.
>
> In the past in OSPM conference we brought up an issue with performance
> inversion where a task running on a smaller (slower to be more generic) CPU is
> holding the lock and causing massive delays for waiters. This is an artefact of
> DVFS. For HMP, there's an additional cause due to the unequal capacities of the
> CPUs.
>
> Proxy execution seems to be the nice solution to all of these problems, but
> it's a long way away. I'm interested to learn how this inheritance will be
> implemented. And whether there are any userspace conversion issues. i.e: do
> we need to convert all locks to rt-mutex locks?
>
>
> Thanks
>
> --
> Qais Yousef

2022-09-30 17:47:46

by Joel Fernandes

[permalink] [raw]
Subject: Re: Sum of weights idea for CFS PI



On 9/30/2022 9:49 AM, Qais Yousef wrote:
> Hi Joel
>
> I'm interested in the topic, if I can be CCed in any future discussions I'd
> appreciate it :)

Yes, surely! Will do :)

> On 09/29/22 16:38, Joel Fernandes wrote:
>> Hi Peter, all,
>>
>> Just following-up about the idea Peter suggested at LPC22 about sum of weights
>> to solve the CFS priority inversion issues using priority inheritance. I am not
>> sure if a straight forward summation of the weights of dependencies in the
>> chain, is sufficient (or may cause too much unfairness).
>>
>> I think it will work if all the tasks on CPU are 100% in utilization:
>>
>> Say if you have 4 tasks (A, B, C, D) running and each one has equal
>> weight (W) except for A which has twice the weight (2W).
>> So the CPU bandwidth distribution is (assuming all are running):
>> A: 2 / 5
>> B, C. D: 1 / 5
>>
>> Say out of the 4 tasks, 3 of them are a part of a classical priority
>> inversion scenario (A, B and C).
>>
>> Say now A blocks on a lock and that lock's owner C is running, however now
>> because A has blocked, B gets 1/3 bandwidth, where as it should have been
>> limited to 1/5. To remedy this, say you give C a weight of 2W. B gets 1/4
>> bandwidth - still not fair since B is eating away CPU bandwidth causing the
>> priority inversion we want to remedy.
>>
>> The correct bandwidth distribution should be (B and D should be unchanged):
>> B = 1/5
>> D = 1/5
>>
>> C = 3/5
>>
>> This means that C's weight should be 3W , and B and D should be W each
>> as before. So indeed, C's new weight is its original weight PLUS the
>> weight of the A - that's needed to keep the CPU usage of the other
>> tasks (B, D) in check so that C makes forward progress on behalf of A and the
>> other tasks don't eat into the CPU utilization.
>>
>> However, I think this will kinda fall apart if A is asleep 50% of the time
>> (assume the sleep is because of I/O and unrelated to the PI chain).
>>
>> Because now if all were running (and assume no PI dependencies), with A being
>> 50%, the bandwidth of B, C and D each would be divided into 2 components:
>>
>> a. when A is running, it would be as above.
>> b. but if A was sleeping, B, C, and D would get 1/3.
>>
>> So on average, B, C and D get: (1/3 + 1/5) / 2 = 8/30. This gives A about 6/30
>> or 1/5 bandwidth.
>
> The average metric is interesting one. It can be confusing to reason about too.
>
> I think we have 3 events to take into account here, not 2:
>
> a. when A is running and NOT blocked on C.
> b. when A is running and BLOCKED on C.
> c. A is sleeping.
>
> This means A, B, C and D's shares will be:
>
> A , B , C , D
> a. 2/5, 1/5, 1/5, 1/5
> b. - , 3/5, 1/5, 1/5

Here did you mean:
b. -, 1/5, 3/5, 1/5 ?

A blocked on C means, C should get A's weight (on top of its own).

> c. - , 1/3, 1/3, 1/3
>
> Since A is sleeping for 50%, I don't think we can assume equal distribution for
> the 3 events (can't just divide by 3).

Oh yeah, I did not get to _that_ part of the story yet at this point of the
email, I brought it up later below where I say "But now say A happen to block"..

>
> I believe we can assume that
>
> a. occurs 25% of the time
> b. occurs 25% of the time
> c. occurs 50% of the time
>
> I *think* this should provide something more representative.

Yes possible. My basic idea was I was trying to *not* change the distribution of
B based on whether A blocks on C, or whether it does not. In my view, B's
distribution should not change just because A and C have a lock dependency,
because otherwise C can get too much or too little time. If C gets too much
time, then its hurting B. If C gets too little time, then its hurting A.

>> But now say A happen to block on a lock that C is holding. You would boost C to
>> weight 3W which gives it 3/5 (or 18/30) as we saw above, which is more than what
>> C should actually get.
>>
>> C should get (8/30 + 6/30 = 14/30) AFAICS.
>>
>> Hopefully one can see that a straight summation of weights is not enough. It
>> needs to be something like:
>>
>> C's new weight = C's original weight + (A's weight) * (A's utilization)
>>
>> Or something, otherwise the inherited weight may be too much to properly solve it.
>>
>> Any thoughts on this? You mentioned you had some notes on this and/or proxy
>> execution, could you share it?
>
> I assume we'll be using rt-mutex inheritance property to handle this? If this
> was discussed during a talk, I'd appreciate a link to that.

Yes that's the idea but I am also aware that 'other' kinds of dependencies in
userspace that cannot benefit from a kernel-only boost.

So if we consider a bounded-buffer producer/consumer. We can consider the
producer as A, and the consumer as C, with B being a random mid-priority task.
Once the bounded buffer gets full, A is waiting on a signal from C that it
filled the buffer for which C needs to run in the first place to queue its
payload into the buffer. However, trouble-maker B is determined to eat away's
C's time and develop a prio inversion between itself and A. This pattern should
also generalize to a worker pool consuming work.

In this case, there is no lock involved yet you have a dependency. But I don't
mean to sound depressing, and just because there are cases like this does not
mean we should not solve the lock-based ones. When I looked at Android, I saw
that it uses futex directly from Android Runtime code instead of using pthread.
So perhaps this can be trivially converted to FUTEX_LOCK_PI and then what we do
in the kernel will JustWork(Tm) ?

>
> In the past in OSPM conference we brought up an issue with performance
> inversion where a task running on a smaller (slower to be more generic) CPU is
> holding the lock and causing massive delays for waiters. This is an artefact of
> DVFS. For HMP, there's an additional cause due to the unequal capacities of the
> CPUs.
>
> Proxy execution seems to be the nice solution to all of these problems, but
> it's a long way away. I'm interested to learn how this inheritance will be
> implemented. And whether there are any userspace conversion issues. i.e: do
> we need to convert all locks to rt-mutex locks?

I am not an expert on FUTEX_LOCK_PI and this could be a good time for tglx to
weigh in, but I think converting all userspace locks to use FUTEX_LOCK_PI sounds
reasonable to me.

Other thoughts?

thanks,

- Joel

2022-09-30 18:55:17

by Joel Fernandes

[permalink] [raw]
Subject: Re: Sum of weights idea for CFS PI

On Fri, Sep 30, 2022 at 2:10 PM Youssef Esmat <[email protected]> wrote:
[..]
> > > Hi Everyone!
> >
> > Hi Youssef,
> >
> > (Youssef is new to LKML though in no way new to OS or software development. I
> > gave him the usual 'dont-top-post' chat already - fyi).
> >
> > > I am not sure we should care about A's sleeping pattern. The case we
> > > care about is when A is running or wants to run but can't because it
> > > is blocked on C. In that case C should get the weight of A as if A was
> > > running.
> >
> > Just to clarify - Youssef did mean sum of weights of different things in the
> > chain, and not just weights (he confirmed on chat that that's what he meant).
> >
>
> Yeah thanks for clarifying, I meant that C should get the sum of
> weights as if A was running (3/5 in your example) since in this
> segment of time A would have been running if it was not blocked on the
> lock. I think it's safe to ignore the average and just use the sum of

For the onlooker, we are talking about the classical case of priority
inversion involving 3 tasks A, B and C which can be expanded to a
chain of tasks. Highest prio A blocks on a lock that lowest prio C
holds, while an unrelated medium prio B blocks C (or reduces progress
of it as in the case of CFS).

On the note of "A would have been running if it was not blocked on the
lock". I think that would be an assumption - we don't know if A would
be running. We only know the past, not the future. A could very well
make an I/O request for example. Hence there could be a need to use
A's past utilization, right?

thanks,

- Joel

2022-09-30 19:10:53

by Joel Fernandes

[permalink] [raw]
Subject: Re: Sum of weights idea for CFS PI

On 9/30/2022 11:44 AM, Youssef Esmat wrote:
> Hi Everyone!

Hi Youssef,

(Youssef is new to LKML though in no way new to OS or software development. I
gave him the usual 'dont-top-post' chat already - fyi).

> I am not sure we should care about A's sleeping pattern. The case we
> care about is when A is running or wants to run but can't because it
> is blocked on C. In that case C should get the weight of A as if A was
> running.

Just to clarify - Youssef did mean sum of weights of different things in the
chain, and not just weights (he confirmed on chat that that's what he meant).

> Ideally this is also a temporary boost since critical sections should
> be relatively small, so erring on the side of giving C slightly more
> runtime would be safe I think.

True. But I would not hold my breath too much on user space not holding a lock
for very long periods of time. But I agree that generally should be true.

thanks,

- Joel


>
> Thanks,
> Youssef
>
> On Fri, Sep 30, 2022 at 8:49 AM Qais Yousef <[email protected]> wrote:
>>
>> Hi Joel
>>
>> I'm interested in the topic, if I can be CCed in any future discussions I'd
>> appreciate it :)
>>
>> On 09/29/22 16:38, Joel Fernandes wrote:
>>> Hi Peter, all,
>>>
>>> Just following-up about the idea Peter suggested at LPC22 about sum of weights
>>> to solve the CFS priority inversion issues using priority inheritance. I am not
>>> sure if a straight forward summation of the weights of dependencies in the
>>> chain, is sufficient (or may cause too much unfairness).
>>>
>>> I think it will work if all the tasks on CPU are 100% in utilization:
>>>
>>> Say if you have 4 tasks (A, B, C, D) running and each one has equal
>>> weight (W) except for A which has twice the weight (2W).
>>> So the CPU bandwidth distribution is (assuming all are running):
>>> A: 2 / 5
>>> B, C. D: 1 / 5
>>>
>>> Say out of the 4 tasks, 3 of them are a part of a classical priority
>>> inversion scenario (A, B and C).
>>>
>>> Say now A blocks on a lock and that lock's owner C is running, however now
>>> because A has blocked, B gets 1/3 bandwidth, where as it should have been
>>> limited to 1/5. To remedy this, say you give C a weight of 2W. B gets 1/4
>>> bandwidth - still not fair since B is eating away CPU bandwidth causing the
>>> priority inversion we want to remedy.
>>>
>>> The correct bandwidth distribution should be (B and D should be unchanged):
>>> B = 1/5
>>> D = 1/5
>>>
>>> C = 3/5
>>>
>>> This means that C's weight should be 3W , and B and D should be W each
>>> as before. So indeed, C's new weight is its original weight PLUS the
>>> weight of the A - that's needed to keep the CPU usage of the other
>>> tasks (B, D) in check so that C makes forward progress on behalf of A and the
>>> other tasks don't eat into the CPU utilization.
>>>
>>> However, I think this will kinda fall apart if A is asleep 50% of the time
>>> (assume the sleep is because of I/O and unrelated to the PI chain).
>>>
>>> Because now if all were running (and assume no PI dependencies), with A being
>>> 50%, the bandwidth of B, C and D each would be divided into 2 components:
>>>
>>> a. when A is running, it would be as above.
>>> b. but if A was sleeping, B, C, and D would get 1/3.
>>>
>>> So on average, B, C and D get: (1/3 + 1/5) / 2 = 8/30. This gives A about 6/30
>>> or 1/5 bandwidth.
>>
>> The average metric is interesting one. It can be confusing to reason about too.
>>
>> I think we have 3 events to take into account here, not 2:
>>
>> a. when A is running and NOT blocked on C.
>> b. when A is running and BLOCKED on C.
>> c. A is sleeping.
>>
>> This means A, B, C and D's shares will be:
>>
>> A , B , C , D
>> a. 2/5, 1/5, 1/5, 1/5
>> b. - , 3/5, 1/5, 1/5
>> c. - , 1/3, 1/3, 1/3
>>
>> Since A is sleeping for 50%, I don't think we can assume equal distribution for
>> the 3 events (can't just divide by 3).
>>
>> I believe we can assume that
>>
>> a. occurs 25% of the time
>> b. occurs 25% of the time
>> c. occurs 50% of the time
>>
>> I *think* this should provide something more representative.
>>
>>>
>>> But now say A happen to block on a lock that C is holding. You would boost C to
>>> weight 3W which gives it 3/5 (or 18/30) as we saw above, which is more than what
>>> C should actually get.
>>>
>>> C should get (8/30 + 6/30 = 14/30) AFAICS.
>>>
>>> Hopefully one can see that a straight summation of weights is not enough. It
>>> needs to be something like:
>>>
>>> C's new weight = C's original weight + (A's weight) * (A's utilization)
>>>
>>> Or something, otherwise the inherited weight may be too much to properly solve it.
>>>
>>> Any thoughts on this? You mentioned you had some notes on this and/or proxy
>>> execution, could you share it?
>>
>> I assume we'll be using rt-mutex inheritance property to handle this? If this
>> was discussed during a talk, I'd appreciate a link to that.
>>
>> In the past in OSPM conference we brought up an issue with performance
>> inversion where a task running on a smaller (slower to be more generic) CPU is
>> holding the lock and causing massive delays for waiters. This is an artefact of
>> DVFS. For HMP, there's an additional cause due to the unequal capacities of the
>> CPUs.
>>
>> Proxy execution seems to be the nice solution to all of these problems, but
>> it's a long way away. I'm interested to learn how this inheritance will be
>> implemented. And whether there are any userspace conversion issues. i.e: do
>> we need to convert all locks to rt-mutex locks?
>>
>>
>> Thanks
>>
>> --
>> Qais Yousef

2022-09-30 19:11:35

by Youssef Esmat

[permalink] [raw]
Subject: Re: Sum of weights idea for CFS PI

On Fri, Sep 30, 2022 at 12:42 PM Joel Fernandes <[email protected]> wrote:
>
> On 9/30/2022 11:44 AM, Youssef Esmat wrote:
> > Hi Everyone!
>
> Hi Youssef,
>
> (Youssef is new to LKML though in no way new to OS or software development. I
> gave him the usual 'dont-top-post' chat already - fyi).
>
> > I am not sure we should care about A's sleeping pattern. The case we
> > care about is when A is running or wants to run but can't because it
> > is blocked on C. In that case C should get the weight of A as if A was
> > running.
>
> Just to clarify - Youssef did mean sum of weights of different things in the
> chain, and not just weights (he confirmed on chat that that's what he meant).
>

Yeah thanks for clarifying, I meant that C should get the sum of
weights as if A was running (3/5 in your example) since in this
segment of time A would have been running if it was not blocked on the
lock. I think it's safe to ignore the average and just use the sum of
weights.

So it would look like this:

Time ->
A: Slp, 2/5, Blk, 2/5, Slp
B: 1/3, 1/5, 1/5, 1/5, 1/3
C: 1/3, 1/5, 3/5, 1/5, 1/3
D: 1/3, 1/5, 1/5, 1/5, 1/3

* Slp means sleeping
* Blk means blocked on the lock owned by C.

> > Ideally this is also a temporary boost since critical sections should
> > be relatively small, so erring on the side of giving C slightly more
> > runtime would be safe I think.
>
> True. But I would not hold my breath too much on user space not holding a lock
> for very long periods of time. But I agree that generally should be true.
>
> thanks,
>
> - Joel
>
>
> >
> > Thanks,
> > Youssef
> >
> > On Fri, Sep 30, 2022 at 8:49 AM Qais Yousef <[email protected]> wrote:
> >>
> >> Hi Joel
> >>
> >> I'm interested in the topic, if I can be CCed in any future discussions I'd
> >> appreciate it :)
> >>
> >> On 09/29/22 16:38, Joel Fernandes wrote:
> >>> Hi Peter, all,
> >>>
> >>> Just following-up about the idea Peter suggested at LPC22 about sum of weights
> >>> to solve the CFS priority inversion issues using priority inheritance. I am not
> >>> sure if a straight forward summation of the weights of dependencies in the
> >>> chain, is sufficient (or may cause too much unfairness).
> >>>
> >>> I think it will work if all the tasks on CPU are 100% in utilization:
> >>>
> >>> Say if you have 4 tasks (A, B, C, D) running and each one has equal
> >>> weight (W) except for A which has twice the weight (2W).
> >>> So the CPU bandwidth distribution is (assuming all are running):
> >>> A: 2 / 5
> >>> B, C. D: 1 / 5
> >>>
> >>> Say out of the 4 tasks, 3 of them are a part of a classical priority
> >>> inversion scenario (A, B and C).
> >>>
> >>> Say now A blocks on a lock and that lock's owner C is running, however now
> >>> because A has blocked, B gets 1/3 bandwidth, where as it should have been
> >>> limited to 1/5. To remedy this, say you give C a weight of 2W. B gets 1/4
> >>> bandwidth - still not fair since B is eating away CPU bandwidth causing the
> >>> priority inversion we want to remedy.
> >>>
> >>> The correct bandwidth distribution should be (B and D should be unchanged):
> >>> B = 1/5
> >>> D = 1/5
> >>>
> >>> C = 3/5
> >>>
> >>> This means that C's weight should be 3W , and B and D should be W each
> >>> as before. So indeed, C's new weight is its original weight PLUS the
> >>> weight of the A - that's needed to keep the CPU usage of the other
> >>> tasks (B, D) in check so that C makes forward progress on behalf of A and the
> >>> other tasks don't eat into the CPU utilization.
> >>>
> >>> However, I think this will kinda fall apart if A is asleep 50% of the time
> >>> (assume the sleep is because of I/O and unrelated to the PI chain).
> >>>
> >>> Because now if all were running (and assume no PI dependencies), with A being
> >>> 50%, the bandwidth of B, C and D each would be divided into 2 components:
> >>>
> >>> a. when A is running, it would be as above.
> >>> b. but if A was sleeping, B, C, and D would get 1/3.
> >>>
> >>> So on average, B, C and D get: (1/3 + 1/5) / 2 = 8/30. This gives A about 6/30
> >>> or 1/5 bandwidth.
> >>
> >> The average metric is interesting one. It can be confusing to reason about too.
> >>
> >> I think we have 3 events to take into account here, not 2:
> >>
> >> a. when A is running and NOT blocked on C.
> >> b. when A is running and BLOCKED on C.
> >> c. A is sleeping.
> >>
> >> This means A, B, C and D's shares will be:
> >>
> >> A , B , C , D
> >> a. 2/5, 1/5, 1/5, 1/5
> >> b. - , 3/5, 1/5, 1/5
> >> c. - , 1/3, 1/3, 1/3
> >>
> >> Since A is sleeping for 50%, I don't think we can assume equal distribution for
> >> the 3 events (can't just divide by 3).
> >>
> >> I believe we can assume that
> >>
> >> a. occurs 25% of the time
> >> b. occurs 25% of the time
> >> c. occurs 50% of the time
> >>
> >> I *think* this should provide something more representative.
> >>
> >>>
> >>> But now say A happen to block on a lock that C is holding. You would boost C to
> >>> weight 3W which gives it 3/5 (or 18/30) as we saw above, which is more than what
> >>> C should actually get.
> >>>
> >>> C should get (8/30 + 6/30 = 14/30) AFAICS.
> >>>
> >>> Hopefully one can see that a straight summation of weights is not enough. It
> >>> needs to be something like:
> >>>
> >>> C's new weight = C's original weight + (A's weight) * (A's utilization)
> >>>
> >>> Or something, otherwise the inherited weight may be too much to properly solve it.
> >>>
> >>> Any thoughts on this? You mentioned you had some notes on this and/or proxy
> >>> execution, could you share it?
> >>
> >> I assume we'll be using rt-mutex inheritance property to handle this? If this
> >> was discussed during a talk, I'd appreciate a link to that.
> >>
> >> In the past in OSPM conference we brought up an issue with performance
> >> inversion where a task running on a smaller (slower to be more generic) CPU is
> >> holding the lock and causing massive delays for waiters. This is an artefact of
> >> DVFS. For HMP, there's an additional cause due to the unequal capacities of the
> >> CPUs.
> >>
> >> Proxy execution seems to be the nice solution to all of these problems, but
> >> it's a long way away. I'm interested to learn how this inheritance will be
> >> implemented. And whether there are any userspace conversion issues. i.e: do
> >> we need to convert all locks to rt-mutex locks?
> >>
> >>
> >> Thanks
> >>
> >> --
> >> Qais Yousef

2022-10-03 16:37:27

by Qais Yousef

[permalink] [raw]
Subject: Re: Sum of weights idea for CFS PI

On 09/30/22 13:34, Joel Fernandes wrote:
>
>
> On 9/30/2022 9:49 AM, Qais Yousef wrote:
> > Hi Joel
> >
> > I'm interested in the topic, if I can be CCed in any future discussions I'd
> > appreciate it :)
>
> Yes, surely! Will do :)
>
> > On 09/29/22 16:38, Joel Fernandes wrote:
> >> Hi Peter, all,
> >>
> >> Just following-up about the idea Peter suggested at LPC22 about sum of weights
> >> to solve the CFS priority inversion issues using priority inheritance. I am not
> >> sure if a straight forward summation of the weights of dependencies in the
> >> chain, is sufficient (or may cause too much unfairness).
> >>
> >> I think it will work if all the tasks on CPU are 100% in utilization:
> >>
> >> Say if you have 4 tasks (A, B, C, D) running and each one has equal
> >> weight (W) except for A which has twice the weight (2W).
> >> So the CPU bandwidth distribution is (assuming all are running):
> >> A: 2 / 5
> >> B, C. D: 1 / 5
> >>
> >> Say out of the 4 tasks, 3 of them are a part of a classical priority
> >> inversion scenario (A, B and C).
> >>
> >> Say now A blocks on a lock and that lock's owner C is running, however now
> >> because A has blocked, B gets 1/3 bandwidth, where as it should have been
> >> limited to 1/5. To remedy this, say you give C a weight of 2W. B gets 1/4
> >> bandwidth - still not fair since B is eating away CPU bandwidth causing the
> >> priority inversion we want to remedy.
> >>
> >> The correct bandwidth distribution should be (B and D should be unchanged):
> >> B = 1/5
> >> D = 1/5
> >>
> >> C = 3/5
> >>
> >> This means that C's weight should be 3W , and B and D should be W each
> >> as before. So indeed, C's new weight is its original weight PLUS the
> >> weight of the A - that's needed to keep the CPU usage of the other
> >> tasks (B, D) in check so that C makes forward progress on behalf of A and the
> >> other tasks don't eat into the CPU utilization.
> >>
> >> However, I think this will kinda fall apart if A is asleep 50% of the time
> >> (assume the sleep is because of I/O and unrelated to the PI chain).
> >>
> >> Because now if all were running (and assume no PI dependencies), with A being
> >> 50%, the bandwidth of B, C and D each would be divided into 2 components:
> >>
> >> a. when A is running, it would be as above.
> >> b. but if A was sleeping, B, C, and D would get 1/3.
> >>
> >> So on average, B, C and D get: (1/3 + 1/5) / 2 = 8/30. This gives A about 6/30
> >> or 1/5 bandwidth.
> >
> > The average metric is interesting one. It can be confusing to reason about too.
> >
> > I think we have 3 events to take into account here, not 2:
> >
> > a. when A is running and NOT blocked on C.
> > b. when A is running and BLOCKED on C.
> > c. A is sleeping.
> >
> > This means A, B, C and D's shares will be:
> >
> > A , B , C , D
> > a. 2/5, 1/5, 1/5, 1/5
> > b. - , 3/5, 1/5, 1/5
>
> Here did you mean:
> b. -, 1/5, 3/5, 1/5 ?

Yes! Sorry a typo.

>
> A blocked on C means, C should get A's weight (on top of its own).
>
> > c. - , 1/3, 1/3, 1/3
> >
> > Since A is sleeping for 50%, I don't think we can assume equal distribution for
> > the 3 events (can't just divide by 3).
>
> Oh yeah, I did not get to _that_ part of the story yet at this point of the
> email, I brought it up later below where I say "But now say A happen to block"..

Hmm, but that's my point, I think you need to take _that_ part here or we're
not be doing an apple to apple comparison.

>
> >
> > I believe we can assume that
> >
> > a. occurs 25% of the time
> > b. occurs 25% of the time
> > c. occurs 50% of the time
> >
> > I *think* this should provide something more representative.
>
> Yes possible. My basic idea was I was trying to *not* change the distribution of
> B based on whether A blocks on C, or whether it does not. In my view, B's
> distribution should not change just because A and C have a lock dependency,
> because otherwise C can get too much or too little time. If C gets too much
> time, then its hurting B. If C gets too little time, then its hurting A.

Maybe I am getting confused. But AFAICT you're treating


a. when A is running, it would be as above.
b. but if A was sleeping, B, C, and D would get 1/3.

similar to

a. when A is running *and blocked on C for all its runtime*
b. but if A was sleeping, B, C, and D would get 1/3.

I don't think this is valid. If A is blocked on C for 50% of the time, and
sleeping for 50% of the time, when did it get blocked/unblocked?

This will have an impact on the average share for C and skew it, no?

Unless I missed something, the average share of C being (3/5 + 1/3) is an
impossible state. You need to consider the portion of time when C runs as 1/5,
when A is actually not blocked on anything, too.

Hmm actually I just re-read your statement below and you just say 3/5 (18/30)
is too much. You didn't consider the average. I'll leave the above in hope to
help me understand what am I missing and where I went wrong :-)

Generally IMHO looking at the average will not help. I think if the share
values make sense in each state individually (and I believe they are), that
would be enough. AFAICS, B and D are still taking the right amount of time when
C inherits the bandwidth. And C by definition will run longer when A is blocked
on it for the whole duration of this blocked time.

>
> >> But now say A happen to block on a lock that C is holding. You would boost C to
> >> weight 3W which gives it 3/5 (or 18/30) as we saw above, which is more than what
> >> C should actually get.
> >>
> >> C should get (8/30 + 6/30 = 14/30) AFAICS.
> >>
> >> Hopefully one can see that a straight summation of weights is not enough. It
> >> needs to be something like:
> >>
> >> C's new weight = C's original weight + (A's weight) * (A's utilization)
> >>
> >> Or something, otherwise the inherited weight may be too much to properly solve it.
> >>
> >> Any thoughts on this? You mentioned you had some notes on this and/or proxy
> >> execution, could you share it?
> >
> > I assume we'll be using rt-mutex inheritance property to handle this? If this
> > was discussed during a talk, I'd appreciate a link to that.
>
> Yes that's the idea but I am also aware that 'other' kinds of dependencies in
> userspace that cannot benefit from a kernel-only boost.
>
> So if we consider a bounded-buffer producer/consumer. We can consider the
> producer as A, and the consumer as C, with B being a random mid-priority task.
> Once the bounded buffer gets full, A is waiting on a signal from C that it
> filled the buffer for which C needs to run in the first place to queue its
> payload into the buffer. However, trouble-maker B is determined to eat away's
> C's time and develop a prio inversion between itself and A. This pattern should
> also generalize to a worker pool consuming work.

Will proxy-execution be able to handle these other cases? I think it is tied to
rt-mutex to be able to detect the dependency chain too. Looks a generic
extension of the problem that all potential solutions will need to deal with.

>
> In this case, there is no lock involved yet you have a dependency. But I don't
> mean to sound depressing, and just because there are cases like this does not
> mean we should not solve the lock-based ones. When I looked at Android, I saw
> that it uses futex directly from Android Runtime code instead of using pthread.
> So perhaps this can be trivially converted to FUTEX_LOCK_PI and then what we do
> in the kernel will JustWork(Tm) ?

I guess it will depend on individual libc implementation, but I thought all of
them use FUTEX under the hood for pthreads mutexes.

Maybe we can add a bootparam to force all futexes to be FUTEX_LOCK_PI?

>
> >
> > In the past in OSPM conference we brought up an issue with performance
> > inversion where a task running on a smaller (slower to be more generic) CPU is
> > holding the lock and causing massive delays for waiters. This is an artefact of
> > DVFS. For HMP, there's an additional cause due to the unequal capacities of the
> > CPUs.
> >
> > Proxy execution seems to be the nice solution to all of these problems, but
> > it's a long way away. I'm interested to learn how this inheritance will be
> > implemented. And whether there are any userspace conversion issues. i.e: do
> > we need to convert all locks to rt-mutex locks?
>
> I am not an expert on FUTEX_LOCK_PI and this could be a good time for tglx to
> weigh in, but I think converting all userspace locks to use FUTEX_LOCK_PI sounds
> reasonable to me.

I realized whether we need to worry about in-kernel kthreads users too..

pthreads requires users to specify if they want PTHREAD_PRIO_{INHERIT, PROTECT}
when initializing their mutex.

AFAICS from glibc, PTHREAD_PRIO_INHERIT maps to FUTEX_LOCK_PI.

But PTHREAD_PRIO_PROTOTECT uses a different algorithm that is implemented in
glibc. This makes me think how much this makes sense in linux as for CFS
priorities are transalted into weights and not actual priorites like in RT :-/

I need to dig more..

Expert opinions would help for sure :)


Thanks!

--
Qais Yousef

>
> Other thoughts?
>
> thanks,
>
> - Joel
>

2022-10-03 16:41:20

by Joel Fernandes

[permalink] [raw]
Subject: Re: Sum of weights idea for CFS PI

There's a lot to unwind so I will reply in pieces after spending some time
thinking about it, but just for this part:

On 10/3/2022 12:14 PM, Qais Yousef wrote:
>> In this case, there is no lock involved yet you have a dependency. But I don't
>> mean to sound depressing, and just because there are cases like this does not
>> mean we should not solve the lock-based ones. When I looked at Android, I saw
>> that it uses futex directly from Android Runtime code instead of using pthread.
>> So perhaps this can be trivially converted to FUTEX_LOCK_PI and then what we do
>> in the kernel will JustWork(Tm) ?
> I guess it will depend on individual libc implementation, but I thought all of
> them use FUTEX under the hood for pthreads mutexes.
>
> Maybe we can add a bootparam to force all futexes to be FUTEX_LOCK_PI?
>

In the case of FUTEX_LOCK_PI, you have to store the TID of the 'lock owner' in
the futex word to signify that lock is held.

That wont work for the case above, Producer/Consumer signalling each other on a
bounded-buffer, right? That's not locking even though it is acquiring and
release of a limited resource.

thanks,

- Joel

Subject: Re: Sum of weights idea for CFS PI

On 2022-09-30 13:34:49 [-0400], Joel Fernandes wrote:
> In this case, there is no lock involved yet you have a dependency. But I don't
> mean to sound depressing, and just because there are cases like this does not
> mean we should not solve the lock-based ones. When I looked at Android, I saw
> that it uses futex directly from Android Runtime code instead of using pthread.
> So perhaps this can be trivially converted to FUTEX_LOCK_PI and then what we do
> in the kernel will JustWork(Tm) ?

The easy part is just to replace the lock/unlock functions with
FUTEX_LOCK_PI/UNLOCK_PI syscalls. The slightly advanced part is where
you use an atomic operation to replace 0 with threads's ID in the lock
path to avoid going into the kernel for locking if the lock is not
contended. If it is, then you need to use the syscall.


> > Proxy execution seems to be the nice solution to all of these problems, but
> > it's a long way away. I'm interested to learn how this inheritance will be
> > implemented. And whether there are any userspace conversion issues. i.e: do
> > we need to convert all locks to rt-mutex locks?
>
> I am not an expert on FUTEX_LOCK_PI and this could be a good time for tglx to
> weigh in, but I think converting all userspace locks to use FUTEX_LOCK_PI sounds
> reasonable to me.

Based on my understanding with proxy-execution, all in-kernel locks
should be covered.
Priority inheritance (PI) works only with FUTEX_LOCK_PI for userpace and
rtmutex for the in-kernel locks. Regular FUTEX_LOCK does only wait/wake
in userspace so there is no way for the kernel to "help". Ah and for PI
to work you need priorities that you can inherit. With two threads
running as SCHED_OTHER there will be just "normal" sleep+wake in the
kernel. If the blocking thread is SCHED_FIFO then it will inherit its
priority to the lock owner.

> Other thoughts?
>
> thanks,
>
> - Joel

Sebastian

2022-10-04 15:19:12

by Joel Fernandes

[permalink] [raw]
Subject: Re: Sum of weights idea for CFS PI



On 10/4/2022 7:50 AM, Sebastian Andrzej Siewior wrote:
> On 2022-09-30 13:34:49 [-0400], Joel Fernandes wrote:
>> In this case, there is no lock involved yet you have a dependency. But I don't
>> mean to sound depressing, and just because there are cases like this does not
>> mean we should not solve the lock-based ones. When I looked at Android, I saw
>> that it uses futex directly from Android Runtime code instead of using pthread.
>> So perhaps this can be trivially converted to FUTEX_LOCK_PI and then what we do
>> in the kernel will JustWork(Tm) ?
>
> The easy part is just to replace the lock/unlock functions with
> FUTEX_LOCK_PI/UNLOCK_PI syscalls. The slightly advanced part is where
> you use an atomic operation to replace 0 with threads's ID in the lock
> path to avoid going into the kernel for locking if the lock is not
> contended. If it is, then you need to use the syscall.
>
> …
>>> Proxy execution seems to be the nice solution to all of these problems, but
>>> it's a long way away. I'm interested to learn how this inheritance will be
>>> implemented. And whether there are any userspace conversion issues. i.e: do
>>> we need to convert all locks to rt-mutex locks?
>>
>> I am not an expert on FUTEX_LOCK_PI and this could be a good time for tglx to
>> weigh in, but I think converting all userspace locks to use FUTEX_LOCK_PI sounds
>> reasonable to me.
>
> Based on my understanding with proxy-execution, all in-kernel locks
> should be covered.
> Priority inheritance (PI) works only with FUTEX_LOCK_PI for userpace and
> rtmutex for the in-kernel locks. Regular FUTEX_LOCK does only wait/wake
> in userspace so there is no way for the kernel to "help". Ah and for PI
> to work you need priorities that you can inherit. With two threads
> running as SCHED_OTHER there will be just "normal" sleep+wake in the
> kernel. If the blocking thread is SCHED_FIFO then it will inherit its
> priority to the lock owner.

Hi Sebastian, I agree with your thoughts on this. Yes proxy execution idea
should cover this. Basically, any primitive that allows userspace to let the
kernel know is a dependency can use this AFAICS, FUTEX_LOCK_PI being a prime
example. Perhaps Android's binder being another where A sends a message to C and
blocks till C responds. Meanwhile medium prio B blocks C.

thanks,

- Joel

2022-10-04 16:50:54

by Qais Yousef

[permalink] [raw]
Subject: Re: Sum of weights idea for CFS PI

On 10/03/22 12:27, Joel Fernandes wrote:
> There's a lot to unwind so I will reply in pieces after spending some time
> thinking about it, but just for this part:
>
> On 10/3/2022 12:14 PM, Qais Yousef wrote:
> >> In this case, there is no lock involved yet you have a dependency. But I don't
> >> mean to sound depressing, and just because there are cases like this does not
> >> mean we should not solve the lock-based ones. When I looked at Android, I saw
> >> that it uses futex directly from Android Runtime code instead of using pthread.
> >> So perhaps this can be trivially converted to FUTEX_LOCK_PI and then what we do
> >> in the kernel will JustWork(Tm) ?
> > I guess it will depend on individual libc implementation, but I thought all of
> > them use FUTEX under the hood for pthreads mutexes.
> >
> > Maybe we can add a bootparam to force all futexes to be FUTEX_LOCK_PI?
> >
>
> In the case of FUTEX_LOCK_PI, you have to store the TID of the 'lock owner' in
> the futex word to signify that lock is held.

Right. So userspace has to opt-in.

> That wont work for the case above, Producer/Consumer signalling each other on a
> bounded-buffer, right? That's not locking even though it is acquiring and
> release of a limited resource.

Yes but as I tried to point out I don't think proxy-execution handles this case
where you don't hold a lock explicitly. But I could be wrong. IIUC Sebastian's
understanding is similar to mine. Only 'locks' (FUTEX_LOCK_PI which ends up
using rt-mutex) do PI inheritance.

So this signaling scenario is a new class of problems that wasn't handled
before; to my understanding.


Thanks

--
Qais Yousef

2022-10-04 20:15:07

by Joel Fernandes

[permalink] [raw]
Subject: Re: Sum of weights idea for CFS PI



On 10/4/2022 12:30 PM, Qais Yousef wrote:
> On 10/03/22 12:27, Joel Fernandes wrote:
>> There's a lot to unwind so I will reply in pieces after spending some time
>> thinking about it, but just for this part:
>>
>> On 10/3/2022 12:14 PM, Qais Yousef wrote:
>>>> In this case, there is no lock involved yet you have a dependency. But I don't
>>>> mean to sound depressing, and just because there are cases like this does not
>>>> mean we should not solve the lock-based ones. When I looked at Android, I saw
>>>> that it uses futex directly from Android Runtime code instead of using pthread.
>>>> So perhaps this can be trivially converted to FUTEX_LOCK_PI and then what we do
>>>> in the kernel will JustWork(Tm) ?
>>> I guess it will depend on individual libc implementation, but I thought all of
>>> them use FUTEX under the hood for pthreads mutexes.
>>>
>>> Maybe we can add a bootparam to force all futexes to be FUTEX_LOCK_PI?
>>>
>>
>> In the case of FUTEX_LOCK_PI, you have to store the TID of the 'lock owner' in
>> the futex word to signify that lock is held.
>
> Right. So userspace has to opt-in.
>
>> That wont work for the case above, Producer/Consumer signalling each other on a
>> bounded-buffer, right? That's not locking even though it is acquiring and
>> release of a limited resource.
>
> Yes but as I tried to point out I don't think proxy-execution handles this case
> where you don't hold a lock explicitly. But I could be wrong.

I don't disagree. Proxy execution is an implementation detail, without more
information from userspace, any implementation cannot help. I was just
responding to your point about converting all futexes which you cannot do
without knowing what the futex is used for.

But I am thinking of messing around with rt_mutex_setprio() and some userspace
tests to see if I can make the sum of weights thing work for the *userspace
locking* usecases (FUTEX_LOCK_PI). Then run some tests and collect some traces.
Perhaps you can do that on the Android side as well.

> IIUC Sebastian's
> understanding is similar to mine. Only 'locks' (FUTEX_LOCK_PI which ends up
> using rt-mutex) do PI inheritance.
>
> So this signaling scenario is a new class of problems that wasn't handled
> before; to my understanding.
Most certainly, agreed.

Thanks!

- Joel

2022-10-04 20:37:38

by Joel Fernandes

[permalink] [raw]
Subject: Re: Sum of weights idea for CFS PI



On 10/3/2022 12:14 PM, Qais Yousef wrote:
> On 09/30/22 13:34, Joel Fernandes wrote:
>>
>>
>> On 9/30/2022 9:49 AM, Qais Yousef wrote:
>>> Hi Joel
>>>
>>> I'm interested in the topic, if I can be CCed in any future discussions I'd
>>> appreciate it :)
>>
>> Yes, surely! Will do :)
>>
>>> On 09/29/22 16:38, Joel Fernandes wrote:
>>>> Hi Peter, all,
>>>>
>>>> Just following-up about the idea Peter suggested at LPC22 about sum of weights
>>>> to solve the CFS priority inversion issues using priority inheritance. I am not
>>>> sure if a straight forward summation of the weights of dependencies in the
>>>> chain, is sufficient (or may cause too much unfairness).
>>>>
>>>> I think it will work if all the tasks on CPU are 100% in utilization:
>>>>
>>>> Say if you have 4 tasks (A, B, C, D) running and each one has equal
>>>> weight (W) except for A which has twice the weight (2W).
>>>> So the CPU bandwidth distribution is (assuming all are running):
>>>> A: 2 / 5
>>>> B, C. D: 1 / 5
>>>>
>>>> Say out of the 4 tasks, 3 of them are a part of a classical priority
>>>> inversion scenario (A, B and C).
>>>>
>>>> Say now A blocks on a lock and that lock's owner C is running, however now
>>>> because A has blocked, B gets 1/3 bandwidth, where as it should have been
>>>> limited to 1/5. To remedy this, say you give C a weight of 2W. B gets 1/4
>>>> bandwidth - still not fair since B is eating away CPU bandwidth causing the
>>>> priority inversion we want to remedy.
>>>>
>>>> The correct bandwidth distribution should be (B and D should be unchanged):
>>>> B = 1/5
>>>> D = 1/5
>>>>
>>>> C = 3/5
>>>>
>>>> This means that C's weight should be 3W , and B and D should be W each
>>>> as before. So indeed, C's new weight is its original weight PLUS the
>>>> weight of the A - that's needed to keep the CPU usage of the other
>>>> tasks (B, D) in check so that C makes forward progress on behalf of A and the
>>>> other tasks don't eat into the CPU utilization.
>>>>
>>>> However, I think this will kinda fall apart if A is asleep 50% of the time
>>>> (assume the sleep is because of I/O and unrelated to the PI chain).
>>>>
>>>> Because now if all were running (and assume no PI dependencies), with A being
>>>> 50%, the bandwidth of B, C and D each would be divided into 2 components:
>>>>
>>>> a. when A is running, it would be as above.
>>>> b. but if A was sleeping, B, C, and D would get 1/3.
>>>>
>>>> So on average, B, C and D get: (1/3 + 1/5) / 2 = 8/30. This gives A about 6/30
>>>> or 1/5 bandwidth.
>>>
>>> The average metric is interesting one. It can be confusing to reason about too.
>>>
>>> I think we have 3 events to take into account here, not 2:
>>>
>>> a. when A is running and NOT blocked on C.
>>> b. when A is running and BLOCKED on C.
>>> c. A is sleeping.
>>>
>>> This means A, B, C and D's shares will be:
>>>
>>> A , B , C , D
>>> a. 2/5, 1/5, 1/5, 1/5
>>> b. - , 3/5, 1/5, 1/5
>>
>> Here did you mean:
>> b. -, 1/5, 3/5, 1/5 ?
>
> Yes! Sorry a typo.
>
>>
>> A blocked on C means, C should get A's weight (on top of its own).
>>
>>> c. - , 1/3, 1/3, 1/3
>>>
>>> Since A is sleeping for 50%, I don't think we can assume equal distribution for
>>> the 3 events (can't just divide by 3).
>>
>> Oh yeah, I did not get to _that_ part of the story yet at this point of the
>> email, I brought it up later below where I say "But now say A happen to block"..
>
> Hmm, but that's my point, I think you need to take _that_ part here or we're
> not be doing an apple to apple comparison.
>
>>
>>>
>>> I believe we can assume that
>>>
>>> a. occurs 25% of the time
>>> b. occurs 25% of the time
>>> c. occurs 50% of the time
>>>
>>> I *think* this should provide something more representative.
>>
>> Yes possible. My basic idea was I was trying to *not* change the distribution of
>> B based on whether A blocks on C, or whether it does not. In my view, B's
>> distribution should not change just because A and C have a lock dependency,
>> because otherwise C can get too much or too little time. If C gets too much
>> time, then its hurting B. If C gets too little time, then its hurting A.
>
> Maybe I am getting confused. But AFAICT you're treating
>
>
> a. when A is running, it would be as above.
> b. but if A was sleeping, B, C, and D would get 1/3.
>
> similar to
>
> a. when A is running *and blocked on C for all its runtime*
> b. but if A was sleeping, B, C, and D would get 1/3.


I am treating the following the same:

a. when A is running, it would be as above.
b. but if A was sleeping, B, C, and D would get 1/3.

similar to

a. when A is running *and blocked on C for all its runtime*
^^ -- in this case, B and D should not have their distributions
changed at all because they are not participating in the
lock acquire and release. So they should neither be hurt
any more, nor be boosted. They should simply stay same [1]

b. but if A was sleeping, B, C, and D would get 1/3.


[1] Why? Consider 3 tasks in the all-RT case, A high, B medium and C low prio.

If all are running 100% and A does not block on C, B is blocked by A
indefinitely. So the prio of A and B are inverted. We seek to rectify this, that
is we need make changes such that, B is returned back to the blocked state. We
do this by boosting C.

In other words, the prio inheritance will cause B's distribution to not be
changed (it was supposed to be blocked before and it is now going to be blocked
state again).

CFS should not behave any differently, B's distribution should not be changed
before/after the priority inhertiance of A by C. That's just my opinion - and
that's how I calculated to distribution. With that mind, could you go back to
seeing if my math was originally correct or did I mess something up?

I do think though that Youssef's point of not worrying about being too accurate
is reasonable if the critical sections are short lived but I'm not sure.

>
> I don't think this is valid. If A is blocked on C for 50% of the time, and
> sleeping for 50% of the time, when did it get blocked/unblocked?
>
> This will have an impact on the average share for C and skew it, no?
>
> Unless I missed something, the average share of C being (3/5 + 1/3) is an
> impossible state. You need to consider the portion of time when C runs as 1/5,
> when A is actually not blocked on anything, too.
>
> Hmm actually I just re-read your statement below and you just say 3/5 (18/30)
> is too much. You didn't consider the average. I'll leave the above in hope to
> help me understand what am I missing and where I went wrong :-)
>
> Generally IMHO looking at the average will not help. I think if the share
> values make sense in each state individually (and I believe they are), that
> would be enough. AFAICS, B and D are still taking the right amount of time when
> C inherits the bandwidth. And C by definition will run longer when A is blocked
> on it for the whole duration of this blocked time.

I was degenerating the case where A sleeps (say I/O) vs A blocks, to simplify
the math, and then taking average of that. I think that's reasonable?

>
>>
>>>> But now say A happen to block on a lock that C is holding. You would boost C to
>>>> weight 3W which gives it 3/5 (or 18/30) as we saw above, which is more than what
>>>> C should actually get.
>>>>
>>>> C should get (8/30 + 6/30 = 14/30) AFAICS.
>>>>
>>>> Hopefully one can see that a straight summation of weights is not enough. It
>>>> needs to be something like:
>>>>
>>>> C's new weight = C's original weight + (A's weight) * (A's utilization)
>>>>
>>>> Or something, otherwise the inherited weight may be too much to properly solve it.
>>>>
>>>> Any thoughts on this? You mentioned you had some notes on this and/or proxy
>>>> execution, could you share it?
>>>
>>> I assume we'll be using rt-mutex inheritance property to handle this? If this
>>> was discussed during a talk, I'd appreciate a link to that.
>>
>> Yes that's the idea but I am also aware that 'other' kinds of dependencies in
>> userspace that cannot benefit from a kernel-only boost.
>>
>> So if we consider a bounded-buffer producer/consumer. We can consider the
>> producer as A, and the consumer as C, with B being a random mid-priority task.
>> Once the bounded buffer gets full, A is waiting on a signal from C that it
>> filled the buffer for which C needs to run in the first place to queue its
>> payload into the buffer. However, trouble-maker B is determined to eat away's
>> C's time and develop a prio inversion between itself and A. This pattern should
>> also generalize to a worker pool consuming work.
>
> Will proxy-execution be able to handle these other cases? I think it is tied to
> rt-mutex to be able to detect the dependency chain too. Looks a generic
> extension of the problem that all potential solutions will need to deal with.

Yeah, in theory if the producer and consumer can "mark" themselves as
potentially have someone waiting for them, then the kernel has that info and any
implementation may capitalize on that. I am not sure if this is already possible
by some command in the futex API.

>>> In the past in OSPM conference we brought up an issue with performance
>>> inversion where a task running on a smaller (slower to be more generic) CPU is
>>> holding the lock and causing massive delays for waiters. This is an artefact of
>>> DVFS. For HMP, there's an additional cause due to the unequal capacities of the
>>> CPUs.
>>>
>>> Proxy execution seems to be the nice solution to all of these problems, but
>>> it's a long way away. I'm interested to learn how this inheritance will be
>>> implemented. And whether there are any userspace conversion issues. i.e: do
>>> we need to convert all locks to rt-mutex locks?
>>
>> I am not an expert on FUTEX_LOCK_PI and this could be a good time for tglx to
>> weigh in, but I think converting all userspace locks to use FUTEX_LOCK_PI sounds
>> reasonable to me.
>
> I realized whether we need to worry about in-kernel kthreads users too..

Possibly.

> pthreads requires users to specify if they want PTHREAD_PRIO_{INHERIT, PROTECT}
> when initializing their mutex.

PTHREAD_PRIO_PROTECT seems like an interesting approach! Thanks for mentioning.
> AFAICS from glibc, PTHREAD_PRIO_INHERIT maps to FUTEX_LOCK_PI.

Ok.
> But PTHREAD_PRIO_PROTOTECT uses a different algorithm that is implemented in
> glibc. This makes me think how much this makes sense in linux as for CFS
> priorities are transalted into weights and not actual priorites like in RT :-/
>
> I need to dig more..

Ah same here, more reading/digging to do ;).


- Joel

2022-10-05 10:07:42

by Qais Yousef

[permalink] [raw]
Subject: Re: Sum of weights idea for CFS PI

On 10/04/22 15:48, Joel Fernandes wrote:
>
>
> On 10/4/2022 12:30 PM, Qais Yousef wrote:
> > On 10/03/22 12:27, Joel Fernandes wrote:
> >> There's a lot to unwind so I will reply in pieces after spending some time
> >> thinking about it, but just for this part:
> >>
> >> On 10/3/2022 12:14 PM, Qais Yousef wrote:
> >>>> In this case, there is no lock involved yet you have a dependency. But I don't
> >>>> mean to sound depressing, and just because there are cases like this does not
> >>>> mean we should not solve the lock-based ones. When I looked at Android, I saw
> >>>> that it uses futex directly from Android Runtime code instead of using pthread.
> >>>> So perhaps this can be trivially converted to FUTEX_LOCK_PI and then what we do
> >>>> in the kernel will JustWork(Tm) ?
> >>> I guess it will depend on individual libc implementation, but I thought all of
> >>> them use FUTEX under the hood for pthreads mutexes.
> >>>
> >>> Maybe we can add a bootparam to force all futexes to be FUTEX_LOCK_PI?
> >>>
> >>
> >> In the case of FUTEX_LOCK_PI, you have to store the TID of the 'lock owner' in
> >> the futex word to signify that lock is held.
> >
> > Right. So userspace has to opt-in.
> >
> >> That wont work for the case above, Producer/Consumer signalling each other on a
> >> bounded-buffer, right? That's not locking even though it is acquiring and
> >> release of a limited resource.
> >
> > Yes but as I tried to point out I don't think proxy-execution handles this case
> > where you don't hold a lock explicitly. But I could be wrong.
>
> I don't disagree. Proxy execution is an implementation detail, without more
> information from userspace, any implementation cannot help. I was just
> responding to your point about converting all futexes which you cannot do
> without knowing what the futex is used for.

+1

I don't think I read much on literature on priority inversion caused by waiting
on signals. I need to research that.

I think it is considered a voluntary sleep and sane system design should ensure
both of these tasks priorities don't lead to starvation based on expected rate
of producer/consumer.

It doesn't seem to be a problem for PREEMPT_RT since no body has done anything
about it AFAICT?

It could be the fact that in CFS priority is weights (or bandwidth) and this
introduces this new class of problems. I think we should still ask the question
if the priority assignment is wrong when this happens. If there's a clear
relationship between producer/consumer, should they have the same priority if
they do equal amount of work?

>
> But I am thinking of messing around with rt_mutex_setprio() and some userspace
> tests to see if I can make the sum of weights thing work for the *userspace
> locking* usecases (FUTEX_LOCK_PI). Then run some tests and collect some traces.
> Perhaps you can do that on the Android side as well.

I'd be happy to help, yes :)

In my view, the trickiest part would be is how to account for the stolen time.
If C gets 3/5 share and runs for 2/5 before releasing the lock, then when
A wakes up, it should perceive that it ran for 1/5 (time C stole from A while
holding the lock) and run only for 1/5 before getting preempted. To preserve
its 2/5 share. That is IF we want to be very accurate.

If the 2 tasks are not on the same rq, I think that will not change how things
are done a lot..

>
> > IIUC Sebastian's
> > understanding is similar to mine. Only 'locks' (FUTEX_LOCK_PI which ends up
> > using rt-mutex) do PI inheritance.
> >
> > So this signaling scenario is a new class of problems that wasn't handled
> > before; to my understanding.
> Most certainly, agreed.


Sorry I am thinking out loud for most/all part of my reply :)


Thanks!

--
Qais Yousef

2022-10-05 10:11:15

by Qais Yousef

[permalink] [raw]
Subject: Re: Sum of weights idea for CFS PI

Hi Joel

On 10/04/22 16:27, Joel Fernandes wrote:

[...]

> I am treating the following the same:
>
> a. when A is running, it would be as above.
> b. but if A was sleeping, B, C, and D would get 1/3.
>
> similar to
>
> a. when A is running *and blocked on C for all its runtime*
> ^^ -- in this case, B and D should not have their distributions
> changed at all because they are not participating in the
> lock acquire and release. So they should neither be hurt
> any more, nor be boosted. They should simply stay same [1]
>
> b. but if A was sleeping, B, C, and D would get 1/3.
>
>
> [1] Why? Consider 3 tasks in the all-RT case, A high, B medium and C low prio.
>
> If all are running 100% and A does not block on C, B is blocked by A
> indefinitely. So the prio of A and B are inverted. We seek to rectify this, that
> is we need make changes such that, B is returned back to the blocked state. We
> do this by boosting C.
>
> In other words, the prio inheritance will cause B's distribution to not be
> changed (it was supposed to be blocked before and it is now going to be blocked
> state again).
>
> CFS should not behave any differently, B's distribution should not be changed
> before/after the priority inhertiance of A by C. That's just my opinion - and
> that's how I calculated to distribution. With that mind, could you go back to
> seeing if my math was originally correct or did I mess something up?

It's not about the math. But I think the before and after can't be the same for
C..

> I do think though that Youssef's point of not worrying about being too accurate
> is reasonable if the critical sections are short lived but I'm not sure.

.. I do agree with that as well. I was just trying to highlight that looking at
average can be misleading and I don't see C taking too much time.

If any worries I have, it'd be not accounting correctly for the stolen time
C takes from A. Otherwise A + C share combined would be higher than it should
be. Which might be the problem you're trying to highlight but I am unable to
get/see. But this is an implementation detail and an artefact of wrong
accounting, not how shares are summed.

> > I don't think this is valid. If A is blocked on C for 50% of the time, and
> > sleeping for 50% of the time, when did it get blocked/unblocked?
> >
> > This will have an impact on the average share for C and skew it, no?
> >
> > Unless I missed something, the average share of C being (3/5 + 1/3) is an
> > impossible state. You need to consider the portion of time when C runs as 1/5,
> > when A is actually not blocked on anything, too.
> >
> > Hmm actually I just re-read your statement below and you just say 3/5 (18/30)
> > is too much. You didn't consider the average. I'll leave the above in hope to
> > help me understand what am I missing and where I went wrong :-)
> >
> > Generally IMHO looking at the average will not help. I think if the share
> > values make sense in each state individually (and I believe they are), that
> > would be enough. AFAICS, B and D are still taking the right amount of time when
> > C inherits the bandwidth. And C by definition will run longer when A is blocked
> > on it for the whole duration of this blocked time.
>
> I was degenerating the case where A sleeps (say I/O) vs A blocks, to simplify
> the math, and then taking average of that. I think that's reasonable?

I'm not sure. This is skewing the results in my view.

I think the comparison should just be:

1) A, B, C, and D are all running and nothing gets blocked at all. Then shares
would be:

2/5, 1/5, 1/5, 1/5

2) A is blocked and C; B, C, D are running with no blocked time. Shares would
be:

- , 1/5, 3/5, 1/5

By definition, we want to treat A in (2) as RUNNING because as soon as
C unblocks A we should return to (1). From B and D perspective, their share is
not impacted throughout this transition. Which is AFAIU is what we want to
achieve.

I think considering the sleeping time and averaging can lead to misleading
results if care is not taken.

Anyway - just trying to explain how I see it and why C is unlikely to be taking
too much time. I could be wrong. As Youssef said, I think there's no
fundamental problem here.

My 2 cents ;-)


Thanks!

--
Qais Yousef

2022-10-06 13:58:05

by Joel Fernandes

[permalink] [raw]
Subject: Re: Sum of weights idea for CFS PI

On Wed, Oct 5, 2022 at 6:04 AM Qais Yousef <[email protected]> wrote:
>
> Hi Joel
>
> On 10/04/22 16:27, Joel Fernandes wrote:
>
> [...]
>
> > I am treating the following the same:
> >
> > a. when A is running, it would be as above.
> > b. but if A was sleeping, B, C, and D would get 1/3.
> >
> > similar to
> >
> > a. when A is running *and blocked on C for all its runtime*
> > ^^ -- in this case, B and D should not have their distributions
> > changed at all because they are not participating in the
> > lock acquire and release. So they should neither be hurt
> > any more, nor be boosted. They should simply stay same [1]
> >
> > b. but if A was sleeping, B, C, and D would get 1/3.
> >
> >
> > [1] Why? Consider 3 tasks in the all-RT case, A high, B medium and C low prio.
> >
> > If all are running 100% and A does not block on C, B is blocked by A
> > indefinitely. So the prio of A and B are inverted. We seek to rectify this, that
> > is we need make changes such that, B is returned back to the blocked state. We
> > do this by boosting C.
> >
> > In other words, the prio inheritance will cause B's distribution to not be
> > changed (it was supposed to be blocked before and it is now going to be blocked
> > state again).
> >
> > CFS should not behave any differently, B's distribution should not be changed
> > before/after the priority inhertiance of A by C. That's just my opinion - and
> > that's how I calculated to distribution. With that mind, could you go back to
> > seeing if my math was originally correct or did I mess something up?
>
> It's not about the math. But I think the before and after can't be the same for
> C..

C is acquiring/releasing the lock so I expect its distribution to
change. I was talking about the poor B who has nothing to do with the
lock.

> > > I don't think this is valid. If A is blocked on C for 50% of the time, and
> > > sleeping for 50% of the time, when did it get blocked/unblocked?
> > >
> > > This will have an impact on the average share for C and skew it, no?
> > >
> > > Unless I missed something, the average share of C being (3/5 + 1/3) is an
> > > impossible state. You need to consider the portion of time when C runs as 1/5,
> > > when A is actually not blocked on anything, too.
> > >
> > > Hmm actually I just re-read your statement below and you just say 3/5 (18/30)
> > > is too much. You didn't consider the average. I'll leave the above in hope to
> > > help me understand what am I missing and where I went wrong :-)
> > >
> > > Generally IMHO looking at the average will not help. I think if the share
> > > values make sense in each state individually (and I believe they are), that
> > > would be enough. AFAICS, B and D are still taking the right amount of time when
> > > C inherits the bandwidth. And C by definition will run longer when A is blocked
> > > on it for the whole duration of this blocked time.
> >
> > I was degenerating the case where A sleeps (say I/O) vs A blocks, to simplify
> > the math, and then taking average of that. I think that's reasonable?
>
> I'm not sure. This is skewing the results in my view.
>
> I think the comparison should just be:
>
> 1) A, B, C, and D are all running and nothing gets blocked at all. Then shares
> would be:
>
> 2/5, 1/5, 1/5, 1/5
>
> 2) A is blocked and C; B, C, D are running with no blocked time. Shares would
> be:
>
> - , 1/5, 3/5, 1/5
>
> By definition, we want to treat A in (2) as RUNNING because as soon as
> C unblocks A we should return to (1). From B and D perspective, their share is
> not impacted throughout this transition. Which is AFAIU is what we want to
> achieve.
>
> I think considering the sleeping time and averaging can lead to misleading
> results if care is not taken.

Yes, but that doesn't mean we can just ignore it. It is easy in my
view to skew the inherited weight to a very large number, only to find
that tasks unrelated to the lock acquire/release are "suffering"
though they had nothing to do with the lock or the PI. But it is
reasonable to try the simple approach first and see the impact.

I also never said the averaging approach or consideration of sleeping
time is perfect ;-)

> Anyway - just trying to explain how I see it and why C is unlikely to be taking
> too much time. I could be wrong. As Youssef said, I think there's no
> fundamental problem here.

I know on Android where they use smaller HZ, the large tick causes
lots of problems for large nice deltas. Example if a highly niced task
was to be preempted for 1ms, and preempts instead at 3ms, then the
less-niced task will not be so nice (even less nice than it promised
to be) any more because of the 2ms boost that the higher niced task
got. This can lead the the sched_latency thrown out of the window. Not
adjusting the weights properly can potentially make that problem much
worse IMO.

Thanks.

2022-10-06 20:05:22

by Youssef Esmat

[permalink] [raw]
Subject: Re: Sum of weights idea for CFS PI

On Thu, Oct 6, 2022 at 8:53 AM Joel Fernandes <[email protected]> wrote:
>
> On Wed, Oct 5, 2022 at 6:04 AM Qais Yousef <[email protected]> wrote:
> >
> > Hi Joel
> >
> > On 10/04/22 16:27, Joel Fernandes wrote:
> >
> > [...]
> >
> > > I am treating the following the same:
> > >
> > > a. when A is running, it would be as above.
> > > b. but if A was sleeping, B, C, and D would get 1/3.
> > >
> > > similar to
> > >
> > > a. when A is running *and blocked on C for all its runtime*
> > > ^^ -- in this case, B and D should not have their distributions
> > > changed at all because they are not participating in the
> > > lock acquire and release. So they should neither be hurt
> > > any more, nor be boosted. They should simply stay same [1]
> > >
> > > b. but if A was sleeping, B, C, and D would get 1/3.
> > >
> > >
> > > [1] Why? Consider 3 tasks in the all-RT case, A high, B medium and C low prio.
> > >
> > > If all are running 100% and A does not block on C, B is blocked by A
> > > indefinitely. So the prio of A and B are inverted. We seek to rectify this, that
> > > is we need make changes such that, B is returned back to the blocked state. We
> > > do this by boosting C.
> > >
> > > In other words, the prio inheritance will cause B's distribution to not be
> > > changed (it was supposed to be blocked before and it is now going to be blocked
> > > state again).
> > >
> > > CFS should not behave any differently, B's distribution should not be changed
> > > before/after the priority inhertiance of A by C. That's just my opinion - and
> > > that's how I calculated to distribution. With that mind, could you go back to
> > > seeing if my math was originally correct or did I mess something up?
> >
> > It's not about the math. But I think the before and after can't be the same for
> > C..
>
> C is acquiring/releasing the lock so I expect its distribution to
> change. I was talking about the poor B who has nothing to do with the
> lock.
>
> > > > I don't think this is valid. If A is blocked on C for 50% of the time, and
> > > > sleeping for 50% of the time, when did it get blocked/unblocked?
> > > >
> > > > This will have an impact on the average share for C and skew it, no?
> > > >
> > > > Unless I missed something, the average share of C being (3/5 + 1/3) is an
> > > > impossible state. You need to consider the portion of time when C runs as 1/5,
> > > > when A is actually not blocked on anything, too.
> > > >
> > > > Hmm actually I just re-read your statement below and you just say 3/5 (18/30)
> > > > is too much. You didn't consider the average. I'll leave the above in hope to
> > > > help me understand what am I missing and where I went wrong :-)
> > > >
> > > > Generally IMHO looking at the average will not help. I think if the share
> > > > values make sense in each state individually (and I believe they are), that
> > > > would be enough. AFAICS, B and D are still taking the right amount of time when
> > > > C inherits the bandwidth. And C by definition will run longer when A is blocked
> > > > on it for the whole duration of this blocked time.
> > >
> > > I was degenerating the case where A sleeps (say I/O) vs A blocks, to simplify
> > > the math, and then taking average of that. I think that's reasonable?
> >
> > I'm not sure. This is skewing the results in my view.
> >
> > I think the comparison should just be:
> >
> > 1) A, B, C, and D are all running and nothing gets blocked at all. Then shares
> > would be:
> >
> > 2/5, 1/5, 1/5, 1/5
> >
> > 2) A is blocked and C; B, C, D are running with no blocked time. Shares would
> > be:
> >
> > - , 1/5, 3/5, 1/5
> >
> > By definition, we want to treat A in (2) as RUNNING because as soon as
> > C unblocks A we should return to (1). From B and D perspective, their share is
> > not impacted throughout this transition. Which is AFAIU is what we want to
> > achieve.
> >
> > I think considering the sleeping time and averaging can lead to misleading
> > results if care is not taken.
>
> Yes, but that doesn't mean we can just ignore it. It is easy in my
> view to skew the inherited weight to a very large number, only to find
> that tasks unrelated to the lock acquire/release are "suffering"
> though they had nothing to do with the lock or the PI. But it is
> reasonable to try the simple approach first and see the impact.
>
> I also never said the averaging approach or consideration of sleeping
> time is perfect ;-)
>
> > Anyway - just trying to explain how I see it and why C is unlikely to be taking
> > too much time. I could be wrong. As Youssef said, I think there's no
> > fundamental problem here.
>
> I know on Android where they use smaller HZ, the large tick causes
> lots of problems for large nice deltas. Example if a highly niced task
> was to be preempted for 1ms, and preempts instead at 3ms, then the
> less-niced task will not be so nice (even less nice than it promised
> to be) any more because of the 2ms boost that the higher niced task
> got. This can lead the the sched_latency thrown out of the window. Not
> adjusting the weights properly can potentially make that problem much
> worse IMO.

Once C releases the lock it should get adjusted and A will get
adjusted also regardless of tick. At the point we adjust the weights
we have a chance to check for preemption and cause a reschedule.

If C doesn't release the lock quickly (hopefully rare), it should
continue to run at the adjusted weight since it is still blocking A.

>
> Thanks.

2022-10-08 16:31:16

by Joel Fernandes

[permalink] [raw]
Subject: Re: Sum of weights idea for CFS PI



> On Oct 6, 2022, at 3:40 PM, Youssef Esmat <[email protected]> wrote:
>
[..]
>>
>>> Anyway - just trying to explain how I see it and why C is unlikely to be taking
>>> too much time. I could be wrong. As Youssef said, I think there's no
>>> fundamental problem here.
>>
>> I know on Android where they use smaller HZ, the large tick causes
>> lots of problems for large nice deltas. Example if a highly niced task
>> was to be preempted for 1ms, and preempts instead at 3ms, then the
>> less-niced task will not be so nice (even less nice than it promised
>> to be) any more because of the 2ms boost that the higher niced task
>> got. This can lead the the sched_latency thrown out of the window. Not
>> adjusting the weights properly can potentially make that problem much
>> worse IMO.
>
> Once C releases the lock it should get adjusted and A will get
> adjusted also regardless of tick. At the point we adjust the weights
> we have a chance to check for preemption and cause a reschedule.

Yes but the lock can be held for potentially long time (and even user space lock). I’m more comfortable with Peter’s PE patch which seems a more generic solution, than sum of weights if we can get it working. I’m studying Connor’s patch set now…

> If C doesn't release the lock quickly (hopefully rare), it should
> continue to run at the adjusted weight since it is still blocking A.

We can’t depend on rare. Even one bad instance is a fail. So if lock holder and releaser go crazy, we can’t destabilize the system. After all, this is CFS and fairness should not be broken, even if rarely.

Thanks.


>
>>
>> Thanks.

2022-10-10 15:13:08

by Qais Yousef

[permalink] [raw]
Subject: Re: Sum of weights idea for CFS PI

On 10/08/22 11:04, Joel Fernandes wrote:
>
>
> > On Oct 6, 2022, at 3:40 PM, Youssef Esmat <[email protected]> wrote:
> >
> [..]
> >>
> >>> Anyway - just trying to explain how I see it and why C is unlikely to be
> >>> taking too much time. I could be wrong. As Youssef said, I think there's
> >>> no fundamental problem here.
> >>
> >> I know on Android where they use smaller HZ, the large tick causes lots of
> >> problems for large nice deltas. Example if a highly niced task was to be
> >> preempted for 1ms, and preempts instead at 3ms, then the less-niced task
> >> will not be so nice (even less nice than it promised to be) any more
> >> because of the 2ms boost that the higher niced task got. This can lead the
> >> the sched_latency thrown out of the window. Not adjusting the weights
> >> properly can potentially make that problem much worse IMO.
> >
> > Once C releases the lock it should get adjusted and A will get adjusted
> > also regardless of tick. At the point we adjust the weights we have
> > a chance to check for preemption and cause a reschedule.
>
> Yes but the lock can be held for potentially long time (and even user space
> lock). I’m more comfortable with Peter’s PE patch which seems a more generic
> solution, than sum of weights if we can get it working. I’m studying Connor’s
> patch set now…

The 2 solutions are equivalent AFAICT.


With summation:

A , B , C , D
sleeping, running, running, running
- , 1/5 , 3/5 , 1/5

Where we'll treat A as running but donate its bandwidth to C, the mutex owner.



With PE:

A , B , C , D
running, running, running, running
2/5 , 1/5 , 1/5 , 1/5

Where A will donate its execution context to C, the mutex owner.


In both cases we should end up with the same distribution as if neither A nor
C ever go to sleep because of holding the mutex.


I still can't see how B and D fairness will be impacted as the solution to the
problem is to never treat a waiter as sleeping and let the owner run for more,
but only within the limit of what the waiter is allowed to run for. AFAICS,
both solutions maintain this relationship.


Thanks

--
Qais Yousef

2022-10-10 15:28:58

by Joel Fernandes

[permalink] [raw]
Subject: Re: Sum of weights idea for CFS PI

On Mon, Oct 10, 2022 at 10:46 AM Qais Yousef <[email protected]> wrote:
>
> On 10/08/22 11:04, Joel Fernandes wrote:
> >
> >
> > > On Oct 6, 2022, at 3:40 PM, Youssef Esmat <[email protected]> wrote:
> > >
> > [..]
> > >>
> > >>> Anyway - just trying to explain how I see it and why C is unlikely to be
> > >>> taking too much time. I could be wrong. As Youssef said, I think there's
> > >>> no fundamental problem here.
> > >>
> > >> I know on Android where they use smaller HZ, the large tick causes lots of
> > >> problems for large nice deltas. Example if a highly niced task was to be
> > >> preempted for 1ms, and preempts instead at 3ms, then the less-niced task
> > >> will not be so nice (even less nice than it promised to be) any more
> > >> because of the 2ms boost that the higher niced task got. This can lead the
> > >> the sched_latency thrown out of the window. Not adjusting the weights
> > >> properly can potentially make that problem much worse IMO.
> > >
> > > Once C releases the lock it should get adjusted and A will get adjusted
> > > also regardless of tick. At the point we adjust the weights we have
> > > a chance to check for preemption and cause a reschedule.
> >
> > Yes but the lock can be held for potentially long time (and even user space
> > lock). I’m more comfortable with Peter’s PE patch which seems a more generic
> > solution, than sum of weights if we can get it working. I’m studying Connor’s
> > patch set now…
>
> The 2 solutions are equivalent AFAICT.

Possibly. Maybe I am talking about a non-issue then, but I had to be
careful ;-) Maybe both have the issue I was referring to, or they
don't. But in any case, PE seems more organic.

> With summation:
>
> A , B , C , D
> sleeping, running, running, running
> - , 1/5 , 3/5 , 1/5
>
> Where we'll treat A as running but donate its bandwidth to C, the mutex owner.

> With PE:
>
> A , B , C , D
> running, running, running, running
> 2/5 , 1/5 , 1/5 , 1/5
>
> Where A will donate its execution context to C, the mutex owner.

Yes. It would also be great if Peter can participate in this thread,
if he has time. Not to nitpick but to be more precise in PE
terminology, you mean "scheduler context". The "execution context" is
not inherited [1]

If p1 is selected to run while still blocked, the lock owner p2 can
run "on its behalf", inheriting p1's scheduler context. Execution
context is not inherited, meaning that e.g. the CPUs where p2 can run
are still determined by its own affinity and not p1's.

[1] https://lore.kernel.org/all/[email protected]/t/#mdf0146cdf78e48fc5cc515c1a34cdc1d596e0ed8

> In both cases we should end up with the same distribution as if neither A nor
> C ever go to sleep because of holding the mutex.

Hopefully!

> I still can't see how B and D fairness will be impacted as the solution to the
> problem is to never treat a waiter as sleeping and let the owner run for more,
> but only within the limit of what the waiter is allowed to run for. AFAICS,
> both solutions maintain this relationship.

True!

- Joel

2022-10-12 15:15:32

by Qais Yousef

[permalink] [raw]
Subject: Re: Sum of weights idea for CFS PI

On 10/10/22 11:11, Joel Fernandes wrote:
> On Mon, Oct 10, 2022 at 10:46 AM Qais Yousef <[email protected]> wrote:
> >
> > On 10/08/22 11:04, Joel Fernandes wrote:
> > >
> > >
> > > > On Oct 6, 2022, at 3:40 PM, Youssef Esmat <[email protected]> wrote:
> > > >
> > > [..]
> > > >>
> > > >>> Anyway - just trying to explain how I see it and why C is unlikely to be
> > > >>> taking too much time. I could be wrong. As Youssef said, I think there's
> > > >>> no fundamental problem here.
> > > >>
> > > >> I know on Android where they use smaller HZ, the large tick causes lots of
> > > >> problems for large nice deltas. Example if a highly niced task was to be
> > > >> preempted for 1ms, and preempts instead at 3ms, then the less-niced task
> > > >> will not be so nice (even less nice than it promised to be) any more
> > > >> because of the 2ms boost that the higher niced task got. This can lead the
> > > >> the sched_latency thrown out of the window. Not adjusting the weights
> > > >> properly can potentially make that problem much worse IMO.
> > > >
> > > > Once C releases the lock it should get adjusted and A will get adjusted
> > > > also regardless of tick. At the point we adjust the weights we have
> > > > a chance to check for preemption and cause a reschedule.
> > >
> > > Yes but the lock can be held for potentially long time (and even user space
> > > lock). I’m more comfortable with Peter’s PE patch which seems a more generic
> > > solution, than sum of weights if we can get it working. I’m studying Connor’s
> > > patch set now…
> >
> > The 2 solutions are equivalent AFAICT.
>
> Possibly. Maybe I am talking about a non-issue then, but I had to be
> careful ;-) Maybe both have the issue I was referring to, or they
> don't. But in any case, PE seems more organic.

Careful is good! I failed to see the problem, that doesn't mean it doesn't
exist :-)

>
> > With summation:
> >
> > A , B , C , D
> > sleeping, running, running, running
> > - , 1/5 , 3/5 , 1/5
> >
> > Where we'll treat A as running but donate its bandwidth to C, the mutex owner.
>
> > With PE:
> >
> > A , B , C , D
> > running, running, running, running
> > 2/5 , 1/5 , 1/5 , 1/5
> >
> > Where A will donate its execution context to C, the mutex owner.
>
> Yes. It would also be great if Peter can participate in this thread,
> if he has time. Not to nitpick but to be more precise in PE
> terminology, you mean "scheduler context". The "execution context" is
> not inherited [1]
>
> If p1 is selected to run while still blocked, the lock owner p2 can
> run "on its behalf", inheriting p1's scheduler context. Execution
> context is not inherited, meaning that e.g. the CPUs where p2 can run
> are still determined by its own affinity and not p1's.

Yep sorry got the terminology mixed up :-)


Cheers

--
Qais Yousef

>
> [1] https://lore.kernel.org/all/[email protected]/t/#mdf0146cdf78e48fc5cc515c1a34cdc1d596e0ed8
>
> > In both cases we should end up with the same distribution as if neither A nor
> > C ever go to sleep because of holding the mutex.
>
> Hopefully!
>
> > I still can't see how B and D fairness will be impacted as the solution to the
> > problem is to never treat a waiter as sleeping and let the owner run for more,
> > but only within the limit of what the waiter is allowed to run for. AFAICS,
> > both solutions maintain this relationship.
>
> True!
>
> - Joel