LinuxLists.cc - Re: [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight

2024-02-29 09:33:39

Subject: Re: [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight

Hi Abel:

I'm not so familiar with eevdf and still learning. Here I've some questions
about this patch.

1. You did proof that V will not change during reweight (COROLLARY #2). However,
according to the original paper, the new V will be:
V' = V + lag(j)/(W - w_j) - lag(j)/(W - w_j + w'_j)
So the V' in paper will change (when lag is not zero).
Is the V in Linux code slightly different from the paper?

2. I print some log around reweight_entity(), mainly want to print V by calling
avg_vruntime(cfs_rq). I found your algorithm only keeps the V unchanged during
reweight_eevdf(), but not reweight_entity().

In detail:
If curr is true (i.e., cfs_rq->curr == se), we will directly run
reweight_eevdf(), and the V is not changed.
If curr is false, we will have __dequeue_entity() -> reweight_eevdf() ->
__enqueue_entity(). The V is finally changed due to dequeue and enqueue. So the
result of reweight_entity() will be impacted by "cfs_rq->curr == se", is this
expected?

Thanks!

2024-02-29 14:32:07

by Abel Wu

[permalink] [raw]

Subject: Re: Re: [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight

Hi Tianchen,

On 2/29/24 5:24 PM, Tianchen Ding Wrote:
> Hi Abel:
>
> I'm not so familiar with eevdf and still learning. Here I've some questions about this patch.
>
> 1. You did proof that V will not change during reweight (COROLLARY #2). However, according to the original paper, the new V will be:
> V' = V + lag(j)/(W - w_j) - lag(j)/(W - w_j + w'_j)
> So the V' in paper will change (when lag is not zero).
> Is the V in Linux code slightly different from the paper?

Good catch. And to the best of my knowledge, the answer is YES. The
above Equation in the paper, which is Eq. (20), is based on the
assumption that:

"once client 3 leaves, the remaining two clients will
proportionally support the eventual loss or gain in the
service time" -- Page 10

"by updating the virtual time according to Eq. (18,19) we
ensure that the sum over the lags of all active clients
is always zero" -- Page 11

But in Peter's implementation, it is the competitors in the new group
that client 3 later joins in who actually support the effect. So when
client 3 leaves competition with !0-lag in Linux, the rq's sum(lag_i)
is no longer zero.

>
> 2. I print some log around reweight_entity(), mainly want to print V by calling avg_vruntime(cfs_rq). I found your algorithm only keeps the V unchanged during reweight_eevdf(), but not reweight_entity().
>
> In detail:
> If curr is true (i.e., cfs_rq->curr == se), we will directly run reweight_eevdf(), and the V is not changed.
> If curr is false, we will have __dequeue_entity() -> reweight_eevdf() -> __enqueue_entity(). The V is finally changed due to dequeue and enqueue. So the result of reweight_entity() will be impacted by "cfs_rq->curr == se", is this expected?

Good catch again! It smells like a bug. Since this @se is still on_rq,
it should be taken into consideration when calculating avg_runtime(),
but in fact it isn't because __dequeue_entity() will remove its share.

And I seem to spot another bug, although not relate to this problem,
that we actually need to call update_curr() unconditionally if curr is
available, because we need to commit curr's outstanding runtime to
ensure the result of avg_runtime() is up to date.

Thanks & BR,
Abel

2024-03-01 06:42:25

by Tianchen Ding

[permalink] [raw]

Subject: Re: [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight

On 2024/2/29 22:25, Abel Wu wrote:
> Good catch. And to the best of my knowledge, the answer is YES. The
> above Equation in the paper, which is Eq. (20), is based on the
> assumption that:
>
>     "once client 3 leaves, the remaining two clients will
>      proportionally support the eventual loss or gain in the
>      service time" -- Page 10
>
>     "by updating the virtual time according to Eq. (18,19) we
>      ensure that the sum over the lags of all active clients
>      is always zero" -- Page 11
>
> But in Peter's implementation, it is the competitors in the new group
> that client 3 later joins in who actually support the effect. So when
> client 3 leaves competition with !0-lag in Linux, the rq's sum(lag_i)
> is no longer zero.
>

I've different opinions. According to the comments above avg_vruntime_add(), V
is calculated exactly to satisfy sum(lag_i)=0. This is guaranteed by math.

Actually I print some logs in enqueue_entity() and dequeue_entity() to verify this:

[ 293.261236] before dequeue: V=2525278131 W=3072 v=2526243139 w=1024 lag_sum=0
[ 293.261237] after dequeue: V=2524795627 W=2048 v=2526243139 w=1024 lag_sum=0
[ 293.262286] before enqueue: V=2525319064 W=2048 v=2526766576 w=1024 lag_sum=0
[ 293.262287] after enqueue: V=2525801568 W=3072 v=2526766576 w=1024 lag_sum=0

For the first 2 lines, we have 2524795627 = 2525278131 + (2525278131 - 2526243139) * 1024 / 2048.
Which is Eq. (18)

For the last 2 lines, we have 2525801568 = 2525319064 - (2525319064 - 2526766576) * 1024 / 3072.
Which is Eq. (19)

So whatever client 3 leave or join competition with !0-lag in Linux, V is handled properly.

> Good catch again! It smells like a bug. Since this @se is still on_rq,
> it should be taken into consideration when calculating avg_runtime(),
> but in fact it isn't because __dequeue_entity() will remove its share.
>
> And I seem to spot another bug, although not relate to this problem,
> that we actually need to call update_curr() unconditionally if curr is
> available, because we need to commit curr's outstanding runtime to
> ensure the result of avg_runtime() is up to date.
>

I've tried to record avg_vruntime before __dequeue_entity() and pass it to
reweight_eevdf(). Then the issue is fixed. The V keeps the same during the whole
reweight_entity().

I could send these two bugfix patches (one for this bug and one you sugguested
about update_curr). But before doing so, I still want to dig out the answer of
my first question.

Hi Peter, would you please provide any information?

Thanks.

My rough logging code:
(Note: lag_sum may output a minus value, with its absolute value less than W.
This is ok because my lag_sum calculate is not so accurate due to the sign flips
in avg_vruntime())

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 533547e3c90a..9306c1bbd472 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5260,6 +5260,62 @@ static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq);

static inline bool cfs_bandwidth_used(void);

+static int rbtree_all(const void *key, const struct rb_node *node)
+{
+ return 0;
+}
+
+static s64 get_lag_sum(struct cfs_rq *cfs_rq)
+{
+ u64 avruntime = avg_vruntime(cfs_rq);
+ struct sched_entity *curr = cfs_rq->curr;
+ struct rb_node *node;
+ s64 lag_sum = 0;
+
+ rb_for_each(node, 0, &cfs_rq->tasks_timeline.rb_root, rbtree_all) {
+ struct sched_entity *se = __node_2_se(node);
+
+ if (se->on_rq)
+ lag_sum += (avruntime - se->vruntime) * scale_load_down(se->load.weight);
+ }
+
+ if (curr && curr->on_rq) {
+ lag_sum += (avruntime - curr->vruntime) * scale_load_down(curr->load.weight);
+ }
+
+ return lag_sum;
+}
+
+static void print_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se, bool before, bool enqueue)
+{
+ if (cpu_of(rq_of(cfs_rq)))
+ return; // avoid too many printings.
+
+ long load = cfs_rq->avg_load;
+ struct sched_entity *curr = cfs_rq->curr;
+ unsigned long weight = scale_load_down(se->load.weight);
+ s64 lag_sum = get_lag_sum(cfs_rq);
+ u64 avruntime = avg_vruntime(cfs_rq);
+
+ if (curr && curr->on_rq) {
+ unsigned long curr_weight = scale_load_down(curr->load.weight);
+ load += curr_weight;
+ }
+
+ if (before) {
+ if (enqueue)
+ printk("before enqueue: V=%llu W=%ld v=%llu w=%lu lag_sum=%lld\n", avruntime, load, se->vruntime, weight, lag_sum);
+ else
+ printk("before dequeue: V=%llu W=%ld v=%llu w=%lu lag_sum=%lld\n", avruntime, load, se->vruntime, weight, lag_sum);
+ }
+ else {
+ if (enqueue)
+ printk("after enqueue: V=%llu W=%ld v=%llu w=%lu lag_sum=%lld\n", avruntime, load, se->vruntime, weight, lag_sum);
+ else
+ printk("after dequeue: V=%llu W=%ld v=%llu w=%lu lag_sum=%lld\n", avruntime, load, se->vruntime, weight, lag_sum);
+ }
+}
+
static void
enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
@@ -5307,9 +5363,11 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)

check_schedstat_required();
update_stats_enqueue_fair(cfs_rq, se, flags);
+ print_eevdf(cfs_rq, se, true, true);
if (!curr)
__enqueue_entity(cfs_rq, se);
se->on_rq = 1;
+ print_eevdf(cfs_rq, se, false, true);

if (cfs_rq->nr_running == 1) {
check_enqueue_throttle(cfs_rq);
@@ -5347,6 +5405,7 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)

static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);

+
static void
dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
@@ -5377,9 +5436,11 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
clear_buddies(cfs_rq, se);

update_entity_lag(cfs_rq, se);
+ print_eevdf(cfs_rq, se, true, false);
if (se != cfs_rq->curr)
__dequeue_entity(cfs_rq, se);
se->on_rq = 0;
+ print_eevdf(cfs_rq, se, false, false);
account_entity_dequeue(cfs_rq, se);

/* return excess runtime on last dequeue */

2024-03-01 08:30:30

by Abel Wu

[permalink] [raw]

Subject: Re: Re: [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight

On 3/1/24 2:41 PM, Tianchen Ding Wrote:
> On 2024/2/29 22:25, Abel Wu wrote:
>> Good catch. And to the best of my knowledge, the answer is YES. The
>> above Equation in the paper, which is Eq. (20), is based on the
>> assumption that:
>>
>>      "once client 3 leaves, the remaining two clients will
>>       proportionally support the eventual loss or gain in the
>>       service time" -- Page 10
>>
>>      "by updating the virtual time according to Eq. (18,19) we
>>       ensure that the sum over the lags of all active clients
>>       is always zero" -- Page 11
>>
>> But in Peter's implementation, it is the competitors in the new group
>> that client 3 later joins in who actually support the effect. So when
>> client 3 leaves competition with !0-lag in Linux, the rq's sum(lag_i)
>> is no longer zero.
>>
>
> I've different opinions. According to the comments above avg_vruntime_add(), V
> is calculated exactly to satisfy sum(lag_i)=0. This is guaranteed by math.

Yes, you are right. I mixed another fairness issue with this. What I
was thinking is that considering multiple competition groups (e.g.
runqueues), the latency bound could be violated, that is someone could
starve a bit. Say one entity even with positive lag could become less
competitive if migrated to a higher competitive group.

Staring at Eq. (20) again, what if we do a fake reweight? I mean let
the client leave and rejoin at the same time without changing weight?
IMHO it should have no effects, but according to Eq. (20) the V will
change to:

V' = V + lag(j)/(W - w_j) - lag(j)/W != V

Have I missed anything?

>
> Actually I print some logs in enqueue_entity() and dequeue_entity() to verify this:
>
> [ 293.261236] before dequeue: V=2525278131 W=3072 v=2526243139 w=1024 lag_sum=0
> [ 293.261237] after dequeue: V=2524795627 W=2048 v=2526243139 w=1024 lag_sum=0
> [ 293.262286] before enqueue: V=2525319064 W=2048 v=2526766576 w=1024 lag_sum=0
> [ 293.262287] after enqueue: V=2525801568 W=3072 v=2526766576 w=1024 lag_sum=0
>
> For the first 2 lines, we have 2524795627 = 2525278131 + (2525278131 - 2526243139) * 1024 / 2048.
> Which is Eq. (18)
>
> For the last 2 lines, we have 2525801568 = 2525319064 - (2525319064 - 2526766576) * 1024 / 3072.
> Which is Eq. (19)
>
> So whatever client 3 leave or join competition with !0-lag in Linux, V is handled properly.
>
>> Good catch again! It smells like a bug. Since this @se is still on_rq,
>> it should be taken into consideration when calculating avg_runtime(),
>> but in fact it isn't because __dequeue_entity() will remove its share.
>>
>> And I seem to spot another bug, although not relate to this problem,
>> that we actually need to call update_curr() unconditionally if curr is
>> available, because we need to commit curr's outstanding runtime to
>> ensure the result of avg_runtime() is up to date.
>>
>
> I've tried to record avg_vruntime before __dequeue_entity() and pass it to
> reweight_eevdf(). Then the issue is fixed. The V keeps the same during the whole
> reweight_entity().
>
> I could send these two bugfix patches (one for this bug and one you sugguested

That would be appreciated!

Thanks,
Abel

2024-03-01 10:10:40

by Tianchen Ding

[permalink] [raw]

Subject: Re: [PATCH 1/4] sched/eevdf: Fix vruntime adjustment on reweight

On 2024/3/1 16:30, Abel Wu wrote:
> On 3/1/24 2:41 PM, Tianchen Ding Wrote:
>> On 2024/2/29 22:25, Abel Wu wrote:
>>> Good catch. And to the best of my knowledge, the answer is YES. The
>>> above Equation in the paper, which is Eq. (20), is based on the
>>> assumption that:
>>>
>>>      "once client 3 leaves, the remaining two clients will
>>>       proportionally support the eventual loss or gain in the
>>>       service time" -- Page 10
>>>
>>>      "by updating the virtual time according to Eq. (18,19) we
>>>       ensure that the sum over the lags of all active clients
>>>       is always zero" -- Page 11
>>>
>>> But in Peter's implementation, it is the competitors in the new group
>>> that client 3 later joins in who actually support the effect. So when
>>> client 3 leaves competition with !0-lag in Linux, the rq's sum(lag_i)
>>> is no longer zero.
>>>
>>
>> I've different opinions. According to the comments above avg_vruntime_add(), V
>> is calculated exactly to satisfy sum(lag_i)=0. This is guaranteed by math.
>
> Yes, you are right. I mixed another fairness issue with this. What I
> was thinking is that considering multiple competition groups (e.g.
> runqueues), the latency bound could be violated, that is someone could
> starve a bit. Say one entity even with positive lag could become less
> competitive if migrated to a higher competitive group.
>
> Staring at Eq. (20) again, what if we do a fake reweight? I mean let
> the client leave and rejoin at the same time without changing weight?
> IMHO it should have no effects, but according to Eq. (20) the V will
> change to:
>
>     V' = V + lag(j)/(W - w_j) - lag(j)/W != V
>
> Have I missed anything?
>

Good point! I've not ever noticed this conflict.

I tried to modify reweight_entity() to run dequeue_entity() -> adjust se->vlag ->
enqueue_entity(). And I found V do not changed.

The difference is, when doing enqueue_entity(), Peter enlarges the lag in place_entity().
Because after enqueue, the lag will evaporate.
In order to keep the same lag after enqueue, during place_entity(),
the new lag(t) will be enlarged with (W+w_i)/W.

So the Eq. (20) should be:

V' = V + lag(j)/(W - w_j) - lag'(j)/(W - w_j + w'_j)

lag'(j) = lag(j) * (W - w_j + w'_j)/(W - w_j)

So we can get

V' = V + lag(j)/(W - w_j) - lag(j) * (W - w_j + w'_j)/(W - w_j)/(W - w_j + w'_j) = V

So COROLLARY #2 is correct.