2016-03-09 09:23:07

by Pavankumar Kondeti

[permalink] [raw]
Subject: Migrated CFS task getting an unfair advantage

Hi

When a CFS task is enqueued during migration (load balance or change in
affinity), its vruntime is normalized before updating the current and
cfs_rq->min_vruntime. If the current entity is a low priority task or
belongs to a cgroup that has lower cpu.shares and it is the only entity
queued, there is a possibility of big update to the cfs_rq->min_vruntime.
As the migrated task is normalized before this update, it gets an unfair
advantage over tasks queued after this point. If the migrated task is
a CPU hogger, the other CFS tasks queued on this CPU gets starved.

This problem can be simulated by running the below workload:

- create NR_CPU low prio CFS tasks affined to each CPU and put them
under a cgroup with cpu.shares = 52 (1024/20). These tasks run forever.
- create another CFS task which runs forever and periodically hops across
all CPUs by changing the affinity.
- We could see the destination CPU's cfs_rq->min_vruntime falls behind
the migrated task by few hundreds of msec after migration.

If we add the migrated task to destination CPU cfs_rq's rb tree before updating
the current in enqueue_entity(), the cfs_rq->min_vruntime does not go beyond
the newly migrated task. Is this an acceptable solution?

Thanks,
Pavan


--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a
Linux Foundation Collaborative Project


2016-03-09 12:04:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Migrated CFS task getting an unfair advantage

On Wed, Mar 09, 2016 at 02:52:57PM +0530, Pavan Kondeti wrote:

> When a CFS task is enqueued during migration (load balance or change in
> affinity), its vruntime is normalized before updating the current and
> cfs_rq->min_vruntime.

static void
enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
/*
* Update the normalized vruntime before updating min_vruntime
* through calling update_curr().
*/
if (!(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_WAKING))
se->vruntime += cfs_rq->min_vruntime;

update_curr(cfs_rq);

This, right? Some idiot wrote a comment but forgot to explain why.

> If the current entity is a low priority task or belongs to a cgroup
> that has lower cpu.shares and it is the only entity queued, there is a
> possibility of big update to the cfs_rq->min_vruntime.

> As the migrated task is normalized before this update, it gets an
> unfair advantage over tasks queued after this point. If the migrated
> task is a CPU hogger, the other CFS tasks queued on this CPU gets
> starved.

Because it takes a whole while for the newly placed task to gain on the
previous task, right?

> If we add the migrated task to destination CPU cfs_rq's rb tree before
> updating the current in enqueue_entity(), the cfs_rq->min_vruntime
> does not go beyond the newly migrated task. Is this an acceptable
> solution?

Hurm.. so I'm not sure how that would solve anything. The existing task
would still be shot far into the future.

What you want is to normalize after update_curr()... but we cannot do
that in the case cfs_rq->curr == se (which I suppose is what that
comment is on about).

Does something like the below work?

---
kernel/sched/fair.c | 20 ++++++++++++++------
1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 33130529e9b5..3c114d971d84 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3157,17 +3157,25 @@ static inline void check_schedstat_required(void)
static void
enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
+ bool renorm = !(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_WAKING);
+ bool curr = cfs_rq->curr == se;
+
/*
- * Update the normalized vruntime before updating min_vruntime
- * through calling update_curr().
+ * If we're the current task, we must renormalise before calling
+ * update_curr().
*/
- if (!(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_WAKING))
+ if (renorm && curr)
se->vruntime += cfs_rq->min_vruntime;

+ update_curr(cfs_rq);
+
/*
- * Update run-time statistics of the 'current'.
+ * Otherwise, renormalise after, such that we're placed at the current
+ * moment in time, instead of some random moment in the past.
*/
- update_curr(cfs_rq);
+ if (renorm && !curr)
+ se->vruntime += cfs_rq->min_vruntime;
+
enqueue_entity_load_avg(cfs_rq, se);
account_entity_enqueue(cfs_rq, se);
update_cfs_shares(cfs_rq);
@@ -3183,7 +3191,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
update_stats_enqueue(cfs_rq, se);
check_spread(cfs_rq, se);
}
- if (se != cfs_rq->curr)
+ if (!curr)
__enqueue_entity(cfs_rq, se);
se->on_rq = 1;


2016-03-09 13:06:41

by pavankumar kondeti

[permalink] [raw]
Subject: Re: Migrated CFS task getting an unfair advantage

Hi Peter,

On Wed, Mar 9, 2016 at 5:34 PM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Mar 09, 2016 at 02:52:57PM +0530, Pavan Kondeti wrote:
>
>> When a CFS task is enqueued during migration (load balance or change in
>> affinity), its vruntime is normalized before updating the current and
>> cfs_rq->min_vruntime.
>
> static void
> enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> {
> /*
> * Update the normalized vruntime before updating min_vruntime
> * through calling update_curr().
> */
> if (!(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_WAKING))
> se->vruntime += cfs_rq->min_vruntime;
>
> update_curr(cfs_rq);
>
> This, right? Some idiot wrote a comment but forgot to explain why.
>
>> If the current entity is a low priority task or belongs to a cgroup
>> that has lower cpu.shares and it is the only entity queued, there is a
>> possibility of big update to the cfs_rq->min_vruntime.
>
>> As the migrated task is normalized before this update, it gets an
>> unfair advantage over tasks queued after this point. If the migrated
>> task is a CPU hogger, the other CFS tasks queued on this CPU gets
>> starved.
>
> Because it takes a whole while for the newly placed task to gain on the
> previous task, right?
>

Yes. The newly woken up task vruntime is adjusted wrt the cfs_rq->min_vruntime.
The cfs_rq->min_vruntime can potentially be hundreds of msec beyond the
migrated task.

>> If we add the migrated task to destination CPU cfs_rq's rb tree before
>> updating the current in enqueue_entity(), the cfs_rq->min_vruntime
>> does not go beyond the newly migrated task. Is this an acceptable
>> solution?
>
> Hurm.. so I'm not sure how that would solve anything. The existing task
> would still be shot far into the future.
>
In my testing, the problem is gone with this approach.

The update_min_vruntime() called from update_curr() has a check to make
sure that cfs_rq->min_vruntime does not go beyond the leftmost entity
(in this case it would be the migrated task) vruntime. so we don't see
the migrated task getting any advantage.

> What you want is to normalize after update_curr()... but we cannot do
> that in the case cfs_rq->curr == se (which I suppose is what that
> comment is on about).
>
> Does something like the below work?
>

Thanks for providing this patch. It solved the problem.

> ---
> kernel/sched/fair.c | 20 ++++++++++++++------
> 1 file changed, 14 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 33130529e9b5..3c114d971d84 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3157,17 +3157,25 @@ static inline void check_schedstat_required(void)
> static void
> enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> {
> + bool renorm = !(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_WAKING);
> + bool curr = cfs_rq->curr == se;
> +
> /*
> - * Update the normalized vruntime before updating min_vruntime
> - * through calling update_curr().
> + * If we're the current task, we must renormalise before calling
> + * update_curr().
> */
> - if (!(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_WAKING))
> + if (renorm && curr)
> se->vruntime += cfs_rq->min_vruntime;
>
> + update_curr(cfs_rq);
> +
> /*
> - * Update run-time statistics of the 'current'.
> + * Otherwise, renormalise after, such that we're placed at the current
> + * moment in time, instead of some random moment in the past.
> */
> - update_curr(cfs_rq);
> + if (renorm && !curr)
> + se->vruntime += cfs_rq->min_vruntime;
> +
> enqueue_entity_load_avg(cfs_rq, se);
> account_entity_enqueue(cfs_rq, se);
> update_cfs_shares(cfs_rq);
> @@ -3183,7 +3191,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> update_stats_enqueue(cfs_rq, se);
> check_spread(cfs_rq, se);
> }
> - if (se != cfs_rq->curr)
> + if (!curr)
> __enqueue_entity(cfs_rq, se);
> se->on_rq = 1;
>

2016-03-09 19:00:47

by Andrew Hunter

[permalink] [raw]
Subject: Re: Migrated CFS task getting an unfair advantage

At Google, we essentially reverted 88ec22d and the subsequent tweaks
to it, keeping vruntime absolute always , instead using
task_move_group_fair to change the basis of relative min_vruntime
between cpus. We found this made it a lot easier to reason about and
work with corss-cpu computations. I could post the patch if it would
be of interest...

On Wed, Mar 9, 2016 at 5:06 AM, pavankumar kondeti
<[email protected]> wrote:
> Hi Peter,
>
> On Wed, Mar 9, 2016 at 5:34 PM, Peter Zijlstra <[email protected]> wrote:
>> On Wed, Mar 09, 2016 at 02:52:57PM +0530, Pavan Kondeti wrote:
>>
>>> When a CFS task is enqueued during migration (load balance or change in
>>> affinity), its vruntime is normalized before updating the current and
>>> cfs_rq->min_vruntime.
>>
>> static void
>> enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>> {
>> /*
>> * Update the normalized vruntime before updating min_vruntime
>> * through calling update_curr().
>> */
>> if (!(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_WAKING))
>> se->vruntime += cfs_rq->min_vruntime;
>>
>> update_curr(cfs_rq);
>>
>> This, right? Some idiot wrote a comment but forgot to explain why.
>>
>>> If the current entity is a low priority task or belongs to a cgroup
>>> that has lower cpu.shares and it is the only entity queued, there is a
>>> possibility of big update to the cfs_rq->min_vruntime.
>>
>>> As the migrated task is normalized before this update, it gets an
>>> unfair advantage over tasks queued after this point. If the migrated
>>> task is a CPU hogger, the other CFS tasks queued on this CPU gets
>>> starved.
>>
>> Because it takes a whole while for the newly placed task to gain on the
>> previous task, right?
>>
>
> Yes. The newly woken up task vruntime is adjusted wrt the cfs_rq->min_vruntime.
> The cfs_rq->min_vruntime can potentially be hundreds of msec beyond the
> migrated task.
>
>>> If we add the migrated task to destination CPU cfs_rq's rb tree before
>>> updating the current in enqueue_entity(), the cfs_rq->min_vruntime
>>> does not go beyond the newly migrated task. Is this an acceptable
>>> solution?
>>
>> Hurm.. so I'm not sure how that would solve anything. The existing task
>> would still be shot far into the future.
>>
> In my testing, the problem is gone with this approach.
>
> The update_min_vruntime() called from update_curr() has a check to make
> sure that cfs_rq->min_vruntime does not go beyond the leftmost entity
> (in this case it would be the migrated task) vruntime. so we don't see
> the migrated task getting any advantage.
>
>> What you want is to normalize after update_curr()... but we cannot do
>> that in the case cfs_rq->curr == se (which I suppose is what that
>> comment is on about).
>>
>> Does something like the below work?
>>
>
> Thanks for providing this patch. It solved the problem.
>
>> ---
>> kernel/sched/fair.c | 20 ++++++++++++++------
>> 1 file changed, 14 insertions(+), 6 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 33130529e9b5..3c114d971d84 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3157,17 +3157,25 @@ static inline void check_schedstat_required(void)
>> static void
>> enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>> {
>> + bool renorm = !(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_WAKING);
>> + bool curr = cfs_rq->curr == se;
>> +
>> /*
>> - * Update the normalized vruntime before updating min_vruntime
>> - * through calling update_curr().
>> + * If we're the current task, we must renormalise before calling
>> + * update_curr().
>> */
>> - if (!(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_WAKING))
>> + if (renorm && curr)
>> se->vruntime += cfs_rq->min_vruntime;
>>
>> + update_curr(cfs_rq);
>> +
>> /*
>> - * Update run-time statistics of the 'current'.
>> + * Otherwise, renormalise after, such that we're placed at the current
>> + * moment in time, instead of some random moment in the past.
>> */
>> - update_curr(cfs_rq);
>> + if (renorm && !curr)
>> + se->vruntime += cfs_rq->min_vruntime;
>> +
>> enqueue_entity_load_avg(cfs_rq, se);
>> account_entity_enqueue(cfs_rq, se);
>> update_cfs_shares(cfs_rq);
>> @@ -3183,7 +3191,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>> update_stats_enqueue(cfs_rq, se);
>> check_spread(cfs_rq, se);
>> }
>> - if (se != cfs_rq->curr)
>> + if (!curr)
>> __enqueue_entity(cfs_rq, se);
>> se->on_rq = 1;
>>

2016-03-09 19:21:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Migrated CFS task getting an unfair advantage

On Wed, Mar 09, 2016 at 11:00:42AM -0800, Andrew Hunter wrote:
> At Google, we essentially reverted 88ec22d and the subsequent tweaks
> to it, keeping vruntime absolute always , instead using
> task_move_group_fair to change the basis of relative min_vruntime
> between cpus. We found this made it a lot easier to reason about and
> work with corss-cpu computations. I could post the patch if it would
> be of interest...

Yes please, Paul said he would many times.

2016-03-09 23:48:36

by Byungchul Park

[permalink] [raw]
Subject: Re: Migrated CFS task getting an unfair advantage

On Wed, Mar 09, 2016 at 11:00:42AM -0800, Andrew Hunter wrote:
> At Google, we essentially reverted 88ec22d and the subsequent tweaks
> to it, keeping vruntime absolute always , instead using
> task_move_group_fair to change the basis of relative min_vruntime

Hello, Andrew

I am curious about how it can be done with absolute value. :-)

> between cpus. We found this made it a lot easier to reason about and
> work with corss-cpu computations. I could post the patch if it would
> be of interest...

I am interested in it.

Thanks,
Byungchul

>

Subject: [tip:sched/urgent] sched/fair: Fix fairness issue on migration

Commit-ID: 3a47d5124a957358274e9ca7b115b2f3a914f56d
Gitweb: http://git.kernel.org/tip/3a47d5124a957358274e9ca7b115b2f3a914f56d
Author: Peter Zijlstra <[email protected]>
AuthorDate: Wed, 9 Mar 2016 13:04:03 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 21 Mar 2016 10:49:23 +0100

sched/fair: Fix fairness issue on migration

Pavan reported that in the presence of very light tasks (or cgroups)
the placement of migrated tasks can cause severe fairness issues.

The problem is that enqueue_entity() places the task before it updates
time, thereby it can place the task far in the past (remember that
light tasks will shoot virtual time forward at a high speed, so in
relation to the pre-existing light task, we can land far in the past).

This is done because update_curr() needs the current task, and we
might be placing the current task.

The obvious solution is to differentiate between the current and any
other task; placing the current before we update time, and placing any
other task after, such that !curr tasks end up at the current moment
in time, and not in the past.

Reported-by: Pavan Kondeti <[email protected]>
Tested-by: Pavan Kondeti <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Matt Fleming <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 20 ++++++++++++++------
1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3313052..3c114d9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3157,17 +3157,25 @@ static inline void check_schedstat_required(void)
static void
enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
+ bool renorm = !(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_WAKING);
+ bool curr = cfs_rq->curr == se;
+
/*
- * Update the normalized vruntime before updating min_vruntime
- * through calling update_curr().
+ * If we're the current task, we must renormalise before calling
+ * update_curr().
*/
- if (!(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_WAKING))
+ if (renorm && curr)
se->vruntime += cfs_rq->min_vruntime;

+ update_curr(cfs_rq);
+
/*
- * Update run-time statistics of the 'current'.
+ * Otherwise, renormalise after, such that we're placed at the current
+ * moment in time, instead of some random moment in the past.
*/
- update_curr(cfs_rq);
+ if (renorm && !curr)
+ se->vruntime += cfs_rq->min_vruntime;
+
enqueue_entity_load_avg(cfs_rq, se);
account_entity_enqueue(cfs_rq, se);
update_cfs_shares(cfs_rq);
@@ -3183,7 +3191,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
update_stats_enqueue(cfs_rq, se);
check_spread(cfs_rq, se);
}
- if (se != cfs_rq->curr)
+ if (!curr)
__enqueue_entity(cfs_rq, se);
se->on_rq = 1;