LinuxLists.cc - Add rq->nr_uninterruptible count to dest cpu's rq while CPU goes down.

[permalink] [raw]

Subject: Re: Add rq->nr_uninterruptible count to dest cpu's rq while CPU goes down.

On Thu, 2012-08-16 at 19:45 +0600, Rakib Mullick wrote:
> When a CPU is about to go down, it moves all it's sleeping task to an active CPU, then nr_uninterruptible counts are
> also moved. When moving nr_uninterruptible count, currently it chooses a randomly picked CPU from the active CPU mask
> to keep the global nr_uninterruptible count intact. But, it would be precise to move nr_uninterruptible counts to the
> CPU where all the sleeping tasks were moved and it also might have subtle impact over rq's load calculation. So, this
> patch is prepared to address this issue.

The Changelog is ill-formated. Other than that, the patch doesn't appear
to actually do what it says. The sleeping tasks can be scattered to any
number of cpus as decided by select_fallback_rq().

Furthermore there should be absolutely no impact on load calculation
what so ever. nr_uninterruptible is only ever useful as a sum over all
cpus, this total sum doesn't change regardless of where you put the
value.

Worse, there's absolutely no relation to the tasks on the runqueue
(sleeping or otherwise) and nr_uninterruptible, so coupling these
actions makes no sense what so ever.

2012-08-16 14:29:01

[permalink] [raw]

Subject: Re: Add rq->nr_uninterruptible count to dest cpu's rq while CPU goes down.

On 8/16/12, Peter Zijlstra <[email protected]> wrote:
> On Thu, 2012-08-16 at 19:45 +0600, Rakib Mullick wrote:
>> When a CPU is about to go down, it moves all it's sleeping task to an
>> active CPU, then nr_uninterruptible counts are
>> also moved. When moving nr_uninterruptible count, currently it chooses a
>> randomly picked CPU from the active CPU mask
>> to keep the global nr_uninterruptible count intact. But, it would be
>> precise to move nr_uninterruptible counts to the
>> CPU where all the sleeping tasks were moved and it also might have subtle
>> impact over rq's load calculation. So, this
>> patch is prepared to address this issue.
>
> The Changelog is ill-formated. Other than that, the patch doesn't appear
> to actually do what it says. The sleeping tasks can be scattered to any
> number of cpus as decided by select_fallback_rq().
>
I'm not sure which parts are missing from Changelog to patch. And this
patch assumes that, sleeping tasks won't be scattered. From
select_fallback_rq(), sleeping tasks might get scattered due to
various cases like. if CPU is down, task isn't allowed to move a
particular CPU. Other than that, dest cpu supposed to be the same.

> Furthermore there should be absolutely no impact on load calculation
> what so ever. nr_uninterruptible is only ever useful as a sum over all
> cpus, this total sum doesn't change regardless of where you put the
> value.
>
> Worse, there's absolutely no relation to the tasks on the runqueue
> (sleeping or otherwise) and nr_uninterruptible, so coupling these
> actions makes no sense what so ever.
>
nr_uninterruptible is coupled with tasks on the runqueue to calculate
nr_active numbers.
In calc_load_fold_active(), this nr_active numbers are used to
calculate delta. This is how I understand this part and seeing some
impact.

Thanks,
Rakib

2012-08-16 14:42:31

[permalink] [raw]

Subject: Re: Add rq->nr_uninterruptible count to dest cpu's rq while CPU goes down.

On Thu, 2012-08-16 at 20:28 +0600, Rakib Mullick wrote:

> I'm not sure which parts are missing from Changelog to patch. And this
> patch assumes that, sleeping tasks won't be scattered. From
> select_fallback_rq(), sleeping tasks might get scattered due to
> various cases like. if CPU is down, task isn't allowed to move a
> particular CPU. Other than that, dest cpu supposed to be the same.

Sure but affinities and cpusets can still scatter, and therefore your
logic doesn't hold up, but see below.

> > Furthermore there should be absolutely no impact on load calculation
> > what so ever. nr_uninterruptible is only ever useful as a sum over all
> > cpus, this total sum doesn't change regardless of where you put the
> > value.
> >
> > Worse, there's absolutely no relation to the tasks on the runqueue
> > (sleeping or otherwise) and nr_uninterruptible, so coupling these
> > actions makes no sense what so ever.
> >
> nr_uninterruptible is coupled with tasks on the runqueue to calculate
> nr_active numbers.

It is not.. nr_uninterruptible is incremented on the cpu the task goes
to sleep and decremented on the cpu doing the wakeup.

This means that nr_uninterruptible is a complete mess and any per-cpu
value isn't meaningful at all.

It is quite possible to always have the inc on cpu0 and the decrement on
cpu1, yielding results like:

{1000, -1000} for an effective nr_uninterruptible = 0. Taking either cpu
down will then migrate whatever delta it has to another cpu, but there
might only be a single task, yet the delta is +-1000.

> In calc_load_fold_active(), this nr_active numbers are used to
> calculate delta. This is how I understand this part and seeing some
> impact.

You understand wrong, please re-read the comment added in commit
5167e8d5.

2012-08-16 15:32:47

[permalink] [raw]

Subject: Re: Add rq->nr_uninterruptible count to dest cpu's rq while CPU goes down.

On 8/16/12, Peter Zijlstra <[email protected]> wrote:
> On Thu, 2012-08-16 at 20:28 +0600, Rakib Mullick wrote:
>
>> nr_uninterruptible is coupled with tasks on the runqueue to calculate
>> nr_active numbers.
>
> It is not.. nr_uninterruptible is incremented on the cpu the task goes
> to sleep and decremented on the cpu doing the wakeup.
>
If nr_uninterruptible's life cycle is this simple then, while CPU
goes down, nr_uninterruptible count will be decremented when all the
tasks are moved to other CPUs and should be fine.

> This means that nr_uninterruptible is a complete mess and any per-cpu
> value isn't meaningful at all.
>
Well, if nr_uninterruptible is a mess, then this patch has no meaning.
And also I think migrate_nr_uninterruptible() is meaning less too.

> It is quite possible to always have the inc on cpu0 and the decrement on
> cpu1, yielding results like:
>
> {1000, -1000} for an effective nr_uninterruptible = 0. Taking either cpu
> down will then migrate whatever delta it has to another cpu, but there
> might only be a single task, yet the delta is +-1000.
>
>> In calc_load_fold_active(), this nr_active numbers are used to
>> calculate delta. This is how I understand this part and seeing some
>> impact.
>
> You understand wrong, please re-read the comment added in commit
> 5167e8d5.
>
Yes, reading.

Thanks,
Rakib.

2012-08-16 17:46:51

[permalink] [raw]

Subject: Re: Add rq->nr_uninterruptible count to dest cpu's rq while CPU goes down.

On Thu, 2012-08-16 at 21:32 +0600, Rakib Mullick wrote:
> And also I think migrate_nr_uninterruptible() is meaning less too.

Hmm, I think I see a problem.. we forget to migrate the effective delta
created by rq->calc_load_active.

2012-08-17 13:39:26

[permalink] [raw]

Subject: Re: Add rq->nr_uninterruptible count to dest cpu's rq while CPU goes down.

On 8/16/12, Peter Zijlstra <[email protected]> wrote:
> On Thu, 2012-08-16 at 21:32 +0600, Rakib Mullick wrote:
>> And also I think migrate_nr_uninterruptible() is meaning less too.
>
> Hmm, I think I see a problem.. we forget to migrate the effective delta
> created by rq->calc_load_active.
>
And rq->calc_load_active needs to be migrated to the proper dest_rq
not like currently picking any random rq.

2012-08-20 09:27:12

[permalink] [raw]

Subject: Re: Add rq->nr_uninterruptible count to dest cpu's rq while CPU goes down.

On Fri, 2012-08-17 at 19:39 +0600, Rakib Mullick wrote:
> On 8/16/12, Peter Zijlstra <[email protected]> wrote:
> > On Thu, 2012-08-16 at 21:32 +0600, Rakib Mullick wrote:
> >> And also I think migrate_nr_uninterruptible() is meaning less too.
> >
> > Hmm, I think I see a problem.. we forget to migrate the effective delta
> > created by rq->calc_load_active.
> >
> And rq->calc_load_active needs to be migrated to the proper dest_rq
> not like currently picking any random rq.

OK, so how about something like the below, it would also solve Paul's
issue with that code.

Please do double check the logic, I've had all of 4 hours sleep and its
far too warm for a brain to operate in any case.

---
Subject: sched: Fix load avg vs cpu-hotplug

Rabik and Paul reported two different issues related to the same few
lines of code.

Rabik's issue is that the nr_uninterruptible migration code is wrong in
that he sees artifacts due to this (Rabik please do expand in more
detail).

Paul's issue is that this code as it stands relies on us using
stop_machine() for unplug, we all would like to remove this assumption
so that eventually we can remove this stop_machine() usage altogether.

The only reason we'd have to migrate nr_uninterruptible is so that we
could use for_each_online_cpu() loops in favour of
for_each_possible_cpu() loops, however since nr_uninterruptible() is the
only such loop and its using possible lets not bother at all.

The problem Rabik sees is (probably) caused by the fact that by
migrating nr_uninterruptible we screw rq->calc_load_active for both rqs
involved.

So don't bother with fancy migration schemes (meaning we now have to
keep using for_each_possible_cpu()) and instead fold any nr_active delta
after we migrate all tasks away to make sure we don't have any skewed
nr_active accounting.

Reported-by: Rakib Mullick <[email protected]>
Reported-by: Paul E. McKenney <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
---
kernel/sched/core.c | 31 ++++++++++---------------------
1 file changed, 10 insertions(+), 21 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4376c9f..06d23c6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5338,27 +5338,17 @@ void idle_task_exit(void)
}

/*
- * While a dead CPU has no uninterruptible tasks queued at this point,
- * it might still have a nonzero ->nr_uninterruptible counter, because
- * for performance reasons the counter is not stricly tracking tasks to
- * their home CPUs. So we just add the counter to another CPU's counter,
- * to keep the global sum constant after CPU-down:
- */
-static void migrate_nr_uninterruptible(struct rq *rq_src)
-{
- struct rq *rq_dest = cpu_rq(cpumask_any(cpu_active_mask));
-
- rq_dest->nr_uninterruptible += rq_src->nr_uninterruptible;
- rq_src->nr_uninterruptible = 0;
-}
-
-/*
- * remove the tasks which were accounted by rq from calc_load_tasks.
+ * Since this CPU is going 'away' for a while, fold any nr_active delta
+ * we might have. Assumes we're called after migrate_tasks() so that the
+ * nr_active count is stable.
+ *
+ * Also see the comment "Global load-average calculations".
*/
-static void calc_global_load_remove(struct rq *rq)
+static void calc_load_migrate(struct rq *rq)
{
- atomic_long_sub(rq->calc_load_active, &calc_load_tasks);
- rq->calc_load_active = 0;
+ long delta = calc_load_fold_active(rq);
+ if (delta)
+ atomic_long_add(delta, &calc_load_tasks);
}

/*
@@ -5652,8 +5642,7 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
BUG_ON(rq->nr_running != 1); /* the migration thread */
raw_spin_unlock_irqrestore(&rq->lock, flags);

- migrate_nr_uninterruptible(rq);
- calc_global_load_remove(rq);
+ calc_load_migrate(rq);
break;
#endif
}

2012-08-20 16:10:21

[permalink] [raw]

Subject: Re: Add rq->nr_uninterruptible count to dest cpu's rq while CPU goes down.

On 8/20/12, Peter Zijlstra <[email protected]> wrote:
> On Fri, 2012-08-17 at 19:39 +0600, Rakib Mullick wrote:
>> On 8/16/12, Peter Zijlstra <[email protected]> wrote:
>> > On Thu, 2012-08-16 at 21:32 +0600, Rakib Mullick wrote:
>> >> And also I think migrate_nr_uninterruptible() is meaning less too.
>> >
>> > Hmm, I think I see a problem.. we forget to migrate the effective delta
>> > created by rq->calc_load_active.
>> >
>> And rq->calc_load_active needs to be migrated to the proper dest_rq
>> not like currently picking any random rq.
>
>
> OK, so how about something like the below, it would also solve Paul's
> issue with that code.
>
>
> Please do double check the logic, I've had all of 4 hours sleep and its
> far too warm for a brain to operate in any case.
>
> ---
> Subject: sched: Fix load avg vs cpu-hotplug
>
> Rabik and Paul reported two different issues related to the same few
> lines of code.
>
First of all, you've misspelled my name, it's Rakib not Rabik.

> Rabik's issue is that the nr_uninterruptible migration code is wrong in
> that he sees artifacts due to this (Rabik please do expand in more
> detail).
>
Okay, I was thinking about per rq->nr_uninterruptible accounting due
to it's use in delta calculation at calc_load_fold_active(). So, I
proposed to migrate nr_uninterruptible to the rq, where tasks were
migrated as if, they gets folded from update_cpu_load(). Now, note
that, delta is calculated using rq->nr_running and
rq->nr_uninterruptible, so if we migrate tasks into a rq, but migrate
nr_uninterruptible into another rq, its wrong and we're screwing the
delta calculation.

> Paul's issue is that this code as it stands relies on us using
> stop_machine() for unplug, we all would like to remove this assumption
> so that eventually we can remove this stop_machine() usage altogether.
>
> The only reason we'd have to migrate nr_uninterruptible is so that we
> could use for_each_online_cpu() loops in favour of
> for_each_possible_cpu() loops, however since nr_uninterruptible() is the
> only such loop and its using possible lets not bother at all.
>
> The problem Rabik sees is (probably) caused by the fact that by
> migrating nr_uninterruptible we screw rq->calc_load_active for both rqs
> involved.
>
Certainly, we don't care about the rq which is going down. But, the dest_rq.

> So don't bother with fancy migration schemes (meaning we now have to
> keep using for_each_possible_cpu()) and instead fold any nr_active delta
> after we migrate all tasks away to make sure we don't have any skewed
> nr_active accounting.
>
>
> Reported-by: Rakib Mullick <[email protected]>
> Reported-by: Paul E. McKenney <[email protected]>
> Signed-off-by: Peter Zijlstra <[email protected]>
> ---
> kernel/sched/core.c | 31 ++++++++++---------------------
> 1 file changed, 10 insertions(+), 21 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 4376c9f..06d23c6 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5338,27 +5338,17 @@ void idle_task_exit(void)
> }
>
> /*
> - * While a dead CPU has no uninterruptible tasks queued at this point,
> - * it might still have a nonzero ->nr_uninterruptible counter, because
> - * for performance reasons the counter is not stricly tracking tasks to
> - * their home CPUs. So we just add the counter to another CPU's counter,
> - * to keep the global sum constant after CPU-down:
> - */
> -static void migrate_nr_uninterruptible(struct rq *rq_src)
> -{
> - struct rq *rq_dest = cpu_rq(cpumask_any(cpu_active_mask));
> -
> - rq_dest->nr_uninterruptible += rq_src->nr_uninterruptible;
> - rq_src->nr_uninterruptible = 0;
> -}
> -
> -/*
> - * remove the tasks which were accounted by rq from calc_load_tasks.
> + * Since this CPU is going 'away' for a while, fold any nr_active delta
> + * we might have. Assumes we're called after migrate_tasks() so that the
> + * nr_active count is stable.
> + *
But after migrate_tasks(), it's likely that rq->nr_running will be 1.
Then, nr_active will be screwed. No?

> + * Also see the comment "Global load-average calculations".
> */
> -static void calc_global_load_remove(struct rq *rq)
> +static void calc_load_migrate(struct rq *rq)
> {
> - atomic_long_sub(rq->calc_load_active, &calc_load_tasks);
> - rq->calc_load_active = 0;
> + long delta = calc_load_fold_active(rq);
> + if (delta)
> + atomic_long_add(delta, &calc_load_tasks);
> }
>
> /*
> @@ -5652,8 +5642,7 @@ migration_call(struct notifier_block *nfb, unsigned
> long action, void *hcpu)
> BUG_ON(rq->nr_running != 1); /* the migration thread */
> raw_spin_unlock_irqrestore(&rq->lock, flags);
>
> - migrate_nr_uninterruptible(rq);
> - calc_global_load_remove(rq);
> + calc_load_migrate(rq);
> break;
> #endif
> }
>

Thanks,
Rakib

2012-08-20 16:16:55

[permalink] [raw]

Subject: Re: Add rq->nr_uninterruptible count to dest cpu's rq while CPU goes down.

On Mon, 2012-08-20 at 22:10 +0600, Rakib Mullick wrote:
> >
> First of all, you've misspelled my name, it's Rakib not Rabik.

Damn, sorry!, lysdexic that..

> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 4376c9f..06d23c6 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -5338,27 +5338,17 @@ void idle_task_exit(void)
> > }
> >
> > /*
> > - * While a dead CPU has no uninterruptible tasks queued at this point,
> > - * it might still have a nonzero ->nr_uninterruptible counter, because
> > - * for performance reasons the counter is not stricly tracking tasks to
> > - * their home CPUs. So we just add the counter to another CPU's counter,
> > - * to keep the global sum constant after CPU-down:
> > - */
> > -static void migrate_nr_uninterruptible(struct rq *rq_src)
> > -{
> > - struct rq *rq_dest = cpu_rq(cpumask_any(cpu_active_mask));
> > -
> > - rq_dest->nr_uninterruptible += rq_src->nr_uninterruptible;
> > - rq_src->nr_uninterruptible = 0;
> > -}
> > -
> > -/*
> > - * remove the tasks which were accounted by rq from calc_load_tasks.
> > + * Since this CPU is going 'away' for a while, fold any nr_active delta
> > + * we might have. Assumes we're called after migrate_tasks() so that the
> > + * nr_active count is stable.
> > + *
> But after migrate_tasks(), it's likely that rq->nr_running will be 1.
> Then, nr_active will be screwed. No?

Gah indeed. let me try that again.

2012-08-20 16:28:17

[permalink] [raw]

Subject: Re: Add rq->nr_uninterruptible count to dest cpu's rq while CPU goes down.

On Mon, Aug 20, 2012 at 11:26:57AM +0200, Peter Zijlstra wrote:
> On Fri, 2012-08-17 at 19:39 +0600, Rakib Mullick wrote:
> > On 8/16/12, Peter Zijlstra <[email protected]> wrote:
> > > On Thu, 2012-08-16 at 21:32 +0600, Rakib Mullick wrote:
> > >> And also I think migrate_nr_uninterruptible() is meaning less too.
> > >
> > > Hmm, I think I see a problem.. we forget to migrate the effective delta
> > > created by rq->calc_load_active.
> > >
> > And rq->calc_load_active needs to be migrated to the proper dest_rq
> > not like currently picking any random rq.
>
>
> OK, so how about something like the below, it would also solve Paul's
> issue with that code.
>
>
> Please do double check the logic, I've had all of 4 hours sleep and its
> far too warm for a brain to operate in any case.
>
> ---
> Subject: sched: Fix load avg vs cpu-hotplug
>
> Rabik and Paul reported two different issues related to the same few
> lines of code.
>
> Rabik's issue is that the nr_uninterruptible migration code is wrong in
> that he sees artifacts due to this (Rabik please do expand in more
> detail).
>
> Paul's issue is that this code as it stands relies on us using
> stop_machine() for unplug, we all would like to remove this assumption
> so that eventually we can remove this stop_machine() usage altogether.
>
> The only reason we'd have to migrate nr_uninterruptible is so that we
> could use for_each_online_cpu() loops in favour of
> for_each_possible_cpu() loops, however since nr_uninterruptible() is the
> only such loop and its using possible lets not bother at all.
>
> The problem Rabik sees is (probably) caused by the fact that by
> migrating nr_uninterruptible we screw rq->calc_load_active for both rqs
> involved.
>
> So don't bother with fancy migration schemes (meaning we now have to
> keep using for_each_possible_cpu()) and instead fold any nr_active delta
> after we migrate all tasks away to make sure we don't have any skewed
> nr_active accounting.
>
>
> Reported-by: Rakib Mullick <[email protected]>
> Reported-by: Paul E. McKenney <[email protected]>
> Signed-off-by: Peter Zijlstra <[email protected]>
> ---
> kernel/sched/core.c | 31 ++++++++++---------------------
> 1 file changed, 10 insertions(+), 21 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 4376c9f..06d23c6 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5338,27 +5338,17 @@ void idle_task_exit(void)
> }
>
> /*
> - * While a dead CPU has no uninterruptible tasks queued at this point,
> - * it might still have a nonzero ->nr_uninterruptible counter, because
> - * for performance reasons the counter is not stricly tracking tasks to
> - * their home CPUs. So we just add the counter to another CPU's counter,
> - * to keep the global sum constant after CPU-down:
> - */
> -static void migrate_nr_uninterruptible(struct rq *rq_src)
> -{
> - struct rq *rq_dest = cpu_rq(cpumask_any(cpu_active_mask));
> -
> - rq_dest->nr_uninterruptible += rq_src->nr_uninterruptible;
> - rq_src->nr_uninterruptible = 0;
> -}
> -
> -/*
> - * remove the tasks which were accounted by rq from calc_load_tasks.
> + * Since this CPU is going 'away' for a while, fold any nr_active delta
> + * we might have. Assumes we're called after migrate_tasks() so that the
> + * nr_active count is stable.
> + *
> + * Also see the comment "Global load-average calculations".
> */
> -static void calc_global_load_remove(struct rq *rq)
> +static void calc_load_migrate(struct rq *rq)
> {
> - atomic_long_sub(rq->calc_load_active, &calc_load_tasks);
> - rq->calc_load_active = 0;
> + long delta = calc_load_fold_active(rq);
> + if (delta)
> + atomic_long_add(delta, &calc_load_tasks);
> }
>
> /*
> @@ -5652,8 +5642,7 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
> BUG_ON(rq->nr_running != 1); /* the migration thread */
> raw_spin_unlock_irqrestore(&rq->lock, flags);
>
> - migrate_nr_uninterruptible(rq);
> - calc_global_load_remove(rq);
> + calc_load_migrate(rq);

Not sure that it matters, but...

This is called from the CPU_DYING notifier, which runs with irqs
disabled, but in process context. As I understand it, this means that
->nr_running==1. If my understanding is correct (ha!), this means that
this change sets ->calc_load_active to one (rather than zero as in the
original) and that it subtracts one fewer from calc_load_tasks than did
the original. Of course, I have no idea whether this matters.

If I am correct and if it does matter, one straightforward fix
is to add a "CPU_DEAD" branch to the switch statement and move the
"calc_load_migrate(rq)" to that new branch. Given that "rq" references
the outgoing CPU, my guess is that locking is not needed, but you would
know better than I.

Thanx, Paul

> break;
> #endif
> }
>

2012-08-27 18:51:16

[permalink] [raw]

Subject: Re: Add rq->nr_uninterruptible count to dest cpu's rq while CPU goes down.

On Mon, Aug 20, 2012 at 09:26:57AM -0700, Paul E. McKenney wrote:
> On Mon, Aug 20, 2012 at 11:26:57AM +0200, Peter Zijlstra wrote:

[ . . . ]

> > OK, so how about something like the below, it would also solve Paul's
> > issue with that code.
> >
> >
> > Please do double check the logic, I've had all of 4 hours sleep and its
> > far too warm for a brain to operate in any case.
> >
> > ---
> > Subject: sched: Fix load avg vs cpu-hotplug
> >
> > Rabik and Paul reported two different issues related to the same few
> > lines of code.
> >
> > Rabik's issue is that the nr_uninterruptible migration code is wrong in
> > that he sees artifacts due to this (Rabik please do expand in more
> > detail).
> >
> > Paul's issue is that this code as it stands relies on us using
> > stop_machine() for unplug, we all would like to remove this assumption
> > so that eventually we can remove this stop_machine() usage altogether.
> >
> > The only reason we'd have to migrate nr_uninterruptible is so that we
> > could use for_each_online_cpu() loops in favour of
> > for_each_possible_cpu() loops, however since nr_uninterruptible() is the
> > only such loop and its using possible lets not bother at all.
> >
> > The problem Rabik sees is (probably) caused by the fact that by
> > migrating nr_uninterruptible we screw rq->calc_load_active for both rqs
> > involved.
> >
> > So don't bother with fancy migration schemes (meaning we now have to
> > keep using for_each_possible_cpu()) and instead fold any nr_active delta
> > after we migrate all tasks away to make sure we don't have any skewed
> > nr_active accounting.
> >
> >
> > Reported-by: Rakib Mullick <[email protected]>
> > Reported-by: Paul E. McKenney <[email protected]>
> > Signed-off-by: Peter Zijlstra <[email protected]>
> > ---
> > kernel/sched/core.c | 31 ++++++++++---------------------
> > 1 file changed, 10 insertions(+), 21 deletions(-)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 4376c9f..06d23c6 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -5338,27 +5338,17 @@ void idle_task_exit(void)
> > }
> >
> > /*
> > - * While a dead CPU has no uninterruptible tasks queued at this point,
> > - * it might still have a nonzero ->nr_uninterruptible counter, because
> > - * for performance reasons the counter is not stricly tracking tasks to
> > - * their home CPUs. So we just add the counter to another CPU's counter,
> > - * to keep the global sum constant after CPU-down:
> > - */
> > -static void migrate_nr_uninterruptible(struct rq *rq_src)
> > -{
> > - struct rq *rq_dest = cpu_rq(cpumask_any(cpu_active_mask));
> > -
> > - rq_dest->nr_uninterruptible += rq_src->nr_uninterruptible;
> > - rq_src->nr_uninterruptible = 0;
> > -}
> > -
> > -/*
> > - * remove the tasks which were accounted by rq from calc_load_tasks.
> > + * Since this CPU is going 'away' for a while, fold any nr_active delta
> > + * we might have. Assumes we're called after migrate_tasks() so that the
> > + * nr_active count is stable.
> > + *
> > + * Also see the comment "Global load-average calculations".
> > */
> > -static void calc_global_load_remove(struct rq *rq)
> > +static void calc_load_migrate(struct rq *rq)
> > {
> > - atomic_long_sub(rq->calc_load_active, &calc_load_tasks);
> > - rq->calc_load_active = 0;
> > + long delta = calc_load_fold_active(rq);
> > + if (delta)
> > + atomic_long_add(delta, &calc_load_tasks);
> > }
> >
> > /*
> > @@ -5652,8 +5642,7 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
> > BUG_ON(rq->nr_running != 1); /* the migration thread */
> > raw_spin_unlock_irqrestore(&rq->lock, flags);
> >
> > - migrate_nr_uninterruptible(rq);
> > - calc_global_load_remove(rq);
> > + calc_load_migrate(rq);
>
> Not sure that it matters, but...
>
> This is called from the CPU_DYING notifier, which runs with irqs
> disabled, but in process context. As I understand it, this means that
> ->nr_running==1. If my understanding is correct (ha!), this means that
> this change sets ->calc_load_active to one (rather than zero as in the
> original) and that it subtracts one fewer from calc_load_tasks than did
> the original. Of course, I have no idea whether this matters.
>
> If I am correct and if it does matter, one straightforward fix
> is to add a "CPU_DEAD" branch to the switch statement and move the
> "calc_load_migrate(rq)" to that new branch. Given that "rq" references
> the outgoing CPU, my guess is that locking is not needed, but you would
> know better than I.

How about the following updated patch?

Thanx, Paul

------------------------------------------------------------------------

sched: Fix load avg vs cpu-hotplug

Rabik and Paul reported two different issues related to the same few
lines of code.

Rabik's issue is that the nr_uninterruptible migration code is wrong in
that he sees artifacts due to this (Rabik please do expand in more
detail).

Paul's issue is that this code as it stands relies on us using
stop_machine() for unplug, we all would like to remove this assumption
so that eventually we can remove this stop_machine() usage altogether.

The only reason we'd have to migrate nr_uninterruptible is so that we
could use for_each_online_cpu() loops in favour of
for_each_possible_cpu() loops, however since nr_uninterruptible() is the
only such loop and its using possible lets not bother at all.

The problem Rabik sees is (probably) caused by the fact that by
migrating nr_uninterruptible we screw rq->calc_load_active for both rqs
involved.

So don't bother with fancy migration schemes (meaning we now have to
keep using for_each_possible_cpu()) and instead fold any nr_active delta
after we migrate all tasks away to make sure we don't have any skewed
nr_active accounting.

[ paulmck: Move call to calc_load_migration to CPU_DEAD to avoid
miscounting noted by Rakib. ]

Reported-by: Rakib Mullick <[email protected]>
Reported-by: Paul E. McKenney <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e841dfc..a8807f2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5309,27 +5309,17 @@ void idle_task_exit(void)
}

/*
- * While a dead CPU has no uninterruptible tasks queued at this point,
- * it might still have a nonzero ->nr_uninterruptible counter, because
- * for performance reasons the counter is not stricly tracking tasks to
- * their home CPUs. So we just add the counter to another CPU's counter,
- * to keep the global sum constant after CPU-down:
- */
-static void migrate_nr_uninterruptible(struct rq *rq_src)
-{
- struct rq *rq_dest = cpu_rq(cpumask_any(cpu_active_mask));
-
- rq_dest->nr_uninterruptible += rq_src->nr_uninterruptible;
- rq_src->nr_uninterruptible = 0;
-}
-
-/*
- * remove the tasks which were accounted by rq from calc_load_tasks.
+ * Since this CPU is going 'away' for a while, fold any nr_active delta
+ * we might have. Assumes we're called after migrate_tasks() so that the
+ * nr_active count is stable.
+ *
+ * Also see the comment "Global load-average calculations".
*/
-static void calc_global_load_remove(struct rq *rq)
+static void calc_load_migrate(struct rq *rq)
{
- atomic_long_sub(rq->calc_load_active, &calc_load_tasks);
- rq->calc_load_active = 0;
+ long delta = calc_load_fold_active(rq);
+ if (delta)
+ atomic_long_add(delta, &calc_load_tasks);
}

/*
@@ -5622,9 +5612,18 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
migrate_tasks(cpu);
BUG_ON(rq->nr_running != 1); /* the migration thread */
raw_spin_unlock_irqrestore(&rq->lock, flags);
+ break;

- migrate_nr_uninterruptible(rq);
- calc_global_load_remove(rq);
+ case CPU_DEAD:
+ {
+ struct rq *dest_rq;
+
+ local_irq_save(flags);
+ dest_rq = cpu_rq(smp_processor_id());
+ raw_spin_lock(&dest_rq->lock);
+ calc_load_migrate(rq);
+ raw_spin_unlock_irqrestore(&dest_rq->lock, flags);
+ }
break;
#endif
}

2012-08-28 06:57:15

[permalink] [raw]

Subject: Re: Add rq->nr_uninterruptible count to dest cpu's rq while CPU goes down.

Hello Paul,

On 8/28/12, Paul E. McKenney <[email protected]> wrote:
> On Mon, Aug 20, 2012 at 09:26:57AM -0700, Paul E. McKenney wrote:
>> On Mon, Aug 20, 2012 at 11:26:57AM +0200, Peter Zijlstra wrote:
>
> How about the following updated patch?
>
Actually, I was waiting for Peter's update.

> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> sched: Fix load avg vs cpu-hotplug
>
> Rabik and Paul reported two different issues related to the same few
> lines of code.
>
> Rabik's issue is that the nr_uninterruptible migration code is wrong in
> that he sees artifacts due to this (Rabik please do expand in more
> detail).
>
> Paul's issue is that this code as it stands relies on us using
> stop_machine() for unplug, we all would like to remove this assumption
> so that eventually we can remove this stop_machine() usage altogether.
>
> The only reason we'd have to migrate nr_uninterruptible is so that we
> could use for_each_online_cpu() loops in favour of
> for_each_possible_cpu() loops, however since nr_uninterruptible() is the
> only such loop and its using possible lets not bother at all.
>
> The problem Rabik sees is (probably) caused by the fact that by
> migrating nr_uninterruptible we screw rq->calc_load_active for both rqs
> involved.
>
> So don't bother with fancy migration schemes (meaning we now have to
> keep using for_each_possible_cpu()) and instead fold any nr_active delta
> after we migrate all tasks away to make sure we don't have any skewed
> nr_active accounting.
>
> [ paulmck: Move call to calc_load_migration to CPU_DEAD to avoid
> miscounting noted by Rakib. ]
>
> Reported-by: Rakib Mullick <[email protected]>
> Reported-by: Paul E. McKenney <[email protected]>
> Signed-off-by: Peter Zijlstra <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e841dfc..a8807f2 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5309,27 +5309,17 @@ void idle_task_exit(void)
> }
>
> /*
> - * While a dead CPU has no uninterruptible tasks queued at this point,
> - * it might still have a nonzero ->nr_uninterruptible counter, because
> - * for performance reasons the counter is not stricly tracking tasks to
> - * their home CPUs. So we just add the counter to another CPU's counter,
> - * to keep the global sum constant after CPU-down:
> - */
> -static void migrate_nr_uninterruptible(struct rq *rq_src)
> -{
> - struct rq *rq_dest = cpu_rq(cpumask_any(cpu_active_mask));
> -
> - rq_dest->nr_uninterruptible += rq_src->nr_uninterruptible;
> - rq_src->nr_uninterruptible = 0;
> -}
> -
> -/*
> - * remove the tasks which were accounted by rq from calc_load_tasks.
> + * Since this CPU is going 'away' for a while, fold any nr_active delta
> + * we might have. Assumes we're called after migrate_tasks() so that the
> + * nr_active count is stable.
> + *
> + * Also see the comment "Global load-average calculations".
> */
> -static void calc_global_load_remove(struct rq *rq)
> +static void calc_load_migrate(struct rq *rq)
> {
> - atomic_long_sub(rq->calc_load_active, &calc_load_tasks);
> - rq->calc_load_active = 0;
> + long delta = calc_load_fold_active(rq);
> + if (delta)
> + atomic_long_add(delta, &calc_load_tasks);
> }
>
> /*
> @@ -5622,9 +5612,18 @@ migration_call(struct notifier_block *nfb, unsigned
> long action, void *hcpu)
> migrate_tasks(cpu);
> BUG_ON(rq->nr_running != 1); /* the migration thread */
> raw_spin_unlock_irqrestore(&rq->lock, flags);
> + break;
>
> - migrate_nr_uninterruptible(rq);
> - calc_global_load_remove(rq);
> + case CPU_DEAD:
> + {
> + struct rq *dest_rq;
> +
> + local_irq_save(flags);
> + dest_rq = cpu_rq(smp_processor_id());

Use of smp_processor_id() as dest cpu isn't clear to me, this
processor is about to get down, isn't it?

> + raw_spin_lock(&dest_rq->lock);
> + calc_load_migrate(rq);

Well, calc_load_migrate() has no impact cause rq->nr_running == 1 at
this point. It's been already pointed out previously.

Thanks,
Rakib

2012-08-28 13:49:47

[permalink] [raw]

Subject: Re: Add rq->nr_uninterruptible count to dest cpu's rq while CPU goes down.

On Tue, Aug 28, 2012 at 12:57:09PM +0600, Rakib Mullick wrote:
> Hello Paul,
>
> On 8/28/12, Paul E. McKenney <[email protected]> wrote:
> > On Mon, Aug 20, 2012 at 09:26:57AM -0700, Paul E. McKenney wrote:
> >> On Mon, Aug 20, 2012 at 11:26:57AM +0200, Peter Zijlstra wrote:
> >
> > How about the following updated patch?
> >
> Actually, I was waiting for Peter's update.

I was too, but chatted with Peter.

> > Thanx, Paul
> >
> > ------------------------------------------------------------------------
> >
> > sched: Fix load avg vs cpu-hotplug
> >
> > Rabik and Paul reported two different issues related to the same few
> > lines of code.
> >
> > Rabik's issue is that the nr_uninterruptible migration code is wrong in
> > that he sees artifacts due to this (Rabik please do expand in more
> > detail).
> >
> > Paul's issue is that this code as it stands relies on us using
> > stop_machine() for unplug, we all would like to remove this assumption
> > so that eventually we can remove this stop_machine() usage altogether.
> >
> > The only reason we'd have to migrate nr_uninterruptible is so that we
> > could use for_each_online_cpu() loops in favour of
> > for_each_possible_cpu() loops, however since nr_uninterruptible() is the
> > only such loop and its using possible lets not bother at all.
> >
> > The problem Rabik sees is (probably) caused by the fact that by
> > migrating nr_uninterruptible we screw rq->calc_load_active for both rqs
> > involved.
> >
> > So don't bother with fancy migration schemes (meaning we now have to
> > keep using for_each_possible_cpu()) and instead fold any nr_active delta
> > after we migrate all tasks away to make sure we don't have any skewed
> > nr_active accounting.
> >
> > [ paulmck: Move call to calc_load_migration to CPU_DEAD to avoid
> > miscounting noted by Rakib. ]
> >
> > Reported-by: Rakib Mullick <[email protected]>
> > Reported-by: Paul E. McKenney <[email protected]>
> > Signed-off-by: Peter Zijlstra <[email protected]>
> > Signed-off-by: Paul E. McKenney <[email protected]>
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index e841dfc..a8807f2 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -5309,27 +5309,17 @@ void idle_task_exit(void)
> > }
> >
> > /*
> > - * While a dead CPU has no uninterruptible tasks queued at this point,
> > - * it might still have a nonzero ->nr_uninterruptible counter, because
> > - * for performance reasons the counter is not stricly tracking tasks to
> > - * their home CPUs. So we just add the counter to another CPU's counter,
> > - * to keep the global sum constant after CPU-down:
> > - */
> > -static void migrate_nr_uninterruptible(struct rq *rq_src)
> > -{
> > - struct rq *rq_dest = cpu_rq(cpumask_any(cpu_active_mask));
> > -
> > - rq_dest->nr_uninterruptible += rq_src->nr_uninterruptible;
> > - rq_src->nr_uninterruptible = 0;
> > -}
> > -
> > -/*
> > - * remove the tasks which were accounted by rq from calc_load_tasks.
> > + * Since this CPU is going 'away' for a while, fold any nr_active delta
> > + * we might have. Assumes we're called after migrate_tasks() so that the
> > + * nr_active count is stable.
> > + *
> > + * Also see the comment "Global load-average calculations".
> > */
> > -static void calc_global_load_remove(struct rq *rq)
> > +static void calc_load_migrate(struct rq *rq)
> > {
> > - atomic_long_sub(rq->calc_load_active, &calc_load_tasks);
> > - rq->calc_load_active = 0;
> > + long delta = calc_load_fold_active(rq);
> > + if (delta)
> > + atomic_long_add(delta, &calc_load_tasks);
> > }
> >
> > /*
> > @@ -5622,9 +5612,18 @@ migration_call(struct notifier_block *nfb, unsigned
> > long action, void *hcpu)
> > migrate_tasks(cpu);
> > BUG_ON(rq->nr_running != 1); /* the migration thread */
> > raw_spin_unlock_irqrestore(&rq->lock, flags);
> > + break;
> >
> > - migrate_nr_uninterruptible(rq);
> > - calc_global_load_remove(rq);
> > + case CPU_DEAD:
> > + {
> > + struct rq *dest_rq;
> > +
> > + local_irq_save(flags);
> > + dest_rq = cpu_rq(smp_processor_id());
>
> Use of smp_processor_id() as dest cpu isn't clear to me, this
> processor is about to get down, isn't it?

Nope. The CPU_DEAD notifier happens after the outgoing CPU has been
fully offlined, and so it must run on some other CPU.

> > + raw_spin_lock(&dest_rq->lock);
> > + calc_load_migrate(rq);
>
> Well, calc_load_migrate() has no impact cause rq->nr_running == 1 at
> this point. It's been already pointed out previously.

Even after the outgoing CPU is fully gone? I would hope that the value
would be zero.

Thanx, Paul

2012-08-28 16:52:48

[permalink] [raw]

Subject: Re: Add rq->nr_uninterruptible count to dest cpu's rq while CPU goes down.

On 8/28/12, Paul E. McKenney <[email protected]> wrote:
> On Tue, Aug 28, 2012 at 12:57:09PM +0600, Rakib Mullick wrote:
>> Hello Paul,
>>
>> On 8/28/12, Paul E. McKenney <[email protected]> wrote:
>> > On Mon, Aug 20, 2012 at 09:26:57AM -0700, Paul E. McKenney wrote:
>> >> On Mon, Aug 20, 2012 at 11:26:57AM +0200, Peter Zijlstra wrote:
>> >
>> > How about the following updated patch?
>> >
>> Actually, I was waiting for Peter's update.
>
> I was too, but chatted with Peter.
>
>> > Thanx, Paul
>> >
>> > ------------------------------------------------------------------------
>> >
>> > sched: Fix load avg vs cpu-hotplug
>> >
>> > Rabik and Paul reported two different issues related to the same few
>> > lines of code.
>> >
>> > Rabik's issue is that the nr_uninterruptible migration code is wrong in
>> > that he sees artifacts due to this (Rabik please do expand in more
>> > detail).
>> >
>> > Paul's issue is that this code as it stands relies on us using
>> > stop_machine() for unplug, we all would like to remove this assumption
>> > so that eventually we can remove this stop_machine() usage altogether.
>> >
>> > The only reason we'd have to migrate nr_uninterruptible is so that we
>> > could use for_each_online_cpu() loops in favour of
>> > for_each_possible_cpu() loops, however since nr_uninterruptible() is
>> > the
>> > only such loop and its using possible lets not bother at all.
>> >
>> > The problem Rabik sees is (probably) caused by the fact that by
>> > migrating nr_uninterruptible we screw rq->calc_load_active for both rqs
>> > involved.
>> >
>> > So don't bother with fancy migration schemes (meaning we now have to
>> > keep using for_each_possible_cpu()) and instead fold any nr_active
>> > delta
>> > after we migrate all tasks away to make sure we don't have any skewed
>> > nr_active accounting.
>> >
>> > [ paulmck: Move call to calc_load_migration to CPU_DEAD to avoid
>> > miscounting noted by Rakib. ]
>> >
>> > Reported-by: Rakib Mullick <[email protected]>
>> > Reported-by: Paul E. McKenney <[email protected]>
>> > Signed-off-by: Peter Zijlstra <[email protected]>
>> > Signed-off-by: Paul E. McKenney <[email protected]>
>> >
>> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> > index e841dfc..a8807f2 100644
>> > --- a/kernel/sched/core.c
>> > +++ b/kernel/sched/core.c
>> > @@ -5309,27 +5309,17 @@ void idle_task_exit(void)
>> > }
>> >
>> > /*
>> > - * While a dead CPU has no uninterruptible tasks queued at this point,
>> > - * it might still have a nonzero ->nr_uninterruptible counter, because
>> > - * for performance reasons the counter is not stricly tracking tasks
>> > to
>> > - * their home CPUs. So we just add the counter to another CPU's
>> > counter,
>> > - * to keep the global sum constant after CPU-down:
>> > - */
>> > -static void migrate_nr_uninterruptible(struct rq *rq_src)
>> > -{
>> > - struct rq *rq_dest = cpu_rq(cpumask_any(cpu_active_mask));
>> > -
>> > - rq_dest->nr_uninterruptible += rq_src->nr_uninterruptible;
>> > - rq_src->nr_uninterruptible = 0;
>> > -}
>> > -
>> > -/*
>> > - * remove the tasks which were accounted by rq from calc_load_tasks.
>> > + * Since this CPU is going 'away' for a while, fold any nr_active
>> > delta
>> > + * we might have. Assumes we're called after migrate_tasks() so that
>> > the
>> > + * nr_active count is stable.
>> > + *
>> > + * Also see the comment "Global load-average calculations".
>> > */
>> > -static void calc_global_load_remove(struct rq *rq)
>> > +static void calc_load_migrate(struct rq *rq)
>> > {
>> > - atomic_long_sub(rq->calc_load_active, &calc_load_tasks);
>> > - rq->calc_load_active = 0;
>> > + long delta = calc_load_fold_active(rq);
>> > + if (delta)
>> > + atomic_long_add(delta, &calc_load_tasks);
>> > }
>> >
>> > /*
>> > @@ -5622,9 +5612,18 @@ migration_call(struct notifier_block *nfb,
>> > unsigned
>> > long action, void *hcpu)
>> > migrate_tasks(cpu);
>> > BUG_ON(rq->nr_running != 1); /* the migration thread */
>> > raw_spin_unlock_irqrestore(&rq->lock, flags);
>> > + break;
>> >
>> > - migrate_nr_uninterruptible(rq);
>> > - calc_global_load_remove(rq);
>> > + case CPU_DEAD:
>> > + {
>> > + struct rq *dest_rq;
>> > +
>> > + local_irq_save(flags);
>> > + dest_rq = cpu_rq(smp_processor_id());
>>
>> Use of smp_processor_id() as dest cpu isn't clear to me, this
>> processor is about to get down, isn't it?
>
> Nope. The CPU_DEAD notifier happens after the outgoing CPU has been
> fully offlined, and so it must run on some other CPU.
>
>> > + raw_spin_lock(&dest_rq->lock);
>> > + calc_load_migrate(rq);
>>
>> Well, calc_load_migrate() has no impact cause rq->nr_running == 1 at
>> this point. It's been already pointed out previously.
>
> Even after the outgoing CPU is fully gone? I would hope that the value
> would be zero.
>
Perhaps, yes and it doesn't make any difference. And so, at this point
doing calc_load_migrate()... I'm not sure. But, I'm sure that, this is
not what I had in my mind.

The patch I sent it was to move rq->nr_uninterruptible to the dest_rq
where all the tasks were moved. Then, Peter says that patch isn't
correct, cause tasks might get spread out amongst more than one CPU
due to tasks affinity (*if* task was affined). But, we can easily
expect that, admin is smart enough to not put a CPU offline, where
s/he put a task to explicitly run on. I agree that, my patch isn't
100% correct but it's better than current which is almost 100% wrong
which picks up a random CPU-rq to move rq->nr_uninterruptible count.
So, simply moving ->nr_uninterruptible to dest_rq (where tasks were
moved) should be fine. And when dest_rq's time will come to go idle
i.e calc_load_enter_idle() or calc_load_account_active() (if no_hz=n),
active count will get folded, we do not need to explicitly fold
nr_active count when CPU is going to die.

Thanks,
Rakib.

2012-08-28 17:58:04

[permalink] [raw]

Subject: Re: Add rq->nr_uninterruptible count to dest cpu's rq while CPU goes down.

On Tue, Aug 28, 2012 at 10:52:45PM +0600, Rakib Mullick wrote:
> On 8/28/12, Paul E. McKenney <[email protected]> wrote:
> > On Tue, Aug 28, 2012 at 12:57:09PM +0600, Rakib Mullick wrote:
> >> Hello Paul,
> >>
> >> On 8/28/12, Paul E. McKenney <[email protected]> wrote:
> >> > On Mon, Aug 20, 2012 at 09:26:57AM -0700, Paul E. McKenney wrote:
> >> >> On Mon, Aug 20, 2012 at 11:26:57AM +0200, Peter Zijlstra wrote:
> >> >
> >> > How about the following updated patch?
> >> >
> >> Actually, I was waiting for Peter's update.
> >
> > I was too, but chatted with Peter.
> >
> >> > Thanx, Paul
> >> >
> >> > ------------------------------------------------------------------------
> >> >
> >> > sched: Fix load avg vs cpu-hotplug
> >> >
> >> > Rabik and Paul reported two different issues related to the same few
> >> > lines of code.
> >> >
> >> > Rabik's issue is that the nr_uninterruptible migration code is wrong in
> >> > that he sees artifacts due to this (Rabik please do expand in more
> >> > detail).
> >> >
> >> > Paul's issue is that this code as it stands relies on us using
> >> > stop_machine() for unplug, we all would like to remove this assumption
> >> > so that eventually we can remove this stop_machine() usage altogether.
> >> >
> >> > The only reason we'd have to migrate nr_uninterruptible is so that we
> >> > could use for_each_online_cpu() loops in favour of
> >> > for_each_possible_cpu() loops, however since nr_uninterruptible() is
> >> > the
> >> > only such loop and its using possible lets not bother at all.
> >> >
> >> > The problem Rabik sees is (probably) caused by the fact that by
> >> > migrating nr_uninterruptible we screw rq->calc_load_active for both rqs
> >> > involved.
> >> >
> >> > So don't bother with fancy migration schemes (meaning we now have to
> >> > keep using for_each_possible_cpu()) and instead fold any nr_active
> >> > delta
> >> > after we migrate all tasks away to make sure we don't have any skewed
> >> > nr_active accounting.
> >> >
> >> > [ paulmck: Move call to calc_load_migration to CPU_DEAD to avoid
> >> > miscounting noted by Rakib. ]
> >> >
> >> > Reported-by: Rakib Mullick <[email protected]>
> >> > Reported-by: Paul E. McKenney <[email protected]>
> >> > Signed-off-by: Peter Zijlstra <[email protected]>
> >> > Signed-off-by: Paul E. McKenney <[email protected]>
> >> >
> >> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >> > index e841dfc..a8807f2 100644
> >> > --- a/kernel/sched/core.c
> >> > +++ b/kernel/sched/core.c
> >> > @@ -5309,27 +5309,17 @@ void idle_task_exit(void)
> >> > }
> >> >
> >> > /*
> >> > - * While a dead CPU has no uninterruptible tasks queued at this point,
> >> > - * it might still have a nonzero ->nr_uninterruptible counter, because
> >> > - * for performance reasons the counter is not stricly tracking tasks
> >> > to
> >> > - * their home CPUs. So we just add the counter to another CPU's
> >> > counter,
> >> > - * to keep the global sum constant after CPU-down:
> >> > - */
> >> > -static void migrate_nr_uninterruptible(struct rq *rq_src)
> >> > -{
> >> > - struct rq *rq_dest = cpu_rq(cpumask_any(cpu_active_mask));
> >> > -
> >> > - rq_dest->nr_uninterruptible += rq_src->nr_uninterruptible;
> >> > - rq_src->nr_uninterruptible = 0;
> >> > -}
> >> > -
> >> > -/*
> >> > - * remove the tasks which were accounted by rq from calc_load_tasks.
> >> > + * Since this CPU is going 'away' for a while, fold any nr_active
> >> > delta
> >> > + * we might have. Assumes we're called after migrate_tasks() so that
> >> > the
> >> > + * nr_active count is stable.
> >> > + *
> >> > + * Also see the comment "Global load-average calculations".
> >> > */
> >> > -static void calc_global_load_remove(struct rq *rq)
> >> > +static void calc_load_migrate(struct rq *rq)
> >> > {
> >> > - atomic_long_sub(rq->calc_load_active, &calc_load_tasks);
> >> > - rq->calc_load_active = 0;
> >> > + long delta = calc_load_fold_active(rq);
> >> > + if (delta)
> >> > + atomic_long_add(delta, &calc_load_tasks);
> >> > }
> >> >
> >> > /*
> >> > @@ -5622,9 +5612,18 @@ migration_call(struct notifier_block *nfb,
> >> > unsigned
> >> > long action, void *hcpu)
> >> > migrate_tasks(cpu);
> >> > BUG_ON(rq->nr_running != 1); /* the migration thread */
> >> > raw_spin_unlock_irqrestore(&rq->lock, flags);
> >> > + break;
> >> >
> >> > - migrate_nr_uninterruptible(rq);
> >> > - calc_global_load_remove(rq);
> >> > + case CPU_DEAD:
> >> > + {
> >> > + struct rq *dest_rq;
> >> > +
> >> > + local_irq_save(flags);
> >> > + dest_rq = cpu_rq(smp_processor_id());
> >>
> >> Use of smp_processor_id() as dest cpu isn't clear to me, this
> >> processor is about to get down, isn't it?
> >
> > Nope. The CPU_DEAD notifier happens after the outgoing CPU has been
> > fully offlined, and so it must run on some other CPU.
> >
> >> > + raw_spin_lock(&dest_rq->lock);
> >> > + calc_load_migrate(rq);
> >>
> >> Well, calc_load_migrate() has no impact cause rq->nr_running == 1 at
> >> this point. It's been already pointed out previously.
> >
> > Even after the outgoing CPU is fully gone? I would hope that the value
> > would be zero.
> >
> Perhaps, yes and it doesn't make any difference. And so, at this point
> doing calc_load_migrate()... I'm not sure. But, I'm sure that, this is
> not what I had in my mind.
>
> The patch I sent it was to move rq->nr_uninterruptible to the dest_rq
> where all the tasks were moved. Then, Peter says that patch isn't
> correct, cause tasks might get spread out amongst more than one CPU
> due to tasks affinity (*if* task was affined). But, we can easily
> expect that, admin is smart enough to not put a CPU offline, where
> s/he put a task to explicitly run on. I agree that, my patch isn't
> 100% correct but it's better than current which is almost 100% wrong
> which picks up a random CPU-rq to move rq->nr_uninterruptible count.
> So, simply moving ->nr_uninterruptible to dest_rq (where tasks were
> moved) should be fine. And when dest_rq's time will come to go idle
> i.e calc_load_enter_idle() or calc_load_account_active() (if no_hz=n),
> active count will get folded, we do not need to explicitly fold
> nr_active count when CPU is going to die.

OK, but I thought that Peter said that ->nr_uninterruptible was
meaningful only when summed across all CPUs. If that is the case,
it shouldn't matter where the counts are moved.

(I am not all that worried about the exact form of the patch, as long
as it allows us to get rid of the __stop_machine() in CPU offlining.)

Thanx, Paul

2012-08-29 01:05:10