This patchset includes a few modifications related to idle_balance().
Patch #1 addresses an issue introduced by commit e5fc6611 which potentially
can cause rq->max_idle_balance_cost to not get updated when it should.
Patch #2 initializes the per domain newidle balance stats in sd_numa_init().
Patch #3 is a performance related patch. It stops searching for more tasks
to pull while traversing the domains in idle balance if we find that there
are runnable tasks. This patch resulted in approximately a 6% performance
improvement to a Java server workload on an 8 socket machine.
Jason Low (3):
sched, balancing: Update rq->max_idle_balance_cost whenever newidle
balance is attempted
sched: Initialize newidle balance stats in sd_numa_init()
sched, fair: Stop searching for tasks in newidle balance if there are
runnable tasks
kernel/sched/core.c | 2 ++
kernel/sched/fair.c | 17 +++++++++++------
2 files changed, 13 insertions(+), 6 deletions(-)
Commit e5fc6611 can potentially cause rq->max_idle_balance_cost to not be
updated, even when load_balance(NEWLY_IDLE) is attempted and the per-sd
max cost value is updated.
In this patch, we update the rq->max_idle_balance_cost regardless of
whether or not a task has been enqueued while browsing the domains.
Signed-off-by: Jason Low <[email protected]>
---
kernel/sched/fair.c | 9 +++++----
1 files changed, 5 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 43232b8..3e3ffb8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6658,6 +6658,7 @@ static int idle_balance(struct rq *this_rq)
int this_cpu = this_rq->cpu;
idle_enter_fair(this_rq);
+
/*
* We must set idle_stamp _before_ calling idle_balance(), such that we
* measure the duration of idle_balance() as idle time.
@@ -6710,9 +6711,12 @@ static int idle_balance(struct rq *this_rq)
raw_spin_lock(&this_rq->lock);
+ if (curr_cost > this_rq->max_idle_balance_cost)
+ this_rq->max_idle_balance_cost = curr_cost;
+
/*
* While browsing the domains, we released the rq lock.
- * A task could have be enqueued in the meantime
+ * A task could have been enqueued in the meantime.
*/
if (this_rq->cfs.h_nr_running && !pulled_task) {
pulled_task = 1;
@@ -6727,9 +6731,6 @@ static int idle_balance(struct rq *this_rq)
this_rq->next_balance = next_balance;
}
- if (curr_cost > this_rq->max_idle_balance_cost)
- this_rq->max_idle_balance_cost = curr_cost;
-
out:
/* Is there a task of a high priority class? */
if (this_rq->nr_running != this_rq->cfs.h_nr_running)
--
1.7.1
Also initialize the per-sd variables for newidle load balancing
in sd_numa_init().
Signed-off-by: Jason Low <[email protected]>
---
kernel/sched/core.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3f12533..b1c6cb9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6030,6 +6030,8 @@ sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
,
.last_balance = jiffies,
.balance_interval = sd_weight,
+ .max_newidle_lb_cost = 0,
+ .next_decay_max_lb_cost = jiffies,
};
SD_INIT_NAME(sd, NUMA);
sd->private = &tl->data;
--
1.7.1
It was found that when running some workloads (such as AIM7) on large systems
with many cores, CPUs do not remain idle for long. Thus, tasks can
wake/get enqueued while doing idle balancing.
In this patch, while traversing the domains in idle balance, in addition to
checking for pulled_task, we add an extra check for this_rq->nr_running for
determining if we should stop searching for tasks to pull. If there are
runnable tasks on this rq, then we will stop traversing the domains. This
reduces the chance that idle balance delays a task from running.
This patch resulted in approximately a 6% performance improvement when
running a Java Server workload on an 8 socket machine.
Signed-off-by: Jason Low <[email protected]>
---
kernel/sched/fair.c | 8 ++++++--
1 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3e3ffb8..232518c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6689,7 +6689,6 @@ static int idle_balance(struct rq *this_rq)
if (sd->flags & SD_BALANCE_NEWIDLE) {
t0 = sched_clock_cpu(this_cpu);
- /* If we've pulled tasks over stop searching: */
pulled_task = load_balance(this_cpu, this_rq,
sd, CPU_NEWLY_IDLE,
&continue_balancing);
@@ -6704,7 +6703,12 @@ static int idle_balance(struct rq *this_rq)
interval = msecs_to_jiffies(sd->balance_interval);
if (time_after(next_balance, sd->last_balance + interval))
next_balance = sd->last_balance + interval;
- if (pulled_task)
+
+ /*
+ * Stop searching for tasks to pull if there are
+ * now runnable tasks on this rq.
+ */
+ if (pulled_task || this_rq->nr_running > 0)
break;
}
rcu_read_unlock();
--
1.7.1
On Wed, 2014-04-23 at 18:30 -0700, Jason Low wrote:
> It was found that when running some workloads (such as AIM7) on large systems
> with many cores, CPUs do not remain idle for long. Thus, tasks can
> wake/get enqueued while doing idle balancing.
>
> In this patch, while traversing the domains in idle balance, in addition to
> checking for pulled_task, we add an extra check for this_rq->nr_running for
> determining if we should stop searching for tasks to pull. If there are
> runnable tasks on this rq, then we will stop traversing the domains. This
> reduces the chance that idle balance delays a task from running.
>
> This patch resulted in approximately a 6% performance improvement when
> running a Java Server workload on an 8 socket machine.
Checking rq->lock for contention before ever going to idle balancing as
well should give you a bit more. No need to run around looking for work
that's trying to arrive. By not going there, perhaps stacking tasks,
you may head off a future bounce as well.
-Mike
On Wed, Apr 23, 2014 at 06:30:35PM -0700, Jason Low wrote:
> It was found that when running some workloads (such as AIM7) on large systems
> with many cores, CPUs do not remain idle for long. Thus, tasks can
> wake/get enqueued while doing idle balancing.
>
> In this patch, while traversing the domains in idle balance, in addition to
> checking for pulled_task, we add an extra check for this_rq->nr_running for
> determining if we should stop searching for tasks to pull. If there are
> runnable tasks on this rq, then we will stop traversing the domains. This
> reduces the chance that idle balance delays a task from running.
>
> This patch resulted in approximately a 6% performance improvement when
> running a Java Server workload on an 8 socket machine.
>
> Signed-off-by: Jason Low <[email protected]>
> ---
> kernel/sched/fair.c | 8 ++++++--
> 1 files changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3e3ffb8..232518c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6689,7 +6689,6 @@ static int idle_balance(struct rq *this_rq)
> if (sd->flags & SD_BALANCE_NEWIDLE) {
> t0 = sched_clock_cpu(this_cpu);
>
> - /* If we've pulled tasks over stop searching: */
> pulled_task = load_balance(this_cpu, this_rq,
> sd, CPU_NEWLY_IDLE,
> &continue_balancing);
> @@ -6704,7 +6703,12 @@ static int idle_balance(struct rq *this_rq)
> interval = msecs_to_jiffies(sd->balance_interval);
> if (time_after(next_balance, sd->last_balance + interval))
> next_balance = sd->last_balance + interval;
> - if (pulled_task)
> +
> + /*
> + * Stop searching for tasks to pull if there are
> + * now runnable tasks on this rq.
> + */
> + if (pulled_task || this_rq->nr_running > 0)
> break;
> }
> rcu_read_unlock();
There's also the CONFIG_PREEMPT bit in move_tasks() does making that
unconditional also help such a workload?
On Thu, 2014-04-24 at 04:51 +0200, Mike Galbraith wrote:
> Checking rq->lock for contention before ever going to idle balancing as
> well should give you a bit more. No need to run around looking for work
> that's trying to arrive. By not going there, perhaps stacking tasks,
> you may head off a future bounce as well.
Zzt. Bailing without drop/retake to let work land would be a bad plan.
-Mike
Hi Jason,
On 04/24/2014 07:00 AM, Jason Low wrote:
> Commit e5fc6611 can potentially cause rq->max_idle_balance_cost to not be
> updated, even when load_balance(NEWLY_IDLE) is attempted and the per-sd
> max cost value is updated.
>
> In this patch, we update the rq->max_idle_balance_cost regardless of
> whether or not a task has been enqueued while browsing the domains.
>
> Signed-off-by: Jason Low <[email protected]>
> ---
> kernel/sched/fair.c | 9 +++++----
> 1 files changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 43232b8..3e3ffb8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6658,6 +6658,7 @@ static int idle_balance(struct rq *this_rq)
> int this_cpu = this_rq->cpu;
>
> idle_enter_fair(this_rq);
> +
> /*
> * We must set idle_stamp _before_ calling idle_balance(), such that we
> * measure the duration of idle_balance() as idle time.
> @@ -6710,9 +6711,12 @@ static int idle_balance(struct rq *this_rq)
>
> raw_spin_lock(&this_rq->lock);
>
> + if (curr_cost > this_rq->max_idle_balance_cost)
> + this_rq->max_idle_balance_cost = curr_cost;
> +
> /*
What about the update of next_balance field? See the code snippet below.
This will also be skipped as a consequence of the commit e5fc6611 right?
if (pulled_task || time_after(jiffies, this_rq->next_balance)) {
/*
* We are going idle. next_balance may be set based on
* a busy processor. So reset next_balance.
*/
this_rq->next_balance = next_balance;
}
Also the comment in the above snippet does not look right to me.
It says "we are going idle" but the condition checks for pulled_task.
Regards
Preeti U Murthy
> * While browsing the domains, we released the rq lock.
> - * A task could have be enqueued in the meantime
> + * A task could have been enqueued in the meantime.
> */
> if (this_rq->cfs.h_nr_running && !pulled_task) {
> pulled_task = 1;
> @@ -6727,9 +6731,6 @@ static int idle_balance(struct rq *this_rq)
> this_rq->next_balance = next_balance;
> }
>
> - if (curr_cost > this_rq->max_idle_balance_cost)
> - this_rq->max_idle_balance_cost = curr_cost;
> -
> out:
> /* Is there a task of a high priority class? */
> if (this_rq->nr_running != this_rq->cfs.h_nr_running)
>
On Thu, Apr 24, 2014 at 02:30:35AM +0100, Jason Low wrote:
> @@ -6704,7 +6703,12 @@ static int idle_balance(struct rq *this_rq)
> interval = msecs_to_jiffies(sd->balance_interval);
> if (time_after(next_balance, sd->last_balance + interval))
> next_balance = sd->last_balance + interval;
> - if (pulled_task)
> +
> + /*
> + * Stop searching for tasks to pull if there are
> + * now runnable tasks on this rq.
> + */
> + if (pulled_task || this_rq->nr_running > 0)
Should this be cfs tasks instead?
+ if (pulled_task || this_rq->cfs.h_nr_running > 0)
3.15-rc2 commit 35805ff8f4fc535ac85330170d3c56829c87c677 seems to
indicate that using rq->nr_running may lead to trouble.
The other two patches look good to me.
Morten
On Thu, Apr 24, 2014 at 11:30:59AM +0100, Morten Rasmussen wrote:
> On Thu, Apr 24, 2014 at 02:30:35AM +0100, Jason Low wrote:
> > @@ -6704,7 +6703,12 @@ static int idle_balance(struct rq *this_rq)
> > interval = msecs_to_jiffies(sd->balance_interval);
> > if (time_after(next_balance, sd->last_balance + interval))
> > next_balance = sd->last_balance + interval;
> > - if (pulled_task)
> > +
> > + /*
> > + * Stop searching for tasks to pull if there are
> > + * now runnable tasks on this rq.
> > + */
> > + if (pulled_task || this_rq->nr_running > 0)
>
> Should this be cfs tasks instead?
>
> + if (pulled_task || this_rq->cfs.h_nr_running > 0)
>
> 3.15-rc2 commit 35805ff8f4fc535ac85330170d3c56829c87c677 seems to
> indicate that using rq->nr_running may lead to trouble.
>
> The other two patches look good to me.
No, this really wants to be nr_running, we want to bail the idle
balancer when there's anything runnable present.
Note how out: is very careful to return -1 (which results in RETRY_TASK)
when rq->nr_running != rq->cfs.h_nr_running.
That same out: test also makes problem that commit fixes impossible
again.
On Thu, Apr 24, 2014 at 03:44:47PM +0530, Preeti U Murthy wrote:
> What about the update of next_balance field? See the code snippet below.
> This will also be skipped as a consequence of the commit e5fc6611 right?
>
> if (pulled_task || time_after(jiffies, this_rq->next_balance)) {
> /*
> * We are going idle. next_balance may be set based on
> * a busy processor. So reset next_balance.
> */
> this_rq->next_balance = next_balance;
> }
>
> Also the comment in the above snippet does not look right to me.
> It says "we are going idle" but the condition checks for pulled_task.
Yeah, that's odd indeed. Ingo did that back in dd41f596cda0d, I suspect
its an error, but..
So I think that should become !pulled_task || time_after().
On Wed, Apr 23, 2014 at 06:30:34PM -0700, Jason Low wrote:
> Also initialize the per-sd variables for newidle load balancing
> in sd_numa_init().
Vincent actually fixed this 'by accident' in his topology rework, but
I'll queue this for .15 and fix the conflict with his patch, which is
queued for .16.
On Thu, Apr 24, 2014 at 02:04:15PM +0200, Peter Zijlstra wrote:
> On Thu, Apr 24, 2014 at 03:44:47PM +0530, Preeti U Murthy wrote:
> > What about the update of next_balance field? See the code snippet below.
> > This will also be skipped as a consequence of the commit e5fc6611 right?
> >
> > if (pulled_task || time_after(jiffies, this_rq->next_balance)) {
> > /*
> > * We are going idle. next_balance may be set based on
> > * a busy processor. So reset next_balance.
> > */
> > this_rq->next_balance = next_balance;
> > }
> >
> > Also the comment in the above snippet does not look right to me.
> > It says "we are going idle" but the condition checks for pulled_task.
>
> Yeah, that's odd indeed. Ingo did that back in dd41f596cda0d, I suspect
> its an error, but..
>
> So I think that should become !pulled_task || time_after().
Hmm, no, I missed that the for_each_domain() loop pushes next_balance
ahead if it did a balance on the domain.
So it actually makes sense and the comment is wrong, but then you're
also right that we want to not skip that.
So how about something like so?
---
Subject: sched,fair: Fix idle_balance()'s pulled_task logic
From: Peter Zijlstra <[email protected]>
Date: Thu Apr 24 14:24:20 CEST 2014
Jason reported that we can fail to update max_idle_balance_cost, even
if we actually did try to find one.
Preeti then noticed that the next_balance update logic was equally
flawed.
So fix both and update the comments.
Fixes: e5fc66119ec9 ("sched: Fix race in idle_balance()")
Cc: Daniel Lezcano <[email protected]>
Reported-by: Jason Low <[email protected]>
Reported-by: Preeti U Murthy <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
---
kernel/sched/fair.c | 23 ++++++++++-------------
1 file changed, 10 insertions(+), 13 deletions(-)
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6711,21 +6711,18 @@ static int idle_balance(struct rq *this_
raw_spin_lock(&this_rq->lock);
/*
- * While browsing the domains, we released the rq lock.
- * A task could have be enqueued in the meantime
+ * If we pulled a task (or if the interval expired), we did a balance
+ * pass, so update next_balance.
*/
- if (this_rq->cfs.h_nr_running && !pulled_task) {
- pulled_task = 1;
- goto out;
- }
-
- if (pulled_task || time_after(jiffies, this_rq->next_balance)) {
- /*
- * We are going idle. next_balance may be set based on
- * a busy processor. So reset next_balance.
- */
+ if (pulled_task || time_after(jiffies, this_rq->next_balance))
this_rq->next_balance = next_balance;
- }
+
+ /*
+ * While browsing the domains, we released the rq lock, a task could
+ * have been enqueued in the meantime.
+ */
+ if (this_rq->cfs.h_nr_running && !pulled_task)
+ pulled_task = 1;
if (curr_cost > this_rq->max_idle_balance_cost)
this_rq->max_idle_balance_cost = curr_cost;
On Thu, Apr 24, 2014 at 12:32:31PM +0100, Peter Zijlstra wrote:
> On Thu, Apr 24, 2014 at 11:30:59AM +0100, Morten Rasmussen wrote:
> > On Thu, Apr 24, 2014 at 02:30:35AM +0100, Jason Low wrote:
> > > @@ -6704,7 +6703,12 @@ static int idle_balance(struct rq *this_rq)
> > > interval = msecs_to_jiffies(sd->balance_interval);
> > > if (time_after(next_balance, sd->last_balance + interval))
> > > next_balance = sd->last_balance + interval;
> > > - if (pulled_task)
> > > +
> > > + /*
> > > + * Stop searching for tasks to pull if there are
> > > + * now runnable tasks on this rq.
> > > + */
> > > + if (pulled_task || this_rq->nr_running > 0)
> >
> > Should this be cfs tasks instead?
> >
> > + if (pulled_task || this_rq->cfs.h_nr_running > 0)
> >
> > 3.15-rc2 commit 35805ff8f4fc535ac85330170d3c56829c87c677 seems to
> > indicate that using rq->nr_running may lead to trouble.
> >
> > The other two patches look good to me.
>
> No, this really wants to be nr_running, we want to bail the idle
> balancer when there's anything runnable present.
>
> Note how out: is very careful to return -1 (which results in RETRY_TASK)
> when rq->nr_running != rq->cfs.h_nr_running.
>
> That same out: test also makes problem that commit fixes impossible
> again.
I should have done my homework properly. I may be missing something, but
don't we risk bailing out of idle balance if there is a throttled rt
task and go straight to idle?
On Thu, Apr 24, 2014 at 03:08:02PM +0100, Morten Rasmussen wrote:
> On Thu, Apr 24, 2014 at 12:32:31PM +0100, Peter Zijlstra wrote:
> > On Thu, Apr 24, 2014 at 11:30:59AM +0100, Morten Rasmussen wrote:
> > > On Thu, Apr 24, 2014 at 02:30:35AM +0100, Jason Low wrote:
> > > > @@ -6704,7 +6703,12 @@ static int idle_balance(struct rq *this_rq)
> > > > interval = msecs_to_jiffies(sd->balance_interval);
> > > > if (time_after(next_balance, sd->last_balance + interval))
> > > > next_balance = sd->last_balance + interval;
> > > > - if (pulled_task)
> > > > +
> > > > + /*
> > > > + * Stop searching for tasks to pull if there are
> > > > + * now runnable tasks on this rq.
> > > > + */
> > > > + if (pulled_task || this_rq->nr_running > 0)
> > >
> > > Should this be cfs tasks instead?
> > >
> > > + if (pulled_task || this_rq->cfs.h_nr_running > 0)
> > >
> > > 3.15-rc2 commit 35805ff8f4fc535ac85330170d3c56829c87c677 seems to
> > > indicate that using rq->nr_running may lead to trouble.
> > >
> > > The other two patches look good to me.
> >
> > No, this really wants to be nr_running, we want to bail the idle
> > balancer when there's anything runnable present.
> >
> > Note how out: is very careful to return -1 (which results in RETRY_TASK)
> > when rq->nr_running != rq->cfs.h_nr_running.
> >
> > That same out: test also makes problem that commit fixes impossible
> > again.
>
> I should have done my homework properly. I may be missing something, but
> don't we risk bailing out of idle balance if there is a throttled rt
> task and go straight to idle?
Good point, depends, in tip/sched/core Kirill fixed that. For -linus
this is indeed a problem.
See patches:
46383648b3c7 sched: Revert commit 4c6c4e38c4e9 ("sched/core: Fix endless loop in pick_next_task()")
f4ebcbc0d7e0 sched/rt: Substract number of tasks of throttled queues from rq->nr_running
653d07a6989a sched/rt: Add accessors rq_of_rt_se()
22abdef37ceb sched/rt: Sum number of all children tasks in hierarhy at ->rt_nr_running
So for sched/core everything should be fine. This does mean I have to
queue this patch for /core and not /urgent (/me quickly moves).
On Thu, 2014-04-24 at 04:51 +0200, Mike Galbraith wrote:
> On Wed, 2014-04-23 at 18:30 -0700, Jason Low wrote:
> > It was found that when running some workloads (such as AIM7) on large systems
> > with many cores, CPUs do not remain idle for long. Thus, tasks can
> > wake/get enqueued while doing idle balancing.
> >
> > In this patch, while traversing the domains in idle balance, in addition to
> > checking for pulled_task, we add an extra check for this_rq->nr_running for
> > determining if we should stop searching for tasks to pull. If there are
> > runnable tasks on this rq, then we will stop traversing the domains. This
> > reduces the chance that idle balance delays a task from running.
> >
> > This patch resulted in approximately a 6% performance improvement when
> > running a Java Server workload on an 8 socket machine.
>
> Checking rq->lock for contention before ever going to idle balancing as
> well should give you a bit more. No need to run around looking for work
> that's trying to arrive. By not going there, perhaps stacking tasks,
> you may head off a future bounce as well.
Are you thinking of something along the lines of this:
@@ -6658,7 +6658,8 @@ static int idle_balance(struct rq *this_rq)
*/
this_rq->idle_stamp = rq_clock(this_rq);
- if (this_rq->avg_idle < sysctl_sched_migration_cost)
+ if (this_rq->avg_idle < sysctl_sched_migration_cost ||
+ spin_is_contended(&this_rq->lock))
goto out;
On Thu, 2014-04-24 at 09:15 +0200, Peter Zijlstra wrote:
> On Wed, Apr 23, 2014 at 06:30:35PM -0700, Jason Low wrote:
> > It was found that when running some workloads (such as AIM7) on large systems
> > with many cores, CPUs do not remain idle for long. Thus, tasks can
> > wake/get enqueued while doing idle balancing.
> >
> > In this patch, while traversing the domains in idle balance, in addition to
> > checking for pulled_task, we add an extra check for this_rq->nr_running for
> > determining if we should stop searching for tasks to pull. If there are
> > runnable tasks on this rq, then we will stop traversing the domains. This
> > reduces the chance that idle balance delays a task from running.
> >
> > This patch resulted in approximately a 6% performance improvement when
> > running a Java Server workload on an 8 socket machine.
> >
> > Signed-off-by: Jason Low <[email protected]>
> > ---
> > kernel/sched/fair.c | 8 ++++++--
> > 1 files changed, 6 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 3e3ffb8..232518c 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6689,7 +6689,6 @@ static int idle_balance(struct rq *this_rq)
> > if (sd->flags & SD_BALANCE_NEWIDLE) {
> > t0 = sched_clock_cpu(this_cpu);
> >
> > - /* If we've pulled tasks over stop searching: */
> > pulled_task = load_balance(this_cpu, this_rq,
> > sd, CPU_NEWLY_IDLE,
> > &continue_balancing);
> > @@ -6704,7 +6703,12 @@ static int idle_balance(struct rq *this_rq)
> > interval = msecs_to_jiffies(sd->balance_interval);
> > if (time_after(next_balance, sd->last_balance + interval))
> > next_balance = sd->last_balance + interval;
> > - if (pulled_task)
> > +
> > + /*
> > + * Stop searching for tasks to pull if there are
> > + * now runnable tasks on this rq.
> > + */
> > + if (pulled_task || this_rq->nr_running > 0)
> > break;
> > }
> > rcu_read_unlock();
>
> There's also the CONFIG_PREEMPT bit in move_tasks() does making that
> unconditional also help such a workload?
If the below patch is what you were referring to, I believe this
can help too. This was also something that I was testing out before
we went with those patches which compares avg_idle with idle balance
cost. I recall seeing somewhere around a +7% performance improvement
in at least least 1 of the AIM7 workloads. I can do some more testing
with this.
---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 43232b8..d069054 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5304,7 +5304,6 @@ static int move_tasks(struct lb_env *env)
pulled++;
env->imbalance -= load;
-#ifdef CONFIG_PREEMPT
/*
* NEWIDLE balancing is a source of latency, so preemptible
* kernels will stop after the first task is pulled to minimize
@@ -5312,7 +5311,6 @@ static int move_tasks(struct lb_env *env)
*/
if (env->idle == CPU_NEWLY_IDLE)
break;
-#endif
/*
* We only want to steal up to the prescribed amount of
On Thu, Apr 24, 2014 at 09:43:09AM -0700, Jason Low wrote:
> If the below patch is what you were referring to, I believe this
> can help too. This was also something that I was testing out before
> we went with those patches which compares avg_idle with idle balance
> cost. I recall seeing somewhere around a +7% performance improvement
> in at least least 1 of the AIM7 workloads. I can do some more testing
> with this.
Yes, exactly that.
I can't remember the details, but I suspect we feared the less agressive
idle balance due to that patch (it will only pull a single task, instead
of multiple) would cause more idle_balance invocations and thereby
decrease throughput.
So I suppose something with _many_ bursty threads which leads to severe
inequalities would be the workload to trigger that.
Not sure we've ever seen that.. maybe Mike remembers, he seems to have a
head for such details.
On Thu, 2014-04-24 at 14:44 +0200, Peter Zijlstra wrote:
> On Thu, Apr 24, 2014 at 02:04:15PM +0200, Peter Zijlstra wrote:
> > On Thu, Apr 24, 2014 at 03:44:47PM +0530, Preeti U Murthy wrote:
> > > What about the update of next_balance field? See the code snippet below.
> > > This will also be skipped as a consequence of the commit e5fc6611 right?
> > >
> > > if (pulled_task || time_after(jiffies, this_rq->next_balance)) {
> > > /*
> > > * We are going idle. next_balance may be set based on
> > > * a busy processor. So reset next_balance.
> > > */
> > > this_rq->next_balance = next_balance;
> > > }
> > >
> > > Also the comment in the above snippet does not look right to me.
> > > It says "we are going idle" but the condition checks for pulled_task.
> >
> > Yeah, that's odd indeed. Ingo did that back in dd41f596cda0d, I suspect
> > its an error, but..
> >
> > So I think that should become !pulled_task || time_after().
>
> Hmm, no, I missed that the for_each_domain() loop pushes next_balance
> ahead if it did a balance on the domain.
>
> So it actually makes sense and the comment is wrong, but then you're
> also right that we want to not skip that.
Hi Preeti, Peter,
So I thought that the original rationale (commit 1bd77f2d) behind
updating rq->next_balance in idle_balance() is that, if we are going
idle (!pulled_task), we want to ensure that the next_balance gets
calculated without the busy_factor.
If the rq is busy, then rq->next_balance gets updated based on
sd->interval * busy_factor. However, when the rq goes from "busy"
to idle, rq->next_balance might still have been calculated under
the assumption that the rq is busy. Thus, if we are going idle, we
would then properly update next_balance without the busy factor
if we update when !pulled_task.
On Thu, Apr 24, 2014 at 09:43:09AM -0700, Jason Low wrote:
> I recall seeing somewhere around a +7% performance improvement
> in at least least 1 of the AIM7 workloads. I can do some more testing
> with this.
But yeah, please test more than 1 workload; its fairly typical for one
workload to win and another to completely tank :-/
On Thu, Apr 24, 2014 at 09:53:37AM -0700, Jason Low wrote:
>
> So I thought that the original rationale (commit 1bd77f2d) behind
> updating rq->next_balance in idle_balance() is that, if we are going
> idle (!pulled_task), we want to ensure that the next_balance gets
> calculated without the busy_factor.
>
> If the rq is busy, then rq->next_balance gets updated based on
> sd->interval * busy_factor. However, when the rq goes from "busy"
> to idle, rq->next_balance might still have been calculated under
> the assumption that the rq is busy. Thus, if we are going idle, we
> would then properly update next_balance without the busy factor
> if we update when !pulled_task.
>
Its late here and I'm confused!
So the for_each_domain() loop calculates a new next_balance based on
->balance_interval (which has that busy_factor on, right).
But if it fails to pull anything, we'll (potentially) iterate the entire
tree up to the largest domain; and supposedly set next_balanced to the
largest possible interval.
So when we go from busy to idle (!pulled_task), we actually set
->next_balance to the longest interval. Whereas the commit you
referenced says it sets it to a shorter while.
Not seeing it.
So the code as modified by Ingo in one of the initial CFS commits, will
move the ->next_balance time ahead if the balance succeeded
(pulled_task), thereby reflecting that we are busy and we just did a
balance so we need not do one again soon. (we might want to re-think
this if we really make the idle balance only pull 1 task max).
Of course, I've now gone over this code 3 times today, so I'm terminally
confused.
On Thu, Apr 24, 2014 at 07:14:53PM +0200, Peter Zijlstra wrote:
> On Thu, Apr 24, 2014 at 09:53:37AM -0700, Jason Low wrote:
> >
> > So I thought that the original rationale (commit 1bd77f2d) behind
> > updating rq->next_balance in idle_balance() is that, if we are going
> > idle (!pulled_task), we want to ensure that the next_balance gets
> > calculated without the busy_factor.
> >
> > If the rq is busy, then rq->next_balance gets updated based on
> > sd->interval * busy_factor. However, when the rq goes from "busy"
> > to idle, rq->next_balance might still have been calculated under
> > the assumption that the rq is busy. Thus, if we are going idle, we
> > would then properly update next_balance without the busy factor
> > if we update when !pulled_task.
> >
>
> Its late here and I'm confused!
>
> So the for_each_domain() loop calculates a new next_balance based on
> ->balance_interval (which has that busy_factor on, right).
Not right, ->balance_interval is the base interval. rebalance_domains()
doesn't update it.
On Thu, 2014-04-24 at 09:37 -0700, Jason Low wrote:
> On Thu, 2014-04-24 at 04:51 +0200, Mike Galbraith wrote:
> > On Wed, 2014-04-23 at 18:30 -0700, Jason Low wrote:
> > > It was found that when running some workloads (such as AIM7) on large systems
> > > with many cores, CPUs do not remain idle for long. Thus, tasks can
> > > wake/get enqueued while doing idle balancing.
> > >
> > > In this patch, while traversing the domains in idle balance, in addition to
> > > checking for pulled_task, we add an extra check for this_rq->nr_running for
> > > determining if we should stop searching for tasks to pull. If there are
> > > runnable tasks on this rq, then we will stop traversing the domains. This
> > > reduces the chance that idle balance delays a task from running.
> > >
> > > This patch resulted in approximately a 6% performance improvement when
> > > running a Java Server workload on an 8 socket machine.
> >
> > Checking rq->lock for contention before ever going to idle balancing as
> > well should give you a bit more. No need to run around looking for work
> > that's trying to arrive. By not going there, perhaps stacking tasks,
> > you may head off a future bounce as well.
>
> Are you thinking of something along the lines of this:
>
> @@ -6658,7 +6658,8 @@ static int idle_balance(struct rq *this_rq)
> */
> this_rq->idle_stamp = rq_clock(this_rq);
>
> - if (this_rq->avg_idle < sysctl_sched_migration_cost)
> + if (this_rq->avg_idle < sysctl_sched_migration_cost ||
> + spin_is_contended(&this_rq->lock))
> goto out;
>
More or less, yes, that's what I was thinking, because the wakeup you
are watching for, and encountering in reality could very well be that
very one. But as noted, my reaction to that wakeup couldn't possibly
have been further off the mark.
-Mike
On Thu, 2014-04-24 at 19:14 +0200, Peter Zijlstra wrote:
> On Thu, Apr 24, 2014 at 09:53:37AM -0700, Jason Low wrote:
> >
> > So I thought that the original rationale (commit 1bd77f2d) behind
> > updating rq->next_balance in idle_balance() is that, if we are going
> > idle (!pulled_task), we want to ensure that the next_balance gets
> > calculated without the busy_factor.
> >
> > If the rq is busy, then rq->next_balance gets updated based on
> > sd->interval * busy_factor. However, when the rq goes from "busy"
> > to idle, rq->next_balance might still have been calculated under
> > the assumption that the rq is busy. Thus, if we are going idle, we
> > would then properly update next_balance without the busy factor
> > if we update when !pulled_task.
> >
>
> Its late here and I'm confused!
>
> So the for_each_domain() loop calculates a new next_balance based on
> ->balance_interval (which has that busy_factor on, right).
>
> But if it fails to pull anything, we'll (potentially) iterate the entire
> tree up to the largest domain; and supposedly set next_balanced to the
> largest possible interval.
>
> So when we go from busy to idle (!pulled_task), we actually set
> ->next_balance to the longest interval. Whereas the commit you
> referenced says it sets it to a shorter while.
>
> Not seeing it.
So this is the way I understand that code:
In rebalance_domain, next_balance is suppose to be set to the
minimum of all sd->last_balance + interval so that we properly call
into rebalance_domains() if one of the domains is due for a balance.
In the domain traversals:
if (time_after(next_balance, sd->last_balance + interval))
next_balance = sd->last_balance + interval;
we update next_balance to a new value if the current next_balance
is after, and we only update next_balance to a smaller value.
In rebalance_domains, we have code:
interval = sd->balance_interval;
if (idle != CPU_IDLE)
interval *= sd->busy_factor;
...
if (time_after(next_balance, sd->last_balance + interval)) {
next_balance = sd->last_balance + interval;
...
rq->next_balance = next_balance;
In the CPU_IDLE case, interval would not include the busy factor,
whereas in the !CPU_IDLE case, we multiply the interval by the
sd->busy_factor.
So as an example, if a CPU is not idle and we run this:
rebalance_domain()
interval = 1 ms;
if (idle != CPU_IDLE)
interval *= 64;
next_balance = sd->last_balance + 64 ms
rq->next_balance = next_balance
The rq->next_balance is set to a large value since the CPU is not idle.
Then, let's say the CPU then goes idle 1 ms later. The
rq->next_balance can be up to 63 ms later, because we computed
it when the CPU is not idle. Now that we are going idle,
we would have to wait a long time for the next balance.
So I believe that the initial reason why rq->next_balance was
updated in idle_balance is that if the CPU is in the process
of going idle (!pulled_task in idle_balance()), we can reset the
rq->next_balance based on the interval = 1 ms, as oppose to
having it remain up to 64 ms later (in idle_balance(), interval
doesn't get multiplied by sd->busy_factor).
On Thu, 2014-04-24 at 18:52 +0200, Peter Zijlstra wrote:
> On Thu, Apr 24, 2014 at 09:43:09AM -0700, Jason Low wrote:
> > If the below patch is what you were referring to, I believe this
> > can help too. This was also something that I was testing out before
> > we went with those patches which compares avg_idle with idle balance
> > cost. I recall seeing somewhere around a +7% performance improvement
> > in at least least 1 of the AIM7 workloads. I can do some more testing
> > with this.
>
> Yes, exactly that.
>
> I can't remember the details, but I suspect we feared the less agressive
> idle balance due to that patch (it will only pull a single task, instead
> of multiple) would cause more idle_balance invocations and thereby
> decrease throughput.
>
> So I suppose something with _many_ bursty threads which leads to severe
> inequalities would be the workload to trigger that.
>
> Not sure we've ever seen that.. maybe Mike remembers, he seems to have a
> head for such details.
Okay, so running the AIM7 fserver workload, I didn't see any noticeable
performance gains with having move_tasks() pull at most one task. The
+7% performance improvement that I saw was without the idle balance cost
patches. I think that with those idle balance cost patches, there aren't
as much benefits with this patch, and allowing more than 1 task to be
pulled in move_task(), like we have now, may be the best option.
On Thu, 2014-04-24 at 18:52 +0200, Peter Zijlstra wrote:
> On Thu, Apr 24, 2014 at 09:43:09AM -0700, Jason Low wrote:
> > If the below patch is what you were referring to, I believe this
> > can help too. This was also something that I was testing out before
> > we went with those patches which compares avg_idle with idle balance
> > cost. I recall seeing somewhere around a +7% performance improvement
> > in at least least 1 of the AIM7 workloads. I can do some more testing
> > with this.
>
> Yes, exactly that.
>
> I can't remember the details, but I suspect we feared the less agressive
> idle balance due to that patch (it will only pull a single task, instead
> of multiple) would cause more idle_balance invocations and thereby
> decrease throughput.
>
> So I suppose something with _many_ bursty threads which leads to severe
> inequalities would be the workload to trigger that.
>
> Not sure we've ever seen that.. maybe Mike remembers, he seems to have a
> head for such details.
I don't recall ever seeing such.
-Mike
On Fri, 2014-04-25 at 04:45 +0200, Mike Galbraith wrote:
> On Thu, 2014-04-24 at 18:52 +0200, Peter Zijlstra wrote:
> > On Thu, Apr 24, 2014 at 09:43:09AM -0700, Jason Low wrote:
> > > If the below patch is what you were referring to, I believe this
> > > can help too. This was also something that I was testing out before
> > > we went with those patches which compares avg_idle with idle balance
> > > cost. I recall seeing somewhere around a +7% performance improvement
> > > in at least least 1 of the AIM7 workloads. I can do some more testing
> > > with this.
> >
> > Yes, exactly that.
> >
> > I can't remember the details, but I suspect we feared the less agressive
> > idle balance due to that patch (it will only pull a single task, instead
> > of multiple) would cause more idle_balance invocations and thereby
> > decrease throughput.
> >
> > So I suppose something with _many_ bursty threads which leads to severe
> > inequalities would be the workload to trigger that.
> >
> > Not sure we've ever seen that.. maybe Mike remembers, he seems to have a
> > head for such details.
>
> I don't recall ever seeing such.
Hmmm, could commit: 1b9508f6831e (sched: Rate-limit newidle) be related
to what Peter's referring to? The patch mentioned that the rate-limit
benefited "sysbench oltp".
On 04/24/2014 10:44 PM, Peter Zijlstra wrote:
> On Thu, Apr 24, 2014 at 09:53:37AM -0700, Jason Low wrote:
>>
>> So I thought that the original rationale (commit 1bd77f2d) behind
>> updating rq->next_balance in idle_balance() is that, if we are going
>> idle (!pulled_task), we want to ensure that the next_balance gets
>> calculated without the busy_factor.
>>
>> If the rq is busy, then rq->next_balance gets updated based on
>> sd->interval * busy_factor. However, when the rq goes from "busy"
>> to idle, rq->next_balance might still have been calculated under
>> the assumption that the rq is busy. Thus, if we are going idle, we
>> would then properly update next_balance without the busy factor
>> if we update when !pulled_task.
>>
>
> Its late here and I'm confused!
>
> So the for_each_domain() loop calculates a new next_balance based on
> ->balance_interval (which has that busy_factor on, right).
>
> But if it fails to pull anything, we'll (potentially) iterate the entire
> tree up to the largest domain; and supposedly set next_balanced to the
> largest possible interval.
*to the smallest possible interval.
>
> So when we go from busy to idle (!pulled_task), we actually set
> ->next_balance to the longest interval. Whereas the commit you
> referenced says it sets it to a shorter while.
We will set next_balance to the earliest balance time among the sched
domains iterated.
>
> Not seeing it.
>
> So the code as modified by Ingo in one of the initial CFS commits, will
> move the ->next_balance time ahead if the balance succeeded
> (pulled_task), thereby reflecting that we are busy and we just did a
> balance so we need not do one again soon. (we might want to re-think
> this if we really make the idle balance only pull 1 task max).
>
> Of course, I've now gone over this code 3 times today, so I'm terminally
> confused.
I am unable to understand how updating of rq->next_balance should depend
solely on the pulled_task parameter( I am not considering the expiry of
rq->next_balance here).
True that we will need to override the busy_factor in rq->next_balance
if we do not pull any tasks and go to idle. Besides that however we will
probably need to override rq->next_balance irrespective of whether we
pull any tasks.
Lets look at what happens to the sd->balance_interval in load_balance().
If we pull tasks then it is set to min_interval. If active balance
occurs or if tasks are pinned then we push the interval farther away.In
the former case where it is set to min_interval, pulled_tasks > 0, in
the latter case, especially the pinned case, pulled_task=0 (not sure
about the active balance case).
If after this modification on sd->balance_interval,
rq->next_balance > sd->last_balance + sd->balance_interval then
shouldn't we be resetting rq->next_balance? And if we should, then the
dependence on pulled_tasks is not justified is it? All this assuming
that rq->next_balance should always reflect the minimum value of
sd->next_balance among the sched domains of which the rq is a part.
Regards
Preeti U Murthy
>
Hi Jason,
On 04/25/2014 03:48 AM, Jason Low wrote:
> On Thu, 2014-04-24 at 19:14 +0200, Peter Zijlstra wrote:
>> On Thu, Apr 24, 2014 at 09:53:37AM -0700, Jason Low wrote:
>>>
>>> So I thought that the original rationale (commit 1bd77f2d) behind
>>> updating rq->next_balance in idle_balance() is that, if we are going
>>> idle (!pulled_task), we want to ensure that the next_balance gets
>>> calculated without the busy_factor.
>>>
>>> If the rq is busy, then rq->next_balance gets updated based on
>>> sd->interval * busy_factor. However, when the rq goes from "busy"
>>> to idle, rq->next_balance might still have been calculated under
>>> the assumption that the rq is busy. Thus, if we are going idle, we
>>> would then properly update next_balance without the busy factor
>>> if we update when !pulled_task.
>>>
>>
>> Its late here and I'm confused!
>>
>> So the for_each_domain() loop calculates a new next_balance based on
>> ->balance_interval (which has that busy_factor on, right).
>>
>> But if it fails to pull anything, we'll (potentially) iterate the entire
>> tree up to the largest domain; and supposedly set next_balanced to the
>> largest possible interval.
>>
>> So when we go from busy to idle (!pulled_task), we actually set
>> ->next_balance to the longest interval. Whereas the commit you
>> referenced says it sets it to a shorter while.
>>
>> Not seeing it.
>
> So this is the way I understand that code:
>
> In rebalance_domain, next_balance is suppose to be set to the
> minimum of all sd->last_balance + interval so that we properly call
> into rebalance_domains() if one of the domains is due for a balance.
>
> In the domain traversals:
>
> if (time_after(next_balance, sd->last_balance + interval))
> next_balance = sd->last_balance + interval;
>
> we update next_balance to a new value if the current next_balance
> is after, and we only update next_balance to a smaller value.
>
> In rebalance_domains, we have code:
>
> interval = sd->balance_interval;
> if (idle != CPU_IDLE)
> interval *= sd->busy_factor;
>
> ...
>
> if (time_after(next_balance, sd->last_balance + interval)) {
> next_balance = sd->last_balance + interval;
>
> ...
>
> rq->next_balance = next_balance;
>
> In the CPU_IDLE case, interval would not include the busy factor,
> whereas in the !CPU_IDLE case, we multiply the interval by the
> sd->busy_factor.
>
> So as an example, if a CPU is not idle and we run this:
>
> rebalance_domain()
> interval = 1 ms;
> if (idle != CPU_IDLE)
> interval *= 64;
>
> next_balance = sd->last_balance + 64 ms
>
> rq->next_balance = next_balance
>
> The rq->next_balance is set to a large value since the CPU is not idle.
>
> Then, let's say the CPU then goes idle 1 ms later. The
> rq->next_balance can be up to 63 ms later, because we computed
> it when the CPU is not idle. Now that we are going idle,
> we would have to wait a long time for the next balance.
>
> So I believe that the initial reason why rq->next_balance was
> updated in idle_balance is that if the CPU is in the process
> of going idle (!pulled_task in idle_balance()), we can reset the
> rq->next_balance based on the interval = 1 ms, as oppose to
> having it remain up to 64 ms later (in idle_balance(), interval
> doesn't get multiplied by sd->busy_factor).
I agree with this. However I am concerned with an additional point that
I have mentioned in my reply to Peter's mail on this thread.
Should we verify if rq->next_balance update is independent of
pulled_tasks? sd->balance_interval is changed during load_balance() and
rq->next_balance should perhaps consider that?
Regards
Preeti U Murthy
>
>
>
On Thu, 2014-04-24 at 20:33 -0700, Jason Low wrote:
> On Fri, 2014-04-25 at 04:45 +0200, Mike Galbraith wrote:
> > On Thu, 2014-04-24 at 18:52 +0200, Peter Zijlstra wrote:
> > > On Thu, Apr 24, 2014 at 09:43:09AM -0700, Jason Low wrote:
> > > > If the below patch is what you were referring to, I believe this
> > > > can help too. This was also something that I was testing out before
> > > > we went with those patches which compares avg_idle with idle balance
> > > > cost. I recall seeing somewhere around a +7% performance improvement
> > > > in at least least 1 of the AIM7 workloads. I can do some more testing
> > > > with this.
> > >
> > > Yes, exactly that.
> > >
> > > I can't remember the details, but I suspect we feared the less agressive
> > > idle balance due to that patch (it will only pull a single task, instead
> > > of multiple) would cause more idle_balance invocations and thereby
> > > decrease throughput.
> > >
> > > So I suppose something with _many_ bursty threads which leads to severe
> > > inequalities would be the workload to trigger that.
> > >
> > > Not sure we've ever seen that.. maybe Mike remembers, he seems to have a
> > > head for such details.
> >
> > I don't recall ever seeing such.
>
> Hmmm, could commit: 1b9508f6831e (sched: Rate-limit newidle) be related
> to what Peter's referring to? The patch mentioned that the rate-limit
> benefited "sysbench oltp".
That's the opposite, was to prevent way too much idle balancing, which
was demolishing fast mover loads.
-Mike
On 04/24/2014 07:00 AM, Jason Low wrote:
> Also initialize the per-sd variables for newidle load balancing
> in sd_numa_init().
>
> Signed-off-by: Jason Low <[email protected]>
> ---
> kernel/sched/core.c | 2 ++
> 1 files changed, 2 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 3f12533..b1c6cb9 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6030,6 +6030,8 @@ sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
> ,
> .last_balance = jiffies,
> .balance_interval = sd_weight,
> + .max_newidle_lb_cost = 0,
> + .next_decay_max_lb_cost = jiffies,
> };
> SD_INIT_NAME(sd, NUMA);
> sd->private = &tl->data;
>
Reviewed-by: Preeti U Murthy <[email protected]>
On Fri, 2014-04-25 at 10:42 +0530, Preeti U Murthy wrote:
> I agree with this. However I am concerned with an additional point that
> I have mentioned in my reply to Peter's mail on this thread.
>
> Should we verify if rq->next_balance update is independent of
> pulled_tasks? sd->balance_interval is changed during load_balance() and
> rq->next_balance should perhaps consider that?
Hi Preeti,
I agree that we may want to consider having rq->next balance update be
independent of pulled_task. As you mentioned, load_balance() can modify
the balance_interval.
There are a few things I'm wondering if we would need to also add then:
1. In the case that this_rq->avg_idle < sysctl_sched_migration_cost, we
would need to also traverse the domains to properly compute
next_balance (without the sd->busy_factor) as we would be going idle.
Otherwise, next_balance could get set to jiffies + HZ while the
CPU goes idle.
2. In the domain traversal, when we pulled_task, we might want to
multiply interval by sd->busy_factor because the rq will remain busy.
3. If this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost, then we
may still want to compute next_balance, rather than simply break out
of the sched domain traversal loop. This is also to avoid having
the next_balance = jiffies + HZ when a domain should rebalance
less than 1 second later.
On Fri, 2014-04-25 at 00:13 -0700, Jason Low wrote:
> On Fri, 2014-04-25 at 10:42 +0530, Preeti U Murthy wrote:
> > I agree with this. However I am concerned with an additional point that
> > I have mentioned in my reply to Peter's mail on this thread.
> >
> > Should we verify if rq->next_balance update is independent of
> > pulled_tasks? sd->balance_interval is changed during load_balance() and
> > rq->next_balance should perhaps consider that?
>
> Hi Preeti,
>
> I agree that we may want to consider having rq->next balance update be
> independent of pulled_task. As you mentioned, load_balance() can modify
> the balance_interval.
>
> There are a few things I'm wondering if we would need to also add then:
>
> 1. In the case that this_rq->avg_idle < sysctl_sched_migration_cost, we
> would need to also traverse the domains to properly compute
> next_balance (without the sd->busy_factor) as we would be going idle.
> Otherwise, next_balance could get set to jiffies + HZ while the
> CPU goes idle.
Avoiding high frequency cache misses and cycle wastage on micro-idle was
what avg-idle was about. If you're going to traverse anyway, or have a
better way to not do that too frequently, you can just nuke it.
-Mike
On Fri, Apr 25, 2014 at 10:38:43AM +0530, Preeti U Murthy wrote:
> > But if it fails to pull anything, we'll (potentially) iterate the entire
> > tree up to the largest domain; and supposedly set next_balanced to the
> > largest possible interval.
>
> *to the smallest possible interval.
> >
> > So when we go from busy to idle (!pulled_task), we actually set
> > ->next_balance to the longest interval. Whereas the commit you
> > referenced says it sets it to a shorter while.
>
> We will set next_balance to the earliest balance time among the sched
> domains iterated.
Duh. I read that time_after the wrong way around.. silly me :-)
> I am unable to understand how updating of rq->next_balance should depend
> solely on the pulled_task parameter( I am not considering the expiry of
> rq->next_balance here).
>
> True that we will need to override the busy_factor in rq->next_balance
> if we do not pull any tasks and go to idle. Besides that however we will
> probably need to override rq->next_balance irrespective of whether we
> pull any tasks.
>
> Lets look at what happens to the sd->balance_interval in load_balance().
> If we pull tasks then it is set to min_interval. If active balance
> occurs or if tasks are pinned then we push the interval farther away.In
> the former case where it is set to min_interval, pulled_tasks > 0, in
> the latter case, especially the pinned case, pulled_task=0 (not sure
> about the active balance case).
>
> If after this modification on sd->balance_interval,
> rq->next_balance > sd->last_balance + sd->balance_interval then
> shouldn't we be resetting rq->next_balance? And if we should, then the
> dependence on pulled_tasks is not justified is it? All this assuming
> that rq->next_balance should always reflect the minimum value of
> sd->next_balance among the sched domains of which the rq is a part.
So how about something like this? It tracks the minimal next_balance for
whatever domains we do visit, or the very bottom domain in the
insta-bail case (but yeah, Mike's got a point.. we could think of
removing that).
The thought is that since the higher domains have larger intervals
anyway, its less likely they will move the timer back often, so skipping
them is not that big a deal.
I also considered tracking next_busy_balance and using that when
pulled_task, but I decided against it (after I'd actually written the
code to do so). We were on the brink of going idle, that's really not
busy.
---
kernel/sched/fair.c | 58 ++++++++++++++++++++++++++++++++---------------------
1 file changed, 35 insertions(+), 23 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 43232b8bacde..2ac1ad3de6c6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6645,17 +6645,29 @@ static int load_balance(int this_cpu, struct rq *this_rq,
return ld_moved;
}
+static inline void
+update_next_balance(struct sched_domain *sd, unsigned long *next_balance)
+{
+ unsigned long interval, next;
+
+ interval = msecs_to_jiffies(sd->balance_interval);
+ next = sd->last_balance + interval;
+
+ if (time_after(*next_balance, next))
+ *next_balance = next;
+}
+
/*
* idle_balance is called by schedule() if this_cpu is about to become
* idle. Attempts to pull tasks from other CPUs.
*/
static int idle_balance(struct rq *this_rq)
{
+ unsigned long next_balance = jiffies + HZ;
+ int this_cpu = this_rq->cpu;
struct sched_domain *sd;
int pulled_task = 0;
- unsigned long next_balance = jiffies + HZ;
u64 curr_cost = 0;
- int this_cpu = this_rq->cpu;
idle_enter_fair(this_rq);
/*
@@ -6664,8 +6676,14 @@ static int idle_balance(struct rq *this_rq)
*/
this_rq->idle_stamp = rq_clock(this_rq);
- if (this_rq->avg_idle < sysctl_sched_migration_cost)
+ if (this_rq->avg_idle < sysctl_sched_migration_cost) {
+ rcu_read_lock();
+ update_next_balance(
+ rcu_dereference_check_sched_domain(this_rq->sd),
+ &next_balance);
+ rcu_read_unlock();
goto out;
+ }
/*
* Drop the rq->lock, but keep IRQ/preempt disabled.
@@ -6675,15 +6693,16 @@ static int idle_balance(struct rq *this_rq)
update_blocked_averages(this_cpu);
rcu_read_lock();
for_each_domain(this_cpu, sd) {
- unsigned long interval;
int continue_balancing = 1;
u64 t0, domain_cost;
if (!(sd->flags & SD_LOAD_BALANCE))
continue;
- if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
+ if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
+ update_next_balance(sd, &next_balance);
break;
+ }
if (sd->flags & SD_BALANCE_NEWIDLE) {
t0 = sched_clock_cpu(this_cpu);
@@ -6700,9 +6719,7 @@ static int idle_balance(struct rq *this_rq)
curr_cost += domain_cost;
}
- interval = msecs_to_jiffies(sd->balance_interval);
- if (time_after(next_balance, sd->last_balance + interval))
- next_balance = sd->last_balance + interval;
+ update_next_balance(sd, &next_balance);
if (pulled_task)
break;
}
@@ -6710,27 +6727,22 @@ static int idle_balance(struct rq *this_rq)
raw_spin_lock(&this_rq->lock);
+ if (curr_cost > this_rq->max_idle_balance_cost)
+ this_rq->max_idle_balance_cost = curr_cost;
+
/*
- * While browsing the domains, we released the rq lock.
- * A task could have be enqueued in the meantime
+ * While browsing the domains, we released the rq lock, a task could
+ * have be enqueued in the meantime. Since we're not going idle,
+ * pretend we pulled a task.
*/
- if (this_rq->cfs.h_nr_running && !pulled_task) {
+ if (this_rq->cfs.h_nr_running && !pulled_task)
pulled_task = 1;
- goto out;
- }
- if (pulled_task || time_after(jiffies, this_rq->next_balance)) {
- /*
- * We are going idle. next_balance may be set based on
- * a busy processor. So reset next_balance.
- */
+out:
+ /* Move the next balance forward */
+ if (time_after(this_rq->next_balance, next_balance))
this_rq->next_balance = next_balance;
- }
-
- if (curr_cost > this_rq->max_idle_balance_cost)
- this_rq->max_idle_balance_cost = curr_cost;
-out:
/* Is there a task of a high priority class? */
if (this_rq->nr_running != this_rq->cfs.h_nr_running)
pulled_task = -1;
On Fri, 2014-04-25 at 09:58 +0200, Mike Galbraith wrote:
> On Fri, 2014-04-25 at 00:13 -0700, Jason Low wrote:
> > On Fri, 2014-04-25 at 10:42 +0530, Preeti U Murthy wrote:
> > > I agree with this. However I am concerned with an additional point that
> > > I have mentioned in my reply to Peter's mail on this thread.
> > >
> > > Should we verify if rq->next_balance update is independent of
> > > pulled_tasks? sd->balance_interval is changed during load_balance() and
> > > rq->next_balance should perhaps consider that?
> >
> > Hi Preeti,
> >
> > I agree that we may want to consider having rq->next balance update be
> > independent of pulled_task. As you mentioned, load_balance() can modify
> > the balance_interval.
> >
> > There are a few things I'm wondering if we would need to also add then:
> >
> > 1. In the case that this_rq->avg_idle < sysctl_sched_migration_cost, we
> > would need to also traverse the domains to properly compute
> > next_balance (without the sd->busy_factor) as we would be going idle.
> > Otherwise, next_balance could get set to jiffies + HZ while the
> > CPU goes idle.
>
> Avoiding high frequency cache misses and cycle wastage on micro-idle was
> what avg-idle was about. If you're going to traverse anyway, or have a
> better way to not do that too frequently, you can just nuke it.
Yeah, we already compare avg-idle with the per-domain costs in that
function. I'll run some performance tests with the first check removed,
as such a change can potentially have a (+/-) impact on performance.
Thanks,
Jason
On Fri, 2014-04-25 at 11:43 +0200, Peter Zijlstra wrote:
> So how about something like this? It tracks the minimal next_balance for
> whatever domains we do visit, or the very bottom domain in the
> insta-bail case (but yeah, Mike's got a point.. we could think of
> removing that).
>
> The thought is that since the higher domains have larger intervals
> anyway, its less likely they will move the timer back often, so skipping
> them is not that big a deal.
>
> I also considered tracking next_busy_balance and using that when
> pulled_task, but I decided against it (after I'd actually written the
> code to do so). We were on the brink of going idle, that's really not
> busy.
Preeti mentioned that sd->balance_interval is changed during load_balance().
Should we also consider updating the interval in rebalance_domains() after
calling load_balance(), and also taking max_load_balance_interval into account
in the updates for next_balance in idle_balance()?
If so, how about the something like the below change which also introduces
get_sd_balance_interval() to obtain the sd's balance interval, and have both
update_next_balance() and rebalance_domains() use that function.
Signed-off-by: Jason Low <[email protected]>
---
kernel/sched/fair.c | 81 ++++++++++++++++++++++++++++++++-------------------
1 files changed, 51 insertions(+), 30 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 43232b8..09c546c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6645,27 +6645,59 @@ out:
return ld_moved;
}
+static inline unsigned long get_sd_balance_interval(struct sched_domain *sd, int busy)
+{
+ unsigned long interval = sd->balance_interval;
+
+ if (busy)
+ interval *= sd->busy_factor;
+
+ /* scale ms to jiffies */
+ interval = msecs_to_jiffies(interval);
+ interval = clamp(interval, 1UL, max_load_balance_interval);
+
+ return interval;
+}
+
+static inline void
+update_next_balance(struct sched_domain *sd, int busy, unsigned long *next_balance)
+{
+ unsigned long interval, next;
+
+ interval = get_sd_balance_interval(sd, busy);
+ next = sd->last_balance + interval;
+
+ if (time_after(*next_balance, next))
+ *next_balance = next;
+}
+
/*
* idle_balance is called by schedule() if this_cpu is about to become
* idle. Attempts to pull tasks from other CPUs.
*/
static int idle_balance(struct rq *this_rq)
{
+ unsigned long next_balance = jiffies + HZ;
+ int this_cpu = this_rq->cpu;
struct sched_domain *sd;
int pulled_task = 0;
- unsigned long next_balance = jiffies + HZ;
u64 curr_cost = 0;
- int this_cpu = this_rq->cpu;
idle_enter_fair(this_rq);
+
/*
* We must set idle_stamp _before_ calling idle_balance(), such that we
* measure the duration of idle_balance() as idle time.
*/
this_rq->idle_stamp = rq_clock(this_rq);
- if (this_rq->avg_idle < sysctl_sched_migration_cost)
+ if (this_rq->avg_idle < sysctl_sched_migration_cost) {
+ rcu_read_lock();
+ sd = rcu_dereference_check_sched_domain(this_rq->sd);
+ update_next_balance(sd, 0, &next_balance);
+ rcu_read_unlock();
goto out;
+ }
/*
* Drop the rq->lock, but keep IRQ/preempt disabled.
@@ -6675,15 +6707,16 @@ static int idle_balance(struct rq *this_rq)
update_blocked_averages(this_cpu);
rcu_read_lock();
for_each_domain(this_cpu, sd) {
- unsigned long interval;
int continue_balancing = 1;
u64 t0, domain_cost;
if (!(sd->flags & SD_LOAD_BALANCE))
continue;
- if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
+ if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
+ update_next_balance(sd, 0, &next_balance);
break;
+ }
if (sd->flags & SD_BALANCE_NEWIDLE) {
t0 = sched_clock_cpu(this_cpu);
@@ -6700,9 +6733,7 @@ static int idle_balance(struct rq *this_rq)
curr_cost += domain_cost;
}
- interval = msecs_to_jiffies(sd->balance_interval);
- if (time_after(next_balance, sd->last_balance + interval))
- next_balance = sd->last_balance + interval;
+ update_next_balance(sd, 0, &next_balance);
if (pulled_task)
break;
}
@@ -6710,27 +6741,22 @@ static int idle_balance(struct rq *this_rq)
raw_spin_lock(&this_rq->lock);
+ if (curr_cost > this_rq->max_idle_balance_cost)
+ this_rq->max_idle_balance_cost = curr_cost;
+
/*
- * While browsing the domains, we released the rq lock.
- * A task could have be enqueued in the meantime
+ * While browsing the domains, we released the rq lock, a task could
+ * have be enqueued in the meantime. Since we're not going idle,
+ * pretend we pulled a task.
*/
- if (this_rq->cfs.h_nr_running && !pulled_task) {
+ if (this_rq->cfs.h_nr_running && !pulled_task)
pulled_task = 1;
- goto out;
- }
- if (pulled_task || time_after(jiffies, this_rq->next_balance)) {
- /*
- * We are going idle. next_balance may be set based on
- * a busy processor. So reset next_balance.
- */
+out:
+ /* Move the next balance forward */
+ if (time_after(this_rq->next_balance, next_balance))
this_rq->next_balance = next_balance;
- }
-
- if (curr_cost > this_rq->max_idle_balance_cost)
- this_rq->max_idle_balance_cost = curr_cost;
-out:
/* Is there a task of a high priority class? */
if (this_rq->nr_running != this_rq->cfs.h_nr_running)
pulled_task = -1;
@@ -7013,13 +7039,7 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
break;
}
- interval = sd->balance_interval;
- if (idle != CPU_IDLE)
- interval *= sd->busy_factor;
-
- /* scale ms to jiffies */
- interval = msecs_to_jiffies(interval);
- interval = clamp(interval, 1UL, max_load_balance_interval);
+ interval = get_sd_balance_interval(sd, idle != CPU_IDLE);
need_serialize = sd->flags & SD_SERIALIZE;
@@ -7038,6 +7058,7 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
idle = idle_cpu(cpu) ? CPU_IDLE : CPU_NOT_IDLE;
}
sd->last_balance = jiffies;
+ interval = get_sd_balance_interval(sd, idle != CPU_IDLE);
}
if (need_serialize)
spin_unlock(&balancing);
--
1.7.1
On Fri, Apr 25, 2014 at 12:54:14PM -0700, Jason Low wrote:
> Preeti mentioned that sd->balance_interval is changed during load_balance().
> Should we also consider updating the interval in rebalance_domains() after
> calling load_balance(),
Yeah, that might make sense.
> and also taking max_load_balance_interval into account
> in the updates for next_balance in idle_balance()?
I was thinking that max_load_balance_interval thing was mostly about the
*busy_factor thing, but sure, can't hurt to be consistent and always do
it.
> If so, how about the something like the below change which also introduces
> get_sd_balance_interval() to obtain the sd's balance interval, and have both
> update_next_balance() and rebalance_domains() use that function.
Yes, that looks good.
Can you send it with a proper changelog?
Hi Jason, Peter,
The below patch looks good to me except for one point.
In idle_balance() the below code snippet does not look right:
- if (pulled_task || time_after(jiffies, this_rq->next_balance)) {
- /*
- * We are going idle. next_balance may be set based on
- * a busy processor. So reset next_balance.
- */
+out:
+ /* Move the next balance forward */
+ if (time_after(this_rq->next_balance, next_balance))
this_rq->next_balance = next_balance;
- }
By not checking this_rq->next_balance against jiffies,
we might end up not updating this parameter when it
has expired.
So shouldn't it be:
if (time_after(jiffies, this_rq->next_balance) ||
time_after(this_rq->next_balance, next_balance))
this_rq->next_balance = next_balance;
Besides this:
Reviewed-by: Preeti U Murthy <[email protected]>
Regards
Preeti U Murthy
On Sat, Apr 26, 2014 at 1:24 AM, Jason Low <[email protected]> wrote:
> Signed-off-by: Jason Low <[email protected]>
> ---
> kernel/sched/fair.c | 81 ++++++++++++++++++++++++++++++++-------------------
> 1 files changed, 51 insertions(+), 30 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 43232b8..09c546c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6645,27 +6645,59 @@ out:
> return ld_moved;
> }
>
> +static inline unsigned long get_sd_balance_interval(struct sched_domain *sd, int busy)
> +{
> + unsigned long interval = sd->balance_interval;
> +
> + if (busy)
> + interval *= sd->busy_factor;
> +
> + /* scale ms to jiffies */
> + interval = msecs_to_jiffies(interval);
> + interval = clamp(interval, 1UL, max_load_balance_interval);
> +
> + return interval;
> +}
> +
> +static inline void
> +update_next_balance(struct sched_domain *sd, int busy, unsigned long *next_balance)
> +{
> + unsigned long interval, next;
> +
> + interval = get_sd_balance_interval(sd, busy);
> + next = sd->last_balance + interval;
> +
> + if (time_after(*next_balance, next))
> + *next_balance = next;
> +}
> +
> /*
> * idle_balance is called by schedule() if this_cpu is about to become
> * idle. Attempts to pull tasks from other CPUs.
> */
> static int idle_balance(struct rq *this_rq)
> {
> + unsigned long next_balance = jiffies + HZ;
> + int this_cpu = this_rq->cpu;
> struct sched_domain *sd;
> int pulled_task = 0;
> - unsigned long next_balance = jiffies + HZ;
> u64 curr_cost = 0;
> - int this_cpu = this_rq->cpu;
>
> idle_enter_fair(this_rq);
> +
> /*
> * We must set idle_stamp _before_ calling idle_balance(), such that we
> * measure the duration of idle_balance() as idle time.
> */
> this_rq->idle_stamp = rq_clock(this_rq);
>
> - if (this_rq->avg_idle < sysctl_sched_migration_cost)
> + if (this_rq->avg_idle < sysctl_sched_migration_cost) {
> + rcu_read_lock();
> + sd = rcu_dereference_check_sched_domain(this_rq->sd);
> + update_next_balance(sd, 0, &next_balance);
> + rcu_read_unlock();
> goto out;
> + }
>
> /*
> * Drop the rq->lock, but keep IRQ/preempt disabled.
> @@ -6675,15 +6707,16 @@ static int idle_balance(struct rq *this_rq)
> update_blocked_averages(this_cpu);
> rcu_read_lock();
> for_each_domain(this_cpu, sd) {
> - unsigned long interval;
> int continue_balancing = 1;
> u64 t0, domain_cost;
>
> if (!(sd->flags & SD_LOAD_BALANCE))
> continue;
>
> - if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
> + if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
> + update_next_balance(sd, 0, &next_balance);
> break;
> + }
>
> if (sd->flags & SD_BALANCE_NEWIDLE) {
> t0 = sched_clock_cpu(this_cpu);
> @@ -6700,9 +6733,7 @@ static int idle_balance(struct rq *this_rq)
> curr_cost += domain_cost;
> }
>
> - interval = msecs_to_jiffies(sd->balance_interval);
> - if (time_after(next_balance, sd->last_balance + interval))
> - next_balance = sd->last_balance + interval;
> + update_next_balance(sd, 0, &next_balance);
> if (pulled_task)
> break;
> }
> @@ -6710,27 +6741,22 @@ static int idle_balance(struct rq *this_rq)
>
> raw_spin_lock(&this_rq->lock);
>
> + if (curr_cost > this_rq->max_idle_balance_cost)
> + this_rq->max_idle_balance_cost = curr_cost;
> +
> /*
> - * While browsing the domains, we released the rq lock.
> - * A task could have be enqueued in the meantime
> + * While browsing the domains, we released the rq lock, a task could
> + * have be enqueued in the meantime. Since we're not going idle,
> + * pretend we pulled a task.
> */
> - if (this_rq->cfs.h_nr_running && !pulled_task) {
> + if (this_rq->cfs.h_nr_running && !pulled_task)
> pulled_task = 1;
> - goto out;
> - }
>
> - if (pulled_task || time_after(jiffies, this_rq->next_balance)) {
> - /*
> - * We are going idle. next_balance may be set based on
> - * a busy processor. So reset next_balance.
> - */
> +out:
> + /* Move the next balance forward */
> + if (time_after(this_rq->next_balance, next_balance))
> this_rq->next_balance = next_balance;
> - }
> -
> - if (curr_cost > this_rq->max_idle_balance_cost)
> - this_rq->max_idle_balance_cost = curr_cost;
>
> -out:
> /* Is there a task of a high priority class? */
> if (this_rq->nr_running != this_rq->cfs.h_nr_running)
> pulled_task = -1;
> @@ -7013,13 +7039,7 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
> break;
> }
>
> - interval = sd->balance_interval;
> - if (idle != CPU_IDLE)
> - interval *= sd->busy_factor;
> -
> - /* scale ms to jiffies */
> - interval = msecs_to_jiffies(interval);
> - interval = clamp(interval, 1UL, max_load_balance_interval);
> + interval = get_sd_balance_interval(sd, idle != CPU_IDLE);
>
> need_serialize = sd->flags & SD_SERIALIZE;
>
> @@ -7038,6 +7058,7 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
> idle = idle_cpu(cpu) ? CPU_IDLE : CPU_NOT_IDLE;
> }
> sd->last_balance = jiffies;
> + interval = get_sd_balance_interval(sd, idle != CPU_IDLE);
> }
> if (need_serialize)
> spin_unlock(&balancing);
> --
> 1.7.1
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On Sun, Apr 27, 2014 at 02:01:45PM +0530, Preeti Murthy wrote:
> Hi Jason, Peter,
>
> The below patch looks good to me except for one point.
>
> In idle_balance() the below code snippet does not look right:
>
> - if (pulled_task || time_after(jiffies, this_rq->next_balance)) {
> - /*
> - * We are going idle. next_balance may be set based on
> - * a busy processor. So reset next_balance.
> - */
> +out:
> + /* Move the next balance forward */
> + if (time_after(this_rq->next_balance, next_balance))
> this_rq->next_balance = next_balance;
> - }
>
> By not checking this_rq->next_balance against jiffies,
> we might end up not updating this parameter when it
> has expired.
>
> So shouldn't it be:
>
> if (time_after(jiffies, this_rq->next_balance) ||
> time_after(this_rq->next_balance, next_balance))
> this_rq->next_balance = next_balance;
So the reason I didn't do that is that nothing else does that either.
Also, note that the value we set rq->next_balance to might itself
already be expired. There is no guarantee that last_balance + interval
is in the future.
On Sat, 2014-04-26 at 16:50 +0200, Peter Zijlstra wrote:
> On Fri, Apr 25, 2014 at 12:54:14PM -0700, Jason Low wrote:
> > Preeti mentioned that sd->balance_interval is changed during load_balance().
> > Should we also consider updating the interval in rebalance_domains() after
> > calling load_balance(),
>
> Yeah, that might make sense.
>
> > and also taking max_load_balance_interval into account
> > in the updates for next_balance in idle_balance()?
>
> I was thinking that max_load_balance_interval thing was mostly about the
> *busy_factor thing, but sure, can't hurt to be consistent and always do
> it.
>
> > If so, how about the something like the below change which also introduces
> > get_sd_balance_interval() to obtain the sd's balance interval, and have both
> > update_next_balance() and rebalance_domains() use that function.
>
> Yes, that looks good.
>
> Can you send it with a proper changelog?
Sure, I'll send a v2 patchset so that this applies with the other
patches. I also think it would be beneficial to split this change into 2
patches (the 1st patch fixes commit e5fc6611, and the 2nd patch changes
how next_balance gets updated).
On Sun, 2014-04-27 at 14:01 +0530, Preeti Murthy wrote:
> Hi Jason, Peter,
>
> The below patch looks good to me except for one point.
>
> In idle_balance() the below code snippet does not look right:
>
> - if (pulled_task || time_after(jiffies, this_rq->next_balance)) {
> - /*
> - * We are going idle. next_balance may be set based on
> - * a busy processor. So reset next_balance.
> - */
> +out:
> + /* Move the next balance forward */
> + if (time_after(this_rq->next_balance, next_balance))
> this_rq->next_balance = next_balance;
> - }
>
> By not checking this_rq->next_balance against jiffies,
> we might end up not updating this parameter when it
> has expired.
>
> So shouldn't it be:
>
> if (time_after(jiffies, this_rq->next_balance) ||
> time_after(this_rq->next_balance, next_balance))
> this_rq->next_balance = next_balance;
Hi Preeti,
If jiffies is after this_rq->next_balance, doesn't that mean that it's
actually due for a periodic balance and we wouldn't need to modify it?
In rebalance_domains(), we do load_balance if time_after_eq(jiffies,
sd->last_balance + interval).
>
> Besides this:
> Reviewed-by: Preeti U Murthy <[email protected]>
Thanks for the review.
On 04/28/2014 02:54 PM, Peter Zijlstra wrote:
> On Sun, Apr 27, 2014 at 02:01:45PM +0530, Preeti Murthy wrote:
>> Hi Jason, Peter,
>>
>> The below patch looks good to me except for one point.
>>
>> In idle_balance() the below code snippet does not look right:
>>
>> - if (pulled_task || time_after(jiffies, this_rq->next_balance)) {
>> - /*
>> - * We are going idle. next_balance may be set based on
>> - * a busy processor. So reset next_balance.
>> - */
>> +out:
>> + /* Move the next balance forward */
>> + if (time_after(this_rq->next_balance, next_balance))
>> this_rq->next_balance = next_balance;
>> - }
>>
>> By not checking this_rq->next_balance against jiffies,
>> we might end up not updating this parameter when it
>> has expired.
>>
>> So shouldn't it be:
>>
>> if (time_after(jiffies, this_rq->next_balance) ||
>> time_after(this_rq->next_balance, next_balance))
>> this_rq->next_balance = next_balance;
>
> So the reason I didn't do that is that nothing else does that either.
> Also, note that the value we set rq->next_balance to might itself
> already be expired. There is no guarantee that last_balance + interval
> is in the future.
>
Hmm this makes sense. Thanks!
Regards
Preeti U Murthy
On 04/28/2014 11:34 PM, Jason Low wrote:
> On Sun, 2014-04-27 at 14:01 +0530, Preeti Murthy wrote:
>> Hi Jason, Peter,
>>
>> The below patch looks good to me except for one point.
>>
>> In idle_balance() the below code snippet does not look right:
>>
>> - if (pulled_task || time_after(jiffies, this_rq->next_balance)) {
>> - /*
>> - * We are going idle. next_balance may be set based on
>> - * a busy processor. So reset next_balance.
>> - */
>> +out:
>> + /* Move the next balance forward */
>> + if (time_after(this_rq->next_balance, next_balance))
>> this_rq->next_balance = next_balance;
>> - }
>>
>> By not checking this_rq->next_balance against jiffies,
>> we might end up not updating this parameter when it
>> has expired.
>>
>> So shouldn't it be:
>>
>> if (time_after(jiffies, this_rq->next_balance) ||
>> time_after(this_rq->next_balance, next_balance))
>> this_rq->next_balance = next_balance;
>
> Hi Preeti,
>
> If jiffies is after this_rq->next_balance, doesn't that mean that it's
> actually due for a periodic balance and we wouldn't need to modify it?
> In rebalance_domains(), we do load_balance if time_after_eq(jiffies,
> sd->last_balance + interval).
Right. So I missed the point that we don't really have a problem with
the rq->next_balance being expired. It will anyway ensure that in the
next call to rebalance_domains() load balancing will be done and that is
all we want. Thanks for pointing it out.
Regards
Preeti U Murthy
>
>>
>> Besides this:
>> Reviewed-by: Preeti U Murthy <[email protected]>
>
> Thanks for the review.
>