2007-09-15 13:07:06

by Ingo Molnar

[permalink] [raw]
Subject: [git] CFS-devel, group scheduler, fixes


The latest sched-devel.git tree can be pulled from:

git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git

people were busy sending patches, so there's lots of updates since the
first announcement of the cfs-devel.git tree four days ago:

include/linux/sched.h | 4
init/Kconfig | 9 +
kernel/sched.c | 374 +++++++++++++++++++++++++++++++++++++++++++-----
kernel/sched_debug.c | 22 +-
kernel/sched_fair.c | 191 ++++++++++++++++++++----
kernel/sched_idletask.c | 5
kernel/sched_rt.c | 5
kernel/sysctl.c | 19 ++
8 files changed, 552 insertions(+), 77 deletions(-)

the most user-noticeable improvement should be the latency/interactivity
fixes from Mike Galbraith and Peter Zijlstra. The biggest (and most
complex) item is the new group scheduler code from Srivatsa Vaddagiri.
There's also speedups from Dmitry Adamushko and various cleanups and
fixes. Also added in the yield workaround that should address the
regression reported by Antoine Martin.

Changes:

- the biggest item is the addition of the group scheduler from Srivatsa
Vaddagiri - this is not configurable yet, it depends on
CONFIG_CONTAINERS. It causes no overhead on !CONFIG_CONTAINERS. This
code clearly seems mature now, hopefully the container bits go
upstream in 2.6.24 too. Srivatsa did lots of heavy lifting in the past
few months, and this final bit of code that moves in all the
infrastructure changes almost nothing in the core scheduler.

- a triplet of nice simplifications from Dmitry Adamushko. Most notably
Dmitry got rid of se->fair_key which shaves 8 bytes of task_struct and
gives further speedup. Thanks Dmitry!

- continued refinements to the SMP ->vruntime code and timeslicing by
Peter Zijstra and Mike Galbraith.

As usual, bugreports, fixes and suggestions are welcome and please
holler if some patch went missing in action.

Ingo

------------------>
Dmitry Adamushko (3):
sched: clean up struct load_stat
sched: clean up schedstat block in dequeue_entity()
sched: optimize away ->fair_key

Matthias Kaehlcke (1):
sched: use list_for_each_entry_safe() in __wake_up_common()

Mike Galbraith (1):
sched: fix SMP migration latencies

Peter Zijlstra (7):
sched: simplify SCHED_FEAT_* code
sched: new task placement for vruntime
sched: simplify adaptive latency
sched: clean up new task placement
sched: add tree based averages
sched: handle vruntime overflow
sched: better min_vruntime tracking

Srivatsa Vaddagiri (1):
sched: group-scheduler core

Ingo Molnar (27):
sched: fix new-task method
sched: resched task in task_new_fair()
sched: small sched_debug cleanup
sched: debug: track maximum 'slice'
sched: uniform tunings
sched: use constants if !CONFIG_SCHED_DEBUG
sched: remove stat_gran
sched: remove precise CPU load
sched: remove precise CPU load calculations #2
sched: track cfs_rq->curr on !group-scheduling too
sched: cleanup: simplify cfs_rq_curr() methods
sched: uninline __enqueue_entity()/__dequeue_entity()
sched: speed up update_load_add/_sub()
sched: clean up calc_weighted()
sched: introduce se->vruntime
sched: move sched_feat() definitions
sched: optimize vruntime based scheduling
sched: simplify check_preempt() methods
sched: wakeup granularity fix
sched: add se->vruntime debugging
sched: sync up ->min_vruntime when going idle
sched: add more vruntime statistics
sched: debug: update exec_clock only when SCHED_DEBUG
sched: remove wait_runtime limit
sched: remove wait_runtime fields and features
sched: x86: allow single-depth wchan output
sched: yield workaround

arch/i386/Kconfig | 11
include/linux/sched.h | 21 -
init/Kconfig | 9
kernel/sched.c | 558 +++++++++++++++++++++++++------------
kernel/sched_debug.c | 100 +++---
kernel/sched_fair.c | 716 +++++++++++++++++++-----------------------------
kernel/sched_idletask.c | 5
kernel/sched_rt.c | 5
kernel/sysctl.c | 33 +-
9 files changed, 765 insertions(+), 693 deletions(-)


2007-09-18 19:36:46

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: [git] CFS-devel, group scheduler, fixes


[ well, don't expect to find here anything like RDCFS
(no, 'D' does not stand for 'dumb'!). I was focused
on more prosaic things in the mean time so just
didn't have time for writing it.. ]

here is a few cleanup/simplification/optimization(s)
based on the recent modifications in the sched-dev tree.

(1) optimize task_new_fair()
(2) simplify yield_task()
(3) rework enqueue/dequeue_entity() to get rid of
sched_class::set_curr_task()

additionally, the changes somewhat decrease code size:

text data bss dec hex filename
43538 5398 48 48984 bf58 build/kernel/sched.o.before
43250 5390 48 48688 be30 build/kernel/sched.o

(SMP + lots of debugging options but, I guess, in this case the diff
should remain visible for any combination).

---

(1)

due to the fact that we no longer keep the 'current' within the tree,
dequeue/enqueue_entity() is useless for the 'current' in task_new_fair().
We are about to reschedule and sched_class->put_prev_task() will put
the 'current' back into the tree, based on its new key.

Signed-off-by: Dmitry Adamushko <[email protected]>

---
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 6e52d5a..5a244e2 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -969,10 +969,11 @@ static void task_new_fair(struct rq *rq, struct task_struct *p)

if (sysctl_sched_child_runs_first &&
curr->vruntime < se->vruntime) {
-
- dequeue_entity(cfs_rq, curr, 0);
+ /*
+ * Upon rescheduling, sched_class::put_prev_task() will place
+ * 'current' within the tree based on its new key value.
+ */
swap(curr->vruntime, se->vruntime);
- enqueue_entity(cfs_rq, curr, 0);
}

update_stats_enqueue(cfs_rq, se);

---

Dmitry



2007-09-18 20:32:22

by Ingo Molnar

[permalink] [raw]
Subject: Re: [git] CFS-devel, group scheduler, fixes


* dimm <[email protected]> wrote:

> [ well, don't expect to find here anything like RDCFS (no, 'D' does
> not stand for 'dumb'!). I was focused on more prosaic things in the
> mean time so just didn't have time for writing it.. ]
>
> here is a few cleanup/simplification/optimization(s) based on the
> recent modifications in the sched-dev tree.
>
> (1) optimize task_new_fair()
> (2) simplify yield_task()
> (3) rework enqueue/dequeue_entity() to get rid of
> sched_class::set_curr_task()

the queue with your enhancements and simplifications applied looks good
here, and it booted fine on two testboxes. I've updated the
sched-devel.git tree:

git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git

(below is the full shortlog over current upstream.)

(I have not tested the group scheduling bits but perhaps Srivatsa would
like to do that?)

Ingo

------------------>
Dmitry Adamushko (9):
sched: clean up struct load_stat
sched: clean up schedstat block in dequeue_entity()
sched: sched_setscheduler() fix
sched: add set_curr_task() calls
sched: do not keep current in the tree and get rid of sched_entity::fair_key
sched: yield-workaround update
sched: optimize task_new_fair()
sched: simplify sched_class::yield_task()
sched: rework enqueue/dequeue_entity() to get rid of set_curr_task()

Ingo Molnar (29):
sched: fix new-task method
sched: resched task in task_new_fair()
sched: small sched_debug cleanup
sched: debug: track maximum 'slice'
sched: uniform tunings
sched: use constants if !CONFIG_SCHED_DEBUG
sched: remove stat_gran
sched: remove precise CPU load
sched: remove precise CPU load calculations #2
sched: track cfs_rq->curr on !group-scheduling too
sched: cleanup: simplify cfs_rq_curr() methods
sched: uninline __enqueue_entity()/__dequeue_entity()
sched: speed up update_load_add/_sub()
sched: clean up calc_weighted()
sched: introduce se->vruntime
sched: move sched_feat() definitions
sched: optimize vruntime based scheduling
sched: simplify check_preempt() methods
sched: wakeup granularity fix
sched: add se->vruntime debugging
sched: sync up ->min_vruntime when going idle
sched: add more vruntime statistics
sched: debug: update exec_clock only when SCHED_DEBUG
sched: remove wait_runtime limit
sched: remove wait_runtime fields and features
sched: x86: allow single-depth wchan output
sched: yield workaround
sched: fix delay accounting performance regression
sched: disable START_DEBIT

Matthias Kaehlcke (1):
sched: use list_for_each_entry_safe() in __wake_up_common()

Mike Galbraith (1):
sched: fix SMP migration latencies

Peter Zijlstra (7):
sched: simplify SCHED_FEAT_* code
sched: new task placement for vruntime
sched: simplify adaptive latency
sched: clean up new task placement
sched: add tree based averages
sched: handle vruntime overflow
sched: better min_vruntime tracking

Srivatsa Vaddagiri (1):
sched: group-scheduler core

arch/i386/Kconfig | 11
include/linux/sched.h | 24 -
init/Kconfig | 9
kernel/sched.c | 544 ++++++++++++++++++++++++------------
kernel/sched_debug.c | 100 ++----
kernel/sched_fair.c | 752 ++++++++++++++++++++------------------------------
kernel/sched_rt.c | 4
kernel/sched_stats.h | 4
kernel/sysctl.c | 35 +-
9 files changed, 763 insertions(+), 720 deletions(-)

2007-09-18 20:47:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: [git] CFS-devel, group scheduler, fixes


* dimm <[email protected]> wrote:

> (1)
>
> due to the fact that we no longer keep the 'current' within the tree,
> dequeue/enqueue_entity() is useless for the 'current' in
> task_new_fair(). We are about to reschedule and
> sched_class->put_prev_task() will put the 'current' back into the
> tree, based on its new key.
>
> Signed-off-by: Dmitry Adamushko <[email protected]>

thanks, applied.

Ingo

2007-09-19 03:45:36

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [git] CFS-devel, group scheduler, fixes

On Tue, Sep 18, 2007 at 10:22:43PM +0200, Ingo Molnar wrote:
> (I have not tested the group scheduling bits but perhaps Srivatsa would
> like to do that?)

Ingo,
I plan to test it today and send you any updates that may be
required.

--
Regards,
vatsa

2007-09-19 06:04:16

by Tong Li

[permalink] [raw]
Subject: Re: [git] CFS-devel, group scheduler, fixes

This patch attempts to improve CFS's SMP global fairness based on the new
virtual time design.

Removed vruntime adjustment in set_task_cpu() as it skews global fairness.

Modified small_imbalance logic in find_busiest_group(). If there's small
imbalance, move tasks from busiest to local sched_group only if the local
group contains a CPU whose min_vruntime is the maximum among all CPUs in
the same sched_domain. This prevents any CPU from advancing too far ahead
in virtual time and avoids tasks thrashing between two CPUs without
utilizing other CPUs in the system. For example, for 10 tasks on 8 CPUs,
since the load is not evenly divisible by the number of CPUs, we want the
extra load to have a fair use of every CPU in the system.

Tested with a microbenchmark running 10 nice-0 tasks on 8 CPUs. Each task
runs a trivial while (1) loop. The benchmark runs for 300 seconds and, at
every T seconds, it samples for each task the following:

1. Actual CPU time the task received during the past 60 seconds.

2. Ideal CPU time it would receive under a perfect fair scheduler.

3. Lag = ideal time - actual time, where a positive lag means the task
received less CPU time than its fair share and negative means it received
more.

4. Error = lag / ideal time

The following shows the max and min errors among all samples for all tasks
before and after applying the patch:

Before:

Sampling interval: 30 s
Max error: 100.00%
Min error: -25.00%

Sampling interval: 10 s
Max error: 27.62%
Min error: -25.00%

After:

Sampling interval: 30 s
Max error: 1.33%
Min error: -1.29%

Sampling interval: 10 s
Max error: 7.38%
Min error: -6.25%

The errors for the 10s sampling interval are still not as small as I had
hoped for, but looks like it does have some improvement.

tong

Signed-off-by: Tong Li <[email protected]>
---
--- linux-2.6-sched-devel-orig/kernel/sched.c 2007-09-15 22:00:48.000000000 -0700
+++ linux-2.6-sched-devel/kernel/sched.c 2007-09-18 22:10:52.000000000 -0700
@@ -1033,9 +1033,6 @@ void set_task_cpu(struct task_struct *p,
if (p->se.block_start)
p->se.block_start -= clock_offset;
#endif
- if (likely(new_rq->cfs.min_vruntime))
- p->se.vruntime -= old_rq->cfs.min_vruntime -
- new_rq->cfs.min_vruntime;

__set_task_cpu(p, new_cpu);
}
@@ -1599,6 +1596,7 @@ static void __sched_fork(struct task_str
p->se.exec_start = 0;
p->se.sum_exec_runtime = 0;
p->se.prev_sum_exec_runtime = 0;
+ p->se.vruntime = 0;

#ifdef CONFIG_SCHEDSTATS
p->se.wait_start = 0;
@@ -2277,6 +2275,8 @@ find_busiest_group(struct sched_domain *
int *sd_idle, cpumask_t *cpus, int *balance)
{
struct sched_group *busiest = NULL, *this = NULL, *group = sd->groups;
+ struct sched_group *max_vruntime_group = NULL;
+ u64 max_vruntime = 0;
unsigned long max_load, avg_load, total_load, this_load, total_pwr;
unsigned long max_pull;
unsigned long busiest_load_per_task, busiest_nr_running;
@@ -2322,6 +2322,11 @@ find_busiest_group(struct sched_domain *

rq = cpu_rq(i);

+ if (rq->cfs.min_vruntime > max_vruntime) {
+ max_vruntime = rq->cfs.min_vruntime;
+ max_vruntime_group = group;
+ }
+
if (*sd_idle && rq->nr_running)
*sd_idle = 0;

@@ -2483,59 +2488,16 @@ group_next:
* moved
*/
if (*imbalance < busiest_load_per_task) {
- unsigned long tmp, pwr_now, pwr_move;
- unsigned int imbn;
-
small_imbalance:
- pwr_move = pwr_now = 0;
- imbn = 2;
- if (this_nr_running) {
- this_load_per_task /= this_nr_running;
- if (busiest_load_per_task > this_load_per_task)
- imbn = 1;
- } else
- this_load_per_task = SCHED_LOAD_SCALE;
-
- if (max_load - this_load + SCHED_LOAD_SCALE_FUZZ >=
- busiest_load_per_task * imbn) {
- *imbalance = busiest_load_per_task;
- return busiest;
- }
-
- /*
- * OK, we don't have enough imbalance to justify moving tasks,
- * however we may be able to increase total CPU power used by
- * moving them.
+ /*
+ * When there's small imbalance, move tasks only if this
+ * sched_group contains a CPU whose min_vruntime is the
+ * maximum among all CPUs in the same domain.
*/
-
- pwr_now += busiest->__cpu_power *
- min(busiest_load_per_task, max_load);
- pwr_now += this->__cpu_power *
- min(this_load_per_task, this_load);
- pwr_now /= SCHED_LOAD_SCALE;
-
- /* Amount of load we'd subtract */
- tmp = sg_div_cpu_power(busiest,
- busiest_load_per_task * SCHED_LOAD_SCALE);
- if (max_load > tmp)
- pwr_move += busiest->__cpu_power *
- min(busiest_load_per_task, max_load - tmp);
-
- /* Amount of load we'd add */
- if (max_load * busiest->__cpu_power <
- busiest_load_per_task * SCHED_LOAD_SCALE)
- tmp = sg_div_cpu_power(this,
- max_load * busiest->__cpu_power);
- else
- tmp = sg_div_cpu_power(this,
- busiest_load_per_task * SCHED_LOAD_SCALE);
- pwr_move += this->__cpu_power *
- min(this_load_per_task, this_load + tmp);
- pwr_move /= SCHED_LOAD_SCALE;
-
- /* Move if we gain throughput */
- if (pwr_move > pwr_now)
+ if (max_vruntime_group == this)
*imbalance = busiest_load_per_task;
+ else
+ *imbalance = 0;
}

return busiest;

2007-09-19 06:28:57

by Mike Galbraith

[permalink] [raw]
Subject: Re: [git] CFS-devel, group scheduler, fixes

Greetings,

On Tue, 2007-09-18 at 23:03 -0700, Tong Li wrote:
> This patch attempts to improve CFS's SMP global fairness based on the new
> virtual time design.
>
> Removed vruntime adjustment in set_task_cpu() as it skews global fairness.

Since I'm (still) encountering Xorg latency issues (which go away if
load is hefty instead of light) even with that migration adjustment and
synchronization, and am having difficulty nailing it down to a specific
event, I'll test this immediately.

-Mike

2007-09-19 07:51:18

by Mike Galbraith

[permalink] [raw]
Subject: Re: [git] CFS-devel, group scheduler, fixes

On Wed, 2007-09-19 at 08:28 +0200, Mike Galbraith wrote:
> Greetings,
>
> On Tue, 2007-09-18 at 23:03 -0700, Tong Li wrote:
> > This patch attempts to improve CFS's SMP global fairness based on the new
> > virtual time design.
> >
> > Removed vruntime adjustment in set_task_cpu() as it skews global fairness.
>
> Since I'm (still) encountering Xorg latency issues (which go away if
> load is hefty instead of light) even with that migration adjustment and
> synchronization, and am having difficulty nailing it down to a specific
> event, I'll test this immediately.

(had to apply manually to freshly pulled tree)

Drat. This didn't cure the latency hits with a Xorg at nice -5 running
with a make -j2 at nice 0, but seems to have reinstated a latency issue
which was previously cured.

Xorg 1 sec. max latency samples:
(trimmed to only show >20ms latencies)
se.wait_max : 23343582
se.wait_max : 20119460
se.wait_max : 20771573
se.wait_max : 21084567
se.wait_max : 31338500
se.wait_max : 35368148
se.wait_max : 39199642
se.wait_max : 22889062
se.wait_max : 40285501
se.wait_max : 21002720
se.wait_max : 21002266
se.wait_max : 21680578
se.wait_max : 22012913
se.wait_max : 94646331
se.wait_max : 29003693
se.wait_max : 20812613

(boot with maxcpus=1 or nail X+make to one cpu and these latencies are
gone, so it does seem to be the migration logic - why i was so
interested in testing your patch)

The scenario which was previously cured was this:
taskset -c 1 nice -n 0 ./massive_intr 2 9999
taskset -c 1 nice -n 5 ./massive_intr 2 9999
click link
(http://pages.cs.wisc.edu/~shubu/talks/cachescrub-prdc2004.ppt) to bring
up browser and OpenOffice Impress.

Xorg (at nice -5 + above scenario) latency samples:
se.wait_max : 57985337
se.wait_max : 25163510
se.wait_max : 37005538
se.wait_max : 66986511
se.wait_max : 53990868
se.wait_max : 80976761
se.wait_max : 96967501
se.wait_max : 80989254
se.wait_max : 53990897
se.wait_max : 181963905
se.wait_max : 85985181

-Mike

2007-09-19 08:43:04

by Mike Galbraith

[permalink] [raw]
Subject: Re: [git] CFS-devel, group scheduler, fixes

On Wed, 2007-09-19 at 09:51 +0200, Mike Galbraith wrote:

> The scenario which was previously cured was this:
> taskset -c 1 nice -n 0 ./massive_intr 2 9999
> taskset -c 1 nice -n 5 ./massive_intr 2 9999
> click link
> (http://pages.cs.wisc.edu/~shubu/talks/cachescrub-prdc2004.ppt) to bring
> up browser and OpenOffice Impress.
>
> Xorg (at nice -5 + above scenario) latency samples:
> se.wait_max : 57985337
> se.wait_max : 25163510
> se.wait_max : 37005538
> se.wait_max : 66986511
> se.wait_max : 53990868
> se.wait_max : 80976761
> se.wait_max : 96967501
> se.wait_max : 80989254
> se.wait_max : 53990897
> se.wait_max : 181963905
> se.wait_max : 85985181

To be doubly sure of the effect on the pinned tasks + migrating Xorg
scenario, I just ran the above test 10 times with virgin devel source.
Maximum Xorg latency was 20ms.

-Mike

2007-09-19 17:07:11

by Tong Li

[permalink] [raw]
Subject: Re: [git] CFS-devel, group scheduler, fixes

On Wed, 19 Sep 2007, Mike Galbraith wrote:

> On Wed, 2007-09-19 at 09:51 +0200, Mike Galbraith wrote:
>
>> The scenario which was previously cured was this:
>> taskset -c 1 nice -n 0 ./massive_intr 2 9999
>> taskset -c 1 nice -n 5 ./massive_intr 2 9999
>> click link
>> (http://pages.cs.wisc.edu/~shubu/talks/cachescrub-prdc2004.ppt) to bring
>> up browser and OpenOffice Impress.
>>
>> Xorg (at nice -5 + above scenario) latency samples:
>> se.wait_max : 57985337
>> se.wait_max : 25163510
>> se.wait_max : 37005538
>> se.wait_max : 66986511
>> se.wait_max : 53990868
>> se.wait_max : 80976761
>> se.wait_max : 96967501
>> se.wait_max : 80989254
>> se.wait_max : 53990897
>> se.wait_max : 181963905
>> se.wait_max : 85985181
>
> To be doubly sure of the effect on the pinned tasks + migrating Xorg
> scenario, I just ran the above test 10 times with virgin devel source.
> Maximum Xorg latency was 20ms.
>
> -Mike
>

Yeah, the patch was a first attempt at getting better global fairness for
unpinned tasks. I expected there'd be latency problems when unpinned and
pinned tasks co-exist. The original code for vruntime adjustment in
set_task_cpu() helped alleviate this problem, so removing it in my patch
would definitely bring the problem back. On the other hand, leaving it
there broke global fairness in my fairness benchmark. So we'd need a
better compromise.

Were the experiments run on a 2-CPU system? When Xorg experiences large
wait time, is it on the same CPU that has the two pinned tasks? If this is
the case, then the problem could be X somehow advanced faster and got a
larger vruntime then the two pinned tasks, so it had to wait for the
pinned ones to catch up before it got a chance to be scheduled.

tong

2007-09-19 19:36:42

by Suresh Siddha

[permalink] [raw]
Subject: Re: [git] CFS-devel, group scheduler, fixes

On Tue, Sep 18, 2007 at 11:03:59PM -0700, Tong Li wrote:
> This patch attempts to improve CFS's SMP global fairness based on the new
> virtual time design.
>
> Removed vruntime adjustment in set_task_cpu() as it skews global fairness.
>
> Modified small_imbalance logic in find_busiest_group(). If there's small
> imbalance, move tasks from busiest to local sched_group only if the local
> group contains a CPU whose min_vruntime is the maximum among all CPUs in
> the same sched_domain. This prevents any CPU from advancing too far ahead
> in virtual time and avoids tasks thrashing between two CPUs without
> utilizing other CPUs in the system. For example, for 10 tasks on 8 CPUs,
> since the load is not evenly divisible by the number of CPUs, we want the
> extra load to have a fair use of every CPU in the system.
>
> Tested with a microbenchmark running 10 nice-0 tasks on 8 CPUs. Each task

Just as an experiment, can you run 82 tasks on 8 CPUs. Current
imbalance_pct logic will not detect and fix the global fairness issue
even with this patch.

> if (*imbalance < busiest_load_per_task) {
> - unsigned long tmp, pwr_now, pwr_move;
> - unsigned int imbn;
> -
> small_imbalance:
> - pwr_move = pwr_now = 0;
> - imbn = 2;
> - if (this_nr_running) {
> - this_load_per_task /= this_nr_running;
> - if (busiest_load_per_task > this_load_per_task)
> - imbn = 1;
> - } else
> - this_load_per_task = SCHED_LOAD_SCALE;
> -
> - if (max_load - this_load + SCHED_LOAD_SCALE_FUZZ >=
> - busiest_load_per_task * imbn) {
> - *imbalance = busiest_load_per_task;
> - return busiest;
> - }

This patch removes quite a few lines and all this is logic is not for
fairness :( Especially the above portion handles some of the HT/MC
optimizations.

> -
> - /*
> - * OK, we don't have enough imbalance to justify moving
> tasks,
> - * however we may be able to increase total CPU power used by
> - * moving them.
> + /*
> + * When there's small imbalance, move tasks only if this
> + * sched_group contains a CPU whose min_vruntime is the
> + * maximum among all CPUs in the same domain.
> */
> -
> - pwr_now += busiest->__cpu_power *
> - min(busiest_load_per_task, max_load);
> - pwr_now += this->__cpu_power *
> - min(this_load_per_task, this_load);
> - pwr_now /= SCHED_LOAD_SCALE;
> -
> - /* Amount of load we'd subtract */
> - tmp = sg_div_cpu_power(busiest,
> - busiest_load_per_task * SCHED_LOAD_SCALE);
> - if (max_load > tmp)
> - pwr_move += busiest->__cpu_power *
> - min(busiest_load_per_task, max_load - tmp);
> -
> - /* Amount of load we'd add */
> - if (max_load * busiest->__cpu_power <
> - busiest_load_per_task * SCHED_LOAD_SCALE)
> - tmp = sg_div_cpu_power(this,
> - max_load * busiest->__cpu_power);
> - else
> - tmp = sg_div_cpu_power(this,
> - busiest_load_per_task * SCHED_LOAD_SCALE);
> - pwr_move += this->__cpu_power *
> - min(this_load_per_task, this_load + tmp);
> - pwr_move /= SCHED_LOAD_SCALE;
> -
> - /* Move if we gain throughput */
> - if (pwr_move > pwr_now)
> + if (max_vruntime_group == this)
> *imbalance = busiest_load_per_task;
> + else
> + *imbalance = 0;

Not sure how this all interacts when some of the cpu's are idle. I have to
look more closely.

thanks,
suresh

2007-09-19 20:58:34

by Tong Li

[permalink] [raw]
Subject: Re: [git] CFS-devel, group scheduler, fixes

On Wed, 19 Sep 2007, Siddha, Suresh B wrote:

> On Tue, Sep 18, 2007 at 11:03:59PM -0700, Tong Li wrote:
>> This patch attempts to improve CFS's SMP global fairness based on the new
>> virtual time design.
>>
>> Removed vruntime adjustment in set_task_cpu() as it skews global fairness.
>>
>> Modified small_imbalance logic in find_busiest_group(). If there's small
>> imbalance, move tasks from busiest to local sched_group only if the local
>> group contains a CPU whose min_vruntime is the maximum among all CPUs in
>> the same sched_domain. This prevents any CPU from advancing too far ahead
>> in virtual time and avoids tasks thrashing between two CPUs without
>> utilizing other CPUs in the system. For example, for 10 tasks on 8 CPUs,
>> since the load is not evenly divisible by the number of CPUs, we want the
>> extra load to have a fair use of every CPU in the system.
>>
>> Tested with a microbenchmark running 10 nice-0 tasks on 8 CPUs. Each task
>
> Just as an experiment, can you run 82 tasks on 8 CPUs. Current
> imbalance_pct logic will not detect and fix the global fairness issue
> even with this patch.

Right. For 82 tasks on 8 CPUs, the max error is 7.07% and min is -14.12%.
It could be fixed by removing the imbalance_pct check (i.e., making
imbalance_pct effectively 1). The cost would be more migrations, so I
wasn't sure if that was the approach people would agree on.

>
>> if (*imbalance < busiest_load_per_task) {
>> - unsigned long tmp, pwr_now, pwr_move;
>> - unsigned int imbn;
>> -
>> small_imbalance:
>> - pwr_move = pwr_now = 0;
>> - imbn = 2;
>> - if (this_nr_running) {
>> - this_load_per_task /= this_nr_running;
>> - if (busiest_load_per_task > this_load_per_task)
>> - imbn = 1;
>> - } else
>> - this_load_per_task = SCHED_LOAD_SCALE;
>> -
>> - if (max_load - this_load + SCHED_LOAD_SCALE_FUZZ >=
>> - busiest_load_per_task * imbn) {
>> - *imbalance = busiest_load_per_task;
>> - return busiest;
>> - }
>
> This patch removes quite a few lines and all this is logic is not for
> fairness :( Especially the above portion handles some of the HT/MC
> optimizations.

What are the optimizations? I was trying to simplify the code to use only
vruntime to control balancing when there's small imbalance, but if it
breaks non-fairness related optimizations, we can certainly add the code
back.

tong

2007-09-20 04:55:29

by Mike Galbraith

[permalink] [raw]
Subject: Re: [git] CFS-devel, group scheduler, fixes

On Wed, 2007-09-19 at 10:06 -0700, Tong Li wrote:

> Were the experiments run on a 2-CPU system?

Yes.

> When Xorg experiences large
> wait time, is it on the same CPU that has the two pinned tasks? If this is
> the case, then the problem could be X somehow advanced faster and got a
> larger vruntime then the two pinned tasks, so it had to wait for the
> pinned ones to catch up before it got a chance to be scheduled.

Good question. I've not yet been able to capture any event where
vruntimes are that far apart in sched_debug.

-Mike

2007-09-20 07:15:28

by Mike Galbraith

[permalink] [raw]
Subject: Re: [git] CFS-devel, group scheduler, fixes

On Thu, 2007-09-20 at 06:55 +0200, Mike Galbraith wrote:
> On Wed, 2007-09-19 at 10:06 -0700, Tong Li wrote:
>
> > Were the experiments run on a 2-CPU system?
>
> Yes.
>
> > When Xorg experiences large
> > wait time, is it on the same CPU that has the two pinned tasks? If this is
> > the case, then the problem could be X somehow advanced faster and got a
> > larger vruntime then the two pinned tasks, so it had to wait for the
> > pinned ones to catch up before it got a chance to be scheduled.
>
> Good question. I've not yet been able to capture any event where
> vruntimes are that far apart in sched_debug.

But, I did just manage to trigger some horrid behavior, and log it. I
modified the kernel to print task's actual tree key instead of their
current vruntime, and was watching that while make -j2 was running (and
not seeing anything very interesting), when on a lark, I restarted
SuSE's system updater thingy. That beast chews 100% CPU for so long at
startup that I long ago got annoyed, and changed it to run at nice 19.
Anyway, when it started, interactivity went to hell in the proverbial
hand-basket, and the sched_debug log shows some interesting results..
like spread0 hitting -13659412644, and cc1 being keyed at -3867063305.

cpu#0, 2992.608 MHz
.nr_running : 4
.load : 4096
.nr_switches : 1105882
.nr_load_updates : 735146
.nr_uninterruptible : 4294966399
.jiffies : 1936201
.next_balance : 1936276
.curr->pid : 27004
.clock : 735034802877
.idle_clock : 0
.prev_clock_raw : 2259213083886
.clock_warps : 0
.clock_overflows : 677631
.clock_deep_idle_events : 0
.clock_max_delta : 999848
.cpu_load[0] : 4096
.cpu_load[1] : 3960
.cpu_load[2] : 3814
.cpu_load[3] : 3539
.cpu_load[4] : 3153

cfs_rq
.exec_clock : 623175870707
.MIN_vruntime : 3791173262297
.min_vruntime : 3791173259723
.max_vruntime : 3791173264579
.spread : 2282
.spread0 : 0
.nr_sync_min_vruntime : 296756

runnable tasks:
task PID tree-key switches prio exec-runtime sum-exec sum-sleep
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
bash 6338 3212 554 120 3791173262935 104271360 645224098563
R cc1 27004 -7509103 6 120 3791165750620 15396029 13300466
make 27006 4856 6 120 3791173264579 9540879 6458208
make 27010 2574 3 120 3791173262297 8897680 0

cpu#1, 2992.608 MHz
.nr_running : 5
.load : 6208
.nr_switches : 1012995
.nr_load_updates : 747540
.nr_uninterruptible : 897
.jiffies : 1936201
.next_balance : 1936303
.curr->pid : 27012
.clock : 747426624818
.idle_clock : 0
.prev_clock_raw : 2259213140042
.clock_warps : 0
.clock_overflows : 582091
.clock_deep_idle_events : 0
.clock_max_delta : 999848
.cpu_load[0] : 6208
.cpu_load[1] : 4545
.cpu_load[2] : 3810
.cpu_load[3] : 3065
.cpu_load[4] : 2397

cfs_rq
.exec_clock : 581940255731
.MIN_vruntime : 3791835890838
.min_vruntime : 3791854778322
.max_vruntime : 3791902652145
.spread : 66761307
.spread0 : 681518599
.nr_sync_min_vruntime : 256743

runnable tasks:
task PID tree-key switches prio exec-runtime sum-exec sum-sleep
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Xorg 5740 -18887484 157025 115 3791835890838 46597617 638567698716
update-status 26093 48184396 1537 139 3791902652145 5709334592 2622161813
sh 27011 -6808461 2 120 3791847969861 4162855 1457101
R cat 27012 -19965525 3 120 3791836175385 1756456 1974618
bash 27013 1498 2 120 3791854779820 394907 0

Sched Debug Version: v0.05-v20, 2.6.23-rc6-smp-d #43
now at 2237384024289 nsecs

cpu#0, 2992.608 MHz
.nr_running : 2
.load : 1039
.nr_switches : 1106019
.nr_load_updates : 736329
.nr_uninterruptible : 4294966399
.jiffies : 1937383
.next_balance : 1937428
.curr->pid : 27087
.clock : 736217427466
.idle_clock : 0
.prev_clock_raw : 2260395464994
.clock_warps : 0
.clock_overflows : 678147
.clock_deep_idle_events : 0
.clock_max_delta : 999848
.cpu_load[0] : 1039
.cpu_load[1] : 1039
.cpu_load[2] : 1039
.cpu_load[3] : 1039
.cpu_load[4] : 1039

cfs_rq
.exec_clock : 624358414291
.MIN_vruntime : 3806531640151
.min_vruntime : 3806501788119
.max_vruntime : 3806531640151
.spread : 0
.spread0 : 0
.nr_sync_min_vruntime : 296782

runnable tasks:
task PID tree-key switches prio exec-runtime sum-exec sum-sleep
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
update-status 26093 56810274 1572 139 3806531640151 5929003154 2622161813
R cc1 27087 0 13 120 3806501788119 403581890 15024992

cpu#1, 2992.608 MHz
.nr_running : 2
.load : 2048
.nr_switches : 1013933
.nr_load_updates : 748722
.nr_uninterruptible : 898
.jiffies : 1937383
.next_balance : 1937455
.curr->pid : 27089
.clock : 748608445154
.idle_clock : 0
.prev_clock_raw : 2260394709751
.clock_warps : 0
.clock_overflows : 582583
.clock_deep_idle_events : 0
.clock_max_delta : 999848
.cpu_load[0] : 2048
.cpu_load[1] : 5016
.cpu_load[2] : 6366
.cpu_load[3] : 7081
.cpu_load[4] : 6895

cfs_rq
.exec_clock : 583122024084
.MIN_vruntime : 3806331397540
.min_vruntime : 3806350696772
.max_vruntime : 3806331397540
.spread : 0
.spread0 : -151091347
.nr_sync_min_vruntime : 256756

runnable tasks:
task PID tree-key switches prio exec-runtime sum-exec sum-sleep
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
konsole 6330 -19299232 124476 120 3806331397540 17086991675 615801753460
R cat 27089 -19513636 2 120 3806331713718 1163204 7839421

Sched Debug Version: v0.05-v20, 2.6.23-rc6-smp-d #43
now at 2238396055118 nsecs

cpu#0, 2992.608 MHz
.nr_running : 4
.load : 3087
.nr_switches : 1106671
.nr_load_updates : 737341
.nr_uninterruptible : 4294966395
.jiffies : 1938395
.next_balance : 1938412
.curr->pid : 27115
.clock : 737229273643
.idle_clock : 0
.prev_clock_raw : 2261407091960
.clock_warps : 0
.clock_overflows : 678565
.clock_deep_idle_events : 0
.clock_max_delta : 999848
.cpu_load[0] : 3087
.cpu_load[1] : 3055
.cpu_load[2] : 2723
.cpu_load[3] : 2169
.cpu_load[4] : 1698

cfs_rq
.exec_clock : 625370185925
.MIN_vruntime : 3833041200497
.min_vruntime : 3833041199517
.max_vruntime : 3833103226422
.spread : 62025925
.spread0 : 0
.nr_sync_min_vruntime : 296798

runnable tasks:
task PID tree-key switches prio exec-runtime sum-exec sum-sleep
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
bash 6334 980 4556 120 3833041200497 565634278 646894253582
update-status 26093 66364651 1667 139 3833103226422 6318237803 2622161813
cc1 27112 1387807 22 120 3833042327482 242204855 12705113
R cat 27115 -19905217 2 120 3833023008660 1844735 3670718

cpu#1, 2992.608 MHz
.nr_running : 2
.load : 4145
.nr_switches : 1014099
.nr_load_updates : 749734
.nr_uninterruptible : 901
.jiffies : 1938395
.next_balance : 1938479
.curr->pid : 27084
.clock : 749620291330
.idle_clock : 0
.prev_clock_raw : 2261406342542
.clock_warps : 0
.clock_overflows : 582991
.clock_deep_idle_events : 0
.clock_max_delta : 999848
.cpu_load[0] : 4145
.cpu_load[1] : 4145
.cpu_load[2] : 4136
.cpu_load[3] : 3934
.cpu_load[4] : 3342

cfs_rq
.exec_clock : 584133842523
.MIN_vruntime : 3819361786873
.min_vruntime : 3819381786873
.max_vruntime : 3819361786873
.spread : 0
.spread0 : -13659412644
.nr_sync_min_vruntime : 256768

runnable tasks:
task PID tree-key switches prio exec-runtime sum-exec sum-sleep
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
kacpid 73 -20000000 1073 115 3819361786873 489384092 743825068255
R cc1 27084 -3867063305 138 120 3815518715812 1205052524 24535558


2007-09-20 07:53:25

by Ingo Molnar

[permalink] [raw]
Subject: Re: [git] CFS-devel, group scheduler, fixes


* Mike Galbraith <[email protected]> wrote:

> [...] I modified the kernel to print task's actual tree key instead
> of their current vruntime, [...]

btw., that looks like a debug printout bug in sched-devel.git - could
you send me your fix? I've pushed out the latest sched-devel (ontop of
-rc7) to the usual place:

git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git

(note that there are some debug printout updates which might clash with
your fix)

Ingo

2007-09-20 08:17:21

by Mike Galbraith

[permalink] [raw]
Subject: Re: [git] CFS-devel, group scheduler, fixes

On Thu, 2007-09-20 at 09:51 +0200, Ingo Molnar wrote:
> * Mike Galbraith <[email protected]> wrote:
>
> > [...] I modified the kernel to print task's actual tree key instead
> > of their current vruntime, [...]
>
> btw., that looks like a debug printout bug in sched-devel.git - could
> you send me your fix? I've pushed out the latest sched-devel (ontop of
> -rc7) to the usual place:

Well, I temporarily added a key field, which is only used to store the
key for debug...

> git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git
>
> (note that there are some debug printout updates which might clash with
> your fix)

I'll pull/build this and test my migration problem there. All I have to
do is to add a nice 19 chew-max to my make -j2, and all hell breaks
loose. Always suspecting myself first in the search for dirty rotten
SOB who broke local scheduler :) I nuked the migration fix. Looks like
I'm not the SOB this time.. it got much worse, with max Xorg latency of
>500ms.

-Mike

2007-09-20 19:49:25

by Willy Tarreau

[permalink] [raw]
Subject: Re: [git] CFS-devel, group scheduler, fixes

On Thu, Sep 20, 2007 at 09:15:07AM +0200, Mike Galbraith wrote:
> But, I did just manage to trigger some horrid behavior, and log it. I
> modified the kernel to print task's actual tree key instead of their
> current vruntime, and was watching that while make -j2 was running (and
> not seeing anything very interesting), when on a lark, I restarted
> SuSE's system updater thingy. That beast chews 100% CPU for so long at
> startup that I long ago got annoyed, and changed it to run at nice 19.
> Anyway, when it started, interactivity went to hell in the proverbial
> hand-basket, and the sched_debug log shows some interesting results..
> like spread0 hitting -13659412644, and cc1 being keyed at -3867063305.
>
> cpu#0, 2992.608 MHz
> .nr_running : 4
> .load : 4096
> .nr_switches : 1105882
> .nr_load_updates : 735146
> .nr_uninterruptible : 4294966399

(...)

> cpu#1, 2992.608 MHz
> .nr_running : 5
> .load : 6208
> .nr_switches : 1012995
> .nr_load_updates : 747540
> .nr_uninterruptible : 897

I don't know if this is relevant, but 4294966399 in nr_uninterruptible
for cpu#0 equals -897, exactly the negation of cpu1.nr_uninterruptible.
I don't know if this rings a bell for someone or if it's a completely
useless comment, but just in case...

HTH,
Willy

2007-09-21 02:41:16

by Mike Galbraith

[permalink] [raw]
Subject: Re: [git] CFS-devel, group scheduler, fixes

On Thu, 2007-09-20 at 21:48 +0200, Willy Tarreau wrote:

> I don't know if this is relevant, but 4294966399 in nr_uninterruptible
> for cpu#0 equals -897, exactly the negation of cpu1.nr_uninterruptible.
> I don't know if this rings a bell for someone or if it's a completely
> useless comment, but just in case...

A task can block on one cpu, and wake up on another, which isn't
tracked, hence the fishy looking numbers. The true nr_uninterruptible
is the sum of all.

-Mike

2007-09-21 03:13:12

by Willy Tarreau

[permalink] [raw]
Subject: Re: [git] CFS-devel, group scheduler, fixes

On Fri, Sep 21, 2007 at 04:40:55AM +0200, Mike Galbraith wrote:
> On Thu, 2007-09-20 at 21:48 +0200, Willy Tarreau wrote:
>
> > I don't know if this is relevant, but 4294966399 in nr_uninterruptible
> > for cpu#0 equals -897, exactly the negation of cpu1.nr_uninterruptible.
> > I don't know if this rings a bell for someone or if it's a completely
> > useless comment, but just in case...
>
> A task can block on one cpu, and wake up on another, which isn't
> tracked, hence the fishy looking numbers. The true nr_uninterruptible
> is the sum of all.

OK, thanks Mike for this clarification.

Willy

2007-09-22 03:27:42

by Tong Li

[permalink] [raw]
Subject: Re: [git] CFS-devel, group scheduler, fixes

Mike,

Could you try this patch to see if it solves the latency problem?

tong

Changes:

1. Modified vruntime adjustment logic in set_task_cpu(). See comments in
code. This fixed the negative vruntime problem.

2. This code in update_curr() seems to be wrong:

if (unlikely(!curr))
return sync_vruntime(cfs_rq);

cfs_rq->curr can be NULL even if cfs_rq->nr_running is non-zero (e.g.,
when an RT task is running). We only want to call sync_vruntime when
cfs_rq->nr_running is 0. This fixed the large latency problem (at least in
my tests).

Signed-off-by: Tong Li <[email protected]>
---
diff -uprN linux-2.6-sched-devel-orig/kernel/sched.c linux-2.6-sched-devel/kernel/sched.c
--- linux-2.6-sched-devel-orig/kernel/sched.c 2007-09-20 12:15:41.000000000 -0700
+++ linux-2.6-sched-devel/kernel/sched.c 2007-09-21 19:40:08.000000000 -0700
@@ -1033,9 +1033,20 @@ void set_task_cpu(struct task_struct *p,
if (p->se.block_start)
p->se.block_start -= clock_offset;
#endif
- if (likely(new_rq->cfs.min_vruntime))
- p->se.vruntime -= old_rq->cfs.min_vruntime -
- new_rq->cfs.min_vruntime;
+ /*
+ * Reset p's vruntime if it moves to new_cpu whose min_vruntime is
+ * 100,000,000 (equivalent to 100ms for nice-0 tasks) larger or
+ * smaller than p's vruntime. This improves interactivity when
+ * pinned and unpinned tasks co-exist. For example, pinning a few
+ * tasks to a CPU can cause its min_vruntime much smaller than the
+ * other CPUs. If a task moves to this CPU, its vruntime can be so
+ * large it won't be scheduled until the locally pinned tasks'
+ * vruntimes catch up, causing large delays.
+ */
+ if (unlikely(old_cpu != new_cpu && p->se.vruntime &&
+ (p->se.vruntime > new_rq->cfs.min_vruntime + 100000000 ||
+ p->se.vruntime + 100000000 < new_rq->cfs.min_vruntime)))
+ p->se.vruntime = new_rq->cfs.min_vruntime;

__set_task_cpu(p, new_cpu);
}
@@ -1599,6 +1610,7 @@ static void __sched_fork(struct task_str
p->se.exec_start = 0;
p->se.sum_exec_runtime = 0;
p->se.prev_sum_exec_runtime = 0;
+ p->se.vruntime = 0;

#ifdef CONFIG_SCHEDSTATS
p->se.wait_start = 0;
diff -uprN linux-2.6-sched-devel-orig/kernel/sched_fair.c linux-2.6-sched-devel/kernel/sched_fair.c
--- linux-2.6-sched-devel-orig/kernel/sched_fair.c 2007-09-20 12:15:41.000000000 -0700
+++ linux-2.6-sched-devel/kernel/sched_fair.c 2007-09-21 17:23:09.000000000 -0700
@@ -306,8 +306,10 @@ static void update_curr(struct cfs_rq *c
u64 now = rq_of(cfs_rq)->clock;
unsigned long delta_exec;

- if (unlikely(!curr))
+ if (unlikely(!cfs_rq->nr_running))
return sync_vruntime(cfs_rq);
+ if (unlikely(!curr))
+ return;

/*
* Get the amount of time the current task was running

2007-09-22 10:02:04

by Mike Galbraith

[permalink] [raw]
Subject: Re: [git] CFS-devel, group scheduler, fixes

On Fri, 2007-09-21 at 20:27 -0700, Tong Li wrote:
> Mike,
>
> Could you try this patch to see if it solves the latency problem?

No, but it helps some when running two un-pinned busy loops, one at nice
0, and the other at nice 19. Yesterday I hit latencies of up to 1.2
_seconds_ doing this, and logging sched_debug and /proc/`pidof
Xorg`/sched from SCHED_RR shells.

se.wait_max : 164.242748
se.wait_max : 121.996128
se.wait_max : 194.464773
se.wait_max : 517.425411
se.wait_max : 131.453214
se.wait_max : 122.984190
se.wait_max : 111.729274
se.wait_max : 119.567580
se.wait_max : 126.980696
se.wait_max : 177.241452
se.wait_max : 129.604329
se.wait_max : 119.631657

It doesn't help at all with this oddity while running same:

root@Homer: now is
the time
foooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooor aall good men to come to the aid of their country

That was a nice 0 shell window. I'm not a great typist, but I ain't
_that_ bad :)

-Mike

2007-09-23 07:14:36

by Mike Galbraith

[permalink] [raw]
Subject: Re: [git] CFS-devel, group scheduler, fixes

On Sat, 2007-09-22 at 12:01 +0200, Mike Galbraith wrote:
> On Fri, 2007-09-21 at 20:27 -0700, Tong Li wrote:
> > Mike,
> >
> > Could you try this patch to see if it solves the latency problem?
>
> No, but it helps some when running two un-pinned busy loops, one at nice
> 0, and the other at nice 19. Yesterday I hit latencies of up to 1.2
> _seconds_ doing this, and logging sched_debug and /proc/`pidof
> Xorg`/sched from SCHED_RR shells.

Looking at a log (snippet attached) from this morning with the last hunk
of your patch still intact, it looks like min_vruntime is being modified
after a task is queued. If you look at the snippet, you'll see the nice
19 bash busy loop on CPU1 with a vruntime of 3010385.345325, and one
second later on CPU1 with it's vruntime at 2814952.425082, but
min_vruntime is 3061874.838356.

I took a hammer to it, and my latencies running this test went away.

diff -uprNX /root/dontdiff git/linux-2.6.sched-devel/kernel/sched_fair.c linux-2.6.23-rc7.d/kernel/sched_fair.c
--- git/linux-2.6.sched-devel/kernel/sched_fair.c 2007-09-22 13:37:32.000000000 +0200
+++ linux-2.6.23-rc7.d/kernel/sched_fair.c 2007-09-23 08:29:38.000000000 +0200
@@ -290,14 +290,19 @@ __update_curr(struct cfs_rq *cfs_rq, str
static void sync_vruntime(struct cfs_rq *cfs_rq)
{
struct rq *rq;
- int cpu;
+ int cpu, wrote = 0;

for_each_online_cpu(cpu) {
rq = &per_cpu(runqueues, cpu);
+ if (spin_is_locked(&rq->lock))
+ continue;
+ smp_wmb();
cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime,
rq->cfs.min_vruntime);
+ wrote++;
}
- schedstat_inc(cfs_rq, nr_sync_min_vruntime);
+ if (wrote)
+ schedstat_inc(cfs_rq, nr_sync_min_vruntime);
}

static void update_curr(struct cfs_rq *cfs_rq)
@@ -306,8 +311,10 @@ static void update_curr(struct cfs_rq *c
u64 now = rq_of(cfs_rq)->clock;
unsigned long delta_exec;

- if (unlikely(!curr))
+ if (unlikely(!cfs_rq->nr_running))
return sync_vruntime(cfs_rq);
+ if (unlikely(!curr))
+ return;

/*
* Get the amount of time the current task was running


Attachments:
sched_debug.save2 (6.56 kB)

2007-09-23 11:37:58

by Mike Galbraith

[permalink] [raw]
Subject: Re: [git] CFS-devel, group scheduler, fixes

Hi,

I haven't chased down the exact scenario, but using a min_vruntime which
is about to change definitely seems to be what's causing my latency
woes. Does the below cure your fairness woes as well?

(first bit is just some debug format changes for your convenience if you
try it)

diff -uprNX /root/dontdiff git/linux-2.6.sched-devel/kernel/sched_debug.c linux-2.6.23-rc7.d/kernel/sched_debug.c
--- git/linux-2.6.sched-devel/kernel/sched_debug.c 2007-09-22 13:37:32.000000000 +0200
+++ linux-2.6.23-rc7.d/kernel/sched_debug.c 2007-09-21 12:12:27.000000000 +0200
@@ -67,7 +67,7 @@ print_task(struct seq_file *m, struct rq
(long long)(p->nvcsw + p->nivcsw),
p->prio);
#ifdef CONFIG_SCHEDSTATS
- SEQ_printf(m, "%15Ld.%06ld %15Ld.%06ld %15Ld.%06ld\n",
+ SEQ_printf(m, "%9Ld.%06ld %9Ld.%06ld %9Ld.%06ld\n",
SPLIT_NS(p->se.vruntime),
SPLIT_NS(p->se.sum_exec_runtime),
SPLIT_NS(p->se.sum_sleep_runtime));
@@ -83,10 +83,10 @@ static void print_rq(struct seq_file *m,

SEQ_printf(m,
"\nrunnable tasks:\n"
- " task PID tree-key switches prio"
- " exec-runtime sum-exec sum-sleep\n"
+ " task PID tree-key switches prio"
+ " exec-runtime sum-exec sum-sleep\n"
"------------------------------------------------------"
- "------------------------------------------------");
+ "----------------------------------------------------\n");

read_lock_irq(&tasklist_lock);

@@ -134,7 +134,7 @@ void print_cfs_rq(struct seq_file *m, in
spread0 = min_vruntime - rq0_min_vruntime;
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "spread0",
SPLIT_NS(spread0));
- SEQ_printf(m, " .%-30s: %ld\n", "spread0",
+ SEQ_printf(m, " .%-30s: %ld\n", "nr_sync_min_vruntime",
cfs_rq->nr_sync_min_vruntime);
}

diff -uprNX /root/dontdiff git/linux-2.6.sched-devel/kernel/sched_fair.c linux-2.6.23-rc7.d/kernel/sched_fair.c
--- git/linux-2.6.sched-devel/kernel/sched_fair.c 2007-09-22 13:37:32.000000000 +0200
+++ linux-2.6.23-rc7.d/kernel/sched_fair.c 2007-09-23 12:53:11.000000000 +0200
@@ -290,14 +290,19 @@ __update_curr(struct cfs_rq *cfs_rq, str
static void sync_vruntime(struct cfs_rq *cfs_rq)
{
struct rq *rq;
- int cpu;
+ int cpu, wrote = 0;

for_each_online_cpu(cpu) {
rq = &per_cpu(runqueues, cpu);
+ if (!spin_trylock(&rq->lock))
+ continue;
cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime,
rq->cfs.min_vruntime);
+ spin_unlock(&rq->lock);
+ wrote++;
}
- schedstat_inc(cfs_rq, nr_sync_min_vruntime);
+ if (wrote)
+ schedstat_inc(cfs_rq, nr_sync_min_vruntime);
}

static void update_curr(struct cfs_rq *cfs_rq)
@@ -306,8 +311,10 @@ static void update_curr(struct cfs_rq *c
u64 now = rq_of(cfs_rq)->clock;
unsigned long delta_exec;

- if (unlikely(!curr))
+ if (unlikely(!cfs_rq->nr_running))
return sync_vruntime(cfs_rq);
+ if (unlikely(!curr))
+ return;

/*
* Get the amount of time the current task was running


2007-09-24 06:21:36

by Tong Li

[permalink] [raw]
Subject: Re: [git] CFS-devel, group scheduler, fixes

On Sun, 23 Sep 2007, Mike Galbraith wrote:

> On Sat, 2007-09-22 at 12:01 +0200, Mike Galbraith wrote:
>> On Fri, 2007-09-21 at 20:27 -0700, Tong Li wrote:
>>> Mike,
>>>
>>> Could you try this patch to see if it solves the latency problem?
>>
>> No, but it helps some when running two un-pinned busy loops, one at nice
>> 0, and the other at nice 19. Yesterday I hit latencies of up to 1.2
>> _seconds_ doing this, and logging sched_debug and /proc/`pidof
>> Xorg`/sched from SCHED_RR shells.
>
> Looking at a log (snippet attached) from this morning with the last hunk
> of your patch still intact, it looks like min_vruntime is being modified
> after a task is queued. If you look at the snippet, you'll see the nice
> 19 bash busy loop on CPU1 with a vruntime of 3010385.345325, and one
> second later on CPU1 with it's vruntime at 2814952.425082, but
> min_vruntime is 3061874.838356.

I think this could be what was happening: between the two seconds, CPU 0
becomes idle and it pulls the nice 19 task over via pull_task(), which
calls set_task_cpu(), which changes the task's vruntime to the current
min_vruntime of CPU 0 (in my patch). Then, after set_task_cpu(), CPU 0
calls activate_task(), which calls enqueue_task() and in turn
update_curr(). Now, nr_running on CPU 0 is 0, so sync_vruntime() gets
called and CPU 0's min_vruntime gets synced to the system max. Thus, the
nice 19 task now has a vruntime less than CPU 0's min_vruntime. I think
this can be fixed by adding the following code in set_task_cpu() before we
adjust p->vruntime:

if (!new_rq->cfs.nr_running)
sync_vruntime(new_rq);

> static void sync_vruntime(struct cfs_rq *cfs_rq)
> {
> struct rq *rq;
> - int cpu;
> + int cpu, wrote = 0;
>
> for_each_online_cpu(cpu) {
> rq = &per_cpu(runqueues, cpu);
> + if (spin_is_locked(&rq->lock))
> + continue;
> + smp_wmb();
> cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime,
> rq->cfs.min_vruntime);
> + wrote++;
> }
> - schedstat_inc(cfs_rq, nr_sync_min_vruntime);
> + if (wrote)
> + schedstat_inc(cfs_rq, nr_sync_min_vruntime);
> }

I think this rq->lock check works because it prevents the above scenario
(CPU 0 is in pull_task so it must hold the rq lock). But my concern is
that it may be too conservative, since sync_vruntime is called by
update_curr, which mostly gets called in enqueue_task() and
dequeue_task(), both of which are often invoked with the rq lock being
held. Thus, if we don't allow sync_vruntime when rq lock is held, the sync
will occur much less frequently and thus weaken cross-CPU fairness.

tong

2007-09-24 10:10:27

by Mike Galbraith

[permalink] [raw]
Subject: Re: [git] CFS-devel, group scheduler, fixes

On Sun, 2007-09-23 at 23:21 -0700, Tong Li wrote:
> On Sun, 23 Sep 2007, Mike Galbraith wrote:
>
> > On Sat, 2007-09-22 at 12:01 +0200, Mike Galbraith wrote:
> >> On Fri, 2007-09-21 at 20:27 -0700, Tong Li wrote:
> >>> Mike,
> >>>
> >>> Could you try this patch to see if it solves the latency problem?
> >>
> >> No, but it helps some when running two un-pinned busy loops, one at nice
> >> 0, and the other at nice 19. Yesterday I hit latencies of up to 1.2
> >> _seconds_ doing this, and logging sched_debug and /proc/`pidof
> >> Xorg`/sched from SCHED_RR shells.
> >
> > Looking at a log (snippet attached) from this morning with the last hunk
> > of your patch still intact, it looks like min_vruntime is being modified
> > after a task is queued. If you look at the snippet, you'll see the nice
> > 19 bash busy loop on CPU1 with a vruntime of 3010385.345325, and one
> > second later on CPU1 with it's vruntime at 2814952.425082, but
> > min_vruntime is 3061874.838356.
>
> I think this could be what was happening: between the two seconds, CPU 0
> becomes idle and it pulls the nice 19 task over via pull_task(), which
> calls set_task_cpu(), which changes the task's vruntime to the current
> min_vruntime of CPU 0 (in my patch). Then, after set_task_cpu(), CPU 0
> calls activate_task(), which calls enqueue_task() and in turn
> update_curr(). Now, nr_running on CPU 0 is 0, so sync_vruntime() gets
> called and CPU 0's min_vruntime gets synced to the system max. Thus, the
> nice 19 task now has a vruntime less than CPU 0's min_vruntime. I think
> this can be fixed by adding the following code in set_task_cpu() before we
> adjust p->vruntime:
>
> if (!new_rq->cfs.nr_running)
> sync_vruntime(new_rq);

Hmm. I can imagine Mondo-Boxen-R-Us folks getting upset with that.
Better would be like Ingo said, see if we can toss sync_vrintime(), and
I've been playing with that...

I found something this morning, and as usual, the darn thing turned out
to be dirt simple. With sync_vruntime() disabled, I found queues with
negative min_vruntime right from boot, and went hunting. Adding some
instrumentation to set_task_cpu() (annoying consequences), I found the
below:

Legend:
vrun: tasks's vruntime
old: old queue's min_vruntime
new: new queue's min_vruntime
result: what's gonna happen

[ 60.214508] kseriod vrun: 1427596999 old: 15070988657 new: 4065818654 res: -9577573004
[ 218.274661] konqueror vrun: 342076210254 old: 658982966338 new: 219203403598 res: -97703352486
[ 218.284657] init vrun: 411638725179 old: 659187735208 new: 219203492472 res: -28345517557
[...]

A task which hasn't run in long enough for queues to have digressed
further than it's vruntime is going to end up with a negative vruntime.
Looking at place_entity(), it looks like it's supposed to fix that up,
but due to the order of arguments passed to max_vrintime(), and the
unsigned comparison therein, it won't.

Running with the patchlet below, my box _so far_ has not become
terminally unhappy despite spread0. I'm writing this with the pinned
hogs test running right now, and all is well, so I _think_ it might be
ok to just remove sync_vruntime() after all.

diff -uprNX /root/dontdiff git/linux-2.6.sched-devel/kernel/sched_fair.c linux-2.6.23-rc7.d/kernel/sched_fair.c
--- git/linux-2.6.sched-devel/kernel/sched_fair.c 2007-09-23 14:48:18.000000000 +0200
+++ linux-2.6.23-rc7.d/kernel/sched_fair.c 2007-09-24 11:02:05.000000000 +0200
@@ -117,7 +117,7 @@ static inline struct task_struct *task_o
static inline u64
max_vruntime(u64 min_vruntime, u64 vruntime)
{
- if ((vruntime > min_vruntime) ||
+ if (((s64)vruntime > (s64)min_vruntime) ||
(min_vruntime > (1ULL << 61) && vruntime < (1ULL << 50)))
min_vruntime = vruntime;

@@ -310,7 +310,7 @@ static void update_curr(struct cfs_rq *c
unsigned long delta_exec;

if (unlikely(!cfs_rq->nr_running))
- return sync_vruntime(cfs_rq);
+ return ;//sync_vruntime(cfs_rq);
if (unlikely(!curr))
return;



2007-09-24 10:24:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [git] CFS-devel, group scheduler, fixes

On Mon, 24 Sep 2007 12:10:09 +0200 Mike Galbraith <[email protected]> wrote:


> @@ -117,7 +117,7 @@ static inline struct task_struct *task_o
> static inline u64
> max_vruntime(u64 min_vruntime, u64 vruntime)
> {
> - if ((vruntime > min_vruntime) ||
> + if (((s64)vruntime > (s64)min_vruntime) ||
> (min_vruntime > (1ULL << 61) && vruntime < (1ULL << 50)))
> min_vruntime = vruntime;
>

how about something like:

s64 delta = (s64)(vruntime - min_vruntime);
if (delta > 0)
min_vruntime += delta;

That would rid us of most of the funny conditionals there.

2007-09-24 10:42:28

by Mike Galbraith

[permalink] [raw]
Subject: Re: [git] CFS-devel, group scheduler, fixes

On Mon, 2007-09-24 at 12:24 +0200, Peter Zijlstra wrote:

> how about something like:
>
> s64 delta = (s64)(vruntime - min_vruntime);
> if (delta > 0)
> min_vruntime += delta;
>
> That would rid us of most of the funny conditionals there.

That still left me with negative min_vruntimes. The pinned hogs didn't
lock my box up, but I quickly got the below, so hastily killed it.

se.wait_max : 7.846949
se.wait_max : 301.951601
se.wait_max : 7.071359

-Mike

2007-09-24 11:09:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [git] CFS-devel, group scheduler, fixes

On Mon, 24 Sep 2007 12:42:15 +0200 Mike Galbraith <[email protected]> wrote:

> On Mon, 2007-09-24 at 12:24 +0200, Peter Zijlstra wrote:
>
> > how about something like:
> >
> > s64 delta = (s64)(vruntime - min_vruntime);
> > if (delta > 0)
> > min_vruntime += delta;
> >
> > That would rid us of most of the funny conditionals there.
>
> That still left me with negative min_vruntimes. The pinned hogs didn't
> lock my box up, but I quickly got the below, so hastily killed it.
>
> se.wait_max : 7.846949
> se.wait_max : 301.951601
> se.wait_max : 7.071359
>

Odd, the idea (which I think is clear) is that min_vruntime can wrap
around the u64 spectrum. And by using min_vruntime as offset to base
the key around, we get a signed but limited range key-space. (because
we update min_vruntime to be the leftmost task (in a monotonic fashion))

So I'm having trouble with these patches, that is, both your wrap
around condition of:

if (likely(new_rq->cfs.min_vruntime))

as well as the last patchlet:

if (((s64)vruntime > (s64)min_vruntime) ||

in that neither of these changes make sense in what its trying to do.

Its perfectly valid for min_vruntime to exist in 1ULL << 63.

2007-09-24 11:22:29

by Mike Galbraith

[permalink] [raw]
Subject: Re: [git] CFS-devel, group scheduler, fixes

On Mon, 2007-09-24 at 12:42 +0200, Mike Galbraith wrote:
> On Mon, 2007-09-24 at 12:24 +0200, Peter Zijlstra wrote:
>
> > how about something like:
> >
> > s64 delta = (s64)(vruntime - min_vruntime);
> > if (delta > 0)
> > min_vruntime += delta;
> >
> > That would rid us of most of the funny conditionals there.
>
> That still left me with negative min_vruntimes. The pinned hogs didn't
> lock my box up, but I quickly got the below, so hastily killed it.

Shouldn't the max() in place_entity() be the max_vruntime() that my
lysdexia told me it was when I looked at it earlier? ;-)

-Mike

2007-09-24 11:43:26

by Mike Galbraith

[permalink] [raw]
Subject: Re: [git] CFS-devel, group scheduler, fixes

On Mon, 2007-09-24 at 13:08 +0200, Peter Zijlstra wrote:

> Its perfectly valid for min_vruntime to exist in 1ULL << 63.

But wrap backward timewarp is what's killing my box.

-Mike

2007-09-24 11:52:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [git] CFS-devel, group scheduler, fixes

On Mon, 24 Sep 2007 13:22:14 +0200 Mike Galbraith <[email protected]> wrote:

> On Mon, 2007-09-24 at 12:42 +0200, Mike Galbraith wrote:
> > On Mon, 2007-09-24 at 12:24 +0200, Peter Zijlstra wrote:
> >
> > > how about something like:
> > >
> > > s64 delta = (s64)(vruntime - min_vruntime);
> > > if (delta > 0)
> > > min_vruntime += delta;
> > >
> > > That would rid us of most of the funny conditionals there.
> >
> > That still left me with negative min_vruntimes. The pinned hogs didn't
> > lock my box up, but I quickly got the below, so hastily killed it.
>
> Shouldn't the max() in place_entity() be the max_vruntime() that my
> lysdexia told me it was when I looked at it earlier? ;-)
>

probably, my tree doesn't have that max anymore so I'm not sure.

2007-09-24 16:48:22

by Tong Li

[permalink] [raw]
Subject: Re: [git] CFS-devel, group scheduler, fixes

On Mon, 24 Sep 2007, Peter Zijlstra wrote:

> On Mon, 24 Sep 2007 13:22:14 +0200 Mike Galbraith <[email protected]> wrote:
>
>> On Mon, 2007-09-24 at 12:42 +0200, Mike Galbraith wrote:
>>> On Mon, 2007-09-24 at 12:24 +0200, Peter Zijlstra wrote:
>>>
>>>> how about something like:
>>>>
>>>> s64 delta = (s64)(vruntime - min_vruntime);
>>>> if (delta > 0)
>>>> min_vruntime += delta;
>>>>
>>>> That would rid us of most of the funny conditionals there.
>>>
>>> That still left me with negative min_vruntimes. The pinned hogs didn't
>>> lock my box up, but I quickly got the below, so hastily killed it.
>>
>> Shouldn't the max() in place_entity() be the max_vruntime() that my
>> lysdexia told me it was when I looked at it earlier? ;-)
>>
>
> probably, my tree doesn't have that max anymore so I'm not sure.
>

Last time I looked, I thought the max() in place_entity() was fine and the
problem seemed to be in set_task_cpu() that was causing the negative
vruntimes:

if (likely(new_rq->cfs.min_vruntime))
p->se.vruntime -= old_rq->cfs.min_vruntime -
new_rq->cfs.min_vruntime;

I think it's fine we get rid of sync_vruntime(). I need to think more
about how to make global fairness work based on this--it seems to be more
complicated than if we had sync_vruntime().

tong