LinuxLists.cc - [git pull request] scheduler updates

2007-08-12 16:32:41

Subject: [git pull request] scheduler updates

Linus, please pull the latest scheduler git tree from:

git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git

three bugfixes:

- a nice fix from eagle-eye Oleg for a subtle typo in the balancing
code, the effect of this bug was more agressive idle balancing. This
bug was introduced by one of the original CFS commits.

- a round of global->static fixes from Adrian Bunk - this change,
besides the cleanup effect, chops 100 bytes off sched.o.

- Peter Zijlstra noticed a sleeper-bonus bug. I kept this patch under
observation and testing this past week and saw no ill effects so far.
It could fix two suspected regressions. (It could improve Kasper
Sandberg's workload and it could improve the sleeper/runner
problem/bug Roman Zippel was seeing.)

test-built and test-booted on x86-32 and x86-64, and did a dozen of
randconfig builds for good measure (which uncovered two new build errors
in latest -git).

Thanks,

Ingo

--------------->
Adrian Bunk (1):
sched: make global code static

Ingo Molnar (1):
sched: fix sleeper bonus

Oleg Nesterov (1):
sched: run_rebalance_domains: s/SCHED_IDLE/CPU_IDLE/

include/linux/cpu.h | 2 --
kernel/sched.c | 48 ++++++++++++++++++++++++------------------------
kernel/sched_fair.c | 12 ++++++------
3 files changed, 30 insertions(+), 32 deletions(-)

2007-08-14 08:39:11

by Christian Borntraeger

[permalink] [raw]

Subject: Re: [accounting regression since rc1] scheduler updates

This is a 2nd try with correct email address, sorry for the duplicates.

Am Sonntag, 12. August 2007 schrieb Ingo Molnar:
> Linus, please pull the latest scheduler git tree from:

Hello Ingo,

this is a followup to the discussion in
http://lkml.org/lkml/2007/7/19/538

Since 2.6.12, s390 already does precise accouting for system and user time.
Depending on CONFIG_VIRT_CPU_ACCOUNTING, we use two 64bit hardware timers on
s390: the first returns the wall clock time and is stepped even if the
virtual cpu is not backed by a physical cpu. The second timer is only
stepped, when the virtual cpu is backed by a physical cpu. The timers have a
very high accurancy, and the architecture guarantees that bit 51 is increased
by one/microsecond. We store both timers on each context switch, irq,
syscall, and machinecheck in entry.S. The calculation are made in
arch/s390/kernel/vtime.c in accouting_system_vtime and friends with
microsecond accurracy. This is also used for irq accouting (see the
definition of irq_enter). It basically boils down to precise numbers in the
cpu stat and the utime/stime for processes as well as knowledge about time
stolen by the hypervisor.

With CFS the accounting was changed, and everything is now based on
sum_exec_runtime. There is now an accounting regression on s390 (and maybe
ppc64), as the default jiffy implemenation does not know anything about
virtual cpus.

While looking for a solution, I started with a very quick hack and reverted
b27f03d4bdc145a09fb7b0c0e004b29f1ee555fa for the procfs related changes. If I
revert that commit, it seems that I get the old behaviour - but of course
this is just a hack.

I see some options now:

1. Jan could finish his sched_clock implementation for s390 and we would get
close to the precise numbers. This would also let CFS make better decisions.
Downside: its not as precise as before as we do some math on the numbers and
it will burn cycles to compute numbers we already have
(utime=sum*utime/stime).
2. set sum_exec_runtime based on the precise utime and stime. Dont know enough
about CFS if this would show different scheduling behaviour than 1
3. ifdef fs/proc/array.c depending on CONFIG_VIRT_CPU_ACCOUNTING. This will
save some cycles, and the numbers are precise to a microsecond. Downside: the
scheduler gets no information about virtual cpus and steal time so its
probably not completely fair
4. implement sched_clock AND reuse the exisiting utime and stime numbers.
5. other clever solutions I cannot see

Any suggestions?

Christian

2007-08-16 08:19:05

by Christian Borntraeger

[permalink] [raw]

Subject: [PATCH][RFC] Re: accounting regression since rc1

Ingo,

this patch fixes the accounting regression for CONFIG_VIRT_CPU_ACCOUNTING. It
reverts parts of commit b27f03d4bdc145a09fb7b0c0e004b29f1ee555fa by converting
fs/proc/array.c back to cputime_t. The new functions task_utime and
task_stime now return cputime_t instead of clock_t. If
CONFIG_VIRT_CPU_ACCOUTING is set, task->utime and task->stime are returned
directly instead of using sum_exec_runtime.

Patch is tested on s390x with and without VIRT_CPU_ACCOUTING as well as on
i386. Not tested on ppc64 - Paul?

Feedback is welcome.

Signed-Off-By: Christian Borntraeger <[email protected]>

---
fs/proc/array.c | 41 ++++++++++++++++++++++++++---------------
1 file changed, 26 insertions(+), 15 deletions(-)

Index: linux-2.6/fs/proc/array.c
===================================================================
--- linux-2.6.orig/fs/proc/array.c
+++ linux-2.6/fs/proc/array.c
@@ -320,7 +320,18 @@ int proc_pid_status(struct task_struct *
return buffer - orig;
}

-static clock_t task_utime(struct task_struct *p)
+#if defined(CONFIG_VIRT_CPU_ACCOUNTING)
+static cputime_t task_utime(struct task_struct *p)
+{
+ return p->utime;
+}
+
+static cputime_t task_stime(struct task_struct *p)
+{
+ return p->stime;
+}
+#else
+static cputime_t task_utime(struct task_struct *p)
{
clock_t utime = cputime_to_clock_t(p->utime),
total = utime + cputime_to_clock_t(p->stime);
@@ -337,10 +348,10 @@ static clock_t task_utime(struct task_st
}
utime = (clock_t)temp;

- return utime;
+ return clock_t_to_cputime(utime);
}

-static clock_t task_stime(struct task_struct *p)
+static cputime_t task_stime(struct task_struct *p)
{
clock_t stime;

@@ -349,10 +360,12 @@ static clock_t task_stime(struct task_st
* the total, to make sure the total observed by userspace
* grows monotonically - apps rely on that):
*/
- stime = nsec_to_clock_t(p->se.sum_exec_runtime) - task_utime(p);
+ stime = nsec_to_clock_t(p->se.sum_exec_runtime) -
+ cputime_to_clock_t(task_utime(p));

- return stime;
+ return clock_t_to_cputime(stime);
}
+#endif

static int do_task_stat(struct task_struct *task, char *buffer, int whole)
{
@@ -368,8 +381,7 @@ static int do_task_stat(struct task_stru
unsigned long long start_time;
unsigned long cmin_flt = 0, cmaj_flt = 0;
unsigned long min_flt = 0, maj_flt = 0;
- cputime_t cutime, cstime;
- clock_t utime, stime;
+ cputime_t cutime, cstime, utime, stime;
unsigned long rsslim = 0;
char tcomm[sizeof(task->comm)];
unsigned long flags;
@@ -387,8 +399,7 @@ static int do_task_stat(struct task_stru

sigemptyset(&sigign);
sigemptyset(&sigcatch);
- cutime = cstime = cputime_zero;
- utime = stime = 0;
+ cutime = cstime = utime = stime = cputime_zero;

rcu_read_lock();
if (lock_task_sighand(task, &flags)) {
@@ -414,15 +425,15 @@ static int do_task_stat(struct task_stru
do {
min_flt += t->min_flt;
maj_flt += t->maj_flt;
- utime += task_utime(t);
- stime += task_stime(t);
+ utime = cputime_add(utime, task_utime(t));
+ stime = cputime_add(stime, task_stime(t));
t = next_thread(t);
} while (t != task);

min_flt += sig->min_flt;
maj_flt += sig->maj_flt;
- utime += cputime_to_clock_t(sig->utime);
- stime += cputime_to_clock_t(sig->stime);
+ utime = cputime_add(utime, sig->utime);
+ stime = cputime_add(stime, sig->stime);
}

sid = signal_session(sig);
@@ -471,8 +482,8 @@ static int do_task_stat(struct task_stru
cmin_flt,
maj_flt,
cmaj_flt,
- utime,
- stime,
+ cputime_to_clock_t(utime),
+ cputime_to_clock_t(stime),
cputime_to_clock_t(cutime),
cputime_to_clock_t(cstime),
priority,

2007-08-20 15:46:37