Andi,
You have mentioned the stream benchmark when reporting on the
performance of the Opteron NUMA sched-domains scheduler. I am trying to
reproduce your results and am struggling with the benchmark. Can you
rpovide the details of the tests you ran. Namely your compiler
settings, compile command line, and your value of N. Also I didn't see
how to specify the number of threads to run, how did you specify that?
I have a 4 way 1.4 GHz 1MB cache opteron machine with 7 GB of RAM.
When I ran the steeam_omp benchmark with N=4000000 I got nearly
identical results (within statistical noise) from 2.6.5, 2.6.5-mm5, and
2.6.5-mm5-with_flat_domains (the patch I sent earlier). Clearly not
what is expected, so I assume I am not running or building the benchmark
correctly. I found the projects build system (none) and docs (minimal)
to be lacking.
You mentioned your problem was fixed by some of "Ingo's tweaks". Which
patches are these tweaks in and are they in the mm tree yet?
Thanks,
Darren
Darren Hart <[email protected]> writes:
> Andi,
>
> You have mentioned the stream benchmark when reporting on the
> performance of the Opteron NUMA sched-domains scheduler. I am trying to
> reproduce your results and am struggling with the benchmark. Can you
> rpovide the details of the tests you ran. Namely your compiler
> settings, compile command line, and your value of N. Also I didn't see
> how to specify the number of threads to run, how did you specify that?
> I have a 4 way 1.4 GHz 1MB cache opteron machine with 7 GB of RAM.
I didn't actually compile them myself; someone sent me executables
compiled with the PGI compiler. Maybe your compiler has a different
runtime and behaves differently?
You can find them and my test script that tests everything in
ftp://ftp.suse.com/pub/people/ak/bench/stream.tar.gz
> When I ran the steeam_omp benchmark with N=4000000 I got nearly
> identical results (within statistical noise) from 2.6.5, 2.6.5-mm5, and
> 2.6.5-mm5-with_flat_domains (the patch I sent earlier). Clearly not
> what is expected, so I assume I am not running or building the benchmark
> correctly. I found the projects build system (none) and docs (minimal)
> to be lacking.
>
> You mentioned your problem was fixed by some of "Ingo's tweaks". Which
> patches are these tweaks in and are they in the mm tree yet?
Didn't think so, no. They were posted by Ingo Molnar to linux-kernel,
check the archives. They did just enable aggressive balancing
at fork/clone time.
I attached the patch i tested.
-Andi
--u3/rZRmxL6MmkK24
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
i've attached sched-balance-context.patch, which is the current version
of fork()/clone() balancing, against 2.6.5-rc3-mm1.
Changes:
- only balance CLONE_VM threads
- take ->cpus_allowed into account when balancing.
i've checked kernel recompiles and while they didnt hurt from fork()
balancing on an 8-way SMP box, i implemented the thread-only balancing
nevertheless.
Ingo
--u3/rZRmxL6MmkK24
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="sched-balance-context.patch"
--- linux/include/linux/sched.h.orig
+++ linux/include/linux/sched.h
@@ -715,12 +715,17 @@ extern void do_timer(struct pt_regs *);
extern int FASTCALL(wake_up_state(struct task_struct * tsk, unsigned int state));
extern int FASTCALL(wake_up_process(struct task_struct * tsk));
+extern void FASTCALL(wake_up_forked_process(struct task_struct * tsk));
#ifdef CONFIG_SMP
extern void kick_process(struct task_struct *tsk);
+ extern void FASTCALL(wake_up_forked_thread(struct task_struct * tsk));
#else
static inline void kick_process(struct task_struct *tsk) { }
+ static inline void wake_up_forked_thread(struct task_struct * tsk)
+ {
+ return wake_up_forked_process(tsk);
+ }
#endif
-extern void FASTCALL(wake_up_forked_process(struct task_struct * tsk));
extern void FASTCALL(sched_fork(task_t * p));
extern void FASTCALL(sched_exit(task_t * p));
--- linux/kernel/sched.c.orig
+++ linux/kernel/sched.c
@@ -1139,6 +1137,119 @@ enum idle_type
};
#ifdef CONFIG_SMP
+
+/*
+ * find_idlest_cpu - find the least busy runqueue.
+ */
+static int find_idlest_cpu(int this_cpu, runqueue_t *this_rq, cpumask_t mask)
+{
+ unsigned long load, min_load, this_load;
+ int i, min_cpu;
+ cpumask_t tmp;
+
+ min_cpu = UINT_MAX;
+ min_load = ULONG_MAX;
+
+ cpus_and(tmp, mask, cpu_online_map);
+ for_each_cpu_mask(i, tmp) {
+ load = cpu_load(i);
+
+ if (load < min_load) {
+ min_cpu = i;
+ min_load = load;
+
+ /* break out early on an idle CPU: */
+ if (!min_load)
+ break;
+ }
+ }
+
+ /* add +1 to account for the new task */
+ this_load = cpu_load(this_cpu) + SCHED_LOAD_SCALE;
+
+ /*
+ * Would with the addition of the new task to the
+ * current CPU there be an imbalance between this
+ * CPU and the idlest CPU?
+ */
+ if (min_load*this_rq->sd->imbalance_pct < 100*this_load)
+ return min_cpu;
+
+ return this_cpu;
+}
+
+/*
+ * wake_up_forked_thread - wake up a freshly forked thread.
+ *
+ * This function will do some initial scheduler statistics housekeeping
+ * that must be done for every newly created context, and it also does
+ * runqueue balancing.
+ */
+void fastcall wake_up_forked_thread(task_t * p)
+{
+ unsigned long flags;
+ int this_cpu = get_cpu(), cpu;
+ runqueue_t *this_rq = cpu_rq(this_cpu), *rq;
+
+ /*
+ * Migrate the new context to the least busy CPU,
+ * if that CPU is out of balance.
+ */
+ cpu = find_idlest_cpu(this_cpu, this_rq, p->cpus_allowed);
+
+ local_irq_save(flags);
+lock_again:
+ rq = cpu_rq(cpu);
+ double_rq_lock(this_rq, rq);
+
+ BUG_ON(p->state != TASK_RUNNING);
+
+ /*
+ * We did find_idlest_cpu() unlocked, so in theory
+ * the mask could have changed:
+ */
+ if (!cpu_isset(cpu, p->cpus_allowed)) {
+ cpu = any_online_cpu(p->cpus_allowed);
+ double_rq_unlock(this_rq, rq);
+ goto lock_again;
+ }
+ /*
+ * We decrease the sleep average of forking parents
+ * and children as well, to keep max-interactive tasks
+ * from forking tasks that are max-interactive.
+ */
+ current->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(current) *
+ PARENT_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
+
+ p->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(p) *
+ CHILD_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
+
+ p->interactive_credit = 0;
+
+ p->prio = effective_prio(p);
+ set_task_cpu(p, cpu);
+
+ if (cpu == this_cpu) {
+ if (unlikely(!current->array))
+ __activate_task(p, rq);
+ else {
+ p->prio = current->prio;
+ list_add_tail(&p->run_list, ¤t->run_list);
+ p->array = current->array;
+ p->array->nr_active++;
+ rq->nr_running++;
+ }
+ } else {
+ __activate_task(p, rq);
+ if (TASK_PREEMPTS_CURR(p, rq))
+ resched_task(rq->curr);
+ }
+
+ double_rq_unlock(this_rq, rq);
+ local_irq_restore(flags);
+ put_cpu();
+}
+
/*
* If dest_cpu is allowed for this process, migrate the task to it.
* This is accomplished by forcing the cpu_allowed mask to only
--- linux/kernel/fork.c.orig
+++ linux/kernel/fork.c
@@ -1179,9 +1179,23 @@ long do_fork(unsigned long clone_flags,
set_tsk_thread_flag(p, TIF_SIGPENDING);
}
- if (!(clone_flags & CLONE_STOPPED))
- wake_up_forked_process(p); /* do this last */
- else
+ if (!(clone_flags & CLONE_STOPPED)) {
+ /*
+ * Do the wakeup last. On SMP we treat fork() and
+ * CLONE_VM separately, because fork() has already
+ * created cache footprint on this CPU (due to
+ * copying the pagetables), hence migration would
+ * probably be costy. Threads on the other hand
+ * have less traction to the current CPU, and if
+ * there's an imbalance then the scheduler can
+ * migrate this fresh thread now, before it
+ * accumulates a larger cache footprint:
+ */
+ if (clone_flags & CLONE_VM)
+ wake_up_forked_thread(p);
+ else
+ wake_up_forked_process(p);
+ } else
p->state = TASK_STOPPED;
++total_forks;
--u3/rZRmxL6MmkK24--
On Tue, 2004-04-20 at 11:58, Andi Kleen wrote:
> Darren Hart <[email protected]> writes:
>
> > Andi,
> >
> > You have mentioned the stream benchmark when reporting on the
> > performance of the Opteron NUMA sched-domains scheduler. I am trying to
> > reproduce your results and am struggling with the benchmark. Can you
> > rpovide the details of the tests you ran. Namely your compiler
> > settings, compile command line, and your value of N. Also I didn't see
> > how to specify the number of threads to run, how did you specify that?
> > I have a 4 way 1.4 GHz 1MB cache opteron machine with 7 GB of RAM.
>
> I didn't actually compile them myself; someone sent me executables
> compiled with the PGI compiler. Maybe your compiler has a different
> runtime and behaves differently?
>
> You can find them and my test script that tests everything in
> ftp://ftp.suse.com/pub/people/ak/bench/stream.tar.gz
Thanks Andi,
I noticed your binary ran with N=2000000 which is only sufficient for a
2 proc 1 MB cache opteron box according to the documentation on the
stream faq. I also noticed wide variation in results (25% or so) when
running with 4 threads on a 4 proc opteron on linux-2.6.5-mm5. Can you
provide me with the specs of the system you ran your tests on?
Thanks,
Darren
> I noticed your binary ran with N=2000000 which is only sufficient for a
> 2 proc 1 MB cache opteron box according to the documentation on the
It does not seem to make any difference.
> stream faq. I also noticed wide variation in results (25% or so) when
> running with 4 threads on a 4 proc opteron on linux-2.6.5-mm5. Can you
> provide me with the specs of the system you ran your tests on?
Yes, mm5 is still broken because it has the "tuned to numasaurus" numa
scheduler. Run it on a standard (non mm*) kernel or with Ingo's early
load balance patch.
-Andi
Andi Kleen wrote:
>>I noticed your binary ran with N=2000000 which is only sufficient for a
>>2 proc 1 MB cache opteron box according to the documentation on the
>
>
> It does not seem to make any difference.
>
>
>>stream faq. I also noticed wide variation in results (25% or so) when
>>running with 4 threads on a 4 proc opteron on linux-2.6.5-mm5. Can you
>>provide me with the specs of the system you ran your tests on?
>
>
> Yes, mm5 is still broken because it has the "tuned to numasaurus" numa
> scheduler. Run it on a standard (non mm*) kernel or with Ingo's early
> load balance patch.
>
Now what is wrong with it? I thought you said it is OK
now that Ingo's balance-on-clone is implemented, and
that you also saw similar variation in results with
numasched?
> Now what is wrong with it? I thought you said it is OK
Old problem; STREAM does not scale linearly as it should.
> now that Ingo's balance-on-clone is implemented, and
> that you also saw similar variation in results with
> numasched?
I said it was fixed with Ingo's patch, nothing more.
The patch isn't in mm5 yet as far as I know.
There were some small variations left (a few percent,
much less than 25%), but these were there even with 2.4.
-Andi
* Andi Kleen <[email protected]> wrote:
> I said it was fixed with Ingo's patch, nothing more.
> The patch isn't in mm5 yet as far as I know.
sched-balance-context.patch in recent -mm kernels is a morphed version
of my previous patch. Could you try something like 2.6.6-rc2-mm2 and
check whether performance is linear?
Ingo
On Mon, 2004-04-26 at 19:33, Andi Kleen wrote:
> > I noticed your binary ran with N=2000000 which is only sufficient for a
> > 2 proc 1 MB cache opteron box according to the documentation on the
>
> It does not seem to make any difference.
I was under the impression you didn't change the N value (array size)
and ran the benchmark with someone else's precompiled binaries (the ones
you sent me). Did you have two binaries with different array sizes
compiled in, or did I miss some way of configuring that? The
documentation was admittedly sparse.
> > stream faq. I also noticed wide variation in results (25% or so) when
> > running with 4 threads on a 4 proc opteron on linux-2.6.5-mm5. Can you
> > provide me with the specs of the system you ran your tests on?
>
> Yes, mm5 is still broken because it has the "tuned to numasaurus" numa
> scheduler. Run it on a standard (non mm*) kernel or with Ingo's early
> load balance patch.
I ran it on 2.6.5, 2.6.5-mm5, and 2.6.5-mm5-flat-domains trying to
reproduce the results you found (including the poor performance of
virgin and mm) so that I can have some context while analyzing the
sched_domains topology on x86_64 and its effects on performance. So
that I can see where the differences lie in our tests, could you please
provide some of the specs of the system you ran on, such as number of
procs, cache size, and amount of RAM.
Thanks,
--Darren
On Tue, Apr 27, 2004 at 09:47:19AM -0700, Darren Hart wrote:
> On Mon, 2004-04-26 at 19:33, Andi Kleen wrote:
> > > I noticed your binary ran with N=2000000 which is only sufficient for a
> > > 2 proc 1 MB cache opteron box according to the documentation on the
> >
> > It does not seem to make any difference.
>
> I was under the impression you didn't change the N value (array size)
> and ran the benchmark with someone else's precompiled binaries (the ones
> you sent me). Did you have two binaries with different array sizes
Correct. I always used 2000000. But it did not seem to make
any difference in showing the scheduler issues even when going
to 4 CPUs.
(there were some fluctuations, but much less than the 25% you reported)
> > > stream faq. I also noticed wide variation in results (25% or so) when
> > > running with 4 threads on a 4 proc opteron on linux-2.6.5-mm5. Can you
> > > provide me with the specs of the system you ran your tests on?
> >
> > Yes, mm5 is still broken because it has the "tuned to numasaurus" numa
> > scheduler. Run it on a standard (non mm*) kernel or with Ingo's early
> > load balance patch.
>
> I ran it on 2.6.5, 2.6.5-mm5, and 2.6.5-mm5-flat-domains trying to
> reproduce the results you found (including the poor performance of
> virgin and mm) so that I can have some context while analyzing the
> sched_domains topology on x86_64 and its effects on performance. So
> that I can see where the differences lie in our tests, could you please
> provide some of the specs of the system you ran on, such as number of
> procs, cache size, and amount of RAM.
I saw the issue on a wide range of systems, ranging from 2 CPUs to 4 CPUs.
All hard enough per CPU memory to fit the benchmark.
-Andi
> Darren Hart <[email protected]> writes:
>
> [...]
>
> I didn't actually compile them myself; someone sent me executables
> compiled with the PGI compiler. Maybe your compiler has a different
> runtime and behaves differently?
>
> You can find them and my test script that tests everything in
> ftp://ftp.suse.com/pub/people/ak/bench/stream.tar.gz
Is this benchmark available for other machines, e.g. for IA-64 ?
Can I have a copy for IA-64, please ?
Thanks,
Zolt?n Menyh?rt
On Wed, Apr 28, 2004 at 11:47:39AM +0200, Zoltan Menyhart wrote:
> > Darren Hart <[email protected]> writes:
> >
> > [...]
> >
> > I didn't actually compile them myself; someone sent me executables
> > compiled with the PGI compiler. Maybe your compiler has a different
> > runtime and behaves differently?
> >
> > You can find them and my test script that tests everything in
> > ftp://ftp.suse.com/pub/people/ak/bench/stream.tar.gz
>
> Is this benchmark available for other machines, e.g. for IA-64 ?
> Can I have a copy for IA-64, please ?
http://www.cs.virginia.edu/stream/
Compile the OpenMP C or Fortran version with a OpenMP capable compiler
and run it parallel on each CPU.
Exact results may also vary with the OpenMP runtime.
-Andi