Once upon a time, this patch was in -mm tree (2.6.13-mm1):
http://marc.theaimsgroup.com/?l=linux-kernel&m=112265450426975&w=2
It is neither in Linus's official tree, nor it is in -mm anymore.
I guess I missed the objection for dropping the patch. I'm bringing
up this discussion again. The wake-up path is a lot hotter on numa
system running database benchmark. Even on a moderate 8P numa box,
__wake_up and try_to_wake_up is showing up as #1 and #4 hottest kernel
functions. While on a comparable 4P smp box, these two functions are
#5 and #9 respectively.
I think situation will be worse on 32P numa box in the wake up path.
I don't have any measurement on 32P setup yet, because 8P numa
performance sucks at the moment and it is a blocker for us before
proceed any bigger setup.
Execution profile for 8P numa box [1]:
Symbol Clockticks Inst. Retired L3 Misses
#1 __wake_up 8.08% 1.88% 4.67%
#2 finish_task_switch 7.53% 18.11% 5.82%
#3 __make_request 6.87% 2.09% 4.35%
#4 try_to_wake_up 5.57% 0.64% 3.10%
Execution profile for 4P SMP box [2]:
Symbol Clockticks
#5 __wake_up 3.57%
#9 try_to_wake_up 2.38%
My question is: what was the reason this patch is dropped and what
can we do to improve wake-up performance? In my opinion, we should
simply put the task on the CPU it was previously ran and have
rebalance_tick and load_balance_newidle to balance out the load.
- Ken
[1] 8 processor: 1.6 GHz Itanium2 processor, 9M L3. 256 GB memory
[2] 4 processor: 1.6 GHz Itanium2 processor, 9M L3. 128 GB memory
* Chen, Kenneth W <[email protected]> wrote:
> Once upon a time, this patch was in -mm tree (2.6.13-mm1):
> http://marc.theaimsgroup.com/?l=linux-kernel&m=112265450426975&w=2
>
> It is neither in Linus's official tree, nor it is in -mm anymore.
it's sched-better-wake-balancing-3.patch, in 2.6.14-rc5-mm1.
Ingo
Chen, Kenneth W wrote:
> Once upon a time, this patch was in -mm tree (2.6.13-mm1):
> http://marc.theaimsgroup.com/?l=linux-kernel&m=112265450426975&w=2
>
> It is neither in Linus's official tree, nor it is in -mm anymore.
>
> I guess I missed the objection for dropping the patch. I'm bringing
My objection for the patch is that it seems to be designed just to
improve your TPC - and I don't think we've seen results yet... or
did I miss that?
Also - by no means do I think improving TPC is wrong, but I think
such a patch may not be the right way to go. It doesn't seem to solve
your problem well.
Now you may have one of two problems. Well it definitely looks like
you are taking a lot of cache misses in try_to_wake_up - however this
won't be due to the load balancing stuff, but rather from locking the
remote CPUs runqueue and touching its runqueues, and cachelines in
the task_struct that had been last touched by the remote CPU.
In fact, if the balancing stuff in try_to_wake_up is working as it
should, then it will result in fewer "remote wakups" because tasks
will be moved to the same CPU that wakes them. Schedstats can tell
us a lot about this, BTW.
The second problem you may have is that the balancing stuff is going
haywire and actually causing tasks to move around too much. If this
is the case, then I really need to look at your workload (at least
schedstats output) and try to get things working a bit better. Knocking
half its brains out with a hammer is just going to make it perform
poorly in more cases without fixing your underlying problem.
Well - you may have a 3rd problem: that schedule and wake_up are simply
being called too often. What's going on with your workload? How many
context switches? What's the schedule profile look like? (we should get
a wake up profile too).
Basically I'd like to see a lot more information.
> up this discussion again. The wake-up path is a lot hotter on numa
> system running database benchmark. Even on a moderate 8P numa box,
> __wake_up and try_to_wake_up is showing up as #1 and #4 hottest kernel
> functions. While on a comparable 4P smp box, these two functions are
> #5 and #9 respectively.
>
With all else being equal, an 8P box is going to have 133% more remote
wakeups than a 4P box, and each of those cacheline transfers is going
to have a higher latency. The difference in numbers when moving from
4 to 8 way isn't very surprising.
> I think situation will be worse on 32P numa box in the wake up path.
> I don't have any measurement on 32P setup yet, because 8P numa
> performance sucks at the moment and it is a blocker for us before
> proceed any bigger setup.
>
That's a pity. What do we suck in comparison to? What's our ratio
between actual and expected performance?
Thanks,
Nick
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
* Nick Piggin <[email protected]> wrote:
> Chen, Kenneth W wrote:
> >Once upon a time, this patch was in -mm tree (2.6.13-mm1):
> >http://marc.theaimsgroup.com/?l=linux-kernel&m=112265450426975&w=2
> >
> >It is neither in Linus's official tree, nor it is in -mm anymore.
> >
> >I guess I missed the objection for dropping the patch. I'm bringing
>
> My objection for the patch is that it seems to be designed just to
> improve your TPC - and I don't think we've seen results yet... or did
> I miss that?
>
> Also - by no means do I think improving TPC is wrong, but I think such
> a patch may not be the right way to go. It doesn't seem to solve your
> problem well.
Nick, the TPC workload is simple and has been described before: lots of
interrupts arriving on many CPUs, and waking up tasks randomly, which do
short amount of work and then go back to sleep again. There is no
correlation between the CPU the interrupt arrives on and the CPU the
task gets woken up on. There is no point in immediate balancing either:
the IRQs are well-balanced themselves so there are no load transients to
take care of (except for idle CPUs, which my patch handles), and the
next wakeup for that task wont arrive on the same CPU anyway.
in such a workload, my patch will clearly improve things, by not
bouncing tasks around wildly.
> Now you may have one of two problems. Well it definitely looks like
> you are taking a lot of cache misses in try_to_wake_up - however this
> won't be due to the load balancing stuff, but rather from locking the
> remote CPUs runqueue and touching its runqueues, and cachelines in the
> task_struct that had been last touched by the remote CPU.
no, because you are not considering a fundamentally random workload like
TPC. There is only a 1:8 chance to hit the right CPU with the interrupt,
and there is no benefit from moving the task to the CPU it got woken up
from. In fact, it hurts by doing pointless migrations.
my patch adds the rule that we only consider 'fast' migration when
provably beneficial: if the target CPU is idle. Any other case will have
to go over the 'slow' migration paths.
> In fact, if the balancing stuff in try_to_wake_up is working as it
> should, then it will result in fewer "remote wakups" because tasks
> will be moved to the same CPU that wakes them. Schedstats can tell us
> a lot about this, BTW.
wrong. Even if the balancing stuff in try_to_wake_up is working as it
should, it can easily happen that moving a task is not worthwhile: if
there is little or no further relationship between the wakeup CPU and
the IRQ CPU, i.e. when the migration cost is larger than the
relationship-win between the wakeup CPU and the IRQ CPU.
so for me the decision logic is simple: the balancing code logic is
migrating over-eagerly, and this simple and straightforward patch makes
it less eager for an important workload class. You are welcome to
suggest other approaches, but simply saying "I dont like this" wont
bring us further, as the damage on TPC workloads is clearly
demonstrated. If this patch hurts other workloads (and please
demonstrate them instead of calling my patch a hammer - the patch has
been in -mm for many months already) then simply provide the logic that
will do the balancing for those workloads only, without hurting this
workload!
When we have to pick between two workloads (only one of which is
identified at the moment!) that have have to balance out against each
other then we will go towards the simpler solution (all other factors
being equal). I.e. in this case by not doing the balancing. Migration is
a fundamentally intrusive act and should be done carefully. If you can
pull it off without hurting other workloads then fine, but otherwise it
needs refinement. This rule is not a hard limit: we obviously will do
changes that hurt some rare workloads only a bit while helping other,
more common workloads enormously. This has not been demonstrated for
this case yet. There is also a simplicity factor: not doing a complex
balancing decision is obviously simpler, so we have a bias towards it.
I.e. do something complex and costly only if we can prove it most likely
OK.
Ingo
Ingo Molnar wrote:
> * Nick Piggin <[email protected]> wrote:
>
>
>>Chen, Kenneth W wrote:
>>
>>>Once upon a time, this patch was in -mm tree (2.6.13-mm1):
>>>http://marc.theaimsgroup.com/?l=linux-kernel&m=112265450426975&w=2
>>>
>>>It is neither in Linus's official tree, nor it is in -mm anymore.
>>>
>>>I guess I missed the objection for dropping the patch. I'm bringing
>>
>>My objection for the patch is that it seems to be designed just to
>>improve your TPC - and I don't think we've seen results yet... or did
>>I miss that?
>>
>>Also - by no means do I think improving TPC is wrong, but I think such
>>a patch may not be the right way to go. It doesn't seem to solve your
>>problem well.
>
>
> Nick, the TPC workload is simple and has been described before: lots of
> interrupts arriving on many CPUs, and waking up tasks randomly, which do
> short amount of work and then go back to sleep again. There is no
> correlation between the CPU the interrupt arrives on and the CPU the
> task gets woken up on. There is no point in immediate balancing either:
> the IRQs are well-balanced themselves so there are no load transients to
> take care of (except for idle CPUs, which my patch handles), and the
> next wakeup for that task wont arrive on the same CPU anyway.
>
> in such a workload, my patch will clearly improve things, by not
> bouncing tasks around wildly.
>
Ingo, I wasn't aware that tasks are bouncing around wildly; does
your patch improve things? Then by definition it must penalise
workloads where the pairings are more predictable?
I would prefer to try fixing wake balancing before giving up and
turning it off for busy CPUs.
>
>>Now you may have one of two problems. Well it definitely looks like
>>you are taking a lot of cache misses in try_to_wake_up - however this
>>won't be due to the load balancing stuff, but rather from locking the
>>remote CPUs runqueue and touching its runqueues, and cachelines in the
>>task_struct that had been last touched by the remote CPU.
>
>
> no, because you are not considering a fundamentally random workload like
> TPC. There is only a 1:8 chance to hit the right CPU with the interrupt,
> and there is no benefit from moving the task to the CPU it got woken up
> from. In fact, it hurts by doing pointless migrations.
>
It doesn't always migrate though. That's the point of all the heuristics.
> my patch adds the rule that we only consider 'fast' migration when
> provably beneficial: if the target CPU is idle. Any other case will have
> to go over the 'slow' migration paths.
>
wrong. There is no way you can "prove" that a migration is beneficial!
>
>>In fact, if the balancing stuff in try_to_wake_up is working as it
>>should, then it will result in fewer "remote wakups" because tasks
>>will be moved to the same CPU that wakes them. Schedstats can tell us
>>a lot about this, BTW.
>
>
> wrong. Even if the balancing stuff in try_to_wake_up is working as it
> should, it can easily happen that moving a task is not worthwhile: if
> there is little or no further relationship between the wakeup CPU and
> the IRQ CPU, i.e. when the migration cost is larger than the
> relationship-win between the wakeup CPU and the IRQ CPU.
>
> so for me the decision logic is simple: the balancing code logic is
> migrating over-eagerly, and this simple and straightforward patch makes
> it less eager for an important workload class. You are welcome to
> suggest other approaches, but simply saying "I dont like this" wont
> bring us further, as the damage on TPC workloads is clearly
> demonstrated. If this patch hurts other workloads (and please
Ken mentioned it was worth 2%. Not a bad improvement, but if our
performance "sucks" then it sounds like we need to look elsewhere.
> demonstrate them instead of calling my patch a hammer - the patch has
> been in -mm for many months already) then simply provide the logic that
> will do the balancing for those workloads only, without hurting this
> workload!
>
No doubt that if it is doing pointless migrations that your patch
prevents, then that will improve performance here. However I'd rather
try to fix the actual balancing code.
Without any form of wake balancing, then a multiprocessor system will
tend to have a completely random distribution of tasks over CPUs over
time. I prefer to add a driver so it is not completely random for
amenable workloads.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
* Nick Piggin <[email protected]> wrote:
> Ingo, I wasn't aware that tasks are bouncing around wildly; does your
> patch improve things? Then by definition it must penalise workloads
> where the pairings are more predictable?
for TPC, most of the non-to-idle migrations are 'wrong'. So basically
any change that gets rid of extra migrations is a win. This does not
mean that it is all bouncing madly.
> I would prefer to try fixing wake balancing before giving up and
> turning it off for busy CPUs.
agreed, and that was my suggestion: improve the heuristics to not hurt
workloads where there is no natural pairing.
one possible way would be to do a task_hot() check in the passive
balancing code, and only migrate the task when it's been inactive for a
long time: that should be the case for most TPC wakeups. (This assumes
an accurate cache-hot estimator, for which another patch exists.)
> Without any form of wake balancing, then a multiprocessor system will
> tend to have a completely random distribution of tasks over CPUs over
> time. I prefer to add a driver so it is not completely random for
> amenable workloads.
but my patch does not do 'no form of wake balancing'. It will do
non-load-related wake balancing if the target CPU is idle. Arguably,
that can easily be 'never' under common workloads.
Ingo
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h 2005-10-29 00:52:31.000000000 +1000
+++ linux-2.6/include/linux/sched.h 2005-10-29 00:58:51.000000000 +1000
@@ -648,9 +648,12 @@ struct task_struct {
int lock_depth; /* BKL lock depth */
-#if defined(CONFIG_SMP) && defined(__ARCH_WANT_UNLOCKED_CTXSW)
+#if defined(CONFIG_SMP)
+ int last_waker_cpu; /* CPU that last woke this task up */
+#if defined(__ARCH_WANT_UNLOCKED_CTXSW)
int oncpu;
#endif
+#endif
int prio, static_prio;
struct list_head run_list;
prio_array_t *array;
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c 2005-10-29 00:52:34.000000000 +1000
+++ linux-2.6/kernel/sched.c 2005-10-29 01:00:53.000000000 +1000
@@ -1183,6 +1183,10 @@ static int try_to_wake_up(task_t *p, uns
}
}
+ if (p->last_waker_cpu != this_cpu)
+ goto out_set_cpu;
+
+
if (unlikely(!cpu_isset(this_cpu, p->cpus_allowed)))
goto out_set_cpu;
@@ -1253,6 +1257,8 @@ out_set_cpu:
cpu = task_cpu(p);
}
+ p->last_waker_cpu = this_cpu;
+
out_activate:
#endif /* CONFIG_SMP */
if (old_state == TASK_UNINTERRUPTIBLE) {
@@ -1334,9 +1340,12 @@ void fastcall sched_fork(task_t *p, int
#ifdef CONFIG_SCHEDSTATS
memset(&p->sched_info, 0, sizeof(p->sched_info));
#endif
-#if defined(CONFIG_SMP) && defined(__ARCH_WANT_UNLOCKED_CTXSW)
+#if defined(CONFIG_SMP)
+ p->last_waker_cpu = cpu;
+#if defined(__ARCH_WANT_UNLOCKED_CTXSW)
p->oncpu = 0;
#endif
+#endif
#ifdef CONFIG_PREEMPT
/* Want to start with kernel preemption disabled. */
p->thread_info->preempt_count = 1;
Ingo Molnar wrote on Friday, October 28, 2005 12:40 AM
> * Chen, Kenneth W <[email protected]> wrote:
> > Once upon a time, this patch was in -mm tree (2.6.13-mm1):
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=112265450426975&w=2
> >
> > It is neither in Linus's official tree, nor it is in -mm anymore.
>
> it's sched-better-wake-balancing-3.patch, in 2.6.14-rc5-mm1.
Sorry for the noise. I went through 2.6.14-rc5-mm1 a couple of time.
I couldn't believe myself how I could missed it. (I must be sleeping
or don't know what I was doing at that time).
- Ken