Comparing with 2.6.31's results, hackbench has some regression on a couple of
machines woth kernel 2.6.32-rc1.
I run it with commandline:
./hackbench 100 process 2000
1) On 4*4 core tigerton: 70%;
2) On 2*4 core stoakley: 7%.
I located below 2 patches.
commit 29cd8bae396583a2ee9a3340db8c5102acf9f6fd
Author: Peter Zijlstra <[email protected]>
Date: Thu Sep 17 09:01:14 2009 +0200
sched: Fix SD_POWERSAVING_BALANCE|SD_PREFER_LOCAL vs SD_WAKE_AFFINE
and
commit de69a80be32445b0a71e8e3b757e584d7beb90f7
Author: Peter Zijlstra <[email protected]>
Date: Thu Sep 17 09:01:20 2009 +0200
sched: Stop buddies from hogging the system
1) On 4*4 core tigerton: if I revert patch 29cd8b, the regression becomes
less than 55%; If I revert the 2 patches, all regression disappears.
2) On 2*4 core stakley: If I revert the 2 patches, comparing with 2.6.31,
I get about 8% improvement instead of regression.
Sorry for reporting the regression later as there is a long national holiday.
Yanmin
On Fri, 2009-10-09 at 17:19 +0800, Zhang, Yanmin wrote:
> Comparing with 2.6.31's results, hackbench has some regression on a couple of
> machines woth kernel 2.6.32-rc1.
> I run it with commandline:
> ../hackbench 100 process 2000
>
> 1) On 4*4 core tigerton: 70%;
> 2) On 2*4 core stoakley: 7%.
>
> I located below 2 patches.
> commit 29cd8bae396583a2ee9a3340db8c5102acf9f6fd
> Author: Peter Zijlstra <[email protected]>
> Date: Thu Sep 17 09:01:14 2009 +0200
>
> sched: Fix SD_POWERSAVING_BALANCE|SD_PREFER_LOCAL vs SD_WAKE_AFFINE
>
> and
Should I guess be solved by turning SD_PREFER_LOCAL off, right?
> commit de69a80be32445b0a71e8e3b757e584d7beb90f7
> Author: Peter Zijlstra <[email protected]>
> Date: Thu Sep 17 09:01:20 2009 +0200
>
> sched: Stop buddies from hogging the system
>
>
> 1) On 4*4 core tigerton: if I revert patch 29cd8b, the regression becomes
> less than 55%; If I revert the 2 patches, all regression disappears.
> 2) On 2*4 core stakley: If I revert the 2 patches, comparing with 2.6.31,
> I get about 8% improvement instead of regression.
>
> Sorry for reporting the regression later as there is a long national holiday.
No problem. There should still be plenty time to poke at them before .32
hits the street.
I really liked de69a80b, and it affecting hackbench shows I wasn't
crazy ;-)
So hackbench is a multi-cast, with one sender spraying multiple
receivers, who in their turn don't spray back, right?
This would be exactly the scenario that patch 'cures'. Previously we
would not clear the last buddy after running the next, allowing the
sender to get back to work sooner than it otherwise ought to have been.
Now, since those receivers don't poke back, they don't enforce the buddy
relation...
/me ponders a bit
Does this make it any better?
---
kernel/sched_fair.c | 27 +++++++++++++--------------
1 files changed, 13 insertions(+), 14 deletions(-)
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 4e777b4..bf5901e 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -861,12 +861,21 @@ wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se);
static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq)
{
struct sched_entity *se = __pick_next_entity(cfs_rq);
+ struct sched_entity *buddy;
- if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, se) < 1)
- return cfs_rq->next;
+ if (cfs_rq->next) {
+ buddy = cfs_rq->next;
+ cfs_rq->next = NULL;
+ if (wakeup_preempt_entity(buddy, se) < 1)
+ return buddy;
+ }
- if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, se) < 1)
- return cfs_rq->last;
+ if (cfs_rq->last) {
+ buddy = cfs_rq->last;
+ cfs_rq->last = NULL;
+ if (wakeup_preempt_entity(buddy, se) < 1)
+ return buddy;
+ }
return se;
}
@@ -1654,16 +1663,6 @@ static struct task_struct *pick_next_task_fair(struct rq *rq)
do {
se = pick_next_entity(cfs_rq);
- /*
- * If se was a buddy, clear it so that it will have to earn
- * the favour again.
- *
- * If se was not a buddy, clear the buddies because neither
- * was elegible to run, let them earn it again.
- *
- * IOW. unconditionally clear buddies.
- */
- __clear_buddies(cfs_rq, NULL);
set_next_entity(cfs_rq, se);
cfs_rq = group_cfs_rq(se);
} while (cfs_rq);
On Fri, 2009-10-09 at 12:43 +0200, Peter Zijlstra wrote:
> On Fri, 2009-10-09 at 17:19 +0800, Zhang, Yanmin wrote:
> > Comparing with 2.6.31's results, hackbench has some regression on a couple of
> > machines woth kernel 2.6.32-rc1.
> > I run it with commandline:
> > ../hackbench 100 process 2000
> >
> > 1) On 4*4 core tigerton: 70%;
> > 2) On 2*4 core stoakley: 7%.
> >
> > I located below 2 patches.
> > commit 29cd8bae396583a2ee9a3340db8c5102acf9f6fd
> > Author: Peter Zijlstra <[email protected]>
> > Date: Thu Sep 17 09:01:14 2009 +0200
> >
> > sched: Fix SD_POWERSAVING_BALANCE|SD_PREFER_LOCAL vs SD_WAKE_AFFINE
> >
> > and
>
> Should I guess be solved by turning SD_PREFER_LOCAL off, right?
>
> > commit de69a80be32445b0a71e8e3b757e584d7beb90f7
> > Author: Peter Zijlstra <[email protected]>
> > Date: Thu Sep 17 09:01:20 2009 +0200
> >
> > sched: Stop buddies from hogging the system
> >
> >
> > 1) On 4*4 core tigerton: if I revert patch 29cd8b, the regression becomes
> > less than 55%; If I revert the 2 patches, all regression disappears.
> > 2) On 2*4 core stakley: If I revert the 2 patches, comparing with 2.6.31,
> > I get about 8% improvement instead of regression.
> >
> > Sorry for reporting the regression later as there is a long national holiday.
>
> No problem. There should still be plenty time to poke at them before .32
> hits the street.
>
> I really liked de69a80b, and it affecting hackbench shows I wasn't
> crazy ;-)
>
> So hackbench is a multi-cast, with one sender spraying multiple
> receivers, who in their turn don't spray back, right?
Right. volanoMark has about 9% regression on stoakley and 50% regression
on tigerton. If I revert the original patches, volanoMark regression on stoakley
disappears, but still has about 45% on tigerton.
>
> This would be exactly the scenario that patch 'cures'. Previously we
> would not clear the last buddy after running the next, allowing the
> sender to get back to work sooner than it otherwise ought to have been.
>
> Now, since those receivers don't poke back, they don't enforce the buddy
> relation...
>
>
> /me ponders a bit
>
> Does this make it any better?
I apply this patch and another one you sent on tbench email thread.
On stoakley, hackbench is recovered. If reverting the original 2 patches,
we get 8% improvement.
On tigerton, with your 2 patches, there is still about 45% regression.
As for volanoMark, with your 2 patches, regression disappears on staokley
and it becomes about 35% on tigerton.
aim7 has about 6% regression on stoakley and tigerton. I didn't locate the
root cause yet.
The good news is only tbench has about 6% regression on Nehalem machines.
Other regressions such like hackbench/aim7/volanoMark is not clear/big on
Nehalem. But reverting the original 2 patches don't fix the tbench regression
on Nehalem machines.
>
> ---
> kernel/sched_fair.c | 27 +++++++++++++--------------
> 1 files changed, 13 insertions(+), 14 deletions(-)
>
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index 4e777b4..bf5901e 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -861,12 +861,21 @@ wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se);
> static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq)
> {
> struct sched_entity *se = __pick_next_entity(cfs_rq);
> + struct sched_entity *buddy;
>
> - if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, se) < 1)
> - return cfs_rq->next;
> + if (cfs_rq->next) {
> + buddy = cfs_rq->next;
> + cfs_rq->next = NULL;
> + if (wakeup_preempt_entity(buddy, se) < 1)
> + return buddy;
> + }
>
> - if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, se) < 1)
> - return cfs_rq->last;
> + if (cfs_rq->last) {
> + buddy = cfs_rq->last;
> + cfs_rq->last = NULL;
> + if (wakeup_preempt_entity(buddy, se) < 1)
> + return buddy;
> + }
>
> return se;
> }
> @@ -1654,16 +1663,6 @@ static struct task_struct *pick_next_task_fair(struct rq *rq)
>
> do {
> se = pick_next_entity(cfs_rq);
> - /*
> - * If se was a buddy, clear it so that it will have to earn
> - * the favour again.
> - *
> - * If se was not a buddy, clear the buddies because neither
> - * was elegible to run, let them earn it again.
> - *
> - * IOW. unconditionally clear buddies.
> - */
> - __clear_buddies(cfs_rq, NULL);
> set_next_entity(cfs_rq, se);
> cfs_rq = group_cfs_rq(se);
> } while (cfs_rq);
>
>
On Mon, 2009-10-12 at 15:05 +0800, Zhang, Yanmin wrote:
> The good news is only tbench has about 6% regression on Nehalem machines.
> Other regressions such like hackbench/aim7/volanoMark is not clear/big on
> Nehalem. But reverting the original 2 patches don't fix the tbench regression
> on Nehalem machines.
Could be NO_NEXT_BUDDY. NEXT_BUDDY is pretty important for scaling.
-Mike
On Mon, 2009-10-12 at 15:05 +0800, Zhang, Yanmin wrote:
> > So hackbench is a multi-cast, with one sender spraying multiple
> > receivers, who in their turn don't spray back, right?
> Right. volanoMark has about 9% regression on stoakley and 50% regression
> on tigerton. If I revert the original patches, volanoMark regression on stoakley
> disappears, but still has about 45% on tigerton.
> > /me ponders a bit
> >
> > Does this make it any better?
> I apply this patch and another one you sent on tbench email thread.
> On stoakley, hackbench is recovered. If reverting the original 2 patches,
> we get 8% improvement.
> On tigerton, with your 2 patches, there is still about 45% regression.
[ and here I got confused because this 45% seemed to match the 45%
above, but then I saw it was hackbench vs volano ]
> As for volanoMark, with your 2 patches, regression disappears on staokley
> and it becomes about 35% on tigerton.
So hackbench on tigerton is worse, but volano on tigerton is better with
this patch vs reverting bits?
> The good news is only tbench has about 6% regression on Nehalem machines.
> Other regressions such like hackbench/aim7/volanoMark is not clear/big on
> Nehalem. But reverting the original 2 patches don't fix the tbench regression
> on Nehalem machines.
Right, so Mike's suggestion of doing:
echo NEXT_BUDDY > /debug/sched_features
Seems like the next thing to try..
Mike, did we ever figure out _why_ NEXT_BUDDY introduced latencies?
Buddies shouldn't make latencies worse than regular while(1); loops
would.
On Mon, 2009-10-12 at 16:21 +0200, Peter Zijlstra wrote:
> Mike, did we ever figure out _why_ NEXT_BUDDY introduced latencies?
No, I haven't gotten around to poking at that yet. I'll try that x264
thing, IIRC, NEXT_BUDDY hurt it pretty badly.
> Buddies shouldn't make latencies worse than regular while(1); loops
> would.
Yeah, no idea what's going wrong.
-Mike
On Mon, 2009-10-12 at 16:21 +0200, Peter Zijlstra wrote:
> On Mon, 2009-10-12 at 15:05 +0800, Zhang, Yanmin wrote:
>
> > > So hackbench is a multi-cast, with one sender spraying multiple
> > > receivers, who in their turn don't spray back, right?
>
> > Right. volanoMark has about 9% regression on stoakley and 50% regression
> > on tigerton. If I revert the original patches, volanoMark regression on stoakley
> > disappears, but still has about 45% on tigerton.
>
> > > /me ponders a bit
> > >
> > > Does this make it any better?
>
> > I apply this patch and another one you sent on tbench email thread.
> > On stoakley, hackbench is recovered. If reverting the original 2 patches,
> > we get 8% improvement.
> > On tigerton, with your 2 patches, there is still about 45% regression.
>
> [ and here I got confused because this 45% seemed to match the 45%
> above, but then I saw it was hackbench vs volano ]
Sorry for mentioning some data about multiple benchmarks in one email.
>
> > As for volanoMark, with your 2 patches, regression disappears on staokley
> > and it becomes about 35% on tigerton.
>
> So hackbench on tigerton is worse, but volano on tigerton is better with
> this patch vs reverting bits?
Right with your 2 new patches vs reverting the 2 original patches.
>
> > The good news is only tbench has about 6% regression on Nehalem machines.
> > Other regressions such like hackbench/aim7/volanoMark is not clear/big on
> > Nehalem. But reverting the original 2 patches don't fix the tbench regression
> > on Nehalem machines.
>
> Right, so Mike's suggestion of doing:
> echo NEXT_BUDDY > /debug/sched_features
With your 2 new patches plus NEXT_BUDDY configuration, hackbench has
some improvement instead of regression on tigerton now. So it does work.
NEXT_BUDDY has no help on volanoMark and tbench.
>
> Seems like the next thing to try..
>
> Mike, did we ever figure out _why_ NEXT_BUDDY introduced latencies?
>
> Buddies shouldn't make latencies worse than regular while(1); loops
> would.
>
On Tue, 2009-10-13 at 11:12 +0800, Zhang, Yanmin wrote:
> NEXT_BUDDY has no help on volanoMark and tbench.
Vmark is mostly about preemption and affinity. Increases in wakeup
preemption or load balancing will bring it down. The affinity bit
applies heavily to mysql+oltp too, though it loves wakeup preemption.
test test test...
My conclusion for results _here_ is load balancing changes are harming
cache wise. GENTLE_FAIR_SLEEPERS harms short term fairness, but despite
GENTLE_FAIR_SLEEPERS, we're still wakeup preempting too much, likely
doing more cache harm.
(even for the desktop, overly aggressive wakeup preemption can do harm)
I'd suggest trying the settings below.
vmark
92143 stock settings
96396 -SD_BALANCE_NEWIDLE
99403 -SD_WAKE_BALANCE
121821 min_granularity+wakeup_granularity *= 2
128795 NO_WAKEUP_PREEMPT (and back on, just checking fairness)
97193 NO_FAIR_SLEEPERS (vmark likes fairness, max out fairness)
107519 FAIR_SLEEPERS NO_GENTLE_FAIR_SLEEPERS (over-preempt again, so..)
123721 min_granularity+wakeup_granularity *= 2
131290 NO_WAKEUP_PREEMPT (and back on, just checking fairness)
123464 NEXT_BUDDY (no effect)
vs stock 1.339
tbench 8 with these settings.
752.249 MB/sec 8 procs
747.010 MB/sec 8 procs NO_NEXT_BUDDY
753.177 MB/sec 8 procs min_granularity+wakeup_granularity /= 2
749.518 MB/sec 8 procs GENTLE_FAIR_SLEEPERS
753.051 MB/sec 8 procs min_granularity+wakeup_granularity /= 2
734.772 MB/sec 8 procs +SD_WAKE_BALANCE
733.683 MB/sec 8 procs +SD_BALANCE_NEWIDLE (we are at stock)
vs stock 1.025
Turns off netfilter, stock settings
903.304 MB/sec 8 procs
900.656 MB/sec 8 procs -SD_BALANCE_NEWIDLE
928.914 MB/sec 8 procs -SD_WAKE_BALANCE
930.591 MB/sec 8 procs min_granularity+wakeup_granularity *= 2
926.836 MB/sec 8 procs NO_GENTLE_FAIR_SLEEPERS
931.148 MB/sec 8 procs min_granularity+wakeup_granularity *= 2
vs stock 1.030
vmark
146264
116559 stock
vs stock 1.254
On Tue, 2009-10-13 at 11:39 +0200, Mike Galbraith wrote:
> On Tue, 2009-10-13 at 11:12 +0800, Zhang, Yanmin wrote:
>
> > NEXT_BUDDY has no help on volanoMark and tbench.
>
> Vmark is mostly about preemption and affinity. Increases in wakeup
> preemption or load balancing will bring it down. The affinity bit
> applies heavily to mysql+oltp too, though it loves wakeup preemption.
>
> test test test...
>
> My conclusion for results _here_ is load balancing changes are harming
> cache wise. GENTLE_FAIR_SLEEPERS harms short term fairness, but despite
> GENTLE_FAIR_SLEEPERS, we're still wakeup preempting too much, likely
> doing more cache harm.
>
> (even for the desktop, overly aggressive wakeup preemption can do harm)
>
> I'd suggest trying the settings below.
Except don't bother tweaking min_granularity. Further testing showed
that's fine. So turn off GENTLE_FAIR_SLEEPERS, and bump
wakeup_granularity up a bit.
Note: don't bump it further than the short term fairness goal
(sched_latency, or half of that if gentle is enabled), or you won't have
any wakeup preemption, which is deadly for many loads.
> vmark
>
> 92143 stock settings
> 96396 -SD_BALANCE_NEWIDLE
> 99403 -SD_WAKE_BALANCE
> 121821 min_granularity+wakeup_granularity *= 2
> 128795 NO_WAKEUP_PREEMPT (and back on, just checking fairness)
> 97193 NO_FAIR_SLEEPERS (vmark likes fairness, max out fairness)
> 107519 FAIR_SLEEPERS NO_GENTLE_FAIR_SLEEPERS (over-preempt again, so..)
> 123721 min_granularity+wakeup_granularity *= 2
> 131290 NO_WAKEUP_PREEMPT (and back on, just checking fairness)
> 123464 NEXT_BUDDY (no effect)
> vs stock 1.339
>
> tbench 8 with these settings.
>
> 752.249 MB/sec 8 procs
> 747.010 MB/sec 8 procs NO_NEXT_BUDDY
> 753.177 MB/sec 8 procs min_granularity+wakeup_granularity /= 2
> 749.518 MB/sec 8 procs GENTLE_FAIR_SLEEPERS
> 753.051 MB/sec 8 procs min_granularity+wakeup_granularity /= 2
> 734.772 MB/sec 8 procs +SD_WAKE_BALANCE
> 733.683 MB/sec 8 procs +SD_BALANCE_NEWIDLE (we are at stock)
> vs stock 1.025
>
> Turns off netfilter, stock settings
>
> 903.304 MB/sec 8 procs
> 900.656 MB/sec 8 procs -SD_BALANCE_NEWIDLE
> 928.914 MB/sec 8 procs -SD_WAKE_BALANCE
> 930.591 MB/sec 8 procs min_granularity+wakeup_granularity *= 2
> 926.836 MB/sec 8 procs NO_GENTLE_FAIR_SLEEPERS
> 931.148 MB/sec 8 procs min_granularity+wakeup_granularity *= 2
> vs stock 1.030
>
> vmark
>
> 146264
> 116559 stock
> vs stock 1.254
Commit-ID: 92f6a5e37a2e2d3342dafb2b39c2f8bc340bbf84
Gitweb: http://git.kernel.org/tip/92f6a5e37a2e2d3342dafb2b39c2f8bc340bbf84
Author: Peter Zijlstra <[email protected]>
AuthorDate: Fri, 9 Oct 2009 12:43:07 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 14 Oct 2009 15:02:34 +0200
sched: Do less agressive buddy clearing
Yanmin reported a hackbench regression due to:
> commit de69a80be32445b0a71e8e3b757e584d7beb90f7
> Author: Peter Zijlstra <[email protected]>
> Date: Thu Sep 17 09:01:20 2009 +0200
>
> sched: Stop buddies from hogging the system
I really liked de69a80b, and it affecting hackbench shows I wasn't
crazy ;-)
So hackbench is a multi-cast, with one sender spraying multiple
receivers, who in their turn don't spray back.
This would be exactly the scenario that patch 'cures'. Previously
we would not clear the last buddy after running the next task,
allowing the sender to get back to work sooner than it otherwise
ought to have been, increasing latencies for other tasks.
Now, since those receivers don't poke back, they don't enforce the
buddy relation, which means there's nothing to re-elect the sender.
Cure this by less agressively clearing the buddy stats. Only clear
buddies when they were not chosen. It should still avoid a buddy
sticking around long after its served its time.
Reported-by: "Zhang, Yanmin" <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
CC: Mike Galbraith <[email protected]>
LKML-Reference: <1255084986.8802.46.camel@laptop>
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched_fair.c | 27 +++++++++++++--------------
1 files changed, 13 insertions(+), 14 deletions(-)
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 4e777b4..c32c3e6 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -861,12 +861,21 @@ wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se);
static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq)
{
struct sched_entity *se = __pick_next_entity(cfs_rq);
+ struct sched_entity *buddy;
- if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, se) < 1)
- return cfs_rq->next;
+ if (cfs_rq->next) {
+ buddy = cfs_rq->next;
+ cfs_rq->next = NULL;
+ if (wakeup_preempt_entity(buddy, se) < 1)
+ return buddy;
+ }
- if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, se) < 1)
- return cfs_rq->last;
+ if (cfs_rq->last) {
+ buddy = cfs_rq->last;
+ cfs_rq->last = NULL;
+ if (wakeup_preempt_entity(buddy, se) < 1)
+ return buddy;
+ }
return se;
}
@@ -1654,16 +1663,6 @@ static struct task_struct *pick_next_task_fair(struct rq *rq)
do {
se = pick_next_entity(cfs_rq);
- /*
- * If se was a buddy, clear it so that it will have to earn
- * the favour again.
- *
- * If se was not a buddy, clear the buddies because neither
- * was elegible to run, let them earn it again.
- *
- * IOW. unconditionally clear buddies.
- */
- __clear_buddies(cfs_rq, NULL);
set_next_entity(cfs_rq, se);
cfs_rq = group_cfs_rq(se);
} while (cfs_rq);
On Tue, 2009-10-13 at 11:12 +0800, Zhang, Yanmin wrote:
> NEXT_BUDDY has no help on volanoMark and tbench.
Can you try the patch below please? It does tries to preserve buddy
affinity where possible, and mitigates over-preemption by strengthening
buddies a bit. It improves vmark here by ~7%.
diff --git a/kernel/sched.c b/kernel/sched.c
index 00f9e71..fb025d4 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2007,8 +2007,12 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
/*
* Buddy candidates are cache hot:
+ *
+ * Do not honor buddies if there may be nothing else to
+ * prevent us from becoming idle.
*/
if (sched_feat(CACHE_HOT_BUDDY) &&
+ task_rq(p)->nr_running >= sched_nr_latency &&
(&p->se == cfs_rq_of(&p->se)->next ||
&p->se == cfs_rq_of(&p->se)->last))
return 1;
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index c32c3e6..428bf55 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -863,18 +863,20 @@ static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq)
struct sched_entity *se = __pick_next_entity(cfs_rq);
struct sched_entity *buddy;
- if (cfs_rq->next) {
+ if (cfs_rq->next && sched_feat(NEXT_BUDDY)) {
buddy = cfs_rq->next;
- cfs_rq->next = NULL;
- if (wakeup_preempt_entity(buddy, se) < 1)
+ if (wakeup_preempt_entity(buddy, se) < 1) {
+ cfs_rq->next = NULL;
return buddy;
+ }
}
- if (cfs_rq->last) {
+ if (cfs_rq->last && sched_feat(LAST_BUDDY)) {
buddy = cfs_rq->last;
- cfs_rq->last = NULL;
- if (wakeup_preempt_entity(buddy, se) < 1)
+ if (wakeup_preempt_entity(buddy, se) < 1) {
+ cfs_rq->last = NULL;
return buddy;
+ }
}
return se;
@@ -1600,9 +1602,9 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
* Also, during early boot the idle thread is in the fair class, for
* obvious reasons its a bad idea to schedule back to the idle thread.
*/
- if (sched_feat(LAST_BUDDY) && likely(se->on_rq && curr != rq->idle))
+ if (!(wake_flags & WF_FORK) && likely(se->on_rq && curr != rq->idle))
set_last_buddy(se);
- if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK))
+ if (!(wake_flags & WF_FORK))
set_next_buddy(pse);
/*
On Fri, 2009-10-16 at 13:06 +0200, Mike Galbraith wrote:
> On Tue, 2009-10-13 at 11:12 +0800, Zhang, Yanmin wrote:
>
> > NEXT_BUDDY has no help on volanoMark and tbench.
>
> Can you try the patch below please? It does tries to preserve buddy
> affinity where possible, and mitigates over-preemption by strengthening
> buddies a bit. It improves vmark here by ~7%.
I ran some benchmarks against 2.6.32-rc1+Peter_2_patches+below_patch.
Below result is against 2.6.32-rc1.
hackbench result has about 10% improvement on stoakley (2*4 cores) and
tigerton (4*4 cores).
tbench still has about 5% regression on stoakley and tigerton.
VolanoMark has 33% regression on tigerton, but has 2% improvement on stoakley.
I also ran the benchmarks against the latest tips/master and got the similiar
results like above testing.
The testing against tips on Nehalem machine didn't show much improvement/regression.
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 00f9e71..fb025d4 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -2007,8 +2007,12 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
>
> /*
> * Buddy candidates are cache hot:
> + *
> + * Do not honor buddies if there may be nothing else to
> + * prevent us from becoming idle.
> */
> if (sched_feat(CACHE_HOT_BUDDY) &&
> + task_rq(p)->nr_running >= sched_nr_latency &&
> (&p->se == cfs_rq_of(&p->se)->next ||
> &p->se == cfs_rq_of(&p->se)->last))
> return 1;
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index c32c3e6..428bf55 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -863,18 +863,20 @@ static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq)
> struct sched_entity *se = __pick_next_entity(cfs_rq);
> struct sched_entity *buddy;
>
> - if (cfs_rq->next) {
> + if (cfs_rq->next && sched_feat(NEXT_BUDDY)) {
> buddy = cfs_rq->next;
> - cfs_rq->next = NULL;
> - if (wakeup_preempt_entity(buddy, se) < 1)
> + if (wakeup_preempt_entity(buddy, se) < 1) {
> + cfs_rq->next = NULL;
> return buddy;
> + }
> }
>
> - if (cfs_rq->last) {
> + if (cfs_rq->last && sched_feat(LAST_BUDDY)) {
> buddy = cfs_rq->last;
> - cfs_rq->last = NULL;
> - if (wakeup_preempt_entity(buddy, se) < 1)
> + if (wakeup_preempt_entity(buddy, se) < 1) {
> + cfs_rq->last = NULL;
> return buddy;
> + }
> }
>
> return se;
> @@ -1600,9 +1602,9 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
> * Also, during early boot the idle thread is in the fair class, for
> * obvious reasons its a bad idea to schedule back to the idle thread.
> */
> - if (sched_feat(LAST_BUDDY) && likely(se->on_rq && curr != rq->idle))
> + if (!(wake_flags & WF_FORK) && likely(se->on_rq && curr != rq->idle))
> set_last_buddy(se);
> - if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK))
> + if (!(wake_flags & WF_FORK))
> set_next_buddy(pse);
>
> /*
>
>
On Tue, 2009-10-27 at 16:03 +0800, Zhang, Yanmin wrote:
> On Fri, 2009-10-16 at 13:06 +0200, Mike Galbraith wrote:
> > On Tue, 2009-10-13 at 11:12 +0800, Zhang, Yanmin wrote:
> >
> > > NEXT_BUDDY has no help on volanoMark and tbench.
> >
> > Can you try the patch below please? It does tries to preserve buddy
> > affinity where possible, and mitigates over-preemption by strengthening
> > buddies a bit. It improves vmark here by ~7%.
> I ran some benchmarks against 2.6.32-rc1+Peter_2_patches+below_patch.
> Below result is against 2.6.32-rc1.
> hackbench result has about 10% improvement on stoakley (2*4 cores) and
> tigerton (4*4 cores).
> tbench still has about 5% regression on stoakley and tigerton.
> VolanoMark has 33% regression on tigerton, but has 2% improvement on stoakley.
>
> I also ran the benchmarks against the latest tips/master and got the similiar
> results like above testing.
>
> The testing against tips on Nehalem machine didn't show much improvement/regression.
Thanks for the testing. Your results suggest that I should revive the
mark buddies whether you use them or not idea.
-Mike
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index 00f9e71..fb025d4 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -2007,8 +2007,12 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
> >
> > /*
> > * Buddy candidates are cache hot:
> > + *
> > + * Do not honor buddies if there may be nothing else to
> > + * prevent us from becoming idle.
> > */
> > if (sched_feat(CACHE_HOT_BUDDY) &&
> > + task_rq(p)->nr_running >= sched_nr_latency &&
> > (&p->se == cfs_rq_of(&p->se)->next ||
> > &p->se == cfs_rq_of(&p->se)->last))
> > return 1;
> > diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> > index c32c3e6..428bf55 100644
> > --- a/kernel/sched_fair.c
> > +++ b/kernel/sched_fair.c
> > @@ -863,18 +863,20 @@ static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq)
> > struct sched_entity *se = __pick_next_entity(cfs_rq);
> > struct sched_entity *buddy;
> >
> > - if (cfs_rq->next) {
> > + if (cfs_rq->next && sched_feat(NEXT_BUDDY)) {
> > buddy = cfs_rq->next;
> > - cfs_rq->next = NULL;
> > - if (wakeup_preempt_entity(buddy, se) < 1)
> > + if (wakeup_preempt_entity(buddy, se) < 1) {
> > + cfs_rq->next = NULL;
> > return buddy;
> > + }
> > }
> >
> > - if (cfs_rq->last) {
> > + if (cfs_rq->last && sched_feat(LAST_BUDDY)) {
> > buddy = cfs_rq->last;
> > - cfs_rq->last = NULL;
> > - if (wakeup_preempt_entity(buddy, se) < 1)
> > + if (wakeup_preempt_entity(buddy, se) < 1) {
> > + cfs_rq->last = NULL;
> > return buddy;
> > + }
> > }
> >
> > return se;
> > @@ -1600,9 +1602,9 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
> > * Also, during early boot the idle thread is in the fair class, for
> > * obvious reasons its a bad idea to schedule back to the idle thread.
> > */
> > - if (sched_feat(LAST_BUDDY) && likely(se->on_rq && curr != rq->idle))
> > + if (!(wake_flags & WF_FORK) && likely(se->on_rq && curr != rq->idle))
> > set_last_buddy(se);
> > - if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK))
> > + if (!(wake_flags & WF_FORK))
> > set_next_buddy(pse);
> >
> > /*
> >
> >
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On Tue, 2009-10-27 at 15:42 +0100, Mike Galbraith wrote:
> On Tue, 2009-10-27 at 16:03 +0800, Zhang, Yanmin wrote:
> > On Fri, 2009-10-16 at 13:06 +0200, Mike Galbraith wrote:
> > > On Tue, 2009-10-13 at 11:12 +0800, Zhang, Yanmin wrote:
> > >
> > > > NEXT_BUDDY has no help on volanoMark and tbench.
> > >
> > > Can you try the patch below please? It does tries to preserve buddy
> > > affinity where possible, and mitigates over-preemption by strengthening
> > > buddies a bit. It improves vmark here by ~7%.
> > I ran some benchmarks against 2.6.32-rc1+Peter_2_patches+below_patch.
> > Below result is against 2.6.32-rc1.
> > hackbench result has about 10% improvement on stoakley (2*4 cores) and
> > tigerton (4*4 cores).
> > tbench still has about 5% regression on stoakley and tigerton.
> > VolanoMark has 33% regression on tigerton, but has 2% improvement on stoakley.
> >
> > I also ran the benchmarks against the latest tips/master and got the similiar
> > results like above testing.
> >
> > The testing against tips on Nehalem machine didn't show much improvement/regression.
>
> Thanks for the testing. Your results suggest that I should revive the
> mark buddies whether you use them or not idea.
>
> -Mike
I'm investigating 5% tbench regression on Nehalem machine. perf_counter shows
select_task_rq_fair consumes about 5% cpu time with 2.6.32-rc1 while it consumes
less than 0.5% with 2.6.31.
Patch c88d5910890 has comments to explain it, but I still can't understand why
to add complicated balance logic when selecting task rq.
I will check which section in function select_task_rq_fair consumes so much time.
Yanmin
On Wed, 2009-10-28 at 17:29 +0800, Zhang, Yanmin wrote:
> -Mike
> I'm investigating 5% tbench regression on Nehalem machine. perf_counter shows
> select_task_rq_fair consumes about 5% cpu time with 2.6.32-rc1 while it consumes
> less than 0.5% with 2.6.31.
>
> Patch c88d5910890 has comments to explain it, but I still can't understand why
> to add complicated balance logic when selecting task rq.
>
> I will check which section in function select_task_rq_fair consumes so much time.
Turn off SD_WAKE_BALANCE as it was called in rc1. See commit 182a85f.
-Mike
On Wed, 2009-10-28 at 15:22 +0100, Mike Galbraith wrote:
> On Wed, 2009-10-28 at 17:29 +0800, Zhang, Yanmin wrote:
> > -Mike
> > I'm investigating 5% tbench regression on Nehalem machine. perf_counter shows
> > select_task_rq_fair consumes about 5% cpu time with 2.6.32-rc1 while it consumes
> > less than 0.5% with 2.6.31.
> >
> > Patch c88d5910890 has comments to explain it, but I still can't understand why
> > to add complicated balance logic when selecting task rq.
> >
> > I will check which section in function select_task_rq_fair consumes so much time.
>
> Turn off SD_WAKE_BALANCE as it was called in rc1. See commit 182a85f.
I run testing against 2.6.32-rc1 which already includes the patch.
On Thu, 2009-10-29 at 08:50 +0800, Zhang, Yanmin wrote:
> On Wed, 2009-10-28 at 15:22 +0100, Mike Galbraith wrote:
> > On Wed, 2009-10-28 at 17:29 +0800, Zhang, Yanmin wrote:
> > > -Mike
> > > I'm investigating 5% tbench regression on Nehalem machine. perf_counter shows
> > > select_task_rq_fair consumes about 5% cpu time with 2.6.32-rc1 while it consumes
> > > less than 0.5% with 2.6.31.
> > >
> > > Patch c88d5910890 has comments to explain it, but I still can't understand why
> > > to add complicated balance logic when selecting task rq.
> > >
> > > I will check which section in function select_task_rq_fair consumes so much time.
> >
> > Turn off SD_WAKE_BALANCE as it was called in rc1. See commit 182a85f.
> I run testing against 2.6.32-rc1 which already includes the patch.
Duh, I checked the wrong tree.
SD_PREFER_LOCAL is still on in rc1 though (double checks;), so you'll go
through the power saving code until you reach a domain containing both
waker's cpu and wakee's previous cpu even if that code already found
that a higher domain wasn't overloaded. Looks to me like that block
wants a want_sd && qualifier.
Even it you turn SD_PREFER_LOCAL off, you can still hit the overhead if
SD_POWERSAVINGS_BALANCE is set, so I'd make sure both are off and see if
that's the source (likely, since the rest is already off).
-Mike
On Thu, 2009-10-29 at 06:46 +0100, Mike Galbraith wrote:
> On Thu, 2009-10-29 at 08:50 +0800, Zhang, Yanmin wrote:
> > On Wed, 2009-10-28 at 15:22 +0100, Mike Galbraith wrote:
> > > On Wed, 2009-10-28 at 17:29 +0800, Zhang, Yanmin wrote:
> > > > -Mike
> > > > I'm investigating 5% tbench regression on Nehalem machine. perf_counter shows
> > > > select_task_rq_fair consumes about 5% cpu time with 2.6.32-rc1 while it consumes
> > > > less than 0.5% with 2.6.31.
> > > >
> > > > Patch c88d5910890 has comments to explain it, but I still can't understand why
> > > > to add complicated balance logic when selecting task rq.
> > > >
> > > > I will check which section in function select_task_rq_fair consumes so much time.
> > >
> > > Turn off SD_WAKE_BALANCE as it was called in rc1. See commit 182a85f.
> > I run testing against 2.6.32-rc1 which already includes the patch.
>
> Duh, I checked the wrong tree.
>
> SD_PREFER_LOCAL is still on in rc1 though (double checks;), so you'll go
> through the power saving code until you reach a domain containing both
> waker's cpu and wakee's previous cpu even if that code already found
> that a higher domain wasn't overloaded. Looks to me like that block
> wants a want_sd && qualifier.
>
> Even it you turn SD_PREFER_LOCAL off, you can still hit the overhead if
> SD_POWERSAVINGS_BALANCE is set, so I'd make sure both are off and see if
> that's the source (likely, since the rest is already off).
Yes. SD_POWERSAVINGS_BALANCE is disabled by default. I applied Peter's patch which
turning SD_PREFER_LOCAL off for MC and cpu domain and it doesn't help.
perf counter shows select_task_rq_fair still consumes about 5% cpu time. Eventually,
I found for_each_cpu in for_each_domain consumes the 5% cpu time, because Peter's
patch doesn't turn off SD_PREFER_LOCAL for node domain.
I turned it off for node domain against the latest tips tree and tbench regression
disappears on a Nehalem machine and becomes about 2% on another one.
Can we turn it off for node domain by default?
On Thu, 2009-10-29 at 14:26 +0800, Zhang, Yanmin wrote:
> On Thu, 2009-10-29 at 06:46 +0100, Mike Galbraith wrote:
> > SD_PREFER_LOCAL is still on in rc1 though (double checks;), so you'll go
> > through the power saving code until you reach a domain containing both
> > waker's cpu and wakee's previous cpu even if that code already found
> > that a higher domain wasn't overloaded. Looks to me like that block
> > wants a want_sd && qualifier.
> >
> > Even it you turn SD_PREFER_LOCAL off, you can still hit the overhead if
> > SD_POWERSAVINGS_BALANCE is set, so I'd make sure both are off and see if
> > that's the source (likely, since the rest is already off).
> Yes. SD_POWERSAVINGS_BALANCE is disabled by default. I applied Peter's patch which
> turning SD_PREFER_LOCAL off for MC and cpu domain and it doesn't help.
> perf counter shows select_task_rq_fair still consumes about 5% cpu time. Eventually,
> I found for_each_cpu in for_each_domain consumes the 5% cpu time, because Peter's
> patch doesn't turn off SD_PREFER_LOCAL for node domain.
> I turned it off for node domain against the latest tips tree and tbench regression
> disappears on a Nehalem machine and becomes about 2% on another one.
>
> Can we turn it off for node domain by default?
If it's hurting fast path overhead to the tune of an order of magnitude,
I guess there's no choice but to either fix it or turn it off. Since
SD_BALANCE_WAKE is off globally, I don't see any point in keeping
SD_PREFER_LOCAL at any level.
(That said, what we need is for this to not be so expensive that we
can't afford it in the fast path).
sched: Disable SD_PREFER_LOCAL at node level.
Yanmin Zhang reported that SD_PREFER_LOCAL induces an order of magnitude
increase in select_task_rq_fair() overhead while running heavy wakeup
benchmarks (tbench and vmark). Since SD_BALANCE_WAKE is off at node level,
turn SD_PREFER_LOCAL off as well pending further investigation.
Signed-off-by: Mike Galbraith <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Reported-by: Zhang, Yanmin <[email protected]>
LKML-Reference: <new-submission>
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index d823c24..40e37b1 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -143,7 +143,7 @@ extern unsigned long node_remap_size[];
| 1*SD_BALANCE_FORK \
| 0*SD_BALANCE_WAKE \
| 1*SD_WAKE_AFFINE \
- | 1*SD_PREFER_LOCAL \
+ | 0*SD_PREFER_LOCAL \
| 0*SD_SHARE_CPUPOWER \
| 0*SD_POWERSAVINGS_BALANCE \
| 0*SD_SHARE_PKG_RESOURCES \
On Thu, 2009-10-29 at 10:14 +0100, Mike Galbraith wrote:
> On Thu, 2009-10-29 at 14:26 +0800, Zhang, Yanmin wrote:
> > On Thu, 2009-10-29 at 06:46 +0100, Mike Galbraith wrote:
>
> > > SD_PREFER_LOCAL is still on in rc1 though (double checks;), so you'll go
> > > through the power saving code until you reach a domain containing both
> > > waker's cpu and wakee's previous cpu even if that code already found
> > > that a higher domain wasn't overloaded. Looks to me like that block
> > > wants a want_sd && qualifier.
> > >
> > > Even it you turn SD_PREFER_LOCAL off, you can still hit the overhead if
> > > SD_POWERSAVINGS_BALANCE is set, so I'd make sure both are off and see if
> > > that's the source (likely, since the rest is already off).
> > Yes. SD_POWERSAVINGS_BALANCE is disabled by default. I applied Peter's patch which
> > turning SD_PREFER_LOCAL off for MC and cpu domain and it doesn't help.
> > perf counter shows select_task_rq_fair still consumes about 5% cpu time. Eventually,
> > I found for_each_cpu in for_each_domain consumes the 5% cpu time, because Peter's
> > patch doesn't turn off SD_PREFER_LOCAL for node domain.
> > I turned it off for node domain against the latest tips tree and tbench regression
> > disappears on a Nehalem machine and becomes about 2% on another one.
> >
> > Can we turn it off for node domain by default?
>
> If it's hurting fast path overhead to the tune of an order of magnitude,
> I guess there's no choice but to either fix it or turn it off. Since
> SD_BALANCE_WAKE is off globally, I don't see any point in keeping
> SD_PREFER_LOCAL at any level.
>
> (That said, what we need is for this to not be so expensive that we
> can't afford it in the fast path).
>
> sched: Disable SD_PREFER_LOCAL at node level.
>
> Yanmin Zhang reported that SD_PREFER_LOCAL induces an order of magnitude
> increase in select_task_rq_fair() overhead while running heavy wakeup
> benchmarks (tbench and vmark). Since SD_BALANCE_WAKE is off at node level,
> turn SD_PREFER_LOCAL off as well pending further investigation.
Mike,
Thanks a lot! With the patch, we do see much improvement on tbench and volanoMark.
For exmaple, volanoMark has about 8% improvement with tips+the_patch, comparing with
Kernel 2.6.31.
Yanmin
>
> Signed-off-by: Mike Galbraith <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Reported-by: Zhang, Yanmin <[email protected]>
> LKML-Reference: <new-submission>
>
> diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
> index d823c24..40e37b1 100644
> --- a/arch/x86/include/asm/topology.h
> +++ b/arch/x86/include/asm/topology.h
> @@ -143,7 +143,7 @@ extern unsigned long node_remap_size[];
> | 1*SD_BALANCE_FORK \
> | 0*SD_BALANCE_WAKE \
> | 1*SD_WAKE_AFFINE \
> - | 1*SD_PREFER_LOCAL \
> + | 0*SD_PREFER_LOCAL \
> | 0*SD_SHARE_CPUPOWER \
> | 0*SD_POWERSAVINGS_BALANCE \
> | 0*SD_SHARE_PKG_RESOURCES \
>
>
>