2008-06-06 05:03:54

by Greg Smith

[permalink] [raw]
Subject: Re: [patch] Re: PostgreSQL pgbench performance regression in 2.6.23+

On Tue, 27 May 2008, Mike Galbraith wrote:

> Care to give the below a whirl? If fixes the over-enthusiastic affinity
> bug in a less restrictive way. It doesn't attempt to addresss the needs
> of any particular load though, that needs more thought (tricky issue).
>
> With default features, I get the below...

Sorry I didn't get back to you until now, got distracted for a bit.
Here's my table now updated with this patched version and with your
numbers for comparision, since we have the same basic processor setup:

Clients .22.19 .26.git patch Mike
1 7660 11043 11003 10122
2 17798 11452 16868 14360
3 29612 13231 20381 17049
4 25584 13053 22222 18749
6 25295 12263 23546 24913
8 24344 11748 23895 27976
10 23963 11612 22492 29347
15 23026 11414 21896 29157
20 22549 11332 21015 28392
30 22074 10743 18411 26590
40 21495 10406 17982 24422
50 20051 10534 17009 23306

So this is a huge win for this patch compared with the stock 2.6.26.git
(I'm still using the daily snapshot from 2008-05-26) and a nice
improvement over the earlier, smaller patches I tested in this thread
(which peaked at 19537 for 10 clients for me with default features, vs. a
peak of 23895 @ 8 here).

I think I might not be testing exactly the same thing you did, though,
because the pattern doesn't match. I think that my Q6600 system runs a
little bit faster than yours, which is the case for small numbers of
clients here. But once we get above 8 clients your setup is way faster,
with the difference at 15 clients being the largest. Were you perhaps
using batch mode when you generated these results? Only thing I could
think of that would produce this pattern. If it's not something simple
like that, I may have to dig into whether there was some change in the git
snapshot between what you tested and what I did.

Regardless, clearly your patch reduces the regression with the default
parameters to a mild one instead of the gigantic one we started with.
Considering how generally incompatible this benchmark is with this
scheduler, and that there are clear workarounds (feature disabling) I can
document in PostgreSQL land to "fix" the problem defined for me now, I'd
be happy if all that came from this investigation was this change. I'd
hope that being strengthened against this workload improves the
scheduler's robustness for other tasks of this type, which I'm sure there
are more of than just pgbench.

You get my vote for moving toward committing it+backport even if
the improvement is only what I saw in my tests. If I can figure out how
to get closer to the results you got, all the better.

--
* Greg Smith [email protected] http://www.gregsmith.com Baltimore, MD


2008-06-06 06:13:17

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch] Re: PostgreSQL pgbench performance regression in 2.6.23+


On Fri, 2008-06-06 at 01:03 -0400, Greg Smith wrote:

> I think I might not be testing exactly the same thing you did, though,
> because the pattern doesn't match. I think that my Q6600 system runs a
> little bit faster than yours, which is the case for small numbers of
> clients here. But once we get above 8 clients your setup is way faster,
> with the difference at 15 clients being the largest. Were you perhaps
> using batch mode when you generated these results?

No, those were with stock settings.

> Regardless, clearly your patch reduces the regression with the default
> parameters to a mild one instead of the gigantic one we started with.

Unfortunately, after the recent reverts, we're right back to huge :-/

I'm trying to come up with a dirt simple solution that doesn't harm
other load types. I've found no clear reason why we regressed so badly,
it seems to be a luck of the draw run order thing. As soon as the load
starts jamming up a bit, it avalanches into a serialized mess again. I
know the why, just need to find that dirt simple and pure win fix.

> Considering how generally incompatible this benchmark is with this
> scheduler, and that there are clear workarounds (feature disabling) I can
> document in PostgreSQL land to "fix" the problem defined for me now, I'd
> be happy if all that came from this investigation was this change. I'd
> hope that being strengthened against this workload improves the
> scheduler's robustness for other tasks of this type, which I'm sure there
> are more of than just pgbench.

I consider pgbench to be a pretty excellent testcase. Getting this
fixed properly will certainly benefit similar loads, Xorg being one
that's just not as extreme as pgbench.

> You get my vote for moving toward committing it+backport even if
> the improvement is only what I saw in my tests. If I can figure out how
> to get closer to the results you got, all the better.

It's committed, but I don't think a back-port is justified. It does
what it's supposed to do, but there's a part 2. I suspect that your
results differ from mine due to that luck of the run order draw thing.

-Mike

2008-06-07 11:38:20

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch] Re: PostgreSQL pgbench performance regression in 2.6.23+


On Fri, 2008-06-06 at 08:13 +0200, Mike Galbraith wrote:
> On Fri, 2008-06-06 at 01:03 -0400, Greg Smith wrote:
>
> > I think I might not be testing exactly the same thing you did, though,
> > because the pattern doesn't match. I think that my Q6600 system runs a
> > little bit faster than yours, which is the case for small numbers of
> > clients here. But once we get above 8 clients your setup is way faster,
> > with the difference at 15 clients being the largest. Were you perhaps
> > using batch mode when you generated these results?
>
> No, those were with stock settings.
>
> > Regardless, clearly your patch reduces the regression with the default
> > parameters to a mild one instead of the gigantic one we started with.
>
> Unfortunately, after the recent reverts, we're right back to huge :-/
>
> I'm trying to come up with a dirt simple solution that doesn't harm
> other load types.

The below doesn't hurt my volanomark numbers of the day, helps pgbench
considerably, and improves the higher client end of mysql+oltp a wee
bit. It may hurt the low end a wee bit, but the low end is always
pretty unstable, so it's hard to tell with only three runs.

pgbench
2.6.26-rc5 2.6.26-rc5+
1 10213.768037 10237.990274 10165.511814 10183.705908
2 15885.949053 15519.005195 14994.697875 15204.900479
3 15663.233356 16043.733087 16554.371722 17279.376443
4 14193.807355 15799.792612 18447.345925 18088.861169
5 17239.456219 17326.938538 20119.250823 18537.351094
6 15293.624093 14272.208159 21439.841579 22634.887824
8 12483.727461 13486.991527 25579.379337 25908.373483
10 11919.023584 12058.503518 23876.035623 22403.867804
15 10128.724654 11253.959398 23276.797649 23595.597093
20 9645.056147 9980.465235 23603.315133 23256.506240
30 9288.747962 8801.059613 23633.448266 23229.286697
40 8494.705123 8323.107702 22925.552706 23081.526954
50 8357.781935 8239.867147 19102.481374 19558.624434

volanomark
2.6.26-rc5
test-1.log:Average throughput = 101768 messages per second
test-2.log:Average throughput = 99124 messages per second
test-3.log:Average throughput = 99821 messages per second
test-1.log:Average throughput = 101362 messages per second
test-2.log:Average throughput = 98891 messages per second
test-3.log:Average throughput = 99164 messages per second

2.6.26-rc5+
test-1.log:Average throughput = 103275 messages per second
test-2.log:Average throughput = 100034 messages per second
test-3.log:Average throughput = 99434 messages per second
test-1.log:Average throughput = 100460 messages per second
test-2.log:Average throughput = 100188 messages per second
test-3.log:Average throughput = 99617 messages per second


Index: linux-2.6.26.git/kernel/sched_fair.c
===================================================================
--- linux-2.6.26.git.orig/kernel/sched_fair.c
+++ linux-2.6.26.git/kernel/sched_fair.c
@@ -664,6 +664,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, st

update_stats_dequeue(cfs_rq, se);
if (sleep) {
+ se->last_preempter = NULL;
update_avg_stats(cfs_rq, se);
#ifdef CONFIG_SCHEDSTATS
if (entity_is_task(se)) {
@@ -692,8 +693,10 @@ check_preempt_tick(struct cfs_rq *cfs_rq

ideal_runtime = sched_slice(cfs_rq, curr);
delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
- if (delta_exec > ideal_runtime)
+ if (delta_exec > ideal_runtime) {
+ curr->last_preempter = NULL;
resched_task(rq_of(cfs_rq)->curr);
+ }
}

static void
@@ -994,6 +997,7 @@ wake_affine(struct rq *rq, struct sched_
unsigned int imbalance)
{
struct task_struct *curr = this_rq->curr;
+ struct sched_entity *se = &curr->se, *pse = &p->se;
unsigned long tl = this_load;
unsigned long tl_per_task;
int balanced;
@@ -1002,14 +1006,26 @@ wake_affine(struct rq *rq, struct sched_
return 0;

/*
+ * If the current task is being wakeup preempted by multiple tasks
+ * that it awakened, such that it can't get significant work done
+ * between preemptions, try to spread these preemption sources.
+ */
+ if (sync && se->last_preempter && se->last_preempter != pse) {
+ u64 se_last_exec = se->sum_exec_runtime - se->prev_sum_exec_runtime;
+
+ if (se_last_exec < sysctl_sched_migration_cost)
+ return 0;
+ }
+
+ /*
* If sync wakeup then subtract the (maximum possible)
* effect of the currently running task from the load
* of the current CPU:
*/
if (sync)
- tl -= current->se.load.weight;
+ tl -= se->load.weight;

- balanced = 100*(tl + p->se.load.weight) <= imbalance*load;
+ balanced = 100*(tl + pse->load.weight) <= imbalance*load;

/*
* If the currently running task will sleep within
@@ -1017,8 +1033,8 @@ wake_affine(struct rq *rq, struct sched_
* woken task:
*/
if (sync && balanced && curr->sched_class == &fair_sched_class) {
- if (curr->se.avg_overlap < sysctl_sched_migration_cost &&
- p->se.avg_overlap < sysctl_sched_migration_cost)
+ if (se->avg_overlap < sysctl_sched_migration_cost &&
+ pse->avg_overlap < sysctl_sched_migration_cost)
return 1;
}

@@ -1219,8 +1235,27 @@ static void check_preempt_wakeup(struct
pse = parent_entity(pse);
}

- if (wakeup_preempt_entity(se, pse) == 1)
- resched_task(curr);
+ if (wakeup_preempt_entity(se, pse) == 1) {
+ int preempt = 1;
+
+ /*
+ * If current task is being prempted by multiple wakees,
+ * tag it for 1:N affine wakeup preemption avoidance.
+ */
+ if (se->last_preempter && se->last_preempter != pse &&
+ se->load.weight >= pse->load.weight) {
+ u64 exec = se->sum_exec_runtime - se->prev_sum_exec_runtime;
+
+ if (exec < sysctl_sched_migration_cost)
+ preempt = 0;
+ }
+
+ if (se == &current->se)
+ se->last_preempter = pse;
+
+ if (preempt)
+ resched_task(curr);
+ }
}

static struct task_struct *pick_next_task_fair(struct rq *rq)
Index: linux-2.6.26.git/include/linux/sched.h
===================================================================
--- linux-2.6.26.git.orig/include/linux/sched.h
+++ linux-2.6.26.git/include/linux/sched.h
@@ -963,6 +963,7 @@ struct sched_entity {

u64 last_wakeup;
u64 avg_overlap;
+ struct sched_entity *last_preempter;

#ifdef CONFIG_SCHEDSTATS
u64 wait_start;
Index: linux-2.6.26.git/kernel/sched.c
===================================================================
--- linux-2.6.26.git.orig/kernel/sched.c
+++ linux-2.6.26.git/kernel/sched.c
@@ -2176,6 +2176,7 @@ static void __sched_fork(struct task_str
p->se.prev_sum_exec_runtime = 0;
p->se.last_wakeup = 0;
p->se.avg_overlap = 0;
+ p->se.last_preempter = NULL;

#ifdef CONFIG_SCHEDSTATS
p->se.wait_start = 0;

2008-06-07 12:51:23

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch] Re: PostgreSQL pgbench performance regression in 2.6.23+

Since I tested mysql+oltp and made the dang pdf of the results, I may
as well actually attach the thing <does that before continuing...>.

BTW, I have a question wrt avg_overlap. When a wakeup cause the current
task to begin sharing CPU with a freshly awakened task, the current task
is tagged.. but the wakee isn't. How come? If one is sharing, so is
the other.

-Mike


Attachments:
zz.pdf (16.19 kB)

2008-06-07 13:08:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch] Re: PostgreSQL pgbench performance regression in 2.6.23+

On Sat, 2008-06-07 at 14:50 +0200, Mike Galbraith wrote:
> Since I tested mysql+oltp and made the dang pdf of the results, I may
> as well actually attach the thing <does that before continuing...>.
>
> BTW, I have a question wrt avg_overlap. When a wakeup cause the current
> task to begin sharing CPU with a freshly awakened task, the current task
> is tagged.. but the wakee isn't. How come? If one is sharing, so is
> the other.

avg_overlap is about measuring how long we'll run after waking someone
else. The other measure, how long our waker shares the cpu with us,
hasn't proven to be relevant so far.

2008-06-07 13:09:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch] Re: PostgreSQL pgbench performance regression in 2.6.23+

On Sat, 2008-06-07 at 13:38 +0200, Mike Galbraith wrote:

Interesting.. Looks good.

> Index: linux-2.6.26.git/kernel/sched_fair.c
> ===================================================================
> --- linux-2.6.26.git.orig/kernel/sched_fair.c
> +++ linux-2.6.26.git/kernel/sched_fair.c
> @@ -664,6 +664,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
>
> update_stats_dequeue(cfs_rq, se);
> if (sleep) {
> + se->last_preempter = NULL;
> update_avg_stats(cfs_rq, se);
> #ifdef CONFIG_SCHEDSTATS
> if (entity_is_task(se)) {
> @@ -692,8 +693,10 @@ check_preempt_tick(struct cfs_rq *cfs_rq
>
> ideal_runtime = sched_slice(cfs_rq, curr);
> delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
> - if (delta_exec > ideal_runtime)
> + if (delta_exec > ideal_runtime) {
> + curr->last_preempter = NULL;
> resched_task(rq_of(cfs_rq)->curr);
> + }
> }
>
> static void
> @@ -994,6 +997,7 @@ wake_affine(struct rq *rq, struct sched_
> unsigned int imbalance)
> {
> struct task_struct *curr = this_rq->curr;
> + struct sched_entity *se = &curr->se, *pse = &p->se;
> unsigned long tl = this_load;
> unsigned long tl_per_task;
> int balanced;
> @@ -1002,14 +1006,26 @@ wake_affine(struct rq *rq, struct sched_
> return 0;
>
> /*
> + * If the current task is being wakeup preempted by multiple tasks
> + * that it awakened, such that it can't get significant work done
> + * between preemptions, try to spread these preemption sources.
> + */
> + if (sync && se->last_preempter && se->last_preempter != pse) {
> + u64 se_last_exec = se->sum_exec_runtime - se->prev_sum_exec_runtime;
> +
> + if (se_last_exec < sysctl_sched_migration_cost)
> + return 0;
> + }
> +
> + /*
> * If sync wakeup then subtract the (maximum possible)
> * effect of the currently running task from the load
> * of the current CPU:
> */
> if (sync)
> - tl -= current->se.load.weight;
> + tl -= se->load.weight;
>
> - balanced = 100*(tl + p->se.load.weight) <= imbalance*load;
> + balanced = 100*(tl + pse->load.weight) <= imbalance*load;
>
> /*
> * If the currently running task will sleep within
> @@ -1017,8 +1033,8 @@ wake_affine(struct rq *rq, struct sched_
> * woken task:
> */
> if (sync && balanced && curr->sched_class == &fair_sched_class) {
> - if (curr->se.avg_overlap < sysctl_sched_migration_cost &&
> - p->se.avg_overlap < sysctl_sched_migration_cost)
> + if (se->avg_overlap < sysctl_sched_migration_cost &&
> + pse->avg_overlap < sysctl_sched_migration_cost)
> return 1;
> }
>
> @@ -1219,8 +1235,27 @@ static void check_preempt_wakeup(struct
> pse = parent_entity(pse);
> }
>
> - if (wakeup_preempt_entity(se, pse) == 1)
> - resched_task(curr);
> + if (wakeup_preempt_entity(se, pse) == 1) {
> + int preempt = 1;
> +
> + /*
> + * If current task is being prempted by multiple wakees,
> + * tag it for 1:N affine wakeup preemption avoidance.
> + */
> + if (se->last_preempter && se->last_preempter != pse &&
> + se->load.weight >= pse->load.weight) {
> + u64 exec = se->sum_exec_runtime - se->prev_sum_exec_runtime;
> +
> + if (exec < sysctl_sched_migration_cost)
> + preempt = 0;
> + }
> +
> + if (se == &current->se)
> + se->last_preempter = pse;
> +
> + if (preempt)
> + resched_task(curr);
> + }
> }
>
> static struct task_struct *pick_next_task_fair(struct rq *rq)
> Index: linux-2.6.26.git/include/linux/sched.h
> ===================================================================
> --- linux-2.6.26.git.orig/include/linux/sched.h
> +++ linux-2.6.26.git/include/linux/sched.h
> @@ -963,6 +963,7 @@ struct sched_entity {
>
> u64 last_wakeup;
> u64 avg_overlap;
> + struct sched_entity *last_preempter;
>
> #ifdef CONFIG_SCHEDSTATS
> u64 wait_start;
> Index: linux-2.6.26.git/kernel/sched.c
> ===================================================================
> --- linux-2.6.26.git.orig/kernel/sched.c
> +++ linux-2.6.26.git/kernel/sched.c
> @@ -2176,6 +2176,7 @@ static void __sched_fork(struct task_str
> p->se.prev_sum_exec_runtime = 0;
> p->se.last_wakeup = 0;
> p->se.avg_overlap = 0;
> + p->se.last_preempter = NULL;
>
> #ifdef CONFIG_SCHEDSTATS
> p->se.wait_start = 0;
>
>

2008-06-07 14:16:35

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch] Re: PostgreSQL pgbench performance regression in 2.6.23+


On Sat, 2008-06-07 at 15:07 +0200, Peter Zijlstra wrote:
> On Sat, 2008-06-07 at 14:50 +0200, Mike Galbraith wrote:
> > Since I tested mysql+oltp and made the dang pdf of the results, I may
> > as well actually attach the thing <does that before continuing...>.
> >
> > BTW, I have a question wrt avg_overlap. When a wakeup cause the current
> > task to begin sharing CPU with a freshly awakened task, the current task
> > is tagged.. but the wakee isn't. How come? If one is sharing, so is
> > the other.
>
> avg_overlap is about measuring how long we'll run after waking someone
> else. The other measure, how long our waker shares the cpu with us,
> hasn't proven to be relevant so far.

Yeah wrt relevance, I've been playing with making it mean this and that,
with approx 0 success ;-) If it's a measure of how long we run after
waking though, don't we need to make sure it's not a cross CPU wakeup?

-Mike

2008-06-07 14:54:19

by Mike Galbraith

[permalink] [raw]
Subject: [patch part 2] Re: [patch] Re: PostgreSQL pgbench performance regression in 2.6.23+


On Sat, 2008-06-07 at 15:08 +0200, Peter Zijlstra wrote:
> On Sat, 2008-06-07 at 13:38 +0200, Mike Galbraith wrote:
>
> Interesting.. Looks good.

In that case it _might_ fly, so needs changelog and blame line.

Tasks which awaken many clients can suffer terribly due to affine wakeup
preemption. This can (does for pgbench) lead to serialization of the entire
load on one CPU due to ever lowering throughput of the preempted waker and
constant affine wakeup of many preempters. Prevent this by noticing when
multi-task preemption is going on, ensure that the 1:N waker can always do
a reasonable batch of work, and temporarily restrict further affine wakeups.

Signed-off-by: Mike Galbraith <[email protected]>


include/linux/sched.h | 1 +
kernel/sched.c | 1 +
kernel/sched_fair.c | 49 ++++++++++++++++++++++++++++++++++++++++++-------
3 files changed, 44 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ae0be3c..73b7d23 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -963,6 +963,7 @@ struct sched_entity {

u64 last_wakeup;
u64 avg_overlap;
+ struct sched_entity *last_preempter;

#ifdef CONFIG_SCHEDSTATS
u64 wait_start;
diff --git a/kernel/sched.c b/kernel/sched.c
index bfb8ad8..deb30e9 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2176,6 +2176,7 @@ static void __sched_fork(struct task_struct *p)
p->se.prev_sum_exec_runtime = 0;
p->se.last_wakeup = 0;
p->se.avg_overlap = 0;
+ p->se.last_preempter = NULL;

#ifdef CONFIG_SCHEDSTATS
p->se.wait_start = 0;
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 08ae848..4539a79 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -664,6 +664,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int sleep)

update_stats_dequeue(cfs_rq, se);
if (sleep) {
+ se->last_preempter = NULL;
update_avg_stats(cfs_rq, se);
#ifdef CONFIG_SCHEDSTATS
if (entity_is_task(se)) {
@@ -692,8 +693,10 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)

ideal_runtime = sched_slice(cfs_rq, curr);
delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
- if (delta_exec > ideal_runtime)
+ if (delta_exec > ideal_runtime) {
+ curr->last_preempter = NULL;
resched_task(rq_of(cfs_rq)->curr);
+ }
}

static void
@@ -994,6 +997,7 @@ wake_affine(struct rq *rq, struct sched_domain *this_sd, struct rq *this_rq,
unsigned int imbalance)
{
struct task_struct *curr = this_rq->curr;
+ struct sched_entity *se = &curr->se, *pse = &p->se;
unsigned long tl = this_load;
unsigned long tl_per_task;
int balanced;
@@ -1002,14 +1006,26 @@ wake_affine(struct rq *rq, struct sched_domain *this_sd, struct rq *this_rq,
return 0;

/*
+ * If the current task is being wakeup preempted by multiple tasks
+ * that it awakened, such that it can't get significant work done
+ * between preemptions, try to spread these preemption sources.
+ */
+ if (sync && se->last_preempter && se->last_preempter != pse) {
+ u64 se_last_exec = se->sum_exec_runtime - se->prev_sum_exec_runtime;
+
+ if (se_last_exec < sysctl_sched_migration_cost)
+ return 0;
+ }
+
+ /*
* If sync wakeup then subtract the (maximum possible)
* effect of the currently running task from the load
* of the current CPU:
*/
if (sync)
- tl -= current->se.load.weight;
+ tl -= se->load.weight;

- balanced = 100*(tl + p->se.load.weight) <= imbalance*load;
+ balanced = 100*(tl + pse->load.weight) <= imbalance*load;

/*
* If the currently running task will sleep within
@@ -1017,8 +1033,8 @@ wake_affine(struct rq *rq, struct sched_domain *this_sd, struct rq *this_rq,
* woken task:
*/
if (sync && balanced && curr->sched_class == &fair_sched_class) {
- if (curr->se.avg_overlap < sysctl_sched_migration_cost &&
- p->se.avg_overlap < sysctl_sched_migration_cost)
+ if (se->avg_overlap < sysctl_sched_migration_cost &&
+ pse->avg_overlap < sysctl_sched_migration_cost)
return 1;
}

@@ -1219,8 +1235,27 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p)
pse = parent_entity(pse);
}

- if (wakeup_preempt_entity(se, pse) == 1)
- resched_task(curr);
+ if (wakeup_preempt_entity(se, pse) == 1) {
+ int preempt = 1;
+
+ /*
+ * If current task is being prempted by multiple wakees,
+ * tag it for 1:N affine wakeup preemption avoidance.
+ */
+ if (se->last_preempter && se->last_preempter != pse &&
+ se->load.weight >= pse->load.weight) {
+ u64 exec = se->sum_exec_runtime - se->prev_sum_exec_runtime;
+
+ if (exec < sysctl_sched_migration_cost)
+ preempt = 0;
+ }
+
+ if (se == &current->se)
+ se->last_preempter = pse;
+
+ if (preempt)
+ resched_task(curr);
+ }
}

static struct task_struct *pick_next_task_fair(struct rq *rq)

2008-06-07 16:12:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch part 2] Re: [patch] Re: PostgreSQL pgbench performance regression in 2.6.23+

On Sat, 2008-06-07 at 16:54 +0200, Mike Galbraith wrote:
> On Sat, 2008-06-07 at 15:08 +0200, Peter Zijlstra wrote:
> > On Sat, 2008-06-07 at 13:38 +0200, Mike Galbraith wrote:
> >
> > Interesting.. Looks good.
>
> In that case it _might_ fly, so needs changelog and blame line.

Just wondering, how much effect does the last_preempter stuff have?, it
seems to me the minimum runtime check ought to throttle these wakeups
quite a bit as well.


2008-06-07 16:16:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch] Re: PostgreSQL pgbench performance regression in 2.6.23+

On Sat, 2008-06-07 at 16:16 +0200, Mike Galbraith wrote:
> On Sat, 2008-06-07 at 15:07 +0200, Peter Zijlstra wrote:
> > On Sat, 2008-06-07 at 14:50 +0200, Mike Galbraith wrote:
> > > Since I tested mysql+oltp and made the dang pdf of the results, I may
> > > as well actually attach the thing <does that before continuing...>.
> > >
> > > BTW, I have a question wrt avg_overlap. When a wakeup cause the current
> > > task to begin sharing CPU with a freshly awakened task, the current task
> > > is tagged.. but the wakee isn't. How come? If one is sharing, so is
> > > the other.
> >
> > avg_overlap is about measuring how long we'll run after waking someone
> > else. The other measure, how long our waker shares the cpu with us,
> > hasn't proven to be relevant so far.
>
> Yeah wrt relevance, I've been playing with making it mean this and that,
> with approx 0 success ;-) If it's a measure of how long we run after
> waking though, don't we need to make sure it's not a cross CPU wakeup?

The idea was to dynamically detect sync wakeups, who's defining property
is that the waker will sleep after waking the wakee. And who's effect is
pulling tasks together on wakeups - so that we might have the most
benefit of cache sharing.

So if we were to exclude cross cpu wakeups from this measurement we'd
handicap the whole scheme, because then we'd never measure that its
actually a sync wakeup and wants to run on the same cpu.

2008-06-07 17:53:18

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch part 2] Re: [patch] Re: PostgreSQL pgbench performance regression in 2.6.23+


On Sat, 2008-06-07 at 18:12 +0200, Peter Zijlstra wrote:
> On Sat, 2008-06-07 at 16:54 +0200, Mike Galbraith wrote:
> > On Sat, 2008-06-07 at 15:08 +0200, Peter Zijlstra wrote:
> > > On Sat, 2008-06-07 at 13:38 +0200, Mike Galbraith wrote:
> > >
> > > Interesting.. Looks good.
> >
> > In that case it _might_ fly, so needs changelog and blame line.
>
> Just wondering, how much effect does the last_preempter stuff have?, it
> seems to me the minimum runtime check ought to throttle these wakeups
> quite a bit as well.

Without last_preempter, you'd have all tasks having a minimum runtime.
That would harm the single cpu starve.c testcase for sure, and anything
like it. I wanted to target this pretty accurately to 1:N type loads.

If you mean no trying to disperse preempters, I can test without it.

-Mike

2008-06-07 17:56:34

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch] Re: PostgreSQL pgbench performance regression in 2.6.23+


On Sat, 2008-06-07 at 18:16 +0200, Peter Zijlstra wrote:

> The idea was to dynamically detect sync wakeups, who's defining property
> is that the waker will sleep after waking the wakee. And who's effect is
> pulling tasks together on wakeups - so that we might have the most
> benefit of cache sharing.
>
> So if we were to exclude cross cpu wakeups from this measurement we'd
> handicap the whole scheme, because then we'd never measure that its
> actually a sync wakeup and wants to run on the same cpu.

Ah, I get it now, thanks.

-Mike

2008-06-07 18:19:25

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch part 2] Re: [patch] Re: PostgreSQL pgbench performance regression in 2.6.23+


On Sat, 2008-06-07 at 19:53 +0200, Mike Galbraith wrote:
> On Sat, 2008-06-07 at 18:12 +0200, Peter Zijlstra wrote:

> > Just wondering, how much effect does the last_preempter stuff have?, it
> > seems to me the minimum runtime check ought to throttle these wakeups
> > quite a bit as well.
>
> Without last_preempter, you'd have all tasks having a minimum runtime.
> That would harm the single cpu starve.c testcase for sure, and anything
> like it. I wanted to target this pretty accurately to 1:N type loads.
>
> If you mean no trying to disperse preempters, I can test without it.

pgbench
2.6.26-rc5+ 2.6.26-rc5+ with no disperse
1 10165.511814 10183.705908 10191.865953 10186.995546
2 14994.697875 15204.900479 15209.856474 15239.639522
3 16554.371722 17279.376443 16431.588533 15828.812843
4 18447.345925 18088.861169 15967.533533 16827.107528
5 20119.250823 18537.351094 17890.057368 18829.423686
6 21439.841579 22634.887824 18562.389387 18907.807327
8 25579.379337 25908.373483 19527.104304 19687.221241
10 23876.035623 22403.867804 22635.429472 20627.666899
15 23276.797649 23595.597093 22695.938882 22233.399329
20 23603.315133 23256.506240 22623.205980 22637.340746
30 23633.448266 23229.286697 22736.523283 22691.638135
40 22925.552706 23081.526954 20037.610595 22174.404351
50 19102.481374 19558.624434 21459.370223 21664.820102