2008-10-17 17:29:53

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 0/4] pending scheduler updates

Please apply
--


2008-10-20 12:06:01

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 0/4] pending scheduler updates


* Peter Zijlstra <[email protected]> wrote:

> Please apply

applied these to tip/sched/urgent:

f9c0b09: sched: revert back to per-rq vruntime
a4c2f00: sched: fair scheduler should not resched rt tasks
ffda12a: sched: optimize group load balancer

(will wait with 4/4 until you and Mike come to a verdict.)

thanks Peter,

Ingo

2008-10-21 17:35:51

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [PATCH 0/4] pending scheduler updates

On Mon, Oct 20, 2008 at 02:05:38PM +0200, Ingo Molnar wrote:
>
> * Peter Zijlstra <[email protected]> wrote:
>
> > Please apply
>
> applied these to tip/sched/urgent:
>
> f9c0b09: sched: revert back to per-rq vruntime
> a4c2f00: sched: fair scheduler should not resched rt tasks
> ffda12a: sched: optimize group load balancer
>
> (will wait with 4/4 until you and Mike come to a verdict.)

Is there any conclusion on Patch 4/4? It looks sane to me!

--
Regards,
vatsa

2008-10-22 09:44:04

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 0/4] pending scheduler updates


* Srivatsa Vaddagiri <[email protected]> wrote:

> On Mon, Oct 20, 2008 at 02:05:38PM +0200, Ingo Molnar wrote:
> >
> > * Peter Zijlstra <[email protected]> wrote:
> >
> > > Please apply
> >
> > applied these to tip/sched/urgent:
> >
> > f9c0b09: sched: revert back to per-rq vruntime
> > a4c2f00: sched: fair scheduler should not resched rt tasks
> > ffda12a: sched: optimize group load balancer
> >
> > (will wait with 4/4 until you and Mike come to a verdict.)
>
> Is there any conclusion on Patch 4/4? It looks sane to me!

Mike?

Ingo

2008-10-22 10:03:55

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 0/4] pending scheduler updates

On Wed, 2008-10-22 at 11:40 +0200, Ingo Molnar wrote:
> * Srivatsa Vaddagiri <[email protected]> wrote:
>
> > On Mon, Oct 20, 2008 at 02:05:38PM +0200, Ingo Molnar wrote:
> > >
> > > * Peter Zijlstra <[email protected]> wrote:
> > >
> > > > Please apply
> > >
> > > applied these to tip/sched/urgent:
> > >
> > > f9c0b09: sched: revert back to per-rq vruntime
> > > a4c2f00: sched: fair scheduler should not resched rt tasks
> > > ffda12a: sched: optimize group load balancer
> > >
> > > (will wait with 4/4 until you and Mike come to a verdict.)
> >
> > Is there any conclusion on Patch 4/4? It looks sane to me!
>
> Mike?

Your call of course, but I don't think it's a good trade. In testing,
it does serious injury to mysql+oltp peak throughput, and slows down
things which need frequent preemption in order to compete effectively
against hogs. (includes desktop, though that's not heavily tested)

The attached starvation testcase, distilled from real app and posted to
lkml a few years ago, it totally kills. It's a worst case scenario to
be sure, but that worst case is pretty dramatic. I guesstimated it
would take ~12 hours for it to do 10M signals with wakeup preemption so
restricted, vs 43 seconds with preemption as is. To me, that suggests
that anything ultra fast/light will pay through the nose.

It has positive effects too, but IMHO, the bad outweigh the good.

-Mike


Attachments:
starve.c (715.00 B)

2008-10-22 10:32:23

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 0/4] pending scheduler updates

On Wed, 2008-10-22 at 12:03 +0200, Mike Galbraith wrote:

> It has positive effects too, but IMHO, the bad outweigh the good.

BTW, most dramatic on the other end of the spectrum is pgsql+oltp. With
preemption as is, it collapses as load climbs to heavy with preemption
knobs at stock. Postgres uses user-land spinlocks and _appears_ to wake
others while these are still held. For this load, there is such a thing
as too much short-term fairness, preempting lock holder creates nasty
gaggle of contended lock spinners. It's curable with knobs, and I think
it's postgres's own fault, but may be wrong.

With that patch, pgsql+oltp scales perfectly.

-Mike

2008-10-22 12:10:26

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 0/4] pending scheduler updates


* Mike Galbraith <[email protected]> wrote:

> On Wed, 2008-10-22 at 12:03 +0200, Mike Galbraith wrote:
>
> > It has positive effects too, but IMHO, the bad outweigh the good.
>
> BTW, most dramatic on the other end of the spectrum is pgsql+oltp.
> With preemption as is, it collapses as load climbs to heavy with
> preemption knobs at stock. Postgres uses user-land spinlocks and
> _appears_ to wake others while these are still held. For this load,
> there is such a thing as too much short-term fairness, preempting lock
> holder creates nasty gaggle of contended lock spinners. It's curable
> with knobs, and I think it's postgres's own fault, but may be wrong.
>
> With that patch, pgsql+oltp scales perfectly.

hm, tempting.

Have you tried to hack/fix pgsql to do proper wakeups?

Right now pgsql it punishes schedulers that preempt it while it is
holding totally undeclared (to the kernel) user-space spinlocks ...

Hence postgresql is rewarding a _bad_ scheduler policy in essence. And
pgsql scalability seems to fall totally apart above 16 cpus - regardless
of scheduler policy.

Ingo

2008-10-22 12:38:19

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 0/4] pending scheduler updates

On Wed, 2008-10-22 at 14:10 +0200, Ingo Molnar wrote:
> * Mike Galbraith <[email protected]> wrote:
>
> > On Wed, 2008-10-22 at 12:03 +0200, Mike Galbraith wrote:
> >
> > > It has positive effects too, but IMHO, the bad outweigh the good.
> >
> > BTW, most dramatic on the other end of the spectrum is pgsql+oltp.
> > With preemption as is, it collapses as load climbs to heavy with
> > preemption knobs at stock. Postgres uses user-land spinlocks and
> > _appears_ to wake others while these are still held. For this load,
> > there is such a thing as too much short-term fairness, preempting lock
> > holder creates nasty gaggle of contended lock spinners. It's curable
> > with knobs, and I think it's postgres's own fault, but may be wrong.
> >
> > With that patch, pgsql+oltp scales perfectly.
>
> hm, tempting.

I disagree. Postgres's scaling problem is trivially corrected by
twiddling knobs (or whatnot). With that patch, you can't twiddle mysql
throughput back, or disk intensive loads for that matter. You can tweak
the preempt number, but it has nothing to do with lag, so anybody can
preempt anybody else as you turn the knob toward zero. Chaos.

> Have you tried to hack/fix pgsql to do proper wakeups?

No, I tried to build without spinlocks to verify, but build croaked.
Never went back to slogging through the code.

> Right now pgsql it punishes schedulers that preempt it while it is
> holding totally undeclared (to the kernel) user-space spinlocks ...
>
> Hence postgresql is rewarding a _bad_ scheduler policy in essence. And
> pgsql scalability seems to fall totally apart above 16 cpus - regardless
> of scheduler policy.

If someone gives me that problem, and a credit card for electric
company, I'll do my very extra special best to defeat it ;-)

-Mike

2008-10-22 12:42:55

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 0/4] pending scheduler updates


* Mike Galbraith <[email protected]> wrote:

> > > With that patch, pgsql+oltp scales perfectly.
> >
> > hm, tempting.
>
> I disagree. Postgres's scaling problem is trivially corrected by
> twiddling knobs (or whatnot). [...]

okay, then we need to document it a bit more: what knobs need twiddling
to make it scale perfectly?

> [...] With that patch, you can't twiddle mysql throughput back, or
> disk intensive loads for that matter. You can tweak the preempt
> number, but it has nothing to do with lag, so anybody can preempt
> anybody else as you turn the knob toward zero. Chaos.

okay, convinced.

> > Have you tried to hack/fix pgsql to do proper wakeups?
>
> No, I tried to build without spinlocks to verify, but build croaked.
> Never went back to slogging through the code.

if it falls back to IPC semaphores that's a bad trade from a performance
POV. The best would be if it used proper futexes (i.e. pthread_mutex()
and friends) not some home-grown user-space spinlock thing.

Ingo

2008-10-22 13:05:52

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 0/4] pending scheduler updates

On Wed, 2008-10-22 at 14:42 +0200, Ingo Molnar wrote:
> * Mike Galbraith <[email protected]> wrote:
>
> > > > With that patch, pgsql+oltp scales perfectly.
> > >
> > > hm, tempting.
> >
> > I disagree. Postgres's scaling problem is trivially corrected by
> > twiddling knobs (or whatnot). [...]
>
> okay, then we need to document it a bit more: what knobs need twiddling
> to make it scale perfectly?

Twiddle sched_wakeup_granularity_ns or just turn FAIR_SLEEPERS off for
best pgsql+oltp scalability. Pgsql+oltp collapse is delicate. It
doesn't take much to send it one way or the other. Cut preempt a wee
bit, and it can snap right into shape.

I think we need to penalize sleepers a wee bit as load climbs in general
though, everybody begins to like preemption less as load increases.
Mysql+oltp improves as well, and it loves preempt at modest load.

-Mike

2008-10-22 17:49:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/4] pending scheduler updates

On Wed, 2008-10-22 at 12:32 +0200, Mike Galbraith wrote:
> On Wed, 2008-10-22 at 12:03 +0200, Mike Galbraith wrote:
>
> > It has positive effects too, but IMHO, the bad outweigh the good.
>
> BTW, most dramatic on the other end of the spectrum is pgsql+oltp. With
> preemption as is, it collapses as load climbs to heavy with preemption
> knobs at stock. Postgres uses user-land spinlocks and _appears_ to wake
> others while these are still held. For this load, there is such a thing
> as too much short-term fairness, preempting lock holder creates nasty
> gaggle of contended lock spinners. It's curable with knobs, and I think
> it's postgres's own fault, but may be wrong.
>
> With that patch, pgsql+oltp scales perfectly.

Are we talking about this patch, which re-instates the vruntime based
wakeup-preemption ?

---
Subject: sched: fix wakeup preemption
From: Peter Zijlstra <[email protected]>

Signed-off-by: Peter Zijlstra <[email protected]>
---
kernel/sched_fair.c | 98 ++++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 92 insertions(+), 6 deletions(-)

Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -143,6 +143,49 @@ static inline struct sched_entity *paren
return se->parent;
}

+/* return depth at which a sched entity is present in the hierarchy */
+static inline int depth_se(struct sched_entity *se)
+{
+ int depth = 0;
+
+ for_each_sched_entity(se)
+ depth++;
+
+ return depth;
+}
+
+static void
+find_matching_se(struct sched_entity **se, struct sched_entity **pse)
+{
+ int se_depth, pse_depth;
+
+ /*
+ * preemption test can be made between sibling entities who are in the
+ * same cfs_rq i.e who have a common parent. Walk up the hierarchy of
+ * both tasks until we find their ancestors who are siblings of common
+ * parent.
+ */
+
+ /* First walk up until both entities are at same depth */
+ se_depth = depth_se(*se);
+ pse_depth = depth_se(*pse);
+
+ while (se_depth > pse_depth) {
+ se_depth--;
+ *se = parent_entity(*se);
+ }
+
+ while (pse_depth > se_depth) {
+ pse_depth--;
+ *pse = parent_entity(*pse);
+ }
+
+ while (!is_same_group(*se, *pse)) {
+ *se = parent_entity(*se);
+ *pse = parent_entity(*pse);
+ }
+}
+
#else /* CONFIG_FAIR_GROUP_SCHED */

static inline struct rq *rq_of(struct cfs_rq *cfs_rq)
@@ -193,6 +236,11 @@ static inline struct sched_entity *paren
return NULL;
}

+static inline void
+find_matching_se(struct sched_entity **se, struct sched_entity **pse)
+{
+}
+
#endif /* CONFIG_FAIR_GROUP_SCHED */


@@ -1244,13 +1292,42 @@ static unsigned long wakeup_gran(struct
* More easily preempt - nice tasks, while not making it harder for
* + nice tasks.
*/
- if (sched_feat(ASYM_GRAN))
- gran = calc_delta_mine(gran, NICE_0_LOAD, &se->load);
+ if (!sched_feat(ASYM_GRAN) || se->load.weight > NICE_0_LOAD)
+ gran = calc_delta_fair(sysctl_sched_wakeup_granularity, se);

return gran;
}

/*
+ * Should 'se' preempt 'curr'.
+ *
+ * |s1
+ * |s2
+ * |s3
+ * g
+ * |<--->|c
+ *
+ * w(c, s1) = -1
+ * w(c, s2) = 0
+ * w(c, s3) = 1
+ *
+ */
+static int
+wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
+{
+ s64 gran, vdiff = curr->vruntime - se->vruntime;
+
+ if (vdiff < 0)
+ return -1;
+
+ gran = wakeup_gran(curr);
+ if (vdiff > gran)
+ return 1;
+
+ return 0;
+}
+
+/*
* Preempt the current task with a newly woken task if needed:
*/
static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int sync)
@@ -1258,7 +1335,6 @@ static void check_preempt_wakeup(struct
struct task_struct *curr = rq->curr;
struct cfs_rq *cfs_rq = task_cfs_rq(curr);
struct sched_entity *se = &curr->se, *pse = &p->se;
- s64 delta_exec;

if (unlikely(rt_prio(p->prio))) {
update_rq_clock(rq);
@@ -1296,9 +1372,19 @@ static void check_preempt_wakeup(struct
return;
}

- delta_exec = se->sum_exec_runtime - se->prev_sum_exec_runtime;
- if (delta_exec > wakeup_gran(pse))
- resched_task(curr);
+ find_matching_se(&se, &pse);
+
+ while (se) {
+ BUG_ON(!pse);
+
+ if (wakeup_preempt_entity(se, pse) == 1) {
+ resched_task(curr);
+ break;
+ }
+
+ se = parent_entity(se);
+ pse = parent_entity(pse);
+ }
}

static struct task_struct *pick_next_task_fair(struct rq *rq)

2008-10-22 17:57:24

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 0/4] pending scheduler updates

On Wed, 2008-10-22 at 19:38 +0200, Peter Zijlstra wrote:
> On Wed, 2008-10-22 at 12:32 +0200, Mike Galbraith wrote:
> > On Wed, 2008-10-22 at 12:03 +0200, Mike Galbraith wrote:
> >
> > > It has positive effects too, but IMHO, the bad outweigh the good.
> >
> > BTW, most dramatic on the other end of the spectrum is pgsql+oltp. With
> > preemption as is, it collapses as load climbs to heavy with preemption
> > knobs at stock. Postgres uses user-land spinlocks and _appears_ to wake
> > others while these are still held. For this load, there is such a thing
> > as too much short-term fairness, preempting lock holder creates nasty
> > gaggle of contended lock spinners. It's curable with knobs, and I think
> > it's postgres's own fault, but may be wrong.
> >
> > With that patch, pgsql+oltp scales perfectly.
>
> Are we talking about this patch, which re-instates the vruntime based
> wakeup-preemption ?

No, if it was that one, I'd be tinkering with mysql+oltp. Everything
else I tested (limited time, but fairly wide spectrum) with that patch
was fine, including interactivity. Caveat: tbench/netperf test results
I'm not comfortable with, would need to backport to 26 to feel at all
confident with those. (fwtw)

-Mike