2012-05-21 15:46:17

by Martin Schwidefsky

[permalink] [raw]
Subject: [PATCH 0/2] RFC: readd fair sleepers for server systems

our performance team found a performance degradation with a recent
distribution update in regard to fair sleepers (or the lack of fair
sleepers). On s390 we used to run with fair sleepers disabled.

We see the performance degradation with our network benchmark and fair
sleepers enabled, the largest hit is on virtual connections:

VM guest Hipersockets
Throughput degrades up to 18%
CPU load/cost increase up to 17%
VM stream
Throughput degrades up to 15%
CPU load/cost increase up to 22%
LPAR Hipersockets
Throughput degrades up to 27%
CPU load/cost increase up to 20%

Real world workloads are also affected, e.g. we see degrations with oltp
database workloads. Christian has the numbers if needed.
The only workload on s390 with a performance benefit with fair sleepers
enabled are some J2EE workloads, but only in the <2% area.

In short, we want the fair sleepers tunable back. I understand that on
x86 we want to avoid the cost of a branch on the hot path in place_entity,
therefore add a compile time config option for the fair sleeper control.

blue skies,
Martin

Martin Schwidefsky (2):
sched: readd FAIR_SLEEPERS feature
sched: enable FAIR_SLEEPERS for s390

arch/s390/Kconfig | 1 +
init/Kconfig | 3 +++
kernel/sched/fair.c | 14 +++++++++++++-
kernel/sched/features.h | 9 +++++++++
4 files changed, 26 insertions(+), 1 deletion(-)

--
1.7.10.2


2012-05-21 15:46:12

by Martin Schwidefsky

[permalink] [raw]
Subject: [PATCH 2/2] sched: enable FAIR_SLEEPERS for s390

For s390 the performance benefit that comes with the ability to disable
fair sleepers is more important than the saved branch on the hot path
of the scheduler. Therefore enable CONFIG_SCHED_FAIR_SLEEPERS.

Reported-by: Christian Ehrhardt <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>
---
arch/s390/Kconfig | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 9015060..d5746d8 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -122,6 +122,7 @@ config S390
select ARCH_INLINE_WRITE_UNLOCK_BH
select ARCH_INLINE_WRITE_UNLOCK_IRQ
select ARCH_INLINE_WRITE_UNLOCK_IRQRESTORE
+ select SCHED_FAIR_SLEEPERS

config SCHED_OMIT_FRAME_POINTER
def_bool y
--
1.7.10.2

2012-05-21 15:46:20

by Martin Schwidefsky

[permalink] [raw]
Subject: [PATCH 1/2] sched: readd FAIR_SLEEPERS feature

git commit 5ca9880c6f4ba4c8 "sched: Remove FAIR_SLEEPERS features" removed
the ability to disable sleeper fairness. The benefit is a saved branch
on desktop systems where preemption is important and fair sleepers are
always enabled. But the control is important for server systems where
disabling sleeper fairness has a performance benefit.

Readd the fair sleepers control but add a compile time option to be able to
disable the control again. The default is no control, if an architecture
wants to have the control CONFIG_SCHED_FAIR_SLEEPERS needs to be selected.

Reported-by: Christian Ehrhardt <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>
---
init/Kconfig | 3 +++
kernel/sched/fair.c | 14 +++++++++++++-
kernel/sched/features.h | 9 +++++++++
3 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/init/Kconfig b/init/Kconfig
index 6cfd71d..ddfd2c2 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -865,6 +865,9 @@ config SCHED_AUTOGROUP
desktop applications. Task group autogeneration is currently based
upon task session.

+config SCHED_FAIR_SLEEPERS
+ bool
+
config MM_OWNER
bool

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e955364..a791a9d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1046,6 +1046,18 @@ static void check_spread(struct cfs_rq *cfs_rq, struct sched_entity *se)
#endif
}

+#ifdef CONFIG_SCHED_FAIR_SLEEPERS
+static inline is_fair_sleeper(int initial)
+{
+ return !initial && sched_feat(FAIR_SLEEPERS);
+}
+#else /* CONFIG_SCHED_FAIR_SLEEPERS */
+static inline is_fair_sleeper(int initial)
+{
+ return !initial;
+}
+#endif /* CONFIG_SCHED_FAIR_SLEEPERS */
+
static void
place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
{
@@ -1061,7 +1073,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
vruntime += sched_vslice(cfs_rq, se);

/* sleeps up to a single latency don't count. */
- if (!initial) {
+ if (is_fair_sleeper(initial)) {
unsigned long thresh = sysctl_sched_latency;

/*
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index de00a48..e72dc7a 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -1,3 +1,12 @@
+#ifdef CONFIG_SCHED_FAIR_SLEEPERS
+/*
+ * Disregards a certain amount of sleep time (sched_latency_ns) and
+ * considers the task to be running during that period. This gives it
+ * a service deficit on wakeup, allowing it to run sooner.
+ */
+SCHED_FEAT(FAIR_SLEEPERS, false)
+#endif
+
/*
* Only give sleepers 50% of their service deficit. This allows
* them to run sooner, but does not allow tons of sleepers to
--
1.7.10.2

2012-05-21 18:17:25

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 0/2] RFC: readd fair sleepers for server systems

On Mon, 2012-05-21 at 17:45 +0200, Martin Schwidefsky wrote:
> our performance team found a performance degradation with a recent
> distribution update in regard to fair sleepers (or the lack of fair
> sleepers). On s390 we used to run with fair sleepers disabled.
>
> We see the performance degradation with our network benchmark and fair
> sleepers enabled, the largest hit is on virtual connections:

I can see you wanting the feature back. You guys apparently do not
generally run mixed loads on your boxen, else you wouldn't want to turn
the scheduler into a tick granularity scheduler, but why compile time?
If the fast path branch isn't important, and given it only became
important while I was trying to scrape a few cycles together, why not
just restore the feature as it used to exist under the pretext that you
need it, and others may as well, so we eat the branch in the interest of
general flexibility, and call removal a booboo?

-Mike

2012-05-22 07:11:55

by Christian Ehrhardt

[permalink] [raw]
Subject: Re: [PATCH 0/2] RFC: readd fair sleepers for server systems



On 05/21/2012 08:17 PM, Mike Galbraith wrote:
> On Mon, 2012-05-21 at 17:45 +0200, Martin Schwidefsky wrote:
>> our performance team found a performance degradation with a recent
>> distribution update in regard to fair sleepers (or the lack of fair
>> sleepers). On s390 we used to run with fair sleepers disabled.
>>
>> We see the performance degradation with our network benchmark and fair
>> sleepers enabled, the largest hit is on virtual connections:
>
> I can see you wanting the feature back. You guys apparently do not
> generally run mixed loads on your boxen, else you wouldn't want to turn
> the scheduler into a tick granularity scheduler, but why compile time?
> If the fast path branch isn't important, and given it only became
> important while I was trying to scrape a few cycles together, why not
> just restore the feature as it used to exist under the pretext that you
> need it, and others may as well, so we eat the branch in the interest of
> general flexibility, and call removal a booboo?
>
> -Mike
>

If "eating the branches" is fine for everyone s390 can surely live with
it. The intention to make it configurable, was to allow systems that
really never want it, to be still able to avoid the branch.

By that everyone can configure it the way they want it and we avoid
another modification of the same code over and over again.


--

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

2012-05-22 07:12:32

by Christian Ehrhardt

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched: readd FAIR_SLEEPERS feature



On 05/21/2012 05:45 PM, Martin Schwidefsky wrote:
[...]
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -1,3 +1,12 @@
> +#ifdef CONFIG_SCHED_FAIR_SLEEPERS
> +/*
> + * Disregards a certain amount of sleep time (sched_latency_ns) and
> + * considers the task to be running during that period. This gives it
> + * a service deficit on wakeup, allowing it to run sooner.
> + */
> +SCHED_FEAT(FAIR_SLEEPERS, false)
> +#endif
> +
> /*
> * Only give sleepers 50% of their service deficit. This allows
> * them to run sooner, but does not allow tons of sleepers to

This would be right for s390, but a change to every other architecture.
As far as I know s390 had custom patches in any distribution supported
on s390 to set the default to false (like in your patch), but the
upstream default for every other architecture was true.

I think the patch could look like this to make all happy:
...
+#ifndef CONFIG_S390
+ SCHED_FEAT(FAIR_SLEEPERS, true)
+#else
+SCHED_FEAT(FAIR_SLEEPERS, false)
...


--

Gr?sse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

2012-05-22 08:53:50

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 0/2] RFC: readd fair sleepers for server systems

On Tue, 2012-05-22 at 09:11 +0200, Christian Ehrhardt wrote:
>
> On 05/21/2012 08:17 PM, Mike Galbraith wrote:
> > On Mon, 2012-05-21 at 17:45 +0200, Martin Schwidefsky wrote:
> >> our performance team found a performance degradation with a recent
> >> distribution update in regard to fair sleepers (or the lack of fair
> >> sleepers). On s390 we used to run with fair sleepers disabled.
> >>
> >> We see the performance degradation with our network benchmark and fair
> >> sleepers enabled, the largest hit is on virtual connections:
> >
> > I can see you wanting the feature back. You guys apparently do not
> > generally run mixed loads on your boxen, else you wouldn't want to turn
> > the scheduler into a tick granularity scheduler, but why compile time?
> > If the fast path branch isn't important, and given it only became
> > important while I was trying to scrape a few cycles together, why not
> > just restore the feature as it used to exist under the pretext that you
> > need it, and others may as well, so we eat the branch in the interest of
> > general flexibility, and call removal a booboo?
> >
> > -Mike
> >
>
> If "eating the branches" is fine for everyone s390 can surely live with
> it. The intention to make it configurable, was to allow systems that
> really never want it, to be still able to avoid the branch.
>
> By that everyone can configure it the way they want it and we avoid
> another modification of the same code over and over again.

Ok. Features have become cheaper, but we can still use every cycle we
can get our grubby mitts on.

-Mike

2012-05-22 09:01:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/2] RFC: readd fair sleepers for server systems

On Mon, 2012-05-21 at 17:45 +0200, Martin Schwidefsky wrote:
> our performance team found a performance degradation with a recent
> distribution update in regard to fair sleepers (or the lack of fair
> sleepers). On s390 we used to run with fair sleepers disabled.

This change was made a very long time ago.. tell your people to mind
what upstream does if they want us to mind them.

Also, reports like this make me want to make /debug/sched_features a
patch in tip/out-of-tree so that its never available outside
development.

> We see the performance degradation with our network benchmark and fair
> sleepers enabled, the largest hit is on virtual connections:
>
> VM guest Hipersockets
> Throughput degrades up to 18%
> CPU load/cost increase up to 17%
> VM stream
> Throughput degrades up to 15%
> CPU load/cost increase up to 22%
> LPAR Hipersockets
> Throughput degrades up to 27%
> CPU load/cost increase up to 20%

Why is this, is this some weird interaction with your hypervisor?

> In short, we want the fair sleepers tunable back. I understand that on
> x86 we want to avoid the cost of a branch on the hot path in place_entity,
> therefore add a compile time config option for the fair sleeper control.

I'm very much not liking this... this makes s390 schedule completely
different from all the other architectures.

2012-05-22 09:06:17

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched: readd FAIR_SLEEPERS feature

On Tue, 2012-05-22 at 09:11 +0200, Christian Ehrhardt wrote:
>
> On 05/21/2012 05:45 PM, Martin Schwidefsky wrote:
> [...]
> > --- a/kernel/sched/features.h
> > +++ b/kernel/sched/features.h
> > @@ -1,3 +1,12 @@
> > +#ifdef CONFIG_SCHED_FAIR_SLEEPERS
> > +/*
> > + * Disregards a certain amount of sleep time (sched_latency_ns) and
> > + * considers the task to be running during that period. This gives it
> > + * a service deficit on wakeup, allowing it to run sooner.
> > + */
> > +SCHED_FEAT(FAIR_SLEEPERS, false)
> > +#endif
> > +
> > /*
> > * Only give sleepers 50% of their service deficit. This allows
> > * them to run sooner, but does not allow tons of sleepers to
>
> This would be right for s390, but a change to every other architecture.
> As far as I know s390 had custom patches in any distribution supported
> on s390 to set the default to false (like in your patch), but the
> upstream default for every other architecture was true.
>
> I think the patch could look like this to make all happy:
> ...
> +#ifndef CONFIG_S390
> + SCHED_FEAT(FAIR_SLEEPERS, true)
> +#else
> +SCHED_FEAT(FAIR_SLEEPERS, false)
> ...

Um, yeah, default off is a non-starter.. until someone comes up with a
less annoying preemption model that is.

-Mike

2012-05-23 11:32:58

by Christian Ehrhardt

[permalink] [raw]
Subject: Re: [PATCH 0/2] RFC: readd fair sleepers for server systems


On 05/22/2012 11:01 AM, Peter Zijlstra wrote:
> On Mon, 2012-05-21 at 17:45 +0200, Martin Schwidefsky wrote:
>> our performance team found a performance degradation with a recent
>> distribution update in regard to fair sleepers (or the lack of fair
>> sleepers). On s390 we used to run with fair sleepers disabled.
>
> This change was made a very long time ago.. tell your people to mind
> what upstream does if they want us to mind them.

Your're completely right, but we do mind - and we have reasons for that.
I'll try to explain why.
Upstream often has so many changes that some effects end up hidden
behind each others. A lot of issues are detected and fixed there, but
due to restricted resources not all of them. Also every new developed
features goes through test. Distribution releases are the 3rd stage of
testing before something is available for a customer.

The analysis of the features in general and fair sleepers among them
started by a teammate long ago. More precisely, a bit before the time we
both agreed about the related
http://comments.gmane.org/gmane.linux.kernel/920457, so I'd say in time.
It re-occurred every distribution release since then, but so far without
a real fix.

Then in early 2010 the removal of the fair sleepers tunable took place,
but - and here I admit our fault - this didn't increase pressure for a
long time as both major distributions where at 2.6.32 back then and
stayed there for a long time.

Eventually we also had a revert of that patch in both major
distributions for the last few service updates that backported this
patch. All that hoping that we finally identify and avoid needing that
revert upstream.
All that causes a lot of discussions every distribution release.

I hope all that relativizes your feeling of "a long time"

But currently a fix seems out of reach solving things so that we can
live with fair sleepers (without being able to turn it off in case it is
needed).


> Also, reports like this make me want to make /debug/sched_features a
> patch in tip/out-of-tree so that its never available outside
> development.

Sorry if we offended you in any way, that was not our intention at all.
But I guess keeping the tunables available is the only way to properly
test them for all the myriad of hardware/workload combinations out there
- and by far not all things can be reliably tested out-of-tree.

>> We see the performance degradation with our network benchmark and fair
>> sleepers enabled, the largest hit is on virtual connections:
>>
>> VM guest Hipersockets
>> Throughput degrades up to 18%
>> CPU load/cost increase up to 17%
>> VM stream
>> Throughput degrades up to 15%
>> CPU load/cost increase up to 22%
>> LPAR Hipersockets
>> Throughput degrades up to 27%
>> CPU load/cost increase up to 20%
>
> Why is this, is this some weird interaction with your hypervisor?

It is not completely analyzed, as soon as debugging goes out of Linux it
can be kind of complex even internally.

On top of these network degradations we also have issues with database
latencies even when not using virtual network connections. But for these
I didn't have such summarized numbers at hand when I searched for
workload data. As rule of thumb the worst case latency can grow up to x3
if fair sleepers is on. It felt a bit like the old throughput vs
worst-case latency trade-off - eventually people might want to decide on
their own between the two.


>> In short, we want the fair sleepers tunable back. I understand that on
>> x86 we want to avoid the cost of a branch on the hot path in place_entity,
>> therefore add a compile time config option for the fair sleeper control.
>
> I'm very much not liking this... this makes s390 schedule completely
> different from all the other architectures.

I don't even "like" it myself - If I could make wishes I would like the
50% of gentle sleepers working fine, but unfortunately they aren't.
Liking it or not - for the moment this is the only way we can avoid
several severe degradations. And I'm not even sure if some others just
didn't realize yet or refused to ask loud enough for it.

--

Gr?sse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

2012-05-23 11:50:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/2] RFC: readd fair sleepers for server systems

On Wed, 2012-05-23 at 13:32 +0200, Christian Ehrhardt wrote:
> > Why is this, is this some weird interaction with your hypervisor?
>
> It is not completely analyzed, as soon as debugging goes out of Linux it
> can be kind of complex even internally.

Is there significant steal time in these workloads? If so, does it help
if you implement
CONFIG_PARAVIRT_TIME_ACCOUNTING/paravirt_steal_rq_enabled for s390?
(although I guess we'd better loose the paravirt part of the name then).

This 'feature' subtracts steal time from the task-clock so that the
scheduler doesn't consider a task to be running when the vcpu wasn't
running as well.

Not doing that (current situation) could result in over-active
preemption because we think a task ran significantly longer than it
actually did. Same for sleeper fairness, we might think a task slept
very long (and give a bigger boost) when in fact it didn't.


2012-05-23 15:30:22

by Christian Ehrhardt

[permalink] [raw]
Subject: Re: [PATCH 0/2] RFC: readd fair sleepers for server systems



On 05/23/2012 01:49 PM, Peter Zijlstra wrote:
> On Wed, 2012-05-23 at 13:32 +0200, Christian Ehrhardt wrote:
>>> Why is this, is this some weird interaction with your hypervisor?
>>
>> It is not completely analyzed, as soon as debugging goes out of Linux it
>> can be kind of complex even internally.
>
> Is there significant steal time in these workloads? If so, does it help
> if you implement
> CONFIG_PARAVIRT_TIME_ACCOUNTING/paravirt_steal_rq_enabled for s390?
> (although I guess we'd better loose the paravirt part of the name then).

Interesting, yeah there is enough steal time - not in all, but in most
cases we had in conflict with fair sleepers so far.
We don't have any code for CONFIG_PARAVIRT and its childs yet, so I need
to look further into it.

> This 'feature' subtracts steal time from the task-clock so that the
> scheduler doesn't consider a task to be running when the vcpu wasn't
> running as well.
>
> Not doing that (current situation) could result in over-active
> preemption because we think a task ran significantly longer than it
> actually did. Same for sleeper fairness, we might think a task slept
> very long (and give a bigger boost) when in fact it didn't.

Great - sounds like a good thing to check, I'll definitely try this out.
This week we are changing our automation environment, so give me a few
days for numbers on that.

--

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

2012-05-23 15:44:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/2] RFC: readd fair sleepers for server systems

On Wed, 2012-05-23 at 17:28 +0200, Christian Ehrhardt wrote:
> We don't have any code for CONFIG_PARAVIRT and its childs yet, so I need
> to look further into it.

Yeah, screw CONFIG_PARAVIRT :-) Its just that the code that deals with
high res steal time is only used by them. So ideally you'd extract the
relevant bits from under CONFIG_PARAVIRT and use them.

So its that one block in update_rq_clock_task() (or both if you also
have high res irq accounting) and then you need to provide
paravirt_steal_clock(int cpu) which returns the steal time of that cpu
in nano-seconds granularity. Which I'm assuming s390 has available
someplace.

+- some renames to remove the paravirt_ part of names.