Hi,
After investigating the claims of poor MySQL performance on Linux,
I had a look at postgresql with sysbench. This DBMS is even faster
than MySQL and could scale up to more connections with my (as close
as possible) out of the box compile and config. The nice thing
about PostgreSQL is that it has no noticable contention on user
space locks on this workload.
The problem with MySQL contention means that if the scheduler
unluckily chooses to deschedule a lock holder, then you can get
idle time building up on other cores and you can get context switch
cascades as things all pile up onto this heavily contended lock. As
such, it means MySQL is not an ideal candidate for looking at
performance behaviour. I discounted the relatively worse scaling of
MySQL with 2.6.25-rc (than 2.6.22) as such an effect.
PostgreSQL is different. It has zero idle time when running this
workload. It actually scaled "super linearly" on my system here,
from single threaded performance to 8 cores (giving an 8.2x
performance increase)!
So PostgreSQL performance profile is actually much more interesting.
To my dismay, I found that Linux 2.6.25-rc5 performs really badly
after saturating the runqueues and subsequently increasing threads.
2.6.22 drops a little bit, but basically settles near the peak
performance. With 2.6.25-rc5, throughput seems to be falling off
linearly with the number of threads.
Actually, this performance profile sort of matches the MySQL curve,
so while I thought the MySQL numbers may be invalid, actually it
appears to back up the pgsql numbers.
This was with postgresql 8.3; config and kernel config available on
request. Looks very much like a CPU scheduler issue. Please take a
look.
postgres.png contains 2.6.22 vs 2.6.25-rc5. compare.png contains
both of those plus MySQL on 2.6.22 vs 2.6.25-rc5.
Thanks,
Nick
* Nick Piggin <[email protected]> wrote:
> PostgreSQL is different. It has zero idle time when running this
> workload. It actually scaled "super linearly" on my system here, from
> single threaded performance to 8 cores (giving an 8.2x performance
> increase)!
>
> So PostgreSQL performance profile is actually much more interesting.
> To my dismay, I found that Linux 2.6.25-rc5 performs really badly
> after saturating the runqueues and subsequently increasing threads.
> 2.6.22 drops a little bit, but basically settles near the peak
> performance. With 2.6.25-rc5, throughput seems to be falling off
> linearly with the number of threads.
thanks Nick, i'll check this - and i agree that this very much looks
like a scheduler regression. Just a quick suggestion, does a simple
runtime tune like this fix the workload:
for N in /proc/sys/kernel/sched_domain/*/*/flags; do
echo $[`cat $N`|16] > N
done
this sets SD_WAKE_IDLE for all the nodes in the scheduler domains tree.
(doing this results in over-agressive idle balancing - but if this fixes
your testcase it shows that we were balancing under-agressively for this
workload.) Thanks,
Ingo
On Tuesday 11 March 2008 18:58, Ingo Molnar wrote:
> * Nick Piggin <[email protected]> wrote:
> > PostgreSQL is different. It has zero idle time when running this
> > workload. It actually scaled "super linearly" on my system here, from
> > single threaded performance to 8 cores (giving an 8.2x performance
> > increase)!
> >
> > So PostgreSQL performance profile is actually much more interesting.
> > To my dismay, I found that Linux 2.6.25-rc5 performs really badly
> > after saturating the runqueues and subsequently increasing threads.
> > 2.6.22 drops a little bit, but basically settles near the peak
> > performance. With 2.6.25-rc5, throughput seems to be falling off
> > linearly with the number of threads.
>
> thanks Nick, i'll check this
Thanks.
> - and i agree that this very much looks
> like a scheduler regression.
I'd say it is. Quite a nasty one too: if your server gets nudged over
the edge of the cliff, it goes into a feedback loop and goes splat at
the bottom somewhere ;)
> Just a quick suggestion, does a simple
> runtime tune like this fix the workload:
>
> for N in /proc/sys/kernel/sched_domain/*/*/flags; do
> echo $[`cat $N`|16] > N
> done
>
> this sets SD_WAKE_IDLE for all the nodes in the scheduler domains tree.
> (doing this results in over-agressive idle balancing - but if this fixes
> your testcase it shows that we were balancing under-agressively for this
> workload.) Thanks,
It doesn't change anything.
There is no idle time for this workload, btw.
* Nick Piggin <[email protected]> wrote:
> > Just a quick suggestion, does a simple runtime tune like this fix
> > the workload:
> >
> > for N in /proc/sys/kernel/sched_domain/*/*/flags; do
> > echo $[`cat $N`|16] > N
> > done
> >
> > this sets SD_WAKE_IDLE for all the nodes in the scheduler domains
> > tree. (doing this results in over-agressive idle balancing - but if
> > this fixes your testcase it shows that we were balancing
> > under-agressively for this workload.) Thanks,
>
> It doesn't change anything.
>
> There is no idle time for this workload, btw.
oh, i thought you said that. Could you try to turn SD_WAKE_AFFINE
all-off and all-on perhaps, via the same scriptlet?
Ingo
On Tue, 2008-03-11 at 17:49 +1100, Nick Piggin wrote:
> So PostgreSQL performance profile is actually much more interesting.
> To my dismay, I found that Linux 2.6.25-rc5 performs really badly
> after saturating the runqueues and subsequently increasing threads.
> 2.6.22 drops a little bit, but basically settles near the peak
> performance. With 2.6.25-rc5, throughput seems to be falling off
> linearly with the number of threads.
>
The FreeBSD folks have a whole host of benchmark results (MySQL,
PostgreSQL, BIND, NSD, ebizzy, SPECjbb, etc.) located at
http://people.freebsd.org/~kris/scaling/ that demonstrate that the
2.6.23+ scheduler is worse than the 2.6.22 scheduler and both are worse
than FreeBSD 7.
The interesting thing is that they've been running these tests
constantly for years now to demonstrate that their new scheduler hasn't
regressed compared to their old scheduler and as a benchmark against the
competition (i.e. Linux).
Does anybody even do this at all for Linux?
(Also, ignoring MySQL because it's a terrible piece of software at least
when regarding it's scalability is a bad idea. It's the M in LAMP, it
has a huge user base, and FreeBSD manages to outperform Linux with the
same unscalable piece of software.)
--
Nicholas Miell <[email protected]>
Nicholas Miell wrote:
> On Tue, 2008-03-11 at 17:49 +1100, Nick Piggin wrote:
>
>> So PostgreSQL performance profile is actually much more interesting.
>> To my dismay, I found that Linux 2.6.25-rc5 performs really badly
>> after saturating the runqueues and subsequently increasing threads.
>> 2.6.22 drops a little bit, but basically settles near the peak
>> performance. With 2.6.25-rc5, throughput seems to be falling off
>> linearly with the number of threads.
>>
>
> The FreeBSD folks have a whole host of benchmark results (MySQL,
> PostgreSQL, BIND, NSD, ebizzy, SPECjbb, etc.) located at
> http://people.freebsd.org/~kris/scaling/ that demonstrate that the
> 2.6.23+ scheduler is worse than the 2.6.22 scheduler and both are worse
> than FreeBSD 7.
>
> The interesting thing is that they've been running these tests
> constantly for years now to demonstrate that their new scheduler hasn't
> regressed compared to their old scheduler and as a benchmark against the
> competition (i.e. Linux).
>
> Does anybody even do this at all for Linux?
>
> (Also, ignoring MySQL because it's a terrible piece of software at least
> when regarding it's scalability is a bad idea. It's the M in LAMP, it
> has a huge user base, and FreeBSD manages to outperform Linux with the
> same unscalable piece of software.)
Did you actually see this?
http://www.kernel.org/pub/linux/kernel/people/npiggin/sysbench/
FreeBSD does not outperform Linux, it's actually a bit faster according
to Nick's tests.
I cannot comment on BIND and NSD, but SPECjbb looks pretty close and the
bad ebizzy performance seems to be an issue with glibc's memory allocator.
greetings
Cyrus
On Tue, 2008-03-11 at 22:34 +0100, Cyrus Massoumi wrote:
> Nicholas Miell wrote:
>
> > (Also, ignoring MySQL because it's a terrible piece of software at least
> > when regarding it's scalability is a bad idea. It's the M in LAMP, it
> > has a huge user base, and FreeBSD manages to outperform Linux with the
> > same unscalable piece of software.)
>
> Did you actually see this?
> http://www.kernel.org/pub/linux/kernel/people/npiggin/sysbench/
>
> FreeBSD does not outperform Linux, it's actually a bit faster according
> to Nick's tests.
I am aware of those results, but in the mail I was responding to, Nick
Piggin said the following:
> The problem with MySQL contention means that if the scheduler
> unluckily chooses to deschedule a lock holder, then you can get
> idle time building up on other cores and you can get context switch
> cascades as things all pile up onto this heavily contended lock. As
> such, it means MySQL is not an ideal candidate for looking at
> performance behaviour. I discounted the relatively worse scaling of
> MySQL with 2.6.25-rc (than 2.6.22) as such an effect.
which I interpreted to mean that MySQL performs worse on 2.6.23+ than on
2.6.22 but for some reason this doesn't matter.
--
Nicholas Miell <[email protected]>
On Wednesday 12 March 2008 10:12, Nicholas Miell wrote:
> On Tue, 2008-03-11 at 22:34 +0100, Cyrus Massoumi wrote:
> Piggin said the following:
> > The problem with MySQL contention means that if the scheduler
> > unluckily chooses to deschedule a lock holder, then you can get
> > idle time building up on other cores and you can get context switch
> > cascades as things all pile up onto this heavily contended lock. As
> > such, it means MySQL is not an ideal candidate for looking at
> > performance behaviour. I discounted the relatively worse scaling of
> > MySQL with 2.6.25-rc (than 2.6.22) as such an effect.
>
> which I interpreted to mean that MySQL performs worse on 2.6.23+ than on
> 2.6.22 but for some reason this doesn't matter.
I didn't try 2.6.23, which I think has bigger problems due to new
CFS code. 2.6.25-rc fixes that, but yes in general I think the MySQL
performance profile is worse on later kernels: while it has a
very slightly higher peak performance, it drops off in performance
more quickly after that peak. I initially wasn't too worried about
it, but seeing as postgresql has a similar problem, I've made the
scheduler developers aware of it.
On Tue, 11 Mar 2008 16:12:12 -0700
Nicholas Miell <[email protected]> wrote:
>
> On Tue, 2008-03-11 at 22:34 +0100, Cyrus Massoumi wrote:
> > Nicholas Miell wrote:
> >
> > > (Also, ignoring MySQL because it's a terrible piece of software at least
> > > when regarding it's scalability is a bad idea. It's the M in LAMP, it
> > > has a huge user base, and FreeBSD manages to outperform Linux with the
> > > same unscalable piece of software.)
> >
> > Did you actually see this?
> > http://www.kernel.org/pub/linux/kernel/people/npiggin/sysbench/
> >
> > FreeBSD does not outperform Linux, it's actually a bit faster according
> > to Nick's tests.
>
> I am aware of those results, but in the mail I was responding to, Nick
> Piggin said the following:
>
> > The problem with MySQL contention means that if the scheduler
> > unluckily chooses to deschedule a lock holder, then you can get
> > idle time building up on other cores and you can get context switch
> > cascades as things all pile up onto this heavily contended lock. As
> > such, it means MySQL is not an ideal candidate for looking at
> > performance behaviour. I discounted the relatively worse scaling of
> > MySQL with 2.6.25-rc (than 2.6.22) as such an effect.
>
> which I interpreted to mean that MySQL performs worse on 2.6.23+ than on
> 2.6.22 but for some reason this doesn't matter.
>
How many of these problems are due to poorly implemented userlevel
spinlocks? Do the database spinlocks map to futexes?
(Back onto lkml)
On Tuesday 11 March 2008 23:02, Ingo Molnar wrote:
> another thing to try would be to increase:
>
> /proc/sys/kernel/sched_migration_cost
>
> from its 500 usecs default to a few msecs ?
This doesn't really help either (at 10ms).
(For the record, I've tried turning SD_WAKE_IDLE, SD_WAKE_AFFINE
on and off for each domain and that hasn't helped either).
I've also tried increasing sched_latency_ns as far as it can go.
BTW. this is a pretty nasty behaviour if you ask my opinion. It
starts *increasing* the number of involuntary context switches
as resources get oversubscribed. That's completely unintuitive as
far as I can see -- when we get overloaded, the obvious thing to
do is try to increase efficiency, or at least try as hard as
possible not to lose it. So context switches should be steady or
decreasing as I add more processes to a runqueue.
It seems to max out at nearly 100 context switches per second,
and this has actually shown to be too frequent for modern CPUs
and big caches.
Increasing the tunable didn't help for this workload, but it really
needs to be fixed so it doesn't decrease timeslices as the number
of processes increases.
On Wed, 2008-03-12 at 12:21 +1100, Nick Piggin wrote:
> (Back onto lkml)
>
> On Tuesday 11 March 2008 23:02, Ingo Molnar wrote:
> > another thing to try would be to increase:
> >
> > /proc/sys/kernel/sched_migration_cost
> >
> > from its 500 usecs default to a few msecs ?
>
> This doesn't really help either (at 10ms).
>
> (For the record, I've tried turning SD_WAKE_IDLE, SD_WAKE_AFFINE
> on and off for each domain and that hasn't helped either).
>
> I've also tried increasing sched_latency_ns as far as it can go.
> BTW. this is a pretty nasty behaviour if you ask my opinion. It
> starts *increasing* the number of involuntary context switches
> as resources get oversubscribed. That's completely unintuitive as
> far as I can see -- when we get overloaded, the obvious thing to
> do is try to increase efficiency, or at least try as hard as
> possible not to lose it. So context switches should be steady or
> decreasing as I add more processes to a runqueue.
>
> It seems to max out at nearly 100 context switches per second,
> and this has actually shown to be too frequent for modern CPUs
> and big caches.
>
> Increasing the tunable didn't help for this workload, but it really
> needs to be fixed so it doesn't decrease timeslices as the number
> of processes increases.
/proc/sys/kernel/sched_min_granularity_ns
/proc/sys/kernel/sched_latency_ns
period := max(latency, nr_running * min_granularity)
slice := period * w_{i} / W
W := \Sum_{i} w_{i}
So if you want to increase the slice length for loaded systems, up
min_granularity.
On Tue, 2008-03-11 at 16:12 -0700, Nicholas Miell wrote:
> On Tue, 2008-03-11 at 22:34 +0100, Cyrus Massoumi wrote:
> > Nicholas Miell wrote:
> >
> > > (Also, ignoring MySQL because it's a terrible piece of software at least
> > > when regarding it's scalability is a bad idea. It's the M in LAMP, it
> > > has a huge user base, and FreeBSD manages to outperform Linux with the
> > > same unscalable piece of software.)
> >
> > Did you actually see this?
> > http://www.kernel.org/pub/linux/kernel/people/npiggin/sysbench/
> >
> > FreeBSD does not outperform Linux, it's actually a bit faster according
> > to Nick's tests.
>
> I am aware of those results, but in the mail I was responding to, Nick
> Piggin said the following:
>
> > The problem with MySQL contention means that if the scheduler
> > unluckily chooses to deschedule a lock holder, then you can get
> > idle time building up on other cores and you can get context switch
> > cascades as things all pile up onto this heavily contended lock. As
> > such, it means MySQL is not an ideal candidate for looking at
> > performance behaviour. I discounted the relatively worse scaling of
> > MySQL with 2.6.25-rc (than 2.6.22) as such an effect.
>
> which I interpreted to mean that MySQL performs worse on 2.6.23+ than on
> 2.6.22 but for some reason this doesn't matter.
That's because 2.6.23-rc lack the cache hot feature of scheduler. Ingo fixed
it in 2.6.24-rc.
-yanmin
Cyrus Massoumi wrote:
> FreeBSD does not outperform Linux, it's actually a bit faster according
> to Nick's tests.
>
> I cannot comment on BIND and NSD, but SPECjbb looks pretty close and the
> bad ebizzy performance seems to be an issue with glibc's memory allocator.
DNS test was a request from BIND team:
*BIND 9 performance while serving large zones under update*
<http://new.isc.org/proj/dnsperf/ISC-TN-2008-1.html>
more tests FreeBSD vs. LiNUX at http://jeffr-tech.livejournal.com/
-thanks-
regards,
--
so much to do, so little time.
On Wednesday 12 March 2008 18:58, Peter Zijlstra wrote:
> On Wed, 2008-03-12 at 12:21 +1100, Nick Piggin wrote:
> > (Back onto lkml)
> >
> > On Tuesday 11 March 2008 23:02, Ingo Molnar wrote:
> > > another thing to try would be to increase:
> > >
> > > /proc/sys/kernel/sched_migration_cost
> > >
> > > from its 500 usecs default to a few msecs ?
> >
> > This doesn't really help either (at 10ms).
> >
> > (For the record, I've tried turning SD_WAKE_IDLE, SD_WAKE_AFFINE
> > on and off for each domain and that hasn't helped either).
> >
> > I've also tried increasing sched_latency_ns as far as it can go.
> > BTW. this is a pretty nasty behaviour if you ask my opinion. It
> > starts *increasing* the number of involuntary context switches
> > as resources get oversubscribed. That's completely unintuitive as
> > far as I can see -- when we get overloaded, the obvious thing to
> > do is try to increase efficiency, or at least try as hard as
> > possible not to lose it. So context switches should be steady or
> > decreasing as I add more processes to a runqueue.
> >
> > It seems to max out at nearly 100 context switches per second,
> > and this has actually shown to be too frequent for modern CPUs
> > and big caches.
> >
> > Increasing the tunable didn't help for this workload, but it really
> > needs to be fixed so it doesn't decrease timeslices as the number
> > of processes increases.
>
> /proc/sys/kernel/sched_min_granularity_ns
> /proc/sys/kernel/sched_latency_ns
>
> period := max(latency, nr_running * min_granularity)
> slice := period * w_{i} / W
> W := \Sum_{i} w_{i}
>
> So if you want to increase the slice length for loaded systems, up
> min_granularity.
OK, but the very concept of reducing efficiency when load increases
is nasty, and leads to nasty feedback loops. It's just a very bad
behaviour to have out of the box, and as a general observation, 10ms
is too short a default timeslice IMO.
I don't see how it is really helpful for interactive processes either.
By definition, if they are not CPU bound, then they should be run
quite soon after waking up; if they are CPU bound, then reducing
efficiency by increasing context switches is effectively going to
increase their latency anyway.
Can this be changed by default, please?
On Sun, Mar 16, 2008 at 5:44 PM, Nick Piggin <[email protected]> wrote:
> I don't see how it is really helpful for interactive processes either.
> By definition, if they are not CPU bound, then they should be run
> quite soon after waking up; if they are CPU bound, then reducing
> efficiency by increasing context switches is effectively going to
> increase their latency anyway.
How? Are you saying that switching the granularity to, say, 25ms, will
*decrease* the latency of interactive tasks?
And the efficiency we're talking about reducing here is due to the
fact that tasks are hitting cold caches more times per second when the
granularity is smaller, correct? Or are you concerned by another
issue?
> Can this be changed by default, please?
Not without benchmarks of interactivity, please. There are far, far
more linux desktops than there are servers. People expect to have to
tune servers (I do, for the servers I maintain). People don't expect
to have to tune a desktop to make it run well.
On Sun, Mar 16, 2008 at 10:16:36PM -0700, Ray Lee wrote:
> On Sun, Mar 16, 2008 at 5:44 PM, Nick Piggin <[email protected]> wrote:
> > I don't see how it is really helpful for interactive processes either.
> > By definition, if they are not CPU bound, then they should be run
> > quite soon after waking up; if they are CPU bound, then reducing
> > efficiency by increasing context switches is effectively going to
> > increase their latency anyway.
>
> How? Are you saying that switching the granularity to, say, 25ms, will
> *decrease* the latency of interactive tasks?
>
> And the efficiency we're talking about reducing here is due to the
> fact that tasks are hitting cold caches more times per second when the
> granularity is smaller, correct? Or are you concerned by another
> issue?
>
> > Can this be changed by default, please?
>
> Not without benchmarks of interactivity, please. There are far, far
> more linux desktops than there are servers. People expect to have to
> tune servers (I do, for the servers I maintain). People don't expect
> to have to tune a desktop to make it run well.
Oh and even on servers, when your anti-virus proxy reaches a load of 800,
you're happy no to have too large a time-slice so that you regularly get
a chance of being allowed to type in commands over SSH.
Large time-slices are needed only in HPC environments IMHO, where only
one task runs.
Willy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On Monday 17 March 2008 16:16, Ray Lee wrote:
> On Sun, Mar 16, 2008 at 5:44 PM, Nick Piggin <[email protected]>
wrote:
> > I don't see how it is really helpful for interactive processes either.
> > By definition, if they are not CPU bound, then they should be run
> > quite soon after waking up; if they are CPU bound, then reducing
> > efficiency by increasing context switches is effectively going to
> > increase their latency anyway.
>
> How? Are you saying that switching the granularity to, say, 25ms, will
> *decrease* the latency of interactive tasks?
No. It shouldn't change them.
> And the efficiency we're talking about reducing here is due to the
> fact that tasks are hitting cold caches more times per second when the
> granularity is smaller, correct? Or are you concerned by another
> issue?
Secondary issues like the actual cost of context switch, but they are
generally in the noise compared to cache and tlb costs.
> > Can this be changed by default, please?
>
> Not without benchmarks of interactivity, please. There are far, far
> more linux desktops than there are servers. People expect to have to
> tune servers (I do, for the servers I maintain). People don't expect
> to have to tune a desktop to make it run well.
Linux desktops shouldn't run with massive loads anyway. Tuning the
scheduler to "work" well in an X session when you have a make -j100
in the background is retarded.
But sure, if the scheduler doesn't properly prioritize non-CPU bound
tasks versus CPU bound ones, then it should be fixed to do so.
On Monday 17 March 2008 16:21, Willy Tarreau wrote:
> On Sun, Mar 16, 2008 at 10:16:36PM -0700, Ray Lee wrote:
> > On Sun, Mar 16, 2008 at 5:44 PM, Nick Piggin <[email protected]>
> > > Can this be changed by default, please?
> >
> > Not without benchmarks of interactivity, please. There are far, far
> > more linux desktops than there are servers. People expect to have to
> > tune servers (I do, for the servers I maintain). People don't expect
> > to have to tune a desktop to make it run well.
>
> Oh and even on servers, when your anti-virus proxy reaches a load of 800,
> you're happy no to have too large a time-slice so that you regularly get
> a chance of being allowed to type in commands over SSH.
Your ssh session should be allowed to run anyway. I don't see the difference.
If the runqueue length is 100 and the time-slice is (say) 10ms, then if your
ssh only needs average of 5ms of CPU time per second, then it should be run
next when it becomes runnable. If it wants 20ms of CPU time per second, then
it has to wait for 2 seconds anyway to be run next, regardless of whether
the timeslice was 10ms or 20ms.
> Large time-slices are needed only in HPC environments IMHO, where only
> one task runs.
That's silly. By definition if there is only one task running, you don't
care what the timeslice is.
We actually did conduct some benchmarks, and a 10ms timeslice can start
hurting even things like kbuild quite a bit.
But anyway, I don't care what the time slice is so much (although it should
be higher -- if the scheduler can't get good interactive behaviour with a
20-30ms timeslice in most cases then it's no good IMO). I care mostly that
the timeslice does not decrease when load increases.
On Mon, Mar 17, 2008 at 06:19:38PM +1100, Nick Piggin wrote:
> On Monday 17 March 2008 16:21, Willy Tarreau wrote:
> > On Sun, Mar 16, 2008 at 10:16:36PM -0700, Ray Lee wrote:
> > > On Sun, Mar 16, 2008 at 5:44 PM, Nick Piggin <[email protected]>
> > > > Can this be changed by default, please?
> > >
> > > Not without benchmarks of interactivity, please. There are far, far
> > > more linux desktops than there are servers. People expect to have to
> > > tune servers (I do, for the servers I maintain). People don't expect
> > > to have to tune a desktop to make it run well.
> >
> > Oh and even on servers, when your anti-virus proxy reaches a load of 800,
> > you're happy no to have too large a time-slice so that you regularly get
> > a chance of being allowed to type in commands over SSH.
>
> Your ssh session should be allowed to run anyway. I don't see the difference.
> If the runqueue length is 100 and the time-slice is (say) 10ms, then if your
> ssh only needs average of 5ms of CPU time per second, then it should be run
> next when it becomes runnable. If it wants 20ms of CPU time per second, then
> it has to wait for 2 seconds anyway to be run next, regardless of whether
> the timeslice was 10ms or 20ms.
It's not about what *ssh* uses but about what *others* use. Except by
renicing SSH or marking it real-time, it has no way to say "give the
CPU to me right now, I have something very short to do". So it will
have to wait for the 100 other tasks to eat their 10ms, waiting 1 second
to consume 5ms of CPU (and I was speaking about 800 and not 100).
It is one of the situations where I prefer to shorten timeslices when
load increases because it will not slow down the service too much, but
will still provide better interactivity, which is also benefical to the
service itself since there is no reason for the cycles usage to be the
same for all processes. So by having a finer granularity, small CPU eaters
will finish sooner.
> > Large time-slices are needed only in HPC environments IMHO, where only
> > one task runs.
>
> That's silly. By definition if there is only one task running, you don't
> care what the timeslice is.
I mean there's only one important task. There is always a bit of pollution
around it, and interrupting the tasks less often slightly reduces the
context-switch overhead.
> We actually did conduct some benchmarks, and a 10ms timeslice can start
> hurting even things like kbuild quite a bit.
>
> But anyway, I don't care what the time slice is so much (although it should
> be higher -- if the scheduler can't get good interactive behaviour with a
> 20-30ms timeslice in most cases then it's no good IMO). I care mostly that
> the timeslice does not decrease when load increases.
On the opposite, I think it's a fundamental requirement if you need to
maintain a reasonable interactivity, and a fair progress between all
tasks. I think it's obvious to understand that the only way to maintain
a constant latency with a growing number of tasks is to reduce the time
each task may spend on the CPU. Contrary to other domains such as network,
you don't know how much time a task will spend on the CPU if you grant an
access to it, and there is no way to know because only the work that this
task will perform will determine if it should run shorter or longer. Fair
scheduling in other areas such as network is "easier" because you know the
size of your packets so you know how much time they will take on the wire.
Here with tasks, the best you can do is estimating based on history. But
it will be very rare when you'll be able to correctly guess and guarantee
that the latency is correct.
Maybe the timeslices should shrink only past a certain load though (I don't
know how it's done today).
Regards,
willy
On Monday 17 March 2008 19:26, Willy Tarreau wrote:
> On Mon, Mar 17, 2008 at 06:19:38PM +1100, Nick Piggin wrote:
> > Your ssh session should be allowed to run anyway. I don't see the
> > difference. If the runqueue length is 100 and the time-slice is (say)
> > 10ms, then if your ssh only needs average of 5ms of CPU time per second,
> > then it should be run next when it becomes runnable. If it wants 20ms of
> > CPU time per second, then it has to wait for 2 seconds anyway to be run
> > next, regardless of whether the timeslice was 10ms or 20ms.
>
> It's not about what *ssh* uses but about what *others* use. Except by
> renicing SSH or marking it real-time, it has no way to say "give the
> CPU to me right now, I have something very short to do". So it will
> have to wait for the 100 other tasks to eat their 10ms, waiting 1 second
> to consume 5ms of CPU (and I was speaking about 800 and not 100).
Um, if ssh is not using as much CPU time as the other processes running,
(if it has "something very short to do") then yes it should get the CPU
*right now*, regardless of what the timeslice size is. If it *is* using
as much CPU time as everyone else, then it will have to wait to get time,
just like everybody else; and in that case, lowering the timeslice will
not help matters at all because consider if ssh has to compute for 20ms
before returning control to the user, then with a 10ms timeslice it just
has to wait for 2 slices. So in that case you actually do want a longer
and more efficient timeslice so everybody (including ssh) can get their
job done faster.
> > > Large time-slices are needed only in HPC environments IMHO, where only
> > > one task runs.
> >
> > That's silly. By definition if there is only one task running, you don't
> > care what the timeslice is.
>
> I mean there's only one important task. There is always a bit of pollution
> around it, and interrupting the tasks less often slightly reduces the
> context-switch overhead.
I think it is important for many situations, not only just HPC at all.
Just because tpc-c runs are set up so the number of server threads
exactly matches the number of cpus, doesn't mean that real world servers
don't run into lots of different overload conditions. And yes, cache
efficiency does matter for those too, not just HPC.
> > We actually did conduct some benchmarks, and a 10ms timeslice can start
> > hurting even things like kbuild quite a bit.
> >
> > But anyway, I don't care what the time slice is so much (although it
> > should be higher -- if the scheduler can't get good interactive behaviour
> > with a 20-30ms timeslice in most cases then it's no good IMO). I care
> > mostly that the timeslice does not decrease when load increases.
>
> On the opposite, I think it's a fundamental requirement if you need to
> maintain a reasonable interactivity, and a fair progress between all
> tasks. I think it's obvious to understand that the only way to maintain
> a constant latency with a growing number of tasks is to reduce the time
> each task may spend on the CPU. Contrary to other domains such as network,
> you don't know how much time a task will spend on the CPU if you grant an
> access to it, and there is no way to know because only the work that this
> task will perform will determine if it should run shorter or longer. Fair
> scheduling in other areas such as network is "easier" because you know the
> size of your packets so you know how much time they will take on the wire.
>
> Here with tasks, the best you can do is estimating based on history. But
> it will be very rare when you'll be able to correctly guess and guarantee
> that the latency is correct.
>
> Maybe the timeslices should shrink only past a certain load though (I don't
> know how it's done today).
You are just asserting that shorter timeslices are more interactive.
As far as I know (aside from implementation details of a given scheduler),
that assertion only holds in general for a small number of things like
for example video playing or 3d graphics that adaptively scale back their
output as they get starved for CPU (it might be better to drop every 2nd
frame than to drop 10 frames every 20). I doubt there are many server side
apps like that. What you really need on your server is to give ssh more
priority than your 800 spam threads. You can do that *properly* with nice
or with this group fairness stuff. Lowering timeslices is basically
shooting in the dark.
Nick,
We do grow the period as the load increases, and this keeps the slice
constant - although it might not be big enough for your taste (but its
tunable)
Short running tasks will indeed be very likely to be run quickly after
wakeup because wakeup's are placed left in the tree. (and when using
sleeper fairness, can get up to a whole slice bonus).
Interactivity is all about generating a scheduling pattern that is easy
on the human brain - that means predictable and preferably with lags <
40ms - as long as the interval is predictable the human brain will patch
up a lot, once it becomes erratic all is out the window. (human
perception of lags is in the 10ms range, but up to 40ms seems to do
acceptable patch up as long as its predictable).
Due to current desktop bloat, its important cpu bound tasks are treated
well too. Take for instance scrolling firefox - that utterly consumes
the fastest cpus, still people expect a smooth experience. By ensuring
the scheduler behaviour degrades in a predicatable fashion, and trying
to keep the latency to a sane level.
The thing that seems to trip up this psql thing is the strict
requirement to always run the leftmost task. If all tasks have very
short runnable periods, we start interleaving between all contending
tasks. The way we're looking to solve this by weakening this leftmost
requirement so that a server/client pair can ping-pong for a while, then
switch to another pair which gets to ping-pong for a while.
This alternating pattern as opposed to the interleaving pattern is much
more friendly to the cache. And we should do it in such a manner that we
still ensure fairness and predictablilty and such.
The latest sched code contains a few patches in this direction
(.25-rc6), and they seem to have the desired effect on 1 socket single
and dual core and 8 socket single core and dual core. On quad core we
seem to have some load balance problems that destroy the workload in
other interresting ways - looking into that now.
- Peter
On Monday 17 March 2008 20:28, Peter Zijlstra wrote:
> Nick,
>
> We do grow the period as the load increases, and this keeps the slice
> constant - although it might not be big enough for your taste (but its
> tunable)
>
> Short running tasks will indeed be very likely to be run quickly after
> wakeup because wakeup's are placed left in the tree. (and when using
> sleeper fairness, can get up to a whole slice bonus).
>
> Interactivity is all about generating a scheduling pattern that is easy
> on the human brain - that means predictable and preferably with lags <
> 40ms - as long as the interval is predictable the human brain will patch
> up a lot, once it becomes erratic all is out the window. (human
> perception of lags is in the 10ms range, but up to 40ms seems to do
> acceptable patch up as long as its predictable).
>
> Due to current desktop bloat, its important cpu bound tasks are treated
> well too. Take for instance scrolling firefox - that utterly consumes
> the fastest cpus, still people expect a smooth experience. By ensuring
> the scheduler behaviour degrades in a predicatable fashion, and trying
> to keep the latency to a sane level.
Yeah, and firefox scrolling is in the class of workloads where they
adaptively reduce CPU consumption as they get less quota (ie. because
they just start skipping).
Still, for desktop workloads you shouldn't have to deal with lots of
CPU hog processes on the runqueue, so I don't see why this is needed?
I don't mind having the timeslice smallish, but it shouldn't be
reduced as load increases.
> The thing that seems to trip up this psql thing is the strict
> requirement to always run the leftmost task. If all tasks have very
> short runnable periods, we start interleaving between all contending
> tasks. The way we're looking to solve this by weakening this leftmost
> requirement so that a server/client pair can ping-pong for a while, then
> switch to another pair which gets to ping-pong for a while.
>
> This alternating pattern as opposed to the interleaving pattern is much
> more friendly to the cache. And we should do it in such a manner that we
> still ensure fairness and predictablilty and such.
>
> The latest sched code contains a few patches in this direction
> (.25-rc6), and they seem to have the desired effect on 1 socket single
> and dual core and 8 socket single core and dual core. On quad core we
> seem to have some load balance problems that destroy the workload in
> other interresting ways - looking into that now.
Yeah, thanks for looking at that. Wow, scheduler patches sure make
it upstream a lot quicker than when I used to work on the damn thing ;)
I did a quick run and it seems like the postgresql overload problem is
far better if not solved now on my 2x quad core. Haven't had time to
get some reportable results, but I hope to.
* Peter Zijlstra <[email protected]> wrote:
> The latest sched code contains a few patches in this direction
> (.25-rc6), and they seem to have the desired effect on 1 socket single
> and dual core and 8 socket single core and dual core. On quad core we
> seem to have some load balance problems that destroy the workload in
> other interresting ways - looking into that now.
here's a performance comparison between 2.6.21 and -rc6, on a
8-socket/16-core system:
http://redhat.com/~mingo/misc/sysbench-rc6.jpg
[transactions/sec, higher is better]
2.6.21 2.6.25-rc5 2.6.25-rc6
-------------------------------------------------------
1: 383.26 270.47 269.69
2: 741.02 527.67 560.52
4: 1880.79 1049.59 1184.44
8: 3815.59 2901.07 3881.78
16: 8944.81 8993.24 9000.81
32: 8647.19 8568.66 8638.64
64: 8058.10 7624.46 8212.92
128: 6500.06 5804.75 8182.71
256: 5625.27 3656.52 7661.02
[ Postgresql 8.3, default scheduler parameters, sysbench parameters:
--test=oltp --db-driver=psql --max-time=60 --max-requests=0
--oltp-read-only=on. Ask if you need more info about the test. ]
as you can see near and after the saturation point .25 not only has
fixed any regression but rules the picture and is 35%+ faster at 256
clients and shows no breakdown at all at high client counts.
( i also have to observe that while running with 256 clients overload,
the 2.6.25 system was totally serviceable, while 2.6.21 showed bad
lags. )
The "early rampup" phase [less than 25% utilized] is still not as good
as we'd like it to be - our idle balancing force is still a tad too
strong for this workload. (But that is relatively easy to solve in
general and we are working on those bits.)
in any case, we welcome any help from you with these tuning efforts.
It's certainly fun :)
Ingo