LinuxLists.cc - Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads

[permalink] [raw]

Subject: Re: Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads

On Sat, Feb 09, 2008 at 01:08:30AM +0100, Ingo Molnar wrote:
>
> * Olof Johansson <[email protected]> wrote:
>
> > 2.6.22: 3332 ms
> > 2.6.23: 4397 ms
> > 2.6.24: 8953 ms
> > 2.6.24-git19: 8986 ms
>
> if you enable SCHED_DEBUG, and subtract 4 from the value of
> /proc/sys/kernel/sched_features, does it get any better?
>
> if not, does writing 0 into /proc/sys/kernel/sched_features have any
> impact?

Doesn't seem to make a difference, either of them. I can unfortunately
not retest on x86 at the moment, only powerpc. But it made no noticable
difference in runtime w/ pasemi_defconfig and current mainline.

-Olof

2008-02-09 01:52:01

by Robert Hancock

[permalink] [raw]

Subject: Re: Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads

Olof Johansson wrote:
> Hi,
>
> I ended up with a customer benchmark in my lap this week that doesn't
> do well on recent kernels. :(
>
> After cutting it down to a simple testcase/microbenchmark, it seems like
> recent kernels don't do as well with short-lived threads competing
> with the thread it's cloned off of. The CFS scheduler changes come to
> mind, but I suppose it could be caused by something else as well.
>
> The pared-down testcase is included below. Reported runtime for the
> testcase has increased almost 3x between 2.6.22 and 2.6.24:
>
> 2.6.22: 3332 ms
> 2.6.23: 4397 ms
> 2.6.24: 8953 ms
> 2.6.24-git19: 8986 ms
>
> While running, it'll fork off a bunch of threads, each doing just a little
> work, then busy-waiting on the original thread to finish as well. Yes,
> it's incredibly stupidly coded but that's not my point here.
>
> During run, (runtime 10s on my 1.5GHz Core2 Duo laptop), vmstat 2 shows:
>
> 0 0 0 115196 364748 2248396 0 0 0 0 163 89 0 0 100 0
> 2 0 0 115172 364748 2248396 0 0 0 0 270 178 24 0 76 0
> 2 0 0 115172 364748 2248396 0 0 0 0 402 283 52 0 48 0
> 2 0 0 115180 364748 2248396 0 0 0 0 402 281 50 0 50 0
> 2 0 0 115180 364764 2248396 0 0 0 22 403 295 51 0 48 1
> 2 0 0 115056 364764 2248396 0 0 0 0 399 280 52 0 48 0
> 0 0 0 115196 364764 2248396 0 0 0 0 241 141 17 0 83 0
> 0 0 0 115196 364768 2248396 0 0 0 2 155 67 0 0 100 0
> 0 0 0 115196 364768 2248396 0 0 0 0 148 62 0 0 100 0
>
> I.e. runqueue is 2, but only one cpu is busy. However, this still seems
> true on the kernel that runs the testcase in more reasonable time.
>
> Also, 'time' reports real and user time roughly the same on all kernels,
> so it's not that the older kernels are better at spreading out the load
> between the two cores (either that or it doesn't account for stuff right).
>
> I've included the config files, runtime output and vmstat output at
> http://lixom.net/~olof/threadtest/. I see similar behaviour on PPC as
> well as x86, so it's not architecture-specific.
>
> Testcase below. Yes, I know, there's a bunch of stuff that could be done
> differently and better, but it still doesn't motivate why there's a 3x
> slowdown between kernel versions...

I would say that something coded this bizarrely is really an application
bug and not something that one could call a kernel regression. Any
change in how the parent and child threads get scheduled will have a
huge impact on this test. I bet if you replace that busy wait with a
pthread_cond_wait or something similar, this problem goes away.

Hopefully it doesn't have to be pointed out that spawning off threads to
do so little work before terminating is inefficient, a thread pool or
even just a single thread would likely do a much better job..

2008-02-09 07:58:52

[permalink] [raw]

Subject: Re: Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads

On Fri, 2008-02-08 at 18:04 -0600, Olof Johansson wrote:
> Hi,
>
> I ended up with a customer benchmark in my lap this week that doesn't
> do well on recent kernels. :(
>
> After cutting it down to a simple testcase/microbenchmark, it seems like
> recent kernels don't do as well with short-lived threads competing
> with the thread it's cloned off of. The CFS scheduler changes come to
> mind, but I suppose it could be caused by something else as well.
>
> The pared-down testcase is included below. Reported runtime for the
> testcase has increased almost 3x between 2.6.22 and 2.6.24:
>
> 2.6.22: 3332 ms
> 2.6.23: 4397 ms
> 2.6.24: 8953 ms
> 2.6.24-git19: 8986 ms

My 3GHz P4 shows disjointed results.

2.6.22.17-smp
time 798 ms
time 780 ms
time 702 ms

2.6.22.17-cfs-v24.1-smp
time 562 ms
time 551 ms
time 551 ms

2.6.23.15-smp
time 254 ms
time 254 ms
time 293 ms

2.6.23.15-cfs-v24-smp
time 764 ms
time 791 ms
time 780 ms

2.6.24.1-smp
time 815 ms
time 820 ms
time 771 ms

2.6.25-smp (git today)
time 29 ms
time 61 ms
time 72 ms

2008-02-09 08:55:05

[permalink] [raw]

Subject: Re: Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads

Hi Mike,

On Sat, Feb 09, 2008 at 08:58:39AM +0100, Mike Galbraith wrote:
>
> On Fri, 2008-02-08 at 18:04 -0600, Olof Johansson wrote:
> > Hi,
> >
> > I ended up with a customer benchmark in my lap this week that doesn't
> > do well on recent kernels. :(
> >
> > After cutting it down to a simple testcase/microbenchmark, it seems like
> > recent kernels don't do as well with short-lived threads competing
> > with the thread it's cloned off of. The CFS scheduler changes come to
> > mind, but I suppose it could be caused by something else as well.
> >
> > The pared-down testcase is included below. Reported runtime for the
> > testcase has increased almost 3x between 2.6.22 and 2.6.24:
> >
> > 2.6.22: 3332 ms
> > 2.6.23: 4397 ms
> > 2.6.24: 8953 ms
> > 2.6.24-git19: 8986 ms
>
> My 3GHz P4 shows disjointed results.
>
> 2.6.22.17-smp
> time 798 ms
> time 780 ms
> time 702 ms
>
> 2.6.22.17-cfs-v24.1-smp
> time 562 ms
> time 551 ms
> time 551 ms
>
> 2.6.23.15-smp
> time 254 ms
> time 254 ms
> time 293 ms
>
> 2.6.23.15-cfs-v24-smp
> time 764 ms
> time 791 ms
> time 780 ms
>
> 2.6.24.1-smp
> time 815 ms
> time 820 ms
> time 771 ms

How many CPUs do you have ? It's impressive to see such important variations.
I would guess from the numbers that you sometimes have 1 or 2 CPUs doing
nothing. I've already observed strange CPU usage patterns while building
kernels (bound to 50% usage on a dual-athlon), but I can't say for sure
that it was only on 2.6, so it may be related to make dependencies instead
(at least it's what I've been suspecting till now).

> 2.6.25-smp (git today)
> time 29 ms
> time 61 ms
> time 72 ms

These ones look rather strange. What type of workload is it ? Can you
publish the program for others to test it ?

Regards,
Willy

2008-02-09 11:05:24

[permalink] [raw]

Subject: Re: Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads

On Sat, 2008-02-09 at 09:03 +0100, Willy Tarreau wrote:

> How many CPUs do you have ?

It's a P4/HT, so 1 plus $CHUMP_CHANGE_MAYBE

> > 2.6.25-smp (git today)
> > time 29 ms
> > time 61 ms
> > time 72 ms
>
> These ones look rather strange. What type of workload is it ? Can you
> publish the program for others to test it ?

It's the proglet posted in this thread.

-Mike

2008-02-09 12:31:57

[permalink] [raw]

Subject: Re: Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads

On Sat, Feb 09, 2008 at 11:58:25AM +0100, Mike Galbraith wrote:
>
> On Sat, 2008-02-09 at 09:03 +0100, Willy Tarreau wrote:
>
> > How many CPUs do you have ?
>
> It's a P4/HT, so 1 plus $CHUMP_CHANGE_MAYBE
>
> > > 2.6.25-smp (git today)
> > > time 29 ms
> > > time 61 ms
> > > time 72 ms
> >
> > These ones look rather strange. What type of workload is it ? Can you
> > publish the program for others to test it ?
>
> It's the proglet posted in this thread.

OK sorry, I did not notice it when I first read the report.

Regards,
Willy

2008-02-09 13:39:26

[permalink] [raw]

Subject: Re: Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads

On Sat, 2008-02-09 at 12:40 +0100, Willy Tarreau wrote:
> On Sat, Feb 09, 2008 at 11:58:25AM +0100, Mike Galbraith wrote:
> >
> > On Sat, 2008-02-09 at 09:03 +0100, Willy Tarreau wrote:
> >
> > > How many CPUs do you have ?
> >
> > It's a P4/HT, so 1 plus $CHUMP_CHANGE_MAYBE
> >
> > > > 2.6.25-smp (git today)
> > > > time 29 ms
> > > > time 61 ms
> > > > time 72 ms
> > >
> > > These ones look rather strange. What type of workload is it ? Can you
> > > publish the program for others to test it ?
> >
> > It's the proglet posted in this thread.
>
> OK sorry, I did not notice it when I first read the report.

Hm. The 2.6.25-smp kernel is the only one that looks like it's doing
what proggy wants to do, massive context switching. Bump threads to
larger number so you can watch: the supposedly good kernel (22) is doing
everything on one CPU. Everybody else sucks differently (idleness), and
the clear throughput winner, via mad over-schedule (!?!), is git today.

-Mike

2008-02-09 17:11:59

[permalink] [raw]

Subject: Re: Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads

On Sat, Feb 09, 2008 at 02:37:39PM +0100, Mike Galbraith wrote:
>
> On Sat, 2008-02-09 at 12:40 +0100, Willy Tarreau wrote:
> > On Sat, Feb 09, 2008 at 11:58:25AM +0100, Mike Galbraith wrote:
> > >
> > > On Sat, 2008-02-09 at 09:03 +0100, Willy Tarreau wrote:
> > >
> > > > How many CPUs do you have ?
> > >
> > > It's a P4/HT, so 1 plus $CHUMP_CHANGE_MAYBE
> > >
> > > > > 2.6.25-smp (git today)
> > > > > time 29 ms
> > > > > time 61 ms
> > > > > time 72 ms
> > > >
> > > > These ones look rather strange. What type of workload is it ? Can you
> > > > publish the program for others to test it ?
> > >
> > > It's the proglet posted in this thread.
> >
> > OK sorry, I did not notice it when I first read the report.
>
> Hm. The 2.6.25-smp kernel is the only one that looks like it's doing
> what proggy wants to do, massive context switching. Bump threads to
> larger number so you can watch: the supposedly good kernel (22) is doing
> everything on one CPU. Everybody else sucks differently (idleness), and
> the clear throughput winner, via mad over-schedule (!?!), is git today.

For me, 2.6.25-smp gives pretty irregular results :

time 6548 ms
time 7272 ms
time 1188 ms
time 3772 ms

The CPU usage is quite irregular too and never goes beyond 50% (this is a
dual-athlon). If I start two of these processes, 100% of the CPU is used,
the context switch rate is more regular (about 700/s) and the total time
is more regular too (between 14.8 and 18.5 seconds).

Increasing the parallel run time of the two threads by changing the upper
limit of the for(j) loop correctly saturates both processors. I think that
this program simply does not have enough work to do for each thread to run
for a full timeslice, thus showing a random behaviour.

However, I fail to understand the goal of the reproducer. Granted it shows
irregularities in the scheduler under such conditions, but what *real*
workload would spend its time sequentially creating then immediately killing
threads, never using more than 2 at a time ?

If this could be turned into a DoS, I could understand, but here it looks
a bit pointless :-/

Regards,
Willy

2008-02-09 17:34:15

[permalink] [raw]

Subject: Re: Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads

On Sat, 2008-02-09 at 17:19 +0100, Willy Tarreau wrote:

> However, I fail to understand the goal of the reproducer.

(me too, I was trying to figure out what could be expected)

2008-02-10 05:28:19

[permalink] [raw]

Subject: Re: Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads

On Sat, Feb 09, 2008 at 05:19:57PM +0100, Willy Tarreau wrote:
> On Sat, Feb 09, 2008 at 02:37:39PM +0100, Mike Galbraith wrote:
> >
> > On Sat, 2008-02-09 at 12:40 +0100, Willy Tarreau wrote:
> > > On Sat, Feb 09, 2008 at 11:58:25AM +0100, Mike Galbraith wrote:
> > > >
> > > > On Sat, 2008-02-09 at 09:03 +0100, Willy Tarreau wrote:
> > > >
> > > > > How many CPUs do you have ?
> > > >
> > > > It's a P4/HT, so 1 plus $CHUMP_CHANGE_MAYBE
> > > >
> > > > > > 2.6.25-smp (git today)
> > > > > > time 29 ms
> > > > > > time 61 ms
> > > > > > time 72 ms
> > > > >
> > > > > These ones look rather strange. What type of workload is it ? Can you
> > > > > publish the program for others to test it ?
> > > >
> > > > It's the proglet posted in this thread.
> > >
> > > OK sorry, I did not notice it when I first read the report.
> >
> > Hm. The 2.6.25-smp kernel is the only one that looks like it's doing
> > what proggy wants to do, massive context switching. Bump threads to
> > larger number so you can watch: the supposedly good kernel (22) is doing
> > everything on one CPU. Everybody else sucks differently (idleness), and
> > the clear throughput winner, via mad over-schedule (!?!), is git today.
>
> For me, 2.6.25-smp gives pretty irregular results :
>
> time 6548 ms
> time 7272 ms
> time 1188 ms
> time 3772 ms
>
> The CPU usage is quite irregular too and never goes beyond 50% (this is a
> dual-athlon). If I start two of these processes, 100% of the CPU is used,
> the context switch rate is more regular (about 700/s) and the total time
> is more regular too (between 14.8 and 18.5 seconds).
>
> Increasing the parallel run time of the two threads by changing the upper
> limit of the for(j) loop correctly saturates both processors. I think that
> this program simply does not have enough work to do for each thread to run
> for a full timeslice, thus showing a random behaviour.

Right. I should have tinkered a bit more with it before I posted it, the
version posted had too little going on in the first loop and thus got
hung up on the second busywait loop instead.

I did a bunch of runs with various loop sizes. Basically, what seems to
happen is that the older kernels are quicker at rebalancing a new thread
over to the other cpu, while newer kernels let them share the same cpu
longer (and thus increases wall clock runtime).

All of these are built with gcc without optimization, larger loop size
and an added sched_yield() in the busy-wait loop at the end to take that
out as a factor. As you've seen yourself, runtimes can be quite noisy
but the trends are quite clear anyway. All of these numbers were
collected with default scheduler runtime options, same kernels and
configs as previously posted.

Loop to 1M:
2.6.22 time 4015 ms
2.6.23 time 4581 ms
2.6.24 time 10765 ms
2.6.24-git19 time 8286 ms

2M:
2.6.22 time 7574 ms
2.6.23 time 9031 ms
2.6.24 time 12844 ms
2.6.24-git19 time 10959 ms

3M:
2.6.22 time 8015 ms
2.6.23 time 13053 ms
2.6.24 time 16204 ms
2.6.24-git19 time 14984 ms

4M:
2.6.22 time 10045 ms
2.6.23 time 16642 ms
2.6.24 time 16910 ms
2.6.24-git19 time 16468 ms

5M:
2.6.22 time 12055 ms
2.6.23 time 21024 ms
<missed 2.6.24 here>
2.6.24-git19 time 16040 ms

10M:
2.6.22 time 24030 ms
2.6.23 time 33082 ms
2.6.24 time 34139 ms
2.6.24-git19 time 33724 ms

20M:
2.6.22 time 50015 ms
2.6.23 time 63963 ms
2.6.24 time 65100 ms
2.6.24-git19 time 63092 ms

40M:
2.6.22 time 94315 ms
2.6.23 time 107930 ms
2.6.24 time 113291 ms
2.6.24-git19 time 110360 ms

So with more work per thread, the differences become less but they're
still there. At the 40M loop, with 500 threads it's quite a bit of
runtime per thread.

> However, I fail to understand the goal of the reproducer. Granted it shows
> irregularities in the scheduler under such conditions, but what *real*
> workload would spend its time sequentially creating then immediately killing
> threads, never using more than 2 at a time ?
>
> If this could be turned into a DoS, I could understand, but here it looks
> a bit pointless :-/

It seems generally unfortunate that it takes longer for a new thread to
move over to the second cpu even when the first is busy with the original
thread. I can certainly see cases where this causes suboptimal overall
system behaviour.

I agree that the testcase is highly artificial. Unfortunately, it's
not uncommon to see these kind of weird testcases from customers tring
to evaluate new hardware. :( They tend to be pared-down versions of
whatever their real workload is (the real workload is doing things more
appropriately, but the smaller version is used for testing). I was lucky
enough to get source snippets to base a standalone reproduction case on
for this, normally we wouldn't even get copies of their binaries.

-Olof

2008-02-10 06:16:31

[permalink] [raw]

Subject: Re: Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads

On Sat, Feb 09, 2008 at 11:29:41PM -0600, Olof Johansson wrote:
> On Sat, Feb 09, 2008 at 05:19:57PM +0100, Willy Tarreau wrote:
> > On Sat, Feb 09, 2008 at 02:37:39PM +0100, Mike Galbraith wrote:
> > >
> > > On Sat, 2008-02-09 at 12:40 +0100, Willy Tarreau wrote:
> > > > On Sat, Feb 09, 2008 at 11:58:25AM +0100, Mike Galbraith wrote:
> > > > >
> > > > > On Sat, 2008-02-09 at 09:03 +0100, Willy Tarreau wrote:
> > > > >
> > > > > > How many CPUs do you have ?
> > > > >
> > > > > It's a P4/HT, so 1 plus $CHUMP_CHANGE_MAYBE
> > > > >
> > > > > > > 2.6.25-smp (git today)
> > > > > > > time 29 ms
> > > > > > > time 61 ms
> > > > > > > time 72 ms
> > > > > >
> > > > > > These ones look rather strange. What type of workload is it ? Can you
> > > > > > publish the program for others to test it ?
> > > > >
> > > > > It's the proglet posted in this thread.
> > > >
> > > > OK sorry, I did not notice it when I first read the report.
> > >
> > > Hm. The 2.6.25-smp kernel is the only one that looks like it's doing
> > > what proggy wants to do, massive context switching. Bump threads to
> > > larger number so you can watch: the supposedly good kernel (22) is doing
> > > everything on one CPU. Everybody else sucks differently (idleness), and
> > > the clear throughput winner, via mad over-schedule (!?!), is git today.
> >
> > For me, 2.6.25-smp gives pretty irregular results :
> >
> > time 6548 ms
> > time 7272 ms
> > time 1188 ms
> > time 3772 ms
> >
> > The CPU usage is quite irregular too and never goes beyond 50% (this is a
> > dual-athlon). If I start two of these processes, 100% of the CPU is used,
> > the context switch rate is more regular (about 700/s) and the total time
> > is more regular too (between 14.8 and 18.5 seconds).
> >
> > Increasing the parallel run time of the two threads by changing the upper
> > limit of the for(j) loop correctly saturates both processors. I think that
> > this program simply does not have enough work to do for each thread to run
> > for a full timeslice, thus showing a random behaviour.
>
> Right. I should have tinkered a bit more with it before I posted it, the
> version posted had too little going on in the first loop and thus got
> hung up on the second busywait loop instead.
>
> I did a bunch of runs with various loop sizes. Basically, what seems to
> happen is that the older kernels are quicker at rebalancing a new thread
> over to the other cpu, while newer kernels let them share the same cpu
> longer (and thus increases wall clock runtime).
>
> All of these are built with gcc without optimization, larger loop size
> and an added sched_yield() in the busy-wait loop at the end to take that
> out as a factor. As you've seen yourself, runtimes can be quite noisy
> but the trends are quite clear anyway. All of these numbers were
> collected with default scheduler runtime options, same kernels and
> configs as previously posted.
>
> Loop to 1M:
> 2.6.22 time 4015 ms
> 2.6.23 time 4581 ms
> 2.6.24 time 10765 ms
> 2.6.24-git19 time 8286 ms
>
> 2M:
> 2.6.22 time 7574 ms
> 2.6.23 time 9031 ms
> 2.6.24 time 12844 ms
> 2.6.24-git19 time 10959 ms
>
> 3M:
> 2.6.22 time 8015 ms
> 2.6.23 time 13053 ms
> 2.6.24 time 16204 ms
> 2.6.24-git19 time 14984 ms
>
> 4M:
> 2.6.22 time 10045 ms
> 2.6.23 time 16642 ms
> 2.6.24 time 16910 ms
> 2.6.24-git19 time 16468 ms
>
> 5M:
> 2.6.22 time 12055 ms
> 2.6.23 time 21024 ms
> <missed 2.6.24 here>
> 2.6.24-git19 time 16040 ms
>
> 10M:
> 2.6.22 time 24030 ms
> 2.6.23 time 33082 ms
> 2.6.24 time 34139 ms
> 2.6.24-git19 time 33724 ms
>
> 20M:
> 2.6.22 time 50015 ms
> 2.6.23 time 63963 ms
> 2.6.24 time 65100 ms
> 2.6.24-git19 time 63092 ms
>
> 40M:
> 2.6.22 time 94315 ms
> 2.6.23 time 107930 ms
> 2.6.24 time 113291 ms
> 2.6.24-git19 time 110360 ms
>
> So with more work per thread, the differences become less but they're
> still there. At the 40M loop, with 500 threads it's quite a bit of
> runtime per thread.

No, it's really nothing. I had to push the loop to 1 billion to make the load
noticeable. You don't have 500 threads, you have 2 threads and that load is
repeated 500 times. And if we look at the numbers, let's take the worst one :
> 40M:
> 2.6.24 time 113291 ms
113291/500 = 227 microseconds/loop. This is still very low compared to the
smallest timeslice you would have (1 ms at HZ=1000).

So your threads are still completing *before* the scheduler has to preempt
them.

> > However, I fail to understand the goal of the reproducer. Granted it shows
> > irregularities in the scheduler under such conditions, but what *real*
> > workload would spend its time sequentially creating then immediately killing
> > threads, never using more than 2 at a time ?
> >
> > If this could be turned into a DoS, I could understand, but here it looks
> > a bit pointless :-/
>
> It seems generally unfortunate that it takes longer for a new thread to
> move over to the second cpu even when the first is busy with the original
> thread. I can certainly see cases where this causes suboptimal overall
> system behaviour.

In fact, I don't think it takes longer, I think it does not do it at their
creation, but will do it immediately after the first slice is consumed. This
would explain the important differences here. I don't know how we could ensure
that the new thread is created on the second CPU from the start, though.

I tried inserting a sched_yield() at the top of the busy loop (1M loops).
By default, it did not change a thing. Then I simply set sched_compat_yield
to 1, and the two threads then ran simultaneously with a stable low time
(2700 ms instead of 10-12 seconds).

Doing so with 10k loops (initial test) shows times in the range 240-300 ms
only instead of 2200-6500 ms.

Ingo, would it be possible (and wise) to ensure that a new thread being
created gets immediately rebalanced in order to emulate what is done here
with sched_compat_yield=1 and sched_yield() in both threads just after the
thread creation ? I don't expect any performance difference doing this,
but maybe some shell scripts reliying on short-lived pipes would get faster
on SMP.

In fact, right now the following command uses only 1 CPU :

$ dd if=/dev/urandom of=rnd bs=1M count=1
$ (time bzip2 -9|bzip2 -9|bzip2 -9|bzip2 -9|bzip2 -9|bzip2 -9)<rnd >/dev/null

real 0m6.977s
user 0m6.908s
sys 0m0.072s

Maybe this is due to the block size in bzip2 though, because gzip does not
exhibit exactly the same behaviour :

$ (time gzipi -1|gzip -1|gzip -1|gzip -1|gzip -1|gzip -1)<rnd >/dev/null

real 0m5.300s
user 0m7.392s
sys 0m0.392s

> I agree that the testcase is highly artificial. Unfortunately, it's
> not uncommon to see these kind of weird testcases from customers tring
> to evaluate new hardware. :( They tend to be pared-down versions of
> whatever their real workload is (the real workload is doing things more
> appropriately, but the smaller version is used for testing). I was lucky
> enough to get source snippets to base a standalone reproduction case on
> for this, normally we wouldn't even get copies of their binaries.

I'm well aware of that. What's important is to be able to explain what is
causing the difference and why the test case does not represent anything
related to performance. Maybe the code author wanted to get 500 parallel
threads and got his code wrong ?

Regards,
willy

2008-02-10 06:59:23

[permalink] [raw]

Subject: Re: Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads

On Sun, Feb 10, 2008 at 07:15:58AM +0100, Willy Tarreau wrote:
> On Sat, Feb 09, 2008 at 11:29:41PM -0600, Olof Johansson wrote:
> > 40M:
> > 2.6.22 time 94315 ms
> > 2.6.23 time 107930 ms
> > 2.6.24 time 113291 ms
> > 2.6.24-git19 time 110360 ms
> >
> > So with more work per thread, the differences become less but they're
> > still there. At the 40M loop, with 500 threads it's quite a bit of
> > runtime per thread.
>
> No, it's really nothing. I had to push the loop to 1 billion to make the load
> noticeable. You don't have 500 threads, you have 2 threads and that load is
> repeated 500 times. And if we look at the numbers, let's take the worst one :
> > 40M:
> > 2.6.24 time 113291 ms
> 113291/500 = 227 microseconds/loop. This is still very low compared to the
> smallest timeslice you would have (1 ms at HZ=1000).
>
> So your threads are still completing *before* the scheduler has to preempt
> them.

Hmm? I get that to be 227ms per loop, which is way more than a full
timeslice. Running the program took in the range of 2 minutes, so it's
110000 milliseconds, not microseconds.

> > It seems generally unfortunate that it takes longer for a new thread to
> > move over to the second cpu even when the first is busy with the original
> > thread. I can certainly see cases where this causes suboptimal overall
> > system behaviour.
>
> In fact, I don't think it takes longer, I think it does not do it at their
> creation, but will do it immediately after the first slice is consumed. This
> would explain the important differences here. I don't know how we could ensure
> that the new thread is created on the second CPU from the start, though.

The math doesn't add up for me. Even if it rebalanced at the end of
the first slice (i.e. after 1ms), that would be a 1ms penalty per
iteration. With 500 threads that'd be a total penalty of 500ms.

> I tried inserting a sched_yield() at the top of the busy loop (1M loops).
> By default, it did not change a thing. Then I simply set sched_compat_yield
> to 1, and the two threads then ran simultaneously with a stable low time
> (2700 ms instead of 10-12 seconds).
>
> Doing so with 10k loops (initial test) shows times in the range 240-300 ms
> only instead of 2200-6500 ms.

Right, likely because the long-running cases got stuck at the busy loop
at the end, which would end up aborting quicker if the other thread got
scheduled for just a bit. It was a mistake to post that variant of the
testcase, it's not as relevant and doesn't mimic the original workload I
was trying to mimic as well as if the first loop was made larger.

> Ingo, would it be possible (and wise) to ensure that a new thread being
> created gets immediately rebalanced in order to emulate what is done here
> with sched_compat_yield=1 and sched_yield() in both threads just after the
> thread creation ? I don't expect any performance difference doing this,
> but maybe some shell scripts reliying on short-lived pipes would get faster
> on SMP.

There's always the tradeoff of losing cache warmth whenever a thread is
moved, so I'm not sure if it's a good idea to always migrate it at
creation time. It's not a simple problem, really.

> > I agree that the testcase is highly artificial. Unfortunately, it's
> > not uncommon to see these kind of weird testcases from customers tring
> > to evaluate new hardware. :( They tend to be pared-down versions of
> > whatever their real workload is (the real workload is doing things more
> > appropriately, but the smaller version is used for testing). I was lucky
> > enough to get source snippets to base a standalone reproduction case on
> > for this, normally we wouldn't even get copies of their binaries.
>
> I'm well aware of that. What's important is to be able to explain what is
> causing the difference and why the test case does not represent anything
> related to performance. Maybe the code author wanted to get 500 parallel
> threads and got his code wrong ?

I believe it started out as a simple attempt to parallelize a workload
that sliced the problem too low, instead of slicing it in larger chunks
and have each thread do more work at a time. It did well on 2.6.22 with
almost a 2x speedup, but did worse than the single-treaded testcase on a
2.6.24 kernel.

So yes, it can clearly be handled through explanations and education
and fixen the broken testcase, but I'm still not sure the new behaviour
is desired.

-Olof

2008-02-10 07:58:47

[permalink] [raw]

Subject: Re: Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads

On Sun, Feb 10, 2008 at 01:00:56AM -0600, Olof Johansson wrote:
> On Sun, Feb 10, 2008 at 07:15:58AM +0100, Willy Tarreau wrote:
> > On Sat, Feb 09, 2008 at 11:29:41PM -0600, Olof Johansson wrote:
> > > 40M:
> > > 2.6.22 time 94315 ms
> > > 2.6.23 time 107930 ms
> > > 2.6.24 time 113291 ms
> > > 2.6.24-git19 time 110360 ms
> > >
> > > So with more work per thread, the differences become less but they're
> > > still there. At the 40M loop, with 500 threads it's quite a bit of
> > > runtime per thread.
> >
> > No, it's really nothing. I had to push the loop to 1 billion to make the load
> > noticeable. You don't have 500 threads, you have 2 threads and that load is
> > repeated 500 times. And if we look at the numbers, let's take the worst one :
> > > 40M:
> > > 2.6.24 time 113291 ms
> > 113291/500 = 227 microseconds/loop. This is still very low compared to the
> > smallest timeslice you would have (1 ms at HZ=1000).
> >
> > So your threads are still completing *before* the scheduler has to preempt
> > them.
>
> Hmm? I get that to be 227ms per loop, which is way more than a full
> timeslice. Running the program took in the range of 2 minutes, so it's
> 110000 milliseconds, not microseconds.

Damn you're right! I don't know why I assumed that the reported time was
in microseconds. Nevermind.

> > > It seems generally unfortunate that it takes longer for a new thread to
> > > move over to the second cpu even when the first is busy with the original
> > > thread. I can certainly see cases where this causes suboptimal overall
> > > system behaviour.
> >
> > In fact, I don't think it takes longer, I think it does not do it at their
> > creation, but will do it immediately after the first slice is consumed. This
> > would explain the important differences here. I don't know how we could ensure
> > that the new thread is created on the second CPU from the start, though.
>
> The math doesn't add up for me. Even if it rebalanced at the end of
> the first slice (i.e. after 1ms), that would be a 1ms penalty per
> iteration. With 500 threads that'd be a total penalty of 500ms.

yes you're right.

> > I tried inserting a sched_yield() at the top of the busy loop (1M loops).
> > By default, it did not change a thing. Then I simply set sched_compat_yield
> > to 1, and the two threads then ran simultaneously with a stable low time
> > (2700 ms instead of 10-12 seconds).
> >
> > Doing so with 10k loops (initial test) shows times in the range 240-300 ms
> > only instead of 2200-6500 ms.
>
> Right, likely because the long-running cases got stuck at the busy loop
> at the end, which would end up aborting quicker if the other thread got
> scheduled for just a bit. It was a mistake to post that variant of the
> testcase, it's not as relevant and doesn't mimic the original workload I
> was trying to mimic as well as if the first loop was made larger.

agreed, but what's important is not to change the workload, but to see
what changes induce a different behaviour.

> > Ingo, would it be possible (and wise) to ensure that a new thread being
> > created gets immediately rebalanced in order to emulate what is done here
> > with sched_compat_yield=1 and sched_yield() in both threads just after the
> > thread creation ? I don't expect any performance difference doing this,
> > but maybe some shell scripts reliying on short-lived pipes would get faster
> > on SMP.
>
> There's always the tradeoff of losing cache warmth whenever a thread is
> moved, so I'm not sure if it's a good idea to always migrate it at
> creation time. It's not a simple problem, really.

yes I know. That should not prevent us from experimenting though. If
thread-CPU affinity is too strong and causes the second CPU to be
rarely used, there's something wrong waiting for a fix.

> > > I agree that the testcase is highly artificial. Unfortunately, it's
> > > not uncommon to see these kind of weird testcases from customers tring
> > > to evaluate new hardware. :( They tend to be pared-down versions of
> > > whatever their real workload is (the real workload is doing things more
> > > appropriately, but the smaller version is used for testing). I was lucky
> > > enough to get source snippets to base a standalone reproduction case on
> > > for this, normally we wouldn't even get copies of their binaries.
> >
> > I'm well aware of that. What's important is to be able to explain what is
> > causing the difference and why the test case does not represent anything
> > related to performance. Maybe the code author wanted to get 500 parallel
> > threads and got his code wrong ?
>
> I believe it started out as a simple attempt to parallelize a workload
> that sliced the problem too low, instead of slicing it in larger chunks
> and have each thread do more work at a time. It did well on 2.6.22 with
> almost a 2x speedup, but did worse than the single-treaded testcase on a
> 2.6.24 kernel.
>
> So yes, it can clearly be handled through explanations and education
> and fixen the broken testcase, but I'm still not sure the new behaviour
> is desired.

While I could accept the first slice of the second thread being consumed
on the same CPU as the first one, it definitely must migrate quickly if
both threads are competing for CPU.

regards,
willy

2008-02-11 08:16:18

[permalink] [raw]

Subject: Re: Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads

On Sun, 2008-02-10 at 01:00 -0600, Olof Johansson wrote:
> On Sun, Feb 10, 2008 at 07:15:58AM +0100, Willy Tarreau wrote:
>
> > > I agree that the testcase is highly artificial. Unfortunately, it's
> > > not uncommon to see these kind of weird testcases from customers tring
> > > to evaluate new hardware. :( They tend to be pared-down versions of
> > > whatever their real workload is (the real workload is doing things more
> > > appropriately, but the smaller version is used for testing). I was lucky
> > > enough to get source snippets to base a standalone reproduction case on
> > > for this, normally we wouldn't even get copies of their binaries.
> >
> > I'm well aware of that. What's important is to be able to explain what is
> > causing the difference and why the test case does not represent anything
> > related to performance. Maybe the code author wanted to get 500 parallel
> > threads and got his code wrong ?
>
> I believe it started out as a simple attempt to parallelize a workload
> that sliced the problem too low, instead of slicing it in larger chunks
> and have each thread do more work at a time. It did well on 2.6.22 with
> almost a 2x speedup, but did worse than the single-treaded testcase on a
> 2.6.24 kernel.
>
> So yes, it can clearly be handled through explanations and education
> and fixen the broken testcase, but I'm still not sure the new behaviour
> is desired.

Piddling around with your testcase, it still looks to me like things
improved considerably in latest greatest git. Hopefully that means
happiness is in the pipe for the real workload... synthetic load is
definitely happier here as burst is shortened.

#!/bin/sh

uname -r >> results
./threadtest 10 1000000 >> results
./threadtest 100 100000 >> results
./threadtest 1000 10000 >> results
./threadtest 10000 1000 >> results
./threadtest 100000 100 >> results
echo >> results

(threadtest <iterations> <burn_time>)

results:
2.6.22.17-smp
time: 10525370 (us) work: 20000000 wait: 181000 idx: 1.90
time: 13514232 (us) work: 20000001 wait: 2666366 idx: 1.48
time: 36280953 (us) work: 20000008 wait: 21156077 idx: 0.55
time: 196374337 (us) work: 20000058 wait: 177141620 idx: 0.10
time: 128721968 (us) work: 20000099 wait: 115052705 idx: 0.16

2.6.22.17-cfs-v24.1-smp
time: 10579591 (us) work: 20000000 wait: 203659 idx: 1.89
time: 11170784 (us) work: 20000000 wait: 471961 idx: 1.79
time: 11650138 (us) work: 20000000 wait: 1406224 idx: 1.72
time: 21447616 (us) work: 20000007 wait: 10689242 idx: 0.93
time: 106792791 (us) work: 20000060 wait: 92098132 idx: 0.19

2.6.23.15-smp
time: 10507122 (us) work: 20000000 wait: 159809 idx: 1.90
time: 10545417 (us) work: 20000000 wait: 263833 idx: 1.90
time: 11337770 (us) work: 20000012 wait: 1069588 idx: 1.76
time: 15969860 (us) work: 20000000 wait: 5577750 idx: 1.25
time: 54029726 (us) work: 20000027 wait: 41734789 idx: 0.37

2.6.23.15-cfs-v24-smp
time: 10528972 (us) work: 20000000 wait: 217060 idx: 1.90
time: 10697159 (us) work: 20000000 wait: 447224 idx: 1.87
time: 12242250 (us) work: 20000000 wait: 1930175 idx: 1.63
time: 26364658 (us) work: 20000011 wait: 15468447 idx: 0.76
time: 158338977 (us) work: 20000084 wait: 144048265 idx: 0.13

2.6.24.1-smp
time: 10570521 (us) work: 20000000 wait: 208947 idx: 1.89
time: 10699224 (us) work: 20000000 wait: 404644 idx: 1.87
time: 12280164 (us) work: 20000005 wait: 1969263 idx: 1.63
time: 26424580 (us) work: 20000004 wait: 15725967 idx: 0.76
time: 159012417 (us) work: 20000055 wait: 144906212 idx: 0.13

2.6.25-smp (.git)
time: 10707278 (us) work: 20000000 wait: 376063 idx: 1.87
time: 10696886 (us) work: 20000000 wait: 455909 idx: 1.87
time: 11068510 (us) work: 19990003 wait: 820104 idx: 1.81
time: 11493803 (us) work: 19995076 wait: 1160150 idx: 1.74
time: 21311673 (us) work: 20001848 wait: 9399490 idx: 0.94

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <pthread.h>

#ifdef __PPC__
static void atomic_inc(volatile long *a)
{
asm volatile ("1:\n\
lwarx %0,0,%1\n\
addic %0,%0,1\n\
stwcx. %0,0,%1\n\
bne- 1b" : "=&r" (result) : "r"(a));
}
#else
static void atomic_inc(volatile long *a)
{
asm volatile ("lock; incl %0" : "+m" (*a));
}
#endif

long usecs(void)
{
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec * 1000000 + tv.tv_usec;
}

void burn(long *burnt)
{
long then, now, delta, tolerance = 10;

then = now = usecs();
while (now == then)
now = usecs();
delta = now - then;
if (delta < tolerance)
*burnt += delta;
}

volatile long stopped;
long burn_usecs = 1000, tot_work, tot_wait;

void *thread_func(void *cpus)
{
long work = 0, wait = 0;

while (work < burn_usecs)
burn(&work);
tot_work += work;

atomic_inc(&stopped);

/* Busy-wait */
while (stopped < *(int *)cpus)
burn(&wait);
tot_wait += wait;

return NULL;
}

int main(int argc, char **argv)
{
pthread_t thread;
int iter = 500, cpus = 2;
long t1, t2;

if (argc > 1)
iter = atoi(argv[1]);

if (argc > 2)
burn_usecs = atoi(argv[2]);

t1 = usecs();
while(iter--) {
stopped = 0;

pthread_create(&thread, NULL, &thread_func, &cpus);
thread_func(&cpus);
pthread_join(thread, NULL);
}
t2 = usecs();

printf("time: %ld (us) work: %ld wait: %ld idx: %2.2f\n",
t2-t1, tot_work, tot_wait, (double)tot_work/(t2-t1));

return 0;
}

2008-02-11 17:25:07

[permalink] [raw]

Subject: Re: Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads

On Mon, Feb 11, 2008 at 09:15:55AM +0100, Mike Galbraith wrote:
> Piddling around with your testcase, it still looks to me like things
> improved considerably in latest greatest git. Hopefully that means
> happiness is in the pipe for the real workload... synthetic load is
> definitely happier here as burst is shortened.

The real workload doesn't see much of an improvement. The changes I did
when tinkering yesterday seem like they're better at modelling just
what's going on with that one.

I've added the new testcase I'm using for the numbers I posted last
night at http://lixom.net/~olof/threadtest/new/, including numbers for
the various kernels. I also included the binaries I built. (with "gcc
-DLOOPS=<size> testcase.c -lpthread").

I tried graphing the numbers as well, it looks like for larger workloads
that 2.6.22 has a fixed benefit over newer kernels. I.e. it seems quicker
at rebalancing, but once things balance out they're obviously doing ok
independent of kernel version.

Graph at http://lixom.net/olof/threadtest/new/schedgraph.pdf. I couldn't
figure out how to make the X axis linear, so it obviously looks odd with
the current powers-of-two at the end instead of linear, but the
differences can still be seen clearly.

-Olof

2008-02-11 19:59:25

[permalink] [raw]

Subject: Re: Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads

On Mon, 2008-02-11 at 11:26 -0600, Olof Johansson wrote:
> On Mon, Feb 11, 2008 at 09:15:55AM +0100, Mike Galbraith wrote:
> > Piddling around with your testcase, it still looks to me like things
> > improved considerably in latest greatest git. Hopefully that means
> > happiness is in the pipe for the real workload... synthetic load is
> > definitely happier here as burst is shortened.
>
> The real workload doesn't see much of an improvement. The changes I did
> when tinkering yesterday seem like they're better at modelling just
> what's going on with that one.

So the real application is trying to yield? If so, you could try
prodding /proc/sys/kernel/sched_compat_yield.

It shouldn't matter if you yield or not really, that should reduce the
number of non-work spin cycles wasted awaiting preemption as threads
execute in series (the problem), and should improve your performance
numbers, but not beyond single threaded.

If I plugged a yield into the busy wait, I would expect to see a large
behavioral difference due to yield implementation changes, but that
would only be a symptom in this case, no? Yield should be a noop.

-Mike

2008-02-11 20:30:18

[permalink] [raw]

Subject: Re: Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads

On Mon, Feb 11, 2008 at 08:58:46PM +0100, Mike Galbraith wrote:
>
> On Mon, 2008-02-11 at 11:26 -0600, Olof Johansson wrote:
> > On Mon, Feb 11, 2008 at 09:15:55AM +0100, Mike Galbraith wrote:
> > > Piddling around with your testcase, it still looks to me like things
> > > improved considerably in latest greatest git. Hopefully that means
> > > happiness is in the pipe for the real workload... synthetic load is
> > > definitely happier here as burst is shortened.
> >
> > The real workload doesn't see much of an improvement. The changes I did
> > when tinkering yesterday seem like they're better at modelling just
> > what's going on with that one.
>
> So the real application is trying to yield? If so, you could try
> prodding /proc/sys/kernel/sched_compat_yield.

The real application wasn't using yield, but it also didn't get hung up
on the busy-wait loop, that was a mistake by me on Friday. Practically
all the time of that application was spent in the actual workload
loop. I tried adding a yield at the bottom loop, but it didn't make a
significant difference.

> It shouldn't matter if you yield or not really, that should reduce the
> number of non-work spin cycles wasted awaiting preemption as threads
> execute in series (the problem), and should improve your performance
> numbers, but not beyond single threaded.
>
> If I plugged a yield into the busy wait, I would expect to see a large
> behavioral difference due to yield implementation changes, but that
> would only be a symptom in this case, no? Yield should be a noop.

Exactly. It made a big impact on the first testcase from Friday, where
the spin-off thread spent the bulk of the time in the busy-wait loop,
with a very small initial workload loop. Thus the yield passed the cpu
over to the other thread who got a chance to run the small workload,
followed by a quick finish by both of them. The better model spends the
bulk of the time in the first workload loop, so yielding doesn't gain
at all the same amount.

I still added it to rule out that it was a factor in the time
differences.

-Olof

2008-02-11 21:44:59

by Bill Davidsen

[permalink] [raw]

Subject: Re: Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads

Olof Johansson wrote:

>> However, I fail to understand the goal of the reproducer. Granted it shows
>> irregularities in the scheduler under such conditions, but what *real*
>> workload would spend its time sequentially creating then immediately killing
>> threads, never using more than 2 at a time ?
>>
>> If this could be turned into a DoS, I could understand, but here it looks
>> a bit pointless :-/
>
> It seems generally unfortunate that it takes longer for a new thread to
> move over to the second cpu even when the first is busy with the original
> thread. I can certainly see cases where this causes suboptimal overall
> system behaviour.
>
I think the moving to another CPU gets really dependent on the CPU type.
On a P4+HT the caches are shared, and moving costs almost nothing for
cache hits, while on CPUs which have other cache layouts the migration
cost is higher. Obviously multi-core should be cheaper than
multi-socket, by avoiding using the system memory bus, but it still can
get ugly.

I have an IPC test around which showed that, it ran like hell on HT, and
progressively worse as cache because less shared. I wonder why the
latest git works so much better?

--
Bill Davidsen <[email protected]>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot

2008-02-12 04:30:50

[permalink] [raw]

Subject: Re: Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads

On Mon, 2008-02-11 at 16:45 -0500, Bill Davidsen wrote:

> I think the moving to another CPU gets really dependent on the CPU type.
> On a P4+HT the caches are shared, and moving costs almost nothing for
> cache hits, while on CPUs which have other cache layouts the migration
> cost is higher. Obviously multi-core should be cheaper than
> multi-socket, by avoiding using the system memory bus, but it still can
> get ugly.
>
> I have an IPC test around which showed that, it ran like hell on HT, and
> progressively worse as cache because less shared. I wonder why the
> latest git works so much better?

Yes, I'm wondering the same. With latest git, ~400 usec work units
suffice to achieve overlap (on my P4/HT), whereas all other kernels
tested require several milliseconds.

-Mike

2008-02-12 09:24:12

[permalink] [raw]

Subject: Re: Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads

On Mon, 2008-02-11 at 14:31 -0600, Olof Johansson wrote:
> On Mon, Feb 11, 2008 at 08:58:46PM +0100, Mike Galbraith wrote:

> > It shouldn't matter if you yield or not really, that should reduce the
> > number of non-work spin cycles wasted awaiting preemption as threads
> > execute in series (the problem), and should improve your performance
> > numbers, but not beyond single threaded.
> >
> > If I plugged a yield into the busy wait, I would expect to see a large
> > behavioral difference due to yield implementation changes, but that
> > would only be a symptom in this case, no? Yield should be a noop.
>
> Exactly. It made a big impact on the first testcase from Friday, where
> the spin-off thread spent the bulk of the time in the busy-wait loop,
> with a very small initial workload loop. Thus the yield passed the cpu
> over to the other thread who got a chance to run the small workload,
> followed by a quick finish by both of them. The better model spends the
> bulk of the time in the first workload loop, so yielding doesn't gain
> at all the same amount.

There is a strong dependency on execution order in this testcase.

Between cpu affinity and giving the child a little head start to reduce
the chance (100% if child wakes on same CPU and doesn't preempt parent)
of busy wait, modified testcase behaves. I don't think I should need
the CPU affinity, but I do.

If you plunk a usleep(1) in prior to calling thread_func() does your
testcase performance change radically? If so, I wonder if the real
application has the same kind of dependency.

-Mike

Attachments:

threadtest.c (1.98 kB)

2008-02-13 05:49:36