2005-11-07 22:18:01

by Brian Twichell

[permalink] [raw]
Subject: Database regression due to scheduler changes ?

Hi,

We observed a 1.5% regression in an OLTP database workload going from
2.6.13-rc4 to 2.6.13-rc5. The regression has been carried forward
at least as far as 2.6.14-rc5.

Through experimentation, and through examining the changes that
went into 2.6.13-rc5, we found that we can eliminate the regression
in 2.6.13-rc5 with one straightforward change: eliminating the
NUMA level from the CPU scheduler domain structures.

After observing this, we collected schedstats (provided below)
to try to determine how the scheduler behaves differently
when the NUMA level is eliminated. It appears to us that
the scheduler is having more success in balancing in this
case. We tried to duplicate this effect by changing parameters
in the NUMA-level and SMP-level domain definitions to
increase the aggressiveness of the balancing, but none of the
changes could recoup the regression.

We suspect the regression was introduced in the scheduler changes
that went into 2.6.13-rc1. However, the regression was hidden
from us by a bug in include/asm-ppc64/topology.h that made ppc64
look non-NUMA from 2.6.13-rc1 through 2.6.13-rc4. That bug was
fixed in 2.6.13-rc5. Unfortunately the workload does not run to
completion on 2.6.12 or 2.6.13-rc1. We have measurements on
2.6.12-rc6-git7 that do not show the regression.

One alternative for fixing this in 2.6.13 would have been to #define
ARCH_HAS_SCHED_DOMAINS and to introduce a ppc64-specific version
of build_sched_domains that eliminates the NUMA-level domain for
small (e.g. 4-way) ppc64 systems. However, ARCH_HAS_SCHED_DOMAINS
has been eliminated from 2.6.14, and anyways that solution doesn't
seem very encompassing to me.

So, at this point I am soliciting assistance from scheduler experts
to determine how this regression can be eliminated. We are keen
to prevent this regression from going into the next distro versions.
Simply shipping a distro kernel with CONFIG_NUMA off isn't a viable
option because we need it for our larger configurations.

Our system configuration is a 4-way 1.9 GHz Power5-based server. As
the system supports SMT, it shows eight online CPUs.

Below are the schedstats. The first set is with the NUMA-level
domain, while the second set is without the NUMA-level domain.

Cheers,
Brian Twichell

Schedstats (NUMA-level domain included)
----------------------------------------------------------------------
00:09:05--------------------------------------------------------------
2845 sys_sched_yield()
0( 0.00%) found (only) active queue empty on current cpu
0( 0.00%) found (only) expired queue empty on current cpu
157( 5.52%) found both queues empty on current cpu
2688( 94.48%) found neither queue empty on current cpu


23287180 schedule()
1( 0.00%) switched active and expired queues
0( 0.00%) used existing active queue

0 active_load_balance()
0 sched_balance_exec()

0.19/1.17 avg runtime/latency over all cpus (ms)

[scheduler domain #0]
1418943 load_balance()
112240( 7.91%) called while idle
499( 0.44%) tried but failed to move any tasks
80433( 71.66%) found no busier group
31308( 27.89%) succeeded in moving at least one task
(average imbalance: 1.549)
316022( 22.27%) called while busy
21( 0.01%) tried but failed to move any tasks
220440( 69.75%) found no busier group
95561( 30.24%) succeeded in moving at least one task
(average imbalance: 1.727)
990681( 69.82%) called when newly idle
533( 0.05%) tried but failed to move any tasks
808816( 81.64%) found no busier group
181332( 18.30%) succeeded in moving at least one task
(average imbalance: 1.500)

0 sched_balance_exec() tried to push a task

[scheduler domain #1]
922193 load_balance()
85822( 9.31%) called while idle
4032( 4.70%) tried but failed to move any tasks
70982( 82.71%) found no busier group
10808( 12.59%) succeeded in moving at least one task
(average imbalance: 1.348)
27022( 2.93%) called while busy
106( 0.39%) tried but failed to move any tasks
25478( 94.29%) found no busier group
1438( 5.32%) succeeded in moving at least one task
(average imbalance: 1.712)
809349( 87.76%) called when newly idle
6967( 0.86%) tried but failed to move any tasks
757097( 93.54%) found no busier group
45285( 5.60%) succeeded in moving at least one task
(average imbalance: 1.338)

0 sched_balance_exec() tried to push a task

[scheduler domain #2]
825662 load_balance()
52074( 6.31%) called while idle
17791( 34.16%) tried but failed to move any tasks
32839( 63.06%) found no busier group
1444( 2.77%) succeeded in moving at least one task
(average imbalance: 1.981)
9524( 1.15%) called while busy
1072( 11.26%) tried but failed to move any tasks
7654( 80.37%) found no busier group
798( 8.38%) succeeded in moving at least one task
(average imbalance: 2.976)
764064( 92.54%) called when newly idle
262831( 34.40%) tried but failed to move any tasks
409353( 53.58%) found no busier group
91880( 12.03%) succeeded in moving at least one task
(average imbalance: 2.518)

0 sched_balance_exec() tried to push a task


Schedstats (NUMA-level domain eliminated)
----------------------------------------------------------------------
00:09:03--------------------------------------------------------------
2576 sys_sched_yield()
0( 0.00%) found (only) active queue empty on current cpu
0( 0.00%) found (only) expired queue empty on current cpu
118( 4.58%) found both queues empty on current cpu
2458( 95.42%) found neither queue empty on current cpu


23617887 schedule()
1106774 goes idle
0( 0.00%) switched active and expired queues
0( 0.00%) used existing active queue

0 active_load_balance()
0 sched_balance_exec()

0.19/1.10 avg runtime/latency over all cpus (ms)

[scheduler domain #0]
1810988 load_balance()
153509( 8.48%) called while idle
680( 0.44%) tried but failed to move any tasks
104906( 68.34%) found no busier group
47923( 31.22%) succeeded in moving at least one task
(average imbalance: 1.658)
317016( 17.51%) called while busy
30( 0.01%) tried but failed to move any tasks
217438( 68.59%) found no busier group
99548( 31.40%) succeeded in moving at least one task
(average imbalance: 1.831)
1340463( 74.02%) called when newly idle
762( 0.06%) tried but failed to move any tasks
1092960( 81.54%) found no busier group
246741( 18.41%) succeeded in moving at least one task
(average imbalance: 1.564)

0 sched_balance_exec() tried to push a task

[scheduler domain #1]
1244187 load_balance()
111326( 8.95%) called while idle
8396( 7.54%) tried but failed to move any tasks
71276( 64.02%) found no busier group
31654( 28.43%) succeeded in moving at least one task
(average imbalance: 1.412)
39138( 3.15%) called while busy
220( 0.56%) tried but failed to move any tasks
34676( 88.60%) found no busier group
4242( 10.84%) succeeded in moving at least one task
(average imbalance: 1.360)
1093723( 87.91%) called when newly idle
15971( 1.46%) tried but failed to move any tasks
932422( 85.25%) found no busier group
145330( 13.29%) succeeded in moving at least one task
(average imbalance: 1.189)

0 sched_balance_exec() tried to push a task



2005-11-07 22:35:45

by David Lang

[permalink] [raw]
Subject: Re: Database regression due to scheduler changes ?

Brian,
If I am understanding the data you posted, it looks like you are useing
sched_yield extensivly in your database. This is known to have significant
problems on SMP machines, and even bigger ones on NUMA machines, in part
becouse the process doing the sched_yield may get rescheduled immediatly
and not allow other processes to run (to free up whatever resource it's
waiting for). This causes the processor to look busy to the scheduler and
therefor the scheduler doesn't migrate other processes to the CPU that's
spinning on sched_yield. On NUMA machines this is even more noticable as
processes now have to migrate through an additional layer of the
scheduler.

have to tried eliminating the sched_yield to see what difference it makes?

David Lang


--
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
-- C.A.R. Hoare

2005-11-07 22:47:47

by linux-os (Dick Johnson)

[permalink] [raw]
Subject: Re: Database regression due to scheduler changes ?


On Mon, 7 Nov 2005, Brian Twichell wrote:

> Hi,
>
> We observed a 1.5% regression in an OLTP database workload going from
> 2.6.13-rc4 to 2.6.13-rc5. The regression has been carried forward
> at least as far as 2.6.14-rc5.
>
> Through experimentation, and through examining the changes that
> went into 2.6.13-rc5, we found that we can eliminate the regression
> in 2.6.13-rc5 with one straightforward change: eliminating the
> NUMA level from the CPU scheduler domain structures.
>
> After observing this, we collected schedstats (provided below)
> to try to determine how the scheduler behaves differently
> when the NUMA level is eliminated. It appears to us that
> the scheduler is having more success in balancing in this
> case. We tried to duplicate this effect by changing parameters
> in the NUMA-level and SMP-level domain definitions to
> increase the aggressiveness of the balancing, but none of the
> changes could recoup the regression.
>
> We suspect the regression was introduced in the scheduler changes
> that went into 2.6.13-rc1. However, the regression was hidden
> from us by a bug in include/asm-ppc64/topology.h that made ppc64
> look non-NUMA from 2.6.13-rc1 through 2.6.13-rc4. That bug was
> fixed in 2.6.13-rc5. Unfortunately the workload does not run to
> completion on 2.6.12 or 2.6.13-rc1. We have measurements on
> 2.6.12-rc6-git7 that do not show the regression.
>
> One alternative for fixing this in 2.6.13 would have been to #define
> ARCH_HAS_SCHED_DOMAINS and to introduce a ppc64-specific version
> of build_sched_domains that eliminates the NUMA-level domain for
> small (e.g. 4-way) ppc64 systems. However, ARCH_HAS_SCHED_DOMAINS
> has been eliminated from 2.6.14, and anyways that solution doesn't
> seem very encompassing to me.
>
> So, at this point I am soliciting assistance from scheduler experts
> to determine how this regression can be eliminated. We are keen
> to prevent this regression from going into the next distro versions.
> Simply shipping a distro kernel with CONFIG_NUMA off isn't a viable
> option because we need it for our larger configurations.
>
> Our system configuration is a 4-way 1.9 GHz Power5-based server. As
> the system supports SMT, it shows eight online CPUs.
>
> Below are the schedstats. The first set is with the NUMA-level
> domain, while the second set is without the NUMA-level domain.
>
> Cheers,
> Brian Twichell
>
> Schedstats (NUMA-level domain included)
> ----------------------------------------------------------------------
> 00:09:05--------------------------------------------------------------
> 2845 sys_sched_yield()
> 0( 0.00%) found (only) active queue empty on current cpu
> 0( 0.00%) found (only) expired queue empty on current cpu
> 157( 5.52%) found both queues empty on current cpu
> 2688( 94.48%) found neither queue empty on current cpu
>
>
> 23287180 schedule()
> 1( 0.00%) switched active and expired queues
> 0( 0.00%) used existing active queue
>
> 0 active_load_balance()
> 0 sched_balance_exec()
>
> 0.19/1.17 avg runtime/latency over all cpus (ms)
>
> [scheduler domain #0]
> 1418943 load_balance()
> 112240( 7.91%) called while idle
> 499( 0.44%) tried but failed to move any tasks
> 80433( 71.66%) found no busier group
> 31308( 27.89%) succeeded in moving at least one task
> (average imbalance: 1.549)
> 316022( 22.27%) called while busy
> 21( 0.01%) tried but failed to move any tasks
> 220440( 69.75%) found no busier group
> 95561( 30.24%) succeeded in moving at least one task
> (average imbalance: 1.727)
> 990681( 69.82%) called when newly idle
> 533( 0.05%) tried but failed to move any tasks
> 808816( 81.64%) found no busier group
> 181332( 18.30%) succeeded in moving at least one task
> (average imbalance: 1.500)
>
> 0 sched_balance_exec() tried to push a task
>
> [scheduler domain #1]
> 922193 load_balance()
> 85822( 9.31%) called while idle
> 4032( 4.70%) tried but failed to move any tasks
> 70982( 82.71%) found no busier group
> 10808( 12.59%) succeeded in moving at least one task
> (average imbalance: 1.348)
> 27022( 2.93%) called while busy
> 106( 0.39%) tried but failed to move any tasks
> 25478( 94.29%) found no busier group
> 1438( 5.32%) succeeded in moving at least one task
> (average imbalance: 1.712)
> 809349( 87.76%) called when newly idle
> 6967( 0.86%) tried but failed to move any tasks
> 757097( 93.54%) found no busier group
> 45285( 5.60%) succeeded in moving at least one task
> (average imbalance: 1.338)
>
> 0 sched_balance_exec() tried to push a task
>
> [scheduler domain #2]
> 825662 load_balance()
> 52074( 6.31%) called while idle
> 17791( 34.16%) tried but failed to move any tasks
> 32839( 63.06%) found no busier group
> 1444( 2.77%) succeeded in moving at least one task
> (average imbalance: 1.981)
> 9524( 1.15%) called while busy
> 1072( 11.26%) tried but failed to move any tasks
> 7654( 80.37%) found no busier group
> 798( 8.38%) succeeded in moving at least one task
> (average imbalance: 2.976)
> 764064( 92.54%) called when newly idle
> 262831( 34.40%) tried but failed to move any tasks
> 409353( 53.58%) found no busier group
> 91880( 12.03%) succeeded in moving at least one task
> (average imbalance: 2.518)
>
> 0 sched_balance_exec() tried to push a task
>
>
> Schedstats (NUMA-level domain eliminated)
> ----------------------------------------------------------------------
> 00:09:03--------------------------------------------------------------
> 2576 sys_sched_yield()
> 0( 0.00%) found (only) active queue empty on current cpu
> 0( 0.00%) found (only) expired queue empty on current cpu
> 118( 4.58%) found both queues empty on current cpu
> 2458( 95.42%) found neither queue empty on current cpu
>
>
> 23617887 schedule()
> 1106774 goes idle
> 0( 0.00%) switched active and expired queues
> 0( 0.00%) used existing active queue
>
> 0 active_load_balance()
> 0 sched_balance_exec()
>
> 0.19/1.10 avg runtime/latency over all cpus (ms)
>
> [scheduler domain #0]
> 1810988 load_balance()
> 153509( 8.48%) called while idle
> 680( 0.44%) tried but failed to move any tasks
> 104906( 68.34%) found no busier group
> 47923( 31.22%) succeeded in moving at least one task
> (average imbalance: 1.658)
> 317016( 17.51%) called while busy
> 30( 0.01%) tried but failed to move any tasks
> 217438( 68.59%) found no busier group
> 99548( 31.40%) succeeded in moving at least one task
> (average imbalance: 1.831)
> 1340463( 74.02%) called when newly idle
> 762( 0.06%) tried but failed to move any tasks
> 1092960( 81.54%) found no busier group
> 246741( 18.41%) succeeded in moving at least one task
> (average imbalance: 1.564)
>
> 0 sched_balance_exec() tried to push a task
>
> [scheduler domain #1]
> 1244187 load_balance()
> 111326( 8.95%) called while idle
> 8396( 7.54%) tried but failed to move any tasks
> 71276( 64.02%) found no busier group
> 31654( 28.43%) succeeded in moving at least one task
> (average imbalance: 1.412)
> 39138( 3.15%) called while busy
> 220( 0.56%) tried but failed to move any tasks
> 34676( 88.60%) found no busier group
> 4242( 10.84%) succeeded in moving at least one task
> (average imbalance: 1.360)
> 1093723( 87.91%) called when newly idle
> 15971( 1.46%) tried but failed to move any tasks
> 932422( 85.25%) found no busier group
> 145330( 13.29%) succeeded in moving at least one task
> (average imbalance: 1.189)
>
> 0 sched_balance_exec() tried to push a task
>

Can you change sched_yield() to usleep(1) or usleep(0) and see if
that works. I found that in recent kernels sched_yield() just seems
to spin (may not actually spin, but seems to with a high CPU usage).

Cheers,
Dick Johnson
Penguin : Linux version 2.6.13.4 on an i686 machine (5589.55 BogoMips).
Warning : 98.36% of all statistics are fiction.
.

****************************************************************
The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to [email protected] - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.

2005-11-07 23:06:15

by Brian Twichell

[permalink] [raw]
Subject: Re: Database regression due to scheduler changes ?

David Lang wrote:

> If I am understanding the data you posted, it looks like you are
> useing sched_yield extensivly in your database.

Yes, I've seen problems in the past with workloads that use sched_yield
heavily.

But bear in mind, the ~2700 sched_yields shown in the schedstats
occurred over a 9 minute period.
That means that sched_yield is being called at a rate of around 5 per
second -- this is not a heavy user of sched_yield.

To put this into a broader perspective, this workload has around 270
tasks, and the context switch rate is around
45,000 per second.



2005-11-08 00:49:59

by Nick Piggin

[permalink] [raw]
Subject: Re: Database regression due to scheduler changes ?

Brian Twichell wrote:
> David Lang wrote:
>
>> If I am understanding the data you posted, it looks like you are
>> useing sched_yield extensivly in your database.
>
>
> Yes, I've seen problems in the past with workloads that use sched_yield
> heavily.
>
> But bear in mind, the ~2700 sched_yields shown in the schedstats
> occurred over a 9 minute period. That means that sched_yield is being
> called at a rate of around 5 per second -- this is not a heavy user of
> sched_yield.
>
> To put this into a broader perspective, this workload has around 270
> tasks, and the context switch rate is around
> 45,000 per second.
>

Hi,

Thanks for your detailed report (and schedstats analysis). Sorry
I didn't see it until now.

I think you are right that the NUMA domain is probably being too
constrictive of task balancing, and that is where the regression
is coming from.

For some workloads it is definitely important to have the NUMA
domain, because it helps spread load over memory controllers as
well as CPUs - so I guess eliminating that domain is not a good
long term solution.

I would look at changing parameters of SD_NODE_INIT in include/
asm-powerpc/topology.h so they are closer to SD_CPU_INIT parameters
(ie. more aggressive).

Reducing min_ and max_interval, busy_factor, cache_hot_time, will
all do this.

I would also take a look at removing SD_WAKE_IDLE from the flags.
This flag should make balancing more aggressive, but it can have
problems when applied to a NUMA domain due to too much task
movement.

I agree that sched_yield would be unlikely to be a problem at
those rates, and either way it doesn't explain the regression.

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-11-08 01:17:08

by Anton Blanchard

[permalink] [raw]
Subject: Re: Database regression due to scheduler changes ?


Hi Nick,

> I would also take a look at removing SD_WAKE_IDLE from the flags.
> This flag should make balancing more aggressive, but it can have
> problems when applied to a NUMA domain due to too much task
> movement.

I was wondering how ppc64 ended up with different parameters in the NODE
definitions (added SD_BALANCE_NEWIDLE and SD_WAKE_IDLE) and it looks
like it was Andrew :)

http://lkml.org/lkml/2004/11/2/205

It looks like balancing was not agressive enough on his workload too.
Im a bit uneasy with only ppc64 having the two flags though.

Im also considering adding balance on fork for ppc64, it seems like a
lot of people like to run stream like benchmarks and Im getting tired of
telling them to lock their threads down to cpus.

Anton

2005-11-08 01:34:41

by Martin Bligh

[permalink] [raw]
Subject: Re: Database regression due to scheduler changes ?

>> I would also take a look at removing SD_WAKE_IDLE from the flags.
>> This flag should make balancing more aggressive, but it can have
>> problems when applied to a NUMA domain due to too much task
>> movement.
>
> I was wondering how ppc64 ended up with different parameters in the NODE
> definitions (added SD_BALANCE_NEWIDLE and SD_WAKE_IDLE) and it looks
> like it was Andrew :)
>
> http://lkml.org/lkml/2004/11/2/205
>
> It looks like balancing was not agressive enough on his workload too.
> Im a bit uneasy with only ppc64 having the two flags though.
>
> Im also considering adding balance on fork for ppc64, it seems like a
> lot of people like to run stream like benchmarks and Im getting tired of
> telling them to lock their threads down to cpus.

Please don't screw up everything else just for stream. It's a silly
frigging benchmark. There's very little real-world stuff that really
needs balance on fork, as opposed to balance on clone, and it'll slow
down everything else.

M.


2005-11-08 01:44:33

by Nick Piggin

[permalink] [raw]
Subject: Re: Database regression due to scheduler changes ?

Martin J. Bligh wrote:

>>Im also considering adding balance on fork for ppc64, it seems like a
>>lot of people like to run stream like benchmarks and Im getting tired of
>>telling them to lock their threads down to cpus.
>
>
> Please don't screw up everything else just for stream. It's a silly
> frigging benchmark. There's very little real-world stuff that really
> needs balance on fork, as opposed to balance on clone, and it'll slow
> down everything else.
>

Long lived and memory intensive cloned or forked tasks will often
[but far from always :(] want to be put on another memory controller
from their siblings.

On workloads where there are lots of short lived ones (some bloated
java programs), the load balancer should normally detect this and
cut the balance-on-fork/clone.

Of course there are going to be cases where this fails. I haven't
seen significant slowdowns in tests, although I'm sure there would
be some at least small regressions. Have you seen any? Do you have
any tests in mind that might show a problem?

Thanks,
Nick

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-11-08 01:46:35

by Nick Piggin

[permalink] [raw]
Subject: Re: Database regression due to scheduler changes ?

Nick Piggin wrote:

[...]

> be some at least small regressions. Have you seen any? Do you have
> any tests in mind that might show a problem?
>

To clarify, I'm not suggesting you should go one way or the other
for POWER4/5, but if you did have regressions I would be interested
at least so I can try helping platforms that do use balance on clone.

Thanks,
Nick

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-11-08 01:58:21

by Martin Bligh

[permalink] [raw]
Subject: Re: Database regression due to scheduler changes ?

>>> Im also considering adding balance on fork for ppc64, it seems like a
>>> lot of people like to run stream like benchmarks and Im getting tired of
>>> telling them to lock their threads down to cpus.
>>
>> Please don't screw up everything else just for stream. It's a silly
>> frigging benchmark. There's very little real-world stuff that really
>> needs balance on fork, as opposed to balance on clone, and it'll slow
>> down everything else.
>
> Long lived and memory intensive cloned or forked tasks will often
> [but far from always :(] want to be put on another memory controller
> from their siblings.
>
> On workloads where there are lots of short lived ones (some bloated
> java programs), the load balancer should normally detect this and
> cut the balance-on-fork/clone.
>
> Of course there are going to be cases where this fails. I haven't
> seen significant slowdowns in tests, although I'm sure there would
> be some at least small regressions. Have you seen any? Do you have
> any tests in mind that might show a problem?

Anything fork/exec-y should show it's slower. Most stuff either forks
and execs (in case it's silly to do it twice, and much cheaper to do
it at exec time), or it's a clone, in which case a different set of
rules applies for what you want (and actually, I suspect fork w/o exec
is much the same).

Of course the pig is you can't determine at fork whether it'll exec
or not, so you optimise for the common case, which is "do exec", unless
given a hint otherwise.

For clone, and I suspect fork w/o exec, you have a tightly coupled
group of processes that really would like to be close to each other.
If you have 1 app on the whole system, you *may* want it spread across
the system. If you have nr_apps >= nr_nodes, you probably want them
node local. Determining which workload you have is messy, and may
change.

Tweak the freak benchmark, not everything else ;-)

M.

2005-11-08 02:05:16

by David Lang

[permalink] [raw]
Subject: Re: Database regression due to scheduler changes ?

On Tue, 8 Nov 2005, Nick Piggin wrote:

> Martin J. Bligh wrote:
>
>>> Im also considering adding balance on fork for ppc64, it seems like a
>>> lot of people like to run stream like benchmarks and Im getting tired of
>>> telling them to lock their threads down to cpus.
>>
>>
>> Please don't screw up everything else just for stream. It's a silly
>> frigging benchmark. There's very little real-world stuff that really
>> needs balance on fork, as opposed to balance on clone, and it'll slow
>> down everything else.
>>
>
> Long lived and memory intensive cloned or forked tasks will often
> [but far from always :(] want to be put on another memory controller
> from their siblings.
>
> On workloads where there are lots of short lived ones (some bloated
> java programs), the load balancer should normally detect this and
> cut the balance-on-fork/clone.

although if the primary workload is short-lived tasks and you don't do
balance-on-fork/clone won't you have trouble ever balancing things?
(anything that you do move over will probably exit quickly and put you
right back where you started)

at the risk of a slowdown from an extra test it almost sounds like what is
needed is to get feedback from the last scheduled balance attempt and use
that to decide per-fork what to do.

for example say the scheduled balance attempt leaves a per-cpu value that
has it's high bit tested every fork/clone (and then rotated left 1 bit)
and if it's a 1 do a balance for this new process.

with a reasonable sized item (I would guess the default int size would
probably be the most efficiant to process, but even 8 bits may be enough)
the scheduled balance attempt can leave quite an extensive range of
behavior, from 'always balance' to 'never balance' to 'balance every 5th
and 8th fork', etc.

> Of course there are going to be cases where this fails. I haven't
> seen significant slowdowns in tests, although I'm sure there would
> be some at least small regressions. Have you seen any? Do you have
> any tests in mind that might show a problem?

even though people will point out that it's a brin-dead workload (that
should be converted to a state machine) I would expect that most
fork-per-connection servers would show problems if the work per connection
is small

David Lang

--
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
-- C.A.R. Hoare

2005-11-08 02:12:20

by Martin Bligh

[permalink] [raw]
Subject: Re: Database regression due to scheduler changes ?



--On Monday, November 07, 2005 18:04:23 -0800 David Lang <[email protected]> wrote:

> On Tue, 8 Nov 2005, Nick Piggin wrote:
>
>> Martin J. Bligh wrote:
>>
>>>> Im also considering adding balance on fork for ppc64, it seems like a
>>>> lot of people like to run stream like benchmarks and Im getting tired of
>>>> telling them to lock their threads down to cpus.
>>>
>>>
>>> Please don't screw up everything else just for stream. It's a silly
>>> frigging benchmark. There's very little real-world stuff that really
>>> needs balance on fork, as opposed to balance on clone, and it'll slow
>>> down everything else.
>>>
>>
>> Long lived and memory intensive cloned or forked tasks will often
>> [but far from always :(] want to be put on another memory controller
>> from their siblings.
>>
>> On workloads where there are lots of short lived ones (some bloated
>> java programs), the load balancer should normally detect this and
>> cut the balance-on-fork/clone.
>
> although if the primary workload is short-lived tasks and you don't do balance-on-fork/clone won't you have trouble ever balancing things?
(anything that you do move over will probably exit quickly and put you
right back where you started)

If you fork without execing a lot, with no hints, and they all exit
quickly, then yes. But I don't think that's a common workload ;-)

> at the risk of a slowdown from an extra test it almost sounds like what is needed is to get feedback from the last scheduled balance attempt and use that to decide per-fork what to do.
>
> for example say the scheduled balance attempt leaves a per-cpu value that has it's high bit tested every fork/clone (and then rotated left 1 bit) and if it's a 1 do a balance for this new process.
>
> with a reasonable sized item (I would guess the default int size would probably be the most efficiant to process, but even 8 bits may be enough) the scheduled balance attempt can leave quite an extensive range of behavior, from 'always balance' to 'never balance' to 'balance every 5th and 8th fork', etc.

That might work, yes. But I'd prefer to see a real workload that
suffers before worrying about it too much. You have something in mind?

>> Of course there are going to be cases where this fails. I haven't
>> seen significant slowdowns in tests, although I'm sure there would
>> be some at least small regressions. Have you seen any? Do you have
>> any tests in mind that might show a problem?
>
> even though people will point out that it's a brin-dead workload (that should be converted to a state machine) I would expect that most fork-per-connection servers would show problems if the work per connection is small

I suspect most of those are either inetd (exec's) or multiple servers
that service requests by now. maybe not. Threads might be quicker if
it's heavy anyway ;-)

M.

2005-11-08 02:14:14

by Nick Piggin

[permalink] [raw]
Subject: Re: Database regression due to scheduler changes ?

David Lang wrote:
> On Tue, 8 Nov 2005, Nick Piggin wrote:
>
>>
>> Long lived and memory intensive cloned or forked tasks will often
>> [but far from always :(] want to be put on another memory controller
>> from their siblings.
>>
>> On workloads where there are lots of short lived ones (some bloated
>> java programs), the load balancer should normally detect this and
>> cut the balance-on-fork/clone.
>
>
> although if the primary workload is short-lived tasks and you don't do
> balance-on-fork/clone won't you have trouble ever balancing things?
> (anything that you do move over will probably exit quickly and put you
> right back where you started)
>

You'll have no trouble if things *need* to be balanced, because
that would imply the runqueue length average is significantly
above the lengths of other runqueues.

As far as the extra test goes, it's really a miniscule overhead
compared with the fork / clone cost itself, and can be really
worthwhile if we get it right.

>
>> Of course there are going to be cases where this fails. I haven't
>> seen significant slowdowns in tests, although I'm sure there would
>> be some at least small regressions. Have you seen any? Do you have
>> any tests in mind that might show a problem?
>
>
> even though people will point out that it's a brin-dead workload (that
> should be converted to a state machine) I would expect that most
> fork-per-connection servers would show problems if the work per
> connection is small
>

Well it may be brain-dead, but if people use them (and they do)
then I would really be interested to see results.

I did testing with some things like apache and volanomark, however
I was not able to make out much difference on my setups. Though
obviously that's not to say that there won't be with other software
or other workloads / architectures etc.

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-11-08 02:31:22

by Byron Stanoszek

[permalink] [raw]
Subject: Re: Database regression due to scheduler changes ?

On Mon, 7 Nov 2005, David Lang wrote:

> Brian,
> If I am understanding the data you posted, it looks like you are useing
> sched_yield extensivly in your database. This is known to have significant
> problems on SMP machines, and even bigger ones on NUMA machines, in part
> becouse the process doing the sched_yield may get rescheduled immediatly and
> not allow other processes to run (to free up whatever resource it's waiting
> for). This causes the processor to look busy to the scheduler and therefor
> the scheduler doesn't migrate other processes to the CPU that's spinning on
> sched_yield. On NUMA machines this is even more noticable as processes now
> have to migrate through an additional layer of the scheduler.

I have an application designed on Linux where the only processes running are
'init' and those integral to the application. Each communicates using mutual
exclusion & semaphores across a shared file/memory backing.

The application was designed to be as close intrinsically as to what Linux
does--manage processes. There's only 1 thread per process, and each process has
a different executable for its own task.

One day I plan to extend this application across multiple CPUs using either SMP
or NUMA. Therefore a lot of the mutual exclusion routines I've coded in use
sched_yield().

What should I do instead to alleviate the problem of causing the processor to
look busy? In this case I _want_ other processes to be migrated over to the
CPU in order to free up the critical section faster.

A simple test using a 2-cpu SMP system resulted in sched_yield() being a lot
faster than using futexes, but I don't know for the NUMA case.

Best regards,
-Byron

--
Byron Stanoszek Ph: (330) 644-3059
Systems Programmer Fax: (330) 644-8110
Commercial Timesharing Inc. Email: [email protected]

2005-11-08 03:52:45

by Nick Piggin

[permalink] [raw]
Subject: Re: Database regression due to scheduler changes ?

linux-os (Dick Johnson) wrote:

>
> Can you change sched_yield() to usleep(1) or usleep(0) and see if
> that works. I found that in recent kernels sched_yield() just seems
> to spin (may not actually spin, but seems to with a high CPU usage).
>

I've told you that it *does* spin and always has. Even with 2.4
kernels. In fact, it is *specified* to spin, anything else would
be a bug.

Caveat: it also yields the CPU, but only if there is another
runnable task with a higher priority (which is meaningless
between SCHED_OTHER tasks, though we try to do something sane
there too).

Secondly, Brian actually pinpointed the source of the
regression and it is not sched_yield(), nor has sched_yield
changed since the regression. So wouldn't this just be a wild
goose chase.

Nick

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-11-09 02:13:46

by Andrew Theurer

[permalink] [raw]
Subject: Re: Database regression due to scheduler changes ?

Nick wrote:

>> I would also take a look at removing SD_WAKE_IDLE from the flags.
>> This flag should make balancing more aggressive, but it can have
>> problems when applied to a NUMA domain due to too much task
>> movement.
>
> Anton wrote:
> I was wondering how ppc64 ended up with different parameters in the NODE
> definitions (added SD_BALANCE_NEWIDLE and SD_WAKE_IDLE) and it looks
> like it was Andrew :)
>
> http://lkml.org/lkml/2004/11/2/205

FWIW I changed all arch's, but most (except ppc) got changed back. At
the time we had data showing the more aggressive wake idle and newidle
was good for things like OLTP.

Brian, do you have cpu util numbers and runqueue lengths for both tests?

>
> It looks like balancing was not agressive enough on his workload too.
> Im a bit uneasy with only ppc64 having the two flags though.

Brian wrote:

> We suspect the regression was introduced in the scheduler changes
> that went into 2.6.13-rc1. However, the regression was hidden
> from us by a bug in include/asm-ppc64/topology.h that made ppc64
> look non-NUMA from 2.6.13-rc1 through 2.6.13-rc4. That bug was
> fixed in 2.6.13-rc5. Unfortunately the workload does not run to
> completion on 2.6.12 or 2.6.13-rc1.

Brian, I am not sure if you were thinking of a particular set of sched
changes, but I suspect it might be one or more in the list below (my
guess is the first and last). Would it be possible to back out these
change-sets from 2.6.13-rc5 and see if there is any difference? FWIW,
even if they do help, I am not suggesting, yet, that they should be
reverted. I am hoping there is some compromise that can work better in
all situations.

-Andrew

commit cafb20c1f9976a70d633bb1e1c8c24eab00e4e80
Author: Nick Piggin <[email protected]>
Date: Sat Jun 25 14:57:17 2005 -0700

[PATCH] sched: no aggressive idle balancing

Remove the very aggressive idle stuff that has recently gone into 2.6 - it is
going against the direction we are trying to go. Hopefully we can regain
performance through other methods.

Signed-off-by: Nick Piggin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

commit a3f21bce1fefdf92a4d1705e888d390b10f3ac6f
Author: Nick Piggin <[email protected]>
Date: Sat Jun 25 14:57:15 2005 -0700

[PATCH] sched: tweak affine wakeups

Do less affine wakeups. We're trying to reduce dbt2-pgsql idle time
regressions here... make sure we don't don't move tasks the wrong way in an
imbalance condition. Also, remove the cache coldness requirement from the
calculation - this seems to induce sharp cutoff points where behaviour will
suddenly change on some workloads if the load creeps slightly over or under
some point. It is good for periodic balancing because in that case have
otherwise have no other context to determine what task to move.

But also make a minor tweak to "wake balancing" - the imbalance tolerance is
now set at half the domain's imbalance, so we get the opportunity to do wake
balancing before the more random periodic rebalancing gets preformed.

Signed-off-by: Nick Piggin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

commit 7897986bad8f6cd50d6149345aca7f6480f49464
Author: Nick Piggin <[email protected]>
Date: Sat Jun 25 14:57:13 2005 -0700

[PATCH] sched: balance timers

Do CPU load averaging over a number of different intervals. Allow each
interval to be chosen by sending a parameter to source_load and target_load.
0 is instantaneous, idx > 0 returns a decaying average with the most recent
sample weighted at 2^(idx-1). To a maximum of 3 (could be easily increased).

So generally a higher number will result in more conservative balancing.

Signed-off-by: Nick Piggin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

commit 99b61ccf0bf0e9a85823d39a5db6a1519caeb13d
Author: Nick Piggin <[email protected]>
Date: Sat Jun 25 14:57:12 2005 -0700

[PATCH] sched: less aggressive idle balancing

Remove the special casing for idle CPU balancing. Things like this are
hurting for example on SMT, where are single sibling being idle doesn't really
warrant a really aggressive pull over the NUMA domain, for example.

Signed-off-by: Nick Piggin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>




2005-11-09 05:03:55

by Brian Twichell

[permalink] [raw]
Subject: Re: Database regression due to scheduler changes ?

Nick Piggin wrote:

>
> I think you are right that the NUMA domain is probably being too
> constrictive of task balancing, and that is where the regression
> is coming from.
>
> For some workloads it is definitely important to have the NUMA
> domain, because it helps spread load over memory controllers as
> well as CPUs - so I guess eliminating that domain is not a good
> long term solution.
>
> I would look at changing parameters of SD_NODE_INIT in include/
> asm-powerpc/topology.h so they are closer to SD_CPU_INIT parameters
> (ie. more aggressive).

I ran with the following:

--- topology.h.orig 2005-11-08 13:11:57.000000000 -0600
+++ topology.h 2005-11-08 13:17:15.000000000 -0600
@@ -43,11 +43,11 @@ static inline int node_to_first_cpu(int
.span = CPU_MASK_NONE, \
.parent = NULL, \
.groups = NULL, \
- .min_interval = 8, \
- .max_interval = 32, \
- .busy_factor = 32, \
+ .min_interval = 1, \
+ .max_interval = 4, \
+ .busy_factor = 64, \
.imbalance_pct = 125, \
- .cache_hot_time = (10*1000000), \
+ .cache_hot_time = (5*1000000/2), \
.cache_nice_tries = 1, \
.per_cpu_gain = 100, \
.flags = SD_LOAD_BALANCE \

There was no improvement in performance. The schedstats from this run
follow:

2516 sys_sched_yield()
0( 0.00%) found (only) active queue empty on current cpu
0( 0.00%) found (only) expired queue empty on current cpu
46( 1.83%) found both queues empty on current cpu
2470( 98.17%) found neither queue empty on current cpu


22969106 schedule()
694922 goes idle
3( 0.00%) switched active and expired queues
0( 0.00%) used existing active queue

0 active_load_balance()
0 sched_balance_exec()

0.19/1.28 avg runtime/latency over all cpus (ms)

[scheduler domain #0]
1153606 load_balance()
82580( 7.16%) called while idle
488( 0.59%) tried but failed to move any tasks
63876( 77.35%) found no busier group
18216( 22.06%) succeeded in moving at least one task
(average imbalance: 1.526)
317610( 27.53%) called while busy
15( 0.00%) tried but failed to move any tasks
220139( 69.31%) found no busier group
97456( 30.68%) succeeded in moving at least one task
(average imbalance: 1.752)
753416( 65.31%) called when newly idle
487( 0.06%) tried but failed to move any tasks
624132( 82.84%) found no busier group
128797( 17.10%) succeeded in moving at least one task
(average imbalance: 1.531)

0 sched_balance_exec() tried to push a task

[scheduler domain #1]
715638 load_balance()
68533( 9.58%) called while idle
3140( 4.58%) tried but failed to move any tasks
60357( 88.07%) found no busier group
5036( 7.35%) succeeded in moving at least one task
(average imbalance: 1.251)
22486( 3.14%) called while busy
64( 0.28%) tried but failed to move any tasks
21352( 94.96%) found no busier group
1070( 4.76%) succeeded in moving at least one task
(average imbalance: 1.922)
624619( 87.28%) called when newly idle
5218( 0.84%) tried but failed to move any tasks
591970( 94.77%) found no busier group
27431( 4.39%) succeeded in moving at least one task
(average imbalance: 1.382)

0 sched_balance_exec() tried to push a task

[scheduler domain #2]
685164 load_balance()
63247( 9.23%) called while idle
7280( 11.51%) tried but failed to move any tasks
52200( 82.53%) found no busier group
3767( 5.96%) succeeded in moving at least one task
(average imbalance: 1.361)
24729( 3.61%) called while busy
418( 1.69%) tried but failed to move any tasks
21025( 85.02%) found no busier group
3286( 13.29%) succeeded in moving at least one task
(average imbalance: 3.579)
597188( 87.16%) called when newly idle
67577( 11.32%) tried but failed to move any tasks
371377( 62.19%) found no busier group
158234( 26.50%) succeeded in moving at least one task
(average imbalance: 2.146)

0 sched_balance_exec() tried to push a task

>
> I would also take a look at removing SD_WAKE_IDLE from the flags.
> This flag should make balancing more aggressive, but it can have
> problems when applied to a NUMA domain due to too much task
> movement.

Independent from the run above, I ran with the following:

--- topology.h.orig 2005-11-08 19:32:19.000000000 -0600
+++ topology.h 2005-11-08 19:34:25.000000000 -0600
@@ -53,7 +53,6 @@ static inline int node_to_first_cpu(int
.flags = SD_LOAD_BALANCE \
| SD_BALANCE_EXEC \
| SD_BALANCE_NEWIDLE \
- | SD_WAKE_IDLE \
| SD_WAKE_BALANCE, \
.last_balance = jiffies, \
.balance_interval = 1, \

There was no improvement in performance.

I didn't expect any change in performance this time, because I
don't think the SD_WAKE_IDLE flag is effective in the NUMA
domain, due to the following code in wake_idle:

for_each_domain(cpu, sd) {
if (sd->flags & SD_WAKE_IDLE) {
cpus_and(tmp, sd->span, p->cpus_allowed);
for_each_cpu_mask(i, tmp) {
if (idle_cpu(i))
return i;
}
}
else
break;
}

If I read that loop correctly it stops at the first domain
which doesn't have SD_WAKE_IDLE set, which is the CPU domain
(see SD_CPU_INIT), and thus it never gets to the NUMA domain.

Thanks for the suggestions Nick. Andrew raises some
good questions that I will address tomorrow.

Cheers,
Brian

2005-11-14 23:03:49

by Brian Twichell

[permalink] [raw]
Subject: Re: Database regression due to scheduler changes ?

Nick Piggin wrote:

> Just one other thing - A couple of fields aren't actually getting
> initialised at all, which I didn't pick up on.
>
> This bug looks to have been due to a mismerge between the
> common asm-powerpc directory and one of my scheduler changes
> somewhere along the line.
>
> If you get time to try this out, that would be great.
>
>===================================================================
>--- linux-2.6.orig/include/asm-powerpc/topology.h 2005-11-09 16:43:16.000000000 +1100
>+++ linux-2.6/include/asm-powerpc/topology.h 2005-11-09 16:45:17.000000000 +1100
>@@ -51,6 +51,10 @@ static inline int node_to_first_cpu(int
> .cache_hot_time = (10*1000000), \
> .cache_nice_tries = 1, \
> .per_cpu_gain = 100, \
>+ .busy_idx = 3, \
>+ .idle_id = 1, \
>+ .newidle_idx = 2, \
>+ .wake_idx = 1, \
> .flags = SD_LOAD_BALANCE \
> | SD_BALANCE_EXEC \
> | SD_BALANCE_NEWIDLE \
>
>
Nick,

That patch eliminates the regression on 2.6.13-rc5. Thanks !!
We are currently evaluating it with other workloads.

It also gives a boost on 2.6.14, but unfortunately we are still 1%
regressed on 2.6.14. (The regression on 2.6.14 was larger than
the regression on 2.6.13-rc5.) We're trying to isolate the 2.6.14
regression now. I'll let you know if we isolate it to a
scheduler change.

Cheers,
Brian