2009-09-06 21:00:05

by Ingo Molnar

[permalink] [raw]
Subject: BFS vs. mainline scheduler benchmarks and measurements

hi Con,

I've read your BFS announcement/FAQ with great interest:

http://ck.kolivas.org/patches/bfs/bfs-faq.txt

First and foremost, let me say that i'm happy that you are hacking
the Linux scheduler again. It's perhaps proof that hacking the
scheduler is one of the most addictive things on the planet ;-)

I understand that BFS is still early code and that you are not
targeting BFS for mainline inclusion - but BFS is an interesting
and bold new approach, cutting a _lot_ of code out of
kernel/sched*.c, so it raised my curiosity and interest :-)

In the announcement and on your webpage you have compared BFS to
the mainline scheduler in various workloads - showing various
improvements over it. I have tried and tested BFS and ran a set of
benchmarks - this mail contains the results and my (quick)
findings.

So ... to get to the numbers - i've tested both BFS and the tip of
the latest upstream scheduler tree on a testbox of mine. I
intentionally didnt test BFS on any really large box - because you
described its upper limit like this in the announcement:

-----------------------
|
| How scalable is it?
|
| I don't own the sort of hardware that is likely to suffer from
| using it, so I can't find the upper limit. Based on first
| principles about the overhead of locking, and the way lookups
| occur, I'd guess that a machine with more than 16 CPUS would
| start to have less performance. BIG NUMA machines will probably
| suck a lot with this because it pays no deference to locality of
| the NUMA nodes when deciding what cpu to use. It just keeps them
| all busy. The so-called "light NUMA" that constitutes commodity
| hardware these days seems to really like BFS.
|
-----------------------

I generally agree with you that "light NUMA" is what a Linux
scheduler needs to concentrate on (at most) in terms of
scalability. Big NUMA, 4096 CPUs is not very common and we tune the
Linux scheduler for desktop and small-server workloads mostly.

So the testbox i picked fits into the upper portion of what i
consider a sane range of systems to tune for - and should still fit
into BFS's design bracket as well according to your description:
it's a dual quad core system with hyperthreading. It has twice as
many cores as the quad you tested on but it's not excessive and
certainly does not have 4096 CPUs ;-)

Here are the benchmark results:

kernel build performance:
http://redhat.com/~mingo/misc/bfs-vs-tip-kbuild.jpg

pipe performance:
http://redhat.com/~mingo/misc/bfs-vs-tip-pipe.jpg

messaging performance (hackbench):
http://redhat.com/~mingo/misc/bfs-vs-tip-messaging.jpg

OLTP performance (postgresql + sysbench)
http://redhat.com/~mingo/misc/bfs-vs-tip-oltp.jpg

Alas, as it can be seen in the graphs, i can not see any BFS
performance improvements, on this box.

Here's a more detailed description of the results:

| Kernel build performance
---------------------------

http://redhat.com/~mingo/misc/bfs-vs-tip-kbuild.jpg

In the kbuild test BFS is showing significant weaknesses up to 16
CPUs. On 8 CPUs utilized (half load) it's 27.6% slower. All results
(-j1, -j2... -j15 are slower. The peak at 100% utilization at -j16
is slightly stronger under BFS, by 1.5%. The 'absolute best' result
is sched-devel at -j64 with 46.65 seconds - the best BFS result is
47.38 seconds (at -j64) - 1.5% better.

| Pipe performance
-------------------

http://redhat.com/~mingo/misc/bfs-vs-tip-pipe.jpg

Pipe performance is a very simple test, two tasks message to each
other via pipes. I measured 1 million such messages:

http://redhat.com/~mingo/cfs-scheduler/tools/pipe-test-1m.c

The pipe test ran a number of them in parallel:

for ((i=0;i<$NR;i++)); do ~/sched-tests/pipe-test-1m & done; wait

and measured elapsed time. This tests two things: basic scheduler
performance and also scheduler fairness. (if one of these parallel
jobs is delayed unfairly then the test will finish later.)

[ see further below for a simpler pipe latency benchmark as well. ]

As can be seen in the graph BFS performed very poorly in this test:
at 8 pairs of tasks it had a runtime of 45.42 seconds - while
sched-devel finished them in 3.8 seconds.

I saw really bad interactivity in the BFS test here - the system
was starved for as long as the test ran. I stopped the tests at 8
loops - the system was unusable and i was getting IO timeouts due
to the scheduling lag:

sd 0:0:0:0: [sda] Unhandled error code
sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
end_request: I/O error, dev sda, sector 81949243
Aborting journal on device sda2.
ext3_abort called.
EXT3-fs error (device sda2): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only

I measured interactivity during this test:

$ time ssh aldebaran /bin/true
real 2m17.968s
user 0m0.009s
sys 0m0.003s

A single command took more than 2 minutes.

| Messaging performance
------------------------

http://redhat.com/~mingo/misc/bfs-vs-tip-messaging.jpg

Hackbench ran better - but mainline sched-devel is significantly
faster for smaller and larger loads as well. With 20 groups
mainline ran 61.5% faster.

| OLTP performance
--------------------

http://redhat.com/~mingo/misc/bfs-vs-tip-oltp.jpg

As can be seen in the graph for sysbench OLTP performance
sched-devel outperforms BFS on each of the main stages:

single client load ( 1 client - 6.3% faster )
half load ( 8 clients - 57.6% faster )
peak performance ( 16 clients - 117.6% faster )
overload ( 512 clients - 288.3% faster )

| Other tests
--------------

I also tested a couple of other things, such as lat_tcp:

BFS: TCP latency using localhost: 16.5608 microseconds
sched-devel: TCP latency using localhost: 13.5528 microseconds [22.1% faster]

lat_pipe:

BFS: Pipe latency: 4.9703 microseconds
sched-devel: Pipe latency: 2.6137 microseconds [90.1% faster]

General interactivity of BFS seemed good to me - except for the
pipe test when there was significant lag over a minute. I think
it's some starvation bug, not an inherent design property of BFS,
so i'm looking forward to re-test it with the fix.

Test environment: i used latest BFS (205 and then i re-ran under
208 and the numbers are all from 208), and the latest mainline
scheduler development tree from:

http://people.redhat.com/mingo/tip.git/README

Commit 840a065 in particular. It's on a .31-rc8 base while BFS is
on a .30 base - will be able to test BFS on a .31 base as well once
you release it. (but it doesnt matter much to the results - there
werent any heavy core kernel changes impacting these workloads.)

The system had enough RAM to have the workloads cached, and i
repeated all tests to make sure it's all representative.
Nevertheless i'd like to encourage others to repeat these (or
other) tests - the more testing the better.

I also tried to configure the kernel in a BFS friendly way, i used
HZ=1000 as recommended, turned off all debug options, etc. The
kernel config i used can be found here:

http://redhat.com/~mingo/misc/config

( Let me know if you need any more info about any of the tests i
conducted. )

Also, i'd like to outline that i agree with the general goals
described by you in the BFS announcement - small desktop systems
matter more than large systems. We find it critically important
that the mainline Linux scheduler performs well on those systems
too - and if you (or anyone else) can reproduce suboptimal behavior
please let the scheduler folks know so that we can fix/improve it.

I hope to be able to work with you on this, please dont hesitate
sending patches if you wish - and we'll also be following BFS for
good ideas and code to adopt to mainline.

Thanks,

Ingo


2009-09-07 02:05:24

by Frans Pop

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

Ingo Molnar wrote:
> So the testbox i picked fits into the upper portion of what i
> consider a sane range of systems to tune for - and should still fit
> into BFS's design bracket as well according to your description:
> it's a dual quad core system with hyperthreading.

Ingo,

Nice that you've looked into this.

Would it be possible for you to run the same tests on e.g. a dual core
and/or a UP system (or maybe just offline some CPUs?)? It would be very
interesting to see whether BFS does better in the lower portion of the
range, or if the differences you show between the two schedulers are
consistent across the range.

Cheers,
FJP

2009-09-07 03:39:00

by Nikos Chantziaras

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On 09/06/2009 11:59 PM, Ingo Molnar wrote:
>[...]
> Also, i'd like to outline that i agree with the general goals
> described by you in the BFS announcement - small desktop systems
> matter more than large systems. We find it critically important
> that the mainline Linux scheduler performs well on those systems
> too - and if you (or anyone else) can reproduce suboptimal behavior
> please let the scheduler folks know so that we can fix/improve it.

BFS improved behavior of many applications on my Intel Core 2 box in a
way that can't be benchmarked. Examples:

mplayer using OpenGL renderer doesn't drop frames anymore when dragging
and dropping the video window around in an OpenGL composited desktop
(KDE 4.3.1). (Start moving the mplayer window around; then drop it. At
the moment the move starts and at the moment you drop the window back to
the desktop, there's a big frame skip as if mplayer was frozen for a
bit; around 200 or 300ms.)

Composite desktop effects like zoom and fade out don't stall for
sub-second periods of time while there's CPU load in the background. In
other words, the desktop is more fluid and less skippy even during heavy
CPU load. Moving windows around with CPU load in the background doesn't
result in short skips.

LMMS (a tool utilizing real-time sound synthesis) does not produce
"pops", "crackles" and drops in the sound during real-time playback due
to buffer under-runs. Those problems amplify when there's heavy CPU
load in the background, while with BFS heavy load doesn't produce those
artifacts (though LMMS makes itself run SCHED_ISO with BFS) Also,
hitting a key on the keyboard needs less time for the note to become
audible when using BFS. Same should hold true for other tools who
traditionally benefit from the "-rt" kernel sources.

Games like Doom 3 and such don't "freeze" periodically for small amounts
of time (again for sub-second amounts) when something in the background
grabs CPU time (be it my mailer checking for new mail or a cron job, or
whatever.)

And, the most drastic improvement here, with BFS I can do a "make -j2"
in the kernel tree and the GUI stays fluid. Without BFS, things start
to lag, even with in-RAM builds (like having the whole kernel tree
inside a tmpfs) and gcc running with nice 19 and ionice -c 3.

Unfortunately, I can't come up with any way to somehow benchmark all of
this. There's no benchmark for "fluidity" and "responsiveness".
Running the Doom 3 benchmark, or any other benchmark, doesn't say
anything about responsiveness, it only measures how many frames were
calculated in a specific period of time. How "stable" (with no stalls)
those frames were making it to the screen is not measurable.

If BFS would imply small drops in pure performance counted in
instructions per seconds, that would be a totally acceptable regression
for desktop/multimedia/gaming PCs. Not for server machines, of course.
However, on my machine, BFS is faster in classic workloads. When I
run "make -j2" with BFS and the standard scheduler, BFS always finishes
a bit faster. Not by much, but still. One thing I'm noticing here is
that BFS produces 100% CPU load on each core with "make -j2" while the
normal scheduler stays at about 90-95% with -j2 or higher in at least
one of the cores. There seems to be under-utilization of CPU time.

Also, by searching around the net but also through discussions on
various mailing lists, there seems to be a trend: the problems for some
reason seem to occur more often with Intel CPUs (Core 2 chips and lower;
I can't say anything about Core I7) while people on AMD CPUs mostly not
being affected by most or even all of the above. (And due to this flame
wars often break out, with one party accusing the other of imagining
things). Can the integrated memory controller on AMD chips have
something to do with this? Do AMD chips generally offer better
"multithrading" behavior? Unfortunately, you didn't mention on what CPU
you ran your tests. If it was AMD, it might be a good idea to run tests
on Pentium and Core 2 CPUs.

For reference, my system is:

CPU: Intel Core 2 Duo E6600 (2.4GHz)
Mainboard: Asus P5E (Intel X38 chipset)
RAM: 6GB (2+2+1+1) dual channel DDR2 800
GPU: RV770 (Radeon HD4870).

2009-09-07 03:56:14

by Con Kolivas

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

2009/9/7 Ingo Molnar <[email protected]>:
> hi Con,

Sigh..

Well hello there.

>
> I've read your BFS announcement/FAQ with great interest:
>
>    http://ck.kolivas.org/patches/bfs/bfs-faq.txt

> I understand that BFS is still early code and that you are not
> targeting BFS for mainline inclusion - but BFS is an interesting
> and bold new approach, cutting a _lot_ of code out of
> kernel/sched*.c, so it raised my curiosity and interest :-)

Hard to keep a project under wraps and get an audience at the same
time, it is. I do realise it was inevitable LKML would invade my
personal space no matter how much I didn't want it to, but it would be
rude of me to not respond.

> In the announcement and on your webpage you have compared BFS to
> the mainline scheduler in various workloads - showing various
> improvements over it. I have tried and tested BFS and ran a set of
> benchmarks - this mail contains the results and my (quick)
> findings.

/me sees Ingo run off to find the right combination of hardware and
benchmark to prove his point.

[snip lots of bullshit meaningless benchmarks showing how great cfs is
and/or how bad bfs is, along with telling people they should use these
artificial benchmarks to determine how good it is, demonstrating yet
again why benchmarks fail the desktop]

I'm not interested in a long protracted discussion about this since
I'm too busy to live linux the way full time developers do, so I'll
keep it short, and perhaps you'll understand my intent better if the
FAQ wasn't clear enough.


Do you know what a normal desktop PC looks like? No, a more realistic
question based on what you chose to benchmark to prove your point
would be: Do you know what normal people actually do on them?


Feel free to treat the question as rhetorical.

Regards,
-ck

/me checks on his distributed computing client's progress, fires up
his next H264 encode, changes music tracks and prepares to have his
arse whooped on quakelive.

2009-09-07 09:49:53

by Jens Axboe

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Sun, Sep 06 2009, Ingo Molnar wrote:
> So ... to get to the numbers - i've tested both BFS and the tip of
> the latest upstream scheduler tree on a testbox of mine. I
> intentionally didnt test BFS on any really large box - because you
> described its upper limit like this in the announcement:

I ran a simple test as well, since I was curious to see how it performed
wrt interactiveness. One of my pet peeves with the current scheduler is
that I have to nice compile jobs, or my X experience is just awful while
the compile is running.

Now, this test case is something that attempts to see what
interactiveness would be like. It'll run a given command line while at
the same time logging delays. The delays are measured as follows:

- The app creates a pipe, and forks a child that blocks on reading from
that pipe.
- The app sleeps for a random period of time, anywhere between 100ms
and 2s. When it wakes up, it gets the current time and writes that to
the pipe.
- The child then gets woken, checks the time on its own, and logs the
difference between the two.

The idea here being that the delay between writing to the pipe and the
child reading the data and comparing should (in some way) be indicative
of how responsive the system would seem to a user.

The test app was quickly hacked up, so don't put too much into it. The
test run is a simple kernel compile, using -jX where X is the number of
threads in the system. The files are cache hot, so little IO is done.
The -x2 run is using the double number of processes as we have threads,
eg -j128 on a 64 thread box.

And I have to apologize for using a large system to test this on, I
realize it's out of the scope of BFS, but it's just easier to fire one
of these beasts up than it is to sacrifice my notebook or desktop
machine... So it's a 64 thread box. CFS -jX runtime is the baseline at
100, lower number means faster and vice versa. The latency numbers are
in msecs.


Scheduler Runtime Max lat Avg lat Std dev
----------------------------------------------------------------
CFS 100 951 462 267
CFS-x2 100 983 484 308
BFS
BFS-x2

And unfortunately this is where it ends for now, since BFS doesn't boot
on the two boxes I tried. It hard hangs right after disk detection. But
the latency numbers look pretty appalling for CFQ, so it's a bit of a
shame that I did not get to compare. I'll try again later with a newer
revision, when available.

--
Jens Axboe

2009-09-07 10:12:39

by Nikos Chantziaras

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On 09/07/2009 12:49 PM, Jens Axboe wrote:
> [...]
> And I have to apologize for using a large system to test this on, I
> realize it's out of the scope of BFS, but it's just easier to fire one
> of these beasts up than it is to sacrifice my notebook or desktop
> machine...

How does a kernel rebuild constitute "sacrifice"?


> So it's a 64 thread box. CFS -jX runtime is the baseline at
> 100, lower number means faster and vice versa. The latency numbers are
> in msecs.
>
>
> Scheduler Runtime Max lat Avg lat Std dev
> ----------------------------------------------------------------
> CFS 100 951 462 267
> CFS-x2 100 983 484 308
> BFS
> BFS-x2
>
> And unfortunately this is where it ends for now, since BFS doesn't boot
> on the two boxes I tried.

Then who post this in the first place?

2009-09-07 10:41:17

by Jens Axboe

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Mon, Sep 07 2009, Nikos Chantziaras wrote:
> On 09/07/2009 12:49 PM, Jens Axboe wrote:
>> [...]
>> And I have to apologize for using a large system to test this on, I
>> realize it's out of the scope of BFS, but it's just easier to fire one
>> of these beasts up than it is to sacrifice my notebook or desktop
>> machine...
>
> How does a kernel rebuild constitute "sacrifice"?

It's more of a bother since I have to physically be at the notebook,
where as the server type boxes usually have remote management. The
workstation I use currently, so it'd be very disruptive to do it there.
And as things are apparently very alpha on the bfs side currently, it's
easier to 'sacrifice' an idle test box. That's the keyword, 'test'
boxes. You know, machines used for testing. Not production machines.

Plus the notebook is using btrfs which isn't format compatible with
2.6.30 on disk format.

Is there a point to this question?

>> So it's a 64 thread box. CFS -jX runtime is the baseline at
>> 100, lower number means faster and vice versa. The latency numbers are
>> in msecs.
>>
>>
>> Scheduler Runtime Max lat Avg lat Std dev
>> ----------------------------------------------------------------
>> CFS 100 951 462 267
>> CFS-x2 100 983 484 308
>> BFS
>> BFS-x2
>>
>> And unfortunately this is where it ends for now, since BFS doesn't boot
>> on the two boxes I tried.
>
> Then who post this in the first place?

You snipped the relevant part of the conclusion, the part where I make a
comment on the cfs latencies.

Don't bother replying to any of my emails if YOU continue writing emails
in this fashion. I have MUCH better things to do than entertain kiddies.
If you do get your act together and want to reply, follow lkml etiquette
and group reply.

--
Jens Axboe

2009-09-07 11:01:53

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Mon, Sep 07, 2009 at 06:38:36AM +0300, Nikos Chantziaras wrote:
> Unfortunately, I can't come up with any way to somehow benchmark all of
> this. There's no benchmark for "fluidity" and "responsiveness". Running
> the Doom 3 benchmark, or any other benchmark, doesn't say anything about
> responsiveness, it only measures how many frames were calculated in a
> specific period of time. How "stable" (with no stalls) those frames were
> making it to the screen is not measurable.



That looks eventually benchmarkable. This is about latency.
For example, you could try to run high load tasks in the
background and then launch a task that wakes up in middle/large
periods to do something. You could measure the time it takes to wake
it up to perform what it wants.

We have some events tracing infrastructure in the kernel that can
snapshot the wake up and sched switch events.

Having CONFIG_EVENT_TRACING=y should be sufficient for that.

You just need to mount a debugfs point, say in /debug.

Then you can activate these sched events by doing:

echo 0 > /debug/tracing/tracing_on
echo 1 > /debug/tracing/events/sched/sched_switch/enable
echo 1 > /debug/tracing/events/sched/sched_wake_up/enable

#Launch your tasks

echo 1 > /debug/tracing/tracing_on

#Wait for some time

echo 0 > /debug/tracing/tracing_off

That will require some parsing of the result in /debug/tracing/trace
to get the delays between wake_up events and switch in events
for the task that periodically wakes up and then produce some
statistics such as the average or the maximum latency.

That's a bit of a rough approach to measure such latencies but that
should work.


> If BFS would imply small drops in pure performance counted in
> instructions per seconds, that would be a totally acceptable regression
> for desktop/multimedia/gaming PCs. Not for server machines, of course.
> However, on my machine, BFS is faster in classic workloads. When I run
> "make -j2" with BFS and the standard scheduler, BFS always finishes a bit
> faster. Not by much, but still. One thing I'm noticing here is that BFS
> produces 100% CPU load on each core with "make -j2" while the normal
> scheduler stays at about 90-95% with -j2 or higher in at least one of the
> cores. There seems to be under-utilization of CPU time.



That also could be benchmarkable by using the above sched events and
look at the average time spent in a cpu to run the idle tasks.

2009-09-07 11:57:50

by Jens Axboe

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Mon, Sep 07 2009, Jens Axboe wrote:
> Scheduler Runtime Max lat Avg lat Std dev
> ----------------------------------------------------------------
> CFS 100 951 462 267
> CFS-x2 100 983 484 308
> BFS
> BFS-x2

Those numbers are buggy, btw, it's not nearly as bad. But responsiveness
under compile load IS bad though, the test app just didn't quantify it
correctly. I'll see if I can get it working properly.

--
Jens Axboe

2009-09-07 12:16:19

by Ingo Molnar

[permalink] [raw]
Subject: [quad core results] BFS vs. mainline scheduler benchmarks and measurements


* Frans Pop <[email protected]> wrote:

> Ingo Molnar wrote:
> > So the testbox i picked fits into the upper portion of what i
> > consider a sane range of systems to tune for - and should still fit
> > into BFS's design bracket as well according to your description:
> > it's a dual quad core system with hyperthreading.
>
> Ingo,
>
> Nice that you've looked into this.
>
> Would it be possible for you to run the same tests on e.g. a dual
> core and/or a UP system (or maybe just offline some CPUs?)? It
> would be very interesting to see whether BFS does better in the
> lower portion of the range, or if the differences you show between
> the two schedulers are consistent across the range.

Sure!

Note that usually we can extrapolate ballpark-figure quad and dual
socket results from 8 core results. Trends as drastic as the ones
i reported do not get reversed as one shrinks the number of cores.

[ This technique is not universal - for example borderline graphs
on cannot be extrapolated down reliably - but the graphs i
posted were far from borderline. ]

Con posted single-socket quad comparisons/graphs so to make it 100%
apples to apples i re-tested with a single-socket (non-NUMA) quad as
well, and have uploaded the new graphs/results to:

kernel build performance on quad:
http://redhat.com/~mingo/misc/bfs-vs-tip-kbuild-quad.jpg

pipe performance on quad:
http://redhat.com/~mingo/misc/bfs-vs-tip-pipe-quad.jpg

messaging performance (hackbench) on quad:
http://redhat.com/~mingo/misc/bfs-vs-tip-messaging-quad.jpg

OLTP performance (postgresql + sysbench) on quad:
http://redhat.com/~mingo/misc/bfs-vs-tip-oltp-quad.jpg

It shows similar curves and behavior to the 8-core results i posted
- BFS is slower than mainline in virtually every measurement. The
ratios are different for different parts of the graphs - but the
trend is similar.

I also re-ran a few standalone kernel latency tests with a single
quad:

lat_tcp:

BFS: TCP latency using localhost: 16.9926 microseconds
sched-devel: TCP latency using localhost: 12.4141 microseconds [36.8% faster]

as a comparison, the 8 core lat_tcp result was:

BFS: TCP latency using localhost: 16.5608 microseconds
sched-devel: TCP latency using localhost: 13.5528 microseconds [22.1% faster]

lat_pipe quad result:

BFS: Pipe latency: 4.6978 microseconds
sched-devel: Pipe latency: 2.6860 microseconds [74.8% faster]

as a comparison, the 8 core lat_pipe result was:

BFS: Pipe latency: 4.9703 microseconds
sched-devel: Pipe latency: 2.6137 microseconds [90.1% faster]

On the desktop interactivity front, i also still saw that bad
starvation artifact with BFS with multiple copies of CPU-bound
pipe-test-1m.c running in parallel:

http://redhat.com/~mingo/cfs-scheduler/tools/pipe-test-1m.c

Start up a few copies of them like this:

for ((i=0;i<32;i++)); do ./pipe-test-1m & done

and the quad eventually came to a halt here - until the tasks
finished running.

I also tested a few key data points on dual core and it shows
similar trends as well (as expected from the 8 and 4 core results).

But ... i'd really encourage everyone to test these things yourself
as well and not take anyone's word on this as granted. The more
people provide numbers, the better. The latest BFS patch can be
found at:

http://ck.kolivas.org/patches/bfs/

The mainline sched-devel tree can be found at:

http://people.redhat.com/mingo/tip.git/README

Thanks,

Ingo

2009-09-07 12:37:06

by Stefan Richter

[permalink] [raw]
Subject: Re: [quad core results] BFS vs. mainline scheduler benchmarks and measurements

Ingo Molnar wrote:
> i'd really encourage everyone to test these things yourself
> as well and not take anyone's word on this as granted. The more
> people provide numbers, the better.

Besides mean values from bandwidth and latency focused tests, standard
deviations or variance, or e.g. 90th percentiles and perhaps maxima of
latency focused tests might be of interest. Or graphs with error bars.
--
Stefan Richter
-=====-==--= =--= --===
http://arcgraph.de/sr/

2009-09-07 13:41:51

by Markus Tornqvist

[permalink] [raw]
Subject: Re: [quad core results] BFS vs. mainline scheduler benchmarks and measurements

Please Cc me as I'm not a subscriber.

(LKML bounced this message once already for 8-bit headers, I'm retrying
now - sorry if someone gets it twice)

On Mon, Sep 07, 2009 at 02:16:13PM +0200, Ingo Molnar wrote:
>
>Con posted single-socket quad comparisons/graphs so to make it 100%
>apples to apples i re-tested with a single-socket (non-NUMA) quad as
>well, and have uploaded the new graphs/results to:
>
> kernel build performance on quad:
> http://redhat.com/~mingo/misc/bfs-vs-tip-kbuild-quad.jpg
[...]
>
>It shows similar curves and behavior to the 8-core results i posted
>- BFS is slower than mainline in virtually every measurement. The
>ratios are different for different parts of the graphs - but the
>trend is similar.

Dude, not cool.

1. Quad HT is not the same as a 4-core desktop, you're doing it with 8 cores
2. You just proved BFS is better on the job_count == core_count case, as BFS
says it is, if you look at the graph
3. You're comparing an old version of BFS against an unreleased dev kernel

Also, you said on http://article.gmane.org/gmane.linux.kernel/886319
"I also tried to configure the kernel in a BFS friendly way, i used
HZ=1000 as recommended, turned off all debug options, etc. The
kernel config i used can be found here:
http://redhat.com/~mingo/misc/config
"

Quickly looking at the conf you have
CONFIG_HZ_250=y
CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set

CONFIG_ARCH_WANT_FRAME_POINTERS=y
CONFIG_FRAME_POINTER=y

And other DEBUG.

--
mjt

2009-09-07 13:59:36

by Ingo Molnar

[permalink] [raw]
Subject: Re: [quad core results] BFS vs. mainline scheduler benchmarks and measurements


* Markus T?rnqvist <[email protected]> wrote:

> Please Cc me as I'm not a subscriber.
>
> On Mon, Sep 07, 2009 at 02:16:13PM +0200, Ingo Molnar wrote:
> >
> >Con posted single-socket quad comparisons/graphs so to make it 100%
> >apples to apples i re-tested with a single-socket (non-NUMA) quad as
> >well, and have uploaded the new graphs/results to:
> >
> > kernel build performance on quad:
> > http://redhat.com/~mingo/misc/bfs-vs-tip-kbuild-quad.jpg
> [...]
> >
> >It shows similar curves and behavior to the 8-core results i posted
> >- BFS is slower than mainline in virtually every measurement. The
> >ratios are different for different parts of the graphs - but the
> >trend is similar.
>
> Dude, not cool.
>
> 1. Quad HT is not the same as a 4-core desktop, you're doing it with 8 cores

No, it's 4 cores. HyperThreading adds two 'siblings' per core, which
are not 'cores'.

> 2. You just proved BFS is better on the job_count == core_count case, as BFS
> says it is, if you look at the graph

I pointed that out too. I think the graphs speak for themselves:

http://redhat.com/~mingo/misc/bfs-vs-tip-kbuild-quad.jpg
http://redhat.com/~mingo/misc/bfs-vs-tip-kbuild.jpg

> 3. You're comparing an old version of BFS against an unreleased dev kernel

bfs-208 was 1 day old (and it is a 500K+ kernel patch) when i tested
it against the 2 days old sched-devel tree. Btw., i initially
measured 205 as well and spent one more day on acquiring and
analyzing the 208 results.

There's bfs-209 out there today. These tests take 8+ hours to
complete and validate. I'll re-test BFS in the future too, and as i
said it in the first mail i'll test it on a .31 base as well once
BFS has been ported to it:

> > It's on a .31-rc8 base while BFS is on a .30 base - will be able
> > to test BFS on a .31 base as well once you release it. (but it
> > doesnt matter much to the results - there werent any heavy core
> > kernel changes impacting these workloads.)

> Also, you said on http://article.gmane.org/gmane.linux.kernel/886319
> "I also tried to configure the kernel in a BFS friendly way, i used
> HZ=1000 as recommended, turned off all debug options, etc. The
> kernel config i used can be found here:
> http://redhat.com/~mingo/misc/config
> "
>
> Quickly looking at the conf you have
> CONFIG_HZ_250=y
> CONFIG_PREEMPT_NONE=y
> # CONFIG_PREEMPT_VOLUNTARY is not set
> # CONFIG_PREEMPT is not set

Indeed. HZ does not seem to matter according to what i see in my
measurements. Can you measure such sensitivity?

> CONFIG_ARCH_WANT_FRAME_POINTERS=y
> CONFIG_FRAME_POINTER=y
>
> And other DEBUG.

These are the defaults and they dont make a measurable difference to
these results. What other debug options do you mean and do they make
a difference?

Ingo

2009-09-07 14:15:13

by Ingo Molnar

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements


* Jens Axboe <[email protected]> wrote:

> On Mon, Sep 07 2009, Jens Axboe wrote:
> > Scheduler Runtime Max lat Avg lat Std dev
> > ----------------------------------------------------------------
> > CFS 100 951 462 267
> > CFS-x2 100 983 484 308
> > BFS
> > BFS-x2
>
> Those numbers are buggy, btw, it's not nearly as bad. But
> responsiveness under compile load IS bad though, the test app just
> didn't quantify it correctly. I'll see if I can get it working
> properly.

What's the default latency target on your box:

cat /proc/sys/kernel/sched_latency_ns

?

And yes, it would be wonderful to get a test-app from you that would
express the kind of pain you are seeing during compile jobs.

Ingo

2009-09-07 14:37:17

by Arjan van de Ven

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Mon, 07 Sep 2009 06:38:36 +0300
Nikos Chantziaras <[email protected]> wrote:

> On 09/06/2009 11:59 PM, Ingo Molnar wrote:
> >[...]
> > Also, i'd like to outline that i agree with the general goals
> > described by you in the BFS announcement - small desktop systems
> > matter more than large systems. We find it critically important
> > that the mainline Linux scheduler performs well on those systems
> > too - and if you (or anyone else) can reproduce suboptimal behavior
> > please let the scheduler folks know so that we can fix/improve it.
>
> BFS improved behavior of many applications on my Intel Core 2 box in
> a way that can't be benchmarked. Examples:

Have you tried to see if latencytop catches such latencies ?

2009-09-07 14:41:50

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [quad core results] BFS vs. mainline scheduler benchmarks and measurements

On Mon, 7 Sep 2009 16:41:51 +0300
> >It shows similar curves and behavior to the 8-core results i posted
> >- BFS is slower than mainline in virtually every measurement. The
> >ratios are different for different parts of the graphs - but the
> >trend is similar.
>
> Dude, not cool.
>
> 1. Quad HT is not the same as a 4-core desktop, you're doing it with
> 8 cores

4 cores, 8 threads. Which is basically the standard desktop cpu going
forward... (4 cores already is today, 8 threads is that any day now)



--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2009-09-07 15:16:59

by Michael Büsch

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

Here's a very simple test setup on an embedded singlecore bcm47xx machine (WL500GPv2)
It uses iperf for performance testing. The iperf server is run on the
embedded device. The device is so slow that the iperf test is completely
CPU bound. The network connection is a 100MBit on the device connected
via patch cable to a 1000MBit machine.

The kernel is openwrt-2.6.30.5.

Here are the results:



Mainline CFS scheduler:

mb@homer:~$ iperf -c 192.168.1.1
------------------------------------------------------------
Client connecting to 192.168.1.1, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.1.99 port 35793 connected with 192.168.1.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 27.4 MBytes 23.0 Mbits/sec
mb@homer:~$ iperf -c 192.168.1.1
------------------------------------------------------------
Client connecting to 192.168.1.1, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.1.99 port 35794 connected with 192.168.1.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 27.3 MBytes 22.9 Mbits/sec
mb@homer:~$ iperf -c 192.168.1.1
------------------------------------------------------------
Client connecting to 192.168.1.1, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.1.99 port 56147 connected with 192.168.1.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 27.3 MBytes 22.9 Mbits/sec


BFS scheduler:

mb@homer:~$ iperf -c 192.168.1.1
------------------------------------------------------------
Client connecting to 192.168.1.1, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.1.99 port 52489 connected with 192.168.1.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 38.2 MBytes 32.0 Mbits/sec
mb@homer:~$ iperf -c 192.168.1.1
------------------------------------------------------------
Client connecting to 192.168.1.1, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.1.99 port 52490 connected with 192.168.1.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 38.1 MBytes 31.9 Mbits/sec
mb@homer:~$ iperf -c 192.168.1.1
------------------------------------------------------------
Client connecting to 192.168.1.1, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.1.99 port 52491 connected with 192.168.1.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 38.1 MBytes 31.9 Mbits/sec


--
Greetings, Michael.

2009-09-07 15:20:36

by Frans Pop

[permalink] [raw]
Subject: Re: [quad core results] BFS vs. mainline scheduler benchmarks and measurements

On Monday 07 September 2009, Arjan van de Ven wrote:
> 4 cores, 8 threads. Which is basically the standard desktop cpu going
> forward... (4 cores already is today, 8 threads is that any day now)

Despite that I'm personally more interested in what I have available here
*now*. And that's various UP Pentium systems, one dual core Pentium D and
Core Duo.

I've been running BFS on my laptop today while doing CPU intensive jobs
(not disk intensive), and I must say that BFS does seem very responsive.
OTOH, I've also noticed some surprising things, such as processors staying
on lower frequencies while doing CPU-intensive work.

I feels like I have less of the mouse cursor and typing freezes I'm used
to with CFS, even when I'm *not* doing anything special. I've been
blaming those on still running with ordered mode ext3, but now I'm
starting to wonder.

I'll try to do more structured testing, comparisons and measurements
later. At the very least it's nice to have something to compare _with_.

Cheers,
FJP

2009-09-07 15:24:42

by Xavier Bestel

[permalink] [raw]
Subject: Re: [quad core results] BFS vs. mainline scheduler benchmarks and measurements


On Mon, 2009-09-07 at 07:45 -0700, Arjan van de Ven wrote:
> On Mon, 7 Sep 2009 16:41:51 +0300
> > >It shows similar curves and behavior to the 8-core results i posted
> > >- BFS is slower than mainline in virtually every measurement. The
> > >ratios are different for different parts of the graphs - but the
> > >trend is similar.
> >
> > Dude, not cool.
> >
> > 1. Quad HT is not the same as a 4-core desktop, you're doing it with
> > 8 cores
>
> 4 cores, 8 threads. Which is basically the standard desktop cpu going
> forward... (4 cores already is today, 8 threads is that any day now)

Except on your typical smartphone, which will run linux and probably
vastly outnumber the number of "traditional" linux desktops.

Xav


2009-09-07 15:34:22

by Nikos Chantziaras

[permalink] [raw]
Subject: Re: [quad core results] BFS vs. mainline scheduler benchmarks and measurements

On 09/07/2009 03:16 PM, Ingo Molnar wrote:
> [...]
> Note that usually we can extrapolate ballpark-figure quad and dual
> socket results from 8 core results. Trends as drastic as the ones
> i reported do not get reversed as one shrinks the number of cores.
>
> Con posted single-socket quad comparisons/graphs so to make it 100%
> apples to apples i re-tested with a single-socket (non-NUMA) quad as
> well, and have uploaded the new graphs/results to:
>
> kernel build performance on quad:
> http://redhat.com/~mingo/misc/bfs-vs-tip-kbuild-quad.jpg
>
> pipe performance on quad:
> http://redhat.com/~mingo/misc/bfs-vs-tip-pipe-quad.jpg
>
> messaging performance (hackbench) on quad:
> http://redhat.com/~mingo/misc/bfs-vs-tip-messaging-quad.jpg
>
> OLTP performance (postgresql + sysbench) on quad:
> http://redhat.com/~mingo/misc/bfs-vs-tip-oltp-quad.jpg
>
> It shows similar curves and behavior to the 8-core results i posted
> - BFS is slower than mainline in virtually every measurement.

Except for numbers, what's your *experience* with BFS when it comes to
composited desktops + games + multimedia apps? (Watching high
definition videos, playing some latest high-tech 3D game, etc.) I
described the exact problems experienced with mainline in a previous reply.

Are you even using that stuff actually? Because it would be hard to
tell if your desktop consists mainly of Emacs and an xterm; you even
seem to be using Mutt so I suspect your desktop probably doesn't look
very Windows Vista/OS X/Compiz-like. Usually, with "multimedia desktop
PC" one doesn't mean:

http://foss.math.aegean.gr/~realnc/pics/desktop2.png

but rather:

http://foss.math.aegean.gr/~realnc/pics/desktop1.png

BFS probably wouldn't offer the former anything, while on the latter it
does make a difference. If your usage of the "desktop" bears a
resemblance to the first example, I'd say you might be not the most
qualified person to judge on the "Linux desktop experience." That is
not meant be offensive or patronizing, just an observation and I even
might be totally wrong about it.

2009-09-07 15:33:27

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [quad core results] BFS vs. mainline scheduler benchmarks and measurements

On Mon, 7 Sep 2009 17:20:33 +0200
Frans Pop <[email protected]> wrote:

> On Monday 07 September 2009, Arjan van de Ven wrote:
> > 4 cores, 8 threads. Which is basically the standard desktop cpu
> > going forward... (4 cores already is today, 8 threads is that any
> > day now)
>
> Despite that I'm personally more interested in what I have available
> here *now*. And that's various UP Pentium systems, one dual core
> Pentium D and Core Duo.
>
> I've been running BFS on my laptop today while doing CPU intensive
> jobs (not disk intensive), and I must say that BFS does seem very
> responsive. OTOH, I've also noticed some surprising things, such as
> processors staying on lower frequencies while doing CPU-intensive
> work.
>
> I feels like I have less of the mouse cursor and typing freezes I'm
> used to with CFS, even when I'm *not* doing anything special. I've
> been blaming those on still running with ordered mode ext3, but now
> I'm starting to wonder.
>
> I'll try to do more structured testing, comparisons and measurements
> later. At the very least it's nice to have something to compare
> _with_.
>

it's a shameless plug since I wrote it, but latencytop will be able to
tell you what your bottleneck is...
and that is very interesting to know, regardless of the "what scheduler
code" discussion;

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2009-09-07 15:33:52

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [quad core results] BFS vs. mainline scheduler benchmarks and measurements

On Mon, 07 Sep 2009 17:24:29 +0200
Xavier Bestel <[email protected]> wrote:

>
> On Mon, 2009-09-07 at 07:45 -0700, Arjan van de Ven wrote:
> > On Mon, 7 Sep 2009 16:41:51 +0300
> > > >It shows similar curves and behavior to the 8-core results i
> > > >posted
> > > >- BFS is slower than mainline in virtually every measurement.
> > > >The ratios are different for different parts of the graphs - but
> > > >the trend is similar.
> > >
> > > Dude, not cool.
> > >
> > > 1. Quad HT is not the same as a 4-core desktop, you're doing it
> > > with 8 cores
> >
> > 4 cores, 8 threads. Which is basically the standard desktop cpu
> > going forward... (4 cores already is today, 8 threads is that any
> > day now)
>
> Except on your typical smartphone, which will run linux and probably
> vastly outnumber the number of "traditional" linux desktops.

yeah the trend in cellphones is only quad core without HT, not quad
core WITH ht ;-)



--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2009-09-07 15:47:50

by Frans Pop

[permalink] [raw]
Subject: Re: [quad core results] BFS vs. mainline scheduler benchmarks and measurements

On Monday 07 September 2009, Arjan van de Ven wrote:
> it's a shameless plug since I wrote it, but latencytop will be able to
> tell you what your bottleneck is...
> and that is very interesting to know, regardless of the "what scheduler
> code" discussion;

I'm very much aware of that and I've tried pinning it down a few times,
but failed to come up with anything conclusive. I plan to make a new
effort in this context as the freezes have increasingly been annoying me.

Unfortunately latencytop only shows a blank screen when used with BFS, but
I guess that's not totally unexpected.

Cheers,
FJP

2009-09-07 15:59:01

by Diego Calleja

[permalink] [raw]
Subject: Re: [quad core results] BFS vs. mainline scheduler benchmarks and measurements

On Lunes 07 Septiembre 2009 17:24:29 Xavier Bestel escribi?:
> Except on your typical smartphone, which will run linux and probably
> vastly outnumber the number of "traditional" linux desktops.

Smartphones will probably start using ARM dualcore cpus the next year,
the embedded land is no SMP-free.

2009-09-07 17:38:45

by Jens Axboe

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Mon, Sep 07 2009, Ingo Molnar wrote:
>
> * Jens Axboe <[email protected]> wrote:
>
> > On Mon, Sep 07 2009, Jens Axboe wrote:
> > > Scheduler Runtime Max lat Avg lat Std dev
> > > ----------------------------------------------------------------
> > > CFS 100 951 462 267
> > > CFS-x2 100 983 484 308
> > > BFS
> > > BFS-x2
> >
> > Those numbers are buggy, btw, it's not nearly as bad. But
> > responsiveness under compile load IS bad though, the test app just
> > didn't quantify it correctly. I'll see if I can get it working
> > properly.
>
> What's the default latency target on your box:
>
> cat /proc/sys/kernel/sched_latency_ns
>
> ?

It's off right now, but it is set to whatever is the default. I don't
touch it.

> And yes, it would be wonderful to get a test-app from you that would
> express the kind of pain you are seeing during compile jobs.

I was hoping this one would, but it's not showing anything. I even added
support for doing the ping and wakeup over a socket, to see if the pipe
test was doing well because of the sync wakeup we do there. The net
latency is a little worse, but still good. So no luck in making that app
so far.

--
Jens Axboe

2009-09-07 17:56:26

by Avi Kivity

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On 09/07/2009 12:49 PM, Jens Axboe wrote:
>
> I ran a simple test as well, since I was curious to see how it performed
> wrt interactiveness. One of my pet peeves with the current scheduler is
> that I have to nice compile jobs, or my X experience is just awful while
> the compile is running.
>

I think the problem is that CFS is optimizing for the wrong thing. It's
trying to be fair to tasks, but these are meaningless building blocks of
jobs, which is what the user sees and measures. Your make -j128
dominates your interactive task by two orders of magnitude. If the
scheduler attempts to bridge this gap using heuristics, it will fail
badly when it misdetects since it will starve the really important
100-thread job for a task that was misdetected as interactive.

I think that bash (and the GUI shell) should put any new job (for bash,
a pipeline; for the GUI, an application launch from the menu) in a
scheduling group of its own. This way it will have equal weight in the
scheduler's eyes with interactive tasks; one will not dominate the
other. Of course if the cpu is free the compile job is welcome to use
all 128 threads.

(similarly, different login sessions should be placed in different jobs
to avoid a heavily multithreaded screensaver from overwhelming ed).

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2009-09-07 18:34:28

by Jerome Glisse

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Mon, 2009-09-07 at 13:50 +1000, Con Kolivas wrote:

> /me checks on his distributed computing client's progress, fires up
> his next H264 encode, changes music tracks and prepares to have his
> arse whooped on quakelive.
> --

For such computer usage i would strongly suggest that you look into
GPU driver development there is a lot of performances to win in this
area and my feeling is that you can improve what you are doing
(games -> opengl (so GPU), H264 (encoding is harder to accelerate
with a GPU but for decoding and displaying it you definitely want
to involve the GPU), and tons of others things you are doing on your
linux desktop would go faster if GPU was put to more use. A wild guess
is that you can get a 2 or even 3 figures percentage improvement
with better GPU driver. My point is that i don't think a linux
scheduler improvement (compared to what we have now) will give a
significant boost for the linux desktop, on the contrary any even
slight improvement to the GPU driver stack can give you a boost.
Another way of saying that, there is no point into prioritizing X or
desktop app if CPU has to do all the drawing by itself (CPU is
several magnitude slower than GPU at doing such kind of task).

Regards,
Jerome Glisse

2009-09-07 18:26:38

by Ingo Molnar

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements


* Michael Buesch <[email protected]> wrote:

> Here's a very simple test setup on an embedded singlecore bcm47xx
> machine (WL500GPv2) It uses iperf for performance testing. The
> iperf server is run on the embedded device. The device is so slow
> that the iperf test is completely CPU bound. The network
> connection is a 100MBit on the device connected via patch cable to
> a 1000MBit machine.
>
> The kernel is openwrt-2.6.30.5.
>
> Here are the results:
>
>
>
> Mainline CFS scheduler:
>
> mb@homer:~$ iperf -c 192.168.1.1
> ------------------------------------------------------------
> Client connecting to 192.168.1.1, TCP port 5001
> TCP window size: 16.0 KByte (default)
> ------------------------------------------------------------
> [ 3] local 192.168.1.99 port 35793 connected with 192.168.1.1 port 5001
> [ ID] Interval Transfer Bandwidth
> [ 3] 0.0-10.0 sec 27.4 MBytes 23.0 Mbits/sec
> mb@homer:~$ iperf -c 192.168.1.1
> ------------------------------------------------------------
> Client connecting to 192.168.1.1, TCP port 5001
> TCP window size: 16.0 KByte (default)
> ------------------------------------------------------------
> [ 3] local 192.168.1.99 port 35794 connected with 192.168.1.1 port 5001
> [ ID] Interval Transfer Bandwidth
> [ 3] 0.0-10.0 sec 27.3 MBytes 22.9 Mbits/sec
> mb@homer:~$ iperf -c 192.168.1.1
> ------------------------------------------------------------
> Client connecting to 192.168.1.1, TCP port 5001
> TCP window size: 16.0 KByte (default)
> ------------------------------------------------------------
> [ 3] local 192.168.1.99 port 56147 connected with 192.168.1.1 port 5001
> [ ID] Interval Transfer Bandwidth
> [ 3] 0.0-10.0 sec 27.3 MBytes 22.9 Mbits/sec
>
>
> BFS scheduler:
>
> mb@homer:~$ iperf -c 192.168.1.1
> ------------------------------------------------------------
> Client connecting to 192.168.1.1, TCP port 5001
> TCP window size: 16.0 KByte (default)
> ------------------------------------------------------------
> [ 3] local 192.168.1.99 port 52489 connected with 192.168.1.1 port 5001
> [ ID] Interval Transfer Bandwidth
> [ 3] 0.0-10.0 sec 38.2 MBytes 32.0 Mbits/sec
> mb@homer:~$ iperf -c 192.168.1.1
> ------------------------------------------------------------
> Client connecting to 192.168.1.1, TCP port 5001
> TCP window size: 16.0 KByte (default)
> ------------------------------------------------------------
> [ 3] local 192.168.1.99 port 52490 connected with 192.168.1.1 port 5001
> [ ID] Interval Transfer Bandwidth
> [ 3] 0.0-10.0 sec 38.1 MBytes 31.9 Mbits/sec
> mb@homer:~$ iperf -c 192.168.1.1
> ------------------------------------------------------------
> Client connecting to 192.168.1.1, TCP port 5001
> TCP window size: 16.0 KByte (default)
> ------------------------------------------------------------
> [ 3] local 192.168.1.99 port 52491 connected with 192.168.1.1 port 5001
> [ ID] Interval Transfer Bandwidth
> [ 3] 0.0-10.0 sec 38.1 MBytes 31.9 Mbits/sec

That's interesting. I tried to reproduce it on x86, but the profile
does not show any scheduler overhead at all on the server:

$ perf report

#
# Samples: 8369
#
# Overhead Symbol
# ........ ......
#
9.20% [k] copy_user_generic_string
3.80% [k] e1000_clean
3.58% [k] ipt_do_table
2.72% [k] mwait_idle
2.68% [k] nf_iterate
2.28% [k] e1000_intr
2.15% [k] tcp_packet
2.10% [k] __hash_conntrack
1.59% [k] read_tsc
1.52% [k] _local_bh_enable_ip
1.34% [k] eth_type_trans
1.29% [k] __alloc_skb
1.19% [k] tcp_recvmsg
1.19% [k] ip_rcv
1.17% [k] e1000_clean_rx_irq
1.12% [k] apic_timer_interrupt
0.99% [k] vsnprintf
0.96% [k] nf_conntrack_in
0.96% [k] kmem_cache_free
0.93% [k] __kmalloc_track_caller


Could you profile it please? Also, what's the context-switch rate?

Below is the call-graph profile as well - all the overhead is in
networking and SLAB.

Ingo

$ perf report --call-graph fractal,5

#
# Samples: 8947
#
# Overhead Command Shared Object Symbol
# ........ .............. ............................. ......
#
9.06% iperf [kernel] [k] copy_user_generic_string
|
|--98.89%-- skb_copy_datagram_iovec
| |
| |--77.18%-- tcp_recvmsg
| | sock_common_recvmsg
| | __sock_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call_fastpath
| | __recv
| |
| --22.82%-- tcp_rcv_established
| tcp_v4_do_rcv
| tcp_prequeue_process
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
--1.11%-- system_call_fastpath
__GI___libc_nanosleep

3.62% [init] [kernel] [k] e1000_clean
2.96% [init] [kernel] [k] ipt_do_table
2.79% [init] [kernel] [k] mwait_idle
2.22% [init] [kernel] [k] e1000_intr
1.93% [init] [kernel] [k] nf_iterate
1.65% [init] [kernel] [k] __hash_conntrack
1.52% [init] [kernel] [k] tcp_packet
1.29% [init] [kernel] [k] ip_rcv
1.18% [init] [kernel] [k] __alloc_skb
1.15% iperf [kernel] [k] tcp_recvmsg

1.04% [init] [kernel] [k] _local_bh_enable_ip
1.02% [init] [kernel] [k] apic_timer_interrupt
1.02% [init] [kernel] [k] eth_type_trans
1.01% [init] [kernel] [k] tcp_v4_rcv
0.96% iperf [kernel] [k] kfree
|
|--95.35%-- skb_release_data
| __kfree_skb
| |
| |--79.27%-- tcp_recvmsg
| | sock_common_recvmsg
| | __sock_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call_fastpath
| | __recv
| |
| --20.73%-- tcp_rcv_established
| tcp_v4_do_rcv
| tcp_prequeue_process
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
--4.65%-- __kfree_skb
|
|--75.00%-- tcp_rcv_established
| tcp_v4_do_rcv
| tcp_prequeue_process
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
--25.00%-- tcp_recvmsg
sock_common_recvmsg
__sock_recvmsg
sock_recvmsg
sys_recvfrom
system_call_fastpath
__recv

0.96% [init] [kernel] [k] read_tsc
0.92% iperf [kernel] [k] tcp_v4_do_rcv
|
|--95.12%-- tcp_prequeue_process
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
--4.88%-- tcp_recvmsg
sock_common_recvmsg
__sock_recvmsg
sock_recvmsg
sys_recvfrom
system_call_fastpath
__recv

0.92% [init] [kernel] [k] e1000_clean_rx_irq
0.86% iperf [kernel] [k] tcp_rcv_established
|
|--96.10%-- tcp_v4_do_rcv
| tcp_prequeue_process
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
--3.90%-- tcp_prequeue_process
tcp_recvmsg
sock_common_recvmsg
__sock_recvmsg
sock_recvmsg
sys_recvfrom
system_call_fastpath
__recv

0.84% iperf [kernel] [k] kmem_cache_free
|
|--93.33%-- __kfree_skb
| |
| |--71.43%-- tcp_recvmsg
| | sock_common_recvmsg
| | __sock_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call_fastpath
| | __recv
| |
| --28.57%-- tcp_rcv_established
| tcp_v4_do_rcv
| tcp_prequeue_process
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
|--4.00%-- tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
--2.67%-- tcp_rcv_established
tcp_v4_do_rcv
tcp_prequeue_process
tcp_recvmsg
sock_common_recvmsg
__sock_recvmsg
sock_recvmsg
sys_recvfrom
system_call_fastpath
__recv

0.80% [init] [kernel] [k] netif_receive_skb
0.79% iperf [kernel] [k] tcp_event_data_recv
|
|--83.10%-- tcp_rcv_established
| tcp_v4_do_rcv
| tcp_prequeue_process
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
|--12.68%-- tcp_v4_do_rcv
| tcp_prequeue_process
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
--4.23%-- tcp_data_queue
tcp_rcv_established
tcp_v4_do_rcv
tcp_prequeue_process
tcp_recvmsg
sock_common_recvmsg
__sock_recvmsg
sock_recvmsg
sys_recvfrom
system_call_fastpath
__recv

0.67% perf [kernel] [k] format_decode
|
|--91.67%-- vsnprintf
| seq_printf
| |
| |--67.27%-- show_map_vma
| | show_map
| | seq_read
| | vfs_read
| | sys_read
| | system_call_fastpath
| | __GI_read
| |
| |--23.64%-- render_sigset_t
| | proc_pid_status
| | proc_single_show
| | seq_read
| | vfs_read
| | sys_read
| | system_call_fastpath
| | __GI_read
| |
| |--7.27%-- proc_pid_status
| | proc_single_show
| | seq_read
| | vfs_read
| | sys_read
| | system_call_fastpath
| | __GI_read
| |
| --1.82%-- cpuset_task_status_allowed
| proc_pid_status
| proc_single_show
| seq_read
| vfs_read
| sys_read
| system_call_fastpath
| __GI_read
|
--8.33%-- seq_printf
|
|--60.00%-- proc_pid_status
| proc_single_show
| seq_read
| vfs_read
| sys_read
| system_call_fastpath
| __GI_read
|
--40.00%-- show_map_vma
show_map
seq_read
vfs_read
sys_read
system_call_fastpath
__GI_read

0.65% [init] [kernel] [k] __kmalloc_track_caller
0.63% [init] [kernel] [k] nf_conntrack_in
0.63% [init] [kernel] [k] ip_route_input
0.58% perf [kernel] [k] vsnprintf
|
|--98.08%-- seq_printf
| |
| |--60.78%-- show_map_vma
| | show_map
| | seq_read
| | vfs_read
| | sys_read
| | system_call_fastpath
| | __GI_read
| |
| |--19.61%-- render_sigset_t
| | proc_pid_status
| | proc_single_show
| | seq_read
| | vfs_read
| | sys_read
| | system_call_fastpath
| | __GI_read
| |
| |--9.80%-- proc_pid_status
| | proc_single_show
| | seq_read
| | vfs_read
| | sys_read
| | system_call_fastpath
| | __GI_read
| |
| |--3.92%-- task_mem
| | proc_pid_status
| | proc_single_show
| | seq_read
| | vfs_read
| | sys_read
| | system_call_fastpath
| | __GI_read
| |
| |--3.92%-- cpuset_task_status_allowed
| | proc_pid_status
| | proc_single_show
| | seq_read
| | vfs_read
| | sys_read
| | system_call_fastpath
| | __GI_read
| |
| --1.96%-- render_cap_t
| proc_pid_status
| proc_single_show
| seq_read
| vfs_read
| sys_read
| system_call_fastpath
| __GI_read
|
--1.92%-- snprintf
proc_task_readdir
vfs_readdir
sys_getdents
system_call_fastpath
__getdents64
0x69706565000a3430

0.57% [init] [kernel] [k] ktime_get
0.57% [init] [kernel] [k] nf_nat_fn
0.56% iperf [kernel] [k] tcp_packet
|
|--68.00%-- __tcp_ack_snd_check
| tcp_rcv_established
| tcp_v4_do_rcv
| tcp_prequeue_process
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
--32.00%-- tcp_cleanup_rbuf
tcp_recvmsg
sock_common_recvmsg
__sock_recvmsg
sock_recvmsg
sys_recvfrom
system_call_fastpath
__recv

0.56% iperf /usr/bin/iperf [.] 0x000000000059f8
|
|--8.00%-- 0x4059f8
|
|--8.00%-- 0x405a16
|
|--8.00%-- 0x4059fd
|
|--4.00%-- 0x409d22
|
|--4.00%-- 0x405871
|
|--4.00%-- 0x406ee1
|
|--4.00%-- 0x405726
|
|--4.00%-- 0x4058db
|
|--4.00%-- 0x406ee8
|
|--2.00%-- 0x405b60
|
|--2.00%-- 0x4058fd
|
|--2.00%-- 0x4058d5
|
|--2.00%-- 0x405490
|
|--2.00%-- 0x4058bb
|
|--2.00%-- 0x405b93
|
|--2.00%-- 0x405b8e
|
|--2.00%-- 0x405903
|
|--2.00%-- 0x405ba8
|
|--2.00%-- 0x406eae
|
|--2.00%-- 0x405545
|
|--2.00%-- 0x405870
|
|--2.00%-- 0x405b67
|
|--2.00%-- 0x4058ce
|
|--2.00%-- 0x40570e
|
|--2.00%-- 0x406ee4
|
|--2.00%-- 0x405a02
|
|--2.00%-- 0x406eec
|
|--2.00%-- 0x405b82
|
|--2.00%-- 0x40556a
|
|--2.00%-- 0x405755
|
|--2.00%-- 0x405a0a
|
|--2.00%-- 0x405498
|
|--2.00%-- 0x409d20
|
|--2.00%-- 0x405b21
|
--2.00%-- 0x405a2c

0.56% [init] [kernel] [k] kmem_cache_alloc
0.56% [init] [kernel] [k] __inet_lookup_established
0.55% perf [kernel] [k] number
|
|--95.92%-- vsnprintf
| |
| |--97.87%-- seq_printf
| | |
| | |--56.52%-- show_map_vma
| | | show_map
| | | seq_read
| | | vfs_read
| | | sys_read
| | | system_call_fastpath
| | | __GI_read
| | |
| | |--28.26%-- render_sigset_t
| | | proc_pid_status
| | | proc_single_show
| | | seq_read
| | | vfs_read
| | | sys_read
| | | system_call_fastpath
| | | __GI_read
| | |
| | |--6.52%-- proc_pid_status
| | | proc_single_show
| | | seq_read
| | | vfs_read
| | | sys_read
| | | system_call_fastpath
| | | __GI_read
| | |
| | |--4.35%-- render_cap_t
| | | proc_pid_status
| | | proc_single_show
| | | seq_read
| | | vfs_read
| | | sys_read
| | | system_call_fastpath
| | | __GI_read
| | |
| | --4.35%-- task_mem
| | proc_pid_status
| | proc_single_show
| | seq_read
| | vfs_read
| | sys_read
| | system_call_fastpath
| | __GI_read
| |
| --2.13%-- scnprintf
| bitmap_scnlistprintf
| seq_bitmap_list
| cpuset_task_status_allowed
| proc_pid_status
| proc_single_show
| seq_read
| vfs_read
| sys_read
| system_call_fastpath
| __GI_read
|
--4.08%-- seq_printf
|
|--50.00%-- show_map_vma
| show_map
| seq_read
| vfs_read
| sys_read
| system_call_fastpath
| __GI_read
|
--50.00%-- render_sigset_t
proc_pid_status
proc_single_show
seq_read
vfs_read
sys_read
system_call_fastpath
__GI_read

0.55% [init] [kernel] [k] native_sched_clock
0.50% iperf [kernel] [k] e1000_xmit_frame
|
|--71.11%-- __tcp_ack_snd_check
| tcp_rcv_established
| tcp_v4_do_rcv
| tcp_prequeue_process
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
--28.89%-- tcp_cleanup_rbuf
tcp_recvmsg
sock_common_recvmsg
__sock_recvmsg
sock_recvmsg
sys_recvfrom
system_call_fastpath
__recv

0.50% iperf [kernel] [k] ipt_do_table
|
|--37.78%-- ipt_local_hook
| nf_iterate
| nf_hook_slow
| __ip_local_out
| ip_local_out
| ip_queue_xmit
| tcp_transmit_skb
| tcp_send_ack
| |
| |--58.82%-- __tcp_ack_snd_check
| | tcp_rcv_established
| | tcp_v4_do_rcv
| | tcp_prequeue_process
| | tcp_recvmsg
| | sock_common_recvmsg
| | __sock_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call_fastpath
| | __recv
| |
| --41.18%-- tcp_cleanup_rbuf
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
|--31.11%-- ipt_post_routing_hook
| nf_iterate
| nf_hook_slow
| ip_output
| ip_local_out
| ip_queue_xmit
| tcp_transmit_skb
| tcp_send_ack
| |
| |--64.29%-- __tcp_ack_snd_check
| | tcp_rcv_established
| | tcp_v4_do_rcv
| | tcp_prequeue_process
| | tcp_recvmsg
| | sock_common_recvmsg
| | __sock_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call_fastpath
| | __recv
| |
| --35.71%-- tcp_cleanup_rbuf
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
|--20.00%-- ipt_local_out_hook
| nf_iterate
| nf_hook_slow
| __ip_local_out
| ip_local_out
| ip_queue_xmit
| tcp_transmit_skb
| tcp_send_ack
| |
| |--88.89%-- __tcp_ack_snd_check
| | tcp_rcv_established
| | tcp_v4_do_rcv
| | tcp_prequeue_process
| | tcp_recvmsg
| | sock_common_recvmsg
| | __sock_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call_fastpath
| | __recv
| |
| --11.11%-- tcp_cleanup_rbuf
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
|--6.67%-- nf_iterate
| nf_hook_slow
| |
| |--66.67%-- ip_output
| | ip_local_out
| | ip_queue_xmit
| | tcp_transmit_skb
| | tcp_send_ack
| | tcp_cleanup_rbuf
| | tcp_recvmsg
| | sock_common_recvmsg
| | __sock_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call_fastpath
| | __recv
| |
| --33.33%-- __ip_local_out
| ip_local_out
| ip_queue_xmit
| tcp_transmit_skb
| tcp_send_ack
| __tcp_ack_snd_check
| tcp_rcv_established
| tcp_v4_do_rcv
| tcp_prequeue_process
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
|--2.22%-- ipt_local_in_hook
| nf_iterate
| nf_hook_slow
| ip_local_deliver
| ip_rcv_finish
| ip_rcv
| netif_receive_skb
| napi_skb_finish
| napi_gro_receive
| e1000_receive_skb
| e1000_clean_rx_irq
| e1000_clean
| net_rx_action
| __do_softirq
| call_softirq
| do_softirq
| irq_exit
| do_IRQ
| ret_from_intr
| vgettimeofday
|
--2.22%-- ipt_pre_routing_hook
nf_iterate
nf_hook_slow
ip_rcv
netif_receive_skb
napi_skb_finish
napi_gro_receive
e1000_receive_skb
e1000_clean_rx_irq
e1000_clean
net_rx_action
__do_softirq
call_softirq
do_softirq
irq_exit
do_IRQ
ret_from_intr
__GI___libc_nanosleep

0.50% iperf [kernel] [k] schedule
|
|--57.78%-- do_nanosleep
| hrtimer_nanosleep
| sys_nanosleep
| system_call_fastpath
| __GI___libc_nanosleep
|
|--33.33%-- schedule_timeout
| sk_wait_data
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
|--6.67%-- hrtimer_nanosleep
| sys_nanosleep
| system_call_fastpath
| __GI___libc_nanosleep
|
--2.22%-- sk_wait_data
tcp_recvmsg
sock_common_recvmsg
__sock_recvmsg
sock_recvmsg
sys_recvfrom
system_call_fastpath
__recv

0.49% iperf [kernel] [k] tcp_transmit_skb
|
|--97.73%-- tcp_send_ack
| |
| |--83.72%-- __tcp_ack_snd_check
| | tcp_rcv_established
| | tcp_v4_do_rcv
| | |
| | |--97.22%-- tcp_prequeue_process
| | | tcp_recvmsg
| | | sock_common_recvmsg
| | | __sock_recvmsg
| | | sock_recvmsg
| | | sys_recvfrom
| | | system_call_fastpath
| | | __recv
| | |
| | --2.78%-- release_sock
| | tcp_recvmsg
| | sock_common_recvmsg
| | __sock_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call_fastpath
| | __recv
| |
| --16.28%-- tcp_cleanup_rbuf
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
--2.27%-- __tcp_ack_snd_check
tcp_rcv_established
tcp_v4_do_rcv
tcp_prequeue_process
tcp_recvmsg
sock_common_recvmsg
__sock_recvmsg
sock_recvmsg
sys_recvfrom
system_call_fastpath
__recv

0.49% [init] [kernel] [k] nf_hook_slow
0.48% iperf [kernel] [k] virt_to_head_page
|
|--53.49%-- kfree
| skb_release_data
| __kfree_skb
| |
| |--65.22%-- tcp_recvmsg
| | sock_common_recvmsg
| | __sock_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call_fastpath
| | __recv
| |
| --34.78%-- tcp_rcv_established
| tcp_v4_do_rcv
| tcp_prequeue_process
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
|--18.60%-- skb_release_data
| __kfree_skb
| |
| |--62.50%-- tcp_rcv_established
| | tcp_v4_do_rcv
| | tcp_prequeue_process
| | tcp_recvmsg
| | sock_common_recvmsg
| | __sock_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call_fastpath
| | __recv
| |
| --37.50%-- tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
|--18.60%-- kmem_cache_free
| __kfree_skb
| |
| |--62.50%-- tcp_rcv_established
| | tcp_v4_do_rcv
| | tcp_prequeue_process
| | tcp_recvmsg
| | sock_common_recvmsg
| | __sock_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call_fastpath
| | __recv
| |
| --37.50%-- tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
--9.30%-- __kfree_skb
|
|--75.00%-- tcp_rcv_established
| tcp_v4_do_rcv
| tcp_prequeue_process
| tcp_recvmsg
| sock_common_recvmsg
| __sock_recvmsg
| sock_recvmsg
| sys_recvfrom
| system_call_fastpath
| __recv
|
--25.00%-- tcp_recvmsg
sock_common_recvmsg
__sock_recvmsg
sock_recvmsg
sys_recvfrom
system_call_fastpath
__recv
...

2009-09-07 18:46:39

by Jens Axboe

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Mon, Sep 07 2009, Avi Kivity wrote:
> On 09/07/2009 12:49 PM, Jens Axboe wrote:
>>
>> I ran a simple test as well, since I was curious to see how it performed
>> wrt interactiveness. One of my pet peeves with the current scheduler is
>> that I have to nice compile jobs, or my X experience is just awful while
>> the compile is running.
>>
>
> I think the problem is that CFS is optimizing for the wrong thing. It's
> trying to be fair to tasks, but these are meaningless building blocks of
> jobs, which is what the user sees and measures. Your make -j128
> dominates your interactive task by two orders of magnitude. If the
> scheduler attempts to bridge this gap using heuristics, it will fail
> badly when it misdetects since it will starve the really important
> 100-thread job for a task that was misdetected as interactive.

Agree, I was actually looking into doing joint latency for X number of
tasks for the test app. I'll try and do that and see if we can detect
something from that.

--
Jens Axboe

2009-09-07 18:47:24

by Daniel Walker

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Mon, 2009-09-07 at 20:26 +0200, Ingo Molnar wrote:
> That's interesting. I tried to reproduce it on x86, but the profile
> does not show any scheduler overhead at all on the server:

If the scheduler isn't running the task which causes the lower
throughput , would that even show up in profiling output?

Daniel

2009-09-07 18:51:15

by Michael Büsch

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Monday 07 September 2009 20:26:29 Ingo Molnar wrote:
> Could you profile it please? Also, what's the context-switch rate?

As far as I can tell, the broadcom mips architecture does not have profiling support.
It does only have some proprietary profiling registers that nobody wrote kernel
support for, yet.

--
Greetings, Michael.

2009-09-07 20:36:37

by Ingo Molnar

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements


* Jens Axboe <[email protected]> wrote:

> Agree, I was actually looking into doing joint latency for X
> number of tasks for the test app. I'll try and do that and see if
> we can detect something from that.

Could you please try latest -tip:

http://people.redhat.com/mingo/tip.git/README

(c26f010 or later)

Does it get any better with make -j128 build jobs? Peter just fixed
a bug in the SMP load-balancer that can cause interactivity problems
on large CPU count systems.

Ingo

2009-09-07 20:44:58

by Jens Axboe

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Mon, Sep 07 2009, Jens Axboe wrote:
> > And yes, it would be wonderful to get a test-app from you that would
> > express the kind of pain you are seeing during compile jobs.
>
> I was hoping this one would, but it's not showing anything. I even added
> support for doing the ping and wakeup over a socket, to see if the pipe
> test was doing well because of the sync wakeup we do there. The net
> latency is a little worse, but still good. So no luck in making that app
> so far.

Here's a version that bounces timestamps between a producer and a number
of consumers (clients). Not really tested much, but perhaps someone can
compare this on a box that boots BFS and see what happens.

To run it, use -cX where X is the number of children that you wait for a
response from. The max delay between this children is logged for each
wakeup. You can invoke it ala:

$ ./latt -c4 'make -j4'

and it'll dump the max/avg/stddev bounce time after make has completed,
or if you just want to play around, start the compile in one xterm and
do:

$ ./latt -c4 'sleep 5'

to just log for a small period of time. Vary the number of clients to
see how that changes the aggregated latency. 1 should be fast, adding
more clients quickly adds up.

Additionally, it has a -f and -t option that controls the window of
sleep time for the parent between each message. The numbers are in
msecs, and it defaults to a minimum of 100msecs and up to 500msecs.

--
Jens Axboe


Attachments:
(No filename) (1.42 kB)
latt.c (5.43 kB)
Download all attachments

2009-09-07 20:46:44

by Jens Axboe

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Mon, Sep 07 2009, Ingo Molnar wrote:
>
> * Jens Axboe <[email protected]> wrote:
>
> > Agree, I was actually looking into doing joint latency for X
> > number of tasks for the test app. I'll try and do that and see if
> > we can detect something from that.
>
> Could you please try latest -tip:
>
> http://people.redhat.com/mingo/tip.git/README
>
> (c26f010 or later)
>
> Does it get any better with make -j128 build jobs? Peter just fixed

The compile 'problem' is on my workstation, which is a dual core Intel
core 2. I use -j4 on that typically. On the bigger boxes, I don't notice
any interactivity problems, largely because I don't run anything latency
sensitive on those :-)

> a bug in the SMP load-balancer that can cause interactivity problems
> on large CPU count systems.

Worth trying on the dual core box?

--
Jens Axboe

2009-09-07 20:57:11

by Ingo Molnar

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements


* Michael Buesch <[email protected]> wrote:

> On Monday 07 September 2009 20:26:29 Ingo Molnar wrote:
> > Could you profile it please? Also, what's the context-switch rate?
>
> As far as I can tell, the broadcom mips architecture does not have
> profiling support. It does only have some proprietary profiling
> registers that nobody wrote kernel support for, yet.

Well, what does 'vmstat 1' show - how many context switches are
there per second on the iperf server? In theory if it's a truly
saturated box, there shouldnt be many - just a single iperf task
running at 100% CPU utilization or so.

(Also, if there's hrtimer support for that board then perfcounters
could be used to profile it.)

Ingo

2009-09-07 21:03:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Mon, 2009-09-07 at 22:46 +0200, Jens Axboe wrote:
> > a bug in the SMP load-balancer that can cause interactivity problems
> > on large CPU count systems.
>
> Worth trying on the dual core box?

I debugged the issue on a dual core :-)

It should be more pronounced on larger machines, but its present on
dual-core too.

2009-09-07 21:05:50

by Jens Axboe

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Mon, Sep 07 2009, Peter Zijlstra wrote:
> On Mon, 2009-09-07 at 22:46 +0200, Jens Axboe wrote:
> > > a bug in the SMP load-balancer that can cause interactivity problems
> > > on large CPU count systems.
> >
> > Worth trying on the dual core box?
>
> I debugged the issue on a dual core :-)
>
> It should be more pronounced on larger machines, but its present on
> dual-core too.

Alright, I'll upgrade that box to -tip tomorrow and see if it makes
a noticable difference. At -j4 or higher, I can literally see windows
slowly popping up when switching to a different virtual desktop.

--
Jens Axboe

2009-09-07 22:18:26

by Ingo Molnar

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements


* Jens Axboe <[email protected]> wrote:

> On Mon, Sep 07 2009, Peter Zijlstra wrote:
> > On Mon, 2009-09-07 at 22:46 +0200, Jens Axboe wrote:
> > > > a bug in the SMP load-balancer that can cause interactivity problems
> > > > on large CPU count systems.
> > >
> > > Worth trying on the dual core box?
> >
> > I debugged the issue on a dual core :-)
> >
> > It should be more pronounced on larger machines, but its present on
> > dual-core too.
>
> Alright, I'll upgrade that box to -tip tomorrow and see if it
> makes a noticable difference. At -j4 or higher, I can literally
> see windows slowly popping up when switching to a different
> virtual desktop.

btw., if you run -tip and have these enabled:

CONFIG_PERF_COUNTER=y
CONFIG_EVENT_TRACING=y

cd tools/perf/
make -j install

... then you can use a couple of new perfcounters features to
measure scheduler latencies. For example:

perf stat -e sched:sched_stat_wait -e task-clock ./hackbench 20

Will tell you how many times this workload got delayed by waiting
for CPU time.

You can repeat the workload as well and see the statistical
properties of those metrics:

aldebaran:/home/mingo> perf stat --repeat 10 -e \
sched:sched_stat_wait:r -e task-clock ./hackbench 20
Time: 0.251
Time: 0.214
Time: 0.254
Time: 0.278
Time: 0.245
Time: 0.308
Time: 0.242
Time: 0.222
Time: 0.268
Time: 0.244

Performance counter stats for './hackbench 20' (10 runs):

59826 sched:sched_stat_wait # 0.026 M/sec ( +- 5.540% )
2280.099643 task-clock-msecs # 7.525 CPUs ( +- 1.620% )

0.303013390 seconds time elapsed ( +- 3.189% )

To get scheduling events, do:

# perf list 2>&1 | grep sched:
sched:sched_kthread_stop [Tracepoint event]
sched:sched_kthread_stop_ret [Tracepoint event]
sched:sched_wait_task [Tracepoint event]
sched:sched_wakeup [Tracepoint event]
sched:sched_wakeup_new [Tracepoint event]
sched:sched_switch [Tracepoint event]
sched:sched_migrate_task [Tracepoint event]
sched:sched_process_free [Tracepoint event]
sched:sched_process_exit [Tracepoint event]
sched:sched_process_wait [Tracepoint event]
sched:sched_process_fork [Tracepoint event]
sched:sched_signal_send [Tracepoint event]
sched:sched_stat_wait [Tracepoint event]
sched:sched_stat_sleep [Tracepoint event]
sched:sched_stat_iowait [Tracepoint event]

stat_wait/sleep/iowait would be the interesting ones, for latency
analysis.

Or, if you want to see all the specific delays and want to see
min/max/avg, you can do:

perf record -e sched:sched_stat_wait:r -f -R -c 1 ./hackbench 20
perf trace

Ingo

2009-09-07 23:57:16

by Pekka Pietikäinen

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Mon, Sep 07, 2009 at 10:57:01PM +0200, Ingo Molnar wrote:
> > > Could you profile it please? Also, what's the context-switch rate?
> >
> > As far as I can tell, the broadcom mips architecture does not have
> > profiling support. It does only have some proprietary profiling
> > registers that nobody wrote kernel support for, yet.
> Well, what does 'vmstat 1' show - how many context switches are
> there per second on the iperf server? In theory if it's a truly
> saturated box, there shouldnt be many - just a single iperf task
Yay, finally something that's measurable in this thread \o/

Gigabit Ethernet iperf on an Atom or so might be something that
shows similar effects yet is debuggable. Anyone feel like taking a shot?

That beast doing iperf probably ends up making it go quite close to it's
limits (IO, mem bw, cpu). IIRC the routing/bridging performance is
something like 40Mbps (depends a lot on the model, corresponds pretty
well with the Mhz of the beast).

Maybe not totally unlike what make -j16 does to a 1-4 core box?

2009-09-07 23:54:31

by Thomas Fjellstrom

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Sun September 6 2009, Nikos Chantziaras wrote:
> On 09/06/2009 11:59 PM, Ingo Molnar wrote:
> >[...]
> > Also, i'd like to outline that i agree with the general goals
> > described by you in the BFS announcement - small desktop systems
> > matter more than large systems. We find it critically important
> > that the mainline Linux scheduler performs well on those systems
> > too - and if you (or anyone else) can reproduce suboptimal behavior
> > please let the scheduler folks know so that we can fix/improve it.
>
> BFS improved behavior of many applications on my Intel Core 2 box in a
> way that can't be benchmarked. Examples:
>
> mplayer using OpenGL renderer doesn't drop frames anymore when dragging
> and dropping the video window around in an OpenGL composited desktop
> (KDE 4.3.1). (Start moving the mplayer window around; then drop it. At
> the moment the move starts and at the moment you drop the window back to
> the desktop, there's a big frame skip as if mplayer was frozen for a
> bit; around 200 or 300ms.)
>
> Composite desktop effects like zoom and fade out don't stall for
> sub-second periods of time while there's CPU load in the background. In
> other words, the desktop is more fluid and less skippy even during heavy
> CPU load. Moving windows around with CPU load in the background doesn't
> result in short skips.
>
> LMMS (a tool utilizing real-time sound synthesis) does not produce
> "pops", "crackles" and drops in the sound during real-time playback due
> to buffer under-runs. Those problems amplify when there's heavy CPU
> load in the background, while with BFS heavy load doesn't produce those
> artifacts (though LMMS makes itself run SCHED_ISO with BFS) Also,
> hitting a key on the keyboard needs less time for the note to become
> audible when using BFS. Same should hold true for other tools who
> traditionally benefit from the "-rt" kernel sources.
>
> Games like Doom 3 and such don't "freeze" periodically for small amounts
> of time (again for sub-second amounts) when something in the background
> grabs CPU time (be it my mailer checking for new mail or a cron job, or
> whatever.)
>
> And, the most drastic improvement here, with BFS I can do a "make -j2"
> in the kernel tree and the GUI stays fluid. Without BFS, things start
> to lag, even with in-RAM builds (like having the whole kernel tree
> inside a tmpfs) and gcc running with nice 19 and ionice -c 3.
>
> Unfortunately, I can't come up with any way to somehow benchmark all of
> this. There's no benchmark for "fluidity" and "responsiveness".
> Running the Doom 3 benchmark, or any other benchmark, doesn't say
> anything about responsiveness, it only measures how many frames were
> calculated in a specific period of time. How "stable" (with no stalls)
> those frames were making it to the screen is not measurable.
>
> If BFS would imply small drops in pure performance counted in
> instructions per seconds, that would be a totally acceptable regression
> for desktop/multimedia/gaming PCs. Not for server machines, of course.
> However, on my machine, BFS is faster in classic workloads. When I
> run "make -j2" with BFS and the standard scheduler, BFS always finishes
> a bit faster. Not by much, but still. One thing I'm noticing here is
> that BFS produces 100% CPU load on each core with "make -j2" while the
> normal scheduler stays at about 90-95% with -j2 or higher in at least
> one of the cores. There seems to be under-utilization of CPU time.
>
> Also, by searching around the net but also through discussions on
> various mailing lists, there seems to be a trend: the problems for some
> reason seem to occur more often with Intel CPUs (Core 2 chips and lower;
> I can't say anything about Core I7) while people on AMD CPUs mostly not
> being affected by most or even all of the above. (And due to this flame
> wars often break out, with one party accusing the other of imagining
> things). Can the integrated memory controller on AMD chips have
> something to do with this? Do AMD chips generally offer better
> "multithrading" behavior? Unfortunately, you didn't mention on what CPU
> you ran your tests. If it was AMD, it might be a good idea to run tests
> on Pentium and Core 2 CPUs.
>
> For reference, my system is:
>
> CPU: Intel Core 2 Duo E6600 (2.4GHz)
> Mainboard: Asus P5E (Intel X38 chipset)
> RAM: 6GB (2+2+1+1) dual channel DDR2 800
> GPU: RV770 (Radeon HD4870).
>

My Phenom 9550 (2.2Ghz) whips the pants off my Intel Q6600 (2.6Ghz). I and a
friend of mine both get large amounts of stalling when doing a lot of IO. I
haven't seen such horrible desktop interactivity since before the new
schedulers and the -ck patchset came out for 2.4.x. Its a heck of a lot better
on my AMD Phenom's, but some lag is noticeable these days, even when it wasn't
a few kernel releases ago.

Intel Specs:
CPU: Intel Core 2 Quad Q6600 (2.6Ghz)
Mainboard: ASUS P5K-SE (Intel p35 iirc)
RAM: 4G 800Mhz DDR2 dual channel (4x1G)
GPU: NVidia 8800GTS 320M

AMD Specs:
CPU: AMD Phenom I 9550 (2.2Ghz)
Mainboard: Gigabyte MA78GM-S2H
RAM: 4G 800Mhz DDR2 dual channel (2x2G)
GPU: Onboard Radeon 3200HD

AMD Specs x2:
CPU: AMD Phenom II 810 (2.6Ghz)
Mainboard: Gigabyte MA790FXT-UD5P
RAM: 4G 1066Mhz DDR3 dual channel (2x2G)
GPU: NVidia 8800GTS 320M (or currently a 8400GS)

Of course I get better performance out of the Phenom II vs either other box,
but it surprises me that I'd get more out of the budget AMD box over the not
so budget Intel box.

--
Thomas Fjellstrom
[email protected]

2009-09-08 07:19:08

by Nikos Chantziaras

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On 09/07/2009 05:40 PM, Arjan van de Ven wrote:
> On Mon, 07 Sep 2009 06:38:36 +0300
> Nikos Chantziaras<[email protected]> wrote:
>
>> On 09/06/2009 11:59 PM, Ingo Molnar wrote:
>>> [...]
>>> Also, i'd like to outline that i agree with the general goals
>>> described by you in the BFS announcement - small desktop systems
>>> matter more than large systems. We find it critically important
>>> that the mainline Linux scheduler performs well on those systems
>>> too - and if you (or anyone else) can reproduce suboptimal behavior
>>> please let the scheduler folks know so that we can fix/improve it.
>>
>> BFS improved behavior of many applications on my Intel Core 2 box in
>> a way that can't be benchmarked. Examples:
>
> Have you tried to see if latencytop catches such latencies ?

I've just tried it.

I start latencytop and then mplayer on a video that doesn't max out the
CPU (needs about 20-30% of a single core (out of 2 available)). Then,
while the video is playing, I press Alt+Tab repeatedly which makes the
desktop compositor kick-in and stay active (it lays out all windows as a
"flip-switch", similar to the Microsoft Vista Aero alt+tab effect).
Repeatedly pressing alt+tab results in the compositor (in this case KDE
4.3.1) keep doing processing. With the mainline scheduler, mplayer
starts dropping frames and skip sound like crazy for the whole duration
of this exercise.

latencytop has this to say:

http://foss.math.aegean.gr/~realnc/pics/latop1.png

Though I don't really understand what this tool is trying to tell me, I
hope someone does.

2009-09-08 07:48:30

by Ingo Molnar

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements


* Ingo Molnar <[email protected]> wrote:

> That's interesting. I tried to reproduce it on x86, but the
> profile does not show any scheduler overhead at all on the server:

I've now simulated a saturated iperf server by adding an
udelay(3000) to e1000_intr() in via the patch below.

There's no idle time left that way:

Cpu(s): 0.0%us, 2.6%sy, 0.0%ni, 0.0%id, 0.0%wa, 93.2%hi, 4.2%si, 0.0%st
Mem: 1021044k total, 93400k used, 927644k free, 5068k buffers
Swap: 8193140k total, 0k used, 8193140k free, 25404k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1604 mingo 20 0 38300 956 724 S 99.4 0.1 3:15.07 iperf
727 root 15 -5 0 0 0 S 0.2 0.0 0:00.41 kondemand/0
1226 root 20 0 6452 336 240 S 0.2 0.0 0:00.06 irqbalance
1387 mingo 20 0 78872 1988 1300 S 0.2 0.2 0:00.23 sshd
1657 mingo 20 0 12752 1128 800 R 0.2 0.1 0:01.34 top
1 root 20 0 10320 684 572 S 0.0 0.1 0:01.79 init
2 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 kthreadd

And the server is only able to saturate half of the 1 gigabit
bandwidth:

Client connecting to t, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[ 3] local 10.0.1.19 port 50836 connected with 10.0.1.14 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 504 MBytes 423 Mbits/sec
------------------------------------------------------------
Client connecting to t, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[ 3] local 10.0.1.19 port 50837 connected with 10.0.1.14 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 502 MBytes 420 Mbits/sec


perf top is showing:

------------------------------------------------------------------------------
PerfTop: 28517 irqs/sec kernel:99.4% [100000 cycles], (all, 1 CPUs)
------------------------------------------------------------------------------

samples pcnt kernel function
_______ _____ _______________

139553.00 - 93.2% : delay_tsc
2098.00 - 1.4% : hmac_digest
561.00 - 0.4% : ip_call_ra_chain
335.00 - 0.2% : neigh_alloc
279.00 - 0.2% : __hash_conntrack
257.00 - 0.2% : dev_activate
186.00 - 0.1% : proc_tcp_available_congestion_control
178.00 - 0.1% : e1000_get_regs
167.00 - 0.1% : tcp_event_data_recv

delay_tsc() dominates, as expected. Still zero scheduler overhead
and the contex-switch rate is well below 1000 per sec.

Then i booted v2.6.30 vanilla, added the udelay(3000) and got:

[ 5] local 10.0.1.14 port 5001 connected with 10.0.1.19 port 47026
[ 5] 0.0-10.0 sec 493 MBytes 412 Mbits/sec
[ 4] local 10.0.1.14 port 5001 connected with 10.0.1.19 port 47027
[ 4] 0.0-10.0 sec 520 MBytes 436 Mbits/sec
[ 5] local 10.0.1.14 port 5001 connected with 10.0.1.19 port 47028
[ 5] 0.0-10.0 sec 506 MBytes 424 Mbits/sec
[ 4] local 10.0.1.14 port 5001 connected with 10.0.1.19 port 47029
[ 4] 0.0-10.0 sec 496 MBytes 415 Mbits/sec

i.e. essentially the same throughput. (and this shows that using .30
versus .31 did not materially impact iperf performance in this test,
under these conditions and with this hardware)

The i applied the BFS patch to v2.6.30 and used the same
udelay(3000) hack and got:

No measurable change in throughput.

Obviously, this test is not equivalent to your test - but it does
show that even saturated iperf is getting scheduled just fine. (or,
rather, does not get scheduled all that much.)

[ 5] local 10.0.1.14 port 5001 connected with 10.0.1.19 port 38505
[ 5] 0.0-10.1 sec 481 MBytes 401 Mbits/sec
[ 4] local 10.0.1.14 port 5001 connected with 10.0.1.19 port 38506
[ 4] 0.0-10.0 sec 505 MBytes 423 Mbits/sec
[ 5] local 10.0.1.14 port 5001 connected with 10.0.1.19 port 38507
[ 5] 0.0-10.0 sec 508 MBytes 426 Mbits/sec
[ 4] local 10.0.1.14 port 5001 connected with 10.0.1.19 port 38508
[ 4] 0.0-10.0 sec 486 MBytes 406 Mbits/sec

So either your MIPS system has some unexpected dependency on the
scheduler, or there's something weird going on.

Mind poking on this one to figure out whether it's all repeatable
and why that slowdown happens? Multiple attempts to reproduce it
failed here for me.

Ingo

2009-09-08 08:04:39

by Ingo Molnar

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements


* Pekka Pietikainen <[email protected]> wrote:

> On Mon, Sep 07, 2009 at 10:57:01PM +0200, Ingo Molnar wrote:
> > > > Could you profile it please? Also, what's the context-switch rate?
> > >
> > > As far as I can tell, the broadcom mips architecture does not have
> > > profiling support. It does only have some proprietary profiling
> > > registers that nobody wrote kernel support for, yet.
> > Well, what does 'vmstat 1' show - how many context switches are
> > there per second on the iperf server? In theory if it's a truly
> > saturated box, there shouldnt be many - just a single iperf task
>
> Yay, finally something that's measurable in this thread \o/

My initial posting in this thread contains 6 separate types of
measurements, rather extensive ones. Out of those, 4 measurements
were latency oriented, two were throughput oriented. Plenty of data,
plenty of results, and very good reproducability.

> Gigabit Ethernet iperf on an Atom or so might be something that
> shows similar effects yet is debuggable. Anyone feel like taking a
> shot?

I tried iperf on x86 and simulated saturation and no, there's no BFS
versus mainline performance difference that i can measure - simply
because a saturated iperf server does not schedule much - it's busy
handling all that networking workload.

I did notice that iperf is somewhat noisy: it can easily have weird
outliers regardless of which scheduler is used. That could be an
effect of queueing/timing: depending on precisely what order packets
arrive and they get queued by the networking stack, does get a
cache-effective pathway of packets get opened - while with slightly
different timings, that pathway closes and we get much worse
queueing performance. I saw noise on the order of magnitude of 10%,
so iperf has to be measured carefully before drawing conclusions.

> That beast doing iperf probably ends up making it go quite close
> to it's limits (IO, mem bw, cpu). IIRC the routing/bridging
> performance is something like 40Mbps (depends a lot on the model,
> corresponds pretty well with the Mhz of the beast).
>
> Maybe not totally unlike what make -j16 does to a 1-4 core box?

No, a single iperf session is very different from kbuild make -j16.

Firstly, iperf server is just a single long-lived task - so we
context-switch between that and the idle thread , [and perhaps a
kernel thread such as ksoftirqd]. The scheduler essentially has no
leeway what task to schedule and for how long: if there's work going
on the iperf server task will run - if there's none, the idle task
runs. [modulo ksoftirqd - depending on the driver model and
dependent on precise timings.]

kbuild -j16 on the other hand is a complex hierarchy and mixture of
thousands of short-lived and long-lived tasks. The scheduler has a
lot of leeway to decide what to schedule and for how long.

>From a scheduler perspective the two workloads could not be any more
different. Kbuild does test scheduler decisions in non-trivial ways
- iperf server does not really.

Ingo

2009-09-08 08:13:53

by Nikos Chantziaras

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On 09/08/2009 11:04 AM, Ingo Molnar wrote:
>
> * Pekka Pietikainen<[email protected]> wrote:
>
>> On Mon, Sep 07, 2009 at 10:57:01PM +0200, Ingo Molnar wrote:
>>>>> Could you profile it please? Also, what's the context-switch rate?
>>>>
>>>> As far as I can tell, the broadcom mips architecture does not have
>>>> profiling support. It does only have some proprietary profiling
>>>> registers that nobody wrote kernel support for, yet.
>>> Well, what does 'vmstat 1' show - how many context switches are
>>> there per second on the iperf server? In theory if it's a truly
>>> saturated box, there shouldnt be many - just a single iperf task
>>
>> Yay, finally something that's measurable in this thread \o/
>
> My initial posting in this thread contains 6 separate types of
> measurements, rather extensive ones. Out of those, 4 measurements
> were latency oriented, two were throughput oriented. Plenty of data,
> plenty of results, and very good reproducability.

None of which involve latency-prone GUI applications running on cheap
commodity hardware though. I listed examples where mainline seems to
behave sub-optimal and ways to reproduce them but this doesn't seem to
be an area of interest.

2009-09-08 08:28:11

by Arjan van de Ven

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Tue, 08 Sep 2009 10:19:06 +0300
Nikos Chantziaras <[email protected]> wrote:

> latencytop has this to say:
>
> http://foss.math.aegean.gr/~realnc/pics/latop1.png
>
> Though I don't really understand what this tool is trying to tell me,
> I hope someone does.

unfortunately this is both an older version of latencytop, and it's
incorrectly installed ;-(
Latencytop is supposed to translate those cryptic strings to english,
but due to not being correctly installed, it does not do this ;(

the latest version of latencytop also has a GUI (thanks to Ben)

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2009-09-08 08:34:42

by Arjan van de Ven

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Tue, 08 Sep 2009 10:19:06 +0300
Nikos Chantziaras <[email protected]> wrote:

> latencytop has this to say:
>
> http://foss.math.aegean.gr/~realnc/pics/latop1.png
>
> Though I don't really understand what this tool is trying to tell me,
> I hope someone does.

despite the untranslated content, it is clear that you have scheduler
delays (either due to scheduler bugs or cpu contention) of upto 68
msecs... Second in line is your binary AMD graphics driver that is
chewing up 14% of your total latency...


--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2009-09-08 09:13:05

by Jens Axboe

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Mon, Sep 07 2009, Jens Axboe wrote:
> On Mon, Sep 07 2009, Jens Axboe wrote:
> > > And yes, it would be wonderful to get a test-app from you that would
> > > express the kind of pain you are seeing during compile jobs.
> >
> > I was hoping this one would, but it's not showing anything. I even added
> > support for doing the ping and wakeup over a socket, to see if the pipe
> > test was doing well because of the sync wakeup we do there. The net
> > latency is a little worse, but still good. So no luck in making that app
> > so far.
>
> Here's a version that bounces timestamps between a producer and a number
> of consumers (clients). Not really tested much, but perhaps someone can
> compare this on a box that boots BFS and see what happens.

And here's a newer version. It ensures that clients are running before
sending a timestamp, and it drops the first and last log entry to
eliminate any weird effects there. Accuracy should also be improved.

On an idle box, it'll usually log all zeroes. Sometimes I see 3-4msec
latencies, weird.

--
Jens Axboe


Attachments:
(No filename) (1.04 kB)
latt.c (9.56 kB)
Download all attachments

2009-09-08 09:50:22

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Tue, 2009-09-08 at 09:48 +0200, Ingo Molnar wrote:
> So either your MIPS system has some unexpected dependency on the
> scheduler, or there's something weird going on.
>
> Mind poking on this one to figure out whether it's all repeatable
> and why that slowdown happens? Multiple attempts to reproduce it
> failed here for me.

Could it be the scheduler using constructs that don't do well on MIPS ?

I remember at some stage we spotted an expensive multiply in there,
maybe there's something similar, or some unaligned or non-cache friendly
vs. the MIPS cache line size data structure, that sort of thing ...

Is this a SW loaded TLB ? Does it misses on kernel space ? That could
also be some differences in how many pages are touched by each scheduler
causing more TLB pressure. This will be mostly invisible on x86.

At this stage, it will be hard to tell without some profile data I
suppose. Maybe next week I can try on a small SW loaded TLB embedded PPC
see if I can reproduce some of that, but no promises here.

Cheers,
Ben.

2009-09-08 10:12:28

by Ingo Molnar

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements


* Nikos Chantziaras <[email protected]> wrote:

> On 09/08/2009 11:04 AM, Ingo Molnar wrote:
>>
>> * Pekka Pietikainen<[email protected]> wrote:
>>
>>> On Mon, Sep 07, 2009 at 10:57:01PM +0200, Ingo Molnar wrote:
>>>>>> Could you profile it please? Also, what's the context-switch rate?
>>>>>
>>>>> As far as I can tell, the broadcom mips architecture does not have
>>>>> profiling support. It does only have some proprietary profiling
>>>>> registers that nobody wrote kernel support for, yet.
>>>> Well, what does 'vmstat 1' show - how many context switches are
>>>> there per second on the iperf server? In theory if it's a truly
>>>> saturated box, there shouldnt be many - just a single iperf task
>>>
>>> Yay, finally something that's measurable in this thread \o/
>>
>> My initial posting in this thread contains 6 separate types of
>> measurements, rather extensive ones. Out of those, 4 measurements
>> were latency oriented, two were throughput oriented. Plenty of
>> data, plenty of results, and very good reproducability.
>
> None of which involve latency-prone GUI applications running on
> cheap commodity hardware though. [...]

The lat_tcp, lat_pipe and pipe-test numbers are all benchmarks that
characterise such workloads - they show the latency of context
switches.

I also tested where Con posted numbers that BFS has an edge over
mainline: kbuild performance. Should i not have done that?

Also note the interbench latency measurements that Con posted:

http://ck.kolivas.org/patches/bfs/interbench-bfs-cfs.txt

--- Benchmarking simulated cpu of Audio in the presence of simulated ---
Load Latency +/- SD (ms) Max Latency % Desired CPU % Deadlines Met
None 0.004 +/- 0.00436 0.006 100 100
Video 0.008 +/- 0.00879 0.015 100 100
X 0.006 +/- 0.0067 0.014 100 100
Burn 0.005 +/- 0.00563 0.009 100 100
Write 0.005 +/- 0.00887 0.16 100 100
Read 0.006 +/- 0.00696 0.018 100 100
Compile 0.007 +/- 0.00751 0.019 100 100

Versus the mainline scheduler:

--- Benchmarking simulated cpu of Audio in the presence of simulated ---
Load Latency +/- SD (ms) Max Latency % Desired CPU % Deadlines Met
None 0.005 +/- 0.00562 0.007 100 100
Video 0.003 +/- 0.00333 0.009 100 100
X 0.003 +/- 0.00409 0.01 100 100
Burn 0.004 +/- 0.00415 0.006 100 100
Write 0.005 +/- 0.00592 0.021 100 100
Read 0.004 +/- 0.00463 0.009 100 100
Compile 0.003 +/- 0.00426 0.014 100 100

look at those standard deviation numbers, their spread is way too
high, often 50% or more - very hard to compare such noisy data.

Furthermore, they happen to show the 2.6.30 mainline scheduler
outperforming BFS in almost every interactivity metric.

Check it for yourself and compare the entries. I havent made those
measurements, Con did.

For example 'Compile' latencies:

--- Benchmarking simulated cpu of Audio in the presence of simulated Load
Latency +/- SD (ms) Max Latency % Desired CPU % Deadlines Met
v2.6.30: Compile 0.003 +/- 0.00426 0.014 100 100
BFS: Compile 0.007 +/- 0.00751 0.019 100 100

but ... with a near 100% standard deviation that's pretty hard to
judge. The Max Latency went from 14 usecs under v2.6.30 to 19 usecs
on BFS.

> [...] I listed examples where mainline seems to behave
> sub-optimal and ways to reproduce them but this doesn't seem to be
> an area of interest.

It is an area of interest of course. That's how the interactivity
results above became possible.

Ingo

2009-09-08 10:13:36

by Nikos Chantziaras

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On 09/08/2009 11:38 AM, Arjan van de Ven wrote:
> On Tue, 08 Sep 2009 10:19:06 +0300
> Nikos Chantziaras<[email protected]> wrote:
>
>> latencytop has this to say:
>>
>> http://foss.math.aegean.gr/~realnc/pics/latop1.png
>>
>> Though I don't really understand what this tool is trying to tell me,
>> I hope someone does.
>
> despite the untranslated content, it is clear that you have scheduler
> delays (either due to scheduler bugs or cpu contention) of upto 68
> msecs... Second in line is your binary AMD graphics driver that is
> chewing up 14% of your total latency...

I've now used a correctly installed and up-to-date version of latencytop
and repeated the test. Also, I got rid of AMD's binary blob and used
kernel DRM drivers for my graphics card to throw fglrx out of the
equation (which btw didn't help; the exact same problems occur).

Here the result:

http://foss.math.aegean.gr/~realnc/pics/latop2.png

Again: this is on an Intel Core 2 Duo CPU.

2009-09-08 10:40:51

by Nikos Chantziaras

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On 09/08/2009 01:12 PM, Ingo Molnar wrote:
>
> * Nikos Chantziaras<[email protected]> wrote:
>
>> On 09/08/2009 11:04 AM, Ingo Molnar wrote:
>>>
>>> * Pekka Pietikainen<[email protected]> wrote:
>>>
>>>> On Mon, Sep 07, 2009 at 10:57:01PM +0200, Ingo Molnar wrote:
>>>>>>> Could you profile it please? Also, what's the context-switch rate?
>>>>>>
>>>>>> As far as I can tell, the broadcom mips architecture does not have
>>>>>> profiling support. It does only have some proprietary profiling
>>>>>> registers that nobody wrote kernel support for, yet.
>>>>> Well, what does 'vmstat 1' show - how many context switches are
>>>>> there per second on the iperf server? In theory if it's a truly
>>>>> saturated box, there shouldnt be many - just a single iperf task
>>>>
>>>> Yay, finally something that's measurable in this thread \o/
>>>
>>> My initial posting in this thread contains 6 separate types of
>>> measurements, rather extensive ones. Out of those, 4 measurements
>>> were latency oriented, two were throughput oriented. Plenty of
>>> data, plenty of results, and very good reproducability.
>>
>> None of which involve latency-prone GUI applications running on
>> cheap commodity hardware though. [...]
>
> The lat_tcp, lat_pipe and pipe-test numbers are all benchmarks that
> characterise such workloads - they show the latency of context
> switches.
>
> I also tested where Con posted numbers that BFS has an edge over
> mainline: kbuild performance. Should i not have done that?

It's good that you did, of course. However, when someone reports a
problem/issue, the developer usually tries to reproduce the problem; he
needs to see what the user sees. This is how it's usually done, not
only in most other development environments, but also here from I could
gather by reading this list. When getting reports about interactivity
issues and with very specific examples of how to reproduce, I would have
expected that most developers interested in identifying the issue would
try to reproduce the same problem and work from there. That would mean
that you (or anyone else with an interest of tracking this down) would
follow the examples given (by me and others, like enabling desktop
compositing, firing up mplayer with a video and generally reproducing
this using the quite detailed steps I posted as a recipe).

However, in this case, instead of the above, raw numbers are posted with
batch jobs and benchmarks that aren't actually reproducing the issue as
described by the reporter(s). That way, the developer doesn't get to
experience the issue firt-hand (and due to this possibly missing the
real cause). In most other bug reports or issues, the right thing seems
to happen and the devs try to reproduce it exactly as described. But
not in this case. I suspect this is due to most devs not using the
software components on their machines that are necessary for this and
therefore it would take too much time to reproduce the issue exactly as
described?

2009-09-08 11:30:27

by Nikos Chantziaras

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On 09/08/2009 02:54 AM, Thomas Fjellstrom wrote:
> On Sun September 6 2009, Nikos Chantziaras wrote:
>> [...]
>> For reference, my system is:
>>
>> CPU: Intel Core 2 Duo E6600 (2.4GHz)
>> Mainboard: Asus P5E (Intel X38 chipset)
>> RAM: 6GB (2+2+1+1) dual channel DDR2 800
>> GPU: RV770 (Radeon HD4870).
>>
>
> My Phenom 9550 (2.2Ghz) whips the pants off my Intel Q6600 (2.6Ghz). I and a
> friend of mine both get large amounts of stalling when doing a lot of IO. I
> haven't seen such horrible desktop interactivity since before the new
> schedulers and the -ck patchset came out for 2.4.x. Its a heck of a lot better
> on my AMD Phenom's, but some lag is noticeable these days, even when it wasn't
> a few kernel releases ago.

It seems someone tried BFS on quite slower hardware: Android. According
to the feedback, the device is much more responsive with BFS:
http://twitter.com/cyanogen

2009-09-08 11:32:53

by Juergen Borleis

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Dienstag, 8. September 2009, Nikos Chantziaras wrote:
> On 09/08/2009 11:38 AM, Arjan van de Ven wrote:
> > On Tue, 08 Sep 2009 10:19:06 +0300
> >
> > Nikos Chantziaras<[email protected]> wrote:
> >> latencytop has this to say:
> >>
> >> http://foss.math.aegean.gr/~realnc/pics/latop1.png
> >>
> >> Though I don't really understand what this tool is trying to tell me,
> >> I hope someone does.
> >
> > despite the untranslated content, it is clear that you have scheduler
> > delays (either due to scheduler bugs or cpu contention) of upto 68
> > msecs... Second in line is your binary AMD graphics driver that is
> > chewing up 14% of your total latency...
>
> I've now used a correctly installed and up-to-date version of latencytop
> and repeated the test. Also, I got rid of AMD's binary blob and used
> kernel DRM drivers for my graphics card to throw fglrx out of the
> equation (which btw didn't help; the exact same problems occur).
>
> Here the result:
>
> http://foss.math.aegean.gr/~realnc/pics/latop2.png
>
> Again: this is on an Intel Core 2 Duo CPU.

Just an idea: Maybe some system management code hits you?

jbe

--
Pengutronix e.K. | Juergen Beisert |
Linux Solutions for Science and Industry | Phone: +49-8766-939 228 |
Vertretung Sued/Muenchen, Germany | Fax: +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686 | http://www.pengutronix.de/ |

2009-09-08 11:36:05

by Ingo Molnar

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements


* Nikos Chantziaras <[email protected]> wrote:

> [...] That would mean that you (or anyone else with an interest of
> tracking this down) would follow the examples given (by me and
> others, like enabling desktop compositing, firing up mplayer with
> a video and generally reproducing this using the quite detailed
> steps I posted as a recipe).

Could you follow up on Frederic's detailed tracing suggestions that
would give us the source of the latency?

( Also, as per lkml etiquette, please try to keep the Cc: list
intact when replying to emails. I missed your first reply
that you un-Cc:-ed. )

A quick look at the latencytop output suggests a scheduling latency.
Could you send me the kernel .config that you are using?

Ingo

2009-09-08 12:05:04

by el_es

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

Ingo Molnar <mingo <at> elte.hu> writes:


> For example 'Compile' latencies:
>
> --- Benchmarking simulated cpu of Audio in the presence of simulated Load
> Latency +/- SD (ms) Max Latency % Desired CPU %
Deadlines Met
> v2.6.30: Compile 0.003 +/- 0.00426 0.014 100 100
> BFS: Compile 0.007 +/- 0.00751 0.019 100 100
>
> but ... with a near 100% standard deviation that's pretty hard to
> judge. The Max Latency went from 14 usecs under v2.6.30 to 19 usecs
> on BFS.
>
[...]
> Ingo
>

This just struck me : maybe what desktop users *feel* is exactly that : current
approach is too fine-grained, trying to achieve the minimum latency with *most*
reproductible result (less stddev) at all cost ? And BFS just doesn't care?
I know this sounds like heresy.

[ the space below is to satisfy the brain-dead GMane posting engine].










Lukasz


2009-09-08 12:03:16

by Theodore Ts'o

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Tue, Sep 08, 2009 at 01:13:34PM +0300, Nikos Chantziaras wrote:
>> despite the untranslated content, it is clear that you have scheduler
>> delays (either due to scheduler bugs or cpu contention) of upto 68
>> msecs... Second in line is your binary AMD graphics driver that is
>> chewing up 14% of your total latency...
>
> I've now used a correctly installed and up-to-date version of latencytop
> and repeated the test. Also, I got rid of AMD's binary blob and used
> kernel DRM drivers for my graphics card to throw fglrx out of the
> equation (which btw didn't help; the exact same problems occur).
>
> Here the result:
>
> http://foss.math.aegean.gr/~realnc/pics/latop2.png

This was with an unmodified 2.6.31-rcX kernel? Does Latencytop do
anything useful on a BFS-patched kernel?

- Ted

2009-09-08 13:18:46

by Serge Belyshev

[permalink] [raw]
Subject: Epic regression in throughput since v2.6.23


Hi. I've done measurments of time taken by make -j4 kernel build
on a quadcore box. Results are interesting: mainline kernel
has regressed since v2.6.23 release by more than 10%.

The following graph is time taken by "make -j4" (median over 9 runs)
versus kernel version. The huge (10%) regressions since v2.6.23 is
apparent. Note that tip/master c26f010 is better than current mainline.
Also note that BFS is significantly better than both and shows the same
throughput as vanilla v2.6.23:

http://img403.imageshack.us/img403/7029/epicmakej4.png


The following plot is a detailed comparison of time taken versus number
of parallel jobs. Note that at "make -j4" (which equals number of hardware
threads), BFS has the minimum (best performance),
and tip/master -- maximum (worst). I've also tested mainline v2.6.31
(not shown on the graph) which produces similar, albeit a bit slower,
results as the tip/master.

http://img179.imageshack.us/img179/5335/epicbfstip.png


Conclusion are
1) mainline has severely regressed since v2.6.23
2) BFS shows optimal performance at make -jN where N equals number of
h/w threads, while current mainline scheduler performance is far from
optimal in this case.

2009-09-09 00:47:36

by Ralf Baechle

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Tue, Sep 08, 2009 at 07:50:00PM +1000, Benjamin Herrenschmidt wrote:

> On Tue, 2009-09-08 at 09:48 +0200, Ingo Molnar wrote:
> > So either your MIPS system has some unexpected dependency on the
> > scheduler, or there's something weird going on.
> >
> > Mind poking on this one to figure out whether it's all repeatable
> > and why that slowdown happens? Multiple attempts to reproduce it
> > failed here for me.
>
> Could it be the scheduler using constructs that don't do well on MIPS ?

It would surprise me.

I'm wondering if BFS has properties that make it perform better on a very
low memory system; I guess the BCM74xx system will have like 32MB or 64MB
only.

> I remember at some stage we spotted an expensive multiply in there,
> maybe there's something similar, or some unaligned or non-cache friendly
> vs. the MIPS cache line size data structure, that sort of thing ...
>
> Is this a SW loaded TLB ? Does it misses on kernel space ? That could
> also be some differences in how many pages are touched by each scheduler
> causing more TLB pressure. This will be mostly invisible on x86.

Software refilled. No misses ever for kernel space or low-mem; think of
it as low-mem and kernel executable living in a 512MB page that is mapped
by a mechanism outside the TLB. Vmalloc ranges are TLB mapped. Ioremap
address ranges only if above physical address 512MB.

An emulated unaligned load/store is very expensive; one that is encoded
properly by GCC for __attribute__((packed)) is only 1 cycle and 1
instruction ( = 4 bytes) extra.

> At this stage, it will be hard to tell without some profile data I
> suppose. Maybe next week I can try on a small SW loaded TLB embedded PPC
> see if I can reproduce some of that, but no promises here.

Ralf

2009-09-08 13:41:59

by Felix Fietkau

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

Benjamin Herrenschmidt wrote:
> On Tue, 2009-09-08 at 09:48 +0200, Ingo Molnar wrote:
>> So either your MIPS system has some unexpected dependency on the
>> scheduler, or there's something weird going on.
>>
>> Mind poking on this one to figure out whether it's all repeatable
>> and why that slowdown happens? Multiple attempts to reproduce it
>> failed here for me.
>
> Could it be the scheduler using constructs that don't do well on MIPS ?
>
> I remember at some stage we spotted an expensive multiply in there,
> maybe there's something similar, or some unaligned or non-cache friendly
> vs. the MIPS cache line size data structure, that sort of thing ...
>
> Is this a SW loaded TLB ? Does it misses on kernel space ? That could
> also be some differences in how many pages are touched by each scheduler
> causing more TLB pressure. This will be mostly invisible on x86.
The TLB is SW loaded, yes. However it should not do any misses on kernel
space, since the whole segment is in a wired TLB entry.

- Felix

2009-09-08 14:17:24

by Arjan van de Ven

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Tue, 08 Sep 2009 13:13:34 +0300
Nikos Chantziaras <[email protected]> wrote:

> On 09/08/2009 11:38 AM, Arjan van de Ven wrote:
> > On Tue, 08 Sep 2009 10:19:06 +0300
> > Nikos Chantziaras<[email protected]> wrote:
> >
> >> latencytop has this to say:
> >>
> >> http://foss.math.aegean.gr/~realnc/pics/latop1.png
> >>
> >> Though I don't really understand what this tool is trying to tell
> >> me, I hope someone does.
> >
> > despite the untranslated content, it is clear that you have
> > scheduler delays (either due to scheduler bugs or cpu contention)
> > of upto 68 msecs... Second in line is your binary AMD graphics
> > driver that is chewing up 14% of your total latency...
>
> I've now used a correctly installed and up-to-date version of
> latencytop and repeated the test. Also, I got rid of AMD's binary
> blob and used kernel DRM drivers for my graphics card to throw fglrx
> out of the equation (which btw didn't help; the exact same problems
> occur).
>
> Here the result:
>
> http://foss.math.aegean.gr/~realnc/pics/latop2.png
>
> Again: this is on an Intel Core 2 Duo CPU.


so we finally have objective numbers!

now the interesting part is also WHERE the latency hits. Because
fundamentally, if you oversubscribe the CPU, you WILL get scheduling
latency.. simply you have more to run than there is CPU.

Now the scheduler impacts this latency in two ways
* Deciding how long apps run before someone else gets to take over
("time slicing")
* Deciding who gets to run first/more; eg priority between apps

the first one more or less controls the maximum, while the second one
controls which apps get to enjoy this maximum.

latencytop shows you both, but it is interesting to see how much the
apps get that you care about latency for....



--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2009-09-08 14:45:19

by Michael Büsch

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Tuesday 08 September 2009 09:48:25 Ingo Molnar wrote:
> Mind poking on this one to figure out whether it's all repeatable
> and why that slowdown happens?

I repeated the test several times, because I couldn't really believe that
there's such a big difference for me, but the results were the same.
I don't really know what's going on nor how to find out what's going on.

--
Greetings, Michael.

2009-09-08 15:23:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Tue, 2009-09-08 at 11:13 +0200, Jens Axboe wrote:
> And here's a newer version.

I tinkered a bit with your proglet and finally found the problem.

You used a single pipe per child, this means the loop in run_child()
would consume what it just wrote out until it got force preempted by the
parent which would also get woken.

This results in the child spinning a while (its full quota) and only
reporting the last timestamp to the parent.

Since consumer (parent) is a single thread the program basically
measures the worst delay in a thundering herd wakeup of N children.

The below version yields:

idle

[root@opteron sched]# ./latt -c8 sleep 30
Entries: 664 (clients=8)

Averages:
------------------------------
Max 128 usec
Avg 26 usec
Stdev 16 usec


make -j4

[root@opteron sched]# ./latt -c8 sleep 30
Entries: 648 (clients=8)

Averages:
------------------------------
Max 20861 usec
Avg 3763 usec
Stdev 4637 usec


Mike's patch, make -j4

[root@opteron sched]# ./latt -c8 sleep 30
Entries: 648 (clients=8)

Averages:
------------------------------
Max 17854 usec
Avg 6298 usec
Stdev 4735 usec


Attachments:
latt.c (9.00 kB)

2009-09-08 15:45:26

by Michael Büsch

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Monday 07 September 2009 22:57:01 Ingo Molnar wrote:
>
> * Michael Buesch <[email protected]> wrote:
>
> > On Monday 07 September 2009 20:26:29 Ingo Molnar wrote:
> > > Could you profile it please? Also, what's the context-switch rate?
> >
> > As far as I can tell, the broadcom mips architecture does not have
> > profiling support. It does only have some proprietary profiling
> > registers that nobody wrote kernel support for, yet.
>
> Well, what does 'vmstat 1' show - how many context switches are
> there per second on the iperf server? In theory if it's a truly
> saturated box, there shouldnt be many - just a single iperf task
> running at 100% CPU utilization or so.
>
> (Also, if there's hrtimer support for that board then perfcounters
> could be used to profile it.)

CFS:

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
1 0 0 15892 1684 5868 0 0 0 0 268 6 31 69 0 0
1 0 0 15892 1684 5868 0 0 0 0 266 2 34 66 0 0
1 0 0 15892 1684 5868 0 0 0 0 266 6 33 67 0 0
1 0 0 15892 1684 5868 0 0 0 0 267 4 37 63 0 0
1 0 0 15892 1684 5868 0 0 0 0 267 6 34 66 0 0
[ 4] local 192.168.1.1 port 5001 connected with 192.168.1.99 port 47278
2 0 0 15756 1684 5868 0 0 0 0 1655 68 26 74 0 0
2 0 0 15756 1684 5868 0 0 0 0 1945 88 20 80 0 0
2 0 0 15756 1684 5868 0 0 0 0 1882 85 20 80 0 0
2 0 0 15756 1684 5868 0 0 0 0 1923 86 18 82 0 0
2 0 0 15756 1684 5868 0 0 0 0 1986 87 23 77 0 0
2 0 0 15756 1684 5868 0 0 0 0 1923 87 17 83 0 0
2 0 0 15756 1684 5868 0 0 0 0 1951 84 19 81 0 0
2 0 0 15756 1684 5868 0 0 0 0 1970 87 18 82 0 0
2 0 0 15756 1684 5868 0 0 0 0 1972 85 23 77 0 0
2 0 0 15756 1684 5868 0 0 0 0 1961 87 18 82 0 0
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-10.0 sec 28.6 MBytes 23.9 Mbits/sec
1 0 0 15752 1684 5868 0 0 0 0 599 22 22 78 0 0
1 0 0 15752 1684 5868 0 0 0 0 269 4 32 68 0 0
1 0 0 15752 1684 5868 0 0 0 0 266 4 29 71 0 0
1 0 0 15764 1684 5868 0 0 0 0 267 6 37 63 0 0
1 0 0 15764 1684 5868 0 0 0 0 267 4 31 69 0 0
1 0 0 15768 1684 5868 0 0 0 0 266 4 51 49 0 0


I'm currently unable to test BFS, because the device throws strange flash errors.
Maybe the flash is broken :(

--
Greetings, Michael.

2009-09-08 17:55:32

by Jesse Brandeburg

[permalink] [raw]
Subject: Re: Epic regression in throughput since v2.6.23

On Tue, Sep 8, 2009 at 5:57 AM, Serge
Belyshev<[email protected]> wrote:
>
> Hi. I've done measurments of time taken by make -j4 kernel build
> on a quadcore box. ?Results are interesting: mainline kernel
> has regressed since v2.6.23 release by more than 10%.

Is this related to why I now have to double the amount of threads X I
pass to make -jX, in order to use all my idle time for a kernel
compile? I had noticed (without measuring exactly) that it seems with
each kernel released in this series mentioned, I had to increase my
number of worker threads, my common working model now is (cpus * 2) in
order to get zero idle time.

Sorry I haven't tested BFS yet, but am interested to see if it helps
interactivity when playing flash videos on my dual core laptop.

2009-09-08 18:15:25

by Nikos Chantziaras

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On 09/07/2009 02:01 PM, Frederic Weisbecker wrote:
> On Mon, Sep 07, 2009 at 06:38:36AM +0300, Nikos Chantziaras wrote:
>> Unfortunately, I can't come up with any way to somehow benchmark all of
>> this. There's no benchmark for "fluidity" and "responsiveness". Running
>> the Doom 3 benchmark, or any other benchmark, doesn't say anything about
>> responsiveness, it only measures how many frames were calculated in a
>> specific period of time. How "stable" (with no stalls) those frames were
>> making it to the screen is not measurable.
>
>
> That looks eventually benchmarkable. This is about latency.
> For example, you could try to run high load tasks in the
> background and then launch a task that wakes up in middle/large
> periods to do something. You could measure the time it takes to wake
> it up to perform what it wants.
>
> We have some events tracing infrastructure in the kernel that can
> snapshot the wake up and sched switch events.
>
> Having CONFIG_EVENT_TRACING=y should be sufficient for that.
>
> You just need to mount a debugfs point, say in /debug.
>
> Then you can activate these sched events by doing:
>
> echo 0> /debug/tracing/tracing_on
> echo 1> /debug/tracing/events/sched/sched_switch/enable
> echo 1> /debug/tracing/events/sched/sched_wake_up/enable
>
> #Launch your tasks
>
> echo 1> /debug/tracing/tracing_on
>
> #Wait for some time
>
> echo 0> /debug/tracing/tracing_off
>
> That will require some parsing of the result in /debug/tracing/trace
> to get the delays between wake_up events and switch in events
> for the task that periodically wakes up and then produce some
> statistics such as the average or the maximum latency.
>
> That's a bit of a rough approach to measure such latencies but that
> should work.

I've tried this with 2.6.31-rc9 while running mplayer and alt+tabbing
repeatedly to the point where mplayer starts to stall and drop frames.
This produced a 4.1MB trace file (132k bzip2'ed):

http://foss.math.aegean.gr/~realnc/kernel/trace1.bz2

Uncompressed for online viewing:

http://foss.math.aegean.gr/~realnc/kernel/trace1

I must admit that I don't know what it is I'm looking at :P

2009-09-08 18:20:27

by Nikos Chantziaras

[permalink] [raw]
Subject: Re: Epic regression in throughput since v2.6.23

On 09/08/2009 08:47 PM, Jesse Brandeburg wrote:
>[...]
> Sorry I haven't tested BFS yet, but am interested to see if it helps
> interactivity when playing flash videos on my dual core laptop.

Interactivity: yes (Flash will not result in the rest of the system
lagging).

Flash videos: they will still play as bad as before. BFS has no way to
fix broken code inside Flash :P

2009-09-08 18:37:36

by Nikos Chantziaras

[permalink] [raw]
Subject: Re: Epic regression in throughput since v2.6.23

On 09/08/2009 03:57 PM, Serge Belyshev wrote:
>
> Hi. I've done measurments of time taken by make -j4 kernel build
> on a quadcore box. Results are interesting: mainline kernel
> has regressed since v2.6.23 release by more than 10%.

It seems more people are starting to confirm this issue:

http://foldingforum.org/viewtopic.php?f=44&t=11336

IMHO it's not *that* dramatic as some people there describe it ("Is it
the holy grail?") but if something makes your desktop "smooth as silk"
just like that, it might seem as a holy grail ;) In any case, there
clearly seems to be a performance problem with the mainline scheduler on
many people's desktops that are being solved by BFS.

2009-09-08 19:01:06

by Jeff Garzik

[permalink] [raw]
Subject: Re: Epic regression in throughput since v2.6.23

On 09/08/2009 01:47 PM, Jesse Brandeburg wrote:
> On Tue, Sep 8, 2009 at 5:57 AM, Serge
> Belyshev<[email protected]> wrote:
>>
>> Hi. I've done measurments of time taken by make -j4 kernel build
>> on a quadcore box. Results are interesting: mainline kernel
>> has regressed since v2.6.23 release by more than 10%.
>
> Is this related to why I now have to double the amount of threads X I
> pass to make -jX, in order to use all my idle time for a kernel
> compile? I had noticed (without measuring exactly) that it seems with
> each kernel released in this series mentioned, I had to increase my
> number of worker threads, my common working model now is (cpus * 2) in
> order to get zero idle time.

You will almost certainly see idle CPUs/threads with "make -jN_CPUS" due
to processes waiting for I/O.

If you're curious, there is also room for experimenting with make's "-l"
argument, which caps the number of jobs based on load average rather
than a static number of job slots.

Jeff

2009-09-08 19:07:02

by Nikos Chantziaras

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On 09/08/2009 02:35 PM, Ingo Molnar wrote:
>
> * Nikos Chantziaras<[email protected]> wrote:
>
>> [...] That would mean that you (or anyone else with an interest of
>> tracking this down) would follow the examples given (by me and
>> others, like enabling desktop compositing, firing up mplayer with
>> a video and generally reproducing this using the quite detailed
>> steps I posted as a recipe).
>
> Could you follow up on Frederic's detailed tracing suggestions that
> would give us the source of the latency?

I've set it up and ran the tests now.


> ( Also, as per lkml etiquette, please try to keep the Cc: list
> intact when replying to emails. I missed your first reply
> that you un-Cc:-ed. )

Sorry for that.


> A quick look at the latencytop output suggests a scheduling latency.
> Could you send me the kernel .config that you are using?

That would be this one:

http://foss.math.aegean.gr/~realnc/kernel/config-2.6.31-rc9

2009-09-08 19:20:21

by Serge Belyshev

[permalink] [raw]
Subject: Re: Epic regression in throughput since v2.6.23

Jeff Garzik <[email protected]> writes:

> You will almost certainly see idle CPUs/threads with "make -jN_CPUS"
> due to processes waiting for I/O.

Just to clarify: I have excluded all I/O effects in my plots completely
by building completely from tmpfs. Also, before each actual measurment
there was a thrown-off "pre-caching" one. And my box has 8GB of RAM.

2009-09-08 19:26:13

by Jeff Garzik

[permalink] [raw]
Subject: Re: Epic regression in throughput since v2.6.23

On 09/08/2009 03:20 PM, Serge Belyshev wrote:
> Jeff Garzik<[email protected]> writes:
>
>> You will almost certainly see idle CPUs/threads with "make -jN_CPUS"
>> due to processes waiting for I/O.
>
> Just to clarify: I have excluded all I/O effects in my plots completely
> by building completely from tmpfs. Also, before each actual measurment
> there was a thrown-off "pre-caching" one. And my box has 8GB of RAM.

You could always one-up that by using ramfs ;)

Jeff


2009-09-08 20:22:44

by Frans Pop

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

Arjan van de Ven wrote:
> the latest version of latencytop also has a GUI (thanks to Ben)

That looks nice, but...

I kind of miss the split screen feature where latencytop would show both
the overall figures + the ones for the currently most affected task.
Downside of that last was that I never managed to keep the display on a
specific task.

The graphical display also makes it impossible to simply copy and paste
the results.

Having the freeze button is nice though.

Would it be possible to have a command line switch that allows to start
the old textual mode?

Looks like the man page needs updating too :-)

Cheers,
FJP

2009-09-08 20:34:08

by Jens Axboe

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Tue, Sep 08 2009, Peter Zijlstra wrote:
> On Tue, 2009-09-08 at 11:13 +0200, Jens Axboe wrote:
> > And here's a newer version.
>
> I tinkered a bit with your proglet and finally found the problem.
>
> You used a single pipe per child, this means the loop in run_child()
> would consume what it just wrote out until it got force preempted by the
> parent which would also get woken.
>
> This results in the child spinning a while (its full quota) and only
> reporting the last timestamp to the parent.

Oh doh, that's not well thought out. Well it was a quick hack :-)
Thanks for the fixup, now it's at least usable to some degree.

> Since consumer (parent) is a single thread the program basically
> measures the worst delay in a thundering herd wakeup of N children.

Yes, it's really meant to measure how long it takes to wake a group of
processes, assuming that this is where things fall down on the 'box
loaded, switch desktop' case. Now whether that's useful or not or
whether this test app is worth the bits it takes up on the hard drive,
is another question.

--
Jens Axboe

2009-09-08 21:10:26

by Michal Schmidt

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

Dne Tue, 8 Sep 2009 22:22:43 +0200
Frans Pop <[email protected]> napsal(a):
> Would it be possible to have a command line switch that allows to
> start the old textual mode?

I use:
DISPLAY= latencytop

:-)
Michal

2009-09-08 21:11:52

by Frans Pop

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Tuesday 08 September 2009, Frans Pop wrote:
> Arjan van de Ven wrote:
> > the latest version of latencytop also has a GUI (thanks to Ben)
>
> That looks nice, but...
>
> I kind of miss the split screen feature where latencytop would show
> both the overall figures + the ones for the currently most affected
> task. Downside of that last was that I never managed to keep the
> display on a specific task.
[...]
> Would it be possible to have a command line switch that allows to start
> the old textual mode?

I got a private reply suggesting that --nogui might work, and it does.
Thanks a lot Nikos!

> Looks like the man page needs updating too :-)

So this definitely needs attention :-P
Support of the standard -h and --help options would be great too.

Cheers,
FJP

2009-09-08 21:28:58

by Nikos Chantziaras

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On 09/08/2009 03:03 PM, Theodore Tso wrote:
> On Tue, Sep 08, 2009 at 01:13:34PM +0300, Nikos Chantziaras wrote:
>>> despite the untranslated content, it is clear that you have scheduler
>>> delays (either due to scheduler bugs or cpu contention) of upto 68
>>> msecs... Second in line is your binary AMD graphics driver that is
>>> chewing up 14% of your total latency...
>>
>> I've now used a correctly installed and up-to-date version of latencytop
>> and repeated the test. Also, I got rid of AMD's binary blob and used
>> kernel DRM drivers for my graphics card to throw fglrx out of the
>> equation (which btw didn't help; the exact same problems occur).
>>
>> Here the result:
>>
>> http://foss.math.aegean.gr/~realnc/pics/latop2.png
>
> This was with an unmodified 2.6.31-rcX kernel?

Yes (-rc9). I also tested with 2.6.30.5 and getting the same results.


> Does Latencytop do anything useful on a BFS-patched kernel?

Nope. BFS does not support any form of tracing yet. latencytop runs
but only shows a blank list. All I can say is that a BFS patched kernel
with the same .config fixes all visible latency issues.

2009-09-08 21:46:52

by Geunsik Lim

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Wed, Sep 9, 2009 at 6:11 AM, Frans Pop<[email protected]> wrote:
>> Would it be possible to have a command line switch that allows to start
>> the old textual mode?
> I got a private reply suggesting that --nogui might work, and it does.
Um. You means that you tested with runlevel 3(multiuser mode). Is it right?
Frans. Can you share me your linux distribution for this test?
I want to check with same conditions(e.g:linux distribution like
fedora 11,ubuntu9.04 , runlevel, and so on.).
> Thanks a lot Nikos!
>> Looks like the man page needs updating too :-)
> So this definitely needs attention :-P
> Support of the standard -h and --help options would be great too.
> Cheers,
> FJP
> --

Thanks,
GeunSik Lim.



--
Regards,
GeunSik Lim ( Samsung Electronics )
Blog : http://blog.naver.com/invain/
e-Mail: [email protected]
[email protected] , [email protected]

2009-09-08 22:00:49

by Nikos Chantziaras

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On 09/08/2009 02:32 PM, Juergen Beisert wrote:
> On Dienstag, 8. September 2009, Nikos Chantziaras wrote:
>> On 09/08/2009 11:38 AM, Arjan van de Ven wrote:
>>> On Tue, 08 Sep 2009 10:19:06 +0300
>>>
>>> Nikos Chantziaras<[email protected]> wrote:
>>>> latencytop has this to say:
>>>>
>>>> http://foss.math.aegean.gr/~realnc/pics/latop1.png
>>>>
>>>> Though I don't really understand what this tool is trying to tell me,
>>>> I hope someone does.
>>>
>>> despite the untranslated content, it is clear that you have scheduler
>>> delays (either due to scheduler bugs or cpu contention) of upto 68
>>> msecs... Second in line is your binary AMD graphics driver that is
>>> chewing up 14% of your total latency...
>>
>> I've now used a correctly installed and up-to-date version of latencytop
>> and repeated the test. Also, I got rid of AMD's binary blob and used
>> kernel DRM drivers for my graphics card to throw fglrx out of the
>> equation (which btw didn't help; the exact same problems occur).
>>
>> Here the result:
>>
>> http://foss.math.aegean.gr/~realnc/pics/latop2.png
>>
>> Again: this is on an Intel Core 2 Duo CPU.
>
> Just an idea: Maybe some system management code hits you?

I'm not sure what is meant with "system management code."

2009-09-08 22:15:38

by Serge Belyshev

[permalink] [raw]
Subject: Re: Epic regression in throughput since v2.6.23

Serge Belyshev <[email protected]> writes:
>[snip]

I've updated the graphs, added kernels 2.6.24..2.6.29:
http://img186.imageshack.us/img186/7029/epicmakej4.png

And added comparison with best-performing 2.6.23 kernel:
http://img34.imageshack.us/img34/7563/epicbfstips.png

>
> Conclusions are
> 1) mainline has severely regressed since v2.6.23
> 2) BFS shows optimal performance at make -jN where N equals number of
> h/w threads, while current mainline scheduler performance is far from
> optimal in this case.

2009-09-08 22:36:34

by Frans Pop

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Tuesday 08 September 2009, you wrote:
> On Wed, Sep 9, 2009 at 6:11 AM, Frans Pop<[email protected]> wrote:
> >> Would it be possible to have a command line switch that allows to
> >> start the old textual mode?
> >
> > I got a private reply suggesting that --nogui might work, and it
> > does.
>
> Um. You means that you tested with runlevel 3(multiuser mode). Is it
> right? Frans. Can you share me your linux distribution for this test? I
> want to check with same conditions(e.g:linux distribution like fedora
> 11,ubuntu9.04 , runlevel, and so on.).

I ran it from KDE's konsole by just entering 'sudo latencytop --nogui' at
the command prompt.

Distro is Debian stable ("Lenny"), which does not have differences between
runlevels: by default they all start a desktop environment (if a display
manager like xdm/kdm/gdm is installed). But if you really want to know,
the runlevel was 2 ;-)

Cheers,
FJP

2009-09-08 22:53:48

by Nikos Chantziaras

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On 09/08/2009 05:20 PM, Arjan van de Ven wrote:
> On Tue, 08 Sep 2009 13:13:34 +0300
> Nikos Chantziaras<[email protected]> wrote:
>
>> On 09/08/2009 11:38 AM, Arjan van de Ven wrote:
>>> On Tue, 08 Sep 2009 10:19:06 +0300
>>> Nikos Chantziaras<[email protected]> wrote:
>>>
>>>> latencytop has this to say:
>>>>
>>>> http://foss.math.aegean.gr/~realnc/pics/latop1.png
>>>>
>>>> Though I don't really understand what this tool is trying to tell
>>>> me, I hope someone does.
>>>
>>> despite the untranslated content, it is clear that you have
>>> scheduler delays (either due to scheduler bugs or cpu contention)
>>> of upto 68 msecs... Second in line is your binary AMD graphics
>>> driver that is chewing up 14% of your total latency...
>>
>> I've now used a correctly installed and up-to-date version of
>> latencytop and repeated the test. Also, I got rid of AMD's binary
>> blob and used kernel DRM drivers for my graphics card to throw fglrx
>> out of the equation (which btw didn't help; the exact same problems
>> occur).
>>
>> Here the result:
>>
>> http://foss.math.aegean.gr/~realnc/pics/latop2.png
>>
>> Again: this is on an Intel Core 2 Duo CPU.
>
>
> so we finally have objective numbers!
>
> now the interesting part is also WHERE the latency hits. Because
> fundamentally, if you oversubscribe the CPU, you WILL get scheduling
> latency.. simply you have more to run than there is CPU.

Sounds plausible. However, with mainline this latency is very, very
noticeable. With BFS I need to look really hard to detect it or do
outright silly things, like a "make -j50". (At first I wrote "-j20"
here but then went ahead an tested it just for kicks, and BFS would
still let me use the GUI smoothly, LOL. So then I corrected it to
"-j50"...)

2009-09-08 23:21:01

by Jiri Kosina

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Wed, 9 Sep 2009, Nikos Chantziaras wrote:

> > > Here the result:
> > >
> > > http://foss.math.aegean.gr/~realnc/pics/latop2.png
> > >
> > > Again: this is on an Intel Core 2 Duo CPU.
> >
> > Just an idea: Maybe some system management code hits you?
>
> I'm not sure what is meant with "system management code."

System management interrupt happens when firmware/BIOS/HW-debugger is
executed in privilege mode so high, that even OS can't do anything about
that.

It is used in many situations, such as

- memory errors
- ACPI (mostly fan control)
- TPM

OS has small to none possibility to influence SMI/SMM. But if this would
be the cause, you should probably obtain completely different results on
different hardware configuration (as it is likely to have completely
different SMM behavior).

--
Jiri Kosina
SUSE Labs, Novell Inc.

2009-09-08 23:38:38

by Nikos Chantziaras

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On 09/09/2009 02:20 AM, Jiri Kosina wrote:
> On Wed, 9 Sep 2009, Nikos Chantziaras wrote:
>
>>>> Here the result:
>>>>
>>>> http://foss.math.aegean.gr/~realnc/pics/latop2.png
>>>>
>>>> Again: this is on an Intel Core 2 Duo CPU.
>>>
>>> Just an idea: Maybe some system management code hits you?
>>
>> I'm not sure what is meant with "system management code."
>
> System management interrupt happens when firmware/BIOS/HW-debugger is
> executed in privilege mode so high, that even OS can't do anything about
> that.
>
> It is used in many situations, such as
>
> - memory errors
> - ACPI (mostly fan control)
> - TPM
>
> OS has small to none possibility to influence SMI/SMM. But if this would
> be the cause, you should probably obtain completely different results on
> different hardware configuration (as it is likely to have completely
> different SMM behavior).

Wouldn't that mean that a BFS-patched kernel would suffer from this too?

In any case, of the above, only fan control is active, and I've run with
it disabled on occasion (hot summer days, I wanted to just keep it max
with no fan control) with no change. As far as I can tell, the Asus P5E
doesn't have a TPM (the "Deluxe" and "VM" models seem to have one.) As
for memory errors, I use unbuffered non-ECC RAM which passes a
memtest86+ cycle cleanly (well, at least the last time I ran it through
one, a few months ago.)

2009-09-09 00:28:47

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

> The TLB is SW loaded, yes. However it should not do any misses on kernel
> space, since the whole segment is in a wired TLB entry.

Including vmalloc space ?

Ben.

2009-09-09 00:37:36

by David Miller

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

From: Benjamin Herrenschmidt <[email protected]>
Date: Wed, 09 Sep 2009 10:28:22 +1000

>> The TLB is SW loaded, yes. However it should not do any misses on kernel
>> space, since the whole segment is in a wired TLB entry.
>
> Including vmalloc space ?

No, MIPS does take SW tlb misses on vmalloc space. :-)

2009-09-09 01:36:22

by Felix Fietkau

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

Ralf Baechle wrote:
>> I remember at some stage we spotted an expensive multiply in there,
>> maybe there's something similar, or some unaligned or non-cache friendly
>> vs. the MIPS cache line size data structure, that sort of thing ...
>>
>> Is this a SW loaded TLB ? Does it misses on kernel space ? That could
>> also be some differences in how many pages are touched by each scheduler
>> causing more TLB pressure. This will be mostly invisible on x86.
>
> Software refilled. No misses ever for kernel space or low-mem; think of
> it as low-mem and kernel executable living in a 512MB page that is mapped
> by a mechanism outside the TLB. Vmalloc ranges are TLB mapped. Ioremap
> address ranges only if above physical address 512MB.
>
> An emulated unaligned load/store is very expensive; one that is encoded
> properly by GCC for __attribute__((packed)) is only 1 cycle and 1
> instruction ( = 4 bytes) extra.
CFS definitely isn't causing any emulated unaligned load/stores on these
devices, we've tested that.

- Felix

2009-09-09 06:13:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements


* Jens Axboe <[email protected]> wrote:

> On Tue, Sep 08 2009, Peter Zijlstra wrote:
> > On Tue, 2009-09-08 at 11:13 +0200, Jens Axboe wrote:
> > > And here's a newer version.
> >
> > I tinkered a bit with your proglet and finally found the
> > problem.
> >
> > You used a single pipe per child, this means the loop in
> > run_child() would consume what it just wrote out until it got
> > force preempted by the parent which would also get woken.
> >
> > This results in the child spinning a while (its full quota) and
> > only reporting the last timestamp to the parent.
>
> Oh doh, that's not well thought out. Well it was a quick hack :-)
> Thanks for the fixup, now it's at least usable to some degree.

What kind of latencies does it report on your box?

Our vanilla scheduler default latency targets are:

single-core: 20 msecs
dual-core: 40 msecs
quad-core: 60 msecs
opto-core: 80 msecs

You can enable CONFIG_SCHED_DEBUG=y and set it directly as well via
/proc/sys/kernel/sched_latency_ns:

echo 10000000 > /proc/sys/kernel/sched_latency_ns

Ingo

2009-09-09 14:00:20

by Pavel Machek

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

Hi!

> > So ... to get to the numbers - i've tested both BFS and the tip of
> > the latest upstream scheduler tree on a testbox of mine. I
> > intentionally didnt test BFS on any really large box - because you
> > described its upper limit like this in the announcement:
>
> I ran a simple test as well, since I was curious to see how it performed
> wrt interactiveness. One of my pet peeves with the current scheduler is
> that I have to nice compile jobs, or my X experience is just awful while
> the compile is running.
>
> Now, this test case is something that attempts to see what
> interactiveness would be like. It'll run a given command line while at
> the same time logging delays. The delays are measured as follows:
>
> - The app creates a pipe, and forks a child that blocks on reading from
> that pipe.
> - The app sleeps for a random period of time, anywhere between 100ms
> and 2s. When it wakes up, it gets the current time and writes that to
> the pipe.
> - The child then gets woken, checks the time on its own, and logs the
> difference between the two.
>
> The idea here being that the delay between writing to the pipe and the
> child reading the data and comparing should (in some way) be indicative
> of how responsive the system would seem to a user.
>
> The test app was quickly hacked up, so don't put too much into it. The
> test run is a simple kernel compile, using -jX where X is the number of
> threads in the system. The files are cache hot, so little IO is done.
> The -x2 run is using the double number of processes as we have threads,
> eg -j128 on a 64 thread box.

Could you post the source? Someone else might get us
numbers... preferably on dualcore box or something...
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-09-09 08:34:59

by Nikos Chantziaras

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On 09/09/2009 09:13 AM, Ingo Molnar wrote:
>
> * Jens Axboe<[email protected]> wrote:
>
>> On Tue, Sep 08 2009, Peter Zijlstra wrote:
>>> On Tue, 2009-09-08 at 11:13 +0200, Jens Axboe wrote:
>>>> And here's a newer version.
>>>
>>> I tinkered a bit with your proglet and finally found the
>>> problem.
>>>
>>> You used a single pipe per child, this means the loop in
>>> run_child() would consume what it just wrote out until it got
>>> force preempted by the parent which would also get woken.
>>>
>>> This results in the child spinning a while (its full quota) and
>>> only reporting the last timestamp to the parent.
>>
>> Oh doh, that's not well thought out. Well it was a quick hack :-)
>> Thanks for the fixup, now it's at least usable to some degree.
>
> What kind of latencies does it report on your box?
>
> Our vanilla scheduler default latency targets are:
>
> single-core: 20 msecs
> dual-core: 40 msecs
> quad-core: 60 msecs
> opto-core: 80 msecs
>
> You can enable CONFIG_SCHED_DEBUG=y and set it directly as well via
> /proc/sys/kernel/sched_latency_ns:
>
> echo 10000000> /proc/sys/kernel/sched_latency_ns

I've tried values ranging from 10000000 down to 100000. This results in
the stalls/freezes being a bit shorter, but clearly still there. It
does not eliminate them.

If there's anything else I can try/test, I would be happy to do so.

2009-09-09 08:52:27

by Mike Galbraith

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Wed, 2009-09-09 at 08:13 +0200, Ingo Molnar wrote:
> * Jens Axboe <[email protected]> wrote:
>
> > On Tue, Sep 08 2009, Peter Zijlstra wrote:
> > > On Tue, 2009-09-08 at 11:13 +0200, Jens Axboe wrote:
> > > > And here's a newer version.
> > >
> > > I tinkered a bit with your proglet and finally found the
> > > problem.
> > >
> > > You used a single pipe per child, this means the loop in
> > > run_child() would consume what it just wrote out until it got
> > > force preempted by the parent which would also get woken.
> > >
> > > This results in the child spinning a while (its full quota) and
> > > only reporting the last timestamp to the parent.
> >
> > Oh doh, that's not well thought out. Well it was a quick hack :-)
> > Thanks for the fixup, now it's at least usable to some degree.
>
> What kind of latencies does it report on your box?
>
> Our vanilla scheduler default latency targets are:
>
> single-core: 20 msecs
> dual-core: 40 msecs
> quad-core: 60 msecs
> opto-core: 80 msecs
>
> You can enable CONFIG_SCHED_DEBUG=y and set it directly as well via
> /proc/sys/kernel/sched_latency_ns:
>
> echo 10000000 > /proc/sys/kernel/sched_latency_ns

He would also need to lower min_granularity, otherwise, it'd be larger
than the whole latency target.

I'm testing right now, and one thing that is definitely a problem is the
amount of sleeper fairness we're giving. A full latency is just too
much short term fairness in my testing. While sleepers are catching up,
hogs languish. That's the biggest issue going on.

I've also been doing some timings of make -j4 (looking at idle time),
and find that child_runs_first is mildly detrimental to fork/exec load,
as are buddies.

I'm running with the below at the moment. (the kthread/workqueue thing
is just because I don't see any reason for it to exist, so consider it
to be a waste of perfectly good math;)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 6ec4643..a44210e 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -16,8 +16,6 @@
#include <linux/mutex.h>
#include <trace/events/sched.h>

-#define KTHREAD_NICE_LEVEL (-5)
-
static DEFINE_SPINLOCK(kthread_create_lock);
static LIST_HEAD(kthread_create_list);

@@ -150,7 +148,6 @@ struct task_struct *kthread_create(int (*threadfn)(void *data),
* The kernel thread should not inherit these properties.
*/
sched_setscheduler_nocheck(create.result, SCHED_NORMAL, &param);
- set_user_nice(create.result, KTHREAD_NICE_LEVEL);
set_cpus_allowed_ptr(create.result, cpu_all_mask);
}
return create.result;
@@ -226,7 +223,6 @@ int kthreadd(void *unused)
/* Setup a clean context for our children to inherit. */
set_task_comm(tsk, "kthreadd");
ignore_signals(tsk);
- set_user_nice(tsk, KTHREAD_NICE_LEVEL);
set_cpus_allowed_ptr(tsk, cpu_all_mask);
set_mems_allowed(node_possible_map);

diff --git a/kernel/sched.c b/kernel/sched.c
index c512a02..e68c341 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -7124,33 +7124,6 @@ void __cpuinit init_idle(struct task_struct *idle, int cpu)
*/
cpumask_var_t nohz_cpu_mask;

-/*
- * Increase the granularity value when there are more CPUs,
- * because with more CPUs the 'effective latency' as visible
- * to users decreases. But the relationship is not linear,
- * so pick a second-best guess by going with the log2 of the
- * number of CPUs.
- *
- * This idea comes from the SD scheduler of Con Kolivas:
- */
-static inline void sched_init_granularity(void)
-{
- unsigned int factor = 1 + ilog2(num_online_cpus());
- const unsigned long limit = 200000000;
-
- sysctl_sched_min_granularity *= factor;
- if (sysctl_sched_min_granularity > limit)
- sysctl_sched_min_granularity = limit;
-
- sysctl_sched_latency *= factor;
- if (sysctl_sched_latency > limit)
- sysctl_sched_latency = limit;
-
- sysctl_sched_wakeup_granularity *= factor;
-
- sysctl_sched_shares_ratelimit *= factor;
-}
-
#ifdef CONFIG_SMP
/*
* This is how migration works:
@@ -9356,7 +9329,6 @@ void __init sched_init_smp(void)
/* Move init over to a non-isolated CPU */
if (set_cpus_allowed_ptr(current, non_isolated_cpus) < 0)
BUG();
- sched_init_granularity();
free_cpumask_var(non_isolated_cpus);

alloc_cpumask_var(&fallback_doms, GFP_KERNEL);
@@ -9365,7 +9337,6 @@ void __init sched_init_smp(void)
#else
void __init sched_init_smp(void)
{
- sched_init_granularity();
}
#endif /* CONFIG_SMP */

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index e386e5d..ff7fec9 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -51,7 +51,7 @@ static unsigned int sched_nr_latency = 5;
* After fork, child runs first. (default) If set to 0 then
* parent will (try to) run first.
*/
-const_debug unsigned int sysctl_sched_child_runs_first = 1;
+const_debug unsigned int sysctl_sched_child_runs_first = 0;

/*
* sys_sched_yield() compat mode
@@ -713,7 +713,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
if (!initial) {
/* sleeps upto a single latency don't count. */
if (sched_feat(NEW_FAIR_SLEEPERS)) {
- unsigned long thresh = sysctl_sched_latency;
+ unsigned long thresh = sysctl_sched_min_granularity;

/*
* Convert the sleeper threshold into virtual time.
@@ -1502,7 +1502,8 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int sync)
*/
if (sched_feat(LAST_BUDDY) && likely(se->on_rq && curr != rq->idle))
set_last_buddy(se);
- set_next_buddy(pse);
+ if (sched_feat(NEXT_BUDDY))
+ set_next_buddy(pse);

/*
* We can come here with TIF_NEED_RESCHED already set from new task
diff --git a/kernel/sched_features.h b/kernel/sched_features.h
index 4569bfa..85d30d1 100644
--- a/kernel/sched_features.h
+++ b/kernel/sched_features.h
@@ -13,5 +13,6 @@ SCHED_FEAT(LB_BIAS, 1)
SCHED_FEAT(LB_WAKEUP_UPDATE, 1)
SCHED_FEAT(ASYM_EFF_LOAD, 1)
SCHED_FEAT(WAKEUP_OVERLAP, 0)
-SCHED_FEAT(LAST_BUDDY, 1)
+SCHED_FEAT(LAST_BUDDY, 0)
+SCHED_FEAT(NEXT_BUDDY, 0)
SCHED_FEAT(OWNER_SPIN, 1)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 3c44b56..addfe2d 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -317,8 +317,6 @@ static int worker_thread(void *__cwq)
if (cwq->wq->freezeable)
set_freezable();

- set_user_nice(current, -5);
-
for (;;) {
prepare_to_wait(&cwq->more_work, &wait, TASK_INTERRUPTIBLE);
if (!freezing(current) &&

2009-09-09 09:02:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Wed, 2009-09-09 at 10:52 +0200, Mike Galbraith wrote:
> @@ -1502,7 +1502,8 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int sync)
> */
> if (sched_feat(LAST_BUDDY) && likely(se->on_rq && curr != rq->idle))
> set_last_buddy(se);
> - set_next_buddy(pse);
> + if (sched_feat(NEXT_BUDDY))
> + set_next_buddy(pse);
>
> /*
> * We can come here with TIF_NEED_RESCHED already set from new task

You might want to test stuff like sysbench again, iirc we went on a
cache-trashing rampage without buddies.

Our goal is not to excel at any one load but to not suck at any one
load.

2009-09-09 09:05:22

by Nikos Chantziaras

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On 09/09/2009 11:52 AM, Mike Galbraith wrote:
> On Wed, 2009-09-09 at 08:13 +0200, Ingo Molnar wrote:
>> * Jens Axboe<[email protected]> wrote:
>>
>>> On Tue, Sep 08 2009, Peter Zijlstra wrote:
>>>> On Tue, 2009-09-08 at 11:13 +0200, Jens Axboe wrote:
>>>>> And here's a newer version.
>>>>
>>>> I tinkered a bit with your proglet and finally found the
>>>> problem.
>>>>
>>>> You used a single pipe per child, this means the loop in
>>>> run_child() would consume what it just wrote out until it got
>>>> force preempted by the parent which would also get woken.
>>>>
>>>> This results in the child spinning a while (its full quota) and
>>>> only reporting the last timestamp to the parent.
>>>
>>> Oh doh, that's not well thought out. Well it was a quick hack :-)
>>> Thanks for the fixup, now it's at least usable to some degree.
>>
>> What kind of latencies does it report on your box?
>>
>> Our vanilla scheduler default latency targets are:
>>
>> single-core: 20 msecs
>> dual-core: 40 msecs
>> quad-core: 60 msecs
>> opto-core: 80 msecs
>>
>> You can enable CONFIG_SCHED_DEBUG=y and set it directly as well via
>> /proc/sys/kernel/sched_latency_ns:
>>
>> echo 10000000> /proc/sys/kernel/sched_latency_ns
>
> He would also need to lower min_granularity, otherwise, it'd be larger
> than the whole latency target.

Thank you for mentioning min_granularity. After:

echo 10000000 > /proc/sys/kernel/sched_latency_ns
echo 2000000 > /proc/sys/kernel/sched_min_granularity_ns

I can clearly see an improvement: animations that are supposed to be
fluid "skip" much less now, and in one occasion (simply moving the video
window around) have been eliminated completely. However, there seems to
be a side effect from having CONFIG_SCHED_DEBUG enabled; things seem to
be generally a tad more "jerky" with that option enabled, even when not
even touching the latency and granularity defaults.

I'll try the patch you posted and see if this further improves things.

2009-09-09 09:10:08

by Jens Axboe

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Wed, Sep 09 2009, Mike Galbraith wrote:
> On Wed, 2009-09-09 at 08:13 +0200, Ingo Molnar wrote:
> > * Jens Axboe <[email protected]> wrote:
> >
> > > On Tue, Sep 08 2009, Peter Zijlstra wrote:
> > > > On Tue, 2009-09-08 at 11:13 +0200, Jens Axboe wrote:
> > > > > And here's a newer version.
> > > >
> > > > I tinkered a bit with your proglet and finally found the
> > > > problem.
> > > >
> > > > You used a single pipe per child, this means the loop in
> > > > run_child() would consume what it just wrote out until it got
> > > > force preempted by the parent which would also get woken.
> > > >
> > > > This results in the child spinning a while (its full quota) and
> > > > only reporting the last timestamp to the parent.
> > >
> > > Oh doh, that's not well thought out. Well it was a quick hack :-)
> > > Thanks for the fixup, now it's at least usable to some degree.
> >
> > What kind of latencies does it report on your box?
> >
> > Our vanilla scheduler default latency targets are:
> >
> > single-core: 20 msecs
> > dual-core: 40 msecs
> > quad-core: 60 msecs
> > opto-core: 80 msecs
> >
> > You can enable CONFIG_SCHED_DEBUG=y and set it directly as well via
> > /proc/sys/kernel/sched_latency_ns:
> >
> > echo 10000000 > /proc/sys/kernel/sched_latency_ns
>
> He would also need to lower min_granularity, otherwise, it'd be larger
> than the whole latency target.
>
> I'm testing right now, and one thing that is definitely a problem is the
> amount of sleeper fairness we're giving. A full latency is just too
> much short term fairness in my testing. While sleepers are catching up,
> hogs languish. That's the biggest issue going on.
>
> I've also been doing some timings of make -j4 (looking at idle time),
> and find that child_runs_first is mildly detrimental to fork/exec load,
> as are buddies.
>
> I'm running with the below at the moment. (the kthread/workqueue thing
> is just because I don't see any reason for it to exist, so consider it
> to be a waste of perfectly good math;)

Using latt, it seems better than -rc9. The below are entries logged
while running make -j128 on a 64 thread box. I did two runs on each, and
latt is using 8 clients.

-rc9
Max 23772 usec
Avg 1129 usec
Stdev 4328 usec
Stdev mean 117 usec

Max 32709 usec
Avg 1467 usec
Stdev 5095 usec
Stdev mean 136 usec

-rc9 + patch

Max 11561 usec
Avg 1532 usec
Stdev 1994 usec
Stdev mean 48 usec

Max 9590 usec
Avg 1550 usec
Stdev 2051 usec
Stdev mean 50 usec

max latency is way down, and much smaller variation as well.


--
Jens Axboe

2009-09-09 09:17:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Wed, 2009-09-09 at 12:05 +0300, Nikos Chantziaras wrote:

> Thank you for mentioning min_granularity. After:
>
> echo 10000000 > /proc/sys/kernel/sched_latency_ns
> echo 2000000 > /proc/sys/kernel/sched_min_granularity_ns

You might also want to do:

echo 2000000 > /proc/sys/kernel/sched_wakeup_granularity_ns

That affects when a newly woken task will preempt an already running
task.

> I can clearly see an improvement: animations that are supposed to be
> fluid "skip" much less now, and in one occasion (simply moving the video
> window around) have been eliminated completely. However, there seems to
> be a side effect from having CONFIG_SCHED_DEBUG enabled; things seem to
> be generally a tad more "jerky" with that option enabled, even when not
> even touching the latency and granularity defaults.

There's more code in the scheduler with that enabled but unless you've
got a terrible high ctx rate that really shouldn't affect things.

Anyway, you can always poke at these numbers in the code, and like Mike
did, kill sched_init_granularity().


2009-09-09 09:18:46

by Mike Galbraith

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Wed, 2009-09-09 at 11:02 +0200, Peter Zijlstra wrote:
> On Wed, 2009-09-09 at 10:52 +0200, Mike Galbraith wrote:
> > @@ -1502,7 +1502,8 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int sync)
> > */
> > if (sched_feat(LAST_BUDDY) && likely(se->on_rq && curr != rq->idle))
> > set_last_buddy(se);
> > - set_next_buddy(pse);
> > + if (sched_feat(NEXT_BUDDY))
> > + set_next_buddy(pse);
> >
> > /*
> > * We can come here with TIF_NEED_RESCHED already set from new task
>
> You might want to test stuff like sysbench again, iirc we went on a
> cache-trashing rampage without buddies.
>
> Our goal is not to excel at any one load but to not suck at any one
> load.

Oh absolutely. I wouldn't want buddies disabled by default, I only
added the buddy knob to test effects on fork/exec.

I only posted to patch to give Jens something canned to try out.

-Mike

2009-09-09 09:40:27

by Nikos Chantziaras

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On 09/09/2009 12:17 PM, Peter Zijlstra wrote:
> On Wed, 2009-09-09 at 12:05 +0300, Nikos Chantziaras wrote:
>
>> Thank you for mentioning min_granularity. After:
>>
>> echo 10000000> /proc/sys/kernel/sched_latency_ns
>> echo 2000000> /proc/sys/kernel/sched_min_granularity_ns
>
> You might also want to do:
>
> echo 2000000> /proc/sys/kernel/sched_wakeup_granularity_ns
>
> That affects when a newly woken task will preempt an already running
> task.

Lowering wakeup_granularity seems to make things worse in an interesting
way:

With low wakeup_granularity, the video itself will start skipping if I
move the window around. However, the window manager's effect of moving
a window around is smooth.

With high wakeup_granularity, the video itself will not skip while
moving the window around. But this time, the window manager's effect of
the window move is skippy.

(I should point out that only with the BFS-patched kernel can I have a
smooth video *and* a smooth window-moving effect at the same time.)
Mainline seems to prioritize one of the two according to whether
wakeup_granularity is raised or lowered. However, I have not tested
Mike's patch yet (but will do so ASAP.)

2009-09-09 09:53:21

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Tue, 2009-09-08 at 22:22 +0200, Frans Pop wrote:
> Arjan van de Ven wrote:
> > the latest version of latencytop also has a GUI (thanks to Ben)
>
> That looks nice, but...
>
> I kind of miss the split screen feature where latencytop would show both
> the overall figures + the ones for the currently most affected task.
> Downside of that last was that I never managed to keep the display on a
> specific task.

Any idea of how to present it ? I'm happy to spend 5mn improving the
GUI :-)

> The graphical display also makes it impossible to simply copy and paste
> the results.

Ah that's right. I'm not 100% sure how to do that (first experiments
with gtk). I suppose I could try to do some kind of "snapshot" feature
which saves the results in textual form.

> Having the freeze button is nice though.
>
> Would it be possible to have a command line switch that allows to start
> the old textual mode?

It's there iirc. --nogui :-)

Cheers,
Ben.

> Looks like the man page needs updating too :-)
>
> Cheers,
> FJP
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2009-09-09 10:18:00

by Nikos Chantziaras

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On 09/09/2009 12:40 PM, Nikos Chantziaras wrote:
> On 09/09/2009 12:17 PM, Peter Zijlstra wrote:
>> On Wed, 2009-09-09 at 12:05 +0300, Nikos Chantziaras wrote:
>>
>>> Thank you for mentioning min_granularity. After:
>>>
>>> echo 10000000> /proc/sys/kernel/sched_latency_ns
>>> echo 2000000> /proc/sys/kernel/sched_min_granularity_ns
>>
>> You might also want to do:
>>
>> echo 2000000> /proc/sys/kernel/sched_wakeup_granularity_ns
>>
>> That affects when a newly woken task will preempt an already running
>> task.
>
> Lowering wakeup_granularity seems to make things worse in an interesting
> way:
>
> With low wakeup_granularity, the video itself will start skipping if I
> move the window around. However, the window manager's effect of moving a
> window around is smooth.
>
> With high wakeup_granularity, the video itself will not skip while
> moving the window around. But this time, the window manager's effect of
> the window move is skippy.
>
> (I should point out that only with the BFS-patched kernel can I have a
> smooth video *and* a smooth window-moving effect at the same time.)
> Mainline seems to prioritize one of the two according to whether
> wakeup_granularity is raised or lowered. However, I have not tested
> Mike's patch yet (but will do so ASAP.)

I've tested Mike's patch and it achieves the same effect as raising
sched_min_granularity.

To round it up:

By testing various values for sched_latency_ns, sched_min_granularity_ns
and sched_wakeup_granularity_ns, I can achieve three results:

1. Fluid animations for the foreground app, skippy ones for
the rest (video plays nicely, rest of the desktop lags.)

2. Fluid animations for the background apps, a skippy one for
the one in the foreground (dekstop behaves nicely, video lags.)

3. Equally skippy/jerky behavior for all of them.

Unfortunately, a "4. Equally fluid behavior for all of them" cannot be
achieved with mainline, unless I missed some other tweak.

2009-09-09 11:14:32

by David Newall

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

Benjamin Herrenschmidt wrote:
> On Tue, 2009-09-08 at 22:22 +0200, Frans Pop wrote:
>
>> Arjan van de Ven wrote:
>>
>>> the latest version of latencytop also has a GUI (thanks to Ben)
>>>
>> That looks nice, but...
>>
>> I kind of miss the split screen feature where latencytop would show both
>> the overall figures + the ones for the currently most affected task.
>> Downside of that last was that I never managed to keep the display on a
>> specific task.
>>
>
> Any idea of how to present it ? I'm happy to spend 5mn improving the
> GUI :-)

Use a second window.

2009-09-09 11:33:18

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Wed, 2009-09-09 at 20:44 +0930, David Newall wrote:
> Benjamin Herrenschmidt wrote:
> > On Tue, 2009-09-08 at 22:22 +0200, Frans Pop wrote:
> >
> >> Arjan van de Ven wrote:
> >>
> >>> the latest version of latencytop also has a GUI (thanks to Ben)
> >>>
> >> That looks nice, but...
> >>
> >> I kind of miss the split screen feature where latencytop would show both
> >> the overall figures + the ones for the currently most affected task.
> >> Downside of that last was that I never managed to keep the display on a
> >> specific task.
> >>
> >
> > Any idea of how to present it ? I'm happy to spend 5mn improving the
> > GUI :-)
>
> Use a second window.

I'm not too fan of cluttering the screen with windows... I suppose I
could have a separate pane for the "global" view but I haven't found a
way to lay it out in a way that doesn't suck :-) I could have done a 3rd
colums on the right with the overall view but it felt like using too
much screen real estate.

I'll experiment a bit, maybe 2 windows is indeed the solution. But you
get into the problem of what to do if only one of them is closed ? Do I
add a menu bar on each of them to re-open the "other" one if closed ?
etc...

Don't get me wrong, I have a shitload of experience doing GUIs (back in
the old days when I was hacking on MacOS), though I'm relatively new to
GTK. But GUI design is rather hard in general :-)

Ben.

2009-09-09 11:52:05

by Nikos Chantziaras

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On 09/08/2009 06:23 PM, Peter Zijlstra wrote:
> On Tue, 2009-09-08 at 11:13 +0200, Jens Axboe wrote:
>> And here's a newer version.
>
> I tinkered a bit with your proglet and finally found the problem.
>
> You used a single pipe per child, this means the loop in run_child()
> would consume what it just wrote out until it got force preempted by the
> parent which would also get woken.
>
> This results in the child spinning a while (its full quota) and only
> reporting the last timestamp to the parent.
>
> Since consumer (parent) is a single thread the program basically
> measures the worst delay in a thundering herd wakeup of N children.
>
> The below version yields:
>
> idle
>
> [root@opteron sched]# ./latt -c8 sleep 30
> Entries: 664 (clients=8)
>
> Averages:
> ------------------------------
> Max 128 usec
> Avg 26 usec
> Stdev 16 usec
>
>
> make -j4
>
> [root@opteron sched]# ./latt -c8 sleep 30
> Entries: 648 (clients=8)
>
> Averages:
> ------------------------------
> Max 20861 usec
> Avg 3763 usec
> Stdev 4637 usec
>
>
> Mike's patch, make -j4
>
> [root@opteron sched]# ./latt -c8 sleep 30
> Entries: 648 (clients=8)
>
> Averages:
> ------------------------------
> Max 17854 usec
> Avg 6298 usec
> Stdev 4735 usec

I've run two tests with this tool. One with mainline (2.6.31-rc9) and
one patched with 2.6.31-rc9-sched-bfs-210.patch.

Before running this test, I disabled the cron daemon in order not to
have something pop-up in the background out of a sudden.

The test consisted of starting a "make -j2" in the kernel tree inside a
3GB tmpfs mountpoint and then running 'latt "mplayer -vo gl2 -framedrop
videofile.mkv"' (mplayer in this case is a single-threaded
application.) Caches were warmed-up first; the results below are from
the second run of each test.

The kernel .config file used by the running kernels and also for "make
-j2" is:

http://foss.math.aegean.gr/~realnc/kernel/config-2.6.31-rc9-latt-test

The video file used for mplayer is:

http://foss.math.aegean.gr/~realnc/vids/3DMark2000.mkv (100MB)
(The reason this was used is that it's a 60FPS video,
therefore very smooth and makes all skips stand out
clearly.)


Results for mainline:

Averages:
------------------------------
Max 29930 usec
Avg 11043 usec
Stdev 5752 usec


Results for BFS:

Averages:
------------------------------
Max 14017 usec
Avg 49 usec
Stdev 697 usec


One thing that's worth noting is that with mainline, mplayer would
occasionally spit this out:

YOUR SYSTEM IS TOO SLOW TO PLAY THIS

which doesn't happen with BFS.

2009-09-09 11:54:28

by Jens Axboe

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Wed, Sep 09 2009, Jens Axboe wrote:
> On Wed, Sep 09 2009, Mike Galbraith wrote:
> > On Wed, 2009-09-09 at 08:13 +0200, Ingo Molnar wrote:
> > > * Jens Axboe <[email protected]> wrote:
> > >
> > > > On Tue, Sep 08 2009, Peter Zijlstra wrote:
> > > > > On Tue, 2009-09-08 at 11:13 +0200, Jens Axboe wrote:
> > > > > > And here's a newer version.
> > > > >
> > > > > I tinkered a bit with your proglet and finally found the
> > > > > problem.
> > > > >
> > > > > You used a single pipe per child, this means the loop in
> > > > > run_child() would consume what it just wrote out until it got
> > > > > force preempted by the parent which would also get woken.
> > > > >
> > > > > This results in the child spinning a while (its full quota) and
> > > > > only reporting the last timestamp to the parent.
> > > >
> > > > Oh doh, that's not well thought out. Well it was a quick hack :-)
> > > > Thanks for the fixup, now it's at least usable to some degree.
> > >
> > > What kind of latencies does it report on your box?
> > >
> > > Our vanilla scheduler default latency targets are:
> > >
> > > single-core: 20 msecs
> > > dual-core: 40 msecs
> > > quad-core: 60 msecs
> > > opto-core: 80 msecs
> > >
> > > You can enable CONFIG_SCHED_DEBUG=y and set it directly as well via
> > > /proc/sys/kernel/sched_latency_ns:
> > >
> > > echo 10000000 > /proc/sys/kernel/sched_latency_ns
> >
> > He would also need to lower min_granularity, otherwise, it'd be larger
> > than the whole latency target.
> >
> > I'm testing right now, and one thing that is definitely a problem is the
> > amount of sleeper fairness we're giving. A full latency is just too
> > much short term fairness in my testing. While sleepers are catching up,
> > hogs languish. That's the biggest issue going on.
> >
> > I've also been doing some timings of make -j4 (looking at idle time),
> > and find that child_runs_first is mildly detrimental to fork/exec load,
> > as are buddies.
> >
> > I'm running with the below at the moment. (the kthread/workqueue thing
> > is just because I don't see any reason for it to exist, so consider it
> > to be a waste of perfectly good math;)
>
> Using latt, it seems better than -rc9. The below are entries logged
> while running make -j128 on a 64 thread box. I did two runs on each, and
> latt is using 8 clients.
>
> -rc9
> Max 23772 usec
> Avg 1129 usec
> Stdev 4328 usec
> Stdev mean 117 usec
>
> Max 32709 usec
> Avg 1467 usec
> Stdev 5095 usec
> Stdev mean 136 usec
>
> -rc9 + patch
>
> Max 11561 usec
> Avg 1532 usec
> Stdev 1994 usec
> Stdev mean 48 usec
>
> Max 9590 usec
> Avg 1550 usec
> Stdev 2051 usec
> Stdev mean 50 usec
>
> max latency is way down, and much smaller variation as well.

Things are much better with this patch on the notebook! I cannot compare
with BFS as that still doesn't run anywhere I want it to run, but it's
way better than -rc9-git stock. latt numbers on the notebook have 1/3
the max latency, average is lower, and stddev is much smaller too.

--
Jens Axboe

2009-09-09 11:55:56

by Frans Pop

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Wednesday 09 September 2009, Benjamin Herrenschmidt wrote:
> On Tue, 2009-09-08 at 22:22 +0200, Frans Pop wrote:
> > Arjan van de Ven wrote:
> > > the latest version of latencytop also has a GUI (thanks to Ben)
> >
> > That looks nice, but...
> >
> > I kind of miss the split screen feature where latencytop would show
> > both the overall figures + the ones for the currently most affected
> > task. Downside of that last was that I never managed to keep the
> > display on a specific task.
>
> Any idea of how to present it ? I'm happy to spend 5mn improving the
> GUI :-)

I'd say add an extra horizontal split in the second column, so you'd get
three areas in the right column:
- top for the global target (permanently)
- middle for current, either:
- "current most lagging" if "Global" is selected in left column
- selected process if a specific target is selected in left column
- bottom for backtrace

Maybe with that setup "Global" in the left column should be renamed to
something like "Dynamic".

The backtrace area would show selection from either top or middle areas
(so selecting a cause in top or middle area should unselect causes in the
other).

Cheers,
FJP

2009-09-09 12:20:06

by Jens Axboe

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Wed, Sep 09 2009, Jens Axboe wrote:
> On Wed, Sep 09 2009, Jens Axboe wrote:
> > On Wed, Sep 09 2009, Mike Galbraith wrote:
> > > On Wed, 2009-09-09 at 08:13 +0200, Ingo Molnar wrote:
> > > > * Jens Axboe <[email protected]> wrote:
> > > >
> > > > > On Tue, Sep 08 2009, Peter Zijlstra wrote:
> > > > > > On Tue, 2009-09-08 at 11:13 +0200, Jens Axboe wrote:
> > > > > > > And here's a newer version.
> > > > > >
> > > > > > I tinkered a bit with your proglet and finally found the
> > > > > > problem.
> > > > > >
> > > > > > You used a single pipe per child, this means the loop in
> > > > > > run_child() would consume what it just wrote out until it got
> > > > > > force preempted by the parent which would also get woken.
> > > > > >
> > > > > > This results in the child spinning a while (its full quota) and
> > > > > > only reporting the last timestamp to the parent.
> > > > >
> > > > > Oh doh, that's not well thought out. Well it was a quick hack :-)
> > > > > Thanks for the fixup, now it's at least usable to some degree.
> > > >
> > > > What kind of latencies does it report on your box?
> > > >
> > > > Our vanilla scheduler default latency targets are:
> > > >
> > > > single-core: 20 msecs
> > > > dual-core: 40 msecs
> > > > quad-core: 60 msecs
> > > > opto-core: 80 msecs
> > > >
> > > > You can enable CONFIG_SCHED_DEBUG=y and set it directly as well via
> > > > /proc/sys/kernel/sched_latency_ns:
> > > >
> > > > echo 10000000 > /proc/sys/kernel/sched_latency_ns
> > >
> > > He would also need to lower min_granularity, otherwise, it'd be larger
> > > than the whole latency target.
> > >
> > > I'm testing right now, and one thing that is definitely a problem is the
> > > amount of sleeper fairness we're giving. A full latency is just too
> > > much short term fairness in my testing. While sleepers are catching up,
> > > hogs languish. That's the biggest issue going on.
> > >
> > > I've also been doing some timings of make -j4 (looking at idle time),
> > > and find that child_runs_first is mildly detrimental to fork/exec load,
> > > as are buddies.
> > >
> > > I'm running with the below at the moment. (the kthread/workqueue thing
> > > is just because I don't see any reason for it to exist, so consider it
> > > to be a waste of perfectly good math;)
> >
> > Using latt, it seems better than -rc9. The below are entries logged
> > while running make -j128 on a 64 thread box. I did two runs on each, and
> > latt is using 8 clients.
> >
> > -rc9
> > Max 23772 usec
> > Avg 1129 usec
> > Stdev 4328 usec
> > Stdev mean 117 usec
> >
> > Max 32709 usec
> > Avg 1467 usec
> > Stdev 5095 usec
> > Stdev mean 136 usec
> >
> > -rc9 + patch
> >
> > Max 11561 usec
> > Avg 1532 usec
> > Stdev 1994 usec
> > Stdev mean 48 usec
> >
> > Max 9590 usec
> > Avg 1550 usec
> > Stdev 2051 usec
> > Stdev mean 50 usec
> >
> > max latency is way down, and much smaller variation as well.
>
> Things are much better with this patch on the notebook! I cannot compare
> with BFS as that still doesn't run anywhere I want it to run, but it's
> way better than -rc9-git stock. latt numbers on the notebook have 1/3
> the max latency, average is lower, and stddev is much smaller too.

BFS210 runs on the laptop (dual core intel core duo). With make -j4
running, I clock the following latt -c8 'sleep 10' latencies:

-rc9

Max 17895 usec
Avg 8028 usec
Stdev 5948 usec
Stdev mean 405 usec

Max 17896 usec
Avg 4951 usec
Stdev 6278 usec
Stdev mean 427 usec

Max 17885 usec
Avg 5526 usec
Stdev 6819 usec
Stdev mean 464 usec

-rc9 + mike

Max 6061 usec
Avg 3797 usec
Stdev 1726 usec
Stdev mean 117 usec

Max 5122 usec
Avg 3958 usec
Stdev 1697 usec
Stdev mean 115 usec

Max 6691 usec
Avg 2130 usec
Stdev 2165 usec
Stdev mean 147 usec

-rc9 + bfs210

Max 92 usec
Avg 27 usec
Stdev 19 usec
Stdev mean 1 usec

Max 80 usec
Avg 23 usec
Stdev 15 usec
Stdev mean 1 usec

Max 97 usec
Avg 27 usec
Stdev 21 usec
Stdev mean 1 usec

One thing I also noticed is that when I have logged in, I run xmodmap
manually to load some keymappings (I always tell myself to add this to
the log in scripts, but I suspend/resume this laptop for weeks at the
time and forget before the next boot). With the stock kernel, xmodmap
will halt X updates and take forever to run. With BFS, it returned
instantly. As I would expect.

So the BFS design may be lacking in the scalability end (which is
obviously true, if you look at the code), but I can understand the
appeal of the scheduler for "normal" desktop people.

--
Jens Axboe

2009-09-09 12:48:39

by Mike Galbraith

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Wed, 2009-09-09 at 13:54 +0200, Jens Axboe wrote:

> Things are much better with this patch on the notebook! I cannot compare
> with BFS as that still doesn't run anywhere I want it to run, but it's
> way better than -rc9-git stock. latt numbers on the notebook have 1/3
> the max latency, average is lower, and stddev is much smaller too.

That patch has a bit of bustage in it.

We definitely want to turn down sched_latency though, and LAST_BUDDY
also wants some examination it seems.

taskset -c 3 ./xx 1 (100% cpu 1 sec interval perturbation measurement proggy. overhead is what it is not getting)
xx says
2392.52 MHZ CPU
perturbation threshold 0.057 usecs.
...
'nuther terminal
taskset -c 3 make -j2 vmlinux

xx output

current (fixed breakage) patched tip tree
pert/s: 153 >18842.18us: 11 min: 0.50 max:36010.37 avg:4354.06 sum/s:666171us overhead:66.62%
pert/s: 160 >18767.18us: 12 min: 0.13 max:32011.66 avg:4172.69 sum/s:667631us overhead:66.66%
pert/s: 156 >18499.43us: 9 min: 0.13 max:27883.24 avg:4296.08 sum/s:670189us overhead:66.49%
pert/s: 146 >18480.71us: 10 min: 0.50 max:32009.38 avg:4615.19 sum/s:673818us overhead:67.26%
pert/s: 154 >18433.20us: 17 min: 0.14 max:31537.12 avg:4474.14 sum/s:689018us overhead:67.68%
pert/s: 158 >18520.11us: 9 min: 0.50 max:34328.86 avg:4275.66 sum/s:675554us overhead:66.76%
pert/s: 154 >18683.74us: 12 min: 0.51 max:35949.23 avg:4363.67 sum/s:672005us overhead:67.04%
pert/s: 154 >18745.53us: 8 min: 0.51 max:34203.43 avg:4399.72 sum/s:677556us overhead:67.03%

bfs209
pert/s: 124 >18681.88us: 17 min: 0.15 max:27274.74 avg:4627.36 sum/s:573793us overhead:56.70%
pert/s: 106 >18702.52us: 20 min: 0.55 max:32022.07 avg:5754.48 sum/s:609975us overhead:59.80%
pert/s: 116 >19082.42us: 17 min: 0.15 max:39835.34 avg:5167.69 sum/s:599452us overhead:59.95%
pert/s: 109 >19289.41us: 22 min: 0.14 max:36818.95 avg:5485.79 sum/s:597951us overhead:59.64%
pert/s: 108 >19238.97us: 19 min: 0.14 max:32026.74 avg:5543.17 sum/s:598662us overhead:59.87%
pert/s: 106 >19415.76us: 20 min: 0.54 max:36011.78 avg:6001.89 sum/s:636201us overhead:62.95%
pert/s: 115 >19341.89us: 16 min: 0.08 max:32040.83 avg:5313.45 sum/s:611047us overhead:59.98%
pert/s: 101 >19527.53us: 24 min: 0.14 max:36018.37 avg:6378.06 sum/s:644184us overhead:64.42%

stock tip (ouch ouch ouch)
pert/s: 153 >48453.23us: 5 min: 0.12 max:144009.85 avg:4688.90 sum/s:717401us overhead:70.89%
pert/s: 172 >47209.49us: 3 min: 0.48 max:68009.05 avg:4022.55 sum/s:691879us overhead:67.05%
pert/s: 148 >51139.18us: 5 min: 0.53 max:168094.76 avg:4918.14 sum/s:727885us overhead:71.65%
pert/s: 171 >51350.64us: 6 min: 0.12 max:102202.79 avg:4304.77 sum/s:736115us overhead:69.24%
pert/s: 153 >57686.54us: 5 min: 0.12 max:224019.85 avg:5399.31 sum/s:826094us overhead:74.50%
pert/s: 172 >55886.47us: 2 min: 0.11 max:75378.18 avg:3993.52 sum/s:686885us overhead:67.67%
pert/s: 157 >58819.31us: 3 min: 0.12 max:165976.63 avg:4453.16 sum/s:699146us overhead:69.91%
pert/s: 149 >58410.21us: 5 min: 0.12 max:104663.89 avg:4792.73 sum/s:714116us overhead:71.41%

sched_latency=20ms min_granularity=4ms
pert/s: 162 >30152.07us: 2 min: 0.49 max:60011.85 avg:4272.97 sum/s:692221us overhead:68.13%
pert/s: 147 >29705.33us: 8 min: 0.14 max:46577.27 avg:4792.03 sum/s:704428us overhead:70.44%
pert/s: 162 >29344.16us: 2 min: 0.49 max:48010.50 avg:4176.75 sum/s:676633us overhead:67.40%
pert/s: 155 >29109.69us: 2 min: 0.49 max:49575.08 avg:4423.87 sum/s:685700us overhead:68.30%
pert/s: 153 >30627.66us: 3 min: 0.13 max:84005.71 avg:4573.07 sum/s:699680us overhead:69.42%
pert/s: 142 >30652.47us: 5 min: 0.49 max:56760.06 avg:4991.61 sum/s:708808us overhead:70.88%
pert/s: 152 >30101.12us: 2 min: 0.49 max:45757.88 avg:4519.92 sum/s:687028us overhead:67.89%
pert/s: 161 >29303.50us: 3 min: 0.12 max:40011.73 avg:4238.15 sum/s:682342us overhead:67.43%

NO_LAST_BUDDY
pert/s: 154 >15257.87us: 28 min: 0.13 max:42004.05 avg:4590.99 sum/s:707013us overhead:70.41%
pert/s: 162 >15392.05us: 34 min: 0.12 max:29021.79 avg:4177.47 sum/s:676750us overhead:66.81%
pert/s: 162 >15665.11us: 33 min: 0.13 max:32008.34 avg:4237.10 sum/s:686410us overhead:67.90%
pert/s: 159 >15914.89us: 31 min: 0.56 max:32056.86 avg:4268.87 sum/s:678751us overhead:67.47%
pert/s: 166 >15858.94us: 26 min: 0.13 max:26655.84 avg:4055.02 sum/s:673134us overhead:66.65%
pert/s: 165 >15878.96us: 32 min: 0.13 max:28010.44 avg:4107.86 sum/s:677798us overhead:66.68%
pert/s: 164 >16213.55us: 29 min: 0.14 max:34263.04 avg:4186.64 sum/s:686610us overhead:68.04%
pert/s: 149 >16764.54us: 20 min: 0.13 max:38688.64 avg:4758.26 sum/s:708981us overhead:70.23%

2009-09-09 15:39:42

by Mike Galbraith

[permalink] [raw]
Subject: [tip:sched/core] sched: Turn off child_runs_first

Commit-ID: 2bba22c50b06abe9fd0d23933b1e64d35b419262
Gitweb: http://git.kernel.org/tip/2bba22c50b06abe9fd0d23933b1e64d35b419262
Author: Mike Galbraith <[email protected]>
AuthorDate: Wed, 9 Sep 2009 15:41:37 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 9 Sep 2009 17:30:05 +0200

sched: Turn off child_runs_first

Set child_runs_first default to off.

It hurts 'optimal' make -j<NR_CPUS> workloads as make jobs
get preempted by child tasks, reducing parallelism.

Note, this patch might make existing races in user
applications more prominent than before - so breakages
might be bisected to this commit.

Child-runs-first is broken on SMP to begin with, and we
already had it off briefly in v2.6.23 so most of the
offenders ought to be fixed. Would be nice not to revert
this commit but fix those apps finally ...

Signed-off-by: Mike Galbraith <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
LKML-Reference: <[email protected]>
[ made the sysctl independent of CONFIG_SCHED_DEBUG, in case
people want to work around broken apps. ]
Signed-off-by: Ingo Molnar <[email protected]>


---
include/linux/sched.h | 2 +-
kernel/sched_fair.c | 4 ++--
kernel/sysctl.c | 16 ++++++++--------
3 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3b7f43e..3a50e82 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1820,8 +1820,8 @@ extern unsigned int sysctl_sched_min_granularity;
extern unsigned int sysctl_sched_wakeup_granularity;
extern unsigned int sysctl_sched_shares_ratelimit;
extern unsigned int sysctl_sched_shares_thresh;
-#ifdef CONFIG_SCHED_DEBUG
extern unsigned int sysctl_sched_child_runs_first;
+#ifdef CONFIG_SCHED_DEBUG
extern unsigned int sysctl_sched_features;
extern unsigned int sysctl_sched_migration_cost;
extern unsigned int sysctl_sched_nr_migrate;
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index e386e5d..af325a3 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -48,10 +48,10 @@ unsigned int sysctl_sched_min_granularity = 4000000ULL;
static unsigned int sched_nr_latency = 5;

/*
- * After fork, child runs first. (default) If set to 0 then
+ * After fork, child runs first. If set to 0 (default) then
* parent will (try to) run first.
*/
-const_debug unsigned int sysctl_sched_child_runs_first = 1;
+unsigned int sysctl_sched_child_runs_first __read_mostly;

/*
* sys_sched_yield() compat mode
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6c9836e..25d6bf3 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -246,6 +246,14 @@ static int max_wakeup_granularity_ns = NSEC_PER_SEC; /* 1 second */
#endif

static struct ctl_table kern_table[] = {
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "sched_child_runs_first",
+ .data = &sysctl_sched_child_runs_first,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
#ifdef CONFIG_SCHED_DEBUG
{
.ctl_name = CTL_UNNUMBERED,
@@ -300,14 +308,6 @@ static struct ctl_table kern_table[] = {
},
{
.ctl_name = CTL_UNNUMBERED,
- .procname = "sched_child_runs_first",
- .data = &sysctl_sched_child_runs_first,
- .maxlen = sizeof(unsigned int),
- .mode = 0644,
- .proc_handler = &proc_dointvec,
- },
- {
- .ctl_name = CTL_UNNUMBERED,
.procname = "sched_features",
.data = &sysctl_sched_features,
.maxlen = sizeof(unsigned int),

2009-09-09 15:39:47

by Mike Galbraith

[permalink] [raw]
Subject: [tip:sched/core] sched: Re-tune the scheduler latency defaults to decrease worst-case latencies

Commit-ID: 172e082a9111ea504ee34cbba26284a5ebdc53a7
Gitweb: http://git.kernel.org/tip/172e082a9111ea504ee34cbba26284a5ebdc53a7
Author: Mike Galbraith <[email protected]>
AuthorDate: Wed, 9 Sep 2009 15:41:37 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 9 Sep 2009 17:30:06 +0200

sched: Re-tune the scheduler latency defaults to decrease worst-case latencies

Reduce the latency target from 20 msecs to 5 msecs.

Why? Larger latencies increase spread, which is good for scaling,
but bad for worst case latency.

We still have the ilog(nr_cpus) rule to scale up on bigger
server boxes.

Signed-off-by: Mike Galbraith <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>


---
kernel/sched_fair.c | 12 ++++++------
1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index af325a3..26fadb4 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -24,7 +24,7 @@

/*
* Targeted preemption latency for CPU-bound tasks:
- * (default: 20ms * (1 + ilog(ncpus)), units: nanoseconds)
+ * (default: 5ms * (1 + ilog(ncpus)), units: nanoseconds)
*
* NOTE: this latency value is not the same as the concept of
* 'timeslice length' - timeslices in CFS are of variable length
@@ -34,13 +34,13 @@
* (to see the precise effective timeslice length of your workload,
* run vmstat and monitor the context-switches (cs) field)
*/
-unsigned int sysctl_sched_latency = 20000000ULL;
+unsigned int sysctl_sched_latency = 5000000ULL;

/*
* Minimal preemption granularity for CPU-bound tasks:
- * (default: 4 msec * (1 + ilog(ncpus)), units: nanoseconds)
+ * (default: 1 msec * (1 + ilog(ncpus)), units: nanoseconds)
*/
-unsigned int sysctl_sched_min_granularity = 4000000ULL;
+unsigned int sysctl_sched_min_granularity = 1000000ULL;

/*
* is kept at sysctl_sched_latency / sysctl_sched_min_granularity
@@ -63,13 +63,13 @@ unsigned int __read_mostly sysctl_sched_compat_yield;

/*
* SCHED_OTHER wake-up granularity.
- * (default: 5 msec * (1 + ilog(ncpus)), units: nanoseconds)
+ * (default: 1 msec * (1 + ilog(ncpus)), units: nanoseconds)
*
* This option delays the preemption effects of decoupled workloads
* and reduces their over-scheduling. Synchronous workloads will still
* have immediate wakeup/sleep latencies.
*/
-unsigned int sysctl_sched_wakeup_granularity = 5000000UL;
+unsigned int sysctl_sched_wakeup_granularity = 1000000UL;

const_debug unsigned int sysctl_sched_migration_cost = 500000UL;

2009-09-09 15:39:37

by Mike Galbraith

[permalink] [raw]
Subject: [tip:sched/core] sched: Keep kthreads at default priority

Commit-ID: 61cbe54d9479ad98283b2dda686deae4c34b2d59
Gitweb: http://git.kernel.org/tip/61cbe54d9479ad98283b2dda686deae4c34b2d59
Author: Mike Galbraith <[email protected]>
AuthorDate: Wed, 9 Sep 2009 15:41:37 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 9 Sep 2009 17:30:06 +0200

sched: Keep kthreads at default priority

Removes kthread/workqueue priority boost, they increase worst-case
desktop latencies.

Signed-off-by: Mike Galbraith <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>


---
kernel/kthread.c | 4 ----
kernel/workqueue.c | 2 --
2 files changed, 0 insertions(+), 6 deletions(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index eb8751a..5fe7099 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -16,8 +16,6 @@
#include <linux/mutex.h>
#include <trace/events/sched.h>

-#define KTHREAD_NICE_LEVEL (-5)
-
static DEFINE_SPINLOCK(kthread_create_lock);
static LIST_HEAD(kthread_create_list);
struct task_struct *kthreadd_task;
@@ -145,7 +143,6 @@ struct task_struct *kthread_create(int (*threadfn)(void *data),
* The kernel thread should not inherit these properties.
*/
sched_setscheduler_nocheck(create.result, SCHED_NORMAL, &param);
- set_user_nice(create.result, KTHREAD_NICE_LEVEL);
set_cpus_allowed_ptr(create.result, cpu_all_mask);
}
return create.result;
@@ -221,7 +218,6 @@ int kthreadd(void *unused)
/* Setup a clean context for our children to inherit. */
set_task_comm(tsk, "kthreadd");
ignore_signals(tsk);
- set_user_nice(tsk, KTHREAD_NICE_LEVEL);
set_cpus_allowed_ptr(tsk, cpu_all_mask);
set_mems_allowed(node_possible_map);

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 0668795..ea1b4e7 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -317,8 +317,6 @@ static int worker_thread(void *__cwq)
if (cwq->wq->freezeable)
set_freezable();

- set_user_nice(current, -5);
-
for (;;) {
prepare_to_wait(&cwq->more_work, &wait, TASK_INTERRUPTIBLE);
if (!freezing(current) &&

2009-09-09 15:52:28

by Ingo Molnar

[permalink] [raw]
Subject: Re: Epic regression in throughput since v2.6.23


* Serge Belyshev <[email protected]> wrote:

> Serge Belyshev <[email protected]> writes:
> >[snip]
>
> I've updated the graphs, added kernels 2.6.24..2.6.29:
> http://img186.imageshack.us/img186/7029/epicmakej4.png
>
> And added comparison with best-performing 2.6.23 kernel:
> http://img34.imageshack.us/img34/7563/epicbfstips.png

Thanks!

I think we found the reason for that regression - would you mind
to re-test with latest -tip, e157986 or later?

If that works for you i'll describe our theory.

Ingo

2009-09-09 17:02:45

by Dmitry Torokhov

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Keep kthreads at default priority

On Wed, Sep 09, 2009 at 03:37:34PM +0000, tip-bot for Mike Galbraith wrote:
>
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index eb8751a..5fe7099 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -16,8 +16,6 @@
> #include <linux/mutex.h>
> #include <trace/events/sched.h>
>
> -#define KTHREAD_NICE_LEVEL (-5)
> -

Why don't we just redefine it to 0? We may find out later that we'd
still prefer to have kernel threads have boost.

--
Dmitry

2009-09-09 17:06:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Keep kthreads at default priority

On Wed, 2009-09-09 at 09:55 -0700, Dmitry Torokhov wrote:
> On Wed, Sep 09, 2009 at 03:37:34PM +0000, tip-bot for Mike Galbraith wrote:
> >
> > diff --git a/kernel/kthread.c b/kernel/kthread.c
> > index eb8751a..5fe7099 100644
> > --- a/kernel/kthread.c
> > +++ b/kernel/kthread.c
> > @@ -16,8 +16,6 @@
> > #include <linux/mutex.h>
> > #include <trace/events/sched.h>
> >
> > -#define KTHREAD_NICE_LEVEL (-5)
> > -
>
> Why don't we just redefine it to 0? We may find out later that we'd
> still prefer to have kernel threads have boost.

Seems sensible, also the traditional reasoning behind this nice level is
that kernel threads do work on behalf of multiple tasks. Its a kind of
prio ceiling thing.

2009-09-09 17:34:58

by Mike Galbraith

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Keep kthreads at default priority

On Wed, 2009-09-09 at 19:06 +0200, Peter Zijlstra wrote:
> On Wed, 2009-09-09 at 09:55 -0700, Dmitry Torokhov wrote:
> > On Wed, Sep 09, 2009 at 03:37:34PM +0000, tip-bot for Mike Galbraith wrote:
> > >
> > > diff --git a/kernel/kthread.c b/kernel/kthread.c
> > > index eb8751a..5fe7099 100644
> > > --- a/kernel/kthread.c
> > > +++ b/kernel/kthread.c
> > > @@ -16,8 +16,6 @@
> > > #include <linux/mutex.h>
> > > #include <trace/events/sched.h>
> > >
> > > -#define KTHREAD_NICE_LEVEL (-5)
> > > -
> >
> > Why don't we just redefine it to 0? We may find out later that we'd
> > still prefer to have kernel threads have boost.
>
> Seems sensible, also the traditional reasoning behind this nice level is
> that kernel threads do work on behalf of multiple tasks. Its a kind of
> prio ceiling thing.

True. None of our current threads are heavy enough to matter much.

-Mike

2009-09-09 17:57:58

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Turn off child_runs_first

On Wed, Sep 09, 2009 at 03:37:07PM +0000, tip-bot for Mike Galbraith wrote:
> Commit-ID: 2bba22c50b06abe9fd0d23933b1e64d35b419262
> Gitweb: http://git.kernel.org/tip/2bba22c50b06abe9fd0d23933b1e64d35b419262
> Author: Mike Galbraith <[email protected]>
> AuthorDate: Wed, 9 Sep 2009 15:41:37 +0200
> Committer: Ingo Molnar <[email protected]>
> CommitDate: Wed, 9 Sep 2009 17:30:05 +0200
>
> sched: Turn off child_runs_first
>
> Set child_runs_first default to off.
>
> It hurts 'optimal' make -j<NR_CPUS> workloads as make jobs
> get preempted by child tasks, reducing parallelism.

Wasn't one of the reasons why we historically did child_runs_first was
so that for fork/exit workloads, the child has a chance to exec the
new process? If the parent runs first, then more pages will probably
need to be COW'ed.

- Ted

2009-09-09 18:04:11

by Ingo Molnar

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements


* Jens Axboe <[email protected]> wrote:

> On Wed, Sep 09 2009, Jens Axboe wrote:
> > On Wed, Sep 09 2009, Jens Axboe wrote:
> > > On Wed, Sep 09 2009, Mike Galbraith wrote:
> > > > On Wed, 2009-09-09 at 08:13 +0200, Ingo Molnar wrote:
> > > > > * Jens Axboe <[email protected]> wrote:
> > > > >
> > > > > > On Tue, Sep 08 2009, Peter Zijlstra wrote:
> > > > > > > On Tue, 2009-09-08 at 11:13 +0200, Jens Axboe wrote:
> > > > > > > > And here's a newer version.
> > > > > > >
> > > > > > > I tinkered a bit with your proglet and finally found the
> > > > > > > problem.
> > > > > > >
> > > > > > > You used a single pipe per child, this means the loop in
> > > > > > > run_child() would consume what it just wrote out until it got
> > > > > > > force preempted by the parent which would also get woken.
> > > > > > >
> > > > > > > This results in the child spinning a while (its full quota) and
> > > > > > > only reporting the last timestamp to the parent.
> > > > > >
> > > > > > Oh doh, that's not well thought out. Well it was a quick hack :-)
> > > > > > Thanks for the fixup, now it's at least usable to some degree.
> > > > >
> > > > > What kind of latencies does it report on your box?
> > > > >
> > > > > Our vanilla scheduler default latency targets are:
> > > > >
> > > > > single-core: 20 msecs
> > > > > dual-core: 40 msecs
> > > > > quad-core: 60 msecs
> > > > > opto-core: 80 msecs
> > > > >
> > > > > You can enable CONFIG_SCHED_DEBUG=y and set it directly as well via
> > > > > /proc/sys/kernel/sched_latency_ns:
> > > > >
> > > > > echo 10000000 > /proc/sys/kernel/sched_latency_ns
> > > >
> > > > He would also need to lower min_granularity, otherwise, it'd be larger
> > > > than the whole latency target.
> > > >
> > > > I'm testing right now, and one thing that is definitely a problem is the
> > > > amount of sleeper fairness we're giving. A full latency is just too
> > > > much short term fairness in my testing. While sleepers are catching up,
> > > > hogs languish. That's the biggest issue going on.
> > > >
> > > > I've also been doing some timings of make -j4 (looking at idle time),
> > > > and find that child_runs_first is mildly detrimental to fork/exec load,
> > > > as are buddies.
> > > >
> > > > I'm running with the below at the moment. (the kthread/workqueue thing
> > > > is just because I don't see any reason for it to exist, so consider it
> > > > to be a waste of perfectly good math;)
> > >
> > > Using latt, it seems better than -rc9. The below are entries logged
> > > while running make -j128 on a 64 thread box. I did two runs on each, and
> > > latt is using 8 clients.
> > >
> > > -rc9
> > > Max 23772 usec
> > > Avg 1129 usec
> > > Stdev 4328 usec
> > > Stdev mean 117 usec
> > >
> > > Max 32709 usec
> > > Avg 1467 usec
> > > Stdev 5095 usec
> > > Stdev mean 136 usec
> > >
> > > -rc9 + patch
> > >
> > > Max 11561 usec
> > > Avg 1532 usec
> > > Stdev 1994 usec
> > > Stdev mean 48 usec
> > >
> > > Max 9590 usec
> > > Avg 1550 usec
> > > Stdev 2051 usec
> > > Stdev mean 50 usec
> > >
> > > max latency is way down, and much smaller variation as well.
> >
> > Things are much better with this patch on the notebook! I cannot compare
> > with BFS as that still doesn't run anywhere I want it to run, but it's
> > way better than -rc9-git stock. latt numbers on the notebook have 1/3
> > the max latency, average is lower, and stddev is much smaller too.
>
> BFS210 runs on the laptop (dual core intel core duo). With make -j4
> running, I clock the following latt -c8 'sleep 10' latencies:
>
> -rc9
>
> Max 17895 usec
> Avg 8028 usec
> Stdev 5948 usec
> Stdev mean 405 usec
>
> Max 17896 usec
> Avg 4951 usec
> Stdev 6278 usec
> Stdev mean 427 usec
>
> Max 17885 usec
> Avg 5526 usec
> Stdev 6819 usec
> Stdev mean 464 usec
>
> -rc9 + mike
>
> Max 6061 usec
> Avg 3797 usec
> Stdev 1726 usec
> Stdev mean 117 usec
>
> Max 5122 usec
> Avg 3958 usec
> Stdev 1697 usec
> Stdev mean 115 usec
>
> Max 6691 usec
> Avg 2130 usec
> Stdev 2165 usec
> Stdev mean 147 usec

At least in my tests these latencies were mainly due to a bug in
latt.c - i've attached the fixed version.

The other reason was wakeup batching. If you do this:

echo 0 > /proc/sys/kernel/sched_wakeup_granularity_ns

... then you can switch on insta-wakeups on -tip too.

With a dual-core box and a make -j4 background job running, on
latest -tip i get the following latencies:

$ ./latt -c8 sleep 30
Entries: 656 (clients=8)

Averages:
------------------------------
Max 158 usec
Avg 12 usec
Stdev 10 usec

Thanks,

Ingo


Attachments:
(No filename) (5.39 kB)
latt.c (8.85 kB)
Download all attachments

2009-09-09 18:09:10

by Ingo Molnar

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Turn off child_runs_first


* Theodore Tso <[email protected]> wrote:

> On Wed, Sep 09, 2009 at 03:37:07PM +0000, tip-bot for Mike Galbraith wrote:
> > Commit-ID: 2bba22c50b06abe9fd0d23933b1e64d35b419262
> > Gitweb: http://git.kernel.org/tip/2bba22c50b06abe9fd0d23933b1e64d35b419262
> > Author: Mike Galbraith <[email protected]>
> > AuthorDate: Wed, 9 Sep 2009 15:41:37 +0200
> > Committer: Ingo Molnar <[email protected]>
> > CommitDate: Wed, 9 Sep 2009 17:30:05 +0200
> >
> > sched: Turn off child_runs_first
> >
> > Set child_runs_first default to off.
> >
> > It hurts 'optimal' make -j<NR_CPUS> workloads as make jobs
> > get preempted by child tasks, reducing parallelism.
>
> Wasn't one of the reasons why we historically did child_runs_first
> was so that for fork/exit workloads, the child has a chance to
> exec the new process? If the parent runs first, then more pages
> will probably need to be COW'ed.

That kind of workload should be using vfork() anyway, and be even
faster because it can avoid the fork overhead, right?

Also, on SMP we do that anyway - there's good likelyhood on an idle
system that we wake the child on the other core straight away.

Ingo

2009-09-09 19:01:26

by Chris Friesen

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Turn off child_runs_first

On 09/09/2009 12:08 PM, Ingo Molnar wrote:
>
> * Theodore Tso <[email protected]> wrote:

>> Wasn't one of the reasons why we historically did child_runs_first
>> was so that for fork/exit workloads, the child has a chance to
>> exec the new process? If the parent runs first, then more pages
>> will probably need to be COW'ed.
>
> That kind of workload should be using vfork() anyway, and be even
> faster because it can avoid the fork overhead, right?

According to my man page, POSIX.1-2008 removes the specification of
vfork().

Chris

2009-09-09 19:48:50

by Pavel Machek

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Turn off child_runs_first

Hi!

> > > It hurts 'optimal' make -j<NR_CPUS> workloads as make jobs
> > > get preempted by child tasks, reducing parallelism.
> >
> > Wasn't one of the reasons why we historically did child_runs_first
> > was so that for fork/exit workloads, the child has a chance to
> > exec the new process? If the parent runs first, then more pages
> > will probably need to be COW'ed.
>
> That kind of workload should be using vfork() anyway, and be even
> faster because it can avoid the fork overhead, right?

Well.. one should not have to update userspace to keep
performance.... and vfork is extremely ugly interface.

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-09-09 20:12:14

by Nikos Chantziaras

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On 09/09/2009 09:04 PM, Ingo Molnar wrote:
> [...]
> * Jens Axboe<[email protected]> wrote:
>
>> On Wed, Sep 09 2009, Jens Axboe wrote:
>> [...]
>> BFS210 runs on the laptop (dual core intel core duo). With make -j4
>> running, I clock the following latt -c8 'sleep 10' latencies:
>>
>> -rc9
>>
>> Max 17895 usec
>> Avg 8028 usec
>> Stdev 5948 usec
>> Stdev mean 405 usec
>>
>> Max 17896 usec
>> Avg 4951 usec
>> Stdev 6278 usec
>> Stdev mean 427 usec
>>
>> Max 17885 usec
>> Avg 5526 usec
>> Stdev 6819 usec
>> Stdev mean 464 usec
>>
>> -rc9 + mike
>>
>> Max 6061 usec
>> Avg 3797 usec
>> Stdev 1726 usec
>> Stdev mean 117 usec
>>
>> Max 5122 usec
>> Avg 3958 usec
>> Stdev 1697 usec
>> Stdev mean 115 usec
>>
>> Max 6691 usec
>> Avg 2130 usec
>> Stdev 2165 usec
>> Stdev mean 147 usec
>
> At least in my tests these latencies were mainly due to a bug in
> latt.c - i've attached the fixed version.
>
> The other reason was wakeup batching. If you do this:
>
> echo 0> /proc/sys/kernel/sched_wakeup_granularity_ns
>
> ... then you can switch on insta-wakeups on -tip too.
>
> With a dual-core box and a make -j4 background job running, on
> latest -tip i get the following latencies:
>
> $ ./latt -c8 sleep 30
> Entries: 656 (clients=8)
>
> Averages:
> ------------------------------
> Max 158 usec
> Avg 12 usec
> Stdev 10 usec

With your version of latt.c, I get these results with 2.6-tip vs
2.6.31-rc9-bfs:


(mainline)
Averages:
------------------------------
Max 50 usec
Avg 12 usec
Stdev 3 usec


(BFS)
Averages:
------------------------------
Max 474 usec
Avg 11 usec
Stdev 16 usec


However, the interactivity problems still remain. Does that mean it's
not a latency issue?

2009-09-09 20:49:34

by Serge Belyshev

[permalink] [raw]
Subject: Re: Epic regression in throughput since v2.6.23

Ingo Molnar <[email protected]> writes:

> Thanks!
>
> I think we found the reason for that regression - would you mind
> to re-test with latest -tip, e157986 or later?
>
> If that works for you i'll describe our theory.
>

Good job -- seems to work, thanks. Regression is still about 3% though:
http://img3.imageshack.us/img3/5335/epicbfstip.png

2009-09-09 20:50:42

by Jens Axboe

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Wed, Sep 09 2009, Nikos Chantziaras wrote:
> On 09/09/2009 09:04 PM, Ingo Molnar wrote:
>> [...]
>> * Jens Axboe<[email protected]> wrote:
>>
>>> On Wed, Sep 09 2009, Jens Axboe wrote:
>>> [...]
>>> BFS210 runs on the laptop (dual core intel core duo). With make -j4
>>> running, I clock the following latt -c8 'sleep 10' latencies:
>>>
>>> -rc9
>>>
>>> Max 17895 usec
>>> Avg 8028 usec
>>> Stdev 5948 usec
>>> Stdev mean 405 usec
>>>
>>> Max 17896 usec
>>> Avg 4951 usec
>>> Stdev 6278 usec
>>> Stdev mean 427 usec
>>>
>>> Max 17885 usec
>>> Avg 5526 usec
>>> Stdev 6819 usec
>>> Stdev mean 464 usec
>>>
>>> -rc9 + mike
>>>
>>> Max 6061 usec
>>> Avg 3797 usec
>>> Stdev 1726 usec
>>> Stdev mean 117 usec
>>>
>>> Max 5122 usec
>>> Avg 3958 usec
>>> Stdev 1697 usec
>>> Stdev mean 115 usec
>>>
>>> Max 6691 usec
>>> Avg 2130 usec
>>> Stdev 2165 usec
>>> Stdev mean 147 usec
>>
>> At least in my tests these latencies were mainly due to a bug in
>> latt.c - i've attached the fixed version.
>>
>> The other reason was wakeup batching. If you do this:
>>
>> echo 0> /proc/sys/kernel/sched_wakeup_granularity_ns
>>
>> ... then you can switch on insta-wakeups on -tip too.
>>
>> With a dual-core box and a make -j4 background job running, on
>> latest -tip i get the following latencies:
>>
>> $ ./latt -c8 sleep 30
>> Entries: 656 (clients=8)
>>
>> Averages:
>> ------------------------------
>> Max 158 usec
>> Avg 12 usec
>> Stdev 10 usec
>
> With your version of latt.c, I get these results with 2.6-tip vs
> 2.6.31-rc9-bfs:
>
>
> (mainline)
> Averages:
> ------------------------------
> Max 50 usec
> Avg 12 usec
> Stdev 3 usec
>
>
> (BFS)
> Averages:
> ------------------------------
> Max 474 usec
> Avg 11 usec
> Stdev 16 usec
>
>
> However, the interactivity problems still remain. Does that mean it's
> not a latency issue?

It probably just means that latt isn't a good measure of the problem.
Which isn't really too much of a surprise.

--
Jens Axboe

2009-09-09 21:23:22

by Cory Fields

[permalink] [raw]
Subject: Re: Epic regression in throughput since v2.6.23

I've noticed the same regression since around 2.6.23, mainly in
multi-core video decoding. A git bisect reveals the guilty commit to
be: 33b0c4217dcd67b788318c3192a2912b530e4eef

It is easily visible because with the guilty commit included, one core
of the cpu remains pegged while the other(s) are severly
underutilized.

Hope this helps

Cory Fields

2009-09-10 01:34:26

by Con Kolivas

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Thu, 10 Sep 2009 06:50:43 Jens Axboe wrote:
> On Wed, Sep 09 2009, Nikos Chantziaras wrote:
> > On 09/09/2009 09:04 PM, Ingo Molnar wrote:
> >> [...]
> >>
> >> * Jens Axboe<[email protected]> wrote:
> >>> On Wed, Sep 09 2009, Jens Axboe wrote:
> >>> [...]
> >>> BFS210 runs on the laptop (dual core intel core duo). With make -j4
> >>> running, I clock the following latt -c8 'sleep 10' latencies:
> >>>
> >>> -rc9
> >>>
> >>> Max 17895 usec
> >>> Avg 8028 usec
> >>> Stdev 5948 usec
> >>> Stdev mean 405 usec
> >>>
> >>> Max 17896 usec
> >>> Avg 4951 usec
> >>> Stdev 6278 usec
> >>> Stdev mean 427 usec
> >>>
> >>> Max 17885 usec
> >>> Avg 5526 usec
> >>> Stdev 6819 usec
> >>> Stdev mean 464 usec
> >>>
> >>> -rc9 + mike
> >>>
> >>> Max 6061 usec
> >>> Avg 3797 usec
> >>> Stdev 1726 usec
> >>> Stdev mean 117 usec
> >>>
> >>> Max 5122 usec
> >>> Avg 3958 usec
> >>> Stdev 1697 usec
> >>> Stdev mean 115 usec
> >>>
> >>> Max 6691 usec
> >>> Avg 2130 usec
> >>> Stdev 2165 usec
> >>> Stdev mean 147 usec
> >>
> >> At least in my tests these latencies were mainly due to a bug in
> >> latt.c - i've attached the fixed version.
> >>
> >> The other reason was wakeup batching. If you do this:
> >>
> >> echo 0> /proc/sys/kernel/sched_wakeup_granularity_ns
> >>
> >> ... then you can switch on insta-wakeups on -tip too.
> >>
> >> With a dual-core box and a make -j4 background job running, on
> >> latest -tip i get the following latencies:
> >>
> >> $ ./latt -c8 sleep 30
> >> Entries: 656 (clients=8)
> >>
> >> Averages:
> >> ------------------------------
> >> Max 158 usec
> >> Avg 12 usec
> >> Stdev 10 usec
> >
> > With your version of latt.c, I get these results with 2.6-tip vs
> > 2.6.31-rc9-bfs:
> >
> >
> > (mainline)
> > Averages:
> > ------------------------------
> > Max 50 usec
> > Avg 12 usec
> > Stdev 3 usec
> >
> >
> > (BFS)
> > Averages:
> > ------------------------------
> > Max 474 usec
> > Avg 11 usec
> > Stdev 16 usec
> >
> >
> > However, the interactivity problems still remain. Does that mean it's
> > not a latency issue?
>
> It probably just means that latt isn't a good measure of the problem.
> Which isn't really too much of a surprise.

And that's a real shame because this was one of the first real good attempts
I've seen to actually measure the difference, and I thank you for your
efforts Jens. I believe the reason it's limited is because all you're
measuring is time from wakeup and the test app isn't actually doing any work.
The issue is more than just waking up as fast as possible, it's then doing
some meaningful amount of work within a reasonable time frame as well. What
the "meaningful amount of work" and "reasonable time frame" are, remains a
mystery, but I guess could be added on to this testing app.

What does please me now, though, is that this message thread is finally
concentrating on what BFS was all about. The fact that it doesn't scale is no
mystery whatsoever. The fact that that throughput and lack of scaling was
what was given attention was missing the point entirely. To point that out I
used the bluntest response possible, because I know that works on lkml (does
it not?). Unfortunately I was so blunt that I ended up writing it in another
language; Troll. So for that, I apologise.

The unfortunate part is that BFS is still far from a working, complete state,
yet word got out that I had "released" something, which I had not, but
obviously there's no great distinction between putting something on a server
for testing, and a real release with an announce.

BFS is a scheduling experiment to demonstrate what effect the cpu scheduler
really has on the desktop and how it might be able to perform if we design
the scheduler for that one purpose.

It pleases me immensely to see that it has already spurred on a flood of
changes to the interactivity side of mainline development in its few days of
existence, including some ideas that BFS uses itself. That in itself, to me,
means it has already started to accomplish its goal, which ultimately, one
way or another, is to improve what the CPU scheduler can do for the linux
desktop. I can't track all the sensitive areas of the mainline kernel
scheduler changes without getting involved more deeply than I care to so it
would be counterproductive of me to try and hack on mainline. I much prefer
the quieter inbox.

If people want to use BFS for their own purposes or projects, or even better
help hack on it, that would make me happy for different reasons. I will
continue to work on my little project -in my own time- and hope that it
continues to drive further development of the mainline kernel in its own way.
We need more experiments like this to question what we currently have and
accept. Other major kernel subsystems are no exception.

Regards,
--
-ck

<code before rhetoric>

2009-09-10 03:15:29

by Mike Galbraith

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Wed, 2009-09-09 at 23:12 +0300, Nikos Chantziaras wrote:

> With your version of latt.c, I get these results with 2.6-tip vs
> 2.6.31-rc9-bfs:
>
>
> (mainline)
> Averages:
> ------------------------------
> Max 50 usec
> Avg 12 usec
> Stdev 3 usec
>
>
> (BFS)
> Averages:
> ------------------------------
> Max 474 usec
> Avg 11 usec
> Stdev 16 usec
>
>
> However, the interactivity problems still remain. Does that mean it's
> not a latency issue?

Could be a fairness issue. If X+client needs more than it's fair share
of CPU, there's nothing to do but use nice levels. I'm stuck with
unaccelerated X (nvidia card), so if I want a good DVD watching or
whatever eye-candy experience while my box does a lot of other work, I
either have to use SCHED_IDLE/nice for the background stuff, or renice
X. That's the down side of a fair scheduler.

There is another variant of latency related interactivity issue for the
desktop though, too LOW latency. If X and clients are switching too
fast, redraw can look nasty, sliced/diced.

-Mike

2009-09-10 06:08:32

by Ingo Molnar

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements


* Nikos Chantziaras <[email protected]> wrote:

> On 09/09/2009 09:04 PM, Ingo Molnar wrote:
>> [...]
>> * Jens Axboe<[email protected]> wrote:
>>
>>> On Wed, Sep 09 2009, Jens Axboe wrote:
>>> [...]
>>> BFS210 runs on the laptop (dual core intel core duo). With make -j4
>>> running, I clock the following latt -c8 'sleep 10' latencies:
>>>
>>> -rc9
>>>
>>> Max 17895 usec
>>> Avg 8028 usec
>>> Stdev 5948 usec
>>> Stdev mean 405 usec
>>>
>>> Max 17896 usec
>>> Avg 4951 usec
>>> Stdev 6278 usec
>>> Stdev mean 427 usec
>>>
>>> Max 17885 usec
>>> Avg 5526 usec
>>> Stdev 6819 usec
>>> Stdev mean 464 usec
>>>
>>> -rc9 + mike
>>>
>>> Max 6061 usec
>>> Avg 3797 usec
>>> Stdev 1726 usec
>>> Stdev mean 117 usec
>>>
>>> Max 5122 usec
>>> Avg 3958 usec
>>> Stdev 1697 usec
>>> Stdev mean 115 usec
>>>
>>> Max 6691 usec
>>> Avg 2130 usec
>>> Stdev 2165 usec
>>> Stdev mean 147 usec
>>
>> At least in my tests these latencies were mainly due to a bug in
>> latt.c - i've attached the fixed version.
>>
>> The other reason was wakeup batching. If you do this:
>>
>> echo 0> /proc/sys/kernel/sched_wakeup_granularity_ns
>>
>> ... then you can switch on insta-wakeups on -tip too.
>>
>> With a dual-core box and a make -j4 background job running, on
>> latest -tip i get the following latencies:
>>
>> $ ./latt -c8 sleep 30
>> Entries: 656 (clients=8)
>>
>> Averages:
>> ------------------------------
>> Max 158 usec
>> Avg 12 usec
>> Stdev 10 usec
>
> With your version of latt.c, I get these results with 2.6-tip vs
> 2.6.31-rc9-bfs:
>
>
> (mainline)
> Averages:
> ------------------------------
> Max 50 usec
> Avg 12 usec
> Stdev 3 usec
>
>
> (BFS)
> Averages:
> ------------------------------
> Max 474 usec
> Avg 11 usec
> Stdev 16 usec
>
> However, the interactivity problems still remain. Does that mean
> it's not a latency issue?

It means that Jens's test-app, which demonstrated and helped us fix
the issue for him does not help us fix it for you just yet.

The "fluidity problem" you described might not be a classic latency
issue per se (which latt.c measures), but a timeslicing / CPU time
distribution problem.

A slight shift in CPU time allocation can change the flow of tasks
to result in a 'choppier' system.

Have you tried, in addition of the granularity tweaks you've done,
to renice mplayer either up or down? (or compiz and Xorg for that
matter)

I'm not necessarily suggesting this as a 'real' solution (we really
prefer kernels that just get it right) - but it's an additional
parameter dimension along which you can tweak CPU time distribution
on your box.

Here's the general rule of thumb: mine one nice level gives plus 5%
CPU time to a task and takes away 5% CPU time from another task -
i.e. shifts the CPU allocation by 10%.

( this is modified by all sorts of dynamic conditions: by the number
of tasks running and their wakeup patters so not a rule cast into
stone - but still a good ballpark figure for CPU intense tasks. )

Btw., i've read your descriptions about what you've tuned so far -
have you seen/checked the wakeup_granularity tunable as well?
Setting that to 0 will change the general balance of how CPU time is
allocated between tasks too.

There's also a whole bunch of scheduler features you can turn on/off
individually via /debug/sched_features. For example, to turn off
NEW_FAIR_SLEEPERS, you can do:

# cat /debug/sched_features
NEW_FAIR_SLEEPERS NO_NORMALIZED_SLEEPER ADAPTIVE_GRAN WAKEUP_PREEMPT
START_DEBIT AFFINE_WAKEUPS CACHE_HOT_BUDDY SYNC_WAKEUPS NO_HRTICK
NO_DOUBLE_TICK ASYM_GRAN LB_BIAS LB_WAKEUP_UPDATE ASYM_EFF_LOAD
NO_WAKEUP_OVERLAP LAST_BUDDY OWNER_SPIN

# echo NO_NEW_FAIR_SLEEPERS > /debug/sched_features

Btw., NO_NEW_FAIR_SLEEPERS is something that will turn the scheduler
into a more classic fair scheduler (like BFS is too).

NO_START_DEBIT might be another thing that improves (or worsens :-/)
make -j type of kernel build workloads.

Note, these flags are all runtime, the new settings take effect
almost immediately (and at the latest it takes effect when a task
has started up) and safe to do runtime.

It basically gives us 32768 pluggable schedulers each with a
slightly separate algorithm - each setting in essence creates a new
scheduler. (this mechanism is how we introduce new scheduler
features and allow their debugging / regression-testing.)

(okay, almost, so beware: turning on HRTICK might lock up your
system.)

Plus, yet another dimension of tuning on SMP systems (such as
dual-core) are the sched-domains tunable. There's a whole world of
tuning in that area and BFS essentially implements a very agressive
'always balance to other CPUs' policy.

I've attached my sched-tune-domains script which helps tune these
parameters.

For example on a testbox of mine it outputs:

usage: tune-sched-domains <val>
{cpu0/domain0:SIBLING} SD flag: 239
+ 1: SD_LOAD_BALANCE: Do load balancing on this domain
+ 2: SD_BALANCE_NEWIDLE: Balance when about to become idle
+ 4: SD_BALANCE_EXEC: Balance on exec
+ 8: SD_BALANCE_FORK: Balance on fork, clone
- 16: SD_WAKE_IDLE: Wake to idle CPU on task wakeup
+ 32: SD_WAKE_AFFINE: Wake task to waking CPU
+ 64: SD_WAKE_BALANCE: Perform balancing at task wakeup
+ 128: SD_SHARE_CPUPOWER: Domain members share cpu power
- 256: SD_POWERSAVINGS_BALANCE: Balance for power savings
- 512: SD_SHARE_PKG_RESOURCES: Domain members share cpu pkg resources
-1024: SD_SERIALIZE: Only a single load balancing instance
-2048: SD_WAKE_IDLE_FAR: Gain latency sacrificing cache hit
-4096: SD_PREFER_SIBLING: Prefer to place tasks in a sibling domain
{cpu0/domain1:MC} SD flag: 4735
+ 1: SD_LOAD_BALANCE: Do load balancing on this domain
+ 2: SD_BALANCE_NEWIDLE: Balance when about to become idle
+ 4: SD_BALANCE_EXEC: Balance on exec
+ 8: SD_BALANCE_FORK: Balance on fork, clone
+ 16: SD_WAKE_IDLE: Wake to idle CPU on task wakeup
+ 32: SD_WAKE_AFFINE: Wake task to waking CPU
+ 64: SD_WAKE_BALANCE: Perform balancing at task wakeup
- 128: SD_SHARE_CPUPOWER: Domain members share cpu power
- 256: SD_POWERSAVINGS_BALANCE: Balance for power savings
+ 512: SD_SHARE_PKG_RESOURCES: Domain members share cpu pkg resources
-1024: SD_SERIALIZE: Only a single load balancing instance
-2048: SD_WAKE_IDLE_FAR: Gain latency sacrificing cache hit
+4096: SD_PREFER_SIBLING: Prefer to place tasks in a sibling domain
{cpu0/domain2:NODE} SD flag: 3183
+ 1: SD_LOAD_BALANCE: Do load balancing on this domain
+ 2: SD_BALANCE_NEWIDLE: Balance when about to become idle
+ 4: SD_BALANCE_EXEC: Balance on exec
+ 8: SD_BALANCE_FORK: Balance on fork, clone
- 16: SD_WAKE_IDLE: Wake to idle CPU on task wakeup
+ 32: SD_WAKE_AFFINE: Wake task to waking CPU
+ 64: SD_WAKE_BALANCE: Perform balancing at task wakeup
- 128: SD_SHARE_CPUPOWER: Domain members share cpu power
- 256: SD_POWERSAVINGS_BALANCE: Balance for power savings
- 512: SD_SHARE_PKG_RESOURCES: Domain members share cpu pkg resources
+1024: SD_SERIALIZE: Only a single load balancing instance
+2048: SD_WAKE_IDLE_FAR: Gain latency sacrificing cache hit
-4096: SD_PREFER_SIBLING: Prefer to place tasks in a sibling domain

The way i can turn on say SD_WAKE_IDLE for the NODE domain is to:

tune-sched-domains 239 4735 $((3183+16))

( This is a pretty stone-age script i admit ;-)

Thanks for all your testing so far,

Ingo


Attachments:
(No filename) (8.16 kB)
tune-sched-domains (2.10 kB)
Download all attachments

2009-09-10 06:41:03

by Ingo Molnar

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements


* Ingo Molnar <[email protected]> wrote:

> > However, the interactivity problems still remain. Does that
> > mean it's not a latency issue?
>
> It means that Jens's test-app, which demonstrated and helped us
> fix the issue for him does not help us fix it for you just yet.

Lemme qualify that by saying that Jens's issues are improved not
fixed [he has not re-run with latest latt.c yet] but not all things
are fully fixed yet. For example the xmodmap thing sounds
interesting - could that be a child-runs-first effect?

Ingo

2009-09-10 06:53:16

by Ingo Molnar

[permalink] [raw]
Subject: Re: Epic regression in throughput since v2.6.23


* Serge Belyshev <[email protected]> wrote:

> Ingo Molnar <[email protected]> writes:
>
> > Thanks!
> >
> > I think we found the reason for that regression - would you mind
> > to re-test with latest -tip, e157986 or later?
> >
> > If that works for you i'll describe our theory.
> >
>
> Good job -- seems to work, thanks. Regression is still about 3%
> though: http://img3.imageshack.us/img3/5335/epicbfstip.png

Ok, thanks for the update. The problem is that i've run out of
testsystems that can reproduce this. So we need your help to debug
this directly ...

A good start would be to post the -tip versus BFS "perf stat"
measurement results:

perf stat --repeat 3 make -j4 bzImage

And also the -j8 perf stat result, so that we can see what the
difference is between -j4 and -j8.

Note: please check out latest tip and do:

cd tools/perf/
make -j install

To pick up the latest 'perf' tool. In particular the precision of
--repeat has been improved recently so you want that binary from
-tip even if you measure vanilla .31 or .31 based BFS.

Also, it would be nice if you could send me your kernel config -
maybe it's some config detail that keeps me from being able to
reproduce these results. I havent seen a link to a config in your
mails (maybe i missed it - these threads are voluminous).

Ingo

2009-09-10 06:55:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Wed, 2009-09-09 at 14:20 +0200, Jens Axboe wrote:
>
> One thing I also noticed is that when I have logged in, I run xmodmap
> manually to load some keymappings (I always tell myself to add this to
> the log in scripts, but I suspend/resume this laptop for weeks at the
> time and forget before the next boot). With the stock kernel, xmodmap
> will halt X updates and take forever to run. With BFS, it returned
> instantly. As I would expect.

Can you provide a little more detail (I'm a xmodmap n00b), how does one
run xmodmap and maybe provide your xmodmap config?

2009-09-10 06:58:55

by Jens Axboe

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Thu, Sep 10 2009, Peter Zijlstra wrote:
> On Wed, 2009-09-09 at 14:20 +0200, Jens Axboe wrote:
> >
> > One thing I also noticed is that when I have logged in, I run xmodmap
> > manually to load some keymappings (I always tell myself to add this to
> > the log in scripts, but I suspend/resume this laptop for weeks at the
> > time and forget before the next boot). With the stock kernel, xmodmap
> > will halt X updates and take forever to run. With BFS, it returned
> > instantly. As I would expect.
>
> Can you provide a little more detail (I'm a xmodmap n00b), how does one
> run xmodmap and maybe provide your xmodmap config?

Will do, let me get the notebook and strace time it on both bfs and
mainline.

--
Jens Axboe

2009-09-10 06:59:40

by Ingo Molnar

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements


* Peter Zijlstra <[email protected]> wrote:

> On Wed, 2009-09-09 at 14:20 +0200, Jens Axboe wrote:
> >
> > One thing I also noticed is that when I have logged in, I run xmodmap
> > manually to load some keymappings (I always tell myself to add this to
> > the log in scripts, but I suspend/resume this laptop for weeks at the
> > time and forget before the next boot). With the stock kernel, xmodmap
> > will halt X updates and take forever to run. With BFS, it returned
> > instantly. As I would expect.
>
> Can you provide a little more detail (I'm a xmodmap n00b), how
> does one run xmodmap and maybe provide your xmodmap config?

(and which version did you use, just in case it matters.)

Ingo

2009-09-10 07:04:47

by Ingo Molnar

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements


* Jens Axboe <[email protected]> wrote:

> On Thu, Sep 10 2009, Peter Zijlstra wrote:
> > On Wed, 2009-09-09 at 14:20 +0200, Jens Axboe wrote:
> > >
> > > One thing I also noticed is that when I have logged in, I run xmodmap
> > > manually to load some keymappings (I always tell myself to add this to
> > > the log in scripts, but I suspend/resume this laptop for weeks at the
> > > time and forget before the next boot). With the stock kernel, xmodmap
> > > will halt X updates and take forever to run. With BFS, it returned
> > > instantly. As I would expect.
> >
> > Can you provide a little more detail (I'm a xmodmap n00b), how
> > does one run xmodmap and maybe provide your xmodmap config?
>
> Will do, let me get the notebook and strace time it on both bfs
> and mainline.

A 'perf stat' comparison would be nice as well - that will show us
events strace doesnt show, and shows us the basic scheduler behavior
as well.

A 'full' trace could be done as well via trace-cmd.c (attached), if
you enable:

CONFIG_CONTEXT_SWITCH_TRACER=y

and did something like:

trace-cmd -s xmodmap ... > trace.txt

Ingo


Attachments:
(No filename) (1.10 kB)
trace-cmd.c (6.39 kB)
Download all attachments

2009-09-10 07:33:17

by Jens Axboe

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Thu, Sep 10 2009, Jens Axboe wrote:
> On Thu, Sep 10 2009, Peter Zijlstra wrote:
> > On Wed, 2009-09-09 at 14:20 +0200, Jens Axboe wrote:
> > >
> > > One thing I also noticed is that when I have logged in, I run xmodmap
> > > manually to load some keymappings (I always tell myself to add this to
> > > the log in scripts, but I suspend/resume this laptop for weeks at the
> > > time and forget before the next boot). With the stock kernel, xmodmap
> > > will halt X updates and take forever to run. With BFS, it returned
> > > instantly. As I would expect.
> >
> > Can you provide a little more detail (I'm a xmodmap n00b), how does one
> > run xmodmap and maybe provide your xmodmap config?
>
> Will do, let me get the notebook and strace time it on both bfs and
> mainline.

Here's the result of running perf stat xmodmap .xmodmap-carl on the
notebook. I have attached the .xmodmap-carl file, it's pretty simple. I
have also attached the output of strace -o foo -f -tt xmodmap
.xmodmap-carl when run on 2.6.31-rc9.

2.6.31-rc9-bfs210

Performance counter stats for 'xmodmap .xmodmap-carl':

153.994976 task-clock-msecs # 0.990 CPUs (scaled from 99.86%)
0 context-switches # 0.000 M/sec (scaled from 99.86%)
0 CPU-migrations # 0.000 M/sec (scaled from 99.86%)
315 page-faults # 0.002 M/sec (scaled from 99.86%)
<not counted> cycles
<not counted> instructions
<not counted> cache-references
<not counted> cache-misses

0.155573406 seconds time elapsed

2.6.31-rc9

Performance counter stats for 'xmodmap .xmodmap-carl':

8.529265 task-clock-msecs # 0.001 CPUs
23 context-switches # 0.003 M/sec
1 CPU-migrations # 0.000 M/sec
315 page-faults # 0.037 M/sec
<not counted> cycles
<not counted> instructions
<not counted> cache-references
<not counted> cache-misses

11.804293482 seconds time elapsed


--
Jens Axboe


Attachments:
(No filename) (2.13 kB)
.xmodmap-carl (1.33 kB)
strace-xmodmap.txt (20.00 kB)
Download all attachments

2009-09-10 07:43:28

by Ingo Molnar

[permalink] [raw]
Subject: [updated] BFS vs. mainline scheduler benchmarks and measurements


* Ingo Molnar <[email protected]> wrote:

> OLTP performance (postgresql + sysbench)
> http://redhat.com/~mingo/misc/bfs-vs-tip-oltp.jpg

To everyone who might care about this, i've updated the sysbench
results to latest -tip:

http://redhat.com/~mingo/misc/bfs-vs-tip-oltp-v2.jpg

This double checks the effects of the various interactivity fixlets
in the scheduler tree (whose interactivity effects
mentioned/documented in the various threads on lkml) in the
throughput space too and they also improved sysbench performance.

Con, i'd also like to thank you for raising general interest in
scheduler latencies once more by posting the BFS patch. It gave us
more bugreports upstream and gave us desktop users willing to test
patches which in turn helps us improve the code. When users choose
to suffer in silence that is never helpful.

BFS isnt particularly strong in this graph - from having looked at
the workload under BFS my impression was that this is primarily due
to you having cut out much of the sched-domains SMP load-balancer
code. BFS 'insta-balances' very agressively, which hurts cache
affine workloads rather visibly.

You might want to have a look at that design detail if you care -
load-balancing is in significant parts orthogonal to the basic
design of a fair scheduler.

For example we kept much of the existing load-balancer when we went
to CFS in v2.6.23 - the fairness engine and the load-balancer are in
large parts independent units of code and can be improved/tweaked
separately.

There's interactions, but the concepts are largely separate.

Thanks,

Ingo

2009-09-10 07:49:12

by Ingo Molnar

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements


* Jens Axboe <[email protected]> wrote:

> On Thu, Sep 10 2009, Jens Axboe wrote:
> > On Thu, Sep 10 2009, Peter Zijlstra wrote:
> > > On Wed, 2009-09-09 at 14:20 +0200, Jens Axboe wrote:
> > > >
> > > > One thing I also noticed is that when I have logged in, I
> > > > run xmodmap manually to load some keymappings (I always tell
> > > > myself to add this to the log in scripts, but I
> > > > suspend/resume this laptop for weeks at the time and forget
> > > > before the next boot). With the stock kernel, xmodmap will
> > > > halt X updates and take forever to run. With BFS, it
> > > > returned instantly. As I would expect.
> > >
> > > Can you provide a little more detail (I'm a xmodmap n00b), how does one
> > > run xmodmap and maybe provide your xmodmap config?
> >
> > Will do, let me get the notebook and strace time it on both bfs and
> > mainline.
>
> Here's the result of running perf stat xmodmap .xmodmap-carl on
> the notebook. I have attached the .xmodmap-carl file, it's pretty
> simple. I have also attached the output of strace -o foo -f -tt
> xmodmap .xmodmap-carl when run on 2.6.31-rc9.
>
> 2.6.31-rc9-bfs210
>
> Performance counter stats for 'xmodmap .xmodmap-carl':
>
> 153.994976 task-clock-msecs # 0.990 CPUs (scaled from 99.86%)
> 0 context-switches # 0.000 M/sec (scaled from 99.86%)
> 0 CPU-migrations # 0.000 M/sec (scaled from 99.86%)
> 315 page-faults # 0.002 M/sec (scaled from 99.86%)
> <not counted> cycles
> <not counted> instructions
> <not counted> cache-references
> <not counted> cache-misses
>
> 0.155573406 seconds time elapsed

(Side question: what hardware is this - why are there no hw
counters? Could you post the /proc/cpuinfo?)

> 2.6.31-rc9
>
> Performance counter stats for 'xmodmap .xmodmap-carl':
>
> 8.529265 task-clock-msecs # 0.001 CPUs
> 23 context-switches # 0.003 M/sec
> 1 CPU-migrations # 0.000 M/sec
> 315 page-faults # 0.037 M/sec
> <not counted> cycles
> <not counted> instructions
> <not counted> cache-references
> <not counted> cache-misses
>
> 11.804293482 seconds time elapsed

Thanks - so we context-switch 23 times - possibly to Xorg. But 11
seconds is extremely long. Will try to reproduce it.

Ingo

2009-09-10 07:53:56

by Jens Axboe

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Thu, Sep 10 2009, Ingo Molnar wrote:
>
> * Jens Axboe <[email protected]> wrote:
>
> > On Thu, Sep 10 2009, Jens Axboe wrote:
> > > On Thu, Sep 10 2009, Peter Zijlstra wrote:
> > > > On Wed, 2009-09-09 at 14:20 +0200, Jens Axboe wrote:
> > > > >
> > > > > One thing I also noticed is that when I have logged in, I
> > > > > run xmodmap manually to load some keymappings (I always tell
> > > > > myself to add this to the log in scripts, but I
> > > > > suspend/resume this laptop for weeks at the time and forget
> > > > > before the next boot). With the stock kernel, xmodmap will
> > > > > halt X updates and take forever to run. With BFS, it
> > > > > returned instantly. As I would expect.
> > > >
> > > > Can you provide a little more detail (I'm a xmodmap n00b), how does one
> > > > run xmodmap and maybe provide your xmodmap config?
> > >
> > > Will do, let me get the notebook and strace time it on both bfs and
> > > mainline.
> >
> > Here's the result of running perf stat xmodmap .xmodmap-carl on
> > the notebook. I have attached the .xmodmap-carl file, it's pretty
> > simple. I have also attached the output of strace -o foo -f -tt
> > xmodmap .xmodmap-carl when run on 2.6.31-rc9.
> >
> > 2.6.31-rc9-bfs210
> >
> > Performance counter stats for 'xmodmap .xmodmap-carl':
> >
> > 153.994976 task-clock-msecs # 0.990 CPUs (scaled from 99.86%)
> > 0 context-switches # 0.000 M/sec (scaled from 99.86%)
> > 0 CPU-migrations # 0.000 M/sec (scaled from 99.86%)
> > 315 page-faults # 0.002 M/sec (scaled from 99.86%)
> > <not counted> cycles
> > <not counted> instructions
> > <not counted> cache-references
> > <not counted> cache-misses
> >
> > 0.155573406 seconds time elapsed
>
> (Side question: what hardware is this - why are there no hw
> counters? Could you post the /proc/cpuinfo?)

Sure, attached. It's a Thinkpad x60, core duo. Nothing fancy. The perf
may be a bit dated.

I went to try -tip btw, but it crashes on boot. Here's the backtrace,
typed manually, it's crashing in queue_work_on+0x28/0x60.

Call Trace:
queue_work
schedule_work
clocksource_mark_unstable
mark_tsc_unstable
check_tsc_sync_source
native_cpu_up
relay_hotcpu_callback
do_forK_idle
_cpu_up
cpu_up
kernel_init
kernel_thread_helper

> > Performance counter stats for 'xmodmap .xmodmap-carl':
> >
> > 8.529265 task-clock-msecs # 0.001 CPUs
> > 23 context-switches # 0.003 M/sec
> > 1 CPU-migrations # 0.000 M/sec
> > 315 page-faults # 0.037 M/sec
> > <not counted> cycles
> > <not counted> instructions
> > <not counted> cache-references
> > <not counted> cache-misses
> >
> > 11.804293482 seconds time elapsed
>
> Thanks - so we context-switch 23 times - possibly to Xorg. But 11
> seconds is extremely long. Will try to reproduce it.

There's also the strace info with timings. Xorg is definitely involved,
during those 11s things stop updating completely.

--
Jens Axboe

2009-09-10 09:44:32

by Jens Axboe

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Thu, Sep 10 2009, Ingo Molnar wrote:
>
> * Jens Axboe <[email protected]> wrote:
>
> > On Thu, Sep 10 2009, Peter Zijlstra wrote:
> > > On Wed, 2009-09-09 at 14:20 +0200, Jens Axboe wrote:
> > > >
> > > > One thing I also noticed is that when I have logged in, I run xmodmap
> > > > manually to load some keymappings (I always tell myself to add this to
> > > > the log in scripts, but I suspend/resume this laptop for weeks at the
> > > > time and forget before the next boot). With the stock kernel, xmodmap
> > > > will halt X updates and take forever to run. With BFS, it returned
> > > > instantly. As I would expect.
> > >
> > > Can you provide a little more detail (I'm a xmodmap n00b), how
> > > does one run xmodmap and maybe provide your xmodmap config?
> >
> > Will do, let me get the notebook and strace time it on both bfs
> > and mainline.
>
> A 'perf stat' comparison would be nice as well - that will show us
> events strace doesnt show, and shows us the basic scheduler behavior
> as well.
>
> A 'full' trace could be done as well via trace-cmd.c (attached), if
> you enable:
>
> CONFIG_CONTEXT_SWITCH_TRACER=y
>
> and did something like:
>
> trace-cmd -s xmodmap ... > trace.txt

trace.txt attached. Steven, you seem to go through a lot of trouble to
find the debugfs path, yet at the very end do:

> system("cat /debug/tracing/trace");

which doesn't seem quite right :-)

--
Jens Axboe

2009-09-10 09:46:01

by Jens Axboe

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Thu, Sep 10 2009, Jens Axboe wrote:
> trace.txt attached.

Now it really is, I very much need a more clever MUA to help me with
these things :-)

--
Jens Axboe


Attachments:
(No filename) (165.00 B)
trace.txt.bz2 (235.68 kB)
Download all attachments

2009-09-10 09:48:20

by Jens Axboe

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Wed, Sep 09 2009, Ingo Molnar wrote:
> At least in my tests these latencies were mainly due to a bug in
> latt.c - i've attached the fixed version.

What bug? I don't see any functional change between the version you
attach and the current one.

--
Jens Axboe

2009-09-10 09:54:56

by Jens Axboe

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Thu, Sep 10 2009, Ingo Molnar wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
> > > However, the interactivity problems still remain. Does that
> > > mean it's not a latency issue?
> >
> > It means that Jens's test-app, which demonstrated and helped us
> > fix the issue for him does not help us fix it for you just yet.
>
> Lemme qualify that by saying that Jens's issues are improved not
> fixed [he has not re-run with latest latt.c yet] but not all things
> are fully fixed yet. For example the xmodmap thing sounds
> interesting - could that be a child-runs-first effect?

I thought so too, so when -tip failed to boot I pulled the patches from
Mike into 2.6.31. It doesn't change anything for xmodmap, though.

--
Jens Axboe

2009-09-10 09:59:37

by Ingo Molnar

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements


* Jens Axboe <[email protected]> wrote:

> On Wed, Sep 09 2009, Ingo Molnar wrote:
> > At least in my tests these latencies were mainly due to a bug in
> > latt.c - i've attached the fixed version.
>
> What bug? I don't see any functional change between the version
> you attach and the current one.

Here's the diff of what i fixed yesterday over the last latt.c
version i found in this thread. The poll() thing is the significant
one.

Ingo

--- latt.c.orig
+++ latt.c
@@ -39,6 +39,7 @@ static unsigned int verbose;
struct stats
{
double n, mean, M2, max;
+ int max_pid;
};

static void update_stats(struct stats *stats, unsigned long long val)
@@ -85,22 +86,6 @@ static double stddev_stats(struct stats
return sqrt(variance);
}

-/*
- * The std dev of the mean is related to the std dev by:
- *
- * s
- * s_mean = -------
- * sqrt(n)
- *
- */
-static double stddev_mean_stats(struct stats *stats)
-{
- double variance = stats->M2 / (stats->n - 1);
- double variance_mean = variance / stats->n;
-
- return sqrt(variance_mean);
-}
-
struct stats delay_stats;

static int pipes[MAX_CLIENTS*2][2];
@@ -212,7 +197,7 @@ static unsigned long usec_since(struct t
static void log_delay(unsigned long delay)
{
if (verbose) {
- fprintf(stderr, "log delay %8lu usec\n", delay);
+ fprintf(stderr, "log delay %8lu usec (pid %d)\n", delay, getpid());
fflush(stderr);
}

@@ -300,7 +285,7 @@ static int __write_ts(int i, struct time
return write(fd, ts, sizeof(*ts)) != sizeof(*ts);
}

-static long __read_ts(int i, struct timespec *ts)
+static long __read_ts(int i, struct timespec *ts, pid_t *cpids)
{
int fd = pipes[2*i+1][0];
struct timespec t;
@@ -309,11 +294,14 @@ static long __read_ts(int i, struct time
return -1;

log_delay(usec_since(ts, &t));
+ if (verbose)
+ fprintf(stderr, "got delay %ld from child %d [pid %d]\n", usec_since(ts, &t), i, cpids[i]);

return 0;
}

-static int read_ts(struct pollfd *pfd, unsigned int nr, struct timespec *ts)
+static int read_ts(struct pollfd *pfd, unsigned int nr, struct timespec *ts,
+ pid_t *cpids)
{
unsigned int i;

@@ -322,7 +310,7 @@ static int read_ts(struct pollfd *pfd, u
return -1L;
if (pfd[i].revents & POLLIN) {
pfd[i].events = 0;
- if (__read_ts(i, &ts[i]))
+ if (__read_ts(i, &ts[i], cpids))
return -1L;
nr--;
}
@@ -368,7 +356,6 @@ static void run_parent(pid_t *cpids)
srand(1234);

do {
- unsigned long delay;
unsigned pending_events;

do_rand_sleep();
@@ -404,17 +391,17 @@ static void run_parent(pid_t *cpids)
*/
pending_events = clients;
while (pending_events) {
- int evts = poll(ipfd, clients, 0);
+ int evts = poll(ipfd, clients, -1);

if (evts < 0) {
do_exit = 1;
break;
} else if (!evts) {
- /* printf("bugger2\n"); */
+ printf("bugger2\n");
continue;
}

- if (read_ts(ipfd, evts, t1)) {
+ if (read_ts(ipfd, evts, t1, cpids)) {
do_exit = 1;
break;
}

2009-09-10 10:01:47

by Jens Axboe

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Thu, Sep 10 2009, Ingo Molnar wrote:
>
> * Jens Axboe <[email protected]> wrote:
>
> > On Wed, Sep 09 2009, Ingo Molnar wrote:
> > > At least in my tests these latencies were mainly due to a bug in
> > > latt.c - i've attached the fixed version.
> >
> > What bug? I don't see any functional change between the version
> > you attach and the current one.
>
> Here's the diff of what i fixed yesterday over the last latt.c
> version i found in this thread. The poll() thing is the significant
> one.

Ah indeed, thanks Ingo! I'm tempted to add some actual work processing
into latt as well, to see if that helps improve it.

--
Jens Axboe

2009-09-10 10:02:27

by Ingo Molnar

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements


* Jens Axboe <[email protected]> wrote:

> I went to try -tip btw, but it crashes on boot. Here's the
> backtrace, typed manually, it's crashing in
> queue_work_on+0x28/0x60.
>
> Call Trace:
> queue_work
> schedule_work
> clocksource_mark_unstable
> mark_tsc_unstable
> check_tsc_sync_source
> native_cpu_up
> relay_hotcpu_callback
> do_forK_idle
> _cpu_up
> cpu_up
> kernel_init
> kernel_thread_helper

hm, that looks like an old bug i fixed days ago via:

00a3273: Revert "x86: Make tsc=reliable override boot time stability checks"

Have you tested tip:master - do you still know which sha1?

Ingo

2009-09-10 10:03:30

by Ingo Molnar

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements


* Jens Axboe <[email protected]> wrote:

> On Thu, Sep 10 2009, Ingo Molnar wrote:
> >
> > * Ingo Molnar <[email protected]> wrote:
> >
> > > > However, the interactivity problems still remain. Does that
> > > > mean it's not a latency issue?
> > >
> > > It means that Jens's test-app, which demonstrated and helped us
> > > fix the issue for him does not help us fix it for you just yet.
> >
> > Lemme qualify that by saying that Jens's issues are improved not
> > fixed [he has not re-run with latest latt.c yet] but not all things
> > are fully fixed yet. For example the xmodmap thing sounds
> > interesting - could that be a child-runs-first effect?
>
> I thought so too, so when -tip failed to boot I pulled the patches
> from Mike into 2.6.31. It doesn't change anything for xmodmap,
> though.

Note, you can access just the pristine scheduler patches by checking
out and testing tip:sched/core - no need to pull them out and apply.

Your crash looks like clocksource related - that's in a separate
topic which you can thus isolate if you use sched/core.

Thanks,

Ingo

2009-09-10 10:09:44

by Jens Axboe

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Thu, Sep 10 2009, Ingo Molnar wrote:
>
> * Jens Axboe <[email protected]> wrote:
>
> > I went to try -tip btw, but it crashes on boot. Here's the
> > backtrace, typed manually, it's crashing in
> > queue_work_on+0x28/0x60.
> >
> > Call Trace:
> > queue_work
> > schedule_work
> > clocksource_mark_unstable
> > mark_tsc_unstable
> > check_tsc_sync_source
> > native_cpu_up
> > relay_hotcpu_callback
> > do_forK_idle
> > _cpu_up
> > cpu_up
> > kernel_init
> > kernel_thread_helper
>
> hm, that looks like an old bug i fixed days ago via:
>
> 00a3273: Revert "x86: Make tsc=reliable override boot time stability checks"
>
> Have you tested tip:master - do you still know which sha1?

It was -tip pulled this morning, 2-3 hours ago. I don't have the sha
anymore, but it was a fresh pull today.

--
Jens Axboe

2009-09-10 10:11:35

by Jens Axboe

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Thu, Sep 10 2009, Ingo Molnar wrote:
>
> * Jens Axboe <[email protected]> wrote:
>
> > On Thu, Sep 10 2009, Ingo Molnar wrote:
> > >
> > > * Ingo Molnar <[email protected]> wrote:
> > >
> > > > > However, the interactivity problems still remain. Does that
> > > > > mean it's not a latency issue?
> > > >
> > > > It means that Jens's test-app, which demonstrated and helped us
> > > > fix the issue for him does not help us fix it for you just yet.
> > >
> > > Lemme qualify that by saying that Jens's issues are improved not
> > > fixed [he has not re-run with latest latt.c yet] but not all things
> > > are fully fixed yet. For example the xmodmap thing sounds
> > > interesting - could that be a child-runs-first effect?
> >
> > I thought so too, so when -tip failed to boot I pulled the patches
> > from Mike into 2.6.31. It doesn't change anything for xmodmap,
> > though.
>
> Note, you can access just the pristine scheduler patches by checking
> out and testing tip:sched/core - no need to pull them out and apply.
>
> Your crash looks like clocksource related - that's in a separate
> topic which you can thus isolate if you use sched/core.

I'm building sched/core now and will run the xmodmap test there.

--
Jens Axboe

2009-09-10 10:28:36

by Jens Axboe

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Thu, Sep 10 2009, Jens Axboe wrote:
> On Thu, Sep 10 2009, Ingo Molnar wrote:
> >
> > * Jens Axboe <[email protected]> wrote:
> >
> > > On Thu, Sep 10 2009, Ingo Molnar wrote:
> > > >
> > > > * Ingo Molnar <[email protected]> wrote:
> > > >
> > > > > > However, the interactivity problems still remain. Does that
> > > > > > mean it's not a latency issue?
> > > > >
> > > > > It means that Jens's test-app, which demonstrated and helped us
> > > > > fix the issue for him does not help us fix it for you just yet.
> > > >
> > > > Lemme qualify that by saying that Jens's issues are improved not
> > > > fixed [he has not re-run with latest latt.c yet] but not all things
> > > > are fully fixed yet. For example the xmodmap thing sounds
> > > > interesting - could that be a child-runs-first effect?
> > >
> > > I thought so too, so when -tip failed to boot I pulled the patches
> > > from Mike into 2.6.31. It doesn't change anything for xmodmap,
> > > though.
> >
> > Note, you can access just the pristine scheduler patches by checking
> > out and testing tip:sched/core - no need to pull them out and apply.
> >
> > Your crash looks like clocksource related - that's in a separate
> > topic which you can thus isolate if you use sched/core.
>
> I'm building sched/core now and will run the xmodmap test there.

No difference. Then I tried switching NO_NEW_FAIR_SLEEPERS on, and then
I get:

Performance counter stats for 'xmodmap .xmodmap-carl':

9.009137 task-clock-msecs # 0.447 CPUs
18 context-switches # 0.002 M/sec
1 CPU-migrations # 0.000 M/sec
315 page-faults # 0.035 M/sec
<not counted> cycles
<not counted> instructions
<not counted> cache-references
<not counted> cache-misses

0.020167093 seconds time elapsed

Woot!

--
Jens Axboe

2009-09-10 10:57:32

by Mike Galbraith

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Thu, 2009-09-10 at 12:28 +0200, Jens Axboe wrote:

> No difference. Then I tried switching NO_NEW_FAIR_SLEEPERS on, and then
> I get:
>
> Performance counter stats for 'xmodmap .xmodmap-carl':
>
> 9.009137 task-clock-msecs # 0.447 CPUs
> 18 context-switches # 0.002 M/sec
> 1 CPU-migrations # 0.000 M/sec
> 315 page-faults # 0.035 M/sec
> <not counted> cycles
> <not counted> instructions
> <not counted> cache-references
> <not counted> cache-misses
>
> 0.020167093 seconds time elapsed
>
> Woot!

Something is very seriously hosed on that box... clock?

Can you turn it back on, and do..
while sleep .1; do cat /proc/sched_debug >> foo; done
..on one core, and (quickly;) xmodmap .xmodmap-carl, then send me a few
seconds worth (gzipped up) to eyeball?

-Mike

2009-09-10 11:03:48

by Jens Axboe

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Thu, Sep 10 2009, Con Kolivas wrote:
> > It probably just means that latt isn't a good measure of the problem.
> > Which isn't really too much of a surprise.
>
> And that's a real shame because this was one of the first real good attempts
> I've seen to actually measure the difference, and I thank you for your
> efforts Jens. I believe the reason it's limited is because all you're
> measuring is time from wakeup and the test app isn't actually doing any work.
> The issue is more than just waking up as fast as possible, it's then doing
> some meaningful amount of work within a reasonable time frame as well. What
> the "meaningful amount of work" and "reasonable time frame" are, remains a
> mystery, but I guess could be added on to this testing app.

Here's a quickie addition that adds some work to the threads. The
latency measure is now 'when did I wake up and complete my work'. The
default work is filling a buffer with pseudo random data and then
compressing it with zlib. Default is 64kb of data, can be adjusted with
-x. -x0 turns off work processing.

--
Jens Axboe


Attachments:
(No filename) (1.07 kB)
latt.c (10.98 kB)
Download all attachments

2009-09-10 11:09:14

by Jens Axboe

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Thu, Sep 10 2009, Mike Galbraith wrote:
> On Thu, 2009-09-10 at 12:28 +0200, Jens Axboe wrote:
>
> > No difference. Then I tried switching NO_NEW_FAIR_SLEEPERS on, and then
> > I get:
> >
> > Performance counter stats for 'xmodmap .xmodmap-carl':
> >
> > 9.009137 task-clock-msecs # 0.447 CPUs
> > 18 context-switches # 0.002 M/sec
> > 1 CPU-migrations # 0.000 M/sec
> > 315 page-faults # 0.035 M/sec
> > <not counted> cycles
> > <not counted> instructions
> > <not counted> cache-references
> > <not counted> cache-misses
> >
> > 0.020167093 seconds time elapsed
> >
> > Woot!
>
> Something is very seriously hosed on that box... clock?

model name : Genuine Intel(R) CPU T2400 @ 1.83GHz

Throttles down to 1.00GHz when idle.

> Can you turn it back on, and do..

I guess you mean turn NEW_FAIR_SLEEPERS back on, correct?

> while sleep .1; do cat /proc/sched_debug >> foo; done
> ..on one core, and (quickly;) xmodmap .xmodmap-carl, then send me a few
> seconds worth (gzipped up) to eyeball?

Attached.

--
Jens Axboe


Attachments:
(No filename) (1.20 kB)
sched-debug-cat.bz2 (11.92 kB)
Download all attachments

2009-09-10 11:21:12

by Mike Galbraith

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Thu, 2009-09-10 at 13:09 +0200, Jens Axboe wrote:
> On Thu, Sep 10 2009, Mike Galbraith wrote:
> > On Thu, 2009-09-10 at 12:28 +0200, Jens Axboe wrote:
> >
> > > No difference. Then I tried switching NO_NEW_FAIR_SLEEPERS on, and then
> > > I get:
> > >
> > > Performance counter stats for 'xmodmap .xmodmap-carl':
> > >
> > > 9.009137 task-clock-msecs # 0.447 CPUs
> > > 18 context-switches # 0.002 M/sec
> > > 1 CPU-migrations # 0.000 M/sec
> > > 315 page-faults # 0.035 M/sec
> > > <not counted> cycles
> > > <not counted> instructions
> > > <not counted> cache-references
> > > <not counted> cache-misses
> > >
> > > 0.020167093 seconds time elapsed
> > >
> > > Woot!
> >
> > Something is very seriously hosed on that box... clock?
>
> model name : Genuine Intel(R) CPU T2400 @ 1.83GHz
>
> Throttles down to 1.00GHz when idle.
>
> > Can you turn it back on, and do..
>
> I guess you mean turn NEW_FAIR_SLEEPERS back on, correct?
>
> > while sleep .1; do cat /proc/sched_debug >> foo; done
> > ..on one core, and (quickly;) xmodmap .xmodmap-carl, then send me a few
> > seconds worth (gzipped up) to eyeball?
>
> Attached.

xmodmap doesn't seem to be running in this sample.

-Mike

2009-09-10 11:24:42

by Jens Axboe

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Thu, Sep 10 2009, Mike Galbraith wrote:
> On Thu, 2009-09-10 at 13:09 +0200, Jens Axboe wrote:
> > On Thu, Sep 10 2009, Mike Galbraith wrote:
> > > On Thu, 2009-09-10 at 12:28 +0200, Jens Axboe wrote:
> > >
> > > > No difference. Then I tried switching NO_NEW_FAIR_SLEEPERS on, and then
> > > > I get:
> > > >
> > > > Performance counter stats for 'xmodmap .xmodmap-carl':
> > > >
> > > > 9.009137 task-clock-msecs # 0.447 CPUs
> > > > 18 context-switches # 0.002 M/sec
> > > > 1 CPU-migrations # 0.000 M/sec
> > > > 315 page-faults # 0.035 M/sec
> > > > <not counted> cycles
> > > > <not counted> instructions
> > > > <not counted> cache-references
> > > > <not counted> cache-misses
> > > >
> > > > 0.020167093 seconds time elapsed
> > > >
> > > > Woot!
> > >
> > > Something is very seriously hosed on that box... clock?
> >
> > model name : Genuine Intel(R) CPU T2400 @ 1.83GHz
> >
> > Throttles down to 1.00GHz when idle.
> >
> > > Can you turn it back on, and do..
> >
> > I guess you mean turn NEW_FAIR_SLEEPERS back on, correct?
> >
> > > while sleep .1; do cat /proc/sched_debug >> foo; done
> > > ..on one core, and (quickly;) xmodmap .xmodmap-carl, then send me a few
> > > seconds worth (gzipped up) to eyeball?
> >
> > Attached.
>
> xmodmap doesn't seem to be running in this sample.

That's weird, it was definitely running. I did:

sleep 1; xmodmap .xmodmap-carl

in one xterm, and then switched to the other and ran the sched_debug
dump. I have to do it this way, as X will not move focus once xmodmap
starts running. It could be that xmodmap is mostly idle, and the real
work is done by Xorg and/or xfwm4 (my window manager).

--
Jens Axboe

2009-09-10 11:28:31

by Mike Galbraith

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Thu, 2009-09-10 at 13:24 +0200, Jens Axboe wrote:
> On Thu, Sep 10 2009, Mike Galbraith wrote:

> > xmodmap doesn't seem to be running in this sample.
>
> That's weird, it was definitely running. I did:
>
> sleep 1; xmodmap .xmodmap-carl
>
> in one xterm, and then switched to the other and ran the sched_debug
> dump. I have to do it this way, as X will not move focus once xmodmap
> starts running. It could be that xmodmap is mostly idle, and the real
> work is done by Xorg and/or xfwm4 (my window manager).

Hm. Ok, I'll crawl over it, see if anything falls out.

-Mike

2009-09-10 11:35:15

by Jens Axboe

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Thu, Sep 10 2009, Mike Galbraith wrote:
> On Thu, 2009-09-10 at 13:24 +0200, Jens Axboe wrote:
> > On Thu, Sep 10 2009, Mike Galbraith wrote:
>
> > > xmodmap doesn't seem to be running in this sample.
> >
> > That's weird, it was definitely running. I did:
> >
> > sleep 1; xmodmap .xmodmap-carl
> >
> > in one xterm, and then switched to the other and ran the sched_debug
> > dump. I have to do it this way, as X will not move focus once xmodmap
> > starts running. It could be that xmodmap is mostly idle, and the real
> > work is done by Xorg and/or xfwm4 (my window manager).
>
> Hm. Ok, I'll crawl over it, see if anything falls out.

That seems to be confirmed with the low context switch rate of the perf
stat of xmodmap. If I run perf stat -a to get a system wide collection
for xmodmap, I get:

Performance counter stats for 'xmodmap .xmodmap-carl':

20112.060925 task-clock-msecs # 1.998 CPUs
629360 context-switches # 0.031 M/sec
8 CPU-migrations # 0.000 M/sec
13489 page-faults # 0.001 M/sec
<not counted> cycles
<not counted> instructions
<not counted> cache-references
<not counted> cache-misses

10.067532449 seconds time elapsed

And again, system is idle while this is happening. Can't rule out that
this is some kind of user space bug of course.

--
Jens Axboe

2009-09-10 11:42:34

by Mike Galbraith

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Thu, 2009-09-10 at 13:35 +0200, Jens Axboe wrote:
> On Thu, Sep 10 2009, Mike Galbraith wrote:
> > On Thu, 2009-09-10 at 13:24 +0200, Jens Axboe wrote:
> > > On Thu, Sep 10 2009, Mike Galbraith wrote:
> >
> > > > xmodmap doesn't seem to be running in this sample.
> > >
> > > That's weird, it was definitely running. I did:
> > >
> > > sleep 1; xmodmap .xmodmap-carl
> > >
> > > in one xterm, and then switched to the other and ran the sched_debug
> > > dump. I have to do it this way, as X will not move focus once xmodmap
> > > starts running. It could be that xmodmap is mostly idle, and the real
> > > work is done by Xorg and/or xfwm4 (my window manager).
> >
> > Hm. Ok, I'll crawl over it, see if anything falls out.
>
> That seems to be confirmed with the low context switch rate of the perf
> stat of xmodmap. If I run perf stat -a to get a system wide collection
> for xmodmap, I get:
>
> Performance counter stats for 'xmodmap .xmodmap-carl':
>
> 20112.060925 task-clock-msecs # 1.998 CPUs
> 629360 context-switches # 0.031 M/sec
> 8 CPU-migrations # 0.000 M/sec
> 13489 page-faults # 0.001 M/sec
> <not counted> cycles
> <not counted> instructions
> <not counted> cache-references
> <not counted> cache-misses
>
> 10.067532449 seconds time elapsed
>
> And again, system is idle while this is happening. Can't rule out that
> this is some kind of user space bug of course.

All I'm seeing so far is massive CPU usage for dinky job.

-Mike

2009-09-10 12:19:57

by Jens Axboe

[permalink] [raw]
Subject: latt location (Was Re: BFS vs. mainline scheduler benchmarks and measurements)

On Wed, Sep 09 2009, Pavel Machek wrote:
> Could you post the source? Someone else might get us
> numbers... preferably on dualcore box or something...

Since it's posted in various places and by various people, I've put it
on the web now as well. Should always be the latest version.

http://kernel.dk/latt.c

Note that it requires zlib-devel packages to build now.

--
Jens Axboe

2009-09-10 13:53:58

by Steven Rostedt

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Thu, 2009-09-10 at 11:44 +0200, Jens Axboe wrote:
> On Thu, Sep 10 2009, Ingo Molnar wrote:

> trace.txt attached. Steven, you seem to go through a lot of trouble to
> find the debugfs path, yet at the very end do:
>
> > system("cat /debug/tracing/trace");
>
> which doesn't seem quite right :-)
>

That's an older version of the tool. The newer version (still in alpha)
doesn't do that.

-- Steve

2009-09-10 16:02:23

by Bret Towe

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Wed, Sep 9, 2009 at 11:08 PM, Ingo Molnar <[email protected]> wrote:
>
> * Nikos Chantziaras <[email protected]> wrote:
>
>> On 09/09/2009 09:04 PM, Ingo Molnar wrote:
>>> [...]
>>> * Jens Axboe<[email protected]> ?wrote:
>>>
>>>> On Wed, Sep 09 2009, Jens Axboe wrote:
>>>> ?[...]
>>>> BFS210 runs on the laptop (dual core intel core duo). With make -j4
>>>> running, I clock the following latt -c8 'sleep 10' latencies:
>>>>
>>>> -rc9
>>>>
>>>> ? ? ? ? ?Max ? ? ? ? ? ? ? ?17895 usec
>>>> ? ? ? ? ?Avg ? ? ? ? ? ? ? ? 8028 usec
>>>> ? ? ? ? ?Stdev ? ? ? ? ? ? ? 5948 usec
>>>> ? ? ? ? ?Stdev mean ? ? ? ? ? 405 usec
>>>>
>>>> ? ? ? ? ?Max ? ? ? ? ? ? ? ?17896 usec
>>>> ? ? ? ? ?Avg ? ? ? ? ? ? ? ? 4951 usec
>>>> ? ? ? ? ?Stdev ? ? ? ? ? ? ? 6278 usec
>>>> ? ? ? ? ?Stdev mean ? ? ? ? ? 427 usec
>>>>
>>>> ? ? ? ? ?Max ? ? ? ? ? ? ? ?17885 usec
>>>> ? ? ? ? ?Avg ? ? ? ? ? ? ? ? 5526 usec
>>>> ? ? ? ? ?Stdev ? ? ? ? ? ? ? 6819 usec
>>>> ? ? ? ? ?Stdev mean ? ? ? ? ? 464 usec
>>>>
>>>> -rc9 + mike
>>>>
>>>> ? ? ? ? ?Max ? ? ? ? ? ? ? ? 6061 usec
>>>> ? ? ? ? ?Avg ? ? ? ? ? ? ? ? 3797 usec
>>>> ? ? ? ? ?Stdev ? ? ? ? ? ? ? 1726 usec
>>>> ? ? ? ? ?Stdev mean ? ? ? ? ? 117 usec
>>>>
>>>> ? ? ? ? ?Max ? ? ? ? ? ? ? ? 5122 usec
>>>> ? ? ? ? ?Avg ? ? ? ? ? ? ? ? 3958 usec
>>>> ? ? ? ? ?Stdev ? ? ? ? ? ? ? 1697 usec
>>>> ? ? ? ? ?Stdev mean ? ? ? ? ? 115 usec
>>>>
>>>> ? ? ? ? ?Max ? ? ? ? ? ? ? ? 6691 usec
>>>> ? ? ? ? ?Avg ? ? ? ? ? ? ? ? 2130 usec
>>>> ? ? ? ? ?Stdev ? ? ? ? ? ? ? 2165 usec
>>>> ? ? ? ? ?Stdev mean ? ? ? ? ? 147 usec
>>>
>>> At least in my tests these latencies were mainly due to a bug in
>>> latt.c - i've attached the fixed version.
>>>
>>> The other reason was wakeup batching. If you do this:
>>>
>>> ? ? echo 0> ?/proc/sys/kernel/sched_wakeup_granularity_ns
>>>
>>> ... then you can switch on insta-wakeups on -tip too.
>>>
>>> With a dual-core box and a make -j4 background job running, on
>>> latest -tip i get the following latencies:
>>>
>>> ? $ ./latt -c8 sleep 30
>>> ? Entries: 656 (clients=8)
>>>
>>> ? Averages:
>>> ? ------------------------------
>>> ? ? ?Max ? ? ? ? ? 158 usec
>>> ? ? ?Avg ? ? ? ? ? ?12 usec
>>> ? ? ?Stdev ? ? ? ? ?10 usec
>>
>> With your version of latt.c, I get these results with 2.6-tip vs
>> 2.6.31-rc9-bfs:
>>
>>
>> (mainline)
>> Averages:
>> ------------------------------
>> ? ? ? ? Max ? ? ? ? ? ?50 usec
>> ? ? ? ? Avg ? ? ? ? ? ?12 usec
>> ? ? ? ? Stdev ? ? ? ? ? 3 usec
>>
>>
>> (BFS)
>> Averages:
>> ------------------------------
>> ? ? ? ? Max ? ? ? ? ? 474 usec
>> ? ? ? ? Avg ? ? ? ? ? ?11 usec
>> ? ? ? ? Stdev ? ? ? ? ?16 usec
>>
>> However, the interactivity problems still remain. ?Does that mean
>> it's not a latency issue?
>
> It means that Jens's test-app, which demonstrated and helped us fix
> the issue for him does not help us fix it for you just yet.
>
> The "fluidity problem" you described might not be a classic latency
> issue per se (which latt.c measures), but a timeslicing / CPU time
> distribution problem.
>
> A slight shift in CPU time allocation can change the flow of tasks
> to result in a 'choppier' system.
>
> Have you tried, in addition of the granularity tweaks you've done,
> to renice mplayer either up or down? (or compiz and Xorg for that
> matter)
>
> I'm not necessarily suggesting this as a 'real' solution (we really
> prefer kernels that just get it right) - but it's an additional
> parameter dimension along which you can tweak CPU time distribution
> on your box.
>
> Here's the general rule of thumb: mine one nice level gives plus 5%
> CPU time to a task and takes away 5% CPU time from another task -
> i.e. shifts the CPU allocation by 10%.
>
> ( this is modified by all sorts of dynamic conditions: by the number
> ?of tasks running and their wakeup patters so not a rule cast into
> ?stone - but still a good ballpark figure for CPU intense tasks. )
>
> Btw., i've read your descriptions about what you've tuned so far -
> have you seen/checked the wakeup_granularity tunable as well?
> Setting that to 0 will change the general balance of how CPU time is
> allocated between tasks too.
>
> There's also a whole bunch of scheduler features you can turn on/off
> individually via /debug/sched_features. For example, to turn off
> NEW_FAIR_SLEEPERS, you can do:
>
> ?# cat /debug/sched_features
> ?NEW_FAIR_SLEEPERS NO_NORMALIZED_SLEEPER ADAPTIVE_GRAN WAKEUP_PREEMPT
> ?START_DEBIT AFFINE_WAKEUPS CACHE_HOT_BUDDY SYNC_WAKEUPS NO_HRTICK
> ?NO_DOUBLE_TICK ASYM_GRAN LB_BIAS LB_WAKEUP_UPDATE ASYM_EFF_LOAD
> ?NO_WAKEUP_OVERLAP LAST_BUDDY OWNER_SPIN
>
> ?# echo NO_NEW_FAIR_SLEEPERS > /debug/sched_features
>
> Btw., NO_NEW_FAIR_SLEEPERS is something that will turn the scheduler
> into a more classic fair scheduler (like BFS is too).
>
> NO_START_DEBIT might be another thing that improves (or worsens :-/)
> make -j type of kernel build workloads.

thanks to this thread and others I've seen several kernel tunables
that can effect how the scheduler performs/acts
but what I don't see after a bit of looking is where all these are documented
perhaps thats also part of the reason there are unhappy people with
the current code in the kernel just because they don't know how
to tune it for their workload

> Note, these flags are all runtime, the new settings take effect
> almost immediately (and at the latest it takes effect when a task
> has started up) and safe to do runtime.
>
> It basically gives us 32768 pluggable schedulers each with a
> slightly separate algorithm - each setting in essence creates a new
> scheduler. (this mechanism is how we introduce new scheduler
> features and allow their debugging / regression-testing.)
>
> (okay, almost, so beware: turning on HRTICK might lock up your
> system.)
>
> Plus, yet another dimension of tuning on SMP systems (such as
> dual-core) are the sched-domains tunable. There's a whole world of
> tuning in that area and BFS essentially implements a very agressive
> 'always balance to other CPUs' policy.
>
> I've attached my sched-tune-domains script which helps tune these
> parameters.
>
> For example on a testbox of mine it outputs:
>
> usage: tune-sched-domains <val>
> {cpu0/domain0:SIBLING} SD flag: 239
> + ? 1: SD_LOAD_BALANCE: ? ? ? ? ?Do load balancing on this domain
> + ? 2: SD_BALANCE_NEWIDLE: ? ? ? Balance when about to become idle
> + ? 4: SD_BALANCE_EXEC: ? ? ? ? ?Balance on exec
> + ? 8: SD_BALANCE_FORK: ? ? ? ? ?Balance on fork, clone
> - ?16: SD_WAKE_IDLE: ? ? ? ? ? ? Wake to idle CPU on task wakeup
> + ?32: SD_WAKE_AFFINE: ? ? ? ? ? Wake task to waking CPU
> + ?64: SD_WAKE_BALANCE: ? ? ? ? ?Perform balancing at task wakeup
> + 128: SD_SHARE_CPUPOWER: ? ? ? ?Domain members share cpu power
> - 256: SD_POWERSAVINGS_BALANCE: ?Balance for power savings
> - 512: SD_SHARE_PKG_RESOURCES: ? Domain members share cpu pkg resources
> -1024: SD_SERIALIZE: ? ? ? ? ? ? Only a single load balancing instance
> -2048: SD_WAKE_IDLE_FAR: ? ? ? ? Gain latency sacrificing cache hit
> -4096: SD_PREFER_SIBLING: ? ? ? ?Prefer to place tasks in a sibling domain
> {cpu0/domain1:MC} SD flag: 4735
> + ? 1: SD_LOAD_BALANCE: ? ? ? ? ?Do load balancing on this domain
> + ? 2: SD_BALANCE_NEWIDLE: ? ? ? Balance when about to become idle
> + ? 4: SD_BALANCE_EXEC: ? ? ? ? ?Balance on exec
> + ? 8: SD_BALANCE_FORK: ? ? ? ? ?Balance on fork, clone
> + ?16: SD_WAKE_IDLE: ? ? ? ? ? ? Wake to idle CPU on task wakeup
> + ?32: SD_WAKE_AFFINE: ? ? ? ? ? Wake task to waking CPU
> + ?64: SD_WAKE_BALANCE: ? ? ? ? ?Perform balancing at task wakeup
> - 128: SD_SHARE_CPUPOWER: ? ? ? ?Domain members share cpu power
> - 256: SD_POWERSAVINGS_BALANCE: ?Balance for power savings
> + 512: SD_SHARE_PKG_RESOURCES: ? Domain members share cpu pkg resources
> -1024: SD_SERIALIZE: ? ? ? ? ? ? Only a single load balancing instance
> -2048: SD_WAKE_IDLE_FAR: ? ? ? ? Gain latency sacrificing cache hit
> +4096: SD_PREFER_SIBLING: ? ? ? ?Prefer to place tasks in a sibling domain
> {cpu0/domain2:NODE} SD flag: 3183
> + ? 1: SD_LOAD_BALANCE: ? ? ? ? ?Do load balancing on this domain
> + ? 2: SD_BALANCE_NEWIDLE: ? ? ? Balance when about to become idle
> + ? 4: SD_BALANCE_EXEC: ? ? ? ? ?Balance on exec
> + ? 8: SD_BALANCE_FORK: ? ? ? ? ?Balance on fork, clone
> - ?16: SD_WAKE_IDLE: ? ? ? ? ? ? Wake to idle CPU on task wakeup
> + ?32: SD_WAKE_AFFINE: ? ? ? ? ? Wake task to waking CPU
> + ?64: SD_WAKE_BALANCE: ? ? ? ? ?Perform balancing at task wakeup
> - 128: SD_SHARE_CPUPOWER: ? ? ? ?Domain members share cpu power
> - 256: SD_POWERSAVINGS_BALANCE: ?Balance for power savings
> - 512: SD_SHARE_PKG_RESOURCES: ? Domain members share cpu pkg resources
> +1024: SD_SERIALIZE: ? ? ? ? ? ? Only a single load balancing instance
> +2048: SD_WAKE_IDLE_FAR: ? ? ? ? Gain latency sacrificing cache hit
> -4096: SD_PREFER_SIBLING: ? ? ? ?Prefer to place tasks in a sibling domain
>
> The way i can turn on say SD_WAKE_IDLE for the NODE domain is to:
>
> ? tune-sched-domains 239 4735 $((3183+16))
>
> ( This is a pretty stone-age script i admit ;-)
>
> Thanks for all your testing so far,
>
> ? ? ? ?Ingo
>

2009-09-10 16:05:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Thu, 2009-09-10 at 09:02 -0700, Bret Towe wrote:
>
> thanks to this thread and others I've seen several kernel tunables
> that can effect how the scheduler performs/acts
> but what I don't see after a bit of looking is where all these are
> documented
> perhaps thats also part of the reason there are unhappy people with
> the current code in the kernel just because they don't know how
> to tune it for their workload

The thing is, ideally they should not need to poke at these. These knobs
are under CONFIG_SCHED_DEBUG, and that is exactly what they are for.

2009-09-10 16:13:07

by Bret Towe

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Thu, Sep 10, 2009 at 9:05 AM, Peter Zijlstra <[email protected]> wrote:
> On Thu, 2009-09-10 at 09:02 -0700, Bret Towe wrote:
>>
>> thanks to this thread and others I've seen several kernel tunables
>> that can effect how the scheduler performs/acts
>> but what I don't see after a bit of looking is where all these are
>> documented
>> perhaps thats also part of the reason there are unhappy people with
>> the current code in the kernel just because they don't know how
>> to tune it for their workload
>
> The thing is, ideally they should not need to poke at these. These knobs
> are under CONFIG_SCHED_DEBUG, and that is exactly what they are for.

even then I would think they should be documented so people can find out
what item is hurting their workload so they can better report the bug no?

>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at ?http://www.tux.org/lkml/
>

2009-09-10 16:26:16

by Ingo Molnar

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements


* Bret Towe <[email protected]> wrote:

> On Thu, Sep 10, 2009 at 9:05 AM, Peter Zijlstra <[email protected]> wrote:
> > On Thu, 2009-09-10 at 09:02 -0700, Bret Towe wrote:
> >>
> >> thanks to this thread and others I've seen several kernel tunables
> >> that can effect how the scheduler performs/acts
> >> but what I don't see after a bit of looking is where all these are
> >> documented
> >> perhaps thats also part of the reason there are unhappy people with
> >> the current code in the kernel just because they don't know how
> >> to tune it for their workload
> >
> > The thing is, ideally they should not need to poke at these.
> > These knobs are under CONFIG_SCHED_DEBUG, and that is exactly
> > what they are for.
>
> even then I would think they should be documented so people can
> find out what item is hurting their workload so they can better
> report the bug no?

Would be happy to apply such documentation patches. You could also
help start adding a 'scheduler performance' wiki portion to
perf.wiki.kernel.org, if you have time for that.

Ingo

2009-09-10 16:33:47

by Bret Towe

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Thu, Sep 10, 2009 at 9:26 AM, Ingo Molnar <[email protected]> wrote:
>
> * Bret Towe <[email protected]> wrote:
>
>> On Thu, Sep 10, 2009 at 9:05 AM, Peter Zijlstra <[email protected]> wrote:
>> > On Thu, 2009-09-10 at 09:02 -0700, Bret Towe wrote:
>> >>
>> >> thanks to this thread and others I've seen several kernel tunables
>> >> that can effect how the scheduler performs/acts
>> >> but what I don't see after a bit of looking is where all these are
>> >> documented
>> >> perhaps thats also part of the reason there are unhappy people with
>> >> the current code in the kernel just because they don't know how
>> >> to tune it for their workload
>> >
>> > The thing is, ideally they should not need to poke at these.
>> > These knobs are under CONFIG_SCHED_DEBUG, and that is exactly
>> > what they are for.
>>
>> even then I would think they should be documented so people can
>> find out what item is hurting their workload so they can better
>> report the bug no?
>
> Would be happy to apply such documentation patches. You could also
> help start adding a 'scheduler performance' wiki portion to
> perf.wiki.kernel.org, if you have time for that.

time isn't so much the issue but not having any clue as to what any
of the options do

2009-09-10 17:03:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements


* Bret Towe <[email protected]> wrote:

> On Thu, Sep 10, 2009 at 9:26 AM, Ingo Molnar <[email protected]> wrote:
> >
> > * Bret Towe <[email protected]> wrote:
> >
> >> On Thu, Sep 10, 2009 at 9:05 AM, Peter Zijlstra <[email protected]> wrote:
> >> > On Thu, 2009-09-10 at 09:02 -0700, Bret Towe wrote:
> >> >>
> >> >> thanks to this thread and others I've seen several kernel tunables
> >> >> that can effect how the scheduler performs/acts
> >> >> but what I don't see after a bit of looking is where all these are
> >> >> documented
> >> >> perhaps thats also part of the reason there are unhappy people with
> >> >> the current code in the kernel just because they don't know how
> >> >> to tune it for their workload
> >> >
> >> > The thing is, ideally they should not need to poke at these.
> >> > These knobs are under CONFIG_SCHED_DEBUG, and that is exactly
> >> > what they are for.
> >>
> >> even then I would think they should be documented so people can
> >> find out what item is hurting their workload so they can better
> >> report the bug no?
> >
> > Would be happy to apply such documentation patches. You could also
> > help start adding a 'scheduler performance' wiki portion to
> > perf.wiki.kernel.org, if you have time for that.
>
> time isn't so much the issue but not having any clue as to what
> any of the options do

One approach would be to list them in an email in this thread with
question marks and let people here fill them in - then help by
organizing and prettifying the result on the wiki.

Asking for clarifications when an explanation is unclear is also
helpful - those who write this code are not the best people to judge
whether technical descriptions are understandable enough.

Ingo

2009-09-10 17:53:57

by Nikos Chantziaras

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On 09/10/2009 09:08 AM, Ingo Molnar wrote:
>
> * Nikos Chantziaras<[email protected]> wrote:
>>
>> With your version of latt.c, I get these results with 2.6-tip vs
>> 2.6.31-rc9-bfs:
>>
>>
>> (mainline)
>> Averages:
>> ------------------------------
>> Max 50 usec
>> Avg 12 usec
>> Stdev 3 usec
>>
>>
>> (BFS)
>> Averages:
>> ------------------------------
>> Max 474 usec
>> Avg 11 usec
>> Stdev 16 usec
>>
>> However, the interactivity problems still remain. Does that mean
>> it's not a latency issue?
>
> It means that Jens's test-app, which demonstrated and helped us fix
> the issue for him does not help us fix it for you just yet.
>
> The "fluidity problem" you described might not be a classic latency
> issue per se (which latt.c measures), but a timeslicing / CPU time
> distribution problem.
>
> A slight shift in CPU time allocation can change the flow of tasks
> to result in a 'choppier' system.
>
> Have you tried, in addition of the granularity tweaks you've done,
> to renice mplayer either up or down? (or compiz and Xorg for that
> matter)

Yes. It seems to do what one would expect, but only if two separate
programs are competing for CPU time continuously. For example, when
running two glxgears instances, one with nice 0 the other with 19, the
first will report ~5000 FPS, the other ~1000. Renicing the second one
from 19 to 0, will result in both reporting ~3000. So nice values
obviously work in distributing CPU time. But the problem isn't the
available CPU time it seems since even if running glxgears nice -20, it
will still freeze during various other interactive taks (moving windows
etc.)


> [...]
> # echo NO_NEW_FAIR_SLEEPERS> /debug/sched_features
>
> Btw., NO_NEW_FAIR_SLEEPERS is something that will turn the scheduler
> into a more classic fair scheduler (like BFS is too).

Setting NO_NEW_FAIR_SLEEPERS (with everything else at default values)
pretty much solves all issues I raised in all my other posts! With this
setting, I can do "nice -n 19 make -j20" and still have a very smooth
desktop and watch a movie at the same time. Various other annoyances
(like the "logout/shutdown/restart" dialog of KDE not appearing at all
until the background fade-out effect has finished) are also gone. So
this seems to be the single most important setting that vastly improves
desktop behavior, at least here.

In fact, I liked this setting so much that I went to
kernel/sched_features.h of kernel 2.6.30.5 (the kernel I use normally
right now) and set SCHED_FEAT(NEW_FAIR_SLEEPERS, 0) (default is 1) with
absolutely no other tweaks (like sched_latency_ns,
sched_wakeup_granularity_ns, etc.). It pretty much behaves like BFS now
from an interactivity point of view. But I've used it only for about an
hour or so, so I don't know if any ill effects will appear later on.


> NO_START_DEBIT might be another thing that improves (or worsens :-/)
> make -j type of kernel build workloads.

No effect with this one, at least not one I could observe.

I didn't have the opportunity yet to test and tweak all the other
various settings you listed, but I will try to do so as soon as possible.

2009-09-10 18:01:00

by Ingo Molnar

[permalink] [raw]
Subject: [crash, bisected] Re: clocksource: Resolve cpu hotplug dead lock with TSC unstable


* Ingo Molnar <[email protected]> wrote:

>
> * Jens Axboe <[email protected]> wrote:
>
> > I went to try -tip btw, but it crashes on boot. Here's the
> > backtrace, typed manually, it's crashing in
> > queue_work_on+0x28/0x60.
> >
> > Call Trace:
> > queue_work
> > schedule_work
> > clocksource_mark_unstable
> > mark_tsc_unstable
> > check_tsc_sync_source
> > native_cpu_up
> > relay_hotcpu_callback
> > do_forK_idle
> > _cpu_up
> > cpu_up
> > kernel_init
> > kernel_thread_helper
>
> hm, that looks like an old bug i fixed days ago via:
>
> 00a3273: Revert "x86: Make tsc=reliable override boot time stability checks"
>
> Have you tested tip:master - do you still know which sha1?

Ok, i reproduced it on a testbox and bisected it, the crash is
caused by:

7285dd7fd375763bfb8ab1ac9cf3f1206f503c16 is first bad commit
commit 7285dd7fd375763bfb8ab1ac9cf3f1206f503c16
Author: Thomas Gleixner <[email protected]>
Date: Fri Aug 28 20:25:24 2009 +0200

clocksource: Resolve cpu hotplug dead lock with TSC unstable

Martin Schwidefsky analyzed it:

I've reverted it in tip/master for now.

Ingo

2009-09-10 18:46:07

by Ingo Molnar

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements


* Nikos Chantziaras <[email protected]> wrote:

> On 09/10/2009 09:08 AM, Ingo Molnar wrote:
>>
>> * Nikos Chantziaras<[email protected]> wrote:
>>>
>>> With your version of latt.c, I get these results with 2.6-tip vs
>>> 2.6.31-rc9-bfs:
>>>
>>>
>>> (mainline)
>>> Averages:
>>> ------------------------------
>>> Max 50 usec
>>> Avg 12 usec
>>> Stdev 3 usec
>>>
>>>
>>> (BFS)
>>> Averages:
>>> ------------------------------
>>> Max 474 usec
>>> Avg 11 usec
>>> Stdev 16 usec
>>>
>>> However, the interactivity problems still remain. Does that mean
>>> it's not a latency issue?
>>
>> It means that Jens's test-app, which demonstrated and helped us fix
>> the issue for him does not help us fix it for you just yet.
>>
>> The "fluidity problem" you described might not be a classic latency
>> issue per se (which latt.c measures), but a timeslicing / CPU time
>> distribution problem.
>>
>> A slight shift in CPU time allocation can change the flow of tasks
>> to result in a 'choppier' system.
>>
>> Have you tried, in addition of the granularity tweaks you've done,
>> to renice mplayer either up or down? (or compiz and Xorg for that
>> matter)
>
> Yes. It seems to do what one would expect, but only if two separate
> programs are competing for CPU time continuously. For example, when
> running two glxgears instances, one with nice 0 the other with 19, the
> first will report ~5000 FPS, the other ~1000. Renicing the second one
> from 19 to 0, will result in both reporting ~3000. So nice values
> obviously work in distributing CPU time. But the problem isn't the
> available CPU time it seems since even if running glxgears nice -20, it
> will still freeze during various other interactive taks (moving windows
> etc.)
>
>
>> [...]
>> # echo NO_NEW_FAIR_SLEEPERS> /debug/sched_features
>>
>> Btw., NO_NEW_FAIR_SLEEPERS is something that will turn the scheduler
>> into a more classic fair scheduler (like BFS is too).
>
> Setting NO_NEW_FAIR_SLEEPERS (with everything else at default
> values) pretty much solves all issues I raised in all my other
> posts! With this setting, I can do "nice -n 19 make -j20" and
> still have a very smooth desktop and watch a movie at the same
> time. Various other annoyances (like the
> "logout/shutdown/restart" dialog of KDE not appearing at all until
> the background fade-out effect has finished) are also gone. So
> this seems to be the single most important setting that vastly
> improves desktop behavior, at least here.
>
> In fact, I liked this setting so much that I went to
> kernel/sched_features.h of kernel 2.6.30.5 (the kernel I use
> normally right now) and set SCHED_FEAT(NEW_FAIR_SLEEPERS, 0)
> (default is 1) with absolutely no other tweaks (like
> sched_latency_ns, sched_wakeup_granularity_ns, etc.). It pretty
> much behaves like BFS now from an interactivity point of view.
> But I've used it only for about an hour or so, so I don't know if
> any ill effects will appear later on.

ok, this is quite an important observation!

Either NEW_FAIR_SLEEPERS is broken, or if it works it's not what we
want to do. Other measures in the scheduler protect us from fatal
badness here, but all the finer wakeup behavior is out the window
really.

Will check this. We'll probably start with a quick commit disabling
it first - then re-enabling it if it's fixed (will Cc: you so that
you can re-test with fixed-NEW_FAIR_SLEEPERS, if it's re-enabled).

Thanks a lot for the persistent testing!

Ingo

2009-09-10 18:52:13

by Ingo Molnar

[permalink] [raw]
Subject: [tip:sched/core] sched: Disable NEW_FAIR_SLEEPERS for now

Commit-ID: 3f2aa307c4d26b4ed6509d0a79e8254c9e07e921
Gitweb: http://git.kernel.org/tip/3f2aa307c4d26b4ed6509d0a79e8254c9e07e921
Author: Ingo Molnar <[email protected]>
AuthorDate: Thu, 10 Sep 2009 20:34:48 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 10 Sep 2009 20:34:48 +0200

sched: Disable NEW_FAIR_SLEEPERS for now

Nikos Chantziaras and Jens Axboe reported that turning off
NEW_FAIR_SLEEPERS improves desktop interactivity visibly.

Nikos described his experiences the following way:

" With this setting, I can do "nice -n 19 make -j20" and
still have a very smooth desktop and watch a movie at
the same time. Various other annoyances (like the
"logout/shutdown/restart" dialog of KDE not appearing
at all until the background fade-out effect has finished)
are also gone. So this seems to be the single most
important setting that vastly improves desktop behavior,
at least here. "

Jens described it the following way, referring to a 10-seconds
xmodmap scheduling delay he was trying to debug:

" Then I tried switching NO_NEW_FAIR_SLEEPERS on, and then
I get:

Performance counter stats for 'xmodmap .xmodmap-carl':

9.009137 task-clock-msecs # 0.447 CPUs
18 context-switches # 0.002 M/sec
1 CPU-migrations # 0.000 M/sec
315 page-faults # 0.035 M/sec

0.020167093 seconds time elapsed

Woot! "

So disable it for now. In perf trace output i can see weird
delta timestamps:

cc1-9943 [001] 2802.059479616: sched_stat_wait: task: as:9944 wait: 2801938766276 [ns]

That nsec field is not supposed to be that large. More digging
is needed - but lets turn it off while the real bug is found.

Reported-by: Nikos Chantziaras <[email protected]>
Tested-by: Nikos Chantziaras <[email protected]>
Reported-by: Jens Axboe <[email protected]>
Tested-by: Jens Axboe <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Mike Galbraith <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>


---
kernel/sched_features.h | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sched_features.h b/kernel/sched_features.h
index 4569bfa..e2dc63a 100644
--- a/kernel/sched_features.h
+++ b/kernel/sched_features.h
@@ -1,4 +1,4 @@
-SCHED_FEAT(NEW_FAIR_SLEEPERS, 1)
+SCHED_FEAT(NEW_FAIR_SLEEPERS, 0)
SCHED_FEAT(NORMALIZED_SLEEPER, 0)
SCHED_FEAT(ADAPTIVE_GRAN, 1)
SCHED_FEAT(WAKEUP_PREEMPT, 1)

2009-09-10 18:58:17

by Ingo Molnar

[permalink] [raw]
Subject: [tip:sched/core] sched: Fix sched::sched_stat_wait tracepoint field

Commit-ID: e1f8450854d69f0291882804406ea1bab3ca44b4
Gitweb: http://git.kernel.org/tip/e1f8450854d69f0291882804406ea1bab3ca44b4
Author: Ingo Molnar <[email protected]>
AuthorDate: Thu, 10 Sep 2009 20:52:09 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 10 Sep 2009 20:52:54 +0200

sched: Fix sched::sched_stat_wait tracepoint field

This weird perf trace output:

cc1-9943 [001] 2802.059479616: sched_stat_wait: task: as:9944 wait: 2801938766276 [ns]

Is caused by setting one component field of the delta to zero
a bit too early. Move it to later.

( Note, this does not affect the NEW_FAIR_SLEEPERS interactivity bug,
it's just a reporting bug in essence. )

Acked-by: Peter Zijlstra <[email protected]>
Cc: Nikos Chantziaras <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Mike Galbraith <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>


---
kernel/sched_fair.c | 3 +--
1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 26fadb4..aa7f841 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -545,14 +545,13 @@ update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
schedstat_set(se->wait_count, se->wait_count + 1);
schedstat_set(se->wait_sum, se->wait_sum +
rq_of(cfs_rq)->clock - se->wait_start);
- schedstat_set(se->wait_start, 0);
-
#ifdef CONFIG_SCHEDSTATS
if (entity_is_task(se)) {
trace_sched_stat_wait(task_of(se),
rq_of(cfs_rq)->clock - se->wait_start);
}
#endif
+ schedstat_set(se->wait_start, 0);
}

static inline void

2009-09-10 19:55:10

by Martin Steigerwald

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

Am Mittwoch 09 September 2009 schrieb Peter Zijlstra:
> On Wed, 2009-09-09 at 12:05 +0300, Nikos Chantziaras wrote:
> > Thank you for mentioning min_granularity. After:
> >
> > echo 10000000 > /proc/sys/kernel/sched_latency_ns
> > echo 2000000 > /proc/sys/kernel/sched_min_granularity_ns
>
> You might also want to do:
>
> echo 2000000 > /proc/sys/kernel/sched_wakeup_granularity_ns
>
> That affects when a newly woken task will preempt an already running
> task.

Heh that scheduler thing again... and unfortunately Col appearing to feel
hurt while I am think that Ingo is honest on his offer on collaboration...

While it makes fun playing with that numbers and indeed experiencing
subjectively a more fluid deskopt how about just a

echo "This is a f* desktop!" > /proc/sys/kernel/sched_workload

Or to say it in other words: The Linux kernel should not require me to
fine-tune three or more values to have the scheduler act in a way that
matches my workload.

I am willing to test stuff on my work thinkpad and my Amarok thinkpad in
order to help improving with that.

--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7


Attachments:
signature.asc (197.00 B)
This is a digitally signed message part.

2009-09-10 20:06:46

by Ingo Molnar

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements


* Martin Steigerwald <[email protected]> wrote:

> Am Mittwoch 09 September 2009 schrieb Peter Zijlstra:
> > On Wed, 2009-09-09 at 12:05 +0300, Nikos Chantziaras wrote:
> > > Thank you for mentioning min_granularity. After:
> > >
> > > echo 10000000 > /proc/sys/kernel/sched_latency_ns
> > > echo 2000000 > /proc/sys/kernel/sched_min_granularity_ns
> >
> > You might also want to do:
> >
> > echo 2000000 > /proc/sys/kernel/sched_wakeup_granularity_ns
> >
> > That affects when a newly woken task will preempt an already running
> > task.
>
> Heh that scheduler thing again... and unfortunately Col appearing
> to feel hurt while I am think that Ingo is honest on his offer on
> collaboration...
>
> While it makes fun playing with that numbers and indeed
> experiencing subjectively a more fluid deskopt how about just a
>
> echo "This is a f* desktop!" > /proc/sys/kernel/sched_workload

No need to do that, that's supposed to be the default :-) The knobs
are really just there to help us make it even more so - i.e. you
dont need to tune them. But it really relies on people helping us
out and tell us which combinations work best ...

> Or to say it in other words: The Linux kernel should not require
> me to fine-tune three or more values to have the scheduler act in
> a way that matches my workload.
>
> I am willing to test stuff on my work thinkpad and my Amarok
> thinkpad in order to help improving with that.

It would be great if you could check latest -tip:

http://people.redhat.com/mingo/tip.git/README

and compare it to vanilla .31?

Also, could you outline the interactivity problems/complaints you
have?

Ingo

2009-09-10 20:25:47

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Tue, Sep 08, 2009 at 09:15:22PM +0300, Nikos Chantziaras wrote:
> On 09/07/2009 02:01 PM, Frederic Weisbecker wrote:
>> That looks eventually benchmarkable. This is about latency.
>> For example, you could try to run high load tasks in the
>> background and then launch a task that wakes up in middle/large
>> periods to do something. You could measure the time it takes to wake
>> it up to perform what it wants.
>>
>> We have some events tracing infrastructure in the kernel that can
>> snapshot the wake up and sched switch events.
>>
>> Having CONFIG_EVENT_TRACING=y should be sufficient for that.
>>
>> You just need to mount a debugfs point, say in /debug.
>>
>> Then you can activate these sched events by doing:
>>
>> echo 0> /debug/tracing/tracing_on
>> echo 1> /debug/tracing/events/sched/sched_switch/enable
>> echo 1> /debug/tracing/events/sched/sched_wake_up/enable
>>
>> #Launch your tasks
>>
>> echo 1> /debug/tracing/tracing_on
>>
>> #Wait for some time
>>
>> echo 0> /debug/tracing/tracing_off
>>
>> That will require some parsing of the result in /debug/tracing/trace
>> to get the delays between wake_up events and switch in events
>> for the task that periodically wakes up and then produce some
>> statistics such as the average or the maximum latency.
>>
>> That's a bit of a rough approach to measure such latencies but that
>> should work.
>
> I've tried this with 2.6.31-rc9 while running mplayer and alt+tabbing
> repeatedly to the point where mplayer starts to stall and drop frames.
> This produced a 4.1MB trace file (132k bzip2'ed):
>
> http://foss.math.aegean.gr/~realnc/kernel/trace1.bz2
>
> Uncompressed for online viewing:
>
> http://foss.math.aegean.gr/~realnc/kernel/trace1
>
> I must admit that I don't know what it is I'm looking at :P


Hehe :-)

Basically you have samples of two kind of events:

- wake up (when thread A wakes up B)

The format is as follows:


task-pid
(the waker A)
|
| cpu timestamp event-name wakee(B) prio status
| | | | | | |
X-11482 [001] 1023.219246: sched_wakeup: task kwin:11571 [120] success=1

Here X is awakening kwin.


- sched switch (when the scheduler stops A and launches B)

A, task B, task
that gets that gets
sched sched
out in
A cpu timestamp event-name | A prio | B prio
| | | | | | | |
X-11482 [001] 1023.219247: sched_switch: task X:11482 [120] (R) ==> kwin:11571 [120]
|
|
State of A
For A state we can have either:

R: TASK_RUNNING, the task is not sleeping but it is rescheduled for later
to let another task run

S: TASK_INTERRUPTIBLE, the task is sleeping, waiting for an event that may
wake it up. The task can be waked by a signal

D: TASK_UNINTERRUPTIBLE, same as above but can't be waked by a signal.


Now what could be interesting interesting is to measure the time between
such pair of events:

- t0: A wakes up B
- t1: B is sched in

t1 - t0 would then be the scheduler latency, or at least part of it:

The scheduler latency may be an addition of several factors:

- the time it takes for the actual wake up to perform (re-insert
the task into a runqueue, which can be subject to the runqueue(s)
design, the rebalancing if needed, etc..

- the time between a task is waked up and the scheduler eventually
decide to schedule it in.

- the time it takes to perform the task switch, which is not only
in the scheduler scope. But the time it takes may depend of a
rebalancing decision (cache cold, etc..)

Unfortunately we can only measure the second part with the above ftrace
events. But that's still an interesting scheduler abstract that is a
large part of the scheduler latency.

We could write a tiny parser that could walk through such ftrace traces
and produce some average, maximum, standard deviation numbers.

But we have userspace tools that can parse ftrace events (through perf
counter), so I'm trying to write something there, hopefully I could get
a relevant end result.

Thanks.

2009-09-10 20:39:49

by Martin Steigerwald

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

Am Donnerstag 10 September 2009 schrieb Ingo Molnar:
> * Martin Steigerwald <[email protected]> wrote:
> > Am Mittwoch 09 September 2009 schrieb Peter Zijlstra:
> > > On Wed, 2009-09-09 at 12:05 +0300, Nikos Chantziaras wrote:
> > > > Thank you for mentioning min_granularity. After:
> > > >
> > > > echo 10000000 > /proc/sys/kernel/sched_latency_ns
> > > > echo 2000000 > /proc/sys/kernel/sched_min_granularity_ns
> > >
> > > You might also want to do:
> > >
> > > echo 2000000 > /proc/sys/kernel/sched_wakeup_granularity_ns
> > >
> > > That affects when a newly woken task will preempt an already
> > > running task.
> >
> > Heh that scheduler thing again... and unfortunately Col appearing
> > to feel hurt while I am think that Ingo is honest on his offer on
> > collaboration...
> >
> > While it makes fun playing with that numbers and indeed
> > experiencing subjectively a more fluid deskopt how about just a
> >
> > echo "This is a f* desktop!" > /proc/sys/kernel/sched_workload
>
> No need to do that, that's supposed to be the default :-) The knobs
> are really just there to help us make it even more so - i.e. you
> dont need to tune them. But it really relies on people helping us
> out and tell us which combinations work best ...

Well currently I have:

shambhala:/proc/sys/kernel> grep "" sched_latency_ns
sched_min_granularity_ns sched_wakeup_granularity_ns
sched_latency_ns:100000
sched_min_granularity_ns:200000
sched_wakeup_granularity_ns:0

And this give me *a completely different* desktop experience.

I am using KDE 4.3.1 on a mixture of Debian Squeeze/Sid/Experimental, with
compositing. And now when I flip desktops or open a window I can *actually
see* the animation. Before I jusooooooooooooooooooooot saw two to five
steps of the animation,
now its really a lot more fluid.

perceived _latency--! Well its like
oooooooooooooooooooooooooooooooooooooooooooooooooooooooopening the eyes
again cause I tended
to take the jerky behavior as normal and possibly related to having KDE
4.3.1 with compositing enabled on a ThinkPad T42 with
ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo
01:00.0 VGA compatible controller [0300]: ATI Technologies Inc RV350
[Mobility Radeon 9600 M10] [1002:4e50]

which I consider to be low end for that workload. But then why actually?
Next to me is a Sam440ep with PPC 440 667 MHz and and even older Radeon M9
with AmigaOS 4.1 and some simple transparency effects with compositing. And
well this combo does feel like it wheel spins cause the hardware is
actually to fast
foooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo

>
> > Or to say it in other words: The Linux kernel should not require
> > me to fine-tune three or more values to have the scheduler act in
> > a way that matches my workload.
> >
> > I am willing to test stuff on my work thinkpad and my Amarok
> > thinkpad in order to help improving with that.
>
> It would be great if you could check latest -tip:
>
> http://people.redhat.com/mingo/tip.git/README
>
> and compare it to vanilla .31?
>
> Also, could you outline the interactivity problems/complaints you
> have?
>
> Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> in the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>


--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7

2009-09-10 20:43:02

by Ingo Molnar

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements


* Martin Steigerwald <[email protected]> wrote:

> Am Donnerstag 10 September 2009 schrieb Ingo Molnar:
> > * Martin Steigerwald <[email protected]> wrote:
> > > Am Mittwoch 09 September 2009 schrieb Peter Zijlstra:
> > > > On Wed, 2009-09-09 at 12:05 +0300, Nikos Chantziaras wrote:
> > > > > Thank you for mentioning min_granularity. After:
> > > > >
> > > > > echo 10000000 > /proc/sys/kernel/sched_latency_ns
> > > > > echo 2000000 > /proc/sys/kernel/sched_min_granularity_ns
> > > >
> > > > You might also want to do:
> > > >
> > > > echo 2000000 > /proc/sys/kernel/sched_wakeup_granularity_ns
> > > >
> > > > That affects when a newly woken task will preempt an already
> > > > running task.
> > >
> > > Heh that scheduler thing again... and unfortunately Col appearing
> > > to feel hurt while I am think that Ingo is honest on his offer on
> > > collaboration...
> > >
> > > While it makes fun playing with that numbers and indeed
> > > experiencing subjectively a more fluid deskopt how about just a
> > >
> > > echo "This is a f* desktop!" > /proc/sys/kernel/sched_workload
> >
> > No need to do that, that's supposed to be the default :-) The knobs
> > are really just there to help us make it even more so - i.e. you
> > dont need to tune them. But it really relies on people helping us
> > out and tell us which combinations work best ...
>
> Well currently I have:
>
> shambhala:/proc/sys/kernel> grep "" sched_latency_ns
> sched_min_granularity_ns sched_wakeup_granularity_ns
> sched_latency_ns:100000
> sched_min_granularity_ns:200000
> sched_wakeup_granularity_ns:0
>
> And this give me *a completely different* desktop experience.

what is /debug/sched_features - is NO_NEW_FAIR_SLEEPERS set? If not
set yet then try it:

echo NO_NEW_FAIR_SLEEPERS > /debug/sched_features

that too might make things more fluid.

Ingo

2009-09-10 21:17:41

by Martin Steigerwald

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements


Uhoh, it seems I configured my kernel's scheduler to produce keyboard
failure. The many o's in my last mails where not intended.

Let's see whether its better with:

shambhala:/proc/sys/kernel> echo 2000000 >
/proc/sys/kernel/sched_wakeup_granularity_ns

First time I just lost the keyboard in X for a while in such a way that
even a Strg-Alt-F1 did not yield any effect. None of what I typed appeared
anywhere, not in the mail composer window nor in a Konsole terminal or in
the Kickoff menu search field. Mouse still worked okay. I was able to log
out via mouse. Then even in KDM the keyboard did not work. Suddenly it
produced repeating key input events without me typing anything anymore.
Like the second time with those o's where I pressed on send instead of
save as draft accidentally.

I did not have any keyboard issues like this recently nor in the last year
or longer. Let's see whether that raised wakeup_granulaty helps. Desktop
experience seems still quite fluid, maybe not as fluid as with the settings
below. But I prefer a working keyboard in order to finish up this mail.

Am Donnerstag 10 September 2009 schrieb Ingo Molnar:
> * Martin Steigerwald <[email protected]> wrote:
> > Am Mittwoch 09 September 2009 schrieb Peter Zijlstra:
> > > On Wed, 2009-09-09 at 12:05 +0300, Nikos Chantziaras wrote:
> > > > Thank you for mentioning min_granularity. After:
> > > >
> > > > echo 10000000 > /proc/sys/kernel/sched_latency_ns
> > > > echo 2000000 > /proc/sys/kernel/sched_min_granularity_ns
> > >
> > > You might also want to do:
> > >
> > > echo 2000000 > /proc/sys/kernel/sched_wakeup_granularity_ns
> > >
> > > That affects when a newly woken task will preempt an already
> > > running task.
> >
> > Heh that scheduler thing again... and unfortunately Col appearing
> > to feel hurt while I am think that Ingo is honest on his offer on
> > collaboration...
> >
> > While it makes fun playing with that numbers and indeed
> > experiencing subjectively a more fluid deskopt how about just a
> >
> > echo "This is a f* desktop!" > /proc/sys/kernel/sched_workload
>
> No need to do that, that's supposed to be the default :-) The knobs
> are really just there to help us make it even more so - i.e. you
> dont need to tune them. But it really relies on people helping us
> out and tell us which combinations work best ...

Well currently I have:

shambhala:/proc/sys/kernel> grep "" sched_latency_ns
sched_min_granularity_ns sched_wakeup_granularity_ns
sched_latency_ns:100000
sched_min_granularity_ns:200000
sched_wakeup_granularity_ns:0

And this give me *a completely different* desktop experience.

I am using KDE 4.3.1 on a mixture of Debian Squeeze/Sid/Experimental, with
compositing. And now when I flip desktops or open a window I can *actually
see* the animation. Before I just saw two to five steps of the animation,
now its really a lot more fluid.

perceived _latency--! Well its like opening the eyes again cause I tended
to take the jerky behavior as normal and possibly related to having KDE
4.3.1 with compositing enabled on a ThinkPad T42 with


01:00.0 VGA compatible controller [0300]: ATI Technologies Inc RV350
[Mobility Radeon 9600 M10] [1002:4e50]
(with OSS Radeon driver)

which I consider to be low end for that workload. But then why actually?
Next to me is a Sam440ep with PPC 440 667 MHz and and even older Radeon M9
with AmigaOS 4.1 and some simple transparency effects with compositing. And
well this combo does feel like it wheel spins cause the hardware is
actually to fast [2nd keyboard borkage, somewhere before where the first
where I saved as draft] for that operating system. So actually I knew
there could be less waiting and less latency. (Granted AmigaOS 4.1 is much
more minimalistic also in terms on features and no complete memory
protection and message passing by exchanging pointers and whatnot).

At least to summarize this: With those settings I just keep switching
desktops and opening windows to enjoy the effects. Desktops flip over
fluently and windows zoom in on opening fluently either.

All those experiences are with:

shambhala:~> cat /proc/version
Linux version 2.6.31-rc7-tp42-toi-3.0.1-04741-g57e61c0 (martin@shambhala)
(gcc version 4.3.3 (Debian 4.3.3-10) ) #6 PREEMPT Sun Aug 23 10:51:32 CEST
2009

(Nigel Cunningham's tuxonice-head git from about 10 days ago)

Keyboard still working. Possibly it really has got broken with zero as
wakeup granularity.

> > Or to say it in other words: The Linux kernel should not require
> > me to fine-tune three or more values to have the scheduler act in
> > a way that matches my workload.
> >
> > I am willing to test stuff on my work thinkpad and my Amarok
> > thinkpad in order to help improving with that.
>
> It would be great if you could check latest -tip:
>
> http://people.redhat.com/mingo/tip.git/README
>
> and compare it to vanilla .31?
>
> Also, could you outline the interactivity problems/complaints you
> have?

Hmmm, would there be a simple possibilty to somehow merge tuxonice git and
your tip.git into a nice TuxOnIce + scheduler enhanced kernel? I do tend
not to stick with non TuxOnIce enabled kernels for too long. At least not
on my work thinkpad and my Amarok thinkpad cause I believe that reboots
are just for kernel upgrades (with API changes that is) ;-).

Apart from that I lack time to compile a kernel a day at the moment like
in good old RSDL, SD and CFS testing times ;-). But next kernel, 2.6.31-
not-a-rc, is due and I take suggestions for that one. Preferably I would
like to have it with TuxOnIce tough.

Problems I faced:

1) Well those effect issues. Jerky at best. Animations which should have
had 25 frames per second at least, showed 2-5 frames a second. Above
tuning helped a lot with that. On the other hand DVD playback with Dragon
Player (and Xine) seems just fine - I thought at least. Maybe I should
compare watching StarTrek NG with and without scheduler latency fixes. And
maybe I find some additional frames per duration there too.

2) Some jerks here and there. Difficult to categorize. It sometimes just
happens that the machine does not follow where I put my attention. Like a
distracted human who has other things to do than listening to me. This
could be trying to enter some text in a Qt text input widget. But I need
to look more carefully as to when, where and why. This is just too fuzzy.

3) Sometimes even typing as a visible latency. Its difficult to spot the
cause for that stuff. When I type I expect to see each letter as I type it.

4) I/O latencies causing the machine to actually stall for seconds. But
this one got much better when I switched to 2.6.31 - I had to skip 2.6.30
cause it didn't tuxonice nicely, even 2.6.31 did not until it reached rc5.
It seems even way better after switching from XFS to Ext4. But well that
is a different issue. And at the moment I am quite happy with that.

5) Some window manager operations like resizing windows take very long
with compositing. But I think this issue may lie elsewhere. Cause these
did not improve with above settings while many compositing effects did. I
dunno where that slow window resizing comes from. Whether its a
compositing / KWin / Qt refresh issue or something scheduler or something
gfx driver related.

Keyboard is still working, yay! So the X.org keyboard driver might get
irritated with zero as wakeup granularity.

Ciao,
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7


Attachments:
signature.asc (197.00 B)
This is a digitally signed message part.

2009-09-10 21:19:35

by Martin Steigerwald

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

Am Donnerstag 10 September 2009 schrieb Ingo Molnar:
> * Martin Steigerwald <[email protected]> wrote:
> > Am Donnerstag 10 September 2009 schrieb Ingo Molnar:
> > > * Martin Steigerwald <[email protected]> wrote:
> > > > Am Mittwoch 09 September 2009 schrieb Peter Zijlstra:
> > > > > On Wed, 2009-09-09 at 12:05 +0300, Nikos Chantziaras wrote:
> > > > > > Thank you for mentioning min_granularity. After:
> > > > > >
> > > > > > echo 10000000 > /proc/sys/kernel/sched_latency_ns
> > > > > > echo 2000000 > /proc/sys/kernel/sched_min_granularity_ns
> > > > >
> > > > > You might also want to do:
> > > > >
> > > > > echo 2000000 >
> > > > > /proc/sys/kernel/sched_wakeup_granularity_ns
> > > > >
> > > > > That affects when a newly woken task will preempt an already
> > > > > running task.
> > > >
> > > > Heh that scheduler thing again... and unfortunately Col appearing
> > > > to feel hurt while I am think that Ingo is honest on his offer on
> > > > collaboration...
> > > >
> > > > While it makes fun playing with that numbers and indeed
> > > > experiencing subjectively a more fluid deskopt how about just a
> > > >
> > > > echo "This is a f* desktop!" > /proc/sys/kernel/sched_workload
> > >
> > > No need to do that, that's supposed to be the default :-) The knobs
> > > are really just there to help us make it even more so - i.e. you
> > > dont need to tune them. But it really relies on people helping us
> > > out and tell us which combinations work best ...
> >
> > Well currently I have:
> >
> > shambhala:/proc/sys/kernel> grep "" sched_latency_ns
> > sched_min_granularity_ns sched_wakeup_granularity_ns
> > sched_latency_ns:100000
> > sched_min_granularity_ns:200000
> > sched_wakeup_granularity_ns:0
> >
> > And this give me *a completely different* desktop experience.
>
> what is /debug/sched_features - is NO_NEW_FAIR_SLEEPERS set? If not
> set yet then try it:
>
> echo NO_NEW_FAIR_SLEEPERS > /debug/sched_features
>
> that too might make things more fluid.

Hmmm, need to mount that first. But not today, cause I have to dig out on
how to do it. Have to pack some things for tomorrow. And then sleep time.

Ciao,
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7


Attachments:
signature.asc (197.00 B)
This is a digitally signed message part.

2009-09-11 01:36:36

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements


> I'd say add an extra horizontal split in the second column, so you'd get
> three areas in the right column:
> - top for the global target (permanently)
> - middle for current, either:
> - "current most lagging" if "Global" is selected in left column
> - selected process if a specific target is selected in left column
> - bottom for backtrace
>
> Maybe with that setup "Global" in the left column should be renamed to
> something like "Dynamic".
>
> The backtrace area would show selection from either top or middle areas
> (so selecting a cause in top or middle area should unselect causes in the
> other).

I'll have a look after the merge window madness. Multiple windows is
also still an option I suppose even if i don't like it that much: we
could support double-click on an app or "global" in the left list,
making that pop a new window with the same content as the right pane for
that app (or global) that updates at the same time as the rest.

Somebody ping me if I seem to have forgotten about it in 2 weeks :-)

Ben.


2009-09-11 06:10:33

by Ingo Molnar

[permalink] [raw]
Subject: Re: Epic regression in throughput since v2.6.23


* Serge Belyshev <[email protected]> wrote:

> Ingo Molnar <[email protected]> writes:
>
> > perf stat --repeat 3 make -j4 bzImage
>
> BFS hangs here:
>
> [ 128.859000] BUG: soft lockup - CPU#2 stuck for 61s! [sh:7946]
> [ 128.859016] Modules linked in:
> [ 128.859016] CPU 2:
> [ 128.859016] Modules linked in:
> [ 128.859016] Pid: 7946, comm: sh Not tainted 2.6.31-bfs211-dirty #4 GA-MA790FX-DQ6
> [ 128.859016] RIP: 0010:[<ffffffff81055a52>] [<ffffffff81055a52>] task_oncpu_function_call+0x22/0x40
> [ 128.859016] RSP: 0018:ffff880205967e18 EFLAGS: 00000246
> [ 128.859016] RAX: 0000000000000002 RBX: ffff880205964cc0 RCX: 000000000000dd00
> [ 128.859016] RDX: ffff880211138c00 RSI: ffffffff8108d3f0 RDI: ffff88022e42a100
> [ 128.859016] RBP: ffffffff8102d76e R08: ffff880028066000 R09: 0000000000000000
> [ 128.859016] R10: 0000000000000000 R11: 0000000000000058 R12: ffffffff8108d3f0
> [ 128.859016] R13: ffff880211138c00 R14: 0000000000000001 R15: 000000000000e260
> [ 128.859016] FS: 00002b9ba0924e00(0000) GS:ffff880028066000(0000) knlGS:0000000000000000
> [ 128.859016] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 128.859016] CR2: 00002b9ba091e4a8 CR3: 0000000001001000 CR4: 00000000000006e0
> [ 128.859016] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 128.859016] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 128.859016] Call Trace:
> [ 128.859016] [<ffffffff8108ee3b>] ? perf_counter_remove_from_context+0x3b/0x90
> [ 128.859016] [<ffffffff810904b4>] ? perf_counter_exit_task+0x114/0x340
> [ 128.859016] [<ffffffff810c3f66>] ? filp_close+0x56/0x90
> [ 128.859016] [<ffffffff8105d3ac>] ? do_exit+0x14c/0x6f0
> [ 128.859016] [<ffffffff8105d991>] ? do_group_exit+0x41/0xb0
> [ 128.859016] [<ffffffff8105da12>] ? sys_exit_group+0x12/0x20
> [ 128.859016] [<ffffffff8102cceb>] ? system_call_fastpath+0x16/0x1b
>
> So, got nothing to compare with.

Could still compare -j5 to -j4 on -tip, to see why -j4 is 3% short
of -j5's throughput.

(Plus maybe the NEW_FAIR_SLEEPERS change in -tip fixes the 3% drop.)

Ingo

2009-09-11 07:37:57

by Ingo Molnar

[permalink] [raw]
Subject: Re: [crash, bisected] Re: clocksource: Resolve cpu hotplug dead lock with TSC unstable


* Ingo Molnar <[email protected]> wrote:

>
> * Ingo Molnar <[email protected]> wrote:
>
> >
> > * Jens Axboe <[email protected]> wrote:
> >
> > > I went to try -tip btw, but it crashes on boot. Here's the
> > > backtrace, typed manually, it's crashing in
> > > queue_work_on+0x28/0x60.
> > >
> > > Call Trace:
> > > queue_work
> > > schedule_work
> > > clocksource_mark_unstable
> > > mark_tsc_unstable
> > > check_tsc_sync_source
> > > native_cpu_up
> > > relay_hotcpu_callback
> > > do_forK_idle
> > > _cpu_up
> > > cpu_up
> > > kernel_init
> > > kernel_thread_helper
> >
> > hm, that looks like an old bug i fixed days ago via:
> >
> > 00a3273: Revert "x86: Make tsc=reliable override boot time stability checks"
> >
> > Have you tested tip:master - do you still know which sha1?
>
> Ok, i reproduced it on a testbox and bisected it, the crash is
> caused by:
>
> 7285dd7fd375763bfb8ab1ac9cf3f1206f503c16 is first bad commit
> commit 7285dd7fd375763bfb8ab1ac9cf3f1206f503c16
> Author: Thomas Gleixner <[email protected]>
> Date: Fri Aug 28 20:25:24 2009 +0200
>
> clocksource: Resolve cpu hotplug dead lock with TSC unstable
>
> Martin Schwidefsky analyzed it:
>
> I've reverted it in tip/master for now.

and that uncovers the circular locking bug that this commit was
supposed to fix ...

Martin?

Ingo

2009-09-11 07:48:30

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [crash, bisected] Re: clocksource: Resolve cpu hotplug dead lock with TSC unstable

On Fri, 11 Sep 2009 09:37:47 +0200
Ingo Molnar <[email protected]> wrote:

>
> * Ingo Molnar <[email protected]> wrote:
>
> >
> > * Ingo Molnar <[email protected]> wrote:
> >
> > >
> > > * Jens Axboe <[email protected]> wrote:
> > >
> > > > I went to try -tip btw, but it crashes on boot. Here's the
> > > > backtrace, typed manually, it's crashing in
> > > > queue_work_on+0x28/0x60.
> > > >
> > > > Call Trace:
> > > > queue_work
> > > > schedule_work
> > > > clocksource_mark_unstable
> > > > mark_tsc_unstable
> > > > check_tsc_sync_source
> > > > native_cpu_up
> > > > relay_hotcpu_callback
> > > > do_forK_idle
> > > > _cpu_up
> > > > cpu_up
> > > > kernel_init
> > > > kernel_thread_helper
> > >
> > > hm, that looks like an old bug i fixed days ago via:
> > >
> > > 00a3273: Revert "x86: Make tsc=reliable override boot time stability checks"
> > >
> > > Have you tested tip:master - do you still know which sha1?
> >
> > Ok, i reproduced it on a testbox and bisected it, the crash is
> > caused by:
> >
> > 7285dd7fd375763bfb8ab1ac9cf3f1206f503c16 is first bad commit
> > commit 7285dd7fd375763bfb8ab1ac9cf3f1206f503c16
> > Author: Thomas Gleixner <[email protected]>
> > Date: Fri Aug 28 20:25:24 2009 +0200
> >
> > clocksource: Resolve cpu hotplug dead lock with TSC unstable
> >
> > Martin Schwidefsky analyzed it:
> >
> > I've reverted it in tip/master for now.
>
> and that uncovers the circular locking bug that this commit was
> supposed to fix ...
>
> Martin?

Damn, back to running around in circles ..

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

2009-09-11 08:55:04

by Serge Belyshev

[permalink] [raw]
Subject: Re: Epic regression in throughput since v2.6.23

Ingo Molnar <[email protected]> writes:

> Could still compare -j5 to -j4 on -tip, to see why -j4 is 3% short
> of -j5's throughput.
>
> (Plus maybe the NEW_FAIR_SLEEPERS change in -tip fixes the 3% drop.)

Will do in about 12 hours or so.

2009-09-11 09:26:27

by Matt

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

Martin Steigerwald <Martin <at> lichtvoll.de> writes:

>
> Am Donnerstag 10 September 2009 schrieb Ingo Molnar:

[snip]

> > what is /debug/sched_features - is NO_NEW_FAIR_SLEEPERS set? If not
> > set yet then try it:
> >
> > echo NO_NEW_FAIR_SLEEPERS > /debug/sched_features
> >
> > that too might make things more fluid.


Hi Martin,

it made an tremendous difference which still has to be tested out :)

Hi Ingo,

which adverse effect could

cat /proc/sys/kernel/sched_wakeup_granularity_ns
0

have on the throughput ?


Concerning that "NO_NEW_FAIR_SLEEPERS" switch - isn't it as easy as to

do the following ? (I'm not sure if there's supposed to be another debug)

echo NO_NEW_FAIR_SLEEPERS > /sys/kernel/debug/sched_features

which after the change says:

cat /sys/kernel/debug/sched_features
NO_NEW_FAIR_SLEEPERS NO_NORMALIZED_SLEEPER ADAPTIVE_GRAN WAKEUP_PREEMPT
START_DEBIT AFFINE_WAKEUPS CACHE_HOT_BUDDY SYNC_WAKEUPS NO_HRTICK NO_DOUBLE_TICK
ASYM_GRAN LB_BIAS LB_WAKEUP_UPDATE ASYM_EFF_LOAD NO_WAKEUP_OVERLAP LAST_BUDDY
OWNER_SPIN

I hope that's the correct switch ^^

Greetings and please keep on improving the scheduler (especially with regards to
the desktop crowd)

Regards

Mat

2009-09-11 10:10:16

by Matt

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

Martin Steigerwald <Martin <at> lichtvoll.de> writes:

>
> Am Donnerstag 10 September 2009 schrieb Ingo Molnar:

[snip]

> > what is /debug/sched_features - is NO_NEW_FAIR_SLEEPERS set? If not
> > set yet then try it:
> >
> > echo NO_NEW_FAIR_SLEEPERS > /debug/sched_features
> >
> > that too might make things more fluid.

Hi Martin,

it made an tremendous difference which still has to be tested out :)

Hi Ingo,

which adverse effect could

cat /proc/sys/kernel/sched_wakeup_granularity_ns
0

have on the throughput ?

Concerning that "NO_NEW_FAIR_SLEEPERS" switch - isn't it as easy as to

do the following ? (I'm not sure if there's supposed to be another debug)

echo NO_NEW_FAIR_SLEEPERS > /sys/kernel/debug/sched_features

which after the change says:

cat /sys/kernel/debug/sched_features
NO_NEW_FAIR_SLEEPERS NO_NORMALIZED_SLEEPER ADAPTIVE_GRAN WAKEUP_PREEMPT
START_DEBIT AFFINE_WAKEUPS CACHE_HOT_BUDDY SYNC_WAKEUPS NO_HRTICK NO_DOUBLE_TICK
ASYM_GRAN LB_BIAS LB_WAKEUP_UPDATE ASYM_EFF_LOAD NO_WAKEUP_OVERLAP LAST_BUDDY
OWNER_SPIN

I hope that's the correct switch ^^

Greetings and please keep on improving the scheduler (especially with regards to
the desktop crowd)

Regards

Mat


(Sorry for the "double-post" - this one is including all of the CC
which GMane left out :) )

2009-09-11 13:33:12

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [crash, bisected] Re: clocksource: Resolve cpu hotplug dead lock with TSC unstable

On Fri, 11 Sep 2009 09:37:47 +0200
Ingo Molnar <[email protected]> wrote:

>
> * Ingo Molnar <[email protected]> wrote:
>
> >
> > * Ingo Molnar <[email protected]> wrote:
> >
> > >
> > > * Jens Axboe <[email protected]> wrote:
> > >
> > > > I went to try -tip btw, but it crashes on boot. Here's the
> > > > backtrace, typed manually, it's crashing in
> > > > queue_work_on+0x28/0x60.
> > > >
> > > > Call Trace:
> > > > queue_work
> > > > schedule_work
> > > > clocksource_mark_unstable
> > > > mark_tsc_unstable
> > > > check_tsc_sync_source
> > > > native_cpu_up
> > > > relay_hotcpu_callback
> > > > do_forK_idle
> > > > _cpu_up
> > > > cpu_up
> > > > kernel_init
> > > > kernel_thread_helper
> > >
> > > hm, that looks like an old bug i fixed days ago via:
> > >
> > > 00a3273: Revert "x86: Make tsc=reliable override boot time stability checks"
> > >
> > > Have you tested tip:master - do you still know which sha1?
> >
> > Ok, i reproduced it on a testbox and bisected it, the crash is
> > caused by:
> >
> > 7285dd7fd375763bfb8ab1ac9cf3f1206f503c16 is first bad commit
> > commit 7285dd7fd375763bfb8ab1ac9cf3f1206f503c16
> > Author: Thomas Gleixner <[email protected]>
> > Date: Fri Aug 28 20:25:24 2009 +0200
> >
> > clocksource: Resolve cpu hotplug dead lock with TSC unstable
> >
> > Martin Schwidefsky analyzed it:
> >
> > I've reverted it in tip/master for now.
>
> and that uncovers the circular locking bug that this commit was
> supposed to fix ...
>
> Martin?

This patch should fix the obvious problem that the watchdog_work
structure is not yet initialized if the clocksource watchdog is not
running yet.
--
Subject: [PATCH] clocksource: statically initialize watchdog workqueue

From: Martin Schwidefsky <[email protected]>

The watchdog timer is started after the watchdog clocksource and at least
one watched clocksource have been registered. The clocksource work element
watchdog_work is initialized just before the clocksource timer is started.
This is too late for the clocksource_mark_unstable call from native_cpu_up.
To fix this use a static initializer for watchdog_work.

Signed-off-by: Martin Schwidefsky <[email protected]>
---
kernel/time/clocksource.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

Index: linux-2.6/kernel/time/clocksource.c
===================================================================
--- linux-2.6.orig/kernel/time/clocksource.c
+++ linux-2.6/kernel/time/clocksource.c
@@ -123,10 +123,12 @@ static DEFINE_MUTEX(clocksource_mutex);
static char override_name[32];

#ifdef CONFIG_CLOCKSOURCE_WATCHDOG
+static void clocksource_watchdog_work(struct work_struct *work);
+
static LIST_HEAD(watchdog_list);
static struct clocksource *watchdog;
static struct timer_list watchdog_timer;
-static struct work_struct watchdog_work;
+static DECLARE_WORK(watchdog_work, clocksource_watchdog_work);
static DEFINE_SPINLOCK(watchdog_lock);
static cycle_t watchdog_last;
static int watchdog_running;
@@ -230,7 +232,6 @@ static inline void clocksource_start_wat
{
if (watchdog_running || !watchdog || list_empty(&watchdog_list))
return;
- INIT_WORK(&watchdog_work, clocksource_watchdog_work);
init_timer(&watchdog_timer);
watchdog_timer.function = clocksource_watchdog;
watchdog_last = watchdog->read(watchdog);

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

2009-09-11 18:23:03

by Martin Schwidefsky

[permalink] [raw]
Subject: [tip:timers/core] clocksource: Resolve cpu hotplug dead lock with TSC unstable, fix crash

Commit-ID: f79e0258ea1f04d63db499479b5fb855dff6dbc5
Gitweb: http://git.kernel.org/tip/f79e0258ea1f04d63db499479b5fb855dff6dbc5
Author: Martin Schwidefsky <[email protected]>
AuthorDate: Fri, 11 Sep 2009 15:33:05 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Fri, 11 Sep 2009 20:17:18 +0200

clocksource: Resolve cpu hotplug dead lock with TSC unstable, fix crash

The watchdog timer is started after the watchdog clocksource
and at least one watched clocksource have been registered. The
clocksource work element watchdog_work is initialized just
before the clocksource timer is started. This is too late for
the clocksource_mark_unstable call from native_cpu_up. To fix
this use a static initializer for watchdog_work.

This resolves a boot crash reported by multiple people.

Signed-off-by: Martin Schwidefsky <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: John Stultz <[email protected]>
LKML-Reference: <20090911153305.3fe9a361@skybase>
Signed-off-by: Ingo Molnar <[email protected]>


---
kernel/time/clocksource.c | 5 +++--
1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index a0af4ff..5697155 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -123,10 +123,12 @@ static DEFINE_MUTEX(clocksource_mutex);
static char override_name[32];

#ifdef CONFIG_CLOCKSOURCE_WATCHDOG
+static void clocksource_watchdog_work(struct work_struct *work);
+
static LIST_HEAD(watchdog_list);
static struct clocksource *watchdog;
static struct timer_list watchdog_timer;
-static struct work_struct watchdog_work;
+static DECLARE_WORK(watchdog_work, clocksource_watchdog_work);
static DEFINE_SPINLOCK(watchdog_lock);
static cycle_t watchdog_last;
static int watchdog_running;
@@ -257,7 +259,6 @@ static inline void clocksource_start_watchdog(void)
{
if (watchdog_running || !watchdog || list_empty(&watchdog_list))
return;
- INIT_WORK(&watchdog_work, clocksource_watchdog_work);
init_timer(&watchdog_timer);
watchdog_timer.function = clocksource_watchdog;
watchdog_last = watchdog->read(watchdog);

2009-09-11 18:33:42

by Volker Armin Hemmann

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

Hi,

this is with 2.6.31+reiser4+fglrx
Phenom II X4 955

KDE 4.3.1, composite temporary disabled.
tvtime running.

load:
fat emerge with make -j5 running in one konsole tab (xulrunner being
compiled).

without NO_NEW_FAIR_SLEEPERS:

tvtime is smooth most of the time

with NO_NEW_FAIR_SLEEPERS:

tvtime is more jerky. Very visible in scenes with movement.

without background load:

both settings act the same. tvtime is smooth, video is smooth, games are nice.
No real difference.

config is attached.

Gl?ck Auf,
Volker

dmesg:

[ 0.000000] Linux version 2.6.31r4 (root@energy) (gcc version 4.4.1 (Gentoo
4.4.1 p1.0) ) #1 SMP Thu Sep 10 10:48:07 CEST 2009
[ 0.000000] Command line: root=/dev/md1 md=3,/dev/sda3,/dev/sdb3,/dev/sdc3
nmi_watchdog=0 mtrr_spare_reg_nr=1
[ 0.000000] KERNEL supported cpus:
[ 0.000000] AMD AuthenticAMD
[ 0.000000] BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
[ 0.000000] BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
[ 0.000000] BIOS-e820: 00000000000e6000 - 0000000000100000 (reserved)
[ 0.000000] BIOS-e820: 0000000000100000 - 00000000c7eb0000 (usable)
[ 0.000000] BIOS-e820: 00000000c7eb0000 - 00000000c7ec0000 (ACPI data)
[ 0.000000] BIOS-e820: 00000000c7ec0000 - 00000000c7ef0000 (ACPI NVS)
[ 0.000000] BIOS-e820: 00000000c7ef0000 - 00000000c7f00000 (reserved)
[ 0.000000] BIOS-e820: 00000000fff00000 - 0000000100000000 (reserved)
[ 0.000000] BIOS-e820: 0000000100000000 - 0000000238000000 (usable)
[ 0.000000] DMI present.
[ 0.000000] AMI BIOS detected: BIOS may corrupt low RAM, working around it.
[ 0.000000] e820 update range: 0000000000000000 - 0000000000010000 (usable)
==> (reserved)
[ 0.000000] last_pfn = 0x238000 max_arch_pfn = 0x400000000
[ 0.000000] MTRR default type: uncachable
[ 0.000000] MTRR fixed ranges enabled:
[ 0.000000] 00000-9FFFF write-back
[ 0.000000] A0000-EFFFF uncachable
[ 0.000000] F0000-FFFFF write-protect
[ 0.000000] MTRR variable ranges enabled:
[ 0.000000] 0 base 000000000000 mask FFFF80000000 write-back
[ 0.000000] 1 base 000080000000 mask FFFFC0000000 write-back
[ 0.000000] 2 base 0000C0000000 mask FFFFF8000000 write-back
[ 0.000000] 3 disabled
[ 0.000000] 4 disabled
[ 0.000000] 5 disabled
[ 0.000000] 6 disabled
[ 0.000000] 7 disabled
[ 0.000000] TOM2: 0000000238000000 aka 9088M
[ 0.000000] x86 PAT enabled: cpu 0, old 0x7040600070406, new
0x7010600070106
[ 0.000000] e820 update range: 00000000c8000000 - 0000000100000000 (usable)
==> (reserved)
[ 0.000000] last_pfn = 0xc7eb0 max_arch_pfn = 0x400000000
[ 0.000000] Scanning 0 areas for low memory corruption
[ 0.000000] modified physical RAM map:
[ 0.000000] modified: 0000000000000000 - 0000000000010000 (reserved)
[ 0.000000] modified: 0000000000010000 - 000000000009fc00 (usable)
[ 0.000000] modified: 000000000009fc00 - 00000000000a0000 (reserved)
[ 0.000000] modified: 00000000000e6000 - 0000000000100000 (reserved)
[ 0.000000] modified: 0000000000100000 - 00000000c7eb0000 (usable)
[ 0.000000] modified: 00000000c7eb0000 - 00000000c7ec0000 (ACPI data)
[ 0.000000] modified: 00000000c7ec0000 - 00000000c7ef0000 (ACPI NVS)
[ 0.000000] modified: 00000000c7ef0000 - 00000000c7f00000 (reserved)
[ 0.000000] modified: 00000000fff00000 - 0000000100000000 (reserved)
[ 0.000000] modified: 0000000100000000 - 0000000238000000 (usable)
[ 0.000000] initial memory mapped : 0 - 20000000
[ 0.000000] Using GB pages for direct mapping
[ 0.000000] init_memory_mapping: 0000000000000000-00000000c7eb0000
[ 0.000000] 0000000000 - 00c0000000 page 1G
[ 0.000000] 00c0000000 - 00c7e00000 page 2M
[ 0.000000] 00c7e00000 - 00c7eb0000 page 4k
[ 0.000000] kernel direct mapping tables up to c7eb0000 @ 10000-13000
[ 0.000000] init_memory_mapping: 0000000100000000-0000000238000000
[ 0.000000] 0100000000 - 0200000000 page 1G
[ 0.000000] 0200000000 - 0238000000 page 2M
[ 0.000000] kernel direct mapping tables up to 238000000 @ 12000-14000
[ 0.000000] ACPI: RSDP 00000000000fa7c0 00014 (v00 ACPIAM)
[ 0.000000] ACPI: RSDT 00000000c7eb0000 00040 (v01 050609 RSDT2000 20090506
MSFT 00000097)
[ 0.000000] ACPI: FACP 00000000c7eb0200 00084 (v02 A M I OEMFACP 12000601
MSFT 00000097)
[ 0.000000] ACPI: DSDT 00000000c7eb0440 08512 (v01 AS140 AS140121 00000121
INTL 20051117)
[ 0.000000] ACPI: FACS 00000000c7ec0000 00040
[ 0.000000] ACPI: APIC 00000000c7eb0390 0006C (v01 050609 APIC2000 20090506
MSFT 00000097)
[ 0.000000] ACPI: MCFG 00000000c7eb0400 0003C (v01 050609 OEMMCFG 20090506
MSFT 00000097)
[ 0.000000] ACPI: OEMB 00000000c7ec0040 00071 (v01 050609 OEMB2000 20090506
MSFT 00000097)
[ 0.000000] ACPI: AAFT 00000000c7eb8960 00027 (v01 050609 OEMAAFT 20090506
MSFT 00000097)
[ 0.000000] ACPI: HPET 00000000c7eb8990 00038 (v01 050609 OEMHPET 20090506
MSFT 00000097)
[ 0.000000] ACPI: SSDT 00000000c7eb89d0 0088C (v01 A M I POWERNOW 00000001
AMD 00000001)
[ 0.000000] ACPI: Local APIC address 0xfee00000
[ 0.000000] (7 early reservations) ==> bootmem [0000000000 - 0238000000]
[ 0.000000] #0 [0000000000 - 0000001000] BIOS data page ==> [0000000000
- 0000001000]
[ 0.000000] #1 [0000006000 - 0000008000] TRAMPOLINE ==> [0000006000
- 0000008000]
[ 0.000000] #2 [0001000000 - 00015fb8c0] TEXT DATA BSS ==> [0001000000
- 00015fb8c0]
[ 0.000000] #3 [000009fc00 - 0000100000] BIOS reserved ==> [000009fc00
- 0000100000]
[ 0.000000] #4 [00015fc000 - 00015fc133] BRK ==> [00015fc000
- 00015fc133]
[ 0.000000] #5 [0000010000 - 0000012000] PGTABLE ==> [0000010000
- 0000012000]
[ 0.000000] #6 [0000012000 - 0000013000] PGTABLE ==> [0000012000
- 0000013000]
[ 0.000000] [ffffea0000000000-ffffea0007dfffff] PMD -> [ffff880028600000-
ffff88002f7fffff] on node 0
[ 0.000000] Zone PFN ranges:
[ 0.000000] DMA 0x00000010 -> 0x00001000
[ 0.000000] DMA32 0x00001000 -> 0x00100000
[ 0.000000] Normal 0x00100000 -> 0x00238000
[ 0.000000] Movable zone start PFN for each node
[ 0.000000] early_node_map[3] active PFN ranges
[ 0.000000] 0: 0x00000010 -> 0x0000009f
[ 0.000000] 0: 0x00000100 -> 0x000c7eb0
[ 0.000000] 0: 0x00100000 -> 0x00238000
[ 0.000000] On node 0 totalpages: 2096703
[ 0.000000] DMA zone: 56 pages used for memmap
[ 0.000000] DMA zone: 102 pages reserved
[ 0.000000] DMA zone: 3825 pages, LIFO batch:0
[ 0.000000] DMA32 zone: 14280 pages used for memmap
[ 0.000000] DMA32 zone: 800488 pages, LIFO batch:31
[ 0.000000] Normal zone: 17472 pages used for memmap
[ 0.000000] Normal zone: 1260480 pages, LIFO batch:31
[ 0.000000] ACPI: PM-Timer IO Port: 0x808
[ 0.000000] ACPI: Local APIC address 0xfee00000
[ 0.000000] ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled)
[ 0.000000] ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled)
[ 0.000000] ACPI: LAPIC (acpi_id[0x03] lapic_id[0x02] enabled)
[ 0.000000] ACPI: LAPIC (acpi_id[0x04] lapic_id[0x03] enabled)
[ 0.000000] ACPI: IOAPIC (id[0x04] address[0xfec00000] gsi_base[0])
[ 0.000000] IOAPIC[0]: apic_id 4, version 33, address 0xfec00000, GSI 0-23
[ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
[ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 low level)
[ 0.000000] ACPI: IRQ0 used by override.
[ 0.000000] ACPI: IRQ2 used by override.
[ 0.000000] ACPI: IRQ9 used by override.
[ 0.000000] Using ACPI (MADT) for SMP configuration information
[ 0.000000] ACPI: HPET id: 0x8300 base: 0xfed00000
[ 0.000000] SMP: Allowing 4 CPUs, 0 hotplug CPUs
[ 0.000000] nr_irqs_gsi: 24
[ 0.000000] PM: Registered nosave memory: 000000000009f000 -
00000000000a0000
[ 0.000000] PM: Registered nosave memory: 00000000000a0000 -
00000000000e6000
[ 0.000000] PM: Registered nosave memory: 00000000000e6000 -
0000000000100000
[ 0.000000] PM: Registered nosave memory: 00000000c7eb0000 -
00000000c7ec0000
[ 0.000000] PM: Registered nosave memory: 00000000c7ec0000 -
00000000c7ef0000
[ 0.000000] PM: Registered nosave memory: 00000000c7ef0000 -
00000000c7f00000
[ 0.000000] PM: Registered nosave memory: 00000000c7f00000 -
00000000fff00000
[ 0.000000] PM: Registered nosave memory: 00000000fff00000 -
0000000100000000
[ 0.000000] Allocating PCI resources starting at c7f00000 (gap:
c7f00000:38000000)
[ 0.000000] NR_CPUS:4 nr_cpumask_bits:4 nr_cpu_ids:4 nr_node_ids:1
[ 0.000000] PERCPU: Embedded 25 pages at ffff880028034000, static data 72160
bytes
[ 0.000000] Built 1 zonelists in Zone order, mobility grouping on. Total
pages: 2064793
[ 0.000000] Kernel command line: root=/dev/md1
md=3,/dev/sda3,/dev/sdb3,/dev/sdc3 nmi_watchdog=0 mtrr_spare_reg_nr=1
[ 0.000000] md: Will configure md3 (super-block) from
/dev/sda3,/dev/sdb3,/dev/sdc3, below.
[ 0.000000] PID hash table entries: 4096 (order: 12, 32768 bytes)
[ 0.000000] Dentry cache hash table entries: 1048576 (order: 11, 8388608
bytes)
[ 0.000000] Inode-cache hash table entries: 524288 (order: 10, 4194304
bytes)
[ 0.000000] Initializing CPU#0
[ 0.000000] Checking aperture...
[ 0.000000] No AGP bridge found
[ 0.000000] Node 0: aperture @ 2a42000000 size 32 MB
[ 0.000000] Aperture beyond 4GB. Ignoring.
[ 0.000000] Your BIOS doesn't leave a aperture memory hole
[ 0.000000] Please enable the IOMMU option in the BIOS setup
[ 0.000000] This costs you 64 MB of RAM
[ 0.000000] Mapping aperture over 65536 KB of RAM @ 20000000
[ 0.000000] PM: Registered nosave memory: 0000000020000000 -
0000000024000000
[ 0.000000] Memory: 8184476k/9306112k available (3500k kernel code, 919300k
absent, 201380k reserved, 1751k data, 376k init)
[ 0.000000] SLUB: Genslabs=13, HWalign=64, Order=0-3, MinObjects=0, CPUs=4,
Nodes=1
[ 0.000000] Hierarchical RCU implementation.
[ 0.000000] NR_IRQS:4352 nr_irqs:440
[ 0.000000] Fast TSC calibration using PIT
[ 0.000000] Detected 3200.214 MHz processor.
[ 0.000609] Console: colour VGA+ 80x25
[ 0.000611] console [tty0] enabled
[ 0.003333] hpet clockevent registered
[ 0.003333] alloc irq_desc for 24 on node 0
[ 0.003333] alloc kstat_irqs on node 0
[ 0.003333] HPET: 4 timers in total, 1 timers will be used for per-cpu
timer
[ 0.003339] Calibrating delay loop (skipped), value calculated using timer
frequency.. 6402.10 BogoMIPS (lpj=10667366)
[ 0.003421] Mount-cache hash table entries: 256
[ 0.003543] CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64
bytes/line)
[ 0.003578] CPU: L2 Cache: 512K (64 bytes/line)
[ 0.003613] tseg: 0000000000
[ 0.003618] CPU: Physical Processor ID: 0
[ 0.003652] CPU: Processor Core ID: 0
[ 0.003686] mce: CPU supports 6 MCE banks
[ 0.003725] using C1E aware idle routine
[ 0.003768] ACPI: Core revision 20090521
[ 0.016704] Setting APIC routing to flat
[ 0.017026] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[ 0.050860] CPU0: AMD Phenom(tm) II X4 955 Processor stepping 02
[ 0.053333] Booting processor 1 APIC 0x1 ip 0x6000
[ 0.003333] Initializing CPU#1
[ 0.003333] Calibrating delay using timer specific routine.. 6402.85
BogoMIPS (lpj=10666966)
[ 0.003333] CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64
bytes/line)
[ 0.003333] CPU: L2 Cache: 512K (64 bytes/line)
[ 0.003333] CPU: Physical Processor ID: 0
[ 0.003333] CPU: Processor Core ID: 1
[ 0.003333] mce: CPU supports 6 MCE banks
[ 0.003333] x86 PAT enabled: cpu 1, old 0x7040600070406, new
0x7010600070106
[ 0.144161] CPU1: AMD Phenom(tm) II X4 955 Processor stepping 02
[ 0.144507] checking TSC synchronization [CPU#0 -> CPU#1]: passed.
[ 0.146699] Booting processor 2 APIC 0x2 ip 0x6000
[ 0.003333] Initializing CPU#2
[ 0.003333] Calibrating delay using timer specific routine.. 6402.85
BogoMIPS (lpj=10666970)
[ 0.003333] CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64
bytes/line)
[ 0.003333] CPU: L2 Cache: 512K (64 bytes/line)
[ 0.003333] CPU: Physical Processor ID: 0
[ 0.003333] CPU: Processor Core ID: 2
[ 0.003333] mce: CPU supports 6 MCE banks
[ 0.003333] x86 PAT enabled: cpu 2, old 0x7040600070406, new
0x7010600070106
[ 0.240822] CPU2: AMD Phenom(tm) II X4 955 Processor stepping 02
[ 0.241168] checking TSC synchronization [CPU#0 -> CPU#2]: passed.
[ 0.243373] Booting processor 3 APIC 0x3 ip 0x6000
[ 0.003333] Initializing CPU#3
[ 0.003333] Calibrating delay using timer specific routine.. 6402.85
BogoMIPS (lpj=10666972)
[ 0.003333] CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64
bytes/line)
[ 0.003333] CPU: L2 Cache: 512K (64 bytes/line)
[ 0.003333] CPU: Physical Processor ID: 0
[ 0.003333] CPU: Processor Core ID: 3
[ 0.003333] mce: CPU supports 6 MCE banks
[ 0.003333] x86 PAT enabled: cpu 3, old 0x7040600070406, new
0x7010600070106
[ 0.337491] CPU3: AMD Phenom(tm) II X4 955 Processor stepping 02
[ 0.337836] checking TSC synchronization [CPU#0 -> CPU#3]: passed.
[ 0.340006] Brought up 4 CPUs
[ 0.340040] Total of 4 processors activated (25611.67 BogoMIPS).
[ 0.340109] CPU0 attaching sched-domain:
[ 0.340111] domain 0: span 0-3 level MC
[ 0.340112] groups: 0 1 2 3
[ 0.340116] CPU1 attaching sched-domain:
[ 0.340117] domain 0: span 0-3 level MC
[ 0.340118] groups: 1 2 3 0
[ 0.340120] CPU2 attaching sched-domain:
[ 0.340121] domain 0: span 0-3 level MC
[ 0.340122] groups: 2 3 0 1
[ 0.340125] CPU3 attaching sched-domain:
[ 0.340126] domain 0: span 0-3 level MC
[ 0.340127] groups: 3 0 1 2
[ 0.340162] xor: automatically using best checksumming function:
generic_sse
[ 0.356667] generic_sse: 12835.200 MB/sec
[ 0.356700] xor: using function: generic_sse (12835.200 MB/sec)
[ 0.356754] Time: 7:07:42 Date: 09/11/09
[ 0.356802] NET: Registered protocol family 16
[ 0.356851] node 0 link 0: io port [1000, ffffff]
[ 0.356851] TOM: 00000000c8000000 aka 3200M
[ 0.356851] Fam 10h mmconf [e0000000, efffffff]
[ 0.356851] node 0 link 0: mmio [e0000000, efffffff] ==> none
[ 0.356851] node 0 link 0: mmio [f0000000, ffffffff]
[ 0.356851] node 0 link 0: mmio [a0000, bffff]
[ 0.356851] node 0 link 0: mmio [c8000000, dfffffff]
[ 0.356851] TOM2: 0000000238000000 aka 9088M
[ 0.356851] bus: [00,07] on node 0 link 0
[ 0.356851] bus: 00 index 0 io port: [0, ffff]
[ 0.356851] bus: 00 index 1 mmio: [f0000000, ffffffff]
[ 0.356851] bus: 00 index 2 mmio: [a0000, bffff]
[ 0.356851] bus: 00 index 3 mmio: [c8000000, dfffffff]
[ 0.356851] bus: 00 index 4 mmio: [238000000, fcffffffff]
[ 0.356851] ACPI: bus type pci registered
[ 0.356851] PCI: MCFG configuration 0: base e0000000 segment 0 buses 0 - 255
[ 0.356851] PCI: Not using MMCONFIG.
[ 0.356851] PCI: Using configuration type 1 for base access
[ 0.356851] PCI: Using configuration type 1 for extended access
[ 0.356851] bio: create slab <bio-0> at 0
[ 0.356978] ACPI: EC: Look up EC in DSDT
[ 0.367031] ACPI: Interpreter enabled
[ 0.367538] ACPI: (supports S0 S1 S3 S4 S5)
[ 0.367652] ACPI: Using IOAPIC for interrupt routing
[ 0.367725] PCI: MCFG configuration 0: base e0000000 segment 0 buses 0 - 255
[ 0.370531] PCI: MCFG area at e0000000 reserved in ACPI motherboard
resources
[ 0.375947] PCI: Using MMCONFIG at e0000000 - efffffff
[ 0.380367] ACPI: No dock devices found.
[ 0.380451] ACPI: PCI Root Bridge [PCI0] (0000:00)
[ 0.380520] pci 0000:00:00.0: reg 1c 64bit mmio: [0xe0000000-0xffffffff]
[ 0.380520] pci 0000:00:02.0: PME# supported from D0 D3hot D3cold
[ 0.380520] pci 0000:00:02.0: PME# disabled
[ 0.380520] pci 0000:00:09.0: PME# supported from D0 D3hot D3cold
[ 0.380520] pci 0000:00:09.0: PME# disabled
[ 0.380520] pci 0000:00:0a.0: PME# supported from D0 D3hot D3cold
[ 0.380520] pci 0000:00:0a.0: PME# disabled
[ 0.380520] pci 0000:00:11.0: reg 10 io port: [0xa000-0xa007]
[ 0.380520] pci 0000:00:11.0: reg 14 io port: [0x9000-0x9003]
[ 0.380520] pci 0000:00:11.0: reg 18 io port: [0x8000-0x8007]
[ 0.380520] pci 0000:00:11.0: reg 1c io port: [0x7000-0x7003]
[ 0.380520] pci 0000:00:11.0: reg 20 io port: [0x6000-0x600f]
[ 0.380520] pci 0000:00:11.0: reg 24 32bit mmio: [0xfddff800-0xfddffbff]
[ 0.380520] pci 0000:00:12.0: reg 10 32bit mmio: [0xfddfe000-0xfddfefff]
[ 0.380520] pci 0000:00:12.1: reg 10 32bit mmio: [0xfddfd000-0xfddfdfff]
[ 0.380576] pci 0000:00:12.2: reg 10 32bit mmio: [0xfddff000-0xfddff0ff]
[ 0.380625] pci 0000:00:12.2: supports D1 D2
[ 0.380626] pci 0000:00:12.2: PME# supported from D0 D1 D2 D3hot
[ 0.380663] pci 0000:00:12.2: PME# disabled
[ 0.380724] pci 0000:00:13.0: reg 10 32bit mmio: [0xfddfc000-0xfddfcfff]
[ 0.380775] pci 0000:00:13.1: reg 10 32bit mmio: [0xfddf7000-0xfddf7fff]
[ 0.380843] pci 0000:00:13.2: reg 10 32bit mmio: [0xfddf6800-0xfddf68ff]
[ 0.380893] pci 0000:00:13.2: supports D1 D2
[ 0.380894] pci 0000:00:13.2: PME# supported from D0 D1 D2 D3hot
[ 0.380930] pci 0000:00:13.2: PME# disabled
[ 0.381072] pci 0000:00:14.1: reg 10 io port: [0x00-0x07]
[ 0.381078] pci 0000:00:14.1: reg 14 io port: [0x00-0x03]
[ 0.381084] pci 0000:00:14.1: reg 18 io port: [0x00-0x07]
[ 0.381089] pci 0000:00:14.1: reg 1c io port: [0x00-0x03]
[ 0.381095] pci 0000:00:14.1: reg 20 io port: [0xff00-0xff0f]
[ 0.381219] pci 0000:00:14.5: reg 10 32bit mmio: [0xfddf5000-0xfddf5fff]
[ 0.381353] pci 0000:02:00.0: reg 10 64bit mmio: [0xd0000000-0xdfffffff]
[ 0.381360] pci 0000:02:00.0: reg 18 64bit mmio: [0xfdff0000-0xfdffffff]
[ 0.381365] pci 0000:02:00.0: reg 20 io port: [0xc000-0xc0ff]
[ 0.381372] pci 0000:02:00.0: reg 30 32bit mmio: [0xfdfc0000-0xfdfdffff]
[ 0.381387] pci 0000:02:00.0: supports D1 D2
[ 0.381415] pci 0000:02:00.1: reg 10 64bit mmio: [0xfdfec000-0xfdfeffff]
[ 0.381445] pci 0000:02:00.1: supports D1 D2
[ 0.381495] pci 0000:00:02.0: bridge io port: [0xc000-0xcfff]
[ 0.381497] pci 0000:00:02.0: bridge 32bit mmio: [0xfdf00000-0xfdffffff]
[ 0.381500] pci 0000:00:02.0: bridge 64bit mmio pref: [0xd0000000-0xdfffffff]
[ 0.381543] pci 0000:00:09.0: bridge io port: [0xd000-0xdfff]
[ 0.381545] pci 0000:00:09.0: bridge 32bit mmio: [0xfe000000-0xfebfffff]
[ 0.381548] pci 0000:00:09.0: bridge 64bit mmio pref: [0xfa000000-0xfcefffff]
[ 0.383347] pci 0000:01:00.0: reg 10 io port: [0xb800-0xb8ff]
[ 0.383360] pci 0000:01:00.0: reg 18 64bit mmio: [0xcffff000-0xcfffffff]
[ 0.383370] pci 0000:01:00.0: reg 20 64bit mmio: [0xcffe0000-0xcffeffff]
[ 0.383375] pci 0000:01:00.0: reg 30 32bit mmio: [0xfdef0000-0xfdefffff]
[ 0.383402] pci 0000:01:00.0: supports D1 D2
[ 0.383403] pci 0000:01:00.0: PME# supported from D0 D1 D2 D3hot D3cold
[ 0.383440] pci 0000:01:00.0: PME# disabled
[ 0.383525] pci 0000:00:0a.0: bridge io port: [0xb000-0xbfff]
[ 0.383527] pci 0000:00:0a.0: bridge 32bit mmio: [0xfde00000-0xfdefffff]
[ 0.383530] pci 0000:00:0a.0: bridge 64bit mmio pref: [0xcff00000-0xcfffffff]
[ 0.383565] pci 0000:05:06.0: reg 10 32bit mmio: [0xfcfff000-0xfcffffff]
[ 0.383644] pci 0000:05:08.0: reg 10 io port: [0xe800-0xe83f]
[ 0.383701] pci 0000:05:08.0: supports D1 D2
[ 0.383743] pci 0000:00:14.4: transparent bridge
[ 0.383779] pci 0000:00:14.4: bridge io port: [0xe000-0xefff]
[ 0.383785] pci 0000:00:14.4: bridge 32bit mmio pref: [0xfcf00000-0xfcffffff]
[ 0.383797] pci_bus 0000:00: on NUMA node 0
[ 0.383800] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
[ 0.383936] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.PCE2._PRT]
[ 0.383981] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.PCEA._PRT]
[ 0.384025] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.P0PC._PRT]
[ 0.384091] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.PCE9._PRT]
[ 0.386726] ACPI: PCI Interrupt Link [LNKA] (IRQs 4 *7 10 11 12 14 15)
[ 0.386953] ACPI: PCI Interrupt Link [LNKB] (IRQs 4 7 10 *11 12 14 15)
[ 0.387183] ACPI: PCI Interrupt Link [LNKC] (IRQs 4 7 *10 11 12 14 15)
[ 0.387408] ACPI: PCI Interrupt Link [LNKD] (IRQs 4 7 *10 11 12 14 15)
[ 0.387643] ACPI: PCI Interrupt Link [LNKE] (IRQs 4 7 10 11 12 14 15) *0,
disabled.
[ 0.387914] ACPI: PCI Interrupt Link [LNKF] (IRQs 4 7 10 *11 12 14 15)
[ 0.388140] ACPI: PCI Interrupt Link [LNKG] (IRQs *4 10 11 12 14 15)
[ 0.388352] ACPI: PCI Interrupt Link [LNKH] (IRQs 4 7 *10 11 12 14 15)
[ 0.388538] SCSI subsystem initialized
[ 0.388538] libata version 3.00 loaded.
[ 0.388538] usbcore: registered new interface driver usbfs
[ 0.388538] usbcore: registered new interface driver hub
[ 0.388538] usbcore: registered new device driver usb
[ 0.443347] raid6: int64x1 2755 MB/s
[ 0.500010] raid6: int64x2 3858 MB/s
[ 0.556669] raid6: int64x4 2850 MB/s
[ 0.613353] raid6: int64x8 2537 MB/s
[ 0.670007] raid6: sse2x1 3999 MB/s
[ 0.726666] raid6: sse2x2 7012 MB/s
[ 0.783343] raid6: sse2x4 7975 MB/s
[ 0.783377] raid6: using algorithm sse2x4 (7975 MB/s)
[ 0.783423] PCI: Using ACPI for IRQ routing
[ 0.783423] pci 0000:00:00.0: BAR 3: address space collision on of device
[0xe0000000-0xffffffff]
[ 0.783425] pci 0000:00:00.0: BAR 3: can't allocate resource
[ 0.793433] PCI-DMA: Disabling AGP.
[ 0.793518] PCI-DMA: aperture base @ 20000000 size 65536 KB
[ 0.793518] PCI-DMA: using GART IOMMU.
[ 0.793518] PCI-DMA: Reserving 64MB of IOMMU area in the AGP aperture
[ 0.795205] hpet0: at MMIO 0xfed00000, IRQs 2, 8, 24, 0
[ 0.795312] hpet0: 4 comparators, 32-bit 14.318180 MHz counter
[ 0.800030] hpet: hpet2 irq 24 for MSI
[ 0.810015] Switched to high resolution mode on CPU 0
[ 0.811193] Switched to high resolution mode on CPU 2
[ 0.811196] Switched to high resolution mode on CPU 1
[ 0.811200] Switched to high resolution mode on CPU 3
[ 0.820035] pnp: PnP ACPI init
[ 0.820085] ACPI: bus type pnp registered
[ 0.822351] pnp 00:0b: mem resource (0x0-0x9ffff) overlaps 0000:00:00.0 BAR 3
(0x0-0x1fffffff), disabling
[ 0.822406] pnp 00:0b: mem resource (0xc0000-0xcffff) overlaps 0000:00:00.0
BAR 3 (0x0-0x1fffffff), disabling
[ 0.822461] pnp 00:0b: mem resource (0xe0000-0xfffff) overlaps 0000:00:00.0
BAR 3 (0x0-0x1fffffff), disabling
[ 0.822515] pnp 00:0b: mem resource (0x100000-0xc7efffff) overlaps
0000:00:00.0 BAR 3 (0x0-0x1fffffff), disabling
[ 0.822837] pnp: PnP ACPI: found 12 devices
[ 0.822871] ACPI: ACPI bus type pnp unregistered
[ 0.822911] system 00:06: iomem range 0xfec00000-0xfec00fff could not be
reserved
[ 0.822965] system 00:06: iomem range 0xfee00000-0xfee00fff has been
reserved
[ 0.823002] system 00:07: ioport range 0x4d0-0x4d1 has been reserved
[ 0.823037] system 00:07: ioport range 0x40b-0x40b has been reserved
[ 0.823072] system 00:07: ioport range 0x4d6-0x4d6 has been reserved
[ 0.823106] system 00:07: ioport range 0xc00-0xc01 has been reserved
[ 0.823141] system 00:07: ioport range 0xc14-0xc14 has been reserved
[ 0.823176] system 00:07: ioport range 0xc50-0xc51 has been reserved
[ 0.823211] system 00:07: ioport range 0xc52-0xc52 has been reserved
[ 0.823245] system 00:07: ioport range 0xc6c-0xc6c has been reserved
[ 0.823280] system 00:07: ioport range 0xc6f-0xc6f has been reserved
[ 0.823315] system 00:07: ioport range 0xcd0-0xcd1 has been reserved
[ 0.823359] system 00:07: ioport range 0xcd2-0xcd3 has been reserved
[ 0.823394] system 00:07: ioport range 0xcd4-0xcd5 has been reserved
[ 0.823429] system 00:07: ioport range 0xcd6-0xcd7 has been reserved
[ 0.823464] system 00:07: ioport range 0xcd8-0xcdf has been reserved
[ 0.823499] system 00:07: ioport range 0x800-0x89f has been reserved
[ 0.823533] system 00:07: ioport range 0xb00-0xb0f has been reserved
[ 0.823568] system 00:07: ioport range 0xb20-0xb3f has been reserved
[ 0.823603] system 00:07: ioport range 0x900-0x90f has been reserved
[ 0.823638] system 00:07: ioport range 0x910-0x91f has been reserved
[ 0.823673] system 00:07: ioport range 0xfe00-0xfefe has been reserved
[ 0.823708] system 00:07: iomem range 0xffb80000-0xffbfffff has been reserved
[ 0.823743] system 00:07: iomem range 0xfec10000-0xfec1001f has been
reserved
[ 0.823780] system 00:09: ioport range 0x290-0x29f has been reserved
[ 0.823816] system 00:0a: iomem range 0xe0000000-0xefffffff has been reserved
[ 0.823852] system 00:0b: iomem range 0xfec00000-0xffffffff could not be
reserved
[ 0.828757] pci 0000:00:02.0: PCI bridge, secondary bus 0000:02
[ 0.828792] pci 0000:00:02.0: IO window: 0xc000-0xcfff
[ 0.828828] pci 0000:00:02.0: MEM window: 0xfdf00000-0xfdffffff
[ 0.828863] pci 0000:00:02.0: PREFETCH window:
0x000000d0000000-0x000000dfffffff
[ 0.828918] pci 0000:00:09.0: PCI bridge, secondary bus 0000:03
[ 0.828953] pci 0000:00:09.0: IO window: 0xd000-0xdfff
[ 0.828988] pci 0000:00:09.0: MEM window: 0xfe000000-0xfebfffff
[ 0.829023] pci 0000:00:09.0: PREFETCH window:
0x000000fa000000-0x000000fcefffff
[ 0.829078] pci 0000:00:0a.0: PCI bridge, secondary bus 0000:01
[ 0.829112] pci 0000:00:0a.0: IO window: 0xb000-0xbfff
[ 0.829148] pci 0000:00:0a.0: MEM window: 0xfde00000-0xfdefffff
[ 0.829183] pci 0000:00:0a.0: PREFETCH window:
0x000000cff00000-0x000000cfffffff
[ 0.829237] pci 0000:00:14.4: PCI bridge, secondary bus 0000:05
[ 0.829273] pci 0000:00:14.4: IO window: 0xe000-0xefff
[ 0.829310] pci 0000:00:14.4: MEM window: disabled
[ 0.829346] pci 0000:00:14.4: PREFETCH window: 0xfcf00000-0xfcffffff
[ 0.829387] alloc irq_desc for 18 on node -1
[ 0.829388] alloc kstat_irqs on node -1
[ 0.829392] pci 0000:00:02.0: PCI INT A -> GSI 18 (level, low) -> IRQ 18
[ 0.829428] pci 0000:00:02.0: setting latency timer to 64
[ 0.829432] alloc irq_desc for 17 on node -1
[ 0.829433] alloc kstat_irqs on node -1
[ 0.829435] pci 0000:00:09.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
[ 0.829471] pci 0000:00:09.0: setting latency timer to 64
[ 0.829474] pci 0000:00:0a.0: PCI INT A -> GSI 18 (level, low) -> IRQ 18
[ 0.829509] pci 0000:00:0a.0: setting latency timer to 64
[ 0.829516] pci_bus 0000:00: resource 0 io: [0x00-0xffff]
[ 0.829517] pci_bus 0000:00: resource 1 mem: [0x000000-0xffffffffffffffff]
[ 0.829519] pci_bus 0000:02: resource 0 io: [0xc000-0xcfff]
[ 0.829521] pci_bus 0000:02: resource 1 mem: [0xfdf00000-0xfdffffff]
[ 0.829522] pci_bus 0000:02: resource 2 pref mem [0xd0000000-0xdfffffff]
[ 0.829523] pci_bus 0000:03: resource 0 io: [0xd000-0xdfff]
[ 0.829525] pci_bus 0000:03: resource 1 mem: [0xfe000000-0xfebfffff]
[ 0.829526] pci_bus 0000:03: resource 2 pref mem [0xfa000000-0xfcefffff]
[ 0.829527] pci_bus 0000:01: resource 0 io: [0xb000-0xbfff]
[ 0.829529] pci_bus 0000:01: resource 1 mem: [0xfde00000-0xfdefffff]
[ 0.829530] pci_bus 0000:01: resource 2 pref mem [0xcff00000-0xcfffffff]
[ 0.829531] pci_bus 0000:05: resource 0 io: [0xe000-0xefff]
[ 0.829533] pci_bus 0000:05: resource 2 pref mem [0xfcf00000-0xfcffffff]
[ 0.829534] pci_bus 0000:05: resource 3 io: [0x00-0xffff]
[ 0.829535] pci_bus 0000:05: resource 4 mem: [0x000000-0xffffffffffffffff]
[ 0.829547] NET: Registered protocol family 2
[ 0.829602] IP route cache hash table entries: 262144 (order: 9, 2097152
bytes)
[ 0.830084] TCP established hash table entries: 262144 (order: 10, 4194304
bytes)
[ 0.831067] TCP bind hash table entries: 65536 (order: 8, 1048576 bytes)
[ 0.831468] TCP: Hash tables configured (established 262144 bind 65536)
[ 0.831504] TCP reno registered
[ 0.831582] NET: Registered protocol family 1
[ 0.832803] Scanning for low memory corruption every 60 seconds
[ 0.833572] HugeTLB registered 2 MB page size, pre-allocated 0 pages
[ 0.833696] Loading Reiser4. See http://www.namesys.com for a description of
Reiser4.
[ 0.833781] msgmni has been set to 15987
[ 0.834082] alg: No test for stdrng (krng)
[ 0.834123] async_tx: api initialized (sync-only)
[ 0.834227] Block layer SCSI generic (bsg) driver version 0.4 loaded (major
253)
[ 0.834280] io scheduler noop registered
[ 0.834315] io scheduler cfq registered (default)
[ 0.834448] pci 0000:02:00.0: Boot video device
[ 0.834538] alloc irq_desc for 25 on node -1
[ 0.834540] alloc kstat_irqs on node -1
[ 0.834545] pcieport-driver 0000:00:02.0: irq 25 for MSI/MSI-X
[ 0.834550] pcieport-driver 0000:00:02.0: setting latency timer to 64
[ 0.834642] alloc irq_desc for 26 on node -1
[ 0.834643] alloc kstat_irqs on node -1
[ 0.834646] pcieport-driver 0000:00:09.0: irq 26 for MSI/MSI-X
[ 0.834650] pcieport-driver 0000:00:09.0: setting latency timer to 64
[ 0.834741] alloc irq_desc for 27 on node -1
[ 0.834742] alloc kstat_irqs on node -1
[ 0.834744] pcieport-driver 0000:00:0a.0: irq 27 for MSI/MSI-X
[ 0.834748] pcieport-driver 0000:00:0a.0: setting latency timer to 64
[ 0.834953] input: Power Button as
/devices/LNXSYSTM:00/LNXPWRBN:00/input/input0
[ 0.835007] ACPI: Power Button [PWRF]
[ 0.835098] input: Power Button as
/devices/LNXSYSTM:00/device:00/PNP0C0C:00/input/input1
[ 0.835152] ACPI: Power Button [PWRB]
[ 0.835302] processor LNXCPU:00: registered as cooling_device0
[ 0.835337] ACPI: Processor [CPU0] (supports 8 throttling states)
[ 0.835434] processor LNXCPU:01: registered as cooling_device1
[ 0.835504] processor LNXCPU:02: registered as cooling_device2
[ 0.835577] processor LNXCPU:03: registered as cooling_device3
[ 0.839315] Linux agpgart interface v0.103
[ 0.839511] ahci 0000:00:11.0: version 3.0
[ 0.839521] alloc irq_desc for 22 on node -1
[ 0.839522] alloc kstat_irqs on node -1
[ 0.839525] ahci 0000:00:11.0: PCI INT A -> GSI 22 (level, low) -> IRQ 22
[ 0.839673] ahci 0000:00:11.0: AHCI 0001.0100 32 slots 6 ports 3 Gbps 0x3f
impl SATA mode
[ 0.839727] ahci 0000:00:11.0: flags: 64bit ncq sntf ilck pm led clo pmp pio
slum part
[ 0.840272] scsi0 : ahci
[ 0.840403] scsi1 : ahci
[ 0.840501] scsi2 : ahci
[ 0.840598] scsi3 : ahci
[ 0.840697] scsi4 : ahci
[ 0.840795] scsi5 : ahci
[ 0.840926] ata1: SATA max UDMA/133 irq_stat 0x00400000, PHY RDY changed
[ 0.840962] ata2: SATA max UDMA/133 abar m1024@0xfddff800 port 0xfddff980 irq
22
[ 0.841016] ata3: SATA max UDMA/133 abar m1024@0xfddff800 port 0xfddffa00 irq
22
[ 0.841070] ata4: SATA max UDMA/133 abar m1024@0xfddff800 port 0xfddffa80 irq
22
[ 0.841124] ata5: SATA max UDMA/133 abar m1024@0xfddff800 port 0xfddffb00 irq
22
[ 0.841178] ata6: SATA max UDMA/133 abar m1024@0xfddff800 port 0xfddffb80 irq
22
[ 0.841459] PNP: PS/2 Controller [PNP0303:PS2K] at 0x60,0x64 irq 1
[ 0.841493] PNP: PS/2 appears to have AUX port disabled, if this is
incorrect please boot with i8042.nopnp
[ 0.841943] serio: i8042 KBD port at 0x60,0x64 irq 1
[ 0.842074] mice: PS/2 mouse device common for all mice
[ 0.842243] rtc_cmos 00:02: RTC can wake from S4
[ 0.842313] rtc_cmos 00:02: rtc core: registered rtc_cmos as rtc0
[ 0.842369] rtc0: alarms up to one month, y3k, 114 bytes nvram, hpet irqs
[ 0.842428] md: raid1 personality registered for level 1
[ 0.842462] md: raid6 personality registered for level 6
[ 0.842496] md: raid5 personality registered for level 5
[ 0.842530] md: raid4 personality registered for level 4
[ 0.843117] cpuidle: using governor ladder
[ 0.843151] cpuidle: using governor menu
[ 0.843689] usbcore: registered new interface driver hiddev
[ 0.843746] usbcore: registered new interface driver usbhid
[ 0.843780] usbhid: v2.6:USB HID core driver
[ 0.843843] Advanced Linux Sound Architecture Driver Version 1.0.20.
[ 0.843878] ALSA device list:
[ 0.843911] No soundcards found.
[ 0.843979] TCP cubic registered
[ 0.844019] NET: Registered protocol family 10
[ 0.844148] IPv6 over IPv4 tunneling driver
[ 0.844265] NET: Registered protocol family 17
[ 0.844315] powernow-k8: Found 1 AMD Phenom(tm) II X4 955 Processor
processors (4 cpu cores) (version 2.20.00)
[ 0.844391] powernow-k8: 0 : pstate 0 (3200 MHz)
[ 0.844425] powernow-k8: 1 : pstate 1 (2500 MHz)
[ 0.844459] powernow-k8: 2 : pstate 2 (2100 MHz)
[ 0.844492] powernow-k8: 3 : pstate 3 (800 MHz)
[ 0.844886] PM: Resume from disk failed.
[ 0.844966] Magic number: 9:648:116
[ 0.866018] input: AT Translated Set 2 keyboard as
/devices/platform/i8042/serio0/input/input2
[ 1.160036] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 1.160097] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 1.160150] ata6: SATA link down (SStatus 0 SControl 300)
[ 1.160205] ata5: SATA link down (SStatus 0 SControl 300)
[ 1.160259] ata3: SATA link down (SStatus 0 SControl 300)
[ 1.166391] ata4.00: ATA-7: SAMSUNG HD753LJ, 1AA01113, max UDMA7
[ 1.166432] ata4.00: 1465149168 sectors, multi 0: LBA48 NCQ (depth 31/32)
[ 1.166480] ata2.00: ATA-7: SAMSUNG HD502IJ, 1AA01110, max UDMA7
[ 1.166514] ata2.00: 976773168 sectors, multi 0: LBA48 NCQ (depth 31/32)
[ 1.172888] ata4.00: configured for UDMA/133
[ 1.172943] ata2.00: configured for UDMA/133
[ 1.560035] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 1.566394] ata1.00: ATA-7: SAMSUNG HD502IJ, 1AA01109, max UDMA7
[ 1.566430] ata1.00: 976773168 sectors, multi 0: LBA48 NCQ (depth 31/32)
[ 1.572855] ata1.00: configured for UDMA/133
[ 1.583424] scsi 0:0:0:0: Direct-Access ATA SAMSUNG HD502IJ 1AA0
PQ: 0 ANSI: 5
[ 1.583684] sd 0:0:0:0: [sda] 976773168 512-byte logical blocks: (500
GB/465 GiB)
[ 1.583756] sd 0:0:0:0: [sda] Write Protect is off
[ 1.583791] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[ 1.583800] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled,
doesn't support DPO or FUA
[ 1.583911] sda:
[ 1.583952] sd 0:0:0:0: Attached scsi generic sg0 type 0
[ 1.584094] scsi 1:0:0:0: Direct-Access ATA SAMSUNG HD502IJ 1AA0
PQ: 0 ANSI: 5
[ 1.584283] sd 1:0:0:0: [sdb] 976773168 512-byte logical blocks: (500
GB/465 GiB)
[ 1.584354] sd 1:0:0:0: [sdb] Write Protect is off
[ 1.584389] sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
[ 1.584398] sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled,
doesn't support DPO or FUA
[ 1.584507] sdb:
[ 1.584577] sd 1:0:0:0: Attached scsi generic sg1 type 0
[ 1.584709] scsi 3:0:0:0: Direct-Access ATA SAMSUNG HD753LJ 1AA0
PQ: 0 ANSI: 5
[ 1.584865] sd 3:0:0:0: [sdc] 1465149168 512-byte logical blocks: (750
GB/698 GiB)
[ 1.584934] sd 3:0:0:0: [sdc] Write Protect is off
[ 1.584969] sd 3:0:0:0: [sdc] Mode Sense: 00 3a 00 00
[ 1.584977] sd 3:0:0:0: [sdc] Write cache: enabled, read cache: enabled,
doesn't support DPO or FUA
[ 1.585071] sdc:
[ 1.585123] sd 3:0:0:0: Attached scsi generic sg2 type 0
[ 1.588354] sdc1 sdc2 sdc3 sdc4 < sdb1 sdb2 sdb3 sdb4 < sda1 sda2 sda3
sda4 < sdb5 sdc5 sda5 sdc6 sdb6 >
[ 1.606850] sd 1:0:0:0: [sdb] Attached SCSI disk
[ 1.610449] sda6 >
[ 1.610814] sd 0:0:0:0: [sda] Attached SCSI disk
[ 1.617957] sdc7 >
[ 1.618341] sd 3:0:0:0: [sdc] Attached SCSI disk
[ 1.618382] md: Waiting for all devices to be available before autodetect
[ 1.618417] md: If you don't use raid, use raid=noautodetect
[ 1.618533] md: Autodetecting RAID arrays.
[ 1.724723] md: Scanned 12 and added 12 devices.
[ 1.724758] md: autorun ...
[ 1.724792] md: considering sdc6 ...
[ 1.724827] md: adding sdc6 ...
[ 1.724862] md: sdc5 has different UUID to sdc6
[ 1.724897] md: sdc3 has different UUID to sdc6
[ 1.724931] md: sdc1 has different UUID to sdc6
[ 1.724967] md: adding sda6 ...
[ 1.725001] md: sda5 has different UUID to sdc6
[ 1.725036] md: sda3 has different UUID to sdc6
[ 1.725070] md: sda1 has different UUID to sdc6
[ 1.725106] md: adding sdb6 ...
[ 1.725140] md: sdb5 has different UUID to sdc6
[ 1.725175] md: sdb3 has different UUID to sdc6
[ 1.725209] md: sdb1 has different UUID to sdc6
[ 1.725349] md: created md3
[ 1.725382] md: bind<sdb6>
[ 1.725419] md: bind<sda6>
[ 1.725456] md: bind<sdc6>
[ 1.725492] md: running: <sdc6><sda6><sdb6>
[ 1.725626] raid5: device sdc6 operational as raid disk 2
[ 1.725661] raid5: device sda6 operational as raid disk 0
[ 1.725695] raid5: device sdb6 operational as raid disk 1
[ 1.725846] raid5: allocated 3220kB for md3
[ 1.725910] raid5: raid level 5 set md3 active with 3 out of 3 devices,
algorithm 2
[ 1.725963] RAID5 conf printout:
[ 1.725996] --- rd:3 wd:3
[ 1.726029] disk 0, o:1, dev:sda6
[ 1.726062] disk 1, o:1, dev:sdb6
[ 1.726095] disk 2, o:1, dev:sdc6
[ 1.726142] md3: detected capacity change from 0 to 864065421312
[ 1.726213] md: considering sdc5 ...
[ 1.726249] md: adding sdc5 ...
[ 1.726283] md: sdc3 has different UUID to sdc5
[ 1.726318] md: sdc1 has different UUID to sdc5
[ 1.726353] md: adding sda5 ...
[ 1.726388] md: sda3 has different UUID to sdc5
[ 1.726422] md: sda1 has different UUID to sdc5
[ 1.726458] md: adding sdb5 ...
[ 1.726492] md: sdb3 has different UUID to sdc5
[ 1.726526] md: sdb1 has different UUID to sdc5
[ 1.726630] md: created md2
[ 1.726663] md: bind<sdb5>
[ 1.726700] md: bind<sda5>
[ 1.726738] md: bind<sdc5>
[ 1.726774] md: running: <sdc5><sda5><sdb5>
[ 1.726901] raid5: device sdc5 operational as raid disk 2
[ 1.726935] raid5: device sda5 operational as raid disk 0
[ 1.726969] raid5: device sdb5 operational as raid disk 1
[ 1.727126] raid5: allocated 3220kB for md2
[ 1.727190] raid5: raid level 5 set md2 active with 3 out of 3 devices,
algorithm 2
[ 1.727243] RAID5 conf printout:
[ 1.727276] --- rd:3 wd:3
[ 1.727309] disk 0, o:1, dev:sda5
[ 1.727342] disk 1, o:1, dev:sdb5
[ 1.727376] disk 2, o:1, dev:sdc5
[ 1.727420] md2: detected capacity change from 0 to 40007499776
[ 1.727490] md: considering sdc3 ...
[ 1.727526] md: adding sdc3 ...
[ 1.727560] md: sdc1 has different UUID to sdc3
[ 1.727595] md: adding sda3 ...
[ 1.727629] md: sda1 has different UUID to sdc3
[ 1.727664] md: adding sdb3 ...
[ 1.727698] md: sdb1 has different UUID to sdc3
[ 1.727799] md: created md1
[ 1.727832] md: bind<sdb3>
[ 1.727869] md: bind<sda3>
[ 1.727905] md: bind<sdc3>
[ 1.727945] md: running: <sdc3><sda3><sdb3>
[ 1.728090] raid5: device sdc3 operational as raid disk 2
[ 1.728125] raid5: device sda3 operational as raid disk 0
[ 1.728159] raid5: device sdb3 operational as raid disk 1
[ 1.728320] raid5: allocated 3220kB for md1
[ 1.728370] raid5: raid level 5 set md1 active with 3 out of 3 devices,
algorithm 2
[ 1.728423] RAID5 conf printout:
[ 1.728455] --- rd:3 wd:3
[ 1.728488] disk 0, o:1, dev:sda3
[ 1.728522] disk 1, o:1, dev:sdb3
[ 1.728555] disk 2, o:1, dev:sdc3
[ 1.728604] md1: detected capacity change from 0 to 79998877696
[ 1.728674] md: considering sdc1 ...
[ 1.728710] md: adding sdc1 ...
[ 1.728745] md: adding sda1 ...
[ 1.728779] md: adding sdb1 ...
[ 1.728813] md: created md0
[ 1.728846] md: bind<sdb1>
[ 1.728882] md: bind<sda1>
[ 1.728919] md: bind<sdc1>
[ 1.728955] md: running: <sdc1><sda1><sdb1>
[ 1.729133] raid1: raid set md0 active with 3 out of 3 mirrors
[ 1.729176] md0: detected capacity change from 0 to 65667072
[ 1.729232] md: ... autorun DONE.
[ 1.729284] md: Loading md3: /dev/sda3
[ 1.729322] md3: unknown partition table
[ 1.729481] md: couldn't update array info. -22
[ 1.729518] md: could not bd_claim sda3.
[ 1.729552] md: md_import_device returned -16
[ 1.729588] md: could not bd_claim sdb3.
[ 1.729621] md: md_import_device returned -16
[ 1.729657] md: could not bd_claim sdc3.
[ 1.729690] md: md_import_device returned -16
[ 1.729725] md: starting md3 failed
[ 1.729800] md1: unknown partition table
[ 1.767199] reiser4: md1: found disk format 4.0.0.
[ 5.790318] VFS: Mounted root (reiser4 filesystem) readonly on device 9:1.
[ 5.790370] Freeing unused kernel memory: 376k freed
[ 9.037775] udev: starting version 145
[ 9.217043] md2:
[ 9.217072] md0: unknown partition table
[ 9.282015] unknown partition table
[ 10.420576] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
[ 10.420591] r8169 0000:01:00.0: PCI INT A -> GSI 18 (level, low) -> IRQ 18
[ 10.420840] r8169 0000:01:00.0: setting latency timer to 64
[ 10.420871] alloc irq_desc for 28 on node -1
[ 10.420872] alloc kstat_irqs on node -1
[ 10.420882] r8169 0000:01:00.0: irq 28 for MSI/MSI-X
[ 10.420988] eth0: RTL8168c/8111c at 0xffffc90012f18000, 00:19:66:86:ce:12,
XID 3c4000c0 IRQ 28
[ 10.440462] ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
[ 10.440465] ehci_hcd: block sizes: qh 192 qtd 96 itd 192 sitd 96
[ 10.440491] ehci_hcd 0000:00:12.2: PCI INT B -> GSI 17 (level, low) -> IRQ
17
[ 10.440518] ehci_hcd 0000:00:12.2: EHCI Host Controller
[ 10.440532] drivers/usb/core/inode.c: creating file 'devices'
[ 10.440534] drivers/usb/core/inode.c: creating file '001'
[ 10.440566] ehci_hcd 0000:00:12.2: new USB bus registered, assigned bus
number 1
[ 10.440572] ehci_hcd 0000:00:12.2: reset hcs_params 0x102306 dbg=1 cc=2
pcc=3 ordered !ppc ports=6
[ 10.440575] ehci_hcd 0000:00:12.2: reset hcc_params a072 thresh 7 uframes
256/512/1024
[ 10.440596] ehci_hcd 0000:00:12.2: applying AMD SB600/SB700 USB freeze
workaround
[ 10.440602] ehci_hcd 0000:00:12.2: reset command 080002 (park)=0 ithresh=8
period=1024 Reset HALT
[ 10.440615] ehci_hcd 0000:00:12.2: debug port 1
[ 10.440619] ehci_hcd 0000:00:12.2: MWI active
[ 10.440620] ehci_hcd 0000:00:12.2: supports USB remote wakeup
[ 10.440633] ehci_hcd 0000:00:12.2: irq 17, io mem 0xfddff000
[ 10.440637] ehci_hcd 0000:00:12.2: reset command 080002 (park)=0 ithresh=8
period=1024 Reset HALT
[ 10.440642] ehci_hcd 0000:00:12.2: init command 010009 (park)=0 ithresh=1
period=256 RUN
[ 10.447931] ehci_hcd 0000:00:12.2: USB 2.0 started, EHCI 1.00
[ 10.447961] usb usb1: default language 0x0409
[ 10.447965] usb usb1: udev 1, busnum 1, minor = 0
[ 10.447967] usb usb1: New USB device found, idVendor=1d6b, idProduct=0002
[ 10.447968] usb usb1: New USB device strings: Mfr=3, Product=2,
SerialNumber=1
[ 10.447969] usb usb1: Product: EHCI Host Controller
[ 10.447970] usb usb1: Manufacturer: Linux 2.6.31r4 ehci_hcd
[ 10.447971] usb usb1: SerialNumber: 0000:00:12.2
[ 10.447998] usb usb1: uevent
[ 10.448007] usb usb1: usb_probe_device
[ 10.448009] usb usb1: configuration #1 chosen from 1 choice
[ 10.448014] usb usb1: adding 1-0:1.0 (config #1, interface 0)
[ 10.448021] usb 1-0:1.0: uevent
[ 10.448028] hub 1-0:1.0: usb_probe_interface
[ 10.448029] hub 1-0:1.0: usb_probe_interface - got id
[ 10.448031] hub 1-0:1.0: USB hub found
[ 10.448035] hub 1-0:1.0: 6 ports detected
[ 10.448036] hub 1-0:1.0: standalone hub
[ 10.448037] hub 1-0:1.0: no power switching (usb 1.0)
[ 10.448038] hub 1-0:1.0: individual port over-current protection
[ 10.448039] hub 1-0:1.0: power on to power good time: 20ms
[ 10.448042] hub 1-0:1.0: local power source is good
[ 10.448043] hub 1-0:1.0: trying to enable port power on non-switchable hub
[ 10.448067] drivers/usb/core/inode.c: creating file '001'
[ 10.448085] alloc irq_desc for 19 on node -1
[ 10.448087] alloc kstat_irqs on node -1
[ 10.448091] ehci_hcd 0000:00:13.2: PCI INT B -> GSI 19 (level, low) -> IRQ
19
[ 10.448101] ehci_hcd 0000:00:13.2: EHCI Host Controller
[ 10.448105] drivers/usb/core/inode.c: creating file '002'
[ 10.448122] ehci_hcd 0000:00:13.2: new USB bus registered, assigned bus
number 2
[ 10.448127] ehci_hcd 0000:00:13.2: reset hcs_params 0x102306 dbg=1 cc=2
pcc=3 ordered !ppc ports=6
[ 10.448130] ehci_hcd 0000:00:13.2: reset hcc_params a072 thresh 7 uframes
256/512/1024
[ 10.448143] ehci_hcd 0000:00:13.2: applying AMD SB600/SB700 USB freeze
workaround
[ 10.448148] ehci_hcd 0000:00:13.2: reset command 080002 (park)=0 ithresh=8
period=1024 Reset HALT
[ 10.448161] ehci_hcd 0000:00:13.2: debug port 1
[ 10.448164] ehci_hcd 0000:00:13.2: MWI active
[ 10.448165] ehci_hcd 0000:00:13.2: supports USB remote wakeup
[ 10.448173] ehci_hcd 0000:00:13.2: irq 19, io mem 0xfddf6800
[ 10.448176] ehci_hcd 0000:00:13.2: reset command 080002 (park)=0 ithresh=8
period=1024 Reset HALT
[ 10.448181] ehci_hcd 0000:00:13.2: init command 010009 (park)=0 ithresh=1
period=256 RUN
[ 10.457930] ehci_hcd 0000:00:13.2: USB 2.0 started, EHCI 1.00
[ 10.457945] usb usb2: default language 0x0409
[ 10.457949] usb usb2: udev 1, busnum 2, minor = 128
[ 10.457950] usb usb2: New USB device found, idVendor=1d6b, idProduct=0002
[ 10.457952] usb usb2: New USB device strings: Mfr=3, Product=2,
SerialNumber=1
[ 10.457953] usb usb2: Product: EHCI Host Controller
[ 10.457954] usb usb2: Manufacturer: Linux 2.6.31r4 ehci_hcd
[ 10.457955] usb usb2: SerialNumber: 0000:00:13.2
[ 10.457977] usb usb2: uevent
[ 10.457984] usb usb2: usb_probe_device
[ 10.457986] usb usb2: configuration #1 chosen from 1 choice
[ 10.457989] usb usb2: adding 2-0:1.0 (config #1, interface 0)
[ 10.457997] usb 2-0:1.0: uevent
[ 10.458003] hub 2-0:1.0: usb_probe_interface
[ 10.458004] hub 2-0:1.0: usb_probe_interface - got id
[ 10.458005] hub 2-0:1.0: USB hub found
[ 10.458009] hub 2-0:1.0: 6 ports detected
[ 10.458010] hub 2-0:1.0: standalone hub
[ 10.458010] hub 2-0:1.0: no power switching (usb 1.0)
[ 10.458011] hub 2-0:1.0: individual port over-current protection
[ 10.458013] hub 2-0:1.0: power on to power good time: 20ms
[ 10.458015] hub 2-0:1.0: local power source is good
[ 10.458016] hub 2-0:1.0: trying to enable port power on non-switchable hub
[ 10.458038] drivers/usb/core/inode.c: creating file '001'
[ 10.474718] Linux video capture interface: v2.00
[ 10.489875] bttv: driver version 0.9.18 loaded
[ 10.489877] bttv: using 8 buffers with 2080k (520 pages) each for capture
[ 10.489914] bttv: Bt8xx card found (0).
[ 10.489923] alloc irq_desc for 21 on node -1
[ 10.489924] alloc kstat_irqs on node -1
[ 10.489928] bttv 0000:05:06.0: PCI INT A -> GSI 21 (level, low) -> IRQ 21
[ 10.489938] bttv0: Bt848 (rev 18) at 0000:05:06.0, irq: 21, latency: 128,
mmio: 0xfcfff000
[ 10.489971] bttv0: using: Terratec TerraTV+ Version 1.0 (Bt848)/ Terra
TValue Version 1.0/ Vobis TV-Boostar [card=25,insmod option]
[ 10.489974] IRQ 21/bttv0: IRQF_DISABLED is not guaranteed on shared IRQs
[ 10.490007] bttv0: gpio: en=00000000, out=00000000 in=00ffffff [init]
[ 10.547936] ehci_hcd 0000:00:12.2: GetStatus port 2 status 001803 POWER
sig=j CSC CONNECT
[ 10.547939] hub 1-0:1.0: port 2: status 0501 change 0001
[ 10.557954] hub 2-0:1.0: state 7 ports 6 chg 0000 evt 0000
[ 10.647931] hub 1-0:1.0: state 7 ports 6 chg 0004 evt 0000
[ 10.647940] hub 1-0:1.0: port 2, status 0501, change 0000, 480 Mb/s
[ 10.700045] ehci_hcd 0000:00:12.2: port 2 high speed
[ 10.700049] ehci_hcd 0000:00:12.2: GetStatus port 2 status 001005 POWER
sig=se0 PE CONNECT
[ 10.757104] usb 1-2: new high speed USB device using ehci_hcd and address 2
[ 10.810036] ehci_hcd 0000:00:12.2: port 2 high speed
[ 10.810039] ehci_hcd 0000:00:12.2: GetStatus port 2 status 001005 POWER
sig=se0 PE CONNECT
[ 10.881474] usb 1-2: default language 0x0409
[ 10.881724] usb 1-2: udev 2, busnum 1, minor = 1
[ 10.881725] usb 1-2: New USB device found, idVendor=05e3, idProduct=0608
[ 10.881726] usb 1-2: New USB device strings: Mfr=0, Product=1,
SerialNumber=0
[ 10.881728] usb 1-2: Product: USB2.0 Hub
[ 10.881764] usb 1-2: uevent
[ 10.881773] usb 1-2: usb_probe_device
[ 10.881775] usb 1-2: configuration #1 chosen from 1 choice
[ 10.882188] usb 1-2: adding 1-2:1.0 (config #1, interface 0)
[ 10.882199] usb 1-2:1.0: uevent
[ 10.882206] hub 1-2:1.0: usb_probe_interface
[ 10.882207] hub 1-2:1.0: usb_probe_interface - got id
[ 10.882209] hub 1-2:1.0: USB hub found
[ 10.882473] hub 1-2:1.0: 4 ports detected
[ 10.882474] hub 1-2:1.0: standalone hub
[ 10.882476] hub 1-2:1.0: individual port power switching
[ 10.882477] hub 1-2:1.0: individual port over-current protection
[ 10.882478] hub 1-2:1.0: Single TT
[ 10.882479] hub 1-2:1.0: TT requires at most 32 FS bit times (2664 ns)
[ 10.882480] hub 1-2:1.0: Port indicators are supported
[ 10.882481] hub 1-2:1.0: power on to power good time: 100ms
[ 10.882848] hub 1-2:1.0: local power source is good
[ 10.882849] hub 1-2:1.0: enabling power on all ports
[ 10.883860] drivers/usb/core/inode.c: creating file '002'
[ 10.983961] hub 1-2:1.0: port 2: status 0301 change 0001
[ 11.083356] usb 1-2: link qh256-0001/ffff8800c7800180 start 1 [1/0 us]
[ 11.083364] hub 1-2:1.0: state 7 ports 4 chg 0004 evt 0000
[ 11.083700] hub 1-2:1.0: port 2, status 0301, change 0000, 1.5 Mb/s
[ 11.152198] usb 1-2.2: new low speed USB device using ehci_hcd and address
3
[ 11.241187] usb 1-2.2: skipped 1 descriptor after interface
[ 11.241189] usb 1-2.2: skipped 1 descriptor after interface
[ 11.241685] usb 1-2.2: default language 0x0409
[ 11.243941] usb 1-2.2: udev 3, busnum 1, minor = 2
[ 11.243943] usb 1-2.2: New USB device found, idVendor=046d, idProduct=c518
[ 11.243944] usb 1-2.2: New USB device strings: Mfr=1, Product=2,
SerialNumber=0
[ 11.243945] usb 1-2.2: Product: USB Receiver
[ 11.243947] usb 1-2.2: Manufacturer: Logitech
[ 11.243972] usb 1-2.2: uevent
[ 11.243980] usb 1-2.2: usb_probe_device
[ 11.243981] usb 1-2.2: configuration #1 chosen from 1 choice
[ 11.251309] usb 1-2.2: adding 1-2.2:1.0 (config #1, interface 0)
[ 11.251324] usb 1-2.2:1.0: uevent
[ 11.251334] usbhid 1-2.2:1.0: usb_probe_interface
[ 11.251335] usbhid 1-2.2:1.0: usb_probe_interface - got id
[ 11.251796] usb 1-2: clear tt buffer port 2, a3 ep0 t80008d42
[ 11.254692] input: Logitech USB Receiver as
/devices/pci0000:00/0000:00:12.2/usb1/1-2/1-2.2/1-2.2:1.0/input/input3
[ 11.254732] generic-usb 0003:046D:C518.0001: input,hidraw0: USB HID v1.11
Mouse [Logitech USB Receiver] on usb-0000:00:12.2-2.2/input0
[ 11.254740] usb 1-2.2: adding 1-2.2:1.1 (config #1, interface 1)
[ 11.254749] usb 1-2.2:1.1: uevent
[ 11.254755] usbhid 1-2.2:1.1: usb_probe_interface
[ 11.254757] usbhid 1-2.2:1.1: usb_probe_interface - got id
[ 11.255046] usb 1-2: clear tt buffer port 2, a3 ep0 t80008d42
[ 11.260359] input: Logitech USB Receiver as
/devices/pci0000:00/0000:00:12.2/usb1/1-2/1-2.2/1-2.2:1.1/input/input4
[ 11.260368] usb 1-2.2: link qh8-0601/ffff8800c7800300 start 2 [1/2 us]
[ 11.260384] drivers/usb/core/file.c: looking for a minor, starting at 96
[ 11.260414] generic-usb 0003:046D:C518.0002: input,hiddev96,hidraw1: USB
HID v1.11 Device [Logitech USB Receiver] on usb-0000:00:12.2-2.2/input1
[ 11.260427] drivers/usb/core/inode.c: creating file '003'
[ 11.260438] hub 1-2:1.0: state 7 ports 4 chg 0000 evt 0004
[ 11.280770] usb 1-2.2:1.0: uevent
[ 11.280827] usb 1-2.2: uevent
[ 11.280844] usb 1-2.2:1.0: uevent
[ 11.280903] usb 1-2.2: uevent
[ 11.281405] usb 1-2.2:1.1: uevent
[ 11.281467] usb 1-2.2: uevent
[ 11.281517] usb 1-2.2:1.0: uevent
[ 11.281529] usb 1-2.2:1.0: uevent
[ 11.282164] usb 1-2.2:1.1: uevent
[ 11.493831] bttv0: tea5757: read timeout
[ 11.493832] bttv0: tuner type=5
[ 11.505445] bttv0: audio absent, no audio device found!
[ 11.525030] TUNER: Unable to find symbol tea5767_autodetection()
[ 11.525033] tuner 0-0060: chip found @ 0xc0 (bt848 #0 [sw])
[ 11.527843] tuner-simple 0-0060: creating new instance
[ 11.527846] tuner-simple 0-0060: type set to 5 (Philips PAL_BG (FI1216 and
compatibles))
[ 11.528664] bttv0: registered device video0
[ 11.528695] bttv0: registered device vbi0
[ 11.814508] reiser4: md2: found disk format 4.0.0.
[ 13.337101] hub 2-0:1.0: hub_suspend
[ 13.337107] usb usb2: bus auto-suspend
[ 13.337109] ehci_hcd 0000:00:13.2: suspend root hub
[ 13.591267] reiser4: md3: found disk format 4.0.0.
[ 55.020583] reiser4: sdc7: found disk format 4.0.0.
[ 67.386552] Adding 7815612k swap on /dev/sda2. Priority:1 extents:1
across:7815612k
[ 67.408915] Adding 7815612k swap on /dev/sdb2. Priority:1 extents:1
across:7815612k
[ 67.501498] Adding 7815612k swap on /dev/sdc2. Priority:1 extents:1
across:7815612k
[ 68.205087] w83627ehf: Found W83627EHG chip at 0x290
[ 68.421646] r8169: eth0: link up
[ 68.421650] r8169: eth0: link up
[ 70.281241] usb usb1: uevent
[ 70.281269] usb 1-0:1.0: uevent
[ 70.281293] usb 1-2: uevent
[ 70.281320] usb 1-2.2: uevent
[ 70.281346] usb 1-2.2:1.0: uevent
[ 70.281533] usb 1-2.2:1.1: uevent
[ 70.281706] usb 1-2:1.0: uevent
[ 70.281804] usb usb2: uevent
[ 70.281830] usb 2-0:1.0: uevent
[ 71.272886] alloc irq_desc for 23 on node -1
[ 71.272889] alloc kstat_irqs on node -1
[ 71.272895] EMU10K1_Audigy 0000:05:08.0: PCI INT A -> GSI 23 (level, low) -
> IRQ 23
[ 71.278891] Audigy2 value: Special config.
[ 72.473408] fglrx: module license 'Proprietary. (C) 2002 - ATI
Technologies, Starnberg, GERMANY' taints kernel.
[ 72.473417] Disabling lock debugging due to kernel taint
[ 72.490481] [fglrx] Maximum main memory to use for locked dma buffers: 7760
MBytes.
[ 72.490559] [fglrx] vendor: 1002 device: 9501 count: 1
[ 72.490739] [fglrx] ioport: bar 4, base 0xc000, size: 0x100
[ 72.490750] pci 0000:02:00.0: PCI INT A -> GSI 18 (level, low) -> IRQ 18
[ 72.490754] pci 0000:02:00.0: setting latency timer to 64
[ 72.490884] [fglrx] Kernel PAT support is enabled
[ 72.490901] [fglrx] module loaded - fglrx 8.66.2 [Sep 1 2009] with 1
minors
[ 72.668413] alloc irq_desc for 29 on node -1
[ 72.668416] alloc kstat_irqs on node -1
[ 72.668424] fglrx_pci 0000:02:00.0: irq 29 for MSI/MSI-X
[ 72.668759] [fglrx] Firegl kernel thread PID: 3800
[ 74.787429] [fglrx] Gart USWC size:1279 M.
[ 74.787431] [fglrx] Gart cacheable size:508 M.
[ 74.787435] [fglrx] Reserved FB block: Shared offset:0, size:1000000
[ 74.787437] [fglrx] Reserved FB block: Unshared offset:fbff000, size:401000
[ 74.787438] [fglrx] Reserved FB block: Unshared offset:1fffc000, size:4000
[ 75.947153] usb 1-2.2: link qh8-0601/ffff8800c78003c0 start 3 [1/2 us]
[ 616.849440] reiser4[ktxnmgrd:md1:ru(581)]: disable_write_barrier
(fs/reiser4/wander.c:235)[zam-1055]:
[ 616.849445] NOTICE: md1 does not support write barriers, using synchronous
write instead.
[ 671.813536] reiser4[ktxnmgrd:md2:ru(2774)]: disable_write_barrier
(fs/reiser4/wander.c:235)[zam-1055]:
[ 671.813541] NOTICE: md2 does not support write barriers, using synchronous
write instead.
[ 703.842289] reiser4[ktxnmgrd:md3:ru(2776)]: disable_write_barrier
(fs/reiser4/wander.c:235)[zam-1055]:
[ 703.842293] NOTICE: md3 does not support write barriers, using synchronous
write instead.

ps. I have to disable c1e in bios, or video is very jerky - unless something
cpu heavy runs on at least one core...




Attachments:
.config (53.96 kB)

2009-09-12 07:37:47

by Nikos Chantziaras

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

(Volker stripped all CCs from his posts; I restored them manually.)

On 09/11/2009 09:33 PM, Volker Armin Hemmann wrote:
> Hi,
>
> this is with 2.6.31+reiser4+fglrx
> Phenom II X4 955
>
> KDE 4.3.1, composite temporary disabled.
> tvtime running.
>
> load:
> fat emerge with make -j5 running in one konsole tab (xulrunner being
> compiled).
>
> without NO_NEW_FAIR_SLEEPERS:
>
> tvtime is smooth most of the time
>
> with NO_NEW_FAIR_SLEEPERS:
>
> tvtime is more jerky. Very visible in scenes with movement.

Is the make -j5 running niced 0? If yes, that would be actually the
correct behavior. Unfortunately, I can't test tvtime specifically (I
don't have a TV card), but other applications displaying video continue
to work smooth on my dual core machine (Core 2 Duo E6600) even if I do
"nice -n 19 make -j20". If I don't nice it, the video is skippy here
too though.

Question to Ingo:
Would posting perf results help in any way with finding differences
between mainline NEW_FAIR_SLEEPERS/NO_NEW_FAIR_SLEEPERS and BFS?


> without background load:
>
> both settings act the same. tvtime is smooth, video is smooth, games are nice.
> No real difference.
>
> config is attached.
>
> Gl?ck Auf,
> Volker
>
> dmesg:
>
> [ 0.000000] Linux version 2.6.31r4 (root@energy) (gcc version 4.4.1 (Gentoo
> 4.4.1 p1.0) ) #1 SMP Thu Sep 10 10:48:07 CEST 2009
> [ 0.000000] Command line: root=/dev/md1 md=3,/dev/sda3,/dev/sdb3,/dev/sdc3
> nmi_watchdog=0 mtrr_spare_reg_nr=1
> [ 0.000000] KERNEL supported cpus:
> [ 0.000000] AMD AuthenticAMD
> [ 0.000000] BIOS-provided physical RAM map:
> [ 0.000000] BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
> [ 0.000000] BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
> [ 0.000000] BIOS-e820: 00000000000e6000 - 0000000000100000 (reserved)
> [ 0.000000] BIOS-e820: 0000000000100000 - 00000000c7eb0000 (usable)
> [ 0.000000] BIOS-e820: 00000000c7eb0000 - 00000000c7ec0000 (ACPI data)
> [ 0.000000] BIOS-e820: 00000000c7ec0000 - 00000000c7ef0000 (ACPI NVS)
> [ 0.000000] BIOS-e820: 00000000c7ef0000 - 00000000c7f00000 (reserved)
> [ 0.000000] BIOS-e820: 00000000fff00000 - 0000000100000000 (reserved)
> [ 0.000000] BIOS-e820: 0000000100000000 - 0000000238000000 (usable)
> [ 0.000000] DMI present.
> [ 0.000000] AMI BIOS detected: BIOS may corrupt low RAM, working around it.
> [ 0.000000] e820 update range: 0000000000000000 - 0000000000010000 (usable)
> ==> (reserved)
> [ 0.000000] last_pfn = 0x238000 max_arch_pfn = 0x400000000
> [ 0.000000] MTRR default type: uncachable
> [ 0.000000] MTRR fixed ranges enabled:
> [ 0.000000] 00000-9FFFF write-back
> [ 0.000000] A0000-EFFFF uncachable
> [ 0.000000] F0000-FFFFF write-protect
> [ 0.000000] MTRR variable ranges enabled:
> [ 0.000000] 0 base 000000000000 mask FFFF80000000 write-back
> [ 0.000000] 1 base 000080000000 mask FFFFC0000000 write-back
> [ 0.000000] 2 base 0000C0000000 mask FFFFF8000000 write-back
> [ 0.000000] 3 disabled
> [ 0.000000] 4 disabled
> [ 0.000000] 5 disabled
> [ 0.000000] 6 disabled
> [ 0.000000] 7 disabled
> [ 0.000000] TOM2: 0000000238000000 aka 9088M
> [ 0.000000] x86 PAT enabled: cpu 0, old 0x7040600070406, new
> 0x7010600070106
> [ 0.000000] e820 update range: 00000000c8000000 - 0000000100000000 (usable)
> ==> (reserved)
> [ 0.000000] last_pfn = 0xc7eb0 max_arch_pfn = 0x400000000
> [ 0.000000] Scanning 0 areas for low memory corruption
> [ 0.000000] modified physical RAM map:
> [ 0.000000] modified: 0000000000000000 - 0000000000010000 (reserved)
> [ 0.000000] modified: 0000000000010000 - 000000000009fc00 (usable)
> [ 0.000000] modified: 000000000009fc00 - 00000000000a0000 (reserved)
> [ 0.000000] modified: 00000000000e6000 - 0000000000100000 (reserved)
> [ 0.000000] modified: 0000000000100000 - 00000000c7eb0000 (usable)
> [ 0.000000] modified: 00000000c7eb0000 - 00000000c7ec0000 (ACPI data)
> [ 0.000000] modified: 00000000c7ec0000 - 00000000c7ef0000 (ACPI NVS)
> [ 0.000000] modified: 00000000c7ef0000 - 00000000c7f00000 (reserved)
> [ 0.000000] modified: 00000000fff00000 - 0000000100000000 (reserved)
> [ 0.000000] modified: 0000000100000000 - 0000000238000000 (usable)
> [ 0.000000] initial memory mapped : 0 - 20000000
> [ 0.000000] Using GB pages for direct mapping
> [ 0.000000] init_memory_mapping: 0000000000000000-00000000c7eb0000
> [ 0.000000] 0000000000 - 00c0000000 page 1G
> [ 0.000000] 00c0000000 - 00c7e00000 page 2M
> [ 0.000000] 00c7e00000 - 00c7eb0000 page 4k
> [ 0.000000] kernel direct mapping tables up to c7eb0000 @ 10000-13000
> [ 0.000000] init_memory_mapping: 0000000100000000-0000000238000000
> [ 0.000000] 0100000000 - 0200000000 page 1G
> [ 0.000000] 0200000000 - 0238000000 page 2M
> [ 0.000000] kernel direct mapping tables up to 238000000 @ 12000-14000
> [ 0.000000] ACPI: RSDP 00000000000fa7c0 00014 (v00 ACPIAM)
> [ 0.000000] ACPI: RSDT 00000000c7eb0000 00040 (v01 050609 RSDT2000 20090506
> MSFT 00000097)
> [ 0.000000] ACPI: FACP 00000000c7eb0200 00084 (v02 A M I OEMFACP 12000601
> MSFT 00000097)
> [ 0.000000] ACPI: DSDT 00000000c7eb0440 08512 (v01 AS140 AS140121 00000121
> INTL 20051117)
> [ 0.000000] ACPI: FACS 00000000c7ec0000 00040
> [ 0.000000] ACPI: APIC 00000000c7eb0390 0006C (v01 050609 APIC2000 20090506
> MSFT 00000097)
> [ 0.000000] ACPI: MCFG 00000000c7eb0400 0003C (v01 050609 OEMMCFG 20090506
> MSFT 00000097)
> [ 0.000000] ACPI: OEMB 00000000c7ec0040 00071 (v01 050609 OEMB2000 20090506
> MSFT 00000097)
> [ 0.000000] ACPI: AAFT 00000000c7eb8960 00027 (v01 050609 OEMAAFT 20090506
> MSFT 00000097)
> [ 0.000000] ACPI: HPET 00000000c7eb8990 00038 (v01 050609 OEMHPET 20090506
> MSFT 00000097)
> [ 0.000000] ACPI: SSDT 00000000c7eb89d0 0088C (v01 A M I POWERNOW 00000001
> AMD 00000001)
> [ 0.000000] ACPI: Local APIC address 0xfee00000
> [ 0.000000] (7 early reservations) ==> bootmem [0000000000 - 0238000000]
> [ 0.000000] #0 [0000000000 - 0000001000] BIOS data page ==> [0000000000
> - 0000001000]
> [ 0.000000] #1 [0000006000 - 0000008000] TRAMPOLINE ==> [0000006000
> - 0000008000]
> [ 0.000000] #2 [0001000000 - 00015fb8c0] TEXT DATA BSS ==> [0001000000
> - 00015fb8c0]
> [ 0.000000] #3 [000009fc00 - 0000100000] BIOS reserved ==> [000009fc00
> - 0000100000]
> [ 0.000000] #4 [00015fc000 - 00015fc133] BRK ==> [00015fc000
> - 00015fc133]
> [ 0.000000] #5 [0000010000 - 0000012000] PGTABLE ==> [0000010000
> - 0000012000]
> [ 0.000000] #6 [0000012000 - 0000013000] PGTABLE ==> [0000012000
> - 0000013000]
> [ 0.000000] [ffffea0000000000-ffffea0007dfffff] PMD -> [ffff880028600000-
> ffff88002f7fffff] on node 0
> [ 0.000000] Zone PFN ranges:
> [ 0.000000] DMA 0x00000010 -> 0x00001000
> [ 0.000000] DMA32 0x00001000 -> 0x00100000
> [ 0.000000] Normal 0x00100000 -> 0x00238000
> [ 0.000000] Movable zone start PFN for each node
> [ 0.000000] early_node_map[3] active PFN ranges
> [ 0.000000] 0: 0x00000010 -> 0x0000009f
> [ 0.000000] 0: 0x00000100 -> 0x000c7eb0
> [ 0.000000] 0: 0x00100000 -> 0x00238000
> [ 0.000000] On node 0 totalpages: 2096703
> [ 0.000000] DMA zone: 56 pages used for memmap
> [ 0.000000] DMA zone: 102 pages reserved
> [ 0.000000] DMA zone: 3825 pages, LIFO batch:0
> [ 0.000000] DMA32 zone: 14280 pages used for memmap
> [ 0.000000] DMA32 zone: 800488 pages, LIFO batch:31
> [ 0.000000] Normal zone: 17472 pages used for memmap
> [ 0.000000] Normal zone: 1260480 pages, LIFO batch:31
> [ 0.000000] ACPI: PM-Timer IO Port: 0x808
> [ 0.000000] ACPI: Local APIC address 0xfee00000
> [ 0.000000] ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled)
> [ 0.000000] ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled)
> [ 0.000000] ACPI: LAPIC (acpi_id[0x03] lapic_id[0x02] enabled)
> [ 0.000000] ACPI: LAPIC (acpi_id[0x04] lapic_id[0x03] enabled)
> [ 0.000000] ACPI: IOAPIC (id[0x04] address[0xfec00000] gsi_base[0])
> [ 0.000000] IOAPIC[0]: apic_id 4, version 33, address 0xfec00000, GSI 0-23
> [ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
> [ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 low level)
> [ 0.000000] ACPI: IRQ0 used by override.
> [ 0.000000] ACPI: IRQ2 used by override.
> [ 0.000000] ACPI: IRQ9 used by override.
> [ 0.000000] Using ACPI (MADT) for SMP configuration information
> [ 0.000000] ACPI: HPET id: 0x8300 base: 0xfed00000
> [ 0.000000] SMP: Allowing 4 CPUs, 0 hotplug CPUs
> [ 0.000000] nr_irqs_gsi: 24
> [ 0.000000] PM: Registered nosave memory: 000000000009f000 -
> 00000000000a0000
> [ 0.000000] PM: Registered nosave memory: 00000000000a0000 -
> 00000000000e6000
> [ 0.000000] PM: Registered nosave memory: 00000000000e6000 -
> 0000000000100000
> [ 0.000000] PM: Registered nosave memory: 00000000c7eb0000 -
> 00000000c7ec0000
> [ 0.000000] PM: Registered nosave memory: 00000000c7ec0000 -
> 00000000c7ef0000
> [ 0.000000] PM: Registered nosave memory: 00000000c7ef0000 -
> 00000000c7f00000
> [ 0.000000] PM: Registered nosave memory: 00000000c7f00000 -
> 00000000fff00000
> [ 0.000000] PM: Registered nosave memory: 00000000fff00000 -
> 0000000100000000
> [ 0.000000] Allocating PCI resources starting at c7f00000 (gap:
> c7f00000:38000000)
> [ 0.000000] NR_CPUS:4 nr_cpumask_bits:4 nr_cpu_ids:4 nr_node_ids:1
> [ 0.000000] PERCPU: Embedded 25 pages at ffff880028034000, static data 72160
> bytes
> [ 0.000000] Built 1 zonelists in Zone order, mobility grouping on. Total
> pages: 2064793
> [ 0.000000] Kernel command line: root=/dev/md1
> md=3,/dev/sda3,/dev/sdb3,/dev/sdc3 nmi_watchdog=0 mtrr_spare_reg_nr=1
> [ 0.000000] md: Will configure md3 (super-block) from
> /dev/sda3,/dev/sdb3,/dev/sdc3, below.
> [ 0.000000] PID hash table entries: 4096 (order: 12, 32768 bytes)
> [ 0.000000] Dentry cache hash table entries: 1048576 (order: 11, 8388608
> bytes)
> [ 0.000000] Inode-cache hash table entries: 524288 (order: 10, 4194304
> bytes)
> [ 0.000000] Initializing CPU#0
> [ 0.000000] Checking aperture...
> [ 0.000000] No AGP bridge found
> [ 0.000000] Node 0: aperture @ 2a42000000 size 32 MB
> [ 0.000000] Aperture beyond 4GB. Ignoring.
> [ 0.000000] Your BIOS doesn't leave a aperture memory hole
> [ 0.000000] Please enable the IOMMU option in the BIOS setup
> [ 0.000000] This costs you 64 MB of RAM
> [ 0.000000] Mapping aperture over 65536 KB of RAM @ 20000000
> [ 0.000000] PM: Registered nosave memory: 0000000020000000 -
> 0000000024000000
> [ 0.000000] Memory: 8184476k/9306112k available (3500k kernel code, 919300k
> absent, 201380k reserved, 1751k data, 376k init)
> [ 0.000000] SLUB: Genslabs=13, HWalign=64, Order=0-3, MinObjects=0, CPUs=4,
> Nodes=1
> [ 0.000000] Hierarchical RCU implementation.
> [ 0.000000] NR_IRQS:4352 nr_irqs:440
> [ 0.000000] Fast TSC calibration using PIT
> [ 0.000000] Detected 3200.214 MHz processor.
> [ 0.000609] Console: colour VGA+ 80x25
> [ 0.000611] console [tty0] enabled
> [ 0.003333] hpet clockevent registered
> [ 0.003333] alloc irq_desc for 24 on node 0
> [ 0.003333] alloc kstat_irqs on node 0
> [ 0.003333] HPET: 4 timers in total, 1 timers will be used for per-cpu
> timer
> [ 0.003339] Calibrating delay loop (skipped), value calculated using timer
> frequency.. 6402.10 BogoMIPS (lpj=10667366)
> [ 0.003421] Mount-cache hash table entries: 256
> [ 0.003543] CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64
> bytes/line)
> [ 0.003578] CPU: L2 Cache: 512K (64 bytes/line)
> [ 0.003613] tseg: 0000000000
> [ 0.003618] CPU: Physical Processor ID: 0
> [ 0.003652] CPU: Processor Core ID: 0
> [ 0.003686] mce: CPU supports 6 MCE banks
> [ 0.003725] using C1E aware idle routine
> [ 0.003768] ACPI: Core revision 20090521
> [ 0.016704] Setting APIC routing to flat
> [ 0.017026] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
> [ 0.050860] CPU0: AMD Phenom(tm) II X4 955 Processor stepping 02
> [ 0.053333] Booting processor 1 APIC 0x1 ip 0x6000
> [ 0.003333] Initializing CPU#1
> [ 0.003333] Calibrating delay using timer specific routine.. 6402.85
> BogoMIPS (lpj=10666966)
> [ 0.003333] CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64
> bytes/line)
> [ 0.003333] CPU: L2 Cache: 512K (64 bytes/line)
> [ 0.003333] CPU: Physical Processor ID: 0
> [ 0.003333] CPU: Processor Core ID: 1
> [ 0.003333] mce: CPU supports 6 MCE banks
> [ 0.003333] x86 PAT enabled: cpu 1, old 0x7040600070406, new
> 0x7010600070106
> [ 0.144161] CPU1: AMD Phenom(tm) II X4 955 Processor stepping 02
> [ 0.144507] checking TSC synchronization [CPU#0 -> CPU#1]: passed.
> [ 0.146699] Booting processor 2 APIC 0x2 ip 0x6000
> [ 0.003333] Initializing CPU#2
> [ 0.003333] Calibrating delay using timer specific routine.. 6402.85
> BogoMIPS (lpj=10666970)
> [ 0.003333] CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64
> bytes/line)
> [ 0.003333] CPU: L2 Cache: 512K (64 bytes/line)
> [ 0.003333] CPU: Physical Processor ID: 0
> [ 0.003333] CPU: Processor Core ID: 2
> [ 0.003333] mce: CPU supports 6 MCE banks
> [ 0.003333] x86 PAT enabled: cpu 2, old 0x7040600070406, new
> 0x7010600070106
> [ 0.240822] CPU2: AMD Phenom(tm) II X4 955 Processor stepping 02
> [ 0.241168] checking TSC synchronization [CPU#0 -> CPU#2]: passed.
> [ 0.243373] Booting processor 3 APIC 0x3 ip 0x6000
> [ 0.003333] Initializing CPU#3
> [ 0.003333] Calibrating delay using timer specific routine.. 6402.85
> BogoMIPS (lpj=10666972)
> [ 0.003333] CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64
> bytes/line)
> [ 0.003333] CPU: L2 Cache: 512K (64 bytes/line)
> [ 0.003333] CPU: Physical Processor ID: 0
> [ 0.003333] CPU: Processor Core ID: 3
> [ 0.003333] mce: CPU supports 6 MCE banks
> [ 0.003333] x86 PAT enabled: cpu 3, old 0x7040600070406, new
> 0x7010600070106
> [ 0.337491] CPU3: AMD Phenom(tm) II X4 955 Processor stepping 02
> [ 0.337836] checking TSC synchronization [CPU#0 -> CPU#3]: passed.
> [ 0.340006] Brought up 4 CPUs
> [ 0.340040] Total of 4 processors activated (25611.67 BogoMIPS).
> [ 0.340109] CPU0 attaching sched-domain:
> [ 0.340111] domain 0: span 0-3 level MC
> [ 0.340112] groups: 0 1 2 3
> [ 0.340116] CPU1 attaching sched-domain:
> [ 0.340117] domain 0: span 0-3 level MC
> [ 0.340118] groups: 1 2 3 0
> [ 0.340120] CPU2 attaching sched-domain:
> [ 0.340121] domain 0: span 0-3 level MC
> [ 0.340122] groups: 2 3 0 1
> [ 0.340125] CPU3 attaching sched-domain:
> [ 0.340126] domain 0: span 0-3 level MC
> [ 0.340127] groups: 3 0 1 2
> [ 0.340162] xor: automatically using best checksumming function:
> generic_sse
> [ 0.356667] generic_sse: 12835.200 MB/sec
> [ 0.356700] xor: using function: generic_sse (12835.200 MB/sec)
> [ 0.356754] Time: 7:07:42 Date: 09/11/09
> [ 0.356802] NET: Registered protocol family 16
> [ 0.356851] node 0 link 0: io port [1000, ffffff]
> [ 0.356851] TOM: 00000000c8000000 aka 3200M
> [ 0.356851] Fam 10h mmconf [e0000000, efffffff]
> [ 0.356851] node 0 link 0: mmio [e0000000, efffffff] ==> none
> [ 0.356851] node 0 link 0: mmio [f0000000, ffffffff]
> [ 0.356851] node 0 link 0: mmio [a0000, bffff]
> [ 0.356851] node 0 link 0: mmio [c8000000, dfffffff]
> [ 0.356851] TOM2: 0000000238000000 aka 9088M
> [ 0.356851] bus: [00,07] on node 0 link 0
> [ 0.356851] bus: 00 index 0 io port: [0, ffff]
> [ 0.356851] bus: 00 index 1 mmio: [f0000000, ffffffff]
> [ 0.356851] bus: 00 index 2 mmio: [a0000, bffff]
> [ 0.356851] bus: 00 index 3 mmio: [c8000000, dfffffff]
> [ 0.356851] bus: 00 index 4 mmio: [238000000, fcffffffff]
> [ 0.356851] ACPI: bus type pci registered
> [ 0.356851] PCI: MCFG configuration 0: base e0000000 segment 0 buses 0 - 255
> [ 0.356851] PCI: Not using MMCONFIG.
> [ 0.356851] PCI: Using configuration type 1 for base access
> [ 0.356851] PCI: Using configuration type 1 for extended access
> [ 0.356851] bio: create slab<bio-0> at 0
> [ 0.356978] ACPI: EC: Look up EC in DSDT
> [ 0.367031] ACPI: Interpreter enabled
> [ 0.367538] ACPI: (supports S0 S1 S3 S4 S5)
> [ 0.367652] ACPI: Using IOAPIC for interrupt routing
> [ 0.367725] PCI: MCFG configuration 0: base e0000000 segment 0 buses 0 - 255
> [ 0.370531] PCI: MCFG area at e0000000 reserved in ACPI motherboard
> resources
> [ 0.375947] PCI: Using MMCONFIG at e0000000 - efffffff
> [ 0.380367] ACPI: No dock devices found.
> [ 0.380451] ACPI: PCI Root Bridge [PCI0] (0000:00)
> [ 0.380520] pci 0000:00:00.0: reg 1c 64bit mmio: [0xe0000000-0xffffffff]
> [ 0.380520] pci 0000:00:02.0: PME# supported from D0 D3hot D3cold
> [ 0.380520] pci 0000:00:02.0: PME# disabled
> [ 0.380520] pci 0000:00:09.0: PME# supported from D0 D3hot D3cold
> [ 0.380520] pci 0000:00:09.0: PME# disabled
> [ 0.380520] pci 0000:00:0a.0: PME# supported from D0 D3hot D3cold
> [ 0.380520] pci 0000:00:0a.0: PME# disabled
> [ 0.380520] pci 0000:00:11.0: reg 10 io port: [0xa000-0xa007]
> [ 0.380520] pci 0000:00:11.0: reg 14 io port: [0x9000-0x9003]
> [ 0.380520] pci 0000:00:11.0: reg 18 io port: [0x8000-0x8007]
> [ 0.380520] pci 0000:00:11.0: reg 1c io port: [0x7000-0x7003]
> [ 0.380520] pci 0000:00:11.0: reg 20 io port: [0x6000-0x600f]
> [ 0.380520] pci 0000:00:11.0: reg 24 32bit mmio: [0xfddff800-0xfddffbff]
> [ 0.380520] pci 0000:00:12.0: reg 10 32bit mmio: [0xfddfe000-0xfddfefff]
> [ 0.380520] pci 0000:00:12.1: reg 10 32bit mmio: [0xfddfd000-0xfddfdfff]
> [ 0.380576] pci 0000:00:12.2: reg 10 32bit mmio: [0xfddff000-0xfddff0ff]
> [ 0.380625] pci 0000:00:12.2: supports D1 D2
> [ 0.380626] pci 0000:00:12.2: PME# supported from D0 D1 D2 D3hot
> [ 0.380663] pci 0000:00:12.2: PME# disabled
> [ 0.380724] pci 0000:00:13.0: reg 10 32bit mmio: [0xfddfc000-0xfddfcfff]
> [ 0.380775] pci 0000:00:13.1: reg 10 32bit mmio: [0xfddf7000-0xfddf7fff]
> [ 0.380843] pci 0000:00:13.2: reg 10 32bit mmio: [0xfddf6800-0xfddf68ff]
> [ 0.380893] pci 0000:00:13.2: supports D1 D2
> [ 0.380894] pci 0000:00:13.2: PME# supported from D0 D1 D2 D3hot
> [ 0.380930] pci 0000:00:13.2: PME# disabled
> [ 0.381072] pci 0000:00:14.1: reg 10 io port: [0x00-0x07]
> [ 0.381078] pci 0000:00:14.1: reg 14 io port: [0x00-0x03]
> [ 0.381084] pci 0000:00:14.1: reg 18 io port: [0x00-0x07]
> [ 0.381089] pci 0000:00:14.1: reg 1c io port: [0x00-0x03]
> [ 0.381095] pci 0000:00:14.1: reg 20 io port: [0xff00-0xff0f]
> [ 0.381219] pci 0000:00:14.5: reg 10 32bit mmio: [0xfddf5000-0xfddf5fff]
> [ 0.381353] pci 0000:02:00.0: reg 10 64bit mmio: [0xd0000000-0xdfffffff]
> [ 0.381360] pci 0000:02:00.0: reg 18 64bit mmio: [0xfdff0000-0xfdffffff]
> [ 0.381365] pci 0000:02:00.0: reg 20 io port: [0xc000-0xc0ff]
> [ 0.381372] pci 0000:02:00.0: reg 30 32bit mmio: [0xfdfc0000-0xfdfdffff]
> [ 0.381387] pci 0000:02:00.0: supports D1 D2
> [ 0.381415] pci 0000:02:00.1: reg 10 64bit mmio: [0xfdfec000-0xfdfeffff]
> [ 0.381445] pci 0000:02:00.1: supports D1 D2
> [ 0.381495] pci 0000:00:02.0: bridge io port: [0xc000-0xcfff]
> [ 0.381497] pci 0000:00:02.0: bridge 32bit mmio: [0xfdf00000-0xfdffffff]
> [ 0.381500] pci 0000:00:02.0: bridge 64bit mmio pref: [0xd0000000-0xdfffffff]
> [ 0.381543] pci 0000:00:09.0: bridge io port: [0xd000-0xdfff]
> [ 0.381545] pci 0000:00:09.0: bridge 32bit mmio: [0xfe000000-0xfebfffff]
> [ 0.381548] pci 0000:00:09.0: bridge 64bit mmio pref: [0xfa000000-0xfcefffff]
> [ 0.383347] pci 0000:01:00.0: reg 10 io port: [0xb800-0xb8ff]
> [ 0.383360] pci 0000:01:00.0: reg 18 64bit mmio: [0xcffff000-0xcfffffff]
> [ 0.383370] pci 0000:01:00.0: reg 20 64bit mmio: [0xcffe0000-0xcffeffff]
> [ 0.383375] pci 0000:01:00.0: reg 30 32bit mmio: [0xfdef0000-0xfdefffff]
> [ 0.383402] pci 0000:01:00.0: supports D1 D2
> [ 0.383403] pci 0000:01:00.0: PME# supported from D0 D1 D2 D3hot D3cold
> [ 0.383440] pci 0000:01:00.0: PME# disabled
> [ 0.383525] pci 0000:00:0a.0: bridge io port: [0xb000-0xbfff]
> [ 0.383527] pci 0000:00:0a.0: bridge 32bit mmio: [0xfde00000-0xfdefffff]
> [ 0.383530] pci 0000:00:0a.0: bridge 64bit mmio pref: [0xcff00000-0xcfffffff]
> [ 0.383565] pci 0000:05:06.0: reg 10 32bit mmio: [0xfcfff000-0xfcffffff]
> [ 0.383644] pci 0000:05:08.0: reg 10 io port: [0xe800-0xe83f]
> [ 0.383701] pci 0000:05:08.0: supports D1 D2
> [ 0.383743] pci 0000:00:14.4: transparent bridge
> [ 0.383779] pci 0000:00:14.4: bridge io port: [0xe000-0xefff]
> [ 0.383785] pci 0000:00:14.4: bridge 32bit mmio pref: [0xfcf00000-0xfcffffff]
> [ 0.383797] pci_bus 0000:00: on NUMA node 0
> [ 0.383800] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
> [ 0.383936] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.PCE2._PRT]
> [ 0.383981] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.PCEA._PRT]
> [ 0.384025] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.P0PC._PRT]
> [ 0.384091] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.PCE9._PRT]
> [ 0.386726] ACPI: PCI Interrupt Link [LNKA] (IRQs 4 *7 10 11 12 14 15)
> [ 0.386953] ACPI: PCI Interrupt Link [LNKB] (IRQs 4 7 10 *11 12 14 15)
> [ 0.387183] ACPI: PCI Interrupt Link [LNKC] (IRQs 4 7 *10 11 12 14 15)
> [ 0.387408] ACPI: PCI Interrupt Link [LNKD] (IRQs 4 7 *10 11 12 14 15)
> [ 0.387643] ACPI: PCI Interrupt Link [LNKE] (IRQs 4 7 10 11 12 14 15) *0,
> disabled.
> [ 0.387914] ACPI: PCI Interrupt Link [LNKF] (IRQs 4 7 10 *11 12 14 15)
> [ 0.388140] ACPI: PCI Interrupt Link [LNKG] (IRQs *4 10 11 12 14 15)
> [ 0.388352] ACPI: PCI Interrupt Link [LNKH] (IRQs 4 7 *10 11 12 14 15)
> [ 0.388538] SCSI subsystem initialized
> [ 0.388538] libata version 3.00 loaded.
> [ 0.388538] usbcore: registered new interface driver usbfs
> [ 0.388538] usbcore: registered new interface driver hub
> [ 0.388538] usbcore: registered new device driver usb
> [ 0.443347] raid6: int64x1 2755 MB/s
> [ 0.500010] raid6: int64x2 3858 MB/s
> [ 0.556669] raid6: int64x4 2850 MB/s
> [ 0.613353] raid6: int64x8 2537 MB/s
> [ 0.670007] raid6: sse2x1 3999 MB/s
> [ 0.726666] raid6: sse2x2 7012 MB/s
> [ 0.783343] raid6: sse2x4 7975 MB/s
> [ 0.783377] raid6: using algorithm sse2x4 (7975 MB/s)
> [ 0.783423] PCI: Using ACPI for IRQ routing
> [ 0.783423] pci 0000:00:00.0: BAR 3: address space collision on of device
> [0xe0000000-0xffffffff]
> [ 0.783425] pci 0000:00:00.0: BAR 3: can't allocate resource
> [ 0.793433] PCI-DMA: Disabling AGP.
> [ 0.793518] PCI-DMA: aperture base @ 20000000 size 65536 KB
> [ 0.793518] PCI-DMA: using GART IOMMU.
> [ 0.793518] PCI-DMA: Reserving 64MB of IOMMU area in the AGP aperture
> [ 0.795205] hpet0: at MMIO 0xfed00000, IRQs 2, 8, 24, 0
> [ 0.795312] hpet0: 4 comparators, 32-bit 14.318180 MHz counter
> [ 0.800030] hpet: hpet2 irq 24 for MSI
> [ 0.810015] Switched to high resolution mode on CPU 0
> [ 0.811193] Switched to high resolution mode on CPU 2
> [ 0.811196] Switched to high resolution mode on CPU 1
> [ 0.811200] Switched to high resolution mode on CPU 3
> [ 0.820035] pnp: PnP ACPI init
> [ 0.820085] ACPI: bus type pnp registered
> [ 0.822351] pnp 00:0b: mem resource (0x0-0x9ffff) overlaps 0000:00:00.0 BAR 3
> (0x0-0x1fffffff), disabling
> [ 0.822406] pnp 00:0b: mem resource (0xc0000-0xcffff) overlaps 0000:00:00.0
> BAR 3 (0x0-0x1fffffff), disabling
> [ 0.822461] pnp 00:0b: mem resource (0xe0000-0xfffff) overlaps 0000:00:00.0
> BAR 3 (0x0-0x1fffffff), disabling
> [ 0.822515] pnp 00:0b: mem resource (0x100000-0xc7efffff) overlaps
> 0000:00:00.0 BAR 3 (0x0-0x1fffffff), disabling
> [ 0.822837] pnp: PnP ACPI: found 12 devices
> [ 0.822871] ACPI: ACPI bus type pnp unregistered
> [ 0.822911] system 00:06: iomem range 0xfec00000-0xfec00fff could not be
> reserved
> [ 0.822965] system 00:06: iomem range 0xfee00000-0xfee00fff has been
> reserved
> [ 0.823002] system 00:07: ioport range 0x4d0-0x4d1 has been reserved
> [ 0.823037] system 00:07: ioport range 0x40b-0x40b has been reserved
> [ 0.823072] system 00:07: ioport range 0x4d6-0x4d6 has been reserved
> [ 0.823106] system 00:07: ioport range 0xc00-0xc01 has been reserved
> [ 0.823141] system 00:07: ioport range 0xc14-0xc14 has been reserved
> [ 0.823176] system 00:07: ioport range 0xc50-0xc51 has been reserved
> [ 0.823211] system 00:07: ioport range 0xc52-0xc52 has been reserved
> [ 0.823245] system 00:07: ioport range 0xc6c-0xc6c has been reserved
> [ 0.823280] system 00:07: ioport range 0xc6f-0xc6f has been reserved
> [ 0.823315] system 00:07: ioport range 0xcd0-0xcd1 has been reserved
> [ 0.823359] system 00:07: ioport range 0xcd2-0xcd3 has been reserved
> [ 0.823394] system 00:07: ioport range 0xcd4-0xcd5 has been reserved
> [ 0.823429] system 00:07: ioport range 0xcd6-0xcd7 has been reserved
> [ 0.823464] system 00:07: ioport range 0xcd8-0xcdf has been reserved
> [ 0.823499] system 00:07: ioport range 0x800-0x89f has been reserved
> [ 0.823533] system 00:07: ioport range 0xb00-0xb0f has been reserved
> [ 0.823568] system 00:07: ioport range 0xb20-0xb3f has been reserved
> [ 0.823603] system 00:07: ioport range 0x900-0x90f has been reserved
> [ 0.823638] system 00:07: ioport range 0x910-0x91f has been reserved
> [ 0.823673] system 00:07: ioport range 0xfe00-0xfefe has been reserved
> [ 0.823708] system 00:07: iomem range 0xffb80000-0xffbfffff has been reserved
> [ 0.823743] system 00:07: iomem range 0xfec10000-0xfec1001f has been
> reserved
> [ 0.823780] system 00:09: ioport range 0x290-0x29f has been reserved
> [ 0.823816] system 00:0a: iomem range 0xe0000000-0xefffffff has been reserved
> [ 0.823852] system 00:0b: iomem range 0xfec00000-0xffffffff could not be
> reserved
> [ 0.828757] pci 0000:00:02.0: PCI bridge, secondary bus 0000:02
> [ 0.828792] pci 0000:00:02.0: IO window: 0xc000-0xcfff
> [ 0.828828] pci 0000:00:02.0: MEM window: 0xfdf00000-0xfdffffff
> [ 0.828863] pci 0000:00:02.0: PREFETCH window:
> 0x000000d0000000-0x000000dfffffff
> [ 0.828918] pci 0000:00:09.0: PCI bridge, secondary bus 0000:03
> [ 0.828953] pci 0000:00:09.0: IO window: 0xd000-0xdfff
> [ 0.828988] pci 0000:00:09.0: MEM window: 0xfe000000-0xfebfffff
> [ 0.829023] pci 0000:00:09.0: PREFETCH window:
> 0x000000fa000000-0x000000fcefffff
> [ 0.829078] pci 0000:00:0a.0: PCI bridge, secondary bus 0000:01
> [ 0.829112] pci 0000:00:0a.0: IO window: 0xb000-0xbfff
> [ 0.829148] pci 0000:00:0a.0: MEM window: 0xfde00000-0xfdefffff
> [ 0.829183] pci 0000:00:0a.0: PREFETCH window:
> 0x000000cff00000-0x000000cfffffff
> [ 0.829237] pci 0000:00:14.4: PCI bridge, secondary bus 0000:05
> [ 0.829273] pci 0000:00:14.4: IO window: 0xe000-0xefff
> [ 0.829310] pci 0000:00:14.4: MEM window: disabled
> [ 0.829346] pci 0000:00:14.4: PREFETCH window: 0xfcf00000-0xfcffffff
> [ 0.829387] alloc irq_desc for 18 on node -1
> [ 0.829388] alloc kstat_irqs on node -1
> [ 0.829392] pci 0000:00:02.0: PCI INT A -> GSI 18 (level, low) -> IRQ 18
> [ 0.829428] pci 0000:00:02.0: setting latency timer to 64
> [ 0.829432] alloc irq_desc for 17 on node -1
> [ 0.829433] alloc kstat_irqs on node -1
> [ 0.829435] pci 0000:00:09.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
> [ 0.829471] pci 0000:00:09.0: setting latency timer to 64
> [ 0.829474] pci 0000:00:0a.0: PCI INT A -> GSI 18 (level, low) -> IRQ 18
> [ 0.829509] pci 0000:00:0a.0: setting latency timer to 64
> [ 0.829516] pci_bus 0000:00: resource 0 io: [0x00-0xffff]
> [ 0.829517] pci_bus 0000:00: resource 1 mem: [0x000000-0xffffffffffffffff]
> [ 0.829519] pci_bus 0000:02: resource 0 io: [0xc000-0xcfff]
> [ 0.829521] pci_bus 0000:02: resource 1 mem: [0xfdf00000-0xfdffffff]
> [ 0.829522] pci_bus 0000:02: resource 2 pref mem [0xd0000000-0xdfffffff]
> [ 0.829523] pci_bus 0000:03: resource 0 io: [0xd000-0xdfff]
> [ 0.829525] pci_bus 0000:03: resource 1 mem: [0xfe000000-0xfebfffff]
> [ 0.829526] pci_bus 0000:03: resource 2 pref mem [0xfa000000-0xfcefffff]
> [ 0.829527] pci_bus 0000:01: resource 0 io: [0xb000-0xbfff]
> [ 0.829529] pci_bus 0000:01: resource 1 mem: [0xfde00000-0xfdefffff]
> [ 0.829530] pci_bus 0000:01: resource 2 pref mem [0xcff00000-0xcfffffff]
> [ 0.829531] pci_bus 0000:05: resource 0 io: [0xe000-0xefff]
> [ 0.829533] pci_bus 0000:05: resource 2 pref mem [0xfcf00000-0xfcffffff]
> [ 0.829534] pci_bus 0000:05: resource 3 io: [0x00-0xffff]
> [ 0.829535] pci_bus 0000:05: resource 4 mem: [0x000000-0xffffffffffffffff]
> [ 0.829547] NET: Registered protocol family 2
> [ 0.829602] IP route cache hash table entries: 262144 (order: 9, 2097152
> bytes)
> [ 0.830084] TCP established hash table entries: 262144 (order: 10, 4194304
> bytes)
> [ 0.831067] TCP bind hash table entries: 65536 (order: 8, 1048576 bytes)
> [ 0.831468] TCP: Hash tables configured (established 262144 bind 65536)
> [ 0.831504] TCP reno registered
> [ 0.831582] NET: Registered protocol family 1
> [ 0.832803] Scanning for low memory corruption every 60 seconds
> [ 0.833572] HugeTLB registered 2 MB page size, pre-allocated 0 pages
> [ 0.833696] Loading Reiser4. See http://www.namesys.com for a description of
> Reiser4.
> [ 0.833781] msgmni has been set to 15987
> [ 0.834082] alg: No test for stdrng (krng)
> [ 0.834123] async_tx: api initialized (sync-only)
> [ 0.834227] Block layer SCSI generic (bsg) driver version 0.4 loaded (major
> 253)
> [ 0.834280] io scheduler noop registered
> [ 0.834315] io scheduler cfq registered (default)
> [ 0.834448] pci 0000:02:00.0: Boot video device
> [ 0.834538] alloc irq_desc for 25 on node -1
> [ 0.834540] alloc kstat_irqs on node -1
> [ 0.834545] pcieport-driver 0000:00:02.0: irq 25 for MSI/MSI-X
> [ 0.834550] pcieport-driver 0000:00:02.0: setting latency timer to 64
> [ 0.834642] alloc irq_desc for 26 on node -1
> [ 0.834643] alloc kstat_irqs on node -1
> [ 0.834646] pcieport-driver 0000:00:09.0: irq 26 for MSI/MSI-X
> [ 0.834650] pcieport-driver 0000:00:09.0: setting latency timer to 64
> [ 0.834741] alloc irq_desc for 27 on node -1
> [ 0.834742] alloc kstat_irqs on node -1
> [ 0.834744] pcieport-driver 0000:00:0a.0: irq 27 for MSI/MSI-X
> [ 0.834748] pcieport-driver 0000:00:0a.0: setting latency timer to 64
> [ 0.834953] input: Power Button as
> /devices/LNXSYSTM:00/LNXPWRBN:00/input/input0
> [ 0.835007] ACPI: Power Button [PWRF]
> [ 0.835098] input: Power Button as
> /devices/LNXSYSTM:00/device:00/PNP0C0C:00/input/input1
> [ 0.835152] ACPI: Power Button [PWRB]
> [ 0.835302] processor LNXCPU:00: registered as cooling_device0
> [ 0.835337] ACPI: Processor [CPU0] (supports 8 throttling states)
> [ 0.835434] processor LNXCPU:01: registered as cooling_device1
> [ 0.835504] processor LNXCPU:02: registered as cooling_device2
> [ 0.835577] processor LNXCPU:03: registered as cooling_device3
> [ 0.839315] Linux agpgart interface v0.103
> [ 0.839511] ahci 0000:00:11.0: version 3.0
> [ 0.839521] alloc irq_desc for 22 on node -1
> [ 0.839522] alloc kstat_irqs on node -1
> [ 0.839525] ahci 0000:00:11.0: PCI INT A -> GSI 22 (level, low) -> IRQ 22
> [ 0.839673] ahci 0000:00:11.0: AHCI 0001.0100 32 slots 6 ports 3 Gbps 0x3f
> impl SATA mode
> [ 0.839727] ahci 0000:00:11.0: flags: 64bit ncq sntf ilck pm led clo pmp pio
> slum part
> [ 0.840272] scsi0 : ahci
> [ 0.840403] scsi1 : ahci
> [ 0.840501] scsi2 : ahci
> [ 0.840598] scsi3 : ahci
> [ 0.840697] scsi4 : ahci
> [ 0.840795] scsi5 : ahci
> [ 0.840926] ata1: SATA max UDMA/133 irq_stat 0x00400000, PHY RDY changed
> [ 0.840962] ata2: SATA max UDMA/133 abar m1024@0xfddff800 port 0xfddff980 irq
> 22
> [ 0.841016] ata3: SATA max UDMA/133 abar m1024@0xfddff800 port 0xfddffa00 irq
> 22
> [ 0.841070] ata4: SATA max UDMA/133 abar m1024@0xfddff800 port 0xfddffa80 irq
> 22
> [ 0.841124] ata5: SATA max UDMA/133 abar m1024@0xfddff800 port 0xfddffb00 irq
> 22
> [ 0.841178] ata6: SATA max UDMA/133 abar m1024@0xfddff800 port 0xfddffb80 irq
> 22
> [ 0.841459] PNP: PS/2 Controller [PNP0303:PS2K] at 0x60,0x64 irq 1
> [ 0.841493] PNP: PS/2 appears to have AUX port disabled, if this is
> incorrect please boot with i8042.nopnp
> [ 0.841943] serio: i8042 KBD port at 0x60,0x64 irq 1
> [ 0.842074] mice: PS/2 mouse device common for all mice
> [ 0.842243] rtc_cmos 00:02: RTC can wake from S4
> [ 0.842313] rtc_cmos 00:02: rtc core: registered rtc_cmos as rtc0
> [ 0.842369] rtc0: alarms up to one month, y3k, 114 bytes nvram, hpet irqs
> [ 0.842428] md: raid1 personality registered for level 1
> [ 0.842462] md: raid6 personality registered for level 6
> [ 0.842496] md: raid5 personality registered for level 5
> [ 0.842530] md: raid4 personality registered for level 4
> [ 0.843117] cpuidle: using governor ladder
> [ 0.843151] cpuidle: using governor menu
> [ 0.843689] usbcore: registered new interface driver hiddev
> [ 0.843746] usbcore: registered new interface driver usbhid
> [ 0.843780] usbhid: v2.6:USB HID core driver
> [ 0.843843] Advanced Linux Sound Architecture Driver Version 1.0.20.
> [ 0.843878] ALSA device list:
> [ 0.843911] No soundcards found.
> [ 0.843979] TCP cubic registered
> [ 0.844019] NET: Registered protocol family 10
> [ 0.844148] IPv6 over IPv4 tunneling driver
> [ 0.844265] NET: Registered protocol family 17
> [ 0.844315] powernow-k8: Found 1 AMD Phenom(tm) II X4 955 Processor
> processors (4 cpu cores) (version 2.20.00)
> [ 0.844391] powernow-k8: 0 : pstate 0 (3200 MHz)
> [ 0.844425] powernow-k8: 1 : pstate 1 (2500 MHz)
> [ 0.844459] powernow-k8: 2 : pstate 2 (2100 MHz)
> [ 0.844492] powernow-k8: 3 : pstate 3 (800 MHz)
> [ 0.844886] PM: Resume from disk failed.
> [ 0.844966] Magic number: 9:648:116
> [ 0.866018] input: AT Translated Set 2 keyboard as
> /devices/platform/i8042/serio0/input/input2
> [ 1.160036] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> [ 1.160097] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> [ 1.160150] ata6: SATA link down (SStatus 0 SControl 300)
> [ 1.160205] ata5: SATA link down (SStatus 0 SControl 300)
> [ 1.160259] ata3: SATA link down (SStatus 0 SControl 300)
> [ 1.166391] ata4.00: ATA-7: SAMSUNG HD753LJ, 1AA01113, max UDMA7
> [ 1.166432] ata4.00: 1465149168 sectors, multi 0: LBA48 NCQ (depth 31/32)
> [ 1.166480] ata2.00: ATA-7: SAMSUNG HD502IJ, 1AA01110, max UDMA7
> [ 1.166514] ata2.00: 976773168 sectors, multi 0: LBA48 NCQ (depth 31/32)
> [ 1.172888] ata4.00: configured for UDMA/133
> [ 1.172943] ata2.00: configured for UDMA/133
> [ 1.560035] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> [ 1.566394] ata1.00: ATA-7: SAMSUNG HD502IJ, 1AA01109, max UDMA7
> [ 1.566430] ata1.00: 976773168 sectors, multi 0: LBA48 NCQ (depth 31/32)
> [ 1.572855] ata1.00: configured for UDMA/133
> [ 1.583424] scsi 0:0:0:0: Direct-Access ATA SAMSUNG HD502IJ 1AA0
> PQ: 0 ANSI: 5
> [ 1.583684] sd 0:0:0:0: [sda] 976773168 512-byte logical blocks: (500
> GB/465 GiB)
> [ 1.583756] sd 0:0:0:0: [sda] Write Protect is off
> [ 1.583791] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
> [ 1.583800] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled,
> doesn't support DPO or FUA
> [ 1.583911] sda:
> [ 1.583952] sd 0:0:0:0: Attached scsi generic sg0 type 0
> [ 1.584094] scsi 1:0:0:0: Direct-Access ATA SAMSUNG HD502IJ 1AA0
> PQ: 0 ANSI: 5
> [ 1.584283] sd 1:0:0:0: [sdb] 976773168 512-byte logical blocks: (500
> GB/465 GiB)
> [ 1.584354] sd 1:0:0:0: [sdb] Write Protect is off
> [ 1.584389] sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
> [ 1.584398] sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled,
> doesn't support DPO or FUA
> [ 1.584507] sdb:
> [ 1.584577] sd 1:0:0:0: Attached scsi generic sg1 type 0
> [ 1.584709] scsi 3:0:0:0: Direct-Access ATA SAMSUNG HD753LJ 1AA0
> PQ: 0 ANSI: 5
> [ 1.584865] sd 3:0:0:0: [sdc] 1465149168 512-byte logical blocks: (750
> GB/698 GiB)
> [ 1.584934] sd 3:0:0:0: [sdc] Write Protect is off
> [ 1.584969] sd 3:0:0:0: [sdc] Mode Sense: 00 3a 00 00
> [ 1.584977] sd 3:0:0:0: [sdc] Write cache: enabled, read cache: enabled,
> doesn't support DPO or FUA
> [ 1.585071] sdc:
> [ 1.585123] sd 3:0:0:0: Attached scsi generic sg2 type 0
> [ 1.588354] sdc1 sdc2 sdc3 sdc4< sdb1 sdb2 sdb3 sdb4< sda1 sda2 sda3
> sda4< sdb5 sdc5 sda5 sdc6 sdb6>
> [ 1.606850] sd 1:0:0:0: [sdb] Attached SCSI disk
> [ 1.610449] sda6>
> [ 1.610814] sd 0:0:0:0: [sda] Attached SCSI disk
> [ 1.617957] sdc7>
> [ 1.618341] sd 3:0:0:0: [sdc] Attached SCSI disk
> [ 1.618382] md: Waiting for all devices to be available before autodetect
> [ 1.618417] md: If you don't use raid, use raid=noautodetect
> [ 1.618533] md: Autodetecting RAID arrays.
> [ 1.724723] md: Scanned 12 and added 12 devices.
> [ 1.724758] md: autorun ...
> [ 1.724792] md: considering sdc6 ...
> [ 1.724827] md: adding sdc6 ...
> [ 1.724862] md: sdc5 has different UUID to sdc6
> [ 1.724897] md: sdc3 has different UUID to sdc6
> [ 1.724931] md: sdc1 has different UUID to sdc6
> [ 1.724967] md: adding sda6 ...
> [ 1.725001] md: sda5 has different UUID to sdc6
> [ 1.725036] md: sda3 has different UUID to sdc6
> [ 1.725070] md: sda1 has different UUID to sdc6
> [ 1.725106] md: adding sdb6 ...
> [ 1.725140] md: sdb5 has different UUID to sdc6
> [ 1.725175] md: sdb3 has different UUID to sdc6
> [ 1.725209] md: sdb1 has different UUID to sdc6
> [ 1.725349] md: created md3
> [ 1.725382] md: bind<sdb6>
> [ 1.725419] md: bind<sda6>
> [ 1.725456] md: bind<sdc6>
> [ 1.725492] md: running:<sdc6><sda6><sdb6>
> [ 1.725626] raid5: device sdc6 operational as raid disk 2
> [ 1.725661] raid5: device sda6 operational as raid disk 0
> [ 1.725695] raid5: device sdb6 operational as raid disk 1
> [ 1.725846] raid5: allocated 3220kB for md3
> [ 1.725910] raid5: raid level 5 set md3 active with 3 out of 3 devices,
> algorithm 2
> [ 1.725963] RAID5 conf printout:
> [ 1.725996] --- rd:3 wd:3
> [ 1.726029] disk 0, o:1, dev:sda6
> [ 1.726062] disk 1, o:1, dev:sdb6
> [ 1.726095] disk 2, o:1, dev:sdc6
> [ 1.726142] md3: detected capacity change from 0 to 864065421312
> [ 1.726213] md: considering sdc5 ...
> [ 1.726249] md: adding sdc5 ...
> [ 1.726283] md: sdc3 has different UUID to sdc5
> [ 1.726318] md: sdc1 has different UUID to sdc5
> [ 1.726353] md: adding sda5 ...
> [ 1.726388] md: sda3 has different UUID to sdc5
> [ 1.726422] md: sda1 has different UUID to sdc5
> [ 1.726458] md: adding sdb5 ...
> [ 1.726492] md: sdb3 has different UUID to sdc5
> [ 1.726526] md: sdb1 has different UUID to sdc5
> [ 1.726630] md: created md2
> [ 1.726663] md: bind<sdb5>
> [ 1.726700] md: bind<sda5>
> [ 1.726738] md: bind<sdc5>
> [ 1.726774] md: running:<sdc5><sda5><sdb5>
> [ 1.726901] raid5: device sdc5 operational as raid disk 2
> [ 1.726935] raid5: device sda5 operational as raid disk 0
> [ 1.726969] raid5: device sdb5 operational as raid disk 1
> [ 1.727126] raid5: allocated 3220kB for md2
> [ 1.727190] raid5: raid level 5 set md2 active with 3 out of 3 devices,
> algorithm 2
> [ 1.727243] RAID5 conf printout:
> [ 1.727276] --- rd:3 wd:3
> [ 1.727309] disk 0, o:1, dev:sda5
> [ 1.727342] disk 1, o:1, dev:sdb5
> [ 1.727376] disk 2, o:1, dev:sdc5
> [ 1.727420] md2: detected capacity change from 0 to 40007499776
> [ 1.727490] md: considering sdc3 ...
> [ 1.727526] md: adding sdc3 ...
> [ 1.727560] md: sdc1 has different UUID to sdc3
> [ 1.727595] md: adding sda3 ...
> [ 1.727629] md: sda1 has different UUID to sdc3
> [ 1.727664] md: adding sdb3 ...
> [ 1.727698] md: sdb1 has different UUID to sdc3
> [ 1.727799] md: created md1
> [ 1.727832] md: bind<sdb3>
> [ 1.727869] md: bind<sda3>
> [ 1.727905] md: bind<sdc3>
> [ 1.727945] md: running:<sdc3><sda3><sdb3>
> [ 1.728090] raid5: device sdc3 operational as raid disk 2
> [ 1.728125] raid5: device sda3 operational as raid disk 0
> [ 1.728159] raid5: device sdb3 operational as raid disk 1
> [ 1.728320] raid5: allocated 3220kB for md1
> [ 1.728370] raid5: raid level 5 set md1 active with 3 out of 3 devices,
> algorithm 2
> [ 1.728423] RAID5 conf printout:
> [ 1.728455] --- rd:3 wd:3
> [ 1.728488] disk 0, o:1, dev:sda3
> [ 1.728522] disk 1, o:1, dev:sdb3
> [ 1.728555] disk 2, o:1, dev:sdc3
> [ 1.728604] md1: detected capacity change from 0 to 79998877696
> [ 1.728674] md: considering sdc1 ...
> [ 1.728710] md: adding sdc1 ...
> [ 1.728745] md: adding sda1 ...
> [ 1.728779] md: adding sdb1 ...
> [ 1.728813] md: created md0
> [ 1.728846] md: bind<sdb1>
> [ 1.728882] md: bind<sda1>
> [ 1.728919] md: bind<sdc1>
> [ 1.728955] md: running:<sdc1><sda1><sdb1>
> [ 1.729133] raid1: raid set md0 active with 3 out of 3 mirrors
> [ 1.729176] md0: detected capacity change from 0 to 65667072
> [ 1.729232] md: ... autorun DONE.
> [ 1.729284] md: Loading md3: /dev/sda3
> [ 1.729322] md3: unknown partition table
> [ 1.729481] md: couldn't update array info. -22
> [ 1.729518] md: could not bd_claim sda3.
> [ 1.729552] md: md_import_device returned -16
> [ 1.729588] md: could not bd_claim sdb3.
> [ 1.729621] md: md_import_device returned -16
> [ 1.729657] md: could not bd_claim sdc3.
> [ 1.729690] md: md_import_device returned -16
> [ 1.729725] md: starting md3 failed
> [ 1.729800] md1: unknown partition table
> [ 1.767199] reiser4: md1: found disk format 4.0.0.
> [ 5.790318] VFS: Mounted root (reiser4 filesystem) readonly on device 9:1.
> [ 5.790370] Freeing unused kernel memory: 376k freed
> [ 9.037775] udev: starting version 145
> [ 9.217043] md2:
> [ 9.217072] md0: unknown partition table
> [ 9.282015] unknown partition table
> [ 10.420576] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
> [ 10.420591] r8169 0000:01:00.0: PCI INT A -> GSI 18 (level, low) -> IRQ 18
> [ 10.420840] r8169 0000:01:00.0: setting latency timer to 64
> [ 10.420871] alloc irq_desc for 28 on node -1
> [ 10.420872] alloc kstat_irqs on node -1
> [ 10.420882] r8169 0000:01:00.0: irq 28 for MSI/MSI-X
> [ 10.420988] eth0: RTL8168c/8111c at 0xffffc90012f18000, 00:19:66:86:ce:12,
> XID 3c4000c0 IRQ 28
> [ 10.440462] ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
> [ 10.440465] ehci_hcd: block sizes: qh 192 qtd 96 itd 192 sitd 96
> [ 10.440491] ehci_hcd 0000:00:12.2: PCI INT B -> GSI 17 (level, low) -> IRQ
> 17
> [ 10.440518] ehci_hcd 0000:00:12.2: EHCI Host Controller
> [ 10.440532] drivers/usb/core/inode.c: creating file 'devices'
> [ 10.440534] drivers/usb/core/inode.c: creating file '001'
> [ 10.440566] ehci_hcd 0000:00:12.2: new USB bus registered, assigned bus
> number 1
> [ 10.440572] ehci_hcd 0000:00:12.2: reset hcs_params 0x102306 dbg=1 cc=2
> pcc=3 ordered !ppc ports=6
> [ 10.440575] ehci_hcd 0000:00:12.2: reset hcc_params a072 thresh 7 uframes
> 256/512/1024
> [ 10.440596] ehci_hcd 0000:00:12.2: applying AMD SB600/SB700 USB freeze
> workaround
> [ 10.440602] ehci_hcd 0000:00:12.2: reset command 080002 (park)=0 ithresh=8
> period=1024 Reset HALT
> [ 10.440615] ehci_hcd 0000:00:12.2: debug port 1
> [ 10.440619] ehci_hcd 0000:00:12.2: MWI active
> [ 10.440620] ehci_hcd 0000:00:12.2: supports USB remote wakeup
> [ 10.440633] ehci_hcd 0000:00:12.2: irq 17, io mem 0xfddff000
> [ 10.440637] ehci_hcd 0000:00:12.2: reset command 080002 (park)=0 ithresh=8
> period=1024 Reset HALT
> [ 10.440642] ehci_hcd 0000:00:12.2: init command 010009 (park)=0 ithresh=1
> period=256 RUN
> [ 10.447931] ehci_hcd 0000:00:12.2: USB 2.0 started, EHCI 1.00
> [ 10.447961] usb usb1: default language 0x0409
> [ 10.447965] usb usb1: udev 1, busnum 1, minor = 0
> [ 10.447967] usb usb1: New USB device found, idVendor=1d6b, idProduct=0002
> [ 10.447968] usb usb1: New USB device strings: Mfr=3, Product=2,
> SerialNumber=1
> [ 10.447969] usb usb1: Product: EHCI Host Controller
> [ 10.447970] usb usb1: Manufacturer: Linux 2.6.31r4 ehci_hcd
> [ 10.447971] usb usb1: SerialNumber: 0000:00:12.2
> [ 10.447998] usb usb1: uevent
> [ 10.448007] usb usb1: usb_probe_device
> [ 10.448009] usb usb1: configuration #1 chosen from 1 choice
> [ 10.448014] usb usb1: adding 1-0:1.0 (config #1, interface 0)
> [ 10.448021] usb 1-0:1.0: uevent
> [ 10.448028] hub 1-0:1.0: usb_probe_interface
> [ 10.448029] hub 1-0:1.0: usb_probe_interface - got id
> [ 10.448031] hub 1-0:1.0: USB hub found
> [ 10.448035] hub 1-0:1.0: 6 ports detected
> [ 10.448036] hub 1-0:1.0: standalone hub
> [ 10.448037] hub 1-0:1.0: no power switching (usb 1.0)
> [ 10.448038] hub 1-0:1.0: individual port over-current protection
> [ 10.448039] hub 1-0:1.0: power on to power good time: 20ms
> [ 10.448042] hub 1-0:1.0: local power source is good
> [ 10.448043] hub 1-0:1.0: trying to enable port power on non-switchable hub
> [ 10.448067] drivers/usb/core/inode.c: creating file '001'
> [ 10.448085] alloc irq_desc for 19 on node -1
> [ 10.448087] alloc kstat_irqs on node -1
> [ 10.448091] ehci_hcd 0000:00:13.2: PCI INT B -> GSI 19 (level, low) -> IRQ
> 19
> [ 10.448101] ehci_hcd 0000:00:13.2: EHCI Host Controller
> [ 10.448105] drivers/usb/core/inode.c: creating file '002'
> [ 10.448122] ehci_hcd 0000:00:13.2: new USB bus registered, assigned bus
> number 2
> [ 10.448127] ehci_hcd 0000:00:13.2: reset hcs_params 0x102306 dbg=1 cc=2
> pcc=3 ordered !ppc ports=6
> [ 10.448130] ehci_hcd 0000:00:13.2: reset hcc_params a072 thresh 7 uframes
> 256/512/1024
> [ 10.448143] ehci_hcd 0000:00:13.2: applying AMD SB600/SB700 USB freeze
> workaround
> [ 10.448148] ehci_hcd 0000:00:13.2: reset command 080002 (park)=0 ithresh=8
> period=1024 Reset HALT
> [ 10.448161] ehci_hcd 0000:00:13.2: debug port 1
> [ 10.448164] ehci_hcd 0000:00:13.2: MWI active
> [ 10.448165] ehci_hcd 0000:00:13.2: supports USB remote wakeup
> [ 10.448173] ehci_hcd 0000:00:13.2: irq 19, io mem 0xfddf6800
> [ 10.448176] ehci_hcd 0000:00:13.2: reset command 080002 (park)=0 ithresh=8
> period=1024 Reset HALT
> [ 10.448181] ehci_hcd 0000:00:13.2: init command 010009 (park)=0 ithresh=1
> period=256 RUN
> [ 10.457930] ehci_hcd 0000:00:13.2: USB 2.0 started, EHCI 1.00
> [ 10.457945] usb usb2: default language 0x0409
> [ 10.457949] usb usb2: udev 1, busnum 2, minor = 128
> [ 10.457950] usb usb2: New USB device found, idVendor=1d6b, idProduct=0002
> [ 10.457952] usb usb2: New USB device strings: Mfr=3, Product=2,
> SerialNumber=1
> [ 10.457953] usb usb2: Product: EHCI Host Controller
> [ 10.457954] usb usb2: Manufacturer: Linux 2.6.31r4 ehci_hcd
> [ 10.457955] usb usb2: SerialNumber: 0000:00:13.2
> [ 10.457977] usb usb2: uevent
> [ 10.457984] usb usb2: usb_probe_device
> [ 10.457986] usb usb2: configuration #1 chosen from 1 choice
> [ 10.457989] usb usb2: adding 2-0:1.0 (config #1, interface 0)
> [ 10.457997] usb 2-0:1.0: uevent
> [ 10.458003] hub 2-0:1.0: usb_probe_interface
> [ 10.458004] hub 2-0:1.0: usb_probe_interface - got id
> [ 10.458005] hub 2-0:1.0: USB hub found
> [ 10.458009] hub 2-0:1.0: 6 ports detected
> [ 10.458010] hub 2-0:1.0: standalone hub
> [ 10.458010] hub 2-0:1.0: no power switching (usb 1.0)
> [ 10.458011] hub 2-0:1.0: individual port over-current protection
> [ 10.458013] hub 2-0:1.0: power on to power good time: 20ms
> [ 10.458015] hub 2-0:1.0: local power source is good
> [ 10.458016] hub 2-0:1.0: trying to enable port power on non-switchable hub
> [ 10.458038] drivers/usb/core/inode.c: creating file '001'
> [ 10.474718] Linux video capture interface: v2.00
> [ 10.489875] bttv: driver version 0.9.18 loaded
> [ 10.489877] bttv: using 8 buffers with 2080k (520 pages) each for capture
> [ 10.489914] bttv: Bt8xx card found (0).
> [ 10.489923] alloc irq_desc for 21 on node -1
> [ 10.489924] alloc kstat_irqs on node -1
> [ 10.489928] bttv 0000:05:06.0: PCI INT A -> GSI 21 (level, low) -> IRQ 21
> [ 10.489938] bttv0: Bt848 (rev 18) at 0000:05:06.0, irq: 21, latency: 128,
> mmio: 0xfcfff000
> [ 10.489971] bttv0: using: Terratec TerraTV+ Version 1.0 (Bt848)/ Terra
> TValue Version 1.0/ Vobis TV-Boostar [card=25,insmod option]
> [ 10.489974] IRQ 21/bttv0: IRQF_DISABLED is not guaranteed on shared IRQs
> [ 10.490007] bttv0: gpio: en=00000000, out=00000000 in=00ffffff [init]
> [ 10.547936] ehci_hcd 0000:00:12.2: GetStatus port 2 status 001803 POWER
> sig=j CSC CONNECT
> [ 10.547939] hub 1-0:1.0: port 2: status 0501 change 0001
> [ 10.557954] hub 2-0:1.0: state 7 ports 6 chg 0000 evt 0000
> [ 10.647931] hub 1-0:1.0: state 7 ports 6 chg 0004 evt 0000
> [ 10.647940] hub 1-0:1.0: port 2, status 0501, change 0000, 480 Mb/s
> [ 10.700045] ehci_hcd 0000:00:12.2: port 2 high speed
> [ 10.700049] ehci_hcd 0000:00:12.2: GetStatus port 2 status 001005 POWER
> sig=se0 PE CONNECT
> [ 10.757104] usb 1-2: new high speed USB device using ehci_hcd and address 2
> [ 10.810036] ehci_hcd 0000:00:12.2: port 2 high speed
> [ 10.810039] ehci_hcd 0000:00:12.2: GetStatus port 2 status 001005 POWER
> sig=se0 PE CONNECT
> [ 10.881474] usb 1-2: default language 0x0409
> [ 10.881724] usb 1-2: udev 2, busnum 1, minor = 1
> [ 10.881725] usb 1-2: New USB device found, idVendor=05e3, idProduct=0608
> [ 10.881726] usb 1-2: New USB device strings: Mfr=0, Product=1,
> SerialNumber=0
> [ 10.881728] usb 1-2: Product: USB2.0 Hub
> [ 10.881764] usb 1-2: uevent
> [ 10.881773] usb 1-2: usb_probe_device
> [ 10.881775] usb 1-2: configuration #1 chosen from 1 choice
> [ 10.882188] usb 1-2: adding 1-2:1.0 (config #1, interface 0)
> [ 10.882199] usb 1-2:1.0: uevent
> [ 10.882206] hub 1-2:1.0: usb_probe_interface
> [ 10.882207] hub 1-2:1.0: usb_probe_interface - got id
> [ 10.882209] hub 1-2:1.0: USB hub found
> [ 10.882473] hub 1-2:1.0: 4 ports detected
> [ 10.882474] hub 1-2:1.0: standalone hub
> [ 10.882476] hub 1-2:1.0: individual port power switching
> [ 10.882477] hub 1-2:1.0: individual port over-current protection
> [ 10.882478] hub 1-2:1.0: Single TT
> [ 10.882479] hub 1-2:1.0: TT requires at most 32 FS bit times (2664 ns)
> [ 10.882480] hub 1-2:1.0: Port indicators are supported
> [ 10.882481] hub 1-2:1.0: power on to power good time: 100ms
> [ 10.882848] hub 1-2:1.0: local power source is good
> [ 10.882849] hub 1-2:1.0: enabling power on all ports
> [ 10.883860] drivers/usb/core/inode.c: creating file '002'
> [ 10.983961] hub 1-2:1.0: port 2: status 0301 change 0001
> [ 11.083356] usb 1-2: link qh256-0001/ffff8800c7800180 start 1 [1/0 us]
> [ 11.083364] hub 1-2:1.0: state 7 ports 4 chg 0004 evt 0000
> [ 11.083700] hub 1-2:1.0: port 2, status 0301, change 0000, 1.5 Mb/s
> [ 11.152198] usb 1-2.2: new low speed USB device using ehci_hcd and address
> 3
> [ 11.241187] usb 1-2.2: skipped 1 descriptor after interface
> [ 11.241189] usb 1-2.2: skipped 1 descriptor after interface
> [ 11.241685] usb 1-2.2: default language 0x0409
> [ 11.243941] usb 1-2.2: udev 3, busnum 1, minor = 2
> [ 11.243943] usb 1-2.2: New USB device found, idVendor=046d, idProduct=c518
> [ 11.243944] usb 1-2.2: New USB device strings: Mfr=1, Product=2,
> SerialNumber=0
> [ 11.243945] usb 1-2.2: Product: USB Receiver
> [ 11.243947] usb 1-2.2: Manufacturer: Logitech
> [ 11.243972] usb 1-2.2: uevent
> [ 11.243980] usb 1-2.2: usb_probe_device
> [ 11.243981] usb 1-2.2: configuration #1 chosen from 1 choice
> [ 11.251309] usb 1-2.2: adding 1-2.2:1.0 (config #1, interface 0)
> [ 11.251324] usb 1-2.2:1.0: uevent
> [ 11.251334] usbhid 1-2.2:1.0: usb_probe_interface
> [ 11.251335] usbhid 1-2.2:1.0: usb_probe_interface - got id
> [ 11.251796] usb 1-2: clear tt buffer port 2, a3 ep0 t80008d42
> [ 11.254692] input: Logitech USB Receiver as
> /devices/pci0000:00/0000:00:12.2/usb1/1-2/1-2.2/1-2.2:1.0/input/input3
> [ 11.254732] generic-usb 0003:046D:C518.0001: input,hidraw0: USB HID v1.11
> Mouse [Logitech USB Receiver] on usb-0000:00:12.2-2.2/input0
> [ 11.254740] usb 1-2.2: adding 1-2.2:1.1 (config #1, interface 1)
> [ 11.254749] usb 1-2.2:1.1: uevent
> [ 11.254755] usbhid 1-2.2:1.1: usb_probe_interface
> [ 11.254757] usbhid 1-2.2:1.1: usb_probe_interface - got id
> [ 11.255046] usb 1-2: clear tt buffer port 2, a3 ep0 t80008d42
> [ 11.260359] input: Logitech USB Receiver as
> /devices/pci0000:00/0000:00:12.2/usb1/1-2/1-2.2/1-2.2:1.1/input/input4
> [ 11.260368] usb 1-2.2: link qh8-0601/ffff8800c7800300 start 2 [1/2 us]
> [ 11.260384] drivers/usb/core/file.c: looking for a minor, starting at 96
> [ 11.260414] generic-usb 0003:046D:C518.0002: input,hiddev96,hidraw1: USB
> HID v1.11 Device [Logitech USB Receiver] on usb-0000:00:12.2-2.2/input1
> [ 11.260427] drivers/usb/core/inode.c: creating file '003'
> [ 11.260438] hub 1-2:1.0: state 7 ports 4 chg 0000 evt 0004
> [ 11.280770] usb 1-2.2:1.0: uevent
> [ 11.280827] usb 1-2.2: uevent
> [ 11.280844] usb 1-2.2:1.0: uevent
> [ 11.280903] usb 1-2.2: uevent
> [ 11.281405] usb 1-2.2:1.1: uevent
> [ 11.281467] usb 1-2.2: uevent
> [ 11.281517] usb 1-2.2:1.0: uevent
> [ 11.281529] usb 1-2.2:1.0: uevent
> [ 11.282164] usb 1-2.2:1.1: uevent
> [ 11.493831] bttv0: tea5757: read timeout
> [ 11.493832] bttv0: tuner type=5
> [ 11.505445] bttv0: audio absent, no audio device found!
> [ 11.525030] TUNER: Unable to find symbol tea5767_autodetection()
> [ 11.525033] tuner 0-0060: chip found @ 0xc0 (bt848 #0 [sw])
> [ 11.527843] tuner-simple 0-0060: creating new instance
> [ 11.527846] tuner-simple 0-0060: type set to 5 (Philips PAL_BG (FI1216 and
> compatibles))
> [ 11.528664] bttv0: registered device video0
> [ 11.528695] bttv0: registered device vbi0
> [ 11.814508] reiser4: md2: found disk format 4.0.0.
> [ 13.337101] hub 2-0:1.0: hub_suspend
> [ 13.337107] usb usb2: bus auto-suspend
> [ 13.337109] ehci_hcd 0000:00:13.2: suspend root hub
> [ 13.591267] reiser4: md3: found disk format 4.0.0.
> [ 55.020583] reiser4: sdc7: found disk format 4.0.0.
> [ 67.386552] Adding 7815612k swap on /dev/sda2. Priority:1 extents:1
> across:7815612k
> [ 67.408915] Adding 7815612k swap on /dev/sdb2. Priority:1 extents:1
> across:7815612k
> [ 67.501498] Adding 7815612k swap on /dev/sdc2. Priority:1 extents:1
> across:7815612k
> [ 68.205087] w83627ehf: Found W83627EHG chip at 0x290
> [ 68.421646] r8169: eth0: link up
> [ 68.421650] r8169: eth0: link up
> [ 70.281241] usb usb1: uevent
> [ 70.281269] usb 1-0:1.0: uevent
> [ 70.281293] usb 1-2: uevent
> [ 70.281320] usb 1-2.2: uevent
> [ 70.281346] usb 1-2.2:1.0: uevent
> [ 70.281533] usb 1-2.2:1.1: uevent
> [ 70.281706] usb 1-2:1.0: uevent
> [ 70.281804] usb usb2: uevent
> [ 70.281830] usb 2-0:1.0: uevent
> [ 71.272886] alloc irq_desc for 23 on node -1
> [ 71.272889] alloc kstat_irqs on node -1
> [ 71.272895] EMU10K1_Audigy 0000:05:08.0: PCI INT A -> GSI 23 (level, low) -
>> IRQ 23
> [ 71.278891] Audigy2 value: Special config.
> [ 72.473408] fglrx: module license 'Proprietary. (C) 2002 - ATI
> Technologies, Starnberg, GERMANY' taints kernel.
> [ 72.473417] Disabling lock debugging due to kernel taint
> [ 72.490481] [fglrx] Maximum main memory to use for locked dma buffers: 7760
> MBytes.
> [ 72.490559] [fglrx] vendor: 1002 device: 9501 count: 1
> [ 72.490739] [fglrx] ioport: bar 4, base 0xc000, size: 0x100
> [ 72.490750] pci 0000:02:00.0: PCI INT A -> GSI 18 (level, low) -> IRQ 18
> [ 72.490754] pci 0000:02:00.0: setting latency timer to 64
> [ 72.490884] [fglrx] Kernel PAT support is enabled
> [ 72.490901] [fglrx] module loaded - fglrx 8.66.2 [Sep 1 2009] with 1
> minors
> [ 72.668413] alloc irq_desc for 29 on node -1
> [ 72.668416] alloc kstat_irqs on node -1
> [ 72.668424] fglrx_pci 0000:02:00.0: irq 29 for MSI/MSI-X
> [ 72.668759] [fglrx] Firegl kernel thread PID: 3800
> [ 74.787429] [fglrx] Gart USWC size:1279 M.
> [ 74.787431] [fglrx] Gart cacheable size:508 M.
> [ 74.787435] [fglrx] Reserved FB block: Shared offset:0, size:1000000
> [ 74.787437] [fglrx] Reserved FB block: Unshared offset:fbff000, size:401000
> [ 74.787438] [fglrx] Reserved FB block: Unshared offset:1fffc000, size:4000
> [ 75.947153] usb 1-2.2: link qh8-0601/ffff8800c78003c0 start 3 [1/2 us]
> [ 616.849440] reiser4[ktxnmgrd:md1:ru(581)]: disable_write_barrier
> (fs/reiser4/wander.c:235)[zam-1055]:
> [ 616.849445] NOTICE: md1 does not support write barriers, using synchronous
> write instead.
> [ 671.813536] reiser4[ktxnmgrd:md2:ru(2774)]: disable_write_barrier
> (fs/reiser4/wander.c:235)[zam-1055]:
> [ 671.813541] NOTICE: md2 does not support write barriers, using synchronous
> write instead.
> [ 703.842289] reiser4[ktxnmgrd:md3:ru(2776)]: disable_write_barrier
> (fs/reiser4/wander.c:235)[zam-1055]:
> [ 703.842293] NOTICE: md3 does not support write barriers, using synchronous
> write instead.
>
> ps. I have to disable c1e in bios, or video is very jerky - unless something
> cpu heavy runs on at least one core...
>
>
>

2009-09-12 07:47:47

by Arjan van de Ven

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Sat, 12 Sep 2009 10:37:45 +0300
Nikos Chantziaras <[email protected]> wrote:

> (Volker stripped all CCs from his posts; I restored them manually.)
>
> On 09/11/2009 09:33 PM, Volker Armin Hemmann wrote:
> > Hi,
> >
> > this is with 2.6.31+reiser4+fglrx
> > Phenom II X4 955
> >
> > KDE 4.3.1, composite temporary disabled.
> > tvtime running.
> >
> > load:
> > fat emerge with make -j5 running in one konsole tab (xulrunner being
> > compiled).
> >
> > without NO_NEW_FAIR_SLEEPERS:
> >
> > tvtime is smooth most of the time
> >
> > with NO_NEW_FAIR_SLEEPERS:
> >
> > tvtime is more jerky. Very visible in scenes with movement.
>
> Is the make -j5 running niced 0? If yes, that would be actually the
> correct behavior. Unfortunately, I can't test tvtime specifically (I
> don't have a TV card), but other applications displaying video
> continue to work smooth on my dual core machine (Core 2 Duo E6600)
> even if I do "nice -n 19 make -j20". If I don't nice it, the video
> is skippy here too though.
>
> Question to Ingo:
> Would posting perf results help in any way with finding differences
> between mainline NEW_FAIR_SLEEPERS/NO_NEW_FAIR_SLEEPERS and BFS?

please also post latencytop output for the app you care about
(the system wide latencytop numbers aren't as relevant; to some large
degree what is happening is that if you oversubscribe, you need to pay
the price for that period, all you can do is move the cost around to
those tasks you don't care about. For that reason, latencytop output
for the task you care about is relevant ;-)


--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2009-09-12 08:27:11

by Volker Armin Hemmann

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

Hi,

On Samstag 12 September 2009, Nikos Chantziaras wrote:
> (Volker stripped all CCs from his posts; I restored them manually.)

stripping is not entirely correct. marc does not list all recipients ;)
Thank you.

>
> On 09/11/2009 09:33 PM, Volker Armin Hemmann wrote:
> > Hi,
> >
> > this is with 2.6.31+reiser4+fglrx
> > Phenom II X4 955
> >
> > KDE 4.3.1, composite temporary disabled.
> > tvtime running.
> >
> > load:
> > fat emerge with make -j5 running in one konsole tab (xulrunner being
> > compiled).
> >
> > without NO_NEW_FAIR_SLEEPERS:
> >
> > tvtime is smooth most of the time
> >
> > with NO_NEW_FAIR_SLEEPERS:
> >
> > tvtime is more jerky. Very visible in scenes with movement.
>
> Is the make -j5 running niced 0?

yes. It always is.


> If yes, that would be actually the
> correct behavior.

maybe. But I do not complain about jerks at all. I have

[ 3618.305918] hpet1: lost 1 rtc interrupts

with tvtime running since I switched cpus - so something is wrong anyway.

I just wanted to report that for _me_ the behaviour is worse with
NO_NEW_FAIR_SLEEPERS and plain 2.6.31.

I tried it yesterday when the firefox update came in and I switched between
NO_NEW_FAIR... and NEW_FAIR... several times and with NO_NEW_FAIR... tvtime
was just _more_ jerky, I am not saying that it wasn't jerky without it nor do
I complain about it at all. ;)

Gl?ck Auf,
Volker

2009-09-12 09:03:50

by Nikos Chantziaras

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On 09/12/2009 11:27 AM, Volker Armin Hemmann wrote:
> Hi,
>
> On Samstag 12 September 2009, Nikos Chantziaras wrote:
>>
>> On 09/11/2009 09:33 PM, Volker Armin Hemmann wrote:
>>>
>>> this is with 2.6.31+reiser4+fglrx
>>> Phenom II X4 955
>>>
>>> KDE 4.3.1, composite temporary disabled.
>>> tvtime running.
>>>
>>> load:
>>> fat emerge with make -j5 running in one konsole tab (xulrunner being
>>> compiled).
>>>
>>> without NO_NEW_FAIR_SLEEPERS:
>>>
>>> tvtime is smooth most of the time
>>>
>>> with NO_NEW_FAIR_SLEEPERS:
>>>
>>> tvtime is more jerky. Very visible in scenes with movement.
>>
>> Is the make -j5 running niced 0?
>
> yes. It always is.
>
>> If yes, that would be actually the
>> correct behavior.
>
> maybe. But I do not complain about jerks at all. I have
>
> [ 3618.305918] hpet1: lost 1 rtc interrupts
>
> with tvtime running since I switched cpus - so something is wrong anyway.

Seeing the "lost 1 rtc interrupts" message makes me wonder if this could
possibly relate to problems with the C1E state on AMD systems (missing
timer interrupts):

http://lkml.org/lkml/2008/6/12/127

That thread is one year old though and your Phenom II CPU was released 7
months later.

2009-09-12 09:34:27

by Volker Armin Hemmann

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Samstag 12 September 2009, Nikos Chantziaras wrote:

> Seeing the "lost 1 rtc interrupts" message makes me wonder if this could
> possibly relate to problems with the C1E state on AMD systems (missing
> timer interrupts):
>
> http://lkml.org/lkml/2008/6/12/127
>
> That thread is one year old though and your Phenom II CPU was released 7
> months later.
>

thanks for the link!

2009-09-12 11:26:42

by Martin Steigerwald

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

Am Freitag 11 September 2009 schrieb Mat:
> Martin Steigerwald <Martin <at> lichtvoll.de> writes:
> > Am Donnerstag 10 September 2009 schrieb Ingo Molnar:
>
> [snip]
>
> > > what is /debug/sched_features - is NO_NEW_FAIR_SLEEPERS set? If not
> > > set yet then try it:
> > >
> > > echo NO_NEW_FAIR_SLEEPERS > /debug/sched_features
> > >
> > > that too might make things more fluid.
>
> Hi Martin,

Hi Mat,

> it made an tremendous difference which still has to be tested out :)

[...]

> Concerning that "NO_NEW_FAIR_SLEEPERS" switch - isn't it as easy as to
>
> do the following ? (I'm not sure if there's supposed to be another
> debug)
>
> echo NO_NEW_FAIR_SLEEPERS > /sys/kernel/debug/sched_features
>
> which after the change says:
>
> cat /sys/kernel/debug/sched_features
> NO_NEW_FAIR_SLEEPERS NO_NORMALIZED_SLEEPER ADAPTIVE_GRAN WAKEUP_PREEMPT
> START_DEBIT AFFINE_WAKEUPS CACHE_HOT_BUDDY SYNC_WAKEUPS NO_HRTICK
> NO_DOUBLE_TICK ASYM_GRAN LB_BIAS LB_WAKEUP_UPDATE ASYM_EFF_LOAD
> NO_WAKEUP_OVERLAP LAST_BUDDY OWNER_SPIN
>
> I hope that's the correct switch ^^

Thanks. Appears to work here nicely ;-). I thought this might be a debug
fs that I need to mount separately, but its already there here. I will see
how it works out.

I wondered whethere it might be a good idea to have a

echo default > /sys/kernel/kernel-tuning-knob

that will reset it to the compiled in factory defaults. Would be a nice
way to go back to safe settings again once you got carried away to far
with trying those tuning knobs.

Ciao,
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7


Attachments:
signature.asc (197.00 B)
This is a digitally signed message part.

2009-09-12 11:45:30

by Martin Steigerwald

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Re-tune the scheduler latency defaults to decrease worst-case latencies

Am Mittwoch 09 September 2009 schrieb tip-bot for Mike Galbraith:
> Commit-ID: 172e082a9111ea504ee34cbba26284a5ebdc53a7
> Gitweb:
> http://git.kernel.org/tip/172e082a9111ea504ee34cbba26284a5ebdc53a7
> Author: Mike Galbraith <[email protected]>
> AuthorDate: Wed, 9 Sep 2009 15:41:37 +0200
> Committer: Ingo Molnar <[email protected]>
> CommitDate: Wed, 9 Sep 2009 17:30:06 +0200
>
> sched: Re-tune the scheduler latency defaults to decrease worst-case
> latencies
>
> Reduce the latency target from 20 msecs to 5 msecs.
>
> Why? Larger latencies increase spread, which is good for scaling,
> but bad for worst case latency.
>
> We still have the ilog(nr_cpus) rule to scale up on bigger
> server boxes.
>
> Signed-off-by: Mike Galbraith <[email protected]>
> Acked-by: Peter Zijlstra <[email protected]>
> LKML-Reference: <[email protected]>
> Signed-off-by: Ingo Molnar <[email protected]>
>
>
> ---
> kernel/sched_fair.c | 12 ++++++------
> 1 files changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index af325a3..26fadb4 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -24,7 +24,7 @@
>
> /*
> * Targeted preemption latency for CPU-bound tasks:
> - * (default: 20ms * (1 + ilog(ncpus)), units: nanoseconds)
> + * (default: 5ms * (1 + ilog(ncpus)), units: nanoseconds)
> *
> * NOTE: this latency value is not the same as the concept of
> * 'timeslice length' - timeslices in CFS are of variable length
> @@ -34,13 +34,13 @@
> * (to see the precise effective timeslice length of your workload,
> * run vmstat and monitor the context-switches (cs) field)
> */
> -unsigned int sysctl_sched_latency = 20000000ULL;
> +unsigned int sysctl_sched_latency = 5000000ULL;
>
> /*
> * Minimal preemption granularity for CPU-bound tasks:
> - * (default: 4 msec * (1 + ilog(ncpus)), units: nanoseconds)
> + * (default: 1 msec * (1 + ilog(ncpus)), units: nanoseconds)
> */
> -unsigned int sysctl_sched_min_granularity = 4000000ULL;
> +unsigned int sysctl_sched_min_granularity = 1000000ULL;

Needs to be lower for a fluid desktop experience here:

shambhala:/proc/sys/kernel> cat sched_min_granularity_ns
100000

>
> /*
> * is kept at sysctl_sched_latency / sysctl_sched_min_granularity
> @@ -63,13 +63,13 @@ unsigned int __read_mostly
> sysctl_sched_compat_yield;
>
> /*
> * SCHED_OTHER wake-up granularity.
> - * (default: 5 msec * (1 + ilog(ncpus)), units: nanoseconds)
> + * (default: 1 msec * (1 + ilog(ncpus)), units: nanoseconds)
> *
> * This option delays the preemption effects of decoupled workloads
> * and reduces their over-scheduling. Synchronous workloads will still
> * have immediate wakeup/sleep latencies.
> */
> -unsigned int sysctl_sched_wakeup_granularity = 5000000UL;
> +unsigned int sysctl_sched_wakeup_granularity = 1000000UL;

Dito:

shambhala:/proc/sys/kernel> cat sched_wakeup_granularity_ns
100000

With

shambhala:~> cat /proc/version
Linux version 2.6.31-rc7-tp42-toi-3.0.1-04741-g57e61c0 (martin@shambhala)
(gcc version 4.3.3 (Debian 4.3.3-10) ) #6 PREEMPT Sun Aug 23 10:51:32 CEST
2009

on my ThinkPad T42.

Otherwise compositing animations like switching desktops and zooming in
newly opening windows still appear jerky. Even with:

shambhala:/sys/kernel/debug> cat sched_features
NO_NEW_FAIR_SLEEPERS NO_NORMALIZED_SLEEPER ADAPTIVE_GRAN WAKEUP_PREEMPT
START_DEBIT AFFINE_WAKEUPS CACHE_HOT_BUDDY SYNC_WAKEUPS NO_HRTICK
NO_DOUBLE_TICK ASYM_GRAN LB_BIAS LB_WAKEUP_UPDATE ASYM_EFF_LOAD
NO_WAKEUP_OVERLAP LAST_BUDDY OWNER_SPIN

But NO_NEW_FAIR_SLEEPERS also gives a benefit. It makes those animation
even more fluent.

In complete I am quity happy with

shambhala:/proc/sys/kernel> grep "" *sched*
sched_child_runs_first:0
sched_compat_yield:0
sched_features:113916
sched_latency_ns:5000000
sched_migration_cost:500000
sched_min_granularity_ns:100000
sched_nr_migrate:32
sched_rt_period_us:1000000
sched_rt_runtime_us:950000
sched_shares_ratelimit:250000
sched_shares_thresh:4
sched_wakeup_granularity_ns:100000

for now.

It really makes a *lot* of difference. But it appears that both
sched_min_granularity_ns and sched_wakeup_granularity_ns have to be lower
on my ThinkPad for best effect.

I would still prefer some autotuning, where I say "desktop!" or nothing at
all. And thats it.

Ciao,
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7


Attachments:
signature.asc (197.00 B)
This is a digitally signed message part.

2009-09-12 11:48:38

by Martin Steigerwald

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Keep kthreads at default priority

Am Mittwoch 09 September 2009 schrieb Mike Galbraith:
> On Wed, 2009-09-09 at 19:06 +0200, Peter Zijlstra wrote:
> > On Wed, 2009-09-09 at 09:55 -0700, Dmitry Torokhov wrote:
> > > On Wed, Sep 09, 2009 at 03:37:34PM +0000, tip-bot for Mike Galbraith
wrote:
> > > > diff --git a/kernel/kthread.c b/kernel/kthread.c
> > > > index eb8751a..5fe7099 100644
> > > > --- a/kernel/kthread.c
> > > > +++ b/kernel/kthread.c
> > > > @@ -16,8 +16,6 @@
> > > > #include <linux/mutex.h>
> > > > #include <trace/events/sched.h>
> > > >
> > > > -#define KTHREAD_NICE_LEVEL (-5)
> > > > -
> > >
> > > Why don't we just redefine it to 0? We may find out later that we'd
> > > still prefer to have kernel threads have boost.
> >
> > Seems sensible, also the traditional reasoning behind this nice level
> > is that kernel threads do work on behalf of multiple tasks. Its a
> > kind of prio ceiling thing.
>
> True. None of our current threads are heavy enough to matter much.

Does it make sense to have this as a tunable? Where does it matter? Server
workloads?

(Oh no, not another tunable I can hear you yell;-).

--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7


Attachments:
signature.asc (197.00 B)
This is a digitally signed message part.

2009-09-12 12:19:14

by Mike Galbraith

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Keep kthreads at default priority

On Sat, 2009-09-12 at 13:48 +0200, Martin Steigerwald wrote:
> Am Mittwoch 09 September 2009 schrieb Mike Galbraith:
> > On Wed, 2009-09-09 at 19:06 +0200, Peter Zijlstra wrote:
> > > On Wed, 2009-09-09 at 09:55 -0700, Dmitry Torokhov wrote:
> > > > On Wed, Sep 09, 2009 at 03:37:34PM +0000, tip-bot for Mike Galbraith
> wrote:
> > > > > diff --git a/kernel/kthread.c b/kernel/kthread.c
> > > > > index eb8751a..5fe7099 100644
> > > > > --- a/kernel/kthread.c
> > > > > +++ b/kernel/kthread.c
> > > > > @@ -16,8 +16,6 @@
> > > > > #include <linux/mutex.h>
> > > > > #include <trace/events/sched.h>
> > > > >
> > > > > -#define KTHREAD_NICE_LEVEL (-5)
> > > > > -
> > > >
> > > > Why don't we just redefine it to 0? We may find out later that we'd
> > > > still prefer to have kernel threads have boost.
> > >
> > > Seems sensible, also the traditional reasoning behind this nice level
> > > is that kernel threads do work on behalf of multiple tasks. Its a
> > > kind of prio ceiling thing.
> >
> > True. None of our current threads are heavy enough to matter much.
>
> Does it make sense to have this as a tunable? Where does it matter? Server
> workloads?

I don't think it should be a knob. It only makes a difference to
kthreads that are heavy CPU users. If one pops up as a performance
problem, IMHO, it should be tweaked separately. Running at default
weight saves a bit of unnecessary math for the common case.

-Mike

2009-09-13 15:47:56

by Ingo Molnar

[permalink] [raw]
Subject: Re: Epic regression in throughput since v2.6.23


* Serge Belyshev <[email protected]> wrote:

> Note that the disabling NEW_FAIR_SLEEPERS doesn't fix 3%
> regression from v2.6.23, but instead makes "make -j4" runtime
> another 2% worse (27.05 -> 27.72).

ok - thanks for the numbers, will have a look.

> ---
> tools/perf/builtin-stat.c | 18 +++++++++++++++++-
> 1 file changed, 17 insertions(+), 1 deletion(-)

> + // quick ugly hack: if a "--" appears in the command, treat is as
> + // a delimiter and use remaining part as a "cleanup command",
> + // not affecting performance counters.
> + cleanup = cleanup_argc = 0;
> + for (j = 1; j < (argc-1); j ++) {
> + if (!strcmp (argv[j], "--")) {
> + cleanup = j + 1;
> + cleanup_argc = argc - j - 1;
> + argv[j] = NULL;
> + argc = j;
> + }
> + }

Nice feature!

How about doing it a bit cleaner, as '--repeat-prepare' and
'--repeat-cleanup' options, to allow both pre-repeat and post-repeat
cleanup ops to be done outside of the measured period?

Ingo

2009-09-13 19:17:57

by Mike Galbraith

[permalink] [raw]
Subject: Re: Epic regression in throughput since v2.6.23

On Sun, 2009-09-13 at 17:47 +0200, Ingo Molnar wrote:
> * Serge Belyshev <[email protected]> wrote:
>
> > Note that the disabling NEW_FAIR_SLEEPERS doesn't fix 3%
> > regression from v2.6.23, but instead makes "make -j4" runtime
> > another 2% worse (27.05 -> 27.72).
>
> ok - thanks for the numbers, will have a look.

Seems NEXT_BUDDY is hurting the -j4 build.

LAST_BUDDY helps, which makes some sense.. if a task has heated up
cache, and is wakeup preempted by a fast mover (kthread, make..), it can
get the CPU back with still toasty data. Hm. If NEXT_BUDDY is on, that
benefit would likely be frequently destroyed too, because NEXT_BUDDY is
preferred over LAST_BUDDY.

Anyway, I'm thinking of tracking forks/sec as a means of detecting the
fork/exec load. Or, maybe just enable it when there's > 1 buddy pair
running.. or something. After all, NEXT_BUDDY is about scalability, and
make -j4 on a quad surely doesn't need any scalability help :)

Performance counter stats for 'make -j4 vmlinux':

stock
111.625198810 seconds time elapsed avg 112.120 1.00
112.209501685 seconds time elapsed
112.528258240 seconds time elapsed

NO_NEXT_BUDDY NO_LAST_BUDDY
109.405064078 seconds time elapsed avg 109.351 .975
108.708076118 seconds time elapsed
109.942346026 seconds time elapsed

NO_NEXT_BUDDY
108.005756718 seconds time elapsed avg 108.064 .963
107.689862679 seconds time elapsed
108.497117555 seconds time elapsed

NO_LAST_BUDDY
110.208717063 seconds time elapsed avg 110.120 .982
110.362412902 seconds time elapsed
109.791359601 seconds time elapsed


diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index aa7f841..7cfea64 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1501,7 +1501,8 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int sync)
*/
if (sched_feat(LAST_BUDDY) && likely(se->on_rq && curr != rq->idle))
set_last_buddy(se);
- set_next_buddy(pse);
+ if (sched_feat(NEXT_BUDDY))
+ set_next_buddy(pse);

/*
* We can come here with TIF_NEED_RESCHED already set from new task
diff --git a/kernel/sched_features.h b/kernel/sched_features.h
index e2dc63a..6e7070b 100644
--- a/kernel/sched_features.h
+++ b/kernel/sched_features.h
@@ -13,5 +13,6 @@ SCHED_FEAT(LB_BIAS, 1)
SCHED_FEAT(LB_WAKEUP_UPDATE, 1)
SCHED_FEAT(ASYM_EFF_LOAD, 1)
SCHED_FEAT(WAKEUP_OVERLAP, 0)
+SCHED_FEAT(NEXT_BUDDY, 1)
SCHED_FEAT(LAST_BUDDY, 1)
SCHED_FEAT(OWNER_SPIN, 1)

2009-09-14 06:15:45

by Mike Galbraith

[permalink] [raw]
Subject: Re: Epic regression in throughput since v2.6.23

On Sun, 2009-09-13 at 21:17 +0200, Mike Galbraith wrote:

> Anyway, I'm thinking of tracking forks/sec as a means of detecting the
> fork/exec load. Or, maybe just enable it when there's > 1 buddy pair
> running.. or something. After all, NEXT_BUDDY is about scalability, and
> make -j4 on a quad surely doesn't need any scalability help :)

But, this buddy vs fork/exec thing is not at all cut and dried. Even
with fork/exec load being the primary CPU consumer, there are genuine
buddies to worry about when you've got a GUI running, next/last buddy
can reduce the chances that an oinker slips in between X and client.

Ponder...

(oil for rusty old ponder machine welcome, gears grinding)

-Mike

2009-09-14 09:46:12

by Nikos Chantziaras

[permalink] [raw]
Subject: Phoronix CFS vs BFS bencharks

Phoronix has published some benchmarks, including some "non-synthetic"
real-life applications:

http://www.phoronix.com/vr.php?view=14179

The benchmarks are:

* World of Padman
* Timed Apache Compilation
* Timed PHP Compilation
* 7-Zip Compression
* GraphicsMagick
* Apache Benchmark
* Threaded I/O Tester
* PostMark

The test was performed on an Ubuntu 9.10 daily snapshot from 2009-09-10
with the GNOME 2.27.91 desktop, X Server 1.6.3, NVIDIA 190.32 display
driver, GCC 4.4.1, and an EXT4 file-system.

2009-09-14 11:35:35

by Mike Galbraith

[permalink] [raw]
Subject: Re: Phoronix CFS vs BFS bencharks

On Mon, 2009-09-14 at 12:46 +0300, Nikos Chantziaras wrote:
> Phoronix has published some benchmarks, including some "non-synthetic"
> real-life applications:
>
> http://www.phoronix.com/vr.php?view=14179
>
> The benchmarks are:
>
> * World of Padman
> * Timed Apache Compilation
> * Timed PHP Compilation
> * 7-Zip Compression
> * GraphicsMagick
> * Apache Benchmark
> * Threaded I/O Tester
> * PostMark
>
> The test was performed on an Ubuntu 9.10 daily snapshot from 2009-09-10
> with the GNOME 2.27.91 desktop, X Server 1.6.3, NVIDIA 190.32 display
> driver, GCC 4.4.1, and an EXT4 file-system.

Interesting results.

It'd be nice to see what difference the changes since .31 have made to
these comparisons. In particular, child_runs_first was found to have a
substantial negative impact on parallel compiles, and has been turned
off. The reduction of sched_latency has a rather large effect on worst
case latency for CPU hogs, so will likely affect some results markedly.

Hohum, back to the grindstone.

-Mike

2009-09-14 15:20:12

by Ingo Molnar

[permalink] [raw]
Subject: Re: [crash, bisected] Re: clocksource: Resolve cpu hotplug dead lock with TSC unstable


* Martin Schwidefsky <[email protected]> wrote:

> On Fri, 11 Sep 2009 09:37:47 +0200
> Ingo Molnar <[email protected]> wrote:
>
> >
> > * Ingo Molnar <[email protected]> wrote:
> >
> > >
> > > * Ingo Molnar <[email protected]> wrote:
> > >
> > > >
> > > > * Jens Axboe <[email protected]> wrote:
> > > >
> > > > > I went to try -tip btw, but it crashes on boot. Here's the
> > > > > backtrace, typed manually, it's crashing in
> > > > > queue_work_on+0x28/0x60.
> > > > >
> > > > > Call Trace:
> > > > > queue_work
> > > > > schedule_work
> > > > > clocksource_mark_unstable
> > > > > mark_tsc_unstable
> > > > > check_tsc_sync_source
> > > > > native_cpu_up
> > > > > relay_hotcpu_callback
> > > > > do_forK_idle
> > > > > _cpu_up
> > > > > cpu_up
> > > > > kernel_init
> > > > > kernel_thread_helper
> > > >
> > > > hm, that looks like an old bug i fixed days ago via:
> > > >
> > > > 00a3273: Revert "x86: Make tsc=reliable override boot time stability checks"
> > > >
> > > > Have you tested tip:master - do you still know which sha1?
> > >
> > > Ok, i reproduced it on a testbox and bisected it, the crash is
> > > caused by:
> > >
> > > 7285dd7fd375763bfb8ab1ac9cf3f1206f503c16 is first bad commit
> > > commit 7285dd7fd375763bfb8ab1ac9cf3f1206f503c16
> > > Author: Thomas Gleixner <[email protected]>
> > > Date: Fri Aug 28 20:25:24 2009 +0200
> > >
> > > clocksource: Resolve cpu hotplug dead lock with TSC unstable
> > >
> > > Martin Schwidefsky analyzed it:
> > >
> > > I've reverted it in tip/master for now.
> >
> > and that uncovers the circular locking bug that this commit was
> > supposed to fix ...
> >
> > Martin?
>
> This patch should fix the obvious problem that the watchdog_work
> structure is not yet initialized if the clocksource watchdog is not
> running yet.
> --
> Subject: [PATCH] clocksource: statically initialize watchdog workqueue
>
> From: Martin Schwidefsky <[email protected]>
>
> The watchdog timer is started after the watchdog clocksource and at least
> one watched clocksource have been registered. The clocksource work element
> watchdog_work is initialized just before the clocksource timer is started.
> This is too late for the clocksource_mark_unstable call from native_cpu_up.
> To fix this use a static initializer for watchdog_work.
>
> Signed-off-by: Martin Schwidefsky <[email protected]>
> ---
> kernel/time/clocksource.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> Index: linux-2.6/kernel/time/clocksource.c
> ===================================================================
> --- linux-2.6.orig/kernel/time/clocksource.c
> +++ linux-2.6/kernel/time/clocksource.c
> @@ -123,10 +123,12 @@ static DEFINE_MUTEX(clocksource_mutex);
> static char override_name[32];
>
> #ifdef CONFIG_CLOCKSOURCE_WATCHDOG
> +static void clocksource_watchdog_work(struct work_struct *work);
> +
> static LIST_HEAD(watchdog_list);
> static struct clocksource *watchdog;
> static struct timer_list watchdog_timer;
> -static struct work_struct watchdog_work;
> +static DECLARE_WORK(watchdog_work, clocksource_watchdog_work);
> static DEFINE_SPINLOCK(watchdog_lock);
> static cycle_t watchdog_last;
> static int watchdog_running;
> @@ -230,7 +232,6 @@ static inline void clocksource_start_wat
> {
> if (watchdog_running || !watchdog || list_empty(&watchdog_list))
> return;
> - INIT_WORK(&watchdog_work, clocksource_watchdog_work);
> init_timer(&watchdog_timer);
> watchdog_timer.function = clocksource_watchdog;
> watchdog_last = watchdog->read(watchdog);

Now another box crashes during bootup. Reverting these two:

f79e025: clocksource: Resolve cpu hotplug dead lock with TSC unstable, fix crash
7285dd7: clocksource: Resolve cpu hotplug dead lock with TSC unstable

allows me to boot it.

plain 32-bit defconfig.

Ingo

2009-09-14 15:32:45

by Mike Galbraith

[permalink] [raw]
Subject: Re: Phoronix CFS vs BFS bencharks

On Mon, 2009-09-14 at 16:27 +0200, Marcin Letyns wrote:
> Hello,
>
> Disabling NEW_FAIR_SLEEPERS makes a lot of difference here in the
> Apache benchmark:
>
> 2.6.30.6-bfs: 7311.05
>
> 2.6.30.6-cfs-fair_sl_disabled: 8249.17
>
> 2.6.30.6-cfs-fair_sl_enabled: 4894.99

Wow.

Some loads like wakeup preemption (mysql+oltp), and some hate it. This
load appears to REALLY hate it (as does volanomark, but that thing is
extremely overloaded). How many threads does that benchmark run
concurrently?

In any case, it's currently disabled in tip. Time will tell which
benchmarks gain, and which lose. With it disabled, anything light loses
when competing with hog(s). There _are_ one heck of a lot of hogs out
there though, so maybe it _should_ be disabled by default. Dunno.

-Mike

2009-09-14 15:37:21

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [crash, bisected] Re: clocksource: Resolve cpu hotplug dead lock with TSC unstable

On Mon, 14 Sep 2009 17:19:58 +0200
Ingo Molnar <[email protected]> wrote:

> Now another box crashes during bootup. Reverting these two:
>
> f79e025: clocksource: Resolve cpu hotplug dead lock with TSC unstable, fix crash
> 7285dd7: clocksource: Resolve cpu hotplug dead lock with TSC unstable
>
> allows me to boot it.
>
> plain 32-bit defconfig.

I've seen the bug report. init_workqueues comes after smp_init.
The idea I'm currently playing with is a simple check in the tsc
code if the tsc clocksource is already registered or not.
When smp_init is called the tsc is not yet registered, we could
just set the rating to zero.

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

2009-09-14 18:00:12

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [crash, bisected] Re: clocksource: Resolve cpu hotplug dead lock with TSC unstable

On Mon, 14 Sep 2009 17:19:58 +0200
Ingo Molnar <[email protected]> wrote:

> Now another box crashes during bootup. Reverting these two:
>
> f79e025: clocksource: Resolve cpu hotplug dead lock with TSC unstable, fix crash
> 7285dd7: clocksource: Resolve cpu hotplug dead lock with TSC unstable
>
> allows me to boot it.
>
> plain 32-bit defconfig.

Ok, I forced the situation where the bad thing happens. With the patch below
the crash goes away.

[ 0.152056] checking TSC synchronization [CPU#0 -> CPU#1]:
[ 0.156001] Measured 0 cycles TSC warp between CPUs, turning off TSC clock.
[ 0.156001] Marking TSC unstable due to check_tsc_sync_source failed

Is there a reason why we need the TSC as a clocksource early in the boot
process?

--
Subject: clocksource: delay tsc clocksource registration

From: Martin Schwidefsky <[email protected]>

Until the tsc clocksource has been registered it can be
downgraded by setting the CLOCK_SOURCE_UNSTABLE bit and the
rating to zero. Once the tsc clocksource is registered a
work queue is needed to change the rating.

Delay the registration of the tsc clocksource to a point in
the boot process after the work queues have been initialized.

This hopefully finally resolves the boot crash due to the
tsc downgrade.

Signed-off-by: Martin Schwidefsky <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: John Stultz <[email protected]>
---

Index: linux-2.6-tip/arch/x86/kernel/tsc.c
===================================================================
--- linux-2.6-tip.orig/arch/x86/kernel/tsc.c 2009-09-14 19:25:02.000000000 +0200
+++ linux-2.6-tip/arch/x86/kernel/tsc.c 2009-09-14 19:30:13.000000000 +0200
@@ -853,9 +853,16 @@
clocksource_tsc.rating = 0;
clocksource_tsc.flags &= ~CLOCK_SOURCE_IS_CONTINUOUS;
}
+}
+
+static int __init register_tsc_clocksource(void)
+{
clocksource_register(&clocksource_tsc);
+ return 0;
}

+core_initcall(register_tsc_clocksource);
+
#ifdef CONFIG_X86_64
/*
* calibrate_cpu is used on systems with fixed rate TSCs to determine
--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

2009-09-14 19:21:08

by Marcin Letyns

[permalink] [raw]
Subject: Re: Phoronix CFS vs BFS bencharks

2009/9/14 Mike Galbraith <[email protected]>
>
> On Mon, 2009-09-14 at 16:27 +0200, Marcin Letyns wrote:
> > Hello,
> >
> > Disabling NEW_FAIR_SLEEPERS makes a lot of difference here in the
> > Apache benchmark:
> >
> > 2.6.30.6-bfs: 7311.05
> >
> > 2.6.30.6-cfs-fair_sl_disabled: 8249.17
> >
> > 2.6.30.6-cfs-fair_sl_enabled: 4894.99
>
> Wow.
>
> Some loads like wakeup preemption (mysql+oltp), and some hate it. ?This
> load appears to REALLY hate it (as does volanomark, but that thing is
> extremely overloaded). ?How many threads does that benchmark run
> concurrently?

>From the benchmark description:

This is a test of ab, which is the Apache Benchmark program. This test
profile measures how many requests per second a given system can
sustain when carrying out 500,000 requests with 100 requests being
carried out concurrently.

2009-09-14 20:49:23

by Willy Tarreau

[permalink] [raw]
Subject: Re: Phoronix CFS vs BFS bencharks

On Mon, Sep 14, 2009 at 09:14:35PM +0200, Marcin Letyns wrote:
> 2009/9/14 Mike Galbraith <[email protected]>
> >
> > On Mon, 2009-09-14 at 16:27 +0200, Marcin Letyns wrote:
> > > Hello,
> > >
> > > Disabling NEW_FAIR_SLEEPERS makes a lot of difference here in the
> > > Apache benchmark:
> > >
> > > 2.6.30.6-bfs: 7311.05
> > >
> > > 2.6.30.6-cfs-fair_sl_disabled: 8249.17
> > >
> > > 2.6.30.6-cfs-fair_sl_enabled: 4894.99
> >
> > Wow.
> >
> > Some loads like wakeup preemption (mysql+oltp), and some hate it. ?This
> > load appears to REALLY hate it (as does volanomark, but that thing is
> > extremely overloaded). ?How many threads does that benchmark run
> > concurrently?
>
> >From the benchmark description:
>
> This is a test of ab, which is the Apache Benchmark program. This test
> profile measures how many requests per second a given system can
> sustain when carrying out 500,000 requests with 100 requests being
> carried out concurrently.

Be careful not to run ab on the same machine as you run apache, otherwise
the numerous apache processes can limit ab's throughput. This is the same
reason as why I educate people so that they don't run a single-process
proxy in front of a multi-process/multi-thread web server. Apparently
it's not obvious to everyone.

Regards,
Willy

2009-09-15 08:37:43

by Mike Galbraith

[permalink] [raw]
Subject: Re: Phoronix CFS vs BFS bencharks

On Mon, 2009-09-14 at 22:49 +0200, Willy Tarreau wrote:
> On Mon, Sep 14, 2009 at 09:14:35PM +0200, Marcin Letyns wrote:
> > 2009/9/14 Mike Galbraith <[email protected]>
> > >
> > > On Mon, 2009-09-14 at 16:27 +0200, Marcin Letyns wrote:
> > > > Hello,
> > > >
> > > > Disabling NEW_FAIR_SLEEPERS makes a lot of difference here in the
> > > > Apache benchmark:
> > > >
> > > > 2.6.30.6-bfs: 7311.05
> > > >
> > > > 2.6.30.6-cfs-fair_sl_disabled: 8249.17
> > > >
> > > > 2.6.30.6-cfs-fair_sl_enabled: 4894.99
> > >
> > > Wow.
> > >
> > > Some loads like wakeup preemption (mysql+oltp), and some hate it. This
> > > load appears to REALLY hate it (as does volanomark, but that thing is
> > > extremely overloaded). How many threads does that benchmark run
> > > concurrently?
> >
> > >From the benchmark description:
> >
> > This is a test of ab, which is the Apache Benchmark program. This test
> > profile measures how many requests per second a given system can
> > sustain when carrying out 500,000 requests with 100 requests being
> > carried out concurrently.
>
> Be careful not to run ab on the same machine as you run apache, otherwise
> the numerous apache processes can limit ab's throughput. This is the same
> reason as why I educate people so that they don't run a single-process
> proxy in front of a multi-process/multi-thread web server. Apparently
> it's not obvious to everyone.

I turned on apache, and played with ab a bit, and yup, ab is a hog, so
any fairness hurts it a badly. Ergo, running ab on the same box as
apache suffers with CFS when NEW_FAIR_SLEEPERS are turned on. Issuing
ab bandwidth to match it's 1:N pig nature brings throughput right back.

(In all the comparison testing I've done, BFS favors hogs, and with
NEW_FAIR_SLEEPERS turned off, so does CFS, though not as much.)

Running apache on one core and ab on another (with shared cache tho),
something went south with BFS. I would have expected it to be much
closer (shrug).

Some likely not very uninteresting numbers below. I wasted a lot more
of my time generating them than anyone will downloading them :)

ab -n 500000 -c 100 http://localhost/openSUSE.org.html

2.6.31-bfs221-smp
Concurrency Level: 100
Time taken for tests: 43.556 seconds
Complete requests: 500000
Failed requests: 0
Write errors: 0
Total transferred: 7158558404 bytes
HTML transferred: 7027047358 bytes
Requests per second: 11479.50 [#/sec] (mean)
Time per request: 8.711 [ms] (mean)
Time per request: 0.087 [ms] (mean, across all concurrent requests)
Transfer rate: 160501.38 [Kbytes/sec] received

2.6.32-tip-smp NO_NEW_FAIR_SLEEPERS
Concurrency Level: 100
Time taken for tests: 42.834 seconds
Complete requests: 500000
Failed requests: 0
Write errors: 0
Total transferred: 7158429480 bytes
HTML transferred: 7026921590 bytes
Requests per second: 11672.84 [#/sec] (mean)
Time per request: 8.567 [ms] (mean)
Time per request: 0.086 [ms] (mean, across all concurrent requests)
Transfer rate: 163201.63 [Kbytes/sec] received

2.6.32-tip-smp NEW_FAIR_SLEEPERS
Concurrency Level: 100
Time taken for tests: 68.221 seconds
Complete requests: 500000
Failed requests: 0
Write errors: 0
Total transferred: 7158357900 bytes
HTML transferred: 7026851325 bytes
Requests per second: 7329.12 [#/sec] (mean)
Time per request: 13.644 [ms] (mean)
Time per request: 0.136 [ms] (mean, across all concurrent requests)
Transfer rate: 102469.65 [Kbytes/sec] received

2.6.32-tip-smp NEW_FAIR_SLEEPERS + ab at nice -15
Concurrency Level: 100
Time taken for tests: 42.824 seconds
Complete requests: 500000
Failed requests: 0
Write errors: 0
Total transferred: 7158451988 bytes
HTML transferred: 7026943572 bytes
Requests per second: 11675.68 [#/sec] (mean)
Time per request: 8.565 [ms] (mean)
Time per request: 0.086 [ms] (mean, across all concurrent requests)
Transfer rate: 163241.78 [Kbytes/sec] received

taskset -c 2 /etc/init.d/apache2 restart
taskset -c 3 ab -n 500000 -c 100 http://localhost/openSUSE.org.html

2.6.31-bfs221-smp
Concurrency Level: 100
Time taken for tests: 86.590 seconds
Complete requests: 500000
Failed requests: 0
Write errors: 0
Total transferred: 7158000000 bytes
HTML transferred: 7026500000 bytes
Requests per second: 5774.37 [#/sec] (mean)
Time per request: 17.318 [ms] (mean)
Time per request: 0.173 [ms] (mean, across all concurrent requests)
Transfer rate: 80728.41 [Kbytes/sec] received

2.6.32-tip-smp
Concurrency Level: 100
Time taken for tests: 48.640 seconds
Complete requests: 500000
Failed requests: 0
Write errors: 0
Total transferred: 7158000000 bytes
HTML transferred: 7026500000 bytes
Requests per second: 10279.71 [#/sec] (mean)
Time per request: 9.728 [ms] (mean)
Time per request: 0.097 [ms] (mean, across all concurrent requests)
Transfer rate: 143715.15 [Kbytes/sec] received

2009-09-16 18:27:34

by Frans Pop

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

Benjamin Herrenschmidt wrote:
> I'll have a look after the merge window madness. Multiple windows is
> also still an option I suppose even if i don't like it that much: we
> could support double-click on an app or "global" in the left list,
> making that pop a new window with the same content as the right pane for
> that app (or global) that updates at the same time as the rest.

I have another request. If I select a specific application to watch (say a
mail client) but it is idle for a while and thus has no latencies, it will
get dropped from the list and thus my selection of it will be lost.

It would be nice if in that case a selected application would stay visible
and selected, or maybe get reselected automatically when it appears again.

Thanks,
FJP

2009-09-16 19:45:10

by Ingo Molnar

[permalink] [raw]
Subject: Re: Epic regression in throughput since v2.6.23


* Serge Belyshev <[email protected]> wrote:

> Note that the disabling NEW_FAIR_SLEEPERS doesn't fix 3% regression
> from v2.6.23, but instead makes "make -j4" runtime another 2% worse
> (27.05 -> 27.72).

Ok, i think we've got a handle on that finally - mind checking latest
-tip?

Ingo

2009-09-16 23:18:32

by Serge Belyshev

[permalink] [raw]
Subject: Re: Epic regression in throughput since v2.6.23

Ingo Molnar <[email protected]> writes:

> Ok, i think we've got a handle on that finally - mind checking latest
> -tip?

Kernel build benchmark:
http://img11.imageshack.us/img11/4544/makej20090916.png

I have also repeated video encode benchmarks described here:
http://article.gmane.org/gmane.linux.kernel/889444

"x264 --preset ultrafast":
http://img11.imageshack.us/img11/9020/ultrafast20090916.png

"x264 --preset medium":
http://img11.imageshack.us/img11/7729/medium20090916.png

2009-09-17 01:30:19

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

On Wed, 2009-09-16 at 20:27 +0200, Frans Pop wrote:
> Benjamin Herrenschmidt wrote:
> > I'll have a look after the merge window madness. Multiple windows is
> > also still an option I suppose even if i don't like it that much: we
> > could support double-click on an app or "global" in the left list,
> > making that pop a new window with the same content as the right pane for
> > that app (or global) that updates at the same time as the rest.
>
> I have another request. If I select a specific application to watch (say a
> mail client) but it is idle for a while and thus has no latencies, it will
> get dropped from the list and thus my selection of it will be lost.
>
> It would be nice if in that case a selected application would stay visible
> and selected, or maybe get reselected automatically when it appears again.

Hrm... I though I forced the selected app to remain ... or maybe I
wanted to do that and failed :-) Ok. On the list. Please ping me next
week if nothing happens.

Ben.

2009-09-17 04:55:44

by Mike Galbraith

[permalink] [raw]
Subject: [patchlet] Re: Epic regression in throughput since v2.6.23

On Wed, 2009-09-16 at 23:18 +0000, Serge Belyshev wrote:
> Ingo Molnar <[email protected]> writes:
>
> > Ok, i think we've got a handle on that finally - mind checking latest
> > -tip?
>
> Kernel build benchmark:
> http://img11.imageshack.us/img11/4544/makej20090916.png
>
> I have also repeated video encode benchmarks described here:
> http://article.gmane.org/gmane.linux.kernel/889444
>
> "x264 --preset ultrafast":
> http://img11.imageshack.us/img11/9020/ultrafast20090916.png
>
> "x264 --preset medium":
> http://img11.imageshack.us/img11/7729/medium20090916.png

Pre-ramble..
Most of the performance differences I've examined in all these CFS vs
BFS threads boil down to fair scheduler vs unfair scheduler. If you
favor hogs, naturally, hogs getting more bandwidth perform better than
hogs getting their fair share. That's wonderful for hogs, somewhat less
than wonderful for their competition. That fairness is not necessarily
the best thing for throughput is well known. If you've got a single
dissimilar task load running alone, favoring hogs may perform better..
or not. What about mixed loads though? Is the throughput of frequent
switchers less important than hog throughput?

Moving right along..

That x264 thing uncovered an interesting issue within CFS. That load is
a frequent clone() customer, and when it has to compete against a not so
fork/clone happy load, it suffers mightily. Even when running solo, ie
only competing against it's own siblings, IFF sleeper fairness is
enabled, the pain of thread startup latency is quite visible. With
concurrent loads, it is agonizingly painful.

concurrent load test
tbench 8 vs
x264 --preset ultrafast --no-scenecut --sync-lookahead 0 --qp 20 -o /dev/null --threads 8 soccer_4cif.y4m

(i can turn knobs and get whatever numbers i want, including
outperforming bfs, concurrent or solo.. not the point)

START_DEBIT
encoded 600 frames, 44.29 fps, 22096.60 kb/s
encoded 600 frames, 43.59 fps, 22096.60 kb/s
encoded 600 frames, 43.78 fps, 22096.60 kb/s
encoded 600 frames, 43.77 fps, 22096.60 kb/s
encoded 600 frames, 45.67 fps, 22096.60 kb/s

8 1068214 672.35 MB/sec execute 57 sec
8 1083785 672.16 MB/sec execute 58 sec
8 1099188 672.18 MB/sec execute 59 sec
8 1114626 672.00 MB/sec cleanup 60 sec
8 1114626 671.96 MB/sec cleanup 60 sec

NO_START_DEBIT
encoded 600 frames, 123.19 fps, 22096.60 kb/s
encoded 600 frames, 123.85 fps, 22096.60 kb/s
encoded 600 frames, 120.05 fps, 22096.60 kb/s
encoded 600 frames, 123.43 fps, 22096.60 kb/s
encoded 600 frames, 121.27 fps, 22096.60 kb/s

8 848135 533.79 MB/sec execute 57 sec
8 860829 534.08 MB/sec execute 58 sec
8 872840 533.74 MB/sec execute 59 sec
8 885036 533.66 MB/sec cleanup 60 sec
8 885036 533.64 MB/sec cleanup 60 sec

2.6.31-bfs221-smp
encoded 600 frames, 169.00 fps, 22096.60 kb/s
encoded 600 frames, 163.85 fps, 22096.60 kb/s
encoded 600 frames, 161.00 fps, 22096.60 kb/s
encoded 600 frames, 155.57 fps, 22096.60 kb/s
encoded 600 frames, 162.01 fps, 22096.60 kb/s

8 458328 287.67 MB/sec execute 57 sec
8 464442 288.68 MB/sec execute 58 sec
8 471129 288.71 MB/sec execute 59 sec
8 477643 288.61 MB/sec cleanup 60 sec
8 477643 288.60 MB/sec cleanup 60 sec

patchlet:

sched: disable START_DEBIT.

START_DEBIT induces unfairness to loads which fork/clone frequently when they
must compete against loads which do not.


Signed-off-by: Mike Galbraith <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
LKML-Reference: <new-submission>

kernel/sched_features.h | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sched_features.h b/kernel/sched_features.h
index d5059fd..2fc94a0 100644
--- a/kernel/sched_features.h
+++ b/kernel/sched_features.h
@@ -23,7 +23,7 @@ SCHED_FEAT(NORMALIZED_SLEEPER, 0)
* Place new tasks ahead so that they do not starve already running
* tasks
*/
-SCHED_FEAT(START_DEBIT, 1)
+SCHED_FEAT(START_DEBIT, 0)

/*
* Should wakeups try to preempt running tasks.

2009-09-17 05:06:41

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patchlet] Re: Epic regression in throughput since v2.6.23

Aw poo, forgot to add Peter to CC list before poking xmit.

On Thu, 2009-09-17 at 06:55 +0200, Mike Galbraith wrote:
> On Wed, 2009-09-16 at 23:18 +0000, Serge Belyshev wrote:
> > Ingo Molnar <[email protected]> writes:
> >
> > > Ok, i think we've got a handle on that finally - mind checking latest
> > > -tip?
> >
> > Kernel build benchmark:
> > http://img11.imageshack.us/img11/4544/makej20090916.png
> >
> > I have also repeated video encode benchmarks described here:
> > http://article.gmane.org/gmane.linux.kernel/889444
> >
> > "x264 --preset ultrafast":
> > http://img11.imageshack.us/img11/9020/ultrafast20090916.png
> >
> > "x264 --preset medium":
> > http://img11.imageshack.us/img11/7729/medium20090916.png
>
> Pre-ramble..
> Most of the performance differences I've examined in all these CFS vs
> BFS threads boil down to fair scheduler vs unfair scheduler. If you
> favor hogs, naturally, hogs getting more bandwidth perform better than
> hogs getting their fair share. That's wonderful for hogs, somewhat less
> than wonderful for their competition. That fairness is not necessarily
> the best thing for throughput is well known. If you've got a single
> dissimilar task load running alone, favoring hogs may perform better..
> or not. What about mixed loads though? Is the throughput of frequent
> switchers less important than hog throughput?
>
> Moving right along..
>
> That x264 thing uncovered an interesting issue within CFS. That load is
> a frequent clone() customer, and when it has to compete against a not so
> fork/clone happy load, it suffers mightily. Even when running solo, ie
> only competing against it's own siblings, IFF sleeper fairness is
> enabled, the pain of thread startup latency is quite visible. With
> concurrent loads, it is agonizingly painful.
>
> concurrent load test
> tbench 8 vs
> x264 --preset ultrafast --no-scenecut --sync-lookahead 0 --qp 20 -o /dev/null --threads 8 soccer_4cif.y4m
>
> (i can turn knobs and get whatever numbers i want, including
> outperforming bfs, concurrent or solo.. not the point)
>
> START_DEBIT
> encoded 600 frames, 44.29 fps, 22096.60 kb/s
> encoded 600 frames, 43.59 fps, 22096.60 kb/s
> encoded 600 frames, 43.78 fps, 22096.60 kb/s
> encoded 600 frames, 43.77 fps, 22096.60 kb/s
> encoded 600 frames, 45.67 fps, 22096.60 kb/s
>
> 8 1068214 672.35 MB/sec execute 57 sec
> 8 1083785 672.16 MB/sec execute 58 sec
> 8 1099188 672.18 MB/sec execute 59 sec
> 8 1114626 672.00 MB/sec cleanup 60 sec
> 8 1114626 671.96 MB/sec cleanup 60 sec
>
> NO_START_DEBIT
> encoded 600 frames, 123.19 fps, 22096.60 kb/s
> encoded 600 frames, 123.85 fps, 22096.60 kb/s
> encoded 600 frames, 120.05 fps, 22096.60 kb/s
> encoded 600 frames, 123.43 fps, 22096.60 kb/s
> encoded 600 frames, 121.27 fps, 22096.60 kb/s
>
> 8 848135 533.79 MB/sec execute 57 sec
> 8 860829 534.08 MB/sec execute 58 sec
> 8 872840 533.74 MB/sec execute 59 sec
> 8 885036 533.66 MB/sec cleanup 60 sec
> 8 885036 533.64 MB/sec cleanup 60 sec
>
> 2.6.31-bfs221-smp
> encoded 600 frames, 169.00 fps, 22096.60 kb/s
> encoded 600 frames, 163.85 fps, 22096.60 kb/s
> encoded 600 frames, 161.00 fps, 22096.60 kb/s
> encoded 600 frames, 155.57 fps, 22096.60 kb/s
> encoded 600 frames, 162.01 fps, 22096.60 kb/s
>
> 8 458328 287.67 MB/sec execute 57 sec
> 8 464442 288.68 MB/sec execute 58 sec
> 8 471129 288.71 MB/sec execute 59 sec
> 8 477643 288.61 MB/sec cleanup 60 sec
> 8 477643 288.60 MB/sec cleanup 60 sec
>
> patchlet:
>
> sched: disable START_DEBIT.
>
> START_DEBIT induces unfairness to loads which fork/clone frequently when they
> must compete against loads which do not.
>
>
> Signed-off-by: Mike Galbraith <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> LKML-Reference: <new-submission>
>
> kernel/sched_features.h | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/kernel/sched_features.h b/kernel/sched_features.h
> index d5059fd..2fc94a0 100644
> --- a/kernel/sched_features.h
> +++ b/kernel/sched_features.h
> @@ -23,7 +23,7 @@ SCHED_FEAT(NORMALIZED_SLEEPER, 0)
> * Place new tasks ahead so that they do not starve already running
> * tasks
> */
> -SCHED_FEAT(START_DEBIT, 1)
> +SCHED_FEAT(START_DEBIT, 0)
>
> /*
> * Should wakeups try to preempt running tasks.
>

2009-09-17 07:21:19

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patchlet] Re: Epic regression in throughput since v2.6.23


here's some start-debit versus non-start-debit numbers.

The workload: on a dual-core box start and kill 10 loops, once every
second. PID 23137 is a shell doing interactive stuff. (running a loop of
usleep 100000 and echo)

START_DEBIT:

europe:~> perf sched lat | grep 23137
bash:23137 | 34.380 ms | 187 | avg: 0.005 ms | max: 0.017 ms |
bash:23137 | 36.410 ms | 188 | avg: 0.005 ms | max: 0.011 ms |
bash:23137 | 36.680 ms | 183 | avg: 0.007 ms | max: 0.333 ms |

NO_START_DEBIT:

europe:~> perf sched lat | grep 23137
bash:23137 | 35.531 ms | 183 | avg: 0.005 ms | max: 0.019 ms |
bash:23137 | 35.511 ms | 188 | avg: 0.007 ms | max: 0.334 ms |
bash:23137 | 35.774 ms | 185 | avg: 0.005 ms | max: 0.019 ms |

Seems very similar at first sight.

Ingo

2009-09-18 11:25:03

by Ingo Molnar

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements


* Michael Buesch <[email protected]> wrote:

> On Tuesday 08 September 2009 09:48:25 Ingo Molnar wrote:
> > Mind poking on this one to figure out whether it's all repeatable
> > and why that slowdown happens?
>
> I repeated the test several times, because I couldn't really believe
> that there's such a big difference for me, but the results were the
> same. I don't really know what's going on nor how to find out what's
> going on.

Well that's a really memory constrained MIPS device with like 16 MB of
RAM or so? So having effects from small things like changing details in
a kernel image is entirely plausible.

Ingo

2009-09-18 14:46:29

by Felix Fietkau

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

Ingo Molnar wrote:
> * Michael Buesch <[email protected]> wrote:
>
>> On Tuesday 08 September 2009 09:48:25 Ingo Molnar wrote:
>> > Mind poking on this one to figure out whether it's all repeatable
>> > and why that slowdown happens?
>>
>> I repeated the test several times, because I couldn't really believe
>> that there's such a big difference for me, but the results were the
>> same. I don't really know what's going on nor how to find out what's
>> going on.
>
> Well that's a really memory constrained MIPS device with like 16 MB of
> RAM or so? So having effects from small things like changing details in
> a kernel image is entirely plausible.
Normally changing small details doesn't have much of an effect. While 16
MB is indeed not that much, we do usually have around 8 MB free with a
full user space running. Changes to other subsystems normally produce
consistent and repeatable differences that seem entirely unrelated to
memory use, so any measurable difference related to scheduler changes is
unlikely to be related to the low amount of RAM.
By the way, we do frequently also test the same software with devices
that have more RAM, e.g. 32 or 64 MB and it usually behaves in a very
similar way.

- Felix

2009-09-19 18:01:30

by Ingo Molnar

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements


* Felix Fietkau <[email protected]> wrote:

> Ingo Molnar wrote:
> > * Michael Buesch <[email protected]> wrote:
> >
> >> On Tuesday 08 September 2009 09:48:25 Ingo Molnar wrote:
> >> > Mind poking on this one to figure out whether it's all repeatable
> >> > and why that slowdown happens?
> >>
> >> I repeated the test several times, because I couldn't really believe
> >> that there's such a big difference for me, but the results were the
> >> same. I don't really know what's going on nor how to find out what's
> >> going on.
> >
> > Well that's a really memory constrained MIPS device with like 16 MB of
> > RAM or so? So having effects from small things like changing details in
> > a kernel image is entirely plausible.
>
> Normally changing small details doesn't have much of an effect. While
> 16 MB is indeed not that much, we do usually have around 8 MB free
> with a full user space running. Changes to other subsystems normally
> produce consistent and repeatable differences that seem entirely
> unrelated to memory use, so any measurable difference related to
> scheduler changes is unlikely to be related to the low amount of RAM.
> By the way, we do frequently also test the same software with devices
> that have more RAM, e.g. 32 or 64 MB and it usually behaves in a very
> similar way.

Well, Michael Buesch posted vmstat results, and they show what i have
found with my x86 simulated reproducer as well (these are Michael's
numbers):

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
1 0 0 15892 1684 5868 0 0 0 0 268 6 31 69 0 0
1 0 0 15892 1684 5868 0 0 0 0 266 2 34 66 0 0
1 0 0 15892 1684 5868 0 0 0 0 266 6 33 67 0 0
1 0 0 15892 1684 5868 0 0 0 0 267 4 37 63 0 0
1 0 0 15892 1684 5868 0 0 0 0 267 6 34 66 0 0

on average 4 context switches _per second_. The scheduler is not a
factor on this box.

Furthermore:

| I'm currently unable to test BFS, because the device throws strange
| flash errors. Maybe the flash is broken :(

So maybe those flash errors somehow impacted the measurements as well?

Ingo

2009-09-19 18:44:07

by Felix Fietkau

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

Ingo Molnar wrote:
> * Felix Fietkau <[email protected]> wrote:
>
>> Ingo Molnar wrote:
>> > Well that's a really memory constrained MIPS device with like 16 MB of
>> > RAM or so? So having effects from small things like changing details in
>> > a kernel image is entirely plausible.
>>
>> Normally changing small details doesn't have much of an effect. While
>> 16 MB is indeed not that much, we do usually have around 8 MB free
>> with a full user space running. Changes to other subsystems normally
>> produce consistent and repeatable differences that seem entirely
>> unrelated to memory use, so any measurable difference related to
>> scheduler changes is unlikely to be related to the low amount of RAM.
>> By the way, we do frequently also test the same software with devices
>> that have more RAM, e.g. 32 or 64 MB and it usually behaves in a very
>> similar way.
>
> Well, Michael Buesch posted vmstat results, and they show what i have
> found with my x86 simulated reproducer as well (these are Michael's
> numbers):
>
> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
> r b swpd free buff cache si so bi bo in cs us sy id wa
> 1 0 0 15892 1684 5868 0 0 0 0 268 6 31 69 0 0
> 1 0 0 15892 1684 5868 0 0 0 0 266 2 34 66 0 0
> 1 0 0 15892 1684 5868 0 0 0 0 266 6 33 67 0 0
> 1 0 0 15892 1684 5868 0 0 0 0 267 4 37 63 0 0
> 1 0 0 15892 1684 5868 0 0 0 0 267 6 34 66 0 0
>
> on average 4 context switches _per second_. The scheduler is not a
> factor on this box.
>
> Furthermore:
>
> | I'm currently unable to test BFS, because the device throws strange
> | flash errors. Maybe the flash is broken :(
>
> So maybe those flash errors somehow impacted the measurements as well?
I did some tests with BFS v230 vs CFS on Linux 2.6.30 on a different
MIPS device (Atheros AR2317) with 180 MHz and 16 MB RAM. When running
iperf tests, I consistently get the following results when running the
transfer from the device to my laptop:

CFS: [ 5] 0.0-60.0 sec 107 MBytes 15.0 Mbits/sec
BFS: [ 5] 0.0-60.0 sec 119 MBytes 16.6 Mbits/sec

The transfer speed from my laptop to the device are the same with BFS
and CFS. I repeated the tests a few times just to be sure, and I will
check vmstat later.
The difference here cannot be flash related, as I ran a kernel image
with the whole userland contained in initramfs. No on-flash filesystem
was mounted or accessed.

- Felix

2009-09-19 19:40:06

by Ingo Molnar

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements


* Felix Fietkau <[email protected]> wrote:

> Ingo Molnar wrote:
> > * Felix Fietkau <[email protected]> wrote:
> >
> >> Ingo Molnar wrote:
> >> > Well that's a really memory constrained MIPS device with like 16 MB of
> >> > RAM or so? So having effects from small things like changing details in
> >> > a kernel image is entirely plausible.
> >>
> >> Normally changing small details doesn't have much of an effect. While
> >> 16 MB is indeed not that much, we do usually have around 8 MB free
> >> with a full user space running. Changes to other subsystems normally
> >> produce consistent and repeatable differences that seem entirely
> >> unrelated to memory use, so any measurable difference related to
> >> scheduler changes is unlikely to be related to the low amount of RAM.
> >> By the way, we do frequently also test the same software with devices
> >> that have more RAM, e.g. 32 or 64 MB and it usually behaves in a very
> >> similar way.
> >
> > Well, Michael Buesch posted vmstat results, and they show what i have
> > found with my x86 simulated reproducer as well (these are Michael's
> > numbers):
> >
> > procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
> > r b swpd free buff cache si so bi bo in cs us sy id wa
> > 1 0 0 15892 1684 5868 0 0 0 0 268 6 31 69 0 0
> > 1 0 0 15892 1684 5868 0 0 0 0 266 2 34 66 0 0
> > 1 0 0 15892 1684 5868 0 0 0 0 266 6 33 67 0 0
> > 1 0 0 15892 1684 5868 0 0 0 0 267 4 37 63 0 0
> > 1 0 0 15892 1684 5868 0 0 0 0 267 6 34 66 0 0
> >
> > on average 4 context switches _per second_. The scheduler is not a
> > factor on this box.
> >
> > Furthermore:
> >
> > | I'm currently unable to test BFS, because the device throws strange
> > | flash errors. Maybe the flash is broken :(
> >
> > So maybe those flash errors somehow impacted the measurements as well?
> I did some tests with BFS v230 vs CFS on Linux 2.6.30 on a different
> MIPS device (Atheros AR2317) with 180 MHz and 16 MB RAM. When running
> iperf tests, I consistently get the following results when running the
> transfer from the device to my laptop:
>
> CFS: [ 5] 0.0-60.0 sec 107 MBytes 15.0 Mbits/sec
> BFS: [ 5] 0.0-60.0 sec 119 MBytes 16.6 Mbits/sec
>
> The transfer speed from my laptop to the device are the same with BFS
> and CFS. I repeated the tests a few times just to be sure, and I will
> check vmstat later.

Which exact mainline kernel have you tried? For anything performance
related running latest upstream -git (currently at 202c467) would be
recommended.

Ingo

2009-09-19 20:15:14

by Felix Fietkau

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

Ingo Molnar wrote:
> * Felix Fietkau <[email protected]> wrote:
>> I did some tests with BFS v230 vs CFS on Linux 2.6.30 on a different
>> MIPS device (Atheros AR2317) with 180 MHz and 16 MB RAM. When running
>> iperf tests, I consistently get the following results when running the
>> transfer from the device to my laptop:
>>
>> CFS: [ 5] 0.0-60.0 sec 107 MBytes 15.0 Mbits/sec
>> BFS: [ 5] 0.0-60.0 sec 119 MBytes 16.6 Mbits/sec
>>
>> The transfer speed from my laptop to the device are the same with BFS
>> and CFS. I repeated the tests a few times just to be sure, and I will
>> check vmstat later.
>
> Which exact mainline kernel have you tried? For anything performance
> related running latest upstream -git (currently at 202c467) would be
> recommended.
I used the OpenWrt-patched 2.6.30. Support for the hardware that I
tested with hasn't been merged upstream yet. Do you think that the
scheduler related changes after 2.6.30 are relevant for non-SMP
performance as well? If so, I'll work on a test with latest upstream
-git with the necessary patches when I have time for it.

- Felix

2009-09-19 20:22:36

by Ingo Molnar

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements


* Felix Fietkau <[email protected]> wrote:

> Ingo Molnar wrote:
> > * Felix Fietkau <[email protected]> wrote:
> >> I did some tests with BFS v230 vs CFS on Linux 2.6.30 on a different
> >> MIPS device (Atheros AR2317) with 180 MHz and 16 MB RAM. When running
> >> iperf tests, I consistently get the following results when running the
> >> transfer from the device to my laptop:
> >>
> >> CFS: [ 5] 0.0-60.0 sec 107 MBytes 15.0 Mbits/sec
> >> BFS: [ 5] 0.0-60.0 sec 119 MBytes 16.6 Mbits/sec
> >>
> >> The transfer speed from my laptop to the device are the same with BFS
> >> and CFS. I repeated the tests a few times just to be sure, and I will
> >> check vmstat later.
> >
> > Which exact mainline kernel have you tried? For anything performance
> > related running latest upstream -git (currently at 202c467) would be
> > recommended.
>
> I used the OpenWrt-patched 2.6.30. Support for the hardware that I
> tested with hasn't been merged upstream yet. Do you think that the
> scheduler related changes after 2.6.30 are relevant for non-SMP
> performance as well? If so, I'll work on a test with latest upstream
> -git with the necessary patches when I have time for it.

Dont know - it's hard to tell what happens without basic analysis tools.
Is there _any_ way to profile what happens on that system? (Do hrtimers
work on it that could be used to profile it?)

Ingo

2009-09-19 20:34:03

by Felix Fietkau

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

Ingo Molnar wrote:
> * Felix Fietkau <[email protected]> wrote:
>
>> Ingo Molnar wrote:
>> > * Felix Fietkau <[email protected]> wrote:
>> >> I did some tests with BFS v230 vs CFS on Linux 2.6.30 on a different
>> >> MIPS device (Atheros AR2317) with 180 MHz and 16 MB RAM. When running
>> >> iperf tests, I consistently get the following results when running the
>> >> transfer from the device to my laptop:
>> >>
>> >> CFS: [ 5] 0.0-60.0 sec 107 MBytes 15.0 Mbits/sec
>> >> BFS: [ 5] 0.0-60.0 sec 119 MBytes 16.6 Mbits/sec
>> >>
>> >> The transfer speed from my laptop to the device are the same with BFS
>> >> and CFS. I repeated the tests a few times just to be sure, and I will
>> >> check vmstat later.
>> >
>> > Which exact mainline kernel have you tried? For anything performance
>> > related running latest upstream -git (currently at 202c467) would be
>> > recommended.
>>
>> I used the OpenWrt-patched 2.6.30. Support for the hardware that I
>> tested with hasn't been merged upstream yet. Do you think that the
>> scheduler related changes after 2.6.30 are relevant for non-SMP
>> performance as well? If so, I'll work on a test with latest upstream
>> -git with the necessary patches when I have time for it.
>
> Dont know - it's hard to tell what happens without basic analysis tools.
> Is there _any_ way to profile what happens on that system? (Do hrtimers
> work on it that could be used to profile it?)
oprofile doesn't have any support for it (mips r4k, no generic
perfcounters), the only usable clock source is a simple cpu cycle
counter (which is also used for the timer interrupt).

- Felix

2009-09-20 18:10:46

by Ingo Molnar

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements


* Felix Fietkau <[email protected]> wrote:

> Ingo Molnar wrote:
> > * Felix Fietkau <[email protected]> wrote:
> >
> >> Ingo Molnar wrote:
> >> > * Felix Fietkau <[email protected]> wrote:
> >> >> I did some tests with BFS v230 vs CFS on Linux 2.6.30 on a different
> >> >> MIPS device (Atheros AR2317) with 180 MHz and 16 MB RAM. When running
> >> >> iperf tests, I consistently get the following results when running the
> >> >> transfer from the device to my laptop:
> >> >>
> >> >> CFS: [ 5] 0.0-60.0 sec 107 MBytes 15.0 Mbits/sec
> >> >> BFS: [ 5] 0.0-60.0 sec 119 MBytes 16.6 Mbits/sec
> >> >>
> >> >> The transfer speed from my laptop to the device are the same with BFS
> >> >> and CFS. I repeated the tests a few times just to be sure, and I will
> >> >> check vmstat later.
> >> >
> >> > Which exact mainline kernel have you tried? For anything performance
> >> > related running latest upstream -git (currently at 202c467) would be
> >> > recommended.
> >>
> >> I used the OpenWrt-patched 2.6.30. Support for the hardware that I
> >> tested with hasn't been merged upstream yet. Do you think that the
> >> scheduler related changes after 2.6.30 are relevant for non-SMP
> >> performance as well? If so, I'll work on a test with latest upstream
> >> -git with the necessary patches when I have time for it.
> >
> > Dont know - it's hard to tell what happens without basic analysis tools.
> > Is there _any_ way to profile what happens on that system? (Do hrtimers
> > work on it that could be used to profile it?)
>
> oprofile doesn't have any support for it (mips r4k, no generic
> perfcounters), the only usable clock source is a simple cpu cycle
> counter (which is also used for the timer interrupt).

A simple cpu cycle counter ought to be enough to get pretty good
perfcounters support going on that box.

It takes a surprisingly small amount of code to do that, and a large
portion of the perf tooling should then work out of box. Here's a few
example commits of minimal perfcounters support, on other architectures:

310d6b6: [S390] wire up sys_perf_counter_open
2d4618d: parisc: perf: wire up sys_perf_counter_open
19470e1: sh: Wire up sys_perf_counter_open.

Takes about 15 well placed lines of code, if there are no other
complications on MIPS ;-)

Ingo

2009-10-01 09:36:03

by Frans Pop

[permalink] [raw]
Subject: Re: BFS vs. mainline scheduler benchmarks and measurements

Benjamin Herrenschmidt wrote:
> On Wed, 2009-09-16 at 20:27 +0200, Frans Pop wrote:
>> Benjamin Herrenschmidt wrote:
>> > I'll have a look after the merge window madness. Multiple windows is
>> > also still an option I suppose even if i don't like it that much: we
>> > could support double-click on an app or "global" in the left list,
>> > making that pop a new window with the same content as the right pane
>> > for that app (or global) that updates at the same time as the rest.
>>
>> I have another request. If I select a specific application to watch (say
>> a mail client) but it is idle for a while and thus has no latencies, it
>> will get dropped from the list and thus my selection of it will be lost.
>>
>> It would be nice if in that case a selected application would stay
>> visible and selected, or maybe get reselected automatically when it
>> appears again.
>
> Hrm... I though I forced the selected app to remain ... or maybe I
> wanted to do that and failed :-) Ok. On the list. Please ping me next
> week if nothing happens.

As requested: ping?

And while I'm writing anyway, one more suggestion.
I find the fact that the buttons jump twice every 30 seconds (because of a
change in the timer between <10 and >=10 seconds) slightly annoying.
Any chance of making the position of the buttons fixed? One option could be
moving the timer to the left side of the bottom bar.

Cheers,
FJP