well well :) nothing quite speaks out like graphs..
http://doom10.org/index.php?topic=78.0
regards,
Kasper Sandberg
On Thu, Dec 17, 2009 at 1:33 AM, Kasper Sandberg <[email protected]> wrote:
> well well :) nothing quite speaks out like graphs..
>
> http://doom10.org/index.php?topic=78.0
>
>
>
> regards,
> Kasper Sandberg
Yeah, I sent this to Mike a bit ago. Seems that .32 has basically
tied it--and given the strict thread-ordering expectations of x264,
you basically can't expect it to do any better, though I'm curious
what's responsible for the gap in "veryslow", even with SCHED_BATCH
enabled.
The most odd case is that of "ultrafast", in which CFS immediately
ties BFS when we enable SCHED_BATCH. We're doing some further testing
to see exactly what the conditions of this are--is it because
ultrafast is just so much faster than all the other modes and so
switches threads/loads faster? Is it because ultrafast has relatively
equal workload among the threads, unlike the other loads? We'll
probably know soon.
Jason
* Jason Garrett-Glaser <[email protected]> wrote:
> On Thu, Dec 17, 2009 at 1:33 AM, Kasper Sandberg <[email protected]> wrote:
> > well well :) nothing quite speaks out like graphs..
> >
> > http://doom10.org/index.php?topic=78.0
> >
> >
> >
> > regards,
> > Kasper Sandberg
>
> Yeah, I sent this to Mike a bit ago. Seems that .32 has basically tied
> it--and given the strict thread-ordering expectations of x264, you basically
> can't expect it to do any better, though I'm curious what's responsible for
> the gap in "veryslow", even with SCHED_BATCH enabled.
>
> The most odd case is that of "ultrafast", in which CFS immediately ties BFS
> when we enable SCHED_BATCH. We're doing some further testing to see exactly
> what the conditions of this are--is it because ultrafast is just so much
> faster than all the other modes and so switches threads/loads faster? Is it
> because ultrafast has relatively equal workload among the threads, unlike
> the other loads? We'll probably know soon.
Thanks for testing it!
Btw., you might want to make use of 'perf sched record', 'perf sched map',
'perf sched trace' etc. to get an insight into how a particular workload
schedules and why those decisions are done. (You'll need CONFIG_SCHED_DEBUG=y
for best results.)
Thanks,
Ingo
On Thu, 2009-12-17 at 11:53 +0100, Ingo Molnar wrote:
> * Jason Garrett-Glaser <[email protected]> wrote:
>
> > On Thu, Dec 17, 2009 at 1:33 AM, Kasper Sandberg <[email protected]> wrote:
> > > well well :) nothing quite speaks out like graphs..
> > >
> > > http://doom10.org/index.php?topic=78.0
> > >
> > >
> > >
> > > regards,
> > > Kasper Sandberg
> >
> > Yeah, I sent this to Mike a bit ago. Seems that .32 has basically tied
> > it--and given the strict thread-ordering expectations of x264, you basically
> > can't expect it to do any better, though I'm curious what's responsible for
> > the gap in "veryslow", even with SCHED_BATCH enabled.
> >
> > The most odd case is that of "ultrafast", in which CFS immediately ties BFS
> > when we enable SCHED_BATCH. We're doing some further testing to see exactly
Thats kinda besides the point.
all these tunables and weirdness is _NEVER_ going to work for people.
now forgive me for being so blunt, but for a user, having to do
echo x264 > /proc/cfs/gief_me_performance_on_app
or
echo some_benchmark > x264 > /proc/cfs/gief_me_performance_on_app
just isnt usable, bfs matches, even exceeds cfs on all accounts, with
ZERO user tuning, so while cfs may be able to nearly match up with a ton
of application specific stuff, that just doesnt work for a normal user.
not to mention that bfs does this whilst not loosing interactivity,
something which cfs certainly cannot boast.
<snip>
> Thanks,
>
> Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
* Kasper Sandberg <[email protected]> wrote:
> On Thu, 2009-12-17 at 11:53 +0100, Ingo Molnar wrote:
> > * Jason Garrett-Glaser <[email protected]> wrote:
> >
> > > On Thu, Dec 17, 2009 at 1:33 AM, Kasper Sandberg <[email protected]> wrote:
> > > > well well :) nothing quite speaks out like graphs..
> > > >
> > > > http://doom10.org/index.php?topic=78.0
> > > >
> > > >
> > > >
> > > > regards,
> > > > Kasper Sandberg
> > >
> > > Yeah, I sent this to Mike a bit ago. Seems that .32 has basically tied
> > > it--and given the strict thread-ordering expectations of x264, you basically
> > > can't expect it to do any better, though I'm curious what's responsible for
> > > the gap in "veryslow", even with SCHED_BATCH enabled.
> > >
> > > The most odd case is that of "ultrafast", in which CFS immediately ties BFS
> > > when we enable SCHED_BATCH. We're doing some further testing to see exactly
>
> Thats kinda besides the point.
>
> all these tunables and weirdness is _NEVER_ going to work for people.
v2.6.32 improved quite a bit on the x264 front so i dont think that's
necessarily the case.
But yes, i'll subscribe to the view that we cannot satisfy everything all the
time. There's tradeoffs in every scheduler design.
> now forgive me for being so blunt, but for a user, having to do
> echo x264 > /proc/cfs/gief_me_performance_on_app
> or
> echo some_benchmark > x264 > /proc/cfs/gief_me_performance_on_app
>
> just isnt usable, bfs matches, even exceeds cfs on all accounts, with ZERO
> user tuning, so while cfs may be able to nearly match up with a ton of
> application specific stuff, that just doesnt work for a normal user.
>
> not to mention that bfs does this whilst not loosing interactivity,
> something which cfs certainly cannot boast.
What kind of latencies are those? Arent they just compiz induced due to
different weighting of workloads in BFS and in the upstream scheduler?
Would you be willing to help us out pinning them down?
To move the discussion to the numeric front please send the 'perf sched
latency' output of an affected workload.
Thanks,
Ingo
On Thu, 2009-12-17 at 13:08 +0100, Ingo Molnar wrote:
> * Kasper Sandberg <[email protected]> wrote:
>
> > On Thu, 2009-12-17 at 11:53 +0100, Ingo Molnar wrote:
> > > * Jason Garrett-Glaser <[email protected]> wrote:
> > >
> > > > On Thu, Dec 17, 2009 at 1:33 AM, Kasper Sandberg <[email protected]> wrote:
> > > > > well well :) nothing quite speaks out like graphs..
> > > > >
> > > > > http://doom10.org/index.php?topic=78.0
> > > > >
> > > > >
> > > > >
> > > > > regards,
> > > > > Kasper Sandberg
> > > >
> > > > Yeah, I sent this to Mike a bit ago. Seems that .32 has basically tied
> > > > it--and given the strict thread-ordering expectations of x264, you basically
> > > > can't expect it to do any better, though I'm curious what's responsible for
> > > > the gap in "veryslow", even with SCHED_BATCH enabled.
> > > >
> > > > The most odd case is that of "ultrafast", in which CFS immediately ties BFS
> > > > when we enable SCHED_BATCH. We're doing some further testing to see exactly
> >
> > Thats kinda besides the point.
> >
> > all these tunables and weirdness is _NEVER_ going to work for people.
>
> v2.6.32 improved quite a bit on the x264 front so i dont think that's
> necessarily the case.
again, pretty much application specific, and furthermore, ONLY with
SCHED_BATCH is it near BFS. as you know, SCHED_BATCH isnt exactly what
you wanna do for desktop or other interactivity-hungry tasks? bfs
manages better performance than cfs with SCHED_BATCH, without
SCHED_BATCH
>
> But yes, i'll subscribe to the view that we cannot satisfy everything all the
> time. There's tradeoffs in every scheduler design.
yet getting not even as good on average performance from CFS as BFS,
requires tunables, swtiching scheduler policies etc
>
> > now forgive me for being so blunt, but for a user, having to do
> > echo x264 > /proc/cfs/gief_me_performance_on_app
> > or
> > echo some_benchmark > x264 > /proc/cfs/gief_me_performance_on_app
> >
> > just isnt usable, bfs matches, even exceeds cfs on all accounts, with ZERO
> > user tuning, so while cfs may be able to nearly match up with a ton of
> > application specific stuff, that just doesnt work for a normal user.
^^^^ This is also something you need to consider.
> >
> > not to mention that bfs does this whilst not loosing interactivity,
> > something which cfs certainly cannot boast.
>
> What kind of latencies are those? Arent they just compiz induced due to
> different weighting of workloads in BFS and in the upstream scheduler?
> Would you be willing to help us out pinning them down?
Theres not much i can do, i dont have time to switch kernels on my
systems, all i can give you is this simple information, on my systems,
ranging from embedded to dual core2 quad and core i7, BFS manages to
give lower latencies(aka jack doesnt skip with very low latency output,
everythings smoother, even measurably on the desktop), greater
performance(as evidenced by lots of benchmarks, including those i
posted), and that is without touching a single scheduler policy or
tunable at all.
Im well aware that CFS can be tweaked via tunables/policies to achieve a
single of these goals at a time, and im also well aware you cannot ever
do every single cornercase perfectly with one scheduler, however, and
consider this very thoroughly, bfs manages without any tunables, to do
the vast majority of the cases with an excellence CFS can not even 100%
match, even tunables and scheduler polices.. and that is with ALOT less
code aswell.. This ought to tell you that something can and should be
done.
>
> To move the discussion to the numeric front please send the 'perf sched
> latency' output of an affected workload.
>
> Thanks,
>
> Ingo
On Thu, 2009-12-17 at 12:00 +0100, Kasper Sandberg wrote:
> On Thu, 2009-12-17 at 11:53 +0100, Ingo Molnar wrote:
> > * Jason Garrett-Glaser <[email protected]> wrote:
> >
> > > On Thu, Dec 17, 2009 at 1:33 AM, Kasper Sandberg <[email protected]> wrote:
> > > > well well :) nothing quite speaks out like graphs..
> > > >
> > > > http://doom10.org/index.php?topic=78.0
> > > >
> > > >
> > > >
> > > > regards,
> > > > Kasper Sandberg
> > >
> > > Yeah, I sent this to Mike a bit ago. Seems that .32 has basically tied
> > > it--and given the strict thread-ordering expectations of x264, you basically
> > > can't expect it to do any better, though I'm curious what's responsible for
> > > the gap in "veryslow", even with SCHED_BATCH enabled.
> > >
> > > The most odd case is that of "ultrafast", in which CFS immediately ties BFS
> > > when we enable SCHED_BATCH. We're doing some further testing to see exactly
>
> Thats kinda besides the point.
>
> all these tunables and weirdness is _NEVER_ going to work for people.
Fact is, it is working for a great number of people, the vast majority
of whom don't even know where the knobs are, much less what they do.
> now forgive me for being so blunt, but for a user, having to do
> echo x264 > /proc/cfs/gief_me_performance_on_app
> or
> echo some_benchmark > x264 > /proc/cfs/gief_me_performance_on_app
Theatrics noted.
> just isnt usable, bfs matches, even exceeds cfs on all accounts, with
> ZERO user tuning, so while cfs may be able to nearly match up with a ton
> of application specific stuff, that just doesnt work for a normal user.
Seems you haven't done much benchmarking. BFS has strengths as well as
weaknesses, all schedulers do.
> not to mention that bfs does this whilst not loosing interactivity,
> something which cfs certainly cannot boast.
Not true. I sent Con hard evidence of a severe problem area wrt
interactivity, and hard numbers showing other places where BFS needs
some work. But hey, if BFS blows your skirt up, use it and be happy.
-Mike
On Thu, 17 Dec 2009 13:08:26 +0100
Ingo Molnar <[email protected]> wrote:
> >
> > not to mention that bfs does this whilst not loosing interactivity,
> > something which cfs certainly cannot boast.
>
> What kind of latencies are those? Arent they just compiz induced due
> to different weighting of workloads in BFS and in the upstream
> scheduler? Would you be willing to help us out pinning them down?
>
> To move the discussion to the numeric front please send the 'perf
> sched latency' output of an affected workload.
CFS in .32 and before has one known, and now fixed latency issue.
In .32, wake_up() (which is most causes for inter thread communication
and lots of others) was trying to keep the waker and wakee on the same
logical cpu at pretty much all cost. In .33-git, Mike fixed this to,
if there's a free logical cpu sibling, or on a multicore cpu, another
core which shares the cache, to just schedule the new task on that free
cpu rather than on the current, guaranteed busy, cpu.
This change helps latency a lot, and as a result, performance for
various latency sensitive workloads...
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On Thu December 17 2009, Kasper Sandberg wrote:
> On Thu, 2009-12-17 at 11:53 +0100, Ingo Molnar wrote:
> > * Jason Garrett-Glaser <[email protected]> wrote:
> > > On Thu, Dec 17, 2009 at 1:33 AM, Kasper Sandberg <[email protected]>
wrote:
> > > > well well :) nothing quite speaks out like graphs..
> > > >
> > > > http://doom10.org/index.php?topic=78.0
> > > >
> > > >
> > > >
> > > > regards,
> > > > Kasper Sandberg
> > >
> > > Yeah, I sent this to Mike a bit ago. Seems that .32 has basically
> > > tied it--and given the strict thread-ordering expectations of x264,
> > > you basically can't expect it to do any better, though I'm curious
> > > what's responsible for the gap in "veryslow", even with SCHED_BATCH
> > > enabled.
> > >
> > > The most odd case is that of "ultrafast", in which CFS immediately
> > > ties BFS when we enable SCHED_BATCH. We're doing some further
> > > testing to see exactly
>
> Thats kinda besides the point.
>
> all these tunables and weirdness is _NEVER_ going to work for people.
>
> now forgive me for being so blunt, but for a user, having to do
> echo x264 > /proc/cfs/gief_me_performance_on_app
> or
> echo some_benchmark > x264 > /proc/cfs/gief_me_performance_on_app
>
> just isnt usable, bfs matches, even exceeds cfs on all accounts, with
> ZERO user tuning, so while cfs may be able to nearly match up with a ton
> of application specific stuff, that just doesnt work for a normal user.
>
> not to mention that bfs does this whilst not loosing interactivity,
> something which cfs certainly cannot boast.
>
> <snip>
Strange, I seem to recall that BFS needs you to run apps with some silly
schedtool program to get media apps to not skip while doing other tasks. (I
don't have to tweak CFS at all)
> > Thanks,
> >
> > Ingo
--
Thomas Fjellstrom
[email protected]
On Thu, Dec 17, 2009 at 3:00 AM, Kasper Sandberg <[email protected]> wrote:
> On Thu, 2009-12-17 at 11:53 +0100, Ingo Molnar wrote:
>> * Jason Garrett-Glaser <[email protected]> wrote:
>>
>> > On Thu, Dec 17, 2009 at 1:33 AM, Kasper Sandberg <[email protected]> wrote:
>> > > well well :) nothing quite speaks out like graphs..
>> > >
>> > > http://doom10.org/index.php?topic=78.0
>> > >
>> > >
>> > >
>> > > regards,
>> > > Kasper Sandberg
>> >
>> > Yeah, I sent this to Mike a bit ago. ?Seems that .32 has basically tied
>> > it--and given the strict thread-ordering expectations of x264, you basically
>> > can't expect it to do any better, though I'm curious what's responsible for
>> > the gap in "veryslow", even with SCHED_BATCH enabled.
>> >
>> > The most odd case is that of "ultrafast", in which CFS immediately ties BFS
>> > when we enable SCHED_BATCH. ?We're doing some further testing to see exactly
>
> Thats kinda besides the point.
>
> all these tunables and weirdness is _NEVER_ going to work for people.
Can't individually applications request SCHED_BATCH? Our plan was to
have x264 simply detect if it was necessary (once we figure out what
encoding settings result in the large gap situation) and automatically
enable it for the current application.
Jason
* Jason Garrett-Glaser <[email protected]> wrote:
> On Thu, Dec 17, 2009 at 3:00 AM, Kasper Sandberg <[email protected]> wrote:
> > On Thu, 2009-12-17 at 11:53 +0100, Ingo Molnar wrote:
> >> * Jason Garrett-Glaser <[email protected]> wrote:
> >>
> >> > On Thu, Dec 17, 2009 at 1:33 AM, Kasper Sandberg <[email protected]> wrote:
> >> > > well well :) nothing quite speaks out like graphs..
> >> > >
> >> > > http://doom10.org/index.php?topic=78.0
> >> > >
> >> > >
> >> > >
> >> > > regards,
> >> > > Kasper Sandberg
> >> >
> >> > Yeah, I sent this to Mike a bit ago. ?Seems that .32 has basically tied
> >> > it--and given the strict thread-ordering expectations of x264, you basically
> >> > can't expect it to do any better, though I'm curious what's responsible for
> >> > the gap in "veryslow", even with SCHED_BATCH enabled.
> >> >
> >> > The most odd case is that of "ultrafast", in which CFS immediately ties BFS
> >> > when we enable SCHED_BATCH. ?We're doing some further testing to see exactly
> >
> > Thats kinda besides the point.
> >
> > all these tunables and weirdness is _NEVER_ going to work for people.
>
> Can't individually applications request SCHED_BATCH? Our plan was to have
> x264 simply detect if it was necessary (once we figure out what encoding
> settings result in the large gap situation) and automatically enable it for
> the current application.
Yeah, SCHED_BATCH can be requested at will by an app. It's an unprivileged
operation. It gets passed down to child tasks. (You can just do it
unconditionally - older kernels will ignore it and give you an error code for
setscheduler call.)
Having said that, we generally try to make things perform well without apps
having to switch themselves to SCHED_BATCH. Mike, do you think we can make
x264 perform as well (or nearly as well) under SCHED_OTHER as under
SCHED_BATCH?
Ingo
On Fri, 2009-12-18 at 06:23 +0100, Ingo Molnar wrote:
> Having said that, we generally try to make things perform well without apps
> having to switch themselves to SCHED_BATCH. Mike, do you think we can make
> x264 perform as well (or nearly as well) under SCHED_OTHER as under
> SCHED_BATCH?
It's not bad as is, except for ultrafast mode. START_DEBIT is the
biggest problem there. I don't think SCHED_OTHER will ever match
SCHED_BATCH for this load, though I must say I haven't full-spectrum
tested. This load really wants RR scheduling, and wakeup preemption
necessarily perturbs run order.
I'll probably piddle with it some more, it's an interesting load.
-Mike
On Thu, Dec 17, 2009 at 11:30 PM, Mike Galbraith <[email protected]> wrote:
> On Fri, 2009-12-18 at 06:23 +0100, Ingo Molnar wrote:
>
>> Having said that, we generally try to make things perform well without apps
>> having to switch themselves to SCHED_BATCH. Mike, do you think we can make
>> x264 perform as well (or nearly as well) under SCHED_OTHER as under
>> SCHED_BATCH?
>
> It's not bad as is, except for ultrafast mode. ?START_DEBIT is the
> biggest problem there. ?I don't think SCHED_OTHER will ever match
> SCHED_BATCH for this load, though I must say I haven't full-spectrum
> tested. ?This load really wants RR scheduling, and wakeup preemption
> necessarily perturbs run order.
>
> I'll probably piddle with it some more, it's an interesting load.
>
> ? ? ? ?-Mike
>
>
Two more thoughts here:
1) We're considering moving to a thread pool soon; we already have a
working patch for it and if anything it'll save a few clocks spent on
nice()ing threads and other such things. Will this improve
START_DEBIT at all? I've attached the beta patch if you want to try
it. Note this also works with 2) as well, so it adds yet another
dimension to what's mentioned below.
2) We recently implemented a new threading model which may be
interesting to test as well. This threading model gives worse
compression *and* performance, but has one benefit: it adds zero
latency, whereas normal threading adds a full frame of latency per
thread. This was paid for by a company interested in
ultra-low-latency streaming applications, where 1 millisecond is a
huge deal. I've been thinking this might be interesting to bench from
a kernel perspective as well, as when you're spawning a half-dozen
threads and need them all done within 6 milliseconds, you start
getting down to serious scheduler issues.
The new threading model is much less complex than the regular one and
works as follows. The frame is split into X slices, and each slice
encoded with one thread. Specifically, it works via the following
process:
1. Preprocess input frame, perform lookahead analysis on input frame
(all singlethreaded)
2. Split up a ton of threads to do the main encode, one per slice.
3. Join all the threads.
4. Do post-filtering on the output frame, return.
Clearly this is an utter disaster, since it spawns N times as many
threads as the old threading model *and* they last far shorter, *and*
only part of the application is multithreaded. But there's not really
a better way to do low-latency threading, and it's an interesting
challenge to boot. IIRC, it's also the way ffmpeg's encoder threading
works. It's widely considered an inferior model, but as mentioned
before, in this particular use-case there's no choice.
To enable this, use --sliced-threads. I'd recommend using a
higher-resolution clip for this, as it performs atrociously bad on
very low resolution videos for reasons you might be able to guess. If
you need a higher-res clip, check the SD or HD ones here:
http://media.xiph.org/video/derf/ .
I'm personally curious as to what kind of scheduler issues this
results in--I haven't done any BFS vs CFS tests with this option
enabled yet.
Jason
On Thu, 2009-12-17 at 14:30 +0100, Mike Galbraith wrote:
> On Thu, 2009-12-17 at 12:00 +0100, Kasper Sandberg wrote:
> > On Thu, 2009-12-17 at 11:53 +0100, Ingo Molnar wrote:
> > > * Jason Garrett-Glaser <[email protected]> wrote:
> > >
> > > > On Thu, Dec 17, 2009 at 1:33 AM, Kasper Sandberg <[email protected]> wrote:
> > > > > well well :) nothing quite speaks out like graphs..
> > > > >
> > > > > http://doom10.org/index.php?topic=78.0
> > > > >
> > > > >
> > > > >
> > > > > regards,
> > > > > Kasper Sandberg
> > > >
> > > > Yeah, I sent this to Mike a bit ago. Seems that .32 has basically tied
> > > > it--and given the strict thread-ordering expectations of x264, you basically
> > > > can't expect it to do any better, though I'm curious what's responsible for
> > > > the gap in "veryslow", even with SCHED_BATCH enabled.
> > > >
> > > > The most odd case is that of "ultrafast", in which CFS immediately ties BFS
> > > > when we enable SCHED_BATCH. We're doing some further testing to see exactly
> >
> > Thats kinda besides the point.
> >
> > all these tunables and weirdness is _NEVER_ going to work for people.
>
> Fact is, it is working for a great number of people, the vast majority
> of whom don't even know where the knobs are, much less what they do.
but not as great as it could be :)
>
> > now forgive me for being so blunt, but for a user, having to do
> > echo x264 > /proc/cfs/gief_me_performance_on_app
> > or
> > echo some_benchmark > x264 > /proc/cfs/gief_me_performance_on_app
>
> Theatrics noted.
>
> > just isnt usable, bfs matches, even exceeds cfs on all accounts, with
> > ZERO user tuning, so while cfs may be able to nearly match up with a ton
> > of application specific stuff, that just doesnt work for a normal user.
>
> Seems you haven't done much benchmarking. BFS has strengths as well as
> weaknesses, all schedulers do.
yeah, BFS just has more strengths and fewer weaknesses than CFS :)
>
> > not to mention that bfs does this whilst not loosing interactivity,
> > something which cfs certainly cannot boast.
>
> Not true. I sent Con hard evidence of a severe problem area wrt
> interactivity, and hard numbers showing other places where BFS needs
> some work. But hey, if BFS blows your skirt up, use it and be happy.
Theatrics noted.
As for your point, well.. as far as i have heard, all you've come up
with is COMPLETELY WORTHLESS use cases which nobody is ever EVAR going
to do, and thus irellevant
>
> -Mike
>
On Thu, 2009-12-17 at 14:22 -0700, Thomas Fjellstrom wrote:
> On Thu December 17 2009, Kasper Sandberg wrote:
> > On Thu, 2009-12-17 at 11:53 +0100, Ingo Molnar wrote:
> > > * Jason Garrett-Glaser <[email protected]> wrote:
> > > > On Thu, Dec 17, 2009 at 1:33 AM, Kasper Sandberg <[email protected]>
> wrote:
> > > > > well well :) nothing quite speaks out like graphs..
> > > > >
> > > > > http://doom10.org/index.php?topic=78.0
> > > > >
> > > > >
> > > > >
> > > > > regards,
> > > > > Kasper Sandberg
> > > >
> > > > Yeah, I sent this to Mike a bit ago. Seems that .32 has basically
> > > > tied it--and given the strict thread-ordering expectations of x264,
> > > > you basically can't expect it to do any better, though I'm curious
> > > > what's responsible for the gap in "veryslow", even with SCHED_BATCH
> > > > enabled.
> > > >
> > > > The most odd case is that of "ultrafast", in which CFS immediately
> > > > ties BFS when we enable SCHED_BATCH. We're doing some further
> > > > testing to see exactly
> >
> > Thats kinda besides the point.
> >
> > all these tunables and weirdness is _NEVER_ going to work for people.
> >
> > now forgive me for being so blunt, but for a user, having to do
> > echo x264 > /proc/cfs/gief_me_performance_on_app
> > or
> > echo some_benchmark > x264 > /proc/cfs/gief_me_performance_on_app
> >
> > just isnt usable, bfs matches, even exceeds cfs on all accounts, with
> > ZERO user tuning, so while cfs may be able to nearly match up with a ton
> > of application specific stuff, that just doesnt work for a normal user.
> >
> > not to mention that bfs does this whilst not loosing interactivity,
> > something which cfs certainly cannot boast.
> >
> > <snip>
>
> Strange, I seem to recall that BFS needs you to run apps with some silly
> schedtool program to get media apps to not skip while doing other tasks. (I
> don't have to tweak CFS at all)
You recall incorrectly
>
> > > Thanks,
> > >
> > > Ingo
>
>
>
On Thu, 2009-12-17 at 17:18 -0800, Jason Garrett-Glaser wrote:
> On Thu, Dec 17, 2009 at 3:00 AM, Kasper Sandberg <[email protected]> wrote:
> > On Thu, 2009-12-17 at 11:53 +0100, Ingo Molnar wrote:
> >> * Jason Garrett-Glaser <[email protected]> wrote:
> >>
> >> > On Thu, Dec 17, 2009 at 1:33 AM, Kasper Sandberg <[email protected]> wrote:
> >> > > well well :) nothing quite speaks out like graphs..
> >> > >
> >> > > http://doom10.org/index.php?topic=78.0
> >> > >
> >> > >
> >> > >
> >> > > regards,
> >> > > Kasper Sandberg
> >> >
> >> > Yeah, I sent this to Mike a bit ago. Seems that .32 has basically tied
> >> > it--and given the strict thread-ordering expectations of x264, you basically
> >> > can't expect it to do any better, though I'm curious what's responsible for
> >> > the gap in "veryslow", even with SCHED_BATCH enabled.
> >> >
> >> > The most odd case is that of "ultrafast", in which CFS immediately ties BFS
> >> > when we enable SCHED_BATCH. We're doing some further testing to see exactly
> >
> > Thats kinda besides the point.
> >
> > all these tunables and weirdness is _NEVER_ going to work for people.
>
> Can't individually applications request SCHED_BATCH? Our plan was to
> have x264 simply detect if it was necessary (once we figure out what
> encoding settings result in the large gap situation) and automatically
> enable it for the current application.
that is an insane solution, especially considering better schedulers
outperform cfs SCHED_BATCH without doing ANYTHING special.
Do you not see what is happening here? it is simply grotesk
>
> Jason
On Fri, 2009-12-18 at 08:30 +0100, Mike Galbraith wrote:
> On Fri, 2009-12-18 at 06:23 +0100, Ingo Molnar wrote:
>
> > Having said that, we generally try to make things perform well without apps
> > having to switch themselves to SCHED_BATCH. Mike, do you think we can make
> > x264 perform as well (or nearly as well) under SCHED_OTHER as under
> > SCHED_BATCH?
>
> It's not bad as is, except for ultrafast mode. START_DEBIT is the
> biggest problem there. I don't think SCHED_OTHER will ever match
> SCHED_BATCH for this load, though I must say I haven't full-spectrum
> tested. This load really wants RR scheduling, and wakeup preemption
> necessarily perturbs run order.
>
> I'll probably piddle with it some more, it's an interesting load.
Yes, i must say, very interresting, its very complicated and... oh wait,
its just encoding a movie!
>
> -Mike
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On Fri, Dec 18, 2009 at 2:57 AM, Kasper Sandberg <[email protected]> wrote:
> On Fri, 2009-12-18 at 08:30 +0100, Mike Galbraith wrote:
>> On Fri, 2009-12-18 at 06:23 +0100, Ingo Molnar wrote:
>>
>> > Having said that, we generally try to make things perform well without apps
>> > having to switch themselves to SCHED_BATCH. Mike, do you think we can make
>> > x264 perform as well (or nearly as well) under SCHED_OTHER as under
>> > SCHED_BATCH?
>>
>> It's not bad as is, except for ultrafast mode. ?START_DEBIT is the
>> biggest problem there. ?I don't think SCHED_OTHER will ever match
>> SCHED_BATCH for this load, though I must say I haven't full-spectrum
>> tested. ?This load really wants RR scheduling, and wakeup preemption
>> necessarily perturbs run order.
>>
>> I'll probably piddle with it some more, it's an interesting load.
> Yes, i must say, very interresting, its very complicated and... oh wait,
> its just encoding a movie!
Your trolling is becoming a bit over-the-top at this point. You
should also considering replying to multiple people in one email as
opposed to spamming a whole bunch in sequence.
Perhaps as the lead x264 developer I'm qualified to say that it
certainly is a very complicated load due to the strict ordering
requirements of the threading model--and that you should tone down the
whining just a tad and perhaps read a bit more about how BFS and CFS
work before complaining about them.
Jason
On Fri, 2009-12-18 at 11:54 +0100, Kasper Sandberg wrote:
> On Thu, 2009-12-17 at 14:30 +0100, Mike Galbraith wrote:
> > On Thu, 2009-12-17 at 12:00 +0100, Kasper Sandberg wrote:
> > > On Thu, 2009-12-17 at 11:53 +0100, Ingo Molnar wrote:
> > > > * Jason Garrett-Glaser <[email protected]> wrote:
> > > >
> > > > > On Thu, Dec 17, 2009 at 1:33 AM, Kasper Sandberg <[email protected]> wrote:
> > > > > > well well :) nothing quite speaks out like graphs..
> > > > > >
> > > > > > http://doom10.org/index.php?topic=78.0
> > > > > >
> > > > > >
> > > > > >
> > > > > > regards,
> > > > > > Kasper Sandberg
> > > > >
> > > > > Yeah, I sent this to Mike a bit ago. Seems that .32 has basically tied
> > > > > it--and given the strict thread-ordering expectations of x264, you basically
> > > > > can't expect it to do any better, though I'm curious what's responsible for
> > > > > the gap in "veryslow", even with SCHED_BATCH enabled.
> > > > >
> > > > > The most odd case is that of "ultrafast", in which CFS immediately ties BFS
> > > > > when we enable SCHED_BATCH. We're doing some further testing to see exactly
> > >
> > > Thats kinda besides the point.
> > >
> > > all these tunables and weirdness is _NEVER_ going to work for people.
> >
> > Fact is, it is working for a great number of people, the vast majority
> > of whom don't even know where the knobs are, much less what they do.
> but not as great as it could be :)
>
> >
> > > now forgive me for being so blunt, but for a user, having to do
> > > echo x264 > /proc/cfs/gief_me_performance_on_app
> > > or
> > > echo some_benchmark > x264 > /proc/cfs/gief_me_performance_on_app
> >
> > Theatrics noted.
> >
> > > just isnt usable, bfs matches, even exceeds cfs on all accounts, with
> > > ZERO user tuning, so while cfs may be able to nearly match up with a ton
> > > of application specific stuff, that just doesnt work for a normal user.
> >
> > Seems you haven't done much benchmarking. BFS has strengths as well as
> > weaknesses, all schedulers do.
> yeah, BFS just has more strengths and fewer weaknesses than CFS :)
> >
> > > not to mention that bfs does this whilst not loosing interactivity,
> > > something which cfs certainly cannot boast.
> >
> > Not true. I sent Con hard evidence of a severe problem area wrt
> > interactivity, and hard numbers showing other places where BFS needs
> > some work. But hey, if BFS blows your skirt up, use it and be happy.
> Theatrics noted.
>
> As for your point, well.. as far as i have heard, all you've come up
> with is COMPLETELY WORTHLESS use cases which nobody is ever EVAR going
> to do, and thus irellevant
Goodbye troll.
*PLONK*
On Fri, 2009-12-18 at 02:11 -0800, Jason Garrett-Glaser wrote:
> Two more thoughts here:
>
> 1) We're considering moving to a thread pool soon; we already have a
> working patch for it and if anything it'll save a few clocks spent on
> nice()ing threads and other such things. Will this improve
> START_DEBIT at all?
Yeah, START_DEBIT only affects a thread once.
> I've attached the beta patch if you want to try
> it. Note this also works with 2) as well, so it adds yet another
> dimension to what's mentioned below.
>
> 2) We recently implemented a new threading model which may be
> interesting to test as well. This threading model gives worse
> compression *and* performance, but has one benefit: it adds zero
> latency, whereas normal threading adds a full frame of latency per
> thread. This was paid for by a company interested in
> ultra-low-latency streaming applications, where 1 millisecond is a
> huge deal. I've been thinking this might be interesting to bench from
> a kernel perspective as well, as when you're spawning a half-dozen
> threads and need them all done within 6 milliseconds, you start
> getting down to serious scheduler issues.
>
> The new threading model is much less complex than the regular one and
> works as follows. The frame is split into X slices, and each slice
> encoded with one thread. Specifically, it works via the following
> process:
>
> 1. Preprocess input frame, perform lookahead analysis on input frame
> (all singlethreaded)
> 2. Split up a ton of threads to do the main encode, one per slice.
> 3. Join all the threads.
> 4. Do post-filtering on the output frame, return.
>
> Clearly this is an utter disaster, since it spawns N times as many
> threads as the old threading model *and* they last far shorter, *and*
> only part of the application is multithreaded. But there's not really
> a better way to do low-latency threading, and it's an interesting
> challenge to boot. IIRC, it's also the way ffmpeg's encoder threading
> works. It's widely considered an inferior model, but as mentioned
> before, in this particular use-case there's no choice.
>
> To enable this, use --sliced-threads. I'd recommend using a
> higher-resolution clip for this, as it performs atrociously bad on
> very low resolution videos for reasons you might be able to guess. If
> you need a higher-res clip, check the SD or HD ones here:
> http://media.xiph.org/video/derf/ .
In another 8 hrs 24 min, I'll have a sunflower to stare at.
> I'm personally curious as to what kind of scheduler issues this
> results in--I haven't done any BFS vs CFS tests with this option
> enabled yet.
I'll look for x264 source, and patch/piddle.
-Mike
* Mike Galbraith <[email protected]> wrote:
> > I'm personally curious as to what kind of scheduler issues this results
> > in--I haven't done any BFS vs CFS tests with this option enabled yet.
>
> I'll look for x264 source, and patch/piddle.
btw., would be nice to look at it via tools/perf/ as well:
perf stat --repeat 3 ...
to see the basic hardware utilization (cycles/cache-misses, branch execution
rate, instructions, etc.) and the basic parallelism metrics, at a glance.
i suspect "perf stat -e L1-icache-loads -e L1-icache-load-misses" would give
us an even more detailed picture.
Ingo
On Fri, 2009-12-18 at 14:06 +0100, Ingo Molnar wrote:
> * Mike Galbraith <[email protected]> wrote:
>
> > > I'm personally curious as to what kind of scheduler issues this results
> > > in--I haven't done any BFS vs CFS tests with this option enabled yet.
> >
> > I'll look for x264 source, and patch/piddle.
>
> btw., would be nice to look at it via tools/perf/ as well:
>
> perf stat --repeat 3 ...
>
> to see the basic hardware utilization (cycles/cache-misses, branch execution
> rate, instructions, etc.) and the basic parallelism metrics, at a glance.
>
> i suspect "perf stat -e L1-icache-loads -e L1-icache-load-misses" would give
> us an even more detailed picture.
Almost virgin v2.6.32-10468-g020307d running 'medium'.
encoded 600 frames, 36.52 fps, 13003.54 kb/s
Performance counter stats for './x264.sh 8' (3 runs):
63742.218844 task-clock-msecs # 3.870 CPUs ( +- 0.016% )
42593 context-switches # 0.001 M/sec ( +- 0.487% )
3011 CPU-migrations # 0.000 M/sec ( +- 0.417% )
12862 page-faults # 0.000 M/sec ( +- 0.004% )
151734450892 cycles # 2380.439 M/sec ( +- 1.947% ) (scaled from 71.44%)
205642315207 instructions # 1.355 IPC ( +- 0.085% ) (scaled from 80.68%)
16274905932 branches # 255.324 M/sec ( +- 0.080% ) (scaled from 80.67%)
1257135617 branch-misses # 7.724 % ( +- 0.255% ) (scaled from 80.06%)
3116653323 cache-references # 48.895 M/sec ( +- 0.340% ) (scaled from 23.78%)
50823973 cache-misses # 0.797 M/sec ( +- 1.400% ) (scaled from 23.76%)
16.470164901 seconds time elapsed ( +- 0.079% )
encoded 600 frames, 36.58 fps, 13003.54 kb/s
Performance counter stats for './x264.sh 8' (3 runs):
133692266953 L1-icache-loads ( +- 0.027% )
997371592 L1-icache-load-misses ( +- 0.009% )
16.407060367 seconds time elapsed ( +- 0.036% )
On Fri, 2009-12-18 at 13:49 +0100, Mike Galbraith wrote:
> I'll look for x264 source, and patch/piddle.
encoder/encoder.c: In function ‘x264_slice_write’:
encoder/encoder.c:1813: error: ‘x264_t’ has no member named ‘i_threads’
make: *** [encoder/encoder.o] Error 1
marge:..src/x264 # git remote -v
origin git://git.videolan.org/x264.git (fetch)
origin git://git.videolan.org/x264.git (push)
On Fri, 18 Dec 2009 22:05:34 Jason Garrett-Glaser wrote:
> On Fri, Dec 18, 2009 at 2:57 AM, Kasper Sandberg <[email protected]> wrote:
> > On Fri, 2009-12-18 at 08:30 +0100, Mike Galbraith wrote:
> >> On Fri, 2009-12-18 at 06:23 +0100, Ingo Molnar wrote:
> >> > Having said that, we generally try to make things perform well without
> >> > apps having to switch themselves to SCHED_BATCH. Mike, do you think we
> >> > can make x264 perform as well (or nearly as well) under SCHED_OTHER as
> >> > under SCHED_BATCH?
> >>
> >> It's not bad as is, except for ultrafast mode. START_DEBIT is the
> >> biggest problem there. I don't think SCHED_OTHER will ever match
> >> SCHED_BATCH for this load, though I must say I haven't full-spectrum
> >> tested. This load really wants RR scheduling, and wakeup preemption
> >> necessarily perturbs run order.
> >>
> >> I'll probably piddle with it some more, it's an interesting load.
> >
> > Yes, i must say, very interresting, its very complicated and... oh wait,
> > its just encoding a movie!
>
> Your trolling is becoming a bit over-the-top at this point. You
> should also considering replying to multiple people in one email as
> opposed to spamming a whole bunch in sequence.
>
> Perhaps as the lead x264 developer I'm qualified to say that it
> certainly is a very complicated load due to the strict ordering
> requirements of the threading model--and that you should tone down the
> whining just a tad and perhaps read a bit more about how BFS and CFS
> work before complaining about them.
Your workload is interesting because it is a well written real world
application with a solid threading model written in a cross platform portable
way. Your code is valuable as a measure for precisely this reason, and
there's a trap in trying to program in a way that "the scheduler might like".
That's presumably what Kasper is trying to point out, albeit in a much blunter
fashion.
The only workloads I'm remotely interested in are real world workloads
involving real applications like yours, software compilation, video playback,
audio playback, gaming, apache page serving, mysql performance and so on that
people in the real world use on real hardware all day every day. These are, of
course, measurable even above and beyond the elusive and impossible to measure
and quantify interactivity and responsiveness.
I couldn't care less about some artificial benchmark involving LTP, timing
mplayer playing in the presence of 100,000 pipes, volanomark which is just a
sched_yield benchmark, dbench and hackbench which even their original
programmers don't like them being used as a meaningful measure, and so on, and
normal users should also not care about the values returned by these artificial
benchmarks when they bear no resemblance to their real world performance cases
as above.
I have zero interest in adding any "tweaks" to BFS to perform well in X
benchmark, for there be a path where dragons lie. I've always maintained that,
and still stick to it, that the more tweaks you add for corner cases, the more
corner cases you introduce yourself. BFS will remain for a targeted audience
and I care not to appeal to any artificial benchmarketing obsessed population
that drives mainline, since I don't -have- to. Mainline can do what it wants,
and hopefully uses BFS as a yardstick for comparison when appropriate.
Regards,
--
-ck
On Sat, 2009-12-19 at 12:08 +1100, Con Kolivas wrote:
> On Fri, 18 Dec 2009 22:05:34 Jason Garrett-Glaser wrote:
> > On Fri, Dec 18, 2009 at 2:57 AM, Kasper Sandberg <[email protected]> wrote:
> > > On Fri, 2009-12-18 at 08:30 +0100, Mike Galbraith wrote:
> > >> On Fri, 2009-12-18 at 06:23 +0100, Ingo Molnar wrote:
> > >> > Having said that, we generally try to make things perform well without
> > >> > apps having to switch themselves to SCHED_BATCH. Mike, do you think we
> > >> > can make x264 perform as well (or nearly as well) under SCHED_OTHER as
> > >> > under SCHED_BATCH?
> > >>
> > >> It's not bad as is, except for ultrafast mode. START_DEBIT is the
> > >> biggest problem there. I don't think SCHED_OTHER will ever match
> > >> SCHED_BATCH for this load, though I must say I haven't full-spectrum
> > >> tested. This load really wants RR scheduling, and wakeup preemption
> > >> necessarily perturbs run order.
> > >>
> > >> I'll probably piddle with it some more, it's an interesting load.
> > >
> > > Yes, i must say, very interresting, its very complicated and... oh wait,
> > > its just encoding a movie!
> >
> > Your trolling is becoming a bit over-the-top at this point. You
> > should also considering replying to multiple people in one email as
> > opposed to spamming a whole bunch in sequence.
> >
> > Perhaps as the lead x264 developer I'm qualified to say that it
> > certainly is a very complicated load due to the strict ordering
> > requirements of the threading model--and that you should tone down the
> > whining just a tad and perhaps read a bit more about how BFS and CFS
> > work before complaining about them.
>
> Your workload is interesting because it is a well written real world
> application with a solid threading model written in a cross platform portable
> way. Your code is valuable as a measure for precisely this reason, and
> there's a trap in trying to program in a way that "the scheduler might like".
> That's presumably what Kasper is trying to point out, albeit in a much blunter
> fashion.
If using a different kernel facility gives better results, go for what
works best. Programmers have been doing that since day one. I doubt
you'd call it a trap to trade a pipe for a socketpair if one produced
better results than the other.
Mind you, we should be able to better service the load with plain
SCHED_OTHER, no argument there.
> The only workloads I'm remotely interested in are real world workloads
> involving real applications like yours, software compilation, video playback,
> audio playback, gaming, apache page serving, mysql performance and so on that
> people in the real world use on real hardware all day every day. These are, of
> course, measurable even above and beyond the elusive and impossible to measure
> and quantify interactivity and responsiveness.
>
> I couldn't care less about some artificial benchmark involving LTP, timing
> mplayer playing in the presence of 100,000 pipes, volanomark which is just a
> sched_yield benchmark, dbench and hackbench which even their original
> programmers don't like them being used as a meaningful measure, and so on, and
> normal users should also not care about the values returned by these artificial
> benchmarks when they bear no resemblance to their real world performance cases
> as above.
I find all programs interesting and valid in their own right, whether
they be a benchmark or not, though I agree that vmark and hackbench are
a bit over the top.
> I have zero interest in adding any "tweaks" to BFS to perform well in X
> benchmark, for there be a path where dragons lie. I've always maintained that,
> and still stick to it, that the more tweaks you add for corner cases, the more
> corner cases you introduce yourself. BFS will remain for a targeted audience
> and I care not to appeal to any artificial benchmarketing obsessed population
> that drives mainline, since I don't -have- to. Mainline can do what it wants,
> and hopefully uses BFS as a yardstick for comparison when appropriate.
Interesting rant. IMO, benchmarks are all merely programs that do some
work and quantify. Whether you like what they measure or not, whether
they emit flattering numbers or not, they can all tell you something if
you're willing to listen.
Oh, and for the record, timing mplayer thing was NOT in the presence of
100000 pipes, it was in the presence of one cpu hog, as was the time
amarok loading thing. Those were UP tests showing you a weakness. All
of the results I sent you were intended to show you areas that could use
some improvement, but you don't want to hear, so label and hand-wave.
Below is a quote of the results I sent you.
<quote>
I've taken BFS out for a few spins while looking into BFS vs CFS latency
reports, and noticed a couple problems I'll share, comparison testing
has been healthy for CFS, so maybe BFS can profit as well. Below are
some bfs304 vs my working tree numbers from a run this morning, looking
to see if some issues seen in earlier releases were still present.
Comments on noted issues:
It looks like there may be some affinity troubles, and there definitely
seems to be a fairness bug still lurking. No idea what's up with that,
but see data below, it's pretty nasty. Any sleepy load competing with a
pure hog seems to be troublesome.
The pgsql+oltp test data is very interesting to me, pgsql+oltp hates
preemption with a passion, because of it's USERLAND spinlocks. Preempt
the lock holder, and watch the fun. Your preemption model suits it very
well at the low end, and does pretty well all the way though. Really
interesting to me is the difference in 1 and 2 client throughput, why
I'm including these.
msql+oltp and tbench look like they're griping about affinity to me, but
I haven't instrumented anything, so can't be sure. mysql+oltp I know is
a wakeup preemption and is very affinity sensitive. Too little wakeup
preemption, it suffers, any load balancing, it suffers.
What vmark is so upset about, I have no idea. I know it's very affinity
sensitive, and hates wakeup preemption passionately.
Numbers:
vmark
tip 108841 messages per second
tip++ 116260 messages per second
31.bfs304 28279 messages per second
tbench 8
tip 938.421 MB/sec 8 procs
tip++ 952.302 MB/sec 8 procs
31.bfs304 709.121 MB/sec 8 procs
mysql+oltp
clients 1 2 4 8 16 32 64 128 256
tip 9999.36 18493.54 34652.91 34253.13 32057.64 30297.43 28300.96 25450.14 20675.99
tip++ 10041.16 18531.16 34934.22 34192.65 32829.65 32010.55 30341.31 27340.65 22724.87
31.bfs304 9459.85 14952.44 32209.07 29724.03 28608.02 27051.10 24851.44 21223.15 15809.46
pgsql+oltp
clients 1 2 4 8 16 32 64 128 256
tip 13577.63 26510.67 51871.05 51374.62 50190.69 45494.64 37173.83 27767.09 22795.23
tip++ 13685.69 26693.42 52056.45 51733.30 50854.75 49790.95 48972.02 47517.34 44999.22
31.bfs304 15467.03 21126.57 52673.76 50972.41 49652.54 46015.73 44567.18 40419.90 33276.67
fairness bug in 31.bfs304?
prep:
set CPU governor to performance first, as in all benchmarking.
taskset -c 0 pert (100% CPU hog TSC perturbation measurement proggy)
taskset -p 0x1 `pidof Xorg`
perf stat taskset -c 0 konsole -e exit
31.bfs304 2.073724549 seconds time elapsed
tip++ 0.989323860 seconds time elapsed
note: amarok pins itself to CPU0, and is set up to use mysql database.
prep: cache warmup run.
perf stat amarokapp (quit after 12000 song mp3 collection is loaded)
31.bfs304 136.418518486 seconds time elapsed
tip++ 19.439268066 seconds time elapsed
prep: restart amarok, wait for load, start playing
perf stat taskset -c 0 mplayer -nosound 3DMark2000.mkv (exact 6 minute movie)
31.bfs304 432.712500554 seconds time elapsed
tip++ 363.622519583 seconds time elapsed
On Sat, 2009-12-19 at 05:03 +0100, Mike Galbraith wrote:
> On Sat, 2009-12-19 at 12:08 +1100, Con Kolivas wrote:
> > On Fri, 18 Dec 2009 22:05:34 Jason Garrett-Glaser wrote:
> > > On Fri, Dec 18, 2009 at 2:57 AM, Kasper Sandberg <[email protected]> wrote:
> > > > On Fri, 2009-12-18 at 08:30 +0100, Mike Galbraith wrote:
> > > >> On Fri, 2009-12-18 at 06:23 +0100, Ingo Molnar wrote:
> > > >> > Having said that, we generally try to make things perform well without
> > > >> > apps having to switch themselves to SCHED_BATCH. Mike, do you think we
> > > >> > can make x264 perform as well (or nearly as well) under SCHED_OTHER as
> > > >> > under SCHED_BATCH?
> > > >>
> > > >> It's not bad as is, except for ultrafast mode. START_DEBIT is the
> > > >> biggest problem there. I don't think SCHED_OTHER will ever match
> > > >> SCHED_BATCH for this load, though I must say I haven't full-spectrum
> > > >> tested. This load really wants RR scheduling, and wakeup preemption
> > > >> necessarily perturbs run order.
> > > >>
> > > >> I'll probably piddle with it some more, it's an interesting load.
> > > >
> > > > Yes, i must say, very interresting, its very complicated and... oh wait,
> > > > its just encoding a movie!
> > >
> > > Your trolling is becoming a bit over-the-top at this point. You
> > > should also considering replying to multiple people in one email as
> > > opposed to spamming a whole bunch in sequence.
> > >
> > > Perhaps as the lead x264 developer I'm qualified to say that it
> > > certainly is a very complicated load due to the strict ordering
> > > requirements of the threading model--and that you should tone down the
> > > whining just a tad and perhaps read a bit more about how BFS and CFS
> > > work before complaining about them.
> >
> > Your workload is interesting because it is a well written real world
> > application with a solid threading model written in a cross platform portable
> > way. Your code is valuable as a measure for precisely this reason, and
> > there's a trap in trying to program in a way that "the scheduler might like".
> > That's presumably what Kasper is trying to point out, albeit in a much blunter
> > fashion.
>
> If using a different kernel facility gives better results, go for what
> works best. Programmers have been doing that since day one. I doubt
> you'd call it a trap to trade a pipe for a socketpair if one produced
> better results than the other.
Ofcourse in this case that is what performs best one a single
scheduler...
>
> Mind you, we should be able to better service the load with plain
> SCHED_OTHER, no argument there.
Great, so when you said "i dont think it will get better"(or words to
that effect), that didnt mean anything?
>
> > The only workloads I'm remotely interested in are real world workloads
> > involving real applications like yours, software compilation, video playback,
> > audio playback, gaming, apache page serving, mysql performance and so on that
> > people in the real world use on real hardware all day every day. These are, of
> > course, measurable even above and beyond the elusive and impossible to measure
> > and quantify interactivity and responsiveness.
> >
> > I couldn't care less about some artificial benchmark involving LTP, timing
> > mplayer playing in the presence of 100,000 pipes, volanomark which is just a
> > sched_yield benchmark, dbench and hackbench which even their original
> > programmers don't like them being used as a meaningful measure, and so on, and
> > normal users should also not care about the values returned by these artificial
> > benchmarks when they bear no resemblance to their real world performance cases
> > as above.
>
> I find all programs interesting and valid in their own right, whether
> they be a benchmark or not, though I agree that vmark and hackbench are
> a bit over the top.
Yes.. its interresting to SEE, whether its relevant and something to
care about is entirely different.
Yes, its very interresting that something craps out, now, this thing is
_NEVER_ going to occur in real life, and if it happens to do by some
magical christmas fluke, then that is fortunately only ONE time you're
seeing that problem, and as such, its irellevant, and certainly doesnt
merit workarounds which makes other very common stuff perform
significantly worse.
>
> > I have zero interest in adding any "tweaks" to BFS to perform well in X
> > benchmark, for there be a path where dragons lie. I've always maintained that,
> > and still stick to it, that the more tweaks you add for corner cases, the more
> > corner cases you introduce yourself. BFS will remain for a targeted audience
> > and I care not to appeal to any artificial benchmarketing obsessed population
> > that drives mainline, since I don't -have- to. Mainline can do what it wants,
> > and hopefully uses BFS as a yardstick for comparison when appropriate.
>
> Interesting rant. IMO, benchmarks are all merely programs that do some
> work and quantify. Whether you like what they measure or not, whether
> they emit flattering numbers or not, they can all tell you something if
> you're willing to listen.
I suspect con is very interrested in listening, however, as he have
stated, if fixing some corner case in an artificial load requires
damaging a realworld load, that is an unacceptable modification to him,
and I agree. I ask you this, would you rather some artificial benchmark
ran better, but your own everyday applications ran slower as a result?
It seems to me you do, which i can not understand.
>
> Oh, and for the record, timing mplayer thing was NOT in the presence of
> 100000 pipes, it was in the presence of one cpu hog, as was the time
> amarok loading thing. Those were UP tests showing you a weakness. All
> of the results I sent you were intended to show you areas that could use
> some improvement, but you don't want to hear, so label and hand-wave.
>
> Below is a quote of the results I sent you.
>
> <quote>
>
> I've taken BFS out for a few spins while looking into BFS vs CFS latency
> reports, and noticed a couple problems I'll share, comparison testing
> has been healthy for CFS, so maybe BFS can profit as well. Below are
> some bfs304 vs my working tree numbers from a run this morning, looking
> to see if some issues seen in earlier releases were still present.
>
> Comments on noted issues:
>
> It looks like there may be some affinity troubles, and there definitely
> seems to be a fairness bug still lurking. No idea what's up with that,
> but see data below, it's pretty nasty. Any sleepy load competing with a
> pure hog seems to be troublesome.
>
> The pgsql+oltp test data is very interesting to me, pgsql+oltp hates
> preemption with a passion, because of it's USERLAND spinlocks. Preempt
> the lock holder, and watch the fun. Your preemption model suits it very
> well at the low end, and does pretty well all the way though. Really
> interesting to me is the difference in 1 and 2 client throughput, why
> I'm including these.
>
> msql+oltp and tbench look like they're griping about affinity to me, but
> I haven't instrumented anything, so can't be sure. mysql+oltp I know is
> a wakeup preemption and is very affinity sensitive. Too little wakeup
> preemption, it suffers, any load balancing, it suffers.
>
> What vmark is so upset about, I have no idea. I know it's very affinity
> sensitive, and hates wakeup preemption passionately.
>
> Numbers:
>
> vmark
> tip 108841 messages per second
> tip++ 116260 messages per second
> 31.bfs304 28279 messages per second
>
> tbench 8
> tip 938.421 MB/sec 8 procs
> tip++ 952.302 MB/sec 8 procs
> 31.bfs304 709.121 MB/sec 8 procs
>
> mysql+oltp
> clients 1 2 4 8 16 32 64 128 256
> tip 9999.36 18493.54 34652.91 34253.13 32057.64 30297.43 28300.96 25450.14 20675.99
> tip++ 10041.16 18531.16 34934.22 34192.65 32829.65 32010.55 30341.31 27340.65 22724.87
> 31.bfs304 9459.85 14952.44 32209.07 29724.03 28608.02 27051.10 24851.44 21223.15 15809.46
>
> pgsql+oltp
> clients 1 2 4 8 16 32 64 128 256
> tip 13577.63 26510.67 51871.05 51374.62 50190.69 45494.64 37173.83 27767.09 22795.23
> tip++ 13685.69 26693.42 52056.45 51733.30 50854.75 49790.95 48972.02 47517.34 44999.22
> 31.bfs304 15467.03 21126.57 52673.76 50972.41 49652.54 46015.73 44567.18 40419.90 33276.67
>
> fairness bug in 31.bfs304?
>
> prep:
> set CPU governor to performance first, as in all benchmarking.
> taskset -c 0 pert (100% CPU hog TSC perturbation measurement proggy)
> taskset -p 0x1 `pidof Xorg`
>
> perf stat taskset -c 0 konsole -e exit
> 31.bfs304 2.073724549 seconds time elapsed
> tip++ 0.989323860 seconds time elapsed
>
> note: amarok pins itself to CPU0, and is set up to use mysql database.
>
> prep: cache warmup run.
> perf stat amarokapp (quit after 12000 song mp3 collection is loaded)
>
> 31.bfs304 136.418518486 seconds time elapsed
> tip++ 19.439268066 seconds time elapsed
>
> prep: restart amarok, wait for load, start playing
>
> perf stat taskset -c 0 mplayer -nosound 3DMark2000.mkv (exact 6 minute movie)
> 31.bfs304 432.712500554 seconds time elapsed
> tip++ 363.622519583 seconds time elapsed
>
But presumably the cpu hog is running at same priority, and if this is
done on a UP system, that will obviously mean fairness will make stuff
slower..
Try this on a dualcore or quadcore system, or ofcourse just set the
niceness accordingly...
>
On Sat, 2009-12-19 at 18:36 +0100, Kasper Sandberg wrote:
> On Sat, 2009-12-19 at 05:03 +0100, Mike Galbraith wrote:
> > On Sat, 2009-12-19 at 12:08 +1100, Con Kolivas wrote:
> > > Your workload is interesting because it is a well written real world
> > > application with a solid threading model written in a cross platform portable
> > > way. Your code is valuable as a measure for precisely this reason, and
> > > there's a trap in trying to program in a way that "the scheduler might like".
> > > That's presumably what Kasper is trying to point out, albeit in a much blunter
> > > fashion.
> >
> > If using a different kernel facility gives better results, go for what
> > works best. Programmers have been doing that since day one. I doubt
> > you'd call it a trap to trade a pipe for a socketpair if one produced
> > better results than the other.
>
> Ofcourse in this case that is what performs best one a single
> scheduler...
I have no idea what you're talking about here.
> > Mind you, we should be able to better service the load with plain
> > SCHED_OTHER, no argument there.
> Great, so when you said "i dont think it will get better"(or words to
> that effect), that didnt mean anything?
Or here.
Look. BFS handles this load well, a little better than CFS in fact. I
don't have a problem with that, but you seem to think it's a big hairy
deal for some strange reason.
> > > The only workloads I'm remotely interested in are real world workloads
> > > involving real applications like yours, software compilation, video playback,
> > > audio playback, gaming, apache page serving, mysql performance and so on that
> > > people in the real world use on real hardware all day every day. These are, of
> > > course, measurable even above and beyond the elusive and impossible to measure
> > > and quantify interactivity and responsiveness.
> > >
> > > I couldn't care less about some artificial benchmark involving LTP, timing
> > > mplayer playing in the presence of 100,000 pipes, volanomark which is just a
> > > sched_yield benchmark, dbench and hackbench which even their original
> > > programmers don't like them being used as a meaningful measure, and so on, and
> > > normal users should also not care about the values returned by these artificial
> > > benchmarks when they bear no resemblance to their real world performance cases
> > > as above.
> >
> > I find all programs interesting and valid in their own right, whether
> > they be a benchmark or not, though I agree that vmark and hackbench are
> > a bit over the top.
>
> Yes.. its interresting to SEE, whether its relevant and something to
> care about is entirely different.
>
> Yes, its very interresting that something craps out, now, this thing is
> _NEVER_ going to occur in real life, and if it happens to do by some
> magical christmas fluke, then that is fortunately only ONE time you're
> seeing that problem, and as such, its irellevant, and certainly doesnt
> merit workarounds which makes other very common stuff perform
> significantly worse.
Haven't you noticed yet that nobody but you and Con has suggested any
course of action whatsoever? That it is you two who both mention then
condemn workarounds and load specific tweaks all in the same breath with
not one word having come from any other source?
> > > I have zero interest in adding any "tweaks" to BFS to perform well in X
> > > benchmark, for there be a path where dragons lie. I've always maintained that,
> > > and still stick to it, that the more tweaks you add for corner cases, the more
> > > corner cases you introduce yourself. BFS will remain for a targeted audience
> > > and I care not to appeal to any artificial benchmarketing obsessed population
> > > that drives mainline, since I don't -have- to. Mainline can do what it wants,
> > > and hopefully uses BFS as a yardstick for comparison when appropriate.
> >
> > Interesting rant. IMO, benchmarks are all merely programs that do some
> > work and quantify. Whether you like what they measure or not, whether
> > they emit flattering numbers or not, they can all tell you something if
> > you're willing to listen.
>
> I suspect con is very interrested in listening, however, as he have
> stated, if fixing some corner case in an artificial load requires
> damaging a realworld load, that is an unacceptable modification to him,
> and I agree. I ask you this, would you rather some artificial benchmark
> ran better, but your own everyday applications ran slower as a result?
> It seems to me you do, which i can not understand.
You can hand-wave all you want, I really do not care, but kindly keep
your words out of my mouth.
> > fairness bug in 31.bfs304?
> >
> > prep:
> > set CPU governor to performance first, as in all benchmarking.
> > taskset -c 0 pert (100% CPU hog TSC perturbation measurement proggy)
> > taskset -p 0x1 `pidof Xorg`
> >
> > perf stat taskset -c 0 konsole -e exit
> > 31.bfs304 2.073724549 seconds time elapsed
> > tip++ 0.989323860 seconds time elapsed
> >
> > note: amarok pins itself to CPU0, and is set up to use mysql database.
> >
> > prep: cache warmup run.
> > perf stat amarokapp (quit after 12000 song mp3 collection is loaded)
> >
> > 31.bfs304 136.418518486 seconds time elapsed
> > tip++ 19.439268066 seconds time elapsed
> >
> > prep: restart amarok, wait for load, start playing
> >
> > perf stat taskset -c 0 mplayer -nosound 3DMark2000.mkv (exact 6 minute movie)
> > 31.bfs304 432.712500554 seconds time elapsed
> > tip++ 363.622519583 seconds time elapsed
> >
>
> But presumably the cpu hog is running at same priority, and if this is
> done on a UP system, that will obviously mean fairness will make stuff
> slower..
>
> Try this on a dualcore or quadcore system, or ofcourse just set the
> niceness accordingly...
Amazing that you can actually say that with a straight face.
Look. You can hand-wave all results into irrelevance, I do not care.
You've both made it perfectly clear that test results are not welcome.
-Mike
On Saturday 19 December 2009 18:36:03 Kasper Sandberg wrote:
> Try this on a dualcore or quadcore system, or ofcourse just set the<
> niceness accordingly...
Oh well. This is getting too much for a normally very silent and flame fearing
reader. Didnt *you* just tell others to shut up about using any tunables for
any application? And that you dont need any tunables for BFS?
Andres
On Sun, 2009-12-20 at 04:22 +0100, Andres Freund wrote:
> On Saturday 19 December 2009 18:36:03 Kasper Sandberg wrote:
> > Try this on a dualcore or quadcore system, or ofcourse just set the<
> > niceness accordingly...
> Oh well. This is getting too much for a normally very silent and flame fearing
> reader. Didnt *you* just tell others to shut up about using any tunables for
> any application? And that you dont need any tunables for BFS?
That was an entirely different case, have you even been following the
thread?
OFCOURSE you're going to see slowdowns on a UP system if you have a cpu
hog and then run something else, this is the only behavior possible, and
bfs handles it in a fair way.
when i said we needed no tunables, that was for running a _SINGLE_
application, and then measuring said applications performance. (where
BFS indeed does beat CFS by a quite large margin)
and as for CFS, it SHOULD exhibit fair behavior anyway, isnt it called
"completely FAIR scheduler" ? or is that just the marketing name?
>
> Andres
On Sun, 2009-12-20 at 13:10 +0100, Kasper Sandberg wrote:
> On Sun, 2009-12-20 at 04:22 +0100, Andres Freund wrote:
> > On Saturday 19 December 2009 18:36:03 Kasper Sandberg wrote:
> > > Try this on a dualcore or quadcore system, or ofcourse just set the<
> > > niceness accordingly...
> > Oh well. This is getting too much for a normally very silent and flame fearing
> > reader. Didnt *you* just tell others to shut up about using any tunables for
> > any application? And that you dont need any tunables for BFS?
oh and btw, the niceness is not really a tunable"
>
> That was an entirely different case, have you even been following the
> thread?
>
> OFCOURSE you're going to see slowdowns on a UP system if you have a cpu
> hog and then run something else, this is the only behavior possible, and
> bfs handles it in a fair way.
>
> when i said we needed no tunables, that was for running a _SINGLE_
> application, and then measuring said applications performance. (where
> BFS indeed does beat CFS by a quite large margin)
>
> and as for CFS, it SHOULD exhibit fair behavior anyway, isnt it called
> "completely FAIR scheduler" ? or is that just the marketing name?
>
>
>
> >
> > Andres
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On Sun, 2009-12-20 at 13:10 +0100, Kasper Sandberg wrote:
> and as for CFS, it SHOULD exhibit fair behavior anyway, isnt it called
> "completely FAIR scheduler" ? or is that just the marketing name?
Clue: CFS _did_ distribute CPU evenly. Ponder that for a moment.
-Mike
On Sun, 2009-12-20 at 16:13 +0100, Mike Galbraith wrote:
> On Sun, 2009-12-20 at 13:10 +0100, Kasper Sandberg wrote:
>
> > and as for CFS, it SHOULD exhibit fair behavior anyway, isnt it called
> > "completely FAIR scheduler" ? or is that just the marketing name?
>
> Clue: CFS _did_ distribute CPU evenly. Ponder that for a moment.
All done?
Do you think THAT may be why I thought Con might be interested?!?
-Mike
Benchmarks for the new threading model are up, along with a few others:
http://doom10.org/index.php?topic=78.0
Interestingly enough, CFS beats BFS on zerolatency by a significant
margin. Unsurprisingly, given the threading model, the optimal number
of threads is equal to the number of cores.
Jason
On Tue, Dec 22, 2009 at 2:33 AM, Jason Garrett-Glaser
<[email protected]> wrote:
> Benchmarks for the new threading model are up, along with a few others:
>
> http://doom10.org/index.php?topic=78.0
>
> Interestingly enough, CFS beats BFS on zerolatency by a significant
> margin. ?Unsurprisingly, given the threading model, the optimal number
> of threads is equal to the number of cores.
>
> Jason
>
And I am apparently blind: I cannot read graphs. Ignore the
conclusion made in the above post ;)
Jason