2012-05-09 16:01:32

by Jason Garrett-Glaser

[permalink] [raw]
Subject: Scheduler still seems awful with x264, worse with patches

Many months ago, the topic of CFS's inefficiencies with x264 came up
and some improvements were made, but BFS and Windows still stayed a
little bit in the lead. This seemed to be because of a mix of two
issues. Firstly, a combination of relatively short-lived jobs (x264
uses a thread pool, so the actual threads are long-lived). Secondly,
in frame threads, heavy dependencies between threads, benefiting
greatly from a dumb scheduler. Thirdly, in sliced threads -- the
focus of this post -- the best scheduling approach is to simply spread
them throughout the cores and do nothing, so again, a dumb scheduler
will do the right thing.

Recently I tried multithreading x264's lookahead for a customer. The
lookahead previously wasn't threaded, causing bottlenecks with many
cores and threads. I do my development mainly on Windows, and the
patch looked to be quite a success, with nice performance boosts for
many target use-cases.

And then I ran it on Linux and it choked horribly.

The patch is here:
https://github.com/DarkShikari/x264-devel/commit/99e830f1581eac3cf30f07b1d6c6c32bae1725c8
. To replicate the test, simply test that version against the
previous version. My guess is the reason it chokes is that it
involves spawning even *shorter*-lived jobs than x264 typically does,
something that CFS seems to simply collapse on.

Here's some stats from a recent kernel:

SD encoding (before -> after patch):
CFS: 325.49 +/- 1.22 fps -> 251.68 +/- 2.32 fps
BFS: 334.94 +/- 0.59 fps -> 344.47 +/- 0.68 fps

HD encoding (before -> after patch):
CFS: 39.05 +/- 0.22 fps -> 40.56 +/- 0.23 fps
BFS: 40.15 +/- 0.05 fps -> 44.89 +/- 0.05 fps

As can be seen, the longer the threads live (the lower the fps), the
less horrific the penalty is. Furthermore, though I don't have
numbers, using schedtool -R -p 1 does basically as well as BFS in
eliminating the problem. Naturally, this is not really a solution as
it requires root.

To replicate this test, a commandline like this should work on any
cached raw input file (a collection of free raw videos can be found
here if you don't like making your own:
http://media.xiph.org/video/derf/ ):

./x264 --preset superfast --tune zerolatency --threads X input -o /dev/null

Most of my testing was done with X = 4 or 8 on quad-core or
quad-core-with-HT machines. Though I don't have numbers, I should
note that the absolute worst results are on a Core i7 machine where I
set taskset 0xf and used threads 4; the patch literally cut the
framerate in half, despite giving a performance boost when SCHED_RR
was used.

Jason

P.S. Here is the structure of low-latency sliced threading with the
patch, for anyone curious.

To encode a frame with X threads:
1. The previous frame's X threads are still going -- they're
finishing up hpel and deblock calculation, which can be done after the
encoder returns, but before the next frame is encoded.
2. Serially, copy in the input frame and preprocess it, running
various not-yet-parallelized tasks while those threads finish up.
3. In parallel, do lookahead analysis of the input frame -- split it
into X slices and analyze them simultaneously. Then wait on the jobs
and merge the results.
4. If any of the previous frame's threads are still going, wait on them now.
5. Split the main encode into X slices and do them simultaneously.
Once they've all finished encoding -- after they've signaled the main
thread with a condition variable -- serially finish up a few minor
tasks and return to the caller. These threads will continue until
step 4) of the next frame.

When encoding at ~300fps, the encode jobs will typically run for about
~3ms each, and the lookahead threads likely for around 0.5ms, very
roughly. For a low-latency encoder, there is no way to make these
threads last longer while still getting parallelism, and it must be
doable because Windows handles it just fine.

Jason


2012-05-09 16:24:36

by Mike Galbraith

[permalink] [raw]
Subject: Re: Scheduler still seems awful with x264, worse with patches

On Wed, 2012-05-09 at 09:00 -0700, Jason Garrett-Glaser wrote:
> Many months ago, the topic of CFS's inefficiencies with x264 came up
> and some improvements were made, but BFS and Windows still stayed a
> little bit in the lead. This seemed to be because of a mix of two
> issues. Firstly, a combination of relatively short-lived jobs (x264
> uses a thread pool, so the actual threads are long-lived). Secondly,
> in frame threads, heavy dependencies between threads, benefiting
> greatly from a dumb scheduler. Thirdly, in sliced threads -- the
> focus of this post -- the best scheduling approach is to simply spread
> them throughout the cores and do nothing, so again, a dumb scheduler
> will do the right thing.

I took x264 for a quick test drive a short while ago, and it looks like
we slipped a bit. I didn't have time to futz with it much, but did find
that SCHED_IDLE kicked SCHED_OTHER's butt. x264 really really wants RR.

> Recently I tried multithreading x264's lookahead for a customer. The
> lookahead previously wasn't threaded, causing bottlenecks with many
> cores and threads. I do my development mainly on Windows, and the
> patch looked to be quite a success, with nice performance boosts for
> many target use-cases.
>
> And then I ran it on Linux and it choked horribly.
>
> The patch is here:
> https://github.com/DarkShikari/x264-devel/commit/99e830f1581eac3cf30f07b1d6c6c32bae1725c8
> . To replicate the test, simply test that version against the
> previous version. My guess is the reason it chokes is that it
> involves spawning even *shorter*-lived jobs than x264 typically does,
> something that CFS seems to simply collapse on.
>
> Here's some stats from a recent kernel:
>
> SD encoding (before -> after patch):
> CFS: 325.49 +/- 1.22 fps -> 251.68 +/- 2.32 fps
> BFS: 334.94 +/- 0.59 fps -> 344.47 +/- 0.68 fps
>
> HD encoding (before -> after patch):
> CFS: 39.05 +/- 0.22 fps -> 40.56 +/- 0.23 fps
> BFS: 40.15 +/- 0.05 fps -> 44.89 +/- 0.05 fps
>
> As can be seen, the longer the threads live (the lower the fps), the
> less horrific the penalty is. Furthermore, though I don't have
> numbers, using schedtool -R -p 1 does basically as well as BFS in
> eliminating the problem. Naturally, this is not really a solution as
> it requires root.
>
> To replicate this test, a commandline like this should work on any
> cached raw input file (a collection of free raw videos can be found
> here if you don't like making your own:
> http://media.xiph.org/video/derf/ ):
>
> ./x264 --preset superfast --tune zerolatency --threads X input -o /dev/null
>
> Most of my testing was done with X = 4 or 8 on quad-core or
> quad-core-with-HT machines. Though I don't have numbers, I should
> note that the absolute worst results are on a Core i7 machine where I
> set taskset 0xf and used threads 4; the patch literally cut the
> framerate in half, despite giving a performance boost when SCHED_RR
> was used.
>
> Jason
>
> P.S. Here is the structure of low-latency sliced threading with the
> patch, for anyone curious.
>
> To encode a frame with X threads:
> 1. The previous frame's X threads are still going -- they're
> finishing up hpel and deblock calculation, which can be done after the
> encoder returns, but before the next frame is encoded.
> 2. Serially, copy in the input frame and preprocess it, running
> various not-yet-parallelized tasks while those threads finish up.
> 3. In parallel, do lookahead analysis of the input frame -- split it
> into X slices and analyze them simultaneously. Then wait on the jobs
> and merge the results.
> 4. If any of the previous frame's threads are still going, wait on them now.
> 5. Split the main encode into X slices and do them simultaneously.
> Once they've all finished encoding -- after they've signaled the main
> thread with a condition variable -- serially finish up a few minor
> tasks and return to the caller. These threads will continue until
> step 4) of the next frame.
>
> When encoding at ~300fps, the encode jobs will typically run for about
> ~3ms each, and the lookahead threads likely for around 0.5ms, very
> roughly. For a low-latency encoder, there is no way to make these
> threads last longer while still getting parallelism, and it must be
> doable because Windows handles it just fine.
>
> Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2012-05-09 16:30:19

by Jason Garrett-Glaser

[permalink] [raw]
Subject: Re: Scheduler still seems awful with x264, worse with patches

On Wed, May 9, 2012 at 9:24 AM, Mike Galbraith <[email protected]> wrote:
> On Wed, 2012-05-09 at 09:00 -0700, Jason Garrett-Glaser wrote:
>> Many months ago, the topic of CFS's inefficiencies with x264 came up
>> and some improvements were made, but BFS and Windows still stayed a
>> little bit in the lead. ?This seemed to be because of a mix of two
>> issues. ?Firstly, a combination of relatively short-lived jobs (x264
>> uses a thread pool, so the actual threads are long-lived). ?Secondly,
>> in frame threads, heavy dependencies between threads, benefiting
>> greatly from a dumb scheduler. ?Thirdly, in sliced threads -- the
>> focus of this post -- the best scheduling approach is to simply spread
>> them throughout the cores and do nothing, so again, a dumb scheduler
>> will do the right thing.
>
> I took x264 for a quick test drive a short while ago, and it looks like
> we slipped a bit. ?I didn't have time to futz with it much, but did find
> that SCHED_IDLE kicked SCHED_OTHER's butt. ?x264 really really wants RR.

Do remember to separate frame and slice threading in tests; they work
totally differently and, while you might be able to kill two birds
with one stone sometimes, some particular tuning might not affect both
in the same way.

Slice-threading is probably harder in general because the threads last
far less time, and that seems to be the thing that angers CFS.

Note also that my patch slice-threads the lookahead, even if the main
encode is frame-threaded. This is because for various reasons
frame-threading the lookahead may be harder and more difficult, so I
decided to do it this way (and it worked on Windows, so...). Note
also that when using automatic lookahead threads (default in that
patch), x264 currently does:

number of lookahead threads = MIN( sliced threads ? threads : threads / 6, 16 );

Jason

2012-05-10 10:29:50

by Mike Galbraith

[permalink] [raw]
Subject: Re: Scheduler still seems awful with x264, worse with patches

On Wed, 2012-05-09 at 09:00 -0700, Jason Garrett-Glaser wrote:
> Many months ago, the topic of CFS's inefficiencies with x264 came up
> and some improvements were made, but BFS and Windows still stayed a
> little bit in the lead. This seemed to be because of a mix of two
> issues. Firstly, a combination of relatively short-lived jobs (x264
> uses a thread pool, so the actual threads are long-lived). Secondly,
> in frame threads, heavy dependencies between threads, benefiting
> greatly from a dumb scheduler. Thirdly, in sliced threads -- the
> focus of this post -- the best scheduling approach is to simply spread
> them throughout the cores and do nothing, so again, a dumb scheduler
> will do the right thing.
>
> Recently I tried multithreading x264's lookahead for a customer. The
> lookahead previously wasn't threaded, causing bottlenecks with many
> cores and threads. I do my development mainly on Windows, and the
> patch looked to be quite a success, with nice performance boosts for
> many target use-cases.
>
> And then I ran it on Linux and it choked horribly.
>
> The patch is here:
> https://github.com/DarkShikari/x264-devel/commit/99e830f1581eac3cf30f07b1d6c6c32bae1725c8
> . To replicate the test, simply test that version against the
> previous version. My guess is the reason it chokes is that it
> involves spawning even *shorter*-lived jobs than x264 typically does,
> something that CFS seems to simply collapse on.
>
> Here's some stats from a recent kernel:
>
> SD encoding (before -> after patch):
> CFS: 325.49 +/- 1.22 fps -> 251.68 +/- 2.32 fps
> BFS: 334.94 +/- 0.59 fps -> 344.47 +/- 0.68 fps
>
> HD encoding (before -> after patch):
> CFS: 39.05 +/- 0.22 fps -> 40.56 +/- 0.23 fps
> BFS: 40.15 +/- 0.05 fps -> 44.89 +/- 0.05 fps
>
> As can be seen, the longer the threads live (the lower the fps), the
> less horrific the penalty is. Furthermore, though I don't have
> numbers, using schedtool -R -p 1 does basically as well as BFS in
> eliminating the problem. Naturally, this is not really a solution as
> it requires root.
>
> To replicate this test, a commandline like this should work on any
> cached raw input file (a collection of free raw videos can be found
> here if you don't like making your own:
> http://media.xiph.org/video/derf/ ):
>
> ./x264 --preset superfast --tune zerolatency --threads X input -o /dev/null

On my Q6600 box, neither scheduler (identical configs) seems to like
--tune zerolatency much.

# ultrafast
x264 --quiet --no-progress --preset ultrafast --no-scenecut --sync-lookahead 0 --qp 20 -o /dev/null --threads $1 ./soccer_4cif.y4m
x264 --quiet --no-progress --preset ultrafast --tune zerolatency --no-scenecut --sync-lookahead 0 --qp 20 -o /dev/null --threads $1 ./soccer_4cif.y4m
x264 --quiet --no-progress --preset ultrafast --tune zerolatency -o /dev/null --threads $1 ./soccer_4cif.y4m

3.3.0-bfs 3.3.0-cfs
marge:~/tmp # ./x264.sh 8

encoded 600 frames, 449.63 fps 400.20 fps
encoded 600 frames, 355.00 fps 304.12 fps
encoded 600 frames, 305.65 fps 267.25 fps

marge:~/tmp # schedctl -I ./x264.sh 8

encoded 600 frames, 475.00 fps 483.19 fps
encoded 600 frames, 364.72 fps 278.37 fps
encoded 600 frames, 311.69 fps 256.25 fps

marge:~/tmp # schedctl -R ./x264.sh 8

encoded 600 frames, 454.70 fps 489.00 fps
encoded 600 frames, 358.83 fps 365.61 fps
encoded 600 frames, 308.81 fps 310.46 fps

2012-05-10 16:23:01

by Jason Garrett-Glaser

[permalink] [raw]
Subject: Re: Scheduler still seems awful with x264, worse with patches

On Thu, May 10, 2012 at 3:29 AM, Mike Galbraith <[email protected]> wrote:
> On Wed, 2012-05-09 at 09:00 -0700, Jason Garrett-Glaser wrote:
>> Many months ago, the topic of CFS's inefficiencies with x264 came up
>> and some improvements were made, but BFS and Windows still stayed a
>> little bit in the lead. ?This seemed to be because of a mix of two
>> issues. ?Firstly, a combination of relatively short-lived jobs (x264
>> uses a thread pool, so the actual threads are long-lived). ?Secondly,
>> in frame threads, heavy dependencies between threads, benefiting
>> greatly from a dumb scheduler. ?Thirdly, in sliced threads -- the
>> focus of this post -- the best scheduling approach is to simply spread
>> them throughout the cores and do nothing, so again, a dumb scheduler
>> will do the right thing.
>>
>> Recently I tried multithreading x264's lookahead for a customer. ?The
>> lookahead previously wasn't threaded, causing bottlenecks with many
>> cores and threads. ?I do my development mainly on Windows, and the
>> patch looked to be quite a success, with nice performance boosts for
>> many target use-cases.
>>
>> And then I ran it on Linux and it choked horribly.
>>
>> The patch is here:
>> https://github.com/DarkShikari/x264-devel/commit/99e830f1581eac3cf30f07b1d6c6c32bae1725c8
>> . ?To replicate the test, simply test that version against the
>> previous version. ?My guess is the reason it chokes is that it
>> involves spawning even *shorter*-lived jobs than x264 typically does,
>> something that CFS seems to simply collapse on.
>>
>> Here's some stats from a recent kernel:
>>
>> SD encoding (before -> after patch):
>> CFS: 325.49 +/- 1.22 fps -> 251.68 +/- 2.32 fps
>> BFS: 334.94 +/- 0.59 fps -> 344.47 +/- 0.68 fps
>>
>> HD encoding (before -> after patch):
>> CFS: 39.05 +/- 0.22 fps -> 40.56 +/- 0.23 fps
>> BFS: 40.15 +/- 0.05 fps -> 44.89 +/- 0.05 fps
>>
>> As can be seen, the longer the threads live (the lower the fps), the
>> less horrific the penalty is. ?Furthermore, though I don't have
>> numbers, using schedtool -R -p 1 does basically as well as BFS in
>> eliminating the problem. ?Naturally, this is not really a solution as
>> it requires root.
>>
>> To replicate this test, a commandline like this should work on any
>> cached raw input file (a collection of free raw videos can be found
>> here if you don't like making your own:
>> http://media.xiph.org/video/derf/ ):
>>
>> ./x264 --preset superfast --tune zerolatency --threads X input -o /dev/null
>
> On my Q6600 box, neither scheduler (identical configs) seems to like
> --tune zerolatency much.

Sliced-threads (zero latency mode) should probably never be run with
more threads than cores -- virtual cores, at the very least. 8
threads on a quad-core is definitely not the best idea.

Your tests are very very short so I suspect the standard deviation of
those tests is so high as to obscure any actual results; please always
remember to post error bars. A test that only lasts for 2 seconds can
easily have +/- 50fps of error.

Jason

2012-05-10 18:25:12

by Mike Galbraith

[permalink] [raw]
Subject: Re: Scheduler still seems awful with x264, worse with patches

On Thu, 2012-05-10 at 09:22 -0700, Jason Garrett-Glaser wrote:
> On Thu, May 10, 2012 at 3:29 AM, Mike Galbraith <[email protected]> wrote:
> > On Wed, 2012-05-09 at 09:00 -0700, Jason Garrett-Glaser wrote:
> >> Many months ago, the topic of CFS's inefficiencies with x264 came up
> >> and some improvements were made, but BFS and Windows still stayed a
> >> little bit in the lead. This seemed to be because of a mix of two
> >> issues. Firstly, a combination of relatively short-lived jobs (x264
> >> uses a thread pool, so the actual threads are long-lived). Secondly,
> >> in frame threads, heavy dependencies between threads, benefiting
> >> greatly from a dumb scheduler. Thirdly, in sliced threads -- the
> >> focus of this post -- the best scheduling approach is to simply spread
> >> them throughout the cores and do nothing, so again, a dumb scheduler
> >> will do the right thing.
> >>
> >> Recently I tried multithreading x264's lookahead for a customer. The
> >> lookahead previously wasn't threaded, causing bottlenecks with many
> >> cores and threads. I do my development mainly on Windows, and the
> >> patch looked to be quite a success, with nice performance boosts for
> >> many target use-cases.
> >>
> >> And then I ran it on Linux and it choked horribly.
> >>
> >> The patch is here:
> >> https://github.com/DarkShikari/x264-devel/commit/99e830f1581eac3cf30f07b1d6c6c32bae1725c8
> >> . To replicate the test, simply test that version against the
> >> previous version. My guess is the reason it chokes is that it
> >> involves spawning even *shorter*-lived jobs than x264 typically does,
> >> something that CFS seems to simply collapse on.
> >>
> >> Here's some stats from a recent kernel:
> >>
> >> SD encoding (before -> after patch):
> >> CFS: 325.49 +/- 1.22 fps -> 251.68 +/- 2.32 fps
> >> BFS: 334.94 +/- 0.59 fps -> 344.47 +/- 0.68 fps
> >>
> >> HD encoding (before -> after patch):
> >> CFS: 39.05 +/- 0.22 fps -> 40.56 +/- 0.23 fps
> >> BFS: 40.15 +/- 0.05 fps -> 44.89 +/- 0.05 fps
> >>
> >> As can be seen, the longer the threads live (the lower the fps), the
> >> less horrific the penalty is. Furthermore, though I don't have
> >> numbers, using schedtool -R -p 1 does basically as well as BFS in
> >> eliminating the problem. Naturally, this is not really a solution as
> >> it requires root.
> >>
> >> To replicate this test, a commandline like this should work on any
> >> cached raw input file (a collection of free raw videos can be found
> >> here if you don't like making your own:
> >> http://media.xiph.org/video/derf/ ):
> >>
> >> ./x264 --preset superfast --tune zerolatency --threads X input -o /dev/null
> >
> > On my Q6600 box, neither scheduler (identical configs) seems to like
> > --tune zerolatency much.
>
> Sliced-threads (zero latency mode) should probably never be run with
> more threads than cores -- virtual cores, at the very least. 8
> threads on a quad-core is definitely not the best idea.

Ok.

> Your tests are very very short so I suspect the standard deviation of
> those tests is so high as to obscure any actual results; please always
> remember to post error bars. A test that only lasts for 2 seconds can
> easily have +/- 50fps of error.

No. Results are plenty repeatable enough that 'problem' sticks out far
above s/n. No idea what to do with this, but 'tree' seems significant.

-Mike


-Mike