As an x264 developer, I have no position on the whole debate over
BFS/CFS (nor am I a kernel hacker), but a friend of mine recently ran
this set of tests with BFS vs CFS that still doesn't make any sense to
me and suggests some sort of serious suboptimality in the existing
scheduler:
>>>>>>>>>>>>>>>>>>
Background information necessary to replicate test:
Input file: http://media.xiph.org/video/derf/y4m/soccer_4cif.y4m
x264 source: git://git.videolan.org/x264.git
revision of x264 used: e553a4c
CPU: Core 2 Quad Q9300 (2.5GHz)
Kernel/distro/platform: 2.6.31 patched with the gentoo patchset, Gentoo, x86_64.
BFS patch: Latest available (BFS 220).
Methodology: Each test was run 3 times. The median of the three was
then selected.
./x264/x264 --preset ultrafast --no-scenecut --sync-lookahead 0 --qp
20 samples/soccer_4cif.y4m -o /dev/null --threads X
BFS CFS
1: 124.79 fps 131.69 fps
2: 252.14 fps 192.14 fps
3: 376.55 fps 223.24 fps
4: 447.69 fps 242.54 fps
5: 447.98 fps 252.43 fps
6: 447.87 fps 253.56 fps
7: 444.79 fps 250.37 fps
8: 441.08 fps 251.95 fps
./x264/x264 -B 2000 samples/soccer_4cif.y4m -o /dev/null --threads X
BFS CFS
1: 19.72 fps 19.97 fps
2: 39.03 fps 29.75 fps
3: 60.85 fps 39.83 fps
4: 68.60 fps 42.04 fps
5: 70.61 fps 43.78 fps
6: 71.35 fps 46.43 fps
7: 70.80 fps 48.02 fps
8: 70.68 fps 46.95 fps
./x264/x264 --preset veryslow --crf 20 samples/soccer_4cif.y4m -o
/dev/null --threads X
BFS CFS
1: 1.89 fps 1.89 fps
2: 3.24 fps 2.78 fps
3: 4.18 fps 3.47 fps
4: 5.76 fps 4.61 fps
5: 6.07 fps 4.67 fps
6: 6.29 fps 4.90 fps
7: 6.52 fps 5.08 fps
8: 6.65 fps 5.27 fps
I noticed when running single threaded, BFS seemed to be jumping the
process between CPUs. So bonding the process to a single CPU I got
the below numbers.
taskset -c 0 $x264_cmd --threads 1
ultrafast: 130.76 fps
defaults: 20.01 fps
veryslow: 1.90 fps
<<<<<<<<<<<<<<<<<<
What is particularly troubling about these results is that this is not
a situation that should seriously challenge the scheduler (like a
thousand-thread HTTP server). In ultrafast mode, the threading model
is phenomenally simple: each thread, if it gets too far ahead of the
previous thread, is blocked. That's it. (full gory details at
http://akuvian.org/src/x264/sliceless_threads.txt)
In the other modes, the only complication is that there is one more
thread (lookahead) in front of all the main threads and all the main
threads are set to a lower priority via nice() in order to avoid
blocking on the lookahead thread.
Though I'm not a scheduler hacker, these enormous differences in an
application which is entirely CPU-bound and uses very few threads
strikes me as seriously wrong.
Jason Garrett-Glaser
On Mon, 2009-09-14 at 15:29 -0700, Jason Garrett-Glaser wrote:
> As an x264 developer, I have no position on the whole debate over
> BFS/CFS (nor am I a kernel hacker), but a friend of mine recently ran
> this set of tests with BFS vs CFS that still doesn't make any sense to
> me and suggests some sort of serious suboptimality in the existing
> scheduler:
Yup, I confirmed your friend's results.
> >>>>>>>>>>>>>>>>>>
>
> Background information necessary to replicate test:
>
> Input file: http://media.xiph.org/video/derf/y4m/soccer_4cif.y4m
> x264 source: git://git.videolan.org/x264.git
> revision of x264 used: e553a4c
> CPU: Core 2 Quad Q9300 (2.5GHz)
> Kernel/distro/platform: 2.6.31 patched with the gentoo patchset, Gentoo, x86_64.
> BFS patch: Latest available (BFS 220).
> Methodology: Each test was run 3 times. The median of the three was
> then selected.
>
> ./x264/x264 --preset ultrafast --no-scenecut --sync-lookahead 0 --qp
> 20 samples/soccer_4cif.y4m -o /dev/null --threads X
> BFS CFS
> 1: 124.79 fps 131.69 fps
> 2: 252.14 fps 192.14 fps
> 3: 376.55 fps 223.24 fps
> 4: 447.69 fps 242.54 fps
> 5: 447.98 fps 252.43 fps
> 6: 447.87 fps 253.56 fps
> 7: 444.79 fps 250.37 fps
> 8: 441.08 fps 251.95 fps
After a bit of testing, it turns out that NEXT_BUDDY and LB_BIAS
features are _both_ doing injury to this load. We've been looking at
NEXT_BUDDY, but LB_BIAS is a new target.
Thanks a bunch for the nice repeatable testcase!
-Mike
x264 --preset ultrafast --no-scenecut --sync-lookahead 0 --qp 20 -o /dev/null --threads $THREADS soccer_4cif.y4m
2.6.32-tip-smp
4 encoded 600 frames, 280.07 fps, 22096.60 kb/s
encoded 600 frames, 280.67 fps, 22096.60 kb/s
encoded 600 frames, 274.80 fps, 22096.60 kb/s
8 encoded 600 frames, 269.57 fps, 22096.60 kb/s
encoded 600 frames, 282.96 fps, 22096.60 kb/s
encoded 600 frames, 279.66 fps, 22096.60 kb/s
2.6.31-bfs221-smp
4 encoded 600 frames, 408.38 fps, 22096.60 kb/s
encoded 600 frames, 409.17 fps, 22096.60 kb/s
encoded 600 frames, 407.50 fps, 22096.60 kb/s
8 encoded 600 frames, 409.82 fps, 22096.60 kb/s
encoded 600 frames, 413.00 fps, 22096.60 kb/s
encoded 600 frames, 411.10 fps, 22096.60 kb/s
test test test...
2.6.32-tip-smp NO_NEXT_BUDDY NO_LB_BIAS
4 encoded 600 frames, 418.07 fps, 22096.60 kb/s
encoded 600 frames, 418.72 fps, 22096.60 kb/s
encoded 600 frames, 419.10 fps, 22096.60 kb/s
8 encoded 600 frames, 425.75 fps, 22096.60 kb/s
encoded 600 frames, 425.45 fps, 22096.60 kb/s
encoded 600 frames, 422.49 fps, 22096.60 kb/s
Commit-ID: 0ec9fab3d186d9cbb00c0f694d4a260d07c198d9
Gitweb: http://git.kernel.org/tip/0ec9fab3d186d9cbb00c0f694d4a260d07c198d9
Author: Mike Galbraith <[email protected]>
AuthorDate: Tue, 15 Sep 2009 15:07:03 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 15 Sep 2009 16:51:16 +0200
sched: Improve latencies and throughput
Make the idle balancer more agressive, to improve a
x264 encoding workload provided by Jason Garrett-Glaser:
NEXT_BUDDY NO_LB_BIAS
encoded 600 frames, 252.82 fps, 22096.60 kb/s
encoded 600 frames, 250.69 fps, 22096.60 kb/s
encoded 600 frames, 245.76 fps, 22096.60 kb/s
NO_NEXT_BUDDY LB_BIAS
encoded 600 frames, 344.44 fps, 22096.60 kb/s
encoded 600 frames, 346.66 fps, 22096.60 kb/s
encoded 600 frames, 352.59 fps, 22096.60 kb/s
NO_NEXT_BUDDY NO_LB_BIAS
encoded 600 frames, 425.75 fps, 22096.60 kb/s
encoded 600 frames, 425.45 fps, 22096.60 kb/s
encoded 600 frames, 422.49 fps, 22096.60 kb/s
Peter pointed out that this is better done via newidle_idx,
not via LB_BIAS, newidle balancing should look for where
there is load _now_, not where there was load 2 ticks ago.
Worst-case latencies are improved as well as no buddies
means less vruntime spread. (as per prior lkml discussions)
This change improves kbuild-peak parallelism as well.
Reported-by: Jason Garrett-Glaser <[email protected]>
Signed-off-by: Mike Galbraith <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/ia64/include/asm/topology.h | 5 +++--
arch/powerpc/include/asm/topology.h | 2 +-
arch/sh/include/asm/topology.h | 3 ++-
arch/x86/include/asm/topology.h | 4 +---
include/linux/topology.h | 2 +-
kernel/sched_features.h | 2 +-
6 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h
index 47f3c51..42f1673 100644
--- a/arch/ia64/include/asm/topology.h
+++ b/arch/ia64/include/asm/topology.h
@@ -61,7 +61,7 @@ void build_cpu_to_node_map(void);
.cache_nice_tries = 2, \
.busy_idx = 2, \
.idle_idx = 1, \
- .newidle_idx = 2, \
+ .newidle_idx = 0, \
.wake_idx = 0, \
.forkexec_idx = 1, \
.flags = SD_LOAD_BALANCE \
@@ -87,10 +87,11 @@ void build_cpu_to_node_map(void);
.cache_nice_tries = 2, \
.busy_idx = 3, \
.idle_idx = 2, \
- .newidle_idx = 2, \
+ .newidle_idx = 0, \
.wake_idx = 0, \
.forkexec_idx = 1, \
.flags = SD_LOAD_BALANCE \
+ | SD_BALANCE_NEWIDLE \
| SD_BALANCE_EXEC \
| SD_BALANCE_FORK \
| SD_BALANCE_WAKE \
diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index a6b220a..1a2c9eb 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -57,7 +57,7 @@ static inline int pcibus_to_node(struct pci_bus *bus)
.cache_nice_tries = 1, \
.busy_idx = 3, \
.idle_idx = 1, \
- .newidle_idx = 2, \
+ .newidle_idx = 0, \
.wake_idx = 0, \
.flags = SD_LOAD_BALANCE \
| SD_BALANCE_EXEC \
diff --git a/arch/sh/include/asm/topology.h b/arch/sh/include/asm/topology.h
index 9054e5c..c843677 100644
--- a/arch/sh/include/asm/topology.h
+++ b/arch/sh/include/asm/topology.h
@@ -15,13 +15,14 @@
.cache_nice_tries = 2, \
.busy_idx = 3, \
.idle_idx = 2, \
- .newidle_idx = 2, \
+ .newidle_idx = 0, \
.wake_idx = 0, \
.forkexec_idx = 1, \
.flags = SD_LOAD_BALANCE \
| SD_BALANCE_FORK \
| SD_BALANCE_EXEC \
| SD_BALANCE_WAKE \
+ | SD_BALANCE_NEWIDLE \
| SD_SERIALIZE, \
.last_balance = jiffies, \
.balance_interval = 1, \
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 4b1b335..7fafd1b 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -116,14 +116,12 @@ extern unsigned long node_remap_size[];
# define SD_CACHE_NICE_TRIES 1
# define SD_IDLE_IDX 1
-# define SD_NEWIDLE_IDX 2
# define SD_FORKEXEC_IDX 0
#else
# define SD_CACHE_NICE_TRIES 2
# define SD_IDLE_IDX 2
-# define SD_NEWIDLE_IDX 2
# define SD_FORKEXEC_IDX 1
#endif
@@ -137,7 +135,7 @@ extern unsigned long node_remap_size[];
.cache_nice_tries = SD_CACHE_NICE_TRIES, \
.busy_idx = 3, \
.idle_idx = SD_IDLE_IDX, \
- .newidle_idx = SD_NEWIDLE_IDX, \
+ .newidle_idx = 0, \
.wake_idx = 0, \
.forkexec_idx = SD_FORKEXEC_IDX, \
\
diff --git a/include/linux/topology.h b/include/linux/topology.h
index c87edcd..4298745 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -151,7 +151,7 @@ int arch_update_cpu_topology(void);
.cache_nice_tries = 1, \
.busy_idx = 2, \
.idle_idx = 1, \
- .newidle_idx = 2, \
+ .newidle_idx = 0, \
.wake_idx = 0, \
.forkexec_idx = 1, \
\
diff --git a/kernel/sched_features.h b/kernel/sched_features.h
index 891ea0f..e98c2e8 100644
--- a/kernel/sched_features.h
+++ b/kernel/sched_features.h
@@ -67,7 +67,7 @@ SCHED_FEAT(AFFINE_WAKEUPS, 1)
* wakeup-preemption), since its likely going to consume data we
* touched, increases cache locality.
*/
-SCHED_FEAT(NEXT_BUDDY, 1)
+SCHED_FEAT(NEXT_BUDDY, 0)
/*
* Prefer to schedule the task that ran last (when we did