Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932302Ab2EIQBc (ORCPT ); Wed, 9 May 2012 12:01:32 -0400 Received: from mail-lb0-f174.google.com ([209.85.217.174]:64457 "EHLO mail-lb0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932204Ab2EIQBa (ORCPT ); Wed, 9 May 2012 12:01:30 -0400 MIME-Version: 1.0 From: Jason Garrett-Glaser Date: Wed, 9 May 2012 09:00:48 -0700 X-Google-Sender-Auth: lopbNpW0rTEqWpZT51bcMZ4lDYE Message-ID: Subject: Scheduler still seems awful with x264, worse with patches To: Linux Kernel Mailing List Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4135 Lines: 89 Many months ago, the topic of CFS's inefficiencies with x264 came up and some improvements were made, but BFS and Windows still stayed a little bit in the lead. This seemed to be because of a mix of two issues. Firstly, a combination of relatively short-lived jobs (x264 uses a thread pool, so the actual threads are long-lived). Secondly, in frame threads, heavy dependencies between threads, benefiting greatly from a dumb scheduler. Thirdly, in sliced threads -- the focus of this post -- the best scheduling approach is to simply spread them throughout the cores and do nothing, so again, a dumb scheduler will do the right thing. Recently I tried multithreading x264's lookahead for a customer. The lookahead previously wasn't threaded, causing bottlenecks with many cores and threads. I do my development mainly on Windows, and the patch looked to be quite a success, with nice performance boosts for many target use-cases. And then I ran it on Linux and it choked horribly. The patch is here: https://github.com/DarkShikari/x264-devel/commit/99e830f1581eac3cf30f07b1d6c6c32bae1725c8 . To replicate the test, simply test that version against the previous version. My guess is the reason it chokes is that it involves spawning even *shorter*-lived jobs than x264 typically does, something that CFS seems to simply collapse on. Here's some stats from a recent kernel: SD encoding (before -> after patch): CFS: 325.49 +/- 1.22 fps -> 251.68 +/- 2.32 fps BFS: 334.94 +/- 0.59 fps -> 344.47 +/- 0.68 fps HD encoding (before -> after patch): CFS: 39.05 +/- 0.22 fps -> 40.56 +/- 0.23 fps BFS: 40.15 +/- 0.05 fps -> 44.89 +/- 0.05 fps As can be seen, the longer the threads live (the lower the fps), the less horrific the penalty is. Furthermore, though I don't have numbers, using schedtool -R -p 1 does basically as well as BFS in eliminating the problem. Naturally, this is not really a solution as it requires root. To replicate this test, a commandline like this should work on any cached raw input file (a collection of free raw videos can be found here if you don't like making your own: http://media.xiph.org/video/derf/ ): ./x264 --preset superfast --tune zerolatency --threads X input -o /dev/null Most of my testing was done with X = 4 or 8 on quad-core or quad-core-with-HT machines. Though I don't have numbers, I should note that the absolute worst results are on a Core i7 machine where I set taskset 0xf and used threads 4; the patch literally cut the framerate in half, despite giving a performance boost when SCHED_RR was used. Jason P.S. Here is the structure of low-latency sliced threading with the patch, for anyone curious. To encode a frame with X threads: 1. The previous frame's X threads are still going -- they're finishing up hpel and deblock calculation, which can be done after the encoder returns, but before the next frame is encoded. 2. Serially, copy in the input frame and preprocess it, running various not-yet-parallelized tasks while those threads finish up. 3. In parallel, do lookahead analysis of the input frame -- split it into X slices and analyze them simultaneously. Then wait on the jobs and merge the results. 4. If any of the previous frame's threads are still going, wait on them now. 5. Split the main encode into X slices and do them simultaneously. Once they've all finished encoding -- after they've signaled the main thread with a condition variable -- serially finish up a few minor tasks and return to the caller. These threads will continue until step 4) of the next frame. When encoding at ~300fps, the encode jobs will typically run for about ~3ms each, and the lookahead threads likely for around 0.5ms, very roughly. For a low-latency encoder, there is no way to make these threads last longer while still getting parallelism, and it must be doable because Windows handles it just fine. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/