Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754532AbdCTLOq (ORCPT ); Mon, 20 Mar 2017 07:14:46 -0400 Received: from mga14.intel.com ([192.55.52.115]:43609 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754510AbdCTLOk (ORCPT ); Mon, 20 Mar 2017 07:14:40 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.36,193,1486454400"; d="scan'208";a="78937469" Subject: Re: Perf regression after enabling nvme autonomous power state transitions To: Andy Lutomirski References: <770cf82e-d966-19cc-f05a-f8150cc6866a@linux.intel.com> Cc: Andy Lutomirski , Jens Axboe , Christoph Hellwig , LKML , Chris Wilson From: Tvrtko Ursulin Message-ID: Date: Mon, 20 Mar 2017 11:14:11 +0000 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3944 Lines: 97 On 17/03/2017 15:26, Andy Lutomirski wrote: > On Fri, Mar 17, 2017 at 3:58 AM, Tvrtko Ursulin > wrote: >> >> Hi Andy, all, >> >> I have bisected and verified an interesting performance regression caused by >> commit c5552fde102fcc3f2cf9e502b8ac90e3500d8fdf "nvme: Enable autonomous >> power state transitions". >> >> Having that patch or not accounts for approx. 3% perf difference in a test >> which is, and this is the best part, not even i/o bound by any stretch of >> the imagination. > > Weird. Is there something in the workload that is periodically > reading or otherwise doing a synchronous IO? APST shouldn't affect > throughput in any meaningful way, but it certainly affects latency of > sporadic IO. Not as far as I can see. All output files are opened in a most basic way and there are no sync sycalls of any sort in the trace. Reads are all coming from page cache. I did look more closely at the I/O stats (I use "vmstat 1" for this), and it looks that there is some I/O happening during the runtime. I was surprised by that since the total data written out is only ~14MiB as I said. (And those are the only writes the test does, apart from stdout/err at beginning and end.) I tried increasing dirty_*_bytes values to some huge ones to try to eliminate any potential background or direct writeback during the test run but that did not seem to help. I could still see a tiny bit of I/O, few KiB here and there occasionally, and then a large 14MiB dump at the end of a test run. Any ideas on how to figure out who is exactly doing these writes and why? >> The test is multi-process with overall medium CPU usage and high GPU (Intel) >> usage. Average runtime is around 13 seconds during which it writes out >> around 14MiB of data. >> >> It does so in chunks during the whole runtime but doesn't do anything >> special, just normal O_RDWR | O_CREAT | O_TRUNC so in practice this is all >> written to the device only the end of the test run in one chunk. Via the >> background write out I suspect. >> >> The 3% mentioned earlier translates to approx. 430ms longer average runtime >> with the above patch. > > That is surprisingly long. If your device is telling the truth about > its latencies, this should be literally impossible. > >> ps 0 : mp:9.00W operational enlat:5 exlat:5 rrt:0 rrl:0 >> rwt:0 rwl:0 idle_power:- active_power:- > > This is the normal on state. > >> ps 1 : mp:4.60W operational enlat:30 exlat:30 rrt:1 rrl:1 >> rwt:1 rwl:1 idle_power:- active_power:- >> ps 2 : mp:3.80W operational enlat:30 exlat:30 rrt:2 rrl:2 >> rwt:2 rwl:2 idle_power:- active_power:- > > These won't be used at all. > >> ps 3 : mp:0.0700W non-operational enlat:10000 exlat:300 rrt:3 rrl:3 >> rwt:3 rwl:3 idle_power:- active_power:- >> ps 4 : mp:0.0050W non-operational enlat:2000 exlat:10000 rrt:4 rrl:4 >> rwt:4 rwl:4 idle_power:- active_power:- > > These two have their latencies specified in microseconds and will be > used autonomously. > > Can you re-test with > /sys/class/nvme/nvme0/power/pm_qos_latency_tolerance_us set to the > following values and report back? > > 0 - will turn APST off entirely > 1 - will leave APST on but will not program any transitions > 11000 - should allow state 3 but not state 4 11000 seems to be enough to bring the results back to expected. 1 and 0 seem to not have any further effect. I've also tried making the test output to /dev/null and that had the same beneficial result, even with the default "25000" setting for pm_qos_latency_tolerance. So I suspected unexpected direct write back was happening. But as I wrote above increasing the dirty_bytes (or dirty_background_bytes, or the expire controls) did not help. I used 200Mb and 20s for all the fields, which I assumed would have caused everything to go quiet during the test run, but it hadn't. Regards, Tvrtko