Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756091AbdCTVar (ORCPT ); Mon, 20 Mar 2017 17:30:47 -0400 Received: from mail-ua0-f171.google.com ([209.85.217.171]:34237 "EHLO mail-ua0-f171.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755961AbdCTVaR (ORCPT ); Mon, 20 Mar 2017 17:30:17 -0400 MIME-Version: 1.0 In-Reply-To: References: <770cf82e-d966-19cc-f05a-f8150cc6866a@linux.intel.com> From: Andy Lutomirski Date: Mon, 20 Mar 2017 14:23:27 -0700 Message-ID: Subject: Re: Perf regression after enabling nvme autonomous power state transitions To: Tvrtko Ursulin Cc: Andy Lutomirski , Jens Axboe , Christoph Hellwig , LKML , Chris Wilson Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3994 Lines: 112 On Mon, Mar 20, 2017 at 4:14 AM, Tvrtko Ursulin wrote: > > On 17/03/2017 15:26, Andy Lutomirski wrote: >> >> On Fri, Mar 17, 2017 at 3:58 AM, Tvrtko Ursulin >> wrote: >>> >>> >>> Hi Andy, all, >>> >>> I have bisected and verified an interesting performance regression caused >>> by >>> commit c5552fde102fcc3f2cf9e502b8ac90e3500d8fdf "nvme: Enable autonomous >>> power state transitions". >>> >>> Having that patch or not accounts for approx. 3% perf difference in a >>> test >>> which is, and this is the best part, not even i/o bound by any stretch of >>> the imagination. >> >> >> Weird. Is there something in the workload that is periodically >> reading or otherwise doing a synchronous IO? APST shouldn't affect >> throughput in any meaningful way, but it certainly affects latency of >> sporadic IO. > > > Not as far as I can see. All output files are opened in a most basic way and > there are no sync sycalls of any sort in the trace. Reads are all coming > from page cache. > > I did look more closely at the I/O stats (I use "vmstat 1" for this), and it > looks that there is some I/O happening during the runtime. I was surprised > by that since the total data written out is only ~14MiB as I said. (And > those are the only writes the test does, apart from stdout/err at beginning > and end.) > > I tried increasing dirty_*_bytes values to some huge ones to try to > eliminate any potential background or direct writeback during the test run > but that did not seem to help. I could still see a tiny bit of I/O, few KiB > here and there occasionally, and then a large 14MiB dump at the end of a > test run. > > Any ideas on how to figure out who is exactly doing these writes and why? > You could try playing with iosnoop or blktrace. > >>> The test is multi-process with overall medium CPU usage and high GPU >>> (Intel) >>> usage. Average runtime is around 13 seconds during which it writes out >>> around 14MiB of data. >>> >>> It does so in chunks during the whole runtime but doesn't do anything >>> special, just normal O_RDWR | O_CREAT | O_TRUNC so in practice this is >>> all >>> written to the device only the end of the test run in one chunk. Via the >>> background write out I suspect. >>> >>> The 3% mentioned earlier translates to approx. 430ms longer average >>> runtime >>> with the above patch. >> >> >> That is surprisingly long. If your device is telling the truth about >> its latencies, this should be literally impossible. >> >>> ps 0 : mp:9.00W operational enlat:5 exlat:5 rrt:0 rrl:0 >>> rwt:0 rwl:0 idle_power:- active_power:- >> >> >> This is the normal on state. >> >>> ps 1 : mp:4.60W operational enlat:30 exlat:30 rrt:1 rrl:1 >>> rwt:1 rwl:1 idle_power:- active_power:- >>> ps 2 : mp:3.80W operational enlat:30 exlat:30 rrt:2 rrl:2 >>> rwt:2 rwl:2 idle_power:- active_power:- >> >> >> These won't be used at all. >> >>> ps 3 : mp:0.0700W non-operational enlat:10000 exlat:300 rrt:3 rrl:3 >>> rwt:3 rwl:3 idle_power:- active_power:- >>> ps 4 : mp:0.0050W non-operational enlat:2000 exlat:10000 rrt:4 rrl:4 >>> rwt:4 rwl:4 idle_power:- active_power:- >> >> >> These two have their latencies specified in microseconds and will be >> used autonomously. >> >> Can you re-test with >> /sys/class/nvme/nvme0/power/pm_qos_latency_tolerance_us set to the >> following values and report back? >> >> 0 - will turn APST off entirely >> 1 - will leave APST on but will not program any transitions >> 11000 - should allow state 3 but not state 4 > > > 11000 seems to be enough to bring the results back to expected. 1 and 0 seem > to not have any further effect. > > I've also tried making the test output to /dev/null and that had the same > beneficial result, even with the default "25000" setting for > pm_qos_latency_tolerance. My best guess is that your disk lies and the latency in ps 4 is just extraordinarily high. --Andy