Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751175AbdCQK62 (ORCPT ); Fri, 17 Mar 2017 06:58:28 -0400 Received: from mga07.intel.com ([134.134.136.100]:5068 "EHLO mga07.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750952AbdCQK61 (ORCPT ); Fri, 17 Mar 2017 06:58:27 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.36,176,1486454400"; d="scan'208";a="78154568" To: Andy Lutomirski Cc: Jens Axboe , Christoph Hellwig , LKML , Chris Wilson From: Tvrtko Ursulin Subject: Perf regression after enabling nvme autonomous power state transitions Message-ID: <770cf82e-d966-19cc-f05a-f8150cc6866a@linux.intel.com> Date: Fri, 17 Mar 2017 10:58:22 +0000 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2463 Lines: 90 Hi Andy, all, I have bisected and verified an interesting performance regression caused by commit c5552fde102fcc3f2cf9e502b8ac90e3500d8fdf "nvme: Enable autonomous power state transitions". Having that patch or not accounts for approx. 3% perf difference in a test which is, and this is the best part, not even i/o bound by any stretch of the imagination. The test is multi-process with overall medium CPU usage and high GPU (Intel) usage. Average runtime is around 13 seconds during which it writes out around 14MiB of data. It does so in chunks during the whole runtime but doesn't do anything special, just normal O_RDWR | O_CREAT | O_TRUNC so in practice this is all written to the device only the end of the test run in one chunk. Via the background write out I suspect. The 3% mentioned earlier translates to approx. 430ms longer average runtime with the above patch. NVMe storage in question: NVME Identify Controller: vid : 0x8086 ssvid : 0x8086 sn : BTPY70130HEB256D mn : INTEL SSDPEKKW256G7 fr : PSF109C rab : 6 ieee : 5cd2e4 cmic : 0 mdts : 5 cntlid : 1 ver : 10200 rtd3r : 249f0 rtd3e : 13880 oaes : 0 oacs : 0x6 acl : 4 aerl : 7 frmw : 0x12 lpa : 0x3 elpe : 63 npss : 4 avscc : 0 apsta : 0x1 wctemp : 343 cctemp : 353 mtfa : 20 hmpre : 0 hmmin : 0 tnvmcap : 0 unvmcap : 0 rpmbs : 0 sqes : 0x66 cqes : 0x44 nn : 1 oncs : 0x1e fuses : 0 fna : 0x4 vwc : 0x1 awun : 0 awupf : 0 nvscc : 0 acwu : 0 sgls : 0 ps 0 : mp:9.00W operational enlat:5 exlat:5 rrt:0 rrl:0 rwt:0 rwl:0 idle_power:- active_power:- ps 1 : mp:4.60W operational enlat:30 exlat:30 rrt:1 rrl:1 rwt:1 rwl:1 idle_power:- active_power:- ps 2 : mp:3.80W operational enlat:30 exlat:30 rrt:2 rrl:2 rwt:2 rwl:2 idle_power:- active_power:- ps 3 : mp:0.0700W non-operational enlat:10000 exlat:300 rrt:3 rrl:3 rwt:3 rwl:3 idle_power:- active_power:- ps 4 : mp:0.0050W non-operational enlat:2000 exlat:10000 rrt:4 rrl:4 rwt:4 rwl:4 idle_power:- active_power:- I see there are some latencies (unit?) mentioned here, but as the test does not appear to be blocking on IO I am confused as to why would this patch be causing this. Nevertheless the regression is 100% repeatable. Any ideas on what could be causing this and if there is something else to check or look at? Regards, Tvrtko