To: Andy Lutomirski <luto@kernel.org>
Cc: Jens Axboe <axboe@fb.com>, Christoph Hellwig <hch@lst.de>,
        LKML <linux-kernel@vger.kernel.org>,
        Chris Wilson <chris@chris-wilson.co.uk>
From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Subject: Perf regression after enabling nvme autonomous power state
 transitions
Message-ID: <770cf82e-d966-19cc-f05a-f8150cc6866a@linux.intel.com>
Date: Fri, 17 Mar 2017 10:58:22 +0000
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101
 Thunderbird/45.8.0
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2463
Lines: 90


Hi Andy, all,

I have bisected and verified an interesting performance regression 
caused by commit c5552fde102fcc3f2cf9e502b8ac90e3500d8fdf "nvme: Enable 
autonomous power state transitions".

Having that patch or not accounts for approx. 3% perf difference in a 
test which is, and this is the best part, not even i/o bound by any 
stretch of the imagination.

The test is multi-process with overall medium CPU usage and high GPU 
(Intel) usage. Average runtime is around 13 seconds during which it 
writes out around 14MiB of data.

It does so in chunks during the whole runtime but doesn't do anything 
special, just normal O_RDWR | O_CREAT | O_TRUNC so in practice this is 
all written to the device only the end of the test run in one chunk. Via 
the background write out I suspect.

The 3% mentioned earlier translates to approx. 430ms longer average 
runtime with the above patch.

NVMe storage in question:

NVME Identify Controller:
vid     : 0x8086
ssvid   : 0x8086
sn      : BTPY70130HEB256D
mn      : INTEL SSDPEKKW256G7
fr      :  PSF109C
rab     : 6
ieee    : 5cd2e4
cmic    : 0
mdts    : 5
cntlid  : 1
ver     : 10200
rtd3r   : 249f0
rtd3e   : 13880
oaes    : 0
oacs    : 0x6
acl     : 4
aerl    : 7
frmw    : 0x12
lpa     : 0x3
elpe    : 63
npss    : 4
avscc   : 0
apsta   : 0x1
wctemp  : 343
cctemp  : 353
mtfa    : 20
hmpre   : 0
hmmin   : 0
tnvmcap : 0
unvmcap : 0
rpmbs   : 0
sqes    : 0x66
cqes    : 0x44
nn      : 1
oncs    : 0x1e
fuses   : 0
fna     : 0x4
vwc     : 0x1
awun    : 0
awupf   : 0
nvscc   : 0
acwu    : 0
sgls    : 0
ps    0 : mp:9.00W operational enlat:5 exlat:5 rrt:0 rrl:0
           rwt:0 rwl:0 idle_power:- active_power:-
ps    1 : mp:4.60W operational enlat:30 exlat:30 rrt:1 rrl:1
           rwt:1 rwl:1 idle_power:- active_power:-
ps    2 : mp:3.80W operational enlat:30 exlat:30 rrt:2 rrl:2
           rwt:2 rwl:2 idle_power:- active_power:-
ps    3 : mp:0.0700W non-operational enlat:10000 exlat:300 rrt:3 rrl:3
           rwt:3 rwl:3 idle_power:- active_power:-
ps    4 : mp:0.0050W non-operational enlat:2000 exlat:10000 rrt:4 rrl:4
           rwt:4 rwl:4 idle_power:- active_power:-

I see there are some latencies (unit?) mentioned here, but as the test 
does not appear to be blocking on IO I am confused as to why would this 
patch be causing this. Nevertheless the regression is 100% repeatable.

Any ideas on what could be causing this and if there is something else 
to check or look at?

Regards,

Tvrtko