Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp2782002imu; Tue, 6 Nov 2018 22:37:15 -0800 (PST) X-Google-Smtp-Source: AJdET5dDbJxWGhqs4NY4jc4Ii9DbHJgU8KED+iU/lLKVg9+p0Ho8qtysaLnL0wGD71o3dVLjeWfI X-Received: by 2002:a65:6249:: with SMTP id q9-v6mr594632pgv.392.1541572635627; Tue, 06 Nov 2018 22:37:15 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1541572635; cv=none; d=google.com; s=arc-20160816; b=QrWkiJ+XUY3ZwlDqScB2PRyWOJDWSZOK+w95s8TjKUndcroCbjZ6IbcWTfXEi2fTDh cIaIiXRO+aycldM3hX62yV11HfMezfhgyiZi/ikOE+6l3aKi5MjqNyue5YSicK2yQB+3 B0Mj0zCadu+j7CUgxCvB5Cm3oo6xW8HFKEVYJwZ3XmJBJ3YMNxz4JNHmY2BE1lJUPzdu EmluMm8wMLZVyn0KPOEuW3G66ZABcASqJeMxQAFFxIvCjrQbmvI4+AkceT9x/wEvZ6NP 7Odg17v/w7GctnlyIY4VF7ApaC9SmvFEy7gbcy2yFuJmS/uLpb/wYNgRV0sdrWDnMzUB u6Ig== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:dkim-signature:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=b8Uzc7AJRU6AaQw0tQd8/vXDJ51NI27VCXBDIS8vLAQ=; b=waHUBBNAyrd8+qT/fjS/XJZAb3Li6SLFqz+4+otY5Jxf8eYQ2qUwMNBQwWcL1mLC/S ykBeQl/ZAHzboOcWd/XQFvyGoCezegIcqbWhnG3OAi3jLGTAH3LKsFSMSYGfi1sWPuW+ Yx/3NR7/9RrQml6e1N1BYa188BNRvBp1ZmF6Jut04g7aciL0yXy8Dsfy8YgL68iCuh+X MBlRdMH0y7bUKHE1dhGajHgdqJJPIkbnCI1eHM82qtHaLXvwAAvacAQ4mxoEZK3G7YWh 2p0E89+haYbzFmxSjHSRr3F+J5mWEmWbgCPkOZP7UQnjN54oLB2V2PO6BFG3yyjVLOFR hmqA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@nvidia.com header.s=n1 header.b=afRdSGLo; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=nvidia.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id i11-v6si33996927pgk.29.2018.11.06.22.36.57; Tue, 06 Nov 2018 22:37:15 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@nvidia.com header.s=n1 header.b=afRdSGLo; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=nvidia.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729972AbeKGQFb (ORCPT + 99 others); Wed, 7 Nov 2018 11:05:31 -0500 Received: from hqemgate16.nvidia.com ([216.228.121.65]:13405 "EHLO hqemgate16.nvidia.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726544AbeKGQFb (ORCPT ); Wed, 7 Nov 2018 11:05:31 -0500 Received: from hqpgpgate101.nvidia.com (Not Verified[216.228.121.13]) by hqemgate16.nvidia.com (using TLS: TLSv1.2, DES-CBC3-SHA) id ; Tue, 06 Nov 2018 22:36:40 -0800 Received: from hqmail.nvidia.com ([172.20.161.6]) by hqpgpgate101.nvidia.com (PGP Universal service); Tue, 06 Nov 2018 22:36:31 -0800 X-PGP-Universal: processed; by hqpgpgate101.nvidia.com on Tue, 06 Nov 2018 22:36:31 -0800 Received: from [10.110.48.28] (172.20.13.39) by HQMAIL101.nvidia.com (172.20.187.10) with Microsoft SMTP Server (TLS) id 15.0.1395.4; Wed, 7 Nov 2018 06:36:30 +0000 Subject: Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count To: Dave Chinner , Jan Kara CC: Christoph Hellwig , Matthew Wilcox , Michal Hocko , Christopher Lameter , Jason Gunthorpe , Dan Williams , , Andrew Morton , LKML , linux-rdma , References: <20181012060014.10242-1-jhubbard@nvidia.com> <20181012060014.10242-5-jhubbard@nvidia.com> <20181013035516.GA18822@dastard> <7c2e3b54-0b1d-6726-a508-804ef8620cfd@nvidia.com> <20181013164740.GA6593@infradead.org> <84811b54-60bf-2bc3-a58d-6a7925c24aad@nvidia.com> <20181105095447.GE6953@quack2.suse.cz> <20181106024715.GU6311@dastard> <20181106110006.GE25414@quack2.suse.cz> <20181106204149.GV6311@dastard> X-Nvconfidentiality: public From: John Hubbard Message-ID: Date: Tue, 6 Nov 2018 22:36:30 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.3.0 MIME-Version: 1.0 In-Reply-To: <20181106204149.GV6311@dastard> X-Originating-IP: [172.20.13.39] X-ClientProxiedBy: HQMAIL103.nvidia.com (172.20.187.11) To HQMAIL101.nvidia.com (172.20.187.10) Content-Type: text/plain; charset="utf-8" Content-Language: en-US-large Content-Transfer-Encoding: 7bit DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nvidia.com; s=n1; t=1541572600; bh=b8Uzc7AJRU6AaQw0tQd8/vXDJ51NI27VCXBDIS8vLAQ=; h=X-PGP-Universal:Subject:To:CC:References:X-Nvconfidentiality:From: Message-ID:Date:User-Agent:MIME-Version:In-Reply-To: X-Originating-IP:X-ClientProxiedBy:Content-Type:Content-Language: Content-Transfer-Encoding; b=afRdSGLokLBn86hBo8eSTiE9DmA2Ac1kZMNDpJ39TiR0vTPfYMMz79P2UiCyngqPj n1LpTdZfFHFXJhKkcGfMsmX0WgWW19mqRK+EXDsblVxLhQ14ZkTLsnksHSC+H+CT9+ zIKKr/Ncq4oLTIOWx5P9Lt1iNPs7wvu6l82597y6BOEhCEPWlGV8OWCvwjo4I9FTYJ hD38oZGnw5UvssSrLqBlLxNapeHgFK4ps8/QE12QH9HQwmS5h6/OltPpwzpSzPv8kA dWgT4hxVUpdcdDr64Lx8HIZLkyQYKiasecOiKG5O45dsjF054lNtYszeO4xJqMpMHi f5xED5DhlCx+g== Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 11/6/18 12:41 PM, Dave Chinner wrote: > On Tue, Nov 06, 2018 at 12:00:06PM +0100, Jan Kara wrote: >> On Tue 06-11-18 13:47:15, Dave Chinner wrote: >>> On Mon, Nov 05, 2018 at 04:26:04PM -0800, John Hubbard wrote: >>>> On 11/5/18 1:54 AM, Jan Kara wrote: >>>>> Hmm, have you tried larger buffer sizes? Because synchronous 8k IO isn't >>>>> going to max-out NVME iops by far. Can I suggest you install fio [1] (it >>>>> has the advantage that it is pretty much standard for a test like this so >>>>> everyone knows what the test does from a glimpse) and run with it something >>>>> like the following workfile: >>>>> >>>>> [reader] >>>>> direct=1 >>>>> ioengine=libaio >>>>> blocksize=4096 >>>>> size=1g >>>>> numjobs=1 >>>>> rw=read >>>>> iodepth=64 >>>>> >>>>> And see how the numbers with and without your patches compare? >>>>> >>>>> Honza >>>>> >>>>> [1] https://github.com/axboe/fio >>>> >>>> That program is *very* good to have. Whew. Anyway, it looks like read bandwidth >>>> is approximately 74 MiB/s with my patch (it varies a bit, run to run), >>>> as compared to around 85 without the patch, so still showing about a 20% >>>> performance degradation, assuming I'm reading this correctly. >>>> >>>> Raw data follows, using the fio options you listed above: >>>> >>>> Baseline (without my patch): >>>> ---------------------------- >>> .... >>>> lat (usec): min=179, max=14003, avg=2913.65, stdev=1241.75 >>>> clat percentiles (usec): >>>> | 1.00th=[ 2311], 5.00th=[ 2343], 10.00th=[ 2343], 20.00th=[ 2343], >>>> | 30.00th=[ 2343], 40.00th=[ 2376], 50.00th=[ 2376], 60.00th=[ 2376], >>>> | 70.00th=[ 2409], 80.00th=[ 2933], 90.00th=[ 4359], 95.00th=[ 5276], >>>> | 99.00th=[ 8291], 99.50th=[ 9110], 99.90th=[10945], 99.95th=[11469], >>>> | 99.99th=[12256] >>> ..... >>>> Modified (with my patch): >>>> ---------------------------- >>> ..... >>>> lat (usec): min=81, max=15766, avg=3496.57, stdev=1450.21 >>>> clat percentiles (usec): >>>> | 1.00th=[ 2835], 5.00th=[ 2835], 10.00th=[ 2835], 20.00th=[ 2868], >>>> | 30.00th=[ 2868], 40.00th=[ 2868], 50.00th=[ 2868], 60.00th=[ 2900], >>>> | 70.00th=[ 2933], 80.00th=[ 3425], 90.00th=[ 5080], 95.00th=[ 6259], >>>> | 99.00th=[10159], 99.50th=[11076], 99.90th=[12649], 99.95th=[13435], >>>> | 99.99th=[14484] >>> >>> So it's adding at least 500us of completion latency to every IO? >>> I'd argue that the IO latency impact is far worse than the a 20% >>> throughput drop. >> >> Hum, right. So for each IO we have to remove the page from LRU on submit > > Which cost us less then 10us on average: > > slat (usec): min=13, max=3855, avg=44.17, stdev=61.18 > vs > slat (usec): min=18, max=4378, avg=52.59, stdev=63.66 > >> and then put it back on IO completion (which is going to race with new >> submits so LRU lock contention might be an issue). > > Removal has to take the same LRU lock, so I don't think contention > is the problem here. More likely the overhead is in selecting the > LRU to put it on. e.g. list_lru_from_kmem() which may well be doing > a memcg lookup. > >> Spending 500 us on that >> is not unthinkable when the lock is contended but it is more expensive than >> I'd have thought. John, could you perhaps profile where the time is spent? > OK, some updates on that: 1. First of all, I fixed a direct-io-related call site (it was still calling put_page instead of put_user_page), and that not only got rid of a problem, it also changed performance: it makes the impact of the patch a bit less. (Sorry about that! I was hoping that I could get away with temporarily ignoring that failure, but no.) The bandwidth numbers in particular look much closer to each other. 2. Second, note that these fio results are noisy. The std deviation is large enough that some of this could be noise. In order to highlight that, I did 5 runs each of with, and without the patch, and while there is definitely a performance drop on average, it's also true that there is overlap in the results. In other words, the best "with patch" run is about the same as the worst "without patch" run. 3. Finally, initial profiling shows that we're adding about 1% total to the this particular test run...I think. I may have to narrow this down some more, but I don't seem to see any real lock contention. Hints or ideas on measurement methods are welcome, btw. -- 0.59% in put_user_page -- 0.2% (or 0.54%, depending on how you read the perf out below) in get_user_pages_fast: 1.36%--iov_iter_get_pages | --1.27%--get_user_pages_fast | --0.54%--pin_page_for_dma 0.59%--put_user_page 1.34% 0.03% fio [kernel.vmlinux] [k] _raw_spin_lock 0.95% 0.55% fio [kernel.vmlinux] [k] do_raw_spin_lock 0.17% 0.03% fio [kernel.vmlinux] [k] isolate_lru_page 0.06% 0.00% fio [kernel.vmlinux] [k] putback_lru_page 4. Here's some raw fio data: one run without the patch, and two with the patch: ------------------------------------------------------ WITHOUT the patch: ------------------------------------------------------ reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 fio-3.3 Starting 1 process Jobs: 1 (f=1): [R(1)][100.0%][r=55.5MiB/s,w=0KiB/s][r=14.2k,w=0 IOPS][eta 00m:00s] reader: (groupid=0, jobs=1): err= 0: pid=1750: Tue Nov 6 20:18:06 2018 read: IOPS=13.9k, BW=54.4MiB/s (57.0MB/s)(1024MiB/18826msec) slat (usec): min=25, max=4870, avg=68.19, stdev=85.21 clat (usec): min=74, max=19814, avg=4525.40, stdev=1844.03 lat (usec): min=183, max=19927, avg=4593.69, stdev=1866.65 clat percentiles (usec): | 1.00th=[ 3687], 5.00th=[ 3720], 10.00th=[ 3720], 20.00th=[ 3752], | 30.00th=[ 3752], 40.00th=[ 3752], 50.00th=[ 3752], 60.00th=[ 3785], | 70.00th=[ 4178], 80.00th=[ 4490], 90.00th=[ 6652], 95.00th=[ 8225], | 99.00th=[13173], 99.50th=[14353], 99.90th=[16581], 99.95th=[17171], | 99.99th=[18220] bw ( KiB/s): min=49920, max=59320, per=100.00%, avg=55742.24, stdev=2224.20, samples=37 iops : min=12480, max=14830, avg=13935.35, stdev=556.05, samples=37 lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 4=68.78%, 10=28.14%, 20=3.08% cpu : usr=2.39%, sys=95.30%, ctx=669, majf=0, minf=72 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): READ: bw=54.4MiB/s (57.0MB/s), 54.4MiB/s-54.4MiB/s (57.0MB/s-57.0MB/s), io=1024MiB (1074MB), run=18826-18826msec Disk stats (read/write): nvme0n1: ios=259490/1, merge=0/0, ticks=14822/0, in_queue=19241, util=100.00% ------------------------------------------------------ With the patch applied: ------------------------------------------------------ reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 fio-3.3 Starting 1 process Jobs: 1 (f=1): [R(1)][100.0%][r=51.2MiB/s,w=0KiB/s][r=13.1k,w=0 IOPS][eta 00m:00s] reader: (groupid=0, jobs=1): err= 0: pid=2568: Tue Nov 6 20:03:50 2018 read: IOPS=12.8k, BW=50.1MiB/s (52.5MB/s)(1024MiB/20453msec) slat (usec): min=33, max=4365, avg=74.05, stdev=85.79 clat (usec): min=39, max=19818, avg=4916.61, stdev=1961.79 lat (usec): min=100, max=20002, avg=4990.78, stdev=1985.23 clat percentiles (usec): | 1.00th=[ 4047], 5.00th=[ 4080], 10.00th=[ 4080], 20.00th=[ 4080], | 30.00th=[ 4113], 40.00th=[ 4113], 50.00th=[ 4113], 60.00th=[ 4146], | 70.00th=[ 4178], 80.00th=[ 4817], 90.00th=[ 7308], 95.00th=[ 8717], | 99.00th=[14091], 99.50th=[15270], 99.90th=[17433], 99.95th=[18220], | 99.99th=[19006] bw ( KiB/s): min=45370, max=55784, per=100.00%, avg=51332.33, stdev=1843.77, samples=40 iops : min=11342, max=13946, avg=12832.83, stdev=460.92, samples=40 lat (usec) : 50=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 4=0.01%, 10=96.44%, 20=3.53% cpu : usr=2.91%, sys=95.18%, ctx=398, majf=0, minf=73 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): READ: bw=50.1MiB/s (52.5MB/s), 50.1MiB/s-50.1MiB/s (52.5MB/s-52.5MB/s), io=1024MiB (1074MB), run=20453-20453msec Disk stats (read/write): nvme0n1: ios=261399/0, merge=0/0, ticks=16019/0, in_queue=20910, util=100.00% ------------------------------------------------------ OR, here's a better run WITH the patch applied, and you can see that this is nearly as good as the "without" case: ------------------------------------------------------ reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 fio-3.3 Starting 1 process Jobs: 1 (f=1): [R(1)][100.0%][r=53.2MiB/s,w=0KiB/s][r=13.6k,w=0 IOPS][eta 00m:00s] reader: (groupid=0, jobs=1): err= 0: pid=2521: Tue Nov 6 20:01:33 2018 read: IOPS=13.4k, BW=52.5MiB/s (55.1MB/s)(1024MiB/19499msec) slat (usec): min=30, max=12458, avg=69.71, stdev=88.01 clat (usec): min=39, max=25590, avg=4687.42, stdev=1925.29 lat (usec): min=97, max=25704, avg=4757.25, stdev=1946.06 clat percentiles (usec): | 1.00th=[ 3884], 5.00th=[ 3884], 10.00th=[ 3916], 20.00th=[ 3916], | 30.00th=[ 3916], 40.00th=[ 3916], 50.00th=[ 3949], 60.00th=[ 3949], | 70.00th=[ 3982], 80.00th=[ 4555], 90.00th=[ 6915], 95.00th=[ 8848], | 99.00th=[13566], 99.50th=[14877], 99.90th=[16909], 99.95th=[17695], | 99.99th=[24249] bw ( KiB/s): min=48905, max=58016, per=100.00%, avg=53855.79, stdev=2115.03, samples=38 iops : min=12226, max=14504, avg=13463.79, stdev=528.76, samples=38 lat (usec) : 50=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 4=71.80%, 10=24.66%, 20=3.51%, 50=0.02% cpu : usr=3.47%, sys=94.61%, ctx=370, majf=0, minf=73 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): READ: bw=52.5MiB/s (55.1MB/s), 52.5MiB/s-52.5MiB/s (55.1MB/s-55.1MB/s), io=1024MiB (1074MB), run=19499-19499msec Disk stats (read/write): nvme0n1: ios=260720/0, merge=0/0, ticks=15036/0, in_queue=19876, util=100.00% thanks, -- John Hubbard NVIDIA