Received: by 2002:a25:ab43:0:0:0:0:0 with SMTP id u61csp3703289ybi; Mon, 10 Jun 2019 15:14:35 -0700 (PDT) X-Google-Smtp-Source: APXvYqwtsPQQVg+0nF/aDVCsfHm8x9Q3mQbbymAkVnURufzA+d7dsVl47tI5HL06ThhPmH3Y1Q6f X-Received: by 2002:a17:902:25ab:: with SMTP id y40mr16305353pla.268.1560204875808; Mon, 10 Jun 2019 15:14:35 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1560204875; cv=none; d=google.com; s=arc-20160816; b=Fv7LIWtap9fLZ+ZwkLy6bNJVzgxjd2Z9nZadnGpJ20xncch1iMRtzkz/h2/GJn+bJ4 cxNoqUZKXlgxRpUlfgOqdNTPnsxZH0WanrAuoVVsTuwpnvwUYurGSnsUzeBv5Wqtfq2j v42rYb4dJTAnMsOstOl8cXKU/heyF4PjxnKXnQ6ze5976fzp1V0ouCtwacSfcBQL6BTG L9lOj0V+Hv20zPzpjgdo3PlhbJAop6SMGCHYCSlzvenBvjB2ZgdO4Ang4eMW+nsjRTzj agEoWqDBTH28wVE6TuKeiYHD9ZIbLFFsjtbsu7M9QadqK3vkc+6dc2V29eKaDjLCVjPf rh9w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=J4p3khd7sk1xNmeAE1WQh8oD4tp7xwoBakgp3C9KmoE=; b=bwpDZU3115z4CkWvcfC01y/dt3OIc4I2rC3DjD+PjcT90nEnVmL9ds4dJqTxTvAyKn O8q6hETR7PdXxsPAZMYvaTD7E/Zz0akzreGflySmt6BCVdNqEFo8C0aTyr2spIlMA1EM QQEKmrOIUwRBZTptOp1yi0BjkYW0wZSgrRKwFKD+g8emnVsOil5+To830eNxHpVL8hl5 HIDPPtcnoVNc90ndAv32C+rL513SSsnTtOvv9acCv+2xy1oEuJyegDiGR0D3PXWuRJEj 9MnaA1Laipqzo+rT41qMFUYYvamG8txwIPEyW7yY6KIgtyQblAAk9jCcBzcBF5wE4+FT 9zFg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-nfs-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id m2si10357545pgv.89.2019.06.10.15.14.18; Mon, 10 Jun 2019 15:14:35 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-nfs-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-nfs-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2388328AbfFJWN7 (ORCPT + 99 others); Mon, 10 Jun 2019 18:13:59 -0400 Received: from p3plsmtpa06-09.prod.phx3.secureserver.net ([173.201.192.110]:40753 "EHLO p3plsmtpa06-09.prod.phx3.secureserver.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2387661AbfFJWN7 (ORCPT ); Mon, 10 Jun 2019 18:13:59 -0400 Received: from [192.168.0.67] ([24.218.182.144]) by :SMTPAUTH: with ESMTPSA id aSY9h3KkkXGbMaSY9hY9zB; Mon, 10 Jun 2019 15:13:58 -0700 Subject: Re: [PATCH RFC] svcrdma: Ignore source port when computing DRC hash To: Chuck Lever Cc: linux-rdma@vger.kernel.org, Linux NFS Mailing List References: <20190605121518.2150.26479.stgit@klimt.1015granger.net> <9E0019E1-1C1B-465C-B2BF-76372029ABD8@talpey.com> <955993A4-0626-4819-BC6F-306A50E2E048@oracle.com> <4b05cdf7-2c2d-366f-3a29-1034bfec2941@talpey.com> <721DF459-ECAE-4FDD-A016-AFB193BA1C65@oracle.com> <12611B6E-39F0-491E-A357-AEA290F75C25@oracle.com> From: Tom Talpey Message-ID: Date: Mon, 10 Jun 2019 18:13:57 -0400 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101 Thunderbird/60.7.0 MIME-Version: 1.0 In-Reply-To: <12611B6E-39F0-491E-A357-AEA290F75C25@oracle.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-CMAE-Envelope: MS4wfBkhUIRHyKzsmp1NSzgAf6tORQDCvEmCJnP+5RBRl7HTH+Acsr7X//a4o4dfKfABtB+mL3BGeAMkq56ve/EWDfrMKQA+YvEyOPTs90k4Q57PQZ0ncZbA AEkc0Twtg/9MjXAZyp2ONnXnoPZOqytsR/U6Fe/aBXJm5TpwSiEQl4XWFzN9lhM7gGmbKxUGS8VprskH7fog0glwFKhMQY2E3Q+qggIlwFPvCXyuAHJJt86j j/puNDFsW4uqtf+TvKsXTA== Sender: linux-nfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-nfs@vger.kernel.org On 6/10/2019 5:57 PM, Chuck Lever wrote: > > >> On Jun 10, 2019, at 3:14 PM, Tom Talpey wrote: >> >> On 6/10/2019 1:50 PM, Chuck Lever wrote: >>> Hi Tom- >>>> On Jun 10, 2019, at 10:50 AM, Tom Talpey wrote: >>>> >>>> On 6/5/2019 1:25 PM, Chuck Lever wrote: >>>>> Hi Tom- >>>>>> On Jun 5, 2019, at 12:43 PM, Tom Talpey wrote: >>>>>> >>>>>> On 6/5/2019 8:15 AM, Chuck Lever wrote: >>>>>>> The DRC is not working at all after an RPC/RDMA transport reconnect. >>>>>>> The problem is that the new connection uses a different source port, >>>>>>> which defeats DRC hash. >>>>>>> >>>>>>> An NFS/RDMA client's source port is meaningless for RDMA transports. >>>>>>> The transport layer typically sets the source port value on the >>>>>>> connection to a random ephemeral port. The server already ignores it >>>>>>> for the "secure port" check. See commit 16e4d93f6de7 ("NFSD: Ignore >>>>>>> client's source port on RDMA transports"). >>>>>> >>>>>> Where does the entropy come from, then, for the server to not >>>>>> match other requests from other mount points on this same client? >>>>> The first ~200 bytes of each RPC Call message. >>>>> [ Note that this has some fun ramifications for calls with small >>>>> RPC headers that use Read chunks. ] >>>> >>>> Ok, good to know. I forgot that the Linux server implemented this. >>>> I have some concerns abot it, honestly, and it's important to remember >>>> that it's not the same on all servers. But for the problem you're >>>> fixing, it's ok I guess and certainly better than today. Still, the >>>> errors are goingto be completely silent, and can lead to data being >>>> corrupted. Well, welcome to the world of NFSv3. >>> I don't see another option. >>> Some regard this checksum as more robust than using the client's >>> IP source port. After all, the same argument can be made that >>> the server cannot depend on clients to reuse their source port. >>> That is simply a convention that many clients adopted before >>> servers used a stronger DRC hash mechanism. >>>>>> And since RDMA is capable of >>>>>> such high IOPS, the likelihood seems rather high. >>>>> Only when the server's durable storage is slow enough to cause >>>>> some RPC requests to have extremely high latency. >>>>> And, most clients use an atomic counter for their XIDs, so they >>>>> are also likely to wrap that counter over some long-pending RPC >>>>> request. >>>>> The only real answer here is NFSv4 sessions. >>>>>> Missing the cache >>>>>> might actually be safer than hitting, in this case. >>>>> Remember that _any_ retransmit on RPC/RDMA requires a fresh >>>>> connection, that includes NFSv3, to reset credit accounting >>>>> due to the lost half of the RPC Call/Reply pair. >>>>> I can very quickly reproduce bad (non-deterministic) behavior >>>>> by running a software build on an NFSv3 on RDMA mount point >>>>> with disconnect injection. If the DRC issue is addressed, the >>>>> software build runs to completion. >>>> >>>> Ok, good. But I have a better test. >>>> >>>> In the Connectathon suite, there's a "Special" test called "nfsidem". >>>> I wrote this test in, like, 1989 so I remember it :-) >>>> >>>> This test performs all the non-idempotent NFv3 operations in a loop, >>>> and each loop element depends on the previous one, so if there's >>>> any failure, the test imemdiately bombs. >>>> >>>> Nobody seems to understand it, usually when it gets run people will >>>> run it without injecting errors, and it "passes" so they decide >>>> everything is ok. >>>> >>>> So my suggestion is to run your flakeway packet-drop harness while >>>> running nfsidem in a huge loop (nfsidem 10000). The test is slow, >>>> owing to the expensive operations it performs, so you'll need to >>>> run it for a long time. >>>> >>>> You'll almost definitely get a failure or two, since the NFSv3 >>>> protocol is flawed by design. But you can compare the behaviors, >>>> and even compute a likelihood. I'd love to see some actual numbers. >>> I configured the client to disconnect after 23711 RPCs have completed. >>> (I can re-run these with more frequent disconnects if you think that >>> would be useful). >>> Here's a run with the DRC modification: >>> [cel@manet ~]$ sudo mount -o vers=3,proto=rdma,sec=sys klimt.ib:/export/tmp /mnt >>> [cel@manet ~]$ (cd /mnt; ~/src/cthon04/special/nfsidem 100000) >>> testing 100000 idempotencies in directory "./TEST" >>> [cel@manet ~]$ sudo umount /mnt >>> Here's a run with the stock v5.1 Linux server: >>> [cel@manet ~]$ sudo mount -o vers=3,proto=rdma,sec=sys klimt.ib:/export/tmp /mnt >>> [cel@manet ~]$ (cd /mnt; ~/src/cthon04/special/nfsidem 100000) >>> testing 100000 idempotencies in directory "./TEST" >>> [cel@manet ~]$ >>> This test reported no errors in either case. We can see that the >>> disconnects did trigger retransmits: >>> RPC statistics: >>> 1888819 RPC requests sent, 1888581 RPC replies received (0 XIDs not found) >>> average backlog queue length: 119 >> >> Ok, well, that's 1.2% error rate, which IMO could be cranked up much >> higher for testing purposes. I'd also be sure the server was serving >> other workloads during the same time, putting at least some pressure >> on the DRC. The op rate of a single nfsidem test is pretty low so I >> doubt it's ever evicting anything. >> >> Ideally, it would be best to >> 1) increase the error probability >> 2) run several concurrent nfsidem tests, on different connections >> 3) apply some other load to the server, e.g. several cthon basics >> >> The idea being to actually get the needle off of zero and measure some >> kind of difference. Otherwise it really isn't giving any information >> apart from a slight didn't-break-it confidence. Honestly, I'm surprised >> you couldn't evince a failure from stock. On paper, these results don't >> actually tell us the patch is doing anything. > > I boosted the disconnect injection rate to once every 11353 RPCs, > and mounted a second share with "nosharecache", running nfsidem on > both mounts. Both mounts are subject to disconnect injection. > > With the current v5.2-rc Linux server, both nfsidem jobs fail within > 30 seconds. > > With my DRC fix applied, both jobs run to completion with no errors. > It takes more than an hour. Great, that definitely proves it. It's relatively low risk, and should go in. I think it's worth thinking through the best way to address this across multiple transports, especially RDMA, where the port and address behaviors may differ from TCP/UDP. Perhaps tossing the whole idea of using the 5-tuple should simply be set aside, after all these years. Even though NFSv4.1 already has! Thanks for humoring me with the extra testing, by the way :-) Tom. > Here are the op metrics for one of the mounts during a run that > completes successfully: > > RPC statistics: > 4091380 RPC requests sent, 4088143 RPC replies received (0 XIDs not found) > average backlog queue length: 1800 > > ACCESS: > 300395 ops (7%) 301 retrans (0%) 0 major timeouts > avg bytes sent per op: 132 avg bytes received per op: 120 > backlog wait: 4.289199 RTT: 0.019821 total execute time: 4.315092 (milliseconds) > REMOVE: > 300390 ops (7%) 168 retrans (0%) 0 major timeouts > avg bytes sent per op: 136 avg bytes received per op: 144 > backlog wait: 2.368664 RTT: 0.070106 total execute time: 2.445148 (milliseconds) > MKDIR: > 200262 ops (4%) 193 retrans (0%) 0 major timeouts > avg bytes sent per op: 158 avg bytes received per op: 271 > backlog wait: 4.207034 RTT: 0.075421 total execute time: 4.289101 (milliseconds) > RMDIR: > 200262 ops (4%) 100 retrans (0%) 0 major timeouts > avg bytes sent per op: 130 avg bytes received per op: 144 > backlog wait: 2.050749 RTT: 0.071676 total execute time: 2.128801 (milliseconds) > LOOKUP: > 194664 ops (4%) 233 retrans (0%) 0 major timeouts > avg bytes sent per op: 136 avg bytes received per op: 176 > backlog wait: 5.365984 RTT: 0.020615 total execute time: 5.392769 (milliseconds) > SETATTR: > 100130 ops (2%) 86 retrans (0%) 0 major timeouts > avg bytes sent per op: 160 avg bytes received per op: 144 > backlog wait: 3.520603 RTT: 0.066863 total execute time: 3.594327 (milliseconds) > WRITE: > 100130 ops (2%) 82 retrans (0%) 0 major timeouts > avg bytes sent per op: 180 avg bytes received per op: 136 > backlog wait: 3.331249 RTT: 0.118316 total execute time: 3.459563 (milliseconds) > CREATE: > 100130 ops (2%) 95 retrans (0%) 0 major timeouts > avg bytes sent per op: 168 avg bytes received per op: 272 > backlog wait: 4.099451 RTT: 0.071437 total execute time: 4.177479 (milliseconds) > SYMLINK: > 100130 ops (2%) 83 retrans (0%) 0 major timeouts > avg bytes sent per op: 188 avg bytes received per op: 271 > backlog wait: 3.727704 RTT: 0.073534 total execute time: 3.807700 (milliseconds) > RENAME: > 100130 ops (2%) 68 retrans (0%) 0 major timeouts > avg bytes sent per op: 180 avg bytes received per op: 260 > backlog wait: 2.659982 RTT: 0.070518 total execute time: 2.738979 (milliseconds) > LINK: > 100130 ops (2%) 85 retrans (0%) 0 major timeouts > avg bytes sent per op: 172 avg bytes received per op: 232 > backlog wait: 3.676680 RTT: 0.066773 total execute time: 3.749785 (milliseconds) > GETATTR: > 230 ops (0%) 81 retrans (35%) 0 major timeouts > avg bytes sent per op: 170 avg bytes received per op: 112 > backlog wait: 1584.026087 RTT: 0.043478 total execute time: 1584.082609 (milliseconds) > READDIRPLUS: > 10 ops (0%) > avg bytes sent per op: 149 avg bytes received per op: 1726 > backlog wait: 0.000000 RTT: 0.300000 total execute time: 0.400000 (milliseconds) > FSINFO: > 2 ops (0%) > avg bytes sent per op: 112 avg bytes received per op: 80 > backlog wait: 0.000000 RTT: 0.000000 total execute time: 0.000000 (milliseconds) > NULL: > 1 ops (0%) > avg bytes sent per op: 40 avg bytes received per op: 24 > backlog wait: 2.000000 RTT: 0.000000 total execute time: 3.000000 (milliseconds) > READLINK: > 1 ops (0%) > avg bytes sent per op: 128 avg bytes received per op: 1140 > backlog wait: 0.000000 RTT: 0.000000 total execute time: 0.000000 (milliseconds) > PATHCONF: > 1 ops (0%) > avg bytes sent per op: 112 avg bytes received per op: 56 > backlog wait: 0.000000 RTT: 0.000000 total execute time: 0.000000 (milliseconds) > > >> Tom. >> >>> ACCESS: >>> 300001 ops (15%) 44 retrans (0%) 0 major timeouts >>> avg bytes sent per op: 132 avg bytes received per op: 120 >>> backlog wait: 0.591118 RTT: 0.017463 total execute time: 0.614795 (milliseconds) >>> REMOVE: >>> 300000 ops (15%) 40 retrans (0%) 0 major timeouts >>> avg bytes sent per op: 136 avg bytes received per op: 144 >>> backlog wait: 0.531667 RTT: 0.018973 total execute time: 0.556927 (milliseconds) >>> MKDIR: >>> 200000 ops (10%) 26 retrans (0%) 0 major timeouts >>> avg bytes sent per op: 158 avg bytes received per op: 272 >>> backlog wait: 0.518940 RTT: 0.019755 total execute time: 0.545230 (milliseconds) >>> RMDIR: >>> 200000 ops (10%) 24 retrans (0%) 0 major timeouts >>> avg bytes sent per op: 130 avg bytes received per op: 144 >>> backlog wait: 0.512320 RTT: 0.018580 total execute time: 0.537095 (milliseconds) >>> LOOKUP: >>> 188533 ops (9%) 21 retrans (0%) 0 major timeouts >>> avg bytes sent per op: 136 avg bytes received per op: 174 >>> backlog wait: 0.455925 RTT: 0.017721 total execute time: 0.480011 (milliseconds) >>> SETATTR: >>> 100000 ops (5%) 11 retrans (0%) 0 major timeouts >>> avg bytes sent per op: 160 avg bytes received per op: 144 >>> backlog wait: 0.371960 RTT: 0.019470 total execute time: 0.398330 (milliseconds) >>> WRITE: >>> 100000 ops (5%) 9 retrans (0%) 0 major timeouts >>> avg bytes sent per op: 180 avg bytes received per op: 136 >>> backlog wait: 0.399190 RTT: 0.022860 total execute time: 0.436610 (milliseconds) >>> CREATE: >>> 100000 ops (5%) 9 retrans (0%) 0 major timeouts >>> avg bytes sent per op: 168 avg bytes received per op: 272 >>> backlog wait: 0.365290 RTT: 0.019560 total execute time: 0.391140 (milliseconds) >>> SYMLINK: >>> 100000 ops (5%) 18 retrans (0%) 0 major timeouts >>> avg bytes sent per op: 188 avg bytes received per op: 272 >>> backlog wait: 0.750470 RTT: 0.020150 total execute time: 0.786410 (milliseconds) >>> RENAME: >>> 100000 ops (5%) 14 retrans (0%) 0 major timeouts >>> avg bytes sent per op: 180 avg bytes received per op: 260 >>> backlog wait: 0.461650 RTT: 0.020710 total execute time: 0.489670 (milliseconds) >>> -- >>> Chuck Lever > > -- > Chuck Lever > > > > >