Return-Path: linux-nfs-owner@vger.kernel.org Received: from mail-qg0-f52.google.com ([209.85.192.52]:59257 "EHLO mail-qg0-f52.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754924AbaLHUZC (ORCPT ); Mon, 8 Dec 2014 15:25:02 -0500 Received: by mail-qg0-f52.google.com with SMTP id a108so3965441qge.25 for ; Mon, 08 Dec 2014 12:25:00 -0800 (PST) From: Jeff Layton Date: Mon, 8 Dec 2014 15:24:58 -0500 To: "J. Bruce Fields" Cc: Jeff Layton , Trond Myklebust , Chris Worley , linux-nfs@vger.kernel.org, Ben Myers Subject: Re: [PATCH 3/4] sunrpc: convert to lockless lookup of queued server threads Message-ID: <20141208152458.704b5814@tlielax.poochiereds.net> In-Reply-To: <20141208195855.GC16612@fieldses.org> References: <1416597571-4265-1-git-send-email-jlayton@primarydata.com> <1416597571-4265-4-git-send-email-jlayton@primarydata.com> <20141201234759.GF30749@fieldses.org> <20141202065750.283704a7@tlielax.poochiereds.net> <20141202071422.5b01585d@tlielax.poochiereds.net> <20141202165023.GA9195@fieldses.org> <20141208185730.GB16612@fieldses.org> <20141208145429.56234bf2@tlielax.poochiereds.net> <20141208195855.GC16612@fieldses.org> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Sender: linux-nfs-owner@vger.kernel.org List-ID: On Mon, 8 Dec 2014 14:58:55 -0500 "J. Bruce Fields" wrote: > On Mon, Dec 08, 2014 at 02:54:29PM -0500, Jeff Layton wrote: > > On Mon, 8 Dec 2014 13:57:31 -0500 > > "J. Bruce Fields" wrote: > > > > > On Tue, Dec 02, 2014 at 11:50:24AM -0500, J. Bruce Fields wrote: > > > > On Tue, Dec 02, 2014 at 07:14:22AM -0500, Jeff Layton wrote: > > > > > On Tue, 2 Dec 2014 06:57:50 -0500 > > > > > Jeff Layton wrote: > > > > > > > > > > > On Mon, 1 Dec 2014 19:38:19 -0500 > > > > > > Trond Myklebust wrote: > > > > > > > > > > > > > On Mon, Dec 1, 2014 at 6:47 PM, J. Bruce Fields wrote: > > > > > > > > I find it hard to think about how we expect this to affect performance. > > > > > > > > So it comes down to the observed results, I guess, but just trying to > > > > > > > > get an idea: > > > > > > > > > > > > > > > > - this eliminates sp_lock. I think the original idea here was > > > > > > > > that if interrupts could be routed correctly then there > > > > > > > > shouldn't normally be cross-cpu contention on this lock. Do > > > > > > > > we understand why that didn't pan out? Is hardware capable of > > > > > > > > doing this really rare, or is it just too hard to configure it > > > > > > > > correctly? > > > > > > > > > > > > > > One problem is that a 1MB incoming write will generate a lot of > > > > > > > interrupts. While that is not so noticeable on a 1GigE network, it is > > > > > > > on a 40GigE network. The other thing you should note is that this > > > > > > > workload was generated with ~100 clients pounding on that server, so > > > > > > > there are a fair amount of TCP connections to service in parallel. > > > > > > > Playing with the interrupt routing doesn't necessarily help you so > > > > > > > much when all those connections are hot. > > > > > > > > > > > > > > > > > In principle though, the percpu pool_mode should have alleviated the > > > > > contention on the sp_lock. When an interrupt comes in, the xprt gets > > > > > queued to its pool. If there is a pool for each cpu then there should > > > > > be no sp_lock contention. The pernode pool mode might also have > > > > > alleviated the lock contention to a lesser degree in a NUMA > > > > > configuration. > > > > > > > > > > Do we understand why that didn't help? > > > > > > > > Yes, the lots-of-interrupts-per-rpc problem strikes me as a separate if > > > > not entirely orthogonal problem. > > > > > > > > (And I thought it should be addressable separately; Trond and I talked > > > > about this in Westford. I think it currently wakes a thread to handle > > > > each individual tcp segment--but shouldn't it be able to do all the data > > > > copying in the interrupt and wait to wake up a thread until it's got the > > > > entire rpc?) > > > > > > By the way, Jeff, isn't this part of what's complicating the workqueue > > > change? That would seem simpler if we didn't need to queue work until > > > we had the full rpc. > > > > > > > No, I don't think that really adds much in the way of complexity there. > > > > I have that set working. Most of what's holding me up from posting the > > next iteration of that set is performance. So far, my testing shows > > that the workqueue-based code is slightly slower. I've been trying to > > figure out why that is and whether I can do anything about it. Maybe > > I'll go ahead and post it as a second RFC set, until I can get to the > > bottom of the perf delta. > > > > I have pondered doing what you're suggesting above though and it's not a > > trivial change. > > > > The problem is that all of the buffers into which we do receives are > > associated with the svc_rqst (which we don't really have when the > > interrupt comes in), and not the svc_xprt (which we do have at that > > point). > > > > So, you'd need to restructure the code to hang a receive buffer off > > of the svc_xprt. > > Have you looked at svsk->sk_pages and svc_tcp_{save,restore}_pages? > > --b. > Ahh, no I hadn't...interesting. So, basically do the receive into the rqstp's buffer, and if you don't get everything you need you stuff the pages into the sk_pages array to await the next pass. Weird design... Ok, so you could potentially flip that around. Do the receive into the sk_pages buffer in softirq context, and then hand those off to the rqst (in some fashion) once you've received a full RPC. You'd have to work out how to replenish the sk_pages after each receive, and what to do about RDMA, but it's probably doable. > > Once you receive an entire RPC, you'd then have to > > flip that buffer over to a svc_rqst, queue up the job and grab a new > > buffer for the xprt (maybe you could swap them?). > > > > The problem is what to do if you don't have a buffer (or svc_rqst) > > available when an IRQ comes in. You can't allocate one from softirq > > context, so you'd need to offload that case to a workqueue or something > > anyway (which adds a bit of complexity as you'd then have to deal with > > two different receive paths). > > > > I'm also not sure about RDMA. When you get an RPC, the server usually > > has to do an RDMA READ from the client to pull all of the data in. I > > don't think you want to do that from softirq context, so that would > > also need to be queued up somehow. > > > > All of that said, it would probably reduce some context switching if > > we can make that work. Also, I suspect that doing that in the context > > of the workqueue-based code would probably be at least a little simpler. > > > > -- > > Jeff Layton -- Jeff Layton