Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756750AbYBNRtJ (ORCPT ); Thu, 14 Feb 2008 12:49:09 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1750900AbYBNRs4 (ORCPT ); Thu, 14 Feb 2008 12:48:56 -0500 Received: from wa-out-1112.google.com ([209.85.146.180]:27285 "EHLO wa-out-1112.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750834AbYBNRsz (ORCPT ); Thu, 14 Feb 2008 12:48:55 -0500 DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=fVFV/60Tr14rk2vErxy0O82NQQwD2gWN3wLf9OdwSbjKUqat3T3LyER17dBmKSA6+kTrMaxd4Ih/dV0JBBWerJi7qm2Y/uXKKgsMeX74Pl8YXwYwCFqN92Lk/KxtfeQI7xyBAycH3Iz9dLoyzNZjdLXlD0pMVmEE/j8I+4Hcynk= Message-ID: <469958e00802140948j162cc8baqae0b55cd6fb1cd22@mail.gmail.com> Date: Thu, 14 Feb 2008 09:48:52 -0800 From: "Caitlin Bestler" To: "Steve Wise" Subject: Re: [ofa-general] Re: Demand paging for memory regions Cc: "Robin Holt" , "Rik van Riel" , steiner@sgi.com, "Andrea Arcangeli" , a.p.zijlstra@chello.nl, izike@qumranet.com, "Roland Dreier" , linux-kernel@vger.kernel.org, avi@qumranet.com, kvm-devel@lists.sourceforge.net, linux-mm@kvack.org, daniel.blueman@quadrics.com, general@lists.openfabrics.org, "Andrew Morton" , "Christoph Lameter" In-Reply-To: <47B46AFB.9070009@opengridcomputing.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <8A71B368A89016469F72CD08050AD334026D5C23@maui.asicdesigners.com> <47B45994.7010805@opengridcomputing.com> <20080214155333.GA1029@sgi.com> <47B46AFB.9070009@opengridcomputing.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5421 Lines: 112 On Thu, Feb 14, 2008 at 8:23 AM, Steve Wise wrote: > Robin Holt wrote: > > On Thu, Feb 14, 2008 at 09:09:08AM -0600, Steve Wise wrote: > >> Note that for T3, this involves suspending _all_ rdma connections that are > >> in the same PD as the MR being remapped. This is because the driver > >> doesn't know who the application advertised the rkey/stag to. So without > > > > Is there a reason the driver can not track these. > > > > Because advertising of a MR (ie telling the peer about your rkey/stag, > offset and length) is application-specific and can be done out of band, > or in band as simple SEND/RECV payload. Either way, the driver has no > way of tracking this because the protocol used is application-specific. > > I fully agree. If there is one important thing about RDMA and other fastpath solutions that must be understood is that the driver does not see the payload. This is a fundamental strength, but it means that you have to identify what if any intercept points there are in advance. You also raise a good point on the scope of any suspend/resume API. Device reporting of this capability would not be a simple boolean, but more of a suspend/resume scope. A minimal scope would be any connection that actually attempts to use the suspended MR. Slightly wider would be any connection *allowed* to use the MR, which could expand all the way to any connection under the same PD. Convievably I could imagine an RDMA device reporting that it could support suspend/ resume, but only at the scope of the entire device. But even at such a wide scope, suspend/resume could be useful to a Memory Manager. The pages could be fully migrated to the new location, and the only work that was still required during the critical suspend/resume region was to actually shift to the new map. That might be short enough that not accepting *any* incoming RDMA packet would be acceptable. And if the goal is to replace a memory card the alternative might be migrating the applications to other physical servers, which would mean a much longer period of not accepting incoming RDMA packets. But the broader question is what the goal is here. Allowing memory to be shuffled is valuable, and perhaps even ultimately a requirement for high availability systems. RDMA and other direct-access APIs should be evolving their interfaces to accommodate these needs. Oversubscribing memory is a totally different matter. If an application is working with memory that is oversubscribed by a factor of 2 or more can it really benefit from zero-copy direct placement? At first glance I can't see what RDMA could be bringing of value when the overhead of swapping is going to be that large. If it really does make sense, then explicitly registering the portion of memory that should be enabled to receive incoming traffic while the application is swapped out actually makes sense. Current Memory Registration methods force applications to either register too much or too often. They register too much when the cost of registration is high, and the application responds by registering its entire buffer pool permanently. This is a problem when it overstates the amount of memory that the application needs to have resident, or when the device imposes limits on the size of memory maps that it can know. The alternative is to register too often, that is on a per-operation basis. To me that suggests the solutions lie in making it more reasonable to register more memory, or in making it practical to register memory on-the-fly on a per-operation basis with low enough overhead that applications don't feel the need to build elaborate registration caching schemes. As has been pointed out a few times in this thread, the RDMA and transport layers simply do not have enough information to know which portion of registered memory *really* had to be registered. So any back-pressure scheme where the Memory Manager is asking for pinned memory to be "given back" would have to go all the way to the application. Only the application knows what it is "really" using. I also suspect that most applications that are interested in using RDMA would rather be told they can allocate 200M indefinitely (and with real memory backing it) than be given 1GB of virtual memory that is backed by 200-300M of physical memory, especially if it meant dealing with memory pressure upcalls. > >> Point being, it will stop probably all connections that an application is > >> using (assuming the application uses a single PD). > > > > It seems like the need to not stop all would be a compelling enough reason > > to modify the driver to track which processes have received the rkey/stag. > > > > Yes, _if_ the driver could track this. > > And _if_ the rdma API and paradigm was such that the kernel/driver could > keep track, then remote revokations of MR tags could be supported. > > Stevo > > > _______________________________________________ > general mailing list > general@lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/