Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1764891AbYBOB3T (ORCPT ); Thu, 14 Feb 2008 20:29:19 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1759607AbYBOB3K (ORCPT ); Thu, 14 Feb 2008 20:29:10 -0500 Received: from mx.neterion.com ([72.1.205.142]:4691 "EHLO owa.neterion.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755850AbYBOB3I convert rfc822-to-8bit (ORCPT ); Thu, 14 Feb 2008 20:29:08 -0500 X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 8BIT Subject: RE: [ofa-general] Re: Demand paging for memory regions Date: Thu, 14 Feb 2008 20:26:51 -0500 Message-ID: <78C9135A3D2ECE4B8162EBDCE82CAD77030E2456@nekter> In-Reply-To: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: [ofa-general] Re: Demand paging for memory regions Thread-Index: AchvW9jcZIDklBNpTse45hej3AZXQAAE03BQ References: <47B2174E.5000708@opengridcomputing.com> <8A71B368A89016469F72CD08050AD334026D5C23@maui.asicdesigners.com> <47B45994.7010805@opengridcomputing.com> <469958e00802141217i3a3d16a1k1232d69b8ba54471@mail.gmail.com> <469958e00802141443g33448abcs3efa6d6c4aec2b56@mail.gmail.com> From: "Caitlin Bestler" To: "Christoph Lameter" Cc: , , , , Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4710 Lines: 101 > -----Original Message----- > From: Christoph Lameter [mailto:clameter@sgi.com] > Sent: Thursday, February 14, 2008 2:49 PM > To: Caitlin Bestler > Cc: linux-kernel@vger.kernel.org; avi@qumranet.com; linux-mm@kvack.org; > general@lists.openfabrics.org; kvm-devel@lists.sourceforge.net > Subject: Re: [ofa-general] Re: Demand paging for memory regions > > On Thu, 14 Feb 2008, Caitlin Bestler wrote: > > > I have no problem with that, as long as the application layer is > responsible for > > tearing down and re-establishing the connections. The RDMA/transport > layers > > are incapable of tearing down and re-establishing a connection > transparently > > because connections need to be approved above the RDMA layer. > > I am not that familiar with the RDMA layers but it seems that RDMA has > a library that does device driver like things right? So the logic would > best fit in there I guess. > > If you combine mlock with the mmu notifier then you can actually > guarantee that a certain memory range will not be swapped out. The > notifier will then only be called if the memory range will need to be > moved for page migration, memory unplug etc etc. There may be a limit > on > the percentage of memory that you can mlock in the future. This may be > done to guarantee that the VM still has memory to work with. > The problem is that with existing APIs, or even slightly modified APIs, the RDMA layer will not be able to figure out which connections need to be "interrupted" in order to deal with what memory suspensions. Further, because any request for a new connection will be handled by the remote *application layer* peer there is no way for the two RDMA layers to agree to covertly tear down and re-establish the connection. Nor really should there be, connections should be approved by OS layer networking controls. RDMA should not be able to tell the network stack, "trust me, you don't have to check if this connection is legitimate". Another example, if you terminate a connection pending receive operations complete *to the user* in a Completion Queue. Those completions are NOT seen by the RDMA layer, and especially not by the Connection Manager. It has absolutely no way to repost them transparently to the same connection when the connection is re-established. Even worse, some portions of a receive operation might have been placed in the receive buffer and acknowledged to the remote peer. But there is no mechanism to report this fact in the CQE. A receive operation that is aborted is aborted. There is no concept of partial success. Therefore you cannot covertly terminate a connection mid-operation and covertly re-establish it later. Data will be lost, it will no longer be a reliable connection, and therefore it needs to be torn down anyway. The RDMA layers also cannot tell the other side not to transmit. Flow control is the responsibility of the application layer, not RDMA. What the RDMA layer could do is this: once you tell it to suspend a given memory region it can either tell you that it doesn't know how to do that or it can instruct the device to stop processing a set of connections that will ceases all access for a given Memory Region. When you resume it can guarantee that it is no longer using any cached older mappings for the memory region (assuming it was capable of doing the suspend), and then because RDMA connections are reliable everything will recover unless the connection timed-out. The chance that it will time-out is probably low, but the chance that the underlying connection will be in slow start or equivalent is much higher. So any solution that requires the upper layers to suspend operations for a brief bit will require explicit interaction with those layers. No RDMA layer can perform the sleight of hand tricks that you seem to want it to perform. AT the RDMA layer the best you could get is very brief suspensions for the purpose of *re-arranging* memory, not of reducing the amount of registered memory. If you need to reduce the amount of registered memory then you have to talk to the application. Discussions on making it easier for the application to trim a memory region dynamically might be in order, but you will not work around the fact that the application layer needs to determine what pages are registered. And they would really prefer just to be told how much memory they can have up front, they can figure out how to deal with that amount of memory on their own. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/