Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1763507AbYBMD0A (ORCPT ); Tue, 12 Feb 2008 22:26:00 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752170AbYBMDZw (ORCPT ); Tue, 12 Feb 2008 22:25:52 -0500 Received: from quartz.orcorp.ca ([142.179.161.236]:57013 "EHLO quartz.edm.orcorp.ca" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1752073AbYBMDZv (ORCPT ); Tue, 12 Feb 2008 22:25:51 -0500 Date: Tue, 12 Feb 2008 20:25:33 -0700 From: Jason Gunthorpe To: Christoph Lameter Cc: Roland Dreier , Rik van Riel , steiner@sgi.com, Andrea Arcangeli , a.p.zijlstra@chello.nl, izike@qumranet.com, linux-kernel@vger.kernel.org, avi@qumranet.com, linux-mm@kvack.org, daniel.blueman@quadrics.com, Robin Holt , general@lists.openfabrics.org, Andrew Morton , kvm-devel@lists.sourceforge.net Subject: Re: [ofa-general] Re: Demand paging for memory regions Message-ID: <20080213032533.GC32047@obsidianresearch.com> References: <20080209075556.63062452@bree.surriel.com> <47B2174E.5000708@opengridcomputing.com> <20080212232329.GC31435@obsidianresearch.com> <20080213012638.GD31435@obsidianresearch.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.9i Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4774 Lines: 94 On Tue, Feb 12, 2008 at 06:35:09PM -0800, Christoph Lameter wrote: > On Tue, 12 Feb 2008, Jason Gunthorpe wrote: > > > The problem is that the existing wire protocols do not have a > > provision for doing an 'are you ready' or 'I am not ready' exchange > > and they are not designed to store page tables on both sides as you > > propose. The remote side can send RDMA WRITE traffic at any time after > > the RDMA region is established. The local side must be able to handle > > it. There is no way to signal that a page is not ready and the remote > > should not send. > > > > This means the only possible implementation is to stall/discard at the > > local adaptor when a RDMA WRITE is recieved for a page that has been > > reclaimed. This is what leads to deadlock/poor performance.. > > You would only use the wire protocols *after* having established the RDMA > region. The notifier chains allows a RDMA region (or parts thereof) to be > down on demand by the VM. The region can be reestablished if one of > the side accesses it. I hope I got that right. Not much exposure to > Infiniband so far. [clip explaination] But this isn't how IB or iwarp work at all. What you describe is a significant change to the general RDMA operation and requires changes to both sides of the connection and the wire protocol. A few comments on RDMA operation that might clarify things a little bit more: - In RDMA (iwarp and IB versions) the hardware page tables exist to linearize the local memory so the remote does not need to be aware of non-linearities in the physical address space. The main motivation for this is kernel bypass where the user space app wants to instruct the remote side to DMA into memory using user space addresses. Hardware provides the page tables to switch from incoming user space virtual addresses to physical addresess. This greatly simplifies the user space programming model since you don't need to pass around or create s/g lists for memory that is already virtually continuous. Many kernel RDMA drivers (SCSI, NFS) only use the HW page tables for access control and enforcing the liftime of the mapping. The page tables in the RDMA hardware exist primarily to support this, and not for other reasons. The pinning of pages is one part to support the HW page tables and one part to support the RDMA lifetime rules, the liftime rules are what cause problems for the VM. - The wire protocol consists of packets that say 'Write XXX bytes to offset YY in Region RRR'. Creating a region produces the RRR label and currently pins the pages. So long as the RRR label is valid the remote side can issue write packets at any time without any further synchronization. There is no wire level events associated with creating RRR. You can pass RRR to the other machine in any fashion, even using carrier pigeons :) - The RDMA layer is very general (ala TCP), useful protocols (like SCSI) are built on top of it and they specify the lifetime rules and protocol for exchanging RRR. Every protocol is different. In kernel protocols like SRP and NFS RDMA seem to have very short lifetimes for RRR and work more like pci_map_* in real SCSI hardware. - HPC userspace apps, like MPI apps, have different lifetime rules and tend to be really long lived. These people will not want anything that makes their OPs more expensive and also probably don't care too much about the VM problems you are looking at (?) - There is no protocol support to exchange RRR. This is all done by upper level protocols (ala HTTP vs TCP). You cannot assert and revoke RRR in a general way. Every protocol is different and optimized. This is your step 'A will then send a message to B notifying..'. It simply does not exist in the protocol specifications I don't know much about Quadrics, but I would be hesitant to lump it in too much with these RDMA semantics. Christian's comments sound like they operate closer to what you described and that is why the have an existing patch set. I don't know :) What it boils down to is that to implement true removal of pages in a general way the kernel and HCA must either drop packets or stall incoming packets, both are big performance problems - and I can't see many users wanting this. Enterprise style people using SCSI, NFS, etc already have short pin periods and HPC MPI users probably won't care about the VM issues enough to warrent the performance overhead. Regards, Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/