Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1762210AbYBMCBv (ORCPT ); Tue, 12 Feb 2008 21:01:51 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752967AbYBMCBl (ORCPT ); Tue, 12 Feb 2008 21:01:41 -0500 Received: from nat-0.pathscale.com ([198.186.3.72]:43058 "EHLO mx.mv.qlogic.com" rhost-flags-OK-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S1751925AbYBMCBk (ORCPT ); Tue, 12 Feb 2008 21:01:40 -0500 X-Greylist: delayed 365 seconds by postgrey-1.27 at vger.kernel.org; Tue, 12 Feb 2008 21:01:40 EST Date: Tue, 12 Feb 2008 17:55:33 -0800 From: Christian Bell To: Christoph Lameter Cc: Jason Gunthorpe , Rik van Riel , Andrea Arcangeli , a.p.zijlstra@chello.nl, izike@qumranet.com, Roland Dreier , steiner@sgi.com, linux-kernel@vger.kernel.org, avi@qumranet.com, linux-mm@kvack.org, daniel.blueman@quadrics.com, Robin Holt , general@lists.openfabrics.org, Andrew Morton , kvm-devel@lists.sourceforge.net Subject: Re: [ofa-general] Re: Demand paging for memory regions Message-ID: <20080213015533.GP29340@mv.qlogic.com> References: <20080209015659.GC7051@v2.random> <20080209075556.63062452@bree.surriel.com> <47B2174E.5000708@opengridcomputing.com> <20080212232329.GC31435@obsidianresearch.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.1i Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2717 Lines: 53 On Tue, 12 Feb 2008, Christoph Lameter wrote: > On Tue, 12 Feb 2008, Jason Gunthorpe wrote: > > > Well, certainly today the memfree IB devices store the page tables in > > host memory so they are already designed to hang onto packets during > > the page lookup over PCIE, adding in faulting makes this time > > larger. > > You really do not need a page table to use it. What needs to be maintained > is knowledge on both side about what pages are currently shared across > RDMA. If the VM decides to reclaim a page then the notification is used to > remove the remote entry. If the remote side then tries to access the page > again then the page fault on the remote side will stall until the local > page has been brought back. RDMA can proceed after both sides again agree > on that page now being sharable. HPC environments won't be amenable to a pessimistic approach of synchronizing before every data transfer. RDMA is assumed to be a low-level data movement mechanism that has no implied synchronization. In some parallel programming models, it's not uncommon to use RDMA to send 8-byte messages. It can be difficult to make and hold guarantees about in-memory pages when many concurrent RDMA operations are in flight (not uncommon in reasonably large machines). Some of the in-memory page information could be shared with some form of remote caching strategy but then it's a different problem with its own scalability challenges. I think there are very potential clients of the interface when an optimistic approach is used. Part of the trick, however, has to do with being able to re-start transfers instead of buffering the data or making guarantees about delivery that could cause deadlock (as was alluded to earlier in this thread). InfiniBand is constrained in this regard since it requires message-ordering between endpoints (or queue pairs). One could argue that this is still possible with IB, at the cost of throwing more packets away when a referenced page is not in memory. With this approach, the worse case demand paging scenario is met when the active working set of referenced pages is larger than the amount physical memory -- but HPC applications are already bound by this anyway. You'll find that Quadrics has the most experience in this area and that their entire architecture is adapted to being optimistic about demand paging in RDMA transfers -- they've been maintaining a patchset to do this for years. . . christian -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/