Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1764040AbYBLXwj (ORCPT ); Tue, 12 Feb 2008 18:52:39 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752096AbYBLXw2 (ORCPT ); Tue, 12 Feb 2008 18:52:28 -0500 Received: from quartz.orcorp.ca ([142.179.161.236]:35243 "EHLO quartz.edm.orcorp.ca" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1762980AbYBLXw1 (ORCPT ); Tue, 12 Feb 2008 18:52:27 -0500 X-Greylist: delayed 1703 seconds by postgrey-1.27 at vger.kernel.org; Tue, 12 Feb 2008 18:52:27 EST Date: Tue, 12 Feb 2008 16:23:29 -0700 From: Jason Gunthorpe To: Roland Dreier Cc: Christoph Lameter , Rik van Riel , steiner@sgi.com, Andrea Arcangeli , a.p.zijlstra@chello.nl, izike@qumranet.com, linux-kernel@vger.kernel.org, avi@qumranet.com, linux-mm@kvack.org, daniel.blueman@quadrics.com, Robin Holt , general@lists.openfabrics.org, Andrew Morton , kvm-devel@lists.sourceforge.net Subject: Re: [ofa-general] Re: Demand paging for memory regions Message-ID: <20080212232329.GC31435@obsidianresearch.com> References: <20080209012446.GB7051@v2.random> <20080209015659.GC7051@v2.random> <20080209075556.63062452@bree.surriel.com> <47B2174E.5000708@opengridcomputing.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.15+20070412 (2007-04-11) X-Broken-Reverse-DNS: no host name found for IP address 10.0.0.11 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 1846 Lines: 40 On Tue, Feb 12, 2008 at 02:41:48PM -0800, Roland Dreier wrote: > > > Chelsio's T3 HW doesn't support this. > > > Not so far I guess but it could be equipped with these features right? > > I don't know anything about the T3 internals, but it's not clear that > you could do this without a new chip design in general. Lot's of RDMA > devices were designed expecting that when a packet arrives, the HW can > look up the bus address for a given memory region/offset and place > the Well, certainly today the memfree IB devices store the page tables in host memory so they are already designed to hang onto packets during the page lookup over PCIE, adding in faulting makes this time larger. But this is not a good thing at all, IB's congestion model is based on the notion that end ports can always accept packets without making input contigent on output. If you take a software interrupt to fill in the page pointer then you could potentially deadlock on the fabric. For example using this mechanism to allow swap-in of RDMA target pages and then putting the storage over IB would be deadlock prone. Even without deadlock slowing down the input path will cause network congestion and poor performance for other nodes. It is not a desirable thing to do.. I expect that iwarp running over flow controlled ethernet has similar kinds of problems for similar reasons.. In general the best I think you can hope for with RDMA hardware is page migration using some atomic operations with the adaptor and a cpu page copy with retry sort of scheme - but is pure page migration interesting at all? Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/