Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932604AbYBMXXh (ORCPT ); Wed, 13 Feb 2008 18:23:37 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1759338AbYBMXXN (ORCPT ); Wed, 13 Feb 2008 18:23:13 -0500 Received: from quasar.osc.edu ([192.148.249.15]:56887 "EHLO quasar.osc.edu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758387AbYBMXXK (ORCPT ); Wed, 13 Feb 2008 18:23:10 -0500 Date: Wed, 13 Feb 2008 18:23:08 -0500 From: Pete Wyckoff To: Christoph Lameter Cc: Christian Bell , Rik van Riel , Andrea Arcangeli , a.p.zijlstra@chello.nl, izike@qumranet.com, Roland Dreier , steiner@sgi.com, linux-kernel@vger.kernel.org, avi@qumranet.com, linux-mm@kvack.org, daniel.blueman@quadrics.com, Robin Holt , general@lists.openfabrics.org, Andrew Morton , kvm-devel@lists.sourceforge.net Subject: Re: [ofa-general] Re: Demand paging for memory regions Message-ID: <20080213232308.GB7597@osc.edu> References: <47B2174E.5000708@opengridcomputing.com> <20080212232329.GC31435@obsidianresearch.com> <20080213012638.GD31435@obsidianresearch.com> <20080213040905.GQ29340@mv.qlogic.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20080213040905.GQ29340@mv.qlogic.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3930 Lines: 92 christian.bell@qlogic.com wrote on Tue, 12 Feb 2008 20:09 -0800: > One other area that has not been brought up yet (I think) is the > applicability of notifiers in letting users know when pinned memory > is reclaimed by the kernel. This is useful when a lower-level > library employs lazy deregistration strategies on memory regions that > are subsequently released to the kernel via the application's use of > munmap or sbrk. Ohio Supercomputing Center has work in this area but > a generalized approach in the kernel would certainly be welcome. The whole need for memory registration is a giant pain. There is no motivating application need for it---it is simply a hack around virtual memory and the lack of full VM support in current hardware. There are real hardware issues that interact poorly with virtual memory, as discussed previously in this thread. The way a messaging cycle goes in IB is: register buf post send from buf wait for completion deregister buf This tends to get hidden via userspace software libraries into a single call: MPI_send(buf) Now if you actually do the reg/dereg every time, things are very slow. So userspace library writers came up with the idea of caching registrations: if buf is not registered: register buf post send from buf wait for completion The second time that the app happens to do a send from the same buffer, it proceeds much faster. Spatial locality applies here, and this caching is generally worth it. Some libraries have schemes to limit the size of the registration cache too. But there are plenty of ways to hurt yourself with such a scheme. The first being a huge pool of unused but registered memory, as the library doesn't know the app patterns, and it doesn't know the VM pressure level in the kernel. There are plenty of subtle ways that this breaks too. If the registered buf is removed from the address space via munmap() or sbrk() or other ways, the mapping and registration are gone, but the library has no way of knowing that the app just did this. Sure the physical page is still there and pinned, but the app cannot get at it. Later if new address space arrives at the same virtual address but a different physical page, the library will mistakenly think it already has it registered properly, and data is transferred from this old now-unmapped physical page. The whole situation is rather ridiculuous, but we are quite stuck with it for current generation IB and iWarp hardware. If we can't have the kernel interact with the device directly, we could at least manage state in these multiple userspace registration caches. The VM could ask for certain (or any) pages to be released, and the library would respond if they are indeed not in use by the device. The app itself does not know about pinned regions, and the library is aware of exactly which regions are potentially in use. Since the great majority of userspace messaging over IB goes through middleware like MPI or PGAS languages, and they all have the same approach to registration caching, this approach could fix the problem for a big segment of use cases. More text on the registration caching problem is here: http://www.osc.edu/~pw/papers/wyckoff-memreg-ccgrid05.pdf with an approach using vm_ops open and close operations in a kernel module here: http://www.osc.edu/~pw/dreg/ There is a place for VM notifiers in RDMA messaging, but not in talking to devices, at least not the current set. If you can define a reasonable userspace interface for VM notifiers, libraries can manage registration caches more efficiently, letting the kernel unmap pinned pages as it likes. -- Pete -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/