Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758773Ab3EZBLq (ORCPT ); Sat, 25 May 2013 21:11:46 -0400 Received: from mail-ob0-f177.google.com ([209.85.214.177]:47674 "EHLO mail-ob0-f177.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758476Ab3EZBLm (ORCPT ); Sat, 25 May 2013 21:11:42 -0400 MIME-Version: 1.0 In-Reply-To: <0000013ed732b615-748f574f-ccb8-4de7-bbe4-d85d1cbf0c9d-000000@email.amazonses.com> References: <20130523044803.GA25399@ZenIV.linux.org.uk> <20130523104154.GA23650@twins.programming.kicks-ass.net> <0000013ed1b8d0cc-ad2bb878-51bd-430c-8159-629b23ed1b44-000000@email.amazonses.com> <20130523152458.GD23650@twins.programming.kicks-ass.net> <0000013ed2297ba8-467d474a-7068-45b3-9fa3-82641e6aa363-000000@email.amazonses.com> <20130523163901.GG23650@twins.programming.kicks-ass.net> <0000013ed28b638a-066d7dc7-b590-49f8-9423-badb9537b8b6-000000@email.amazonses.com> <20130524140114.GK23650@twins.programming.kicks-ass.net> <0000013ed732b615-748f574f-ccb8-4de7-bbe4-d85d1cbf0c9d-000000@email.amazonses.com> From: KOSAKI Motohiro Date: Sat, 25 May 2013 21:11:21 -0400 Message-ID: Subject: Re: [RFC][PATCH] mm: Fix RLIMIT_MEMLOCK To: Christoph Lameter Cc: Peter Zijlstra , Al Viro , Vince Weaver , LKML , Paul Mackerras , Ingo Molnar , Arnaldo Carvalho de Melo , trinity@vger.kernel.org, Andrew Morton , Linus Torvalds , Roland Dreier , infinipath@qlogic.com, "linux-mm@kvack.org" , linux-rdma@vger.kernel.org, Or Gerlitz Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3167 Lines: 73 On Fri, May 24, 2013 at 11:40 AM, Christoph Lameter wrote: > On Fri, 24 May 2013, Peter Zijlstra wrote: > >> Patch bc3e53f682 ("mm: distinguish between mlocked and pinned pages") >> broke RLIMIT_MEMLOCK. > > Nope the patch fixed a problem with double accounting. > > The problem that we seem to have is to define what mlocked and pinned mean > and how this relates to RLIMIT_MEMLOCK. > > mlocked pages are pages that are movable (not pinned!!!) and that are > marked in some way by user space actions as mlocked (POSIX semantics). > They are marked with a special page flag (PG_mlocked). > > Pinned pages are pages that have an elevated refcount because the hardware > needs to use these pages for I/O. The elevated refcount may be temporary > (then we dont care about this) or for a longer time (such as the memory > registration of the IB subsystem). That is when we account the memory as > pinned. The elevated refcount stops page migration and other things from > trying to move that memory. > > Pages can be both pinned and mlocked. Before my patch some pages those two > issues were conflated since the same counter was used and therefore these > pages were counted twice. If an RDMA application was running using > mlockall() and was performing large scale I/O then the counters could show > extraordinary large numbers and the VM would start to behave erratically. > > It is important for the VM to know which pages cannot be evicted but that > involves many more pages due to dirty pages etc etc. > > So far the assumption has been that RLIMIT_MEMLOCK is a limit on the pages > that userspace has mlocked. > > You want the counter to mean something different it seems. What is it? > > I think we need to be first clear on what we want to accomplish and what > these counters actually should count before changing things. Hm. If pinned and mlocked are totally difference intentionally, why IB uses RLIMIT_MEMLOCK. Why don't IB uses IB specific limit and why only IB raise up number of pinned pages and other gup users don't. I can't guess IB folk's intent. And now ever IB code has duplicated RLIMIT_MEMLOCK check and at least __ipath_get_user_pages() forget to check capable(CAP_IPC_LOCK). That's bad. > Certainly would appreciate improvements in this area but resurrecting the > conflation between mlocked and pinned pages is not the way to go. > >> This patch proposes to properly fix the problem by introducing >> VM_PINNED. This also provides the groundwork for a possible mpin() >> syscall or MADV_PIN -- although these are not included. > > Maybe add a new PIN page flag? Pages are not pinned per vma as the patch > seems to assume. Generically, you are right. But if VM_PINNED is really only for IB, this is acceptable limitation. They can split vma for their own purpose. Anyway, I agree we should clearly understand the semantics of IB pinning and the userland usage and assumption. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/