Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752512AbaAGPCG (ORCPT ); Tue, 7 Jan 2014 10:02:06 -0500 Received: from cantor2.suse.de ([195.135.220.15]:33855 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752305AbaAGPB5 (ORCPT ); Tue, 7 Jan 2014 10:01:57 -0500 Message-ID: <52CC16DC.9070308@suse.cz> Date: Tue, 07 Jan 2014 16:01:48 +0100 From: Vlastimil Babka User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: Motohiro Kosaki , Andrew Morton CC: Linus Torvalds , Sasha Levin , Wanpeng Li , Michel Lespinasse , Bob Liu , Nick Piggin , Motohiro Kosaki JP , Rik van Riel , David Rientjes , Mel Gorman , Minchan Kim , Hugh Dickins , Johannes Weiner , linux-mm , Linux Kernel Mailing List , Joonsoo Kim Subject: Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs References: <1387267550-8689-1-git-send-email-liwanp@linux.vnet.ibm.com> <52b1138b.0201430a.19a8.605dSMTPIN_ADDED_BROKEN@mx.google.com> <52B11765.8030005@oracle.com> <52b120a5.a3b2440a.3acf.ffffd7c3SMTPIN_ADDED_BROKEN@mx.google.com> <52B166CF.6080300@suse.cz> <52b1699f.87293c0a.75d1.34d3SMTPIN_ADDED_BROKEN@mx.google.com> <20131218134316.977d5049209d9278e1dad225@linux-foundation.org> <52C71ACC.20603@oracle.com> <52C74972.6050909@suse.cz> <6B2BA408B38BA1478B473C31C3D2074E2BF812BC82@SV-EXCHANGE1.Corp.FC.LOCAL> In-Reply-To: <6B2BA408B38BA1478B473C31C3D2074E2BF812BC82@SV-EXCHANGE1.Corp.FC.LOCAL> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01/06/2014 05:47 PM, Motohiro Kosaki wrote: > > >> -----Original Message----- >> From: linus971@gmail.com [mailto:linus971@gmail.com] On Behalf Of Linus >> Torvalds >> Sent: Friday, January 03, 2014 7:18 PM >> To: Vlastimil Babka >> Cc: Sasha Levin; Andrew Morton; Wanpeng Li; Michel Lespinasse; Bob Liu; >> Nick Piggin; Motohiro Kosaki JP; Rik van Riel; David Rientjes; Mel Gorman; >> Minchan Kim; Hugh Dickins; Johannes Weiner; linux-mm; Linux Kernel Mailing >> List >> Subject: Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear >> VMAs >> >> On Fri, Jan 3, 2014 at 3:36 PM, Vlastimil Babka wrote: >>> >>> I'm for going with the removal of BUG_ON. The TestSetPageMlocked >>> should provide enough race protection. >> >> Maybe. But dammit, that's subtle, and I don't think you're even right. >> >> It basically depends on mlock_vma_page() and munlock_vma_page() being >> able to run CONCURRENTLY on the same page. In particular, you could have a >> mlock_vma_page() set the bit on one CPU, and munlock_vma_page() >> immediately clearing it on another, and then the rest of those functions >> could run with a totally arbitrary interleaving when working with the exact >> same page. >> >> They both do basically >> >> if (!isolate_lru_page(page)) >> putback_lru_page(page); >> >> but one or the other would randomly win the race (it's internally protected >> by the lru lock), and *if* the munlock_vma_page() wins it, it would also do >> >> try_to_munlock(page); >> >> but if mlock_vma_page() wins it, that wouldn't happen. That looks entirely >> broken - you end up with the PageMlocked bit clear, but >> try_to_munlock() was never called on that page, because >> mlock_vma_page() got to the page isolation before the "subsequent" >> munlock_vma_page(). >> >> And this is very much what the page lock serialization would prevent. >> So no, the PageMlocked in *no* way gives serialization. It's an atomic bit op, >> yes, but that only "serializes" in one direction, not when you can have a mix >> of bit setting and clearing. >> >> So quite frankly, I think you're wrong. The BUG_ON() is correct, or at least >> enforces some kind of ordering. And try_to_unmap_cluster() is just broken >> in calling that without the page being locked. That's my opinion. There may >> be some *other* reason why it all happens to work, but no, >> "TestSetPageMlocked should provide enough race protection" is simply not >> true, and even if it were, it's way too subtle and odd to be a good rule. >> >> So I really object to just removing the BUG_ON(). Not with a *lot* more >> explanation as to why these kinds of issues wouldn't matter. > > I don't have a perfect answer. But I can explain a bit history. Let's me try. > > First off, 5 years ago, Lee's original putback_lru_page() implementation required > page-lock, but I removed the restriction months later. That's why we can see > strange BUG_ON here. > > 5 years ago, both mlock(2) and munlock(2) called do_mlock() and it was protected by > mmap_sem (write mdoe). Then, mlock and munlock had no race. > Now, __mm_populate() (called by mlock(2)) is only protected by mmap_sem read-mode. However it is enough to > protect against munlock. > > Next, In case of mlock vs reclaim, the key is that mlock(2) has two step operation. 1) turn on VM_LOCKED under > mmap_sem write-mode, 2) turn on Page_Mlocked under mmap_sem read-mode. If reclaim race against step (1), > reclaim must lose because it uses trylock. On the other hand, if reclaim race against step (2), reclaim must detect > VM_LOCKED because both VM_LOCKED modifier and observer take mmap-sem. > > By the way, page isolation is still necessary because we need to protect another page modification like page migration. I guess you meant page locking, not (lru) isolation. Indeed, the documentation at Documentation/vm/unevictable-lru.txt also discusses races with page migration. I've checked and it seems really the case that mlock_migrate_page() could race with mlock_vma_page() so that PG_mlocked is set on the old page again after deciding that the new page will be without the flag. (or after the flag was transferred to the new page) That would not be fatal, but distort accounting anyway. So here's a patch that if accepted should replace the removal of BUG_ON patch in -mm tree: http://ozlabs.org/~akpm/mmots/broken-out/mm-remove-bug_on-from-mlock_vma_page.patch The idea is that try_to_unmap_cluster() will try locking the page for mlock, and just leave it alone if lock cannot be obtained. Again that's not fatal, as eventually something will encounter and mlock the page. -----8<----- From: Vlastimil Babka Date: Tue, 7 Jan 2014 14:59:58 +0100 Subject: [PATCH] mm: try_to_unmap_cluster() should lock_page() before mlocking A BUG_ON(!PageLocked) was triggered in mlock_vma_page() by Sasha Levin fuzzing with trinity. The call site try_to_unmap_cluster() does not lock the pages other than its check_page parameter (which is already locked). The BUG_ON in mlock_vma_page() is not documented and its purpose is somewhat unclear, but apparently it serializes against page migration, which could otherwise fail to transfer the PG_mlocked flag. This would not be fatal, as the page would be eventually encountered again, but NR_MLOCK accounting would become distorted nevertheless. This patch adds a comment to the BUG_ON in mlock_vma_page() and munlock_vma_page() to that effect. The call site try_to_unmap_cluster() is fixed so that for page != check_page, trylock_page() is attempted (to avoid possible deadlocks as we already have check_page locked) and mlock_vma_page() is performed only upon success. If the page lock cannot be obtained, the page is left without PG_mlocked, which is again not a problem in the whole unevictable memory design. Reported-by: Sasha Levin Cc: Wanpeng Li Cc: Michel Lespinasse Cc: Bob Liu Cc: KOSAKI Motohiro Cc: Rik van Riel Cc: David Rientjes Cc: Mel Gorman Cc: Hugh Dickins Cc: Joonsoo Kim Cc: Signed-off-by: Vlastimil Babka --- mm/mlock.c | 2 ++ mm/rmap.c | 14 ++++++++++++-- 2 files changed, 14 insertions(+), 2 deletions(-) diff --git a/mm/mlock.c b/mm/mlock.c index 192e6ee..1b12dfa 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -79,6 +79,7 @@ void clear_page_mlock(struct page *page) */ void mlock_vma_page(struct page *page) { + /* Serialize with page migration */ BUG_ON(!PageLocked(page)); if (!TestSetPageMlocked(page)) { @@ -153,6 +154,7 @@ unsigned int munlock_vma_page(struct page *page) { unsigned int nr_pages; + /* For try_to_munlock() and to serialize with page migration */ BUG_ON(!PageLocked(page)); if (TestClearPageMlocked(page)) { diff --git a/mm/rmap.c b/mm/rmap.c index 068522d..b99c742 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1389,9 +1389,19 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount, BUG_ON(!page || PageAnon(page)); if (locked_vma) { - mlock_vma_page(page); /* no-op if already mlocked */ - if (page == check_page) + if (page == check_page) { + /* we know we have check_page locked */ + mlock_vma_page(page); ret = SWAP_MLOCK; + } else if (trylock_page(page)) { + /* + * If we can lock the page, perform mlock. + * Otherwise leave the page alone, it will be + * eventually encountered again later. + */ + mlock_vma_page(page); + unlock_page(page); + } continue; /* don't unmap */ } -- 1.8.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/