Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751683AbaAMOEB (ORCPT ); Mon, 13 Jan 2014 09:04:01 -0500 Received: from cantor2.suse.de ([195.135.220.15]:60985 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751634AbaAMOD6 (ORCPT ); Mon, 13 Jan 2014 09:03:58 -0500 Message-ID: <52D3F248.7030803@suse.cz> Date: Mon, 13 Jan 2014 15:03:52 +0100 From: Vlastimil Babka User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: Motohiro Kosaki , Andrew Morton CC: Linus Torvalds , Sasha Levin , Wanpeng Li , Michel Lespinasse , Bob Liu , Nick Piggin , Motohiro Kosaki JP , Rik van Riel , David Rientjes , Mel Gorman , Minchan Kim , Hugh Dickins , Johannes Weiner , linux-mm , Linux Kernel Mailing List , Joonsoo Kim Subject: Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs References: <1387267550-8689-1-git-send-email-liwanp@linux.vnet.ibm.com> <52b1138b.0201430a.19a8.605dSMTPIN_ADDED_BROKEN@mx.google.com> <52B11765.8030005@oracle.com> <52b120a5.a3b2440a.3acf.ffffd7c3SMTPIN_ADDED_BROKEN@mx.google.com> <52B166CF.6080300@suse.cz> <52b1699f.87293c0a.75d1.34d3SMTPIN_ADDED_BROKEN@mx.google.com> <20131218134316.977d5049209d9278e1dad225@linux-foundation.org> <52C71ACC.20603@oracle.com> <52C74972.6050909@suse.cz> <6B2BA408B38BA1478B473C31C3D2074E2BF812BC82@SV-EXCHANGE1.Corp.FC.LOCAL> <52CC16DC.9070308@suse.cz> <6B2BA408B38BA1478B473C31C3D2074E2C386DF586@SV-EXCHANGE1.Corp.FC.LOCAL> In-Reply-To: <6B2BA408B38BA1478B473C31C3D2074E2C386DF586@SV-EXCHANGE1.Corp.FC.LOCAL> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01/10/2014 06:48 PM, Motohiro Kosaki wrote: >> diff --git a/mm/rmap.c b/mm/rmap.c >> index 068522d..b99c742 100644 >> --- a/mm/rmap.c >> +++ b/mm/rmap.c >> @@ -1389,9 +1389,19 @@ static int try_to_unmap_cluster(unsigned long >> cursor, unsigned int *mapcount, >> BUG_ON(!page || PageAnon(page)); >> >> if (locked_vma) { >> - mlock_vma_page(page); /* no-op if already >> mlocked */ >> - if (page == check_page) >> + if (page == check_page) { >> + /* we know we have check_page locked */ >> + mlock_vma_page(page); >> ret = SWAP_MLOCK; >> + } else if (trylock_page(page)) { >> + /* >> + * If we can lock the page, perform mlock. >> + * Otherwise leave the page alone, it will be >> + * eventually encountered again later. >> + */ >> + mlock_vma_page(page); >> + unlock_page(page); >> + } >> continue; /* don't unmap */ >> } > > I audited all related mm code. However I couldn't find any race that it can close. Well, I would say the lock here closes the race with page migration, no? (As discussed below) > First off, current munlock code is crazy tricky. Oops, that's actually a result of my patches to speed up munlock by batching pages (since 3.12). > munlock > down_write(mmap_sem) > do_mlock() > mlock_fixup > munlock_vma_pages_range > __munlock_pagevec > spin_lock_irq(zone->lru_lock) > TestClearPageMlocked(page) > del_page_from_lru_list > spin_unlock_irq(zone->lru_lock) > lock_page > __munlock_isolated_page > unlock_page > > up_write(mmap_sem) > > Look, TestClearPageMlocked(page) is not protected by lock_page. Right :( That's my fault, when developing the patch series I didn't see the page migration race, and it seemed that lock is only needed to protect the rmap operations in __munlock_isolated_page() > But this is still > safe because Page_mocked mean one or more vma marked VM_LOCKED then we > only need care about turning down final VM_LOCKED. I.e. mmap_sem protect them. > > And, > > spin_lock_irq(zone->lru_lock) > del_page_from_lru_list > spin_unlock_irq(zone->lru_lock) > > This idiom ensures I or someone else isolate the page from lru and isolated pages > will be put back by putback_lru_page in anycase. So, the pages will move the right > lru eventually. > > And then, taking page-lock doesn't help to close vs munlock race. > > On the other hands, page migration has the following call stack. > > some-caller [isolate_lru_page] > unmap_and_move > __unmap_and_move > trylock_page > try_to_unmap > move_to_new_page > migrate_page > migrate_page_copy > unlock_page > > The critical part (i.e. migrate_page_copy) is protected by both page isolation and page lock. However, none of that protects against setting PG_mlocked in try_to_unmap_cluster() -> mlock_vma_page(), or clearing PG_mlocked in __munlock_pagevec(). The race I have in mind for munlock is: CPU1 does page migration some-caller [isolate_lru_page] unmap_and_move __unmap_and_move trylock_page try_to_unmap move_to_new_page migrate_page migrate_page_copy mlock_migrate_page - transfers PG_mlocked from old page to new page CPU2 does munlock: munlock down_write(mmap_sem) do_mlock() mlock_fixup munlock_vma_pages_range __munlock_pagevec spin_lock_irq(zone->lru_lock) TestClearPageMlocked(page) - here it finds PG_mlocked already cleared so it stops, but meanwhile new page already has PG_mlocked set and will stay inevictable > Page fault path take lock page and doesn't use page isolation. This is correct. > try_to_unmap_cluster doesn't take page lock, but it ensure the page is isolated. This is correct too. I don't see where try_to_unmap_cluster() has guaranteed that pages other than check_page are isolated? (I might just be missing that). So in the race example above, CPU2 could be doing try_to_unmap_cluster() and set PG_mlocked on old page, with no effect on the new page. Not fatal for the design of lazy mlocking, but a distorted accounting anyway. > Plus, do_wp_page() has the following comment. But it is wrong. This lock is necessary to protect against > page migration, but not lru manipulation. > > if ((ret & VM_FAULT_WRITE) && (vma->vm_flags & VM_LOCKED)) { > lock_page(old_page); /* LRU manipulation */ > munlock_vma_page(old_page); > unlock_page(old_page); > } > > > But, you know, this is crazy ugly lock design. I'm ok to change the rule to that PG_mlocked must be protected > page lock. If so, I propose to add PageMlocked() like this > > } else if (!PageMlocked() && trylock_page(page)) { > /* > * If we can lock the page, perform mlock. > * Otherwise leave the page alone, it will be > * eventually encountered again later. > */ > mlock_vma_page(page); > unlock_page(page); > > This is safe. Even if race is happen and vmscan failed to mark PG_mlocked, that's no problem. Next vmscan may > do the right job and will fix the mistake. > > Anyway, I'm also sure this is not recent issue. It lived 5 years. It is no reason to rush. Yeah but there's the new issue with __munlock_pagevec() :( Perhaps a more serious one as it could leave pages inevictable when they shouldn't be. I'm not sure if the performance benefits of that could be preserved with full page locking. Another option would be that page migration would somehow deal with the race, or just leave the target pages without PG_mlocked and let them be dealt with later. But if we go with the rule that page lock must protect PG_mlocked, then there's also clear_page_mlock() to consider. And finally, we could then at least replace the atomic test and set with faster non-atomic variants. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/