Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752063AbdH3DSp (ORCPT ); Tue, 29 Aug 2017 23:18:45 -0400 Received: from mail-pf0-f194.google.com ([209.85.192.194]:38706 "EHLO mail-pf0-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751463AbdH3DSo (ORCPT ); Tue, 29 Aug 2017 23:18:44 -0400 Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic From: Nadav Amit In-Reply-To: Date: Tue, 29 Aug 2017 20:18:41 -0700 Cc: Linux Kernel Mailing List , "open list:MEMORY MANAGEMENT" , Dan Williams , Ross Zwisler , Linus Torvalds , Bernhard Held , Adam Borowski , Andrea Arcangeli , =?utf-8?B?UmFkaW0gS3LEjW3DocWZ?= , Wanpeng Li , Paolo Bonzini , Takashi Iwai , Mike Galbraith , "Kirill A . Shutemov" , axie , Andrew Morton Message-Id: <0D062FD4-AAEB-46C9-9D3A-AC91E84414D3@gmail.com> References: <20170829235447.10050-1-jglisse@redhat.com> <20170829235447.10050-3-jglisse@redhat.com> <6D58FBE4-5D03-49CC-AAFF-3C1279A5A849@gmail.com> <20170830025910.GB2386@redhat.com> To: Jerome Glisse X-Mailer: Apple Mail (2.3273) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by nfs id v7U3IpEC005009 Content-Length: 6124 Lines: 151 Nadav Amit wrote: > Jerome Glisse wrote: > >> On Tue, Aug 29, 2017 at 07:46:07PM -0700, Nadav Amit wrote: >>> Jérôme Glisse wrote: >>> >>>> Replacing all mmu_notifier_invalidate_page() by mmu_notifier_invalidat_range() >>>> and making sure it is bracketed by call to mmu_notifier_invalidate_range_start/ >>>> end. >>>> >>>> Note that because we can not presume the pmd value or pte value we have to >>>> assume the worse and unconditionaly report an invalidation as happening. >>>> >>>> Signed-off-by: Jérôme Glisse >>>> Cc: Dan Williams >>>> Cc: Ross Zwisler >>>> Cc: Linus Torvalds >>>> Cc: Bernhard Held >>>> Cc: Adam Borowski >>>> Cc: Andrea Arcangeli >>>> Cc: Radim Krčmář >>>> Cc: Wanpeng Li >>>> Cc: Paolo Bonzini >>>> Cc: Takashi Iwai >>>> Cc: Nadav Amit >>>> Cc: Mike Galbraith >>>> Cc: Kirill A. Shutemov >>>> Cc: axie >>>> Cc: Andrew Morton >>>> --- >>>> mm/rmap.c | 44 +++++++++++++++++++++++++++++++++++++++++--- >>>> 1 file changed, 41 insertions(+), 3 deletions(-) >>>> >>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>> index c8993c63eb25..da97ed525088 100644 >>>> --- a/mm/rmap.c >>>> +++ b/mm/rmap.c >>>> @@ -887,11 +887,21 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma, >>>> .address = address, >>>> .flags = PVMW_SYNC, >>>> }; >>>> + unsigned long start = address, end; >>>> int *cleaned = arg; >>>> >>>> + /* >>>> + * We have to assume the worse case ie pmd for invalidation. Note that >>>> + * the page can not be free from this function. >>>> + */ >>>> + end = min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE); >>>> + mmu_notifier_invalidate_range_start(vma->vm_mm, start, end); >>>> + >>>> while (page_vma_mapped_walk(&pvmw)) { >>>> + unsigned long cstart, cend; >>>> int ret = 0; >>>> - address = pvmw.address; >>>> + >>>> + cstart = address = pvmw.address; >>>> if (pvmw.pte) { >>>> pte_t entry; >>>> pte_t *pte = pvmw.pte; >>>> @@ -904,6 +914,7 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma, >>>> entry = pte_wrprotect(entry); >>>> entry = pte_mkclean(entry); >>>> set_pte_at(vma->vm_mm, address, pte, entry); >>>> + cend = cstart + PAGE_SIZE; >>>> ret = 1; >>>> } else { >>>> #ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE >>>> @@ -918,6 +929,8 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma, >>>> entry = pmd_wrprotect(entry); >>>> entry = pmd_mkclean(entry); >>>> set_pmd_at(vma->vm_mm, address, pmd, entry); >>>> + cstart &= PMD_MASK; >>>> + cend = cstart + PMD_SIZE; >>>> ret = 1; >>>> #else >>>> /* unexpected pmd-mapped page? */ >>>> @@ -926,11 +939,13 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma, >>>> } >>>> >>>> if (ret) { >>>> - mmu_notifier_invalidate_page(vma->vm_mm, address); >>>> + mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend); >>>> (*cleaned)++; >>>> } >>>> } >>>> >>>> + mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); >>>> + >>>> return true; >>>> } >>>> >>>> @@ -1324,6 +1339,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, >>>> pte_t pteval; >>>> struct page *subpage; >>>> bool ret = true; >>>> + unsigned long start = address, end; >>>> enum ttu_flags flags = (enum ttu_flags)arg; >>>> >>>> /* munlock has nothing to gain from examining un-locked vmas */ >>>> @@ -1335,6 +1351,14 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, >>>> flags & TTU_MIGRATION, page); >>>> } >>>> >>>> + /* >>>> + * We have to assume the worse case ie pmd for invalidation. Note that >>>> + * the page can not be free in this function as call of try_to_unmap() >>>> + * must hold a reference on the page. >>>> + */ >>>> + end = min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE); >>>> + mmu_notifier_invalidate_range_start(vma->vm_mm, start, end); >>>> + >>>> while (page_vma_mapped_walk(&pvmw)) { >>>> /* >>>> * If the page is mlock()d, we cannot swap it out. >>>> @@ -1408,6 +1432,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, >>>> set_huge_swap_pte_at(mm, address, >>>> pvmw.pte, pteval, >>>> vma_mmu_pagesize(vma)); >>>> + mmu_notifier_invalidate_range(mm, address, >>>> + address + vma_mmu_pagesize(vma)); >>> >>> I don’t think that the notifier should be called after the PTE is set, but >>> after the PTE is cleared, PTE permissions are demoted (e.g., RW->RO) or >>> access/dirty bits are cleared. [There is an exception: if the PFN in the PTE >>> is changed without clearing the PTE before, but it does not apply here, and >>> there is a different notifier for this case.] >>> >>> Therefore, IIUC, try_to_umap_one() should only call >>> mmu_notifier_invalidate_range() after ptep_get_and_clear() and >>> ptep_clear_flush() are called. All the other calls to >>> mmu_notifier_invalidate_range() in this function can be removed. >> >> Yes it would simplify the patch, i was trying to optimize for the case >> where we restore the pte to its original value after ptep_clear_flush() >> or ptep_get_and_clear() as in this case there is no need to invalidate >> any secondary page table but that's an overkill optimization that just >> add too much complexity. > > Interesting. Actually, prior to your changes, it seems that the break > statements would skip mmu_notifier_invalidate_page() when the PTE is not > changed. So why not just change mmu_notifier_invalidate_page() into > mmu_notifier_invalidate_range() ? Sorry - I noticed it was actually wrong before (as you noted) at least in one case: (unlikely(PageSwapBacked(page) != PageSwapCache(page))) Regards, Nadav