[Based on tag v5.14, but it should still apply to -mm too. If not, I can
repost anytime]
I picked up these patches from uffd-wp v5 series here:
https://lore.kernel.org/lkml/[email protected]/
IMHO all of them are very nice cleanups to existing code already, they're all
small and self-contained. They'll be needed by uffd-wp coming series. I would
appreciate if they can be accepted earlier, so as to not carry them over always
with the uffd-wp series.
I removed some CC from the uffd-wp v5 series to reduce the noise, and added a
few more into it.
Reviews are greatly welcomed, thanks.
Peter Xu (5):
mm/shmem: Unconditionally set pte dirty in mfill_atomic_install_pte
mm: Clear vmf->pte after pte_unmap_same() returns
mm: Drop first_index/last_index in zap_details
mm: Introduce zap_details.zap_flags
mm: Introduce ZAP_FLAG_SKIP_SWAP
include/linux/mm.h | 33 +++++++++++++++++--
mm/memory.c | 82 +++++++++++++++++++++-------------------------
mm/shmem.c | 1 -
mm/userfaultfd.c | 3 +-
4 files changed, 68 insertions(+), 51 deletions(-)
--
2.31.1
pte_unmap_same() will always unmap the pte pointer. After the unmap, vmf->pte
will not be valid any more, we should clear it.
It was safe only because no one is accessing vmf->pte after pte_unmap_same()
returns, since the only caller of pte_unmap_same() (so far) is do_swap_page(),
where vmf->pte will in most cases be overwritten very soon.
Directly pass in vmf into pte_unmap_same() and then we can also avoid the long
parameter list too, which should be a nice cleanup.
Reviewed-by: Miaohe Lin <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
mm/memory.c | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 25fc46e87214..204141e8a53d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2724,19 +2724,20 @@ EXPORT_SYMBOL_GPL(apply_to_existing_page_range);
* proceeding (but do_wp_page is only called after already making such a check;
* and do_anonymous_page can safely check later on).
*/
-static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
- pte_t *page_table, pte_t orig_pte)
+static inline int pte_unmap_same(struct vm_fault *vmf)
{
int same = 1;
#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPTION)
if (sizeof(pte_t) > sizeof(unsigned long)) {
- spinlock_t *ptl = pte_lockptr(mm, pmd);
+ spinlock_t *ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
spin_lock(ptl);
- same = pte_same(*page_table, orig_pte);
+ same = pte_same(*vmf->pte, vmf->orig_pte);
spin_unlock(ptl);
}
#endif
- pte_unmap(page_table);
+ pte_unmap(vmf->pte);
+ /* After unmap of pte, the pointer is invalid now - clear it. */
+ vmf->pte = NULL;
return same;
}
@@ -3487,7 +3488,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
vm_fault_t ret = 0;
void *shadow = NULL;
- if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte))
+ if (!pte_unmap_same(vmf))
goto out;
entry = pte_to_swp_entry(vmf->orig_pte);
--
2.31.1
Instead of trying to introduce one variable for every new zap_details fields,
let's introduce a flag so that it can start to encode true/false informations.
Let's start to use this flag first to clean up the only check_mapping variable.
Firstly, the name "check_mapping" implies this is a "boolean", but actually it
stores the mapping inside, just in a way that it won't be set if we don't want
to check the mapping.
To make things clearer, introduce the 1st zap flag ZAP_FLAG_CHECK_MAPPING, so
that we only check against the mapping if this bit set. At the same time, we
can rename check_mapping into zap_mapping and set it always.
Since at it, introduce another helper zap_check_mapping_skip() and use it in
zap_pte_range() properly.
Some old comments have been removed in zap_pte_range() because they're
duplicated, and since now we're with ZAP_FLAG_CHECK_MAPPING flag, it'll be very
easy to grep this information by simply grepping the flag.
It'll also make life easier when we want to e.g. pass in zap_flags into the
callers like unmap_mapping_pages() (instead of adding new booleans besides the
even_cows parameter).
Signed-off-by: Peter Xu <[email protected]>
---
include/linux/mm.h | 19 ++++++++++++++++++-
mm/memory.c | 34 ++++++++++------------------------
2 files changed, 28 insertions(+), 25 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 69259229f090..fcbc1c4f8e8e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1716,14 +1716,31 @@ static inline bool can_do_mlock(void) { return false; }
extern int user_shm_lock(size_t, struct ucounts *);
extern void user_shm_unlock(size_t, struct ucounts *);
+/* Whether to check page->mapping when zapping */
+#define ZAP_FLAG_CHECK_MAPPING BIT(0)
+
/*
* Parameter block passed down to zap_pte_range in exceptional cases.
*/
struct zap_details {
- struct address_space *check_mapping; /* Check page->mapping if set */
+ struct address_space *zap_mapping;
struct page *single_page; /* Locked page to be unmapped */
+ unsigned long zap_flags;
};
+/* Return true if skip zapping this page, false otherwise */
+static inline bool
+zap_skip_check_mapping(struct zap_details *details, struct page *page)
+{
+ if (!details || !page)
+ return false;
+
+ if (!(details->zap_flags & ZAP_FLAG_CHECK_MAPPING))
+ return false;
+
+ return details->zap_mapping != page_rmapping(page);
+}
+
struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
pte_t pte);
struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/memory.c b/mm/memory.c
index 3b860f6a51ac..05ccacda4fe9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1333,16 +1333,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
struct page *page;
page = vm_normal_page(vma, addr, ptent);
- if (unlikely(details) && page) {
- /*
- * unmap_shared_mapping_pages() wants to
- * invalidate cache without truncating:
- * unmap shared but keep private pages.
- */
- if (details->check_mapping &&
- details->check_mapping != page_rmapping(page))
- continue;
- }
+ if (unlikely(zap_skip_check_mapping(details, page)))
+ continue;
ptent = ptep_get_and_clear_full(mm, addr, pte,
tlb->fullmm);
tlb_remove_tlb_entry(tlb, pte, addr);
@@ -1375,17 +1367,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
is_device_exclusive_entry(entry)) {
struct page *page = pfn_swap_entry_to_page(entry);
- if (unlikely(details && details->check_mapping)) {
- /*
- * unmap_shared_mapping_pages() wants to
- * invalidate cache without truncating:
- * unmap shared but keep private pages.
- */
- if (details->check_mapping !=
- page_rmapping(page))
- continue;
- }
-
+ if (unlikely(zap_skip_check_mapping(details, page)))
+ continue;
pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
rss[mm_counter(page)]--;
@@ -3369,8 +3352,9 @@ void unmap_mapping_page(struct page *page)
first_index = page->index;
last_index = page->index + thp_nr_pages(page) - 1;
- details.check_mapping = mapping;
+ details.zap_mapping = mapping;
details.single_page = page;
+ details.zap_flags = ZAP_FLAG_CHECK_MAPPING;
i_mmap_lock_write(mapping);
if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)))
@@ -3395,9 +3379,11 @@ void unmap_mapping_pages(struct address_space *mapping, pgoff_t start,
pgoff_t nr, bool even_cows)
{
pgoff_t first_index = start, last_index = start + nr - 1;
- struct zap_details details = { };
+ struct zap_details details = { .zap_mapping = mapping };
+
+ if (!even_cows)
+ details.zap_flags |= ZAP_FLAG_CHECK_MAPPING;
- details.check_mapping = even_cows ? NULL : mapping;
if (last_index < first_index)
last_index = ULONG_MAX;
--
2.31.1
It was conditionally done previously, as there's one shmem special case that we
use SetPageDirty() instead. However that's not necessary and it should be
easier and cleaner to do it unconditionally in mfill_atomic_install_pte().
The most recent discussion about this is here, where Hugh explained the history
of SetPageDirty() and why it's possible that it's not required at all:
https://lore.kernel.org/lkml/[email protected]/
Currently mfill_atomic_install_pte() has three callers:
1. shmem_mfill_atomic_pte
2. mcopy_atomic_pte
3. mcontinue_atomic_pte
After the change: case (1) should have its SetPageDirty replaced by the dirty
bit on pte (so we unify them together, finally), case (2) should have no
functional change at all as it has page_in_cache==false, case (3) may add a
dirty bit to the pte. However since case (3) is UFFDIO_CONTINUE for shmem,
it's merely 100% sure the page is dirty after all, so should not make a real
difference either.
This should make it much easier to follow on which case will set dirty for
uffd, as we'll simply set it all now for all uffd related ioctls. Meanwhile,
no special handling of SetPageDirty() if there's no need.
Cc: Hugh Dickins <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
mm/shmem.c | 1 -
mm/userfaultfd.c | 3 +--
2 files changed, 1 insertion(+), 3 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c
index dacda7463d54..3f91c8ce4d02 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2437,7 +2437,6 @@ int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
shmem_recalc_inode(inode);
spin_unlock_irq(&info->lock);
- SetPageDirty(page);
unlock_page(page);
return 0;
out_delete_from_cache:
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 0e2132834bc7..b30a3724c701 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -69,10 +69,9 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
pgoff_t offset, max_off;
_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
+ _dst_pte = pte_mkdirty(_dst_pte);
if (page_in_cache && !vm_shared)
writable = false;
- if (writable || !page_in_cache)
- _dst_pte = pte_mkdirty(_dst_pte);
if (writable) {
if (wp_copy)
_dst_pte = pte_mkuffd_wp(_dst_pte);
--
2.31.1
Firstly, the comment in zap_pte_range() is misleading because it checks against
details rather than check_mappings, so it's against what the code did.
Meanwhile, it's confusing too on not explaining why passing in the details
pointer would mean to skip all swap entries. New user of zap_details could
very possibly miss this fact if they don't read deep until zap_pte_range()
because there's no comment at zap_details talking about it at all, so swap
entries could be errornously skipped without being noticed.
This partly reverts 3e8715fdc03e ("mm: drop zap_details::check_swap_entries"),
but introduce ZAP_FLAG_SKIP_SWAP flag, which means the opposite of previous
"details" parameter: the caller should explicitly set this to skip swap
entries, otherwise swap entries will always be considered (which is still the
major case here).
Cc: Kirill A. Shutemov <[email protected]>
Reviewed-by: Alistair Popple <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
include/linux/mm.h | 12 ++++++++++++
mm/memory.c | 8 +++++---
2 files changed, 17 insertions(+), 3 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index fcbc1c4f8e8e..f798f5e4baa5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1718,6 +1718,8 @@ extern void user_shm_unlock(size_t, struct ucounts *);
/* Whether to check page->mapping when zapping */
#define ZAP_FLAG_CHECK_MAPPING BIT(0)
+/* Whether to skip zapping swap entries */
+#define ZAP_FLAG_SKIP_SWAP BIT(1)
/*
* Parameter block passed down to zap_pte_range in exceptional cases.
@@ -1741,6 +1743,16 @@ zap_skip_check_mapping(struct zap_details *details, struct page *page)
return details->zap_mapping != page_rmapping(page);
}
+/* Return true if skip swap entries, false otherwise */
+static inline bool
+zap_skip_swap(struct zap_details *details)
+{
+ if (!details)
+ return false;
+
+ return details->zap_flags & ZAP_FLAG_SKIP_SWAP;
+}
+
struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
pte_t pte);
struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/memory.c b/mm/memory.c
index 05ccacda4fe9..79957265afb4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1379,8 +1379,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
continue;
}
- /* If details->check_mapping, we leave swap entries. */
- if (unlikely(details))
+ if (unlikely(zap_skip_swap(details)))
continue;
if (!non_swap_entry(entry))
@@ -3379,7 +3378,10 @@ void unmap_mapping_pages(struct address_space *mapping, pgoff_t start,
pgoff_t nr, bool even_cows)
{
pgoff_t first_index = start, last_index = start + nr - 1;
- struct zap_details details = { .zap_mapping = mapping };
+ struct zap_details details = {
+ .zap_mapping = mapping,
+ .zap_flags = ZAP_FLAG_SKIP_SWAP,
+ };
if (!even_cows)
details.zap_flags |= ZAP_FLAG_CHECK_MAPPING;
--
2.31.1
On Wed, Sep 1, 2021 at 1:56 PM Peter Xu <[email protected]> wrote:
>
> It was conditionally done previously, as there's one shmem special case that we
> use SetPageDirty() instead. However that's not necessary and it should be
> easier and cleaner to do it unconditionally in mfill_atomic_install_pte().
>
> The most recent discussion about this is here, where Hugh explained the history
> of SetPageDirty() and why it's possible that it's not required at all:
>
> https://lore.kernel.org/lkml/[email protected]/
Thanks for the cleanup Peter!
I think the discussion of whether or not the data can be marked dirty
below is correct, and the code change looks good as well. But, I think
we're missing an explanation why Hugh's concern is indeed not a
problem?
Specifically, this question:
"Haha: I think Andrea is referring to exactly the dirty_accountable
code in change_pte_protection() which worried me above. Now, I think
that will turn out okay (shmem does not have a page_mkwrite(), and
does not participate in dirty accounting), but you will have to do
some work to assure us all of that, before sending in a cleanup
patch."
Do we have more evidence that this is indeed fine, vs. what we had
when discussing this before? If so, we should talk about it explicitly
in this commit message, I think.
(Sorry if you've covered this and it's just going over my head. ;) )
>
>
>
> Currently mfill_atomic_install_pte() has three callers:
>
> 1. shmem_mfill_atomic_pte
> 2. mcopy_atomic_pte
> 3. mcontinue_atomic_pte
>
> After the change: case (1) should have its SetPageDirty replaced by the dirty
> bit on pte (so we unify them together, finally), case (2) should have no
> functional change at all as it has page_in_cache==false, case (3) may add a
> dirty bit to the pte. However since case (3) is UFFDIO_CONTINUE for shmem,
> it's merely 100% sure the page is dirty after all, so should not make a real
> difference either.
>
> This should make it much easier to follow on which case will set dirty for
> uffd, as we'll simply set it all now for all uffd related ioctls. Meanwhile,
> no special handling of SetPageDirty() if there's no need.
>
> Cc: Hugh Dickins <[email protected]>
> Cc: Axel Rasmussen <[email protected]>
> Cc: Andrea Arcangeli <[email protected]>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> mm/shmem.c | 1 -
> mm/userfaultfd.c | 3 +--
> 2 files changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index dacda7463d54..3f91c8ce4d02 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2437,7 +2437,6 @@ int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
> shmem_recalc_inode(inode);
> spin_unlock_irq(&info->lock);
>
> - SetPageDirty(page);
> unlock_page(page);
> return 0;
> out_delete_from_cache:
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 0e2132834bc7..b30a3724c701 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -69,10 +69,9 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
> pgoff_t offset, max_off;
>
> _dst_pte = mk_pte(page, dst_vma->vm_page_prot);
> + _dst_pte = pte_mkdirty(_dst_pte);
> if (page_in_cache && !vm_shared)
> writable = false;
> - if (writable || !page_in_cache)
> - _dst_pte = pte_mkdirty(_dst_pte);
> if (writable) {
> if (wp_copy)
> _dst_pte = pte_mkuffd_wp(_dst_pte);
> --
> 2.31.1
>
Hi, Axel,
On Wed, Sep 01, 2021 at 02:48:53PM -0700, Axel Rasmussen wrote:
> On Wed, Sep 1, 2021 at 1:56 PM Peter Xu <[email protected]> wrote:
> >
> > It was conditionally done previously, as there's one shmem special case that we
> > use SetPageDirty() instead. However that's not necessary and it should be
> > easier and cleaner to do it unconditionally in mfill_atomic_install_pte().
> >
> > The most recent discussion about this is here, where Hugh explained the history
> > of SetPageDirty() and why it's possible that it's not required at all:
> >
> > https://lore.kernel.org/lkml/[email protected]/
>
> Thanks for the cleanup Peter!
No problem. Obviously that special handling of SetPageDirty is still too
tricky to me and I'd love to remove it.
>
> I think the discussion of whether or not the data can be marked dirty
> below is correct, and the code change looks good as well. But, I think
> we're missing an explanation why Hugh's concern is indeed not a
> problem?
>
> Specifically, this question:
>
> "Haha: I think Andrea is referring to exactly the dirty_accountable
> code in change_pte_protection() which worried me above. Now, I think
> that will turn out okay (shmem does not have a page_mkwrite(), and
> does not participate in dirty accounting), but you will have to do
> some work to assure us all of that, before sending in a cleanup
> patch."
>
> Do we have more evidence that this is indeed fine, vs. what we had
> when discussing this before? If so, we should talk about it explicitly
> in this commit message, I think.
>
> (Sorry if you've covered this and it's just going over my head. ;) )
Thanks for looking into this.
I thought Hugh's explanation should mostly have covered that. The previous
worry is we may have mprotect() applying write bit errornously if we have some
read-only pte marked dirty. But I don't think that'll happen just like Hugh
stated in the thread I attached, as the dirty accountable flag is only set if
vma_wants_writenotify() returns true.
Take the first example within that helper:
if ((vm_flags & (VM_WRITE|VM_SHARED)) != ((VM_WRITE|VM_SHARED)))
return 0;
So firstly it never applies to vma that doesn't have VM_WRITE|VM_SHARED. So far
it even doesn't work for anonymous, but logically it may, like:
https://github.com/aagit/aa/commit/05dc2c56ef79b3836c75fcf68c5b19b08f4e4c58
Peter Collingbourne originated that patch, due to some reason it didn't land
which I forgot, however I still think it's doable even for anonymous.
Sorry to have gone off-topic; let me go back to it.
It also checks for e.g. page_mkwrite() needs, soft dirty tracking and so on to
make sure it's okay to grant write bit when possible.
Hugh mentioned "do some work to assure us all of that" - I did firstly went
throught the code carefully myself so I'm more certain it's doing the right
thing to me, secondly I did run quite some tests on the patch (actually on the
whole uffd-wp shmem+hugetlbfs branch). Even if I'm going to switch the uffd-wp
series to the pte marker format, this patch won't change.
I also analysized three callers that may be affected by this change below, and
explaining why it's okay. I hope that can also be counted as part of the "some
work" that Hugh asked.
Besides all these, I'm pretty happy too if anyone would help me to tell
otherwise on whether there's still things missing so we can't do this. That's
the "code review" part for every single patch, including this one, isn't it? :)
Thanks,
>
> >
> >
> >
> > Currently mfill_atomic_install_pte() has three callers:
> >
> > 1. shmem_mfill_atomic_pte
> > 2. mcopy_atomic_pte
> > 3. mcontinue_atomic_pte
> >
> > After the change: case (1) should have its SetPageDirty replaced by the dirty
> > bit on pte (so we unify them together, finally), case (2) should have no
> > functional change at all as it has page_in_cache==false, case (3) may add a
> > dirty bit to the pte. However since case (3) is UFFDIO_CONTINUE for shmem,
> > it's merely 100% sure the page is dirty after all, so should not make a real
> > difference either.
--
Peter Xu
On 01.09.21 22:56, Peter Xu wrote:
> pte_unmap_same() will always unmap the pte pointer. After the unmap, vmf->pte
> will not be valid any more, we should clear it.
>
> It was safe only because no one is accessing vmf->pte after pte_unmap_same()
> returns, since the only caller of pte_unmap_same() (so far) is do_swap_page(),
> where vmf->pte will in most cases be overwritten very soon.
>
> Directly pass in vmf into pte_unmap_same() and then we can also avoid the long
> parameter list too, which should be a nice cleanup.
>
> Reviewed-by: Miaohe Lin <[email protected]>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> mm/memory.c | 13 +++++++------
> 1 file changed, 7 insertions(+), 6 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 25fc46e87214..204141e8a53d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2724,19 +2724,20 @@ EXPORT_SYMBOL_GPL(apply_to_existing_page_range);
> * proceeding (but do_wp_page is only called after already making such a check;
> * and do_anonymous_page can safely check later on).
> */
> -static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
> - pte_t *page_table, pte_t orig_pte)
> +static inline int pte_unmap_same(struct vm_fault *vmf)
> {
> int same = 1;
> #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPTION)
> if (sizeof(pte_t) > sizeof(unsigned long)) {
> - spinlock_t *ptl = pte_lockptr(mm, pmd);
> + spinlock_t *ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
> spin_lock(ptl);
> - same = pte_same(*page_table, orig_pte);
> + same = pte_same(*vmf->pte, vmf->orig_pte);
> spin_unlock(ptl);
> }
> #endif
> - pte_unmap(page_table);
> + pte_unmap(vmf->pte);
> + /* After unmap of pte, the pointer is invalid now - clear it. */
I'd just drop the comment, it's what we do in similar code.
> + vmf->pte = NULL;
> return same;
> }
>
> @@ -3487,7 +3488,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> vm_fault_t ret = 0;
> void *shadow = NULL;
>
> - if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte))
> + if (!pte_unmap_same(vmf))
> goto out;
Funny, I prototyped something similar yesterday. I did it via
same = pte_lock_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte);
pte_unmap(vmf->pte);
vmf->pte = NULL;
if (!same)
goto out;
To just move handling to the caller.
But this also looks fine, whatever you prefer.
Reviewed-by: David Hildenbrand <[email protected]>
--
Thanks,
David / dhildenb
On 01.09.21 22:57, Peter Xu wrote:
> Instead of trying to introduce one variable for every new zap_details fields,
> let's introduce a flag so that it can start to encode true/false informations.
>
> Let's start to use this flag first to clean up the only check_mapping variable.
> Firstly, the name "check_mapping" implies this is a "boolean", but actually it
> stores the mapping inside, just in a way that it won't be set if we don't want
> to check the mapping.
>
> To make things clearer, introduce the 1st zap flag ZAP_FLAG_CHECK_MAPPING, so
> that we only check against the mapping if this bit set. At the same time, we
> can rename check_mapping into zap_mapping and set it always.
>
> Since at it, introduce another helper zap_check_mapping_skip() and use it in
> zap_pte_range() properly.
>
> Some old comments have been removed in zap_pte_range() because they're
> duplicated, and since now we're with ZAP_FLAG_CHECK_MAPPING flag, it'll be very
> easy to grep this information by simply grepping the flag.
>
> It'll also make life easier when we want to e.g. pass in zap_flags into the
> callers like unmap_mapping_pages() (instead of adding new booleans besides the
> even_cows parameter).
>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> include/linux/mm.h | 19 ++++++++++++++++++-
> mm/memory.c | 34 ++++++++++------------------------
> 2 files changed, 28 insertions(+), 25 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 69259229f090..fcbc1c4f8e8e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1716,14 +1716,31 @@ static inline bool can_do_mlock(void) { return false; }
> extern int user_shm_lock(size_t, struct ucounts *);
> extern void user_shm_unlock(size_t, struct ucounts *);
>
> +/* Whether to check page->mapping when zapping */
> +#define ZAP_FLAG_CHECK_MAPPING BIT(0)
So we want to go full way, like:
typedef int __bitwise zap_flags_t;
#define ZAP_FLAG_CHECK_MAPPING ((__force zap_flags_t)BIT(0))
> +
> /*
> * Parameter block passed down to zap_pte_range in exceptional cases.
> */
> struct zap_details {
> - struct address_space *check_mapping; /* Check page->mapping if set */
> + struct address_space *zap_mapping;
> struct page *single_page; /* Locked page to be unmapped */
> + unsigned long zap_flags;
Why call it "zap_*" if everything in the structure is related to
zapping? IOW, simply "mapping", "flags" would be good enough.
> };
>
> +/* Return true if skip zapping this page, false otherwise */
> +static inline bool
> +zap_skip_check_mapping(struct zap_details *details, struct page *page)
> +{
> + if (!details || !page)
> + return false;
> +
> + if (!(details->zap_flags & ZAP_FLAG_CHECK_MAPPING))
> + return false;
> +
> + return details->zap_mapping != page_rmapping(page);
> +}
I'm confused, why isn't "!details->zap_mapping" vs.
"details->zap_mapping" sufficient? I can see that you may need flags for
other purposes (next patch), but why do we need it here?
Factoring it out into this helper is a nice cleanup, though. But I'd
just not introduce ZAP_FLAG_CHECK_MAPPING.
--
Thanks,
David / dhildenb
On Thu, Sep 02, 2021 at 09:28:42AM +0200, David Hildenbrand wrote:
> On 01.09.21 22:57, Peter Xu wrote:
> > Instead of trying to introduce one variable for every new zap_details fields,
> > let's introduce a flag so that it can start to encode true/false informations.
> >
> > Let's start to use this flag first to clean up the only check_mapping variable.
> > Firstly, the name "check_mapping" implies this is a "boolean", but actually it
> > stores the mapping inside, just in a way that it won't be set if we don't want
> > to check the mapping.
> >
> > To make things clearer, introduce the 1st zap flag ZAP_FLAG_CHECK_MAPPING, so
> > that we only check against the mapping if this bit set. At the same time, we
> > can rename check_mapping into zap_mapping and set it always.
> >
> > Since at it, introduce another helper zap_check_mapping_skip() and use it in
> > zap_pte_range() properly.
> >
> > Some old comments have been removed in zap_pte_range() because they're
> > duplicated, and since now we're with ZAP_FLAG_CHECK_MAPPING flag, it'll be very
> > easy to grep this information by simply grepping the flag.
> >
> > It'll also make life easier when we want to e.g. pass in zap_flags into the
> > callers like unmap_mapping_pages() (instead of adding new booleans besides the
> > even_cows parameter).
> >
> > Signed-off-by: Peter Xu <[email protected]>
> > ---
> > include/linux/mm.h | 19 ++++++++++++++++++-
> > mm/memory.c | 34 ++++++++++------------------------
> > 2 files changed, 28 insertions(+), 25 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 69259229f090..fcbc1c4f8e8e 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1716,14 +1716,31 @@ static inline bool can_do_mlock(void) { return false; }
> > extern int user_shm_lock(size_t, struct ucounts *);
> > extern void user_shm_unlock(size_t, struct ucounts *);
> > +/* Whether to check page->mapping when zapping */
> > +#define ZAP_FLAG_CHECK_MAPPING BIT(0)
>
> So we want to go full way, like:
>
> typedef int __bitwise zap_flags_t;
>
> #define ZAP_FLAG_CHECK_MAPPING ((__force zap_flags_t)BIT(0))
Sure.
>
> > +
> > /*
> > * Parameter block passed down to zap_pte_range in exceptional cases.
> > */
> > struct zap_details {
> > - struct address_space *check_mapping; /* Check page->mapping if set */
> > + struct address_space *zap_mapping;
> > struct page *single_page; /* Locked page to be unmapped */
> > + unsigned long zap_flags;
>
> Why call it "zap_*" if everything in the structure is related to zapping?
> IOW, simply "mapping", "flags" would be good enough.
Not sure if it's a good habit or bad - it's just for tagging system to be able
to identify other "mapping" variables, or a simple grep with the name. So I
normally prefix fields with some special wording to avoid collisions.
>
> > };
> > +/* Return true if skip zapping this page, false otherwise */
> > +static inline bool
> > +zap_skip_check_mapping(struct zap_details *details, struct page *page)
> > +{
> > + if (!details || !page)
> > + return false;
> > +
> > + if (!(details->zap_flags & ZAP_FLAG_CHECK_MAPPING))
> > + return false;
> > +
> > + return details->zap_mapping != page_rmapping(page);
> > +}
>
> I'm confused, why isn't "!details->zap_mapping" vs. "details->zap_mapping"
> sufficient? I can see that you may need flags for other purposes (next
> patch), but why do we need it here?
>
> Factoring it out into this helper is a nice cleanup, though. But I'd just
> not introduce ZAP_FLAG_CHECK_MAPPING.
Yes I think it's okay. I wanted to separate them as they're fundamentall two
things to me. Example: what if the mapping we want to check is NULL itself
(remove private pages only; though it may not have a real user at least so
far)? In that case one variable won't be able to cover it.
But indeed Matthew raised similar comment, so it seems to be a common
preference. No strong opinion on my side, let me coordinate with it.
Thanks for looking,
--
Peter Xu
On Wed, Sep 1, 2021 at 4:00 PM Peter Xu <[email protected]> wrote:
>
> Hi, Axel,
>
> On Wed, Sep 01, 2021 at 02:48:53PM -0700, Axel Rasmussen wrote:
> > On Wed, Sep 1, 2021 at 1:56 PM Peter Xu <[email protected]> wrote:
> > >
> > > It was conditionally done previously, as there's one shmem special case that we
> > > use SetPageDirty() instead. However that's not necessary and it should be
> > > easier and cleaner to do it unconditionally in mfill_atomic_install_pte().
> > >
> > > The most recent discussion about this is here, where Hugh explained the history
> > > of SetPageDirty() and why it's possible that it's not required at all:
> > >
> > > https://lore.kernel.org/lkml/[email protected]/
> >
> > Thanks for the cleanup Peter!
>
> No problem. Obviously that special handling of SetPageDirty is still too
> tricky to me and I'd love to remove it.
>
> >
> > I think the discussion of whether or not the data can be marked dirty
> > below is correct, and the code change looks good as well. But, I think
> > we're missing an explanation why Hugh's concern is indeed not a
> > problem?
> >
> > Specifically, this question:
> >
> > "Haha: I think Andrea is referring to exactly the dirty_accountable
> > code in change_pte_protection() which worried me above. Now, I think
> > that will turn out okay (shmem does not have a page_mkwrite(), and
> > does not participate in dirty accounting), but you will have to do
> > some work to assure us all of that, before sending in a cleanup
> > patch."
> >
> > Do we have more evidence that this is indeed fine, vs. what we had
> > when discussing this before? If so, we should talk about it explicitly
> > in this commit message, I think.
> >
> > (Sorry if you've covered this and it's just going over my head. ;) )
>
> Thanks for looking into this.
>
> I thought Hugh's explanation should mostly have covered that. The previous
> worry is we may have mprotect() applying write bit errornously if we have some
> read-only pte marked dirty. But I don't think that'll happen just like Hugh
> stated in the thread I attached, as the dirty accountable flag is only set if
> vma_wants_writenotify() returns true.
>
> Take the first example within that helper:
>
> if ((vm_flags & (VM_WRITE|VM_SHARED)) != ((VM_WRITE|VM_SHARED)))
> return 0;
>
> So firstly it never applies to vma that doesn't have VM_WRITE|VM_SHARED. So far
> it even doesn't work for anonymous, but logically it may, like:
>
> https://github.com/aagit/aa/commit/05dc2c56ef79b3836c75fcf68c5b19b08f4e4c58
>
> Peter Collingbourne originated that patch, due to some reason it didn't land
> which I forgot, however I still think it's doable even for anonymous.
>
> Sorry to have gone off-topic; let me go back to it.
>
> It also checks for e.g. page_mkwrite() needs, soft dirty tracking and so on to
> make sure it's okay to grant write bit when possible.
>
> Hugh mentioned "do some work to assure us all of that" - I did firstly went
> throught the code carefully myself so I'm more certain it's doing the right
> thing to me, secondly I did run quite some tests on the patch (actually on the
> whole uffd-wp shmem+hugetlbfs branch). Even if I'm going to switch the uffd-wp
> series to the pte marker format, this patch won't change.
>
> I also analysized three callers that may be affected by this change below, and
> explaining why it's okay. I hope that can also be counted as part of the "some
> work" that Hugh asked.
>
> Besides all these, I'm pretty happy too if anyone would help me to tell
> otherwise on whether there's still things missing so we can't do this. That's
> the "code review" part for every single patch, including this one, isn't it? :)
Makes sense. :) Allow me to think out loud for a moment; about the
three UFFD cases:
I agree case (2) mcopy_atomic_pte has !page_in_cache (by definition,
since COPY specifically means inserting a new page), and so is not
changing functionally. I'll ignore this case.
Case (1) shmem_mfill_atomic_pte is only called for shmem VMAs, with
VM_SHARED, and with mode being one of copy (NORMAL) or ZEROPAGE. This
case is interesting, because like case (2) we certainly have
!page_in_cache, so we were always doing pte_mkdirty(). The difference
here is we now no longer *also* SetPageDirty(). What does it mean for
a PTE to be dirty, but not the page (or the other way around)? I
confess I don't know.
But, at least as far as mprotect is concerned, it seems to not care
about PageDirty really. It does check it in one place, "if
(page_is_file_lru(page) && PageDirty(page))", but "page_is_file_lru"
really means "is the page *not* swap backed", so I think shmem pages
are never file lru? If true, then this check is never true for shmem
anyway, regardless of whether we SetPageDirty or not, and can be
ignored.
So, for case (1) I think there's no behavior change, at least with
regard to mprotect().
Case (3) mcontinue_atomic_pte handles CONTINUE for all shmem VMAs.
Meaning, we by definition have page_in_cache. It also seems to me we
could have any combination of VM_SHARED or not, VM_WRITE or not, PTE
writeable or not.
I have to admit, I am a little bit fuzzy on exactly what PageDirty
means for shmem (since there is no "filesystem", it's all just RAM). I
think being "dirty" in this case means, if we wanted to reclaim the
page, we would need to *swap it out*. I.e., shmem pages are always
"dirty", as long as they are in the page cache (and have been modified
at least once since being allocated? or maybe we don't bother keeping
track of zero pages).
So, this patch changes the behavior for case (3), when we end up with
!writable (i.e., !(VM_WRITE && VM_SHARED) -- note we check VM_SHARED
further down). We used to *skip* pte_mkdirty() if !writable, but now
we do it after all. But, for mprotect to do its thing, it requires
vma_wants_writenotify() -- but if we have !VM_WRITE, or !VM_SHARED, it
returns false. So in the only case where case (3) actually does
something different, mprotect *doesn't* do anything.
Okay, I've convinced myself. :) Sorry for the long e-mail. Feel free
to take (for v2 as well, this patch didn't change):
Reviewed-by: Axel Rasmussen <[email protected]>
>
> Thanks,
>
> >
> > >
> > >
> > >
> > > Currently mfill_atomic_install_pte() has three callers:
> > >
> > > 1. shmem_mfill_atomic_pte
> > > 2. mcopy_atomic_pte
> > > 3. mcontinue_atomic_pte
> > >
> > > After the change: case (1) should have its SetPageDirty replaced by the dirty
> > > bit on pte (so we unify them together, finally), case (2) should have no
> > > functional change at all as it has page_in_cache==false, case (3) may add a
> > > dirty bit to the pte. However since case (3) is UFFDIO_CONTINUE for shmem,
> > > it's merely 100% sure the page is dirty after all, so should not make a real
> > > difference either.
>
> --
> Peter Xu
>
On Thu, Sep 02, 2021 at 02:54:12PM -0700, Axel Rasmussen wrote:
> On Wed, Sep 1, 2021 at 4:00 PM Peter Xu <[email protected]> wrote:
> >
> > Hi, Axel,
> >
> > On Wed, Sep 01, 2021 at 02:48:53PM -0700, Axel Rasmussen wrote:
> > > On Wed, Sep 1, 2021 at 1:56 PM Peter Xu <[email protected]> wrote:
> > > >
> > > > It was conditionally done previously, as there's one shmem special case that we
> > > > use SetPageDirty() instead. However that's not necessary and it should be
> > > > easier and cleaner to do it unconditionally in mfill_atomic_install_pte().
> > > >
> > > > The most recent discussion about this is here, where Hugh explained the history
> > > > of SetPageDirty() and why it's possible that it's not required at all:
> > > >
> > > > https://lore.kernel.org/lkml/[email protected]/
> > >
> > > Thanks for the cleanup Peter!
> >
> > No problem. Obviously that special handling of SetPageDirty is still too
> > tricky to me and I'd love to remove it.
> >
> > >
> > > I think the discussion of whether or not the data can be marked dirty
> > > below is correct, and the code change looks good as well. But, I think
> > > we're missing an explanation why Hugh's concern is indeed not a
> > > problem?
> > >
> > > Specifically, this question:
> > >
> > > "Haha: I think Andrea is referring to exactly the dirty_accountable
> > > code in change_pte_protection() which worried me above. Now, I think
> > > that will turn out okay (shmem does not have a page_mkwrite(), and
> > > does not participate in dirty accounting), but you will have to do
> > > some work to assure us all of that, before sending in a cleanup
> > > patch."
> > >
> > > Do we have more evidence that this is indeed fine, vs. what we had
> > > when discussing this before? If so, we should talk about it explicitly
> > > in this commit message, I think.
> > >
> > > (Sorry if you've covered this and it's just going over my head. ;) )
> >
> > Thanks for looking into this.
> >
> > I thought Hugh's explanation should mostly have covered that. The previous
> > worry is we may have mprotect() applying write bit errornously if we have some
> > read-only pte marked dirty. But I don't think that'll happen just like Hugh
> > stated in the thread I attached, as the dirty accountable flag is only set if
> > vma_wants_writenotify() returns true.
> >
> > Take the first example within that helper:
> >
> > if ((vm_flags & (VM_WRITE|VM_SHARED)) != ((VM_WRITE|VM_SHARED)))
> > return 0;
> >
> > So firstly it never applies to vma that doesn't have VM_WRITE|VM_SHARED. So far
> > it even doesn't work for anonymous, but logically it may, like:
> >
> > https://github.com/aagit/aa/commit/05dc2c56ef79b3836c75fcf68c5b19b08f4e4c58
> >
> > Peter Collingbourne originated that patch, due to some reason it didn't land
> > which I forgot, however I still think it's doable even for anonymous.
> >
> > Sorry to have gone off-topic; let me go back to it.
> >
> > It also checks for e.g. page_mkwrite() needs, soft dirty tracking and so on to
> > make sure it's okay to grant write bit when possible.
> >
> > Hugh mentioned "do some work to assure us all of that" - I did firstly went
> > throught the code carefully myself so I'm more certain it's doing the right
> > thing to me, secondly I did run quite some tests on the patch (actually on the
> > whole uffd-wp shmem+hugetlbfs branch). Even if I'm going to switch the uffd-wp
> > series to the pte marker format, this patch won't change.
> >
> > I also analysized three callers that may be affected by this change below, and
> > explaining why it's okay. I hope that can also be counted as part of the "some
> > work" that Hugh asked.
> >
> > Besides all these, I'm pretty happy too if anyone would help me to tell
> > otherwise on whether there's still things missing so we can't do this. That's
> > the "code review" part for every single patch, including this one, isn't it? :)
>
> Makes sense. :) Allow me to think out loud for a moment; about the
> three UFFD cases:
>
> I agree case (2) mcopy_atomic_pte has !page_in_cache (by definition,
> since COPY specifically means inserting a new page), and so is not
> changing functionally. I'll ignore this case.
>
>
>
> Case (1) shmem_mfill_atomic_pte is only called for shmem VMAs, with
> VM_SHARED, and with mode being one of copy (NORMAL) or ZEROPAGE. This
> case is interesting, because like case (2) we certainly have
> !page_in_cache, so we were always doing pte_mkdirty(). The difference
> here is we now no longer *also* SetPageDirty(). What does it mean for
> a PTE to be dirty, but not the page (or the other way around)? I
> confess I don't know.
The PTE being dirty should be fine to cover the PageDirty case.
When the PTE is zapped, we'll need to move that pte dirty bit to PageDirty no
matter when we zap that PTE. E.g. see zap_pte_range(), here is:
if (!PageAnon(page)) {
if (pte_dirty(ptent)) {
force_flush = 1;
set_page_dirty(page);
}
...
}
So IMHO you can see it as a delayed (but unified) version of setting page
PG_dirty.
>
> But, at least as far as mprotect is concerned, it seems to not care
> about PageDirty really. It does check it in one place, "if
> (page_is_file_lru(page) && PageDirty(page))", but "page_is_file_lru"
> really means "is the page *not* swap backed", so I think shmem pages
> are never file lru? If true, then this check is never true for shmem
> anyway, regardless of whether we SetPageDirty or not, and can be
> ignored.
>
> So, for case (1) I think there's no behavior change, at least with
> regard to mprotect().
The behavior change is that the pte does not have the dirty bit set before this
patch for this case, now it will have.
I think that's also why Hugh explained why it's okay, and what's the source of
the discussion that we used to think it might have an impact - that originates
from Andrea's patch on the 1st usage of SetPageDirty within UFFDIO_COPY shmem
path.
>
>
>
> Case (3) mcontinue_atomic_pte handles CONTINUE for all shmem VMAs.
> Meaning, we by definition have page_in_cache. It also seems to me we
> could have any combination of VM_SHARED or not, VM_WRITE or not, PTE
> writeable or not.
>
> I have to admit, I am a little bit fuzzy on exactly what PageDirty
> means for shmem (since there is no "filesystem", it's all just RAM). I
> think being "dirty" in this case means, if we wanted to reclaim the
> page, we would need to *swap it out*.
Yes, at least that's how I understand it too.
If the page doesn't have dirty bit set, it means it can be thrown away at any
time. See pageout() calling clear_page_dirty_for_io(), and its own stack when
doing page reclaim.
> I.e., shmem pages are always
> "dirty", as long as they are in the page cache (and have been modified
> at least once since being allocated? or maybe we don't bother keeping
> track of zero pages).
>
> So, this patch changes the behavior for case (3), when we end up with
> !writable (i.e., !(VM_WRITE && VM_SHARED) -- note we check VM_SHARED
> further down). We used to *skip* pte_mkdirty() if !writable, but now
> we do it after all. But, for mprotect to do its thing, it requires
> vma_wants_writenotify() -- but if we have !VM_WRITE, or !VM_SHARED, it
> returns false. So in the only case where case (3) actually does
> something different, mprotect *doesn't* do anything.
Yes, I think that could be a slight difference for !write case but IMHO it's
fine too, it's the case when process A maps the shmem with RO but process B
maps with RW: process B should be the one that updates page and sends
UFFDIO_CONTINUE; and process B mustn't map it RO because otherwise why bother
if it can't do anything! It means page cache must be dirty anyway irrelevant
of dirty bit in process A. So I think that's not a general use case for minor
fault either as I should have expressed in the commit message.
Keeping that "optimization on conditionally set dirty bit" is okay but it just
makes things complicated and it'll grow more complex when uffd-wp shmem comes.
That is not worth it. I figured it's easier to set dirty always - no reason to
fight data loss again.
>
>
>
> Okay, I've convinced myself. :) Sorry for the long e-mail. Feel free
> to take (for v2 as well, this patch didn't change):
>
> Reviewed-by: Axel Rasmussen <[email protected]>
Thank you, Axel. I'll collect that in v3.
--
Peter Xu