This is a small series that I picked up from Linus's suggestion [0] to simplify
cow handling (and also more strict) by checking against page refcounts rather
than mapcounts.
I'm CCing the author and reviewer of commit 52d1e606ee73 on ksm ("mm: reuse
only-pte-mapped KSM page in do_wp_page()", 2019-03-05). Please shoot if
there's any reason to keep the logic, or it'll be removed in this series. For
more information, please refer to [3,4].
The new mm counter in the last patch can be seen as RFC, depending on whether
anyone dislikes it... I used it majorly for observing the page reuses, so it is
kind of optional.
Two tests I did:
- Run a busy loop dirty program [1] that uses 6G of memory, restrict to 1G
RAM + 5G swap (cgroup). A few hours later, all things still look good.
Make sure to observe (still massive) correct page reuses using the new
counter using the last patch, probably when swapping in.
- Run umapsort [2] to make sure uffd-wp will work again after applying this
series upon master 5.9-rc1 (5.9-rc1 is broken).
In all cases, I must confess it's quite pleased to post a series with diffstat
like this... Hopefully this won't break anyone but only to make everything
better.
Please review, thanks.
[0] https://lore.kernel.org/lkml/CAHk-=wjn90-=s6MBerxTuP=-FVEZtR-LpoH9eenEQ3A-QfKTZw@mail.gmail.com
[1] https://github.com/xzpeter/clibs/blob/master/bsd/mig_mon/mig_mon.c
[2] https://github.com/LLNL/umap-apps/blob/develop/src/umapsort/umapsort.cpp
[3] https://lore.kernel.org/lkml/CAHk-=wh0syDtNzt9jGyHRV0r1pVX5gkdJWdenwmvy=dq0AL5mA@mail.gmail.com
[4] https://lore.kernel.org/lkml/CAHk-=wj5Oyg0LeAxSw_vizerm=sLd=sHfcVecZMKPZn6kNbbXA@mail.gmail.com
Linus Torvalds (1):
mm: Trial do_wp_page() simplification
Peter Xu (3):
mm/ksm: Remove reuse_ksm_page()
mm/gup: Remove enfornced COW mechanism
mm: Add PGREUSE counter
drivers/gpu/drm/i915/gem/i915_gem_userptr.c | 8 ---
include/linux/ksm.h | 7 ---
include/linux/vm_event_item.h | 1 +
mm/gup.c | 40 ++------------
mm/huge_memory.c | 7 +--
mm/ksm.c | 25 ---------
mm/memory.c | 60 +++++++--------------
mm/vmstat.c | 1 +
8 files changed, 29 insertions(+), 120 deletions(-)
--
2.26.2
Remove the function as the last reference has gone away with the do_wp_page()
changes.
Signed-off-by: Peter Xu <[email protected]>
---
include/linux/ksm.h | 7 -------
mm/ksm.c | 25 -------------------------
2 files changed, 32 deletions(-)
diff --git a/include/linux/ksm.h b/include/linux/ksm.h
index e48b1e453ff5..161e8164abcf 100644
--- a/include/linux/ksm.h
+++ b/include/linux/ksm.h
@@ -53,8 +53,6 @@ struct page *ksm_might_need_to_copy(struct page *page,
void rmap_walk_ksm(struct page *page, struct rmap_walk_control *rwc);
void ksm_migrate_page(struct page *newpage, struct page *oldpage);
-bool reuse_ksm_page(struct page *page,
- struct vm_area_struct *vma, unsigned long address);
#else /* !CONFIG_KSM */
@@ -88,11 +86,6 @@ static inline void rmap_walk_ksm(struct page *page,
static inline void ksm_migrate_page(struct page *newpage, struct page *oldpage)
{
}
-static inline bool reuse_ksm_page(struct page *page,
- struct vm_area_struct *vma, unsigned long address)
-{
- return false;
-}
#endif /* CONFIG_MMU */
#endif /* !CONFIG_KSM */
diff --git a/mm/ksm.c b/mm/ksm.c
index 0aa2247bddd7..74d6bfff27c4 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -2657,31 +2657,6 @@ void rmap_walk_ksm(struct page *page, struct rmap_walk_control *rwc)
goto again;
}
-bool reuse_ksm_page(struct page *page,
- struct vm_area_struct *vma,
- unsigned long address)
-{
-#ifdef CONFIG_DEBUG_VM
- if (WARN_ON(is_zero_pfn(page_to_pfn(page))) ||
- WARN_ON(!page_mapped(page)) ||
- WARN_ON(!PageLocked(page))) {
- dump_page(page, "reuse_ksm_page");
- return false;
- }
-#endif
-
- if (PageSwapCache(page) || !page_stable_node(page))
- return false;
- /* Prohibit parallel get_ksm_page() */
- if (!page_ref_freeze(page, 1))
- return false;
-
- page_move_anon_rmap(page, vma);
- page->index = linear_page_index(vma, address);
- page_ref_unfreeze(page, 1);
-
- return true;
-}
#ifdef CONFIG_MIGRATION
void ksm_migrate_page(struct page *newpage, struct page *oldpage)
{
--
2.26.2
This accounts for wp_page_reuse() case, where we reused a page for COW.
Signed-off-by: Peter Xu <[email protected]>
---
include/linux/vm_event_item.h | 1 +
mm/memory.c | 1 +
mm/vmstat.c | 1 +
3 files changed, 3 insertions(+)
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 2e6ca53b9bbd..18e75974d4e3 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -30,6 +30,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
PGFAULT, PGMAJFAULT,
PGLAZYFREED,
PGREFILL,
+ PGREUSE,
PGSTEAL_KSWAPD,
PGSTEAL_DIRECT,
PGSCAN_KSWAPD,
diff --git a/mm/memory.c b/mm/memory.c
index cb9006189d22..148eafb8cbb1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2622,6 +2622,7 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
if (ptep_set_access_flags(vma, vmf->address, vmf->pte, entry, 1))
update_mmu_cache(vma, vmf->address, vmf->pte);
pte_unmap_unlock(vmf->pte, vmf->ptl);
+ count_vm_event(PGREUSE);
}
/*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index e670f910cd2f..4f7b4ee6aa12 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1241,6 +1241,7 @@ const char * const vmstat_text[] = {
"pglazyfreed",
"pgrefill",
+ "pgreuse",
"pgsteal_kswapd",
"pgsteal_direct",
"pgscan_kswapd",
--
2.26.2
With the more strict (but greatly simplified) page reuse logic in do_wp_page(),
we can savely go back to the world where cow is not enforced with writes.
This (majorly) reverts commit 17839856fd588f4ab6b789f482ed3ffd7c403e1f.
There're some context differences due to some changes later on around it:
2170ecfa7688 ("drm/i915: convert get_user_pages() --> pin_user_pages()", 2020-06-03)
376a34efa4ee ("mm/gup: refactor and de-duplicate gup_fast() code", 2020-06-03)
Some lines moved back and forth with those, but this revert patch should have
striped out and covered all the enforced cow bits anyways.
Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
drivers/gpu/drm/i915/gem/i915_gem_userptr.c | 8 -----
mm/gup.c | 40 +++------------------
mm/huge_memory.c | 7 ++--
3 files changed, 9 insertions(+), 46 deletions(-)
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
index 2c2bf24140c9..12b30075134a 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
@@ -596,14 +596,6 @@ static int i915_gem_userptr_get_pages(struct drm_i915_gem_object *obj)
GFP_KERNEL |
__GFP_NORETRY |
__GFP_NOWARN);
- /*
- * Using __get_user_pages_fast() with a read-only
- * access is questionable. A read-only page may be
- * COW-broken, and then this might end up giving
- * the wrong side of the COW..
- *
- * We may or may not care.
- */
if (pvec) {
/* defer to worker if malloc fails */
if (!i915_gem_object_is_readonly(obj))
diff --git a/mm/gup.c b/mm/gup.c
index ae096ea7583f..bb93251194d8 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -381,22 +381,13 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
}
/*
- * FOLL_FORCE or a forced COW break can write even to unwritable pte's,
- * but only after we've gone through a COW cycle and they are dirty.
+ * FOLL_FORCE can write to even unwritable pte's, but only
+ * after we've gone through a COW cycle and they are dirty.
*/
static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
{
- return pte_write(pte) || ((flags & FOLL_COW) && pte_dirty(pte));
-}
-
-/*
- * A (separate) COW fault might break the page the other way and
- * get_user_pages() would return the page from what is now the wrong
- * VM. So we need to force a COW break at GUP time even for reads.
- */
-static inline bool should_force_cow_break(struct vm_area_struct *vma, unsigned int flags)
-{
- return is_cow_mapping(vma->vm_flags) && (flags & (FOLL_GET | FOLL_PIN));
+ return pte_write(pte) ||
+ ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
}
static struct page *follow_page_pte(struct vm_area_struct *vma,
@@ -1067,11 +1058,9 @@ static long __get_user_pages(struct mm_struct *mm,
goto out;
}
if (is_vm_hugetlb_page(vma)) {
- if (should_force_cow_break(vma, foll_flags))
- foll_flags |= FOLL_WRITE;
i = follow_hugetlb_page(mm, vma, pages, vmas,
&start, &nr_pages, i,
- foll_flags, locked);
+ gup_flags, locked);
if (locked && *locked == 0) {
/*
* We've got a VM_FAULT_RETRY
@@ -1085,10 +1074,6 @@ static long __get_user_pages(struct mm_struct *mm,
continue;
}
}
-
- if (should_force_cow_break(vma, foll_flags))
- foll_flags |= FOLL_WRITE;
-
retry:
/*
* If we have a pending SIGKILL, don't keep faulting pages and
@@ -2689,19 +2674,6 @@ static int internal_get_user_pages_fast(unsigned long start, int nr_pages,
return -EFAULT;
/*
- * The FAST_GUP case requires FOLL_WRITE even for pure reads,
- * because get_user_pages() may need to cause an early COW in
- * order to avoid confusing the normal COW routines. So only
- * targets that are already writable are safe to do by just
- * looking at the page tables.
- *
- * NOTE! With FOLL_FAST_ONLY we allow read-only gup_fast() here,
- * because there is no slow path to fall back on. But you'd
- * better be careful about possible COW pages - you'll get _a_
- * COW page, but not necessarily the one you intended to get
- * depending on what COW event happens after this. COW may break
- * the page copy in a random direction.
- *
* Disable interrupts. The nested form is used, in order to allow
* full, general purpose use of this routine.
*
@@ -2714,8 +2686,6 @@ static int internal_get_user_pages_fast(unsigned long start, int nr_pages,
*/
if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) && gup_fast_permitted(start, end)) {
unsigned long fast_flags = gup_flags;
- if (!(gup_flags & FOLL_FAST_ONLY))
- fast_flags |= FOLL_WRITE;
local_irq_save(flags);
gup_pgd_range(addr, end, fast_flags, pages, &nr_pinned);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2ccff8472cd4..7ff29cc3d55c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1291,12 +1291,13 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
}
/*
- * FOLL_FORCE or a forced COW break can write even to unwritable pmd's,
- * but only after we've gone through a COW cycle and they are dirty.
+ * FOLL_FORCE can write to even unwritable pmd's, but only
+ * after we've gone through a COW cycle and they are dirty.
*/
static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags)
{
- return pmd_write(pmd) || ((flags & FOLL_COW) && pmd_dirty(pmd));
+ return pmd_write(pmd) ||
+ ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd));
}
struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
--
2.26.2
From: Linus Torvalds <[email protected]>
How about we just make sure we're the only possible valid user fo the
page before we bother to reuse it?
Simplify, simplify, simplify.
And get rid of the nasty serialization on the page lock at the same time.
Signed-off-by: Linus Torvalds <[email protected]>
[peterx: add subject prefix]
Signed-off-by: Peter Xu <[email protected]>
---
mm/memory.c | 59 +++++++++++++++--------------------------------------
1 file changed, 17 insertions(+), 42 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 602f4283122f..cb9006189d22 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2927,50 +2927,25 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
* not dirty accountable.
*/
if (PageAnon(vmf->page)) {
- int total_map_swapcount;
- if (PageKsm(vmf->page) && (PageSwapCache(vmf->page) ||
- page_count(vmf->page) != 1))
+ struct page *page = vmf->page;
+
+ /* PageKsm() doesn't necessarily raise the page refcount */
+ if (PageKsm(page) || page_count(page) != 1)
+ goto copy;
+ if (!trylock_page(page))
+ goto copy;
+ if (PageKsm(page) || page_mapcount(page) != 1 || page_count(page) != 1) {
+ unlock_page(page);
goto copy;
- if (!trylock_page(vmf->page)) {
- get_page(vmf->page);
- pte_unmap_unlock(vmf->pte, vmf->ptl);
- lock_page(vmf->page);
- vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
- vmf->address, &vmf->ptl);
- if (!pte_same(*vmf->pte, vmf->orig_pte)) {
- update_mmu_tlb(vma, vmf->address, vmf->pte);
- unlock_page(vmf->page);
- pte_unmap_unlock(vmf->pte, vmf->ptl);
- put_page(vmf->page);
- return 0;
- }
- put_page(vmf->page);
- }
- if (PageKsm(vmf->page)) {
- bool reused = reuse_ksm_page(vmf->page, vmf->vma,
- vmf->address);
- unlock_page(vmf->page);
- if (!reused)
- goto copy;
- wp_page_reuse(vmf);
- return VM_FAULT_WRITE;
- }
- if (reuse_swap_page(vmf->page, &total_map_swapcount)) {
- if (total_map_swapcount == 1) {
- /*
- * The page is all ours. Move it to
- * our anon_vma so the rmap code will
- * not search our parent or siblings.
- * Protected against the rmap code by
- * the page lock.
- */
- page_move_anon_rmap(vmf->page, vma);
- }
- unlock_page(vmf->page);
- wp_page_reuse(vmf);
- return VM_FAULT_WRITE;
}
- unlock_page(vmf->page);
+ /*
+ * Ok, we've got the only map reference, and the only
+ * page count reference, and the page is locked,
+ * it's dark out, and we're wearing sunglasses. Hit it.
+ */
+ wp_page_reuse(vmf);
+ unlock_page(page);
+ return VM_FAULT_WRITE;
} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
(VM_WRITE|VM_SHARED))) {
return wp_page_shared(vmf);
--
2.26.2
On Fri, Aug 21, 2020 at 4:50 PM Peter Xu <[email protected]> wrote:
>
> - Run a busy loop dirty program [1] that uses 6G of memory, restrict to 1G
> RAM + 5G swap (cgroup). A few hours later, all things still look good.
> Make sure to observe (still massive) correct page reuses using the new
> counter using the last patch, probably when swapping in.
>
> - Run umapsort [2] to make sure uffd-wp will work again after applying this
> series upon master 5.9-rc1 (5.9-rc1 is broken).
I obviously like the diffstat, am wondering if you saw any throughput
changes or similar for the busy-loop dirtying thing?
Linus
On Fri, Aug 21, 2020 at 4:50 PM Peter Xu <[email protected]> wrote:
>
> This accounts for wp_page_reuse() case, where we reused a page for COW.
If we do this, wouldn't it make more sense to also count the COW case
to see how they match up?
Right now we count faults and major faults. So as a result you can can
calculate minor faults trivially.
But if you count page reuse, you can't calculate any stats on it,
because most of the minor faults will presumably be for new pages
(either zero or cached file mappings).
So the "pgreuse" seems to be a counter without any context to it.
IOW, I get the feeling that either we should do this properly (and
maybe count "dirty faults" and "access" faults, at which point the
reuse case becomes a subcase of the dirty ones) or we shouldn't do it
at all. Creating a counter without any way to sanely compare it to
anything seems a bit pointless.
Linus
On Sat, Aug 22, 2020 at 09:05:37AM -0700, Linus Torvalds wrote:
> On Fri, Aug 21, 2020 at 4:50 PM Peter Xu <[email protected]> wrote:
> >
> > - Run a busy loop dirty program [1] that uses 6G of memory, restrict to 1G
> > RAM + 5G swap (cgroup). A few hours later, all things still look good.
> > Make sure to observe (still massive) correct page reuses using the new
> > counter using the last patch, probably when swapping in.
> >
> > - Run umapsort [2] to make sure uffd-wp will work again after applying this
> > series upon master 5.9-rc1 (5.9-rc1 is broken).
>
> I obviously like the diffstat, am wondering if you saw any throughput
> changes or similar for the busy-loop dirtying thing?
I didn't compare the two in my previous testing.
Firstly, although my program did output some dirty rate information in per
second basis, the dirty rate is kind of fluctuating during the runs due to
frequent swap in/out pages, and I cannot observe a stable dirty rate at least
with the patch applied. I'm afraid I'll see similar thing even without
applying the patch. Maybe it'll show some statistic pattern if I do it per
10sec or longer, but not sure.
More importantly, I'm not sure whether that's the major case to compare either
if we want to have some rough understanding about patch 1 on the performance
impact. The thing is, my test program only dirtied some private allocated
pages with itself as the only owner of the pages. IIUC the page reuse logic
will trigger on either the old or new code because both the mapcount or
refcount will be one. If we really want to compare the two, shouldn't we run
some tests that will trigger the COW differently before/after the patch? E.g.,
when some pages are referenced by some GUP users and when COW happens with only
one pte mapped to the page. I haven't thought deeper than this on such a test
yet.
So my previous testing was majorly for making sure the general cow and the swap
path will at least still behave as expected.
Thanks,
--
Peter Xu
On Sat, Aug 22, 2020 at 09:14:53AM -0700, Linus Torvalds wrote:
> On Fri, Aug 21, 2020 at 4:50 PM Peter Xu <[email protected]> wrote:
> >
> > This accounts for wp_page_reuse() case, where we reused a page for COW.
>
> If we do this, wouldn't it make more sense to also count the COW case
> to see how they match up?
Is wp_page_reuse() only used in cow?
I saw that the callers of wp_page_reuse() are:
(1) finish_mkwrite_fault
(2) wp_pfn_shared
(3) wp_page_shared
(4) do_wp_page
Since (1) is only called by either (2) or (3), while (2) and (3) apply only to
shared memories, so I'm kind of sure the statistic is done via the changed path
in do_wp_page() that is touched in patch 1 (my program was using private
anonymous pages). Maybe I missed something, though..
>
> Right now we count faults and major faults. So as a result you can can
> calculate minor faults trivially.
>
> But if you count page reuse, you can't calculate any stats on it,
> because most of the minor faults will presumably be for new pages
> (either zero or cached file mappings).
>
> So the "pgreuse" seems to be a counter without any context to it.
>
> IOW, I get the feeling that either we should do this properly (and
> maybe count "dirty faults" and "access" faults, at which point the
> reuse case becomes a subcase of the dirty ones) or we shouldn't do it
> at all. Creating a counter without any way to sanely compare it to
> anything seems a bit pointless.
Yeah I haven't thought deep about this statistic, imho it would be something
nice to have (besides helping me to verify the tests) so I still posted it
instead of keeping it in the local repo. If this statistic is not liked by
anyone, then we can definitely drop it.
Thanks,
--
Peter Xu
On 22.08.2020 02:49, Peter Xu wrote:
> This is a small series that I picked up from Linus's suggestion [0] to simplify
> cow handling (and also more strict) by checking against page refcounts rather
> than mapcounts.
>
> I'm CCing the author and reviewer of commit 52d1e606ee73 on ksm ("mm: reuse
> only-pte-mapped KSM page in do_wp_page()", 2019-03-05). Please shoot if
> there's any reason to keep the logic, or it'll be removed in this series. For
> more information, please refer to [3,4].
I'm not sure I understand the reasons to remove that. But the reason to add was
to avoid excess page copying, when it is not needed in real.
> The new mm counter in the last patch can be seen as RFC, depending on whether
> anyone dislikes it... I used it majorly for observing the page reuses, so it is
> kind of optional.
>
> Two tests I did:
>
> - Run a busy loop dirty program [1] that uses 6G of memory, restrict to 1G
> RAM + 5G swap (cgroup). A few hours later, all things still look good.
> Make sure to observe (still massive) correct page reuses using the new
> counter using the last patch, probably when swapping in.
>
> - Run umapsort [2] to make sure uffd-wp will work again after applying this
> series upon master 5.9-rc1 (5.9-rc1 is broken).
>
> In all cases, I must confess it's quite pleased to post a series with diffstat
> like this... Hopefully this won't break anyone but only to make everything
> better.
>
> Please review, thanks.
>
> [0] https://lore.kernel.org/lkml/CAHk-=wjn90-=s6MBerxTuP=-FVEZtR-LpoH9eenEQ3A-QfKTZw@mail.gmail.com
> [1] https://github.com/xzpeter/clibs/blob/master/bsd/mig_mon/mig_mon.c
> [2] https://github.com/LLNL/umap-apps/blob/develop/src/umapsort/umapsort.cpp
> [3] https://lore.kernel.org/lkml/CAHk-=wh0syDtNzt9jGyHRV0r1pVX5gkdJWdenwmvy=dq0AL5mA@mail.gmail.com
> [4] https://lore.kernel.org/lkml/CAHk-=wj5Oyg0LeAxSw_vizerm=sLd=sHfcVecZMKPZn6kNbbXA@mail.gmail.com
>
> Linus Torvalds (1):
> mm: Trial do_wp_page() simplification
>
> Peter Xu (3):
> mm/ksm: Remove reuse_ksm_page()
> mm/gup: Remove enfornced COW mechanism
> mm: Add PGREUSE counter
>
> drivers/gpu/drm/i915/gem/i915_gem_userptr.c | 8 ---
> include/linux/ksm.h | 7 ---
> include/linux/vm_event_item.h | 1 +
> mm/gup.c | 40 ++------------
> mm/huge_memory.c | 7 +--
> mm/ksm.c | 25 ---------
> mm/memory.c | 60 +++++++--------------
> mm/vmstat.c | 1 +
> 8 files changed, 29 insertions(+), 120 deletions(-)
>
On 22.08.2020 02:49, Peter Xu wrote:
> From: Linus Torvalds <[email protected]>
>
> How about we just make sure we're the only possible valid user fo the
> page before we bother to reuse it?
>
> Simplify, simplify, simplify.
>
> And get rid of the nasty serialization on the page lock at the same time.
>
> Signed-off-by: Linus Torvalds <[email protected]>
> [peterx: add subject prefix]
> Signed-off-by: Peter Xu <[email protected]>
> ---
> mm/memory.c | 59 +++++++++++++++--------------------------------------
> 1 file changed, 17 insertions(+), 42 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 602f4283122f..cb9006189d22 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2927,50 +2927,25 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
> * not dirty accountable.
> */
> if (PageAnon(vmf->page)) {
> - int total_map_swapcount;
> - if (PageKsm(vmf->page) && (PageSwapCache(vmf->page) ||
> - page_count(vmf->page) != 1))
> + struct page *page = vmf->page;
> +
> + /* PageKsm() doesn't necessarily raise the page refcount */
No, this is wrong. PageKSM() always raises refcount. There was another problem:
KSM may raise refcount without lock_page(), and only then it takes the lock.
See get_ksm_page(GET_KSM_PAGE_NOLOCK) for the details.
So, reliable protection against parallel access requires to freeze page counter,
which is made in reuse_ksm_page().
> + if (PageKsm(page) || page_count(page) != 1)
> + goto copy;
> + if (!trylock_page(page))
> + goto copy;
> + if (PageKsm(page) || page_mapcount(page) != 1 || page_count(page) != 1) {
> + unlock_page(page);
> goto copy;
> - if (!trylock_page(vmf->page)) {
> - get_page(vmf->page);
> - pte_unmap_unlock(vmf->pte, vmf->ptl);
> - lock_page(vmf->page);
> - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> - vmf->address, &vmf->ptl);
> - if (!pte_same(*vmf->pte, vmf->orig_pte)) {
> - update_mmu_tlb(vma, vmf->address, vmf->pte);
> - unlock_page(vmf->page);
> - pte_unmap_unlock(vmf->pte, vmf->ptl);
> - put_page(vmf->page);
> - return 0;
> - }
> - put_page(vmf->page);
> - }
> - if (PageKsm(vmf->page)) {
> - bool reused = reuse_ksm_page(vmf->page, vmf->vma,
> - vmf->address);
> - unlock_page(vmf->page);
> - if (!reused)
> - goto copy;
> - wp_page_reuse(vmf);
> - return VM_FAULT_WRITE;
> - }
> - if (reuse_swap_page(vmf->page, &total_map_swapcount)) {
> - if (total_map_swapcount == 1) {
> - /*
> - * The page is all ours. Move it to
> - * our anon_vma so the rmap code will
> - * not search our parent or siblings.
> - * Protected against the rmap code by
> - * the page lock.
> - */
> - page_move_anon_rmap(vmf->page, vma);
> - }
> - unlock_page(vmf->page);
> - wp_page_reuse(vmf);
> - return VM_FAULT_WRITE;
> }
> - unlock_page(vmf->page);
> + /*
> + * Ok, we've got the only map reference, and the only
> + * page count reference, and the page is locked,
> + * it's dark out, and we're wearing sunglasses. Hit it.
> + */
> + wp_page_reuse(vmf);
> + unlock_page(page);
> + return VM_FAULT_WRITE;
> } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
> (VM_WRITE|VM_SHARED))) {
> return wp_page_shared(vmf);
>
On Mon 24-08-20 11:36:22, Kirill Tkhai wrote:
> On 22.08.2020 02:49, Peter Xu wrote:
> > From: Linus Torvalds <[email protected]>
> >
> > How about we just make sure we're the only possible valid user fo the
> > page before we bother to reuse it?
> >
> > Simplify, simplify, simplify.
> >
> > And get rid of the nasty serialization on the page lock at the same time.
> >
> > Signed-off-by: Linus Torvalds <[email protected]>
> > [peterx: add subject prefix]
> > Signed-off-by: Peter Xu <[email protected]>
> > ---
> > mm/memory.c | 59 +++++++++++++++--------------------------------------
> > 1 file changed, 17 insertions(+), 42 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 602f4283122f..cb9006189d22 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -2927,50 +2927,25 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
> > * not dirty accountable.
> > */
> > if (PageAnon(vmf->page)) {
> > - int total_map_swapcount;
> > - if (PageKsm(vmf->page) && (PageSwapCache(vmf->page) ||
> > - page_count(vmf->page) != 1))
> > + struct page *page = vmf->page;
> > +
> > + /* PageKsm() doesn't necessarily raise the page refcount */
>
> No, this is wrong. PageKSM() always raises refcount.
OK, then I'm confused. The comment before get_ksm_page() states:
* get_ksm_page: checks if the page indicated by the stable node
* is still its ksm page, despite having held no reference to it.
* In which case we can trust the content of the page, and it
* returns the gotten page; but if the page has now been zapped,
* remove the stale node from the stable tree and return NULL.
...
* You would expect the stable_node to hold a reference to the ksm page.
* But if it increments the page's count, swapping out has to wait for
* ksmd to come around again before it can free the page, which may take
* seconds or even minutes: much too unresponsive. So instead we use a
* "keyhole reference": access to the ksm page from the stable node peeps
* out through its keyhole to see if that page still holds the right key,
* pointing back to this stable node.
So this all seems to indicate that KSM doesn't hold a proper page reference
and relies on anyone making page writeable to change page->mapping so that
KSM notices this and doesn't use the page anymore... Am I missing
something?
> There was another
> problem: KSM may raise refcount without lock_page(), and only then it
> takes the lock. See get_ksm_page(GET_KSM_PAGE_NOLOCK) for the details.
>
> So, reliable protection against parallel access requires to freeze page
> counter, which is made in reuse_ksm_page().
OK, this as well.
Honza
>
> > + if (PageKsm(page) || page_count(page) != 1)
> > + goto copy;
> > + if (!trylock_page(page))
> > + goto copy;
> > + if (PageKsm(page) || page_mapcount(page) != 1 || page_count(page) != 1) {
> > + unlock_page(page);
> > goto copy;
> > - if (!trylock_page(vmf->page)) {
> > - get_page(vmf->page);
> > - pte_unmap_unlock(vmf->pte, vmf->ptl);
> > - lock_page(vmf->page);
> > - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> > - vmf->address, &vmf->ptl);
> > - if (!pte_same(*vmf->pte, vmf->orig_pte)) {
> > - update_mmu_tlb(vma, vmf->address, vmf->pte);
> > - unlock_page(vmf->page);
> > - pte_unmap_unlock(vmf->pte, vmf->ptl);
> > - put_page(vmf->page);
> > - return 0;
> > - }
> > - put_page(vmf->page);
> > - }
> > - if (PageKsm(vmf->page)) {
> > - bool reused = reuse_ksm_page(vmf->page, vmf->vma,
> > - vmf->address);
> > - unlock_page(vmf->page);
> > - if (!reused)
> > - goto copy;
> > - wp_page_reuse(vmf);
> > - return VM_FAULT_WRITE;
> > - }
> > - if (reuse_swap_page(vmf->page, &total_map_swapcount)) {
> > - if (total_map_swapcount == 1) {
> > - /*
> > - * The page is all ours. Move it to
> > - * our anon_vma so the rmap code will
> > - * not search our parent or siblings.
> > - * Protected against the rmap code by
> > - * the page lock.
> > - */
> > - page_move_anon_rmap(vmf->page, vma);
> > - }
> > - unlock_page(vmf->page);
> > - wp_page_reuse(vmf);
> > - return VM_FAULT_WRITE;
> > }
> > - unlock_page(vmf->page);
> > + /*
> > + * Ok, we've got the only map reference, and the only
> > + * page count reference, and the page is locked,
> > + * it's dark out, and we're wearing sunglasses. Hit it.
> > + */
> > + wp_page_reuse(vmf);
> > + unlock_page(page);
> > + return VM_FAULT_WRITE;
> > } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
> > (VM_WRITE|VM_SHARED))) {
> > return wp_page_shared(vmf);
> >
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On 24.08.2020 17:30, Jan Kara wrote:
> On Mon 24-08-20 11:36:22, Kirill Tkhai wrote:
>> On 22.08.2020 02:49, Peter Xu wrote:
>>> From: Linus Torvalds <[email protected]>
>>>
>>> How about we just make sure we're the only possible valid user fo the
>>> page before we bother to reuse it?
>>>
>>> Simplify, simplify, simplify.
>>>
>>> And get rid of the nasty serialization on the page lock at the same time.
>>>
>>> Signed-off-by: Linus Torvalds <[email protected]>
>>> [peterx: add subject prefix]
>>> Signed-off-by: Peter Xu <[email protected]>
>>> ---
>>> mm/memory.c | 59 +++++++++++++++--------------------------------------
>>> 1 file changed, 17 insertions(+), 42 deletions(-)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 602f4283122f..cb9006189d22 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -2927,50 +2927,25 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
>>> * not dirty accountable.
>>> */
>>> if (PageAnon(vmf->page)) {
>>> - int total_map_swapcount;
>>> - if (PageKsm(vmf->page) && (PageSwapCache(vmf->page) ||
>>> - page_count(vmf->page) != 1))
>>> + struct page *page = vmf->page;
>>> +
>>> + /* PageKsm() doesn't necessarily raise the page refcount */
>>
>> No, this is wrong. PageKSM() always raises refcount.
>
> OK, then I'm confused. The comment before get_ksm_page() states:
>
> * get_ksm_page: checks if the page indicated by the stable node
> * is still its ksm page, despite having held no reference to it.
> * In which case we can trust the content of the page, and it
> * returns the gotten page; but if the page has now been zapped,
> * remove the stale node from the stable tree and return NULL.
> ...
> * You would expect the stable_node to hold a reference to the ksm page.
> * But if it increments the page's count, swapping out has to wait for
> * ksmd to come around again before it can free the page, which may take
> * seconds or even minutes: much too unresponsive. So instead we use a
> * "keyhole reference": access to the ksm page from the stable node peeps
> * out through its keyhole to see if that page still holds the right key,
> * pointing back to this stable node.
>
> So this all seems to indicate that KSM doesn't hold a proper page reference
> and relies on anyone making page writeable to change page->mapping so that
> KSM notices this and doesn't use the page anymore... Am I missing
> something?
Sure, KSM does not increment page counter, when a page becomes PageKsm().
Is patch comment about that? Even if so, I don't understand what this
comment is about. "PageKsm() does not take additional counter" is not
a reason the page can't be reused there. The reason is that readers
of this page may increase a counter without taking the lock, so
this page_count() == 1 under the lock does not guarantee anything.
>> There was another
>> problem: KSM may raise refcount without lock_page(), and only then it
>> takes the lock. See get_ksm_page(GET_KSM_PAGE_NOLOCK) for the details.
>>
>> So, reliable protection against parallel access requires to freeze page
>> counter, which is made in reuse_ksm_page().
>
> OK, this as well.
>
> Honza
>
>>
>>> + if (PageKsm(page) || page_count(page) != 1)
>>> + goto copy;
>>> + if (!trylock_page(page))
>>> + goto copy;
>>> + if (PageKsm(page) || page_mapcount(page) != 1 || page_count(page) != 1) {
>>> + unlock_page(page);
>>> goto copy;
>>> - if (!trylock_page(vmf->page)) {
>>> - get_page(vmf->page);
>>> - pte_unmap_unlock(vmf->pte, vmf->ptl);
>>> - lock_page(vmf->page);
>>> - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
>>> - vmf->address, &vmf->ptl);
>>> - if (!pte_same(*vmf->pte, vmf->orig_pte)) {
>>> - update_mmu_tlb(vma, vmf->address, vmf->pte);
>>> - unlock_page(vmf->page);
>>> - pte_unmap_unlock(vmf->pte, vmf->ptl);
>>> - put_page(vmf->page);
>>> - return 0;
>>> - }
>>> - put_page(vmf->page);
>>> - }
>>> - if (PageKsm(vmf->page)) {
>>> - bool reused = reuse_ksm_page(vmf->page, vmf->vma,
>>> - vmf->address);
>>> - unlock_page(vmf->page);
>>> - if (!reused)
>>> - goto copy;
>>> - wp_page_reuse(vmf);
>>> - return VM_FAULT_WRITE;
>>> - }
>>> - if (reuse_swap_page(vmf->page, &total_map_swapcount)) {
>>> - if (total_map_swapcount == 1) {
>>> - /*
>>> - * The page is all ours. Move it to
>>> - * our anon_vma so the rmap code will
>>> - * not search our parent or siblings.
>>> - * Protected against the rmap code by
>>> - * the page lock.
>>> - */
>>> - page_move_anon_rmap(vmf->page, vma);
>>> - }
>>> - unlock_page(vmf->page);
>>> - wp_page_reuse(vmf);
>>> - return VM_FAULT_WRITE;
>>> }
>>> - unlock_page(vmf->page);
>>> + /*
>>> + * Ok, we've got the only map reference, and the only
>>> + * page count reference, and the page is locked,
>>> + * it's dark out, and we're wearing sunglasses. Hit it.
>>> + */
>>> + wp_page_reuse(vmf);
>>> + unlock_page(page);
>>> + return VM_FAULT_WRITE;
>>> } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
>>> (VM_WRITE|VM_SHARED))) {
>>> return wp_page_shared(vmf);
>>>
>>
On Mon, Aug 24, 2020 at 8:38 AM Kirill Tkhai <[email protected]> wrote:
>
> Sure, KSM does not increment page counter, when a page becomes PageKsm().
> Is patch comment about that? Even if so, I don't understand what this
> comment is about. "PageKsm() does not take additional counter" is not
> a reason the page can't be reused there.
No, the reason is that we don't want to reuse a KSM page, and the
page_count() check apparently isn't sufficient in all circumstances.
So the comment is there to explain why a plain "page_count()"
apparently isn't sufficient.
> The reason is that readers
> of this page may increase a counter without taking the lock, so
> this page_count() == 1 under the lock does not guarantee anything.
The intent is to get rid of
(a) all the locking costs. The "lock_page()" we had here used to be
very expensive.
It's shown up several times in the page lock problems, and the
reason seems to be simply that this is _the_ hottest non-IO path there
is, so it's somewhat easy to generate lots of contention on a shared
page.
(b) the problems with GUP - because GUP (and some other page sharing
cases) don't increase the page_mapping() count, GUP was "invisible" to
the re-use code, and as a result the reuse code was a buggy mess.
(c) the complete and pointless complexity of this path, that isn't
actually done anywhere else. The GUP issue was the immediate - and
currently existing - bug caused by this, but the locking costs are
another example.
So the page reuse is simply wrong. It's almost certainly also
pointless and entirely historical. The _reason_ for trying to reuse
the KSM pages was documented not as performance, but simple to match
the other (also pointless) complexity of the swap cache reuse.
So the intent is to do the "page_count()" test early, to get rid of
the locking issues with any shared pages.
So the logic is "if this page is marked PageKsm(), or if it has an
elevated page count, don't even try - just copy".
To make a very concrete example: it's not unusual at all to basically
have simultaneous page faults on a dirty page because it's COW-shared
in both parent and child. Trivial to trigger, with the child and
parent running on different CPU's and just writing to the same page
right after a fork. And there is absolutely _zero_ reason that should
be serialized by anything at all. The parent and child are complete
share-nothing things: taking the page lock was and is simply wrong.
Solution: don't do it. Just notice "Oh, this page has other users"
(and page_count() is the correct thing to do for that, not
page_mappings(), since GUP is also another user), and actively *avoid*
any serialization. Just copy the damn thing.
I'll take full blame for the historical stupidity. This was a bigger
deal back in the days when 4MB of RAM was considered normal. Plus page
locking wasn't even an issue back then. In fact, no locking at all was
needed back when the "try to reuse" code was originally written.
Things were simpler back then.
It's just that I'm 100% convinced that that historical legacy is very
very wrong these days. That "serialize on page lock if COW fault in
parent and child" is just an example of where this is fundamentally
wrong. But the whole complexity in the map count logic is just wholly
and totally wrong too.
I dare anybody to read the swapfile code for "total_map_swapcount" and
tell me they understand it fully.
So my theory is that this code - that is meant to *improve*
performance by sharing pages aggressively after a fork(), because that
used to be a primary issue, is now in fact making performance *much
worse*, because it's trying to optimize for a case that doesn't even
matter any more (does anybody truly believe that swap cache and shared
COW pages are a major source of performance?) and it does so at a huge
complexity _and_ performance cost.
So ripping out the KSM reuse code is just another "this is pointless
and wrong" issue. If you seriously try to KSM share a page that now
only has _one_ single user left, and that one single user writes to it
and is modifying it, then the problem is absolutely *NOT* that we
should try to re-use the page. No, the problem is that the KSM code
picked a horribly bad page to try to share.
Will that happen _occasionally_? Sure. But if it happens once in a
blue moon, we really shouldn't have that code to deal with it.
It's really that simple. All that reuse code is pointless and wrong.
It has historical roots, and it made sense at the time, but in this
day and age I'm convinced it's completely wrong.
Now, I'm _also_ admittedly convinced that I am occasionally completely
wrong, and people do odd things, and maybe there are loads where it
really matters. I doubt it in this case, but I think what we should do
is rip out all the existing historical code, and _if_ somebody has a
case where it matters, we can look at THAT case, and people can show
(a) what the exact pattern is that we actually care about
(b) numbers
and then maybe we can re-introduce some sort of re-use code with -
hopefully - a much more targeted and documented "this is why this
matters" approach.
So the intent is to get rid of the page lock thing, but I also hope
that long-term, we can get rid of reuse_swap_page() and some of that
mapcount stuff entirely.
Linus
On Fri, Aug 21, 2020 at 07:49:54PM -0400, Peter Xu wrote:
> This is a small series that I picked up from Linus's suggestion [0] to simplify
> cow handling (and also more strict) by checking against page refcounts rather
> than mapcounts.
>
> I'm CCing the author and reviewer of commit 52d1e606ee73 on ksm ("mm: reuse
> only-pte-mapped KSM page in do_wp_page()", 2019-03-05). Please shoot if
> there's any reason to keep the logic, or it'll be removed in this series. For
> more information, please refer to [3,4].
>
> The new mm counter in the last patch can be seen as RFC, depending on whether
> anyone dislikes it... I used it majorly for observing the page reuses, so it is
> kind of optional.
>
> Two tests I did:
>
> - Run a busy loop dirty program [1] that uses 6G of memory, restrict to 1G
> RAM + 5G swap (cgroup). A few hours later, all things still look good.
> Make sure to observe (still massive) correct page reuses using the new
> counter using the last patch, probably when swapping in.
>
> - Run umapsort [2] to make sure uffd-wp will work again after applying this
> series upon master 5.9-rc1 (5.9-rc1 is broken).
>
> In all cases, I must confess it's quite pleased to post a series with diffstat
> like this... Hopefully this won't break anyone but only to make everything
> better.
>
> Please review, thanks.
>
> [0] https://lore.kernel.org/lkml/CAHk-=wjn90-=s6MBerxTuP=-FVEZtR-LpoH9eenEQ3A-QfKTZw@mail.gmail.com
> [1] https://github.com/xzpeter/clibs/blob/master/bsd/mig_mon/mig_mon.c
> [2] https://github.com/LLNL/umap-apps/blob/develop/src/umapsort/umapsort.cpp
> [3] https://lore.kernel.org/lkml/CAHk-=wh0syDtNzt9jGyHRV0r1pVX5gkdJWdenwmvy=dq0AL5mA@mail.gmail.com
> [4] https://lore.kernel.org/lkml/CAHk-=wj5Oyg0LeAxSw_vizerm=sLd=sHfcVecZMKPZn6kNbbXA@mail.gmail.com
Is there more comments for this series on either reviews, or suggested tests to
carry out (before a broader trial)?
Thanks,
--
Peter Xu
I am not happy with this patch. But every time I come back to it,
I realize that you've written more to justify it here or there, that
I haven't fully digested; yet if I wait until I've grasped it all,
I shall never arrive at responding at all, so let's wade in now.
(Sometimes I wonder why I say "I" more than other people.)
On Mon, 24 Aug 2020, Linus Torvalds wrote:
> On Mon, Aug 24, 2020 at 8:38 AM Kirill Tkhai <[email protected]> wrote:
> >
> > Sure, KSM does not increment page counter, when a page becomes PageKsm().
> > Is patch comment about that? Even if so, I don't understand what this
> > comment is about. "PageKsm() does not take additional counter" is not
> > a reason the page can't be reused there.
>
> No, the reason is that we don't want to reuse a KSM page, and the
> page_count() check apparently isn't sufficient in all circumstances.
>
> So the comment is there to explain why a plain "page_count()"
> apparently isn't sufficient.
>
> > The reason is that readers
> > of this page may increase a counter without taking the lock, so
> > this page_count() == 1 under the lock does not guarantee anything.
One thing I'm happy with, is the removal of reuse_ksm_page(). I did
not speak up at the time, but it always seemed to me a misdirected
optimization, if even an optimization at all; and it violates the
comment in mm/rmap.c "but PageKsm is never downgraded to PageAnon" -
though I'd have some trouble enumerating where that assumption matters.
>
> The intent is to get rid of
>
> (a) all the locking costs. The "lock_page()" we had here used to be
> very expensive.
>
> It's shown up several times in the page lock problems, and the
> reason seems to be simply that this is _the_ hottest non-IO path there
> is, so it's somewhat easy to generate lots of contention on a shared
> page.
And I'd be happy with adding a "page_mapcount(page) > 1" check, to
avoid trying for page lock in unambiguous cases of shared PageAnon.
But it's there that we part company: you believe in page_count()
and trylock_page(), which to me are buggily unreliable indicators.
2.6.13 21jun05 c475a8ab625d ("can_share_swap_page: use page_mapcount")
2.6.29 06jan09 ab967d86015a ("mm: wp lock page before deciding cow")
The latter was (two years belated) response to bugreport
https://lore.kernel.org/lkml/[email protected]/
which after some confusion I came to understand in
https://lore.kernel.org/lkml/[email protected]/
2006: some years before the net_dma adventures that you highlighted in
https://lore.kernel.org/lkml/CAHk-=whw3QcceKCdYS2ktCPQ96m8Ysyek+w4ny0ygvy7z-_2rw@mail.gmail.com/
>
> (b) the problems with GUP - because GUP (and some other page sharing
> cases) don't increase the page_mapping() count, GUP was "invisible" to
> the re-use code, and as a result the reuse code was a buggy mess.
We have certainly had lots of GUP/fork/COW bugs and difficulties down
the years, plenty of mail threads that I haven't attempted to collate,
2.6.16 14feb06 f822566165dd ("madvise MADV_DONTFORK/MADV_DOFORK").
But I thought that we had arrived at a reasonably understood compromise
eleven years ago. "the reuse code was a buggy mess" is news to me, and
I still haven't grasped why GUP needs to be "visible" to it.
Obviously, if someone is writing to private pages via GUP, without having
declared that intention with the appropriate flag to break COW in advance,
trouble may follow. But you saw a problem just with reading, that I have
failed to grasp.
>
> (c) the complete and pointless complexity of this path, that isn't
> actually done anywhere else. The GUP issue was the immediate - and
> currently existing - bug caused by this, but the locking costs are
> another example.
I probably need that particular GUP issue explained again. But no rush:
having sounded this alarm, I must turn attention to other things.
>
> So the page reuse is simply wrong. It's almost certainly also
> pointless and entirely historical. The _reason_ for trying to reuse
> the KSM pages was documented not as performance, but simple to match
> the other (also pointless) complexity of the swap cache reuse.
I concede on KSM, but disagree that the rest was pointless: Yingchao
Zhou in the lore links above had a reasonable expectation, that COW
would *not* be broken erratically and unnecessarily. You see it
differently, you think that relying on COW not being broken is wrong.
You may be right that it's all historical by now; but we do risk
breaking users by changing the guarantee of the last eleven years:
blame me for establishing that guarantee by fixing the page_count
and trylock_page unreliabilities.
Hugh
(Not snipping what you wrote below, so it's easy to come back to if
necessary: some I agree with, some I don't, but have said enough.)
>
> So the intent is to do the "page_count()" test early, to get rid of
> the locking issues with any shared pages.
>
> So the logic is "if this page is marked PageKsm(), or if it has an
> elevated page count, don't even try - just copy".
>
> To make a very concrete example: it's not unusual at all to basically
> have simultaneous page faults on a dirty page because it's COW-shared
> in both parent and child. Trivial to trigger, with the child and
> parent running on different CPU's and just writing to the same page
> right after a fork. And there is absolutely _zero_ reason that should
> be serialized by anything at all. The parent and child are complete
> share-nothing things: taking the page lock was and is simply wrong.
>
> Solution: don't do it. Just notice "Oh, this page has other users"
> (and page_count() is the correct thing to do for that, not
> page_mappings(), since GUP is also another user), and actively *avoid*
> any serialization. Just copy the damn thing.
>
> I'll take full blame for the historical stupidity. This was a bigger
> deal back in the days when 4MB of RAM was considered normal. Plus page
> locking wasn't even an issue back then. In fact, no locking at all was
> needed back when the "try to reuse" code was originally written.
> Things were simpler back then.
>
> It's just that I'm 100% convinced that that historical legacy is very
> very wrong these days. That "serialize on page lock if COW fault in
> parent and child" is just an example of where this is fundamentally
> wrong. But the whole complexity in the map count logic is just wholly
> and totally wrong too.
>
> I dare anybody to read the swapfile code for "total_map_swapcount" and
> tell me they understand it fully.
>
> So my theory is that this code - that is meant to *improve*
> performance by sharing pages aggressively after a fork(), because that
> used to be a primary issue, is now in fact making performance *much
> worse*, because it's trying to optimize for a case that doesn't even
> matter any more (does anybody truly believe that swap cache and shared
> COW pages are a major source of performance?) and it does so at a huge
> complexity _and_ performance cost.
>
> So ripping out the KSM reuse code is just another "this is pointless
> and wrong" issue. If you seriously try to KSM share a page that now
> only has _one_ single user left, and that one single user writes to it
> and is modifying it, then the problem is absolutely *NOT* that we
> should try to re-use the page. No, the problem is that the KSM code
> picked a horribly bad page to try to share.
>
> Will that happen _occasionally_? Sure. But if it happens once in a
> blue moon, we really shouldn't have that code to deal with it.
>
> It's really that simple. All that reuse code is pointless and wrong.
> It has historical roots, and it made sense at the time, but in this
> day and age I'm convinced it's completely wrong.
>
> Now, I'm _also_ admittedly convinced that I am occasionally completely
> wrong, and people do odd things, and maybe there are loads where it
> really matters. I doubt it in this case, but I think what we should do
> is rip out all the existing historical code, and _if_ somebody has a
> case where it matters, we can look at THAT case, and people can show
>
> (a) what the exact pattern is that we actually care about
>
> (b) numbers
>
> and then maybe we can re-introduce some sort of re-use code with -
> hopefully - a much more targeted and documented "this is why this
> matters" approach.
>
> So the intent is to get rid of the page lock thing, but I also hope
> that long-term, we can get rid of reuse_swap_page() and some of that
> mapcount stuff entirely.
>
> Linus
On 08/21, Peter Xu wrote:
>
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -381,22 +381,13 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
> }
>
> /*
> - * FOLL_FORCE or a forced COW break can write even to unwritable pte's,
> - * but only after we've gone through a COW cycle and they are dirty.
> + * FOLL_FORCE can write to even unwritable pte's, but only
> + * after we've gone through a COW cycle and they are dirty.
> */
> static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
> {
> - return pte_write(pte) || ((flags & FOLL_COW) && pte_dirty(pte));
> -}
> -
> -/*
> - * A (separate) COW fault might break the page the other way and
> - * get_user_pages() would return the page from what is now the wrong
> - * VM. So we need to force a COW break at GUP time even for reads.
> - */
> -static inline bool should_force_cow_break(struct vm_area_struct *vma, unsigned int flags)
> -{
> - return is_cow_mapping(vma->vm_flags) && (flags & (FOLL_GET | FOLL_PIN));
> + return pte_write(pte) ||
> + ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
Do we really need to add the FOLL_FORCE check back?
Afaics, FOLL_COW is only possible if FOLL_FORCE was set.
Oleg.
On Fri, Aug 21, 2020 at 07:49:55PM -0400, Peter Xu wrote:
> From: Linus Torvalds <[email protected]>
>
> How about we just make sure we're the only possible valid user fo the
> page before we bother to reuse it?
>
> Simplify, simplify, simplify.
>
> And get rid of the nasty serialization on the page lock at the same time.
>
> Signed-off-by: Linus Torvalds <[email protected]>
> [peterx: add subject prefix]
> Signed-off-by: Peter Xu <[email protected]>
> mm/memory.c | 59 +++++++++++++++--------------------------------------
> 1 file changed, 17 insertions(+), 42 deletions(-)
I don't have a detailed explanation right now, but this patch appears
to be causing a regression where RDMA subsystem tests fail. Tests
return to normal when this patch is reverted.
It kind of looks like the process is not seeing DMA'd data to a
pin_user_pages()?
Thanks,
Jason
On Mon, Sep 14, 2020 at 7:38 AM Jason Gunthorpe <[email protected]> wrote:
>
> I don't have a detailed explanation right now, but this patch appears
> to be causing a regression where RDMA subsystem tests fail. Tests
> return to normal when this patch is reverted.
>
> It kind of looks like the process is not seeing DMA'd data to a
> pin_user_pages()?
I'm a nincompoop. I actually _talked_ to Hugh Dickins about this when
he raised concerns, and I dismissed his concerns with "but PAGE_PIN is
special".
As usual, Hugh was right. Page pinning certainly _is_ special, but
it's not that different from the regular GUP code.
But in the meantime, I have a lovely confirmation from the kernel test
robot, saying that commit 09854ba94c results in a
"vm-scalability.throughput 31.4% improvement", which was what I was
hoping for - the complexity wasn't just complexity, it was active
badness due to the page locking horrors.
I think what we want to do is basically do the "early COW", but only
do it for FOLL_PIN (and not turn them into writes for anything but the
COW code). So basically redo the "enforced COW mechanism", but rather
than do it for everything, now do it only for FOLL_PIN, and only in
that COW path.
Peter - any chance you can look at this? I'm still looking at the page
lock fairness performance regression, although I now think I have a
test patch for Phoronix to test out.
Linus
On Mon, Sep 14, 2020 at 04:27:24PM +0200, Oleg Nesterov wrote:
> On 08/21, Peter Xu wrote:
> >
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -381,22 +381,13 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
> > }
> >
> > /*
> > - * FOLL_FORCE or a forced COW break can write even to unwritable pte's,
> > - * but only after we've gone through a COW cycle and they are dirty.
> > + * FOLL_FORCE can write to even unwritable pte's, but only
> > + * after we've gone through a COW cycle and they are dirty.
> > */
> > static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
> > {
> > - return pte_write(pte) || ((flags & FOLL_COW) && pte_dirty(pte));
> > -}
> > -
> > -/*
> > - * A (separate) COW fault might break the page the other way and
> > - * get_user_pages() would return the page from what is now the wrong
> > - * VM. So we need to force a COW break at GUP time even for reads.
> > - */
> > -static inline bool should_force_cow_break(struct vm_area_struct *vma, unsigned int flags)
> > -{
> > - return is_cow_mapping(vma->vm_flags) && (flags & (FOLL_GET | FOLL_PIN));
> > + return pte_write(pte) ||
> > + ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
>
> Do we really need to add the FOLL_FORCE check back?
>
> Afaics, FOLL_COW is only possible if FOLL_FORCE was set.
When I proposed the patch I wanted to add back FOLL_FORCE because the previous
removing of FOLL_FORCE should be related to the enforced COW mechanism where
FOLL_COW can definitely happen without FOLL_FORCE. So when we want to revert
the enforced COW we definitely need to recover this check too as it was. I
didn't think deeper than that.
However now I'm a bit confused on why FOLL_COW must be with FOLL_FORCE even
without the enforced COW... Shouldn't FOLL_COW be able to happen even without
FOLL_FORCE (as long as when a page is shared, and the gup is with WRITE
permission)? Not sure what I've missed, though.
--
Peter Xu
On Mon, Sep 14, 2020 at 10:32:11AM -0700, Linus Torvalds wrote:
> On Mon, Sep 14, 2020 at 7:38 AM Jason Gunthorpe <[email protected]> wrote:
> >
> > I don't have a detailed explanation right now, but this patch appears
> > to be causing a regression where RDMA subsystem tests fail. Tests
> > return to normal when this patch is reverted.
> >
> > It kind of looks like the process is not seeing DMA'd data to a
> > pin_user_pages()?
>
> I'm a nincompoop. I actually _talked_ to Hugh Dickins about this when
> he raised concerns, and I dismissed his concerns with "but PAGE_PIN is
> special".
>
> As usual, Hugh was right. Page pinning certainly _is_ special, but
> it's not that different from the regular GUP code.
>
> But in the meantime, I have a lovely confirmation from the kernel test
> robot, saying that commit 09854ba94c results in a
> "vm-scalability.throughput 31.4% improvement", which was what I was
> hoping for - the complexity wasn't just complexity, it was active
> badness due to the page locking horrors.
>
> I think what we want to do is basically do the "early COW", but only
> do it for FOLL_PIN (and not turn them into writes for anything but the
> COW code). So basically redo the "enforced COW mechanism", but rather
> than do it for everything, now do it only for FOLL_PIN, and only in
> that COW path.
>
> Peter - any chance you can look at this? I'm still looking at the page
> lock fairness performance regression, although I now think I have a
> test patch for Phoronix to test out.
Sure, I'll try to prepare something like that and share it shortly.
Thanks,
--
Peter Xu
On Mon, Sep 14, 2020 at 10:59 AM Peter Xu <[email protected]> wrote:
>
> However now I'm a bit confused on why FOLL_COW must be with FOLL_FORCE even
> without the enforced COW... Shouldn't FOLL_COW be able to happen even without
> FOLL_FORCE (as long as when a page is shared, and the gup is with WRITE
> permission)? Not sure what I've missed, though.
Afaik, the FOLL_FORCE test was (and is) unnecessary.
If FOLL_COW is set, we're going through this the second time, and we
either had that pte_write() or we had FOLL_FORCE originally.
So can_follow_write_pte() doesn't need the FOLL_FORCE test - it's
redundant - but it isn't technically wrong either.
Linus
On Mon, Sep 14, 2020 at 02:34:36PM -0400, Peter Xu wrote:
> On Mon, Sep 14, 2020 at 10:32:11AM -0700, Linus Torvalds wrote:
> > On Mon, Sep 14, 2020 at 7:38 AM Jason Gunthorpe <[email protected]> wrote:
> > >
> > > I don't have a detailed explanation right now, but this patch appears
> > > to be causing a regression where RDMA subsystem tests fail. Tests
> > > return to normal when this patch is reverted.
> > >
> > > It kind of looks like the process is not seeing DMA'd data to a
> > > pin_user_pages()?
> >
> > I'm a nincompoop. I actually _talked_ to Hugh Dickins about this when
> > he raised concerns, and I dismissed his concerns with "but PAGE_PIN is
> > special".
> >
> > As usual, Hugh was right. Page pinning certainly _is_ special, but
> > it's not that different from the regular GUP code.
> >
> > But in the meantime, I have a lovely confirmation from the kernel test
> > robot, saying that commit 09854ba94c results in a
> > "vm-scalability.throughput 31.4% improvement", which was what I was
> > hoping for - the complexity wasn't just complexity, it was active
> > badness due to the page locking horrors.
> >
> > I think what we want to do is basically do the "early COW", but only
> > do it for FOLL_PIN (and not turn them into writes for anything but the
> > COW code). So basically redo the "enforced COW mechanism", but rather
> > than do it for everything, now do it only for FOLL_PIN, and only in
> > that COW path.
> >
> > Peter - any chance you can look at this? I'm still looking at the page
> > lock fairness performance regression, although I now think I have a
> > test patch for Phoronix to test out.
>
> Sure, I'll try to prepare something like that and share it shortly.
Jason, would you please try the attached patch to see whether it unbreaks the
rdma test? Thanks!
--
Peter Xu
On Mon, Sep 14, 2020 at 05:15:15PM -0400, Peter Xu wrote:
> On Mon, Sep 14, 2020 at 02:34:36PM -0400, Peter Xu wrote:
> > On Mon, Sep 14, 2020 at 10:32:11AM -0700, Linus Torvalds wrote:
> > > On Mon, Sep 14, 2020 at 7:38 AM Jason Gunthorpe <[email protected]> wrote:
> > > >
> > > > I don't have a detailed explanation right now, but this patch appears
> > > > to be causing a regression where RDMA subsystem tests fail. Tests
> > > > return to normal when this patch is reverted.
> > > >
> > > > It kind of looks like the process is not seeing DMA'd data to a
> > > > pin_user_pages()?
> > >
> > > I'm a nincompoop. I actually _talked_ to Hugh Dickins about this when
> > > he raised concerns, and I dismissed his concerns with "but PAGE_PIN is
> > > special".
> > >
> > > As usual, Hugh was right. Page pinning certainly _is_ special, but
> > > it's not that different from the regular GUP code.
> > >
> > > But in the meantime, I have a lovely confirmation from the kernel test
> > > robot, saying that commit 09854ba94c results in a
> > > "vm-scalability.throughput 31.4% improvement", which was what I was
> > > hoping for - the complexity wasn't just complexity, it was active
> > > badness due to the page locking horrors.
> > >
> > > I think what we want to do is basically do the "early COW", but only
> > > do it for FOLL_PIN (and not turn them into writes for anything but the
> > > COW code). So basically redo the "enforced COW mechanism", but rather
> > > than do it for everything, now do it only for FOLL_PIN, and only in
> > > that COW path.
> > >
> > > Peter - any chance you can look at this? I'm still looking at the page
> > > lock fairness performance regression, although I now think I have a
> > > test patch for Phoronix to test out.
> >
> > Sure, I'll try to prepare something like that and share it shortly.
>
> Jason, would you please try the attached patch to see whether it unbreaks the
> rdma test? Thanks!
Sure, I'll get back to you
> Fast gup is not affected by this because it is never used with FOLL_PIN.
? What is pin_user_pages_fast() then? That is the API the failing test
is using.
> Note: hugetlbfs is not considered throughout this patch, because it's missing
> some required bits after all (like proper setting of FOLL_COW when page fault
> retries). Considering we may want to unbreak RDMA tests even during the rcs,
> this patch only fixes the non-hugetlbfs cases. THPs should still be in count.
People do RDMA with hugetlbfs too.
Just as an aside, the RDMA stuff is also supposed to set MADV_DONTFORK
on these regions, so I'm a bit puzzled what is happening here.
Jason
On Mon, Sep 14, 2020 at 3:55 PM Jason Gunthorpe <[email protected]> wrote:
>
> Just as an aside, the RDMA stuff is also supposed to set MADV_DONTFORK
> on these regions, so I'm a bit puzzled what is happening here.
Did the fork perhaps happen _before_ , so the pages are shared when
you do the pin?
MADV_DONTFORK doesn't mean COW doesn't happen. It just means that the
next fork() won't be copying that memory area.
That said, it's possible that the test cases do something invalid - or
maybe we've broken MADV_DONTFORK - and it all just happened to work
before.
So it's possible the breakage is exposing some other bug..
Linus
On Mon, Sep 14, 2020 at 03:59:31PM -0700, Linus Torvalds wrote:
> On Mon, Sep 14, 2020 at 3:55 PM Jason Gunthorpe <[email protected]> wrote:
> >
> > Just as an aside, the RDMA stuff is also supposed to set MADV_DONTFORK
> > on these regions, so I'm a bit puzzled what is happening here.
>
> Did the fork perhaps happen _before_ , so the pages are shared when
> you do the pin?
Looking at the progam, it seems there are a number of forks for exec
before and after pin_user_pages_fast(), but the parent process always
does waitpid() after the fork.
> MADV_DONTFORK doesn't mean COW doesn't happen. It just means that the
> next fork() won't be copying that memory area.
Yes, this stuff does pin_user_pages_fast() and MADV_DONTFORK
together. It sets FOLL_FORCE and FOLL_WRITE to get an exclusive copy
of the page and MADV_DONTFORK was needed to ensure that a future fork
doesn't establish a COW that would break the DMA by moving the
physical page over to the fork. DMA should stay with the process that
called pin_user_pages_fast() (Is MADV_DONTFORK still needed with
recent years work to GUP/etc? It is a pretty terrible ancient thing)
> That said, it's possible that the test cases do something invalid - or
> maybe we've broken MADV_DONTFORK - and it all just happened to work
> before.
Hmm. If symptoms stop with this patch should we investigate
MADV_DONTFORK?
Thanks,
Jason
On Mon, Sep 14, 2020 at 4:28 PM Jason Gunthorpe <[email protected]> wrote:
>
> Hmm. If symptoms stop with this patch should we investigate
> MADV_DONTFORK?
I took a quick look at it, and it all looks trivially correct.
All MADV_DONTFORK does is to set the VM_DONTCOPY flag in the vma.
And dup_mmap() in kernel/fork.c is very trivial, and does
if (mpnt->vm_flags & VM_DONTCOPY) {
vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
continue;
}
for a vma that has that VM_DONTCOPY flag.
So I don't think it's MADV_DONTFORK, and in fact if that _had_ been
broken, then the old "look at page_mapcount()" would have shown the
problem too, since by definition a fork() would have increased that
count.
That said, the thing Hugh worried about was random other VM-internal
reasons why the page flags end up being elevated, that aren't due to
these things. And he's right. The new aggressive COW by that
do_wp_page() simplification will basically COW for any random thing.
My argument to Hugh was that if the page has become private to a
single mapping, even if it has its count elevated it should all simply
be writable, ie it shouldn't have gotten the paeg fault that causes
the COW in the first place. IOW, my thinking was that any proper page
pinning will also have to make sure that the page is already writable
and dirty, so no amount of other page counts should even matter.
But that argument may be entirely bogus, because I didn't think of
case "insert random case here".
My original force-COW patch for GUP avoided this issue, exactly
because it basically said that a GUP is a write - so it didn't care
about whatever state the page had, it *forced* the page to be mapped
dirty and writable in the target.
But part of the argument for the do_wp_page() simplification thing was
that it allowed us to remove my force-COW thing.
Is there perhaps some easy test-case that shows this that doesn't
require any actual rdma hardware?
Linus
On Mon, Sep 14, 2020 at 05:15:15PM -0400, Peter Xu wrote:
> On Mon, Sep 14, 2020 at 02:34:36PM -0400, Peter Xu wrote:
> > On Mon, Sep 14, 2020 at 10:32:11AM -0700, Linus Torvalds wrote:
> > > On Mon, Sep 14, 2020 at 7:38 AM Jason Gunthorpe <[email protected]> wrote:
> > > >
> > > > I don't have a detailed explanation right now, but this patch appears
> > > > to be causing a regression where RDMA subsystem tests fail. Tests
> > > > return to normal when this patch is reverted.
> > > >
> > > > It kind of looks like the process is not seeing DMA'd data to a
> > > > pin_user_pages()?
> > >
> > > I'm a nincompoop. I actually _talked_ to Hugh Dickins about this when
> > > he raised concerns, and I dismissed his concerns with "but PAGE_PIN is
> > > special".
> > >
> > > As usual, Hugh was right. Page pinning certainly _is_ special, but
> > > it's not that different from the regular GUP code.
> > >
> > > But in the meantime, I have a lovely confirmation from the kernel test
> > > robot, saying that commit 09854ba94c results in a
> > > "vm-scalability.throughput 31.4% improvement", which was what I was
> > > hoping for - the complexity wasn't just complexity, it was active
> > > badness due to the page locking horrors.
> > >
> > > I think what we want to do is basically do the "early COW", but only
> > > do it for FOLL_PIN (and not turn them into writes for anything but the
> > > COW code). So basically redo the "enforced COW mechanism", but rather
> > > than do it for everything, now do it only for FOLL_PIN, and only in
> > > that COW path.
> > >
> > > Peter - any chance you can look at this? I'm still looking at the page
> > > lock fairness performance regression, although I now think I have a
> > > test patch for Phoronix to test out.
> >
> > Sure, I'll try to prepare something like that and share it shortly.
>
> Jason, would you please try the attached patch to see whether it unbreaks the
> rdma test? Thanks!
>
> --
> Peter Xu
> From 93c534866d2c548cf193a5c17f7058a1f770df5a Mon Sep 17 00:00:00 2001
> From: Peter Xu <[email protected]>
> Date: Mon, 14 Sep 2020 15:34:41 -0400
> Subject: [PATCH] mm/gup: Allow enfornced COW for FOLL_PIN
>
> FOLL_PIN may need the enforced COW mechanism as reported by Jason and analyzed
> by Linus [1]. This is a continued work based on previous patch [2], however
> there's some trivial differences.
>
> Instead of applying enforced COW everywhere, we only apply it for FOLL_PIN to
> make sure the pages that were pinned will not be COWed again later on. In
> other words, we'll do early phase COW for pinned page along with the gup
> procedure. And since only FOLL_PIN is affected, we don't need to introduce a
> new flag as FOLL_BREAK_COW. However we'll still need a new fault flag as
> FAULT_FLAG_BREAK_COW inside the page fault handler.
>
> Fast gup is not affected by this because it is never used with FOLL_PIN.
>
> Now userfaultfd-wp needs to be ready with COW happening since read gup could
> trigger COW now with FOLL_PIN (which will never happen previously). So when
> COW happens we'll need to carry over the uffd-wp bits too if it's there.
>
> Meanwhile, both userfaultfd_pte_wp() and userfaultfd_huge_pmd_wp() need to be
> smarter than before on that it needs to return true only if this is a "real"
> write fault. With that extra check, we can identify a real write against an
> enforced COW procedure from a FOLL_PIN gup.
>
> Note: hugetlbfs is not considered throughout this patch, because it's missing
> some required bits after all (like proper setting of FOLL_COW when page fault
> retries). Considering we may want to unbreak RDMA tests even during the rcs,
> this patch only fixes the non-hugetlbfs cases. THPs should still be in count.
>
> [1] https://lore.kernel.org/lkml/[email protected]
> [2] https://lore.kernel.org/lkml/[email protected]
>
> Reported-by: Jason Gunthorpe <[email protected]>
> Suggested-by: Linus Torvalds <[email protected]>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> include/linux/mm.h | 2 ++
> include/linux/userfaultfd_k.h | 12 ++++++------
> mm/gup.c | 17 ++++++++++++-----
> mm/huge_memory.c | 17 ++++++++++++-----
> mm/memory.c | 16 +++++++++-------
> 5 files changed, 41 insertions(+), 23 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ca6e6a81576b..741574bfd343 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -416,6 +416,7 @@ extern pgprot_t protection_map[16];
> * @FAULT_FLAG_REMOTE: The fault is not for current task/mm.
> * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch.
> * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals.
> + * @FAULT_FLAG_BREAK_COW: Do COW explicitly for the fault (even for read).
> *
> * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
> * whether we would allow page faults to retry by specifying these two
> @@ -446,6 +447,7 @@ extern pgprot_t protection_map[16];
> #define FAULT_FLAG_REMOTE 0x80
> #define FAULT_FLAG_INSTRUCTION 0x100
> #define FAULT_FLAG_INTERRUPTIBLE 0x200
> +#define FAULT_FLAG_BREAK_COW 0x400
>
> /*
> * The default fault flags that should be used by most of the
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index a8e5f3ea9bb2..fbcb75daf870 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -62,16 +62,16 @@ static inline bool userfaultfd_wp(struct vm_area_struct *vma)
> return vma->vm_flags & VM_UFFD_WP;
> }
>
> -static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
> - pte_t pte)
> +static inline bool userfaultfd_pte_wp(struct vm_fault *vmf, pte_t pte)
> {
> - return userfaultfd_wp(vma) && pte_uffd_wp(pte);
> + return (vmf->flags & FAULT_FLAG_WRITE) &&
> + userfaultfd_wp(vmf->vma) && pte_uffd_wp(pte);
> }
>
> -static inline bool userfaultfd_huge_pmd_wp(struct vm_area_struct *vma,
> - pmd_t pmd)
> +static inline bool userfaultfd_huge_pmd_wp(struct vm_fault *vmf, pmd_t pmd)
> {
> - return userfaultfd_wp(vma) && pmd_uffd_wp(pmd);
> + return (vmf->flags & FAULT_FLAG_WRITE) &&
> + userfaultfd_wp(vmf->vma) && pmd_uffd_wp(pmd);
> }
Don't forget to change !CONFIG_USERFAULTFD declarations too.
Thanks
On Tue, Sep 15, 2020 at 10:50:40AM -0400, Peter Xu wrote:
> Hi, all,
>
> I prepared another version of the FOLL_PIN enforced cow patch attached, just in
> case it would still be anything close to useful (though now I highly doubt it
> considering below...). I took care of !USERFAULTFD as suggested by Leon, and
> also the fast gup path.
Now with the patch attached (for real..).
>
> However...
>
> On Mon, Sep 14, 2020 at 08:28:51PM -0300, Jason Gunthorpe wrote:
> > Yes, this stuff does pin_user_pages_fast() and MADV_DONTFORK
> > together. It sets FOLL_FORCE and FOLL_WRITE to get an exclusive copy
> > of the page and MADV_DONTFORK was needed to ensure that a future fork
> > doesn't establish a COW that would break the DMA by moving the
> > physical page over to the fork. DMA should stay with the process that
> > called pin_user_pages_fast() (Is MADV_DONTFORK still needed with
> > recent years work to GUP/etc? It is a pretty terrible ancient thing)
>
> ... Now I'm more confused on what has happened.
>
> If we're with FORCE|WRITE, iiuc it should guarantee that the page will trigger
> COW during gup even if it is shared, so no problem on the gup side. Then I'm
> quite confused on why the write bit is not set when cow triggered.
>
> E.g., in wp_page_copy(), if I'm not wrong, the write bit is only controlled by
> (besides the fix patch, though I believe the rdma test should have nothing to
> do with uffd-wp after all so it should be the same anyways):
>
> entry = maybe_mkwrite(pte_mkdirty(entry), vma);
>
> It means, as long as the rdma region has VM_WRITE set (which I think of no
> reason on why it shouldn't...), then it should have the write bit in the COWed
> page entry. If so, the page should be stable and I don't undersdand why
> another COW could even trigger and how the code path in the "trial cow" patch
> is triggered.
>
> Or, the VMA is without VM_WRITE due to some reason? Sorry I probably know
> nothing about RDMA, more information on that side might help too. E.g., is the
> hardware going to walk the software process page table too when doing RDMA (or
> is IOMMU page table used, or none)?
>
> Thanks,
>
> --
> Peter Xu
--
Peter Xu
On Mon, Sep 14, 2020 at 05:15:15PM -0400, Peter Xu wrote:
> On Mon, Sep 14, 2020 at 02:34:36PM -0400, Peter Xu wrote:
> > On Mon, Sep 14, 2020 at 10:32:11AM -0700, Linus Torvalds wrote:
> > > On Mon, Sep 14, 2020 at 7:38 AM Jason Gunthorpe <[email protected]> wrote:
> > > >
> > > > I don't have a detailed explanation right now, but this patch appears
> > > > to be causing a regression where RDMA subsystem tests fail. Tests
> > > > return to normal when this patch is reverted.
> > > >
> > > > It kind of looks like the process is not seeing DMA'd data to a
> > > > pin_user_pages()?
> > >
> > > I'm a nincompoop. I actually _talked_ to Hugh Dickins about this when
> > > he raised concerns, and I dismissed his concerns with "but PAGE_PIN is
> > > special".
> > >
> > > As usual, Hugh was right. Page pinning certainly _is_ special, but
> > > it's not that different from the regular GUP code.
> > >
> > > But in the meantime, I have a lovely confirmation from the kernel test
> > > robot, saying that commit 09854ba94c results in a
> > > "vm-scalability.throughput 31.4% improvement", which was what I was
> > > hoping for - the complexity wasn't just complexity, it was active
> > > badness due to the page locking horrors.
> > >
> > > I think what we want to do is basically do the "early COW", but only
> > > do it for FOLL_PIN (and not turn them into writes for anything but the
> > > COW code). So basically redo the "enforced COW mechanism", but rather
> > > than do it for everything, now do it only for FOLL_PIN, and only in
> > > that COW path.
> > >
> > > Peter - any chance you can look at this? I'm still looking at the page
> > > lock fairness performance regression, although I now think I have a
> > > test patch for Phoronix to test out.
> >
> > Sure, I'll try to prepare something like that and share it shortly.
>
> Jason, would you please try the attached patch to see whether it unbreaks the
> rdma test? Thanks!
Hi Peter,
My tester says the patch does not help (modified as Leon pointed to
make it compile).
He did another test where all forks were removed and the test program
succeeds. Overall in our test suites we see failurs on tests that
involve fork and success on tests that don't.
So fork and COW seem very likely to be the issue.
Thanks,
Jason
On Tue, Sep 15, 2020 at 01:05:53PM -0300, Jason Gunthorpe wrote:
> On Tue, Sep 15, 2020 at 10:50:40AM -0400, Peter Xu wrote:
> > On Mon, Sep 14, 2020 at 08:28:51PM -0300, Jason Gunthorpe wrote:
> > > Yes, this stuff does pin_user_pages_fast() and MADV_DONTFORK
> > > together. It sets FOLL_FORCE and FOLL_WRITE to get an exclusive copy
> > > of the page and MADV_DONTFORK was needed to ensure that a future fork
> > > doesn't establish a COW that would break the DMA by moving the
> > > physical page over to the fork. DMA should stay with the process that
> > > called pin_user_pages_fast() (Is MADV_DONTFORK still needed with
> > > recent years work to GUP/etc? It is a pretty terrible ancient thing)
> >
> > ... Now I'm more confused on what has happened.
>
> I'm going to try to confirm that the MADV_DONTFORK is actually being
> done by userspace properly, more later.
It turns out the test is broken and does not call MADV_DONTFORK when
doing forks - it is an opt-in it didn't do.
It looks to me like this patch makes it much more likely that the COW
break after page pinning will end up moving the pinned physical page
to the fork while before it was not very common. Does that make sense?
Given that the tests are wrong it seems like broken userspace,
however, it also worked reliably for a fairly long time.
I'm waiting for confirmation that adding the missing MADV_DONTFORKS
restores all tests to success even with this patch.
Regards,
Jason
On Tue, Sep 15, 2020 at 03:29:33PM -0300, Jason Gunthorpe wrote:
> On Tue, Sep 15, 2020 at 01:05:53PM -0300, Jason Gunthorpe wrote:
> > On Tue, Sep 15, 2020 at 10:50:40AM -0400, Peter Xu wrote:
> > > On Mon, Sep 14, 2020 at 08:28:51PM -0300, Jason Gunthorpe wrote:
> > > > Yes, this stuff does pin_user_pages_fast() and MADV_DONTFORK
> > > > together. It sets FOLL_FORCE and FOLL_WRITE to get an exclusive copy
> > > > of the page and MADV_DONTFORK was needed to ensure that a future fork
> > > > doesn't establish a COW that would break the DMA by moving the
> > > > physical page over to the fork. DMA should stay with the process that
> > > > called pin_user_pages_fast() (Is MADV_DONTFORK still needed with
> > > > recent years work to GUP/etc? It is a pretty terrible ancient thing)
> > >
> > > ... Now I'm more confused on what has happened.
> >
> > I'm going to try to confirm that the MADV_DONTFORK is actually being
> > done by userspace properly, more later.
>
> It turns out the test is broken and does not call MADV_DONTFORK when
> doing forks - it is an opt-in it didn't do.
>
> It looks to me like this patch makes it much more likely that the COW
> break after page pinning will end up moving the pinned physical page
> to the fork while before it was not very common. Does that make sense?
My understanding is that the fix should not matter much with current failing
test case, as long as it's with FOLL_FORCE & FOLL_WRITE. However what I'm not
sure is what if the RDMA/DMA buffers are designed for pure read from userspace.
E.g. for vfio I'm looking at vaddr_get_pfn() where I believe such pure read
buffers will be a GUP with FOLL_PIN and !FOLL_WRITE which will finally pass to
pin_user_pages_remote(). So what I'm worrying is something like this:
1. Proc A gets a private anon page X for DMA, mapcount==refcount==1.
2. Proc A fork()s and gives birth to proc B, page X will now have
mapcount==refcount==2, write-protected. proc B quits. Page X goes back
to mapcount==refcount==1 (note! without WRITE bits set in the PTE).
3. pin_user_pages(write=false) for page X. Since it's with !FORCE & !WRITE,
no COW needed. Refcount==2 after that.
4. Pass these pages to device. We either setup IOMMU page table or just use
the PFNs, which is not important imho - the most important thing is the
device will DMA into page X no matter what.
5. Some thread of proc A writes to page X, trigger COW since it's
write-protected with mapcount==1 && refcount==2. The HVA that pointing to
page X will be changed to point to another page Y after the COW.
6. Device DMA happens, data resides on X. Proc A can never get the data,
though, because it's looking at page Y now.
If this is a problem, we may still need the fix patch (maybe not as urgent as
before at least). But I'd like to double confirm, just in case I miss some
obvious facts above.
>
> Given that the tests are wrong it seems like broken userspace,
> however, it also worked reliably for a fairly long time.
IMHO it worked because the page to do RDMA has mapcount==1, so it was reused
previously just as-is even after the fork without MADV_DONTFORK and after the
child quits. However logically it should really be protected by MADV_DONTFORK
rather than being reused.
Thanks,
--
Peter Xu
On Tue, Sep 15, 2020 at 03:13:46PM -0400, Peter Xu wrote:
> On Tue, Sep 15, 2020 at 03:29:33PM -0300, Jason Gunthorpe wrote:
> > On Tue, Sep 15, 2020 at 01:05:53PM -0300, Jason Gunthorpe wrote:
> > > On Tue, Sep 15, 2020 at 10:50:40AM -0400, Peter Xu wrote:
> > > > On Mon, Sep 14, 2020 at 08:28:51PM -0300, Jason Gunthorpe wrote:
> > > > > Yes, this stuff does pin_user_pages_fast() and MADV_DONTFORK
> > > > > together. It sets FOLL_FORCE and FOLL_WRITE to get an exclusive copy
> > > > > of the page and MADV_DONTFORK was needed to ensure that a future fork
> > > > > doesn't establish a COW that would break the DMA by moving the
> > > > > physical page over to the fork. DMA should stay with the process that
> > > > > called pin_user_pages_fast() (Is MADV_DONTFORK still needed with
> > > > > recent years work to GUP/etc? It is a pretty terrible ancient thing)
> > > >
> > > > ... Now I'm more confused on what has happened.
> > >
> > > I'm going to try to confirm that the MADV_DONTFORK is actually being
> > > done by userspace properly, more later.
> >
> > It turns out the test is broken and does not call MADV_DONTFORK when
> > doing forks - it is an opt-in it didn't do.
> >
> > It looks to me like this patch makes it much more likely that the COW
> > break after page pinning will end up moving the pinned physical page
> > to the fork while before it was not very common. Does that make sense?
>
> My understanding is that the fix should not matter much with current failing
> test case, as long as it's with FOLL_FORCE & FOLL_WRITE. However what I'm not
> sure is what if the RDMA/DMA buffers are designed for pure read from userspace.
No, they are write. Always FOLL_WRITE.
> E.g. for vfio I'm looking at vaddr_get_pfn() where I believe such pure read
> buffers will be a GUP with FOLL_PIN and !FOLL_WRITE which will finally pass to
> pin_user_pages_remote(). So what I'm worrying is something like this:
I think the !(prot & IOMMU_WRITE) case is probably very rare for
VFIO. I'm also not sure it will work reliably, in RDMA we had this as
a more common case and long ago found bugs. The COW had to be broken
for the pin anyhow.
> 1. Proc A gets a private anon page X for DMA, mapcount==refcount==1.
>
> 2. Proc A fork()s and gives birth to proc B, page X will now have
> mapcount==refcount==2, write-protected. proc B quits. Page X goes back
> to mapcount==refcount==1 (note! without WRITE bits set in the PTE).
> 3. pin_user_pages(write=false) for page X. Since it's with !FORCE & !WRITE,
> no COW needed. Refcount==2 after that.
>
> 4. Pass these pages to device. We either setup IOMMU page table or just use
> the PFNs, which is not important imho - the most important thing is the
> device will DMA into page X no matter what.
>
> 5. Some thread of proc A writes to page X, trigger COW since it's
> write-protected with mapcount==1 && refcount==2. The HVA that pointing to
> page X will be changed to point to another page Y after the COW.
>
> 6. Device DMA happens, data resides on X. Proc A can never get the data,
> though, because it's looking at page Y now.
RDMA doesn't ever use !WRITE
I'm guessing #5 is the issue, just with a different ordering. If the
#3 pin_user_pages() preceeds the #2 fork, don't we get to the same #5?
> If this is a problem, we may still need the fix patch (maybe not as urgent as
> before at least). But I'd like to double confirm, just in case I miss some
> obvious facts above.
I'm worred that the sudden need to have MAD_DONTFORK is going to be a
turn into a huge regression. It already blew up our first level of
synthetic test cases. I'm worried what we will see when the
application suite is run in a few months :\
> > Given that the tests are wrong it seems like broken userspace,
> > however, it also worked reliably for a fairly long time.
>
> IMHO it worked because the page to do RDMA has mapcount==1, so it was reused
> previously just as-is even after the fork without MADV_DONTFORK and after the
> child quits.
That would match the results we see.. So this patch changes things so
it is not re-used as-is, but replaced with Y?
This basically means any driver using pin_user_pages() can no longer
have fork() in userspace, when before fork() only failed in fairly
narrow cases. Unfortunately I think this will break things broadly
beyond RDMA.
Jason
On Tue, Sep 15, 2020 at 04:38:38PM -0300, Jason Gunthorpe wrote:
> On Tue, Sep 15, 2020 at 03:13:46PM -0400, Peter Xu wrote:
> > On Tue, Sep 15, 2020 at 03:29:33PM -0300, Jason Gunthorpe wrote:
> > > On Tue, Sep 15, 2020 at 01:05:53PM -0300, Jason Gunthorpe wrote:
> > > > On Tue, Sep 15, 2020 at 10:50:40AM -0400, Peter Xu wrote:
> > > > > On Mon, Sep 14, 2020 at 08:28:51PM -0300, Jason Gunthorpe wrote:
> > > > > > Yes, this stuff does pin_user_pages_fast() and MADV_DONTFORK
> > > > > > together. It sets FOLL_FORCE and FOLL_WRITE to get an exclusive copy
> > > > > > of the page and MADV_DONTFORK was needed to ensure that a future fork
> > > > > > doesn't establish a COW that would break the DMA by moving the
> > > > > > physical page over to the fork. DMA should stay with the process that
> > > > > > called pin_user_pages_fast() (Is MADV_DONTFORK still needed with
> > > > > > recent years work to GUP/etc? It is a pretty terrible ancient thing)
> > > > >
> > > > > ... Now I'm more confused on what has happened.
> > > >
> > > > I'm going to try to confirm that the MADV_DONTFORK is actually being
> > > > done by userspace properly, more later.
> > >
> > > It turns out the test is broken and does not call MADV_DONTFORK when
> > > doing forks - it is an opt-in it didn't do.
> > >
> > > It looks to me like this patch makes it much more likely that the COW
> > > break after page pinning will end up moving the pinned physical page
> > > to the fork while before it was not very common. Does that make sense?
> >
> > My understanding is that the fix should not matter much with current failing
> > test case, as long as it's with FOLL_FORCE & FOLL_WRITE. However what I'm not
> > sure is what if the RDMA/DMA buffers are designed for pure read from userspace.
>
> No, they are write. Always FOLL_WRITE.
>
> > E.g. for vfio I'm looking at vaddr_get_pfn() where I believe such pure read
> > buffers will be a GUP with FOLL_PIN and !FOLL_WRITE which will finally pass to
> > pin_user_pages_remote(). So what I'm worrying is something like this:
>
> I think the !(prot & IOMMU_WRITE) case is probably very rare for
> VFIO. I'm also not sure it will work reliably, in RDMA we had this as
> a more common case and long ago found bugs. The COW had to be broken
> for the pin anyhow.
If I'm not wrong.. QEMU/KVM (assuming there's one vIOMMU in the guest) will try
to do VFIO maps in this read-only way if the IOVA mapped in the guest points to
read only buffers (say, allocated with PCI_DMA_FROMDEVICE).
>
> > 1. Proc A gets a private anon page X for DMA, mapcount==refcount==1.
> >
> > 2. Proc A fork()s and gives birth to proc B, page X will now have
> > mapcount==refcount==2, write-protected. proc B quits. Page X goes back
> > to mapcount==refcount==1 (note! without WRITE bits set in the PTE).
>
> > 3. pin_user_pages(write=false) for page X. Since it's with !FORCE & !WRITE,
> > no COW needed. Refcount==2 after that.
> >
> > 4. Pass these pages to device. We either setup IOMMU page table or just use
> > the PFNs, which is not important imho - the most important thing is the
> > device will DMA into page X no matter what.
> >
> > 5. Some thread of proc A writes to page X, trigger COW since it's
> > write-protected with mapcount==1 && refcount==2. The HVA that pointing to
> > page X will be changed to point to another page Y after the COW.
> >
> > 6. Device DMA happens, data resides on X. Proc A can never get the data,
> > though, because it's looking at page Y now.
>
> RDMA doesn't ever use !WRITE
>
> I'm guessing #5 is the issue, just with a different ordering. If the
> #3 pin_user_pages() preceeds the #2 fork, don't we get to the same #5?
Right, but only if without MADV_DONTFORK? When without MADV_DONTFORK I'll
probably still see that as an userspace bug instead of a kernel one when the
userspace decided to fork() after step #3.
>
> > If this is a problem, we may still need the fix patch (maybe not as urgent as
> > before at least). But I'd like to double confirm, just in case I miss some
> > obvious facts above.
>
> I'm worred that the sudden need to have MAD_DONTFORK is going to be a
> turn into a huge regression. It already blew up our first level of
> synthetic test cases. I'm worried what we will see when the
> application suite is run in a few months :\
For my own preference I'll consider changing kernel behavior if the impact is
still under control (the performance report of 30%+ boost is also attractive
after the simplify-cow patch). The other way is to maintain the old reuse
logic forever, that'll be another kind of burden. Seems no easy way on either
side...
>
> > > Given that the tests are wrong it seems like broken userspace,
> > > however, it also worked reliably for a fairly long time.
> >
> > IMHO it worked because the page to do RDMA has mapcount==1, so it was reused
> > previously just as-is even after the fork without MADV_DONTFORK and after the
> > child quits.
>
> That would match the results we see.. So this patch changes things so
> it is not re-used as-is, but replaced with Y?
Yes. The patch lets "replaced with Y" (cow) happen earlier at step #3. Then
with MADV_DONTFORK, reuse should not happen any more.
Thanks,
--
Peter Xu
On Tue, Sep 15, 2020 at 05:03:49PM +0200, Oleg Nesterov wrote:
> > - if (reuse_swap_page(vmf->page, &total_map_swapcount)) {
>
> It seems that nobody else calls reuse_swap_page() with total_map_swapcount != NULL.
Seems correct. Though maybe we could postpone the cleanup later, until the
fallouts of the cow rework settle (just in case we'd like to bring some
mapcount logic back). Thanks,
--
Peter Xu
On Tue, Sep 15, 2020 at 10:50:40AM -0400, Peter Xu wrote:
> On Mon, Sep 14, 2020 at 08:28:51PM -0300, Jason Gunthorpe wrote:
> > Yes, this stuff does pin_user_pages_fast() and MADV_DONTFORK
> > together. It sets FOLL_FORCE and FOLL_WRITE to get an exclusive copy
> > of the page and MADV_DONTFORK was needed to ensure that a future fork
> > doesn't establish a COW that would break the DMA by moving the
> > physical page over to the fork. DMA should stay with the process that
> > called pin_user_pages_fast() (Is MADV_DONTFORK still needed with
> > recent years work to GUP/etc? It is a pretty terrible ancient thing)
>
> ... Now I'm more confused on what has happened.
I'm going to try to confirm that the MADV_DONTFORK is actually being
done by userspace properly, more later.
> It means, as long as the rdma region has VM_WRITE set (which I think of no
> reason on why it shouldn't...), then it should have the write bit in the COWed
> page entry. If so, the page should be stable and I don't undersdand why
> another COW could even trigger and how the code path in the "trial cow" patch
> is triggered.
All the regions the test are doing DMA to will be simple process
writable anonymous VMA's from malloc()
> Or, the VMA is without VM_WRITE due to some reason? Sorry I probably know
> nothing about RDMA, more information on that side might help too. E.g., is the
> hardware going to walk the software process page table too when doing RDMA (or
> is IOMMU page table used, or none)?
It does pin_user_pages_fast(), gets a list of DMA addresses for the
pages and then programs the hardware. The pin remains for a very long
time and the HW does DMA to those pages independently.
Userspace will write to the memory and trigger DMA reads and HW will
do DMA writes and trigger something close to an eventfd to let
userspace know to check the DMA'd data.
Very similar to how an in-kernel driver works. It is similar to VFIO
in how it uses pin_user_pages_fast().
Symptoms look to be like the DMA's are not arriving.
As before, the requirement is that once a process as done
pin_user_pages() the physical page stays with the process. If there is
a fork() and a COW then the current memory stays with the original
process and the fork'd child gets the copy. MADV_DONTFORK is expected
to ensure this..
We haven't been able to narrow to a reproduction that doesn't require
alot of hardware unfortunately. It seems oddly sensitive, maybe due to
memory layout triggering a COW..
Jason
Hi, all,
I prepared another version of the FOLL_PIN enforced cow patch attached, just in
case it would still be anything close to useful (though now I highly doubt it
considering below...). I took care of !USERFAULTFD as suggested by Leon, and
also the fast gup path.
However...
On Mon, Sep 14, 2020 at 08:28:51PM -0300, Jason Gunthorpe wrote:
> Yes, this stuff does pin_user_pages_fast() and MADV_DONTFORK
> together. It sets FOLL_FORCE and FOLL_WRITE to get an exclusive copy
> of the page and MADV_DONTFORK was needed to ensure that a future fork
> doesn't establish a COW that would break the DMA by moving the
> physical page over to the fork. DMA should stay with the process that
> called pin_user_pages_fast() (Is MADV_DONTFORK still needed with
> recent years work to GUP/etc? It is a pretty terrible ancient thing)
... Now I'm more confused on what has happened.
If we're with FORCE|WRITE, iiuc it should guarantee that the page will trigger
COW during gup even if it is shared, so no problem on the gup side. Then I'm
quite confused on why the write bit is not set when cow triggered.
E.g., in wp_page_copy(), if I'm not wrong, the write bit is only controlled by
(besides the fix patch, though I believe the rdma test should have nothing to
do with uffd-wp after all so it should be the same anyways):
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
It means, as long as the rdma region has VM_WRITE set (which I think of no
reason on why it shouldn't...), then it should have the write bit in the COWed
page entry. If so, the page should be stable and I don't undersdand why
another COW could even trigger and how the code path in the "trial cow" patch
is triggered.
Or, the VMA is without VM_WRITE due to some reason? Sorry I probably know
nothing about RDMA, more information on that side might help too. E.g., is the
hardware going to walk the software process page table too when doing RDMA (or
is IOMMU page table used, or none)?
Thanks,
--
Peter Xu
On 08/21, Peter Xu wrote:
>
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2927,50 +2927,25 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
> * not dirty accountable.
> */
> if (PageAnon(vmf->page)) {
> - int total_map_swapcount;
> - if (PageKsm(vmf->page) && (PageSwapCache(vmf->page) ||
> - page_count(vmf->page) != 1))
> + struct page *page = vmf->page;
> +
> + /* PageKsm() doesn't necessarily raise the page refcount */
> + if (PageKsm(page) || page_count(page) != 1)
> + goto copy;
> + if (!trylock_page(page))
> + goto copy;
> + if (PageKsm(page) || page_mapcount(page) != 1 || page_count(page) != 1) {
> + unlock_page(page);
> goto copy;
> - if (!trylock_page(vmf->page)) {
> - get_page(vmf->page);
> - pte_unmap_unlock(vmf->pte, vmf->ptl);
> - lock_page(vmf->page);
> - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> - vmf->address, &vmf->ptl);
> - if (!pte_same(*vmf->pte, vmf->orig_pte)) {
> - update_mmu_tlb(vma, vmf->address, vmf->pte);
> - unlock_page(vmf->page);
> - pte_unmap_unlock(vmf->pte, vmf->ptl);
> - put_page(vmf->page);
> - return 0;
> - }
> - put_page(vmf->page);
> - }
> - if (PageKsm(vmf->page)) {
> - bool reused = reuse_ksm_page(vmf->page, vmf->vma,
> - vmf->address);
> - unlock_page(vmf->page);
> - if (!reused)
> - goto copy;
> - wp_page_reuse(vmf);
> - return VM_FAULT_WRITE;
> - }
> - if (reuse_swap_page(vmf->page, &total_map_swapcount)) {
It seems that nobody else calls reuse_swap_page() with total_map_swapcount != NULL.
Oleg.
On Tue, Sep 15, 2020 at 05:33:30PM -0400, Peter Xu wrote:
> > RDMA doesn't ever use !WRITE
> >
> > I'm guessing #5 is the issue, just with a different ordering. If the
> > #3 pin_user_pages() preceeds the #2 fork, don't we get to the same #5?
>
> Right, but only if without MADV_DONTFORK?
Yes, results are that MADV_DONTFORK resolves the issue for initial
tests. I should know if a bigger test suite passes in a few days.
> > > If this is a problem, we may still need the fix patch (maybe not as urgent as
> > > before at least). But I'd like to double confirm, just in case I miss some
> > > obvious facts above.
> >
> > I'm worred that the sudden need to have MAD_DONTFORK is going to be a
> > turn into a huge regression. It already blew up our first level of
> > synthetic test cases. I'm worried what we will see when the
> > application suite is run in a few months :\
>
> For my own preference I'll consider changing kernel behavior if the impact is
> still under control (the performance report of 30%+ boost is also attractive
> after the simplify-cow patch). The other way is to maintain the old reuse
> logic forever, that'll be another kind of burden. Seems no easy way on either
> side...
It seems very strange that a physical page exclusively owned by a
process can become copied if pin_user_pages() is active and the
process did fork() at some point.
Could the new pin_user_pages() logic help here? eg the
GUP_PIN_COUNTING_BIAS stuff?
Could the COW code consider a refcount of GUP_PIN_COUNTING_BIAS + 1 as
being owned by the current mm and not needing COW? The DMA pin would
be 'invisible' for COW purposes?
Perhaps an ideal thing would be to have fork() not set the write
protect on pages that have GUP_PIN_COUNTING_BIAS (ie are under DMA),
instead immediately copy during fork(). Then we don't get into this
situation and also don't need MADV_DONTFORK anymore (yay!!). Feels like
this could be low cost as fork() must already be touching the
refcount?
It looks like RDMA, media, vfio, vdpa, io_uring (!!) and xdp all use
FOLL_LONGTERM and may be at regression risk.
I can't say at this point the scope of the problem with RDMA.
*Technicaly* apps forking without MADV_DONTFORK are against the
defined programming model, but since the old kernel didn't fail
robustly there could be misses. FWIW the failure is catastrophic, the
app just breaks completely.
io_uring seems like something that would have interest to mix with
fork.. I see mentions of MADV_DONTFORK in io_uring documentation,
however it is not documented as a 'if you ever call fork() you have to
use this API'. That seems worrying.
> > > IMHO it worked because the page to do RDMA has mapcount==1, so it was reused
> > > previously just as-is even after the fork without MADV_DONTFORK and after the
> > > child quits.
> >
> > That would match the results we see.. So this patch changes things so
> > it is not re-used as-is, but replaced with Y?
>
> Yes. The patch lets "replaced with Y" (cow) happen earlier at step #3. Then
> with MADV_DONTFORK, reuse should not happen any more.
?? Step #3 is pin_user_pages(), why would that replace with COW?
Thanks,
Jason
On 9/15/20 4:22 PM, Jason Gunthorpe wrote:
> On Tue, Sep 15, 2020 at 05:33:30PM -0400, Peter Xu wrote:
>
>>> RDMA doesn't ever use !WRITE
>>>
>>> I'm guessing #5 is the issue, just with a different ordering. If the
>>> #3 pin_user_pages() preceeds the #2 fork, don't we get to the same #5?
>>
>> Right, but only if without MADV_DONTFORK?
>
> Yes, results are that MADV_DONTFORK resolves the issue for initial
> tests. I should know if a bigger test suite passes in a few days.
>
>>>> If this is a problem, we may still need the fix patch (maybe not as urgent as
>>>> before at least). But I'd like to double confirm, just in case I miss some
>>>> obvious facts above.
>>>
>>> I'm worred that the sudden need to have MAD_DONTFORK is going to be a
>>> turn into a huge regression. It already blew up our first level of
>>> synthetic test cases. I'm worried what we will see when the
>>> application suite is run in a few months :\
>>
>> For my own preference I'll consider changing kernel behavior if the impact is
>> still under control (the performance report of 30%+ boost is also attractive
>> after the simplify-cow patch). The other way is to maintain the old reuse
>> logic forever, that'll be another kind of burden. Seems no easy way on either
>> side...
>
> It seems very strange that a physical page exclusively owned by a
> process can become copied if pin_user_pages() is active and the
> process did fork() at some point.
>
> Could the new pin_user_pages() logic help here? eg the
> GUP_PIN_COUNTING_BIAS stuff?
>
> Could the COW code consider a refcount of GUP_PIN_COUNTING_BIAS + 1 as
> being owned by the current mm and not needing COW? The DMA pin would
> be 'invisible' for COW purposes?
Please do be careful to use the API, rather than the implementation. The
FOLL_PIN refcounting system results in being able to get a "maybe
DMA-pinned", or a "definitely not DMA-pinned", via this API call:
static inline bool page_maybe_dma_pinned(struct page *page)
...which does *not* always use GUP_PIN_COUNTING_BIAS to provide that
answer.
thanks,
--
John Hubbard
NVIDIA
On Wed, Sep 16, 2020 at 02:48:04PM -0300, Jason Gunthorpe wrote:
> On Tue, Sep 15, 2020 at 06:50:46PM -0700, John Hubbard wrote:
> > >
> > > It seems very strange that a physical page exclusively owned by a
> > > process can become copied if pin_user_pages() is active and the
> > > process did fork() at some point.
> > >
> > > Could the new pin_user_pages() logic help here? eg the
> > > GUP_PIN_COUNTING_BIAS stuff?
> > >
> > > Could the COW code consider a refcount of GUP_PIN_COUNTING_BIAS + 1 as
> > > being owned by the current mm and not needing COW? The DMA pin would
> > > be 'invisible' for COW purposes?
> >
> >
> > Please do be careful to use the API, rather than the implementation. The
> > FOLL_PIN refcounting system results in being able to get a "maybe
> > DMA-pinned", or a "definitely not DMA-pinned", via this API call:
>
> So, what I'm thinking is something like (untested):
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 469af373ae76e1..77f63183667e52 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2889,6 +2889,26 @@ static vm_fault_t wp_page_shared(struct vm_fault *vmf)
> return ret;
> }
>
> +static bool cow_needed(struct vm_fault *vmf)
> +{
> + int total_map_swapcount;
> +
> + if (!reuse_swap_page(vmf->page, &total_map_swapcount)) {
> + unlock_page(vmf->page);
> + return true;
> + }
> +
> + if (total_map_swapcount == 1) {
> + /*
> + * The page is all ours. Move it to our anon_vma so the rmap
> + * code will not search our parent or siblings. Protected
> + * against the rmap code by the page lock.
> + */
> + page_move_anon_rmap(vmf->page, vmf->vma);
> + }
> + return false;
> +}
> +
> /*
> * This routine handles present pages, when users try to write
> * to a shared page. It is done by copying the page to a new address
> @@ -2947,8 +2967,21 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
> if (!trylock_page(page))
> goto copy;
> if (PageKsm(page) || page_mapcount(page) != 1 || page_count(page) != 1) {
> + bool cow = true;
> +
> + /*
> + * If the page is DMA pinned we can't rely on the
> + * above to know if there are other CPU references as
> + * page_count() will be elevated by the
> + * pin. Needlessly copying the page will cause the DMA
> + * pin to break, try harder to avoid that.
> + */
> + if (page_maybe_dma_pinned(page))
> + cow = cow_needed(vmf);
> +
> unlock_page(page);
> - goto copy;
> + if (cow)
> + goto copy;
> }
> /*
> * Ok, we've got the only map reference, and the only
>
> What do you think Peter? Is this remotely close?
My understanding is this may only work for the case when the fork()ed child
quitted before we reach here (so we still have mapcount==1 for the page). What
if not? Then mapcount will be greater than 1, and cow will still trigger. Is
that what we want?
Another problem is that, aiui, one of the major change previous patch proposed
is to avoid using lock_page() so that we never block in this path. However if
we need the page reuse logic to guarantee pinning pages, then it becomes a
correctness issue, so I'm afraid we'll start to make all things complicated
again. Maybe even more complicated, because "correctness" should be even harder
than "best effort reuse" since it can cause data corruption if we didn't do it
right...
--
Peter Xu
On Tue, Sep 15, 2020 at 06:50:46PM -0700, John Hubbard wrote:
> >
> > It seems very strange that a physical page exclusively owned by a
> > process can become copied if pin_user_pages() is active and the
> > process did fork() at some point.
> >
> > Could the new pin_user_pages() logic help here? eg the
> > GUP_PIN_COUNTING_BIAS stuff?
> >
> > Could the COW code consider a refcount of GUP_PIN_COUNTING_BIAS + 1 as
> > being owned by the current mm and not needing COW? The DMA pin would
> > be 'invisible' for COW purposes?
>
>
> Please do be careful to use the API, rather than the implementation. The
> FOLL_PIN refcounting system results in being able to get a "maybe
> DMA-pinned", or a "definitely not DMA-pinned", via this API call:
So, what I'm thinking is something like (untested):
diff --git a/mm/memory.c b/mm/memory.c
index 469af373ae76e1..77f63183667e52 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2889,6 +2889,26 @@ static vm_fault_t wp_page_shared(struct vm_fault *vmf)
return ret;
}
+static bool cow_needed(struct vm_fault *vmf)
+{
+ int total_map_swapcount;
+
+ if (!reuse_swap_page(vmf->page, &total_map_swapcount)) {
+ unlock_page(vmf->page);
+ return true;
+ }
+
+ if (total_map_swapcount == 1) {
+ /*
+ * The page is all ours. Move it to our anon_vma so the rmap
+ * code will not search our parent or siblings. Protected
+ * against the rmap code by the page lock.
+ */
+ page_move_anon_rmap(vmf->page, vmf->vma);
+ }
+ return false;
+}
+
/*
* This routine handles present pages, when users try to write
* to a shared page. It is done by copying the page to a new address
@@ -2947,8 +2967,21 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
if (!trylock_page(page))
goto copy;
if (PageKsm(page) || page_mapcount(page) != 1 || page_count(page) != 1) {
+ bool cow = true;
+
+ /*
+ * If the page is DMA pinned we can't rely on the
+ * above to know if there are other CPU references as
+ * page_count() will be elevated by the
+ * pin. Needlessly copying the page will cause the DMA
+ * pin to break, try harder to avoid that.
+ */
+ if (page_maybe_dma_pinned(page))
+ cow = cow_needed(vmf);
+
unlock_page(page);
- goto copy;
+ if (cow)
+ goto copy;
}
/*
* Ok, we've got the only map reference, and the only
What do you think Peter? Is this remotely close?
Seems like it preserves the fast path in most cases, the page_count &
page_maybe_dma_pinned could be further optimized down to one atomic in
non huge page cases.
Jason
On Wed, Sep 16, 2020 at 02:46:19PM -0400, Peter Xu wrote:
> My understanding is this may only work for the case when the fork()ed child
> quitted before we reach here (so we still have mapcount==1 for the
> page).
Yes
> What if not? Then mapcount will be greater than 1, and cow will
> still trigger. Is that what we want?
That doesn't work today anyhow, so it is fine continuing to be broken.
> Another problem is that, aiui, one of the major change previous patch proposed
> is to avoid using lock_page() so that we never block in this path.
I saw you mention this before, but it looks like the change was to
lift some of the atomc_reads out of the lock and avoid the lock if
they indicate failure, checking also for page_maybe_dma_pinned()
outside the lock just means the rare case of FOLL_PIN we will take the
lock again.
> Maybe even more complicated, because "correctness" should be even harder
> than "best effort reuse" since it can cause data corruption if we didn't do it
> right...
The only correct way is for the application to avoid write protect on
FOLL_PIN pages. The purpose here is to allow applications that hadn't
hit "bad luck" and failed to keep working.
Another thought is to insert a warning print here as well that the
program is working improperly? At least it would give a transition
period to evaluate the extent of the problem.
We are thinking it is going to be a notable regression.
I botched the last version of the patch, here is something a bit
better.
Does it seem like it could be OK? I know very little about this part
of the kernel
Thanks,
Jason
diff --git a/mm/memory.c b/mm/memory.c
index 469af373ae76e1..332de777854f8b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2889,6 +2889,24 @@ static vm_fault_t wp_page_shared(struct vm_fault *vmf)
return ret;
}
+static bool cow_needed(struct vm_fault *vmf)
+{
+ int total_map_swapcount;
+
+ if (!reuse_swap_page(vmf->page, &total_map_swapcount))
+ return true;
+
+ if (total_map_swapcount == 1) {
+ /*
+ * The page is all ours. Move it to our anon_vma so the rmap
+ * code will not search our parent or siblings. Protected
+ * against the rmap code by the page lock.
+ */
+ page_move_anon_rmap(vmf->page, vmf->vma);
+ }
+ return false;
+}
+
/*
* This routine handles present pages, when users try to write
* to a shared page. It is done by copying the page to a new address
@@ -2942,13 +2960,27 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
struct page *page = vmf->page;
/* PageKsm() doesn't necessarily raise the page refcount */
- if (PageKsm(page) || page_count(page) != 1)
+ if (PageKsm(page))
goto copy;
+ if (page_count(page) != 1) {
+ /*
+ * If the page is DMA pinned we can't rely on the
+ * above to know if there are other CPU references as
+ * page_count() will be elevated by the
+ * pin. Needlessly copying the page will cause the DMA
+ * pin to break, try harder to avoid that.
+ */
+ if (!page_maybe_dma_pinned(page))
+ goto copy;
+ }
+
if (!trylock_page(page))
goto copy;
if (PageKsm(page) || page_mapcount(page) != 1 || page_count(page) != 1) {
- unlock_page(page);
- goto copy;
+ if (cow_needed(vmf)) {
+ unlock_page(page);
+ goto copy;
+ }
}
/*
* Ok, we've got the only map reference, and the only
> I botched the last version of the patch, here is something a bit
> better.
So I'd like to understand why this problem happens.
Myt argument to Hugh a few weeks ago was that page pinning should take
care of all this:
(a) if the pinner is going to change the page, it will have to get
the pin with FOLL_WRITE in addition to FOLL_PIN
(b) that means it will do the COW and mark the page writable and dirty
(c) the whole _point_ of the FOLL_PIN is that subsequent operations
shouldn't make it non-writable any more (ie it can't be unmapped, and
we should synchronize on fork etc)
So I get the strong feeling that this whole approach to "fix" COW is
really papering over the real problem.
To me, the whole point of pinning is to avoid issues like this. If it
didn't fix this issue, then why did we go to all the effort of
treating pinned pages differently?
Your patch may avoid the problem, but I think it implies things are
horribly horribly broken in pinning land.
I also note that the _only_ user of page_maybe_dma_pinned() right now
is a debug check in gup_benchmark. I think your patch may _work_, and
I think that thanks to page_maybe_dma_pinned() it might even avoid the
horrible case, but I feel we should just truly fix pinning to be
meaningful instead.
IOW, in addition to keeping the page mapped, it should keep the page writable.
Why isn't it?
Linus
On Thu, Sep 17, 2020 at 08:25:38AM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 16, 2020 at 02:46:19PM -0400, Peter Xu wrote:
>
> > My understanding is this may only work for the case when the fork()ed child
> > quitted before we reach here (so we still have mapcount==1 for the
> > page).
>
> Yes
>
> > What if not? Then mapcount will be greater than 1, and cow will
> > still trigger. Is that what we want?
>
> That doesn't work today anyhow, so it is fine continuing to be broken.
>
> > Another problem is that, aiui, one of the major change previous patch proposed
> > is to avoid using lock_page() so that we never block in this path.
>
> I saw you mention this before, but it looks like the change was to
> lift some of the atomc_reads out of the lock and avoid the lock if
> they indicate failure, checking also for page_maybe_dma_pinned()
> outside the lock just means the rare case of FOLL_PIN we will take the
> lock again.
Sorry to be unclear. What I meant was that iiuc if we want to guarantee the
pinned page will be reused, then try_lock_page() could be too weak, and we may
still need lock_page(). E.g., an race on another thread who locked the page
accidentally which fails the try_lock_page() can trigger unexpected cow for the
pinned pages.
>
> > Maybe even more complicated, because "correctness" should be even harder
> > than "best effort reuse" since it can cause data corruption if we didn't do it
> > right...
>
> The only correct way is for the application to avoid write protect on
> FOLL_PIN pages. The purpose here is to allow applications that hadn't
> hit "bad luck" and failed to keep working.
>
> Another thought is to insert a warning print here as well that the
> program is working improperly? At least it would give a transition
> period to evaluate the extent of the problem.
>
> We are thinking it is going to be a notable regression.
>
> I botched the last version of the patch, here is something a bit
> better.
>
> Does it seem like it could be OK? I know very little about this part
> of the kernel
In my humble opinion, the real solution is still to use MADV_DONTFORK properly
so we should never share the DMA pages with others when we know the fact.
The apps may work with the old kernels, but IMHO they just work by accident
because the DMA pages were luckily not shared due to any other reason when the
write-protect page fault happens. E.g., the parent did waitpid() on the childs
so it is guaranteed that there will only be one user of the page then it'll be
reused as long as we check the mapcounts. But that's kind of a "workaround" at
least to me, since I'd say the sharing should not happen at all at the first
place, right after we know it's a DMA page.
It would be good if Linus or Andrea could share their thoughts, or anyone who
knows better (I'd bet a plenty of.. :).
Thanks,
>
> Thanks,
> Jason
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 469af373ae76e1..332de777854f8b 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2889,6 +2889,24 @@ static vm_fault_t wp_page_shared(struct vm_fault *vmf)
> return ret;
> }
>
> +static bool cow_needed(struct vm_fault *vmf)
> +{
> + int total_map_swapcount;
> +
> + if (!reuse_swap_page(vmf->page, &total_map_swapcount))
> + return true;
> +
> + if (total_map_swapcount == 1) {
> + /*
> + * The page is all ours. Move it to our anon_vma so the rmap
> + * code will not search our parent or siblings. Protected
> + * against the rmap code by the page lock.
> + */
> + page_move_anon_rmap(vmf->page, vmf->vma);
> + }
> + return false;
> +}
> +
> /*
> * This routine handles present pages, when users try to write
> * to a shared page. It is done by copying the page to a new address
> @@ -2942,13 +2960,27 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
> struct page *page = vmf->page;
>
> /* PageKsm() doesn't necessarily raise the page refcount */
> - if (PageKsm(page) || page_count(page) != 1)
> + if (PageKsm(page))
> goto copy;
> + if (page_count(page) != 1) {
> + /*
> + * If the page is DMA pinned we can't rely on the
> + * above to know if there are other CPU references as
> + * page_count() will be elevated by the
> + * pin. Needlessly copying the page will cause the DMA
> + * pin to break, try harder to avoid that.
> + */
> + if (!page_maybe_dma_pinned(page))
> + goto copy;
> + }
> +
> if (!trylock_page(page))
> goto copy;
> if (PageKsm(page) || page_mapcount(page) != 1 || page_count(page) != 1) {
> - unlock_page(page);
> - goto copy;
> + if (cow_needed(vmf)) {
> + unlock_page(page);
> + goto copy;
> + }
> }
> /*
> * Ok, we've got the only map reference, and the only
>
--
Peter Xu
On Thu, Sep 17, 2020 at 11:14 AM Peter Xu <[email protected]> wrote:
>
> In my humble opinion, the real solution is still to use MADV_DONTFORK properly
> so we should never share the DMA pages with others when we know the fact.
Is this all just because somebody does a fork() after doing page pinning?
If so, I feel this should be trivially fixed in copy_one_pte().
That's where we currently do
/*
* If it's a COW mapping, write protect it both
* in the parent and the child
*/
if (is_cow_mapping(vm_flags) && pte_write(pte)) {
ptep_set_wrprotect(src_mm, addr, src_pte);
pte = pte_wrprotect(pte);
}
and I feel that that is where we could just change the code to do a
COW event for pinned pages (and *not* mark the parent write protected,
since the parent page now isn't a COW page).
Because if that's the case that Jason is hitting, then I feel that
really is the correct fix: make sure that the pinning action is
meaningful.
As mentioned, I really think the whole (and only) point of page
pinning is that it should keep the page locked in the page tables. And
by "locked" I mean exactly that: not just present, but writable.
And then the "we never COW a pinned page" comes not from the COW code
doing magic, but by it simply never becoming non-writable - because
the page table entry is locked!
Linus
On Thu, Sep 17, 2020 at 11:26:01AM -0700, Linus Torvalds wrote:
> On Thu, Sep 17, 2020 at 11:14 AM Peter Xu <[email protected]> wrote:
> >
> > In my humble opinion, the real solution is still to use MADV_DONTFORK properly
> > so we should never share the DMA pages with others when we know the fact.
>
> Is this all just because somebody does a fork() after doing page pinning?
>
> If so, I feel this should be trivially fixed in copy_one_pte().
> That's where we currently do
>
> /*
> * If it's a COW mapping, write protect it both
> * in the parent and the child
> */
> if (is_cow_mapping(vm_flags) && pte_write(pte)) {
> ptep_set_wrprotect(src_mm, addr, src_pte);
> pte = pte_wrprotect(pte);
> }
>
> and I feel that that is where we could just change the code to do a
> COW event for pinned pages (and *not* mark the parent write protected,
> since the parent page now isn't a COW page).
>
> Because if that's the case that Jason is hitting, then I feel that
> really is the correct fix: make sure that the pinning action is
> meaningful.
>
> As mentioned, I really think the whole (and only) point of page
> pinning is that it should keep the page locked in the page tables. And
> by "locked" I mean exactly that: not just present, but writable.
>
> And then the "we never COW a pinned page" comes not from the COW code
> doing magic, but by it simply never becoming non-writable - because
> the page table entry is locked!
Looks reasonable to me.
The fork() should be slightly slower though, since we'll need to copy the data
for all the DMA buffers for each of the child processes, even if we should be
pretty sure those processes won't use these pages at all. But it seems a good
approach anyway if we care about the potential breakages in the userspace so
the breakage is turned into perf degrades, and if any userspace noticed such
perf degrade on fork() then people will probably remember to use MADV_DONTFORK
properly since afaict MADV_DONTFORK can remove this extra overhead..
Another side effect I can think of is that we'll bring some uncertainty to
fork() starting from when page_maybe_dma_pinned() is used, since it's sometimes
bogus (hpage_pincount_available()==false) so some COWs might be triggered
during fork() even when not necessary if we've got some normal pages with too
many refcounts (over GUP_PIN_COUNTING_BIAS). But assuming that's not a big
deal since it should be extremely rare, or is it?..
--
Peter Xu
On 9/17/20 12:42 PM, Linus Torvalds wrote:
> On Thu, Sep 17, 2020 at 12:03 PM Peter Xu <[email protected]> wrote:
...
>
> Is there possibly somethign else we can filter on than just
> GUP_PIN_COUNTING_BIAS? Because it could be as simple as just marking
> the vma itself and saying "this vma has had a page pinning event done
> on it".
>
> Because if we only start copying the page *iff* the vma is marked by
> that "this vma had page pinning" _and_ the page count is bigger than
> GUP_PIN_COUNTING_BIAS, than I think we can rest pretty easily knowing
> that we aren't going to hit some regular old-fashioned UNIX server
> cases with a lot of forks..
>
> Linus
OK, so this sounds like an improvement that would be desirable in the
core API:
1) Have all the pin_user_pages*() functions reach up and mark the vma(s)
as FOLL_PIN_HAPPENED. I'm assuming that the vma can live it's full life
without the need to ever clear that flag. That *seems* reasonable, based
on the use cases for these kinds of pages.
2) And then rename:
page_maybe_dma_pinned() --> page_likely_dma_pinned()
:)
thanks,
--
John Hubbard
NVIDIA
On Thu, Sep 17, 2020 at 12:38 PM Jason Gunthorpe <[email protected]> wrote:
>
> Looking for awhile, this now looks reasonable and
> doable. page_maybe_dma_pinned() was created for exactly this kind of
> case.
>
> I've attached a dumb sketch for the pte level (surely wrong! I have
> never looked at this part of the mm before!) at the end of this
> message.
This looks conceptually fine to me.
But as mentioned, I think I'd be even happier if we added a "thsi vma
has seen a page pin event" flag to the vma flags, and didn't rely
_just_ on the page_maybe_dma_pinned() check, which migth be triggered
by those fork-happy loads.
Side note: I wonder if that COW mapping check could be entirely within
that vm_normal_page() path.
Because how could a non-normal page be a COW page and not already
write-protected?
But that's a separate issue, it's just how your patch makes that odd
case more obvious.
Linus
On Thu, Sep 17, 2020 at 12:03 PM Peter Xu <[email protected]> wrote:
>
> The fork() should be slightly slower though, since we'll need to copy the data
> for all the DMA buffers for each of the child processes, even if we should be
> pretty sure those processes won't use these pages at all.
I think that as long as we only trigger for pinned pages then that's
fine - and in fact I think we might want to add a WARN_ON_ONCE() or
something to it if we are sure enough about that page pinning.
Because the whole "do page pinning without MADV_DONTFORK and then fork
the area" is I feel a very very invalid load. It sure as hell isn't
something we should care about performance for, and in fact it is
something we should very well warn for exactly to let people know
"this process is doing bad things".
My main worry is that page_maybe_dma_pinned() not being exact. I'm not
entirely convinced whether it really is extremely rare or not.
I could well imagine some very fork-happy load having very elevated
page counts (exactly *because* it is fork-happy), and then the
performance thing might in fact be an issue.
That said, you really have to be *very* fork-happy for this to
trigger, with GUP_PIN_COUNTING_BIAS being 1024. Those kinds of
fork-happy loads very much used to exist, but I think anybody who
cares about performance will have long long since switched to
threading, not forking.
Do people fork a thousand children? Sure. Is it something we need to
worry about a lot? I don't know. I'm thinking ot the android zygote
disaster..
Is there possibly somethign else we can filter on than just
GUP_PIN_COUNTING_BIAS? Because it could be as simple as just marking
the vma itself and saying "this vma has had a page pinning event done
on it".
Because if we only start copying the page *iff* the vma is marked by
that "this vma had page pinning" _and_ the page count is bigger than
GUP_PIN_COUNTING_BIAS, than I think we can rest pretty easily knowing
that we aren't going to hit some regular old-fashioned UNIX server
cases with a lot of forks..
Linus
On Thu, Sep 17, 2020 at 11:11:06AM -0700, Linus Torvalds wrote:
> (a) if the pinner is going to change the page, it will have to get
> the pin with FOLL_WRITE in addition to FOLL_PIN
>
> (b) that means it will do the COW and mark the page writable and dirty
Yep
> (c) the whole _point_ of the FOLL_PIN is that subsequent operations
> shouldn't make it non-writable any more (ie it can't be unmapped, and
> we should synchronize on fork etc)
It is the ideal, but FOLL_PIN has been troubled for a long time:
https://lwn.net/Articles/753027/
ie writeprotect is known to happen due to writeback. I had understood
that fork() will also cause write protect. Eg
copy_process()
copy_mm()
dup_mm()
dup_mmap()
copy_page_range()
[..]
copy_one_pte()
Gets to:
if (is_cow_mapping(vm_flags) && pte_write(pte)) {
ptep_set_wrprotect(src_mm, addr, src_pte);
pte = pte_wrprotect(pte);
}
Which blindly write protects a FOLL_PIN page. Here src_pte will be
under a pin.
I *think* the failing test is basically:
1) pin_user_pages(mem, FOLL_FORCE | FOLL_WRITE)
2) pid = fork()
3) child: does a bit, then exec
4) parent: waitpid(pid)
5) *mem = 0
Here #5 is the WP triggered COW. We know a WP triggered COW is
happening from the bisect and success result with MADV_DONTFORK.
#2 triggers the dup_mmap() and the ptep_set_wrprotect() (From
inspection, at least)
This "Trial do_wp_page() simplification" patch means that when #5 goes
to do COW it gets a copy instead of a re-use because the
reuse_swap_page() was aborting the copy before.
So, to your point, yes ideally FOLL_PIN would never write-protect
pages!
Looking for awhile, this now looks reasonable and
doable. page_maybe_dma_pinned() was created for exactly this kind of
case.
I've attached a dumb sketch for the pte level (surely wrong! I have
never looked at this part of the mm before!) at the end of this
message.
Peter, do you know this better? Does this inspire you to make a patch? :)
Would really love to see this. We have such a huge expensive mess with
MADV_DONTFORK, this would eliminate all that overhead.
Thanks,
Jason
diff --git a/mm/memory.c b/mm/memory.c
index 469af373ae76e1..6bc19a43da1391 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -689,6 +689,21 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
}
#endif
+static int duplicate_page(pte_t *newpte, struct vm_area_struct *vma,
+ unsigned long address, struct page *page)
+{
+ struct page *new_page;
+
+ new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+ if (!new_page)
+ return -ENOMEM;
+ copy_user_highpage(new_page, page, address, vma);
+
+ /* FIXME: surely more than this */
+ *newpte = mk_pte(new_page, vma->vm_page_prot);
+ return 0;
+}
+
/*
* copy one vm_area from one task to the other. Assumes the page tables
* already present in the new task to be cleared in the whole range
@@ -703,6 +718,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
unsigned long vm_flags = vma->vm_flags;
pte_t pte = *src_pte;
struct page *page;
+ bool do_src_wp;
/* pte contains position in swap or file, so copy. */
if (unlikely(!pte_present(pte))) {
@@ -775,15 +791,6 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
goto out_set_pte;
}
- /*
- * If it's a COW mapping, write protect it both
- * in the parent and the child
- */
- if (is_cow_mapping(vm_flags) && pte_write(pte)) {
- ptep_set_wrprotect(src_mm, addr, src_pte);
- pte = pte_wrprotect(pte);
- }
-
/*
* If it's a shared mapping, mark it clean in
* the child
@@ -800,11 +807,34 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
if (!(vm_flags & VM_UFFD_WP))
pte = pte_clear_uffd_wp(pte);
+ do_src_wp = is_cow_mapping(vm_flags) && pte_write(pte);
page = vm_normal_page(vma, addr, pte);
if (page) {
get_page(page);
page_dup_rmap(page, false);
rss[mm_counter(page)]++;
+
+ /*
+ * If a page is DMA pinned we never want to write protect it,
+ * copy it now.
+ */
+ if (do_src_wp && page_maybe_dma_pinned(page)) {
+ do_src_wp = false;
+ ret = duplicate_page(&pte, vma, addr, page);
+ if (ret)
+ /* FIXME: need to restructure a bit to handle this */
+ return ret;
+ }
+ }
+
+ /*
+ * If it's a COW mapping, write protect it both
+ * in the parent and the child
+ * FIXME check carefully this is new order is OK
+ */
+ if (do_src_wp) {
+ ptep_set_wrprotect(src_mm, addr, src_pte);
+ pte = pte_wrprotect(pte);
}
out_set_pte:
On Thu, Sep 17, 2020 at 12:42:11PM -0700, Linus Torvalds wrote:
> Because the whole "do page pinning without MADV_DONTFORK and then fork
> the area" is I feel a very very invalid load. It sure as hell isn't
> something we should care about performance for, and in fact it is
> something we should very well warn for exactly to let people know
> "this process is doing bad things".
It is easy for things like iouring that can just allocate the queue
memory they care about and MADV_DONTFORK it.
Other things work more like O_DIRECT - the data it is working on is
arbtiary app memory, not controlled in anyway.
In RDMA we have this ugly scheme were we automatically call
MADV_DONTFORK on the virtual address and hope it doesn't explode. It
is very hard to call MADV_DONTFORK if you don't control the
allocation. Don't want to break huge pages, have to hope really really
hard that a fork doesn't need that memory. Hope you don't run out of
vmas beause it causes a vma split. So ugly. So much overhead.
Considering almost anything can do a fork() - we've seen app authors
become confused. They say stuff is busted, support folks ask if they
use fork, author says no.. Investigation later shows some hidden
library did system() or whatever.
In this case the tests that found this failed because they were
written in Python and buried in there was some subprocess.call().
I would prefer the kernel consider it a valid work load with the
semantics the sketch patch shows..
> Is there possibly somethign else we can filter on than just
> GUP_PIN_COUNTING_BIAS? Because it could be as simple as just marking
> the vma itself and saying "this vma has had a page pinning event done
> on it".
We'd have to give up pin_user_pages_fast() to do that as we can't fast
walk and get vmas?
Hmm, there are many users. I remember that the hfi1 folks really
wanted the fast version for some reason..
> Because if we only start copying the page *iff* the vma is marked by
> that "this vma had page pinning" _and_ the page count is bigger than
> GUP_PIN_COUNTING_BIAS, than I think we can rest pretty easily knowing
> that we aren't going to hit some regular old-fashioned UNIX server
> cases with a lot of forks..
Agree
Given that this is a user visible regression, it is nearly rc6, what
do you prefer for next steps?
Sorting out this for fork, especially if it has the vma change is
probably more than a weeks time.
Revert this patch and try again next cycle?
Thanks,
Jason
On 9/17/20 1:06 PM, Jason Gunthorpe wrote:
> On Thu, Sep 17, 2020 at 12:42:11PM -0700, Linus Torvalds wrote:
...
>> Is there possibly somethign else we can filter on than just
>> GUP_PIN_COUNTING_BIAS? Because it could be as simple as just marking
>> the vma itself and saying "this vma has had a page pinning event done
>> on it".
>
> We'd have to give up pin_user_pages_fast() to do that as we can't fast
> walk and get vmas?
oops, yes. I'd forgotten about that point. Basically all of the O_DIRECT
callers need _fast. This is a big problem.
thanks,
--
John Hubbard
NVIDIA
>
> Hmm, there are many users. I remember that the hfi1 folks really
> wanted the fast version for some reason..
>
>> Because if we only start copying the page *iff* the vma is marked by
>> that "this vma had page pinning" _and_ the page count is bigger than
>> GUP_PIN_COUNTING_BIAS, than I think we can rest pretty easily knowing
>> that we aren't going to hit some regular old-fashioned UNIX server
>> cases with a lot of forks..
>
> Agree
>
> Given that this is a user visible regression, it is nearly rc6, what
> do you prefer for next steps?
>
> Sorting out this for fork, especially if it has the vma change is
> probably more than a weeks time.
>
> Revert this patch and try again next cycle?
>
> Thanks,
> Jason
>
On Thu, Sep 17, 2020 at 01:19:02PM -0700, John Hubbard wrote:
> On 9/17/20 1:06 PM, Jason Gunthorpe wrote:
> > On Thu, Sep 17, 2020 at 12:42:11PM -0700, Linus Torvalds wrote:
> ...
> > > Is there possibly somethign else we can filter on than just
> > > GUP_PIN_COUNTING_BIAS? Because it could be as simple as just marking
> > > the vma itself and saying "this vma has had a page pinning event done
> > > on it".
> >
> > We'd have to give up pin_user_pages_fast() to do that as we can't fast
> > walk and get vmas?
>
> oops, yes. I'd forgotten about that point. Basically all of the O_DIRECT
> callers need _fast. This is a big problem.
What about an atomic counter in the mm_struct?
# of pages currently under pin.
The use case Linus is worried about will never have any pins, so it
would be 0 and it could skip this entire test.
Jason
On Thu, Sep 17, 2020 at 1:06 PM Jason Gunthorpe <[email protected]> wrote:
>
> Given that this is a user visible regression, it is nearly rc6, what
> do you prefer for next steps?
Since I had to deal with the page lock performance regression too, and
I think we have a fairly simple way forward, I'd rather do an rc8 and
get this done with, than revert and know we have to deal with it
anyway.
Particularly since I think this would _finally_ make page pinning make
sense. In it's current state, it's just yet another broken GUP that
doesn't actually fix things. But if we have the semantics that page
pinning really fixes things in the page tables, I think we now have a
proper solution to this and a _much_ cleaner model for it.
If it means that page pinning can also stop worrying about
MADV_DONTFORK any more, that would be an added API bonus.
For that to happen, we'd need to have the vma flag so that we wouldn't
have any worry about non-pinners, but as you suggested, I think even
just a mm-wide counter - or flag - to deal with the fast-bup case is
likely perfectly sufficient.
Yes, you'd still take a hit on fork, but only the actual process that
did pinning would take that hit. And *if* that ends up being a big
deal, the MADV_DONTFORK can be used to avoid it, rather than be a
correctness issue.
It really feels like this should be just "tens of lines of fairly
simple code", and it would clarify our GUP/PIN/COW rules enormously.
Linus
On Thu, Sep 17, 2020 at 01:35:56PM -0700, Linus Torvalds wrote:
> For that to happen, we'd need to have the vma flag so that we wouldn't
> have any worry about non-pinners, but as you suggested, I think even
> just a mm-wide counter - or flag - to deal with the fast-bup case is
> likely perfectly sufficient.
Would mm_struct.pinned_vm suffice?
Though now I'm not 100% certain whether all pin_user_pages*() callers are
accounting correctly upon pinned_vm. My gut feeling is that bc3e53f682d93df67
does not convert all the old locked_vm users with page pinnings (e.g.,
mm_iommu_do_alloc, vaddr_get_pfn, etc.; didn't try harder).
--
Peter Xu
On Thu, Sep 17, 2020 at 05:40:59PM -0400, Peter Xu wrote:
> On Thu, Sep 17, 2020 at 01:35:56PM -0700, Linus Torvalds wrote:
> > For that to happen, we'd need to have the vma flag so that we wouldn't
> > have any worry about non-pinners, but as you suggested, I think even
> > just a mm-wide counter - or flag - to deal with the fast-bup case is
> > likely perfectly sufficient.
>
> Would mm_struct.pinned_vm suffice?
I think that could be a good long term goal
IIRC last time we dug into the locked_vm vs pinned_vm mess it didn't
get fixed. There is a mix of both kinds, as you saw, and some
resistance I don't clearly remember to changing it.
My advice for this -rc fix is to go with a single bit in the mm_struct
set on any call to pin_user_pages*
Then only users using pin_user_pages and forking are the only ones who
would ever do extra COW on fork. I think that is OK for -rc, this
workload should be rare due to the various historical issues. Anyhow,
a slow down regression is better than a it is broken regression.
This can be improved into a counter later. Due to the pinned_vm
accounting all call sites should have the mm_struct at unpin, but I
have a feeling it will take a alot of driver patches to sort it all
out.
Jason
On Thu, Sep 17, 2020 at 3:09 PM Jason Gunthorpe <[email protected]> wrote:
>
> My advice for this -rc fix is to go with a single bit in the mm_struct
> set on any call to pin_user_pages*
Ack, except make sure it's a byte rather than a bitfield that could
have races. Or even just a separate atomic_t.
Keep it simple ans stupid and obvious. As you say, we can aim for
cleanups later, make it obvious and reliable right now.
Linus
On Thu, Sep 17, 2020 at 07:09:00PM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 17, 2020 at 05:40:59PM -0400, Peter Xu wrote:
> > On Thu, Sep 17, 2020 at 01:35:56PM -0700, Linus Torvalds wrote:
> > > For that to happen, we'd need to have the vma flag so that we wouldn't
> > > have any worry about non-pinners, but as you suggested, I think even
> > > just a mm-wide counter - or flag - to deal with the fast-bup case is
> > > likely perfectly sufficient.
> >
> > Would mm_struct.pinned_vm suffice?
>
> I think that could be a good long term goal
>
> IIRC last time we dug into the locked_vm vs pinned_vm mess it didn't
> get fixed. There is a mix of both kinds, as you saw, and some
> resistance I don't clearly remember to changing it.
>
> My advice for this -rc fix is to go with a single bit in the mm_struct
> set on any call to pin_user_pages*
>
> Then only users using pin_user_pages and forking are the only ones who
> would ever do extra COW on fork. I think that is OK for -rc, this
> workload should be rare due to the various historical issues. Anyhow,
> a slow down regression is better than a it is broken regression.
>
> This can be improved into a counter later. Due to the pinned_vm
> accounting all call sites should have the mm_struct at unpin, but I
> have a feeling it will take a alot of driver patches to sort it all
> out.
Agreed. The HFI1 driver for example increments/decrements pinned_vm on it's
own. I've kind of always felt dirty for that...
I think long term it would be better to move this accounting to
pin_user_pages() but Jason is correct that I think that is going to be too
complex for an rc.
Could we move pinned_vm out of the drivers/rdma subsystem?
Ira
>
> Jason
>
On Thu 17-09-20 15:48:57, Ira Weiny wrote:
> On Thu, Sep 17, 2020 at 07:09:00PM -0300, Jason Gunthorpe wrote:
> > On Thu, Sep 17, 2020 at 05:40:59PM -0400, Peter Xu wrote:
> > > On Thu, Sep 17, 2020 at 01:35:56PM -0700, Linus Torvalds wrote:
> > > > For that to happen, we'd need to have the vma flag so that we wouldn't
> > > > have any worry about non-pinners, but as you suggested, I think even
> > > > just a mm-wide counter - or flag - to deal with the fast-bup case is
> > > > likely perfectly sufficient.
> > >
> > > Would mm_struct.pinned_vm suffice?
> >
> > I think that could be a good long term goal
> >
> > IIRC last time we dug into the locked_vm vs pinned_vm mess it didn't
> > get fixed. There is a mix of both kinds, as you saw, and some
> > resistance I don't clearly remember to changing it.
> >
> > My advice for this -rc fix is to go with a single bit in the mm_struct
> > set on any call to pin_user_pages*
> >
> > Then only users using pin_user_pages and forking are the only ones who
> > would ever do extra COW on fork. I think that is OK for -rc, this
> > workload should be rare due to the various historical issues. Anyhow,
> > a slow down regression is better than a it is broken regression.
> >
> > This can be improved into a counter later. Due to the pinned_vm
> > accounting all call sites should have the mm_struct at unpin, but I
> > have a feeling it will take a alot of driver patches to sort it all
> > out.
>
> Agreed. The HFI1 driver for example increments/decrements pinned_vm on it's
> own. I've kind of always felt dirty for that...
>
> I think long term it would be better to move this accounting to
> pin_user_pages() but Jason is correct that I think that is going to be too
> complex for an rc.
Moving accounting to pin_user_pages() won't be simple because you need to
unaccount on unpin. And that can happen from a different task context (e.g.
IRQ handler for direct IO) so we won't have proper mm_struct available.
> Could we move pinned_vm out of the drivers/rdma subsystem?
I'd love to because IMO it's a mess...
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Thu 17-09-20 19:09:00, Jason Gunthorpe wrote:
> On Thu, Sep 17, 2020 at 05:40:59PM -0400, Peter Xu wrote:
> > On Thu, Sep 17, 2020 at 01:35:56PM -0700, Linus Torvalds wrote:
> > > For that to happen, we'd need to have the vma flag so that we wouldn't
> > > have any worry about non-pinners, but as you suggested, I think even
> > > just a mm-wide counter - or flag - to deal with the fast-bup case is
> > > likely perfectly sufficient.
> >
> > Would mm_struct.pinned_vm suffice?
>
> I think that could be a good long term goal
>
> IIRC last time we dug into the locked_vm vs pinned_vm mess it didn't
> get fixed. There is a mix of both kinds, as you saw, and some
> resistance I don't clearly remember to changing it.
>
> My advice for this -rc fix is to go with a single bit in the mm_struct
> set on any call to pin_user_pages*
>
> Then only users using pin_user_pages and forking are the only ones who
> would ever do extra COW on fork. I think that is OK for -rc, this
> workload should be rare due to the various historical issues. Anyhow,
> a slow down regression is better than a it is broken regression.
Agreed. I really like the solution of not write-protecting pinned pages on
fork(2).
> This can be improved into a counter later. Due to the pinned_vm
> accounting all call sites should have the mm_struct at unpin, but I
> have a feeling it will take a alot of driver patches to sort it all
> out.
I somewhat fear that some of the users of pin_user_pages() don't bother
with pinned_vm accounting exactly because they don't have mm_struct on
unpin...
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Thu, Sep 17, 2020 at 03:03:32PM -0400, Peter Xu wrote:
> Another side effect I can think of is that we'll bring some uncertainty to
> fork() starting from when page_maybe_dma_pinned() is used, since it's sometimes
> bogus (hpage_pincount_available()==false) so some COWs might be triggered
> during fork() even when not necessary if we've got some normal pages with too
> many refcounts (over GUP_PIN_COUNTING_BIAS). But assuming that's not a big
> deal since it should be extremely rare, or is it?..
Looking at this a bit more.. A complete implementation will have to
touch all four places doing write protect during fork:
copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
unsigned long addr, int *rss)
{
[..]
if (is_cow_mapping(vm_flags) && pte_write(pte)) {
ptep_set_wrprotect(src_mm, addr, src_pte);
pte = pte_wrprotect(pte);
int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
struct vm_area_struct *vma)
{
[..]
pmdp_set_wrprotect(src_mm, addr, src_pmd);
pmd = pmd_mkold(pmd_wrprotect(pmd));
int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pud_t *dst_pud, pud_t *src_pud, unsigned long addr,
struct vm_area_struct *vma)
{
[..]
pudp_set_wrprotect(src_mm, addr, src_pud);
pud = pud_mkold(pud_wrprotect(pud));
int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
struct vm_area_struct *vma)
{
[..]
if (cow) {
huge_ptep_set_wrprotect(src, addr, src_pte);
As a regression I'm pretty sure we will hit only the PTE and PMD
cases.
Most likely the other two could be done outside the rc cycle
Jason
On Thu, Sep 17, 2020 at 12:51:49PM -0700, Linus Torvalds wrote:
> On Thu, Sep 17, 2020 at 12:38 PM Jason Gunthorpe <[email protected]> wrote:
> >
> > Looking for awhile, this now looks reasonable and
> > doable. page_maybe_dma_pinned() was created for exactly this kind of
> > case.
> >
> > I've attached a dumb sketch for the pte level (surely wrong! I have
> > never looked at this part of the mm before!) at the end of this
> > message.
>
> This looks conceptually fine to me.
>
> But as mentioned, I think I'd be even happier if we added a "thsi vma
> has seen a page pin event" flag to the vma flags, and didn't rely
> _just_ on the page_maybe_dma_pinned() check, which migth be triggered
> by those fork-happy loads.
>
> Side note: I wonder if that COW mapping check could be entirely within
> that vm_normal_page() path.
>
> Because how could a non-normal page be a COW page and not already
> write-protected?
I feel like we still need to keep those to cover !page case for PFNMAP.
I tried to draft a patch, however I do see some other issues, so I'd like to
discuss here before mindlessly going on with it.
Firstly in the draft patch mm->has_pinned is introduced and it's written to 1
as long as FOLL_GUP is called once. It's never reset after set.
One more thing (I think) we need to do is to pass the new vma from
copy_page_range() down into the end because if we want to start cow during
fork() then we need to operate on that new vma too when new page linked to it
rather than the parent's.
One issue is when we charge for cgroup we probably can't do that onto the new
mm/task, since copy_namespaces() is called after copy_mm(). I don't know
enough about cgroup, I thought the child will inherit the parent's, but I'm not
sure. Or, can we change that order of copy_namespaces() && copy_mm()? I don't
see a problem so far but I'd like to ask first..
The other thing is on how to fail. E.g., when COW failed due to either
charging of cgroup or ENOMEM, ideally we should fail fork() too. Though that
might need more changes - current patch silently kept the shared page for
simplicity.
The draft patch is attached for reference. It should cover the small pages,
but at least it needs to cover huge pages and also a well written commit log.
Thanks,
--
Peter Xu
On Fri, Sep 18, 2020 at 9:40 AM Peter Xu <[email protected]> wrote:
>
> Firstly in the draft patch mm->has_pinned is introduced and it's written to 1
> as long as FOLL_GUP is called once. It's never reset after set.
That's fine. That was what I was expecting you to do. It only needs to
be cleared at mm creation time (fork/exec), and I see you added that
in mm_init() already.
Since this only matters for fork(), and is really just a filter for
the (very) special behavior, and people who _actually_ do the page
pinning won't likely be mixing that with thousands of forks anyway, it
really doesn't matter.
It's literally just about "I'm a normal process, I've never done any
special rdma page pinning, let me fork a lot with no overhead" flag.
The only change I'd make is to make it a "int" and put it next to the
"int map_count", since that will pack better on 64-bit (assuming you
don't do the randomize_layout thing, in which case it's all moot).
Even if we were to expand it to an actual page count, I'm not
convinced we'd ever want anybody to pin more than 2 billion pages.
That's a minimum of 8 TB of RAM. Even if that were physically possibly
on some machines, it doesn't seem reasonable.
So even if that flag were to ever become an actual count, more than 32
bits seems pointless and wrong.
And as a flag, it most certainly doesn't need "unsigned long".
> One more thing (I think) we need to do is to pass the new vma from
> copy_page_range() down into the end because if we want to start cow during
> fork() then we need to operate on that new vma too when new page linked to it
> rather than the parent's.
Ahh. Because you pass the new vma down to the rmap routines.
I actually think it's unnecessary, because all the rmap routines
*really* care about is not the vma, but the anonvma associated with
it. Which will be the same for the parent and the child.
But we'd probably have to change the calling convention for rmap for
that to be obvious, so your solution seems ok. Maybe not optimal, but
I think we're going for "let's make things as clear as possible"
rather than optimal right now.
My main worry here is that it makes the calls really ugly, and we
generally try to avoid having that many arguments, but it was bad
before, and these are generally inlined, so changing it to use a
argument structure wouldn't even help code generation.
So it's not pretty. But it is what it is.
> One issue is when we charge for cgroup we probably can't do that onto the new
> mm/task, since copy_namespaces() is called after copy_mm().
That cannot possibly matter as far as I can see.
Copying the page in between those two calls is already possible since
we've already dropped the mmap_lock by the time copy_namespaces() is
called. So if the parent was threaded, and another thread did a write
access, the parent would have caused that COW that we did early.
And any page copying cost should be to the parent anyway, since that
is who did the pinning that caused the copy in the first place.
So for both of those reasons - the COW can already happen between
copy_mm() and copy_namespaces(), *and* charging it to the parent
namespace is proper anyway - I think your cgroup worry is not
relevant.
I'm not even sure anything relevant to accounting is created, but my
point is that even if it is, I don't see how it could be an issue.
> The other thing is on how to fail. E.g., when COW failed due to either
> charging of cgroup or ENOMEM, ideally we should fail fork() too. Though that
> might need more changes - current patch silently kept the shared page for
> simplicity.
We already can fail forkind due to memory allocations failing. Again,
not an issue. It happens.
The only real thing to worry about would be that this doesn't affect
normal programs, and that mm flag takes care of that.
Linus
On Fri, Sep 18, 2020 at 12:40:32PM -0400, Peter Xu wrote:
> Firstly in the draft patch mm->has_pinned is introduced and it's written to 1
> as long as FOLL_GUP is called once. It's never reset after set.
Worth thinking about also adding FOLL_LONGTERM here, at last as long
as it is not a counter. That further limits the impact.
> One more thing (I think) we need to do is to pass the new vma from
> copy_page_range() down into the end because if we want to start cow during
> fork() then we need to operate on that new vma too when new page linked to it
> rather than the parent's.
Makes sense to me
> One issue is when we charge for cgroup we probably can't do that onto the new
> mm/task, since copy_namespaces() is called after copy_mm(). I don't know
> enough about cgroup, I thought the child will inherit the parent's, but I'm not
> sure. Or, can we change that order of copy_namespaces() && copy_mm()? I don't
> see a problem so far but I'd like to ask first..
Know nothing about cgroups, but I would have guessed that the page
table allocations would want to be in the cgroup too, is the struct
page a different bucket?
> The other thing is on how to fail. E.g., when COW failed due to either
> charging of cgroup or ENOMEM, ideally we should fail fork() too. Though that
> might need more changes - current patch silently kept the shared page for
> simplicity.
I didn't notice anything tricky here.. Something a bit gross but
simple seemed workable:
@@ -852,7 +852,7 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
continue;
}
entry.val = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte,
- vma, addr, rss);
+ vma, addr, rss, &err);
if (entry.val)
break;
progress += 8;
@@ -865,6 +865,9 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pte_unmap_unlock(orig_dst_pte, dst_ptl);
cond_resched();
+ if (err)
+ return err;
+
if (entry.val) {
if (add_swap_count_continuation(entry, GFP_KERNEL) < 0)
return -ENOMEM;
It is not really any different from add_swap_count_continuation()
failure, which already works..
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 496c3ff97cce..b3812fa6383f 100644
> +++ b/include/linux/mm_types.h
> @@ -458,6 +458,7 @@ struct mm_struct {
>
> unsigned long total_vm; /* Total pages mapped */
> unsigned long locked_vm; /* Pages that have PG_mlocked set */
> + unsigned long has_pinned; /* Whether this mm has pinned any page */
Can be unsigned int or smaller, if there is a better hole in mm_struct
> diff --git a/mm/gup.c b/mm/gup.c
> index e5739a1974d5..cab10cefefe4 100644
> +++ b/mm/gup.c
> @@ -1255,6 +1255,17 @@ static __always_inline long __get_user_pages_locked(struct mm_struct *mm,
> BUG_ON(*locked != 1);
> }
>
> + /*
> + * Mark the mm struct if there's any page pinning attempt. We're
> + * aggresive on this bit since even if the pinned pages were unpinned
> + * later on, we'll still keep this bit set for this address space just
> + * to make everything easy.
> + *
> + * TODO: Ideally we can use mm->pinned_vm but only until it's stable.
> + */
> + if (flags & FOLL_PIN)
> + WRITE_ONCE(mm->has_pinned, 1);
This should probably be its own commit, here is a stab at a commit
message:
Reduce the chance of false positive from page_maybe_dma_pinned() by
keeping track if the mm_struct has ever been used with
pin_user_pages(). mm_structs that have never been passed to
pin_user_pages() cannot have a positive page_maybe_dma_pinned() by
definition. This allows cases that might drive up the page ref_count
to avoid any penalty from handling dma_pinned pages.
Due to complexities with unpining this trivial version is a permanent
sticky bit, future work will be needed to make this a counter.
> +/*
> + * Do early cow for the page and the pte. Return true if page duplicate
> + * succeeded, false otherwise.
> + */
> +static bool duplicate_page(struct mm_struct *mm, struct vm_area_struct *vma,
Suggest calling 'vma' 'new' here for consistency
> + unsigned long address, struct page *page,
> + pte_t *newpte)
> +{
> + struct page *new_page;
> + pte_t entry;
> +
> + new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
> + if (!new_page)
> + return false;
> +
> + copy_user_highpage(new_page, page, address, vma);
> + if (mem_cgroup_charge(new_page, mm, GFP_KERNEL)) {
> + put_page(new_page);
> + return false;
> + }
> + cgroup_throttle_swaprate(new_page, GFP_KERNEL);
> + __SetPageUptodate(new_page);
It looks like these GFP flags can't be GFP_KERNEL, this is called
inside the pte_alloc_map_lock() which is a spinlock
One thought is to lift this logic out to around
add_swap_count_continuation()? Would need some serious rework to be
able to store the dst_pte though.
Can't help about the cgroup question
Jason
On Fri, Sep 18, 2020 at 10:16:19AM -0700, Linus Torvalds wrote:
> The only change I'd make is to make it a "int" and put it next to the
> "int map_count", since that will pack better on 64-bit (assuming you
> don't do the randomize_layout thing, in which case it's all moot).
Will do.
[...]
> > One issue is when we charge for cgroup we probably can't do that onto the new
> > mm/task, since copy_namespaces() is called after copy_mm().
>
> That cannot possibly matter as far as I can see.
>
> Copying the page in between those two calls is already possible since
> we've already dropped the mmap_lock by the time copy_namespaces() is
> called. So if the parent was threaded, and another thread did a write
> access, the parent would have caused that COW that we did early.
>
> And any page copying cost should be to the parent anyway, since that
> is who did the pinning that caused the copy in the first place.
>
> So for both of those reasons - the COW can already happen between
> copy_mm() and copy_namespaces(), *and* charging it to the parent
> namespace is proper anyway - I think your cgroup worry is not
> relevant.
>
> I'm not even sure anything relevant to accounting is created, but my
> point is that even if it is, I don't see how it could be an issue.
The parent process should be fine to do any COW and do its accounting right,
which I agree. But the new COW that we're adding here is for the child process
rather than the parent.
I'm just afraid the parent process's accounting number could go very high after
it pinned some pages and fork()ed a few more times, since those extra
accountings will accumulate even after the children die, if I'm not wrong...
Actually I tend to move copy_namespaces() to be before copy_mm() now.. I know
nothing about namespaces, however I see that copy_namespaces() seems to be
quite self-contained. But I'm always ready for a "no, you can't"...
>
> > The other thing is on how to fail. E.g., when COW failed due to either
> > charging of cgroup or ENOMEM, ideally we should fail fork() too. Though that
> > might need more changes - current patch silently kept the shared page for
> > simplicity.
>
> We already can fail forkind due to memory allocations failing. Again,
> not an issue. It happens.
Yes. I didn't change this only because I thought it could save quite a few
lines of codes. However after I notice the fact that this patch will probably
grow bigger no matter what, I'm kind of not worrying on this any more..
--
Peter Xu
On Fri, Sep 18, 2020 at 02:32:40PM -0300, Jason Gunthorpe wrote:
> On Fri, Sep 18, 2020 at 12:40:32PM -0400, Peter Xu wrote:
>
> > Firstly in the draft patch mm->has_pinned is introduced and it's written to 1
> > as long as FOLL_GUP is called once. It's never reset after set.
>
> Worth thinking about also adding FOLL_LONGTERM here, at last as long
> as it is not a counter. That further limits the impact.
But theoritically we should also trigger COW here for pages even with PIN &&
!LONGTERM, am I right? Assuming that FOLL_PIN is already a corner case.
> > One issue is when we charge for cgroup we probably can't do that onto the new
> > mm/task, since copy_namespaces() is called after copy_mm(). I don't know
> > enough about cgroup, I thought the child will inherit the parent's, but I'm not
> > sure. Or, can we change that order of copy_namespaces() && copy_mm()? I don't
> > see a problem so far but I'd like to ask first..
>
> Know nothing about cgroups, but I would have guessed that the page
> table allocations would want to be in the cgroup too, is the struct
> page a different bucket?
Good question... I feel like this kind of accountings were always done to
"current" via alloc_page(). But frankly speaking I don't know whether I
understand it right because afaict "current" is the parent during fork(), while
I feel like it will make more sense if it is accounted to the child process. I
think I should have missed something important but I can't tell..
>
> > The other thing is on how to fail. E.g., when COW failed due to either
> > charging of cgroup or ENOMEM, ideally we should fail fork() too. Though that
> > might need more changes - current patch silently kept the shared page for
> > simplicity.
>
> I didn't notice anything tricky here.. Something a bit gross but
> simple seemed workable:
>
> @@ -852,7 +852,7 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> continue;
> }
> entry.val = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte,
> - vma, addr, rss);
> + vma, addr, rss, &err);
> if (entry.val)
> break;
> progress += 8;
> @@ -865,6 +865,9 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> pte_unmap_unlock(orig_dst_pte, dst_ptl);
> cond_resched();
>
> + if (err)
> + return err;
> +
> if (entry.val) {
> if (add_swap_count_continuation(entry, GFP_KERNEL) < 0)
> return -ENOMEM;
>
> It is not really any different from add_swap_count_continuation()
> failure, which already works..
Yes it's not pretty, but I do plan to use something like this to avoid touching
all the return path in coyp_one_pte(), and I think the answer to the last
question matters too, below.
> > diff --git a/mm/gup.c b/mm/gup.c
> > index e5739a1974d5..cab10cefefe4 100644
> > +++ b/mm/gup.c
> > @@ -1255,6 +1255,17 @@ static __always_inline long __get_user_pages_locked(struct mm_struct *mm,
> > BUG_ON(*locked != 1);
> > }
> >
> > + /*
> > + * Mark the mm struct if there's any page pinning attempt. We're
> > + * aggresive on this bit since even if the pinned pages were unpinned
> > + * later on, we'll still keep this bit set for this address space just
> > + * to make everything easy.
> > + *
> > + * TODO: Ideally we can use mm->pinned_vm but only until it's stable.
> > + */
> > + if (flags & FOLL_PIN)
> > + WRITE_ONCE(mm->has_pinned, 1);
>
> This should probably be its own commit, here is a stab at a commit
> message:
>
> Reduce the chance of false positive from page_maybe_dma_pinned() by
> keeping track if the mm_struct has ever been used with
> pin_user_pages(). mm_structs that have never been passed to
> pin_user_pages() cannot have a positive page_maybe_dma_pinned() by
> definition. This allows cases that might drive up the page ref_count
> to avoid any penalty from handling dma_pinned pages.
>
> Due to complexities with unpining this trivial version is a permanent
> sticky bit, future work will be needed to make this a counter.
Thanks for writting this. I'll keep the commit message once split until I need
to post a formal patch. Before that hope it's fine I'll still use a single
patch for simplicity because I still want to keep the discussion within the
thread.
>
> > +/*
> > + * Do early cow for the page and the pte. Return true if page duplicate
> > + * succeeded, false otherwise.
> > + */
> > +static bool duplicate_page(struct mm_struct *mm, struct vm_area_struct *vma,
>
> Suggest calling 'vma' 'new' here for consistency
OK.
>
> > + unsigned long address, struct page *page,
> > + pte_t *newpte)
> > +{
> > + struct page *new_page;
> > + pte_t entry;
> > +
> > + new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
> > + if (!new_page)
> > + return false;
> > +
> > + copy_user_highpage(new_page, page, address, vma);
> > + if (mem_cgroup_charge(new_page, mm, GFP_KERNEL)) {
> > + put_page(new_page);
> > + return false;
> > + }
> > + cgroup_throttle_swaprate(new_page, GFP_KERNEL);
> > + __SetPageUptodate(new_page);
>
> It looks like these GFP flags can't be GFP_KERNEL, this is called
> inside the pte_alloc_map_lock() which is a spinlock
>
> One thought is to lift this logic out to around
> add_swap_count_continuation()? Would need some serious rework to be
> able to store the dst_pte though.
What would be the result if we simply use GFP_ATOMIC? Would there be too many
pages to allocate in bulk for ATOMIC? IMHO slowness would be fine, but I don't
know the inside of page allocation, and not sure whether __GFP_KSWAPD_RECLAIM
means we might kick kswapd and whether we'll deadlock when the kswapd could
potentially try to take the spinlock again somewhere while we waiting for it?
It would be good to go this (easy) way considering this is a very rare to
trigger path, so we can still keep copy_one page simple. Otherwise I seem to
have no choice to move the page copy logic out of copy_one_pte(), as you
suggested.
--
Peter Xu
On 9/18/20 1:40 PM, Peter Xu wrote:
> On Fri, Sep 18, 2020 at 02:32:40PM -0300, Jason Gunthorpe wrote:
>> On Fri, Sep 18, 2020 at 12:40:32PM -0400, Peter Xu wrote:
>>
>>> Firstly in the draft patch mm->has_pinned is introduced and it's written to 1
>>> as long as FOLL_GUP is called once. It's never reset after set.
>>
>> Worth thinking about also adding FOLL_LONGTERM here, at last as long
>> as it is not a counter. That further limits the impact.
>
> But theoritically we should also trigger COW here for pages even with PIN &&
> !LONGTERM, am I right? Assuming that FOLL_PIN is already a corner case.
>
This note, plus Linus' comment about "I'm a normal process, I've never
done any special rdma page pinning", has me a little worried. Because
page_maybe_dma_pinned() is counting both short- and long-term pins,
actually. And that includes O_DIRECT callers.
O_DIRECT pins are short-term, and RDMA systems are long-term (and should
be setting FOLL_LONGTERM). But there's no way right now to discern
between them, once the initial pin_user_pages*() call is complete. All
we can do today is to count the number of FOLL_PIN calls, not the number
of FOLL_PIN | FOLL_LONGTERM calls.
The reason it's that way, is that writeback and such can experience
problems regardless of the duration of the pin. There are ideas about
how to deal with the pins, and the filesystem (layout leases...) but
still disagreement, which is why there's basically no
page_maybe_dma_pinned() callers yet.
Although I think we're getting closer to using it. There was a recent
attempt at using this stuff, from Chris Wilson. [1]
[1] https://lore.kernel.org/intel-gfx/20200624191417.16735-1-chris%40chris-wilson.co.uk/
thanks,
--
John Hubbard
NVIDIA
On Fri, Sep 18, 2020 at 1:41 PM Peter Xu <[email protected]> wrote:
>
> What would be the result if we simply use GFP_ATOMIC? Would there be too many
> pages to allocate in bulk for ATOMIC?
It's very easy to run out of memory with GFP_ATOMIC, and also cause
various nasty issues with networking (ie when you've depleted atomic
memory, networking starts losing packets etc).
So yeah, this code should not use GFP_ATOMIC, I think it just needs to
drop and re-take the paeg table lock.
Which is easy enough to do: returning a zero 'entry.val' already does
that for other reasons, there's nothing magical about holding the lock
here, there's no larger page table lock requirement.
The main annoyance is that I think it means that copy_pte_range()
would also have to have a special "preallocation page" thing for this
case, so that it can drop the lock, do the allocation, and then take
the lock again and return 0 (to repeat - now with the preallocation
filled).
Honestly, if we had a completely *reliable* sign of "this page is
pinned", then I think the much nicer option would be to just say
"pinned pages will not be copied at all". Kind of an implicit
VM_DONTCOPY.
(Or we'd do the reverse, and say "pinned pages stay pinned even in the child").
But that's not an option when the pinning test is a "maybe". Oh well.
Linus
On Fri, Sep 18, 2020 at 02:06:23PM -0700, John Hubbard wrote:
> On 9/18/20 1:40 PM, Peter Xu wrote:
> > On Fri, Sep 18, 2020 at 02:32:40PM -0300, Jason Gunthorpe wrote:
> > > On Fri, Sep 18, 2020 at 12:40:32PM -0400, Peter Xu wrote:
> > >
> > > > Firstly in the draft patch mm->has_pinned is introduced and it's written to 1
> > > > as long as FOLL_GUP is called once. It's never reset after set.
> > >
> > > Worth thinking about also adding FOLL_LONGTERM here, at last as long
> > > as it is not a counter. That further limits the impact.
> >
> > But theoritically we should also trigger COW here for pages even with PIN &&
> > !LONGTERM, am I right? Assuming that FOLL_PIN is already a corner case.
> >
>
> This note, plus Linus' comment about "I'm a normal process, I've never
> done any special rdma page pinning", has me a little worried. Because
> page_maybe_dma_pinned() is counting both short- and long-term pins,
> actually. And that includes O_DIRECT callers.
>
> O_DIRECT pins are short-term, and RDMA systems are long-term (and should
> be setting FOLL_LONGTERM). But there's no way right now to discern
> between them, once the initial pin_user_pages*() call is complete. All
> we can do today is to count the number of FOLL_PIN calls, not the number
> of FOLL_PIN | FOLL_LONGTERM calls.
My thinking is to hit this issue you have to already be doing
FOLL_LONGTERM, and if some driver hasn't been properly marked and
regresses, the fix is to mark it.
Remember, this use case requires the pin to extend after a system
call, past another fork() system call, and still have data-coherence.
IMHO that can only happen in the FOLL_LONGTERM case as it inhernetly
means the lifetime of the pin is being controlled by userspace, not by
the kernel. Otherwise userspace could not cause new DMA touches after
fork.
Explaining it like that makes me pretty confident it is the right
thing to do, at least for a single bit.
Yes, if we figure out how to do a counter, then the counter can be
everything, but for now, as a rc regression fix, let us limit the
number of impacted cases. Don't need to worry about the unpin problem
because it is never undone.
Jason
On Fri, Sep 18, 2020 at 01:59:41PM -0700, Linus Torvalds wrote:
> Honestly, if we had a completely *reliable* sign of "this page is
> pinned", then I think the much nicer option would be to just say
> "pinned pages will not be copied at all". Kind of an implicit
> VM_DONTCOPY.
It would be simpler to implement, but it makes the programming model
really sketchy. For instance O_DIRECT is using FOLL_PIN, so imagine
this program:
CPU0 CPU1
a = malloc(1024);
b = malloc(1024);
read(fd, a, 1024); // FD is O_DIRECT
... fork()
*b = ...
read completes
Here a and b got lucky and both come from the same page due to the
allocator.
In this case the fork() child in CPU1, would be very surprised that
'b' was not mapped into the fork.
Similiarly, CPU0 would have silent data corruption if the read didn't
deposit data into 'a' - which is a bug we have today. In this race the
COW break of *b might steal the physical page to the child, and *a
won't see the data. For this reason, John is right, fork needs to
eventually do this for O_DIRECT as well.
The copy on fork nicely fixes all of this weird oddball stuff.
Jason
On Fri 18-09-20 21:01:53, Jason Gunthorpe wrote:
> On Fri, Sep 18, 2020 at 02:06:23PM -0700, John Hubbard wrote:
> > On 9/18/20 1:40 PM, Peter Xu wrote:
> > > On Fri, Sep 18, 2020 at 02:32:40PM -0300, Jason Gunthorpe wrote:
> > > > On Fri, Sep 18, 2020 at 12:40:32PM -0400, Peter Xu wrote:
> > > >
> > > > > Firstly in the draft patch mm->has_pinned is introduced and it's written to 1
> > > > > as long as FOLL_GUP is called once. It's never reset after set.
> > > >
> > > > Worth thinking about also adding FOLL_LONGTERM here, at last as long
> > > > as it is not a counter. That further limits the impact.
> > >
> > > But theoritically we should also trigger COW here for pages even with PIN &&
> > > !LONGTERM, am I right? Assuming that FOLL_PIN is already a corner case.
> > >
> >
> > This note, plus Linus' comment about "I'm a normal process, I've never
> > done any special rdma page pinning", has me a little worried. Because
> > page_maybe_dma_pinned() is counting both short- and long-term pins,
> > actually. And that includes O_DIRECT callers.
> >
> > O_DIRECT pins are short-term, and RDMA systems are long-term (and should
> > be setting FOLL_LONGTERM). But there's no way right now to discern
> > between them, once the initial pin_user_pages*() call is complete. All
> > we can do today is to count the number of FOLL_PIN calls, not the number
> > of FOLL_PIN | FOLL_LONGTERM calls.
>
> My thinking is to hit this issue you have to already be doing
> FOLL_LONGTERM, and if some driver hasn't been properly marked and
> regresses, the fix is to mark it.
>
> Remember, this use case requires the pin to extend after a system
> call, past another fork() system call, and still have data-coherence.
>
> IMHO that can only happen in the FOLL_LONGTERM case as it inhernetly
> means the lifetime of the pin is being controlled by userspace, not by
> the kernel. Otherwise userspace could not cause new DMA touches after
> fork.
I agree that the new aggressive COW behavior is probably causing issues
only for FOLL_LONGTERM users. That being said it would be nice if even
ordinary threaded FOLL_PIN users would not have to be that careful about
fork(2) and possible data loss due to COW - we had certainly reports of
O_DIRECT IO loosing data due to fork(2) and COW exactly because it is very
subtle how it behaves... But as I wrote above this is not urgent since that
problematic behavior exists since the beginning of O_DIRECT IO in Linux.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Mon, Sep 21, 2020 at 10:35:05AM +0200, Jan Kara wrote:
> > My thinking is to hit this issue you have to already be doing
> > FOLL_LONGTERM, and if some driver hasn't been properly marked and
> > regresses, the fix is to mark it.
> >
> > Remember, this use case requires the pin to extend after a system
> > call, past another fork() system call, and still have data-coherence.
> >
> > IMHO that can only happen in the FOLL_LONGTERM case as it inhernetly
> > means the lifetime of the pin is being controlled by userspace, not by
> > the kernel. Otherwise userspace could not cause new DMA touches after
> > fork.
>
> I agree that the new aggressive COW behavior is probably causing issues
> only for FOLL_LONGTERM users. That being said it would be nice if even
> ordinary threaded FOLL_PIN users would not have to be that careful about
> fork(2) and possible data loss due to COW - we had certainly reports of
> O_DIRECT IO loosing data due to fork(2) and COW exactly because it is very
> subtle how it behaves... But as I wrote above this is not urgent since that
> problematic behavior exists since the beginning of O_DIRECT IO in Linux.
Yes, I agree - what I was thinking is to do this FOLL_LONGTERM for the
rc and then a small patch to make it wider for the next cycle so it
can test in linux-next for a responsible time period.
Interesting to hear you confirm block has also seen subtle user
problems with this as well.
Jason
[Cc Tejun and Christian - this is a part of a larger discussion which is
not directly related to this particular question so let me trim the
original email to the bare minimum.]
On Fri 18-09-20 12:40:32, Peter Xu wrote:
[...]
> One issue is when we charge for cgroup we probably can't do that onto the new
> mm/task, since copy_namespaces() is called after copy_mm(). I don't know
> enough about cgroup, I thought the child will inherit the parent's, but I'm not
> sure. Or, can we change that order of copy_namespaces() && copy_mm()? I don't
> see a problem so far but I'd like to ask first..
I suspect you are referring to CLONE_INTO_CGROUP, right? I have only now
learned about this feature so I am not deeply familiar with all the
details and I might be easily wrong. Normally all the cgroup aware
resources are accounted to the parent's cgroup. For memcg that includes
all the page tables, early CoW and other allocations with __GFP_ACCOUNT.
IIUC CLONE_INTO_CGROUP properly then this hasn't changed as the child is
associated to its new cgroup (and memcg) only in cgroup_post_fork. If
that is correct then we might have quite a lot of resources bound to
child's lifetime but accounted to the parent's memcg which can lead to
all sorts of interesting problems (e.g. unreclaimable memory - even by
the oom killer).
Christian, Tejun is this the expected semantic or I am just misreading
the code?
--
Michal Hocko
SUSE Labs
Hi, Michal,
On Mon, Sep 21, 2020 at 03:42:00PM +0200, Michal Hocko wrote:
> [Cc Tejun and Christian - this is a part of a larger discussion which is
> not directly related to this particular question so let me trim the
> original email to the bare minimum.]
>
> On Fri 18-09-20 12:40:32, Peter Xu wrote:
> [...]
> > One issue is when we charge for cgroup we probably can't do that onto the new
> > mm/task, since copy_namespaces() is called after copy_mm(). I don't know
> > enough about cgroup, I thought the child will inherit the parent's, but I'm not
> > sure. Or, can we change that order of copy_namespaces() && copy_mm()? I don't
> > see a problem so far but I'd like to ask first..
>
> I suspect you are referring to CLONE_INTO_CGROUP, right?
Thanks for raising this question up to a broader audience.
I was not referring to that explicilty, but it would definitely be good to know
this new feature before I post the (probably stupid :) patch. Because I noticed
it's only done until cgroup_can_fork() or later, so it's definitely also later
than copy_mm() even if I do the move.
> I have only now
> learned about this feature so I am not deeply familiar with all the
> details and I might be easily wrong. Normally all the cgroup aware
> resources are accounted to the parent's cgroup. For memcg that includes
> all the page tables, early CoW and other allocations with __GFP_ACCOUNT.
> IIUC CLONE_INTO_CGROUP properly then this hasn't changed as the child is
> associated to its new cgroup (and memcg) only in cgroup_post_fork. If
> that is correct then we might have quite a lot of resources bound to
> child's lifetime but accounted to the parent's memcg which can lead to
> all sorts of interesting problems (e.g. unreclaimable memory - even by
> the oom killer).
Right that's one of the things that I'm confused too, on that if we always
account to the parent, then when the child quits whether we uncharge them or
not, and how.. Not sure whether the accounting of the parent could steadily
grow as it continues the fork()s.
So is it by design that we account all these to the parents?
>
> Christian, Tejun is this the expected semantic or I am just misreading
> the code?
I'll try to summarize the question here too - what we'd like to do right now is
to do cgroup accounting (e.g., mem_cgroup_charge()) upon the newly created
process within copy_mm(). A quick summary of why we want to do this is to
"trigger early copy-on-write for pinned pages during fork()".
Initially I thought moving copy_mm() to be after copy_namespaces() would be the
right thing to do, but now I highly doubt it..
Thanks,
--
Peter Xu
On Mon 21-09-20 10:18:30, Peter Xu wrote:
> Hi, Michal,
>
> On Mon, Sep 21, 2020 at 03:42:00PM +0200, Michal Hocko wrote:
[...]
> > I have only now
> > learned about this feature so I am not deeply familiar with all the
> > details and I might be easily wrong. Normally all the cgroup aware
> > resources are accounted to the parent's cgroup. For memcg that includes
> > all the page tables, early CoW and other allocations with __GFP_ACCOUNT.
> > IIUC CLONE_INTO_CGROUP properly then this hasn't changed as the child is
> > associated to its new cgroup (and memcg) only in cgroup_post_fork. If
> > that is correct then we might have quite a lot of resources bound to
> > child's lifetime but accounted to the parent's memcg which can lead to
> > all sorts of interesting problems (e.g. unreclaimable memory - even by
> > the oom killer).
>
> Right that's one of the things that I'm confused too, on that if we always
> account to the parent, then when the child quits whether we uncharge them or
> not, and how.. Not sure whether the accounting of the parent could steadily
> grow as it continues the fork()s.
>
> So is it by design that we account all these to the parents?
Let me try to clarify a bit further my concern. Without CLONE_INTO_CGROUP
this makes some sense. Because both parent and child will live in
the same cgroup. All the charges are reference counted so they will
be released when the respective resource gets freed (e.g. page table
released or the backing page dropped) irrespective of the current cgroup
the owner is living in.
Fundamentaly CLONE_INTO_CGROUP is similar to regular fork + move to the
target cgroup after the child gets executed. So in principle there
shouldn't be any big difference. Except that the move has to be explicit
and the the child has to have enough privileges to move itself. I am not
completely sure about CLONE_INTO_CGROUP model though. According to man
clone(2) it seems that O_RDONLY for the target cgroup directory is
sufficient. That seems much more relaxed IIUC and it would allow to fork
into a different cgroup while keeping a lot of resources in the parent's
proper.
--
Michal Hocko
SUSE Labs
Hello,
On Mon, Sep 21, 2020 at 04:28:34PM +0200, Michal Hocko wrote:
> Fundamentaly CLONE_INTO_CGROUP is similar to regular fork + move to the
> target cgroup after the child gets executed. So in principle there
> shouldn't be any big difference. Except that the move has to be explicit
> and the the child has to have enough privileges to move itself. I am not
Yeap, they're supposed to be the same operations. We've never clearly
defined how the accounting gets split across moves because 1. it's
inherently blurry and difficult 2. doesn't make any practical difference for
the recommended and vast majority usage pattern which uses migration to seed
the new cgroup. CLONE_INTO_CGROUP doesn't change any of that.
> completely sure about CLONE_INTO_CGROUP model though. According to man
> clone(2) it seems that O_RDONLY for the target cgroup directory is
> sufficient. That seems much more relaxed IIUC and it would allow to fork
> into a different cgroup while keeping a lot of resources in the parent's
> proper.
If the man page is documenting that, it's wrong. cgroup_css_set_fork() has
an explicit cgroup_may_write() test on the destination cgroup.
CLONE_INTO_CGROUP should follow exactly the same rules as regular
migrations.
Thanks.
--
tejun
On Mon, Sep 21, 2020 at 03:42:00PM +0200, Michal Hocko wrote:
> [Cc Tejun and Christian - this is a part of a larger discussion which is
> not directly related to this particular question so let me trim the
> original email to the bare minimum.]
>
> On Fri 18-09-20 12:40:32, Peter Xu wrote:
> [...]
> > One issue is when we charge for cgroup we probably can't do that onto the new
> > mm/task, since copy_namespaces() is called after copy_mm(). I don't know
> > enough about cgroup, I thought the child will inherit the parent's, but I'm not
> > sure. Or, can we change that order of copy_namespaces() && copy_mm()? I don't
> > see a problem so far but I'd like to ask first..
>
> I suspect you are referring to CLONE_INTO_CGROUP, right? I have only now
> learned about this feature so I am not deeply familiar with all the
> details and I might be easily wrong. Normally all the cgroup aware
> resources are accounted to the parent's cgroup. For memcg that includes
> all the page tables, early CoW and other allocations with __GFP_ACCOUNT.
> IIUC CLONE_INTO_CGROUP properly then this hasn't changed as the child is
> associated to its new cgroup (and memcg) only in cgroup_post_fork. If
> that is correct then we might have quite a lot of resources bound to
> child's lifetime but accounted to the parent's memcg which can lead to
> all sorts of interesting problems (e.g. unreclaimable memory - even by
> the oom killer).
>
> Christian, Tejun is this the expected semantic or I am just misreading
> the code?
Hey Michal,
Thanks for the Cc!
If I understand your question correctly, then you are correct. The logic
is split in three simple parts:
1. Child gets created and doesn't live in any cset
- This should mean that resources are still charged against the
parent's memcg which is what you're asking afiu.
1. cgroup_can_fork()
- create new or find existing matching cset for the child
3. cgroup_post_fork()
- move/attach child to the new or found cset
_Purely from a CLONE_INTO_CGROUP perspective_ you should be ok to
reverse the order of copy_mm() and copy_namespaces().
Christian
On Mon, Sep 21, 2020 at 10:38:47AM -0400, Tejun Heo wrote:
> Hello,
>
> On Mon, Sep 21, 2020 at 04:28:34PM +0200, Michal Hocko wrote:
> > Fundamentaly CLONE_INTO_CGROUP is similar to regular fork + move to the
> > target cgroup after the child gets executed. So in principle there
> > shouldn't be any big difference. Except that the move has to be explicit
> > and the the child has to have enough privileges to move itself. I am not
>
> Yeap, they're supposed to be the same operations. We've never clearly
> defined how the accounting gets split across moves because 1. it's
> inherently blurry and difficult 2. doesn't make any practical difference for
> the recommended and vast majority usage pattern which uses migration to seed
> the new cgroup. CLONE_INTO_CGROUP doesn't change any of that.
>
> > completely sure about CLONE_INTO_CGROUP model though. According to man
> > clone(2) it seems that O_RDONLY for the target cgroup directory is
> > sufficient. That seems much more relaxed IIUC and it would allow to fork
> > into a different cgroup while keeping a lot of resources in the parent's
> > proper.
>
> If the man page is documenting that, it's wrong. cgroup_css_set_fork() has
> an explicit cgroup_may_write() test on the destination cgroup.
> CLONE_INTO_CGROUP should follow exactly the same rules as regular
> migrations.
Indeed!
The O_RDONLY mention on the manpage doesn't make sense but it is
explained that the semantics are exactly the same for moving via the
filesystem:
"In order to place the child process in a different cgroup, the caller
specifies CLONE_INTO_CGROUP in cl_args.flags and passes a file
descriptor that refers to a version 2 cgroup in the
cl_args.cgroup field. (This file descriptor can be obtained by opening
a cgroup v2 directory using either the O_RDONLY or the O_PATH flag.)
Note that all of the usual restrictions (described in cgroups(7)) on
placing a process into a version 2 cgroup apply."
Christian
On Mon 21-09-20 16:43:55, Christian Brauner wrote:
> On Mon, Sep 21, 2020 at 10:38:47AM -0400, Tejun Heo wrote:
> > Hello,
> >
> > On Mon, Sep 21, 2020 at 04:28:34PM +0200, Michal Hocko wrote:
> > > Fundamentaly CLONE_INTO_CGROUP is similar to regular fork + move to the
> > > target cgroup after the child gets executed. So in principle there
> > > shouldn't be any big difference. Except that the move has to be explicit
> > > and the the child has to have enough privileges to move itself. I am not
> >
> > Yeap, they're supposed to be the same operations. We've never clearly
> > defined how the accounting gets split across moves because 1. it's
> > inherently blurry and difficult 2. doesn't make any practical difference for
> > the recommended and vast majority usage pattern which uses migration to seed
> > the new cgroup. CLONE_INTO_CGROUP doesn't change any of that.
> >
> > > completely sure about CLONE_INTO_CGROUP model though. According to man
> > > clone(2) it seems that O_RDONLY for the target cgroup directory is
> > > sufficient. That seems much more relaxed IIUC and it would allow to fork
> > > into a different cgroup while keeping a lot of resources in the parent's
> > > proper.
> >
> > If the man page is documenting that, it's wrong. cgroup_css_set_fork() has
> > an explicit cgroup_may_write() test on the destination cgroup.
> > CLONE_INTO_CGROUP should follow exactly the same rules as regular
> > migrations.
>
> Indeed!
> The O_RDONLY mention on the manpage doesn't make sense but it is
> explained that the semantics are exactly the same for moving via the
> filesystem:
OK, if the semantic is the same as for the task migration then I do not
see any (new) problems. Care to point me where the actual check is
enforced? For the migration you need a write access to cgroup.procs but
if the API expects directory fd then I am not sure how that would expose
the same behavior.
--
Michal Hocko
SUSE Labs
On Mon 21-09-20 16:41:34, Christian Brauner wrote:
> On Mon, Sep 21, 2020 at 03:42:00PM +0200, Michal Hocko wrote:
> > [Cc Tejun and Christian - this is a part of a larger discussion which is
> > not directly related to this particular question so let me trim the
> > original email to the bare minimum.]
> >
> > On Fri 18-09-20 12:40:32, Peter Xu wrote:
> > [...]
> > > One issue is when we charge for cgroup we probably can't do that onto the new
> > > mm/task, since copy_namespaces() is called after copy_mm(). I don't know
> > > enough about cgroup, I thought the child will inherit the parent's, but I'm not
> > > sure. Or, can we change that order of copy_namespaces() && copy_mm()? I don't
> > > see a problem so far but I'd like to ask first..
> >
> > I suspect you are referring to CLONE_INTO_CGROUP, right? I have only now
> > learned about this feature so I am not deeply familiar with all the
> > details and I might be easily wrong. Normally all the cgroup aware
> > resources are accounted to the parent's cgroup. For memcg that includes
> > all the page tables, early CoW and other allocations with __GFP_ACCOUNT.
> > IIUC CLONE_INTO_CGROUP properly then this hasn't changed as the child is
> > associated to its new cgroup (and memcg) only in cgroup_post_fork. If
> > that is correct then we might have quite a lot of resources bound to
> > child's lifetime but accounted to the parent's memcg which can lead to
> > all sorts of interesting problems (e.g. unreclaimable memory - even by
> > the oom killer).
> >
> > Christian, Tejun is this the expected semantic or I am just misreading
> > the code?
>
> Hey Michal,
>
> Thanks for the Cc!
>
> If I understand your question correctly, then you are correct. The logic
> is split in three simple parts:
> 1. Child gets created and doesn't live in any cset
> - This should mean that resources are still charged against the
> parent's memcg which is what you're asking afiu.
> 1. cgroup_can_fork()
> - create new or find existing matching cset for the child
> 3. cgroup_post_fork()
> - move/attach child to the new or found cset
>
> _Purely from a CLONE_INTO_CGROUP perspective_ you should be ok to
> reverse the order of copy_mm() and copy_namespaces().
Switching the order wouldn't make much of a difference right. At least
not for memcg where all the accounted allocations will still go to
current's memcg.
--
Michal Hocko
SUSE Labs
On Mon, Sep 21, 2020 at 04:55:37PM +0200, Michal Hocko wrote:
> On Mon 21-09-20 16:43:55, Christian Brauner wrote:
> > On Mon, Sep 21, 2020 at 10:38:47AM -0400, Tejun Heo wrote:
> > > Hello,
> > >
> > > On Mon, Sep 21, 2020 at 04:28:34PM +0200, Michal Hocko wrote:
> > > > Fundamentaly CLONE_INTO_CGROUP is similar to regular fork + move to the
> > > > target cgroup after the child gets executed. So in principle there
> > > > shouldn't be any big difference. Except that the move has to be explicit
> > > > and the the child has to have enough privileges to move itself. I am not
> > >
> > > Yeap, they're supposed to be the same operations. We've never clearly
> > > defined how the accounting gets split across moves because 1. it's
> > > inherently blurry and difficult 2. doesn't make any practical difference for
> > > the recommended and vast majority usage pattern which uses migration to seed
> > > the new cgroup. CLONE_INTO_CGROUP doesn't change any of that.
> > >
> > > > completely sure about CLONE_INTO_CGROUP model though. According to man
> > > > clone(2) it seems that O_RDONLY for the target cgroup directory is
> > > > sufficient. That seems much more relaxed IIUC and it would allow to fork
> > > > into a different cgroup while keeping a lot of resources in the parent's
> > > > proper.
> > >
> > > If the man page is documenting that, it's wrong. cgroup_css_set_fork() has
> > > an explicit cgroup_may_write() test on the destination cgroup.
> > > CLONE_INTO_CGROUP should follow exactly the same rules as regular
> > > migrations.
> >
> > Indeed!
> > The O_RDONLY mention on the manpage doesn't make sense but it is
> > explained that the semantics are exactly the same for moving via the
> > filesystem:
>
> OK, if the semantic is the same as for the task migration then I do not
> see any (new) problems. Care to point me where the actual check is
> enforced? For the migration you need a write access to cgroup.procs but
> if the API expects directory fd then I am not sure how that would expose
> the same behavior.
kernel/cgroup/cgroup.c:cgroup_csset_fork()
So there's which is the first check for inode_permission() essentially:
/*
* Verify that we the target cgroup is writable for us. This is
* usually done by the vfs layer but since we're not going through
* the vfs layer here we need to do it "manually".
*/
ret = cgroup_may_write(dst_cgrp, sb);
if (ret)
goto err;
and what you're referring to is checked right after in:
ret = cgroup_attach_permissions(cset->dfl_cgrp, dst_cgrp, sb,
!(kargs->flags & CLONE_THREAD));
if (ret)
goto err;
which calls:
ret = cgroup_procs_write_permission(src_cgrp, dst_cgrp, sb);
if (ret)
return ret;
That should be what you're looking for. I've also added selftests as
always that verify this behavior under:
tools/testing/selftests/cgroup/
as soon as CLONE_INTO_CGROUP is detected on the kernel than all the
usual tests are exercised using CLONE_INTO_CGROUP so we should've seen
any regression hopefully.
Thanks!
Christian
On Mon 21-09-20 17:04:50, Christian Brauner wrote:
> On Mon, Sep 21, 2020 at 04:55:37PM +0200, Michal Hocko wrote:
> > On Mon 21-09-20 16:43:55, Christian Brauner wrote:
> > > On Mon, Sep 21, 2020 at 10:38:47AM -0400, Tejun Heo wrote:
> > > > Hello,
> > > >
> > > > On Mon, Sep 21, 2020 at 04:28:34PM +0200, Michal Hocko wrote:
> > > > > Fundamentaly CLONE_INTO_CGROUP is similar to regular fork + move to the
> > > > > target cgroup after the child gets executed. So in principle there
> > > > > shouldn't be any big difference. Except that the move has to be explicit
> > > > > and the the child has to have enough privileges to move itself. I am not
> > > >
> > > > Yeap, they're supposed to be the same operations. We've never clearly
> > > > defined how the accounting gets split across moves because 1. it's
> > > > inherently blurry and difficult 2. doesn't make any practical difference for
> > > > the recommended and vast majority usage pattern which uses migration to seed
> > > > the new cgroup. CLONE_INTO_CGROUP doesn't change any of that.
> > > >
> > > > > completely sure about CLONE_INTO_CGROUP model though. According to man
> > > > > clone(2) it seems that O_RDONLY for the target cgroup directory is
> > > > > sufficient. That seems much more relaxed IIUC and it would allow to fork
> > > > > into a different cgroup while keeping a lot of resources in the parent's
> > > > > proper.
> > > >
> > > > If the man page is documenting that, it's wrong. cgroup_css_set_fork() has
> > > > an explicit cgroup_may_write() test on the destination cgroup.
> > > > CLONE_INTO_CGROUP should follow exactly the same rules as regular
> > > > migrations.
> > >
> > > Indeed!
> > > The O_RDONLY mention on the manpage doesn't make sense but it is
> > > explained that the semantics are exactly the same for moving via the
> > > filesystem:
> >
> > OK, if the semantic is the same as for the task migration then I do not
> > see any (new) problems. Care to point me where the actual check is
> > enforced? For the migration you need a write access to cgroup.procs but
> > if the API expects directory fd then I am not sure how that would expose
> > the same behavior.
>
> kernel/cgroup/cgroup.c:cgroup_csset_fork()
>
> So there's which is the first check for inode_permission() essentially:
>
> /*
> * Verify that we the target cgroup is writable for us. This is
> * usually done by the vfs layer but since we're not going through
> * the vfs layer here we need to do it "manually".
> */
> ret = cgroup_may_write(dst_cgrp, sb);
> if (ret)
> goto err;
>
> and what you're referring to is checked right after in:
>
> ret = cgroup_attach_permissions(cset->dfl_cgrp, dst_cgrp, sb,
> !(kargs->flags & CLONE_THREAD));
> if (ret)
> goto err;
>
> which calls:
>
> ret = cgroup_procs_write_permission(src_cgrp, dst_cgrp, sb);
> if (ret)
> return ret;
>
> That should be what you're looking for. I've also added selftests as
> always that verify this behavior under:
>
> tools/testing/selftests/cgroup/
>
> as soon as CLONE_INTO_CGROUP is detected on the kernel than all the
> usual tests are exercised using CLONE_INTO_CGROUP so we should've seen
> any regression hopefully.
Thanks a lot for this clarification! So I believe the only existing bug
is in documentation which should be explicit that the cgroup fd read
access is not sufficient because it also requires to have a write access
for cgroup.procs in the same directory at the time of fork. I will send
a patch if I find some time for that.
Thanks!
--
Michal Hocko
SUSE Labs
On Mon, Sep 21, 2020 at 04:57:38PM +0200, Michal Hocko wrote:
> Switching the order wouldn't make much of a difference right. At least
> not for memcg where all the accounted allocations will still go to
> current's memcg.
I think I'll make it simple to charge the parent directly just like what we do
right now. I do expect that the charged amount could be quite some when
there're a lot pinned pages on the parent process, however hopefully it's
acceptable as it's still a very rare case, and proper MADV_DONTFORK upon pinned
pages on the parent process will eliminate all these charges too.
I actually already prepared some patches. It's definitely bigger than expected
due to the complexity that we held page table locks in copy_one_pte() when
trying to break the cow pages. I don't know whether it would be proper any
more for rc, especially the ending of rcs. Anyway I'll post them out soon
after I did some basic tests, because I think that's something right to do
irrelevant to when it will be merged.
Thanks,
--
Peter Xu
On Mon 21-09-20 18:06:44, Michal Hocko wrote:
[...]
> Thanks a lot for this clarification! So I believe the only existing bug
> is in documentation which should be explicit that the cgroup fd read
> access is not sufficient because it also requires to have a write access
> for cgroup.procs in the same directory at the time of fork. I will send
> a patch if I find some time for that.
I have reread the man page and concluded that the current wording is
not bugy. It is referring to cgroups(7) which has all the information
but it takes quite some to drill down to the important point. On the
other hand there are many details (like delegation, namespaces) which
makes it quite complex to be concise in clone(2) so it is very likely
better to leave as it is.
--
Michal Hocko
SUSE Labs
> On Thu, Sep 17, 2020 at 03:03:32PM -0400, Peter Xu wrote:
>
>> Another side effect I can think of is that we'll bring some uncertainty to
>> fork() starting from when page_maybe_dma_pinned() is used, since it's sometimes
>> bogus (hpage_pincount_available()==false) so some COWs might be triggered
>> during fork() even when not necessary if we've got some normal pages with too
>> many refcounts (over GUP_PIN_COUNTING_BIAS). But assuming that's not a big
>> deal since it should be extremely rare, or is it?..
>
> Looking at this a bit more.. A complete implementation will have to
> touch all four places doing write protect during fork:
>
> copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
> unsigned long addr, int *rss)
> {
> [..]
> if (is_cow_mapping(vm_flags) && pte_write(pte)) {
> ptep_set_wrprotect(src_mm, addr, src_pte);
> pte = pte_wrprotect(pte);
>
> int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
> struct vm_area_struct *vma)
> {
> [..]
> pmdp_set_wrprotect(src_mm, addr, src_pmd);
> pmd = pmd_mkold(pmd_wrprotect(pmd));
>
> int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> pud_t *dst_pud, pud_t *src_pud, unsigned long addr,
> struct vm_area_struct *vma)
> {
> [..]
> pudp_set_wrprotect(src_mm, addr, src_pud);
> pud = pud_mkold(pud_wrprotect(pud));
>
> int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> struct vm_area_struct *vma)
> {
> [..]
> if (cow) {
> huge_ptep_set_wrprotect(src, addr, src_pte);
>
> As a regression I'm pretty sure we will hit only the PTE and PMD
> cases.
>
> Most likely the other two could be done outside the rc cycle
Hi Peter & Jason,
It seems the hugetlb part was overlooked?
We're testing if the RDMA fork MADV_DONTFORK stuff can be removed on appropriate
kernels, but our tests still fail due to lacking explicit huge pages support [1].
Peter, was it left unchanged on purpose?
Are you planning to submit the hugetlb changes as well?
[1] https://github.com/linux-rdma/rdma-core/pull/883#issuecomment-770398171
On Tue, Feb 02, 2021 at 04:40:33PM +0200, Gal Pressman wrote:
> Hi Peter & Jason,
Hi, Gal, Jason,
>
> It seems the hugetlb part was overlooked?
> We're testing if the RDMA fork MADV_DONTFORK stuff can be removed on appropriate
> kernels, but our tests still fail due to lacking explicit huge pages support [1].
I didn't think it high priority only because I think most hugetlbfs users
should be using it shared, but maybe I'm wrong.. Then it got lost indeed.
But I definitely agree it should better be there.
>
> Peter, was it left unchanged on purpose?
> Are you planning to submit the hugetlb changes as well?
>
> [1] https://github.com/linux-rdma/rdma-core/pull/883#issuecomment-770398171
>
I'll prepare a fix soon and post it. Thanks,
--
Peter Xu
On Tue, Feb 02, 2021 at 11:31:27AM -0500, Peter Xu wrote:
> On Tue, Feb 02, 2021 at 04:40:33PM +0200, Gal Pressman wrote:
> > Hi Peter & Jason,
>
> Hi, Gal, Jason,
>
> >
> > It seems the hugetlb part was overlooked?
> > We're testing if the RDMA fork MADV_DONTFORK stuff can be removed on appropriate
> > kernels, but our tests still fail due to lacking explicit huge pages support [1].
>
> I didn't think it high priority only because I think most hugetlbfs users
> should be using it shared, but maybe I'm wrong.. Then it got lost indeed.
It turns out people are doing this:
mmap(NULL, SEND_BUFF_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0)
Which makes some sense...
Gal, you could also MADV_DONTFORK this range if you are explicitly
allocating them via special mmap.
Jason
On Tue, Feb 02, 2021 at 12:44:20PM -0400, Jason Gunthorpe wrote:
> On Tue, Feb 02, 2021 at 11:31:27AM -0500, Peter Xu wrote:
> > On Tue, Feb 02, 2021 at 04:40:33PM +0200, Gal Pressman wrote:
> > > Hi Peter & Jason,
> >
> > Hi, Gal, Jason,
> >
> > >
> > > It seems the hugetlb part was overlooked?
> > > We're testing if the RDMA fork MADV_DONTFORK stuff can be removed on appropriate
> > > kernels, but our tests still fail due to lacking explicit huge pages support [1].
> >
> > I didn't think it high priority only because I think most hugetlbfs users
> > should be using it shared, but maybe I'm wrong.. Then it got lost indeed.
>
> It turns out people are doing this:
>
> mmap(NULL, SEND_BUFF_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0)
>
> Which makes some sense...
Yes, thanks Jason. Though my understanding is that hugetlb pages are normally
reserved in production, used with careful pre-provisioning on which app would
consume how much (either 2M or 1G). Such an app (especially if it forks
randomly) could actually easily exaust the huge page pool.
>
> Gal, you could also MADV_DONTFORK this range if you are explicitly
> allocating them via special mmap.
Yeah I wanted to mention this one too but I just forgot when reply: the issue
thread previously pasted smells like some people would like to drop
MADV_DONTFORK, but if it's able to still be applied I don't know why not.. It
should still be better than depending on the coming patch imho - it marks the
region as "not necessary for the fork" then we skip the whole hugetlbfs chunk.
It should at least be more efficient if applicable.
Thanks,
--
Peter Xu
On Tue, Feb 02, 2021 at 12:05:36PM -0500, Peter Xu wrote:
> > Gal, you could also MADV_DONTFORK this range if you are explicitly
> > allocating them via special mmap.
>
> Yeah I wanted to mention this one too but I just forgot when reply: the issue
> thread previously pasted smells like some people would like to drop
> MADV_DONTFORK, but if it's able to still be applied I don't know why
> not..
I want to drop the MADV_DONTFORK for dynamic data memory allocated by
the application layers (eg with malloc) without knowledge of how they
will be used.
This case is a buffer internal to the communication system that we
know at allocation time how it will be used; so an explicit,
deliberate, MADV_DONTFORK is fine
Jason
> On Tue, Feb 02, 2021 at 12:05:36PM -0500, Peter Xu wrote:
>
>> > Gal, you could also MADV_DONTFORK this range if you are explicitly
>> > allocating them via special mmap.
>>
>> Yeah I wanted to mention this one too but I just forgot when reply: the issue
>> thread previously pasted smells like some people would like to drop
>> MADV_DONTFORK, but if it's able to still be applied I don't know why
>> not..
>
> I want to drop the MADV_DONTFORK for dynamic data memory allocated by
> the application layers (eg with malloc) without knowledge of how they
> will be used.
>
> This case is a buffer internal to the communication system that we
> know at allocation time how it will be used; so an explicit,
> deliberate, MADV_DONTFORK is fine
We are referring to libfabric's bounce buffers, correct?
Libfabric could be considered as the "app" here, it's not clear why these
buffers should be DONTFORK'd before ibv_reg_mr() but others don't.
Anyway, it should be simple enough to madvise them after allocation, although I
think it's part of libfabric's generic code (which isn't necessarily used on
top of rdma-core).
I'll take this discussion with the libfabric guys.
Thanks Peter and Jason.
On Wed, Feb 03, 2021 at 02:43:58PM +0200, Gal Pressman wrote:
> > On Tue, Feb 02, 2021 at 12:05:36PM -0500, Peter Xu wrote:
> >
> >> > Gal, you could also MADV_DONTFORK this range if you are explicitly
> >> > allocating them via special mmap.
> >>
> >> Yeah I wanted to mention this one too but I just forgot when reply: the issue
> >> thread previously pasted smells like some people would like to drop
> >> MADV_DONTFORK, but if it's able to still be applied I don't know why
> >> not..
> >
> > I want to drop the MADV_DONTFORK for dynamic data memory allocated by
> > the application layers (eg with malloc) without knowledge of how they
> > will be used.
> >
> > This case is a buffer internal to the communication system that we
> > know at allocation time how it will be used; so an explicit,
> > deliberate, MADV_DONTFORK is fine
>
> We are referring to libfabric's bounce buffers, correct?
> Libfabric could be considered as the "app" here, it's not clear why these
> buffers should be DONTFORK'd before ibv_reg_mr() but others don't.
I assumed they were internal to the EFA code itself.
> Anyway, it should be simple enough to madvise them after allocation, although I
> think it's part of libfabric's generic code (which isn't necessarily used on
> top of rdma-core).
Ah, so that is a reasonable justification for wanting to fix this in
the kernel..
Lets give Peter some time first.
The other direction to validate this approach is to remove the
MAP_HUGETLB flags and rely on THP instead, and/or mark them as
MAP_SHARED.
I'm not sure generic code should be use using MAP_HUGETLB..
This would be enough to confirm that everything else is working as
expected
Thanks,
Jason
On 03/02/2021 16:00, Jason Gunthorpe wrote:
> On Wed, Feb 03, 2021 at 02:43:58PM +0200, Gal Pressman wrote:
>>> On Tue, Feb 02, 2021 at 12:05:36PM -0500, Peter Xu wrote:
>>>
>>>>> Gal, you could also MADV_DONTFORK this range if you are explicitly
>>>>> allocating them via special mmap.
>>>>
>>>> Yeah I wanted to mention this one too but I just forgot when reply: the issue
>>>> thread previously pasted smells like some people would like to drop
>>>> MADV_DONTFORK, but if it's able to still be applied I don't know why
>>>> not..
>>>
>>> I want to drop the MADV_DONTFORK for dynamic data memory allocated by
>>> the application layers (eg with malloc) without knowledge of how they
>>> will be used.
>>>
>>> This case is a buffer internal to the communication system that we
>>> know at allocation time how it will be used; so an explicit,
>>> deliberate, MADV_DONTFORK is fine
>>
>> We are referring to libfabric's bounce buffers, correct?
>> Libfabric could be considered as the "app" here, it's not clear why these
>> buffers should be DONTFORK'd before ibv_reg_mr() but others don't.
>
> I assumed they were internal to the EFA code itself.
The hugepages allocation is part of libfabric generic bufpool implementation:
https://github.com/ofiwg/libfabric/blob/cde8665ca5ec2fb957260490d0c8700d8ac69863/include/linux/osd.h#L64
I guess we could madvise them at the libfabric provider's layer.
>> Anyway, it should be simple enough to madvise them after allocation, although I
>> think it's part of libfabric's generic code (which isn't necessarily used on
>> top of rdma-core).
>
> Ah, so that is a reasonable justification for wanting to fix this in
> the kernel..
>
> Lets give Peter some time first.
>
> The other direction to validate this approach is to remove the
> MAP_HUGETLB flags and rely on THP instead, and/or mark them as
> MAP_SHARED.
>
> I'm not sure generic code should be use using MAP_HUGETLB..
It's using MAP_HUGETLB but has a fallback in case it fails:
ret = ofi_alloc_hugepage_buf((void **) &buf_region->alloc_region,
pool->alloc_size);
/* If we can't allocate huge pages, fall back to normal
* allocations for all future attempts.
*/
if (ret) {
pool->attr.flags &= ~OFI_BUFPOOL_HUGEPAGES;
goto retry;
}
buf_region->flags = OFI_BUFPOOL_HUGEPAGES;
> This would be enough to confirm that everything else is working as
> expected
Agree.
On Wed, Feb 16, 2022 at 6:59 PM Oded Gabbay <[email protected]> wrote:
>
> On Mon, Sep 21, 2020 at 3:03 PM Jason Gunthorpe <[email protected]> wrote:
> >
> > On Mon, Sep 21, 2020 at 10:35:05AM +0200, Jan Kara wrote:
> > > > My thinking is to hit this issue you have to already be doing
> > > > FOLL_LONGTERM, and if some driver hasn't been properly marked and
> > > > regresses, the fix is to mark it.
> > > >
> > > > Remember, this use case requires the pin to extend after a system
> > > > call, past another fork() system call, and still have data-coherence.
> > > >
> > > > IMHO that can only happen in the FOLL_LONGTERM case as it inhernetly
> > > > means the lifetime of the pin is being controlled by userspace, not by
> > > > the kernel. Otherwise userspace could not cause new DMA touches after
> > > > fork.
> > >
> > > I agree that the new aggressive COW behavior is probably causing issues
> > > only for FOLL_LONGTERM users. That being said it would be nice if even
> > > ordinary threaded FOLL_PIN users would not have to be that careful about
> > > fork(2) and possible data loss due to COW - we had certainly reports of
> > > O_DIRECT IO loosing data due to fork(2) and COW exactly because it is very
> > > subtle how it behaves... But as I wrote above this is not urgent since that
> > > problematic behavior exists since the beginning of O_DIRECT IO in Linux.
> >
> > Yes, I agree - what I was thinking is to do this FOLL_LONGTERM for the
> > rc and then a small patch to make it wider for the next cycle so it
> > can test in linux-next for a responsible time period.
> >
> > Interesting to hear you confirm block has also seen subtle user
> > problems with this as well.
> >
> > Jason
> >
>
> Hi Jason, Linus,
> Sorry for waking up this thread, but I've filed a bug against this change:
> https://bugzilla.kernel.org/show_bug.cgi?id=215616
>
> In the past week, I've bisected a problem we have in one of our new
> demos running on our Gaudi accelerator, and after a very long
> bisection, I've come to this commit.
> All the details are in the bug, but the bottom line is that somehow,
> this patch causes corruption when the numa balancing feature is
> enabled AND we don't use process affinity AND we use GUP to pin pages
> so our accelerator can DMA to/from system memory.
>
> Either disabling numa balancing, using process affinity to bind to
> specific numa-node or reverting this patch causes the bug to
> disappear.
> I validated the bug and the revert on kernels 5.9, 5.11 and 5.17-rc4.
>
> You can see our GUP code in the driver in get_user_memory() in
> drivers/misc/habanalabs/common/memory.c.
> It is fairly standard and I think I got that line from Daniel (cc'ed
> on this email).
>
> I would appreciate help from the mm experts here to understand how to
> fix this, but it looks as if this simplification caused or exposed
> some race between numa migration code and GUP.
>
> Thanks,
> Oded
Although I wrote it inside the bug, I just wanted to specify here the
exact commit id in the tree:
2020-09-04 - 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 - mm:
do_wp_page() simplification <Linus Torvalds>
Thanks,
Oded
[ Added David Hildenbrand to the participants. David, see
https://bugzilla.kernel.org/show_bug.cgi?id=215616
for details ]
On Wed, Feb 16, 2022 at 8:59 AM Oded Gabbay <[email protected]> wrote:
>
> All the details are in the bug, but the bottom line is that somehow,
> this patch causes corruption when the numa balancing feature is
> enabled AND we don't use process affinity AND we use GUP to pin pages
> so our accelerator can DMA to/from system memory.
Hmm. I thought all the remaining issues were related to THP - and
David Hildenbrand had a series to fix those up.
The fact that it also shows up with numa balancing is a bit
unfortunate, because I think that means that that patch series may not
have caught that case.
That said - what does "we use GUP to pin pages" mean? Does it actually
use the pinning logic, or just regular old GUP?
I'm assuming this is just the existing pin_user_pages_fast() (ie a
proper pin) in drivers/misc/habanalabs/common/memory.c. But I wanted
to confirm that it's not some other situation.
Linus
On Wed, Feb 16, 2022 at 9:04 PM Linus Torvalds
<[email protected]> wrote:
>
> [ Added David Hildenbrand to the participants. David, see
>
> https://bugzilla.kernel.org/show_bug.cgi?id=215616
>
> for details ]
>
> On Wed, Feb 16, 2022 at 8:59 AM Oded Gabbay <[email protected]> wrote:
> >
> > All the details are in the bug, but the bottom line is that somehow,
> > this patch causes corruption when the numa balancing feature is
> > enabled AND we don't use process affinity AND we use GUP to pin pages
> > so our accelerator can DMA to/from system memory.
>
> Hmm. I thought all the remaining issues were related to THP - and
> David Hildenbrand had a series to fix those up.
>
> The fact that it also shows up with numa balancing is a bit
> unfortunate, because I think that means that that patch series may not
> have caught that case.
>
> That said - what does "we use GUP to pin pages" mean? Does it actually
> use the pinning logic, or just regular old GUP?
>
> I'm assuming this is just the existing pin_user_pages_fast() (ie a
> proper pin) in drivers/misc/habanalabs/common/memory.c. But I wanted
> to confirm that it's not some other situation.
>
> Linus
>
[Adding dave hansen as we chatted about it in irc]
It uses the pinning logic, simply calling pin_user_pages_fast with the
relevant gup flags to pin the userspace memory so we can after that
dma map it and give the bus address to the h/w.
And afaik, the gup flags combination we use
(FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM) is the correct combination,
at least according to the last time it was discussed with Daniel,
Jason and other people.
Oded
On Mon, Sep 21, 2020 at 3:03 PM Jason Gunthorpe <[email protected]> wrote:
>
> On Mon, Sep 21, 2020 at 10:35:05AM +0200, Jan Kara wrote:
> > > My thinking is to hit this issue you have to already be doing
> > > FOLL_LONGTERM, and if some driver hasn't been properly marked and
> > > regresses, the fix is to mark it.
> > >
> > > Remember, this use case requires the pin to extend after a system
> > > call, past another fork() system call, and still have data-coherence.
> > >
> > > IMHO that can only happen in the FOLL_LONGTERM case as it inhernetly
> > > means the lifetime of the pin is being controlled by userspace, not by
> > > the kernel. Otherwise userspace could not cause new DMA touches after
> > > fork.
> >
> > I agree that the new aggressive COW behavior is probably causing issues
> > only for FOLL_LONGTERM users. That being said it would be nice if even
> > ordinary threaded FOLL_PIN users would not have to be that careful about
> > fork(2) and possible data loss due to COW - we had certainly reports of
> > O_DIRECT IO loosing data due to fork(2) and COW exactly because it is very
> > subtle how it behaves... But as I wrote above this is not urgent since that
> > problematic behavior exists since the beginning of O_DIRECT IO in Linux.
>
> Yes, I agree - what I was thinking is to do this FOLL_LONGTERM for the
> rc and then a small patch to make it wider for the next cycle so it
> can test in linux-next for a responsible time period.
>
> Interesting to hear you confirm block has also seen subtle user
> problems with this as well.
>
> Jason
>
Hi Jason, Linus,
Sorry for waking up this thread, but I've filed a bug against this change:
https://bugzilla.kernel.org/show_bug.cgi?id=215616
In the past week, I've bisected a problem we have in one of our new
demos running on our Gaudi accelerator, and after a very long
bisection, I've come to this commit.
All the details are in the bug, but the bottom line is that somehow,
this patch causes corruption when the numa balancing feature is
enabled AND we don't use process affinity AND we use GUP to pin pages
so our accelerator can DMA to/from system memory.
Either disabling numa balancing, using process affinity to bind to
specific numa-node or reverting this patch causes the bug to
disappear.
I validated the bug and the revert on kernels 5.9, 5.11 and 5.17-rc4.
You can see our GUP code in the driver in get_user_memory() in
drivers/misc/habanalabs/common/memory.c.
It is fairly standard and I think I got that line from Daniel (cc'ed
on this email).
I would appreciate help from the mm experts here to understand how to
fix this, but it looks as if this simplification caused or exposed
some race between numa migration code and GUP.
Thanks,
Oded
On 16.02.22 20:04, Linus Torvalds wrote:
> [ Added David Hildenbrand to the participants. David, see
>
> https://bugzilla.kernel.org/show_bug.cgi?id=215616
>
> for details ]
>
Thanks for sharing.
> On Wed, Feb 16, 2022 at 8:59 AM Oded Gabbay <[email protected]> wrote:
>>
>> All the details are in the bug, but the bottom line is that somehow,
>> this patch causes corruption when the numa balancing feature is
>> enabled AND we don't use process affinity AND we use GUP to pin pages
>> so our accelerator can DMA to/from system memory.
>
> Hmm. I thought all the remaining issues were related to THP - and
> David Hildenbrand had a series to fix those up.
What I shared so far recently [1] was part 1 of my COW fixes to fix the
COW security issues -- missed COW. This fixes 1) of [2].
Part 2 is around fixing "wrong COW" for FOLL_PIN a.k.a. memory
corruption. That's essentially what PageAnonExclusive() will be all
about, making sure that we don't lose synchronicity between GUP and mm
due to a wrong COW. This will fix 3) of [2]
Part 3 is converting O_DIRECT to use FOLL_PIN instead of FOLL_GET to
similarly fix "wrong COW" for O_DIRECT. John is working on that. This
will fix 2) of [2].
>
> The fact that it also shows up with numa balancing is a bit
> unfortunate, because I think that means that that patch series may not
> have caught that case.
>
> That said - what does "we use GUP to pin pages" mean? Does it actually
> use the pinning logic, or just regular old GUP?
If it uses FOLL_PIN it might be handled by part 2, if it uses O_DIRECT
magic it might be covered by part 3. If neither of both, more work might
be needed to convert it to FOLL_PIN, as with the new COW logic we won't
be able to have the same guarantees for FOLL_GET as we'll have for
FOLL_PIN (which is a difference to our original plans to fix it all [3]).
[1] https://lkml.kernel.org/r/[email protected]
[2]
https://lore.kernel.org/all/[email protected]/
[3] https://lore.kernel.org/all/[email protected]/T/#u
--
Thanks,
David / dhildenb