by Peter Xu

[permalink] [raw]

Subject: Re: [PATCH RFC v2 1/2] mm: Don't skip swap entry even if zap_details specified

On Sat, Jan 08, 2022 at 05:19:04PM -0800, Hugh Dickins wrote:
> On Mon, 15 Nov 2021, Peter Xu wrote:
>
> > This check existed since the 1st git commit of Linux repository, but at that
> > time there's no page migration yet so I think it's okay.
>
> //git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
> helps with the history. Yes, the check was okay back then,
> but a lot of changes have come in since: I'll tell more of those below.

Thanks for looking at this. By the way, the link is greatly helpful. It's
always good to be able to read into the history.

>
> You are absolutely right to clean this up and fix the bugs that
> have crept in, but I think the patch itself is not quite right yet.

Do you mean the pmd path on checking mapping? If so I agree, and I'll add that
in v2 (even if as you pointed out that shouldn't be a real bug, iiuc, as you
analyzed below).

Let me know if I missed anything else..

>
> >
> > With page migration enabled, it should logically be possible that we zap some
> > shmem pages during migration.
>
> Yes.
>
> > When that happens, IIUC the old code could have
> > the RSS counter accounted wrong on MM_SHMEMPAGES because we will zap the ptes
> > without decreasing the counters for the migrating entries. I have no unit test
> > to prove it as I don't know an easy way to trigger this condition, though.
>
> In the no-details case, yes, it does look like that. I ought to try
> and reproduce that, but responding to mails seems more important.

Please let me know if you know how to reproduce it, since I don't know yet a
real reproducer.

What I can do, though, is if with below patch applied to current linux master:

=============
diff --git a/mm/memory.c b/mm/memory.c
index 8f1de811a1dc..51fe02a22ea9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1383,8 +1383,10 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
}

/* If details->check_mapping, we leave swap entries. */
- if (unlikely(details))
+ if (unlikely(details)) {
+ WARN_ON_ONCE(is_migration_entry(entry));
continue;
+ }

if (!non_swap_entry(entry))
rss[MM_SWAPENTS]--;
=============

Then I can easily trigger it if with the help of a test program attached
(zap-migrate.c).

IMHO it won't really reproduce because it seems all relevant shmem operations
(e.g. hole punch, unmap..) will take the page lock so it won't really race with
migration (which needs the page lock too), as mentioned in previous cover
letters when I was still working on the old versions of this. But it could be
possible that I missed something.

While the WARN_ON_ONCE() can trigger only because of the fast path in hole
punching we have the pre-unmap without lock:

if ((u64)unmap_end > (u64)unmap_start)
unmap_mapping_range(mapping, unmap_start,
1 + unmap_end - unmap_start, 0);
shmem_truncate_range(inode, offset, offset + len - 1);

But that will be fixed up right afterwards in shmem_truncate_range(), afaict,
which is with-lock. Hence we have a small window to at least trigger the
warning above, even if it won't really mess up the accounting finally.

>
> >
> > Besides, the optimization itself is already confusing IMHO to me in a few points:
>
> It is confusing and unnecessary and wrong, I agree.
>
> >
> > - The wording "skip swap entries" is confusing, because we're not skipping all
> > swap entries - we handle device private/exclusive pages before that.
>
> I'm entirely ignorant of device pages, so cannot comment on your 2/2,
> but of course it's good if you can bring the cases closer together.
>
> >
> > - The skip behavior is enabled as long as zap_details pointer passed over.
> > It's very hard to figure that out for a new zap caller because it's unclear
> > why we should skip swap entries when we have zap_details specified.
>
> History below will clarify that.
>
> >
> > - With modern systems, especially performance critical use cases, swap
> > entries should be rare, so I doubt the usefulness of this optimization
> > since it should be on a slow path anyway.
> >
> > - It is not aligned with what we do with huge pmd swap entries, where in
> > zap_huge_pmd() we'll do the accounting unconditionally.
>
> The patch below does not align with what's done in zap_huge_pmd() either;
> but I think zap_huge_pmd() is okay without "details" because its only swap
> entries are migration entries, and we do not use huge pages when COWing
> from file huge pages.
>
> >
> > This patch drops that trick, so we handle swap ptes coherently. Meanwhile we
> > should do the same mapping check upon migration entries too.
> >
> > Signed-off-by: Peter Xu <[email protected]>
> > ---
> > mm/memory.c | 6 ++----
> > 1 file changed, 2 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 8f1de811a1dc..e454f3c6aeb9 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1382,16 +1382,14 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > continue;
> > }
> >
> > - /* If details->check_mapping, we leave swap entries. */
> > - if (unlikely(details))
> > - continue;
> > -
>
> Good.
>
> > if (!non_swap_entry(entry))
> > rss[MM_SWAPENTS]--;
> > else if (is_migration_entry(entry)) {
> > struct page *page;
> >
> > page = pfn_swap_entry_to_page(entry);
> > + if (unlikely(zap_skip_check_mapping(details, page)))
> > + continue;
>
> Good; though I don't think it's worth an "unlikely", and I say again,

Sure, I'll drop this "unlikely". Meanwhile, shall we drop all the rest of the
"unlikely" too around this check?

> more forcefully, that "zap_skip_check_mapping" is a terrible name.
>
> David suggested naming its inversion should_zap_page(), yes, that's
> much better; I'd toyed with do_not_zap_page() and zap_skip_page(),
> but should_zap_page() for the inversion sounds good to me.

Ah sure, I'll rename it to should_zap_page() in a new patch before this.

>
> And I'm pleased to see in one of Matthew's diffs that he intends to
> bring zap_details and zap_skip_check_mapping() back into mm/memory.c
> from include/linux/mm.h.

Yeah it's only used in memory.c. I put it there initially because I wanted
zap_details user can reference what's that check mapping is all about, but
maybe that's not necessary.

If you or Matthew could provide a link to that patch, I'll be happy to pick
that up too with this series. Or I can also do nothing assuming Matthew will
handle it elsewhere.

>
> > rss[mm_counter(page)]--;
> > }
> > if (unlikely(!free_swap_and_cache(entry)))
> > --
> > 2.32.0
>
> History. zap_details came in 2.6.6, and it was mostly to manage
> truncation on the non-linear vmas we had at that time (remap_file_pages
> syscall packing non-sequential offsets into a single vma, with pte_file
> entries), where index could not be deduced from position of pte in vma:
> truncation range had to be passed down in zap_details; and an madvise
> needed it too, so it could not be private to mm/memory.c then.
>
> But at the same time, I added the even_cows argument to
> unmap_mapping_range(), to distinguish truncation (which POSIX requires
> to unmap even COWed pages) from invalidation (for page cache coherence,
> which shouldn't touch private COWs). However, there appear to be no
> users of that in 2.6.6, though I wouldn't have added that complication
> just for the fun of confusing everyone: best guess would be that there
> was parallel development, and the use for !even_cows got removed in
> the very same release that it was being added.
>
> (PageAnon was brand new in 2.6.6: maybe it could have been used instead
> of comparing details->check_mapping, or maybe there's some other case
> I've forgotten which actually needs the exact mapping check.)
>
> Eventually a few drivers came to use unmap_shared_mapping_range(),
> the !even_cows caller; but more to the point, hole-punching came in,
> and I felt very strongly that hole-punching on a file had no right
> to delete private user data. So then !even_cows became useful.
>
> IIRC, I've seen Linus say he *detests* zap_details. It had much better
> justification in the days of non-linear, and I was sad to add
> single_page to it quite recently; but hope that can go away later
> (when try_to_unmap_one() is properly extended to THPs).
>
> Now, here's what may clear up a lot of the confusion.
> The 2.6.7 zap_pte_range() got a line at the head of zap_pte_range()
> if (details && !details->check_mapping && !details->nonlinear_vma)
> details = NULL;
> which paired with the
> /*
> * If details->check_mapping, we leave swap entries;
> * if details->nonlinear_vma, we leave file entries.
> */
> if (unlikely(details))
> continue;
> lower down. I haven't followed up the actual commits, but ChangeLog-2.6.7
> implies that 2.6.6 had a "details = NULL;" placed elsewhere but buggily.
> In 2.6.12 it moved from zap_pte_range() to unmap_page_range().
> It was intended, not so much to optimize, as to simplify the flow;
> but in fact has caused all this confusion.
>
> When Kirill discontinued non-linear mapping support in 4.0 (no tears
> were shed and) the nonlinear_vma comment above got deleted, leaving
> just the then more puzzling check_mapping comment.
>
> Then in 4.6 the "details = NULL;" seems to have got deleted as part of
> aac453635549 ("mm, oom: introduce oom reaper"), which added some new
> details (ignore_dirty and check_swap_entries); which got reverted in
> 4.11, but apparently the "details = NULL;" not restored.
>
> My stamina is suddenly exhausted, without actually pinpointing a commit
> for "Fixes:" in your eventual cleanup. Sorry, I've had enough!

Yeah it's in most cases a pain for digging all these trivial details, thanks
for digging already most of it out of the mist.

That's really what I hope this patch can help: not only because the uffd work
will rely on it, but also on resolving this early (we do use some wordings like
"technical debt" sometimes, I think it's the same here but different form) when
the above "history.git" is still functional so we can still reference.

With your help and the history.git I can try a better commit message because
obviously some of the contents needs amending (it's not a pure optimization at
all), but I assume the patch content will be mostly the same, with the tweaks
applied.

Per stated so far I don't know any real reproducer so maybe it's not a real
issue in any production system? Maybe that'll make it a bit easier, because
then we don't strongly require a Fixes (which could be another hard question to
answer besides the issue itself).

Thanks,

--
Peter Xu

2022-01-12 13:26:17

On Mon, Jan 24, 2022 at 04:54:29PM +0800, Peter Xu wrote:
> I have a quick idea on reproducer now (perhaps file size shrinking on private
> pages being swapped out), I'll try to write a real reproducer and update later.

Here's the reproducer..

===8<===
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <stdio.h>
#include <assert.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/types.h>

int page_size;
int shmem_fd;
char *buffer;

void main(void)
{
int ret;
char val;

page_size = getpagesize();
shmem_fd = memfd_create("test", 0);
assert(shmem_fd >= 0);

ret = ftruncate(shmem_fd, page_size * 2);
assert(ret == 0);

buffer = mmap(NULL, page_size * 2, PROT_READ | PROT_WRITE,
MAP_PRIVATE, shmem_fd, 0);
assert(buffer != MAP_FAILED);

/* Write private page, swap it out */
buffer[page_size] = 1;
madvise(buffer, page_size * 2, MADV_PAGEOUT);

/* This should drop private buffer[page_size] already */
ret = ftruncate(shmem_fd, page_size);
assert(ret == 0);
/* Recover the size */
ret = ftruncate(shmem_fd, page_size * 2);
assert(ret == 0);

/* Re-read the data, it should be all zero */
val = buffer[page_size];
if (val == 0)
printf("Good\n");
else
printf("BUG\n");
}
===8<===

I'm looking for a correct Fixes for the 1st patch, but afaict it seems not
right even in the initial git commit, which traces back to the "history" git
repo of your commit:

dd9fd0e03de6 ("[PATCH] rmap: nonlinear truncation", 2004-04-17)

Where we have:

/*
* If details->check_mapping, we leave swap entries;
* if details->nonlinear_vma, we leave file entries.
*/
if (unlikely(details))
continue;
if (!pte_file(ptent))
free_swap_and_cache(pte_to_swp_entry(ptent));

As I don't know how it could be right if the user only specified
details->nonlinear_vma but keeping details->check_mapping==NULL.. there, and
there're definitely callers for that, e.g.:

static long madvise_dontneed(struct vm_area_struct * vma,
unsigned long start, unsigned long end)
{
if ((vma->vm_flags & VM_LOCKED) || is_vm_hugetlb_page(vma))
return -EINVAL;

if (unlikely(vma->vm_flags & VM_NONLINEAR)) {
struct zap_details details = {
.nonlinear_vma = vma,
.last_index = ULONG_MAX,
};
zap_page_range(vma, start, end - start, &details);
} else
zap_page_range(vma, start, end - start, NULL);
return 0;
}

So I plan to add the Fixes to the initial git commit, 1da177e4c3f4
("Linux-2.6.12-rc2", 2005-04-16).

Does it sound right to you?

Thanks,

--
Peter Xu