2020-12-09 16:44:08

by Will Deacon

[permalink] [raw]
Subject: [PATCH 0/2] Create 'old' ptes for faultaround mappings on arm64 with hardware access flag

Hi folks,

This series allows architectures to opt-in at runtime for faultaround
mappings to be created as 'old' instead of 'young'. Although there have
been previous attempts at this, they failed either because the decision
was deferred to userspace [1] or because it was done unconditionally and
shown to regress benchmarks for particular architectures [2].

Since this patch has demonstrated considerable benefits for arm64-based
Android devices with hardware access flag capabilities, patch two enables
this functionality there.

Cheers,

Will

[1] https://www.spinics.net/lists/linux-mm/msg143831.html
[2] 315d09bf30c2 ("Revert "mm: make faultaround produce old ptes"")

Cc: Catalin Marinas <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Vinayak Menon <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: <[email protected]>

--->8

Will Deacon (2):
mm: Allow architectures to request 'old' entries when prefaulting
arm64: mm: Implement arch_wants_old_faultaround_pte()

arch/arm64/include/asm/cpufeature.h | 12 +++++++++++
arch/arm64/include/asm/pgtable.h | 8 +++++++-
include/linux/mm.h | 5 ++++-
mm/memory.c | 31 ++++++++++++++++++++++++++---
4 files changed, 51 insertions(+), 5 deletions(-)

--
2.29.2.576.ga3fc446d84-goog


2020-12-09 16:44:47

by Will Deacon

[permalink] [raw]
Subject: [PATCH 2/2] arm64: mm: Implement arch_wants_old_faultaround_pte()

On CPUs with hardware AF/DBM, initialising prefaulted PTEs as 'old'
improves vmscan behaviour and does not appear to introduce any overhead.

Implement arch_wants_old_faultaround_pte() to return 'true' if we detect
hardware access flag support at runtime. This can be extended in future
based on MIDR matching if necessary.

Cc: Catalin Marinas <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
---
arch/arm64/include/asm/cpufeature.h | 12 ++++++++++++
arch/arm64/include/asm/pgtable.h | 8 +++++++-
2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
index da250e4741bd..3424f5881390 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -764,6 +764,18 @@ static inline bool cpu_has_hw_af(void)
ID_AA64MMFR1_HADBS_SHIFT);
}

+static inline bool system_has_hw_af(void)
+{
+ u64 mmfr1;
+
+ if (!IS_ENABLED(CONFIG_ARM64_HW_AFDBM))
+ return false;
+
+ mmfr1 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR1_EL1);
+ return cpuid_feature_extract_unsigned_field(mmfr1,
+ ID_AA64MMFR1_HADBS_SHIFT);
+}
+
#ifdef CONFIG_ARM64_AMU_EXTN
/* Check whether the cpu supports the Activity Monitors Unit (AMU) */
extern bool cpu_has_amu_feat(int cpu);
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 5628289b9d5e..d5c2a7625e9a 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -974,7 +974,13 @@ static inline bool arch_faults_on_old_pte(void)

return !cpu_has_hw_af();
}
-#define arch_faults_on_old_pte arch_faults_on_old_pte
+#define arch_faults_on_old_pte arch_faults_on_old_pte
+
+/*
+ * Experimentally, it's cheap to set the access flag in hardware and we
+ * benefit from prefaulting mappings as 'old' to start with.
+ */
+#define arch_wants_old_faultaround_pte system_has_hw_af

#endif /* !__ASSEMBLY__ */

--
2.29.2.576.ga3fc446d84-goog

2020-12-09 16:47:08

by Will Deacon

[permalink] [raw]
Subject: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

Commit 5c0a85fad949 ("mm: make faultaround produce old ptes") changed
the "faultaround" behaviour to initialise prefaulted PTEs as 'old',
since this avoids vmscan wrongly assuming that they are hot, despite
having never been explicitly accessed by userspace. The change has been
shown to benefit numerous arm64 micro-architectures (with hardware
access flag) running Android, where both application launch latency and
direct reclaim time are significantly reduced.

Unfortunately, commit 315d09bf30c2 ("Revert "mm: make faultaround produce
old ptes"") reverted the change to it being identified as the cause of a
~6% regression in unixbench on x86. Experiments on a variety of recent
arm64 micro-architectures indicate that unixbench is not affected by
the original commit, yielding a 0-1% performance improvement.

Since one size does not fit all for the initial state of prefaulted PTEs,
introduce arch_wants_old_faultaround_pte(), which allows an architecture
to opt-in to 'old' prefaulted PTEs at runtime based on whatever criteria
it may have.

Cc: Jan Kara <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Reported-by: Vinayak Menon <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
---
include/linux/mm.h | 5 ++++-
mm/memory.c | 31 ++++++++++++++++++++++++++++---
2 files changed, 32 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index db6ae4d3fb4e..932886554586 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -426,6 +426,7 @@ extern pgprot_t protection_map[16];
* @FAULT_FLAG_REMOTE: The fault is not for current task/mm.
* @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch.
* @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals.
+ * @FAULT_FLAG_PREFAULT_OLD: Initialise pre-faulted PTEs in the 'old' state.
*
* About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
* whether we would allow page faults to retry by specifying these two
@@ -456,6 +457,7 @@ extern pgprot_t protection_map[16];
#define FAULT_FLAG_REMOTE 0x80
#define FAULT_FLAG_INSTRUCTION 0x100
#define FAULT_FLAG_INTERRUPTIBLE 0x200
+#define FAULT_FLAG_PREFAULT_OLD 0x400

/*
* The default fault flags that should be used by most of the
@@ -493,7 +495,8 @@ static inline bool fault_flag_allow_retry_first(unsigned int flags)
{ FAULT_FLAG_USER, "USER" }, \
{ FAULT_FLAG_REMOTE, "REMOTE" }, \
{ FAULT_FLAG_INSTRUCTION, "INSTRUCTION" }, \
- { FAULT_FLAG_INTERRUPTIBLE, "INTERRUPTIBLE" }
+ { FAULT_FLAG_INTERRUPTIBLE, "INTERRUPTIBLE" }, \
+ { FAULT_FLAG_PREFAULT_OLD, "PREFAULT_OLD" }

/*
* vm_fault is filled by the pagefault handler and passed to the vma's
diff --git a/mm/memory.c b/mm/memory.c
index c48f8df6e502..6b30c15120e7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -134,6 +134,18 @@ static inline bool arch_faults_on_old_pte(void)
}
#endif

+#ifndef arch_wants_old_faultaround_pte
+static inline bool arch_wants_old_faultaround_pte(void)
+{
+ /*
+ * Transitioning a PTE from 'old' to 'young' can be expensive on
+ * some architectures, even if it's performed in hardware. By
+ * default, "false" means prefaulted entries will be 'young'.
+ */
+ return false;
+}
+#endif
+
static int __init disable_randmaps(char *s)
{
randomize_va_space = 0;
@@ -3788,6 +3800,7 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
{
struct vm_area_struct *vma = vmf->vma;
bool write = vmf->flags & FAULT_FLAG_WRITE;
+ bool old = vmf->flags & FAULT_FLAG_PREFAULT_OLD;
pte_t entry;
vm_fault_t ret;

@@ -3811,7 +3824,7 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)

flush_icache_page(vma, page);
entry = mk_pte(page, vma->vm_page_prot);
- entry = pte_sw_mkyoung(entry);
+ entry = old ? pte_mkold(entry) : pte_sw_mkyoung(entry);
if (write)
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
/* copy-on-write page */
@@ -3964,6 +3977,9 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf)
smp_wmb(); /* See comment in __pte_alloc() */
}

+ if (arch_wants_old_faultaround_pte())
+ vmf->flags |= FAULT_FLAG_PREFAULT_OLD;
+
vmf->vma->vm_ops->map_pages(vmf, start_pgoff, end_pgoff);

/* Huge page is mapped? Page fault is solved */
@@ -3978,8 +3994,17 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf)

/* check if the page fault is solved */
vmf->pte -= (vmf->address >> PAGE_SHIFT) - (address >> PAGE_SHIFT);
- if (!pte_none(*vmf->pte))
- ret = VM_FAULT_NOPAGE;
+ if (pte_none(*vmf->pte))
+ goto out_unlock;
+
+ if (vmf->flags & FAULT_FLAG_PREFAULT_OLD) {
+ pte_t pte = pte_mkyoung(*vmf->pte);
+ if (ptep_set_access_flags(vmf->vma, address, vmf->pte, pte, 0))
+ update_mmu_cache(vmf->vma, address, vmf->pte);
+ }
+
+ ret = VM_FAULT_NOPAGE;
+out_unlock:
pte_unmap_unlock(vmf->pte, vmf->ptl);
out:
vmf->address = address;
--
2.29.2.576.ga3fc446d84-goog

2020-12-09 18:45:13

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

On Wed, Dec 09, 2020 at 09:58:12AM -0800, Linus Torvalds wrote:
> On Wed, Dec 9, 2020 at 8:40 AM Will Deacon <[email protected]> wrote:
> >
> > @@ -3978,8 +3994,17 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf)
> >
> > /* check if the page fault is solved */
> > vmf->pte -= (vmf->address >> PAGE_SHIFT) - (address >> PAGE_SHIFT);
> > - if (!pte_none(*vmf->pte))
> > - ret = VM_FAULT_NOPAGE;
> > + if (pte_none(*vmf->pte))
> > + goto out_unlock;
> > +
> > + if (vmf->flags & FAULT_FLAG_PREFAULT_OLD) {
> > + pte_t pte = pte_mkyoung(*vmf->pte);
> > + if (ptep_set_access_flags(vmf->vma, address, vmf->pte, pte, 0))
> > + update_mmu_cache(vmf->vma, address, vmf->pte);
> > + }
>
> Oh, please dear God no.
>
> First you incorrectly set it old, and then you conditionally make it
> young again and as a result force an atomic rwm update and another TLB
> flush for no good reason.

There shouldn't be a TLB flush here, but I agree that it would have to
go and nobble the hash for PowerPC if they wanted to enable this.

> Just make sure that the FAULT_FLAG_PREFAULT_OLD never sets the
> *actual* address to old.
>
> And yes, that probably means that you need to change "alloc_set_pte()"
> to actually pass in the real address, and leave "vmf->address" alone -
> so that it can know which ones are prefaulted and which one is real,
> but that sounds like a good idea anyway.

Right, I deliberately avoided that based on the feedback from Jan on an
older version [1], but I can certainly look at it again.

> Then you can just make alloc_set_pte() do the right thing in the first
> place, instead of doing this nasty "lets do it wrong and fix it up
> later" horror.

I'll have a crack at this in v2.

Cheers,

Will

[1] https://patchwork.kernel.org/project/linux-arm-kernel/patch/[email protected]/

2020-12-09 18:52:05

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH 2/2] arm64: mm: Implement arch_wants_old_faultaround_pte()

On Wed, Dec 09, 2020 at 06:35:09PM +0000, Catalin Marinas wrote:
> On Wed, Dec 09, 2020 at 04:39:50PM +0000, Will Deacon wrote:
> > diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
> > index da250e4741bd..3424f5881390 100644
> > --- a/arch/arm64/include/asm/cpufeature.h
> > +++ b/arch/arm64/include/asm/cpufeature.h
> > @@ -764,6 +764,18 @@ static inline bool cpu_has_hw_af(void)
> > ID_AA64MMFR1_HADBS_SHIFT);
> > }
> >
> > +static inline bool system_has_hw_af(void)
> > +{
> > + u64 mmfr1;
> > +
> > + if (!IS_ENABLED(CONFIG_ARM64_HW_AFDBM))
> > + return false;
> > +
> > + mmfr1 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR1_EL1);
> > + return cpuid_feature_extract_unsigned_field(mmfr1,
> > + ID_AA64MMFR1_HADBS_SHIFT);
> > +}
>
> Could we not add a new system-wide cpu feature that checks for hardware
> AF? This read_sanitised_ftr_reg() does a binary search on each
> invocation.

I posted a diff [1] which would allow removing the binary search for cases
where we can pass the register coding as a constant (like this), but
honestly, it's not like we have many ID registers so I doubt it really
matters in the grand scheme of things.

That said, I'm spinning a v2 anyway so I can include it for comments
since I haven't posted it as a proper patch before.

Will

[1] https://lore.kernel.org/r/20201202172727.GC29813@willie-the-truck

2020-12-09 19:09:27

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

On Wed, Dec 9, 2020 at 10:40 AM Will Deacon <[email protected]> wrote:
>
> > And yes, that probably means that you need to change "alloc_set_pte()"
> > to actually pass in the real address, and leave "vmf->address" alone -
> > so that it can know which ones are prefaulted and which one is real,
> > but that sounds like a good idea anyway.
>
> Right, I deliberately avoided that based on the feedback from Jan on an
> older version [1], but I can certainly look at it again.

Side note: I absolutely detest alloc_set_pte() in general, and it
would be very good to try to just change the rules of that function
entirely.

The alloc_set_pte() function is written to be some generic "iterate
over ptes setting them'" function, but it has exactly two users, and
none of those users really want the semantics it gives them, I feel.
And the semantics that alloc_set_pte() does have actually *hurts*
them.

In particular, it made it a nightmare to read what do_fault_around()
does: it does that odd

if (pmd_none(*vmf->pmd)) {
vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm);

and then it calls ->map_pages() (which is always filemap_map_pages(),
except for xfs, where it is also always filemap_map_pages but it takes
a lock first).

And it did that prealloc_pte, because that's what alloc_set_pte()
needs to be atomic - and it needs to be atomic because it's all done
under the RCU lock.

So essentially, the _major_ user of alloc_set_pte() requires atomicity
- but that's not at all obvious when you actually read alloc_set_pte()
itself, and in fact when you read it, you see that

if (!vmf->pte) {
ret = pte_alloc_one_map(vmf);

which can block. Except it won't block if we have vmf->prealloc_pte,
because then it will just take that instead.

In other words, ALL THAT IS COMPLETE GARBAGE. And it's written to be
as obtuse as humanly possible, and there is no way in hell anybody
understands it by just looking at the code - you have to follow all
these odd quirks back to see what's going on.

So it would be much better to get rid of alloc_set_pte() entirely, and
move the allocation and the pte locking into the caller, and just
clean it up (because it turns out the callers have their own models
for allocation anyway). And then each of the callers would be a lot
more understandable, instead of having this insanely subtle "I need to
be atomic, so I will pre-allocate in one place, and then take the RCU
lock in another place, in order to call a _third_ place that is only
atomic because of that first step".

The other user of alloc_set_pte() is "finish_fault()", and honestly,
that's what alloc_set_pte() was written for, and why alloc_set_pte()
does all those things that filemap_map_pages() doesn't even want.

But while that other user does want what alloc_set_pte() does, there's
no downside to just moving those things into the caller. It would once
again just clarify things to make the allocation and the setting be
separate operations - even in that place where it doesn't have the
same kind of very subtle behavior with pre-allocation and atomicity.

I think the problem with alloc_set_pte() is hat it has had many
different people working on it over the years, and Kirill massaged
that old use to fit the new pre-alloc use. Before the pre-faulting, I
think both cases were much closer to each other.

So I'm not really blaming anybody here, but the ugly and very
hard-to-follow rules seem to come from historical behavior that was
massaged over time to "just work" rather than have that function be
made more logical.

Kirill - I think you know this code best. Would you be willing to look
at splitting out that "allocate and map" from alloc_set_pte() and into
the callers?

At that point, I think the current very special and odd
do_fault_around() pre-allocation could be made into just a _regular_
"allocate the pmd if it doesn't exist". And then the pte locking could
be moved into filemap_map_pages(), and suddenly the semantics and
rules around all that would be a whole lot more obvious.

Pretty please?

Linus

2020-12-09 20:21:21

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH 2/2] arm64: mm: Implement arch_wants_old_faultaround_pte()

On Wed, Dec 09, 2020 at 04:39:50PM +0000, Will Deacon wrote:
> diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
> index da250e4741bd..3424f5881390 100644
> --- a/arch/arm64/include/asm/cpufeature.h
> +++ b/arch/arm64/include/asm/cpufeature.h
> @@ -764,6 +764,18 @@ static inline bool cpu_has_hw_af(void)
> ID_AA64MMFR1_HADBS_SHIFT);
> }
>
> +static inline bool system_has_hw_af(void)
> +{
> + u64 mmfr1;
> +
> + if (!IS_ENABLED(CONFIG_ARM64_HW_AFDBM))
> + return false;
> +
> + mmfr1 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR1_EL1);
> + return cpuid_feature_extract_unsigned_field(mmfr1,
> + ID_AA64MMFR1_HADBS_SHIFT);
> +}

Could we not add a new system-wide cpu feature that checks for hardware
AF? This read_sanitised_ftr_reg() does a binary search on each
invocation.

--
Catalin

2020-12-09 21:08:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

On Wed, Dec 9, 2020 at 12:32 PM Matthew Wilcox <[email protected]> wrote:
>
> If a filesystem has put an Uptodate page into the page cache, the
> rest of the kernel can read it without telling the filesystem.

XFS does the same thing for xfs_file_read_iter() too.

Not that I disagree with you - when you mmap a file, once it's mapped
you see the data without any lock anyway. So it's all kinds of
pointless to serialize the page fault, because that's simply not
relevant. The lock will be gone by the time the user actually sees the
data.

But hey, the XFS people have their opinions.

Linus

2020-12-09 23:52:15

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

On Wed, Dec 09, 2020 at 11:04:13AM -0800, Linus Torvalds wrote:
> In particular, it made it a nightmare to read what do_fault_around()
> does: it does that odd
>
> if (pmd_none(*vmf->pmd)) {
> vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm);
>
> and then it calls ->map_pages() (which is always filemap_map_pages(),
> except for xfs, where it is also always filemap_map_pages but it takes
> a lock first).

... which is wrong. Dave's paranoia around other people
introducing bugs into XFS made him do this, but we should revert
cd647d5651c0b0deaa26c1acb9e1789437ba9bc7. Those operations he's
worried about are protected by the page lock.

If a filesystem has put an Uptodate page into the page cache, the
rest of the kernel can read it without telling the filesystem.

2020-12-10 01:47:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

On Wed, Dec 9, 2020 at 8:40 AM Will Deacon <[email protected]> wrote:
>
> @@ -3978,8 +3994,17 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf)
>
> /* check if the page fault is solved */
> vmf->pte -= (vmf->address >> PAGE_SHIFT) - (address >> PAGE_SHIFT);
> - if (!pte_none(*vmf->pte))
> - ret = VM_FAULT_NOPAGE;
> + if (pte_none(*vmf->pte))
> + goto out_unlock;
> +
> + if (vmf->flags & FAULT_FLAG_PREFAULT_OLD) {
> + pte_t pte = pte_mkyoung(*vmf->pte);
> + if (ptep_set_access_flags(vmf->vma, address, vmf->pte, pte, 0))
> + update_mmu_cache(vmf->vma, address, vmf->pte);
> + }

Oh, please dear God no.

First you incorrectly set it old, and then you conditionally make it
young again and as a result force an atomic rwm update and another TLB
flush for no good reason.

Just make sure that the FAULT_FLAG_PREFAULT_OLD never sets the
*actual* address to old.

And yes, that probably means that you need to change "alloc_set_pte()"
to actually pass in the real address, and leave "vmf->address" alone -
so that it can know which ones are prefaulted and which one is real,
but that sounds like a good idea anyway.

Then you can just make alloc_set_pte() do the right thing in the first
place, instead of doing this nasty "lets do it wrong and fix it up
later" horror.

Linus

2020-12-10 17:30:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

On Thu, Dec 10, 2020 at 7:08 AM Kirill A. Shutemov
<[email protected]> wrote:
>
> See lightly tested patch below. Is it something you had in mind?

This is closer, in that at least it removes the ostensibly blocking
allocation (that can't happen) from the prefault path.

But the main issue remains:

> > At that point, I think the current very special and odd
> > do_fault_around() pre-allocation could be made into just a _regular_
> > "allocate the pmd if it doesn't exist". And then the pte locking could
> > be moved into filemap_map_pages(), and suddenly the semantics and
> > rules around all that would be a whole lot more obvious.
>
> No. It would stop faultaround code from mapping huge pages. We had to
> defer pte page table mapping until we know we don't have huge pages in
> page cache.

Can we please move that part to the callers too - possibly with a
separate helper function?

Because the real issue remains: as long the map_set_pte() function
takes the pte lock, the caller cannot rely on it.

And the filemap_map_pages() code really would like to rely on it.
Because if the lock is taken there *above* the loop - or even in the
loop iteration at the top, the code can now do things that rely on "I
know I hold the page table lock".

In particular, we can get rid of that very very expensive page locking.

Which is the reason I know about the horrid current issue with
"pre-allocate in one place, lock in another, and know we are atomic in
a third place" issue. Because I had to walk down these paths and
realize that "this loop is run under the page table lock, EXCEPT for
the first iteration, where it's taken by the first time we do that
non-allocating alloc_set_pte()".

See?

Linus

2020-12-11 07:09:20

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

On Wed, Dec 09, 2020 at 11:04:13AM -0800, Linus Torvalds wrote:
>
> Kirill - I think you know this code best. Would you be willing to look
> at splitting out that "allocate and map" from alloc_set_pte() and into
> the callers?

See lightly tested patch below. Is it something you had in mind?

I don't like map_set_pte() name. Any suggestions?

> At that point, I think the current very special and odd
> do_fault_around() pre-allocation could be made into just a _regular_
> "allocate the pmd if it doesn't exist". And then the pte locking could
> be moved into filemap_map_pages(), and suddenly the semantics and
> rules around all that would be a whole lot more obvious.

No. It would stop faultaround code from mapping huge pages. We had to
defer pte page table mapping until we know we don't have huge pages in
page cache.


diff --git a/include/linux/mm.h b/include/linux/mm.h
index db6ae4d3fb4e..6b1bfa0fd488 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -535,7 +535,7 @@ struct vm_fault {
*/
pgtable_t prealloc_pte; /* Pre-allocated pte page table.
* vm_ops->map_pages() calls
- * alloc_set_pte() from atomic context.
+ * map_set_pte() from atomic context.
* do_fault_around() pre-allocates
* page table to avoid allocation from
* atomic context.
@@ -972,7 +972,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
return pte;
}

-vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page);
+vm_fault_t map_set_pte(struct vm_fault *vmf, struct page *page);
vm_fault_t finish_fault(struct vm_fault *vmf);
vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
#endif
diff --git a/mm/filemap.c b/mm/filemap.c
index 331f4261d723..4446ae6d6d09 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2884,7 +2884,7 @@ void filemap_map_pages(struct vm_fault *vmf,
if (vmf->pte)
vmf->pte += xas.xa_index - last_pgoff;
last_pgoff = xas.xa_index;
- if (alloc_set_pte(vmf, page))
+ if (map_set_pte(vmf, page))
goto unlock;
unlock_page(head);
goto next;
diff --git a/mm/memory.c b/mm/memory.c
index c48f8df6e502..1db1000767fc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3490,7 +3490,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
if (pte_alloc(vma->vm_mm, vmf->pmd))
return VM_FAULT_OOM;

- /* See the comment in pte_alloc_one_map() */
+ /* See the comment in map_set_pte() */
if (unlikely(pmd_trans_unstable(vmf->pmd)))
return 0;

@@ -3641,55 +3641,6 @@ static int pmd_devmap_trans_unstable(pmd_t *pmd)
return pmd_devmap(*pmd) || pmd_trans_unstable(pmd);
}

-static vm_fault_t pte_alloc_one_map(struct vm_fault *vmf)
-{
- struct vm_area_struct *vma = vmf->vma;
-
- if (!pmd_none(*vmf->pmd))
- goto map_pte;
- if (vmf->prealloc_pte) {
- vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
- if (unlikely(!pmd_none(*vmf->pmd))) {
- spin_unlock(vmf->ptl);
- goto map_pte;
- }
-
- mm_inc_nr_ptes(vma->vm_mm);
- pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte);
- spin_unlock(vmf->ptl);
- vmf->prealloc_pte = NULL;
- } else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd))) {
- return VM_FAULT_OOM;
- }
-map_pte:
- /*
- * If a huge pmd materialized under us just retry later. Use
- * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead of
- * pmd_trans_huge() to ensure the pmd didn't become pmd_trans_huge
- * under us and then back to pmd_none, as a result of MADV_DONTNEED
- * running immediately after a huge pmd fault in a different thread of
- * this mm, in turn leading to a misleading pmd_trans_huge() retval.
- * All we have to ensure is that it is a regular pmd that we can walk
- * with pte_offset_map() and we can do that through an atomic read in
- * C, which is what pmd_trans_unstable() provides.
- */
- if (pmd_devmap_trans_unstable(vmf->pmd))
- return VM_FAULT_NOPAGE;
-
- /*
- * At this point we know that our vmf->pmd points to a page of ptes
- * and it cannot become pmd_none(), pmd_devmap() or pmd_trans_huge()
- * for the duration of the fault. If a racing MADV_DONTNEED runs and
- * we zap the ptes pointed to by our vmf->pmd, the vmf->ptl will still
- * be valid and we will re-check to make sure the vmf->pte isn't
- * pte_none() under vmf->ptl protection when we return to
- * alloc_set_pte().
- */
- vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
- &vmf->ptl);
- return 0;
-}
-
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
static void deposit_prealloc_pte(struct vm_fault *vmf)
{
@@ -3769,45 +3720,11 @@ static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
}
#endif

-/**
- * alloc_set_pte - setup new PTE entry for given page and add reverse page
- * mapping. If needed, the function allocates page table or use pre-allocated.
- *
- * @vmf: fault environment
- * @page: page to map
- *
- * Caller must take care of unlocking vmf->ptl, if vmf->pte is non-NULL on
- * return.
- *
- * Target users are page handler itself and implementations of
- * vm_ops->map_pages.
- *
- * Return: %0 on success, %VM_FAULT_ code in case of error.
- */
-vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
+static void do_set_pte(struct vm_fault *vmf, struct page *page)
{
struct vm_area_struct *vma = vmf->vma;
bool write = vmf->flags & FAULT_FLAG_WRITE;
pte_t entry;
- vm_fault_t ret;
-
- if (pmd_none(*vmf->pmd) && PageTransCompound(page)) {
- ret = do_set_pmd(vmf, page);
- if (ret != VM_FAULT_FALLBACK)
- return ret;
- }
-
- if (!vmf->pte) {
- ret = pte_alloc_one_map(vmf);
- if (ret)
- return ret;
- }
-
- /* Re-check under ptl */
- if (unlikely(!pte_none(*vmf->pte))) {
- update_mmu_tlb(vma, vmf->address, vmf->pte);
- return VM_FAULT_NOPAGE;
- }

flush_icache_page(vma, page);
entry = mk_pte(page, vma->vm_page_prot);
@@ -3824,14 +3741,56 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
page_add_file_rmap(page, false);
}
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
+}

- /* no need to invalidate: a not-present page won't be cached */
- update_mmu_cache(vma, vmf->address, vmf->pte);
+vm_fault_t map_set_pte(struct vm_fault *vmf, struct page *page)
+{
+ struct vm_area_struct *vma = vmf->vma;
+ vm_fault_t ret;
+
+ if (pmd_none(*vmf->pmd) && PageTransCompound(page)) {
+ ret = do_set_pmd(vmf, page);
+ if (ret != VM_FAULT_FALLBACK)
+ return ret;
+ }
+
+ if (!vmf->pte) {
+ vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
+ if (likely(pmd_none(*vmf->pmd))) {
+ mm_inc_nr_ptes(vma->vm_mm);
+ pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte);
+ vmf->prealloc_pte = NULL;
+ }
+ spin_unlock(vmf->ptl);
+
+ /*
+ * If a huge pmd materialized under us just retry later. Use
+ * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead
+ * of pmd_trans_huge() to ensure the pmd didn't become
+ * pmd_trans_huge under us and then back to pmd_none, as a
+ * result of MADV_DONTNEED running immediately after a huge pmd
+ * fault in a different thread of this mm, in turn leading to a
+ * misleading pmd_trans_huge() retval. All we have to ensure is
+ * that it is a regular pmd that we can walk with
+ * pte_offset_map() and we can do that through an atomic read
+ * in C, which is what pmd_trans_unstable() provides.
+ */
+ if (pmd_devmap_trans_unstable(vmf->pmd))
+ return VM_FAULT_NOPAGE;
+
+ vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
+ vmf->address, &vmf->ptl);
+ }
+
+ /* Re-check under ptl */
+ if (likely(pte_none(*vmf->pte)))
+ do_set_pte(vmf, page);

+ /* no need to invalidate: a not-present page won't be cached */
+ update_mmu_cache(vmf->vma, vmf->address, vmf->pte);
return 0;
}

-
/**
* finish_fault - finish page fault once we have prepared the page to fault
*
@@ -3849,12 +3808,12 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
*/
vm_fault_t finish_fault(struct vm_fault *vmf)
{
+ struct vm_area_struct *vma = vmf->vma;
struct page *page;
- vm_fault_t ret = 0;
+ vm_fault_t ret;

/* Did we COW the page? */
- if ((vmf->flags & FAULT_FLAG_WRITE) &&
- !(vmf->vma->vm_flags & VM_SHARED))
+ if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED))
page = vmf->cow_page;
else
page = vmf->page;
@@ -3863,13 +3822,34 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
* check even for read faults because we might have lost our CoWed
* page
*/
- if (!(vmf->vma->vm_flags & VM_SHARED))
- ret = check_stable_address_space(vmf->vma->vm_mm);
- if (!ret)
- ret = alloc_set_pte(vmf, page);
- if (vmf->pte)
- pte_unmap_unlock(vmf->pte, vmf->ptl);
- return ret;
+ if (!(vma->vm_flags & VM_SHARED))
+ ret = check_stable_address_space(vma->vm_mm);
+ if (ret)
+ return ret;
+
+ if (pmd_none(*vmf->pmd)) {
+ if (PageTransCompound(page)) {
+ ret = do_set_pmd(vmf, page);
+ if (ret != VM_FAULT_FALLBACK)
+ return ret;
+ }
+
+ if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd)))
+ return VM_FAULT_OOM;
+ }
+
+ if (pmd_devmap_trans_unstable(vmf->pmd))
+ return 0;
+
+ vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
+ vmf->address, &vmf->ptl);
+ /* Re-check under ptl */
+ if (likely(pte_none(*vmf->pte)))
+ do_set_pte(vmf, page);
+
+ update_mmu_tlb(vma, vmf->address, vmf->pte);
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
+ return 0;
}

static unsigned long fault_around_bytes __read_mostly =
@@ -4340,7 +4320,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
*/
vmf->pte = NULL;
} else {
- /* See comment in pte_alloc_one_map() */
+ /* See comment in map_set_pte() */
if (pmd_devmap_trans_unstable(vmf->pmd))
return 0;
/*
--
Kirill A. Shutemov

2020-12-14 16:10:43

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

On Thu, Dec 10, 2020 at 09:23:53AM -0800, Linus Torvalds wrote:
> Can we please move that part to the callers too - possibly with a
> separate helper function?

Here it is. Still barely tested.

I expected to hate it more, but it looks reasonable. Opencoded
xas_for_each() smells bad, but...

And diffstat is fine:

4 files changed, 153 insertions(+), 131 deletions(-)

Any comments?

diff --git a/include/linux/mm.h b/include/linux/mm.h
index db6ae4d3fb4e..2825153ad0d6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -534,8 +534,8 @@ struct vm_fault {
* is not NULL, otherwise pmd.
*/
pgtable_t prealloc_pte; /* Pre-allocated pte page table.
- * vm_ops->map_pages() calls
- * alloc_set_pte() from atomic context.
+ * vm_ops->map_pages() sets up a page
+ * table from from atomic context.
* do_fault_around() pre-allocates
* page table to avoid allocation from
* atomic context.
@@ -972,7 +972,9 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
return pte;
}

-vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page);
+vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
+void do_set_pte(struct vm_fault *vmf, struct page *page);
+
vm_fault_t finish_fault(struct vm_fault *vmf);
vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
#endif
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e237004d498d..869c1921ceda 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1259,6 +1259,17 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
#endif
}

+/*
+ * the ordering of these checks is important for pmds with _page_devmap set.
+ * if we check pmd_trans_unstable() first we will trip the bad_pmd() check
+ * inside of pmd_none_or_trans_huge_or_clear_bad(). this will end up correctly
+ * returning 1 but not before it spams dmesg with the pmd_clear_bad() output.
+ */
+static inline int pmd_devmap_trans_unstable(pmd_t *pmd)
+{
+ return pmd_devmap(*pmd) || pmd_trans_unstable(pmd);
+}
+
#ifndef CONFIG_NUMA_BALANCING
/*
* Technically a PTE can be PROTNONE even when not doing NUMA balancing but
diff --git a/mm/filemap.c b/mm/filemap.c
index 0b2067b3c328..8fa2183ce10c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -42,6 +42,7 @@
#include <linux/psi.h>
#include <linux/ramfs.h>
#include <linux/page_idle.h>
+#include <asm/pgalloc.h>
#include "internal.h"

#define CREATE_TRACE_POINTS
@@ -2831,10 +2832,53 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
}
EXPORT_SYMBOL(filemap_fault);

+static bool filemap_map_pages_pmd(struct vm_fault *vmf, struct page *page,
+ struct xa_state *xas)
+{
+ struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+
+ if (xa_is_value(page))
+ return false;
+
+ if (!PageTransHuge(page) || PageLocked(page))
+ return false;
+
+ if (!page_cache_get_speculative(page))
+ return false;
+
+ if (page != xas_reload(xas))
+ goto skip;
+
+ if (!PageTransHuge(page))
+ goto skip;
+
+ if (!PageUptodate(page) || PageReadahead(page) || PageHWPoison(page))
+ goto skip;
+
+ if (!trylock_page(page))
+ goto skip;
+
+ if (page->mapping != mapping || !PageUptodate(page))
+ goto unlock;
+
+ if (xas->xa_index >= DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE))
+ goto unlock;
+
+ do_set_pmd(vmf, page);
+ unlock_page(page);
+ return true;
+unlock:
+ unlock_page(page);
+skip:
+ put_page(page);
+ return false;
+}
+
void filemap_map_pages(struct vm_fault *vmf,
pgoff_t start_pgoff, pgoff_t end_pgoff)
{
- struct file *file = vmf->vma->vm_file;
+ struct vm_area_struct *vma = vmf->vma;
+ struct file *file = vma->vm_file;
struct address_space *mapping = file->f_mapping;
pgoff_t last_pgoff = start_pgoff;
unsigned long max_idx;
@@ -2843,20 +2887,54 @@ void filemap_map_pages(struct vm_fault *vmf,
unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);

rcu_read_lock();
- xas_for_each(&xas, head, end_pgoff) {
+ head = xas_find(&xas, end_pgoff);
+ if (!head) {
+ rcu_read_unlock();
+ return;
+ }
+
+ while (xas_retry(&xas, head))
+ head = xas_next_entry(&xas, end_pgoff);
+
+ if (pmd_none(*vmf->pmd) && filemap_map_pages_pmd(vmf, head, &xas)) {
+ rcu_read_unlock();
+ return;
+ }
+
+ /* Huge page is mapped? No need to proceed. */
+ if (pmd_trans_huge(*vmf->pmd)) {
+ rcu_read_unlock();
+ return;
+ }
+
+ vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
+ if (likely(pmd_none(*vmf->pmd))) {
+ mm_inc_nr_ptes(vma->vm_mm);
+ pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte);
+ vmf->prealloc_pte = NULL;
+ }
+ spin_unlock(vmf->ptl);
+
+ /* See comment in handle_pte_fault() */
+ if (pmd_devmap_trans_unstable(vmf->pmd))
+ return rcu_read_unlock();
+
+ vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
+ vmf->address, &vmf->ptl);
+
+ for (; head; head = xas_next_entry(&xas, end_pgoff)) {
if (xas_retry(&xas, head))
continue;
if (xa_is_value(head))
- goto next;
-
+ continue;
/*
* Check for a locked page first, as a speculative
* reference may adversely influence page migration.
*/
if (PageLocked(head))
- goto next;
+ continue;
if (!page_cache_get_speculative(head))
- goto next;
+ continue;

/* Has the page moved or been split? */
if (unlikely(head != xas_reload(&xas)))
@@ -2884,19 +2962,18 @@ void filemap_map_pages(struct vm_fault *vmf,
if (vmf->pte)
vmf->pte += xas.xa_index - last_pgoff;
last_pgoff = xas.xa_index;
- if (alloc_set_pte(vmf, page))
- goto unlock;
+ if (pte_none(*vmf->pte))
+ do_set_pte(vmf, page);
+ /* no need to invalidate: a not-present page won't be cached */
+ update_mmu_cache(vma, vmf->address, vmf->pte);
unlock_page(head);
- goto next;
+ continue;
unlock:
unlock_page(head);
skip:
put_page(head);
-next:
- /* Huge page is mapped? No need to proceed. */
- if (pmd_trans_huge(*vmf->pmd))
- break;
}
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
rcu_read_unlock();
WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);
}
diff --git a/mm/memory.c b/mm/memory.c
index c48f8df6e502..96d62774096a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3490,7 +3490,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
if (pte_alloc(vma->vm_mm, vmf->pmd))
return VM_FAULT_OOM;

- /* See the comment in pte_alloc_one_map() */
+ /* See the comment in map_set_pte() */
if (unlikely(pmd_trans_unstable(vmf->pmd)))
return 0;

@@ -3630,66 +3630,6 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
return ret;
}

-/*
- * The ordering of these checks is important for pmds with _PAGE_DEVMAP set.
- * If we check pmd_trans_unstable() first we will trip the bad_pmd() check
- * inside of pmd_none_or_trans_huge_or_clear_bad(). This will end up correctly
- * returning 1 but not before it spams dmesg with the pmd_clear_bad() output.
- */
-static int pmd_devmap_trans_unstable(pmd_t *pmd)
-{
- return pmd_devmap(*pmd) || pmd_trans_unstable(pmd);
-}
-
-static vm_fault_t pte_alloc_one_map(struct vm_fault *vmf)
-{
- struct vm_area_struct *vma = vmf->vma;
-
- if (!pmd_none(*vmf->pmd))
- goto map_pte;
- if (vmf->prealloc_pte) {
- vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
- if (unlikely(!pmd_none(*vmf->pmd))) {
- spin_unlock(vmf->ptl);
- goto map_pte;
- }
-
- mm_inc_nr_ptes(vma->vm_mm);
- pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte);
- spin_unlock(vmf->ptl);
- vmf->prealloc_pte = NULL;
- } else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd))) {
- return VM_FAULT_OOM;
- }
-map_pte:
- /*
- * If a huge pmd materialized under us just retry later. Use
- * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead of
- * pmd_trans_huge() to ensure the pmd didn't become pmd_trans_huge
- * under us and then back to pmd_none, as a result of MADV_DONTNEED
- * running immediately after a huge pmd fault in a different thread of
- * this mm, in turn leading to a misleading pmd_trans_huge() retval.
- * All we have to ensure is that it is a regular pmd that we can walk
- * with pte_offset_map() and we can do that through an atomic read in
- * C, which is what pmd_trans_unstable() provides.
- */
- if (pmd_devmap_trans_unstable(vmf->pmd))
- return VM_FAULT_NOPAGE;
-
- /*
- * At this point we know that our vmf->pmd points to a page of ptes
- * and it cannot become pmd_none(), pmd_devmap() or pmd_trans_huge()
- * for the duration of the fault. If a racing MADV_DONTNEED runs and
- * we zap the ptes pointed to by our vmf->pmd, the vmf->ptl will still
- * be valid and we will re-check to make sure the vmf->pte isn't
- * pte_none() under vmf->ptl protection when we return to
- * alloc_set_pte().
- */
- vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
- &vmf->ptl);
- return 0;
-}
-
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
static void deposit_prealloc_pte(struct vm_fault *vmf)
{
@@ -3704,7 +3644,7 @@ static void deposit_prealloc_pte(struct vm_fault *vmf)
vmf->prealloc_pte = NULL;
}

-static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
+vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
{
struct vm_area_struct *vma = vmf->vma;
bool write = vmf->flags & FAULT_FLAG_WRITE;
@@ -3769,45 +3709,11 @@ static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
}
#endif

-/**
- * alloc_set_pte - setup new PTE entry for given page and add reverse page
- * mapping. If needed, the function allocates page table or use pre-allocated.
- *
- * @vmf: fault environment
- * @page: page to map
- *
- * Caller must take care of unlocking vmf->ptl, if vmf->pte is non-NULL on
- * return.
- *
- * Target users are page handler itself and implementations of
- * vm_ops->map_pages.
- *
- * Return: %0 on success, %VM_FAULT_ code in case of error.
- */
-vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
+void do_set_pte(struct vm_fault *vmf, struct page *page)
{
struct vm_area_struct *vma = vmf->vma;
bool write = vmf->flags & FAULT_FLAG_WRITE;
pte_t entry;
- vm_fault_t ret;
-
- if (pmd_none(*vmf->pmd) && PageTransCompound(page)) {
- ret = do_set_pmd(vmf, page);
- if (ret != VM_FAULT_FALLBACK)
- return ret;
- }
-
- if (!vmf->pte) {
- ret = pte_alloc_one_map(vmf);
- if (ret)
- return ret;
- }
-
- /* Re-check under ptl */
- if (unlikely(!pte_none(*vmf->pte))) {
- update_mmu_tlb(vma, vmf->address, vmf->pte);
- return VM_FAULT_NOPAGE;
- }

flush_icache_page(vma, page);
entry = mk_pte(page, vma->vm_page_prot);
@@ -3824,14 +3730,8 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
page_add_file_rmap(page, false);
}
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
-
- /* no need to invalidate: a not-present page won't be cached */
- update_mmu_cache(vma, vmf->address, vmf->pte);
-
- return 0;
}

-
/**
* finish_fault - finish page fault once we have prepared the page to fault
*
@@ -3849,12 +3749,12 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
*/
vm_fault_t finish_fault(struct vm_fault *vmf)
{
+ struct vm_area_struct *vma = vmf->vma;
struct page *page;
- vm_fault_t ret = 0;
+ vm_fault_t ret;

/* Did we COW the page? */
- if ((vmf->flags & FAULT_FLAG_WRITE) &&
- !(vmf->vma->vm_flags & VM_SHARED))
+ if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED))
page = vmf->cow_page;
else
page = vmf->page;
@@ -3863,13 +3763,35 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
* check even for read faults because we might have lost our CoWed
* page
*/
- if (!(vmf->vma->vm_flags & VM_SHARED))
- ret = check_stable_address_space(vmf->vma->vm_mm);
- if (!ret)
- ret = alloc_set_pte(vmf, page);
- if (vmf->pte)
- pte_unmap_unlock(vmf->pte, vmf->ptl);
- return ret;
+ if (!(vma->vm_flags & VM_SHARED))
+ ret = check_stable_address_space(vma->vm_mm);
+ if (ret)
+ return ret;
+
+ if (pmd_none(*vmf->pmd)) {
+ if (PageTransCompound(page)) {
+ ret = do_set_pmd(vmf, page);
+ if (ret != VM_FAULT_FALLBACK)
+ return ret;
+ }
+
+ if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd)))
+ return VM_FAULT_OOM;
+ }
+
+ /* See comment in handle_pte_fault() */
+ if (pmd_devmap_trans_unstable(vmf->pmd))
+ return 0;
+
+ vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
+ vmf->address, &vmf->ptl);
+ /* Re-check under ptl */
+ if (likely(pte_none(*vmf->pte)))
+ do_set_pte(vmf, page);
+
+ update_mmu_tlb(vma, vmf->address, vmf->pte);
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
+ return 0;
}

static unsigned long fault_around_bytes __read_mostly =
@@ -3980,7 +3902,6 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf)
vmf->pte -= (vmf->address >> PAGE_SHIFT) - (address >> PAGE_SHIFT);
if (!pte_none(*vmf->pte))
ret = VM_FAULT_NOPAGE;
- pte_unmap_unlock(vmf->pte, vmf->ptl);
out:
vmf->address = address;
vmf->pte = NULL;
@@ -4340,7 +4261,18 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
*/
vmf->pte = NULL;
} else {
- /* See comment in pte_alloc_one_map() */
+ /*
+ * If a huge pmd materialized under us just retry later. Use
+ * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead
+ * of pmd_trans_huge() to ensure the pmd didn't become
+ * pmd_trans_huge under us and then back to pmd_none, as a
+ * result of MADV_DONTNEED running immediately after a huge pmd
+ * fault in a different thread of this mm, in turn leading to a
+ * misleading pmd_trans_huge() retval. All we have to ensure is
+ * that it is a regular pmd that we can walk with
+ * pte_offset_map() and we can do that through an atomic read
+ * in C, which is what pmd_trans_unstable() provides.
+ */
if (pmd_devmap_trans_unstable(vmf->pmd))
return 0;
/*
--
Kirill A. Shutemov

2020-12-14 18:42:30

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

On Mon, Dec 14, 2020 at 8:07 AM Kirill A. Shutemov <[email protected]> wrote:
>
> Here it is. Still barely tested.

Ok, from looking at the patch (not applying it and looking at the end
result), I think the locking - at least for the filemap_map_pages()
case - is a lot easier to understand.

So you seem to have fixed the thing I personally found most confusing. Thanks.

> I expected to hate it more, but it looks reasonable. Opencoded
> xas_for_each() smells bad, but...

I think the open-coded xas_for_each() per se isn't a problem, but I
agree that the startup condition is a bit ugly. And I'm actually
personally more confused by why xas_retry() is needed here, bit not in
many other places. That is perhaps more obvious now that it shows up
twice.

Adding Willy to the cc in case he has comments on that, and can
explain it to me in small words.

[ https://lore.kernel.org/lkml/20201214160724.ewhjqoi32chheone@box/
for context ]

And I actually think it might make even more sense if you moved more
of the pmd handling into "filemap_map_pages_pmd()".

Now it's a bit odd, with filemap_map_pages() containing part of the
pmd handling, and then part being in filemap_map_pages_pmd().

Could we have a "filemap_map_pmd()" that does it all, and then the
filemap_map_pages() logic would be more along the lines of

if (filemap_map_pmd(vmf, xas)) {
rcu_read_unlock();
return;
}

... then handle pte's ...

Hmm?

There may be some shared state thing why you didn't do it, the above
is mostly a syntactic issue.

Thanks,

Linus

2020-12-14 19:14:25

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

On Mon, Dec 14, 2020 at 09:54:06AM -0800, Linus Torvalds wrote:
> > I expected to hate it more, but it looks reasonable. Opencoded
> > xas_for_each() smells bad, but...
>
> I think the open-coded xas_for_each() per se isn't a problem, but I
> agree that the startup condition is a bit ugly. And I'm actually
> personally more confused by why xas_retry() is needed here, bit not in
> many other places. That is perhaps more obvious now that it shows up
> twice.
>
> Adding Willy to the cc in case he has comments on that, and can
> explain it to me in small words.
>
> [ https://lore.kernel.org/lkml/20201214160724.ewhjqoi32chheone@box/
> for context ]

The xas_retry() is something I now regret, but haven't got annoyed enough
by it yet to fix (also, other projects). It originated in the radix
tree where we would get a radix_tree_node and then iterate over it in
header macros. If we're holding the rcu_read_lock() and somebody else
deletes an entry leaving the entry at index 0 as the only index in the
tree, we tell the RCU readers to rewalk the tree from the top by putting
a retry entry in place of the real entry.

It's not entirely clear to me now why we did that. Just leave the entry
alone and the RCU-walkers will see it, then the rest of the node is empty.

As to why we need to do this in some places and not others; you can
only see a retry entry if you're only protected by the RCU lock. If
you're protected by the spinlock, you can't see any nodes which
contain retry entries.

But I now think we should just get rid of retry entries. Maybe I'm
missing a good reason to keep them.

2020-12-17 10:56:09

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

On Wed, Dec 16, 2020 at 10:41:36AM -0800, Linus Torvalds wrote:
> On Wed, Dec 16, 2020 at 9:07 AM Kirill A. Shutemov <[email protected]> wrote:
> >
> > If this looks fine, I'll submit a proper patch.
>
> That patch looks good to me.
>
> It would be good if somebody else looked it through - maybe I like it
> just because I got to pee in the snow and make my mark. But i think
> that filemap_map_pages() now looks a lot more understandable, and
> having that pte_offset_map_lock() outside the loop should be good.

It worth noting that after the change in the worth case scenario we will
have additional ref/unref and lock/unlock of the page if we get deep
enough into filemap_map_pmd(), but fail to map the page.

Also if the range doesn't have a mappable page we would setup a page
table into the PMD entry. It means we cannot have huge page mapped there
later. It may be a bummer: getting the page table out of page table tree
requires mmap_write_lock().

We also take ptl for cold page cache. It may increase ptl contention, but
it should be negligible with split-ptl.

--
Kirill A. Shutemov

2020-12-17 18:24:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

On Thu, Dec 17, 2020 at 2:54 AM Kirill A. Shutemov <[email protected]> wrote:
>
> Also if the range doesn't have a mappable page we would setup a page
> table into the PMD entry. It means we cannot have huge page mapped there
> later. It may be a bummer: getting the page table out of page table tree
> requires mmap_write_lock().
>
> We also take ptl for cold page cache. It may increase ptl contention, but
> it should be negligible with split-ptl.

Both good points.

I doubt the second one is really noticeable, since if it isn't cached
you're going to have all the costs of actually getting the page, but
the first one sounds fairly fundamental.,

But I think both issues could be easily fixed by doing that
"xas_is_value()" and checking for 'head' being non-NULL early.

In fact, maybe that should be done as part of that early setup loop.
So that early code that is now

+ head = xas_find(&xas, end_pgoff);
+ if (!head) {
+ rcu_read_unlock();
+ return;
+ }
+
+ while (xas_retry(&xas, head))
+ head = xas_next_entry(&xas, end_pgoff);

could/should perhaps be something more along the lines of

+ head = xas_find(&xas, end_pgoff);
+ for (; ; head = xas_next_entry(&xas, end_pgoff)) {
+ if (!head) {
+ rcu_read_unlock();
+ return;
+ }
+ if (likely(!xas_retry(&xas, head))
+ break;
+ }

instead. So that if we don't find any cached entries, we won't do
anything, rather than take locks and then not do anything.

Then that second loop very naturally becomes a "do { } while ()" one.

Hmm?

Linus

2020-12-18 11:06:01

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

On Thu, Dec 17, 2020 at 10:22:33AM -0800, Linus Torvalds wrote:
> + head = xas_find(&xas, end_pgoff);
> + for (; ; head = xas_next_entry(&xas, end_pgoff)) {
> + if (!head) {
> + rcu_read_unlock();
> + return;
> + }
> + if (likely(!xas_retry(&xas, head))
> + break;
> + }
>
> instead. So that if we don't find any cached entries, we won't do
> anything, rather than take locks and then not do anything.

This should do. See below.

> Then that second loop very naturally becomes a "do { } while ()" one.

I don't see it. I haven't found a reasonable way to rework it do-while.

diff --git a/include/linux/mm.h b/include/linux/mm.h
index db6ae4d3fb4e..2825153ad0d6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -534,8 +534,8 @@ struct vm_fault {
* is not NULL, otherwise pmd.
*/
pgtable_t prealloc_pte; /* Pre-allocated pte page table.
- * vm_ops->map_pages() calls
- * alloc_set_pte() from atomic context.
+ * vm_ops->map_pages() sets up a page
+ * table from from atomic context.
* do_fault_around() pre-allocates
* page table to avoid allocation from
* atomic context.
@@ -972,7 +972,9 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
return pte;
}

-vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page);
+vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
+void do_set_pte(struct vm_fault *vmf, struct page *page);
+
vm_fault_t finish_fault(struct vm_fault *vmf);
vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
#endif
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e237004d498d..869c1921ceda 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1259,6 +1259,17 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
#endif
}

+/*
+ * the ordering of these checks is important for pmds with _page_devmap set.
+ * if we check pmd_trans_unstable() first we will trip the bad_pmd() check
+ * inside of pmd_none_or_trans_huge_or_clear_bad(). this will end up correctly
+ * returning 1 but not before it spams dmesg with the pmd_clear_bad() output.
+ */
+static inline int pmd_devmap_trans_unstable(pmd_t *pmd)
+{
+ return pmd_devmap(*pmd) || pmd_trans_unstable(pmd);
+}
+
#ifndef CONFIG_NUMA_BALANCING
/*
* Technically a PTE can be PROTNONE even when not doing NUMA balancing but
diff --git a/mm/filemap.c b/mm/filemap.c
index 0b2067b3c328..04975682296d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -42,6 +42,7 @@
#include <linux/psi.h>
#include <linux/ramfs.h>
#include <linux/page_idle.h>
+#include <asm/pgalloc.h>
#include "internal.h"

#define CREATE_TRACE_POINTS
@@ -2831,10 +2832,74 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
}
EXPORT_SYMBOL(filemap_fault);

+static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page,
+ struct xa_state *xas)
+{
+ struct vm_area_struct *vma = vmf->vma;
+ struct address_space *mapping = vma->vm_file->f_mapping;
+
+ /* Huge page is mapped? No need to proceed. */
+ if (pmd_trans_huge(*vmf->pmd))
+ return true;
+
+ if (xa_is_value(page))
+ goto nohuge;
+
+ if (!pmd_none(*vmf->pmd))
+ goto nohuge;
+
+ if (!PageTransHuge(page) || PageLocked(page))
+ goto nohuge;
+
+ if (!page_cache_get_speculative(page))
+ goto nohuge;
+
+ if (page != xas_reload(xas))
+ goto unref;
+
+ if (!PageTransHuge(page))
+ goto unref;
+
+ if (!PageUptodate(page) || PageReadahead(page) || PageHWPoison(page))
+ goto unref;
+
+ if (!trylock_page(page))
+ goto unref;
+
+ if (page->mapping != mapping || !PageUptodate(page))
+ goto unlock;
+
+ if (xas->xa_index >= DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE))
+ goto unlock;
+
+ do_set_pmd(vmf, page);
+ unlock_page(page);
+ return true;
+unlock:
+ unlock_page(page);
+unref:
+ put_page(page);
+nohuge:
+ vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
+ if (likely(pmd_none(*vmf->pmd))) {
+ mm_inc_nr_ptes(vma->vm_mm);
+ pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte);
+ vmf->prealloc_pte = NULL;
+ }
+ spin_unlock(vmf->ptl);
+
+ /* See comment in handle_pte_fault() */
+ if (pmd_devmap_trans_unstable(vmf->pmd))
+ return true;
+
+ return false;
+}
+
void filemap_map_pages(struct vm_fault *vmf,
pgoff_t start_pgoff, pgoff_t end_pgoff)
{
- struct file *file = vmf->vma->vm_file;
+ struct vm_area_struct *vma = vmf->vma;
+ struct file *file = vma->vm_file;
struct address_space *mapping = file->f_mapping;
pgoff_t last_pgoff = start_pgoff;
unsigned long max_idx;
@@ -2843,20 +2908,37 @@ void filemap_map_pages(struct vm_fault *vmf,
unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);

rcu_read_lock();
- xas_for_each(&xas, head, end_pgoff) {
+ head = xas_find(&xas, end_pgoff);
+ for (; ; head = xas_next_entry(&xas, end_pgoff)) {
+ if (!head) {
+ rcu_read_unlock();
+ return;
+ }
+ if (likely(!xas_retry(&xas, head)))
+ break;
+ }
+
+ if (filemap_map_pmd(vmf, head, &xas)) {
+ rcu_read_unlock();
+ return;
+ }
+
+ vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
+ vmf->address, &vmf->ptl);
+
+ for (; head; head = xas_next_entry(&xas, end_pgoff)) {
if (xas_retry(&xas, head))
continue;
if (xa_is_value(head))
- goto next;
-
+ continue;
/*
* Check for a locked page first, as a speculative
* reference may adversely influence page migration.
*/
if (PageLocked(head))
- goto next;
+ continue;
if (!page_cache_get_speculative(head))
- goto next;
+ continue;

/* Has the page moved or been split? */
if (unlikely(head != xas_reload(&xas)))
@@ -2884,19 +2966,18 @@ void filemap_map_pages(struct vm_fault *vmf,
if (vmf->pte)
vmf->pte += xas.xa_index - last_pgoff;
last_pgoff = xas.xa_index;
- if (alloc_set_pte(vmf, page))
- goto unlock;
+ if (pte_none(*vmf->pte))
+ do_set_pte(vmf, page);
+ /* no need to invalidate: a not-present page won't be cached */
+ update_mmu_cache(vma, vmf->address, vmf->pte);
unlock_page(head);
- goto next;
+ continue;
unlock:
unlock_page(head);
skip:
put_page(head);
-next:
- /* Huge page is mapped? No need to proceed. */
- if (pmd_trans_huge(*vmf->pmd))
- break;
}
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
rcu_read_unlock();
WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);
}
diff --git a/mm/memory.c b/mm/memory.c
index c48f8df6e502..96d62774096a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3490,7 +3490,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
if (pte_alloc(vma->vm_mm, vmf->pmd))
return VM_FAULT_OOM;

- /* See the comment in pte_alloc_one_map() */
+ /* See the comment in map_set_pte() */
if (unlikely(pmd_trans_unstable(vmf->pmd)))
return 0;

@@ -3630,66 +3630,6 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
return ret;
}

-/*
- * The ordering of these checks is important for pmds with _PAGE_DEVMAP set.
- * If we check pmd_trans_unstable() first we will trip the bad_pmd() check
- * inside of pmd_none_or_trans_huge_or_clear_bad(). This will end up correctly
- * returning 1 but not before it spams dmesg with the pmd_clear_bad() output.
- */
-static int pmd_devmap_trans_unstable(pmd_t *pmd)
-{
- return pmd_devmap(*pmd) || pmd_trans_unstable(pmd);
-}
-
-static vm_fault_t pte_alloc_one_map(struct vm_fault *vmf)
-{
- struct vm_area_struct *vma = vmf->vma;
-
- if (!pmd_none(*vmf->pmd))
- goto map_pte;
- if (vmf->prealloc_pte) {
- vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
- if (unlikely(!pmd_none(*vmf->pmd))) {
- spin_unlock(vmf->ptl);
- goto map_pte;
- }
-
- mm_inc_nr_ptes(vma->vm_mm);
- pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte);
- spin_unlock(vmf->ptl);
- vmf->prealloc_pte = NULL;
- } else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd))) {
- return VM_FAULT_OOM;
- }
-map_pte:
- /*
- * If a huge pmd materialized under us just retry later. Use
- * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead of
- * pmd_trans_huge() to ensure the pmd didn't become pmd_trans_huge
- * under us and then back to pmd_none, as a result of MADV_DONTNEED
- * running immediately after a huge pmd fault in a different thread of
- * this mm, in turn leading to a misleading pmd_trans_huge() retval.
- * All we have to ensure is that it is a regular pmd that we can walk
- * with pte_offset_map() and we can do that through an atomic read in
- * C, which is what pmd_trans_unstable() provides.
- */
- if (pmd_devmap_trans_unstable(vmf->pmd))
- return VM_FAULT_NOPAGE;
-
- /*
- * At this point we know that our vmf->pmd points to a page of ptes
- * and it cannot become pmd_none(), pmd_devmap() or pmd_trans_huge()
- * for the duration of the fault. If a racing MADV_DONTNEED runs and
- * we zap the ptes pointed to by our vmf->pmd, the vmf->ptl will still
- * be valid and we will re-check to make sure the vmf->pte isn't
- * pte_none() under vmf->ptl protection when we return to
- * alloc_set_pte().
- */
- vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
- &vmf->ptl);
- return 0;
-}
-
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
static void deposit_prealloc_pte(struct vm_fault *vmf)
{
@@ -3704,7 +3644,7 @@ static void deposit_prealloc_pte(struct vm_fault *vmf)
vmf->prealloc_pte = NULL;
}

-static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
+vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
{
struct vm_area_struct *vma = vmf->vma;
bool write = vmf->flags & FAULT_FLAG_WRITE;
@@ -3769,45 +3709,11 @@ static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
}
#endif

-/**
- * alloc_set_pte - setup new PTE entry for given page and add reverse page
- * mapping. If needed, the function allocates page table or use pre-allocated.
- *
- * @vmf: fault environment
- * @page: page to map
- *
- * Caller must take care of unlocking vmf->ptl, if vmf->pte is non-NULL on
- * return.
- *
- * Target users are page handler itself and implementations of
- * vm_ops->map_pages.
- *
- * Return: %0 on success, %VM_FAULT_ code in case of error.
- */
-vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
+void do_set_pte(struct vm_fault *vmf, struct page *page)
{
struct vm_area_struct *vma = vmf->vma;
bool write = vmf->flags & FAULT_FLAG_WRITE;
pte_t entry;
- vm_fault_t ret;
-
- if (pmd_none(*vmf->pmd) && PageTransCompound(page)) {
- ret = do_set_pmd(vmf, page);
- if (ret != VM_FAULT_FALLBACK)
- return ret;
- }
-
- if (!vmf->pte) {
- ret = pte_alloc_one_map(vmf);
- if (ret)
- return ret;
- }
-
- /* Re-check under ptl */
- if (unlikely(!pte_none(*vmf->pte))) {
- update_mmu_tlb(vma, vmf->address, vmf->pte);
- return VM_FAULT_NOPAGE;
- }

flush_icache_page(vma, page);
entry = mk_pte(page, vma->vm_page_prot);
@@ -3824,14 +3730,8 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
page_add_file_rmap(page, false);
}
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
-
- /* no need to invalidate: a not-present page won't be cached */
- update_mmu_cache(vma, vmf->address, vmf->pte);
-
- return 0;
}

-
/**
* finish_fault - finish page fault once we have prepared the page to fault
*
@@ -3849,12 +3749,12 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
*/
vm_fault_t finish_fault(struct vm_fault *vmf)
{
+ struct vm_area_struct *vma = vmf->vma;
struct page *page;
- vm_fault_t ret = 0;
+ vm_fault_t ret;

/* Did we COW the page? */
- if ((vmf->flags & FAULT_FLAG_WRITE) &&
- !(vmf->vma->vm_flags & VM_SHARED))
+ if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED))
page = vmf->cow_page;
else
page = vmf->page;
@@ -3863,13 +3763,35 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
* check even for read faults because we might have lost our CoWed
* page
*/
- if (!(vmf->vma->vm_flags & VM_SHARED))
- ret = check_stable_address_space(vmf->vma->vm_mm);
- if (!ret)
- ret = alloc_set_pte(vmf, page);
- if (vmf->pte)
- pte_unmap_unlock(vmf->pte, vmf->ptl);
- return ret;
+ if (!(vma->vm_flags & VM_SHARED))
+ ret = check_stable_address_space(vma->vm_mm);
+ if (ret)
+ return ret;
+
+ if (pmd_none(*vmf->pmd)) {
+ if (PageTransCompound(page)) {
+ ret = do_set_pmd(vmf, page);
+ if (ret != VM_FAULT_FALLBACK)
+ return ret;
+ }
+
+ if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd)))
+ return VM_FAULT_OOM;
+ }
+
+ /* See comment in handle_pte_fault() */
+ if (pmd_devmap_trans_unstable(vmf->pmd))
+ return 0;
+
+ vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
+ vmf->address, &vmf->ptl);
+ /* Re-check under ptl */
+ if (likely(pte_none(*vmf->pte)))
+ do_set_pte(vmf, page);
+
+ update_mmu_tlb(vma, vmf->address, vmf->pte);
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
+ return 0;
}

static unsigned long fault_around_bytes __read_mostly =
@@ -3980,7 +3902,6 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf)
vmf->pte -= (vmf->address >> PAGE_SHIFT) - (address >> PAGE_SHIFT);
if (!pte_none(*vmf->pte))
ret = VM_FAULT_NOPAGE;
- pte_unmap_unlock(vmf->pte, vmf->ptl);
out:
vmf->address = address;
vmf->pte = NULL;
@@ -4340,7 +4261,18 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
*/
vmf->pte = NULL;
} else {
- /* See comment in pte_alloc_one_map() */
+ /*
+ * If a huge pmd materialized under us just retry later. Use
+ * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead
+ * of pmd_trans_huge() to ensure the pmd didn't become
+ * pmd_trans_huge under us and then back to pmd_none, as a
+ * result of MADV_DONTNEED running immediately after a huge pmd
+ * fault in a different thread of this mm, in turn leading to a
+ * misleading pmd_trans_huge() retval. All we have to ensure is
+ * that it is a regular pmd that we can walk with
+ * pte_offset_map() and we can do that through an atomic read
+ * in C, which is what pmd_trans_unstable() provides.
+ */
if (pmd_devmap_trans_unstable(vmf->pmd))
return 0;
/*
--
Kirill A. Shutemov

2020-12-18 20:48:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

On Fri, Dec 18, 2020 at 3:04 AM Kirill A. Shutemov <[email protected]> wrote:
>
> This should do. See below.

Looks fine.

> > Then that second loop very naturally becomes a "do { } while ()" one.
>
> I don't see it. I haven't found a reasonable way to rework it do-while.

Now that you return early for the "HEAD == NULL" case, this loop:

for (; head; head = xas_next_entry(&xas, end_pgoff)) {
[...]
}

very naturally becomes

do {
[...]
} while ((head = xas_next_entry(&xas, end_pgoff)) != NULL);

because the initial test for 'head' being NULL is no longer needed,
and thus it's a lot more logical to just test it at the end of the
loop when we update it.

No?

Maybe I'm missing something silly.

Linus

2020-12-19 12:43:00

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

On Fri, Dec 18, 2020 at 10:56:55AM -0800, Linus Torvalds wrote:
> No?

Okay, but we only win the NULL check. xas_retry() and xa_is_value() has to
be repeated in the beginning of the loop.

From b4f4c0d32e654b8459a1e439a453373499b8946a Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <[email protected]>
Date: Sat, 19 Dec 2020 15:19:23 +0300
Subject: [PATCH] mm: Cleanup faultaround and finish_fault() codepaths

alloc_set_pte() has two users with different requirements: in the
faultaround code, it called from an atomic context and PTE page table
has to be preallocated. finish_fault() can sleep and allocate page table
as needed.

PTL locking rules are also strange, hard to follow and overkill for
finish_fault().

Let's untangle the mess. alloc_set_pte() has gone now. All locking is
explicit.

The price is some code duplication to handle huge pages in faultaround
path, but it should be fine, having overall improvement in readability.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/mm.h | 8 +-
include/linux/pgtable.h | 11 +++
mm/filemap.c | 109 +++++++++++++++++++++++----
mm/memory.c | 162 ++++++++++++----------------------------
4 files changed, 158 insertions(+), 132 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index db6ae4d3fb4e..2825153ad0d6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -534,8 +534,8 @@ struct vm_fault {
* is not NULL, otherwise pmd.
*/
pgtable_t prealloc_pte; /* Pre-allocated pte page table.
- * vm_ops->map_pages() calls
- * alloc_set_pte() from atomic context.
+ * vm_ops->map_pages() sets up a page
+ * table from from atomic context.
* do_fault_around() pre-allocates
* page table to avoid allocation from
* atomic context.
@@ -972,7 +972,9 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
return pte;
}

-vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page);
+vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
+void do_set_pte(struct vm_fault *vmf, struct page *page);
+
vm_fault_t finish_fault(struct vm_fault *vmf);
vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
#endif
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e237004d498d..869c1921ceda 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1259,6 +1259,17 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
#endif
}

+/*
+ * the ordering of these checks is important for pmds with _page_devmap set.
+ * if we check pmd_trans_unstable() first we will trip the bad_pmd() check
+ * inside of pmd_none_or_trans_huge_or_clear_bad(). this will end up correctly
+ * returning 1 but not before it spams dmesg with the pmd_clear_bad() output.
+ */
+static inline int pmd_devmap_trans_unstable(pmd_t *pmd)
+{
+ return pmd_devmap(*pmd) || pmd_trans_unstable(pmd);
+}
+
#ifndef CONFIG_NUMA_BALANCING
/*
* Technically a PTE can be PROTNONE even when not doing NUMA balancing but
diff --git a/mm/filemap.c b/mm/filemap.c
index 0b2067b3c328..a9b21ec52e73 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -42,6 +42,7 @@
#include <linux/psi.h>
#include <linux/ramfs.h>
#include <linux/page_idle.h>
+#include <asm/pgalloc.h>
#include "internal.h"

#define CREATE_TRACE_POINTS
@@ -2831,10 +2832,74 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
}
EXPORT_SYMBOL(filemap_fault);

+static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page,
+ struct xa_state *xas)
+{
+ struct vm_area_struct *vma = vmf->vma;
+ struct address_space *mapping = vma->vm_file->f_mapping;
+
+ /* Huge page is mapped? No need to proceed. */
+ if (pmd_trans_huge(*vmf->pmd))
+ return true;
+
+ if (xa_is_value(page))
+ goto nohuge;
+
+ if (!pmd_none(*vmf->pmd))
+ goto nohuge;
+
+ if (!PageTransHuge(page) || PageLocked(page))
+ goto nohuge;
+
+ if (!page_cache_get_speculative(page))
+ goto nohuge;
+
+ if (page != xas_reload(xas))
+ goto unref;
+
+ if (!PageTransHuge(page))
+ goto unref;
+
+ if (!PageUptodate(page) || PageReadahead(page) || PageHWPoison(page))
+ goto unref;
+
+ if (!trylock_page(page))
+ goto unref;
+
+ if (page->mapping != mapping || !PageUptodate(page))
+ goto unlock;
+
+ if (xas->xa_index >= DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE))
+ goto unlock;
+
+ do_set_pmd(vmf, page);
+ unlock_page(page);
+ return true;
+unlock:
+ unlock_page(page);
+unref:
+ put_page(page);
+nohuge:
+ vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
+ if (likely(pmd_none(*vmf->pmd))) {
+ mm_inc_nr_ptes(vma->vm_mm);
+ pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte);
+ vmf->prealloc_pte = NULL;
+ }
+ spin_unlock(vmf->ptl);
+
+ /* See comment in handle_pte_fault() */
+ if (pmd_devmap_trans_unstable(vmf->pmd))
+ return true;
+
+ return false;
+}
+
void filemap_map_pages(struct vm_fault *vmf,
pgoff_t start_pgoff, pgoff_t end_pgoff)
{
- struct file *file = vmf->vma->vm_file;
+ struct vm_area_struct *vma = vmf->vma;
+ struct file *file = vma->vm_file;
struct address_space *mapping = file->f_mapping;
pgoff_t last_pgoff = start_pgoff;
unsigned long max_idx;
@@ -2843,20 +2908,37 @@ void filemap_map_pages(struct vm_fault *vmf,
unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);

rcu_read_lock();
- xas_for_each(&xas, head, end_pgoff) {
+ head = xas_find(&xas, end_pgoff);
+ for (; ; head = xas_next_entry(&xas, end_pgoff)) {
+ if (!head) {
+ rcu_read_unlock();
+ return;
+ }
+ if (likely(!xas_retry(&xas, head)))
+ break;
+ }
+
+ if (filemap_map_pmd(vmf, head, &xas)) {
+ rcu_read_unlock();
+ return;
+ }
+
+ vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
+ vmf->address, &vmf->ptl);
+
+ do {
if (xas_retry(&xas, head))
continue;
if (xa_is_value(head))
- goto next;
-
+ continue;
/*
* Check for a locked page first, as a speculative
* reference may adversely influence page migration.
*/
if (PageLocked(head))
- goto next;
+ continue;
if (!page_cache_get_speculative(head))
- goto next;
+ continue;

/* Has the page moved or been split? */
if (unlikely(head != xas_reload(&xas)))
@@ -2884,19 +2966,18 @@ void filemap_map_pages(struct vm_fault *vmf,
if (vmf->pte)
vmf->pte += xas.xa_index - last_pgoff;
last_pgoff = xas.xa_index;
- if (alloc_set_pte(vmf, page))
- goto unlock;
+ if (pte_none(*vmf->pte))
+ do_set_pte(vmf, page);
+ /* no need to invalidate: a not-present page won't be cached */
+ update_mmu_cache(vma, vmf->address, vmf->pte);
unlock_page(head);
- goto next;
+ continue;
unlock:
unlock_page(head);
skip:
put_page(head);
-next:
- /* Huge page is mapped? No need to proceed. */
- if (pmd_trans_huge(*vmf->pmd))
- break;
- }
+ } while ((head = xas_next_entry(&xas, end_pgoff)) != NULL);
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
rcu_read_unlock();
WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);
}
diff --git a/mm/memory.c b/mm/memory.c
index c48f8df6e502..96d62774096a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3490,7 +3490,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
if (pte_alloc(vma->vm_mm, vmf->pmd))
return VM_FAULT_OOM;

- /* See the comment in pte_alloc_one_map() */
+ /* See the comment in map_set_pte() */
if (unlikely(pmd_trans_unstable(vmf->pmd)))
return 0;

@@ -3630,66 +3630,6 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
return ret;
}

-/*
- * The ordering of these checks is important for pmds with _PAGE_DEVMAP set.
- * If we check pmd_trans_unstable() first we will trip the bad_pmd() check
- * inside of pmd_none_or_trans_huge_or_clear_bad(). This will end up correctly
- * returning 1 but not before it spams dmesg with the pmd_clear_bad() output.
- */
-static int pmd_devmap_trans_unstable(pmd_t *pmd)
-{
- return pmd_devmap(*pmd) || pmd_trans_unstable(pmd);
-}
-
-static vm_fault_t pte_alloc_one_map(struct vm_fault *vmf)
-{
- struct vm_area_struct *vma = vmf->vma;
-
- if (!pmd_none(*vmf->pmd))
- goto map_pte;
- if (vmf->prealloc_pte) {
- vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
- if (unlikely(!pmd_none(*vmf->pmd))) {
- spin_unlock(vmf->ptl);
- goto map_pte;
- }
-
- mm_inc_nr_ptes(vma->vm_mm);
- pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte);
- spin_unlock(vmf->ptl);
- vmf->prealloc_pte = NULL;
- } else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd))) {
- return VM_FAULT_OOM;
- }
-map_pte:
- /*
- * If a huge pmd materialized under us just retry later. Use
- * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead of
- * pmd_trans_huge() to ensure the pmd didn't become pmd_trans_huge
- * under us and then back to pmd_none, as a result of MADV_DONTNEED
- * running immediately after a huge pmd fault in a different thread of
- * this mm, in turn leading to a misleading pmd_trans_huge() retval.
- * All we have to ensure is that it is a regular pmd that we can walk
- * with pte_offset_map() and we can do that through an atomic read in
- * C, which is what pmd_trans_unstable() provides.
- */
- if (pmd_devmap_trans_unstable(vmf->pmd))
- return VM_FAULT_NOPAGE;
-
- /*
- * At this point we know that our vmf->pmd points to a page of ptes
- * and it cannot become pmd_none(), pmd_devmap() or pmd_trans_huge()
- * for the duration of the fault. If a racing MADV_DONTNEED runs and
- * we zap the ptes pointed to by our vmf->pmd, the vmf->ptl will still
- * be valid and we will re-check to make sure the vmf->pte isn't
- * pte_none() under vmf->ptl protection when we return to
- * alloc_set_pte().
- */
- vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
- &vmf->ptl);
- return 0;
-}
-
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
static void deposit_prealloc_pte(struct vm_fault *vmf)
{
@@ -3704,7 +3644,7 @@ static void deposit_prealloc_pte(struct vm_fault *vmf)
vmf->prealloc_pte = NULL;
}

-static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
+vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
{
struct vm_area_struct *vma = vmf->vma;
bool write = vmf->flags & FAULT_FLAG_WRITE;
@@ -3769,45 +3709,11 @@ static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
}
#endif

-/**
- * alloc_set_pte - setup new PTE entry for given page and add reverse page
- * mapping. If needed, the function allocates page table or use pre-allocated.
- *
- * @vmf: fault environment
- * @page: page to map
- *
- * Caller must take care of unlocking vmf->ptl, if vmf->pte is non-NULL on
- * return.
- *
- * Target users are page handler itself and implementations of
- * vm_ops->map_pages.
- *
- * Return: %0 on success, %VM_FAULT_ code in case of error.
- */
-vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
+void do_set_pte(struct vm_fault *vmf, struct page *page)
{
struct vm_area_struct *vma = vmf->vma;
bool write = vmf->flags & FAULT_FLAG_WRITE;
pte_t entry;
- vm_fault_t ret;
-
- if (pmd_none(*vmf->pmd) && PageTransCompound(page)) {
- ret = do_set_pmd(vmf, page);
- if (ret != VM_FAULT_FALLBACK)
- return ret;
- }
-
- if (!vmf->pte) {
- ret = pte_alloc_one_map(vmf);
- if (ret)
- return ret;
- }
-
- /* Re-check under ptl */
- if (unlikely(!pte_none(*vmf->pte))) {
- update_mmu_tlb(vma, vmf->address, vmf->pte);
- return VM_FAULT_NOPAGE;
- }

flush_icache_page(vma, page);
entry = mk_pte(page, vma->vm_page_prot);
@@ -3824,14 +3730,8 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
page_add_file_rmap(page, false);
}
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
-
- /* no need to invalidate: a not-present page won't be cached */
- update_mmu_cache(vma, vmf->address, vmf->pte);
-
- return 0;
}

-
/**
* finish_fault - finish page fault once we have prepared the page to fault
*
@@ -3849,12 +3749,12 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
*/
vm_fault_t finish_fault(struct vm_fault *vmf)
{
+ struct vm_area_struct *vma = vmf->vma;
struct page *page;
- vm_fault_t ret = 0;
+ vm_fault_t ret;

/* Did we COW the page? */
- if ((vmf->flags & FAULT_FLAG_WRITE) &&
- !(vmf->vma->vm_flags & VM_SHARED))
+ if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED))
page = vmf->cow_page;
else
page = vmf->page;
@@ -3863,13 +3763,35 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
* check even for read faults because we might have lost our CoWed
* page
*/
- if (!(vmf->vma->vm_flags & VM_SHARED))
- ret = check_stable_address_space(vmf->vma->vm_mm);
- if (!ret)
- ret = alloc_set_pte(vmf, page);
- if (vmf->pte)
- pte_unmap_unlock(vmf->pte, vmf->ptl);
- return ret;
+ if (!(vma->vm_flags & VM_SHARED))
+ ret = check_stable_address_space(vma->vm_mm);
+ if (ret)
+ return ret;
+
+ if (pmd_none(*vmf->pmd)) {
+ if (PageTransCompound(page)) {
+ ret = do_set_pmd(vmf, page);
+ if (ret != VM_FAULT_FALLBACK)
+ return ret;
+ }
+
+ if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd)))
+ return VM_FAULT_OOM;
+ }
+
+ /* See comment in handle_pte_fault() */
+ if (pmd_devmap_trans_unstable(vmf->pmd))
+ return 0;
+
+ vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
+ vmf->address, &vmf->ptl);
+ /* Re-check under ptl */
+ if (likely(pte_none(*vmf->pte)))
+ do_set_pte(vmf, page);
+
+ update_mmu_tlb(vma, vmf->address, vmf->pte);
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
+ return 0;
}

static unsigned long fault_around_bytes __read_mostly =
@@ -3980,7 +3902,6 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf)
vmf->pte -= (vmf->address >> PAGE_SHIFT) - (address >> PAGE_SHIFT);
if (!pte_none(*vmf->pte))
ret = VM_FAULT_NOPAGE;
- pte_unmap_unlock(vmf->pte, vmf->ptl);
out:
vmf->address = address;
vmf->pte = NULL;
@@ -4340,7 +4261,18 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
*/
vmf->pte = NULL;
} else {
- /* See comment in pte_alloc_one_map() */
+ /*
+ * If a huge pmd materialized under us just retry later. Use
+ * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead
+ * of pmd_trans_huge() to ensure the pmd didn't become
+ * pmd_trans_huge under us and then back to pmd_none, as a
+ * result of MADV_DONTNEED running immediately after a huge pmd
+ * fault in a different thread of this mm, in turn leading to a
+ * misleading pmd_trans_huge() retval. All we have to ensure is
+ * that it is a regular pmd that we can walk with
+ * pte_offset_map() and we can do that through an atomic read
+ * in C, which is what pmd_trans_unstable() provides.
+ */
if (pmd_devmap_trans_unstable(vmf->pmd))
return 0;
/*
--
Kirill A. Shutemov

2020-12-19 20:11:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

On Sat, Dec 19, 2020 at 4:41 AM Kirill A. Shutemov <[email protected]> wrote:
>
> Okay, but we only win the NULL check. xas_retry() and xa_is_value() has to
> be repeated in the beginning of the loop.

Yeah.

I wonder if it might make sense to have a "xas_next_entry_rcu()"
function that does something like

while ((entry = xas_next_entry(&xas, end)) != NULL) {
if (xas_retry(entry) || xa_is_value(entry))
continue;
return entry;
}
return NULL;

but that's a question for Willy, and independent of this patch.

I like the patch, and I'll think about it a bit more, but I might
decide to just apply it to get this thing over with.

Otherwise it should probably go into -mm for more testing.

Linus

2020-12-19 20:46:00

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

On Sat, Dec 19, 2020 at 4:41 AM Kirill A. Shutemov <[email protected]> wrote:
>
> @@ -2884,19 +2966,18 @@ void filemap_map_pages(struct vm_fault *vmf,
> if (vmf->pte)
> vmf->pte += xas.xa_index - last_pgoff;
> last_pgoff = xas.xa_index;
> - if (alloc_set_pte(vmf, page))
> - goto unlock;
> + if (pte_none(*vmf->pte))
> + do_set_pte(vmf, page);
> + /* no need to invalidate: a not-present page won't be cached */
> + update_mmu_cache(vma, vmf->address, vmf->pte);
> unlock_page(head);
> - goto next;
> + continue;

This can't be right.

Look at what happens if "pte_none()" is not true.. It won't install
the new pte, but it also won't drop the ref to the page.

So I think it needs to be

- if (alloc_set_pte(vmf, page))
+ if (!pte_none(*vmf->pte))
goto unlock;
+ do_set_pte(vmf, page);

instead, so that the "if somebody else already filled the page table"
case gets handled right.

Hmm?

Linus

2020-12-22 10:02:32

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

On Sat, Dec 19, 2020 at 12:34:17PM -0800, Linus Torvalds wrote:
> On Sat, Dec 19, 2020 at 4:41 AM Kirill A. Shutemov <[email protected]> wrote:
> >
> > @@ -2884,19 +2966,18 @@ void filemap_map_pages(struct vm_fault *vmf,
> > if (vmf->pte)
> > vmf->pte += xas.xa_index - last_pgoff;
> > last_pgoff = xas.xa_index;
> > - if (alloc_set_pte(vmf, page))
> > - goto unlock;
> > + if (pte_none(*vmf->pte))
> > + do_set_pte(vmf, page);
> > + /* no need to invalidate: a not-present page won't be cached */
> > + update_mmu_cache(vma, vmf->address, vmf->pte);
> > unlock_page(head);
> > - goto next;
> > + continue;
>
> This can't be right.
>
> Look at what happens if "pte_none()" is not true.. It won't install
> the new pte, but it also won't drop the ref to the page.

Ouch.

> So I think it needs to be
>
> - if (alloc_set_pte(vmf, page))
> + if (!pte_none(*vmf->pte))
> goto unlock;
> + do_set_pte(vmf, page);
>
> instead, so that the "if somebody else already filled the page table"
> case gets handled right.

Updated patch is below.

From 0ec1bc1fe95587350ac4f4c866d6482383740b36 Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <[email protected]>
Date: Sat, 19 Dec 2020 15:19:23 +0300
Subject: [PATCH] mm: Cleanup faultaround and finish_fault() codepaths

alloc_set_pte() has two users with different requirements: in the
faultaround code, it called from an atomic context and PTE page table
has to be preallocated. finish_fault() can sleep and allocate page table
as needed.

PTL locking rules are also strange, hard to follow and overkill for
finish_fault().

Let's untangle the mess. alloc_set_pte() has gone now. All locking is
explicit.

The price is some code duplication to handle huge pages in faultaround
path, but it should be fine, having overall improvement in readability.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/mm.h | 8 +-
include/linux/pgtable.h | 11 +++
mm/filemap.c | 109 +++++++++++++++++++++++----
mm/memory.c | 162 ++++++++++++----------------------------
4 files changed, 159 insertions(+), 131 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index db6ae4d3fb4e..2825153ad0d6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -534,8 +534,8 @@ struct vm_fault {
* is not NULL, otherwise pmd.
*/
pgtable_t prealloc_pte; /* Pre-allocated pte page table.
- * vm_ops->map_pages() calls
- * alloc_set_pte() from atomic context.
+ * vm_ops->map_pages() sets up a page
+ * table from from atomic context.
* do_fault_around() pre-allocates
* page table to avoid allocation from
* atomic context.
@@ -972,7 +972,9 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
return pte;
}

-vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page);
+vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
+void do_set_pte(struct vm_fault *vmf, struct page *page);
+
vm_fault_t finish_fault(struct vm_fault *vmf);
vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
#endif
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e237004d498d..869c1921ceda 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1259,6 +1259,17 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
#endif
}

+/*
+ * the ordering of these checks is important for pmds with _page_devmap set.
+ * if we check pmd_trans_unstable() first we will trip the bad_pmd() check
+ * inside of pmd_none_or_trans_huge_or_clear_bad(). this will end up correctly
+ * returning 1 but not before it spams dmesg with the pmd_clear_bad() output.
+ */
+static inline int pmd_devmap_trans_unstable(pmd_t *pmd)
+{
+ return pmd_devmap(*pmd) || pmd_trans_unstable(pmd);
+}
+
#ifndef CONFIG_NUMA_BALANCING
/*
* Technically a PTE can be PROTNONE even when not doing NUMA balancing but
diff --git a/mm/filemap.c b/mm/filemap.c
index 0b2067b3c328..f8fdbe079375 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -42,6 +42,7 @@
#include <linux/psi.h>
#include <linux/ramfs.h>
#include <linux/page_idle.h>
+#include <asm/pgalloc.h>
#include "internal.h"

#define CREATE_TRACE_POINTS
@@ -2831,10 +2832,74 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
}
EXPORT_SYMBOL(filemap_fault);

+static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page,
+ struct xa_state *xas)
+{
+ struct vm_area_struct *vma = vmf->vma;
+ struct address_space *mapping = vma->vm_file->f_mapping;
+
+ /* Huge page is mapped? No need to proceed. */
+ if (pmd_trans_huge(*vmf->pmd))
+ return true;
+
+ if (xa_is_value(page))
+ goto nohuge;
+
+ if (!pmd_none(*vmf->pmd))
+ goto nohuge;
+
+ if (!PageTransHuge(page) || PageLocked(page))
+ goto nohuge;
+
+ if (!page_cache_get_speculative(page))
+ goto nohuge;
+
+ if (page != xas_reload(xas))
+ goto unref;
+
+ if (!PageTransHuge(page))
+ goto unref;
+
+ if (!PageUptodate(page) || PageReadahead(page) || PageHWPoison(page))
+ goto unref;
+
+ if (!trylock_page(page))
+ goto unref;
+
+ if (page->mapping != mapping || !PageUptodate(page))
+ goto unlock;
+
+ if (xas->xa_index >= DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE))
+ goto unlock;
+
+ do_set_pmd(vmf, page);
+ unlock_page(page);
+ return true;
+unlock:
+ unlock_page(page);
+unref:
+ put_page(page);
+nohuge:
+ vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
+ if (likely(pmd_none(*vmf->pmd))) {
+ mm_inc_nr_ptes(vma->vm_mm);
+ pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte);
+ vmf->prealloc_pte = NULL;
+ }
+ spin_unlock(vmf->ptl);
+
+ /* See comment in handle_pte_fault() */
+ if (pmd_devmap_trans_unstable(vmf->pmd))
+ return true;
+
+ return false;
+}
+
void filemap_map_pages(struct vm_fault *vmf,
pgoff_t start_pgoff, pgoff_t end_pgoff)
{
- struct file *file = vmf->vma->vm_file;
+ struct vm_area_struct *vma = vmf->vma;
+ struct file *file = vma->vm_file;
struct address_space *mapping = file->f_mapping;
pgoff_t last_pgoff = start_pgoff;
unsigned long max_idx;
@@ -2843,20 +2908,37 @@ void filemap_map_pages(struct vm_fault *vmf,
unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);

rcu_read_lock();
- xas_for_each(&xas, head, end_pgoff) {
+ head = xas_find(&xas, end_pgoff);
+ for (; ; head = xas_next_entry(&xas, end_pgoff)) {
+ if (!head) {
+ rcu_read_unlock();
+ return;
+ }
+ if (likely(!xas_retry(&xas, head)))
+ break;
+ }
+
+ if (filemap_map_pmd(vmf, head, &xas)) {
+ rcu_read_unlock();
+ return;
+ }
+
+ vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
+ vmf->address, &vmf->ptl);
+
+ do {
if (xas_retry(&xas, head))
continue;
if (xa_is_value(head))
- goto next;
-
+ continue;
/*
* Check for a locked page first, as a speculative
* reference may adversely influence page migration.
*/
if (PageLocked(head))
- goto next;
+ continue;
if (!page_cache_get_speculative(head))
- goto next;
+ continue;

/* Has the page moved or been split? */
if (unlikely(head != xas_reload(&xas)))
@@ -2884,19 +2966,20 @@ void filemap_map_pages(struct vm_fault *vmf,
if (vmf->pte)
vmf->pte += xas.xa_index - last_pgoff;
last_pgoff = xas.xa_index;
- if (alloc_set_pte(vmf, page))
+ if (!pte_none(*vmf->pte))
goto unlock;
+
+ do_set_pte(vmf, page);
+ /* no need to invalidate: a not-present page won't be cached */
+ update_mmu_cache(vma, vmf->address, vmf->pte);
unlock_page(head);
- goto next;
+ continue;
unlock:
unlock_page(head);
skip:
put_page(head);
-next:
- /* Huge page is mapped? No need to proceed. */
- if (pmd_trans_huge(*vmf->pmd))
- break;
- }
+ } while ((head = xas_next_entry(&xas, end_pgoff)) != NULL);
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
rcu_read_unlock();
WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);
}
diff --git a/mm/memory.c b/mm/memory.c
index c48f8df6e502..96d62774096a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3490,7 +3490,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
if (pte_alloc(vma->vm_mm, vmf->pmd))
return VM_FAULT_OOM;

- /* See the comment in pte_alloc_one_map() */
+ /* See the comment in map_set_pte() */
if (unlikely(pmd_trans_unstable(vmf->pmd)))
return 0;

@@ -3630,66 +3630,6 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
return ret;
}

-/*
- * The ordering of these checks is important for pmds with _PAGE_DEVMAP set.
- * If we check pmd_trans_unstable() first we will trip the bad_pmd() check
- * inside of pmd_none_or_trans_huge_or_clear_bad(). This will end up correctly
- * returning 1 but not before it spams dmesg with the pmd_clear_bad() output.
- */
-static int pmd_devmap_trans_unstable(pmd_t *pmd)
-{
- return pmd_devmap(*pmd) || pmd_trans_unstable(pmd);
-}
-
-static vm_fault_t pte_alloc_one_map(struct vm_fault *vmf)
-{
- struct vm_area_struct *vma = vmf->vma;
-
- if (!pmd_none(*vmf->pmd))
- goto map_pte;
- if (vmf->prealloc_pte) {
- vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
- if (unlikely(!pmd_none(*vmf->pmd))) {
- spin_unlock(vmf->ptl);
- goto map_pte;
- }
-
- mm_inc_nr_ptes(vma->vm_mm);
- pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte);
- spin_unlock(vmf->ptl);
- vmf->prealloc_pte = NULL;
- } else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd))) {
- return VM_FAULT_OOM;
- }
-map_pte:
- /*
- * If a huge pmd materialized under us just retry later. Use
- * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead of
- * pmd_trans_huge() to ensure the pmd didn't become pmd_trans_huge
- * under us and then back to pmd_none, as a result of MADV_DONTNEED
- * running immediately after a huge pmd fault in a different thread of
- * this mm, in turn leading to a misleading pmd_trans_huge() retval.
- * All we have to ensure is that it is a regular pmd that we can walk
- * with pte_offset_map() and we can do that through an atomic read in
- * C, which is what pmd_trans_unstable() provides.
- */
- if (pmd_devmap_trans_unstable(vmf->pmd))
- return VM_FAULT_NOPAGE;
-
- /*
- * At this point we know that our vmf->pmd points to a page of ptes
- * and it cannot become pmd_none(), pmd_devmap() or pmd_trans_huge()
- * for the duration of the fault. If a racing MADV_DONTNEED runs and
- * we zap the ptes pointed to by our vmf->pmd, the vmf->ptl will still
- * be valid and we will re-check to make sure the vmf->pte isn't
- * pte_none() under vmf->ptl protection when we return to
- * alloc_set_pte().
- */
- vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
- &vmf->ptl);
- return 0;
-}
-
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
static void deposit_prealloc_pte(struct vm_fault *vmf)
{
@@ -3704,7 +3644,7 @@ static void deposit_prealloc_pte(struct vm_fault *vmf)
vmf->prealloc_pte = NULL;
}

-static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
+vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
{
struct vm_area_struct *vma = vmf->vma;
bool write = vmf->flags & FAULT_FLAG_WRITE;
@@ -3769,45 +3709,11 @@ static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
}
#endif

-/**
- * alloc_set_pte - setup new PTE entry for given page and add reverse page
- * mapping. If needed, the function allocates page table or use pre-allocated.
- *
- * @vmf: fault environment
- * @page: page to map
- *
- * Caller must take care of unlocking vmf->ptl, if vmf->pte is non-NULL on
- * return.
- *
- * Target users are page handler itself and implementations of
- * vm_ops->map_pages.
- *
- * Return: %0 on success, %VM_FAULT_ code in case of error.
- */
-vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
+void do_set_pte(struct vm_fault *vmf, struct page *page)
{
struct vm_area_struct *vma = vmf->vma;
bool write = vmf->flags & FAULT_FLAG_WRITE;
pte_t entry;
- vm_fault_t ret;
-
- if (pmd_none(*vmf->pmd) && PageTransCompound(page)) {
- ret = do_set_pmd(vmf, page);
- if (ret != VM_FAULT_FALLBACK)
- return ret;
- }
-
- if (!vmf->pte) {
- ret = pte_alloc_one_map(vmf);
- if (ret)
- return ret;
- }
-
- /* Re-check under ptl */
- if (unlikely(!pte_none(*vmf->pte))) {
- update_mmu_tlb(vma, vmf->address, vmf->pte);
- return VM_FAULT_NOPAGE;
- }

flush_icache_page(vma, page);
entry = mk_pte(page, vma->vm_page_prot);
@@ -3824,14 +3730,8 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
page_add_file_rmap(page, false);
}
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
-
- /* no need to invalidate: a not-present page won't be cached */
- update_mmu_cache(vma, vmf->address, vmf->pte);
-
- return 0;
}

-
/**
* finish_fault - finish page fault once we have prepared the page to fault
*
@@ -3849,12 +3749,12 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
*/
vm_fault_t finish_fault(struct vm_fault *vmf)
{
+ struct vm_area_struct *vma = vmf->vma;
struct page *page;
- vm_fault_t ret = 0;
+ vm_fault_t ret;

/* Did we COW the page? */
- if ((vmf->flags & FAULT_FLAG_WRITE) &&
- !(vmf->vma->vm_flags & VM_SHARED))
+ if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED))
page = vmf->cow_page;
else
page = vmf->page;
@@ -3863,13 +3763,35 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
* check even for read faults because we might have lost our CoWed
* page
*/
- if (!(vmf->vma->vm_flags & VM_SHARED))
- ret = check_stable_address_space(vmf->vma->vm_mm);
- if (!ret)
- ret = alloc_set_pte(vmf, page);
- if (vmf->pte)
- pte_unmap_unlock(vmf->pte, vmf->ptl);
- return ret;
+ if (!(vma->vm_flags & VM_SHARED))
+ ret = check_stable_address_space(vma->vm_mm);
+ if (ret)
+ return ret;
+
+ if (pmd_none(*vmf->pmd)) {
+ if (PageTransCompound(page)) {
+ ret = do_set_pmd(vmf, page);
+ if (ret != VM_FAULT_FALLBACK)
+ return ret;
+ }
+
+ if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd)))
+ return VM_FAULT_OOM;
+ }
+
+ /* See comment in handle_pte_fault() */
+ if (pmd_devmap_trans_unstable(vmf->pmd))
+ return 0;
+
+ vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
+ vmf->address, &vmf->ptl);
+ /* Re-check under ptl */
+ if (likely(pte_none(*vmf->pte)))
+ do_set_pte(vmf, page);
+
+ update_mmu_tlb(vma, vmf->address, vmf->pte);
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
+ return 0;
}

static unsigned long fault_around_bytes __read_mostly =
@@ -3980,7 +3902,6 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf)
vmf->pte -= (vmf->address >> PAGE_SHIFT) - (address >> PAGE_SHIFT);
if (!pte_none(*vmf->pte))
ret = VM_FAULT_NOPAGE;
- pte_unmap_unlock(vmf->pte, vmf->ptl);
out:
vmf->address = address;
vmf->pte = NULL;
@@ -4340,7 +4261,18 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
*/
vmf->pte = NULL;
} else {
- /* See comment in pte_alloc_one_map() */
+ /*
+ * If a huge pmd materialized under us just retry later. Use
+ * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead
+ * of pmd_trans_huge() to ensure the pmd didn't become
+ * pmd_trans_huge under us and then back to pmd_none, as a
+ * result of MADV_DONTNEED running immediately after a huge pmd
+ * fault in a different thread of this mm, in turn leading to a
+ * misleading pmd_trans_huge() retval. All we have to ensure is
+ * that it is a regular pmd that we can walk with
+ * pte_offset_map() and we can do that through an atomic read
+ * in C, which is what pmd_trans_unstable() provides.
+ */
if (pmd_devmap_trans_unstable(vmf->pmd))
return 0;
/*
--
Kirill A. Shutemov

2020-12-24 04:07:17

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

On Tue, 22 Dec 2020, Kirill A. Shutemov wrote:
>
> Updated patch is below.
>
> From 0ec1bc1fe95587350ac4f4c866d6482383740b36 Mon Sep 17 00:00:00 2001
> From: "Kirill A. Shutemov" <[email protected]>
> Date: Sat, 19 Dec 2020 15:19:23 +0300
> Subject: [PATCH] mm: Cleanup faultaround and finish_fault() codepaths
>
> alloc_set_pte() has two users with different requirements: in the
> faultaround code, it called from an atomic context and PTE page table
> has to be preallocated. finish_fault() can sleep and allocate page table
> as needed.
>
> PTL locking rules are also strange, hard to follow and overkill for
> finish_fault().
>
> Let's untangle the mess. alloc_set_pte() has gone now. All locking is
> explicit.
>
> The price is some code duplication to handle huge pages in faultaround
> path, but it should be fine, having overall improvement in readability.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>

It's not ready yet.

I won't pretend to have reviewed, but I did try applying and running
with it: mostly it seems to work fine, but turned out to be leaking
huge pages (with vmstat's thp_split_page_failed growing bigger and
bigger as page reclaim cannot get rid of them).

Aside from the actual bug, filemap_map_pmd() seems suboptimal at
present: comments below (plus one comment in do_anonymous_page()).

> diff --git a/mm/filemap.c b/mm/filemap.c
> index 0b2067b3c328..f8fdbe079375 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2831,10 +2832,74 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
> }
> EXPORT_SYMBOL(filemap_fault);
>
> +static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page,
> + struct xa_state *xas)
> +{
> + struct vm_area_struct *vma = vmf->vma;
> + struct address_space *mapping = vma->vm_file->f_mapping;
> +
> + /* Huge page is mapped? No need to proceed. */
> + if (pmd_trans_huge(*vmf->pmd))
> + return true;
> +
> + if (xa_is_value(page))
> + goto nohuge;

I think it would be easier to follow if filemap_map_pages() never
passed this an xa_is_value(page): probably just skip them in its
initial xas_next_entry() loop.

> +
> + if (!pmd_none(*vmf->pmd))
> + goto nohuge;

Then at nohuge it unconditionally takes pmd_lock(), finds !pmd_none,
and unlocks again: unnecessary overhead I believe we did not have before.

> +
> + if (!PageTransHuge(page) || PageLocked(page))
> + goto nohuge;

So if PageTransHuge, but someone else temporarily holds PageLocked,
we insert a page table at nohuge, sadly preventing it from being
mapped here later by huge pmd.

> +
> + if (!page_cache_get_speculative(page))
> + goto nohuge;
> +
> + if (page != xas_reload(xas))
> + goto unref;
> +
> + if (!PageTransHuge(page))
> + goto unref;
> +
> + if (!PageUptodate(page) || PageReadahead(page) || PageHWPoison(page))
> + goto unref;
> +
> + if (!trylock_page(page))
> + goto unref;
> +
> + if (page->mapping != mapping || !PageUptodate(page))
> + goto unlock;
> +
> + if (xas->xa_index >= DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE))
> + goto unlock;
> +
> + do_set_pmd(vmf, page);

Here is the source of the huge page leak: do_set_pmd() can fail
(and we would do better to have skipped most of its failure cases long
before getting this far). It worked without leaking once I patched it:

- do_set_pmd(vmf, page);
- unlock_page(page);
- return true;
+ if (do_set_pmd(vmf, page) == 0) {
+ unlock_page(page);
+ return true;
+ }

> + unlock_page(page);
> + return true;
> +unlock:
> + unlock_page(page);
> +unref:
> + put_page(page);
> +nohuge:
> + vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> + if (likely(pmd_none(*vmf->pmd))) {
> + mm_inc_nr_ptes(vma->vm_mm);
> + pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte);
> + vmf->prealloc_pte = NULL;
> + }
> + spin_unlock(vmf->ptl);

I think it's a bit weird to hide this page table insertion inside
filemap_map_pmd() (I guess you're thinking that this function deals
with pmd level, but I'd find it easier to have a filemap_map_huge()
dealing with the huge mapping). Better to do it on return into
filemap_map_pages(); maybe filemap_map_pmd() or filemap_map_huge()
would then need to return vm_fault_t rather than bool, I didn't try.

> +
> + /* See comment in handle_pte_fault() */
> + if (pmd_devmap_trans_unstable(vmf->pmd))
> + return true;
> +
> + return false;
> +}
...
> diff --git a/mm/memory.c b/mm/memory.c
> index c48f8df6e502..96d62774096a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3490,7 +3490,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> if (pte_alloc(vma->vm_mm, vmf->pmd))
> return VM_FAULT_OOM;
>
> - /* See the comment in pte_alloc_one_map() */
> + /* See the comment in map_set_pte() */

No, no such function: probably should be like the others and say
/* See comment in handle_pte_fault() */

Hugh

2020-12-25 11:35:55

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

On Wed, Dec 23, 2020 at 08:04:54PM -0800, Hugh Dickins wrote:
> It's not ready yet.

Thanks for the feedback. It's very helpful.

The patch below should address the problems you've found.

The new helper next_page() returns a stablized page, so filemap_map_pmd()
can clearly decide if we should set up a page table or a huge page.

The leak is fixed too.

Could you give it a try?


From 0a6a3a1c920d04f55bcb202b585c85a4494b2ae0 Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <[email protected]>
Date: Sat, 19 Dec 2020 15:19:23 +0300
Subject: [PATCH] mm: Cleanup faultaround and finish_fault() codepaths

alloc_set_pte() has two users with different requirements: in the
faultaround code, it called from an atomic context and PTE page table
has to be preallocated. finish_fault() can sleep and allocate page table
as needed.

PTL locking rules are also strange, hard to follow and overkill for
finish_fault().

Let's untangle the mess. alloc_set_pte() has gone now. All locking is
explicit.

The price is some code duplication to handle huge pages in faultaround
path, but it should be fine, having overall improvement in readability.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/mm.h | 8 +-
include/linux/pgtable.h | 11 +++
mm/filemap.c | 147 ++++++++++++++++++++++++++----------
mm/memory.c | 162 ++++++++++++----------------------------
4 files changed, 171 insertions(+), 157 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index db6ae4d3fb4e..2825153ad0d6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -534,8 +534,8 @@ struct vm_fault {
* is not NULL, otherwise pmd.
*/
pgtable_t prealloc_pte; /* Pre-allocated pte page table.
- * vm_ops->map_pages() calls
- * alloc_set_pte() from atomic context.
+ * vm_ops->map_pages() sets up a page
+ * table from from atomic context.
* do_fault_around() pre-allocates
* page table to avoid allocation from
* atomic context.
@@ -972,7 +972,9 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
return pte;
}

-vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page);
+vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
+void do_set_pte(struct vm_fault *vmf, struct page *page);
+
vm_fault_t finish_fault(struct vm_fault *vmf);
vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
#endif
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e237004d498d..869c1921ceda 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1259,6 +1259,17 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
#endif
}

+/*
+ * the ordering of these checks is important for pmds with _page_devmap set.
+ * if we check pmd_trans_unstable() first we will trip the bad_pmd() check
+ * inside of pmd_none_or_trans_huge_or_clear_bad(). this will end up correctly
+ * returning 1 but not before it spams dmesg with the pmd_clear_bad() output.
+ */
+static inline int pmd_devmap_trans_unstable(pmd_t *pmd)
+{
+ return pmd_devmap(*pmd) || pmd_trans_unstable(pmd);
+}
+
#ifndef CONFIG_NUMA_BALANCING
/*
* Technically a PTE can be PROTNONE even when not doing NUMA balancing but
diff --git a/mm/filemap.c b/mm/filemap.c
index 0b2067b3c328..069b15778b4a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -42,6 +42,7 @@
#include <linux/psi.h>
#include <linux/ramfs.h>
#include <linux/page_idle.h>
+#include <asm/pgalloc.h>
#include "internal.h"

#define CREATE_TRACE_POINTS
@@ -2831,50 +2832,117 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
}
EXPORT_SYMBOL(filemap_fault);

+static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
+{
+ struct mm_struct *mm = vmf->vma->vm_mm;
+
+ /* Huge page is mapped? No need to proceed. */
+ if (pmd_trans_huge(*vmf->pmd))
+ return true;
+
+ if (pmd_none(*vmf->pmd) &&
+ PageTransHuge(page) &&
+ do_set_pmd(vmf, page)) {
+ unlock_page(page);
+ return true;
+ }
+
+ if (pmd_none(*vmf->pmd)) {
+ vmf->ptl = pmd_lock(mm, vmf->pmd);
+ if (likely(pmd_none(*vmf->pmd))) {
+ mm_inc_nr_ptes(mm);
+ pmd_populate(mm, vmf->pmd, vmf->prealloc_pte);
+ vmf->prealloc_pte = NULL;
+ }
+ spin_unlock(vmf->ptl);
+ }
+
+ /* See comment in handle_pte_fault() */
+ if (pmd_devmap_trans_unstable(vmf->pmd)) {
+ unlock_page(page);
+ put_page(page);
+ return true;
+ }
+
+ return false;
+}
+
+static struct page *next_page(struct page *page, struct vm_fault *vmf,
+ struct xa_state *xas, pgoff_t end_pgoff)
+{
+ struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+ unsigned long max_idx;
+
+ if (!page)
+ page = xas_find(xas, end_pgoff);
+ else
+ page = xas_next_entry(xas, end_pgoff);
+
+ do {
+ if (!page)
+ return NULL;
+ if (xas_retry(xas, page))
+ continue;
+ if (xa_is_value(page))
+ continue;
+ if (PageLocked(page))
+ continue;
+ if (!page_cache_get_speculative(page))
+ continue;
+ /* Has the page moved or been split? */
+ if (unlikely(page != xas_reload(xas)))
+ goto skip;
+ if (!PageUptodate(page) || PageReadahead(page))
+ goto skip;
+ if (PageHWPoison(page))
+ goto skip;
+ if (!trylock_page(page))
+ goto skip;
+ if (page->mapping != mapping)
+ goto unlock;
+ if (!PageUptodate(page))
+ goto unlock;
+ max_idx = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE);
+ if (xas->xa_index >= max_idx)
+ goto unlock;
+ return page;
+unlock:
+ unlock_page(page);
+skip:
+ put_page(page);
+ } while ((page = xas_next_entry(xas, end_pgoff)) != NULL);
+
+ return NULL;
+}
+
void filemap_map_pages(struct vm_fault *vmf,
pgoff_t start_pgoff, pgoff_t end_pgoff)
{
- struct file *file = vmf->vma->vm_file;
+ struct vm_area_struct *vma = vmf->vma;
+ struct file *file = vma->vm_file;
struct address_space *mapping = file->f_mapping;
pgoff_t last_pgoff = start_pgoff;
- unsigned long max_idx;
XA_STATE(xas, &mapping->i_pages, start_pgoff);
struct page *head, *page;
unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);

rcu_read_lock();
- xas_for_each(&xas, head, end_pgoff) {
- if (xas_retry(&xas, head))
- continue;
- if (xa_is_value(head))
- goto next;
+ head = next_page(NULL, vmf, &xas, end_pgoff);
+ if (!head) {
+ rcu_read_unlock();
+ return;
+ }

- /*
- * Check for a locked page first, as a speculative
- * reference may adversely influence page migration.
- */
- if (PageLocked(head))
- goto next;
- if (!page_cache_get_speculative(head))
- goto next;
+ if (filemap_map_pmd(vmf, head)) {
+ rcu_read_unlock();
+ return;
+ }

- /* Has the page moved or been split? */
- if (unlikely(head != xas_reload(&xas)))
- goto skip;
+ vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
+ vmf->address, &vmf->ptl);
+ do {
page = find_subpage(head, xas.xa_index);
-
- if (!PageUptodate(head) ||
- PageReadahead(page) ||
- PageHWPoison(page))
- goto skip;
- if (!trylock_page(head))
- goto skip;
-
- if (head->mapping != mapping || !PageUptodate(head))
- goto unlock;
-
- max_idx = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE);
- if (xas.xa_index >= max_idx)
+ if (PageHWPoison(page))
goto unlock;

if (mmap_miss > 0)
@@ -2884,19 +2952,20 @@ void filemap_map_pages(struct vm_fault *vmf,
if (vmf->pte)
vmf->pte += xas.xa_index - last_pgoff;
last_pgoff = xas.xa_index;
- if (alloc_set_pte(vmf, page))
+
+ if (!pte_none(*vmf->pte))
goto unlock;
+
+ do_set_pte(vmf, page);
+ /* no need to invalidate: a not-present page won't be cached */
+ update_mmu_cache(vma, vmf->address, vmf->pte);
unlock_page(head);
- goto next;
+ continue;
unlock:
unlock_page(head);
-skip:
put_page(head);
-next:
- /* Huge page is mapped? No need to proceed. */
- if (pmd_trans_huge(*vmf->pmd))
- break;
- }
+ } while ((head = next_page(head, vmf, &xas, end_pgoff)) != NULL);
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
rcu_read_unlock();
WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);
}
diff --git a/mm/memory.c b/mm/memory.c
index c48f8df6e502..98d340b45557 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3490,7 +3490,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
if (pte_alloc(vma->vm_mm, vmf->pmd))
return VM_FAULT_OOM;

- /* See the comment in pte_alloc_one_map() */
+ /* See comment in handle_pte_fault() */
if (unlikely(pmd_trans_unstable(vmf->pmd)))
return 0;

@@ -3630,66 +3630,6 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
return ret;
}

-/*
- * The ordering of these checks is important for pmds with _PAGE_DEVMAP set.
- * If we check pmd_trans_unstable() first we will trip the bad_pmd() check
- * inside of pmd_none_or_trans_huge_or_clear_bad(). This will end up correctly
- * returning 1 but not before it spams dmesg with the pmd_clear_bad() output.
- */
-static int pmd_devmap_trans_unstable(pmd_t *pmd)
-{
- return pmd_devmap(*pmd) || pmd_trans_unstable(pmd);
-}
-
-static vm_fault_t pte_alloc_one_map(struct vm_fault *vmf)
-{
- struct vm_area_struct *vma = vmf->vma;
-
- if (!pmd_none(*vmf->pmd))
- goto map_pte;
- if (vmf->prealloc_pte) {
- vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
- if (unlikely(!pmd_none(*vmf->pmd))) {
- spin_unlock(vmf->ptl);
- goto map_pte;
- }
-
- mm_inc_nr_ptes(vma->vm_mm);
- pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte);
- spin_unlock(vmf->ptl);
- vmf->prealloc_pte = NULL;
- } else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd))) {
- return VM_FAULT_OOM;
- }
-map_pte:
- /*
- * If a huge pmd materialized under us just retry later. Use
- * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead of
- * pmd_trans_huge() to ensure the pmd didn't become pmd_trans_huge
- * under us and then back to pmd_none, as a result of MADV_DONTNEED
- * running immediately after a huge pmd fault in a different thread of
- * this mm, in turn leading to a misleading pmd_trans_huge() retval.
- * All we have to ensure is that it is a regular pmd that we can walk
- * with pte_offset_map() and we can do that through an atomic read in
- * C, which is what pmd_trans_unstable() provides.
- */
- if (pmd_devmap_trans_unstable(vmf->pmd))
- return VM_FAULT_NOPAGE;
-
- /*
- * At this point we know that our vmf->pmd points to a page of ptes
- * and it cannot become pmd_none(), pmd_devmap() or pmd_trans_huge()
- * for the duration of the fault. If a racing MADV_DONTNEED runs and
- * we zap the ptes pointed to by our vmf->pmd, the vmf->ptl will still
- * be valid and we will re-check to make sure the vmf->pte isn't
- * pte_none() under vmf->ptl protection when we return to
- * alloc_set_pte().
- */
- vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
- &vmf->ptl);
- return 0;
-}
-
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
static void deposit_prealloc_pte(struct vm_fault *vmf)
{
@@ -3704,7 +3644,7 @@ static void deposit_prealloc_pte(struct vm_fault *vmf)
vmf->prealloc_pte = NULL;
}

-static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
+vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
{
struct vm_area_struct *vma = vmf->vma;
bool write = vmf->flags & FAULT_FLAG_WRITE;
@@ -3769,45 +3709,11 @@ static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
}
#endif

-/**
- * alloc_set_pte - setup new PTE entry for given page and add reverse page
- * mapping. If needed, the function allocates page table or use pre-allocated.
- *
- * @vmf: fault environment
- * @page: page to map
- *
- * Caller must take care of unlocking vmf->ptl, if vmf->pte is non-NULL on
- * return.
- *
- * Target users are page handler itself and implementations of
- * vm_ops->map_pages.
- *
- * Return: %0 on success, %VM_FAULT_ code in case of error.
- */
-vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
+void do_set_pte(struct vm_fault *vmf, struct page *page)
{
struct vm_area_struct *vma = vmf->vma;
bool write = vmf->flags & FAULT_FLAG_WRITE;
pte_t entry;
- vm_fault_t ret;
-
- if (pmd_none(*vmf->pmd) && PageTransCompound(page)) {
- ret = do_set_pmd(vmf, page);
- if (ret != VM_FAULT_FALLBACK)
- return ret;
- }
-
- if (!vmf->pte) {
- ret = pte_alloc_one_map(vmf);
- if (ret)
- return ret;
- }
-
- /* Re-check under ptl */
- if (unlikely(!pte_none(*vmf->pte))) {
- update_mmu_tlb(vma, vmf->address, vmf->pte);
- return VM_FAULT_NOPAGE;
- }

flush_icache_page(vma, page);
entry = mk_pte(page, vma->vm_page_prot);
@@ -3824,14 +3730,8 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
page_add_file_rmap(page, false);
}
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
-
- /* no need to invalidate: a not-present page won't be cached */
- update_mmu_cache(vma, vmf->address, vmf->pte);
-
- return 0;
}

-
/**
* finish_fault - finish page fault once we have prepared the page to fault
*
@@ -3849,12 +3749,12 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
*/
vm_fault_t finish_fault(struct vm_fault *vmf)
{
+ struct vm_area_struct *vma = vmf->vma;
struct page *page;
- vm_fault_t ret = 0;
+ vm_fault_t ret;

/* Did we COW the page? */
- if ((vmf->flags & FAULT_FLAG_WRITE) &&
- !(vmf->vma->vm_flags & VM_SHARED))
+ if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED))
page = vmf->cow_page;
else
page = vmf->page;
@@ -3863,13 +3763,35 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
* check even for read faults because we might have lost our CoWed
* page
*/
- if (!(vmf->vma->vm_flags & VM_SHARED))
- ret = check_stable_address_space(vmf->vma->vm_mm);
- if (!ret)
- ret = alloc_set_pte(vmf, page);
- if (vmf->pte)
- pte_unmap_unlock(vmf->pte, vmf->ptl);
- return ret;
+ if (!(vma->vm_flags & VM_SHARED))
+ ret = check_stable_address_space(vma->vm_mm);
+ if (ret)
+ return ret;
+
+ if (pmd_none(*vmf->pmd)) {
+ if (PageTransCompound(page)) {
+ ret = do_set_pmd(vmf, page);
+ if (ret != VM_FAULT_FALLBACK)
+ return ret;
+ }
+
+ if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd)))
+ return VM_FAULT_OOM;
+ }
+
+ /* See comment in handle_pte_fault() */
+ if (pmd_devmap_trans_unstable(vmf->pmd))
+ return 0;
+
+ vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
+ vmf->address, &vmf->ptl);
+ /* Re-check under ptl */
+ if (likely(pte_none(*vmf->pte)))
+ do_set_pte(vmf, page);
+
+ update_mmu_tlb(vma, vmf->address, vmf->pte);
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
+ return 0;
}

static unsigned long fault_around_bytes __read_mostly =
@@ -3980,7 +3902,6 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf)
vmf->pte -= (vmf->address >> PAGE_SHIFT) - (address >> PAGE_SHIFT);
if (!pte_none(*vmf->pte))
ret = VM_FAULT_NOPAGE;
- pte_unmap_unlock(vmf->pte, vmf->ptl);
out:
vmf->address = address;
vmf->pte = NULL;
@@ -4340,7 +4261,18 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
*/
vmf->pte = NULL;
} else {
- /* See comment in pte_alloc_one_map() */
+ /*
+ * If a huge pmd materialized under us just retry later. Use
+ * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead
+ * of pmd_trans_huge() to ensure the pmd didn't become
+ * pmd_trans_huge under us and then back to pmd_none, as a
+ * result of MADV_DONTNEED running immediately after a huge pmd
+ * fault in a different thread of this mm, in turn leading to a
+ * misleading pmd_trans_huge() retval. All we have to ensure is
+ * that it is a regular pmd that we can walk with
+ * pte_offset_map() and we can do that through an atomic read
+ * in C, which is what pmd_trans_unstable() provides.
+ */
if (pmd_devmap_trans_unstable(vmf->pmd))
return 0;
/*
--
Kirill A. Shutemov

2020-12-26 17:59:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

On Fri, Dec 25, 2020 at 3:31 AM Kirill A. Shutemov <[email protected]> wrote:
>
> The new helper next_page() returns a stablized page, so filemap_map_pmd()
> can clearly decide if we should set up a page table or a huge page.

I really like that next_page() abstraction, my only comment is that I
think it should be renamed as "next_stable_page()" or something, and
then this part:

+ if (!page)
+ page = xas_find(xas, end_pgoff);
+ else
+ page = xas_next_entry(xas, end_pgoff);

should be in the caller.

Then just have two helper functions like 'first_map_page()' and
'next_map_page()' which just do

next_stable_page(xas_find(xas, end_pgoff))

and

next_stable_page(xas_next_entry(xas, end_pgoff))

respectively.

Because not only does that get rid of the "if (page)" test, I think it
would make things a bit clearer. When I read the patch first, the
initial "next_page()" call confused me.

But maybe I'm just grasping at straws. Even in this format, I think
it's a nice cleanup and makes for more understandable code.

Linus

2020-12-26 20:47:06

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

On Sat, Dec 26, 2020 at 09:57:13AM -0800, Linus Torvalds wrote:
> Because not only does that get rid of the "if (page)" test, I think it
> would make things a bit clearer. When I read the patch first, the
> initial "next_page()" call confused me.

Agreed. Here we go:

From d12dea4abe94dbc24b7945329b191ad7d29e213a Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <[email protected]>
Date: Sat, 19 Dec 2020 15:19:23 +0300
Subject: [PATCH] mm: Cleanup faultaround and finish_fault() codepaths

alloc_set_pte() has two users with different requirements: in the
faultaround code, it called from an atomic context and PTE page table
has to be preallocated. finish_fault() can sleep and allocate page table
as needed.

PTL locking rules are also strange, hard to follow and overkill for
finish_fault().

Let's untangle the mess. alloc_set_pte() has gone now. All locking is
explicit.

The price is some code duplication to handle huge pages in faultaround
path, but it should be fine, having overall improvement in readability.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/mm.h | 8 +-
include/linux/pgtable.h | 11 +++
mm/filemap.c | 157 ++++++++++++++++++++++++++++----------
mm/memory.c | 162 ++++++++++++----------------------------
4 files changed, 181 insertions(+), 157 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index db6ae4d3fb4e..2825153ad0d6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -534,8 +534,8 @@ struct vm_fault {
* is not NULL, otherwise pmd.
*/
pgtable_t prealloc_pte; /* Pre-allocated pte page table.
- * vm_ops->map_pages() calls
- * alloc_set_pte() from atomic context.
+ * vm_ops->map_pages() sets up a page
+ * table from from atomic context.
* do_fault_around() pre-allocates
* page table to avoid allocation from
* atomic context.
@@ -972,7 +972,9 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
return pte;
}

-vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page);
+vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
+void do_set_pte(struct vm_fault *vmf, struct page *page);
+
vm_fault_t finish_fault(struct vm_fault *vmf);
vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
#endif
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e237004d498d..869c1921ceda 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1259,6 +1259,17 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
#endif
}

+/*
+ * the ordering of these checks is important for pmds with _page_devmap set.
+ * if we check pmd_trans_unstable() first we will trip the bad_pmd() check
+ * inside of pmd_none_or_trans_huge_or_clear_bad(). this will end up correctly
+ * returning 1 but not before it spams dmesg with the pmd_clear_bad() output.
+ */
+static inline int pmd_devmap_trans_unstable(pmd_t *pmd)
+{
+ return pmd_devmap(*pmd) || pmd_trans_unstable(pmd);
+}
+
#ifndef CONFIG_NUMA_BALANCING
/*
* Technically a PTE can be PROTNONE even when not doing NUMA balancing but
diff --git a/mm/filemap.c b/mm/filemap.c
index 0b2067b3c328..3a92aaa59b9b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -42,6 +42,7 @@
#include <linux/psi.h>
#include <linux/ramfs.h>
#include <linux/page_idle.h>
+#include <asm/pgalloc.h>
#include "internal.h"

#define CREATE_TRACE_POINTS
@@ -2831,50 +2832,127 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
}
EXPORT_SYMBOL(filemap_fault);

+static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
+{
+ struct mm_struct *mm = vmf->vma->vm_mm;
+
+ /* Huge page is mapped? No need to proceed. */
+ if (pmd_trans_huge(*vmf->pmd))
+ return true;
+
+ if (pmd_none(*vmf->pmd) &&
+ PageTransHuge(page) &&
+ do_set_pmd(vmf, page)) {
+ unlock_page(page);
+ return true;
+ }
+
+ if (pmd_none(*vmf->pmd)) {
+ vmf->ptl = pmd_lock(mm, vmf->pmd);
+ if (likely(pmd_none(*vmf->pmd))) {
+ mm_inc_nr_ptes(mm);
+ pmd_populate(mm, vmf->pmd, vmf->prealloc_pte);
+ vmf->prealloc_pte = NULL;
+ }
+ spin_unlock(vmf->ptl);
+ }
+
+ /* See comment in handle_pte_fault() */
+ if (pmd_devmap_trans_unstable(vmf->pmd)) {
+ unlock_page(page);
+ put_page(page);
+ return true;
+ }
+
+ return false;
+}
+
+static struct page *next_stable_page(struct page *page, struct vm_fault *vmf,
+ struct xa_state *xas, pgoff_t end_pgoff)
+{
+ struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+ unsigned long max_idx;
+
+ do {
+ if (!page)
+ return NULL;
+ if (xas_retry(xas, page))
+ continue;
+ if (xa_is_value(page))
+ continue;
+ if (PageLocked(page))
+ continue;
+ if (!page_cache_get_speculative(page))
+ continue;
+ /* Has the page moved or been split? */
+ if (unlikely(page != xas_reload(xas)))
+ goto skip;
+ if (!PageUptodate(page) || PageReadahead(page))
+ goto skip;
+ if (PageHWPoison(page))
+ goto skip;
+ if (!trylock_page(page))
+ goto skip;
+ if (page->mapping != mapping)
+ goto unlock;
+ if (!PageUptodate(page))
+ goto unlock;
+ max_idx = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE);
+ if (xas->xa_index >= max_idx)
+ goto unlock;
+ return page;
+unlock:
+ unlock_page(page);
+skip:
+ put_page(page);
+ } while ((page = xas_next_entry(xas, end_pgoff)) != NULL);
+
+ return NULL;
+}
+
+static inline struct page *first_map_page(struct vm_fault *vmf,
+ struct xa_state *xas,
+ pgoff_t end_pgoff)
+{
+ return next_stable_page(xas_find(xas, end_pgoff), vmf, xas, end_pgoff);
+}
+
+static inline struct page *next_map_page(struct vm_fault *vmf,
+ struct xa_state *xas,
+ pgoff_t end_pgoff)
+{
+ return next_stable_page(xas_next_entry(xas, end_pgoff),
+ vmf, xas, end_pgoff);
+}
+
void filemap_map_pages(struct vm_fault *vmf,
pgoff_t start_pgoff, pgoff_t end_pgoff)
{
- struct file *file = vmf->vma->vm_file;
+ struct vm_area_struct *vma = vmf->vma;
+ struct file *file = vma->vm_file;
struct address_space *mapping = file->f_mapping;
pgoff_t last_pgoff = start_pgoff;
- unsigned long max_idx;
XA_STATE(xas, &mapping->i_pages, start_pgoff);
struct page *head, *page;
unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);

rcu_read_lock();
- xas_for_each(&xas, head, end_pgoff) {
- if (xas_retry(&xas, head))
- continue;
- if (xa_is_value(head))
- goto next;
+ head = first_map_page(vmf, &xas, end_pgoff);
+ if (!head) {
+ rcu_read_unlock();
+ return;
+ }

- /*
- * Check for a locked page first, as a speculative
- * reference may adversely influence page migration.
- */
- if (PageLocked(head))
- goto next;
- if (!page_cache_get_speculative(head))
- goto next;
+ if (filemap_map_pmd(vmf, head)) {
+ rcu_read_unlock();
+ return;
+ }

- /* Has the page moved or been split? */
- if (unlikely(head != xas_reload(&xas)))
- goto skip;
+ vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
+ vmf->address, &vmf->ptl);
+ do {
page = find_subpage(head, xas.xa_index);
-
- if (!PageUptodate(head) ||
- PageReadahead(page) ||
- PageHWPoison(page))
- goto skip;
- if (!trylock_page(head))
- goto skip;
-
- if (head->mapping != mapping || !PageUptodate(head))
- goto unlock;
-
- max_idx = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE);
- if (xas.xa_index >= max_idx)
+ if (PageHWPoison(page))
goto unlock;

if (mmap_miss > 0)
@@ -2884,19 +2962,20 @@ void filemap_map_pages(struct vm_fault *vmf,
if (vmf->pte)
vmf->pte += xas.xa_index - last_pgoff;
last_pgoff = xas.xa_index;
- if (alloc_set_pte(vmf, page))
+
+ if (!pte_none(*vmf->pte))
goto unlock;
+
+ do_set_pte(vmf, page);
+ /* no need to invalidate: a not-present page won't be cached */
+ update_mmu_cache(vma, vmf->address, vmf->pte);
unlock_page(head);
- goto next;
+ continue;
unlock:
unlock_page(head);
-skip:
put_page(head);
-next:
- /* Huge page is mapped? No need to proceed. */
- if (pmd_trans_huge(*vmf->pmd))
- break;
- }
+ } while ((head = next_map_page(vmf, &xas, end_pgoff)) != NULL);
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
rcu_read_unlock();
WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);
}
diff --git a/mm/memory.c b/mm/memory.c
index c48f8df6e502..98d340b45557 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3490,7 +3490,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
if (pte_alloc(vma->vm_mm, vmf->pmd))
return VM_FAULT_OOM;

- /* See the comment in pte_alloc_one_map() */
+ /* See comment in handle_pte_fault() */
if (unlikely(pmd_trans_unstable(vmf->pmd)))
return 0;

@@ -3630,66 +3630,6 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
return ret;
}

-/*
- * The ordering of these checks is important for pmds with _PAGE_DEVMAP set.
- * If we check pmd_trans_unstable() first we will trip the bad_pmd() check
- * inside of pmd_none_or_trans_huge_or_clear_bad(). This will end up correctly
- * returning 1 but not before it spams dmesg with the pmd_clear_bad() output.
- */
-static int pmd_devmap_trans_unstable(pmd_t *pmd)
-{
- return pmd_devmap(*pmd) || pmd_trans_unstable(pmd);
-}
-
-static vm_fault_t pte_alloc_one_map(struct vm_fault *vmf)
-{
- struct vm_area_struct *vma = vmf->vma;
-
- if (!pmd_none(*vmf->pmd))
- goto map_pte;
- if (vmf->prealloc_pte) {
- vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
- if (unlikely(!pmd_none(*vmf->pmd))) {
- spin_unlock(vmf->ptl);
- goto map_pte;
- }
-
- mm_inc_nr_ptes(vma->vm_mm);
- pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte);
- spin_unlock(vmf->ptl);
- vmf->prealloc_pte = NULL;
- } else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd))) {
- return VM_FAULT_OOM;
- }
-map_pte:
- /*
- * If a huge pmd materialized under us just retry later. Use
- * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead of
- * pmd_trans_huge() to ensure the pmd didn't become pmd_trans_huge
- * under us and then back to pmd_none, as a result of MADV_DONTNEED
- * running immediately after a huge pmd fault in a different thread of
- * this mm, in turn leading to a misleading pmd_trans_huge() retval.
- * All we have to ensure is that it is a regular pmd that we can walk
- * with pte_offset_map() and we can do that through an atomic read in
- * C, which is what pmd_trans_unstable() provides.
- */
- if (pmd_devmap_trans_unstable(vmf->pmd))
- return VM_FAULT_NOPAGE;
-
- /*
- * At this point we know that our vmf->pmd points to a page of ptes
- * and it cannot become pmd_none(), pmd_devmap() or pmd_trans_huge()
- * for the duration of the fault. If a racing MADV_DONTNEED runs and
- * we zap the ptes pointed to by our vmf->pmd, the vmf->ptl will still
- * be valid and we will re-check to make sure the vmf->pte isn't
- * pte_none() under vmf->ptl protection when we return to
- * alloc_set_pte().
- */
- vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
- &vmf->ptl);
- return 0;
-}
-
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
static void deposit_prealloc_pte(struct vm_fault *vmf)
{
@@ -3704,7 +3644,7 @@ static void deposit_prealloc_pte(struct vm_fault *vmf)
vmf->prealloc_pte = NULL;
}

-static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
+vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
{
struct vm_area_struct *vma = vmf->vma;
bool write = vmf->flags & FAULT_FLAG_WRITE;
@@ -3769,45 +3709,11 @@ static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
}
#endif

-/**
- * alloc_set_pte - setup new PTE entry for given page and add reverse page
- * mapping. If needed, the function allocates page table or use pre-allocated.
- *
- * @vmf: fault environment
- * @page: page to map
- *
- * Caller must take care of unlocking vmf->ptl, if vmf->pte is non-NULL on
- * return.
- *
- * Target users are page handler itself and implementations of
- * vm_ops->map_pages.
- *
- * Return: %0 on success, %VM_FAULT_ code in case of error.
- */
-vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
+void do_set_pte(struct vm_fault *vmf, struct page *page)
{
struct vm_area_struct *vma = vmf->vma;
bool write = vmf->flags & FAULT_FLAG_WRITE;
pte_t entry;
- vm_fault_t ret;
-
- if (pmd_none(*vmf->pmd) && PageTransCompound(page)) {
- ret = do_set_pmd(vmf, page);
- if (ret != VM_FAULT_FALLBACK)
- return ret;
- }
-
- if (!vmf->pte) {
- ret = pte_alloc_one_map(vmf);
- if (ret)
- return ret;
- }
-
- /* Re-check under ptl */
- if (unlikely(!pte_none(*vmf->pte))) {
- update_mmu_tlb(vma, vmf->address, vmf->pte);
- return VM_FAULT_NOPAGE;
- }

flush_icache_page(vma, page);
entry = mk_pte(page, vma->vm_page_prot);
@@ -3824,14 +3730,8 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
page_add_file_rmap(page, false);
}
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
-
- /* no need to invalidate: a not-present page won't be cached */
- update_mmu_cache(vma, vmf->address, vmf->pte);
-
- return 0;
}

-
/**
* finish_fault - finish page fault once we have prepared the page to fault
*
@@ -3849,12 +3749,12 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
*/
vm_fault_t finish_fault(struct vm_fault *vmf)
{
+ struct vm_area_struct *vma = vmf->vma;
struct page *page;
- vm_fault_t ret = 0;
+ vm_fault_t ret;

/* Did we COW the page? */
- if ((vmf->flags & FAULT_FLAG_WRITE) &&
- !(vmf->vma->vm_flags & VM_SHARED))
+ if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED))
page = vmf->cow_page;
else
page = vmf->page;
@@ -3863,13 +3763,35 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
* check even for read faults because we might have lost our CoWed
* page
*/
- if (!(vmf->vma->vm_flags & VM_SHARED))
- ret = check_stable_address_space(vmf->vma->vm_mm);
- if (!ret)
- ret = alloc_set_pte(vmf, page);
- if (vmf->pte)
- pte_unmap_unlock(vmf->pte, vmf->ptl);
- return ret;
+ if (!(vma->vm_flags & VM_SHARED))
+ ret = check_stable_address_space(vma->vm_mm);
+ if (ret)
+ return ret;
+
+ if (pmd_none(*vmf->pmd)) {
+ if (PageTransCompound(page)) {
+ ret = do_set_pmd(vmf, page);
+ if (ret != VM_FAULT_FALLBACK)
+ return ret;
+ }
+
+ if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd)))
+ return VM_FAULT_OOM;
+ }
+
+ /* See comment in handle_pte_fault() */
+ if (pmd_devmap_trans_unstable(vmf->pmd))
+ return 0;
+
+ vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
+ vmf->address, &vmf->ptl);
+ /* Re-check under ptl */
+ if (likely(pte_none(*vmf->pte)))
+ do_set_pte(vmf, page);
+
+ update_mmu_tlb(vma, vmf->address, vmf->pte);
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
+ return 0;
}

static unsigned long fault_around_bytes __read_mostly =
@@ -3980,7 +3902,6 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf)
vmf->pte -= (vmf->address >> PAGE_SHIFT) - (address >> PAGE_SHIFT);
if (!pte_none(*vmf->pte))
ret = VM_FAULT_NOPAGE;
- pte_unmap_unlock(vmf->pte, vmf->ptl);
out:
vmf->address = address;
vmf->pte = NULL;
@@ -4340,7 +4261,18 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
*/
vmf->pte = NULL;
} else {
- /* See comment in pte_alloc_one_map() */
+ /*
+ * If a huge pmd materialized under us just retry later. Use
+ * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead
+ * of pmd_trans_huge() to ensure the pmd didn't become
+ * pmd_trans_huge under us and then back to pmd_none, as a
+ * result of MADV_DONTNEED running immediately after a huge pmd
+ * fault in a different thread of this mm, in turn leading to a
+ * misleading pmd_trans_huge() retval. All we have to ensure is
+ * that it is a regular pmd that we can walk with
+ * pte_offset_map() and we can do that through an atomic read
+ * in C, which is what pmd_trans_unstable() provides.
+ */
if (pmd_devmap_trans_unstable(vmf->pmd))
return 0;
/*
--
Kirill A. Shutemov

2020-12-26 21:06:36

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

On Sat, 26 Dec 2020, Kirill A. Shutemov wrote:
> On Sat, Dec 26, 2020 at 09:57:13AM -0800, Linus Torvalds wrote:
> > Because not only does that get rid of the "if (page)" test, I think it
> > would make things a bit clearer. When I read the patch first, the
> > initial "next_page()" call confused me.
>
> Agreed. Here we go:
>
> From d12dea4abe94dbc24b7945329b191ad7d29e213a Mon Sep 17 00:00:00 2001
> From: "Kirill A. Shutemov" <[email protected]>
> Date: Sat, 19 Dec 2020 15:19:23 +0300
> Subject: [PATCH] mm: Cleanup faultaround and finish_fault() codepaths
>
> alloc_set_pte() has two users with different requirements: in the
> faultaround code, it called from an atomic context and PTE page table
> has to be preallocated. finish_fault() can sleep and allocate page table
> as needed.
>
> PTL locking rules are also strange, hard to follow and overkill for
> finish_fault().
>
> Let's untangle the mess. alloc_set_pte() has gone now. All locking is
> explicit.
>
> The price is some code duplication to handle huge pages in faultaround
> path, but it should be fine, having overall improvement in readability.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>

Hold on. I guess this one will suffer from the same bug as the previous.
I was about to report back, after satisfactory overnight testing of that
version - provided that one big little bug is fixed:

--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2919,7 +2919,7 @@ static bool filemap_map_pmd(struct vm_fa

if (pmd_none(*vmf->pmd) &&
PageTransHuge(page) &&
- do_set_pmd(vmf, page)) {
+ do_set_pmd(vmf, page) == 0) {
unlock_page(page);
return true;
}

(Yes, you can write that as !do_set_pmd(vmf, page), and maybe I'm odd,
but even though it's very common, I have a personal aversion to using
"!' on a positive-sounding function that returns 0 for success.)

I'll give the new patch a try now, but with that fix added in. Without it,
I got "Bad page" on compound_mapcount on file THP pages - but I run with
a BUG() inside of bad_page() so I cannot miss them: I did not look to see
what the eventual crash or page leak would look like without that.

Hugh

2020-12-26 21:09:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

I was going to just apply this patch, because I like it so much, but
then I decided to take one last look, and:

On Sat, Dec 26, 2020 at 12:43 PM Kirill A. Shutemov
<[email protected]> wrote:
>
> +static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
> +{
> + struct mm_struct *mm = vmf->vma->vm_mm;
> +
> + /* Huge page is mapped? No need to proceed. */
> + if (pmd_trans_huge(*vmf->pmd))
> + return true;

doesn't this cause us to leak a locked page?

I get the feeling that every single "return true" case here should
always unlock the page and - with the exception of a successful
do_set_pmd() - do a "put_page()".

Which kind of argues that we should just do it in the caller (and get
an extra ref in the do_set_pmd() case, so that the caller can always
do

if (filemap_map_pmd(..)) {
unlock_page(page);
put_page(page);
rcu_read_unlock();
return;
}

andf then there are no odd cases inside that filemap_map_pmd() function. Hmm?

Other than that, I really find it all much more legible.

Of course, if I'm wrong about the above, that just proves that I'm
missing something and it wasn't so legible after all..

Linus

2020-12-26 21:19:16

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

On Sat, Dec 26, 2020 at 1:04 PM Hugh Dickins <[email protected]> wrote:
>
>
> Hold on. I guess this one will suffer from the same bug as the previous.
> I was about to report back, after satisfactory overnight testing of that
> version - provided that one big little bug is fixed:
>
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2919,7 +2919,7 @@ static bool filemap_map_pmd(struct vm_fa
>
> if (pmd_none(*vmf->pmd) &&
> PageTransHuge(page) &&
> - do_set_pmd(vmf, page)) {
> + do_set_pmd(vmf, page) == 0) {
> unlock_page(page);
> return true;
> }

I missed that entirely, because when just reading the patch it looks
fine and I didn't look at what do_set_pmd() function returns outside
the patch.

And maybe it would be better to write it as

if (pmd_none(*vmf->pmd) && PageTransHuge(page)) {
vm_fault_t ret = do_set_pmd(vmf, page);
if (!ret) {
...

instead to make it a bit more explicit about how that return value is
a vm_fault_t there...

And see my other email about how I suspect there is still a leak in
that patch for the previous test-case.

Linus

2020-12-26 21:44:35

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

On Sat, Dec 26, 2020 at 11:43:35PM +0300, Kirill A. Shutemov wrote:
> +static struct page *next_stable_page(struct page *page, struct vm_fault *vmf,
> + struct xa_state *xas, pgoff_t end_pgoff)

A "stable page" means one that doesn't currently have an outstanding
write (see wait_for_stable_page()). It's really "Next trivially
insertable page". I might call it "next_uptodate_page()" and gloss
over the other reasons this might fail to return a page.

2020-12-26 22:43:18

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

On Sat, Dec 26, 2020 at 01:16:09PM -0800, Linus Torvalds wrote:
> On Sat, Dec 26, 2020 at 1:04 PM Hugh Dickins <[email protected]> wrote:
> >
> >
> > Hold on. I guess this one will suffer from the same bug as the previous.
> > I was about to report back, after satisfactory overnight testing of that
> > version - provided that one big little bug is fixed:
> >
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -2919,7 +2919,7 @@ static bool filemap_map_pmd(struct vm_fa
> >
> > if (pmd_none(*vmf->pmd) &&
> > PageTransHuge(page) &&
> > - do_set_pmd(vmf, page)) {
> > + do_set_pmd(vmf, page) == 0) {
> > unlock_page(page);
> > return true;
> > }
>
> I missed that entirely, because when just reading the patch it looks
> fine and I didn't look at what do_set_pmd() function returns outside
> the patch.
>
> And maybe it would be better to write it as
>
> if (pmd_none(*vmf->pmd) && PageTransHuge(page)) {
> vm_fault_t ret = do_set_pmd(vmf, page);
> if (!ret) {
> ...
>
> instead to make it a bit more explicit about how that return value is
> a vm_fault_t there...
>
> And see my other email about how I suspect there is still a leak in
> that patch for the previous test-case.

Ughh...

Here's the fixup I have so far. It doesn't blow up immediately, but please
take a closer look. Who knows what stupid mistake I did this time. :/

diff --git a/mm/filemap.c b/mm/filemap.c
index 3a92aaa59b9b..c4b374678e7d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2837,16 +2837,21 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
struct mm_struct *mm = vmf->vma->vm_mm;

/* Huge page is mapped? No need to proceed. */
- if (pmd_trans_huge(*vmf->pmd))
- return true;
-
- if (pmd_none(*vmf->pmd) &&
- PageTransHuge(page) &&
- do_set_pmd(vmf, page)) {
+ if (pmd_trans_huge(*vmf->pmd)) {
unlock_page(page);
+ put_page(page);
return true;
}

+ if (pmd_none(*vmf->pmd) && PageTransHuge(page)) {
+ vm_fault_t ret = do_set_pmd(vmf, page);
+ if (!ret) {
+ /* The page is mapped successfully, reference consumed. */
+ unlock_page(page);
+ return true;
+ }
+ }
+
if (pmd_none(*vmf->pmd)) {
vmf->ptl = pmd_lock(mm, vmf->pmd);
if (likely(pmd_none(*vmf->pmd))) {
@@ -2867,7 +2872,7 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
return false;
}

-static struct page *next_stable_page(struct page *page, struct vm_fault *vmf,
+static struct page *next_uptodate_page(struct page *page, struct vm_fault *vmf,
struct xa_state *xas, pgoff_t end_pgoff)
{
struct address_space *mapping = vmf->vma->vm_file->f_mapping;
@@ -2914,15 +2919,16 @@ static inline struct page *first_map_page(struct vm_fault *vmf,
struct xa_state *xas,
pgoff_t end_pgoff)
{
- return next_stable_page(xas_find(xas, end_pgoff), vmf, xas, end_pgoff);
+ return next_uptodate_page(xas_find(xas, end_pgoff),
+ vmf, xas, end_pgoff);
}

static inline struct page *next_map_page(struct vm_fault *vmf,
struct xa_state *xas,
pgoff_t end_pgoff)
{
- return next_stable_page(xas_next_entry(xas, end_pgoff),
- vmf, xas, end_pgoff);
+ return next_uptodate_page(xas_next_entry(xas, end_pgoff),
+ vmf, xas, end_pgoff);
}

void filemap_map_pages(struct vm_fault *vmf,
--
Kirill A. Shutemov

2020-12-27 00:48:45

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

On Sun, 27 Dec 2020, Kirill A. Shutemov wrote:
>
> Here's the fixup I have so far. It doesn't blow up immediately, but please
> take a closer look. Who knows what stupid mistake I did this time. :/

It's been running fine on x86_64 for a couple of hours (but of course
my testing is deficient, in not detecting the case Linus spotted).

But I just thought I'd try it on i386 (hadn't tried previous versions)
and this has a new disappointment: crashes when booting, in the "check
if the page fault is solved" in do_fault_around(). I imagine a highmem
issue with kmap of the pte address, but I'm reporting now before looking
into it further (but verified that current linux.git i386 boots up fine).

Maybe easily fixed: but does suggest this needs exposure in linux-next.

Hugh

2020-12-27 02:41:40

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

On Sat, 26 Dec 2020, Hugh Dickins wrote:
> On Sun, 27 Dec 2020, Kirill A. Shutemov wrote:
> >
> > Here's the fixup I have so far. It doesn't blow up immediately, but please
> > take a closer look. Who knows what stupid mistake I did this time. :/
>
> It's been running fine on x86_64 for a couple of hours (but of course
> my testing is deficient, in not detecting the case Linus spotted).
>
> But I just thought I'd try it on i386 (hadn't tried previous versions)
> and this has a new disappointment: crashes when booting, in the "check
> if the page fault is solved" in do_fault_around(). I imagine a highmem
> issue with kmap of the pte address, but I'm reporting now before looking
> into it further (but verified that current linux.git i386 boots up fine).

This patch (like its antecedents) moves the pte_unmap_unlock() from
after do_fault_around()'s "check if the page fault is solved" into
filemap_map_pages() itself (which apparently does not NULLify vmf->pte
after unmapping it, which is poor, but good for revealing this issue).
That looks cleaner, but of course there was a very good reason for its
original positioning.

Maybe you want to change the ->map_pages prototype, to pass down the
requested address too, so that it can report whether the requested
address was resolved or not. Or it could be left to __do_fault(),
or even to a repeated fault; but those would be less efficient.

>
> Maybe easily fixed: but does suggest this needs exposure in linux-next.
>
> Hugh

2020-12-27 19:42:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

On Sat, Dec 26, 2020 at 6:38 PM Hugh Dickins <[email protected]> wrote:
>
> This patch (like its antecedents) moves the pte_unmap_unlock() from
> after do_fault_around()'s "check if the page fault is solved" into
> filemap_map_pages() itself (which apparently does not NULLify vmf->pte
> after unmapping it, which is poor, but good for revealing this issue).
> That looks cleaner, but of course there was a very good reason for its
> original positioning.

Good catch.

> Maybe you want to change the ->map_pages prototype, to pass down the
> requested address too, so that it can report whether the requested
> address was resolved or not. Or it could be left to __do_fault(),
> or even to a repeated fault; but those would be less efficient.

Let's keep the old really odd "let's unlock in the caller" for now,
and minimize the changes.

Adding a big big comment at the end of filemap_map_pages() to note the
odd delayed page table unlocking.

Here's an updated patch that combines Kirill's original patch, his
additional incremental patch, and the fix for the pte lock oddity into
one thing.

Does this finally pass your testing?

Linus


Attachments:
0001-mm-Cleanup-faultaround-and-finish_fault-codepaths.patch (15.34 kB)

2020-12-27 20:35:33

by Damian Tometzki

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

On Sun, 27. Dec 11:38, Linus Torvalds wrote:
> On Sat, Dec 26, 2020 at 6:38 PM Hugh Dickins <[email protected]> wrote:
> >
> > This patch (like its antecedents) moves the pte_unmap_unlock() from
> > after do_fault_around()'s "check if the page fault is solved" into
> > filemap_map_pages() itself (which apparently does not NULLify vmf->pte
> > after unmapping it, which is poor, but good for revealing this issue).
> > That looks cleaner, but of course there was a very good reason for its
> > original positioning.
>
> Good catch.
>
> > Maybe you want to change the ->map_pages prototype, to pass down the
> > requested address too, so that it can report whether the requested
> > address was resolved or not. Or it could be left to __do_fault(),
> > or even to a repeated fault; but those would be less efficient.
>
> Let's keep the old really odd "let's unlock in the caller" for now,
> and minimize the changes.
>
> Adding a big big comment at the end of filemap_map_pages() to note the
> odd delayed page table unlocking.
>
> Here's an updated patch that combines Kirill's original patch, his
> additional incremental patch, and the fix for the pte lock oddity into
> one thing.
>
> Does this finally pass your testing?
>
> Linus
Hello together,

when i try to build this patch, i got the following error:

CC arch/x86/kernel/cpu/mce/threshold.o
mm/memory.c:3716:19: error: static declaration of ‘do_set_pmd’ follows non-static declaration
static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
^~~~~~~~~~
In file included from mm/memory.c:43:
./include/linux/mm.h:984:12: note: previous declaration of ‘do_set_pmd’ was here
vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
^~~~~~~~~~
make[3]: *** [scripts/Makefile.build:279: mm/memory.o] Error 1
make[2]: *** [Makefile:1805: mm] Error 2
make[2]: *** Waiting for unfinished jobs....
CC arch/x86/kernel/cpu/mce/therm_throt.o

Best regards
Damian

> From 4d221d934d112aa40c3f4978460be098fc9ce831 Mon Sep 17 00:00:00 2001
> From: "Kirill A. Shutemov" <[email protected]>
> Date: Sat, 19 Dec 2020 15:19:23 +0300
> Subject: [PATCH] mm: Cleanup faultaround and finish_fault() codepaths
>
> alloc_set_pte() has two users with different requirements: in the
> faultaround code, it called from an atomic context and PTE page table
> has to be preallocated. finish_fault() can sleep and allocate page table
> as needed.
>
> PTL locking rules are also strange, hard to follow and overkill for
> finish_fault().
>
> Let's untangle the mess. alloc_set_pte() has gone now. All locking is
> explicit.
>
> The price is some code duplication to handle huge pages in faultaround
> path, but it should be fine, having overall improvement in readability.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Signed-off-by: Linus Torvalds <[email protected]>
> ---
> include/linux/mm.h | 8 +-
> include/linux/pgtable.h | 11 +++
> mm/filemap.c | 168 ++++++++++++++++++++++++++++++----------
> mm/memory.c | 161 +++++++++++---------------------------
> 4 files changed, 192 insertions(+), 156 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5299b90a6c40..c0643a0ad5ff 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -535,8 +535,8 @@ struct vm_fault {
> * is not NULL, otherwise pmd.
> */
> pgtable_t prealloc_pte; /* Pre-allocated pte page table.
> - * vm_ops->map_pages() calls
> - * alloc_set_pte() from atomic context.
> + * vm_ops->map_pages() sets up a page
> + * table from from atomic context.
> * do_fault_around() pre-allocates
> * page table to avoid allocation from
> * atomic context.
> @@ -981,7 +981,9 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
> return pte;
> }
>
> -vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page);
> +vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
> +void do_set_pte(struct vm_fault *vmf, struct page *page);
> +
> vm_fault_t finish_fault(struct vm_fault *vmf);
> vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
> #endif
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 8fcdfa52eb4b..36eb748f3c97 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1314,6 +1314,17 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
> #endif
> }
>
> +/*
> + * the ordering of these checks is important for pmds with _page_devmap set.
> + * if we check pmd_trans_unstable() first we will trip the bad_pmd() check
> + * inside of pmd_none_or_trans_huge_or_clear_bad(). this will end up correctly
> + * returning 1 but not before it spams dmesg with the pmd_clear_bad() output.
> + */
> +static inline int pmd_devmap_trans_unstable(pmd_t *pmd)
> +{
> + return pmd_devmap(*pmd) || pmd_trans_unstable(pmd);
> +}
> +
> #ifndef CONFIG_NUMA_BALANCING
> /*
> * Technically a PTE can be PROTNONE even when not doing NUMA balancing but
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 5c9d564317a5..dbc2eda92a53 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -42,6 +42,7 @@
> #include <linux/psi.h>
> #include <linux/ramfs.h>
> #include <linux/page_idle.h>
> +#include <asm/pgalloc.h>
> #include "internal.h"
>
> #define CREATE_TRACE_POINTS
> @@ -2911,50 +2912,133 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
> }
> EXPORT_SYMBOL(filemap_fault);
>
> +static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
> +{
> + struct mm_struct *mm = vmf->vma->vm_mm;
> +
> + /* Huge page is mapped? No need to proceed. */
> + if (pmd_trans_huge(*vmf->pmd)) {
> + unlock_page(page);
> + put_page(page);
> + return true;
> + }
> +
> + if (pmd_none(*vmf->pmd) && PageTransHuge(page)) {
> + vm_fault_t ret = do_set_pmd(vmf, page);
> + if (!ret) {
> + /* The page is mapped successfully, reference consumed. */
> + unlock_page(page);
> + return true;
> + }
> + }
> +
> + if (pmd_none(*vmf->pmd)) {
> + vmf->ptl = pmd_lock(mm, vmf->pmd);
> + if (likely(pmd_none(*vmf->pmd))) {
> + mm_inc_nr_ptes(mm);
> + pmd_populate(mm, vmf->pmd, vmf->prealloc_pte);
> + vmf->prealloc_pte = NULL;
> + }
> + spin_unlock(vmf->ptl);
> + }
> +
> + /* See comment in handle_pte_fault() */
> + if (pmd_devmap_trans_unstable(vmf->pmd)) {
> + unlock_page(page);
> + put_page(page);
> + return true;
> + }
> +
> + return false;
> +}
> +
> +static struct page *next_uptodate_page(struct page *page, struct vm_fault *vmf,
> + struct xa_state *xas, pgoff_t end_pgoff)
> +{
> + struct address_space *mapping = vmf->vma->vm_file->f_mapping;
> + unsigned long max_idx;
> +
> + do {
> + if (!page)
> + return NULL;
> + if (xas_retry(xas, page))
> + continue;
> + if (xa_is_value(page))
> + continue;
> + if (PageLocked(page))
> + continue;
> + if (!page_cache_get_speculative(page))
> + continue;
> + /* Has the page moved or been split? */
> + if (unlikely(page != xas_reload(xas)))
> + goto skip;
> + if (!PageUptodate(page) || PageReadahead(page))
> + goto skip;
> + if (PageHWPoison(page))
> + goto skip;
> + if (!trylock_page(page))
> + goto skip;
> + if (page->mapping != mapping)
> + goto unlock;
> + if (!PageUptodate(page))
> + goto unlock;
> + max_idx = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE);
> + if (xas->xa_index >= max_idx)
> + goto unlock;
> + return page;
> +unlock:
> + unlock_page(page);
> +skip:
> + put_page(page);
> + } while ((page = xas_next_entry(xas, end_pgoff)) != NULL);
> +
> + return NULL;
> +}
> +
> +static inline struct page *first_map_page(struct vm_fault *vmf,
> + struct xa_state *xas,
> + pgoff_t end_pgoff)
> +{
> + return next_uptodate_page(xas_find(xas, end_pgoff),
> + vmf, xas, end_pgoff);
> +}
> +
> +static inline struct page *next_map_page(struct vm_fault *vmf,
> + struct xa_state *xas,
> + pgoff_t end_pgoff)
> +{
> + return next_uptodate_page(xas_next_entry(xas, end_pgoff),
> + vmf, xas, end_pgoff);
> +}
> +
> void filemap_map_pages(struct vm_fault *vmf,
> pgoff_t start_pgoff, pgoff_t end_pgoff)
> {
> - struct file *file = vmf->vma->vm_file;
> + struct vm_area_struct *vma = vmf->vma;
> + struct file *file = vma->vm_file;
> struct address_space *mapping = file->f_mapping;
> pgoff_t last_pgoff = start_pgoff;
> - unsigned long max_idx;
> XA_STATE(xas, &mapping->i_pages, start_pgoff);
> struct page *head, *page;
> unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
>
> rcu_read_lock();
> - xas_for_each(&xas, head, end_pgoff) {
> - if (xas_retry(&xas, head))
> - continue;
> - if (xa_is_value(head))
> - goto next;
> + head = first_map_page(vmf, &xas, end_pgoff);
> + if (!head) {
> + rcu_read_unlock();
> + return;
> + }
>
> - /*
> - * Check for a locked page first, as a speculative
> - * reference may adversely influence page migration.
> - */
> - if (PageLocked(head))
> - goto next;
> - if (!page_cache_get_speculative(head))
> - goto next;
> + if (filemap_map_pmd(vmf, head)) {
> + rcu_read_unlock();
> + return;
> + }
>
> - /* Has the page moved or been split? */
> - if (unlikely(head != xas_reload(&xas)))
> - goto skip;
> + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> + vmf->address, &vmf->ptl);
> + do {
> page = find_subpage(head, xas.xa_index);
> -
> - if (!PageUptodate(head) ||
> - PageReadahead(page) ||
> - PageHWPoison(page))
> - goto skip;
> - if (!trylock_page(head))
> - goto skip;
> -
> - if (head->mapping != mapping || !PageUptodate(head))
> - goto unlock;
> -
> - max_idx = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE);
> - if (xas.xa_index >= max_idx)
> + if (PageHWPoison(page))
> goto unlock;
>
> if (mmap_miss > 0)
> @@ -2964,19 +3048,25 @@ void filemap_map_pages(struct vm_fault *vmf,
> if (vmf->pte)
> vmf->pte += xas.xa_index - last_pgoff;
> last_pgoff = xas.xa_index;
> - if (alloc_set_pte(vmf, page))
> +
> + if (!pte_none(*vmf->pte))
> goto unlock;
> +
> + do_set_pte(vmf, page);
> + /* no need to invalidate: a not-present page won't be cached */
> + update_mmu_cache(vma, vmf->address, vmf->pte);
> unlock_page(head);
> - goto next;
> + continue;
> unlock:
> unlock_page(head);
> -skip:
> put_page(head);
> -next:
> - /* Huge page is mapped? No need to proceed. */
> - if (pmd_trans_huge(*vmf->pmd))
> - break;
> - }
> + } while ((head = next_map_page(vmf, &xas, end_pgoff)) != NULL);
> +
> + /*
> + * NOTE! We return with the pte still locked! It is unlocked
> + * by do_fault_around() after it has tested whether the target
> + * address got filled in.
> + */
> rcu_read_unlock();
> WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);
> }
> diff --git a/mm/memory.c b/mm/memory.c
> index 7d608765932b..07a408c7d38b 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3501,7 +3501,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> if (pte_alloc(vma->vm_mm, vmf->pmd))
> return VM_FAULT_OOM;
>
> - /* See the comment in pte_alloc_one_map() */
> + /* See comment in handle_pte_fault() */
> if (unlikely(pmd_trans_unstable(vmf->pmd)))
> return 0;
>
> @@ -3641,66 +3641,6 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
> return ret;
> }
>
> -/*
> - * The ordering of these checks is important for pmds with _PAGE_DEVMAP set.
> - * If we check pmd_trans_unstable() first we will trip the bad_pmd() check
> - * inside of pmd_none_or_trans_huge_or_clear_bad(). This will end up correctly
> - * returning 1 but not before it spams dmesg with the pmd_clear_bad() output.
> - */
> -static int pmd_devmap_trans_unstable(pmd_t *pmd)
> -{
> - return pmd_devmap(*pmd) || pmd_trans_unstable(pmd);
> -}
> -
> -static vm_fault_t pte_alloc_one_map(struct vm_fault *vmf)
> -{
> - struct vm_area_struct *vma = vmf->vma;
> -
> - if (!pmd_none(*vmf->pmd))
> - goto map_pte;
> - if (vmf->prealloc_pte) {
> - vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> - if (unlikely(!pmd_none(*vmf->pmd))) {
> - spin_unlock(vmf->ptl);
> - goto map_pte;
> - }
> -
> - mm_inc_nr_ptes(vma->vm_mm);
> - pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte);
> - spin_unlock(vmf->ptl);
> - vmf->prealloc_pte = NULL;
> - } else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd))) {
> - return VM_FAULT_OOM;
> - }
> -map_pte:
> - /*
> - * If a huge pmd materialized under us just retry later. Use
> - * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead of
> - * pmd_trans_huge() to ensure the pmd didn't become pmd_trans_huge
> - * under us and then back to pmd_none, as a result of MADV_DONTNEED
> - * running immediately after a huge pmd fault in a different thread of
> - * this mm, in turn leading to a misleading pmd_trans_huge() retval.
> - * All we have to ensure is that it is a regular pmd that we can walk
> - * with pte_offset_map() and we can do that through an atomic read in
> - * C, which is what pmd_trans_unstable() provides.
> - */
> - if (pmd_devmap_trans_unstable(vmf->pmd))
> - return VM_FAULT_NOPAGE;
> -
> - /*
> - * At this point we know that our vmf->pmd points to a page of ptes
> - * and it cannot become pmd_none(), pmd_devmap() or pmd_trans_huge()
> - * for the duration of the fault. If a racing MADV_DONTNEED runs and
> - * we zap the ptes pointed to by our vmf->pmd, the vmf->ptl will still
> - * be valid and we will re-check to make sure the vmf->pte isn't
> - * pte_none() under vmf->ptl protection when we return to
> - * alloc_set_pte().
> - */
> - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> - &vmf->ptl);
> - return 0;
> -}
> -
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> static void deposit_prealloc_pte(struct vm_fault *vmf)
> {
> @@ -3715,7 +3655,7 @@ static void deposit_prealloc_pte(struct vm_fault *vmf)
> vmf->prealloc_pte = NULL;
> }
>
> -static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
> +vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
> {
> struct vm_area_struct *vma = vmf->vma;
> bool write = vmf->flags & FAULT_FLAG_WRITE;
> @@ -3780,45 +3720,11 @@ static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
> }
> #endif
>
> -/**
> - * alloc_set_pte - setup new PTE entry for given page and add reverse page
> - * mapping. If needed, the function allocates page table or use pre-allocated.
> - *
> - * @vmf: fault environment
> - * @page: page to map
> - *
> - * Caller must take care of unlocking vmf->ptl, if vmf->pte is non-NULL on
> - * return.
> - *
> - * Target users are page handler itself and implementations of
> - * vm_ops->map_pages.
> - *
> - * Return: %0 on success, %VM_FAULT_ code in case of error.
> - */
> -vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
> +void do_set_pte(struct vm_fault *vmf, struct page *page)
> {
> struct vm_area_struct *vma = vmf->vma;
> bool write = vmf->flags & FAULT_FLAG_WRITE;
> pte_t entry;
> - vm_fault_t ret;
> -
> - if (pmd_none(*vmf->pmd) && PageTransCompound(page)) {
> - ret = do_set_pmd(vmf, page);
> - if (ret != VM_FAULT_FALLBACK)
> - return ret;
> - }
> -
> - if (!vmf->pte) {
> - ret = pte_alloc_one_map(vmf);
> - if (ret)
> - return ret;
> - }
> -
> - /* Re-check under ptl */
> - if (unlikely(!pte_none(*vmf->pte))) {
> - update_mmu_tlb(vma, vmf->address, vmf->pte);
> - return VM_FAULT_NOPAGE;
> - }
>
> flush_icache_page(vma, page);
> entry = mk_pte(page, vma->vm_page_prot);
> @@ -3835,14 +3741,8 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
> page_add_file_rmap(page, false);
> }
> set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> -
> - /* no need to invalidate: a not-present page won't be cached */
> - update_mmu_cache(vma, vmf->address, vmf->pte);
> -
> - return 0;
> }
>
> -
> /**
> * finish_fault - finish page fault once we have prepared the page to fault
> *
> @@ -3860,12 +3760,12 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
> */
> vm_fault_t finish_fault(struct vm_fault *vmf)
> {
> + struct vm_area_struct *vma = vmf->vma;
> struct page *page;
> - vm_fault_t ret = 0;
> + vm_fault_t ret;
>
> /* Did we COW the page? */
> - if ((vmf->flags & FAULT_FLAG_WRITE) &&
> - !(vmf->vma->vm_flags & VM_SHARED))
> + if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED))
> page = vmf->cow_page;
> else
> page = vmf->page;
> @@ -3874,13 +3774,35 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
> * check even for read faults because we might have lost our CoWed
> * page
> */
> - if (!(vmf->vma->vm_flags & VM_SHARED))
> - ret = check_stable_address_space(vmf->vma->vm_mm);
> - if (!ret)
> - ret = alloc_set_pte(vmf, page);
> - if (vmf->pte)
> - pte_unmap_unlock(vmf->pte, vmf->ptl);
> - return ret;
> + if (!(vma->vm_flags & VM_SHARED))
> + ret = check_stable_address_space(vma->vm_mm);
> + if (ret)
> + return ret;
> +
> + if (pmd_none(*vmf->pmd)) {
> + if (PageTransCompound(page)) {
> + ret = do_set_pmd(vmf, page);
> + if (ret != VM_FAULT_FALLBACK)
> + return ret;
> + }
> +
> + if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd)))
> + return VM_FAULT_OOM;
> + }
> +
> + /* See comment in handle_pte_fault() */
> + if (pmd_devmap_trans_unstable(vmf->pmd))
> + return 0;
> +
> + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> + vmf->address, &vmf->ptl);
> + /* Re-check under ptl */
> + if (likely(pte_none(*vmf->pte)))
> + do_set_pte(vmf, page);
> +
> + update_mmu_tlb(vma, vmf->address, vmf->pte);
> + pte_unmap_unlock(vmf->pte, vmf->ptl);
> + return 0;
> }
>
> static unsigned long fault_around_bytes __read_mostly =
> @@ -4351,7 +4273,18 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> */
> vmf->pte = NULL;
> } else {
> - /* See comment in pte_alloc_one_map() */
> + /*
> + * If a huge pmd materialized under us just retry later. Use
> + * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead
> + * of pmd_trans_huge() to ensure the pmd didn't become
> + * pmd_trans_huge under us and then back to pmd_none, as a
> + * result of MADV_DONTNEED running immediately after a huge pmd
> + * fault in a different thread of this mm, in turn leading to a
> + * misleading pmd_trans_huge() retval. All we have to ensure is
> + * that it is a regular pmd that we can walk with
> + * pte_offset_map() and we can do that through an atomic read
> + * in C, which is what pmd_trans_unstable() provides.
> + */
> if (pmd_devmap_trans_unstable(vmf->pmd))
> return 0;
> /*
> --
> 2.29.2.157.g1d47791a39
>

2020-12-27 22:38:28

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

On Sun, 27 Dec 2020, Damian Tometzki wrote:
> On Sun, 27. Dec 11:38, Linus Torvalds wrote:
> > On Sat, Dec 26, 2020 at 6:38 PM Hugh Dickins <[email protected]> wrote:
> > >
> > > This patch (like its antecedents) moves the pte_unmap_unlock() from
> > > after do_fault_around()'s "check if the page fault is solved" into
> > > filemap_map_pages() itself (which apparently does not NULLify vmf->pte
> > > after unmapping it, which is poor, but good for revealing this issue).
> > > That looks cleaner, but of course there was a very good reason for its
> > > original positioning.
> >
> > Good catch.
> >
> > > Maybe you want to change the ->map_pages prototype, to pass down the
> > > requested address too, so that it can report whether the requested
> > > address was resolved or not. Or it could be left to __do_fault(),
> > > or even to a repeated fault; but those would be less efficient.
> >
> > Let's keep the old really odd "let's unlock in the caller" for now,
> > and minimize the changes.
> >
> > Adding a big big comment at the end of filemap_map_pages() to note the
> > odd delayed page table unlocking.
> >
> > Here's an updated patch that combines Kirill's original patch, his
> > additional incremental patch, and the fix for the pte lock oddity into
> > one thing.
> >
> > Does this finally pass your testing?

Yes, this one passes my testing on x86_64 and on i386. But...

> >
> > Linus
> Hello together,
>
> when i try to build this patch, i got the following error:
>
> CC arch/x86/kernel/cpu/mce/threshold.o
> mm/memory.c:3716:19: error: static declaration of ‘do_set_pmd’ follows non-static declaration
> static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
> ^~~~~~~~~~
> In file included from mm/memory.c:43:
> ./include/linux/mm.h:984:12: note: previous declaration of ‘do_set_pmd’ was here
> vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
> ^~~~~~~~~~
> make[3]: *** [scripts/Makefile.build:279: mm/memory.o] Error 1
> make[2]: *** [Makefile:1805: mm] Error 2
> make[2]: *** Waiting for unfinished jobs....
> CC arch/x86/kernel/cpu/mce/therm_throt.o

... Damian very helpfully reports that it does not build when
CONFIG_TRANSPARENT_HUGEPAGE is not set, since the "static " has
not been removed from the alternative definition of do_set_pmd().

And its BUILD_BUG() becomes invalid once it's globally available.
You don't like unnecessary BUG()s, and I don't like returning
success there: VM_FAULT_FALLBACK seems best.

--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3713,10 +3713,9 @@ out:
return ret;
}
#else
-static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
+vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
{
- BUILD_BUG();
- return 0;
+ return VM_FAULT_FALLBACK;
}
#endif


(I'm also a wee bit worried by filemap.c's +#include <asm/pgalloc.h>:
that's the kind of thing that might turn out not to work on some arch.)

Hugh

2020-12-27 23:16:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

On Sun, Dec 27, 2020 at 2:35 PM Hugh Dickins <[email protected]> wrote:
>
> Yes, this one passes my testing on x86_64 and on i386.

Ok, good.

> But...
>
> ... Damian very helpfully reports that it does not build when
> CONFIG_TRANSPARENT_HUGEPAGE is not set, since the "static " has
> not been removed from the alternative definition of do_set_pmd().

Ok, your fix for that folded in, and here's yet another version.
Damian, hopefully your configuration builds and works too now.

But obviously it's too late for this merge window that I'm just about
to close, so I hope Andrew can pick this up - I hope it's now in good
enough shape to go into -mm and then we can do it next merge window or
something.

Andrew?

Linus


Attachments:
0001-mm-Cleanup-faultaround-and-finish_fault-codepaths.patch (15.63 kB)

2020-12-27 23:42:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

On Sun, Dec 27, 2020 at 3:12 PM Linus Torvalds
<[email protected]> wrote:
>
> Ok, your fix for that folded in, and here's yet another version.

Still not good.

I don't know what happened, but the change of

- vm_fault_t ret = 0;
+ vm_fault_t ret;

is very very wrong. The next user is

+ if (!(vma->vm_flags & VM_SHARED))
+ ret = check_stable_address_space(vma->vm_mm);
+ if (ret)
+ return ret;

so now 'ret' will potentially be used uninitialized (although this is
the kind of thing that a compiler might almost accidentally end up
fixing - with a single dominating assignment, I could imagine the
compiler moving the test to that assignment and thus "fixing" the code
without really even meaning to).

I think Kirill was intending to move the "if (ret)" up into the path
that sets it, IOW something like

+ if (!(vma->vm_flags & VM_SHARED)) {
+ ret = check_stable_address_space(vma->vm_mm);
+ if (ret)
+ return ret;
+ }

instead. But that patch as-is is broken.

Kirill?

Linus

2020-12-27 23:50:54

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

On Sun, Dec 27, 2020 at 11:38:22AM -0800, Linus Torvalds wrote:
> On Sat, Dec 26, 2020 at 6:38 PM Hugh Dickins <[email protected]> wrote:
> >
> > This patch (like its antecedents) moves the pte_unmap_unlock() from
> > after do_fault_around()'s "check if the page fault is solved" into
> > filemap_map_pages() itself (which apparently does not NULLify vmf->pte
> > after unmapping it, which is poor, but good for revealing this issue).
> > That looks cleaner, but of course there was a very good reason for its
> > original positioning.
>
> Good catch.
>
> > Maybe you want to change the ->map_pages prototype, to pass down the
> > requested address too, so that it can report whether the requested
> > address was resolved or not. Or it could be left to __do_fault(),
> > or even to a repeated fault; but those would be less efficient.
>
> Let's keep the old really odd "let's unlock in the caller" for now,
> and minimize the changes.

I did what Hugh proposed and it got clear to my eyes. It gets somewhat
large, but take a look.

The patch below incorporates all fixups from the thread.

From 3496fc6bde3f6ed0c98d811dda02465bf86fdb90 Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <[email protected]>
Date: Sat, 19 Dec 2020 15:19:23 +0300
Subject: [PATCH] mm: Cleanup faultaround and finish_fault() codepaths

alloc_set_pte() has two users with different requirements: in the
faultaround code, it called from an atomic context and PTE page table
has to be preallocated. finish_fault() can sleep and allocate page table
as needed.

PTL locking rules are also strange, hard to follow and overkill for
finish_fault().

Let's untangle the mess. alloc_set_pte() has gone now. All locking is
explicit.

The price is some code duplication to handle huge pages in faultaround
path, but it should be fine, having overall improvement in readability.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
fs/xfs/xfs_file.c | 7 +-
include/linux/mm.h | 12 +--
include/linux/pgtable.h | 11 +++
mm/filemap.c | 172 +++++++++++++++++++++++++++---------
mm/memory.c | 191 +++++++++++-----------------------------
5 files changed, 206 insertions(+), 187 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 5b0f93f73837..040fa64291ed 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1319,17 +1319,20 @@ xfs_filemap_pfn_mkwrite(
return __xfs_filemap_fault(vmf, PE_SIZE_PTE, true);
}

-static void
+static vm_fault_t
xfs_filemap_map_pages(
struct vm_fault *vmf,
+ unsigned long address,
pgoff_t start_pgoff,
pgoff_t end_pgoff)
{
struct inode *inode = file_inode(vmf->vma->vm_file);
+ vm_fault_t ret;

xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
- filemap_map_pages(vmf, start_pgoff, end_pgoff);
+ ret = filemap_map_pages(vmf, address, start_pgoff, end_pgoff);
xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
+ return ret;
}

static const struct vm_operations_struct xfs_file_vm_ops = {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index db6ae4d3fb4e..4e1d2945a71a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -534,8 +534,8 @@ struct vm_fault {
* is not NULL, otherwise pmd.
*/
pgtable_t prealloc_pte; /* Pre-allocated pte page table.
- * vm_ops->map_pages() calls
- * alloc_set_pte() from atomic context.
+ * vm_ops->map_pages() sets up a page
+ * table from from atomic context.
* do_fault_around() pre-allocates
* page table to avoid allocation from
* atomic context.
@@ -562,7 +562,7 @@ struct vm_operations_struct {
vm_fault_t (*fault)(struct vm_fault *vmf);
vm_fault_t (*huge_fault)(struct vm_fault *vmf,
enum page_entry_size pe_size);
- void (*map_pages)(struct vm_fault *vmf,
+ vm_fault_t (*map_pages)(struct vm_fault *vmf, unsigned long address,
pgoff_t start_pgoff, pgoff_t end_pgoff);
unsigned long (*pagesize)(struct vm_area_struct * area);

@@ -972,7 +972,9 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
return pte;
}

-vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page);
+vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
+void do_set_pte(struct vm_fault *vmf, struct page *page);
+
vm_fault_t finish_fault(struct vm_fault *vmf);
vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
#endif
@@ -2621,7 +2623,7 @@ extern void truncate_inode_pages_final(struct address_space *);

/* generic vm_area_ops exported for stackable file systems */
extern vm_fault_t filemap_fault(struct vm_fault *vmf);
-extern void filemap_map_pages(struct vm_fault *vmf,
+extern vm_fault_t filemap_map_pages(struct vm_fault *vmf, unsigned long address,
pgoff_t start_pgoff, pgoff_t end_pgoff);
extern vm_fault_t filemap_page_mkwrite(struct vm_fault *vmf);

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e237004d498d..869c1921ceda 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1259,6 +1259,17 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
#endif
}

+/*
+ * the ordering of these checks is important for pmds with _page_devmap set.
+ * if we check pmd_trans_unstable() first we will trip the bad_pmd() check
+ * inside of pmd_none_or_trans_huge_or_clear_bad(). this will end up correctly
+ * returning 1 but not before it spams dmesg with the pmd_clear_bad() output.
+ */
+static inline int pmd_devmap_trans_unstable(pmd_t *pmd)
+{
+ return pmd_devmap(*pmd) || pmd_trans_unstable(pmd);
+}
+
#ifndef CONFIG_NUMA_BALANCING
/*
* Technically a PTE can be PROTNONE even when not doing NUMA balancing but
diff --git a/mm/filemap.c b/mm/filemap.c
index 0b2067b3c328..c5da09f3f363 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -42,6 +42,7 @@
#include <linux/psi.h>
#include <linux/ramfs.h>
#include <linux/page_idle.h>
+#include <asm/pgalloc.h>
#include "internal.h"

#define CREATE_TRACE_POINTS
@@ -2831,50 +2832,134 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
}
EXPORT_SYMBOL(filemap_fault);

-void filemap_map_pages(struct vm_fault *vmf,
+static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
+{
+ struct mm_struct *mm = vmf->vma->vm_mm;
+
+ /* Huge page is mapped? No need to proceed. */
+ if (pmd_trans_huge(*vmf->pmd)) {
+ unlock_page(page);
+ put_page(page);
+ return true;
+ }
+
+ if (pmd_none(*vmf->pmd) && PageTransHuge(page)) {
+ vm_fault_t ret = do_set_pmd(vmf, page);
+ if (!ret) {
+ /* The page is mapped successfully, reference consumed. */
+ unlock_page(page);
+ return true;
+ }
+ }
+
+ if (pmd_none(*vmf->pmd)) {
+ vmf->ptl = pmd_lock(mm, vmf->pmd);
+ if (likely(pmd_none(*vmf->pmd))) {
+ mm_inc_nr_ptes(mm);
+ pmd_populate(mm, vmf->pmd, vmf->prealloc_pte);
+ vmf->prealloc_pte = NULL;
+ }
+ spin_unlock(vmf->ptl);
+ }
+
+ /* See comment in handle_pte_fault() */
+ if (pmd_devmap_trans_unstable(vmf->pmd)) {
+ unlock_page(page);
+ put_page(page);
+ return true;
+ }
+
+ return false;
+}
+
+static struct page *next_uptodate_page(struct page *page, struct vm_fault *vmf,
+ struct xa_state *xas, pgoff_t end_pgoff)
+{
+ struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+ unsigned long max_idx;
+
+ do {
+ if (!page)
+ return NULL;
+ if (xas_retry(xas, page))
+ continue;
+ if (xa_is_value(page))
+ continue;
+ if (PageLocked(page))
+ continue;
+ if (!page_cache_get_speculative(page))
+ continue;
+ /* Has the page moved or been split? */
+ if (unlikely(page != xas_reload(xas)))
+ goto skip;
+ if (!PageUptodate(page) || PageReadahead(page))
+ goto skip;
+ if (PageHWPoison(page))
+ goto skip;
+ if (!trylock_page(page))
+ goto skip;
+ if (page->mapping != mapping)
+ goto unlock;
+ if (!PageUptodate(page))
+ goto unlock;
+ max_idx = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE);
+ if (xas->xa_index >= max_idx)
+ goto unlock;
+ return page;
+unlock:
+ unlock_page(page);
+skip:
+ put_page(page);
+ } while ((page = xas_next_entry(xas, end_pgoff)) != NULL);
+
+ return NULL;
+}
+
+static inline struct page *first_map_page(struct vm_fault *vmf,
+ struct xa_state *xas,
+ pgoff_t end_pgoff)
+{
+ return next_uptodate_page(xas_find(xas, end_pgoff),
+ vmf, xas, end_pgoff);
+}
+
+static inline struct page *next_map_page(struct vm_fault *vmf,
+ struct xa_state *xas,
+ pgoff_t end_pgoff)
+{
+ return next_uptodate_page(xas_next_entry(xas, end_pgoff),
+ vmf, xas, end_pgoff);
+}
+
+vm_fault_t filemap_map_pages(struct vm_fault *vmf, unsigned long address,
pgoff_t start_pgoff, pgoff_t end_pgoff)
{
- struct file *file = vmf->vma->vm_file;
+ struct vm_area_struct *vma = vmf->vma;
+ struct file *file = vma->vm_file;
struct address_space *mapping = file->f_mapping;
pgoff_t last_pgoff = start_pgoff;
- unsigned long max_idx;
XA_STATE(xas, &mapping->i_pages, start_pgoff);
struct page *head, *page;
unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
+ vm_fault_t ret = 0;

rcu_read_lock();
- xas_for_each(&xas, head, end_pgoff) {
- if (xas_retry(&xas, head))
- continue;
- if (xa_is_value(head))
- goto next;
+ head = first_map_page(vmf, &xas, end_pgoff);
+ if (!head) {
+ rcu_read_unlock();
+ return 0;
+ }

- /*
- * Check for a locked page first, as a speculative
- * reference may adversely influence page migration.
- */
- if (PageLocked(head))
- goto next;
- if (!page_cache_get_speculative(head))
- goto next;
+ if (filemap_map_pmd(vmf, head)) {
+ rcu_read_unlock();
+ return VM_FAULT_NOPAGE;
+ }

- /* Has the page moved or been split? */
- if (unlikely(head != xas_reload(&xas)))
- goto skip;
+ vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
+ vmf->address, &vmf->ptl);
+ do {
page = find_subpage(head, xas.xa_index);
-
- if (!PageUptodate(head) ||
- PageReadahead(page) ||
- PageHWPoison(page))
- goto skip;
- if (!trylock_page(head))
- goto skip;
-
- if (head->mapping != mapping || !PageUptodate(head))
- goto unlock;
-
- max_idx = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE);
- if (xas.xa_index >= max_idx)
+ if (PageHWPoison(page))
goto unlock;

if (mmap_miss > 0)
@@ -2884,21 +2969,28 @@ void filemap_map_pages(struct vm_fault *vmf,
if (vmf->pte)
vmf->pte += xas.xa_index - last_pgoff;
last_pgoff = xas.xa_index;
- if (alloc_set_pte(vmf, page))
+
+ if (!pte_none(*vmf->pte))
goto unlock;
+
+ do_set_pte(vmf, page);
+ /* no need to invalidate: a not-present page won't be cached */
+ update_mmu_cache(vma, vmf->address, vmf->pte);
unlock_page(head);
- goto next;
+
+ /* The fault is handled */
+ if (vmf->address == address)
+ ret = VM_FAULT_NOPAGE;
+ continue;
unlock:
unlock_page(head);
-skip:
put_page(head);
-next:
- /* Huge page is mapped? No need to proceed. */
- if (pmd_trans_huge(*vmf->pmd))
- break;
- }
+ } while ((head = next_map_page(vmf, &xas, end_pgoff)) != NULL);
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
rcu_read_unlock();
WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);
+
+ return ret;
}
EXPORT_SYMBOL(filemap_map_pages);

diff --git a/mm/memory.c b/mm/memory.c
index c48f8df6e502..247abfcd60d8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3490,7 +3490,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
if (pte_alloc(vma->vm_mm, vmf->pmd))
return VM_FAULT_OOM;

- /* See the comment in pte_alloc_one_map() */
+ /* See comment in handle_pte_fault() */
if (unlikely(pmd_trans_unstable(vmf->pmd)))
return 0;

@@ -3630,66 +3630,6 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
return ret;
}

-/*
- * The ordering of these checks is important for pmds with _PAGE_DEVMAP set.
- * If we check pmd_trans_unstable() first we will trip the bad_pmd() check
- * inside of pmd_none_or_trans_huge_or_clear_bad(). This will end up correctly
- * returning 1 but not before it spams dmesg with the pmd_clear_bad() output.
- */
-static int pmd_devmap_trans_unstable(pmd_t *pmd)
-{
- return pmd_devmap(*pmd) || pmd_trans_unstable(pmd);
-}
-
-static vm_fault_t pte_alloc_one_map(struct vm_fault *vmf)
-{
- struct vm_area_struct *vma = vmf->vma;
-
- if (!pmd_none(*vmf->pmd))
- goto map_pte;
- if (vmf->prealloc_pte) {
- vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
- if (unlikely(!pmd_none(*vmf->pmd))) {
- spin_unlock(vmf->ptl);
- goto map_pte;
- }
-
- mm_inc_nr_ptes(vma->vm_mm);
- pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte);
- spin_unlock(vmf->ptl);
- vmf->prealloc_pte = NULL;
- } else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd))) {
- return VM_FAULT_OOM;
- }
-map_pte:
- /*
- * If a huge pmd materialized under us just retry later. Use
- * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead of
- * pmd_trans_huge() to ensure the pmd didn't become pmd_trans_huge
- * under us and then back to pmd_none, as a result of MADV_DONTNEED
- * running immediately after a huge pmd fault in a different thread of
- * this mm, in turn leading to a misleading pmd_trans_huge() retval.
- * All we have to ensure is that it is a regular pmd that we can walk
- * with pte_offset_map() and we can do that through an atomic read in
- * C, which is what pmd_trans_unstable() provides.
- */
- if (pmd_devmap_trans_unstable(vmf->pmd))
- return VM_FAULT_NOPAGE;
-
- /*
- * At this point we know that our vmf->pmd points to a page of ptes
- * and it cannot become pmd_none(), pmd_devmap() or pmd_trans_huge()
- * for the duration of the fault. If a racing MADV_DONTNEED runs and
- * we zap the ptes pointed to by our vmf->pmd, the vmf->ptl will still
- * be valid and we will re-check to make sure the vmf->pte isn't
- * pte_none() under vmf->ptl protection when we return to
- * alloc_set_pte().
- */
- vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
- &vmf->ptl);
- return 0;
-}
-
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
static void deposit_prealloc_pte(struct vm_fault *vmf)
{
@@ -3704,7 +3644,7 @@ static void deposit_prealloc_pte(struct vm_fault *vmf)
vmf->prealloc_pte = NULL;
}

-static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
+vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
{
struct vm_area_struct *vma = vmf->vma;
bool write = vmf->flags & FAULT_FLAG_WRITE;
@@ -3762,52 +3702,17 @@ static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
return ret;
}
#else
-static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
+vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
{
- BUILD_BUG();
- return 0;
+ return VM_FAULT_FALLBACK;
}
#endif

-/**
- * alloc_set_pte - setup new PTE entry for given page and add reverse page
- * mapping. If needed, the function allocates page table or use pre-allocated.
- *
- * @vmf: fault environment
- * @page: page to map
- *
- * Caller must take care of unlocking vmf->ptl, if vmf->pte is non-NULL on
- * return.
- *
- * Target users are page handler itself and implementations of
- * vm_ops->map_pages.
- *
- * Return: %0 on success, %VM_FAULT_ code in case of error.
- */
-vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
+void do_set_pte(struct vm_fault *vmf, struct page *page)
{
struct vm_area_struct *vma = vmf->vma;
bool write = vmf->flags & FAULT_FLAG_WRITE;
pte_t entry;
- vm_fault_t ret;
-
- if (pmd_none(*vmf->pmd) && PageTransCompound(page)) {
- ret = do_set_pmd(vmf, page);
- if (ret != VM_FAULT_FALLBACK)
- return ret;
- }
-
- if (!vmf->pte) {
- ret = pte_alloc_one_map(vmf);
- if (ret)
- return ret;
- }
-
- /* Re-check under ptl */
- if (unlikely(!pte_none(*vmf->pte))) {
- update_mmu_tlb(vma, vmf->address, vmf->pte);
- return VM_FAULT_NOPAGE;
- }

flush_icache_page(vma, page);
entry = mk_pte(page, vma->vm_page_prot);
@@ -3824,14 +3729,8 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
page_add_file_rmap(page, false);
}
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
-
- /* no need to invalidate: a not-present page won't be cached */
- update_mmu_cache(vma, vmf->address, vmf->pte);
-
- return 0;
}

-
/**
* finish_fault - finish page fault once we have prepared the page to fault
*
@@ -3849,12 +3748,12 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
*/
vm_fault_t finish_fault(struct vm_fault *vmf)
{
+ struct vm_area_struct *vma = vmf->vma;
struct page *page;
- vm_fault_t ret = 0;
+ vm_fault_t ret;

/* Did we COW the page? */
- if ((vmf->flags & FAULT_FLAG_WRITE) &&
- !(vmf->vma->vm_flags & VM_SHARED))
+ if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED))
page = vmf->cow_page;
else
page = vmf->page;
@@ -3863,13 +3762,35 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
* check even for read faults because we might have lost our CoWed
* page
*/
- if (!(vmf->vma->vm_flags & VM_SHARED))
- ret = check_stable_address_space(vmf->vma->vm_mm);
- if (!ret)
- ret = alloc_set_pte(vmf, page);
- if (vmf->pte)
- pte_unmap_unlock(vmf->pte, vmf->ptl);
- return ret;
+ if (!(vma->vm_flags & VM_SHARED))
+ ret = check_stable_address_space(vma->vm_mm);
+ if (ret)
+ return ret;
+
+ if (pmd_none(*vmf->pmd)) {
+ if (PageTransCompound(page)) {
+ ret = do_set_pmd(vmf, page);
+ if (ret != VM_FAULT_FALLBACK)
+ return ret;
+ }
+
+ if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd)))
+ return VM_FAULT_OOM;
+ }
+
+ /* See comment in handle_pte_fault() */
+ if (pmd_devmap_trans_unstable(vmf->pmd))
+ return 0;
+
+ vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
+ vmf->address, &vmf->ptl);
+ /* Re-check under ptl */
+ if (likely(pte_none(*vmf->pte)))
+ do_set_pte(vmf, page);
+
+ update_mmu_tlb(vma, vmf->address, vmf->pte);
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
+ return 0;
}

static unsigned long fault_around_bytes __read_mostly =
@@ -3938,7 +3859,6 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf)
pgoff_t start_pgoff = vmf->pgoff;
pgoff_t end_pgoff;
int off;
- vm_fault_t ret = 0;

nr_pages = READ_ONCE(fault_around_bytes) >> PAGE_SHIFT;
mask = ~(nr_pages * PAGE_SIZE - 1) & PAGE_MASK;
@@ -3960,31 +3880,11 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf)
if (pmd_none(*vmf->pmd)) {
vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm);
if (!vmf->prealloc_pte)
- goto out;
+ return VM_FAULT_OOM;
smp_wmb(); /* See comment in __pte_alloc() */
}

- vmf->vma->vm_ops->map_pages(vmf, start_pgoff, end_pgoff);
-
- /* Huge page is mapped? Page fault is solved */
- if (pmd_trans_huge(*vmf->pmd)) {
- ret = VM_FAULT_NOPAGE;
- goto out;
- }
-
- /* ->map_pages() haven't done anything useful. Cold page cache? */
- if (!vmf->pte)
- goto out;
-
- /* check if the page fault is solved */
- vmf->pte -= (vmf->address >> PAGE_SHIFT) - (address >> PAGE_SHIFT);
- if (!pte_none(*vmf->pte))
- ret = VM_FAULT_NOPAGE;
- pte_unmap_unlock(vmf->pte, vmf->ptl);
-out:
- vmf->address = address;
- vmf->pte = NULL;
- return ret;
+ return vmf->vma->vm_ops->map_pages(vmf, address, start_pgoff, end_pgoff);
}

static vm_fault_t do_read_fault(struct vm_fault *vmf)
@@ -4340,7 +4240,18 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
*/
vmf->pte = NULL;
} else {
- /* See comment in pte_alloc_one_map() */
+ /*
+ * If a huge pmd materialized under us just retry later. Use
+ * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead
+ * of pmd_trans_huge() to ensure the pmd didn't become
+ * pmd_trans_huge under us and then back to pmd_none, as a
+ * result of MADV_DONTNEED running immediately after a huge pmd
+ * fault in a different thread of this mm, in turn leading to a
+ * misleading pmd_trans_huge() retval. All we have to ensure is
+ * that it is a regular pmd that we can walk with
+ * pte_offset_map() and we can do that through an atomic read
+ * in C, which is what pmd_trans_unstable() provides.
+ */
if (pmd_devmap_trans_unstable(vmf->pmd))
return 0;
/*
--
Kirill A. Shutemov

2020-12-27 23:56:33

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

On Sun, Dec 27, 2020 at 03:40:36PM -0800, Linus Torvalds wrote:
> I think Kirill was intending to move the "if (ret)" up into the path
> that sets it, IOW something like
>
> + if (!(vma->vm_flags & VM_SHARED)) {
> + ret = check_stable_address_space(vma->vm_mm);
> + if (ret)
> + return ret;
> + }
>
> instead. But that patch as-is is broken.
>
> Kirill?

Right. Once again, patch is below.

From d50eff470de025f602711ef546aa87d0da1727f5 Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <[email protected]>
Date: Sat, 19 Dec 2020 15:19:23 +0300
Subject: [PATCH] mm: Cleanup faultaround and finish_fault() codepaths

alloc_set_pte() has two users with different requirements: in the
faultaround code, it called from an atomic context and PTE page table
has to be preallocated. finish_fault() can sleep and allocate page table
as needed.

PTL locking rules are also strange, hard to follow and overkill for
finish_fault().

Let's untangle the mess. alloc_set_pte() has gone now. All locking is
explicit.

The price is some code duplication to handle huge pages in faultaround
path, but it should be fine, having overall improvement in readability.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
fs/xfs/xfs_file.c | 7 +-
include/linux/mm.h | 12 +--
include/linux/pgtable.h | 11 +++
mm/filemap.c | 172 ++++++++++++++++++++++++++---------
mm/memory.c | 192 +++++++++++-----------------------------
5 files changed, 207 insertions(+), 187 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 5b0f93f73837..040fa64291ed 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1319,17 +1319,20 @@ xfs_filemap_pfn_mkwrite(
return __xfs_filemap_fault(vmf, PE_SIZE_PTE, true);
}

-static void
+static vm_fault_t
xfs_filemap_map_pages(
struct vm_fault *vmf,
+ unsigned long address,
pgoff_t start_pgoff,
pgoff_t end_pgoff)
{
struct inode *inode = file_inode(vmf->vma->vm_file);
+ vm_fault_t ret;

xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
- filemap_map_pages(vmf, start_pgoff, end_pgoff);
+ ret = filemap_map_pages(vmf, address, start_pgoff, end_pgoff);
xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
+ return ret;
}

static const struct vm_operations_struct xfs_file_vm_ops = {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index db6ae4d3fb4e..4e1d2945a71a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -534,8 +534,8 @@ struct vm_fault {
* is not NULL, otherwise pmd.
*/
pgtable_t prealloc_pte; /* Pre-allocated pte page table.
- * vm_ops->map_pages() calls
- * alloc_set_pte() from atomic context.
+ * vm_ops->map_pages() sets up a page
+ * table from from atomic context.
* do_fault_around() pre-allocates
* page table to avoid allocation from
* atomic context.
@@ -562,7 +562,7 @@ struct vm_operations_struct {
vm_fault_t (*fault)(struct vm_fault *vmf);
vm_fault_t (*huge_fault)(struct vm_fault *vmf,
enum page_entry_size pe_size);
- void (*map_pages)(struct vm_fault *vmf,
+ vm_fault_t (*map_pages)(struct vm_fault *vmf, unsigned long address,
pgoff_t start_pgoff, pgoff_t end_pgoff);
unsigned long (*pagesize)(struct vm_area_struct * area);

@@ -972,7 +972,9 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
return pte;
}

-vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page);
+vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
+void do_set_pte(struct vm_fault *vmf, struct page *page);
+
vm_fault_t finish_fault(struct vm_fault *vmf);
vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
#endif
@@ -2621,7 +2623,7 @@ extern void truncate_inode_pages_final(struct address_space *);

/* generic vm_area_ops exported for stackable file systems */
extern vm_fault_t filemap_fault(struct vm_fault *vmf);
-extern void filemap_map_pages(struct vm_fault *vmf,
+extern vm_fault_t filemap_map_pages(struct vm_fault *vmf, unsigned long address,
pgoff_t start_pgoff, pgoff_t end_pgoff);
extern vm_fault_t filemap_page_mkwrite(struct vm_fault *vmf);

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e237004d498d..869c1921ceda 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1259,6 +1259,17 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
#endif
}

+/*
+ * the ordering of these checks is important for pmds with _page_devmap set.
+ * if we check pmd_trans_unstable() first we will trip the bad_pmd() check
+ * inside of pmd_none_or_trans_huge_or_clear_bad(). this will end up correctly
+ * returning 1 but not before it spams dmesg with the pmd_clear_bad() output.
+ */
+static inline int pmd_devmap_trans_unstable(pmd_t *pmd)
+{
+ return pmd_devmap(*pmd) || pmd_trans_unstable(pmd);
+}
+
#ifndef CONFIG_NUMA_BALANCING
/*
* Technically a PTE can be PROTNONE even when not doing NUMA balancing but
diff --git a/mm/filemap.c b/mm/filemap.c
index 0b2067b3c328..c5da09f3f363 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -42,6 +42,7 @@
#include <linux/psi.h>
#include <linux/ramfs.h>
#include <linux/page_idle.h>
+#include <asm/pgalloc.h>
#include "internal.h"

#define CREATE_TRACE_POINTS
@@ -2831,50 +2832,134 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
}
EXPORT_SYMBOL(filemap_fault);

-void filemap_map_pages(struct vm_fault *vmf,
+static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
+{
+ struct mm_struct *mm = vmf->vma->vm_mm;
+
+ /* Huge page is mapped? No need to proceed. */
+ if (pmd_trans_huge(*vmf->pmd)) {
+ unlock_page(page);
+ put_page(page);
+ return true;
+ }
+
+ if (pmd_none(*vmf->pmd) && PageTransHuge(page)) {
+ vm_fault_t ret = do_set_pmd(vmf, page);
+ if (!ret) {
+ /* The page is mapped successfully, reference consumed. */
+ unlock_page(page);
+ return true;
+ }
+ }
+
+ if (pmd_none(*vmf->pmd)) {
+ vmf->ptl = pmd_lock(mm, vmf->pmd);
+ if (likely(pmd_none(*vmf->pmd))) {
+ mm_inc_nr_ptes(mm);
+ pmd_populate(mm, vmf->pmd, vmf->prealloc_pte);
+ vmf->prealloc_pte = NULL;
+ }
+ spin_unlock(vmf->ptl);
+ }
+
+ /* See comment in handle_pte_fault() */
+ if (pmd_devmap_trans_unstable(vmf->pmd)) {
+ unlock_page(page);
+ put_page(page);
+ return true;
+ }
+
+ return false;
+}
+
+static struct page *next_uptodate_page(struct page *page, struct vm_fault *vmf,
+ struct xa_state *xas, pgoff_t end_pgoff)
+{
+ struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+ unsigned long max_idx;
+
+ do {
+ if (!page)
+ return NULL;
+ if (xas_retry(xas, page))
+ continue;
+ if (xa_is_value(page))
+ continue;
+ if (PageLocked(page))
+ continue;
+ if (!page_cache_get_speculative(page))
+ continue;
+ /* Has the page moved or been split? */
+ if (unlikely(page != xas_reload(xas)))
+ goto skip;
+ if (!PageUptodate(page) || PageReadahead(page))
+ goto skip;
+ if (PageHWPoison(page))
+ goto skip;
+ if (!trylock_page(page))
+ goto skip;
+ if (page->mapping != mapping)
+ goto unlock;
+ if (!PageUptodate(page))
+ goto unlock;
+ max_idx = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE);
+ if (xas->xa_index >= max_idx)
+ goto unlock;
+ return page;
+unlock:
+ unlock_page(page);
+skip:
+ put_page(page);
+ } while ((page = xas_next_entry(xas, end_pgoff)) != NULL);
+
+ return NULL;
+}
+
+static inline struct page *first_map_page(struct vm_fault *vmf,
+ struct xa_state *xas,
+ pgoff_t end_pgoff)
+{
+ return next_uptodate_page(xas_find(xas, end_pgoff),
+ vmf, xas, end_pgoff);
+}
+
+static inline struct page *next_map_page(struct vm_fault *vmf,
+ struct xa_state *xas,
+ pgoff_t end_pgoff)
+{
+ return next_uptodate_page(xas_next_entry(xas, end_pgoff),
+ vmf, xas, end_pgoff);
+}
+
+vm_fault_t filemap_map_pages(struct vm_fault *vmf, unsigned long address,
pgoff_t start_pgoff, pgoff_t end_pgoff)
{
- struct file *file = vmf->vma->vm_file;
+ struct vm_area_struct *vma = vmf->vma;
+ struct file *file = vma->vm_file;
struct address_space *mapping = file->f_mapping;
pgoff_t last_pgoff = start_pgoff;
- unsigned long max_idx;
XA_STATE(xas, &mapping->i_pages, start_pgoff);
struct page *head, *page;
unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
+ vm_fault_t ret = 0;

rcu_read_lock();
- xas_for_each(&xas, head, end_pgoff) {
- if (xas_retry(&xas, head))
- continue;
- if (xa_is_value(head))
- goto next;
+ head = first_map_page(vmf, &xas, end_pgoff);
+ if (!head) {
+ rcu_read_unlock();
+ return 0;
+ }

- /*
- * Check for a locked page first, as a speculative
- * reference may adversely influence page migration.
- */
- if (PageLocked(head))
- goto next;
- if (!page_cache_get_speculative(head))
- goto next;
+ if (filemap_map_pmd(vmf, head)) {
+ rcu_read_unlock();
+ return VM_FAULT_NOPAGE;
+ }

- /* Has the page moved or been split? */
- if (unlikely(head != xas_reload(&xas)))
- goto skip;
+ vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
+ vmf->address, &vmf->ptl);
+ do {
page = find_subpage(head, xas.xa_index);
-
- if (!PageUptodate(head) ||
- PageReadahead(page) ||
- PageHWPoison(page))
- goto skip;
- if (!trylock_page(head))
- goto skip;
-
- if (head->mapping != mapping || !PageUptodate(head))
- goto unlock;
-
- max_idx = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE);
- if (xas.xa_index >= max_idx)
+ if (PageHWPoison(page))
goto unlock;

if (mmap_miss > 0)
@@ -2884,21 +2969,28 @@ void filemap_map_pages(struct vm_fault *vmf,
if (vmf->pte)
vmf->pte += xas.xa_index - last_pgoff;
last_pgoff = xas.xa_index;
- if (alloc_set_pte(vmf, page))
+
+ if (!pte_none(*vmf->pte))
goto unlock;
+
+ do_set_pte(vmf, page);
+ /* no need to invalidate: a not-present page won't be cached */
+ update_mmu_cache(vma, vmf->address, vmf->pte);
unlock_page(head);
- goto next;
+
+ /* The fault is handled */
+ if (vmf->address == address)
+ ret = VM_FAULT_NOPAGE;
+ continue;
unlock:
unlock_page(head);
-skip:
put_page(head);
-next:
- /* Huge page is mapped? No need to proceed. */
- if (pmd_trans_huge(*vmf->pmd))
- break;
- }
+ } while ((head = next_map_page(vmf, &xas, end_pgoff)) != NULL);
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
rcu_read_unlock();
WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);
+
+ return ret;
}
EXPORT_SYMBOL(filemap_map_pages);

diff --git a/mm/memory.c b/mm/memory.c
index c48f8df6e502..e51638b92e7c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3490,7 +3490,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
if (pte_alloc(vma->vm_mm, vmf->pmd))
return VM_FAULT_OOM;

- /* See the comment in pte_alloc_one_map() */
+ /* See comment in handle_pte_fault() */
if (unlikely(pmd_trans_unstable(vmf->pmd)))
return 0;

@@ -3630,66 +3630,6 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
return ret;
}

-/*
- * The ordering of these checks is important for pmds with _PAGE_DEVMAP set.
- * If we check pmd_trans_unstable() first we will trip the bad_pmd() check
- * inside of pmd_none_or_trans_huge_or_clear_bad(). This will end up correctly
- * returning 1 but not before it spams dmesg with the pmd_clear_bad() output.
- */
-static int pmd_devmap_trans_unstable(pmd_t *pmd)
-{
- return pmd_devmap(*pmd) || pmd_trans_unstable(pmd);
-}
-
-static vm_fault_t pte_alloc_one_map(struct vm_fault *vmf)
-{
- struct vm_area_struct *vma = vmf->vma;
-
- if (!pmd_none(*vmf->pmd))
- goto map_pte;
- if (vmf->prealloc_pte) {
- vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
- if (unlikely(!pmd_none(*vmf->pmd))) {
- spin_unlock(vmf->ptl);
- goto map_pte;
- }
-
- mm_inc_nr_ptes(vma->vm_mm);
- pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte);
- spin_unlock(vmf->ptl);
- vmf->prealloc_pte = NULL;
- } else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd))) {
- return VM_FAULT_OOM;
- }
-map_pte:
- /*
- * If a huge pmd materialized under us just retry later. Use
- * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead of
- * pmd_trans_huge() to ensure the pmd didn't become pmd_trans_huge
- * under us and then back to pmd_none, as a result of MADV_DONTNEED
- * running immediately after a huge pmd fault in a different thread of
- * this mm, in turn leading to a misleading pmd_trans_huge() retval.
- * All we have to ensure is that it is a regular pmd that we can walk
- * with pte_offset_map() and we can do that through an atomic read in
- * C, which is what pmd_trans_unstable() provides.
- */
- if (pmd_devmap_trans_unstable(vmf->pmd))
- return VM_FAULT_NOPAGE;
-
- /*
- * At this point we know that our vmf->pmd points to a page of ptes
- * and it cannot become pmd_none(), pmd_devmap() or pmd_trans_huge()
- * for the duration of the fault. If a racing MADV_DONTNEED runs and
- * we zap the ptes pointed to by our vmf->pmd, the vmf->ptl will still
- * be valid and we will re-check to make sure the vmf->pte isn't
- * pte_none() under vmf->ptl protection when we return to
- * alloc_set_pte().
- */
- vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
- &vmf->ptl);
- return 0;
-}
-
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
static void deposit_prealloc_pte(struct vm_fault *vmf)
{
@@ -3704,7 +3644,7 @@ static void deposit_prealloc_pte(struct vm_fault *vmf)
vmf->prealloc_pte = NULL;
}

-static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
+vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
{
struct vm_area_struct *vma = vmf->vma;
bool write = vmf->flags & FAULT_FLAG_WRITE;
@@ -3762,52 +3702,17 @@ static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
return ret;
}
#else
-static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
+vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
{
- BUILD_BUG();
- return 0;
+ return VM_FAULT_FALLBACK;
}
#endif

-/**
- * alloc_set_pte - setup new PTE entry for given page and add reverse page
- * mapping. If needed, the function allocates page table or use pre-allocated.
- *
- * @vmf: fault environment
- * @page: page to map
- *
- * Caller must take care of unlocking vmf->ptl, if vmf->pte is non-NULL on
- * return.
- *
- * Target users are page handler itself and implementations of
- * vm_ops->map_pages.
- *
- * Return: %0 on success, %VM_FAULT_ code in case of error.
- */
-vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
+void do_set_pte(struct vm_fault *vmf, struct page *page)
{
struct vm_area_struct *vma = vmf->vma;
bool write = vmf->flags & FAULT_FLAG_WRITE;
pte_t entry;
- vm_fault_t ret;
-
- if (pmd_none(*vmf->pmd) && PageTransCompound(page)) {
- ret = do_set_pmd(vmf, page);
- if (ret != VM_FAULT_FALLBACK)
- return ret;
- }
-
- if (!vmf->pte) {
- ret = pte_alloc_one_map(vmf);
- if (ret)
- return ret;
- }
-
- /* Re-check under ptl */
- if (unlikely(!pte_none(*vmf->pte))) {
- update_mmu_tlb(vma, vmf->address, vmf->pte);
- return VM_FAULT_NOPAGE;
- }

flush_icache_page(vma, page);
entry = mk_pte(page, vma->vm_page_prot);
@@ -3824,14 +3729,8 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
page_add_file_rmap(page, false);
}
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
-
- /* no need to invalidate: a not-present page won't be cached */
- update_mmu_cache(vma, vmf->address, vmf->pte);
-
- return 0;
}

-
/**
* finish_fault - finish page fault once we have prepared the page to fault
*
@@ -3849,12 +3748,12 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
*/
vm_fault_t finish_fault(struct vm_fault *vmf)
{
+ struct vm_area_struct *vma = vmf->vma;
struct page *page;
- vm_fault_t ret = 0;
+ vm_fault_t ret;

/* Did we COW the page? */
- if ((vmf->flags & FAULT_FLAG_WRITE) &&
- !(vmf->vma->vm_flags & VM_SHARED))
+ if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED))
page = vmf->cow_page;
else
page = vmf->page;
@@ -3863,13 +3762,36 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
* check even for read faults because we might have lost our CoWed
* page
*/
- if (!(vmf->vma->vm_flags & VM_SHARED))
- ret = check_stable_address_space(vmf->vma->vm_mm);
- if (!ret)
- ret = alloc_set_pte(vmf, page);
- if (vmf->pte)
- pte_unmap_unlock(vmf->pte, vmf->ptl);
- return ret;
+ if (!(vma->vm_flags & VM_SHARED)) {
+ ret = check_stable_address_space(vma->vm_mm);
+ if (ret)
+ return ret;
+ }
+
+ if (pmd_none(*vmf->pmd)) {
+ if (PageTransCompound(page)) {
+ ret = do_set_pmd(vmf, page);
+ if (ret != VM_FAULT_FALLBACK)
+ return ret;
+ }
+
+ if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd)))
+ return VM_FAULT_OOM;
+ }
+
+ /* See comment in handle_pte_fault() */
+ if (pmd_devmap_trans_unstable(vmf->pmd))
+ return 0;
+
+ vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
+ vmf->address, &vmf->ptl);
+ /* Re-check under ptl */
+ if (likely(pte_none(*vmf->pte)))
+ do_set_pte(vmf, page);
+
+ update_mmu_tlb(vma, vmf->address, vmf->pte);
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
+ return 0;
}

static unsigned long fault_around_bytes __read_mostly =
@@ -3938,7 +3860,6 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf)
pgoff_t start_pgoff = vmf->pgoff;
pgoff_t end_pgoff;
int off;
- vm_fault_t ret = 0;

nr_pages = READ_ONCE(fault_around_bytes) >> PAGE_SHIFT;
mask = ~(nr_pages * PAGE_SIZE - 1) & PAGE_MASK;
@@ -3960,31 +3881,11 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf)
if (pmd_none(*vmf->pmd)) {
vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm);
if (!vmf->prealloc_pte)
- goto out;
+ return VM_FAULT_OOM;
smp_wmb(); /* See comment in __pte_alloc() */
}

- vmf->vma->vm_ops->map_pages(vmf, start_pgoff, end_pgoff);
-
- /* Huge page is mapped? Page fault is solved */
- if (pmd_trans_huge(*vmf->pmd)) {
- ret = VM_FAULT_NOPAGE;
- goto out;
- }
-
- /* ->map_pages() haven't done anything useful. Cold page cache? */
- if (!vmf->pte)
- goto out;
-
- /* check if the page fault is solved */
- vmf->pte -= (vmf->address >> PAGE_SHIFT) - (address >> PAGE_SHIFT);
- if (!pte_none(*vmf->pte))
- ret = VM_FAULT_NOPAGE;
- pte_unmap_unlock(vmf->pte, vmf->ptl);
-out:
- vmf->address = address;
- vmf->pte = NULL;
- return ret;
+ return vmf->vma->vm_ops->map_pages(vmf, address, start_pgoff, end_pgoff);
}

static vm_fault_t do_read_fault(struct vm_fault *vmf)
@@ -4340,7 +4241,18 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
*/
vmf->pte = NULL;
} else {
- /* See comment in pte_alloc_one_map() */
+ /*
+ * If a huge pmd materialized under us just retry later. Use
+ * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead
+ * of pmd_trans_huge() to ensure the pmd didn't become
+ * pmd_trans_huge under us and then back to pmd_none, as a
+ * result of MADV_DONTNEED running immediately after a huge pmd
+ * fault in a different thread of this mm, in turn leading to a
+ * misleading pmd_trans_huge() retval. All we have to ensure is
+ * that it is a regular pmd that we can walk with
+ * pte_offset_map() and we can do that through an atomic read
+ * in C, which is what pmd_trans_unstable() provides.
+ */
if (pmd_devmap_trans_unstable(vmf->pmd))
return 0;
/*
--
2.26.2

--
Kirill A. Shutemov

2020-12-28 01:56:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

On Sun, Dec 27, 2020 at 3:48 PM Kirill A. Shutemov <[email protected]> wrote:
>
> I did what Hugh proposed and it got clear to my eyes. It gets somewhat
> large, but take a look.

Ok, it's not that much bigger, and the end result is certainly much
clearer wrt locking.

So that last version of yours with the fix for the uninitialized 'ret'
variable looks good to me.

Of course, I've said that before, and have been wrong. So ...

Linus

2020-12-28 06:46:35

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

On Sun, 27 Dec 2020, Linus Torvalds wrote:
> On Sun, Dec 27, 2020 at 3:48 PM Kirill A. Shutemov <[email protected]> wrote:
> >
> > I did what Hugh proposed and it got clear to my eyes. It gets somewhat
> > large, but take a look.
>
> Ok, it's not that much bigger, and the end result is certainly much
> clearer wrt locking.
>
> So that last version of yours with the fix for the uninitialized 'ret'
> variable looks good to me.
>
> Of course, I've said that before, and have been wrong. So ...

And guess what... it's broken.

I folded it into testing rc1: segfault on cc1, systemd
"Journal file corrupted, rotating", seen on more than one machine.

I've backed it out, rc1 itself seems fine, I'll leave rc1 under
load overnight, then come back to the faultaround patch tomorrow;
won't glance at it tonight, but maybe Kirill will guess what's wrong.

Hugh

2020-12-28 12:56:43

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

On Sun, Dec 27, 2020 at 10:43:44PM -0800, Hugh Dickins wrote:
> On Sun, 27 Dec 2020, Linus Torvalds wrote:
> > On Sun, Dec 27, 2020 at 3:48 PM Kirill A. Shutemov <[email protected]> wrote:
> > >
> > > I did what Hugh proposed and it got clear to my eyes. It gets somewhat
> > > large, but take a look.
> >
> > Ok, it's not that much bigger, and the end result is certainly much
> > clearer wrt locking.
> >
> > So that last version of yours with the fix for the uninitialized 'ret'
> > variable looks good to me.
> >
> > Of course, I've said that before, and have been wrong. So ...
>
> And guess what... it's broken.
>
> I folded it into testing rc1: segfault on cc1, systemd
> "Journal file corrupted, rotating", seen on more than one machine.
>
> I've backed it out, rc1 itself seems fine, I'll leave rc1 under
> load overnight, then come back to the faultaround patch tomorrow;
> won't glance at it tonight, but maybe Kirill will guess what's wrong.

So far I only found one more pin leak and always-true check. I don't see
how can it lead to crash or corruption. Keep looking.

diff --git a/mm/filemap.c b/mm/filemap.c
index c5da09f3f363..87671284de62 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2966,8 +2966,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf, unsigned long address,
mmap_miss--;

vmf->address += (xas.xa_index - last_pgoff) << PAGE_SHIFT;
- if (vmf->pte)
- vmf->pte += xas.xa_index - last_pgoff;
+ vmf->pte += xas.xa_index - last_pgoff;
last_pgoff = xas.xa_index;

if (!pte_none(*vmf->pte))
diff --git a/mm/memory.c b/mm/memory.c
index e51638b92e7c..829f5056dd1c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3785,13 +3785,16 @@ vm_fault_t finish_fault(struct vm_fault *vmf)

vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
vmf->address, &vmf->ptl);
+ ret = 0;
/* Re-check under ptl */
if (likely(pte_none(*vmf->pte)))
do_set_pte(vmf, page);
+ else
+ ret = VM_FAULT_NOPAGE;

update_mmu_tlb(vma, vmf->address, vmf->pte);
pte_unmap_unlock(vmf->pte, vmf->ptl);
- return 0;
+ return ret;
}

static unsigned long fault_around_bytes __read_mostly =
--
Kirill A. Shutemov

2020-12-29 01:23:16

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

On Mon, Dec 28, 2020 at 4:53 AM Kirill A. Shutemov <[email protected]> wrote:
>
> So far I only found one more pin leak and always-true check. I don't see
> how can it lead to crash or corruption. Keep looking.

Well, I noticed that the nommu.c version of filemap_map_pages() needs
fixing, but that's obviously not the case Hugh sees.

No,m I think the problem is the

pte_unmap_unlock(vmf->pte, vmf->ptl);

at the end of filemap_map_pages().

Why?

Because we've been updating vmf->pte as we go along:

vmf->pte += xas.xa_index - last_pgoff;

and I think that by the time we get to that "pte_unmap_unlock()",
vmf->pte potentially points to past the edge of the page directory.

I think that is the bug that Hugh sees - simply because we get
entirely confused about the page table locking. And it would match the
latest change, which was all about moving that unlock from the caller
to filemap_map_pages(), and now it's missing the pte fixup..

I personally think it's wrong to update vmf->pte at all. We should
just have a local 'ptep' pointer that we update as we walk along. But
that requires another change to the calling convention, namely to
"do_set_pte()".

Also, considering how complicated this patch is getting, I think it
might be good to try to split it up a bit.

In particular, I think the calling convention change for
"filemap_map_pages()" could be done first and independently. And then
as a second step, move the VM_FAULT_NOPAGE and "pte_unmap_lock()" from
the callers to filemap_map_pages().

And then only as the final step, do that nice re-organization with
first_map_page/next_map_page() and moving the locking from
alloc_set_pte() into filemap_map_pages()..

How does that sound?

Anyway, Hugh, if it's about overshooting the pte pointer, maybe this
absolutely horrendous and disgusting patch fixes it without the above
kinds of more extensive cleanups. Worth testing, perhaps, even if it's
too ugly for words?

Linus


Attachments:
patch (1.33 kB)

2020-12-29 01:28:22

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

On Tue, Dec 29, 2020 at 01:05:48AM +0300, Kirill A. Shutemov wrote:
> On Mon, Dec 28, 2020 at 10:47:36AM -0800, Linus Torvalds wrote:
> > On Mon, Dec 28, 2020 at 4:53 AM Kirill A. Shutemov <[email protected]> wrote:
> > >
> > > So far I only found one more pin leak and always-true check. I don't see
> > > how can it lead to crash or corruption. Keep looking.
> >
> > Well, I noticed that the nommu.c version of filemap_map_pages() needs
> > fixing, but that's obviously not the case Hugh sees.
> >
> > No,m I think the problem is the
> >
> > pte_unmap_unlock(vmf->pte, vmf->ptl);
> >
> > at the end of filemap_map_pages().
> >
> > Why?
> >
> > Because we've been updating vmf->pte as we go along:
> >
> > vmf->pte += xas.xa_index - last_pgoff;
> >
> > and I think that by the time we get to that "pte_unmap_unlock()",
> > vmf->pte potentially points to past the edge of the page directory.
>
> Well, if it's true we have bigger problem: we set up an pte entry without
> relevant PTL.
>
> But I *think* we should be fine here: do_fault_around() limits start_pgoff
> and end_pgoff to stay within the page table.
>
> It made mw looking at the code around pte_unmap_unlock() and I think that
> the bug is that we have to reset vmf->address and NULLify vmf->pte once we
> are done with faultaround:
>
> diff --git a/mm/memory.c b/mm/memory.c

Ugh.. Wrong place. Need to sleep.

I'll look into your idea tomorrow.

diff --git a/mm/filemap.c b/mm/filemap.c
index 87671284de62..e4daab80ed81 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2987,6 +2987,8 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf, unsigned long address,
} while ((head = next_map_page(vmf, &xas, end_pgoff)) != NULL);
pte_unmap_unlock(vmf->pte, vmf->ptl);
rcu_read_unlock();
+ vmf->address = address;
+ vmf->pte = NULL;
WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);

return ret;
--
Kirill A. Shutemov

2020-12-29 01:30:29

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

On Mon, Dec 28, 2020 at 10:47:36AM -0800, Linus Torvalds wrote:
> On Mon, Dec 28, 2020 at 4:53 AM Kirill A. Shutemov <[email protected]> wrote:
> >
> > So far I only found one more pin leak and always-true check. I don't see
> > how can it lead to crash or corruption. Keep looking.
>
> Well, I noticed that the nommu.c version of filemap_map_pages() needs
> fixing, but that's obviously not the case Hugh sees.
>
> No,m I think the problem is the
>
> pte_unmap_unlock(vmf->pte, vmf->ptl);
>
> at the end of filemap_map_pages().
>
> Why?
>
> Because we've been updating vmf->pte as we go along:
>
> vmf->pte += xas.xa_index - last_pgoff;
>
> and I think that by the time we get to that "pte_unmap_unlock()",
> vmf->pte potentially points to past the edge of the page directory.

Well, if it's true we have bigger problem: we set up an pte entry without
relevant PTL.

But I *think* we should be fine here: do_fault_around() limits start_pgoff
and end_pgoff to stay within the page table.

It made mw looking at the code around pte_unmap_unlock() and I think that
the bug is that we have to reset vmf->address and NULLify vmf->pte once we
are done with faultaround:

diff --git a/mm/memory.c b/mm/memory.c
index 829f5056dd1c..405f5c73ce3e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3794,6 +3794,8 @@ vm_fault_t finish_fault(struct vm_fault *vmf)

update_mmu_tlb(vma, vmf->address, vmf->pte);
pte_unmap_unlock(vmf->pte, vmf->ptl);
+ vmf->address = address;
+ vmf->pte = NULL;
return ret;
}

--
Kirill A. Shutemov

2020-12-29 01:32:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

On Mon, Dec 28, 2020 at 10:47 AM Linus Torvalds
<[email protected]> wrote:
>
> I personally think it's wrong to update vmf->pte at all. We should
> just have a local 'ptep' pointer that we update as we walk along. But
> that requires another change to the calling convention, namely to
> "do_set_pte()".

Actually, I think we should not use do_set_pte() at all.

About half of do_set_pte() is about the FAULT_FLAG_WRITE case, which
the fault-around code never has set (and would be wrong if it did).

So I think do_set_pte() should be made local to mm/memory.c, and the
filemap_map_pages() code should do it's own simplified version that
just does the non-writable case, and that just gets passed the address
and the pte pointer.

At that point, there would no longer be any need to update the
address/pte fields in the vmf struct, and in fact I think it could be
made a "const" pointer in this cal chain.

This is all just from looking at the code, I haven't tried to write a
patch to do this, so I might be missing some case.

Linus

2020-12-29 01:48:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

On Mon, Dec 28, 2020 at 2:05 PM Kirill A. Shutemov <[email protected]> wrote:
>
>
> But I *think* we should be fine here: do_fault_around() limits start_pgoff
> and end_pgoff to stay within the page table.

Yeah, but I was thinking it would then update ->pte to just past the edge.

But looking at that logic, you're right - it will update ->pte and
->address only just before installing the pte, so it will never go
_to_ the edge, it will always stay inside.

So scratch my suspicion. It looked promising mainly because that ->pte
pointer update was one of the things that changed when you instead
compared against the address.

Linus

2020-12-29 04:38:49

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

Got it at last, sorry it's taken so long.

On Tue, 29 Dec 2020, Kirill A. Shutemov wrote:
> On Tue, Dec 29, 2020 at 01:05:48AM +0300, Kirill A. Shutemov wrote:
> > On Mon, Dec 28, 2020 at 10:47:36AM -0800, Linus Torvalds wrote:
> > > On Mon, Dec 28, 2020 at 4:53 AM Kirill A. Shutemov <[email protected]> wrote:
> > > >
> > > > So far I only found one more pin leak and always-true check. I don't see
> > > > how can it lead to crash or corruption. Keep looking.

Those mods look good in themselves, but, as you expected,
made no difference to the corruption I was seeing.

> > >
> > > Well, I noticed that the nommu.c version of filemap_map_pages() needs
> > > fixing, but that's obviously not the case Hugh sees.
> > >
> > > No,m I think the problem is the
> > >
> > > pte_unmap_unlock(vmf->pte, vmf->ptl);
> > >
> > > at the end of filemap_map_pages().
> > >
> > > Why?
> > >
> > > Because we've been updating vmf->pte as we go along:
> > >
> > > vmf->pte += xas.xa_index - last_pgoff;
> > >
> > > and I think that by the time we get to that "pte_unmap_unlock()",
> > > vmf->pte potentially points to past the edge of the page directory.
> >
> > Well, if it's true we have bigger problem: we set up an pte entry without
> > relevant PTL.
> >
> > But I *think* we should be fine here: do_fault_around() limits start_pgoff
> > and end_pgoff to stay within the page table.

Yes, Linus's patch had made no difference,
the map_pages loop is safe in that respect.

> >
> > It made mw looking at the code around pte_unmap_unlock() and I think that
> > the bug is that we have to reset vmf->address and NULLify vmf->pte once we
> > are done with faultaround:
> >
> > diff --git a/mm/memory.c b/mm/memory.c
>
> Ugh.. Wrong place. Need to sleep.
>
> I'll look into your idea tomorrow.
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 87671284de62..e4daab80ed81 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2987,6 +2987,8 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf, unsigned long address,
> } while ((head = next_map_page(vmf, &xas, end_pgoff)) != NULL);
> pte_unmap_unlock(vmf->pte, vmf->ptl);
> rcu_read_unlock();
> + vmf->address = address;
> + vmf->pte = NULL;
> WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);
>
> return ret;
> --

And that made no (noticeable) difference either. But at last
I realized, it's absolutely on the right track, but missing the
couple of early returns at the head of filemap_map_pages(): add

--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3025,14 +3025,12 @@ vm_fault_t filemap_map_pages(struct vm_f

rcu_read_lock();
head = first_map_page(vmf, &xas, end_pgoff);
- if (!head) {
- rcu_read_unlock();
- return 0;
- }
+ if (!head)
+ goto out;

if (filemap_map_pmd(vmf, head)) {
- rcu_read_unlock();
- return VM_FAULT_NOPAGE;
+ ret = VM_FAULT_NOPAGE;
+ goto out;
}

vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
@@ -3066,9 +3064,9 @@ unlock:
put_page(head);
} while ((head = next_map_page(vmf, &xas, end_pgoff)) != NULL);
pte_unmap_unlock(vmf->pte, vmf->ptl);
+out:
rcu_read_unlock();
vmf->address = address;
- vmf->pte = NULL;
WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);

return ret;
--

and then the corruption is fixed. It seems miraculous that the
machines even booted with that bad vmf->address going to __do_fault():
maybe that tells us what a good job map_pages does most of the time.

You'll see I've tried removing the "vmf->pte = NULL;" there. I did
criticize earlier that vmf->pte was being left set, but was either
thinking back to some earlier era of mm/memory.c, or else confusing
with vmf->prealloc_pte, which is NULLed when consumed: I could not
find anywhere in mm/memory.c which now needs vmf->pte to be cleared,
and I seem to run fine without it (even on i386 HIGHPTE).

So, the mystery is solved; but I don't think any of these patches
should be applied. Without thinking through Linus's suggestions
re do_set_pte() in particular, I do think this map_pages interface
is too ugly, and given us lots of trouble: please take your time
to go over it all again, and come up with a cleaner patch.

I've grown rather jaded, and questioning the value of the rework:
I don't think I want to look at or test another for a week or so.

Hugh

2020-12-29 13:31:07

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

On Mon, Dec 28, 2020 at 01:58:40PM -0800, Linus Torvalds wrote:
> On Mon, Dec 28, 2020 at 10:47 AM Linus Torvalds
> <[email protected]> wrote:
> >
> > I personally think it's wrong to update vmf->pte at all. We should
> > just have a local 'ptep' pointer that we update as we walk along. But
> > that requires another change to the calling convention, namely to
> > "do_set_pte()".
>
> Actually, I think we should not use do_set_pte() at all.
>
> About half of do_set_pte() is about the FAULT_FLAG_WRITE case, which
> the fault-around code never has set (and would be wrong if it did).
>
> So I think do_set_pte() should be made local to mm/memory.c, and the
> filemap_map_pages() code should do it's own simplified version that
> just does the non-writable case, and that just gets passed the address
> and the pte pointer.

Well, do_set_pte() also does rss accounting and _fast version of it is
local to mm/memory.c. It makes it cumbersome to opencode do_set_pte() in
filemap_map_pages(). We need to make fast rss acounting visible outside
memory.c or have some kind of batching in filemap_map_pages().

> At that point, there would no longer be any need to update the
> address/pte fields in the vmf struct, and in fact I think it could be
> made a "const" pointer in this cal chain.

Unfortunately, we would still need to NULLify vmf->prealloc_pte once it's
consumed. It kills idea with const.

> This is all just from looking at the code, I haven't tried to write a
> patch to do this, so I might be missing some case.

I reworked do_fault_around() so it doesn't touch vmf->address and pass
original address down to ->map_pages(). No need in the new argument in
->map_pages. filemap_map_pages() calculates address based on pgoff of the
page handled.

From 526b30b06d237cb1ed2fd51df0e4a6203df7bfa0 Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <[email protected]>
Date: Sat, 19 Dec 2020 15:19:23 +0300
Subject: [PATCH] mm: Cleanup faultaround and finish_fault() codepaths

alloc_set_pte() has two users with different requirements: in the
faultaround code, it called from an atomic context and PTE page table
has to be preallocated. finish_fault() can sleep and allocate page table
as needed.

PTL locking rules are also strange, hard to follow and overkill for
finish_fault().

Let's untangle the mess. alloc_set_pte() has gone now. All locking is
explicit.

The price is some code duplication to handle huge pages in faultaround
path, but it should be fine, having overall improvement in readability.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
fs/xfs/xfs_file.c | 6 +-
include/linux/mm.h | 12 ++-
include/linux/pgtable.h | 11 +++
mm/filemap.c | 177 ++++++++++++++++++++++++++---------
mm/memory.c | 199 ++++++++++++----------------------------
5 files changed, 213 insertions(+), 192 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 5b0f93f73837..111fe73bb8a7 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1319,17 +1319,19 @@ xfs_filemap_pfn_mkwrite(
return __xfs_filemap_fault(vmf, PE_SIZE_PTE, true);
}

-static void
+static vm_fault_t
xfs_filemap_map_pages(
struct vm_fault *vmf,
pgoff_t start_pgoff,
pgoff_t end_pgoff)
{
struct inode *inode = file_inode(vmf->vma->vm_file);
+ vm_fault_t ret;

xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
- filemap_map_pages(vmf, start_pgoff, end_pgoff);
+ ret = filemap_map_pages(vmf, start_pgoff, end_pgoff);
xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
+ return ret;
}

static const struct vm_operations_struct xfs_file_vm_ops = {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index db6ae4d3fb4e..d3d4e307fa09 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -534,8 +534,8 @@ struct vm_fault {
* is not NULL, otherwise pmd.
*/
pgtable_t prealloc_pte; /* Pre-allocated pte page table.
- * vm_ops->map_pages() calls
- * alloc_set_pte() from atomic context.
+ * vm_ops->map_pages() sets up a page
+ * table from from atomic context.
* do_fault_around() pre-allocates
* page table to avoid allocation from
* atomic context.
@@ -562,7 +562,7 @@ struct vm_operations_struct {
vm_fault_t (*fault)(struct vm_fault *vmf);
vm_fault_t (*huge_fault)(struct vm_fault *vmf,
enum page_entry_size pe_size);
- void (*map_pages)(struct vm_fault *vmf,
+ vm_fault_t (*map_pages)(struct vm_fault *vmf,
pgoff_t start_pgoff, pgoff_t end_pgoff);
unsigned long (*pagesize)(struct vm_area_struct * area);

@@ -972,7 +972,9 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
return pte;
}

-vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page);
+vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
+void do_set_pte(struct vm_fault *vmf, struct page *page);
+
vm_fault_t finish_fault(struct vm_fault *vmf);
vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
#endif
@@ -2621,7 +2623,7 @@ extern void truncate_inode_pages_final(struct address_space *);

/* generic vm_area_ops exported for stackable file systems */
extern vm_fault_t filemap_fault(struct vm_fault *vmf);
-extern void filemap_map_pages(struct vm_fault *vmf,
+extern vm_fault_t filemap_map_pages(struct vm_fault *vmf,
pgoff_t start_pgoff, pgoff_t end_pgoff);
extern vm_fault_t filemap_page_mkwrite(struct vm_fault *vmf);

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e237004d498d..869c1921ceda 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1259,6 +1259,17 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
#endif
}

+/*
+ * the ordering of these checks is important for pmds with _page_devmap set.
+ * if we check pmd_trans_unstable() first we will trip the bad_pmd() check
+ * inside of pmd_none_or_trans_huge_or_clear_bad(). this will end up correctly
+ * returning 1 but not before it spams dmesg with the pmd_clear_bad() output.
+ */
+static inline int pmd_devmap_trans_unstable(pmd_t *pmd)
+{
+ return pmd_devmap(*pmd) || pmd_trans_unstable(pmd);
+}
+
#ifndef CONFIG_NUMA_BALANCING
/*
* Technically a PTE can be PROTNONE even when not doing NUMA balancing but
diff --git a/mm/filemap.c b/mm/filemap.c
index 0b2067b3c328..cc4c99f5a287 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -42,6 +42,7 @@
#include <linux/psi.h>
#include <linux/ramfs.h>
#include <linux/page_idle.h>
+#include <asm/pgalloc.h>
#include "internal.h"

#define CREATE_TRACE_POINTS
@@ -2831,74 +2832,164 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
}
EXPORT_SYMBOL(filemap_fault);

-void filemap_map_pages(struct vm_fault *vmf,
- pgoff_t start_pgoff, pgoff_t end_pgoff)
+static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
{
- struct file *file = vmf->vma->vm_file;
+ struct mm_struct *mm = vmf->vma->vm_mm;
+
+ /* Huge page is mapped? No need to proceed. */
+ if (pmd_trans_huge(*vmf->pmd)) {
+ unlock_page(page);
+ put_page(page);
+ return true;
+ }
+
+ if (pmd_none(*vmf->pmd) && PageTransHuge(page)) {
+ vm_fault_t ret = do_set_pmd(vmf, page);
+ if (!ret) {
+ /* The page is mapped successfully, reference consumed. */
+ unlock_page(page);
+ return true;
+ }
+ }
+
+ if (pmd_none(*vmf->pmd)) {
+ vmf->ptl = pmd_lock(mm, vmf->pmd);
+ if (likely(pmd_none(*vmf->pmd))) {
+ mm_inc_nr_ptes(mm);
+ pmd_populate(mm, vmf->pmd, vmf->prealloc_pte);
+ vmf->prealloc_pte = NULL;
+ }
+ spin_unlock(vmf->ptl);
+ }
+
+ /* See comment in handle_pte_fault() */
+ if (pmd_devmap_trans_unstable(vmf->pmd)) {
+ unlock_page(page);
+ put_page(page);
+ return true;
+ }
+
+ return false;
+}
+
+static struct page *next_uptodate_page(struct page *page,
+ struct address_space *mapping,
+ struct xa_state *xas, pgoff_t end_pgoff)
+{
+ unsigned long max_idx;
+
+ do {
+ if (!page)
+ return NULL;
+ if (xas_retry(xas, page))
+ continue;
+ if (xa_is_value(page))
+ continue;
+ if (PageLocked(page))
+ continue;
+ if (!page_cache_get_speculative(page))
+ continue;
+ /* Has the page moved or been split? */
+ if (unlikely(page != xas_reload(xas)))
+ goto skip;
+ if (!PageUptodate(page) || PageReadahead(page))
+ goto skip;
+ if (PageHWPoison(page))
+ goto skip;
+ if (!trylock_page(page))
+ goto skip;
+ if (page->mapping != mapping)
+ goto unlock;
+ if (!PageUptodate(page))
+ goto unlock;
+ max_idx = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE);
+ if (xas->xa_index >= max_idx)
+ goto unlock;
+ return page;
+unlock:
+ unlock_page(page);
+skip:
+ put_page(page);
+ } while ((page = xas_next_entry(xas, end_pgoff)) != NULL);
+
+ return NULL;
+}
+
+static inline struct page *first_map_page(struct address_space *mapping,
+ struct xa_state *xas,
+ pgoff_t end_pgoff)
+{
+ return next_uptodate_page(xas_find(xas, end_pgoff),
+ mapping, xas, end_pgoff);
+}
+
+static inline struct page *next_map_page(struct address_space *mapping,
+ struct xa_state *xas,
+ pgoff_t end_pgoff)
+{
+ return next_uptodate_page(xas_next_entry(xas, end_pgoff),
+ mapping, xas, end_pgoff);
+}
+
+vm_fault_t filemap_map_pages(struct vm_fault *vmf,
+ pgoff_t start_pgoff, pgoff_t end_pgoff)
+{
+ struct vm_area_struct *vma = vmf->vma;
+ struct file *file = vma->vm_file;
struct address_space *mapping = file->f_mapping;
pgoff_t last_pgoff = start_pgoff;
- unsigned long max_idx;
+ unsigned long address = vmf->address;
XA_STATE(xas, &mapping->i_pages, start_pgoff);
struct page *head, *page;
unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
+ vm_fault_t ret = 0;

rcu_read_lock();
- xas_for_each(&xas, head, end_pgoff) {
- if (xas_retry(&xas, head))
- continue;
- if (xa_is_value(head))
- goto next;
+ head = first_map_page(mapping, &xas, end_pgoff);
+ if (!head)
+ goto out;

- /*
- * Check for a locked page first, as a speculative
- * reference may adversely influence page migration.
- */
- if (PageLocked(head))
- goto next;
- if (!page_cache_get_speculative(head))
- goto next;
+ if (filemap_map_pmd(vmf, head)) {
+ ret = VM_FAULT_NOPAGE;
+ goto out;
+ }

- /* Has the page moved or been split? */
- if (unlikely(head != xas_reload(&xas)))
- goto skip;
+ vmf->address = vma->vm_start + ((start_pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+ vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl);
+ do {
page = find_subpage(head, xas.xa_index);
-
- if (!PageUptodate(head) ||
- PageReadahead(page) ||
- PageHWPoison(page))
- goto skip;
- if (!trylock_page(head))
- goto skip;
-
- if (head->mapping != mapping || !PageUptodate(head))
- goto unlock;
-
- max_idx = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE);
- if (xas.xa_index >= max_idx)
+ if (PageHWPoison(page))
goto unlock;

if (mmap_miss > 0)
mmap_miss--;

vmf->address += (xas.xa_index - last_pgoff) << PAGE_SHIFT;
- if (vmf->pte)
- vmf->pte += xas.xa_index - last_pgoff;
+ vmf->pte += xas.xa_index - last_pgoff;
last_pgoff = xas.xa_index;
- if (alloc_set_pte(vmf, page))
+
+ if (!pte_none(*vmf->pte))
goto unlock;
+
+ do_set_pte(vmf, page);
+ /* no need to invalidate: a not-present page won't be cached */
+ update_mmu_cache(vma, vmf->address, vmf->pte);
unlock_page(head);
- goto next;
+
+ /* The fault is handled */
+ if (vmf->address == address)
+ ret = VM_FAULT_NOPAGE;
+ continue;
unlock:
unlock_page(head);
-skip:
put_page(head);
-next:
- /* Huge page is mapped? No need to proceed. */
- if (pmd_trans_huge(*vmf->pmd))
- break;
- }
+ } while ((head = next_map_page(mapping, &xas, end_pgoff)) != NULL);
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
+out:
rcu_read_unlock();
+ vmf->address = address;
WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);
+ return ret;
}
EXPORT_SYMBOL(filemap_map_pages);

diff --git a/mm/memory.c b/mm/memory.c
index c48f8df6e502..978798a6289b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3490,7 +3490,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
if (pte_alloc(vma->vm_mm, vmf->pmd))
return VM_FAULT_OOM;

- /* See the comment in pte_alloc_one_map() */
+ /* See comment in handle_pte_fault() */
if (unlikely(pmd_trans_unstable(vmf->pmd)))
return 0;

@@ -3630,66 +3630,6 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
return ret;
}

-/*
- * The ordering of these checks is important for pmds with _PAGE_DEVMAP set.
- * If we check pmd_trans_unstable() first we will trip the bad_pmd() check
- * inside of pmd_none_or_trans_huge_or_clear_bad(). This will end up correctly
- * returning 1 but not before it spams dmesg with the pmd_clear_bad() output.
- */
-static int pmd_devmap_trans_unstable(pmd_t *pmd)
-{
- return pmd_devmap(*pmd) || pmd_trans_unstable(pmd);
-}
-
-static vm_fault_t pte_alloc_one_map(struct vm_fault *vmf)
-{
- struct vm_area_struct *vma = vmf->vma;
-
- if (!pmd_none(*vmf->pmd))
- goto map_pte;
- if (vmf->prealloc_pte) {
- vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
- if (unlikely(!pmd_none(*vmf->pmd))) {
- spin_unlock(vmf->ptl);
- goto map_pte;
- }
-
- mm_inc_nr_ptes(vma->vm_mm);
- pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte);
- spin_unlock(vmf->ptl);
- vmf->prealloc_pte = NULL;
- } else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd))) {
- return VM_FAULT_OOM;
- }
-map_pte:
- /*
- * If a huge pmd materialized under us just retry later. Use
- * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead of
- * pmd_trans_huge() to ensure the pmd didn't become pmd_trans_huge
- * under us and then back to pmd_none, as a result of MADV_DONTNEED
- * running immediately after a huge pmd fault in a different thread of
- * this mm, in turn leading to a misleading pmd_trans_huge() retval.
- * All we have to ensure is that it is a regular pmd that we can walk
- * with pte_offset_map() and we can do that through an atomic read in
- * C, which is what pmd_trans_unstable() provides.
- */
- if (pmd_devmap_trans_unstable(vmf->pmd))
- return VM_FAULT_NOPAGE;
-
- /*
- * At this point we know that our vmf->pmd points to a page of ptes
- * and it cannot become pmd_none(), pmd_devmap() or pmd_trans_huge()
- * for the duration of the fault. If a racing MADV_DONTNEED runs and
- * we zap the ptes pointed to by our vmf->pmd, the vmf->ptl will still
- * be valid and we will re-check to make sure the vmf->pte isn't
- * pte_none() under vmf->ptl protection when we return to
- * alloc_set_pte().
- */
- vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
- &vmf->ptl);
- return 0;
-}
-
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
static void deposit_prealloc_pte(struct vm_fault *vmf)
{
@@ -3704,7 +3644,7 @@ static void deposit_prealloc_pte(struct vm_fault *vmf)
vmf->prealloc_pte = NULL;
}

-static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
+vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
{
struct vm_area_struct *vma = vmf->vma;
bool write = vmf->flags & FAULT_FLAG_WRITE;
@@ -3762,52 +3702,17 @@ static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
return ret;
}
#else
-static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
+vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
{
- BUILD_BUG();
- return 0;
+ return VM_FAULT_FALLBACK;
}
#endif

-/**
- * alloc_set_pte - setup new PTE entry for given page and add reverse page
- * mapping. If needed, the function allocates page table or use pre-allocated.
- *
- * @vmf: fault environment
- * @page: page to map
- *
- * Caller must take care of unlocking vmf->ptl, if vmf->pte is non-NULL on
- * return.
- *
- * Target users are page handler itself and implementations of
- * vm_ops->map_pages.
- *
- * Return: %0 on success, %VM_FAULT_ code in case of error.
- */
-vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
+void do_set_pte(struct vm_fault *vmf, struct page *page)
{
struct vm_area_struct *vma = vmf->vma;
bool write = vmf->flags & FAULT_FLAG_WRITE;
pte_t entry;
- vm_fault_t ret;
-
- if (pmd_none(*vmf->pmd) && PageTransCompound(page)) {
- ret = do_set_pmd(vmf, page);
- if (ret != VM_FAULT_FALLBACK)
- return ret;
- }
-
- if (!vmf->pte) {
- ret = pte_alloc_one_map(vmf);
- if (ret)
- return ret;
- }
-
- /* Re-check under ptl */
- if (unlikely(!pte_none(*vmf->pte))) {
- update_mmu_tlb(vma, vmf->address, vmf->pte);
- return VM_FAULT_NOPAGE;
- }

flush_icache_page(vma, page);
entry = mk_pte(page, vma->vm_page_prot);
@@ -3824,14 +3729,8 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
page_add_file_rmap(page, false);
}
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
-
- /* no need to invalidate: a not-present page won't be cached */
- update_mmu_cache(vma, vmf->address, vmf->pte);
-
- return 0;
}

-
/**
* finish_fault - finish page fault once we have prepared the page to fault
*
@@ -3849,12 +3748,12 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
*/
vm_fault_t finish_fault(struct vm_fault *vmf)
{
+ struct vm_area_struct *vma = vmf->vma;
struct page *page;
- vm_fault_t ret = 0;
+ vm_fault_t ret;

/* Did we COW the page? */
- if ((vmf->flags & FAULT_FLAG_WRITE) &&
- !(vmf->vma->vm_flags & VM_SHARED))
+ if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED))
page = vmf->cow_page;
else
page = vmf->page;
@@ -3863,12 +3762,38 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
* check even for read faults because we might have lost our CoWed
* page
*/
- if (!(vmf->vma->vm_flags & VM_SHARED))
- ret = check_stable_address_space(vmf->vma->vm_mm);
- if (!ret)
- ret = alloc_set_pte(vmf, page);
- if (vmf->pte)
- pte_unmap_unlock(vmf->pte, vmf->ptl);
+ if (!(vma->vm_flags & VM_SHARED)) {
+ ret = check_stable_address_space(vma->vm_mm);
+ if (ret)
+ return ret;
+ }
+
+ if (pmd_none(*vmf->pmd)) {
+ if (PageTransCompound(page)) {
+ ret = do_set_pmd(vmf, page);
+ if (ret != VM_FAULT_FALLBACK)
+ return ret;
+ }
+
+ if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd)))
+ return VM_FAULT_OOM;
+ }
+
+ /* See comment in handle_pte_fault() */
+ if (pmd_devmap_trans_unstable(vmf->pmd))
+ return 0;
+
+ vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
+ vmf->address, &vmf->ptl);
+ ret = 0;
+ /* Re-check under ptl */
+ if (likely(pte_none(*vmf->pte)))
+ do_set_pte(vmf, page);
+ else
+ ret = VM_FAULT_NOPAGE;
+
+ update_mmu_tlb(vma, vmf->address, vmf->pte);
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
return ret;
}

@@ -3938,13 +3863,12 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf)
pgoff_t start_pgoff = vmf->pgoff;
pgoff_t end_pgoff;
int off;
- vm_fault_t ret = 0;

nr_pages = READ_ONCE(fault_around_bytes) >> PAGE_SHIFT;
mask = ~(nr_pages * PAGE_SIZE - 1) & PAGE_MASK;

- vmf->address = max(address & mask, vmf->vma->vm_start);
- off = ((address - vmf->address) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
+ address = max(address & mask, vmf->vma->vm_start);
+ off = ((vmf->address - address) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
start_pgoff -= off;

/*
@@ -3952,7 +3876,7 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf)
* the vma or nr_pages from start_pgoff, depending what is nearest.
*/
end_pgoff = start_pgoff -
- ((vmf->address >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) +
+ ((address >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) +
PTRS_PER_PTE - 1;
end_pgoff = min3(end_pgoff, vma_pages(vmf->vma) + vmf->vma->vm_pgoff - 1,
start_pgoff + nr_pages - 1);
@@ -3960,31 +3884,11 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf)
if (pmd_none(*vmf->pmd)) {
vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm);
if (!vmf->prealloc_pte)
- goto out;
+ return VM_FAULT_OOM;
smp_wmb(); /* See comment in __pte_alloc() */
}

- vmf->vma->vm_ops->map_pages(vmf, start_pgoff, end_pgoff);
-
- /* Huge page is mapped? Page fault is solved */
- if (pmd_trans_huge(*vmf->pmd)) {
- ret = VM_FAULT_NOPAGE;
- goto out;
- }
-
- /* ->map_pages() haven't done anything useful. Cold page cache? */
- if (!vmf->pte)
- goto out;
-
- /* check if the page fault is solved */
- vmf->pte -= (vmf->address >> PAGE_SHIFT) - (address >> PAGE_SHIFT);
- if (!pte_none(*vmf->pte))
- ret = VM_FAULT_NOPAGE;
- pte_unmap_unlock(vmf->pte, vmf->ptl);
-out:
- vmf->address = address;
- vmf->pte = NULL;
- return ret;
+ return vmf->vma->vm_ops->map_pages(vmf, start_pgoff, end_pgoff);
}

static vm_fault_t do_read_fault(struct vm_fault *vmf)
@@ -4340,7 +4244,18 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
*/
vmf->pte = NULL;
} else {
- /* See comment in pte_alloc_one_map() */
+ /*
+ * If a huge pmd materialized under us just retry later. Use
+ * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead
+ * of pmd_trans_huge() to ensure the pmd didn't become
+ * pmd_trans_huge under us and then back to pmd_none, as a
+ * result of MADV_DONTNEED running immediately after a huge pmd
+ * fault in a different thread of this mm, in turn leading to a
+ * misleading pmd_trans_huge() retval. All we have to ensure is
+ * that it is a regular pmd that we can walk with
+ * pte_offset_map() and we can do that through an atomic read
+ * in C, which is what pmd_trans_unstable() provides.
+ */
if (pmd_devmap_trans_unstable(vmf->pmd))
return 0;
/*
--
Kirill A. Shutemov

2020-12-29 15:20:56

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

On Tue, Dec 29, 2020 at 04:28:19PM +0300, Kirill A. Shutemov wrote:
> > At that point, there would no longer be any need to update the
> > address/pte fields in the vmf struct, and in fact I think it could be
> > made a "const" pointer in this cal chain.
>
> Unfortunately, we would still need to NULLify vmf->prealloc_pte once it's
> consumed. It kills idea with const.

We could abstract out a refcount for pgtable_t (usually it's a struct
page, but sometimes it's a portion of a struct page, or sometimes
it's multiple struct pages) and always put it instead of freeing it.
There'd be some details to sort out, and I'm not sure it's worth it just
to constify the vmf on this path, but something that might be worth it
for a future case.

> +++ b/fs/xfs/xfs_file.c
> @@ -1319,17 +1319,19 @@ xfs_filemap_pfn_mkwrite(
> return __xfs_filemap_fault(vmf, PE_SIZE_PTE, true);
> }
>
> -static void
> +static vm_fault_t
> xfs_filemap_map_pages(

Can we just ditch the ->map_pages callback and make the rule "if you've put
uptodate pages in the page cache, they can be mapped without informing
the filesystem"?

> +++ b/include/linux/mm.h
> @@ -534,8 +534,8 @@ struct vm_fault {
> * is not NULL, otherwise pmd.
> */
> pgtable_t prealloc_pte; /* Pre-allocated pte page table.
> - * vm_ops->map_pages() calls
> - * alloc_set_pte() from atomic context.
> + * vm_ops->map_pages() sets up a page
> + * table from from atomic context.

Doubled word "from" here.

2020-12-29 20:55:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

On Tue, Dec 29, 2020 at 5:28 AM Kirill A. Shutemov <[email protected]> wrote:
>
> I reworked do_fault_around() so it doesn't touch vmf->address and pass
> original address down to ->map_pages(). No need in the new argument in
> ->map_pages. filemap_map_pages() calculates address based on pgoff of the
> page handled.

So I really like how the end result looks, but it worries me how many
problems this patch has had, and I'd love to try to make it more
incremental.

In particular, I really liked your re-write of the loop in
filemap_map_pages(), and how you separated out the pmd case to be a
separate early case. I think that was some of the nicest part of the
patch.

And I really like how you changed the vmf->address meaning in this
latest version, and pass the real address down, and pass the range
just in the start/end_pgoff. That just looks like the RightThing(tm)
to do.

But I think these nice cleanups could and should be independent of
some of the other changes.

The unlocking change, for example, that changed the return type from
void to vm_fault_t, also looks like a nice cleanup, but that's also
the one that caused the last series of odd problems for Hugh.

The fix for that then also caused that ugly "goto out" mess, which
undid some of the "this looks really nice" effects.

I don't even see the need for that "goto out". Yes, it resets the
'vmf->address' back to what it should be (so that __do_fault()
fallback can work right), but the two first "goto out" cases haven't
actually changed it (or ther mmap_miss), so why did they need to do
that?

Anyway, I really like how it all looks (apart from that odd "goto out"
detail - nitpicking here) but it does worry me that it changes so much
and had so many small problems..

Linus