There is interest in mapping zeropages via vm_insert_pages() [1] into
MAP_SHARED mappings.
For now, we only get zeropages in MAP_SHARED mappings via
vmf_insert_mixed() from FSDAX code, and I think it's a bit shaky in some
cases because we refcount the zeropage when mapping it but not necessarily
always when unmapping it ... and we should actually never refcount it.
It's all a bit tricky, especially how zeropages in MAP_SHARED mappings
interact with GUP (FOLL_LONGTERM), mprotect(), write-faults and s390x
forbidding the shared zeropage (rewrite on its way upstream [2]).
This series tries to take the careful approach of only allowing the
zeropage where it is likely safe to use (which should cover the existing
FSDAX use case and [1]), preventing that it could accidentially get mapped
writable during a write fault, mprotect() etc, and preventing issues
with FOLL_LONGTERM in the future with other users.
Only very lightly tested ... because I yet have to even get fsdax running
(I guess, file system on emulated DIMM in a VM? Hmmm :) ).
[1] https://lkml.kernel.org/r/[email protected]
[2] https://lkml.kernel.org/r/[email protected]
Cc: Andrew Morton <[email protected]>
Cc: Vincent Donnefort <[email protected]>
Cc: Dan Williams <[email protected]>
David Hildenbrand (2):
mm/memory: cleanly support zeropage in vm_insert_page*(),
vm_map_pages*() and vmf_insert_mixed()
mm/rmap: sanity check that zeropages are not passed to RMAP
include/linux/rmap.h | 3 ++
mm/memory.c | 92 ++++++++++++++++++++++++++++++++++----------
mm/mprotect.c | 2 +
3 files changed, 76 insertions(+), 21 deletions(-)
--
2.44.0
Using insert_page() we might have previously ended up passing the zeropage
into rmap code. Make sure that won't happen again.
Note that we won't check the huge zeropage for now, which might still
end up in RMAP code.
Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/rmap.h | 3 +++
1 file changed, 3 insertions(+)
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 7229b9baf20d8..5cb0d419a1d75 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -200,6 +200,9 @@ static inline void __folio_rmap_sanity_checks(struct folio *folio,
/* hugetlb folios are handled separately. */
VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
+ /* When (un)mapping zeropages, we should never touch ref+mapcount. */
+ VM_WARN_ON_FOLIO(is_zero_folio(folio), folio);
+
/*
* TODO: we get driver-allocated folios that have nothing to do with
* the rmap using vm_insert_page(); therefore, we cannot assume that
--
2.44.0
For now we only get the (small) zeropage mapped to user space in four
cases (excluding VM_PFNMAP mappings, such as /proc/vmstat):
(1) Read page faults in anonymous VMAs (MAP_PRIVATE|MAP_ANON):
do_anonymous_page() will not refcount it and map it pte_mkspecial()
(2) UFFDIO_ZEROPAGE on anonymous VMA or COW mapping of shmem
(MAP_PRIVATE). mfill_atomic_pte_zeropage() will not refcount it and
map it pte_mkspecial().
(3) KSM in mergeable VMA (anonymous VMA or COW mapping).
cmp_and_merge_page() will not refcount it and map it
pte_mkspecial().
(4) FSDAX as an optimization for holes.
vmf_insert_mixed()->__vm_insert_mixed() might end up calling
insert_page() without CONFIG_ARCH_HAS_PTE_SPECIAL, refcounting the
zeropage and not mapping it pte_mkspecial(). With
CONFIG_ARCH_HAS_PTE_SPECIAL, we'll call insert_pfn() where we will
not refcount it and map it pte_mkspecial().
In case (4), we might not have VM_MIXEDMAP set: while fs/fuse/dax.c sets
VM_MIXEDMAP, we removed it for ext4 fsdax in commit e1fb4a086495 ("dax:
remove VM_MIXEDMAP for fsdax and device dax") and for XFS in commit
e1fb4a086495 ("dax: remove VM_MIXEDMAP for fsdax and device dax").
Without CONFIG_ARCH_HAS_PTE_SPECIAL and with VM_MIXEDMAP, vm_normal_page()
would currently return the zeropage. We'll refcount the zeropage when
mapping and when unmapping.
Without CONFIG_ARCH_HAS_PTE_SPECIAL and without VM_MIXEDMAP,
vm_normal_page() would currently refuse to return the zeropage.
So we'd refcount it when mapping but not when unmapping it ... do we
have fsdax without CONFIG_ARCH_HAS_PTE_SPECIAL in practice? Hard to
tell.
Independent of that, we should never refcount the zeropage when we might
be holding that reference for a long time, because even without an
accounting imbalance we might overflow the refcount. As there is interest
in using the zeropage also in other VM_MIXEDMAP mappings, let's add clean
support for that in the cases where it makes sense:
(A) Never refcount the zeropage when mapping it:
In insert_page(), special-case the zeropage, do not refcount it, and
use pte_mkspecial(). Don't involve insert_pfn(), adjusting insert_page()
looks cleaner than branching off to insert_pfn().
(B) Never refcount the zeropage when unmapping it:
In vm_normal_page(), also don't return the zeropage in a
VM_MIXEDMAP mapping without CONFIG_ARCH_HAS_PTE_SPECIAL. Add a
VM_WARN_ON_ONCE() sanity check if we'd ever return the zeropage,
which could happen if someone forgets to set pte_mkspecial() when
mapping the zeropage. Document that.
(C) Allow the zeropage only where reasonable
s390x never wants the zeropage in some processes running legacy
KVM guests that make use of storage keys. So disallow that.
Further, using the zeropage in COW mappings is unproblematic (just what
we do for other COW mappings), because FAULT_FLAG_UNSHARE can just
unshare it and GUP with FOLL_LONGTERM would work as expected.
Similarly, mappings that can never have writable PTEs (implying no
write faults) are also not problematic, because nothing could end
up mapping the PTE writable by mistake later. But in case we could have
writable PTEs, we'll only allow the zeropage in FSDAX VMAs, that
are incompatible with GUP and are blocked there completely.
We'll always require the zeropage to be mapped with pte_special().
GUP-fast will reject the zeropage that way, but GUP-slow will allow
it. (Note that GUP does not refcount the zeropage with FOLL_PIN,
because there were issues with overflowing the refcount in the past).
Add sanity checks to can_change_pte_writable() and wp_page_reuse(), to
catch early during testing if we'd ever find a zeropage unexpectedly in
code that wants to upgrade write permissions.
Convert the BUG_ON in vm_mixed_ok() to an ordinary check and simply fail
with VM_FAULT_SIGBUS, like we do for other sanity checks. Drop the
stale comment regarding reserved pages from insert_page().
Note that:
* we won't mess with VM_PFNMAP mappings for now. remap_pfn_range() and
vmf_insert_pfn() would allow the zeropage in some cases and
not refcount it.
* vmf_insert_pfn*() will reject the zeropage in VM_MIXEDMAP
mappings and we'll leave that alone for now. People can simply use
one of the other interfaces.
* we won't bother with the huge zeropage for now. It's never
PTE-mapped and also GUP does not special-case it yet.
Signed-off-by: David Hildenbrand <[email protected]>
---
mm/memory.c | 92 +++++++++++++++++++++++++++++++++++++++------------
mm/mprotect.c | 2 ++
2 files changed, 73 insertions(+), 21 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index eea6e4984eaef..5fffc9bd3febd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -575,10 +575,13 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
* VM_MIXEDMAP mappings can likewise contain memory with or without "struct
* page" backing, however the difference is that _all_ pages with a struct
* page (that is, those where pfn_valid is true) are refcounted and considered
- * normal pages by the VM. The disadvantage is that pages are refcounted
- * (which can be slower and simply not an option for some PFNMAP users). The
- * advantage is that we don't have to follow the strict linearity rule of
- * PFNMAP mappings in order to support COWable mappings.
+ * normal pages by the VM. The only exception are zeropages, which are
+ * *never* refcounted.
+ *
+ * The disadvantage is that pages are refcounted (which can be slower and
+ * simply not an option for some PFNMAP users). The advantage is that we
+ * don't have to follow the strict linearity rule of PFNMAP mappings in
+ * order to support COWable mappings.
*
*/
struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
@@ -616,6 +619,8 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
if (vma->vm_flags & VM_MIXEDMAP) {
if (!pfn_valid(pfn))
return NULL;
+ if (is_zero_pfn(pfn))
+ return NULL;
goto out;
} else {
unsigned long off;
@@ -641,6 +646,7 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
* eg. VDSO mappings can cause them to exist.
*/
out:
+ VM_WARN_ON_ONCE(is_zero_pfn(pfn));
return pfn_to_page(pfn);
}
@@ -1983,10 +1989,47 @@ pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr,
return pte_alloc_map_lock(mm, pmd, addr, ptl);
}
-static int validate_page_before_insert(struct page *page)
+static bool vm_mixed_zeropage_allowed(struct vm_area_struct *vma)
+{
+ VM_WARN_ON_ONCE(vma->vm_flags & VM_PFNMAP);
+ /*
+ * Whoever wants to forbid the zeropage after some zeropages
+ * might already have been mapped has to scan the page tables and
+ * bail out on any zeropages. Zeropages in COW mappings can
+ * be unshared using FAULT_FLAG_UNSHARE faults.
+ */
+ if (mm_forbids_zeropage(vma->vm_mm))
+ return false;
+ /* zeropages in COW mappings are common and unproblematic. */
+ if (is_cow_mapping(vma->vm_flags))
+ return true;
+ /* Mappings that do not allow for writable PTEs are unproblematic. */
+ if (!(vma->vm_flags & (VM_WRITE | VM_MAYWRITE)))
+ return false;
+ /*
+ * Why not allow any VMA that has vm_ops->pfn_mkwrite? GUP could
+ * find the shared zeropage and longterm-pin it, which would
+ * be problematic as soon as the zeropage gets replaced by a different
+ * page due to vma->vm_ops->pfn_mkwrite, because what's mapped would
+ * now differ to what GUP looked up. FSDAX is incompatible to
+ * FOLL_LONGTERM and VM_IO is incompatible to GUP completely (see
+ * check_vma_flags).
+ */
+ return vma->vm_ops && vma->vm_ops->pfn_mkwrite &&
+ (vma_is_fsdax(vma) || vma->vm_flags & VM_IO);
+}
+
+static int validate_page_before_insert(struct vm_area_struct *vma,
+ struct page *page)
{
struct folio *folio = page_folio(page);
+ if (unlikely(is_zero_folio(folio))) {
+ if (!vm_mixed_zeropage_allowed(vma))
+ return -EINVAL;
+ return 0;
+ }
+
if (folio_test_anon(folio) || folio_test_slab(folio) ||
page_has_type(page))
return -EINVAL;
@@ -1998,24 +2041,23 @@ static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte,
unsigned long addr, struct page *page, pgprot_t prot)
{
struct folio *folio = page_folio(page);
+ pte_t pteval;
if (!pte_none(ptep_get(pte)))
return -EBUSY;
/* Ok, finally just insert the thing.. */
- folio_get(folio);
- inc_mm_counter(vma->vm_mm, mm_counter_file(folio));
- folio_add_file_rmap_pte(folio, page, vma);
- set_pte_at(vma->vm_mm, addr, pte, mk_pte(page, prot));
+ pteval = mk_pte(page, prot);
+ if (unlikely(is_zero_folio(folio))) {
+ pteval = pte_mkspecial(pteval);
+ } else {
+ folio_get(folio);
+ inc_mm_counter(vma->vm_mm, mm_counter_file(folio));
+ folio_add_file_rmap_pte(folio, page, vma);
+ }
+ set_pte_at(vma->vm_mm, addr, pte, pteval);
return 0;
}
-/*
- * This is the old fallback for page remapping.
- *
- * For historical reasons, it only allows reserved pages. Only
- * old drivers should use this, and they needed to mark their
- * pages reserved for the old functions anyway.
- */
static int insert_page(struct vm_area_struct *vma, unsigned long addr,
struct page *page, pgprot_t prot)
{
@@ -2023,7 +2065,7 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr,
pte_t *pte;
spinlock_t *ptl;
- retval = validate_page_before_insert(page);
+ retval = validate_page_before_insert(vma, page);
if (retval)
goto out;
retval = -ENOMEM;
@@ -2043,7 +2085,7 @@ static int insert_page_in_batch_locked(struct vm_area_struct *vma, pte_t *pte,
if (!page_count(page))
return -EINVAL;
- err = validate_page_before_insert(page);
+ err = validate_page_before_insert(vma, page);
if (err)
return err;
return insert_page_into_pte_locked(vma, pte, addr, page, prot);
@@ -2149,7 +2191,8 @@ EXPORT_SYMBOL(vm_insert_pages);
* @page: source kernel page
*
* This allows drivers to insert individual pages they've allocated
- * into a user vma.
+ * into a user vma. The zeropage is supported in some VMAs,
+ * see vm_mixed_zeropage_allowed().
*
* The page has to be a nice clean _individual_ kernel allocation.
* If you allocate a compound page, you need to have marked it as
@@ -2195,6 +2238,8 @@ EXPORT_SYMBOL(vm_insert_page);
* @offset: user's requested vm_pgoff
*
* This allows drivers to map range of kernel pages into a user vma.
+ * The zeropage is supported in some VMAs, see
+ * vm_mixed_zeropage_allowed().
*
* Return: 0 on success and error code otherwise.
*/
@@ -2410,8 +2455,11 @@ vm_fault_t vmf_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
}
EXPORT_SYMBOL(vmf_insert_pfn);
-static bool vm_mixed_ok(struct vm_area_struct *vma, pfn_t pfn)
+static bool vm_mixed_ok(struct vm_area_struct *vma, pfn_t pfn, bool mkwrite)
{
+ if (unlikely(is_zero_pfn(pfn_t_to_pfn(pfn))) &&
+ (mkwrite || !vm_mixed_zeropage_allowed(vma)))
+ return false;
/* these checks mirror the abort conditions in vm_normal_page */
if (vma->vm_flags & VM_MIXEDMAP)
return true;
@@ -2430,7 +2478,8 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma,
pgprot_t pgprot = vma->vm_page_prot;
int err;
- BUG_ON(!vm_mixed_ok(vma, pfn));
+ if (!vm_mixed_ok(vma, pfn, mkwrite))
+ return VM_FAULT_SIGBUS;
if (addr < vma->vm_start || addr >= vma->vm_end)
return VM_FAULT_SIGBUS;
@@ -3178,6 +3227,7 @@ static inline void wp_page_reuse(struct vm_fault *vmf, struct folio *folio)
pte_t entry;
VM_BUG_ON(!(vmf->flags & FAULT_FLAG_WRITE));
+ VM_WARN_ON(is_zero_pfn(pte_pfn(vmf->orig_pte)));
if (folio) {
VM_BUG_ON(folio_test_anon(folio) &&
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 8c6cd88252738..888ef66468dbd 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -71,6 +71,8 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
return page && PageAnon(page) && PageAnonExclusive(page);
}
+ VM_WARN_ON_ONCE(is_zero_pfn(pte_pfn(pte)) && pte_dirty(pte));
+
/*
* Writable MAP_SHARED mapping: "clean" might indicate that the FS still
* needs a real write-fault for writenotify
--
2.44.0
Hi David,
[...]
> -static int validate_page_before_insert(struct page *page)
> +static bool vm_mixed_zeropage_allowed(struct vm_area_struct *vma)
> +{
> + VM_WARN_ON_ONCE(vma->vm_flags & VM_PFNMAP);
> + /*
> + * Whoever wants to forbid the zeropage after some zeropages
> + * might already have been mapped has to scan the page tables and
> + * bail out on any zeropages. Zeropages in COW mappings can
> + * be unshared using FAULT_FLAG_UNSHARE faults.
> + */
> + if (mm_forbids_zeropage(vma->vm_mm))
> + return false;
> + /* zeropages in COW mappings are common and unproblematic. */
> + if (is_cow_mapping(vma->vm_flags))
> + return true;
> + /* Mappings that do not allow for writable PTEs are unproblematic. */
> + if (!(vma->vm_flags & (VM_WRITE | VM_MAYWRITE)))
> + return false;
Shouldn't we return true here?
> + /*
> + * Why not allow any VMA that has vm_ops->pfn_mkwrite? GUP could
> + * find the shared zeropage and longterm-pin it, which would
> + * be problematic as soon as the zeropage gets replaced by a different
> + * page due to vma->vm_ops->pfn_mkwrite, because what's mapped would
> + * now differ to what GUP looked up. FSDAX is incompatible to
> + * FOLL_LONGTERM and VM_IO is incompatible to GUP completely (see
> + * check_vma_flags).
> + */
> + return vma->vm_ops && vma->vm_ops->pfn_mkwrite &&
> + (vma_is_fsdax(vma) || vma->vm_flags & VM_IO);
> +}
> +
[...]
>
> -/*
> - * This is the old fallback for page remapping.
> - *
> - * For historical reasons, it only allows reserved pages. Only
> - * old drivers should use this, and they needed to mark their
> - * pages reserved for the old functions anyway.
> - */
> static int insert_page(struct vm_area_struct *vma, unsigned long addr,
> struct page *page, pgprot_t prot)
> {
> @@ -2023,7 +2065,7 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr,
> pte_t *pte;
> spinlock_t *ptl;
>
> - retval = validate_page_before_insert(page);
> + retval = validate_page_before_insert(vma, page);
> if (retval)
> goto out;
> retval = -ENOMEM;
> @@ -2043,7 +2085,7 @@ static int insert_page_in_batch_locked(struct vm_area_struct *vma, pte_t *pte,
>
> if (!page_count(page))
> return -EINVAL;
This test here prevents inserting the zero-page.
> - err = validate_page_before_insert(page);
> + err = validate_page_before_insert(vma, page);
> if (err)
> return err;
> return insert_page_into_pte_locked(vma, pte, addr, page, prot);
> @@ -2149,7 +2191,8 @@ EXPORT_SYMBOL(vm_insert_pages);
> * @page: source kernel page
> *
> * This allows drivers to insert individual pages they've allocated
> - * into a user vma.
> + * into a user vma. The zeropage is supported in some VMAs,
> + * see vm_mixed_zeropage_allowed().
> *
> * The page has to be a nice clean _individual_ kernel allocation.
> * If you allocate a compound page, you need to have marked it as
> @@ -2195,6 +2238,8 @@ EXPORT_SYMBOL(vm_insert_page);
> * @offset: user's requested vm_pgoff
> *
> * This allows drivers to map range of kernel pages into a user vma.
> + * The zeropage is supported in some VMAs, see
> + * vm_mixed_zeropage_allowed().
> *
> * Return: 0 on success and error code otherwise.
> */
> @@ -2410,8 +2455,11 @@ vm_fault_t vmf_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
> }
> EXPORT_SYMBOL(vmf_insert_pfn);
>
> -static bool vm_mixed_ok(struct vm_area_struct *vma, pfn_t pfn)
On 17.05.24 17:07, Vincent Donnefort wrote:
> Hi David,
>
> [...]
>
>> -static int validate_page_before_insert(struct page *page)
>> +static bool vm_mixed_zeropage_allowed(struct vm_area_struct *vma)
>> +{
>> + VM_WARN_ON_ONCE(vma->vm_flags & VM_PFNMAP);
>> + /*
>> + * Whoever wants to forbid the zeropage after some zeropages
>> + * might already have been mapped has to scan the page tables and
>> + * bail out on any zeropages. Zeropages in COW mappings can
>> + * be unshared using FAULT_FLAG_UNSHARE faults.
>> + */
>> + if (mm_forbids_zeropage(vma->vm_mm))
>> + return false;
>> + /* zeropages in COW mappings are common and unproblematic. */
>> + if (is_cow_mapping(vma->vm_flags))
>> + return true;
>> + /* Mappings that do not allow for writable PTEs are unproblematic. */
>> + if (!(vma->vm_flags & (VM_WRITE | VM_MAYWRITE)))
>> + return false;
>
> Shouldn't we return true here?
Indeed, thanks! I wish we would have user in the tree already that could
exercise that code path.
[...]
>> @@ -2043,7 +2085,7 @@ static int insert_page_in_batch_locked(struct vm_area_struct *vma, pte_t *pte,
>>
>> if (!page_count(page))
>> return -EINVAL;
>
> This test here prevents inserting the zero-page.
You mean the existing page_count() check? or the (wrong) vma->vm_flags
check in vm_mixed_zeropage_allowed() ?
--
Cheers,
David / dhildenb
On Tue, May 21, 2024 at 10:25:43AM +0200, David Hildenbrand wrote:
> On 17.05.24 17:07, Vincent Donnefort wrote:
> > Hi David,
> >
> > [...]
> >
> > > -static int validate_page_before_insert(struct page *page)
> > > +static bool vm_mixed_zeropage_allowed(struct vm_area_struct *vma)
> > > +{
> > > + VM_WARN_ON_ONCE(vma->vm_flags & VM_PFNMAP);
> > > + /*
> > > + * Whoever wants to forbid the zeropage after some zeropages
> > > + * might already have been mapped has to scan the page tables and
> > > + * bail out on any zeropages. Zeropages in COW mappings can
> > > + * be unshared using FAULT_FLAG_UNSHARE faults.
> > > + */
> > > + if (mm_forbids_zeropage(vma->vm_mm))
> > > + return false;
> > > + /* zeropages in COW mappings are common and unproblematic. */
> > > + if (is_cow_mapping(vma->vm_flags))
> > > + return true;
> > > + /* Mappings that do not allow for writable PTEs are unproblematic. */
> > > + if (!(vma->vm_flags & (VM_WRITE | VM_MAYWRITE)))
> > > + return false;
> >
> > Shouldn't we return true here?
>
> Indeed, thanks! I wish we would have user in the tree already that could
> exercise that code path.
I have a patch ready to use this path from the memory map tracing! I can either
send it once this one is picked-up or you can add it to your series?
>
> [...]
>
> > > @@ -2043,7 +2085,7 @@ static int insert_page_in_batch_locked(struct vm_area_struct *vma, pte_t *pte,
> > > if (!page_count(page))
> > > return -EINVAL;
> >
> > This test here prevents inserting the zero-page.
>
> You mean the existing page_count() check? or the (wrong) vma->vm_flags check
> in vm_mixed_zeropage_allowed() ?
I meant this page_count() here. As a quick test, I removed that check (also fixed
the vm_flags above) and the zero-page was properly mapped!
>
> --
> Cheers,
>
> David / dhildenb
>
On 21.05.24 11:06, Vincent Donnefort wrote:
> On Tue, May 21, 2024 at 10:25:43AM +0200, David Hildenbrand wrote:
>> On 17.05.24 17:07, Vincent Donnefort wrote:
>>> Hi David,
>>>
>>> [...]
>>>
>>>> -static int validate_page_before_insert(struct page *page)
>>>> +static bool vm_mixed_zeropage_allowed(struct vm_area_struct *vma)
>>>> +{
>>>> + VM_WARN_ON_ONCE(vma->vm_flags & VM_PFNMAP);
>>>> + /*
>>>> + * Whoever wants to forbid the zeropage after some zeropages
>>>> + * might already have been mapped has to scan the page tables and
>>>> + * bail out on any zeropages. Zeropages in COW mappings can
>>>> + * be unshared using FAULT_FLAG_UNSHARE faults.
>>>> + */
>>>> + if (mm_forbids_zeropage(vma->vm_mm))
>>>> + return false;
>>>> + /* zeropages in COW mappings are common and unproblematic. */
>>>> + if (is_cow_mapping(vma->vm_flags))
>>>> + return true;
>>>> + /* Mappings that do not allow for writable PTEs are unproblematic. */
>>>> + if (!(vma->vm_flags & (VM_WRITE | VM_MAYWRITE)))
>>>> + return false;
>>>
>>> Shouldn't we return true here?
>>
>> Indeed, thanks! I wish we would have user in the tree already that could
>> exercise that code path.
>
> I have a patch ready to use this path from the memory map tracing! I can either
> send it once this one is picked-up or you can add it to your series?
Whatever works for you! To debug, it would be good if you could send me
the patch and simple instructions on how to test it (do we have a
selftest as well?).
>
>>
>> [...]
>>
>>>> @@ -2043,7 +2085,7 @@ static int insert_page_in_batch_locked(struct vm_area_struct *vma, pte_t *pte,
>>>> if (!page_count(page))
>>>> return -EINVAL;
>>>
>>> This test here prevents inserting the zero-page.
>>
>> You mean the existing page_count() check? or the (wrong) vma->vm_flags check
>> in vm_mixed_zeropage_allowed() ?
>
> I meant this page_count() here. As a quick test, I removed that check (also fixed
> the vm_flags above) and the zero-page was properly mapped!
That's weird and might indicate another issue.
The refcount of the shared zeropage should be initialized to 1, just
like for any other reserved pages
(mm/mm_init.c:__init_single_page()->init_page_count())
Hm ...
--
Cheers,
David / dhildenb
On Tue, May 21, 2024 at 11:18:41AM +0200, David Hildenbrand wrote:
> On 21.05.24 11:06, Vincent Donnefort wrote:
> > On Tue, May 21, 2024 at 10:25:43AM +0200, David Hildenbrand wrote:
> > > On 17.05.24 17:07, Vincent Donnefort wrote:
> > > > Hi David,
> > > >
> > > > [...]
> > > >
> > > > > -static int validate_page_before_insert(struct page *page)
> > > > > +static bool vm_mixed_zeropage_allowed(struct vm_area_struct *vma)
> > > > > +{
> > > > > + VM_WARN_ON_ONCE(vma->vm_flags & VM_PFNMAP);
> > > > > + /*
> > > > > + * Whoever wants to forbid the zeropage after some zeropages
> > > > > + * might already have been mapped has to scan the page tables and
> > > > > + * bail out on any zeropages. Zeropages in COW mappings can
> > > > > + * be unshared using FAULT_FLAG_UNSHARE faults.
> > > > > + */
> > > > > + if (mm_forbids_zeropage(vma->vm_mm))
> > > > > + return false;
> > > > > + /* zeropages in COW mappings are common and unproblematic. */
> > > > > + if (is_cow_mapping(vma->vm_flags))
> > > > > + return true;
> > > > > + /* Mappings that do not allow for writable PTEs are unproblematic. */
> > > > > + if (!(vma->vm_flags & (VM_WRITE | VM_MAYWRITE)))
> > > > > + return false;
> > > >
> > > > Shouldn't we return true here?
> > >
> > > Indeed, thanks! I wish we would have user in the tree already that could
> > > exercise that code path.
> >
> > I have a patch ready to use this path from the memory map tracing! I can either
> > send it once this one is picked-up or you can add it to your series?
>
> Whatever works for you! To debug, it would be good if you could send me the
> patch and simple instructions on how to test it (do we have a selftest as
> well?).
Of course, I'll share something with you today! It includes an update to the
selftest to make sure we check the padding with the zero-page.
>
> >
> > >
> > > [...]
> > >
> > > > > @@ -2043,7 +2085,7 @@ static int insert_page_in_batch_locked(struct vm_area_struct *vma, pte_t *pte,
> > > > > if (!page_count(page))
> > > > > return -EINVAL;
> > > >
> > > > This test here prevents inserting the zero-page.
> > >
> > > You mean the existing page_count() check? or the (wrong) vma->vm_flags check
> > > in vm_mixed_zeropage_allowed() ?
> >
> > I meant this page_count() here. As a quick test, I removed that check (also fixed
> > the vm_flags above) and the zero-page was properly mapped!
>
> That's weird and might indicate another issue.
>
> The refcount of the shared zeropage should be initialized to 1, just like
> for any other reserved pages
> (mm/mm_init.c:__init_single_page()->init_page_count())
>
> Hm ...
>
> --
> Cheers,
>
> David / dhildenb
>