Whenever GUP currently ends up taking a R/O pin on an anonymous page that
might be shared -- mapped R/O and !PageAnonExclusive() -- any write fault
on the page table entry will end up replacing the mapped anonymous page
due to COW, resulting in the GUP pin no longer being consistent with the
page actually mapped into the page table.
The possible ways to deal with this situation are:
(1) Ignore and pin -- what we do right now.
(2) Fail to pin -- which would be rather surprising to callers and
could break user space.
(3) Trigger unsharing and pin the now exclusive page -- reliable R/O
pins.
Let's implement 3) because it provides the clearest semantics and
allows for checking in unpin_user_pages() and friends for possible BUGs:
when trying to unpin a page that's no longer exclusive, clearly
something went very wrong and might result in memory corruptions that
might be hard to debug. So we better have a nice way to spot such
issues.
This change implies that whenever user space *wrote* to a private
mapping (IOW, we have an anonymous page mapped), that GUP pins will
always remain consistent: reliable R/O GUP pins of anonymous pages.
As a side note, this commit fixes the COW security issue for hugetlb with
FOLL_PIN as documented in:
https://lore.kernel.org/r/[email protected]
The vmsplice reproducer still applies, because vmsplice uses FOLL_GET
instead of FOLL_PIN.
Note that follow_huge_pmd() doesn't apply because we cannot end up in
there with FOLL_PIN.
This commit is heavily based on prototype patches by Andrea.
Co-developed-by: Andrea Arcangeli <[email protected]>
Signed-off-by: Andrea Arcangeli <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/mm.h | 39 +++++++++++++++++++++++++++++++++++++++
mm/gup.c | 32 +++++++++++++++++++++++++++++---
mm/huge_memory.c | 3 +++
mm/hugetlb.c | 25 ++++++++++++++++++++++---
4 files changed, 93 insertions(+), 6 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ffeb7f5fa32b..48ce9b7f903d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3003,6 +3003,45 @@ static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags)
return 0;
}
+/*
+ * Indicates for which pages that are write-protected in the page table,
+ * whether GUP has to trigger unsharing via FAULT_FLAG_UNSHARE such that the
+ * GUP pin will remain consistent with the pages mapped into the page tables
+ * of the MM.
+ *
+ * Temporary unmapping of PageAnonExclusive() pages or clearing of
+ * PageAnonExclusive() has to protect against concurrent GUP:
+ * * Ordinary GUP: Using the PT lock
+ * * GUP-fast and fork(): mm->write_protect_seq
+ * * GUP-fast and KSM or temporary unmapping (swap, migration):
+ * clear/invalidate+flush of the page table entry
+ *
+ * Must be called with the (sub)page that's actually referenced via the
+ * page table entry, which might not necessarily be the head page for a
+ * PTE-mapped THP.
+ */
+static inline bool gup_must_unshare(unsigned int flags, struct page *page)
+{
+ /*
+ * FOLL_WRITE is implicitly handled correctly as the page table entry
+ * has to be writable -- and if it references (part of) an anonymous
+ * folio, that part is required to be marked exclusive.
+ */
+ if ((flags & (FOLL_WRITE | FOLL_PIN)) != FOLL_PIN)
+ return false;
+ /*
+ * Note: PageAnon(page) is stable until the page is actually getting
+ * freed.
+ */
+ if (!PageAnon(page))
+ return false;
+ /*
+ * Note that PageKsm() pages cannot be exclusive, and consequently,
+ * cannot get pinned.
+ */
+ return !PageAnonExclusive(page);
+}
+
typedef int (*pte_fn_t)(pte_t *pte, unsigned long addr, void *data);
extern int apply_to_page_range(struct mm_struct *mm, unsigned long address,
unsigned long size, pte_fn_t fn, void *data);
diff --git a/mm/gup.c b/mm/gup.c
index 7127e789d210..2cb7cbb1fc1f 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -568,6 +568,10 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
}
}
+ if (!pte_write(pte) && gup_must_unshare(flags, page)) {
+ page = ERR_PTR(-EMLINK);
+ goto out;
+ }
/* try_grab_page() does nothing unless FOLL_GET or FOLL_PIN is set. */
if (unlikely(!try_grab_page(page, flags))) {
page = ERR_PTR(-ENOMEM);
@@ -820,6 +824,11 @@ static struct page *follow_p4d_mask(struct vm_area_struct *vma,
* When getting pages from ZONE_DEVICE memory, the @ctx->pgmap caches
* the device's dev_pagemap metadata to avoid repeating expensive lookups.
*
+ * When getting an anonymous page and the caller has to trigger unsharing
+ * of a shared anonymous page first, -EMLINK is returned. The caller should
+ * trigger a fault with FAULT_FLAG_UNSHARE set. Note that unsharing is only
+ * relevant with FOLL_PIN and !FOLL_WRITE.
+ *
* On output, the @ctx->page_mask is set according to the size of the page.
*
* Return: the mapped (struct page *), %NULL if no mapping exists, or
@@ -943,7 +952,8 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
* is, *@locked will be set to 0 and -EBUSY returned.
*/
static int faultin_page(struct vm_area_struct *vma,
- unsigned long address, unsigned int *flags, int *locked)
+ unsigned long address, unsigned int *flags, bool unshare,
+ int *locked)
{
unsigned int fault_flags = 0;
vm_fault_t ret;
@@ -968,6 +978,11 @@ static int faultin_page(struct vm_area_struct *vma,
*/
fault_flags |= FAULT_FLAG_TRIED;
}
+ if (unshare) {
+ fault_flags |= FAULT_FLAG_UNSHARE;
+ /* FAULT_FLAG_WRITE and FAULT_FLAG_UNSHARE are incompatible */
+ VM_BUG_ON(fault_flags & FAULT_FLAG_WRITE);
+ }
ret = handle_mm_fault(vma, address, fault_flags, NULL);
if (ret & VM_FAULT_ERROR) {
@@ -1189,8 +1204,9 @@ static long __get_user_pages(struct mm_struct *mm,
cond_resched();
page = follow_page_mask(vma, start, foll_flags, &ctx);
- if (!page) {
- ret = faultin_page(vma, start, &foll_flags, locked);
+ if (!page || PTR_ERR(page) == -EMLINK) {
+ ret = faultin_page(vma, start, &foll_flags,
+ PTR_ERR(page) == -EMLINK, locked);
switch (ret) {
case 0:
goto retry;
@@ -2346,6 +2362,11 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
goto pte_unmap;
}
+ if (!pte_write(pte) && gup_must_unshare(flags, page)) {
+ put_compound_head(head, 1, flags);
+ goto pte_unmap;
+ }
+
VM_BUG_ON_PAGE(compound_head(page) != head, page);
/*
@@ -2589,6 +2610,11 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
return 0;
}
+ if (!pmd_write(orig) && gup_must_unshare(flags, head)) {
+ put_compound_head(head, refs, flags);
+ return 0;
+ }
+
*nr += refs;
SetPageReferenced(head);
return 1;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2cb8fa31377a..334cd9381635 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1396,6 +1396,9 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
page = pmd_page(*pmd);
VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
+ if (!pmd_write(*pmd) && gup_must_unshare(flags, page))
+ return ERR_PTR(-EMLINK);
+
if (!try_grab_page(page, flags))
return ERR_PTR(-ENOMEM);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ebcde7e9f704..6c2a7dd8a48d 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5957,6 +5957,23 @@ static void record_subpages_vmas(struct page *page, struct vm_area_struct *vma,
}
}
+static inline bool __follow_hugetlb_must_fault(unsigned int flags, pte_t *pte,
+ bool *unshare)
+{
+ pte_t pteval = huge_ptep_get(pte);
+
+ *unshare = false;
+ if (is_swap_pte(pteval))
+ return true;
+ if (huge_pte_write(pteval))
+ return false;
+ if (flags & FOLL_WRITE)
+ return true;
+ if (gup_must_unshare(flags, pte_page(pteval)))
+ *unshare = true;
+ return *unshare;
+}
+
long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct page **pages, struct vm_area_struct **vmas,
unsigned long *position, unsigned long *nr_pages,
@@ -5971,6 +5988,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
while (vaddr < vma->vm_end && remainder) {
pte_t *pte;
spinlock_t *ptl = NULL;
+ bool unshare;
int absent;
struct page *page;
@@ -6021,9 +6039,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
* both cases, and because we can't follow correct pages
* directly from any kind of swap entries.
*/
- if (absent || is_swap_pte(huge_ptep_get(pte)) ||
- ((flags & FOLL_WRITE) &&
- !huge_pte_write(huge_ptep_get(pte)))) {
+ if (absent ||
+ __follow_hugetlb_must_fault(flags, pte, &unshare)) {
vm_fault_t ret;
unsigned int fault_flags = 0;
@@ -6031,6 +6048,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
spin_unlock(ptl);
if (flags & FOLL_WRITE)
fault_flags |= FAULT_FLAG_WRITE;
+ else if (unshare)
+ fault_flags |= FAULT_FLAG_UNSHARE;
if (locked)
fault_flags |= FAULT_FLAG_ALLOW_RETRY |
FAULT_FLAG_KILLABLE;
--
2.35.1
On 02.03.22 17:55, Jason Gunthorpe wrote:
> On Thu, Feb 24, 2022 at 01:26:13PM +0100, David Hildenbrand wrote:
>> Whenever GUP currently ends up taking a R/O pin on an anonymous page that
>> might be shared -- mapped R/O and !PageAnonExclusive() -- any write fault
>> on the page table entry will end up replacing the mapped anonymous page
>> due to COW, resulting in the GUP pin no longer being consistent with the
>> page actually mapped into the page table.
>>
>> The possible ways to deal with this situation are:
>> (1) Ignore and pin -- what we do right now.
>> (2) Fail to pin -- which would be rather surprising to callers and
>> could break user space.
>> (3) Trigger unsharing and pin the now exclusive page -- reliable R/O
>> pins.
>
Hi Jason,
> How does this mesh with the common FOLL_FORCE|FOLL_WRITE|FOLL_PIN
> pattern used for requesting read access? Can they be converted to
> just FOLL_WRITE|FOLL_PIN after this?
Interesting question, I thought about this in detail yet, let me give it
a try:
IIRC, the sole purpose of FOLL_FORCE in the context of R/O pins is to
enforce the eventual COW -- meaning we COW (via FOLL_WRITE) even if we
don't have the permissions to write (via FOLL_FORCE), to make sure we
most certainly have an exclusive anonymoous page in a MAP_PRIVATE mapping.
Dropping only the FOLL_FORCE would make the FOLL_WRITE request fail if
the mapping is currently !VM_WRITE (but is VM_MAYWRITE), so that
wouldn't work.
I recall that we don't allow pinning the zero page ("special pte",
!vm_normal_page()). So if you have an ordinary MAP_PRIVATE|MAP_ANON
mapping, you will now only need a "FOLL_READ" and have a reliable pin,
even if not previously writing to every page.
It would we different with other MAP_PRIVATE file mappings I remember:
With FOLL_FORCE|FOLL_WRITE|FOLL_PIN we'd force placement of an anonymous
page, resulting in the R/O (long-term ?) pin not observing consecutive
file changes. With a pure FOLL_READ we'd still observe file changes as
we don't trigger a write fault.
BUT, once we actually write to the private mapping via the page table,
the GUP pin would go out of sync with the now-anonymous page mapped into
the page table. However, I'm having a hard time answering what's
actually expected?
It's really hard to tell what the user wants with MAP_PRIVATE file
mappings and stumbles over a !anon page (no modifications so far):
(a) I want a R/O pin to observe file modifications.
(b) I want the R/O pin to *not* observe file modifications but observe
my (eventual? if any) private modifications,
Of course, if we already wrote to that page and now have an anon page,
it's easy: we are already no longer following file changes.
Maybe FOLL_PIN would already do now what we'd expect from a R/O pin --
(a), maybe not. I'm wondering if FOLL_LONGTERM could give us an
indication whether (a) or (b) applies.
--
Thanks,
David / dhildenb
On Thu, Feb 24, 2022 at 01:26:13PM +0100, David Hildenbrand wrote:
> Whenever GUP currently ends up taking a R/O pin on an anonymous page that
> might be shared -- mapped R/O and !PageAnonExclusive() -- any write fault
> on the page table entry will end up replacing the mapped anonymous page
> due to COW, resulting in the GUP pin no longer being consistent with the
> page actually mapped into the page table.
>
> The possible ways to deal with this situation are:
> (1) Ignore and pin -- what we do right now.
> (2) Fail to pin -- which would be rather surprising to callers and
> could break user space.
> (3) Trigger unsharing and pin the now exclusive page -- reliable R/O
> pins.
How does this mesh with the common FOLL_FORCE|FOLL_WRITE|FOLL_PIN
pattern used for requesting read access? Can they be converted to
just FOLL_WRITE|FOLL_PIN after this?
Jason
On Wed, Mar 02, 2022 at 09:38:09PM +0100, David Hildenbrand wrote:
> (a) I want a R/O pin to observe file modifications.
> (b) I want the R/O pin to *not* observe file modifications but observe
> my (eventual? if any) private modifications,
A scenario I know that motivated this is fairly straightfoward:
static char data[] = {};
ibv_reg_mr(data, READ_ONLY)
data[0] = 1
.. go touch data via DMA ..
We want to reliably observe the '1'
What is happening under the covers is that 'data' is placed in the
.data segment and becomes a file backed MAP_PRIVATE page. The write
COWs that page
It would work OK if it was in .bss instead
I think the FOLL_FORCE is there because otherwise the trick doesn't
work on true RO pages and the API becomes broken.
Jason
On 3/2/22 12:38, David Hildenbrand wrote:
...
> BUT, once we actually write to the private mapping via the page table,
> the GUP pin would go out of sync with the now-anonymous page mapped into
> the page table. However, I'm having a hard time answering what's
> actually expected?
>
> It's really hard to tell what the user wants with MAP_PRIVATE file
> mappings and stumbles over a !anon page (no modifications so far):
>
> (a) I want a R/O pin to observe file modifications.
> (b) I want the R/O pin to *not* observe file modifications but observe
> my (eventual? if any) private modifications,
>
On this aspect, I think it is easier than trying to discern user
intentions. Because it is less a question of what the user wants, and
more a question of how mmap(2) is specified. And the man page clearly
indicates that the user has no right to expect to see file
modifications. Here's the excerpt:
"MAP_PRIVATE
Create a private copy-on-write mapping. Updates to the mapping are not
visible to other processes mapping the same file, and are not carried
through to the underlying file. It is unspecified whether changes made
to the file after the mmap() call are visible in the mapped region.
"
> Of course, if we already wrote to that page and now have an anon page,
> it's easy: we are already no longer following file changes.
Yes, and in fact, I've always thought that the way this was written
means that it should be treated as a snapshot of the file contents,
and no longer reliably connected in either direction to the page(s).
thanks,
--
John Hubbard
NVIDIA
On 03.03.22 02:47, John Hubbard wrote:
> On 3/2/22 12:38, David Hildenbrand wrote:
> ...
>> BUT, once we actually write to the private mapping via the page table,
>> the GUP pin would go out of sync with the now-anonymous page mapped into
>> the page table. However, I'm having a hard time answering what's
>> actually expected?
>>
>> It's really hard to tell what the user wants with MAP_PRIVATE file
>> mappings and stumbles over a !anon page (no modifications so far):
>>
>> (a) I want a R/O pin to observe file modifications.
>> (b) I want the R/O pin to *not* observe file modifications but observe
>> my (eventual? if any) private modifications,
>>
>
> On this aspect, I think it is easier than trying to discern user
> intentions. Because it is less a question of what the user wants, and
> more a question of how mmap(2) is specified. And the man page clearly
> indicates that the user has no right to expect to see file
> modifications. Here's the excerpt:
>
> "MAP_PRIVATE
>
> Create a private copy-on-write mapping. Updates to the mapping are not
> visible to other processes mapping the same file, and are not carried
> through to the underlying file. It is unspecified whether changes made
> to the file after the mmap() call are visible in the mapped region.
> "
>
>> Of course, if we already wrote to that page and now have an anon page,
>> it's easy: we are already no longer following file changes.
>
> Yes, and in fact, I've always thought that the way this was written
> means that it should be treated as a snapshot of the file contents,
> and no longer reliably connected in either direction to the page(s).
Thanks John, that's extremely helpful. I forgot about these MAP_PRIVATE
mmap() details -- they help a lot to clarify which semantics to provide.
So what we could do is:
a) Extend FAULT_FLAG_UNSHARE to also unshare an !anon page in
a MAP_RPIVATE mapping, replacing it with an (exclusive) anon page.
R/O PTE permissions are maintained, just like unsharing in the
context of this series.
b) Similarly trigger FAULT_FLAG_UNSHARE from GUP when trying to take a
R/O pin (FOLL_PIN) on a R/O-mapped !anon page in a MAP_PRIVATE
mapping.
c) Make R/O pins consistently use "FOLL_PIN" instead, getting rid of
FOLL_FORCE|FOLL_WRITE.
Of course, we can't detect MAP_PRIVATE vs. MAP_SHARED in GUP-fast (no
VMA), so we'd always have to fallback in GUP-fast in case we intend to
FOLL_PIN a R/O-mapped !anon page. That would imply that essentially any
R/O pins (FOLL_PIN) would have to fallback to ordinary GUP. BUT, I mean
we require FOLL_FORCE|FOLL_WRITE right now, which is not any different,
so ...
One optimization would be to trigger b) only for FOLL_LONGTERM. For
!FOLL_LONGTERM there are "in theory" absolutely no guarantees which data
will be observed if we modify concurrently to e.g., O_DIRECT IMHO. But
that would require some more thought.
Of course, that's all material for another journey, although it should
be mostly straight forward.
--
Thanks,
David / dhildenb
On 02.03.22 21:59, Jason Gunthorpe wrote:
> On Wed, Mar 02, 2022 at 09:38:09PM +0100, David Hildenbrand wrote:
>
>> (a) I want a R/O pin to observe file modifications.
>> (b) I want the R/O pin to *not* observe file modifications but observe
>> my (eventual? if any) private modifications,
>
> A scenario I know that motivated this is fairly straightfoward:
>
> static char data[] = {};
>
> ibv_reg_mr(data, READ_ONLY)
> data[0] = 1
> .. go touch data via DMA ..
>
> We want to reliably observe the '1'
>
> What is happening under the covers is that 'data' is placed in the
> .data segment and becomes a file backed MAP_PRIVATE page. The write
> COWs that page
>
> It would work OK if it was in .bss instead
>
> I think the FOLL_FORCE is there because otherwise the trick doesn't
> work on true RO pages and the API becomes broken.
Thanks for the nice example, it matches what John brought up in respect
to MAP_PRIVATE semantics.
--
Thanks,
David / dhildenb
On 03.03.22 09:06, David Hildenbrand wrote:
> On 03.03.22 02:47, John Hubbard wrote:
>> On 3/2/22 12:38, David Hildenbrand wrote:
>> ...
>>> BUT, once we actually write to the private mapping via the page table,
>>> the GUP pin would go out of sync with the now-anonymous page mapped into
>>> the page table. However, I'm having a hard time answering what's
>>> actually expected?
>>>
>>> It's really hard to tell what the user wants with MAP_PRIVATE file
>>> mappings and stumbles over a !anon page (no modifications so far):
>>>
>>> (a) I want a R/O pin to observe file modifications.
>>> (b) I want the R/O pin to *not* observe file modifications but observe
>>> my (eventual? if any) private modifications,
>>>
>>
>> On this aspect, I think it is easier than trying to discern user
>> intentions. Because it is less a question of what the user wants, and
>> more a question of how mmap(2) is specified. And the man page clearly
>> indicates that the user has no right to expect to see file
>> modifications. Here's the excerpt:
>>
>> "MAP_PRIVATE
>>
>> Create a private copy-on-write mapping. Updates to the mapping are not
>> visible to other processes mapping the same file, and are not carried
>> through to the underlying file. It is unspecified whether changes made
>> to the file after the mmap() call are visible in the mapped region.
>> "
>>
>>> Of course, if we already wrote to that page and now have an anon page,
>>> it's easy: we are already no longer following file changes.
>>
>> Yes, and in fact, I've always thought that the way this was written
>> means that it should be treated as a snapshot of the file contents,
>> and no longer reliably connected in either direction to the page(s).
>
> Thanks John, that's extremely helpful. I forgot about these MAP_PRIVATE
> mmap() details -- they help a lot to clarify which semantics to provide.
>
> So what we could do is:
>
> a) Extend FAULT_FLAG_UNSHARE to also unshare an !anon page in
> a MAP_RPIVATE mapping, replacing it with an (exclusive) anon page.
> R/O PTE permissions are maintained, just like unsharing in the
> context of this series.
>
> b) Similarly trigger FAULT_FLAG_UNSHARE from GUP when trying to take a
> R/O pin (FOLL_PIN) on a R/O-mapped !anon page in a MAP_PRIVATE
> mapping.
>
> c) Make R/O pins consistently use "FOLL_PIN" instead, getting rid of
> FOLL_FORCE|FOLL_WRITE.
>
>
> Of course, we can't detect MAP_PRIVATE vs. MAP_SHARED in GUP-fast (no
> VMA), so we'd always have to fallback in GUP-fast in case we intend to
> FOLL_PIN a R/O-mapped !anon page. That would imply that essentially any
> R/O pins (FOLL_PIN) would have to fallback to ordinary GUP. BUT, I mean
> we require FOLL_FORCE|FOLL_WRITE right now, which is not any different,
> so ...
>
> One optimization would be to trigger b) only for FOLL_LONGTERM. For
> !FOLL_LONGTERM there are "in theory" absolutely no guarantees which data
> will be observed if we modify concurrently to e.g., O_DIRECT IMHO. But
> that would require some more thought.
>
> Of course, that's all material for another journey, although it should
> be mostly straight forward.
>
Just a slight clarification after stumbling over shared zeropage code in
follow_page_pte(): we do seem to support pinning the shared zeropage at
least on the GUP-slow path. While I haven't played with it, I assume
we'd have to implement+trigger unsharing in case we'd want to take a R/O
pin on the shared zeropage.
Of course, similar to file-backed MAP_PRIVATE handling, this is out of
the scope of this series ("This change implies that whenever user space
wrote to a private mapping (IOW, we have an anonymous page mapped), that
GUP pins will
always remain consistent: reliable R/O GUP pins of anonymous pages.").
--
Thanks,
David / dhildenb