2021-05-01 14:42:07

by Peter Xu

[permalink] [raw]
Subject: [PATCH 0/2] mm/hugetlb: Fix issues on file sealing and fork

Hugh reported issue with F_SEAL_FUTURE_WRITE not applied correctly to

hugetlbfs, which I can easily verify using the memfd_test program, which seems

that the program is hardly run with hugetlbfs pages (as by default shmem).



Meanwhile I found another probably even more severe issue on that hugetlb fork

won't wr-protect child cow pages, so child can potentially write to parent

private pages. Patch 2 addresses that.



After this series applied, "memfd_test hugetlbfs" should start to pass.



Please review, thanks.



Peter Xu (2):

mm/hugetlb: Fix F_SEAL_FUTURE_WRITE

mm/hugetlb: Fix cow where page writtable in child



fs/hugetlbfs/inode.c | 5 +++++

include/linux/mm.h | 32 ++++++++++++++++++++++++++++++++

mm/hugetlb.c | 2 ++

mm/shmem.c | 22 ++++------------------

4 files changed, 43 insertions(+), 18 deletions(-)



--

2.31.1





2021-05-01 14:42:48

by Peter Xu

[permalink] [raw]
Subject: [PATCH 1/2] mm/hugetlb: Fix F_SEAL_FUTURE_WRITE

F_SEAL_FUTURE_WRITE is missing for hugetlb starting from the first day.
There is a test program for that and it fails constantly.

$ ./memfd_test hugetlbfs
memfd-hugetlb: CREATE
memfd-hugetlb: BASIC
memfd-hugetlb: SEAL-WRITE
memfd-hugetlb: SEAL-FUTURE-WRITE
mmap() didn't fail as expected
Aborted (core dumped)

I think it's probably because no one is really running the hugetlbfs test.

Fix it by checking FUTURE_WRITE also in hugetlbfs_file_mmap() as what we do in
shmem_mmap(). Generalize a helper for that.

Reported-by: Hugh Dickins <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
fs/hugetlbfs/inode.c | 5 +++++
include/linux/mm.h | 32 ++++++++++++++++++++++++++++++++
mm/shmem.c | 22 ++++------------------
3 files changed, 41 insertions(+), 18 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index a2a42335e8fd2..39922c0f2fc8c 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -131,10 +131,15 @@ static void huge_pagevec_release(struct pagevec *pvec)
static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
{
struct inode *inode = file_inode(file);
+ struct hugetlbfs_inode_info *info = HUGETLBFS_I(inode);
loff_t len, vma_len;
int ret;
struct hstate *h = hstate_file(file);

+ ret = seal_check_future_write(info->seals, vma);
+ if (ret)
+ return ret;
+
/*
* vma address alignment (but not the pgoff alignment) has
* already been checked by prepare_hugepage_range. If you add
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 84fb1697b20ff..c3fd7d504a60e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3200,5 +3200,37 @@ extern int sysctl_nr_trim_pages;

void mem_dump_obj(void *object);

+/**
+ * seal_check_future_write - Check for F_SEAL_FUTURE_WRITE flag and handle it
+ * @seals: the seals to check
+ * @vma: the vma to operate on
+ *
+ * Check whether F_SEAL_FUTURE_WRITE is set; if so, do proper check/handling on
+ * the vma flags. Return 0 if check pass, or <0 for errors.
+ */
+static inline int seal_check_future_write(int seals, struct vm_area_struct *vma)
+{
+ if (seals & F_SEAL_FUTURE_WRITE) {
+ /*
+ * New PROT_WRITE and MAP_SHARED mmaps are not allowed when
+ * "future write" seal active.
+ */
+ if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_WRITE))
+ return -EPERM;
+
+ /*
+ * Since an F_SEAL_FUTURE_WRITE sealed memfd can be mapped as
+ * MAP_SHARED and read-only, take care to not allow mprotect to
+ * revert protections on such mappings. Do this only for shared
+ * mappings. For private mappings, don't need to mask
+ * VM_MAYWRITE as we still want them to be COW-writable.
+ */
+ if (vma->vm_flags & VM_SHARED)
+ vma->vm_flags &= ~(VM_MAYWRITE);
+ }
+
+ return 0;
+}
+
#endif /* __KERNEL__ */
#endif /* _LINUX_MM_H */
diff --git a/mm/shmem.c b/mm/shmem.c
index 26c76b13ad233..e86a230735b60 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2258,25 +2258,11 @@ int shmem_lock(struct file *file, int lock, struct user_struct *user)
static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
{
struct shmem_inode_info *info = SHMEM_I(file_inode(file));
+ int ret;

- if (info->seals & F_SEAL_FUTURE_WRITE) {
- /*
- * New PROT_WRITE and MAP_SHARED mmaps are not allowed when
- * "future write" seal active.
- */
- if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_WRITE))
- return -EPERM;
-
- /*
- * Since an F_SEAL_FUTURE_WRITE sealed memfd can be mapped as
- * MAP_SHARED and read-only, take care to not allow mprotect to
- * revert protections on such mappings. Do this only for shared
- * mappings. For private mappings, don't need to mask
- * VM_MAYWRITE as we still want them to be COW-writable.
- */
- if (vma->vm_flags & VM_SHARED)
- vma->vm_flags &= ~(VM_MAYWRITE);
- }
+ ret = seal_check_future_write(info->seals, vma);
+ if (ret)
+ return ret;

/* arm64 - allow memory tagging on RAM-based files */
vma->vm_flags |= VM_MTE_ALLOWED;
--
2.31.1

2021-05-01 14:45:38

by Peter Xu

[permalink] [raw]
Subject: [PATCH 2/2] mm/hugetlb: Fix cow where page writtable in child

When fork() and copy hugetlb page range, we'll remember to wrprotect src pte if
needed, however we forget about the child! Without it, the child will be able
to write to parent's pages when mapped as PROT_READ|PROT_WRITE and MAP_PRIVATE,
which will cause data corruption in the parent process.

This issue can also be exposed by "memfd_test hugetlbfs" kselftest (if it can
pass the F_SEAL_FUTURE_WRITE test first, though).

Signed-off-by: Peter Xu <[email protected]>
---
mm/hugetlb.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 629aa4c2259c8..9978fb73b8caf 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4056,6 +4056,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
* See Documentation/vm/mmu_notifier.rst
*/
huge_ptep_set_wrprotect(src, addr, src_pte);
+ /* Child cannot write too! */
+ entry = huge_pte_wrprotect(entry);
}

page_dup_rmap(ptepage, true);
--
2.31.1

2021-05-03 18:58:49

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm/hugetlb: Fix F_SEAL_FUTURE_WRITE

On 5/1/21 7:41 AM, Peter Xu wrote:
> F_SEAL_FUTURE_WRITE is missing for hugetlb starting from the first day.
> There is a test program for that and it fails constantly.
>
> $ ./memfd_test hugetlbfs
> memfd-hugetlb: CREATE
> memfd-hugetlb: BASIC
> memfd-hugetlb: SEAL-WRITE
> memfd-hugetlb: SEAL-FUTURE-WRITE
> mmap() didn't fail as expected
> Aborted (core dumped)
>
> I think it's probably because no one is really running the hugetlbfs test.
>
> Fix it by checking FUTURE_WRITE also in hugetlbfs_file_mmap() as what we do in
> shmem_mmap(). Generalize a helper for that.
>
> Reported-by: Hugh Dickins <[email protected]>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> fs/hugetlbfs/inode.c | 5 +++++
> include/linux/mm.h | 32 ++++++++++++++++++++++++++++++++
> mm/shmem.c | 22 ++++------------------
> 3 files changed, 41 insertions(+), 18 deletions(-)

Thanks Peter and Hugh!

One question below,

>
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index a2a42335e8fd2..39922c0f2fc8c 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -131,10 +131,15 @@ static void huge_pagevec_release(struct pagevec *pvec)
> static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
> {
> struct inode *inode = file_inode(file);
> + struct hugetlbfs_inode_info *info = HUGETLBFS_I(inode);
> loff_t len, vma_len;
> int ret;
> struct hstate *h = hstate_file(file);
>
> + ret = seal_check_future_write(info->seals, vma);
> + if (ret)
> + return ret;
> +
> /*
> * vma address alignment (but not the pgoff alignment) has
> * already been checked by prepare_hugepage_range. If you add

The full comment below the code you added is:

/*
* vma address alignment (but not the pgoff alignment) has
* already been checked by prepare_hugepage_range. If you add
* any error returns here, do so after setting VM_HUGETLB, so
* is_vm_hugetlb_page tests below unmap_region go the right
* way when do_mmap unwinds (may be important on powerpc
* and ia64).
*/

This comment was added in commit 68589bc35303 by Hugh, although it
appears David Gibson added the reason for the comment in the commit
message:

"If hugetlbfs_file_mmap() returns a failure to do_mmap_pgoff() - for example,
because the given file offset is not hugepage aligned - then do_mmap_pgoff
will go to the unmap_and_free_vma backout path.

But at this stage the vma hasn't been marked as hugepage, and the backout path
will call unmap_region() on it. That will eventually call down to the
non-hugepage version of unmap_page_range(). On ppc64, at least, that will
cause serious problems if there are any existing hugepage pagetable entries in
the vicinity - for example if there are any other hugepage mappings under the
same PUD. unmap_page_range() will trigger a bad_pud() on the hugepage pud
entries. I suspect this will also cause bad problems on ia64, though I don't
have a machine to test it on."

There are still comments in the unmap code about special handling of
ppc64 PUDs. So, this may still be an issue.

I am trying to dig into the code to determine if this is still and
issue. Just curious if you looked into this? Might be simpler and
safer to just put the seal check after setting the VM_HUGETLB flag?

--
Mike Kravetz


2021-05-03 20:55:11

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm/hugetlb: Fix cow where page writtable in child

On 5/1/21 7:41 AM, Peter Xu wrote:
> When fork() and copy hugetlb page range, we'll remember to wrprotect src pte if
> needed, however we forget about the child! Without it, the child will be able
> to write to parent's pages when mapped as PROT_READ|PROT_WRITE and MAP_PRIVATE,
> which will cause data corruption in the parent process.
>
> This issue can also be exposed by "memfd_test hugetlbfs" kselftest (if it can
> pass the F_SEAL_FUTURE_WRITE test first, though).
>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> mm/hugetlb.c | 2 ++
> 1 file changed, 2 insertions(+)

Reviewed-by: Mike Kravetz <[email protected]>

I think we need to add, "Fixes: 4eae4efa2c29" as this is now in v5.12
--
Mike Kravetz

2021-05-03 21:32:06

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm/hugetlb: Fix F_SEAL_FUTURE_WRITE

Mike,

On Mon, May 03, 2021 at 11:55:41AM -0700, Mike Kravetz wrote:
> On 5/1/21 7:41 AM, Peter Xu wrote:
> > F_SEAL_FUTURE_WRITE is missing for hugetlb starting from the first day.
> > There is a test program for that and it fails constantly.
> >
> > $ ./memfd_test hugetlbfs
> > memfd-hugetlb: CREATE
> > memfd-hugetlb: BASIC
> > memfd-hugetlb: SEAL-WRITE
> > memfd-hugetlb: SEAL-FUTURE-WRITE
> > mmap() didn't fail as expected
> > Aborted (core dumped)
> >
> > I think it's probably because no one is really running the hugetlbfs test.
> >
> > Fix it by checking FUTURE_WRITE also in hugetlbfs_file_mmap() as what we do in
> > shmem_mmap(). Generalize a helper for that.
> >
> > Reported-by: Hugh Dickins <[email protected]>
> > Signed-off-by: Peter Xu <[email protected]>
> > ---
> > fs/hugetlbfs/inode.c | 5 +++++
> > include/linux/mm.h | 32 ++++++++++++++++++++++++++++++++
> > mm/shmem.c | 22 ++++------------------
> > 3 files changed, 41 insertions(+), 18 deletions(-)
>
> Thanks Peter and Hugh!
>
> One question below,
>
> >
> > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> > index a2a42335e8fd2..39922c0f2fc8c 100644
> > --- a/fs/hugetlbfs/inode.c
> > +++ b/fs/hugetlbfs/inode.c
> > @@ -131,10 +131,15 @@ static void huge_pagevec_release(struct pagevec *pvec)
> > static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
> > {
> > struct inode *inode = file_inode(file);
> > + struct hugetlbfs_inode_info *info = HUGETLBFS_I(inode);
> > loff_t len, vma_len;
> > int ret;
> > struct hstate *h = hstate_file(file);
> >
> > + ret = seal_check_future_write(info->seals, vma);
> > + if (ret)
> > + return ret;
> > +
> > /*
> > * vma address alignment (but not the pgoff alignment) has
> > * already been checked by prepare_hugepage_range. If you add
>
> The full comment below the code you added is:
>
> /*
> * vma address alignment (but not the pgoff alignment) has
> * already been checked by prepare_hugepage_range. If you add
> * any error returns here, do so after setting VM_HUGETLB, so
> * is_vm_hugetlb_page tests below unmap_region go the right
> * way when do_mmap unwinds (may be important on powerpc
> * and ia64).
> */
>
> This comment was added in commit 68589bc35303 by Hugh, although it
> appears David Gibson added the reason for the comment in the commit
> message:
>
> "If hugetlbfs_file_mmap() returns a failure to do_mmap_pgoff() - for example,
> because the given file offset is not hugepage aligned - then do_mmap_pgoff
> will go to the unmap_and_free_vma backout path.
>
> But at this stage the vma hasn't been marked as hugepage, and the backout path
> will call unmap_region() on it. That will eventually call down to the
> non-hugepage version of unmap_page_range(). On ppc64, at least, that will
> cause serious problems if there are any existing hugepage pagetable entries in
> the vicinity - for example if there are any other hugepage mappings under the
> same PUD. unmap_page_range() will trigger a bad_pud() on the hugepage pud
> entries. I suspect this will also cause bad problems on ia64, though I don't
> have a machine to test it on."
>
> There are still comments in the unmap code about special handling of
> ppc64 PUDs. So, this may still be an issue.
>
> I am trying to dig into the code to determine if this is still and
> issue. Just curious if you looked into this? Might be simpler and
> safer to just put the seal check after setting the VM_HUGETLB flag?

Good catch! I overlooked on that, and I definitely didn't look into it yet.
For now I'd better move that check to be after the flag settings in all cases.

I'll also add:

Fixes: ab3948f58ff84 ("mm/memfd: add an F_SEAL_FUTURE_WRITE seal to memfd")

Thanks,

--
Peter Xu

2021-05-03 21:43:22

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm/hugetlb: Fix cow where page writtable in child

On Mon, May 03, 2021 at 01:53:03PM -0700, Mike Kravetz wrote:
> On 5/1/21 7:41 AM, Peter Xu wrote:
> > When fork() and copy hugetlb page range, we'll remember to wrprotect src pte if
> > needed, however we forget about the child! Without it, the child will be able
> > to write to parent's pages when mapped as PROT_READ|PROT_WRITE and MAP_PRIVATE,
> > which will cause data corruption in the parent process.
> >
> > This issue can also be exposed by "memfd_test hugetlbfs" kselftest (if it can
> > pass the F_SEAL_FUTURE_WRITE test first, though).
> >
> > Signed-off-by: Peter Xu <[email protected]>
> > ---
> > mm/hugetlb.c | 2 ++
> > 1 file changed, 2 insertions(+)
>
> Reviewed-by: Mike Kravetz <[email protected]>

Thanks!

>
> I think we need to add, "Fixes: 4eae4efa2c29" as this is now in v5.12

I could be mistaken, but my understanding is it's broken from the most initial
cow support of hugetlbfs in 2006... So if we want a fixes tag, maybe this?

Fixes: 1e8f889b10d8d ("[PATCH] Hugetlb: Copy on Write support")

--
Peter Xu

2021-05-03 22:11:01

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm/hugetlb: Fix cow where page writtable in child

On 5/3/21 2:41 PM, Peter Xu wrote:
> On Mon, May 03, 2021 at 01:53:03PM -0700, Mike Kravetz wrote:
>> On 5/1/21 7:41 AM, Peter Xu wrote:
>>> When fork() and copy hugetlb page range, we'll remember to wrprotect src pte if
>>> needed, however we forget about the child! Without it, the child will be able
>>> to write to parent's pages when mapped as PROT_READ|PROT_WRITE and MAP_PRIVATE,
>>> which will cause data corruption in the parent process.
>>>
>>> This issue can also be exposed by "memfd_test hugetlbfs" kselftest (if it can
>>> pass the F_SEAL_FUTURE_WRITE test first, though).
>>>
>>> Signed-off-by: Peter Xu <[email protected]>
>>> ---
>>> mm/hugetlb.c | 2 ++
>>> 1 file changed, 2 insertions(+)
>>
>> Reviewed-by: Mike Kravetz <[email protected]>
>
> Thanks!
>
>>
>> I think we need to add, "Fixes: 4eae4efa2c29" as this is now in v5.12
>
> I could be mistaken, but my understanding is it's broken from the most initial
> cow support of hugetlbfs in 2006... So if we want a fixes tag, maybe this?
>
> Fixes: 1e8f889b10d8d ("[PATCH] Hugetlb: Copy on Write support")
>

Here is why I think it was broken in 4eae4efa2c29. Prior to that commit
the code looked like this:

if (cow) {
/*
* No need to notify as we are downgrading page
* table protection not changing it to point
* to a new page.
*
* See Documentation/vm/mmu_notifier.rst
*/
huge_ptep_set_wrprotect(src, addr, src_pte);
}
entry = huge_ptep_get(src_pte);
ptepage = pte_page(entry);
get_page(ptepage);
page_dup_rmap(ptepage, true);
set_huge_pte_at(dst, addr, dst_pte, entry);
hugetlb_count_add(pages_per_huge_page(h), dst);

After setting the wrprotect in the source pte, we 'huge_ptep_get' the
source to create the destination. Hence, wrprotect will be set in the
destination as well. It is perhaps not the most efficient, but
I think it 'works'.

It is subtle, or am I missing something?
--
Mike Kravetz

2021-05-03 22:35:56

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm/hugetlb: Fix cow where page writtable in child

On Mon, May 03, 2021 at 03:10:04PM -0700, Mike Kravetz wrote:
> On 5/3/21 2:41 PM, Peter Xu wrote:
> > On Mon, May 03, 2021 at 01:53:03PM -0700, Mike Kravetz wrote:
> >> On 5/1/21 7:41 AM, Peter Xu wrote:
> >>> When fork() and copy hugetlb page range, we'll remember to wrprotect src pte if
> >>> needed, however we forget about the child! Without it, the child will be able
> >>> to write to parent's pages when mapped as PROT_READ|PROT_WRITE and MAP_PRIVATE,
> >>> which will cause data corruption in the parent process.
> >>>
> >>> This issue can also be exposed by "memfd_test hugetlbfs" kselftest (if it can
> >>> pass the F_SEAL_FUTURE_WRITE test first, though).
> >>>
> >>> Signed-off-by: Peter Xu <[email protected]>
> >>> ---
> >>> mm/hugetlb.c | 2 ++
> >>> 1 file changed, 2 insertions(+)
> >>
> >> Reviewed-by: Mike Kravetz <[email protected]>
> >
> > Thanks!
> >
> >>
> >> I think we need to add, "Fixes: 4eae4efa2c29" as this is now in v5.12
> >
> > I could be mistaken, but my understanding is it's broken from the most initial
> > cow support of hugetlbfs in 2006... So if we want a fixes tag, maybe this?
> >
> > Fixes: 1e8f889b10d8d ("[PATCH] Hugetlb: Copy on Write support")
> >
>
> Here is why I think it was broken in 4eae4efa2c29. Prior to that commit
> the code looked like this:
>
> if (cow) {
> /*
> * No need to notify as we are downgrading page
> * table protection not changing it to point
> * to a new page.
> *
> * See Documentation/vm/mmu_notifier.rst
> */
> huge_ptep_set_wrprotect(src, addr, src_pte);
> }
> entry = huge_ptep_get(src_pte);
> ptepage = pte_page(entry);
> get_page(ptepage);
> page_dup_rmap(ptepage, true);
> set_huge_pte_at(dst, addr, dst_pte, entry);
> hugetlb_count_add(pages_per_huge_page(h), dst);
>
> After setting the wrprotect in the source pte, we 'huge_ptep_get' the
> source to create the destination. Hence, wrprotect will be set in the
> destination as well. It is perhaps not the most efficient, but
> I think it 'works'.
>
> It is subtle, or am I missing something?

You're right, thanks Mike. I'll repost and add correct fixes tag.

--
Peter Xu

2021-05-03 22:41:04

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm/hugetlb: Fix F_SEAL_FUTURE_WRITE

On 5/3/21 2:31 PM, Peter Xu wrote:
> Mike,
>
> On Mon, May 03, 2021 at 11:55:41AM -0700, Mike Kravetz wrote:
>> On 5/1/21 7:41 AM, Peter Xu wrote:
>>> F_SEAL_FUTURE_WRITE is missing for hugetlb starting from the first day.
>>> There is a test program for that and it fails constantly.
>>>
>>> $ ./memfd_test hugetlbfs
>>> memfd-hugetlb: CREATE
>>> memfd-hugetlb: BASIC
>>> memfd-hugetlb: SEAL-WRITE
>>> memfd-hugetlb: SEAL-FUTURE-WRITE
>>> mmap() didn't fail as expected
>>> Aborted (core dumped)
>>>
>>> I think it's probably because no one is really running the hugetlbfs test.
>>>
>>> Fix it by checking FUTURE_WRITE also in hugetlbfs_file_mmap() as what we do in
>>> shmem_mmap(). Generalize a helper for that.
>>>
>>> Reported-by: Hugh Dickins <[email protected]>
>>> Signed-off-by: Peter Xu <[email protected]>
>>> ---
>>> fs/hugetlbfs/inode.c | 5 +++++
>>> include/linux/mm.h | 32 ++++++++++++++++++++++++++++++++
>>> mm/shmem.c | 22 ++++------------------
>>> 3 files changed, 41 insertions(+), 18 deletions(-)
>>
>> Thanks Peter and Hugh!
>>
>> One question below,
>>
>>>
>>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>>> index a2a42335e8fd2..39922c0f2fc8c 100644
>>> --- a/fs/hugetlbfs/inode.c
>>> +++ b/fs/hugetlbfs/inode.c
>>> @@ -131,10 +131,15 @@ static void huge_pagevec_release(struct pagevec *pvec)
>>> static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
>>> {
>>> struct inode *inode = file_inode(file);
>>> + struct hugetlbfs_inode_info *info = HUGETLBFS_I(inode);
>>> loff_t len, vma_len;
>>> int ret;
>>> struct hstate *h = hstate_file(file);
>>>
>>> + ret = seal_check_future_write(info->seals, vma);
>>> + if (ret)
>>> + return ret;
>>> +
>>> /*
>>> * vma address alignment (but not the pgoff alignment) has
>>> * already been checked by prepare_hugepage_range. If you add
>>
>> The full comment below the code you added is:
>>
>> /*
>> * vma address alignment (but not the pgoff alignment) has
>> * already been checked by prepare_hugepage_range. If you add
>> * any error returns here, do so after setting VM_HUGETLB, so
>> * is_vm_hugetlb_page tests below unmap_region go the right
>> * way when do_mmap unwinds (may be important on powerpc
>> * and ia64).
>> */
>>
>> This comment was added in commit 68589bc35303 by Hugh, although it
>> appears David Gibson added the reason for the comment in the commit
>> message:
>>
>> "If hugetlbfs_file_mmap() returns a failure to do_mmap_pgoff() - for example,
>> because the given file offset is not hugepage aligned - then do_mmap_pgoff
>> will go to the unmap_and_free_vma backout path.
>>
>> But at this stage the vma hasn't been marked as hugepage, and the backout path
>> will call unmap_region() on it. That will eventually call down to the
>> non-hugepage version of unmap_page_range(). On ppc64, at least, that will
>> cause serious problems if there are any existing hugepage pagetable entries in
>> the vicinity - for example if there are any other hugepage mappings under the
>> same PUD. unmap_page_range() will trigger a bad_pud() on the hugepage pud
>> entries. I suspect this will also cause bad problems on ia64, though I don't
>> have a machine to test it on."
>>
>> There are still comments in the unmap code about special handling of
>> ppc64 PUDs. So, this may still be an issue.
>>
>> I am trying to dig into the code to determine if this is still and
>> issue. Just curious if you looked into this? Might be simpler and
>> safer to just put the seal check after setting the VM_HUGETLB flag?
>
> Good catch! I overlooked on that, and I definitely didn't look into it yet.
> For now I'd better move that check to be after the flag settings in all cases.
>
> I'll also add:
>
> Fixes: ab3948f58ff84 ("mm/memfd: add an F_SEAL_FUTURE_WRITE seal to memfd")
>

Thanks! With those changes, you can add,

Reviewed-by: Mike Kravetz <[email protected]>

--
Mike Kravetz