2022-09-21 08:14:31

by Liu Shixin

[permalink] [raw]
Subject: [PATCH] mm: hugetlb: fix UAF in hugetlb_handle_userfault

The vma_lock and hugetlb_fault_mutex are dropped before handling
userfault and reacquire them again after handle_userfault(), but
reacquire the vma_lock could lead to UAF[1] due to the following
race,

hugetlb_fault
hugetlb_no_page
/*unlock vma_lock */
hugetlb_handle_userfault
handle_userfault
/* unlock mm->mmap_lock*/
vm_mmap_pgoff
do_mmap
mmap_region
munmap_vma_range
/* clean old vma */
/* lock vma_lock again <--- UAF */
/* unlock vma_lock */

Since the vma_lock will unlock immediately after hugetlb_handle_userfault(),
let's drop the unneeded lock and unlock in hugetlb_handle_userfault() to fix
the issue.

[1] https://lore.kernel.org/linux-mm/[email protected]/
Reported-by: Liu Zixian <[email protected]>
Signed-off-by: Liu Shixin <[email protected]>
Signed-off-by: Kefeng Wang <[email protected]>
---
mm/hugetlb.c | 30 +++++++++++-------------------
1 file changed, 11 insertions(+), 19 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9b8526d27c29..5a5d466692cf 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5489,7 +5489,6 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
unsigned long addr,
unsigned long reason)
{
- vm_fault_t ret;
u32 hash;
struct vm_fault vmf = {
.vma = vma,
@@ -5508,17 +5507,12 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,

/*
* vma_lock and hugetlb_fault_mutex must be
- * dropped before handling userfault. Reacquire
- * after handling fault to make calling code simpler.
+ * dropped before handling userfault.
*/
hugetlb_vma_unlock_read(vma);
hash = hugetlb_fault_mutex_hash(mapping, idx);
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
- ret = handle_userfault(&vmf, reason);
- mutex_lock(&hugetlb_fault_mutex_table[hash]);
- hugetlb_vma_lock_read(vma);
-
- return ret;
+ return handle_userfault(&vmf, reason);
}

static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
@@ -5537,6 +5531,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
unsigned long haddr = address & huge_page_mask(h);
bool new_page, new_pagecache_page = false;
bool reserve_alloc = false;
+ u32 hash = hugetlb_fault_mutex_hash(mapping, idx);

/*
* Currently, we are forced to kill the process in the event the
@@ -5547,7 +5542,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
if (is_vma_resv_set(vma, HPAGE_RESV_UNMAPPED)) {
pr_warn_ratelimited("PID %d killed due to inadequate hugepage pool\n",
current->pid);
- return ret;
+ goto out;
}

/*
@@ -5561,12 +5556,10 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
if (idx >= size)
goto out;
/* Check for page in userfault range */
- if (userfaultfd_missing(vma)) {
- ret = hugetlb_handle_userfault(vma, mapping, idx,
+ if (userfaultfd_missing(vma))
+ return hugetlb_handle_userfault(vma, mapping, idx,
flags, haddr, address,
VM_UFFD_MISSING);
- goto out;
- }

page = alloc_huge_page(vma, haddr, 0);
if (IS_ERR(page)) {
@@ -5634,10 +5627,9 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
if (userfaultfd_minor(vma)) {
unlock_page(page);
put_page(page);
- ret = hugetlb_handle_userfault(vma, mapping, idx,
+ return hugetlb_handle_userfault(vma, mapping, idx,
flags, haddr, address,
VM_UFFD_MINOR);
- goto out;
}
}

@@ -5695,6 +5687,8 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,

unlock_page(page);
out:
+ hugetlb_vma_unlock_read(vma);
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
return ret;

backout:
@@ -5792,11 +5786,9 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,

entry = huge_ptep_get(ptep);
/* PTE markers should be handled the same way as none pte */
- if (huge_pte_none_mostly(entry)) {
- ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
+ if (huge_pte_none_mostly(entry))
+ return hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
entry, flags);
- goto out_mutex;
- }

ret = 0;

--
2.25.1


2022-09-21 18:09:01

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH] mm: hugetlb: fix UAF in hugetlb_handle_userfault

On 09/21/22 16:34, Liu Shixin wrote:
> The vma_lock and hugetlb_fault_mutex are dropped before handling
> userfault and reacquire them again after handle_userfault(), but
> reacquire the vma_lock could lead to UAF[1] due to the following
> race,
>
> hugetlb_fault
> hugetlb_no_page
> /*unlock vma_lock */
> hugetlb_handle_userfault
> handle_userfault
> /* unlock mm->mmap_lock*/
> vm_mmap_pgoff
> do_mmap
> mmap_region
> munmap_vma_range
> /* clean old vma */
> /* lock vma_lock again <--- UAF */
> /* unlock vma_lock */
>
> Since the vma_lock will unlock immediately after hugetlb_handle_userfault(),
> let's drop the unneeded lock and unlock in hugetlb_handle_userfault() to fix
> the issue.

Thank you very much!

When I saw this report, the obvious fix was to do something like what you have
done below. That looks fine with a few minor comments.

One question I have not yet answered is, "Does this same issue apply to
follow_hugetlb_page()?". I believe it does. follow_hugetlb_page calls
hugetlb_fault which could result in the fault being processed by userfaultfd.
If we experience the race above, then the associated vma could no longer be
valid when returning from hugetlb_fault. follow_hugetlb_page and callers
have a flag (locked) to deal with dropping mmap lock. However, I am not sure
if it is handled correctly WRT userfaultfd. I think this needs to be answered
before fixing. And, if the follow_hugetlb_page code needs to be fixed it
should be done at the same time.

> [1] https://lore.kernel.org/linux-mm/[email protected]/
> Reported-by: Liu Zixian <[email protected]>

Perhaps reported by should be,
Reported-by: [email protected]
https://lore.kernel.org/linux-mm/[email protected]/

Should also add,
Fixes: 1a1aad8a9b7b ("userfaultfd: hugetlbfs: add userfaultfd hugetlb hook")

as well as,
Cc: <[email protected]>

> Signed-off-by: Liu Shixin <[email protected]>
> Signed-off-by: Kefeng Wang <[email protected]>
> ---
> mm/hugetlb.c | 30 +++++++++++-------------------
> 1 file changed, 11 insertions(+), 19 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 9b8526d27c29..5a5d466692cf 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
...
> @@ -5792,11 +5786,9 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>
> entry = huge_ptep_get(ptep);
> /* PTE markers should be handled the same way as none pte */
> - if (huge_pte_none_mostly(entry)) {
> - ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
> + if (huge_pte_none_mostly(entry))

We should add a big comment noting that hugetlb_no_page will drop vma lock
and hugetl fault mutex. This will make it easier for people reading the code
and immediately thinking we are returning without dropping the locks.

--
Mike Kravetz

> + return hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
> entry, flags);
> - goto out_mutex;
> - }
>
> ret = 0;
>
> --
> 2.25.1
>

2022-09-21 18:25:24

by Sidhartha Kumar

[permalink] [raw]
Subject: Re: [PATCH] mm: hugetlb: fix UAF in hugetlb_handle_userfault



On 9/21/22 3:34 AM, Liu Shixin wrote:
> The vma_lock and hugetlb_fault_mutex are dropped before handling
> userfault and reacquire them again after handle_userfault(), but
> reacquire the vma_lock could lead to UAF[1] due to the following
> race,
>
> hugetlb_fault
> hugetlb_no_page
> /*unlock vma_lock */
> hugetlb_handle_userfault
> handle_userfault
> /* unlock mm->mmap_lock*/
> vm_mmap_pgoff
> do_mmap
> mmap_region
> munmap_vma_range
> /* clean old vma */
> /* lock vma_lock again <--- UAF */
> /* unlock vma_lock */
>
> Since the vma_lock will unlock immediately after hugetlb_handle_userfault(),
> let's drop the unneeded lock and unlock in hugetlb_handle_userfault() to fix
> the issue.
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/
> Reported-by: Liu Zixian <[email protected]>
> Signed-off-by: Liu Shixin <[email protected]>
> Signed-off-by: Kefeng Wang <[email protected]>
> ---
> mm/hugetlb.c | 30 +++++++++++-------------------
> 1 file changed, 11 insertions(+), 19 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 9b8526d27c29..5a5d466692cf 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -5489,7 +5489,6 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
> unsigned long addr,
> unsigned long reason)
> {
> - vm_fault_t ret;
> u32 hash;
> struct vm_fault vmf = {
> .vma = vma,
> @@ -5508,17 +5507,12 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
>
> /*
> * vma_lock and hugetlb_fault_mutex must be
> - * dropped before handling userfault. Reacquire
> - * after handling fault to make calling code simpler.
> + * dropped before handling userfault.
> */
> hugetlb_vma_unlock_read(vma);
> hash = hugetlb_fault_mutex_hash(mapping, idx);
> mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> - ret = handle_userfault(&vmf, reason);
> - mutex_lock(&hugetlb_fault_mutex_table[hash]);
> - hugetlb_vma_lock_read(vma);
> -
> - return ret;
> + return handle_userfault(&vmf, reason);
> }
>
> static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> @@ -5537,6 +5531,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> unsigned long haddr = address & huge_page_mask(h);
> bool new_page, new_pagecache_page = false;
> bool reserve_alloc = false;
> + u32 hash = hugetlb_fault_mutex_hash(mapping, idx);
>
> /*
> * Currently, we are forced to kill the process in the event the
> @@ -5547,7 +5542,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> if (is_vma_resv_set(vma, HPAGE_RESV_UNMAPPED)) {
> pr_warn_ratelimited("PID %d killed due to inadequate hugepage pool\n",
> current->pid);
> - return ret;
> + goto out;
> }
>
> /*
> @@ -5561,12 +5556,10 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> if (idx >= size)
> goto out;
> /* Check for page in userfault range */
> - if (userfaultfd_missing(vma)) {
> - ret = hugetlb_handle_userfault(vma, mapping, idx,
> + if (userfaultfd_missing(vma))
> + return hugetlb_handle_userfault(vma, mapping, idx,
> flags, haddr, address,
> VM_UFFD_MISSING);
> - goto out;
> - }
>
> page = alloc_huge_page(vma, haddr, 0);
> if (IS_ERR(page)) {
> @@ -5634,10 +5627,9 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> if (userfaultfd_minor(vma)) {
> unlock_page(page);
> put_page(page);
> - ret = hugetlb_handle_userfault(vma, mapping, idx,
> + return hugetlb_handle_userfault(vma, mapping, idx,
> flags, haddr, address,
> VM_UFFD_MINOR);
> - goto out;
> }
> }
>
> @@ -5695,6 +5687,8 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>
> unlock_page(page);
> out:
> + hugetlb_vma_unlock_read(vma);
> + mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> return ret;
>
> backout:
> @@ -5792,11 +5786,9 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>
> entry = huge_ptep_get(ptep);
> /* PTE markers should be handled the same way as none pte */
> - if (huge_pte_none_mostly(entry)) {
> - ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
> + if (huge_pte_none_mostly(entry))
> + return hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
> entry, flags);
> - goto out_mutex;
> - }
>
> ret = 0;
>

I've been looking at this as well.
Reviewed-by: Sidhartha Kumar <[email protected]>

2022-09-21 20:42:20

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] mm: hugetlb: fix UAF in hugetlb_handle_userfault

On Wed, 21 Sep 2022 16:34:40 +0800 Liu Shixin <[email protected]> wrote:

> The vma_lock and hugetlb_fault_mutex are dropped before handling
> userfault and reacquire them again after handle_userfault(), but
> reacquire the vma_lock could lead to UAF[1] due to the following
> race,
>
> hugetlb_fault
> hugetlb_no_page
> /*unlock vma_lock */
> hugetlb_handle_userfault
> handle_userfault
> /* unlock mm->mmap_lock*/
> vm_mmap_pgoff
> do_mmap
> mmap_region
> munmap_vma_range
> /* clean old vma */
> /* lock vma_lock again <--- UAF */
> /* unlock vma_lock */
>
> Since the vma_lock will unlock immediately after hugetlb_handle_userfault(),
> let's drop the unneeded lock and unlock in hugetlb_handle_userfault() to fix
> the issue.
>
> @@ -5508,17 +5507,12 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
>
> /*
> * vma_lock and hugetlb_fault_mutex must be
> - * dropped before handling userfault. Reacquire
> - * after handling fault to make calling code simpler.
> + * dropped before handling userfault.
> */
> hugetlb_vma_unlock_read(vma);
> hash = hugetlb_fault_mutex_hash(mapping, idx);
> mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> - ret = handle_userfault(&vmf, reason);
> - mutex_lock(&hugetlb_fault_mutex_table[hash]);
> - hugetlb_vma_lock_read(vma);
> -
> - return ret;
> + return handle_userfault(&vmf, reason);
> }

Current code is rather different from this. So if the bug still exists
in current code, please verify this and redo the patch appropriately?

And hang on to this version to help with the -stable backporting.

Thanks.

2022-09-22 00:08:59

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH] mm: hugetlb: fix UAF in hugetlb_handle_userfault

On 09/21/22 10:48, Mike Kravetz wrote:
> On 09/21/22 16:34, Liu Shixin wrote:
> > The vma_lock and hugetlb_fault_mutex are dropped before handling
> > userfault and reacquire them again after handle_userfault(), but
> > reacquire the vma_lock could lead to UAF[1] due to the following
> > race,
> >
> > hugetlb_fault
> > hugetlb_no_page
> > /*unlock vma_lock */
> > hugetlb_handle_userfault
> > handle_userfault
> > /* unlock mm->mmap_lock*/
> > vm_mmap_pgoff
> > do_mmap
> > mmap_region
> > munmap_vma_range
> > /* clean old vma */
> > /* lock vma_lock again <--- UAF */
> > /* unlock vma_lock */
> >
> > Since the vma_lock will unlock immediately after hugetlb_handle_userfault(),
> > let's drop the unneeded lock and unlock in hugetlb_handle_userfault() to fix
> > the issue.
>
> Thank you very much!
>
> When I saw this report, the obvious fix was to do something like what you have
> done below. That looks fine with a few minor comments.
>
> One question I have not yet answered is, "Does this same issue apply to
> follow_hugetlb_page()?". I believe it does. follow_hugetlb_page calls
> hugetlb_fault which could result in the fault being processed by userfaultfd.
> If we experience the race above, then the associated vma could no longer be
> valid when returning from hugetlb_fault. follow_hugetlb_page and callers
> have a flag (locked) to deal with dropping mmap lock. However, I am not sure
> if it is handled correctly WRT userfaultfd. I think this needs to be answered
> before fixing. And, if the follow_hugetlb_page code needs to be fixed it
> should be done at the same time.
>

To at least verify this code path, I added userfaultfd handling to the gup_test
program in kernel selftests. When doing basic gup test on a hugetlb page in
a userfaultfd registered range, I hit this warning:

[ 6939.867796] FAULT_FLAG_ALLOW_RETRY missing 1
[ 6939.871503] CPU: 2 PID: 5720 Comm: gup_test Not tainted 6.0.0-rc6-next-20220921+ #72
[ 6939.874562] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
[ 6939.877707] Call Trace:
[ 6939.878745] <TASK>
[ 6939.879779] dump_stack_lvl+0x6c/0x9f
[ 6939.881199] handle_userfault.cold+0x14/0x1e
[ 6939.882830] ? find_held_lock+0x2b/0x80
[ 6939.884370] ? __mutex_unlock_slowpath+0x45/0x280
[ 6939.886145] hugetlb_handle_userfault+0x90/0xf0
[ 6939.887936] hugetlb_fault+0xb7e/0xda0
[ 6939.889409] ? vprintk_emit+0x118/0x3a0
[ 6939.890903] ? _printk+0x58/0x73
[ 6939.892279] follow_hugetlb_page.cold+0x59/0x145
[ 6939.894116] __get_user_pages+0x146/0x750
[ 6939.895580] __gup_longterm_locked+0x3e9/0x680
[ 6939.897023] ? seqcount_lockdep_reader_access.constprop.0+0xa5/0xb0
[ 6939.898939] ? lockdep_hardirqs_on+0x7d/0x100
[ 6939.901243] gup_test_ioctl+0x320/0x6e0
[ 6939.902202] __x64_sys_ioctl+0x87/0xc0
[ 6939.903220] do_syscall_64+0x38/0x90
[ 6939.904233] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 6939.905423] RIP: 0033:0x7fbb53830f7b

This is because userfaultfd is expecting FAULT_FLAG_ALLOW_RETRY which is not
set in this path.

Adding John, Peter and David on Cc: as they are much more fluent in all the
fault and FOLL combinations and might have immediate suggestions. It is going
to take me a little while to figure out:
1) How to make sure we get the right flags passed to handle_userfault
2) How to modify follow_hugetlb_page as userfaultfd can certainly drop
mmap_lock. So we can not assume vma still exists upon return.

--
Mike Kravetz

2022-09-22 01:32:15

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH] mm: hugetlb: fix UAF in hugetlb_handle_userfault

On 9/21/22 16:57, Mike Kravetz wrote:
> On 09/21/22 10:48, Mike Kravetz wrote:
>> On 09/21/22 16:34, Liu Shixin wrote:
>>> The vma_lock and hugetlb_fault_mutex are dropped before handling
>>> userfault and reacquire them again after handle_userfault(), but
>>> reacquire the vma_lock could lead to UAF[1] due to the following
>>> race,
>>>
>>> hugetlb_fault
>>> hugetlb_no_page
>>> /*unlock vma_lock */
>>> hugetlb_handle_userfault
>>> handle_userfault
>>> /* unlock mm->mmap_lock*/
>>> vm_mmap_pgoff
>>> do_mmap
>>> mmap_region
>>> munmap_vma_range
>>> /* clean old vma */
>>> /* lock vma_lock again <--- UAF */
>>> /* unlock vma_lock */
>>>
>>> Since the vma_lock will unlock immediately after hugetlb_handle_userfault(),
>>> let's drop the unneeded lock and unlock in hugetlb_handle_userfault() to fix
>>> the issue.
>>
>> Thank you very much!
>>
>> When I saw this report, the obvious fix was to do something like what you have
>> done below. That looks fine with a few minor comments.
>>
>> One question I have not yet answered is, "Does this same issue apply to
>> follow_hugetlb_page()?". I believe it does. follow_hugetlb_page calls
>> hugetlb_fault which could result in the fault being processed by userfaultfd.
>> If we experience the race above, then the associated vma could no longer be
>> valid when returning from hugetlb_fault. follow_hugetlb_page and callers
>> have a flag (locked) to deal with dropping mmap lock. However, I am not sure
>> if it is handled correctly WRT userfaultfd. I think this needs to be answered
>> before fixing. And, if the follow_hugetlb_page code needs to be fixed it
>> should be done at the same time.
>>
>
> To at least verify this code path, I added userfaultfd handling to the gup_test
> program in kernel selftests. When doing basic gup test on a hugetlb page in

Just for those of us who are easily confused by userfaultfd cases, can you show
what that patch is? It would help me understand this a little faster.

Actually I'm expecting that Peter can easily answer this whole thing. :)

thanks,

--
John Hubbard
NVIDIA

2022-09-22 02:19:49

by Liu Shixin

[permalink] [raw]
Subject: Re: [PATCH] mm: hugetlb: fix UAF in hugetlb_handle_userfault



On 2022/9/22 3:07, Andrew Morton wrote:
> On Wed, 21 Sep 2022 16:34:40 +0800 Liu Shixin <[email protected]> wrote:
>
>> The vma_lock and hugetlb_fault_mutex are dropped before handling
>> userfault and reacquire them again after handle_userfault(), but
>> reacquire the vma_lock could lead to UAF[1] due to the following
>> race,
>>
>> hugetlb_fault
>> hugetlb_no_page
>> /*unlock vma_lock */
>> hugetlb_handle_userfault
>> handle_userfault
>> /* unlock mm->mmap_lock*/
>> vm_mmap_pgoff
>> do_mmap
>> mmap_region
>> munmap_vma_range
>> /* clean old vma */
>> /* lock vma_lock again <--- UAF */
>> /* unlock vma_lock */
>>
>> Since the vma_lock will unlock immediately after hugetlb_handle_userfault(),
>> let's drop the unneeded lock and unlock in hugetlb_handle_userfault() to fix
>> the issue.
>>
>> @@ -5508,17 +5507,12 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
>>
>> /*
>> * vma_lock and hugetlb_fault_mutex must be
>> - * dropped before handling userfault. Reacquire
>> - * after handling fault to make calling code simpler.
>> + * dropped before handling userfault.
>> */
>> hugetlb_vma_unlock_read(vma);
>> hash = hugetlb_fault_mutex_hash(mapping, idx);
>> mutex_unlock(&hugetlb_fault_mutex_table[hash]);
>> - ret = handle_userfault(&vmf, reason);
>> - mutex_lock(&hugetlb_fault_mutex_table[hash]);
>> - hugetlb_vma_lock_read(vma);
>> -
>> - return ret;
>> + return handle_userfault(&vmf, reason);
>> }
> Current code is rather different from this. So if the bug still exists
> in current code, please verify this and redo the patch appropriately?
>
> And hang on to this version to help with the -stable backporting.
>
> Thanks.
> .
This patch conflicts with patch series "hugetlb: Use new vma lock for huge pmd sharing synchronization".
So I reproduce the problem on next-20220920 and this patch is based on next-20220920 instead of mainline.
This problem is existed since v4.11. I will send the stable version later.

2022-09-22 02:52:00

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH] mm: hugetlb: fix UAF in hugetlb_handle_userfault

On 09/21/22 17:57, John Hubbard wrote:
> On 9/21/22 16:57, Mike Kravetz wrote:
> > On 09/21/22 10:48, Mike Kravetz wrote:
> >> On 09/21/22 16:34, Liu Shixin wrote:
> >>> The vma_lock and hugetlb_fault_mutex are dropped before handling
> >>> userfault and reacquire them again after handle_userfault(), but
> >>> reacquire the vma_lock could lead to UAF[1] due to the following
> >>> race,
> >>>
> >>> hugetlb_fault
> >>> hugetlb_no_page
> >>> /*unlock vma_lock */
> >>> hugetlb_handle_userfault
> >>> handle_userfault
> >>> /* unlock mm->mmap_lock*/
> >>> vm_mmap_pgoff
> >>> do_mmap
> >>> mmap_region
> >>> munmap_vma_range
> >>> /* clean old vma */
> >>> /* lock vma_lock again <--- UAF */
> >>> /* unlock vma_lock */
> >>>
> >>> Since the vma_lock will unlock immediately after hugetlb_handle_userfault(),
> >>> let's drop the unneeded lock and unlock in hugetlb_handle_userfault() to fix
> >>> the issue.
> >>
> >> Thank you very much!
> >>
> >> When I saw this report, the obvious fix was to do something like what you have
> >> done below. That looks fine with a few minor comments.
> >>
> >> One question I have not yet answered is, "Does this same issue apply to
> >> follow_hugetlb_page()?". I believe it does. follow_hugetlb_page calls
> >> hugetlb_fault which could result in the fault being processed by userfaultfd.
> >> If we experience the race above, then the associated vma could no longer be
> >> valid when returning from hugetlb_fault. follow_hugetlb_page and callers
> >> have a flag (locked) to deal with dropping mmap lock. However, I am not sure
> >> if it is handled correctly WRT userfaultfd. I think this needs to be answered
> >> before fixing. And, if the follow_hugetlb_page code needs to be fixed it
> >> should be done at the same time.
> >>
> >
> > To at least verify this code path, I added userfaultfd handling to the gup_test
> > program in kernel selftests. When doing basic gup test on a hugetlb page in
>
> Just for those of us who are easily confused by userfaultfd cases, can you show
> what that patch is? It would help me understand this a little faster.

The ugly (just throw code at it to make it work) diff is below. All I did was
add the sample code from the userfaultfd man page.

With that change in place, I just run 'gup_test -U -z -m 2 -H'

> Actually I'm expecting that Peter can easily answer this whole thing. :)

diff --git a/tools/testing/selftests/vm/gup_test.c b/tools/testing/selftests/vm/gup_test.c
index e43879291dac..491424d0a039 100644
--- a/tools/testing/selftests/vm/gup_test.c
+++ b/tools/testing/selftests/vm/gup_test.c
@@ -8,8 +8,11 @@
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/types.h>
+#include <sys/syscall.h>
#include <pthread.h>
#include <assert.h>
+#include <poll.h>
+#include <linux/userfaultfd.h>
#include <mm/gup_test.h>
#include "../kselftest.h"

@@ -48,6 +51,75 @@ static char *cmd_to_str(unsigned long cmd)
return "Unknown command";
}

+static void *fault_handler_thread(void *arg)
+{
+ static struct uffd_msg msg; /* Data read from userfaultfd */
+ static int fault_cnt = 0; /* Number of faults so far handled */
+ long uffd;
+ static char *page = NULL;
+ struct uffdio_copy uffdio_copy;
+ ssize_t nread;
+ size_t page_size = 2 * 1024 * 1024;
+
+ uffd = (long) arg;
+
+ if (page == NULL) {
+ page = mmap(NULL, page_size, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+ if (page == MAP_FAILED)
+ exit(1);
+ }
+
+ for (;;) {
+ struct pollfd pollfd;
+ int nready;
+ pollfd.fd = uffd;
+ pollfd.events = POLLIN;
+ nready = poll(&pollfd, 1, -1);
+ if (nready == -1)
+ exit(1);
+
+ printf("\nfault_handler_thread():\n");
+ printf(" poll() returns: nready = %d; "
+ "POLLIN = %d; POLLERR = %d\n", nready,
+ (pollfd.revents & POLLIN) != 0,
+ (pollfd.revents & POLLERR) != 0);
+
+ nread = read(uffd, &msg, sizeof(msg));
+ if (nread == 0) {
+ printf("EOF on userfaultfd!\n");
+ exit(1);
+ }
+
+ if (nread == -1) {
+ printf("nread == -1\n");
+ exit(1);
+ }
+
+ if (msg.event != UFFD_EVENT_PAGEFAULT) {
+ printf("Unexpected event on userfaultfd\n");
+ exit(1);
+ }
+
+ printf(" UFFD_EVENT_PAGEFAULT event: ");
+ printf("flags = %llx; ", msg.arg.pagefault.flags);
+ printf("address = %llx\n", msg.arg.pagefault.address);
+
+ memset(page, 'A' + fault_cnt % 20, page_size);
+ fault_cnt++;
+
+ uffdio_copy.src = (unsigned long) page;
+ uffdio_copy.dst = (unsigned long) msg.arg.pagefault.address &
+ ~(page_size - 1);
+ uffdio_copy.len = page_size;
+ uffdio_copy.mode = 0;
+ uffdio_copy.copy = 0;
+ if (ioctl(uffd, UFFDIO_COPY, &uffdio_copy) == -1)
+ exit(1);
+
+ }
+}
+
void *gup_thread(void *data)
{
struct gup_test gup = *(struct gup_test *)data;
@@ -94,7 +166,11 @@ int main(int argc, char **argv)
int flags = MAP_PRIVATE, touch = 0;
char *file = "/dev/zero";
pthread_t *tid;
+ pthread_t thr;
char *p;
+ long uffd;
+ struct uffdio_api uffdio_api;
+ struct uffdio_register uffdio_register;

while ((opt = getopt(argc, argv, "m:r:n:F:f:abcj:tTLUuwWSHpz")) != -1) {
switch (opt) {
@@ -230,6 +306,18 @@ int main(int argc, char **argv)
exit(KSFT_SKIP);
}

+ uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
+ if (uffd == -1) {
+ perror("uffd open");
+ exit(1);
+ }
+ uffdio_api.api = UFFD_API;
+ uffdio_api.features = 0;
+ if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1) {
+ perror("uffd ioctl API");
+ exit(1);
+ }
+
p = mmap(NULL, size, PROT_READ | PROT_WRITE, flags, filed, 0);
if (p == MAP_FAILED) {
perror("mmap");
@@ -237,6 +325,20 @@ int main(int argc, char **argv)
}
gup.addr = (unsigned long)p;

+ uffdio_register.range.start = (unsigned long)p;
+ uffdio_register.range.len = size;
+ uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
+ if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1) {
+ perror("uffd ioctl API");
+ exit(1);
+ }
+ ret = pthread_create(&thr, NULL, fault_handler_thread, (void *)uffd);
+ if (ret) {
+ exit(1);
+ }
+
+ sleep(5);
+
if (thp == 1)
madvise(p, size, MADV_HUGEPAGE);
else if (thp == 0)
--
Mike Kravetz

2022-09-22 08:19:22

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] mm: hugetlb: fix UAF in hugetlb_handle_userfault

On 22.09.22 01:57, Mike Kravetz wrote:
> On 09/21/22 10:48, Mike Kravetz wrote:
>> On 09/21/22 16:34, Liu Shixin wrote:
>>> The vma_lock and hugetlb_fault_mutex are dropped before handling
>>> userfault and reacquire them again after handle_userfault(), but
>>> reacquire the vma_lock could lead to UAF[1] due to the following
>>> race,
>>>
>>> hugetlb_fault
>>> hugetlb_no_page
>>> /*unlock vma_lock */
>>> hugetlb_handle_userfault
>>> handle_userfault
>>> /* unlock mm->mmap_lock*/
>>> vm_mmap_pgoff
>>> do_mmap
>>> mmap_region
>>> munmap_vma_range
>>> /* clean old vma */
>>> /* lock vma_lock again <--- UAF */
>>> /* unlock vma_lock */
>>>
>>> Since the vma_lock will unlock immediately after hugetlb_handle_userfault(),
>>> let's drop the unneeded lock and unlock in hugetlb_handle_userfault() to fix
>>> the issue.
>>
>> Thank you very much!
>>
>> When I saw this report, the obvious fix was to do something like what you have
>> done below. That looks fine with a few minor comments.
>>
>> One question I have not yet answered is, "Does this same issue apply to
>> follow_hugetlb_page()?". I believe it does. follow_hugetlb_page calls
>> hugetlb_fault which could result in the fault being processed by userfaultfd.
>> If we experience the race above, then the associated vma could no longer be
>> valid when returning from hugetlb_fault. follow_hugetlb_page and callers
>> have a flag (locked) to deal with dropping mmap lock. However, I am not sure
>> if it is handled correctly WRT userfaultfd. I think this needs to be answered
>> before fixing. And, if the follow_hugetlb_page code needs to be fixed it
>> should be done at the same time.
>>
>
> To at least verify this code path, I added userfaultfd handling to the gup_test
> program in kernel selftests. When doing basic gup test on a hugetlb page in
> a userfaultfd registered range, I hit this warning:
>
> [ 6939.867796] FAULT_FLAG_ALLOW_RETRY missing 1
> [ 6939.871503] CPU: 2 PID: 5720 Comm: gup_test Not tainted 6.0.0-rc6-next-20220921+ #72
> [ 6939.874562] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
> [ 6939.877707] Call Trace:
> [ 6939.878745] <TASK>
> [ 6939.879779] dump_stack_lvl+0x6c/0x9f
> [ 6939.881199] handle_userfault.cold+0x14/0x1e
> [ 6939.882830] ? find_held_lock+0x2b/0x80
> [ 6939.884370] ? __mutex_unlock_slowpath+0x45/0x280
> [ 6939.886145] hugetlb_handle_userfault+0x90/0xf0
> [ 6939.887936] hugetlb_fault+0xb7e/0xda0
> [ 6939.889409] ? vprintk_emit+0x118/0x3a0
> [ 6939.890903] ? _printk+0x58/0x73
> [ 6939.892279] follow_hugetlb_page.cold+0x59/0x145
> [ 6939.894116] __get_user_pages+0x146/0x750
> [ 6939.895580] __gup_longterm_locked+0x3e9/0x680
> [ 6939.897023] ? seqcount_lockdep_reader_access.constprop.0+0xa5/0xb0
> [ 6939.898939] ? lockdep_hardirqs_on+0x7d/0x100
> [ 6939.901243] gup_test_ioctl+0x320/0x6e0
> [ 6939.902202] __x64_sys_ioctl+0x87/0xc0
> [ 6939.903220] do_syscall_64+0x38/0x90
> [ 6939.904233] entry_SYSCALL_64_after_hwframe+0x63/0xcd
> [ 6939.905423] RIP: 0033:0x7fbb53830f7b
>
> This is because userfaultfd is expecting FAULT_FLAG_ALLOW_RETRY which is not
> set in this path.

Right. Without being able to drop the mmap lock, we cannot continue. And
we don't know if we can drop it without FAULT_FLAG_ALLOW_RETRY.

FAULT_FLAG_ALLOW_RETRY is only set when we can communicate to the caller
that we dropped the mmap lock [e.g., int *locked parameter].

All code paths that pass NULL won't be able to handle -- especially
surprisingly also pin_user_pages_fast() -- cannot trigger usefaultfd and
will result in this warning.


A "sane" example is access via /proc/self/mem via ptrace: we don't want
to trigger userfaultfd, but instead simply fail the GUP get/pin.


Now, this is just a printed *warning* (not a WARN/BUG/taint) that tells
us that there is a GUP user that isn't prepared for userfaultfd. So it
rather points out a missing GUP adaption -- incomplete userfaultfd
support. And we seem to have plenty of that judging that
pin_user_pages_fast_only().

Maybe the printed stack trace is a bit too much and makes this look very
scary.

>
> Adding John, Peter and David on Cc: as they are much more fluent in all the
> fault and FOLL combinations and might have immediate suggestions. It is going
> to take me a little while to figure out:
> 1) How to make sure we get the right flags passed to handle_userfault

This is a GUP caller problem -- or rather, how GUP has to deal with
userfaultfd.

> 2) How to modify follow_hugetlb_page as userfaultfd can certainly drop
> mmap_lock. So we can not assume vma still exists upon return.

Again, we have to communicate to the GUP caller that we dropped the mmap
lock. And that requires GUP caller changes.

--
Thanks,

David / dhildenb

2022-09-22 15:35:05

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH] mm: hugetlb: fix UAF in hugetlb_handle_userfault

On Wed, Sep 21, 2022 at 04:57:39PM -0700, Mike Kravetz wrote:
> On 09/21/22 10:48, Mike Kravetz wrote:
> > On 09/21/22 16:34, Liu Shixin wrote:
> > > The vma_lock and hugetlb_fault_mutex are dropped before handling
> > > userfault and reacquire them again after handle_userfault(), but
> > > reacquire the vma_lock could lead to UAF[1] due to the following
> > > race,
> > >
> > > hugetlb_fault
> > > hugetlb_no_page
> > > /*unlock vma_lock */
> > > hugetlb_handle_userfault
> > > handle_userfault
> > > /* unlock mm->mmap_lock*/
> > > vm_mmap_pgoff
> > > do_mmap
> > > mmap_region
> > > munmap_vma_range
> > > /* clean old vma */
> > > /* lock vma_lock again <--- UAF */
> > > /* unlock vma_lock */
> > >
> > > Since the vma_lock will unlock immediately after hugetlb_handle_userfault(),
> > > let's drop the unneeded lock and unlock in hugetlb_handle_userfault() to fix
> > > the issue.
> >
> > Thank you very much!
> >
> > When I saw this report, the obvious fix was to do something like what you have
> > done below. That looks fine with a few minor comments.
> >
> > One question I have not yet answered is, "Does this same issue apply to
> > follow_hugetlb_page()?". I believe it does. follow_hugetlb_page calls
> > hugetlb_fault which could result in the fault being processed by userfaultfd.
> > If we experience the race above, then the associated vma could no longer be
> > valid when returning from hugetlb_fault. follow_hugetlb_page and callers
> > have a flag (locked) to deal with dropping mmap lock. However, I am not sure
> > if it is handled correctly WRT userfaultfd. I think this needs to be answered
> > before fixing. And, if the follow_hugetlb_page code needs to be fixed it
> > should be done at the same time.
> >
>
> To at least verify this code path, I added userfaultfd handling to the gup_test
> program in kernel selftests.

IIRC vm/userfaultfd should already have GUP tested with pthread mutexes
(which iiuc uses futex, and futex uses GUP).

But indeed I didn't trigger any GUP paths after a quick run.. I agree we
should have some unit test that can at least cover GUP with userfaultfd.
I'll further check it up from vm/userfaultfd side later.

> When doing basic gup test on a hugetlb page in
> a userfaultfd registered range, I hit this warning:
>
> [ 6939.867796] FAULT_FLAG_ALLOW_RETRY missing 1
> [ 6939.871503] CPU: 2 PID: 5720 Comm: gup_test Not tainted 6.0.0-rc6-next-20220921+ #72
> [ 6939.874562] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
> [ 6939.877707] Call Trace:
> [ 6939.878745] <TASK>
> [ 6939.879779] dump_stack_lvl+0x6c/0x9f
> [ 6939.881199] handle_userfault.cold+0x14/0x1e
> [ 6939.882830] ? find_held_lock+0x2b/0x80
> [ 6939.884370] ? __mutex_unlock_slowpath+0x45/0x280
> [ 6939.886145] hugetlb_handle_userfault+0x90/0xf0
> [ 6939.887936] hugetlb_fault+0xb7e/0xda0
> [ 6939.889409] ? vprintk_emit+0x118/0x3a0
> [ 6939.890903] ? _printk+0x58/0x73
> [ 6939.892279] follow_hugetlb_page.cold+0x59/0x145
> [ 6939.894116] __get_user_pages+0x146/0x750
> [ 6939.895580] __gup_longterm_locked+0x3e9/0x680
> [ 6939.897023] ? seqcount_lockdep_reader_access.constprop.0+0xa5/0xb0
> [ 6939.898939] ? lockdep_hardirqs_on+0x7d/0x100
> [ 6939.901243] gup_test_ioctl+0x320/0x6e0
> [ 6939.902202] __x64_sys_ioctl+0x87/0xc0
> [ 6939.903220] do_syscall_64+0x38/0x90
> [ 6939.904233] entry_SYSCALL_64_after_hwframe+0x63/0xcd
> [ 6939.905423] RIP: 0033:0x7fbb53830f7b
>
> This is because userfaultfd is expecting FAULT_FLAG_ALLOW_RETRY which is not
> set in this path.
>
> Adding John, Peter and David on Cc: as they are much more fluent in all the
> fault and FOLL combinations and might have immediate suggestions. It is going
> to take me a little while to figure out:
> 1) How to make sure we get the right flags passed to handle_userfault

As David mentioned, one way is to have "locked" passed in with non-NULL.

The other way is to have FOLL_NOWAIT even if locked==NULL.

Here IIUC the trick is when the GUP caller neither wants to release the
mmap lock, nor does it want to stop quickly (i.e. it wants to wait for the
page fault with mmap lock held), then we'll have both locked==NULL and
!FOLL_NOWAIT. Userfaultfd currently doesn't think it's wise so generated
that warning with CONFIG_DEBUG_VM.

> 2) How to modify follow_hugetlb_page as userfaultfd can certainly drop
> mmap_lock. So we can not assume vma still exists upon return.

I think FOLL_NOWAIT flag might work if the only thing we want to do is to
trigger handle_userfault() path. But I'll also look into vm/userfaultfd as
mentioned above to make sure we'll have GUP covered there too. I'll update
if I found anything useful there.

Off-topic a bit: the whole discussion reminded me something on whether
userfaultfd is doing correctly here. E.g., here userfaultfd should really
look like the case when a swap in is needed for a file. FOLL_NOWAIT on
swap-in will mean:

#define FOLL_NOWAIT 0x20 /* if a disk transfer is needed, start the IO
* and return without waiting upon it */

Now userfaultfd returns VM_FAULT_RETRY immediately with FOLL_NOWAIT. I'm
wondering whether it should really generate the message before doing that,
to match with the semantics of initial use of FOLL_NOWAIT on swapping.

Thanks,

--
Peter Xu

2022-09-22 17:42:15

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH] mm: hugetlb: fix UAF in hugetlb_handle_userfault

On 09/22/22 09:46, David Hildenbrand wrote:
> On 22.09.22 01:57, Mike Kravetz wrote:
> > On 09/21/22 10:48, Mike Kravetz wrote:
> > > On 09/21/22 16:34, Liu Shixin wrote:
> > > > The vma_lock and hugetlb_fault_mutex are dropped before handling
> > > > userfault and reacquire them again after handle_userfault(), but
> > > > reacquire the vma_lock could lead to UAF[1] due to the following
> > > > race,
> > > >
> > > > hugetlb_fault
> > > > hugetlb_no_page
> > > > /*unlock vma_lock */
> > > > hugetlb_handle_userfault
> > > > handle_userfault
> > > > /* unlock mm->mmap_lock*/
> > > > vm_mmap_pgoff
> > > > do_mmap
> > > > mmap_region
> > > > munmap_vma_range
> > > > /* clean old vma */
> > > > /* lock vma_lock again <--- UAF */
> > > > /* unlock vma_lock */
> > > >
> > > > Since the vma_lock will unlock immediately after hugetlb_handle_userfault(),
> > > > let's drop the unneeded lock and unlock in hugetlb_handle_userfault() to fix
> > > > the issue.
> > >
> > > Thank you very much!
> > >
> > > When I saw this report, the obvious fix was to do something like what you have
> > > done below. That looks fine with a few minor comments.
> > >
> > > One question I have not yet answered is, "Does this same issue apply to
> > > follow_hugetlb_page()?". I believe it does. follow_hugetlb_page calls
> > > hugetlb_fault which could result in the fault being processed by userfaultfd.
> > > If we experience the race above, then the associated vma could no longer be
> > > valid when returning from hugetlb_fault. follow_hugetlb_page and callers
> > > have a flag (locked) to deal with dropping mmap lock. However, I am not sure
> > > if it is handled correctly WRT userfaultfd. I think this needs to be answered
> > > before fixing. And, if the follow_hugetlb_page code needs to be fixed it
> > > should be done at the same time.
> > >
> >
> > To at least verify this code path, I added userfaultfd handling to the gup_test
> > program in kernel selftests. When doing basic gup test on a hugetlb page in
> > a userfaultfd registered range, I hit this warning:
> >
> > [ 6939.867796] FAULT_FLAG_ALLOW_RETRY missing 1
> > [ 6939.871503] CPU: 2 PID: 5720 Comm: gup_test Not tainted 6.0.0-rc6-next-20220921+ #72
> > [ 6939.874562] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
> > [ 6939.877707] Call Trace:
> > [ 6939.878745] <TASK>
> > [ 6939.879779] dump_stack_lvl+0x6c/0x9f
> > [ 6939.881199] handle_userfault.cold+0x14/0x1e
> > [ 6939.882830] ? find_held_lock+0x2b/0x80
> > [ 6939.884370] ? __mutex_unlock_slowpath+0x45/0x280
> > [ 6939.886145] hugetlb_handle_userfault+0x90/0xf0
> > [ 6939.887936] hugetlb_fault+0xb7e/0xda0
> > [ 6939.889409] ? vprintk_emit+0x118/0x3a0
> > [ 6939.890903] ? _printk+0x58/0x73
> > [ 6939.892279] follow_hugetlb_page.cold+0x59/0x145
> > [ 6939.894116] __get_user_pages+0x146/0x750
> > [ 6939.895580] __gup_longterm_locked+0x3e9/0x680
> > [ 6939.897023] ? seqcount_lockdep_reader_access.constprop.0+0xa5/0xb0
> > [ 6939.898939] ? lockdep_hardirqs_on+0x7d/0x100
> > [ 6939.901243] gup_test_ioctl+0x320/0x6e0
> > [ 6939.902202] __x64_sys_ioctl+0x87/0xc0
> > [ 6939.903220] do_syscall_64+0x38/0x90
> > [ 6939.904233] entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > [ 6939.905423] RIP: 0033:0x7fbb53830f7b
> >
> > This is because userfaultfd is expecting FAULT_FLAG_ALLOW_RETRY which is not
> > set in this path.
>
> Right. Without being able to drop the mmap lock, we cannot continue. And we
> don't know if we can drop it without FAULT_FLAG_ALLOW_RETRY.
>
> FAULT_FLAG_ALLOW_RETRY is only set when we can communicate to the caller
> that we dropped the mmap lock [e.g., int *locked parameter].
>
> All code paths that pass NULL won't be able to handle -- especially
> surprisingly also pin_user_pages_fast() -- cannot trigger usefaultfd and
> will result in this warning.
>
>
> A "sane" example is access via /proc/self/mem via ptrace: we don't want to
> trigger userfaultfd, but instead simply fail the GUP get/pin.
>
>
> Now, this is just a printed *warning* (not a WARN/BUG/taint) that tells us
> that there is a GUP user that isn't prepared for userfaultfd. So it rather
> points out a missing GUP adaption -- incomplete userfaultfd support. And we
> seem to have plenty of that judging that pin_user_pages_fast_only().
>
> Maybe the printed stack trace is a bit too much and makes this look very
> scary.
>
> >
> > Adding John, Peter and David on Cc: as they are much more fluent in all the
> > fault and FOLL combinations and might have immediate suggestions. It is going
> > to take me a little while to figure out:
> > 1) How to make sure we get the right flags passed to handle_userfault
>
> This is a GUP caller problem -- or rather, how GUP has to deal with
> userfaultfd.
>
> > 2) How to modify follow_hugetlb_page as userfaultfd can certainly drop
> > mmap_lock. So we can not assume vma still exists upon return.
>
> Again, we have to communicate to the GUP caller that we dropped the mmap
> lock. And that requires GUP caller changes.
>

Thank you and Peter for replying!

The 'good news' is that there does not appear to be a case where userfaultfd
(via hugetlb_fault) drops the lock and follow_hugetlb_page is not prepard for
the consequences. So, this is not an exposure as in hugetlb_handle_userfault
that is in need of an immediate fix. i.e. A fix like that originally proposed
here is sufficient.

We can think about whether this specific calling sequence needs to be modified.
--
Mike Kravetz