2021-02-11 00:08:47

by Mike Kravetz

[permalink] [raw]
Subject: [RFC PATCH 2/5] hugetlb: enhance hugetlb fault processing to support soft dirty

hugetlb fault processing code would COW all write faults where the
pte was not writable. Soft dirty will write protect ptes as part
of it's tracking mechanism. The existing hugetlb_cow code will do
the right thing for PRIVATE mappings as it checks map_count. However,
for SHARED mappings it would actually allocate and install a COW page.
Modify the code to not call hugetlb_cow for SHARED mappings and just
update the pte.

Signed-off-by: Mike Kravetz <[email protected]>
---
mm/hugetlb.c | 23 ++++++++++++++++-------
1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 47f3123afd1a..b561b6867ec1 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4584,8 +4584,10 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
* spinlock. For private mappings, we also lookup the pagecache
* page now as it is used to determine if a reservation has been
* consumed.
+ * Only non-shared mappings are sent to hugetlb_cow.
*/
- if ((flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry)) {
+ if ((flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry) &&
+ !(vma->vm_flags & VM_SHARED)) {
if (vma_needs_reservation(h, vma, haddr) < 0) {
ret = VM_FAULT_OOM;
goto out_mutex;
@@ -4593,9 +4595,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
/* Just decrements count, does not deallocate */
vma_end_reservation(h, vma, haddr);

- if (!(vma->vm_flags & VM_MAYSHARE))
- pagecache_page = hugetlbfs_pagecache_page(h,
- vma, haddr);
+ pagecache_page = hugetlbfs_pagecache_page(h, vma, haddr);
}

ptl = huge_pte_lock(h, mm, ptep);
@@ -4620,9 +4620,18 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,

if (flags & FAULT_FLAG_WRITE) {
if (!huge_pte_write(entry)) {
- ret = hugetlb_cow(mm, vma, address, ptep,
- pagecache_page, ptl);
- goto out_put_page;
+ if (!(vma->vm_flags & VM_SHARED)) {
+ ret = hugetlb_cow(mm, vma, address, ptep,
+ pagecache_page, ptl);
+ goto out_put_page;
+ }
+
+ /* write protected for soft dirty processing */
+ if ((vma->vm_flags & VM_WRITE) &&
+ (vma->vm_flags & VM_SHARED))
+ entry = huge_pte_mkwrite(entry);
+
+ entry = huge_pte_mkdirty(entry);
}
entry = huge_pte_mkdirty(entry);
}
--
2.29.2


2021-02-17 21:18:44

by Peter Xu

[permalink] [raw]
Subject: Re: [RFC PATCH 2/5] hugetlb: enhance hugetlb fault processing to support soft dirty

On Wed, Feb 10, 2021 at 04:03:19PM -0800, Mike Kravetz wrote:
> hugetlb fault processing code would COW all write faults where the
> pte was not writable. Soft dirty will write protect ptes as part
> of it's tracking mechanism. The existing hugetlb_cow code will do
> the right thing for PRIVATE mappings as it checks map_count. However,
> for SHARED mappings it would actually allocate and install a COW page.
> Modify the code to not call hugetlb_cow for SHARED mappings and just
> update the pte.
>
> Signed-off-by: Mike Kravetz <[email protected]>
> ---
> mm/hugetlb.c | 23 ++++++++++++++++-------
> 1 file changed, 16 insertions(+), 7 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 47f3123afd1a..b561b6867ec1 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4584,8 +4584,10 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> * spinlock. For private mappings, we also lookup the pagecache
> * page now as it is used to determine if a reservation has been
> * consumed.
> + * Only non-shared mappings are sent to hugetlb_cow.
> */
> - if ((flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry)) {
> + if ((flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry) &&
> + !(vma->vm_flags & VM_SHARED)) {
> if (vma_needs_reservation(h, vma, haddr) < 0) {
> ret = VM_FAULT_OOM;
> goto out_mutex;
> @@ -4593,9 +4595,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> /* Just decrements count, does not deallocate */
> vma_end_reservation(h, vma, haddr);
>
> - if (!(vma->vm_flags & VM_MAYSHARE))
> - pagecache_page = hugetlbfs_pagecache_page(h,
> - vma, haddr);
> + pagecache_page = hugetlbfs_pagecache_page(h, vma, haddr);

Pure question: I see that the check actually changed from VM_MAYSHARE into
VM_SHARE, then I noticed I'm actually unclear on the difference.. Say, when
VM_MAYSHARE is set, could VM_SHARED be cleared in any case? Or say, is this
change intended?

I see that vma_set_page_prot() tried to remove VM_SHARED if soft dirty enabled
(which should cause vma_wants_writenotify() to return true, iiuc), however
that's temporary just to calculate vm_page_prot, and it's not applied to the
vma->vm_flags. I failed to find a place where VM_SHARED of the vma is cleared
while VM_MAYSHARE is set..

> }
>
> ptl = huge_pte_lock(h, mm, ptep);
> @@ -4620,9 +4620,18 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>
> if (flags & FAULT_FLAG_WRITE) {
> if (!huge_pte_write(entry)) {
> - ret = hugetlb_cow(mm, vma, address, ptep,
> - pagecache_page, ptl);
> - goto out_put_page;
> + if (!(vma->vm_flags & VM_SHARED)) {
> + ret = hugetlb_cow(mm, vma, address, ptep,
> + pagecache_page, ptl);
> + goto out_put_page;
> + }
> +
> + /* write protected for soft dirty processing */
> + if ((vma->vm_flags & VM_WRITE) &&

This VM_WRITE check seems to be redundant. As example, do_user_addr_fault() of
x86 code will check this right after vma lookup by access_error(). So when
reach here if "flags & FAULT_FLAG_WRITE", then VM_WRITE must be set, imho.

> + (vma->vm_flags & VM_SHARED))
> + entry = huge_pte_mkwrite(entry);

Same question to VM_SHARED, since "(vma->vm_flags & VM_SHARED)" is just checked
above and we'll go hugetlb_cow() otherwise.

> +
> + entry = huge_pte_mkdirty(entry);

There's another huge_pte_mkdirty() right below; likely we could merge them somehow?

Thanks,

> }
> entry = huge_pte_mkdirty(entry);
> }
> --
> 2.29.2
>

--
Peter Xu

2021-02-18 23:29:38

by Mike Kravetz

[permalink] [raw]
Subject: Re: [RFC PATCH 2/5] hugetlb: enhance hugetlb fault processing to support soft dirty

On 2/17/21 11:32 AM, Peter Xu wrote:
> On Wed, Feb 10, 2021 at 04:03:19PM -0800, Mike Kravetz wrote:
>> hugetlb fault processing code would COW all write faults where the
>> pte was not writable. Soft dirty will write protect ptes as part
>> of it's tracking mechanism. The existing hugetlb_cow code will do
>> the right thing for PRIVATE mappings as it checks map_count. However,
>> for SHARED mappings it would actually allocate and install a COW page.
>> Modify the code to not call hugetlb_cow for SHARED mappings and just
>> update the pte.
>>
>> Signed-off-by: Mike Kravetz <[email protected]>
>> ---
>> mm/hugetlb.c | 23 ++++++++++++++++-------
>> 1 file changed, 16 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 47f3123afd1a..b561b6867ec1 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -4584,8 +4584,10 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>> * spinlock. For private mappings, we also lookup the pagecache
>> * page now as it is used to determine if a reservation has been
>> * consumed.
>> + * Only non-shared mappings are sent to hugetlb_cow.
>> */
>> - if ((flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry)) {
>> + if ((flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry) &&
>> + !(vma->vm_flags & VM_SHARED)) {
>> if (vma_needs_reservation(h, vma, haddr) < 0) {
>> ret = VM_FAULT_OOM;
>> goto out_mutex;
>> @@ -4593,9 +4595,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>> /* Just decrements count, does not deallocate */
>> vma_end_reservation(h, vma, haddr);
>>
>> - if (!(vma->vm_flags & VM_MAYSHARE))
>> - pagecache_page = hugetlbfs_pagecache_page(h,
>> - vma, haddr);
>> + pagecache_page = hugetlbfs_pagecache_page(h, vma, haddr);
>
> Pure question: I see that the check actually changed from VM_MAYSHARE into
> VM_SHARE, then I noticed I'm actually unclear on the difference.. Say, when
> VM_MAYSHARE is set, could VM_SHARED be cleared in any case? Or say, is this
> change intended?

The change was not intended. I will use VM_MAYSHARE.

>
> I see that vma_set_page_prot() tried to remove VM_SHARED if soft dirty enabled
> (which should cause vma_wants_writenotify() to return true, iiuc), however
> that's temporary just to calculate vm_page_prot, and it's not applied to the
> vma->vm_flags. I failed to find a place where VM_SHARED of the vma is cleared
> while VM_MAYSHARE is set..

I am not 100% sure about differences. Here is a snippet from do_mmap() where
you can have VM_MAYSHARE and not VM_SHARED

vm_flags |= VM_SHARED | VM_MAYSHARE;
if (!(file->f_mode & FMODE_WRITE))
vm_flags &= ~(VM_MAYWRITE | VM_SHARED);
fallthrough;

>
>> }
>>
>> ptl = huge_pte_lock(h, mm, ptep);
>> @@ -4620,9 +4620,18 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>>
>> if (flags & FAULT_FLAG_WRITE) {
>> if (!huge_pte_write(entry)) {
>> - ret = hugetlb_cow(mm, vma, address, ptep,
>> - pagecache_page, ptl);
>> - goto out_put_page;
>> + if (!(vma->vm_flags & VM_SHARED)) {
>> + ret = hugetlb_cow(mm, vma, address, ptep,
>> + pagecache_page, ptl);
>> + goto out_put_page;
>> + }
>> +
>> + /* write protected for soft dirty processing */
>> + if ((vma->vm_flags & VM_WRITE) &&
>
> This VM_WRITE check seems to be redundant. As example, do_user_addr_fault() of
> x86 code will check this right after vma lookup by access_error(). So when
> reach here if "flags & FAULT_FLAG_WRITE", then VM_WRITE must be set, imho.

Thanks, that sounds reasonable. I will check to make sure and drop the
redundant check.

>
>> + (vma->vm_flags & VM_SHARED))
>> + entry = huge_pte_mkwrite(entry);
>
> Same question to VM_SHARED, since "(vma->vm_flags & VM_SHARED)" is just checked
> above and we'll go hugetlb_cow() otherwise.

Yes, certainly redundant here.

>
>> +
>> + entry = huge_pte_mkdirty(entry);
>
> There's another huge_pte_mkdirty() right below; likely we could merge them somehow?
>

Yes,

Thanks for taking a look!

--
Mike Kravetz

> Thanks,
>
>> }
>> entry = huge_pte_mkdirty(entry);
>> }
>> --
>> 2.29.2
>>
>