LinuxLists.cc - [PATCH v2 0/3] Fix cache flush issues considering PMD sharing

2022-04-27 11:44:58

Subject: [PATCH v2 0/3] Fix cache flush issues considering PMD sharing

Hi,

This patch set fixes some cache flushing issues if PMD sharing is
possible for hugetlb pages, which were found by code inspection.
Meanwhile Mike found the flush_cache_page() can not cover the whole
size of a hugetlb page on some architectures [1], so I added a new
patch 3 to fix this issue, since I found only try_to_unmap_one()
and try_to_migrate_one() need to fix after some investigation.

Please help to review. Thanks.

[1] https://lore.kernel.org/linux-mm/[email protected]/

Changes from v1:
- Add reviewed-by tag from Mike.
- Update some comments suggested by Mike.
- Add a new patch to fix cache issue for hugetlb page.

Baolin Wang (3):
mm: hugetlb: Considering PMD sharing when flushing cache/TLBs
mm: rmap: Move the cache flushing to the correct place for hugetlb PMD
sharing
mm: rmap: Use flush_cache_range() to flush cache for hugetlb pages

mm/hugetlb.c | 17 ++++++++-
mm/mremap.c | 2 +-
mm/rmap.c | 118 ++++++++++++++++++++++++++++++++---------------------------
3 files changed, 80 insertions(+), 57 deletions(-)

--
1.8.3.1

2022-04-27 11:45:02

by Baolin Wang

[permalink] [raw]

Subject: [PATCH v2 1/3] mm: hugetlb: Considering PMD sharing when flushing cache/TLBs

When moving hugetlb page tables, the cache flushing is called in
move_page_tables() without considering the shared PMDs, which may
be cause cache issues on some architectures.

Thus we should move the hugetlb cache flushing into
move_hugetlb_page_tables() with considering the shared PMDs ranges,
calculated by adjust_range_if_pmd_sharing_possible(). Meanwhile also
expanding the TLBs flushing range in case of shared PMDs.

Note this is discovered via code inspection, and did not meet a real
problem in practice so far.

Fixes: 550a7d60bd5e ("mm, hugepages: add mremap() support for hugepage backed vma")
Signed-off-by: Baolin Wang <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
---
mm/hugetlb.c | 17 +++++++++++++++--
mm/mremap.c | 2 +-
2 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1945dfb..d3a6094 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4937,10 +4937,17 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
unsigned long old_addr_copy;
pte_t *src_pte, *dst_pte;
struct mmu_notifier_range range;
+ bool shared_pmd = false;

mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, mm, old_addr,
old_end);
adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end);
+ /*
+ * In case of shared PMDs, we should cover the maximum possible
+ * range.
+ */
+ flush_cache_range(vma, range.start, range.end);
+
mmu_notifier_invalidate_range_start(&range);
/* Prevent race with file truncation */
i_mmap_lock_write(mapping);
@@ -4957,8 +4964,10 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
*/
old_addr_copy = old_addr;

- if (huge_pmd_unshare(mm, vma, &old_addr_copy, src_pte))
+ if (huge_pmd_unshare(mm, vma, &old_addr_copy, src_pte)) {
+ shared_pmd = true;
continue;
+ }

dst_pte = huge_pte_alloc(mm, new_vma, new_addr, sz);
if (!dst_pte)
@@ -4966,7 +4975,11 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,

move_huge_pte(vma, old_addr, new_addr, src_pte, dst_pte);
}
- flush_tlb_range(vma, old_end - len, old_end);
+
+ if (shared_pmd)
+ flush_tlb_range(vma, range.start, range.end);
+ else
+ flush_tlb_range(vma, old_end - len, old_end);
mmu_notifier_invalidate_range_end(&range);
i_mmap_unlock_write(mapping);

diff --git a/mm/mremap.c b/mm/mremap.c
index 98f50e6..0970025 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -490,12 +490,12 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
return 0;

old_end = old_addr + len;
- flush_cache_range(vma, old_addr, old_end);

if (is_vm_hugetlb_page(vma))
return move_hugetlb_page_tables(vma, new_vma, old_addr,
new_addr, len);

+ flush_cache_range(vma, old_addr, old_end);
mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma, vma->vm_mm,
old_addr, old_end);
mmu_notifier_invalidate_range_start(&range);
--
1.8.3.1

2022-04-27 11:45:05

by Baolin Wang

[permalink] [raw]

Subject: [PATCH v2 2/3] mm: rmap: Move the cache flushing to the correct place for hugetlb PMD sharing

The cache level flush will always be first when changing an existing
virtual–>physical mapping to a new value, since this allows us to
properly handle systems whose caches are strict and require a
virtual–>physical translation to exist for a virtual address. So we
should move the cache flushing before huge_pmd_unshare().

As Muchun pointed out[1], now the architectures whose supporting hugetlb
PMD sharing have no cache flush issues in practice. But I think we
should still follow the cache/TLB flushing rules when changing a valid
virtual address mapping in case of potential issues in future.

[1] https://lore.kernel.org/all/YmT%2F%[email protected]/
Signed-off-by: Baolin Wang <[email protected]>
---
mm/rmap.c | 40 ++++++++++++++++++++++------------------
1 file changed, 22 insertions(+), 18 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 61e63db..4f0d115 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1535,15 +1535,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
* do this outside rmap routines.
*/
VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
+ /*
+ * huge_pmd_unshare may unmap an entire PMD page.
+ * There is no way of knowing exactly which PMDs may
+ * be cached for this mm, so we must flush them all.
+ * start/end were already adjusted above to cover this
+ * range.
+ */
+ flush_cache_range(vma, range.start, range.end);
+
if (huge_pmd_unshare(mm, vma, &address, pvmw.pte)) {
- /*
- * huge_pmd_unshare unmapped an entire PMD
- * page. There is no way of knowing exactly
- * which PMDs may be cached for this mm, so
- * we must flush them all. start/end were
- * already adjusted above to cover this range.
- */
- flush_cache_range(vma, range.start, range.end);
flush_tlb_range(vma, range.start, range.end);
mmu_notifier_invalidate_range(mm, range.start,
range.end);
@@ -1560,13 +1561,14 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
page_vma_mapped_walk_done(&pvmw);
break;
}
+ } else {
+ flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
}

/*
* Nuke the page table entry. When having to clear
* PageAnonExclusive(), we always have to flush.
*/
- flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
if (should_defer_flush(mm, flags) && !anon_exclusive) {
/*
* We clear the PTE but do not flush so potentially
@@ -1890,15 +1892,16 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
* do this outside rmap routines.
*/
VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
+ /*
+ * huge_pmd_unshare may unmap an entire PMD page.
+ * There is no way of knowing exactly which PMDs may
+ * be cached for this mm, so we must flush them all.
+ * start/end were already adjusted above to cover this
+ * range.
+ */
+ flush_cache_range(vma, range.start, range.end);
+
if (huge_pmd_unshare(mm, vma, &address, pvmw.pte)) {
- /*
- * huge_pmd_unshare unmapped an entire PMD
- * page. There is no way of knowing exactly
- * which PMDs may be cached for this mm, so
- * we must flush them all. start/end were
- * already adjusted above to cover this range.
- */
- flush_cache_range(vma, range.start, range.end);
flush_tlb_range(vma, range.start, range.end);
mmu_notifier_invalidate_range(mm, range.start,
range.end);
@@ -1915,10 +1918,11 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
page_vma_mapped_walk_done(&pvmw);
break;
}
+ } else {
+ flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
}

/* Nuke the page table entry. */
- flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
pteval = ptep_clear_flush(vma, address, pvmw.pte);

/* Set the dirty flag on the folio now the pte is gone. */
--
1.8.3.1

2022-04-27 11:45:13

by Baolin Wang

[permalink] [raw]

Subject: [PATCH v2 3/3] mm: rmap: Use flush_cache_range() to flush cache for hugetlb pages

Now we will use flush_cache_page() to flush cache for anonymous hugetlb
pages when unmapping or migrating a hugetlb page mapping, but the
flush_cache_page() only handles a PAGE_SIZE range on some architectures
(like arm32, arc and so on), which will cause potential cache issues.
Thus change to use flush_cache_range() to cover the whole size of a
hugetlb page.

Signed-off-by: Baolin Wang <[email protected]>
---
mm/rmap.c | 90 ++++++++++++++++++++++++++++++++++-----------------------------
1 file changed, 48 insertions(+), 42 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 4f0d115..6fdd198 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1528,13 +1528,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
anon_exclusive = folio_test_anon(folio) &&
PageAnonExclusive(subpage);

- if (folio_test_hugetlb(folio) && !folio_test_anon(folio)) {
- /*
- * To call huge_pmd_unshare, i_mmap_rwsem must be
- * held in write mode. Caller needs to explicitly
- * do this outside rmap routines.
- */
- VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
+ if (folio_test_hugetlb(folio)) {
/*
* huge_pmd_unshare may unmap an entire PMD page.
* There is no way of knowing exactly which PMDs may
@@ -1544,22 +1538,31 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
*/
flush_cache_range(vma, range.start, range.end);

- if (huge_pmd_unshare(mm, vma, &address, pvmw.pte)) {
- flush_tlb_range(vma, range.start, range.end);
- mmu_notifier_invalidate_range(mm, range.start,
- range.end);
-
+ if (!folio_test_anon(folio)) {
/*
- * The ref count of the PMD page was dropped
- * which is part of the way map counting
- * is done for shared PMDs. Return 'true'
- * here. When there is no other sharing,
- * huge_pmd_unshare returns false and we will
- * unmap the actual page and drop map count
- * to zero.
+ * To call huge_pmd_unshare, i_mmap_rwsem must be
+ * held in write mode. Caller needs to explicitly
+ * do this outside rmap routines.
*/
- page_vma_mapped_walk_done(&pvmw);
- break;
+ VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
+
+ if (huge_pmd_unshare(mm, vma, &address, pvmw.pte)) {
+ flush_tlb_range(vma, range.start, range.end);
+ mmu_notifier_invalidate_range(mm, range.start,
+ range.end);
+
+ /*
+ * The ref count of the PMD page was dropped
+ * which is part of the way map counting
+ * is done for shared PMDs. Return 'true'
+ * here. When there is no other sharing,
+ * huge_pmd_unshare returns false and we will
+ * unmap the actual page and drop map count
+ * to zero.
+ */
+ page_vma_mapped_walk_done(&pvmw);
+ break;
+ }
}
} else {
flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
@@ -1885,13 +1888,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
anon_exclusive = folio_test_anon(folio) &&
PageAnonExclusive(subpage);

- if (folio_test_hugetlb(folio) && !folio_test_anon(folio)) {
- /*
- * To call huge_pmd_unshare, i_mmap_rwsem must be
- * held in write mode. Caller needs to explicitly
- * do this outside rmap routines.
- */
- VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
+ if (folio_test_hugetlb(folio)) {
/*
* huge_pmd_unshare may unmap an entire PMD page.
* There is no way of knowing exactly which PMDs may
@@ -1901,22 +1898,31 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
*/
flush_cache_range(vma, range.start, range.end);

- if (huge_pmd_unshare(mm, vma, &address, pvmw.pte)) {
- flush_tlb_range(vma, range.start, range.end);
- mmu_notifier_invalidate_range(mm, range.start,
- range.end);
-
+ if (!folio_test_anon(folio)) {
/*
- * The ref count of the PMD page was dropped
- * which is part of the way map counting
- * is done for shared PMDs. Return 'true'
- * here. When there is no other sharing,
- * huge_pmd_unshare returns false and we will
- * unmap the actual page and drop map count
- * to zero.
+ * To call huge_pmd_unshare, i_mmap_rwsem must be
+ * held in write mode. Caller needs to explicitly
+ * do this outside rmap routines.
*/
- page_vma_mapped_walk_done(&pvmw);
- break;
+ VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
+
+ if (huge_pmd_unshare(mm, vma, &address, pvmw.pte)) {
+ flush_tlb_range(vma, range.start, range.end);
+ mmu_notifier_invalidate_range(mm, range.start,
+ range.end);
+
+ /*
+ * The ref count of the PMD page was dropped
+ * which is part of the way map counting
+ * is done for shared PMDs. Return 'true'
+ * here. When there is no other sharing,
+ * huge_pmd_unshare returns false and we will
+ * unmap the actual page and drop map count
+ * to zero.
+ */
+ page_vma_mapped_walk_done(&pvmw);
+ break;
+ }
}
} else {
flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
--
1.8.3.1

2022-04-28 04:16:00

by Muchun Song

[permalink] [raw]

Subject: Re: [PATCH v2 1/3] mm: hugetlb: Considering PMD sharing when flushing cache/TLBs

On Wed, Apr 27, 2022 at 06:52:05PM +0800, Baolin Wang wrote:
> When moving hugetlb page tables, the cache flushing is called in
> move_page_tables() without considering the shared PMDs, which may
> be cause cache issues on some architectures.
>
> Thus we should move the hugetlb cache flushing into
> move_hugetlb_page_tables() with considering the shared PMDs ranges,
> calculated by adjust_range_if_pmd_sharing_possible(). Meanwhile also
> expanding the TLBs flushing range in case of shared PMDs.
>
> Note this is discovered via code inspection, and did not meet a real
> problem in practice so far.
>
> Fixes: 550a7d60bd5e ("mm, hugepages: add mremap() support for hugepage backed vma")
> Signed-off-by: Baolin Wang <[email protected]>
> Reviewed-by: Mike Kravetz <[email protected]>

LGTM.

Reviewed-by: Muchun Song <[email protected]>

Thanks.

2022-04-28 09:20:35

by Baolin Wang

[permalink] [raw]

Subject: Re: [PATCH v2 2/3] mm: rmap: Move the cache flushing to the correct place for hugetlb PMD sharing

On 4/28/2022 1:55 PM, Muchun Song wrote:
> On Wed, Apr 27, 2022 at 06:52:06PM +0800, Baolin Wang wrote:
>> The cache level flush will always be first when changing an existing
>> virtual–>physical mapping to a new value, since this allows us to
>> properly handle systems whose caches are strict and require a
>> virtual–>physical translation to exist for a virtual address. So we
>> should move the cache flushing before huge_pmd_unshare().
>>
>
> Right.
>
>> As Muchun pointed out[1], now the architectures whose supporting hugetlb
>> PMD sharing have no cache flush issues in practice. But I think we
>> should still follow the cache/TLB flushing rules when changing a valid
>> virtual address mapping in case of potential issues in future.
>
> Right. One point i need to clarify. I do not object this change but
> want you to clarify this (not an issue in practice) in commit log
> to let others know they do not need to bp this.
>
>>
>> [1] https://lore.kernel.org/all/YmT%2F%[email protected]/
>> Signed-off-by: Baolin Wang <[email protected]>
>> ---
>> mm/rmap.c | 40 ++++++++++++++++++++++------------------
>> 1 file changed, 22 insertions(+), 18 deletions(-)
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 61e63db..4f0d115 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1535,15 +1535,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>> * do this outside rmap routines.
>> */
>> VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
>> + /*
>> + * huge_pmd_unshare may unmap an entire PMD page.
>> + * There is no way of knowing exactly which PMDs may
>> + * be cached for this mm, so we must flush them all.
>> + * start/end were already adjusted above to cover this
>> + * range.
>> + */
>> + flush_cache_range(vma, range.start, range.end);
>> +
>
> flush_cache_range() is always called even if we do not need to flush.

Right, this is intended. In the original code, if it is not a shared
PMD, we will use flush_cache_page() to do cache flushing. However the
flush_cache_page() can not cover the whole size of a hugetlb page on
some architectures, which is fixed by patch 3.

> How about introducing a new helper like hugetlb_pmd_shared() which
> returns true for shared PMD? Then:
>
> if (hugetlb_pmd_shared(mm, vma, pvmw.pte)) {
> flush_cache_range(vma, range.start, range.end);
> huge_pmd_unshare(mm, vma, &address, pvmw.pte);
> flush_tlb_range(vma, range.start, range.end);
> }
>
> The code could be a little simpler. Right?

IMHO after patch 3, the code will be changed as below, so seems no need
to separate the validation of the shared PMDs from huge_pmd_unshare()
into a new function.

if (folio_test_hugetlb(folio)) {
flush_cache_range(vma, range.start, range.end);

if (!folio_test_anon(folio)) {
VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));

if (hugetlb_pmd_shared(mm, vma, pvmw.pte)) {
huge_pmd_unshare(mm, vma, &address, pvmw.pte));
flush_tlb_range(vma, range.start, range.end);
......
break;
}
}
} else {
......
}

2022-04-29 06:57:15

by Muchun Song

[permalink] [raw]

Subject: Re: [PATCH v2 2/3] mm: rmap: Move the cache flushing to the correct place for hugetlb PMD sharing

On Wed, Apr 27, 2022 at 06:52:06PM +0800, Baolin Wang wrote:
> The cache level flush will always be first when changing an existing
> virtual–>physical mapping to a new value, since this allows us to
> properly handle systems whose caches are strict and require a
> virtual–>physical translation to exist for a virtual address. So we
> should move the cache flushing before huge_pmd_unshare().
>

Right.

> As Muchun pointed out[1], now the architectures whose supporting hugetlb
> PMD sharing have no cache flush issues in practice. But I think we
> should still follow the cache/TLB flushing rules when changing a valid
> virtual address mapping in case of potential issues in future.

Right. One point i need to clarify. I do not object this change but
want you to clarify this (not an issue in practice) in commit log
to let others know they do not need to bp this.

>
> [1] https://lore.kernel.org/all/YmT%2F%[email protected]/
> Signed-off-by: Baolin Wang <[email protected]>
> ---
> mm/rmap.c | 40 ++++++++++++++++++++++------------------
> 1 file changed, 22 insertions(+), 18 deletions(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 61e63db..4f0d115 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1535,15 +1535,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> * do this outside rmap routines.
> */
> VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
> + /*
> + * huge_pmd_unshare may unmap an entire PMD page.
> + * There is no way of knowing exactly which PMDs may
> + * be cached for this mm, so we must flush them all.
> + * start/end were already adjusted above to cover this
> + * range.
> + */
> + flush_cache_range(vma, range.start, range.end);
> +

flush_cache_range() is always called even if we do not need to flush.
How about introducing a new helper like hugetlb_pmd_shared() which
returns true for shared PMD? Then:

if (hugetlb_pmd_shared(mm, vma, pvmw.pte)) {
flush_cache_range(vma, range.start, range.end);
huge_pmd_unshare(mm, vma, &address, pvmw.pte);
flush_tlb_range(vma, range.start, range.end);
}

The code could be a little simpler. Right?

Thanks.

2022-05-03 22:51:51

by Mike Kravetz

[permalink] [raw]

Subject: Re: [PATCH v2 2/3] mm: rmap: Move the cache flushing to the correct place for hugetlb PMD sharing

On 4/27/22 22:55, Muchun Song wrote:
> On Wed, Apr 27, 2022 at 06:52:06PM +0800, Baolin Wang wrote:
>> The cache level flush will always be first when changing an existing
>> virtual–>physical mapping to a new value, since this allows us to
>> properly handle systems whose caches are strict and require a
>> virtual–>physical translation to exist for a virtual address. So we
>> should move the cache flushing before huge_pmd_unshare().
>>
>
> Right.
>
>> As Muchun pointed out[1], now the architectures whose supporting hugetlb
>> PMD sharing have no cache flush issues in practice. But I think we
>> should still follow the cache/TLB flushing rules when changing a valid
>> virtual address mapping in case of potential issues in future.
>
> Right. One point i need to clarify. I do not object this change but
> want you to clarify this (not an issue in practice) in commit log
> to let others know they do not need to bp this.
>
>>
>> [1] https://lore.kernel.org/all/YmT%2F%[email protected]/
>> Signed-off-by: Baolin Wang <[email protected]>
>> ---
>> mm/rmap.c | 40 ++++++++++++++++++++++------------------
>> 1 file changed, 22 insertions(+), 18 deletions(-)
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 61e63db..4f0d115 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1535,15 +1535,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>> * do this outside rmap routines.
>> */
>> VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
>> + /*
>> + * huge_pmd_unshare may unmap an entire PMD page.
>> + * There is no way of knowing exactly which PMDs may
>> + * be cached for this mm, so we must flush them all.
>> + * start/end were already adjusted above to cover this
>> + * range.
>> + */
>> + flush_cache_range(vma, range.start, range.end);
>> +
>
> flush_cache_range() is always called even if we do not need to flush.
> How about introducing a new helper like hugetlb_pmd_shared() which
> returns true for shared PMD? Then:
>
> if (hugetlb_pmd_shared(mm, vma, pvmw.pte)) {
> flush_cache_range(vma, range.start, range.end);
> huge_pmd_unshare(mm, vma, &address, pvmw.pte);
> flush_tlb_range(vma, range.start, range.end);
> }
>
> The code could be a little simpler. Right?
>
> Thanks.
>

I thought about adding a 'hugetlb_pmd_shared()' interface for another use.
I believe it could even be used earlier in this call sequence. Since we
hold i_mmap_rwsem, we would even test for shared BEFORE calling
adjust_range_if_pmd_sharing_possible. We can not make an authoritative test
in adjust range... because not all callers will be holding i_mmap_rwsem.

I think we COULD optimize to minimize the flush range. However, I think
that would complicate this code even more, and it is difficult enough to
follow.

My preference would be to over flush as is done here for correctness and
simplification. We can optimize later if desired.

With Muchun's comment that this is not an issue in practice today,
Reviewed-by: Mike Kravetz <[email protected]>
--
Mike Kravetz

2022-05-04 07:36:33

by Mike Kravetz

[permalink] [raw]

Subject: Re: [PATCH v2 3/3] mm: rmap: Use flush_cache_range() to flush cache for hugetlb pages

On 4/27/22 03:52, Baolin Wang wrote:
> Now we will use flush_cache_page() to flush cache for anonymous hugetlb
> pages when unmapping or migrating a hugetlb page mapping, but the
> flush_cache_page() only handles a PAGE_SIZE range on some architectures
> (like arm32, arc and so on), which will cause potential cache issues.
> Thus change to use flush_cache_range() to cover the whole size of a
> hugetlb page.
>
> Signed-off-by: Baolin Wang <[email protected]>
> ---
> mm/rmap.c | 90 ++++++++++++++++++++++++++++++++++-----------------------------
> 1 file changed, 48 insertions(+), 42 deletions(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 4f0d115..6fdd198 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1528,13 +1528,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> anon_exclusive = folio_test_anon(folio) &&
> PageAnonExclusive(subpage);
>
> - if (folio_test_hugetlb(folio) && !folio_test_anon(folio)) {
> - /*
> - * To call huge_pmd_unshare, i_mmap_rwsem must be
> - * held in write mode. Caller needs to explicitly
> - * do this outside rmap routines.
> - */
> - VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
> + if (folio_test_hugetlb(folio)) {
> /*
> * huge_pmd_unshare may unmap an entire PMD page.
> * There is no way of knowing exactly which PMDs may
> @@ -1544,22 +1538,31 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> */
> flush_cache_range(vma, range.start, range.end);
>
> - if (huge_pmd_unshare(mm, vma, &address, pvmw.pte)) {
> - flush_tlb_range(vma, range.start, range.end);
> - mmu_notifier_invalidate_range(mm, range.start,
> - range.end);
> -
> + if (!folio_test_anon(folio)) {
> /*
> - * The ref count of the PMD page was dropped
> - * which is part of the way map counting
> - * is done for shared PMDs. Return 'true'
> - * here. When there is no other sharing,
> - * huge_pmd_unshare returns false and we will
> - * unmap the actual page and drop map count
> - * to zero.
> + * To call huge_pmd_unshare, i_mmap_rwsem must be
> + * held in write mode. Caller needs to explicitly
> + * do this outside rmap routines.
> */
> - page_vma_mapped_walk_done(&pvmw);
> - break;
> + VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
> +
> + if (huge_pmd_unshare(mm, vma, &address, pvmw.pte)) {
> + flush_tlb_range(vma, range.start, range.end);
> + mmu_notifier_invalidate_range(mm, range.start,
> + range.end);
> +
> + /*
> + * The ref count of the PMD page was dropped
> + * which is part of the way map counting
> + * is done for shared PMDs. Return 'true'
> + * here. When there is no other sharing,
> + * huge_pmd_unshare returns false and we will
> + * unmap the actual page and drop map count
> + * to zero.
> + */
> + page_vma_mapped_walk_done(&pvmw);
> + break;
> + }
> }
> } else {
> flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
> @@ -1885,13 +1888,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
> anon_exclusive = folio_test_anon(folio) &&
> PageAnonExclusive(subpage);
>
> - if (folio_test_hugetlb(folio) && !folio_test_anon(folio)) {
> - /*
> - * To call huge_pmd_unshare, i_mmap_rwsem must be
> - * held in write mode. Caller needs to explicitly
> - * do this outside rmap routines.
> - */
> - VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
> + if (folio_test_hugetlb(folio)) {
> /*
> * huge_pmd_unshare may unmap an entire PMD page.
> * There is no way of knowing exactly which PMDs may
> @@ -1901,22 +1898,31 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
> */
> flush_cache_range(vma, range.start, range.end);
>
> - if (huge_pmd_unshare(mm, vma, &address, pvmw.pte)) {
> - flush_tlb_range(vma, range.start, range.end);
> - mmu_notifier_invalidate_range(mm, range.start,
> - range.end);
> -
> + if (!folio_test_anon(folio)) {
> /*
> - * The ref count of the PMD page was dropped
> - * which is part of the way map counting
> - * is done for shared PMDs. Return 'true'
> - * here. When there is no other sharing,
> - * huge_pmd_unshare returns false and we will
> - * unmap the actual page and drop map count
> - * to zero.
> + * To call huge_pmd_unshare, i_mmap_rwsem must be
> + * held in write mode. Caller needs to explicitly
> + * do this outside rmap routines.
> */
> - page_vma_mapped_walk_done(&pvmw);
> - break;
> + VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
> +
> + if (huge_pmd_unshare(mm, vma, &address, pvmw.pte)) {
> + flush_tlb_range(vma, range.start, range.end);
> + mmu_notifier_invalidate_range(mm, range.start,
> + range.end);
> +
> + /*
> + * The ref count of the PMD page was dropped
> + * which is part of the way map counting
> + * is done for shared PMDs. Return 'true'
> + * here. When there is no other sharing,
> + * huge_pmd_unshare returns false and we will
> + * unmap the actual page and drop map count
> + * to zero.
> + */
> + page_vma_mapped_walk_done(&pvmw);
> + break;
> + }
> }
> } else {
> flush_cache_page(vma, address, pte_pfn(*pvmw.pte));

Thanks,
The code looks fine. It is unfortunate that we need so many levels of
indenting and exceed 80 columns. But, that is OK.

Reviewed-by: Mike Kravetz <[email protected]>

I see you have a followup series to address the call to ptep_clear_flush()
for hugetlb pages not unmapped via huge_pmd_share and will take a look at
that soon.
--
Mike Kravetz

2022-05-04 12:03:13

by Baolin Wang

[permalink] [raw]

Subject: Re: [PATCH v2 2/3] mm: rmap: Move the cache flushing to the correct place for hugetlb PMD sharing

On 5/4/2022 2:42 AM, Mike Kravetz wrote:
> On 4/27/22 22:55, Muchun Song wrote:
>> On Wed, Apr 27, 2022 at 06:52:06PM +0800, Baolin Wang wrote:
>>> The cache level flush will always be first when changing an existing
>>> virtual–>physical mapping to a new value, since this allows us to
>>> properly handle systems whose caches are strict and require a
>>> virtual–>physical translation to exist for a virtual address. So we
>>> should move the cache flushing before huge_pmd_unshare().
>>>
>>
>> Right.
>>
>>> As Muchun pointed out[1], now the architectures whose supporting hugetlb
>>> PMD sharing have no cache flush issues in practice. But I think we
>>> should still follow the cache/TLB flushing rules when changing a valid
>>> virtual address mapping in case of potential issues in future.
>>
>> Right. One point i need to clarify. I do not object this change but
>> want you to clarify this (not an issue in practice) in commit log
>> to let others know they do not need to bp this.
>>
>>>
>>> [1] https://lore.kernel.org/all/YmT%2F%[email protected]/
>>> Signed-off-by: Baolin Wang <[email protected]>
>>> ---
>>> mm/rmap.c | 40 ++++++++++++++++++++++------------------
>>> 1 file changed, 22 insertions(+), 18 deletions(-)
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 61e63db..4f0d115 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1535,15 +1535,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>> * do this outside rmap routines.
>>> */
>>> VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
>>> + /*
>>> + * huge_pmd_unshare may unmap an entire PMD page.
>>> + * There is no way of knowing exactly which PMDs may
>>> + * be cached for this mm, so we must flush them all.
>>> + * start/end were already adjusted above to cover this
>>> + * range.
>>> + */
>>> + flush_cache_range(vma, range.start, range.end);
>>> +
>>
>> flush_cache_range() is always called even if we do not need to flush.
>> How about introducing a new helper like hugetlb_pmd_shared() which
>> returns true for shared PMD? Then:
>>
>> if (hugetlb_pmd_shared(mm, vma, pvmw.pte)) {
>> flush_cache_range(vma, range.start, range.end);
>> huge_pmd_unshare(mm, vma, &address, pvmw.pte);
>> flush_tlb_range(vma, range.start, range.end);
>> }
>>
>> The code could be a little simpler. Right?
>>
>> Thanks.
>>
>
> I thought about adding a 'hugetlb_pmd_shared()' interface for another use.
> I believe it could even be used earlier in this call sequence. Since we
> hold i_mmap_rwsem, we would even test for shared BEFORE calling
> adjust_range_if_pmd_sharing_possible. We can not make an authoritative test
> in adjust range... because not all callers will be holding i_mmap_rwsem.
>
> I think we COULD optimize to minimize the flush range. However, I think
> that would complicate this code even more, and it is difficult enough to
> follow.
>
> My preference would be to over flush as is done here for correctness and
> simplification. We can optimize later if desired.

OK. Agree.

>
> With Muchun's comment that this is not an issue in practice today,
> Reviewed-by: Mike Kravetz <[email protected]>

Thanks.

2022-05-04 17:22:36

by Baolin Wang

[permalink] [raw]

Subject: Re: [PATCH v2 3/3] mm: rmap: Use flush_cache_range() to flush cache for hugetlb pages

On 5/4/2022 4:17 AM, Mike Kravetz wrote:
> On 4/27/22 03:52, Baolin Wang wrote:
>> Now we will use flush_cache_page() to flush cache for anonymous hugetlb
>> pages when unmapping or migrating a hugetlb page mapping, but the
>> flush_cache_page() only handles a PAGE_SIZE range on some architectures
>> (like arm32, arc and so on), which will cause potential cache issues.
>> Thus change to use flush_cache_range() to cover the whole size of a
>> hugetlb page.
>>
>> Signed-off-by: Baolin Wang <[email protected]>
>> ---
>> mm/rmap.c | 90 ++++++++++++++++++++++++++++++++++-----------------------------
>> 1 file changed, 48 insertions(+), 42 deletions(-)
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 4f0d115..6fdd198 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1528,13 +1528,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>> anon_exclusive = folio_test_anon(folio) &&
>> PageAnonExclusive(subpage);
>>
>> - if (folio_test_hugetlb(folio) && !folio_test_anon(folio)) {
>> - /*
>> - * To call huge_pmd_unshare, i_mmap_rwsem must be
>> - * held in write mode. Caller needs to explicitly
>> - * do this outside rmap routines.
>> - */
>> - VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
>> + if (folio_test_hugetlb(folio)) {
>> /*
>> * huge_pmd_unshare may unmap an entire PMD page.
>> * There is no way of knowing exactly which PMDs may
>> @@ -1544,22 +1538,31 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>> */
>> flush_cache_range(vma, range.start, range.end);
>>
>> - if (huge_pmd_unshare(mm, vma, &address, pvmw.pte)) {
>> - flush_tlb_range(vma, range.start, range.end);
>> - mmu_notifier_invalidate_range(mm, range.start,
>> - range.end);
>> -
>> + if (!folio_test_anon(folio)) {
>> /*
>> - * The ref count of the PMD page was dropped
>> - * which is part of the way map counting
>> - * is done for shared PMDs. Return 'true'
>> - * here. When there is no other sharing,
>> - * huge_pmd_unshare returns false and we will
>> - * unmap the actual page and drop map count
>> - * to zero.
>> + * To call huge_pmd_unshare, i_mmap_rwsem must be
>> + * held in write mode. Caller needs to explicitly
>> + * do this outside rmap routines.
>> */
>> - page_vma_mapped_walk_done(&pvmw);
>> - break;
>> + VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
>> +
>> + if (huge_pmd_unshare(mm, vma, &address, pvmw.pte)) {
>> + flush_tlb_range(vma, range.start, range.end);
>> + mmu_notifier_invalidate_range(mm, range.start,
>> + range.end);
>> +
>> + /*
>> + * The ref count of the PMD page was dropped
>> + * which is part of the way map counting
>> + * is done for shared PMDs. Return 'true'
>> + * here. When there is no other sharing,
>> + * huge_pmd_unshare returns false and we will
>> + * unmap the actual page and drop map count
>> + * to zero.
>> + */
>> + page_vma_mapped_walk_done(&pvmw);
>> + break;
>> + }
>> }
>> } else {
>> flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
>> @@ -1885,13 +1888,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>> anon_exclusive = folio_test_anon(folio) &&
>> PageAnonExclusive(subpage);
>>
>> - if (folio_test_hugetlb(folio) && !folio_test_anon(folio)) {
>> - /*
>> - * To call huge_pmd_unshare, i_mmap_rwsem must be
>> - * held in write mode. Caller needs to explicitly
>> - * do this outside rmap routines.
>> - */
>> - VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
>> + if (folio_test_hugetlb(folio)) {
>> /*
>> * huge_pmd_unshare may unmap an entire PMD page.
>> * There is no way of knowing exactly which PMDs may
>> @@ -1901,22 +1898,31 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>> */
>> flush_cache_range(vma, range.start, range.end);
>>
>> - if (huge_pmd_unshare(mm, vma, &address, pvmw.pte)) {
>> - flush_tlb_range(vma, range.start, range.end);
>> - mmu_notifier_invalidate_range(mm, range.start,
>> - range.end);
>> -
>> + if (!folio_test_anon(folio)) {
>> /*
>> - * The ref count of the PMD page was dropped
>> - * which is part of the way map counting
>> - * is done for shared PMDs. Return 'true'
>> - * here. When there is no other sharing,
>> - * huge_pmd_unshare returns false and we will
>> - * unmap the actual page and drop map count
>> - * to zero.
>> + * To call huge_pmd_unshare, i_mmap_rwsem must be
>> + * held in write mode. Caller needs to explicitly
>> + * do this outside rmap routines.
>> */
>> - page_vma_mapped_walk_done(&pvmw);
>> - break;
>> + VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
>> +
>> + if (huge_pmd_unshare(mm, vma, &address, pvmw.pte)) {
>> + flush_tlb_range(vma, range.start, range.end);
>> + mmu_notifier_invalidate_range(mm, range.start,
>> + range.end);
>> +
>> + /*
>> + * The ref count of the PMD page was dropped
>> + * which is part of the way map counting
>> + * is done for shared PMDs. Return 'true'
>> + * here. When there is no other sharing,
>> + * huge_pmd_unshare returns false and we will
>> + * unmap the actual page and drop map count
>> + * to zero.
>> + */
>> + page_vma_mapped_walk_done(&pvmw);
>> + break;
>> + }
>> }
>> } else {
>> flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
>
> Thanks,
> The code looks fine. It is unfortunate that we need so many levels of
> indenting and exceed 80 columns. But, that is OK.

I'll do a cleanup to make it more readable with factoring the hugetlb
case into a new function after the fix series[1].

[1]
https://lore.kernel.org/linux-arm-kernel/20220503120343.6264e126@thinkpad/T/

>
> Reviewed-by: Mike Kravetz <[email protected]>
>
> I see you have a followup series to address the call to ptep_clear_flush()
> for hugetlb pages not unmapped via huge_pmd_share and will take a look at
> that soon.

Thanks.