2021-10-21 19:57:25

by Nadav Amit

[permalink] [raw]
Subject: [PATCH] mm: use correct VMA flags when freeing page-tables

From: Nadav Amit <[email protected]>

Consistent use of the mmu_gather interface requires a call to
tlb_start_vma() and tlb_end_vma() for each VMA. free_pgtables() does not
follow this pattern.

Certain architectures need tlb_start_vma() to be called in order for
tlb_update_vma_flags() to update the VMA flags (tlb->vma_exec and
tlb->vma_huge), which are later used for the proper TLB flush to be
issued. Since tlb_start_vma() is not called, this can lead to the wrong
VMA flags being used when the flush is performed.

Specifically, the munmap syscall would call unmap_region(), which unmaps
the VMAs and then frees the page-tables. A flush is needed after
the page-tables are removed to prevent page-walk caches from holding
stale entries, but this flush would use the flags of the VMA flags of
the last VMA that was flushed. This does not appear to be right.

Use tlb_start_vma() and tlb_end_vma() to prevent this from happening.
This might lead to unnecessary calls to flush_cache_range() on certain
arch's. If needed, a new flag can be added to mmu_gather to indicate
that the flush is not needed.

Cc: Andrea Arcangeli <[email protected]>
Cc: Andrew Cooper <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: [email protected]
Signed-off-by: Nadav Amit <[email protected]>
---
mm/memory.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 12a7b2094434..056fbfdd3c1f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -412,6 +412,8 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
unlink_anon_vmas(vma);
unlink_file_vma(vma);

+ tlb_start_vma(tlb, vma);
+
if (is_vm_hugetlb_page(vma)) {
hugetlb_free_pgd_range(tlb, addr, vma->vm_end,
floor, next ? next->vm_start : ceiling);
@@ -429,6 +431,8 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
free_pgd_range(tlb, addr, vma->vm_end,
floor, next ? next->vm_start : ceiling);
}
+
+ tlb_end_vma(tlb, vma);
vma = next;
}
}
--
2.25.1


2021-10-22 03:08:13

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] mm: use correct VMA flags when freeing page-tables

On Thu, 21 Oct 2021 05:23:22 -0700 Nadav Amit <[email protected]> wrote:

> From: Nadav Amit <[email protected]>
>
> Consistent use of the mmu_gather interface requires a call to
> tlb_start_vma() and tlb_end_vma() for each VMA. free_pgtables() does not
> follow this pattern.
>
> Certain architectures need tlb_start_vma() to be called in order for
> tlb_update_vma_flags() to update the VMA flags (tlb->vma_exec and
> tlb->vma_huge), which are later used for the proper TLB flush to be
> issued. Since tlb_start_vma() is not called, this can lead to the wrong
> VMA flags being used when the flush is performed.
>
> Specifically, the munmap syscall would call unmap_region(), which unmaps
> the VMAs and then frees the page-tables. A flush is needed after
> the page-tables are removed to prevent page-walk caches from holding
> stale entries, but this flush would use the flags of the VMA flags of
> the last VMA that was flushed. This does not appear to be right.

Any thoughts on what the worst-case end-user cisible effects of this
would be?

Again, I'm wondering about the desirability of a -stable backport.

> Use tlb_start_vma() and tlb_end_vma() to prevent this from happening.
> This might lead to unnecessary calls to flush_cache_range() on certain
> arch's. If needed, a new flag can be added to mmu_gather to indicate
> that the flush is not needed.

2021-10-22 15:39:38

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH] mm: use correct VMA flags when freeing page-tables



> On Oct 21, 2021, at 8:06 PM, Andrew Morton <[email protected]> wrote:
>
> On Thu, 21 Oct 2021 05:23:22 -0700 Nadav Amit <[email protected]> wrote:
>
>> From: Nadav Amit <[email protected]>
>>
>> Consistent use of the mmu_gather interface requires a call to
>> tlb_start_vma() and tlb_end_vma() for each VMA. free_pgtables() does not
>> follow this pattern.
>>
>> Certain architectures need tlb_start_vma() to be called in order for
>> tlb_update_vma_flags() to update the VMA flags (tlb->vma_exec and
>> tlb->vma_huge), which are later used for the proper TLB flush to be
>> issued. Since tlb_start_vma() is not called, this can lead to the wrong
>> VMA flags being used when the flush is performed.
>>
>> Specifically, the munmap syscall would call unmap_region(), which unmaps
>> the VMAs and then frees the page-tables. A flush is needed after
>> the page-tables are removed to prevent page-walk caches from holding
>> stale entries, but this flush would use the flags of the VMA flags of
>> the last VMA that was flushed. This does not appear to be right.
>
> Any thoughts on what the worst-case end-user cisible effects of this
> would be?
>
> Again, I'm wondering about the desirability of a -stable backport.

This issue is not relevant for x86, which I am most familiar with, hence my
limited understanding of the impact on specific architectures.

In general, a TLB flush is needed after the page-tables are released
(in contrast to PTE removal) to prevent a speculative page-walk that might
access IO pages or install bogus entries. Such speculative page-walks have
been reported (on x86) as causing a machine-check. [1]

If a certain architecture has different page-walk caches for executable and
non-executable pages (i.e., different page-walk caches for iTLB and dTLB) or
for different page sizes, it might not perform the necessary TLB flush on the
proper TLB.

Looking at the code, we can see MIPS’s flow of:

tlb_flush()
->flush_tlb_range()
->local_flush_tlb_page()
->flush_micro_tlb_vm()

which calls flush_micro_tlb() only if (vma->vm_flags & VM_EXEC). So
MIPS cares about the VM_EXEC. Yet, it is not certain MIPS might be
affected.

For an architecture to be affected it needs to have all the following
properties:

1. Be able to experience hardware failure/exceptions on speculative
page-walks.

2. Have separate page-walk caches for exec/data or huge/small pages.

3. Flush the TLBs based on VM_EXEC and VM_HUGETLB.

From the code, MIPS has the 3rd property and presumably the 2nd, but it is
not certain it has the 1st.

I did not mark it as stable, since I am not sure such an architecture
exists.



[1] https://lore.kernel.org/lkml/[email protected]/

2021-11-01 07:32:44

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] mm: use correct VMA flags when freeing page-tables

On Thu, 21 Oct 2021, Nadav Amit wrote:

> From: Nadav Amit <[email protected]>
>
> Consistent use of the mmu_gather interface requires a call to
> tlb_start_vma() and tlb_end_vma() for each VMA. free_pgtables() does not
> follow this pattern.
>
> Certain architectures need tlb_start_vma() to be called in order for
> tlb_update_vma_flags() to update the VMA flags (tlb->vma_exec and
> tlb->vma_huge), which are later used for the proper TLB flush to be
> issued. Since tlb_start_vma() is not called, this can lead to the wrong
> VMA flags being used when the flush is performed.
>
> Specifically, the munmap syscall would call unmap_region(), which unmaps
> the VMAs and then frees the page-tables. A flush is needed after
> the page-tables are removed to prevent page-walk caches from holding
> stale entries, but this flush would use the flags of the VMA flags of
> the last VMA that was flushed. This does not appear to be right.
>
> Use tlb_start_vma() and tlb_end_vma() to prevent this from happening.
> This might lead to unnecessary calls to flush_cache_range() on certain
> arch's. If needed, a new flag can be added to mmu_gather to indicate
> that the flush is not needed.
>
> Cc: Andrea Arcangeli <[email protected]>
> Cc: Andrew Cooper <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Yu Zhao <[email protected]>
> Cc: Nick Piggin <[email protected]>
> Cc: [email protected]
> Signed-off-by: Nadav Amit <[email protected]>
> ---
> mm/memory.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 12a7b2094434..056fbfdd3c1f 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -412,6 +412,8 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
> unlink_anon_vmas(vma);
> unlink_file_vma(vma);
>
> + tlb_start_vma(tlb, vma);
> +
> if (is_vm_hugetlb_page(vma)) {
> hugetlb_free_pgd_range(tlb, addr, vma->vm_end,
> floor, next ? next->vm_start : ceiling);
> @@ -429,6 +431,8 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
> free_pgd_range(tlb, addr, vma->vm_end,
> floor, next ? next->vm_start : ceiling);
> }
> +
> + tlb_end_vma(tlb, vma);
> vma = next;
> }
> }
> --
> 2.25.1

No.

This is an experiment to see whether reviewers look at a wider context
than is seen in the patch itself? Let's take a look:

tlb_start_vma(tlb, vma);

if (is_vm_hugetlb_page(vma)) {
hugetlb_free_pgd_range(tlb, addr, vma->vm_end,
floor, next ? next->vm_start : ceiling);
} else {
/*
* Optimization: gather nearby vmas into one call down
*/
while (next && next->vm_start <= vma->vm_end + PMD_SIZE
&& !is_vm_hugetlb_page(next)) {
vma = next;
next = vma->vm_next;
unlink_anon_vmas(vma);
unlink_file_vma(vma);
}
free_pgd_range(tlb, addr, vma->vm_end,
floor, next ? next->vm_start : ceiling);
}

tlb_end_vma(tlb, vma);
vma = next;

So, the vma may well have changed in between the new tlb_start_vma()
and tlb_end_vma(): which defeats the intent of the patch.

And I doubt that optimization should be dropped to suit the patch:
you admit to limited understanding of those architectures which need
tlb_start_vma(), me too; but they seem to have survived many years
without it there in free_pgtables(), and I think that tlb_start_vma()
is for when freeing pages, not for when freeing page tables. Surely
all architectures have to accommodate the fact that a single page
table can be occupied by many different kinds of vma.

(Sorry, I'm being totally unresponsive at present, needing to focus
on something else; but thought I'd better get these comments in before
mmotm's mm-use-correct-vma-flags-when-freeing-page-tables.patch goes to 5.16.)

Hugh

2021-11-01 15:48:05

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH] mm: use correct VMA flags when freeing page-tables



> On Nov 1, 2021, at 12:28 AM, Hugh Dickins <[email protected]> wrote:
>
> On Thu, 21 Oct 2021, Nadav Amit wrote:
>
>> From: Nadav Amit <[email protected]>
>>
>> Consistent use of the mmu_gather interface requires a call to
>> tlb_start_vma() and tlb_end_vma() for each VMA. free_pgtables() does not
>> follow this pattern.
>>
>> Certain architectures need tlb_start_vma() to be called in order for
>> tlb_update_vma_flags() to update the VMA flags (tlb->vma_exec and
>> tlb->vma_huge), which are later used for the proper TLB flush to be
>> issued. Since tlb_start_vma() is not called, this can lead to the wrong
>> VMA flags being used when the flush is performed.
>>
>> Specifically, the munmap syscall would call unmap_region(), which unmaps
>> the VMAs and then frees the page-tables. A flush is needed after
>> the page-tables are removed to prevent page-walk caches from holding
>> stale entries, but this flush would use the flags of the VMA flags of
>> the last VMA that was flushed. This does not appear to be right.
>>
>> Use tlb_start_vma() and tlb_end_vma() to prevent this from happening.
>> This might lead to unnecessary calls to flush_cache_range() on certain
>> arch's. If needed, a new flag can be added to mmu_gather to indicate
>> that the flush is not needed.
>>
>> Cc: Andrea Arcangeli <[email protected]>
>> Cc: Andrew Cooper <[email protected]>
>> Cc: Andrew Morton <[email protected]>
>> Cc: Andy Lutomirski <[email protected]>
>> Cc: Dave Hansen <[email protected]>
>> Cc: Peter Zijlstra <[email protected]>
>> Cc: Thomas Gleixner <[email protected]>
>> Cc: Will Deacon <[email protected]>
>> Cc: Yu Zhao <[email protected]>
>> Cc: Nick Piggin <[email protected]>
>> Cc: [email protected]
>> Signed-off-by: Nadav Amit <[email protected]>
>> ---
>> mm/memory.c | 4 ++++
>> 1 file changed, 4 insertions(+)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 12a7b2094434..056fbfdd3c1f 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -412,6 +412,8 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
>> unlink_anon_vmas(vma);
>> unlink_file_vma(vma);
>>
>> + tlb_start_vma(tlb, vma);
>> +
>> if (is_vm_hugetlb_page(vma)) {
>> hugetlb_free_pgd_range(tlb, addr, vma->vm_end,
>> floor, next ? next->vm_start : ceiling);
>> @@ -429,6 +431,8 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
>> free_pgd_range(tlb, addr, vma->vm_end,
>> floor, next ? next->vm_start : ceiling);
>> }
>> +
>> + tlb_end_vma(tlb, vma);
>> vma = next;
>> }
>> }
>> --
>> 2.25.1
>
> No.
>
> This is an experiment to see whether reviewers look at a wider context
> than is seen in the patch itself? Let's take a look:
>
> tlb_start_vma(tlb, vma);
>
> if (is_vm_hugetlb_page(vma)) {
> hugetlb_free_pgd_range(tlb, addr, vma->vm_end,
> floor, next ? next->vm_start : ceiling);
> } else {
> /*
> * Optimization: gather nearby vmas into one call down
> */
> while (next && next->vm_start <= vma->vm_end + PMD_SIZE
> && !is_vm_hugetlb_page(next)) {
> vma = next;
> next = vma->vm_next;
> unlink_anon_vmas(vma);
> unlink_file_vma(vma);
> }
> free_pgd_range(tlb, addr, vma->vm_end,
> floor, next ? next->vm_start : ceiling);
> }
>
> tlb_end_vma(tlb, vma);
> vma = next;
>
> So, the vma may well have changed in between the new tlb_start_vma()
> and tlb_end_vma(): which defeats the intent of the patch.

Indeed, I made a an embarrassing bug. Having said that, I do not
understand the experiment and whether I conducted it or was the
object of it.

>
> And I doubt that optimization should be dropped to suit the patch:
> you admit to limited understanding of those architectures which need
> tlb_start_vma(), me too; but they seem to have survived many years
> without it there in free_pgtables(), and I think that tlb_start_vma()
> is for when freeing pages, not for when freeing page tables. Surely
> all architectures have to accommodate the fact that a single page
> table can be occupied by many different kinds of vma.

When it comes to TLB flushing, the assumption that if something is
in the code for many years it must be working is questionable, and
I have already encountered (and fixed) many such cases before.

Freeing page-tables, as I mentioned before, is needed for the
invalidation the page-walk caches after the page-tables are dropped,
if a speculative page-walk takes place. I can post v2 and fix my
embarrassing bug. I am not going to force this patch - I just
encountered the issue as I was modifying a different piece of code
and the behavior seems very inconsistent.