LinuxLists.cc - [BUG?] X86 arch_tlbbatch_flush() seems to be lacking mm_tlb_flush

2022-10-14 18:35:58

Subject: [BUG?] X86 arch_tlbbatch_flush() seems to be lacking mm_tlb_flush_nested() integration

Hi!

I haven't actually managed to reproduce this behavior, so maybe I'm
just misunderstanding how this works; but I think the
arch_tlbbatch_flush() path for batched TLB flushing in vmscan ought to
have some kind of integration with mm_tlb_flush_nested().

I think that currently, the following race could happen:

[initial situation: page P is mapped into a page table of task B, but
the page is not referenced, the PTE's A/D bits are clear]
A: vmscan begins
A: vmscan looks at P and P's PTEs, and concludes that P is not currently in use
B: reads from P through the PTE, setting the Accessed bit and creating
a TLB entry
A: vmscan enters try_to_unmap_one()
A: try_to_unmap_one() calls should_defer_flush(), which returns true
A: try_to_unmap_one() removes the PTE and queues a TLB flush
(arch_tlbbatch_add_mm())
A: try_to_unmap_one() returns, try_to_unmap() returns to shrink_folio_list()
B: calls munmap() on the VMA that mapped P
B: no PTEs are removed, so no TLB flush happens
B: munmap() returns
[at this point, the TLB entry still exists]
B: calls mmap(), which reuses the same area that was just unmapped
B: tries to access the newly created VMA, but instead the access goes
through the stale TLB entry
A: shrink_folio_list() calls try_to_unmap_flush(), which removes the
stale TLB entry

The effect would be that after process B removes a mapping with
munmap() and creates a new mapping in its place, it would still see
data from the old mapping when trying to access the new mapping.

Am I missing something that protects against this scenario?

munmap() uses the mmu_gather infrastructure, which tries to protect
against this kind of correctness bug with multiple racing TLB
invalidations in tlb_finish_mmu() by blowing away the whole TLB
whenever one TLB invalidation ends while another is still in progress
(tested with mm_tlb_flush_nested(tlb->mm)). But mmu_gather doesn't
seem to be aware of TLB flushes that are batched up in the
arch_tlbbatch_flush() infrastructure, so that doesn't help here.

I think it might be necessary to add a new global counter of pending
arch_tlbbatch_flush() flushes, and query that in
mm_tlb_flush_nested(), or something like that.

2022-10-14 19:15:03

by Linus Torvalds

[permalink] [raw]

Subject: Re: [BUG?] X86 arch_tlbbatch_flush() seems to be lacking mm_tlb_flush_nested() integration

On Fri, Oct 14, 2022 at 11:20 AM Jann Horn <[email protected]> wrote:\
>
> The effect would be that after process B removes a mapping with
> munmap() and creates a new mapping in its place, it would still see
> data from the old mapping when trying to access the new mapping.
>
> Am I missing something that protects against this scenario?

While I don't think that scenario is something you can trigger in
practice without enormously bad luck, I don't see anything that would
protect against it.

Afaik, the whole vmscan thing never takes the vm lock, only the file
lock (to protect mapping->i_map) or the anonvma lock (to protect
anon_vma->rb_root).

And none of that ends up serializing with a new mmap() that doesn't
even install a new page in the page tables (and just gets an old TLB
entry). There are zero shared data structures outside of the mm
itself.

Now, munmap() *could* serialize with it, because at least munmap has
access to the data structures and their locks. But it doesn't do any
deferred flushes that I can see, so while it's serialized, it doesn't
help.

And it wouldn't help to do try_to_unmap_flush() from munmap either,
since the deferred flushing is per-thread, and the munmap is done from
a different thread.

So if you're missing something, then I am too.

All this flushing is very careful to flush before actually releasing
the page, which is our really traditional TLB flush bug. But yeah,
that's not the only race - we should flush before replacing the
mapping too.

Mel? I think the batched flushing goes back to you many many years
ago. I hope Jann and me are just being stupid and missing something
obvious.

Linus

2022-10-14 22:48:55

by Kirill A. Shutemov

[permalink] [raw]

Subject: Re: [BUG?] X86 arch_tlbbatch_flush() seems to be lacking mm_tlb_flush_nested() integration

On Fri, Oct 14, 2022 at 08:19:42PM +0200, Jann Horn wrote:
> Hi!
>
> I haven't actually managed to reproduce this behavior, so maybe I'm
> just misunderstanding how this works; but I think the
> arch_tlbbatch_flush() path for batched TLB flushing in vmscan ought to
> have some kind of integration with mm_tlb_flush_nested().
>
> I think that currently, the following race could happen:
>
> [initial situation: page P is mapped into a page table of task B, but
> the page is not referenced, the PTE's A/D bits are clear]
> A: vmscan begins
> A: vmscan looks at P and P's PTEs, and concludes that P is not currently in use
> B: reads from P through the PTE, setting the Accessed bit and creating
> a TLB entry
> A: vmscan enters try_to_unmap_one()
> A: try_to_unmap_one() calls should_defer_flush(), which returns true
> A: try_to_unmap_one() removes the PTE and queues a TLB flush
> (arch_tlbbatch_add_mm())
> A: try_to_unmap_one() returns, try_to_unmap() returns to shrink_folio_list()
> B: calls munmap() on the VMA that mapped P
> B: no PTEs are removed, so no TLB flush happens
> B: munmap() returns

I think here we will serialize against anon_vma/i_mmap lock in
__do_munmap() -> unmap_region() -> free_pgtables() that A also holds.

So I believe munmap() is safe, but MADV_DONTNEED (and its flavours) is not.

> [at this point, the TLB entry still exists]
> B: calls mmap(), which reuses the same area that was just unmapped
> B: tries to access the newly created VMA, but instead the access goes
> through the stale TLB entry
> A: shrink_folio_list() calls try_to_unmap_flush(), which removes the
> stale TLB entry
>
> The effect would be that after process B removes a mapping with
> munmap() and creates a new mapping in its place, it would still see
> data from the old mapping when trying to access the new mapping.
>
> Am I missing something that protects against this scenario?
>
> munmap() uses the mmu_gather infrastructure, which tries to protect
> against this kind of correctness bug with multiple racing TLB
> invalidations in tlb_finish_mmu() by blowing away the whole TLB
> whenever one TLB invalidation ends while another is still in progress
> (tested with mm_tlb_flush_nested(tlb->mm)). But mmu_gather doesn't
> seem to be aware of TLB flushes that are batched up in the
> arch_tlbbatch_flush() infrastructure, so that doesn't help here.
>
> I think it might be necessary to add a new global counter of pending
> arch_tlbbatch_flush() flushes, and query that in
> mm_tlb_flush_nested(), or something like that.

--
Kiryl Shutsemau / Kirill A. Shutemov

2022-10-14 23:34:48

by Jann Horn

[permalink] [raw]

Subject: Re: [BUG?] X86 arch_tlbbatch_flush() seems to be lacking mm_tlb_flush_nested() integration

On Sat, Oct 15, 2022 at 12:23 AM Kirill A. Shutemov
<[email protected]> wrote:
> On Fri, Oct 14, 2022 at 08:19:42PM +0200, Jann Horn wrote:
> > Hi!
> >
> > I haven't actually managed to reproduce this behavior, so maybe I'm
> > just misunderstanding how this works; but I think the
> > arch_tlbbatch_flush() path for batched TLB flushing in vmscan ought to
> > have some kind of integration with mm_tlb_flush_nested().
> >
> > I think that currently, the following race could happen:
> >
> > [initial situation: page P is mapped into a page table of task B, but
> > the page is not referenced, the PTE's A/D bits are clear]
> > A: vmscan begins
> > A: vmscan looks at P and P's PTEs, and concludes that P is not currently in use
> > B: reads from P through the PTE, setting the Accessed bit and creating
> > a TLB entry
> > A: vmscan enters try_to_unmap_one()
> > A: try_to_unmap_one() calls should_defer_flush(), which returns true
> > A: try_to_unmap_one() removes the PTE and queues a TLB flush
> > (arch_tlbbatch_add_mm())
> > A: try_to_unmap_one() returns, try_to_unmap() returns to shrink_folio_list()
> > B: calls munmap() on the VMA that mapped P
> > B: no PTEs are removed, so no TLB flush happens
> > B: munmap() returns
>
> I think here we will serialize against anon_vma/i_mmap lock in
> __do_munmap() -> unmap_region() -> free_pgtables() that A also holds.
>
> So I believe munmap() is safe, but MADV_DONTNEED (and its flavours) is not.

shrink_folio_list() is not in a context that is operating on a
specific MM; it is operating on a list of pages that might be mapped
into different processes all over the system.

So A has temporarily held those locks somewhere inside
try_to_unmap_one(), but it will drop them before it reaches the point
where it issues the batched TLB flush.
And this batched TLB flush potentially covers multiple MMs at once; it
is not targeted towards a specific MM, but towards all of the CPUs on
which any of the touched MMs might be active.

2022-10-14 23:54:12

by Kirill A. Shutemov

[permalink] [raw]

Subject: Re: [BUG?] X86 arch_tlbbatch_flush() seems to be lacking mm_tlb_flush_nested() integration

On Sat, Oct 15, 2022 at 12:29:57AM +0200, Jann Horn wrote:
> On Sat, Oct 15, 2022 at 12:23 AM Kirill A. Shutemov
> <[email protected]> wrote:
> > On Fri, Oct 14, 2022 at 08:19:42PM +0200, Jann Horn wrote:
> > > Hi!
> > >
> > > I haven't actually managed to reproduce this behavior, so maybe I'm
> > > just misunderstanding how this works; but I think the
> > > arch_tlbbatch_flush() path for batched TLB flushing in vmscan ought to
> > > have some kind of integration with mm_tlb_flush_nested().
> > >
> > > I think that currently, the following race could happen:
> > >
> > > [initial situation: page P is mapped into a page table of task B, but
> > > the page is not referenced, the PTE's A/D bits are clear]
> > > A: vmscan begins
> > > A: vmscan looks at P and P's PTEs, and concludes that P is not currently in use
> > > B: reads from P through the PTE, setting the Accessed bit and creating
> > > a TLB entry
> > > A: vmscan enters try_to_unmap_one()
> > > A: try_to_unmap_one() calls should_defer_flush(), which returns true
> > > A: try_to_unmap_one() removes the PTE and queues a TLB flush
> > > (arch_tlbbatch_add_mm())
> > > A: try_to_unmap_one() returns, try_to_unmap() returns to shrink_folio_list()
> > > B: calls munmap() on the VMA that mapped P
> > > B: no PTEs are removed, so no TLB flush happens
> > > B: munmap() returns
> >
> > I think here we will serialize against anon_vma/i_mmap lock in
> > __do_munmap() -> unmap_region() -> free_pgtables() that A also holds.
> >
> > So I believe munmap() is safe, but MADV_DONTNEED (and its flavours) is not.
>
> shrink_folio_list() is not in a context that is operating on a
> specific MM; it is operating on a list of pages that might be mapped
> into different processes all over the system.

s/specific MM/specific page/

> So A has temporarily held those locks somewhere inside
> try_to_unmap_one(), but it will drop them before it reaches the point

inside try_to_unmap(), which handles all mappings of the page.

> where it issues the batched TLB flush.
> And this batched TLB flush potentially covers multiple MMs at once; it
> is not targeted towards a specific MM, but towards all of the CPUs on
> which any of the touched MMs might be active.

But, yes, you are right. I thought that try_to_unmap_flush() called inside
try_to_unmap() under the lock.

--
Kiryl Shutsemau / Kirill A. Shutemov

2022-10-15 04:20:11

by Nadav Amit

[permalink] [raw]

Subject: Re: [BUG?] X86 arch_tlbbatch_flush() seems to be lacking mm_tlb_flush_nested() integration

On Oct 14, 2022, at 9:19 PM, Jann Horn <[email protected]> wrote:

> Hi!
>
> I haven't actually managed to reproduce this behavior, so maybe I'm
> just misunderstanding how this works; but I think the
> arch_tlbbatch_flush() path for batched TLB flushing in vmscan ought to
> have some kind of integration with mm_tlb_flush_nested().
>
> I think that currently, the following race could happen:
>
> [initial situation: page P is mapped into a page table of task B, but
> the page is not referenced, the PTE's A/D bits are clear]
> A: vmscan begins
> A: vmscan looks at P and P's PTEs, and concludes that P is not currently in use
> B: reads from P through the PTE, setting the Accessed bit and creating
> a TLB entry
> A: vmscan enters try_to_unmap_one()
> A: try_to_unmap_one() calls should_defer_flush(), which returns true
> A: try_to_unmap_one() removes the PTE and queues a TLB flush
> (arch_tlbbatch_add_mm())
> A: try_to_unmap_one() returns, try_to_unmap() returns to shrink_folio_list()
> B: calls munmap() on the VMA that mapped P
> B: no PTEs are removed, so no TLB flush happens

Unless I am missing something, flush_tlb_batched_pending() is would be
called and do the flushing at this point, no?

IIUC the scenario, we had some similar cases in the past [1]. Discussing
these scenarios required too many arguments for my liking, and I would’ve
preferred an easier-to-reason batching coordination between the batching
mechanisms. I proposed some schemes in the past, but to be fair, I think
all of them would have some extra overhead.

[1] https://lore.kernel.org/linux-mm/[email protected]/T/#u

2022-10-16 01:13:47

by Linus Torvalds

[permalink] [raw]

Subject: Re: [BUG?] X86 arch_tlbbatch_flush() seems to be lacking mm_tlb_flush_nested() integration

On Fri, Oct 14, 2022 at 8:51 PM Nadav Amit <[email protected]> wrote:
>
> Unless I am missing something, flush_tlb_batched_pending() is would be
> called and do the flushing at this point, no?

Ahh, yes.

That seems to be doing the right thing, although looking a bit more at
it, I think it might be improved.

At least in the zap_pte_range() case, instead of doing a synchronous
TLB flush if there are pending batched flushes, it migth be better if
flush_tlb_batched_pending() would set the "need_flush_all" bit in the
mmu_gather structure.

That would possibly avoid that extra TLB flush entirely - since
*normally* fzap_page_range() will cause a TLB flush anyway.

Maybe it doesn't matter.

Linus

2022-10-16 05:59:23

by Nadav Amit

[permalink] [raw]

Subject: Re: [BUG?] X86 arch_tlbbatch_flush() seems to be lacking mm_tlb_flush_nested() integration

On Oct 16, 2022, at 2:47 AM, Linus Torvalds <[email protected]> wrote:

> On Fri, Oct 14, 2022 at 8:51 PM Nadav Amit <[email protected]> wrote:
>> Unless I am missing something, flush_tlb_batched_pending() is would be
>> called and do the flushing at this point, no?
>
> Ahh, yes.
>
> That seems to be doing the right thing, although looking a bit more at
> it, I think it might be improved.
>
> At least in the zap_pte_range() case, instead of doing a synchronous
> TLB flush if there are pending batched flushes, it migth be better if
> flush_tlb_batched_pending() would set the "need_flush_all" bit in the
> mmu_gather structure.
>
> That would possibly avoid that extra TLB flush entirely - since
> *normally* fzap_page_range() will cause a TLB flush anyway.
>
> Maybe it doesn't matter.

It seems possible and simple.

But in general, there are still various unnecessary TLB flushes due to the
TLB batching. Specifically, ptep_clear_flush() might flush unnecessarily
when pte_accessible() finds tlb_flush_pending holding a non-zero value.
Worse, the complexity of the code is high.

To simplify the TLB flushing mechanism and eliminate the unnecessary TLB
flushes, it is possible to track the “completed” TLB generation (i.e., one
that was flushed). Tracking pending TLB flushes can be done in VMA- or
page-table granularity instead of mm-grnaulrity to avoid unnecessary flushes
on ptep_clear_flush(). Andy also suggested having a queue of the pending TLB
flushes.

The main problem is that each of the aforementioned enhancements can add
some cache references, and therefore might induce additional overheads. I
sent some patches before [1], which I can revive. The main question is
whether we can prioritize simplicity and unification of the various
TLB-flush batching mechanisms over (probably very small) performance gains.

[1] https://lore.kernel.org/linux-mm/[email protected]/

2022-10-17 11:19:53

by Jann Horn

[permalink] [raw]

Subject: Re: [BUG?] X86 arch_tlbbatch_flush() seems to be lacking mm_tlb_flush_nested() integration

On Sat, Oct 15, 2022 at 5:51 AM Nadav Amit <[email protected]> wrote:
> On Oct 14, 2022, at 9:19 PM, Jann Horn <[email protected]> wrote:
> > I haven't actually managed to reproduce this behavior, so maybe I'm
> > just misunderstanding how this works; but I think the
> > arch_tlbbatch_flush() path for batched TLB flushing in vmscan ought to
> > have some kind of integration with mm_tlb_flush_nested().
> >
> > I think that currently, the following race could happen:
> >
> > [initial situation: page P is mapped into a page table of task B, but
> > the page is not referenced, the PTE's A/D bits are clear]
> > A: vmscan begins
> > A: vmscan looks at P and P's PTEs, and concludes that P is not currently in use
> > B: reads from P through the PTE, setting the Accessed bit and creating
> > a TLB entry
> > A: vmscan enters try_to_unmap_one()
> > A: try_to_unmap_one() calls should_defer_flush(), which returns true
> > A: try_to_unmap_one() removes the PTE and queues a TLB flush
> > (arch_tlbbatch_add_mm())
> > A: try_to_unmap_one() returns, try_to_unmap() returns to shrink_folio_list()
> > B: calls munmap() on the VMA that mapped P
> > B: no PTEs are removed, so no TLB flush happens
>
> Unless I am missing something, flush_tlb_batched_pending() is would be
> called and do the flushing at this point, no?

Ooooh! Thanks, I missed that.

2022-10-17 15:57:16

by Mel Gorman

[permalink] [raw]

Subject: Re: [BUG?] X86 arch_tlbbatch_flush() seems to be lacking mm_tlb_flush_nested() integration

On Sat, Oct 15, 2022 at 04:47:16PM -0700, Linus Torvalds wrote:
> On Fri, Oct 14, 2022 at 8:51 PM Nadav Amit <[email protected]> wrote:
> >
> > Unless I am missing something, flush_tlb_batched_pending() is would be
> > called and do the flushing at this point, no?
>
> Ahh, yes.
>
> That seems to be doing the right thing, although looking a bit more at
> it, I think it might be improved.
>

To be fair, I originally got it wrong and Nadav caught it almost 2 years
later. However, I think the current behaviour is still ok.

> At least in the zap_pte_range() case, instead of doing a synchronous
> TLB flush if there are pending batched flushes, it migth be better if
> flush_tlb_batched_pending() would set the "need_flush_all" bit in the
> mmu_gather structure.
>

I think setting need_flush_all would miss the case if no PTEs were updated
due to a race during unmap. I think it would be safer to check for deferred
TLB flush in mm_tlb_flush_nested but didn't dig too deep.

> That would possibly avoid that extra TLB flush entirely - since
> *normally* fzap_page_range() will cause a TLB flush anyway.
>
> Maybe it doesn't matter.
>

While it could be better, I still think the simple approach is sufficient
and it can be used in each affected area. For example, move_ptes does not
use mmu_gather and either that would have to be converted to use mmu_gather
or have mmu_gather and !mmu_gather detection of deferred TLB flushing from
reclaim context and I'm not sure it's worth it.

Once reclaim is active, the performance is slightly degraded as TLBs
are being flushed anyway and it's possible that active pages are being
reclaimed that will have to be refaulted which is even more costly. For the
scenario Jann was concerned with, pages belonging to the task are being
reclaimed while mmap/munmap operations are also happening. munmap/mmap
is sufficiently expensive that a spurious flush due to parallel reclaim
should have negligible additional overhead and I'd be surprised if the
additional runtime cost can be reliably measured.

--
Mel Gorman
SUSE Labs