2023-08-10 11:55:41

by Yan Zhao

[permalink] [raw]
Subject: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM

This is an RFC series trying to fix the issue of unnecessary NUMA
protection and TLB-shootdowns found in VMs with assigned devices or VFIO
mediated devices during NUMA balance.

For VMs with assigned devices or VFIO mediated devices, all or part of
guest memory are pinned for long-term.

Auto NUMA balancing will periodically selects VMAs of a process and change
protections to PROT_NONE even though some or all pages in the selected
ranges are long-term pinned for DMAs, which is true for VMs with assigned
devices or VFIO mediated devices.

Though this will not cause real problem because NUMA migration will
ultimately reject migration of those kind of pages and restore those
PROT_NONE PTEs, it causes KVM's secondary MMU to be zapped periodically
with equal SPTEs finally faulted back, wasting CPU cycles and generating
unnecessary TLB-shootdowns.

This series first introduces a new flag MMU_NOTIFIER_RANGE_NUMA in patch 1
to work with mmu notifier event type MMU_NOTIFY_PROTECTION_VMA, so that
the subscriber (e.g.KVM) of the mmu notifier can know that an invalidation
event is sent for NUMA migration purpose in specific.

Patch 2 skips setting PROT_NONE to long-term pinned pages in the primary
MMU to avoid NUMA protection introduced page faults and restoration of old
huge PMDs/PTEs in primary MMU.

Patch 3 introduces a new mmu notifier callback .numa_protect(), which
will be called in patch 4 when a page is ensured to be PROT_NONE protected.

Then in patch 5, KVM can recognize a .invalidate_range_start() notification
is for NUMA balancing specific and do not do the page unmap in secondary
MMU until .numa_protect() comes.


Changelog:
RFC v1 --> v2:
1. added patch 3-4 to introduce a new callback .numa_protect()
2. Rather than have KVM duplicate logic to check if a page is pinned for
long-term, let KVM depend on the new callback .numa_protect() to do the
page unmap in secondary MMU for NUMA migration purpose.

RFC v1:
https://lore.kernel.org/all/[email protected]/

Yan Zhao (5):
mm/mmu_notifier: introduce a new mmu notifier flag
MMU_NOTIFIER_RANGE_NUMA
mm: don't set PROT_NONE to maybe-dma-pinned pages for NUMA-migrate
purpose
mm/mmu_notifier: introduce a new callback .numa_protect
mm/autonuma: call .numa_protect() when page is protected for NUMA
migrate
KVM: Unmap pages only when it's indeed protected for NUMA migration

include/linux/mmu_notifier.h | 16 ++++++++++++++++
mm/huge_memory.c | 6 ++++++
mm/mmu_notifier.c | 18 ++++++++++++++++++
mm/mprotect.c | 10 +++++++++-
virt/kvm/kvm_main.c | 25 ++++++++++++++++++++++---
5 files changed, 71 insertions(+), 4 deletions(-)

--
2.17.1



2023-08-10 15:12:11

by Chao Gao

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM

On Thu, Aug 10, 2023 at 04:56:36PM +0800, Yan Zhao wrote:
>This is an RFC series trying to fix the issue of unnecessary NUMA
>protection and TLB-shootdowns found in VMs with assigned devices or VFIO
>mediated devices during NUMA balance.
>
>For VMs with assigned devices or VFIO mediated devices, all or part of
>guest memory are pinned for long-term.
>
>Auto NUMA balancing will periodically selects VMAs of a process and change
>protections to PROT_NONE even though some or all pages in the selected
>ranges are long-term pinned for DMAs, which is true for VMs with assigned
>devices or VFIO mediated devices.
>
>Though this will not cause real problem because NUMA migration will
>ultimately reject migration of those kind of pages and restore those
>PROT_NONE PTEs, it causes KVM's secondary MMU to be zapped periodically
>with equal SPTEs finally faulted back, wasting CPU cycles and generating
>unnecessary TLB-shootdowns.

In my understanding, NUMA balancing also moves tasks closer to the memory
they are accessing. Can this still work with this series applied?

>
>This series first introduces a new flag MMU_NOTIFIER_RANGE_NUMA in patch 1
>to work with mmu notifier event type MMU_NOTIFY_PROTECTION_VMA, so that
>the subscriber (e.g.KVM) of the mmu notifier can know that an invalidation
>event is sent for NUMA migration purpose in specific.
>
>Patch 2 skips setting PROT_NONE to long-term pinned pages in the primary
>MMU to avoid NUMA protection introduced page faults and restoration of old
>huge PMDs/PTEs in primary MMU.
>
>Patch 3 introduces a new mmu notifier callback .numa_protect(), which
>will be called in patch 4 when a page is ensured to be PROT_NONE protected.
>
>Then in patch 5, KVM can recognize a .invalidate_range_start() notification
>is for NUMA balancing specific and do not do the page unmap in secondary
>MMU until .numa_protect() comes.
>
>
>Changelog:
>RFC v1 --> v2:
>1. added patch 3-4 to introduce a new callback .numa_protect()
>2. Rather than have KVM duplicate logic to check if a page is pinned for
>long-term, let KVM depend on the new callback .numa_protect() to do the
>page unmap in secondary MMU for NUMA migration purpose.
>
>RFC v1:
>https://lore.kernel.org/all/[email protected]/
>
>Yan Zhao (5):
> mm/mmu_notifier: introduce a new mmu notifier flag
> MMU_NOTIFIER_RANGE_NUMA
> mm: don't set PROT_NONE to maybe-dma-pinned pages for NUMA-migrate
> purpose
> mm/mmu_notifier: introduce a new callback .numa_protect
> mm/autonuma: call .numa_protect() when page is protected for NUMA
> migrate
> KVM: Unmap pages only when it's indeed protected for NUMA migration
>
> include/linux/mmu_notifier.h | 16 ++++++++++++++++
> mm/huge_memory.c | 6 ++++++
> mm/mmu_notifier.c | 18 ++++++++++++++++++
> mm/mprotect.c | 10 +++++++++-
> virt/kvm/kvm_main.c | 25 ++++++++++++++++++++++---
> 5 files changed, 71 insertions(+), 4 deletions(-)
>
>--
>2.17.1
>

2023-08-11 06:43:50

by Yan Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM

On Thu, Aug 10, 2023 at 09:58:43PM +0800, Chao Gao wrote:
> On Thu, Aug 10, 2023 at 04:56:36PM +0800, Yan Zhao wrote:
> >This is an RFC series trying to fix the issue of unnecessary NUMA
> >protection and TLB-shootdowns found in VMs with assigned devices or VFIO
> >mediated devices during NUMA balance.
> >
> >For VMs with assigned devices or VFIO mediated devices, all or part of
> >guest memory are pinned for long-term.
> >
> >Auto NUMA balancing will periodically selects VMAs of a process and change
> >protections to PROT_NONE even though some or all pages in the selected
> >ranges are long-term pinned for DMAs, which is true for VMs with assigned
> >devices or VFIO mediated devices.
> >
> >Though this will not cause real problem because NUMA migration will
> >ultimately reject migration of those kind of pages and restore those
> >PROT_NONE PTEs, it causes KVM's secondary MMU to be zapped periodically
> >with equal SPTEs finally faulted back, wasting CPU cycles and generating
> >unnecessary TLB-shootdowns.
>
> In my understanding, NUMA balancing also moves tasks closer to the memory
> they are accessing. Can this still work with this series applied?
>
For pages protected with PROT_NONE in primary MMU in scanning phase, yes;
For pages not set to PROT_NONE, no.
Because looks this task_numa_migrate() is only triggered in next page
fault when PROT_NONE and accessible VMA is found.

2023-08-11 20:07:48

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM

>> Ah, okay I see, thanks. That's indeed unfortunate.
>
> Sigh. All this difficulty reminds me that this mechanism was created in
> the early days of NUMA. I wonder sometimes lately whether the cost, in
> complexity and CPU time, is still worth it on today's hardware.
>
> But of course I am deeply biased, so don't take that too seriously.
> See below. :)

:)

>>
>>>
>>> Then current KVM will unmap all notified pages from secondary MMU
>>> in .invalidate_range_start(), which could include pages that finally not
>>> set to PROT_NONE in primary MMU.
>>>
>>> For VMs with pass-through devices, though all guest pages are pinned,
>>> KVM still periodically unmap pages in response to the
>>> .invalidate_range_start() notification from auto NUMA balancing, which
>>> is a waste.
>>
>> Should we want to disable NUMA hinting for such VMAs instead (for example, by QEMU/hypervisor) that knows that any NUMA hinting activity on these ranges would be a complete waste of time? I recall that John H. once mentioned that there are
> similar issues with GPU memory:  NUMA hinting is actually counter-productive and they end up disabling it.
>>
>
> Yes, NUMA balancing is incredibly harmful to performance, for GPU and
> accelerators that map memory...and VMs as well, it seems. Basically,
> anything that has its own processors and page tables needs to be left
> strictly alone by NUMA balancing. Because the kernel is (still, even
> today) unaware of what those processors are doing, and so it has no way
> to do productive NUMA balancing.

Is there any existing way we could handle that better on a per-VMA
level, or on the process level? Any magic toggles?

MMF_HAS_PINNED might be too restrictive. MMF_HAS_PINNED_LONGTERM might
be better, but with things like iouring still too restrictive eventually.

I recall that setting a mempolicy could prevent auto-numa from getting
active, but that might be undesired.

CCing Mel.

--
Cheers,

David / dhildenb


2023-08-11 20:12:40

by John Hubbard

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM

On 8/11/23 11:39, David Hildenbrand wrote:
...
>>> Should we want to disable NUMA hinting for such VMAs instead (for example, by QEMU/hypervisor) that knows that any NUMA hinting activity on these ranges would be a complete waste of time? I recall that John H. once mentioned that there are
>> similar issues with GPU memory:  NUMA hinting is actually counter-productive and they end up disabling it.
>>>
>>
>> Yes, NUMA balancing is incredibly harmful to performance, for GPU and
>> accelerators that map memory...and VMs as well, it seems. Basically,
>> anything that has its own processors and page tables needs to be left
>> strictly alone by NUMA balancing. Because the kernel is (still, even
>> today) unaware of what those processors are doing, and so it has no way
>> to do productive NUMA balancing.
>
> Is there any existing way we could handle that better on a per-VMA level, or on the process level? Any magic toggles?
>
> MMF_HAS_PINNED might be too restrictive. MMF_HAS_PINNED_LONGTERM might be better, but with things like iouring still too restrictive eventually.
>
> I recall that setting a mempolicy could prevent auto-numa from getting active, but that might be undesired.
>
> CCing Mel.
>

Let's discern between page pinning situations, and HMM-style situations.
Page pinning of CPU memory is unnecessary when setting up for using that
memory by modern GPUs or accelerators, because the latter can handle
replayable page faults. So for such cases, the pages are in use by a GPU
or accelerator, but unpinned.

The performance problem occurs because for those pages, the NUMA
balancing causes unmapping, which generates callbacks to the device
driver, which dutifully unmaps the pages from the GPU or accelerator,
even if the GPU might be busy using those pages. The device promptly
causes a device page fault, and the driver then re-establishes the
device page table mapping, which is good until the next round of
unmapping from the NUMA balancer.

hmm_range_fault()-based memory management in particular might benefit
from having NUMA balancing disabled entirely for the memremap_pages()
region, come to think of it. That seems relatively easy and clean at
first glance anyway.

For other regions (allocated by the device driver), a per-VMA flag
seems about right: VM_NO_NUMA_BALANCING ?

thanks,
--
John Hubbard
NVIDIA


2023-08-11 20:17:33

by John Hubbard

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM

On 8/11/23 10:25, David Hildenbrand wrote:
...
> One issue is that folio_maybe_pinned...() ... is unreliable as soon as your page is mapped more than 1024 times.
>
> One might argue that we also want to exclude pages that are mapped that often. That might possibly work.

Yes.

>>
>>> Staring at page #2, are we still missing something similar for THPs?
>> Yes.
>>
>>> Why is that MMU notifier thingy and touching KVM code required?
>> Because NUMA balancing code will firstly send .invalidate_range_start() with
>> event type MMU_NOTIFY_PROTECTION_VMA to KVM in change_pmd_range()
>> unconditionally, before it goes down into change_pte_range() and
>> change_huge_pmd() to check each page count and apply PROT_NONE.
>
> Ah, okay I see, thanks. That's indeed unfortunate.

Sigh. All this difficulty reminds me that this mechanism was created in
the early days of NUMA. I wonder sometimes lately whether the cost, in
complexity and CPU time, is still worth it on today's hardware.

But of course I am deeply biased, so don't take that too seriously.
See below. :)

>
>>
>> Then current KVM will unmap all notified pages from secondary MMU
>> in .invalidate_range_start(), which could include pages that finally not
>> set to PROT_NONE in primary MMU.
>>
>> For VMs with pass-through devices, though all guest pages are pinned,
>> KVM still periodically unmap pages in response to the
>> .invalidate_range_start() notification from auto NUMA balancing, which
>> is a waste.
>
> Should we want to disable NUMA hinting for such VMAs instead (for example, by QEMU/hypervisor) that knows that any NUMA hinting activity on these ranges would be a complete waste of time? I recall that John H. once mentioned that there are
similar issues with GPU memory:  NUMA hinting is actually counter-productive and they end up disabling it.
>

Yes, NUMA balancing is incredibly harmful to performance, for GPU and
accelerators that map memory...and VMs as well, it seems. Basically,
anything that has its own processors and page tables needs to be left
strictly alone by NUMA balancing. Because the kernel is (still, even
today) unaware of what those processors are doing, and so it has no way
to do productive NUMA balancing.


thanks,
--
John Hubbard
NVIDIA


2023-08-14 10:59:58

by Yan Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM

On Fri, Aug 11, 2023 at 12:35:27PM -0700, John Hubbard wrote:
> On 8/11/23 11:39, David Hildenbrand wrote:
> ...
> > > > Should we want to disable NUMA hinting for such VMAs instead (for example, by QEMU/hypervisor) that knows that any NUMA hinting activity on these ranges would be a complete waste of time? I recall that John H. once mentioned that there are
> > > similar issues with GPU memory:? NUMA hinting is actually counter-productive and they end up disabling it.
> > > >
> > >
> > > Yes, NUMA balancing is incredibly harmful to performance, for GPU and
> > > accelerators that map memory...and VMs as well, it seems. Basically,
> > > anything that has its own processors and page tables needs to be left
> > > strictly alone by NUMA balancing. Because the kernel is (still, even
> > > today) unaware of what those processors are doing, and so it has no way
> > > to do productive NUMA balancing.
> >
> > Is there any existing way we could handle that better on a per-VMA level, or on the process level? Any magic toggles?
> >
> > MMF_HAS_PINNED might be too restrictive. MMF_HAS_PINNED_LONGTERM might be better, but with things like iouring still too restrictive eventually.
> >
> > I recall that setting a mempolicy could prevent auto-numa from getting active, but that might be undesired.
> >
> > CCing Mel.
> >
>
> Let's discern between page pinning situations, and HMM-style situations.
> Page pinning of CPU memory is unnecessary when setting up for using that
> memory by modern GPUs or accelerators, because the latter can handle
> replayable page faults. So for such cases, the pages are in use by a GPU
> or accelerator, but unpinned.
>
> The performance problem occurs because for those pages, the NUMA
> balancing causes unmapping, which generates callbacks to the device
> driver, which dutifully unmaps the pages from the GPU or accelerator,
> even if the GPU might be busy using those pages. The device promptly
> causes a device page fault, and the driver then re-establishes the
> device page table mapping, which is good until the next round of
> unmapping from the NUMA balancer.
>
> hmm_range_fault()-based memory management in particular might benefit
> from having NUMA balancing disabled entirely for the memremap_pages()
> region, come to think of it. That seems relatively easy and clean at
> first glance anyway.
>
> For other regions (allocated by the device driver), a per-VMA flag
> seems about right: VM_NO_NUMA_BALANCING ?
>
Thanks a lot for those good suggestions!
For VMs, when could a per-VMA flag be set?
Might be hard in mmap() in QEMU because a VMA may not be used for DMA until
after it's mapped into VFIO.
Then, should VFIO set this flag on after it maps a range?
Could this flag be unset after device hot-unplug?



2023-08-19 09:53:21

by Yan Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM

On Tue, Aug 15, 2023 at 10:36:18AM +0800, Yuan Yao wrote:
> On Mon, Aug 14, 2023 at 05:09:18PM +0800, Yan Zhao wrote:
> > On Fri, Aug 11, 2023 at 12:35:27PM -0700, John Hubbard wrote:
> > > On 8/11/23 11:39, David Hildenbrand wrote:
> > > ...
> > > > > > Should we want to disable NUMA hinting for such VMAs instead (for example, by QEMU/hypervisor) that knows that any NUMA hinting activity on these ranges would be a complete waste of time? I recall that John H. once mentioned that there are
> > > > > similar issues with GPU memory:? NUMA hinting is actually counter-productive and they end up disabling it.
> > > > > >
> > > > >
> > > > > Yes, NUMA balancing is incredibly harmful to performance, for GPU and
> > > > > accelerators that map memory...and VMs as well, it seems. Basically,
> > > > > anything that has its own processors and page tables needs to be left
> > > > > strictly alone by NUMA balancing. Because the kernel is (still, even
> > > > > today) unaware of what those processors are doing, and so it has no way
> > > > > to do productive NUMA balancing.
> > > >
> > > > Is there any existing way we could handle that better on a per-VMA level, or on the process level? Any magic toggles?
> > > >
> > > > MMF_HAS_PINNED might be too restrictive. MMF_HAS_PINNED_LONGTERM might be better, but with things like iouring still too restrictive eventually.
> > > >
> > > > I recall that setting a mempolicy could prevent auto-numa from getting active, but that might be undesired.
> > > >
> > > > CCing Mel.
> > > >
> > >
> > > Let's discern between page pinning situations, and HMM-style situations.
> > > Page pinning of CPU memory is unnecessary when setting up for using that
> > > memory by modern GPUs or accelerators, because the latter can handle
> > > replayable page faults. So for such cases, the pages are in use by a GPU
> > > or accelerator, but unpinned.
> > >
> > > The performance problem occurs because for those pages, the NUMA
> > > balancing causes unmapping, which generates callbacks to the device
> > > driver, which dutifully unmaps the pages from the GPU or accelerator,
> > > even if the GPU might be busy using those pages. The device promptly
> > > causes a device page fault, and the driver then re-establishes the
> > > device page table mapping, which is good until the next round of
> > > unmapping from the NUMA balancer.
> > >
> > > hmm_range_fault()-based memory management in particular might benefit
> > > from having NUMA balancing disabled entirely for the memremap_pages()
> > > region, come to think of it. That seems relatively easy and clean at
> > > first glance anyway.
> > >
> > > For other regions (allocated by the device driver), a per-VMA flag
> > > seems about right: VM_NO_NUMA_BALANCING ?
> > >
> > Thanks a lot for those good suggestions!
> > For VMs, when could a per-VMA flag be set?
> > Might be hard in mmap() in QEMU because a VMA may not be used for DMA until
> > after it's mapped into VFIO.
> > Then, should VFIO set this flag on after it maps a range?
> > Could this flag be unset after device hot-unplug?
>
> Emm... syscall madvise() in my mind, it does things like change flags
> on VMA, e.g madvise(MADV_DONTFORK) adds VM_DONTCOPY to the VMA.
Yes, madvise() might work.
And setting this flag might be an easy decision, while unsetting it might be hard
unless some counters introduced.


2023-08-19 16:46:01

by Yan Zhao

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM

On Wed, Aug 16, 2023 at 11:00:36AM -0700, John Hubbard wrote:
> On 8/16/23 02:49, David Hildenbrand wrote:
> > But do 32bit architectures even care about NUMA hinting? If not, just
> > ignore them ...
>
> Probably not!
>
> ...
> > > So, do you mean that let kernel provide a per-VMA allow/disallow
> > > mechanism, and
> > > it's up to the user space to choose between per-VMA and complex way or
> > > global and simpler way?
> >
> > QEMU could do either way. The question would be if a per-vma settings
> > makes sense for NUMA hinting.
>
> From our experience with compute on GPUs, a per-mm setting would suffice.
> No need to go all the way to VMA granularity.
>
After an offline internal discussion, we think a per-mm setting is also
enough for device passthrough in VMs.

BTW, if we want a per-VMA flag, compared to VM_NO_NUMA_BALANCING, do you
think it's of any value to providing a flag like VM_MAYDMA?
Auto NUMA balancing or other components can decide how to use it by
themselves.

2023-08-19 19:34:33

by John Hubbard

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM

On 8/14/23 02:09, Yan Zhao wrote:
...
>> hmm_range_fault()-based memory management in particular might benefit
>> from having NUMA balancing disabled entirely for the memremap_pages()
>> region, come to think of it. That seems relatively easy and clean at
>> first glance anyway.
>>
>> For other regions (allocated by the device driver), a per-VMA flag
>> seems about right: VM_NO_NUMA_BALANCING ?
>>
> Thanks a lot for those good suggestions!
> For VMs, when could a per-VMA flag be set?
> Might be hard in mmap() in QEMU because a VMA may not be used for DMA until
> after it's mapped into VFIO.
> Then, should VFIO set this flag on after it maps a range?
> Could this flag be unset after device hot-unplug?
>

I'm hoping someone who thinks about VMs and VFIO often can chime in.


thanks,
--
John Hubbard
NVIDIA