tdp_mmu_alloc_sp_for_split() allocates page tables for Eager Page
Splitting. Currently it does not specify a NUMA node preference, so it
will try to allocate from the local node. The thread doing eager page
splitting is supplied by the userspace and may not be running on the same
node where it would be best for page tables to be allocated.
We can improve TDP MMU eager page splitting by making
tdp_mmu_alloc_sp_for_split() NUMA-aware. Specifically, when splitting a
huge page, allocate the new lower level page tables on the same node as the
huge page.
__get_free_page() is replaced by alloc_page_nodes(). This introduces two
functional changes.
1. __get_free_page() removes gfp flag __GFP_HIGHMEM via call to
__get_free_pages(). This should not be an issue as __GFP_HIGHMEM flag is
not passed in tdp_mmu_alloc_sp_for_split() anyway.
2. __get_free_page() calls alloc_pages() and use thread's mempolicy for
the NUMA node allocation. From this commit, thread's mempolicy will not
be used and first preference will be to allocate on the node where huge
page was present.
dirty_log_perf_test for 416 vcpu and 1GB/vcpu configuration on a 8 NUMA
node machine showed dirty memory time improvements between 2% - 35% in
multiple runs.
Suggested-by: David Matlack <[email protected]>
Signed-off-by: Vipin Sharma <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 24 +++++++++++++++++++-----
1 file changed, 19 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index bf2ccf9debcaa..1e30e18fc6a03 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1402,9 +1402,19 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
return spte_set;
}
-static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
+/*
+ * Caller's responsibility to pass a valid spte which has the shadow page
+ * present.
+ */
+static int tdp_mmu_spte_to_nid(u64 spte)
+{
+ return page_to_nid(pfn_to_page(spte_to_pfn(spte)));
+}
+
+static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(int nid, gfp_t gfp)
{
struct kvm_mmu_page *sp;
+ struct page *spt_page;
gfp |= __GFP_ZERO;
@@ -1412,11 +1422,12 @@ static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
if (!sp)
return NULL;
- sp->spt = (void *)__get_free_page(gfp);
- if (!sp->spt) {
+ spt_page = alloc_pages_node(nid, gfp, 0);
+ if (!spt_page) {
kmem_cache_free(mmu_page_header_cache, sp);
return NULL;
}
+ sp->spt = page_address(spt_page);
return sp;
}
@@ -1426,6 +1437,9 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
bool shared)
{
struct kvm_mmu_page *sp;
+ int nid;
+
+ nid = tdp_mmu_spte_to_nid(iter->old_spte);
/*
* Since we are allocating while under the MMU lock we have to be
@@ -1436,7 +1450,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
* If this allocation fails we drop the lock and retry with reclaim
* allowed.
*/
- sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT);
+ sp = __tdp_mmu_alloc_sp_for_split(nid, GFP_NOWAIT | __GFP_ACCOUNT);
if (sp)
return sp;
@@ -1448,7 +1462,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
write_unlock(&kvm->mmu_lock);
iter->yielded = true;
- sp = __tdp_mmu_alloc_sp_for_split(GFP_KERNEL_ACCOUNT);
+ sp = __tdp_mmu_alloc_sp_for_split(nid, GFP_KERNEL_ACCOUNT);
if (shared)
read_lock(&kvm->mmu_lock);
--
2.37.1.455.g008518b4e5-goog
On Mon, Aug 01, 2022 at 08:19:28AM -0700, Vipin Sharma wrote:
>
In the shortlog, clarify that this only applies to the TDP MMU.
> tdp_mmu_alloc_sp_for_split() allocates page tables for Eager Page
> Splitting. Currently it does not specify a NUMA node preference, so it
> will try to allocate from the local node. The thread doing eager page
nit: Instead of "try to allocate from the local node", say something
like "allocate using the current threads mempolicy, which is most likely
going to be MPOL_LOCAL, i.e. allocate from the local node."
> splitting is supplied by the userspace and may not be running on the same
> node where it would be best for page tables to be allocated.
Suggest comparing eager page splitting to vCPU faults, which is the
other place that TDP page tables are allocated. e.g.
Note that TDP page tables allocated during fault handling are less
likely to suffer from the same NUMA locality issue because they will
be allocated on the same node as the vCPU thread (assuming its
mempolicy is MPOL_LOCAL), which is often the right node.
That being said, KVM currently has a gap where a guest doing a lot of
remote memory accesses when touching memory for the first time will
cause KVM to allocate the TDP page tables on the arguably wrong node.
>
> We can improve TDP MMU eager page splitting by making
> tdp_mmu_alloc_sp_for_split() NUMA-aware. Specifically, when splitting a
> huge page, allocate the new lower level page tables on the same node as the
> huge page.
>
> __get_free_page() is replaced by alloc_page_nodes(). This introduces two
> functional changes.
>
> 1. __get_free_page() removes gfp flag __GFP_HIGHMEM via call to
> __get_free_pages(). This should not be an issue as __GFP_HIGHMEM flag is
> not passed in tdp_mmu_alloc_sp_for_split() anyway.
>
> 2. __get_free_page() calls alloc_pages() and use thread's mempolicy for
> the NUMA node allocation. From this commit, thread's mempolicy will not
> be used and first preference will be to allocate on the node where huge
> page was present.
It would be worth noting that userspace could change the mempolicy of
the thread doing eager splitting to prefer allocating from the target
NUMA node, as an alternative approach.
I don't prefer the alternative though since it bleeds details from KVM
into userspace, such as the fact that enabling dirty logging does eager
page splitting, which allocates page tables. It's also unnecessary since
KVM can infer an appropriate NUMA placement without the help of
userspace, and I can't think of a reason for userspace to prefer a
different policy.
>
> dirty_log_perf_test for 416 vcpu and 1GB/vcpu configuration on a 8 NUMA
> node machine showed dirty memory time improvements between 2% - 35% in
> multiple runs.
That's a pretty wide range.
What's probably happening is vCPU threads are being migrated across NUMA
nodes during the test. So for example, a vCPU thread might first run on
Node 0 during the Populate memory phase, so the huge page will be
allocated on Node 0 as well. But then if the thread gets migrated to
Node 1 later, it will perform poorly because it's memory is on a remote
node.
I would recommend pinning vCPUs to specific NUMA nodes to prevent
cross-node migrations. e.g. For a 416 vCPU VM, pin 52 to Node 0, 52 to
Node 1, etc. That should results in more consistent performance and will
be more inline with how large NUMA VMs are actually configured to run.
>
> Suggested-by: David Matlack <[email protected]>
> Signed-off-by: Vipin Sharma <[email protected]>
> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 24 +++++++++++++++++++-----
> 1 file changed, 19 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index bf2ccf9debcaa..1e30e18fc6a03 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1402,9 +1402,19 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
> return spte_set;
> }
>
> -static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
> +/*
> + * Caller's responsibility to pass a valid spte which has the shadow page
> + * present.
> + */
> +static int tdp_mmu_spte_to_nid(u64 spte)
I think this name is ambiguous because it could be getting the node ID
of the SPTE itself, or the node ID of the page the SPTE is pointing to.
Maybe tdp_mmu_spte_pfn_nid()? Or just drop the helper an open code this
calculation in tdp_mmu_alloc_sp_for_split().
> +{
> + return page_to_nid(pfn_to_page(spte_to_pfn(spte)));
> +}
> +
> +static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(int nid, gfp_t gfp)
> {
> struct kvm_mmu_page *sp;
> + struct page *spt_page;
>
> gfp |= __GFP_ZERO;
>
> @@ -1412,11 +1422,12 @@ static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
> if (!sp)
> return NULL;
>
> - sp->spt = (void *)__get_free_page(gfp);
> - if (!sp->spt) {
> + spt_page = alloc_pages_node(nid, gfp, 0);
> + if (!spt_page) {
> kmem_cache_free(mmu_page_header_cache, sp);
> return NULL;
> }
> + sp->spt = page_address(spt_page);
>
> return sp;
> }
> @@ -1426,6 +1437,9 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
> bool shared)
> {
> struct kvm_mmu_page *sp;
> + int nid;
> +
> + nid = tdp_mmu_spte_to_nid(iter->old_spte);
>
> /*
> * Since we are allocating while under the MMU lock we have to be
> @@ -1436,7 +1450,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
> * If this allocation fails we drop the lock and retry with reclaim
> * allowed.
> */
> - sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT);
> + sp = __tdp_mmu_alloc_sp_for_split(nid, GFP_NOWAIT | __GFP_ACCOUNT);
> if (sp)
> return sp;
>
> @@ -1448,7 +1462,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
> write_unlock(&kvm->mmu_lock);
>
> iter->yielded = true;
> - sp = __tdp_mmu_alloc_sp_for_split(GFP_KERNEL_ACCOUNT);
> + sp = __tdp_mmu_alloc_sp_for_split(nid, GFP_KERNEL_ACCOUNT);
>
> if (shared)
> read_lock(&kvm->mmu_lock);
> --
> 2.37.1.455.g008518b4e5-goog
>
On Mon, Aug 01, 2022, David Matlack wrote:
> On Mon, Aug 01, 2022 at 08:19:28AM -0700, Vipin Sharma wrote:
> That being said, KVM currently has a gap where a guest doing a lot of
> remote memory accesses when touching memory for the first time will
> cause KVM to allocate the TDP page tables on the arguably wrong node.
Userspace can solve this by setting the NUMA policy on a VMA or shared-object
basis. E.g. create dedicated memslots for each NUMA node, then bind each of the
backing stores to the appropriate host node.
If there is a gap, e.g. a backing store we want to use doesn't properly support
mempolicy for shared mappings, then we should enhance the backing store.
> > We can improve TDP MMU eager page splitting by making
> > tdp_mmu_alloc_sp_for_split() NUMA-aware. Specifically, when splitting a
> > huge page, allocate the new lower level page tables on the same node as the
> > huge page.
> >
> > __get_free_page() is replaced by alloc_page_nodes(). This introduces two
> > functional changes.
> >
> > 1. __get_free_page() removes gfp flag __GFP_HIGHMEM via call to
> > __get_free_pages(). This should not be an issue as __GFP_HIGHMEM flag is
> > not passed in tdp_mmu_alloc_sp_for_split() anyway.
> >
> > 2. __get_free_page() calls alloc_pages() and use thread's mempolicy for
> > the NUMA node allocation. From this commit, thread's mempolicy will not
> > be used and first preference will be to allocate on the node where huge
> > page was present.
>
> It would be worth noting that userspace could change the mempolicy of
> the thread doing eager splitting to prefer allocating from the target
> NUMA node, as an alternative approach.
>
> I don't prefer the alternative though since it bleeds details from KVM
> into userspace, such as the fact that enabling dirty logging does eager
> page splitting, which allocates page tables.
As above, if userspace is cares about vNUMA, then it already needs to be aware of
some of KVM/kernel details. Separate memslots aren't strictly necessary, e.g.
userspace could stitch together contiguous VMAs to create a single mega-memslot,
but that seems like it'd be more work than just creating separate memslots.
And because eager page splitting for dirty logging runs with mmu_lock held for,
userspace might also benefit from per-node memslots as it can do the splitting on
multiple tasks/CPUs.
Regardless of what we do, the behavior needs to be document, i.e. KVM details will
bleed into userspace. E.g. if KVM is overriding the per-task NUMA policy, then
that should be documented.
> It's also unnecessary since KVM can infer an appropriate NUMA placement
> without the help of userspace, and I can't think of a reason for userspace to
> prefer a different policy.
I can't think of a reason why userspace would want to have a different policy for
the task that's enabling dirty logging, but I also can't think of a reason why
KVM should go out of its way to ignore that policy.
IMO this is a "bug" in dirty_log_perf_test, though it's probably a good idea to
document how to effectively configure vNUMA-aware memslots.
On Mon, Aug 01, 2022 at 11:56:21PM +0000, Sean Christopherson wrote:
> On Mon, Aug 01, 2022, David Matlack wrote:
> > On Mon, Aug 01, 2022 at 08:19:28AM -0700, Vipin Sharma wrote:
> > That being said, KVM currently has a gap where a guest doing a lot of
> > remote memory accesses when touching memory for the first time will
> > cause KVM to allocate the TDP page tables on the arguably wrong node.
>
> Userspace can solve this by setting the NUMA policy on a VMA or shared-object
> basis. E.g. create dedicated memslots for each NUMA node, then bind each of the
> backing stores to the appropriate host node.
>
> If there is a gap, e.g. a backing store we want to use doesn't properly support
> mempolicy for shared mappings, then we should enhance the backing store.
Just to clarify: this patch, and my comment here, is about the NUMA
locality of KVM's page tables, not guest memory.
KVM allocates all page tables through __get_free_page() which uses the
current task's mempolicy. This means the only way for userspace to
control the page table NUMA locality is to set mempolicies on the
threads on which KVM will allocate page tables (vCPUs and any thread
doing eager page splitting, i.e. enable dirty logging or
KVM_CLEAR_DIRTY_LOG).
The ideal setup from a NUMA locality perspective would be that page
tables are always on the local node, but that would require KVM maintain
multiple copies of page tables (at minimum one per physical NUMA node),
which is extra memory and complexity.
With the current KVM MMU (one page table hierarchy shared by all vCPUs),
the next best setup, I'd argue, is to co-locate the page tables with the
memory they are mapping. This setup would ensure that a vCPU accessing
memory in its local virtual node would primarily be accessing page
tables in the local physical node when doing page walks. Obviously page
tables at levels 5, 4, and 3, which likely map memory spanning multiple
nodes, could not always be co-located with all the memory they map.
My comment here is saying that there is no way for userspace to actually
enforce the above policy. There is no mempolicy userspace can set on
vCPU threads to ensure that KVM co-locates page tables with the memory
they are mapping. All userspace can do is force KVM to allocate page
tables on the same node as the vCPU thread, or on a specific node.
For eager page splitting, userspace can control what range of memory is
going to be split by controlling the memslot layout, so it is possible
for userspace to set an appropriate mempolicy on that thread. But if you
agree about my point about vCPU threads above, I think it would be a
step in the right direction to have KVM start forcibly co-locating page
tables with memory for eager page splitting (and we can fix the vCPU
case later).
>
> > > We can improve TDP MMU eager page splitting by making
> > > tdp_mmu_alloc_sp_for_split() NUMA-aware. Specifically, when splitting a
> > > huge page, allocate the new lower level page tables on the same node as the
> > > huge page.
> > >
> > > __get_free_page() is replaced by alloc_page_nodes(). This introduces two
> > > functional changes.
> > >
> > > 1. __get_free_page() removes gfp flag __GFP_HIGHMEM via call to
> > > __get_free_pages(). This should not be an issue as __GFP_HIGHMEM flag is
> > > not passed in tdp_mmu_alloc_sp_for_split() anyway.
> > >
> > > 2. __get_free_page() calls alloc_pages() and use thread's mempolicy for
> > > the NUMA node allocation. From this commit, thread's mempolicy will not
> > > be used and first preference will be to allocate on the node where huge
> > > page was present.
> >
> > It would be worth noting that userspace could change the mempolicy of
> > the thread doing eager splitting to prefer allocating from the target
> > NUMA node, as an alternative approach.
> >
> > I don't prefer the alternative though since it bleeds details from KVM
> > into userspace, such as the fact that enabling dirty logging does eager
> > page splitting, which allocates page tables.
>
> As above, if userspace is cares about vNUMA, then it already needs to be aware of
> some of KVM/kernel details. Separate memslots aren't strictly necessary, e.g.
> userspace could stitch together contiguous VMAs to create a single mega-memslot,
> but that seems like it'd be more work than just creating separate memslots.
>
> And because eager page splitting for dirty logging runs with mmu_lock held for,
> userspace might also benefit from per-node memslots as it can do the splitting on
> multiple tasks/CPUs.
>
> Regardless of what we do, the behavior needs to be document, i.e. KVM details will
> bleed into userspace. E.g. if KVM is overriding the per-task NUMA policy, then
> that should be documented.
+1
>
> > It's also unnecessary since KVM can infer an appropriate NUMA placement
> > without the help of userspace, and I can't think of a reason for userspace to
> > prefer a different policy.
>
> I can't think of a reason why userspace would want to have a different policy for
> the task that's enabling dirty logging, but I also can't think of a reason why
> KVM should go out of its way to ignore that policy.
>
> IMO this is a "bug" in dirty_log_perf_test, though it's probably a good idea to
> document how to effectively configure vNUMA-aware memslots.
On Tue, Aug 02, 2022, David Matlack wrote:
> On Mon, Aug 01, 2022 at 11:56:21PM +0000, Sean Christopherson wrote:
> > On Mon, Aug 01, 2022, David Matlack wrote:
> > > On Mon, Aug 01, 2022 at 08:19:28AM -0700, Vipin Sharma wrote:
> > > That being said, KVM currently has a gap where a guest doing a lot of
> > > remote memory accesses when touching memory for the first time will
> > > cause KVM to allocate the TDP page tables on the arguably wrong node.
> >
> > Userspace can solve this by setting the NUMA policy on a VMA or shared-object
> > basis. E.g. create dedicated memslots for each NUMA node, then bind each of the
> > backing stores to the appropriate host node.
> >
> > If there is a gap, e.g. a backing store we want to use doesn't properly support
> > mempolicy for shared mappings, then we should enhance the backing store.
>
> Just to clarify: this patch, and my comment here, is about the NUMA
> locality of KVM's page tables, not guest memory.
Oooh, I overlooked the "TDP page tables" part in the above paragraph.
> KVM allocates all page tables through __get_free_page() which uses the
> current task's mempolicy. This means the only way for userspace to
> control the page table NUMA locality is to set mempolicies on the
> threads on which KVM will allocate page tables (vCPUs and any thread
> doing eager page splitting, i.e. enable dirty logging or
> KVM_CLEAR_DIRTY_LOG).
>
> The ideal setup from a NUMA locality perspective would be that page
> tables are always on the local node, but that would require KVM maintain
> multiple copies of page tables (at minimum one per physical NUMA node),
> which is extra memory and complexity.
Hmm, it actually wouldn't be much complexity, e.g. a few bits in the role, but
the memory would indeed be a problem.
> With the current KVM MMU (one page table hierarchy shared by all vCPUs),
> the next best setup, I'd argue, is to co-locate the page tables with the
> memory they are mapping. This setup would ensure that a vCPU accessing
> memory in its local virtual node would primarily be accessing page
> tables in the local physical node when doing page walks. Obviously page
> tables at levels 5, 4, and 3, which likely map memory spanning multiple
> nodes, could not always be co-located with all the memory they map.
>
> My comment here is saying that there is no way for userspace to actually
> enforce the above policy. There is no mempolicy userspace can set on
> vCPU threads to ensure that KVM co-locates page tables with the memory
> they are mapping. All userspace can do is force KVM to allocate page
> tables on the same node as the vCPU thread, or on a specific node.
>
> For eager page splitting, userspace can control what range of memory is
> going to be split by controlling the memslot layout, so it is possible
> for userspace to set an appropriate mempolicy on that thread. But if you
> agree about my point about vCPU threads above, I think it would be a
> step in the right direction to have KVM start forcibly co-locating page
> tables with memory for eager page splitting (and we can fix the vCPU
> case later).
I agree that there's a gap with respect to a vCPU being the first to touch a
remote node, but I disagree that forcibly co-locating page tables for eager page
splitting is necessary.
Userspace can already force the ideal setup for eager page splitting by configuring
vNUMA-aware memslots and using a task with appropriate policy to toggle dirty
logging. And userspace really should be encouraged to do that, because otherwise
walking the page tables in software to do the split is going to be constantly
accessing remote memory.
And if anyone ever wants to fix the aforementioned gap, the two fixes that come to
mind would be to tie the policy to the memslot, or to do page_to_nid() on the
underlying page (with the assumption that memory that's not backed by struct page
isn't NUMA-aware, so fall back to task policy). At that point, having one-off code
to pull the node from the existing page tables is also unnecessary.
On 8/2/22 19:22, Sean Christopherson wrote:
> Userspace can already force the ideal setup for eager page splitting by configuring
> vNUMA-aware memslots and using a task with appropriate policy to toggle dirty
> logging. And userspace really should be encouraged to do that, because otherwise
> walking the page tables in software to do the split is going to be constantly
> accessing remote memory.
Yes, it's possible to locate the page tables on the node that holds the
memory they're mapping by enable dirty logging from different tasks for
different memslots, but that seems a bit weird.
Walking the page tables in software is going to do several remote memory
accesses, but it will do that in a thread that probably is devoted to
background tasks anyway. The relative impact of remote memory accesses
in the thread that enables dirty logging vs. in the vCPU thread should
also be visible in the overall performance of dirty_log_perf_test.
So I agree with Vipin's patch and would even extend it to all page table
allocations, however dirty_log_perf_test should be run with fixed CPU
mappings to measure accurately the impact of the remote memory accesses.
Paolo
On Tue, Aug 02, 2022, Paolo Bonzini wrote:
> On 8/2/22 19:22, Sean Christopherson wrote:
> > Userspace can already force the ideal setup for eager page splitting by configuring
> > vNUMA-aware memslots and using a task with appropriate policy to toggle dirty
> > logging. And userspace really should be encouraged to do that, because otherwise
> > walking the page tables in software to do the split is going to be constantly
> > accessing remote memory.
>
> Yes, it's possible to locate the page tables on the node that holds the
> memory they're mapping by enable dirty logging from different tasks for
> different memslots, but that seems a bit weird.
Gah, I misread this patch. More below.
But FWIW, my understanding is that Google's userspace already creates per-node
memslots with dedicated tasks for dirty logging operations. There are multiple
advantages to such a setup, e.g. slots can be processed in parallel (with the
TDP MMU), and when the VM enters blackout the tasks can be affined to the physical
CPUs that were just vacated by the VM.
> Walking the page tables in software is going to do several remote memory
> accesses, but it will do that in a thread that probably is devoted to
> background tasks anyway.
I agree that enabling dirty logging isn't exactly performance critical, but reaping
the dirty log is very much in the critical path. My point is that userspace that
cares about NUMA should be encouraged to create an optimal setup, at which point
this one-off override is unnecessary.
> The relative impact of remote memory accesses in the thread that enables
> dirty logging vs. in the vCPU thread should also be visible in the overall
> performance of dirty_log_perf_test.
>
> So I agree with Vipin's patch and would even extend it to all page table
> allocations,
Ah crud, I misread the patch. I was thinking tdp_mmu_spte_to_nid() was getting
the node for the existing shadow page, but it's actually getting the node for the
underlying physical huge page.
That means this patch is broken now that KVM doesn't require a "struct page" to
create huge SPTEs (commit a8ac499bb6ab ("KVM: x86/mmu: Don't require refcounted
"struct page" to create huge SPTEs"). I.e. this will explode as pfn_to_page()
may return garbage.
return page_to_nid(pfn_to_page(spte_to_pfn(spte)));
That said, I still don't like this patch, at least not as a one-off thing. I do
like the idea of having page table allocations follow the node requested for the
target pfn, what I don't like is having one-off behavior that silently overrides
userspace policy.
I would much rather we solve the problem for all page table allocations, maybe
with a KVM_CAP or module param? Unfortunately, that's a very non-trivial change
because it will probably require having a per-node cache in each of the per-vCPU
caches.
Hmm, actually, a not-awful alternative would be to have the fault path fallback
to the current task's policy if allocation fails. I.e. make it best effort.
E.g. as a rough sketch...
---
arch/x86/kvm/mmu/tdp_mmu.c | 37 ++++++++++++++++++++++++++++++-------
1 file changed, 30 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index bf2ccf9debca..e475f5b55794 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -273,10 +273,11 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
{
+ struct kvm_mmu_memory_cache *cache = &vcpu->arch.mmu_shadow_page_cache;
struct kvm_mmu_page *sp;
sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
- sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
+ sp->spt = kvm_mmu_alloc_shadow_page_table(cache, GFP_NOWAIT, pfn);
return sp;
}
@@ -1190,7 +1191,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
if (is_removed_spte(iter.old_spte))
break;
- sp = tdp_mmu_alloc_sp(vcpu);
+ sp = tdp_mmu_alloc_sp(vcpu, fault->pfn);
tdp_mmu_init_child_sp(sp, &iter);
if (tdp_mmu_link_sp(vcpu->kvm, &iter, sp, account_nx, true)) {
@@ -1402,17 +1403,39 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
return spte_set;
}
-static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
+void *kvm_mmu_alloc_shadow_page_table(struct kvm_mmu_memory_cache *cache,
+ gfp_t gfp, kvm_pfn_t pfn)
+{
+ struct page *page = kvm_pfn_to_refcounted_page(pfn);
+ struct page *spt_page;
+ int node;
+
+ gfp |= __GFP_ZERO | __GFP_ACCOUNT;
+
+ if (page) {
+ spt_page = alloc_pages_node(page_to_nid(page), gfp, 0);
+ if (spt_page)
+ return page_address(spt_page);
+ } else if (!cache) {
+ return (void *)__get_free_page(gfp);
+ }
+
+ if (cache)
+ return kvm_mmu_memory_cache_alloc(cache);
+
+ return NULL;
+}
+
+static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp,
+ kvm_pfn_t pfn)
{
struct kvm_mmu_page *sp;
- gfp |= __GFP_ZERO;
-
sp = kmem_cache_alloc(mmu_page_header_cache, gfp);
if (!sp)
return NULL;
- sp->spt = (void *)__get_free_page(gfp);
+ sp->spt = kvm_mmu_alloc_shadow_page_table(NULL, gfp, pfn);
if (!sp->spt) {
kmem_cache_free(mmu_page_header_cache, sp);
return NULL;
@@ -1436,7 +1459,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
* If this allocation fails we drop the lock and retry with reclaim
* allowed.
*/
- sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT);
+ sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT);
if (sp)
return sp;
base-commit: f8990bfe1eab91c807ca8fc0d48705e8f986b951
--
On 8/2/22 21:12, Sean Christopherson wrote:
> Ah crud, I misread the patch. I was thinking tdp_mmu_spte_to_nid() was getting
> the node for the existing shadow page, but it's actually getting the node for the
> underlying physical huge page.
>
> That means this patch is broken now that KVM doesn't require a "struct page" to
> create huge SPTEs (commit a8ac499bb6ab ("KVM: x86/mmu: Don't require refcounted
> "struct page" to create huge SPTEs"). I.e. this will explode as pfn_to_page()
> may return garbage.
>
> return page_to_nid(pfn_to_page(spte_to_pfn(spte)));
I was about to say that yesterday. However my knowledge of struct page
things has been proved to be spotty enough, that I wasn't 100% sure of
that. But yeah, with a fresh mind it's quite obvious that anything that
goes through hva_to_pfn_remap and similar paths will fail.
> That said, I still don't like this patch, at least not as a one-off thing. I do
> like the idea of having page table allocations follow the node requested for the
> target pfn, what I don't like is having one-off behavior that silently overrides
> userspace policy.
Yes, I totally agree with making it all or nothing.
> I would much rather we solve the problem for all page table allocations, maybe
> with a KVM_CAP or module param? Unfortunately, that's a very non-trivial change
> because it will probably require having a per-node cache in each of the per-vCPU
> caches.
A module parameter is fine. If you care about performance to this
level, your userspace is probably homogeneous enough.
Paolo
> Hmm, actually, a not-awful alternative would be to have the fault path fallback
> to the current task's policy if allocation fails. I.e. make it best effort.
>
> E.g. as a rough sketch...
>
> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 37 ++++++++++++++++++++++++++++++-------
> 1 file changed, 30 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index bf2ccf9debca..e475f5b55794 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -273,10 +273,11 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
>
> static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
> {
> + struct kvm_mmu_memory_cache *cache = &vcpu->arch.mmu_shadow_page_cache;
> struct kvm_mmu_page *sp;
>
> sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> - sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> + sp->spt = kvm_mmu_alloc_shadow_page_table(cache, GFP_NOWAIT, pfn);
>
> return sp;
> }
> @@ -1190,7 +1191,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> if (is_removed_spte(iter.old_spte))
> break;
>
> - sp = tdp_mmu_alloc_sp(vcpu);
> + sp = tdp_mmu_alloc_sp(vcpu, fault->pfn);
> tdp_mmu_init_child_sp(sp, &iter);
>
> if (tdp_mmu_link_sp(vcpu->kvm, &iter, sp, account_nx, true)) {
> @@ -1402,17 +1403,39 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
> return spte_set;
> }
>
> -static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
> +void *kvm_mmu_alloc_shadow_page_table(struct kvm_mmu_memory_cache *cache,
> + gfp_t gfp, kvm_pfn_t pfn)
> +{
> + struct page *page = kvm_pfn_to_refcounted_page(pfn);
> + struct page *spt_page;
> + int node;
> +
> + gfp |= __GFP_ZERO | __GFP_ACCOUNT;
> +
> + if (page) {
> + spt_page = alloc_pages_node(page_to_nid(page), gfp, 0);
> + if (spt_page)
> + return page_address(spt_page);
> + } else if (!cache) {
> + return (void *)__get_free_page(gfp);
> + }
> +
> + if (cache)
> + return kvm_mmu_memory_cache_alloc(cache);
> +
> + return NULL;
> +}
> +
> +static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp,
> + kvm_pfn_t pfn)
> {
> struct kvm_mmu_page *sp;
>
> - gfp |= __GFP_ZERO;
> -
> sp = kmem_cache_alloc(mmu_page_header_cache, gfp);
> if (!sp)
> return NULL;
>
> - sp->spt = (void *)__get_free_page(gfp);
> + sp->spt = kvm_mmu_alloc_shadow_page_table(NULL, gfp, pfn);
> if (!sp->spt) {
> kmem_cache_free(mmu_page_header_cache, sp);
> return NULL;
> @@ -1436,7 +1459,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
> * If this allocation fails we drop the lock and retry with reclaim
> * allowed.
> */
> - sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT);
> + sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT);
> if (sp)
> return sp;
>
>
> base-commit: f8990bfe1eab91c807ca8fc0d48705e8f986b951
> --
>
On Wed, Aug 3, 2022 at 8:08 AM Paolo Bonzini <[email protected]> wrote:
>
> On 8/2/22 21:12, Sean Christopherson wrote:
> > Ah crud, I misread the patch. I was thinking tdp_mmu_spte_to_nid() was getting
> > the node for the existing shadow page, but it's actually getting the node for the
> > underlying physical huge page.
> >
> > That means this patch is broken now that KVM doesn't require a "struct page" to
> > create huge SPTEs (commit a8ac499bb6ab ("KVM: x86/mmu: Don't require refcounted
> > "struct page" to create huge SPTEs"). I.e. this will explode as pfn_to_page()
> > may return garbage.
> >
> > return page_to_nid(pfn_to_page(spte_to_pfn(spte)));
>
> I was about to say that yesterday. However my knowledge of struct page
> things has been proved to be spotty enough, that I wasn't 100% sure of
> that. But yeah, with a fresh mind it's quite obvious that anything that
> goes through hva_to_pfn_remap and similar paths will fail.
>
> > That said, I still don't like this patch, at least not as a one-off thing. I do
> > like the idea of having page table allocations follow the node requested for the
> > target pfn, what I don't like is having one-off behavior that silently overrides
> > userspace policy.
>
> Yes, I totally agree with making it all or nothing.
>
> > I would much rather we solve the problem for all page table allocations, maybe
> > with a KVM_CAP or module param? Unfortunately, that's a very non-trivial change
> > because it will probably require having a per-node cache in each of the per-vCPU
> > caches.
>
> A module parameter is fine. If you care about performance to this
> level, your userspace is probably homogeneous enough.
>
Thank you all for the feedback and suggestions.
Regarding dirty_log_perf_test, I will send out a patch to add an
option which will tag vcpus to physical cpus using sched_setaffinity()
calls. This should increase accuracy for the test.
It seems like we are agreeing on two things:
1. Keep the same behavior for the page table pages allocation for both
during the page fault and during eager page splitting.
2. Page table pages should be allocated on the same node where the
underlying guest memory page is residing.
Here are the two approaches, please provide feedback on which one
looks more appropriate before I start spamming your inbox with my
patches
Approach A:
Have per numa node cache for page table pages allocation
Instead of having only one mmu_shadow_page_cache per vcpu, we provide
multiple caches for each node
either:
mmu_shadow_page_cache[MAX_NUMNODES]
or
mmu_shadow_page_cache->objects[MAX_NUMNODES * KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE]
We can decrease KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE to some lower value
instead of 40 to control memory consumption.
When a fault happens, use the pfn to find which node the page should
belong to and use the corresponding cache to get page table pages.
struct *page = kvm_pfn_to_refcounted_page(pfn);
int nid;
if(page) {
nid = page_to_nid(page);
} else {
nid = numa_node_id();
}
...
tdp_mmu_alloc_sp(nid, vcpu);
...
static struct kvm_mmu_page *tdp_mmu_alloc_sp(int nid, struct kvm_vcpu *vcpu) {
...
sp->spt = kvm_mmu_memory_cache_alloc(nid,
&vcpu->arch.mmu_shadow_page_cache);
...
}
Since we are changing cache allocation for page table pages, should we
also make similar changes for other caches like mmu_page_header_cache,
mmu_gfn_array_cache, and mmu_pte_list_desc_cache? I am not sure how
good this idea is.
Approach B:
Ask page from the specific node on fault path with option to fallback
to the original cache and default task policy.
This is what Sean's rough patch looks like.
On Fri, Aug 5, 2022 at 4:30 PM Vipin Sharma <[email protected]> wrote:
[...]
>
> Here are the two approaches, please provide feedback on which one
> looks more appropriate before I start spamming your inbox with my
> patches
>
> Approach A:
> Have per numa node cache for page table pages allocation
>
> Instead of having only one mmu_shadow_page_cache per vcpu, we provide
> multiple caches for each node
>
> either:
> mmu_shadow_page_cache[MAX_NUMNODES]
> or
> mmu_shadow_page_cache->objects[MAX_NUMNODES * KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE]
I think the former approach will work better. The objects[] array is
allocated dynamically during top-up now, so if a vCPU never allocates
a page table to map memory on a given node, KVM will never have to
allocate an objects[] array for that node. Whereas with the latter
approach KVM would have to allocate the entire objects[] array
up-front.
>
> We can decrease KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE to some lower value
> instead of 40 to control memory consumption.
I'm not sure we are getting any performance benefit from the cache
size being so high. It doesn't fundamentally change the number of
times a vCPU thread will have to call __get_free_page(), it just
batches more of those calls together. Assuming reducing the cache size
doesn't impact performance, I think it's a good idea to reduce it as
part of this feature.
KVM needs at most PT64_ROOT_MAX_LEVEL (5) page tables to handle a
fault. So we could decrease the mmu_shadow_page_cache.objects[]
capacity to PT64_ROOT_MAX_LEVEL (5) and support up to 8 NUMA nodes
without increasing memory usage. If a user wants to run a VM on an
even larger machine, I think it's safe to consume a few extra bytes
for the vCPU shadow page caches at that point (the machine probably
has 10s of TiB of RAM).
>
> When a fault happens, use the pfn to find which node the page should
> belong to and use the corresponding cache to get page table pages.
>
> struct *page = kvm_pfn_to_refcounted_page(pfn);
> int nid;
> if(page) {
> nid = page_to_nid(page);
> } else {
> nid = numa_node_id();
> }
>
> ...
> tdp_mmu_alloc_sp(nid, vcpu);
> ...
>
> static struct kvm_mmu_page *tdp_mmu_alloc_sp(int nid, struct kvm_vcpu *vcpu) {
> ...
> sp->spt = kvm_mmu_memory_cache_alloc(nid,
> &vcpu->arch.mmu_shadow_page_cache);
> ...
> }
>
>
> Since we are changing cache allocation for page table pages, should we
> also make similar changes for other caches like mmu_page_header_cache,
> mmu_gfn_array_cache, and mmu_pte_list_desc_cache? I am not sure how
> good this idea is.
We don't currently have a reason to make these objects NUMA-aware, so
I would only recommend it if it somehow makes the code a lot simpler.
>
> Approach B:
> Ask page from the specific node on fault path with option to fallback
> to the original cache and default task policy.
>
> This is what Sean's rough patch looks like.
This would definitely be a simpler approach but could increase the
amount of time a vCPU thread holds the MMU lock when handling a fault,
since KVM would start performing GFP_NOWAIT allocations under the
lock. So my preference would be to try the cache approach first and
see how complex it turns out to be.
On Tue, Aug 09, 2022, David Matlack wrote:
> On Fri, Aug 5, 2022 at 4:30 PM Vipin Sharma <[email protected]> wrote:
> > Approach B:
> > Ask page from the specific node on fault path with option to fallback
> > to the original cache and default task policy.
> >
> > This is what Sean's rough patch looks like.
>
> This would definitely be a simpler approach but could increase the
> amount of time a vCPU thread holds the MMU lock when handling a fault,
> since KVM would start performing GFP_NOWAIT allocations under the
> lock. So my preference would be to try the cache approach first and
> see how complex it turns out to be.
Ya, as discussed off-list, I don't like my idea either :-)
The pfn and thus node information is available before mmu_lock is acquired, so I
don't see any reason to defer the allocation other than to reduce the memory
footprint, and that's a solvable problem one way or another.