2019-12-11 21:33:08

by Barret Rhoden

[permalink] [raw]
Subject: [PATCH v4 0/2] kvm: Use huge pages for DAX-backed files

This patchset allows KVM to map huge pages for DAX-backed files.

I held previous versions in limbo while people were sorting out whether
or not DAX pages were going to remain PageReserved and how that relates
to KVM.

Now that that is sorted out (DAX pages are PageReserved, but they are
not kvm_is_reserved_pfn(), and DAX pages are considered on a
case-by-case basis for KVM), I can repost this.

v3 -> v4:
v3: https://lore.kernel.org/lkml/[email protected]/
- Rebased onto linus/master

v2 -> v3:
v2: https://lore.kernel.org/lkml/[email protected]/
- Updated Acks/Reviewed-by
- Rebased onto linux-next

v1 -> v2:
https://lore.kernel.org/lkml/[email protected]/
- Updated Acks/Reviewed-by
- Minor touchups
- Added patch to remove redundant PageReserved() check
- Rebased onto linux-next

RFC/discussion thread:
https://lore.kernel.org/lkml/[email protected]/

Barret Rhoden (2):
mm: make dev_pagemap_mapping_shift() externally visible
kvm: Use huge pages for DAX-backed files

arch/x86/kvm/mmu/mmu.c | 36 ++++++++++++++++++++++++++++++++----
include/linux/mm.h | 3 +++
mm/memory-failure.c | 38 +++-----------------------------------
mm/util.c | 34 ++++++++++++++++++++++++++++++++++
4 files changed, 72 insertions(+), 39 deletions(-)

--
2.24.0.525.g8f36a354ae-goog


2019-12-11 21:33:25

by Barret Rhoden

[permalink] [raw]
Subject: [PATCH v4 1/2] mm: make dev_pagemap_mapping_shift() externally visible

KVM has a use case for determining the size of a dax mapping. The KVM
code has easy access to the address and the mm; hence the change in
parameters.

Signed-off-by: Barret Rhoden <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Acked-by: Dan Williams <[email protected]>
---
include/linux/mm.h | 3 +++
mm/memory-failure.c | 38 +++-----------------------------------
mm/util.c | 34 ++++++++++++++++++++++++++++++++++
3 files changed, 40 insertions(+), 35 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8b0ef04b6d15..f88bcc6a3bd1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -998,6 +998,9 @@ static inline bool is_pci_p2pdma_page(const struct page *page)
#define page_ref_zero_or_close_to_overflow(page) \
((unsigned int) page_ref_count(page) + 127u <= 127u)

+unsigned long dev_pagemap_mapping_shift(unsigned long address,
+ struct mm_struct *mm);
+
static inline void get_page(struct page *page)
{
page = compound_head(page);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 41c634f45d45..9dc487e73d9b 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -261,40 +261,6 @@ void shake_page(struct page *p, int access)
}
EXPORT_SYMBOL_GPL(shake_page);

-static unsigned long dev_pagemap_mapping_shift(struct page *page,
- struct vm_area_struct *vma)
-{
- unsigned long address = vma_address(page, vma);
- pgd_t *pgd;
- p4d_t *p4d;
- pud_t *pud;
- pmd_t *pmd;
- pte_t *pte;
-
- pgd = pgd_offset(vma->vm_mm, address);
- if (!pgd_present(*pgd))
- return 0;
- p4d = p4d_offset(pgd, address);
- if (!p4d_present(*p4d))
- return 0;
- pud = pud_offset(p4d, address);
- if (!pud_present(*pud))
- return 0;
- if (pud_devmap(*pud))
- return PUD_SHIFT;
- pmd = pmd_offset(pud, address);
- if (!pmd_present(*pmd))
- return 0;
- if (pmd_devmap(*pmd))
- return PMD_SHIFT;
- pte = pte_offset_map(pmd, address);
- if (!pte_present(*pte))
- return 0;
- if (pte_devmap(*pte))
- return PAGE_SHIFT;
- return 0;
-}
-
/*
* Failure handling: if we can't find or can't kill a process there's
* not much we can do. We just print a message and ignore otherwise.
@@ -318,7 +284,9 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,

tk->addr = page_address_in_vma(p, vma);
if (is_zone_device_page(p))
- tk->size_shift = dev_pagemap_mapping_shift(p, vma);
+ tk->size_shift =
+ dev_pagemap_mapping_shift(vma_address(page, vma),
+ vma->vm_mm);
else
tk->size_shift = page_shift(compound_head(p));

diff --git a/mm/util.c b/mm/util.c
index 988d11e6c17c..553fbe1692ed 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -911,3 +911,37 @@ int memcmp_pages(struct page *page1, struct page *page2)
kunmap_atomic(addr1);
return ret;
}
+
+unsigned long dev_pagemap_mapping_shift(unsigned long address,
+ struct mm_struct *mm)
+{
+ pgd_t *pgd;
+ p4d_t *p4d;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *pte;
+
+ pgd = pgd_offset(mm, address);
+ if (!pgd_present(*pgd))
+ return 0;
+ p4d = p4d_offset(pgd, address);
+ if (!p4d_present(*p4d))
+ return 0;
+ pud = pud_offset(p4d, address);
+ if (!pud_present(*pud))
+ return 0;
+ if (pud_devmap(*pud))
+ return PUD_SHIFT;
+ pmd = pmd_offset(pud, address);
+ if (!pmd_present(*pmd))
+ return 0;
+ if (pmd_devmap(*pmd))
+ return PMD_SHIFT;
+ pte = pte_offset_map(pmd, address);
+ if (!pte_present(*pte))
+ return 0;
+ if (pte_devmap(*pte))
+ return PAGE_SHIFT;
+ return 0;
+}
+EXPORT_SYMBOL_GPL(dev_pagemap_mapping_shift);
--
2.24.0.525.g8f36a354ae-goog

2019-12-11 21:33:54

by Barret Rhoden

[permalink] [raw]
Subject: [PATCH v4 2/2] kvm: Use huge pages for DAX-backed files

This change allows KVM to map DAX-backed files made of huge pages with
huge mappings in the EPT/TDP.

DAX pages are not PageTransCompound. The existing check is trying to
determine if the mapping for the pfn is a huge mapping or not. For
non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound.
For DAX, we can check the page table itself.

Note that KVM already faulted in the page (or huge page) in the host's
page table, and we hold the KVM mmu spinlock. We grabbed that lock in
kvm_mmu_notifier_invalidate_range_end, before checking the mmu seq.

Signed-off-by: Barret Rhoden <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 36 ++++++++++++++++++++++++++++++++----
1 file changed, 32 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6f92b40d798c..cd07bc4e595f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3384,6 +3384,35 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
return -EFAULT;
}

+static bool pfn_is_huge_mapped(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn)
+{
+ struct page *page = pfn_to_page(pfn);
+ unsigned long hva;
+
+ if (!is_zone_device_page(page))
+ return PageTransCompoundMap(page);
+
+ /*
+ * DAX pages do not use compound pages. The page should have already
+ * been mapped into the host-side page table during try_async_pf(), so
+ * we can check the page tables directly.
+ */
+ hva = gfn_to_hva(kvm, gfn);
+ if (kvm_is_error_hva(hva))
+ return false;
+
+ /*
+ * Our caller grabbed the KVM mmu_lock with a successful
+ * mmu_notifier_retry, so we're safe to walk the page table.
+ */
+ switch (dev_pagemap_mapping_shift(hva, current->mm)) {
+ case PMD_SHIFT:
+ case PUD_SIZE:
+ return true;
+ }
+ return false;
+}
+
static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
gfn_t gfn, kvm_pfn_t *pfnp,
int *levelp)
@@ -3398,8 +3427,8 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
* here.
*/
if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
- !kvm_is_zone_device_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL &&
- PageTransCompoundMap(pfn_to_page(pfn)) &&
+ level == PT_PAGE_TABLE_LEVEL &&
+ pfn_is_huge_mapped(vcpu->kvm, gfn, pfn) &&
!mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {
unsigned long mask;
/*
@@ -6015,8 +6044,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
* mapping if the indirect sp has level = 1.
*/
if (sp->role.direct && !kvm_is_reserved_pfn(pfn) &&
- !kvm_is_zone_device_pfn(pfn) &&
- PageTransCompoundMap(pfn_to_page(pfn))) {
+ pfn_is_huge_mapped(kvm, sp->gfn, pfn)) {
pte_list_remove(rmap_head, sptep);

if (kvm_available_flush_tlb_with_range())
--
2.24.0.525.g8f36a354ae-goog

2019-12-12 00:23:16

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] kvm: Use huge pages for DAX-backed files

On 11/12/19 22:32, Barret Rhoden wrote:
> + /*
> + * Our caller grabbed the KVM mmu_lock with a successful
> + * mmu_notifier_retry, so we're safe to walk the page table.
> + */
> + switch (dev_pagemap_mapping_shift(hva, current->mm)) {
> + case PMD_SHIFT:
> + case PUD_SIZE:
> + return true;
> + }
> + return false;

Should this simply be "> PAGE_SHIFT"?

Paolo

2019-12-12 00:24:05

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v4 0/2] kvm: Use huge pages for DAX-backed files

On 11/12/19 22:32, Barret Rhoden wrote:
> This patchset allows KVM to map huge pages for DAX-backed files.
>
> I held previous versions in limbo while people were sorting out whether
> or not DAX pages were going to remain PageReserved and how that relates
> to KVM.
>
> Now that that is sorted out (DAX pages are PageReserved, but they are
> not kvm_is_reserved_pfn(), and DAX pages are considered on a
> case-by-case basis for KVM), I can repost this.
>
> v3 -> v4:
> v3: https://lore.kernel.org/lkml/[email protected]/
> - Rebased onto linus/master
>
> v2 -> v3:
> v2: https://lore.kernel.org/lkml/[email protected]/
> - Updated Acks/Reviewed-by
> - Rebased onto linux-next
>
> v1 -> v2:
> https://lore.kernel.org/lkml/[email protected]/
> - Updated Acks/Reviewed-by
> - Minor touchups
> - Added patch to remove redundant PageReserved() check
> - Rebased onto linux-next
>
> RFC/discussion thread:
> https://lore.kernel.org/lkml/[email protected]/
>
> Barret Rhoden (2):
> mm: make dev_pagemap_mapping_shift() externally visible
> kvm: Use huge pages for DAX-backed files
>
> arch/x86/kvm/mmu/mmu.c | 36 ++++++++++++++++++++++++++++++++----
> include/linux/mm.h | 3 +++
> mm/memory-failure.c | 38 +++-----------------------------------
> mm/util.c | 34 ++++++++++++++++++++++++++++++++++
> 4 files changed, 72 insertions(+), 39 deletions(-)
>

Thanks, with the Acked-by already in place for patch 1 I can pick this up.

Paolo

2019-12-12 12:23:18

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] kvm: Use huge pages for DAX-backed files

On 11.12.19 22:32, Barret Rhoden wrote:
> This change allows KVM to map DAX-backed files made of huge pages with
> huge mappings in the EPT/TDP.
>
> DAX pages are not PageTransCompound. The existing check is trying to
> determine if the mapping for the pfn is a huge mapping or not. For
> non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound.
> For DAX, we can check the page table itself.
>
> Note that KVM already faulted in the page (or huge page) in the host's
> page table, and we hold the KVM mmu spinlock. We grabbed that lock in
> kvm_mmu_notifier_invalidate_range_end, before checking the mmu seq.
>
> Signed-off-by: Barret Rhoden <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 36 ++++++++++++++++++++++++++++++++----
> 1 file changed, 32 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6f92b40d798c..cd07bc4e595f 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3384,6 +3384,35 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
> return -EFAULT;
> }
>
> +static bool pfn_is_huge_mapped(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn)
> +{
> + struct page *page = pfn_to_page(pfn);
> + unsigned long hva;
> +
> + if (!is_zone_device_page(page))
> + return PageTransCompoundMap(page);
> +
> + /*
> + * DAX pages do not use compound pages. The page should have already
> + * been mapped into the host-side page table during try_async_pf(), so
> + * we can check the page tables directly.
> + */
> + hva = gfn_to_hva(kvm, gfn);
> + if (kvm_is_error_hva(hva))
> + return false;
> +
> + /*
> + * Our caller grabbed the KVM mmu_lock with a successful
> + * mmu_notifier_retry, so we're safe to walk the page table.
> + */
> + switch (dev_pagemap_mapping_shift(hva, current->mm)) {
> + case PMD_SHIFT:
> + case PUD_SIZE:

Shouldn't this be PUD_SHIFT?

But I agree with Paolo, that this is simply

return dev_pagemap_mapping_shift(hva, current->mm) > PAGE_SHIFT;

> + return true;
> + }
> + return false;
> +}
> +
> static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
> gfn_t gfn, kvm_pfn_t *pfnp,
> int *levelp)
> @@ -3398,8 +3427,8 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
> * here.
> */
> if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
> - !kvm_is_zone_device_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL &&
> - PageTransCompoundMap(pfn_to_page(pfn)) &&
> + level == PT_PAGE_TABLE_LEVEL &&
> + pfn_is_huge_mapped(vcpu->kvm, gfn, pfn) &&
> !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {
> unsigned long mask;
> /*
> @@ -6015,8 +6044,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> * mapping if the indirect sp has level = 1.
> */
> if (sp->role.direct && !kvm_is_reserved_pfn(pfn) &&
> - !kvm_is_zone_device_pfn(pfn) &&
> - PageTransCompoundMap(pfn_to_page(pfn))) {
> + pfn_is_huge_mapped(kvm, sp->gfn, pfn)) {
> pte_list_remove(rmap_head, sptep);
>
> if (kvm_available_flush_tlb_with_range())
>

Patch itself looks good to me (especially, cleans up these two places a
bit). I am not an expert on the locking part, so I can't give my RB.

--
Thanks,

David / dhildenb

2019-12-12 12:36:24

by Liran Alon

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] kvm: Use huge pages for DAX-backed files



> On 11 Dec 2019, at 23:32, Barret Rhoden <[email protected]> wrote:
>
> This change allows KVM to map DAX-backed files made of huge pages with
> huge mappings in the EPT/TDP.
>
> DAX pages are not PageTransCompound. The existing check is trying to
> determine if the mapping for the pfn is a huge mapping or not. For
> non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound.
> For DAX, we can check the page table itself.

For hugetlbfs pages, tdp_page_fault() -> mapping_level() -> host_mapping_level() -> kvm_host_page_size() -> vma_kernel_pagesize()
will return the page-size of the hugetlbfs without the need to parse the page-tables.
See vma->vm_ops->pagesize() callback implementation at hugetlb_vm_ops->pagesize()==hugetlb_vm_op_pagesize().

Only for pages that were originally mapped as small-pages and later merged to larger pages by THP, there is a need to check for PageTransCompound(). Again, instead of parsing page-tables.

Therefore, it seems more logical to me that:
(a) If DAX-backed files are mapped as large-pages to userspace, it should be reflected in vma->vm_ops->page_size() of that mapping. Causing kvm_host_page_size() to return the right size without the need to parse the page-tables.
(b) If DAX-backed files small-pages can be later merged to large-pages by THP, then the “struct page” of these pages should be modified as usual to make PageTransCompound() return true for them. I’m not highly familiar with this mechanism, but I would expect THP to be able to merge DAX-backed files small-pages to large-pages in case DAX provides “struct page” for the DAX pages.

>
> Note that KVM already faulted in the page (or huge page) in the host's
> page table, and we hold the KVM mmu spinlock. We grabbed that lock in
> kvm_mmu_notifier_invalidate_range_end, before checking the mmu seq.
>
> Signed-off-by: Barret Rhoden <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 36 ++++++++++++++++++++++++++++++++----
> 1 file changed, 32 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6f92b40d798c..cd07bc4e595f 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3384,6 +3384,35 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
> return -EFAULT;
> }
>
> +static bool pfn_is_huge_mapped(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn)
> +{
> + struct page *page = pfn_to_page(pfn);
> + unsigned long hva;
> +
> + if (!is_zone_device_page(page))
> + return PageTransCompoundMap(page);
> +
> + /*
> + * DAX pages do not use compound pages. The page should have already
> + * been mapped into the host-side page table during try_async_pf(), so
> + * we can check the page tables directly.
> + */
> + hva = gfn_to_hva(kvm, gfn);
> + if (kvm_is_error_hva(hva))
> + return false;
> +
> + /*
> + * Our caller grabbed the KVM mmu_lock with a successful
> + * mmu_notifier_retry, so we're safe to walk the page table.
> + */
> + switch (dev_pagemap_mapping_shift(hva, current->mm)) {

Doesn’t dev_pagemap_mapping_shift() get “struct page” as first parameter?
Was this changed by a commit I missed?

-Liran

> + case PMD_SHIFT:
> + case PUD_SIZE:
> + return true;
> + }
> + return false;
> +}
> +
> static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
> gfn_t gfn, kvm_pfn_t *pfnp,
> int *levelp)
> @@ -3398,8 +3427,8 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
> * here.
> */
> if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
> - !kvm_is_zone_device_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL &&
> - PageTransCompoundMap(pfn_to_page(pfn)) &&
> + level == PT_PAGE_TABLE_LEVEL &&
> + pfn_is_huge_mapped(vcpu->kvm, gfn, pfn) &&
> !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {
> unsigned long mask;
> /*
> @@ -6015,8 +6044,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> * mapping if the indirect sp has level = 1.
> */
> if (sp->role.direct && !kvm_is_reserved_pfn(pfn) &&
> - !kvm_is_zone_device_pfn(pfn) &&
> - PageTransCompoundMap(pfn_to_page(pfn))) {
> + pfn_is_huge_mapped(kvm, sp->gfn, pfn)) {
> pte_list_remove(rmap_head, sptep);
>
> if (kvm_available_flush_tlb_with_range())
> --
> 2.24.0.525.g8f36a354ae-goog
>

2019-12-12 16:32:46

by Barret Rhoden

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] kvm: Use huge pages for DAX-backed files

On 12/12/19 7:22 AM, David Hildenbrand wrote:
>> + /*
>> + * Our caller grabbed the KVM mmu_lock with a successful
>> + * mmu_notifier_retry, so we're safe to walk the page table.
>> + */
>> + switch (dev_pagemap_mapping_shift(hva, current->mm)) {
>> + case PMD_SHIFT:
>> + case PUD_SIZE:
>
> Shouldn't this be PUD_SHIFT?
>
> But I agree with Paolo, that this is simply
>
> return dev_pagemap_mapping_shift(hva, current->mm) > PAGE_SHIFT;

Yes, good call. I'll fix that in the next version.

2019-12-12 16:56:03

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] kvm: Use huge pages for DAX-backed files

On Thu, Dec 12, 2019 at 4:34 AM Liran Alon <[email protected]> wrote:
>
>
>
> > On 11 Dec 2019, at 23:32, Barret Rhoden <[email protected]> wrote:
> >
> > This change allows KVM to map DAX-backed files made of huge pages with
> > huge mappings in the EPT/TDP.
> >
> > DAX pages are not PageTransCompound. The existing check is trying to
> > determine if the mapping for the pfn is a huge mapping or not. For
> > non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound.
> > For DAX, we can check the page table itself.
>
> For hugetlbfs pages, tdp_page_fault() -> mapping_level() -> host_mapping_level() -> kvm_host_page_size() -> vma_kernel_pagesize()
> will return the page-size of the hugetlbfs without the need to parse the page-tables.
> See vma->vm_ops->pagesize() callback implementation at hugetlb_vm_ops->pagesize()==hugetlb_vm_op_pagesize().
>
> Only for pages that were originally mapped as small-pages and later merged to larger pages by THP, there is a need to check for PageTransCompound(). Again, instead of parsing page-tables.
>
> Therefore, it seems more logical to me that:
> (a) If DAX-backed files are mapped as large-pages to userspace, it should be reflected in vma->vm_ops->page_size() of that mapping. Causing kvm_host_page_size() to return the right size without the need to parse the page-tables.

A given dax-mapped vma may have mixed page sizes so ->page_size()
can't be used reliably to enumerating the mapping size.

> (b) If DAX-backed files small-pages can be later merged to large-pages by THP, then the “struct page” of these pages should be modified as usual to make PageTransCompound() return true for them. I’m not highly familiar with this mechanism, but I would expect THP to be able to merge DAX-backed files small-pages to large-pages in case DAX provides “struct page” for the DAX pages.

DAX pages do not participate in THP and do not have the
PageTransCompound accounting. The only mechanism that records the
mapping size for dax is the page tables themselves.


>
> >
> > Note that KVM already faulted in the page (or huge page) in the host's
> > page table, and we hold the KVM mmu spinlock. We grabbed that lock in
> > kvm_mmu_notifier_invalidate_range_end, before checking the mmu seq.
> >
> > Signed-off-by: Barret Rhoden <[email protected]>
> > ---
> > arch/x86/kvm/mmu/mmu.c | 36 ++++++++++++++++++++++++++++++++----
> > 1 file changed, 32 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 6f92b40d798c..cd07bc4e595f 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3384,6 +3384,35 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
> > return -EFAULT;
> > }
> >
> > +static bool pfn_is_huge_mapped(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn)
> > +{
> > + struct page *page = pfn_to_page(pfn);
> > + unsigned long hva;
> > +
> > + if (!is_zone_device_page(page))
> > + return PageTransCompoundMap(page);
> > +
> > + /*
> > + * DAX pages do not use compound pages. The page should have already
> > + * been mapped into the host-side page table during try_async_pf(), so
> > + * we can check the page tables directly.
> > + */
> > + hva = gfn_to_hva(kvm, gfn);
> > + if (kvm_is_error_hva(hva))
> > + return false;
> > +
> > + /*
> > + * Our caller grabbed the KVM mmu_lock with a successful
> > + * mmu_notifier_retry, so we're safe to walk the page table.
> > + */
> > + switch (dev_pagemap_mapping_shift(hva, current->mm)) {
>
> Doesn’t dev_pagemap_mapping_shift() get “struct page” as first parameter?
> Was this changed by a commit I missed?
>
> -Liran
>
> > + case PMD_SHIFT:
> > + case PUD_SIZE:
> > + return true;
> > + }
> > + return false;
> > +}
> > +
> > static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
> > gfn_t gfn, kvm_pfn_t *pfnp,
> > int *levelp)
> > @@ -3398,8 +3427,8 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
> > * here.
> > */
> > if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
> > - !kvm_is_zone_device_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL &&
> > - PageTransCompoundMap(pfn_to_page(pfn)) &&
> > + level == PT_PAGE_TABLE_LEVEL &&
> > + pfn_is_huge_mapped(vcpu->kvm, gfn, pfn) &&
> > !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {
> > unsigned long mask;
> > /*
> > @@ -6015,8 +6044,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> > * mapping if the indirect sp has level = 1.
> > */
> > if (sp->role.direct && !kvm_is_reserved_pfn(pfn) &&
> > - !kvm_is_zone_device_pfn(pfn) &&
> > - PageTransCompoundMap(pfn_to_page(pfn))) {
> > + pfn_is_huge_mapped(kvm, sp->gfn, pfn)) {
> > pte_list_remove(rmap_head, sptep);
> >
> > if (kvm_available_flush_tlb_with_range())
> > --
> > 2.24.0.525.g8f36a354ae-goog
> >
>

2019-12-12 17:05:32

by Barret Rhoden

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] kvm: Use huge pages for DAX-backed files

On 12/12/19 7:33 AM, Liran Alon wrote:
>> + /*
>> + * Our caller grabbed the KVM mmu_lock with a successful
>> + * mmu_notifier_retry, so we're safe to walk the page table.
>> + */
>> + switch (dev_pagemap_mapping_shift(hva, current->mm)) {
> Doesn’t dev_pagemap_mapping_shift() get “struct page” as first parameter?
> Was this changed by a commit I missed?

I changed this in Patch 1. The place I call it in KVM has the address
and mm available, which is the only think dev_pagemap_mapping_shift()
really needs. (The first thing it did was convert page to address).

I'll add some more text to patch 1's commit message about that.

Thanks,

Barret


2019-12-12 17:36:13

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] kvm: Use huge pages for DAX-backed files

On Wed, Dec 11, 2019 at 04:32:07PM -0500, Barret Rhoden wrote:
> This change allows KVM to map DAX-backed files made of huge pages with
> huge mappings in the EPT/TDP.
>
> DAX pages are not PageTransCompound. The existing check is trying to
> determine if the mapping for the pfn is a huge mapping or not. For
> non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound.
> For DAX, we can check the page table itself.
>
> Note that KVM already faulted in the page (or huge page) in the host's
> page table, and we hold the KVM mmu spinlock. We grabbed that lock in
> kvm_mmu_notifier_invalidate_range_end, before checking the mmu seq.
>
> Signed-off-by: Barret Rhoden <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 36 ++++++++++++++++++++++++++++++++----
> 1 file changed, 32 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6f92b40d798c..cd07bc4e595f 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3384,6 +3384,35 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
> return -EFAULT;
> }
>
> +static bool pfn_is_huge_mapped(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn)
> +{
> + struct page *page = pfn_to_page(pfn);
> + unsigned long hva;
> +
> + if (!is_zone_device_page(page))
> + return PageTransCompoundMap(page);
> +
> + /*
> + * DAX pages do not use compound pages. The page should have already
> + * been mapped into the host-side page table during try_async_pf(), so
> + * we can check the page tables directly.
> + */
> + hva = gfn_to_hva(kvm, gfn);
> + if (kvm_is_error_hva(hva))
> + return false;
> +
> + /*
> + * Our caller grabbed the KVM mmu_lock with a successful
> + * mmu_notifier_retry, so we're safe to walk the page table.
> + */
> + switch (dev_pagemap_mapping_shift(hva, current->mm)) {
> + case PMD_SHIFT:
> + case PUD_SIZE:

I assume this means DAX can have 1GB pages? I ask because KVM's THP logic
has historically relied on THP only supporting 2MB. I cleaned this up in
a recent series[*], which is in kvm/queue, but I obviously didn't actually
test whether or not KVM would correctly handle 1GB non-hugetlbfs pages.

The easiest thing is probably to rebase on kvm/queue. You'll need to do
that anyways, and I suspect doing so will help shake out any hiccups.

[*] https://lkml.kernel.org/r/[email protected]

> + return true;
> + }
> + return false;
> +}
> +
> static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
> gfn_t gfn, kvm_pfn_t *pfnp,
> int *levelp)
> @@ -3398,8 +3427,8 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
> * here.
> */
> if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
> - !kvm_is_zone_device_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL &&
> - PageTransCompoundMap(pfn_to_page(pfn)) &&
> + level == PT_PAGE_TABLE_LEVEL &&
> + pfn_is_huge_mapped(vcpu->kvm, gfn, pfn) &&
> !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {
> unsigned long mask;
> /*
> @@ -6015,8 +6044,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> * mapping if the indirect sp has level = 1.
> */
> if (sp->role.direct && !kvm_is_reserved_pfn(pfn) &&
> - !kvm_is_zone_device_pfn(pfn) &&
> - PageTransCompoundMap(pfn_to_page(pfn))) {
> + pfn_is_huge_mapped(kvm, sp->gfn, pfn)) {
> pte_list_remove(rmap_head, sptep);
>
> if (kvm_available_flush_tlb_with_range())
> --
> 2.24.0.525.g8f36a354ae-goog
>

2019-12-12 17:38:44

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] kvm: Use huge pages for DAX-backed files

On Thu, Dec 12, 2019 at 9:34 AM Sean Christopherson
<[email protected]> wrote:
>
> On Wed, Dec 11, 2019 at 04:32:07PM -0500, Barret Rhoden wrote:
> > This change allows KVM to map DAX-backed files made of huge pages with
> > huge mappings in the EPT/TDP.
> >
> > DAX pages are not PageTransCompound. The existing check is trying to
> > determine if the mapping for the pfn is a huge mapping or not. For
> > non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound.
> > For DAX, we can check the page table itself.
> >
> > Note that KVM already faulted in the page (or huge page) in the host's
> > page table, and we hold the KVM mmu spinlock. We grabbed that lock in
> > kvm_mmu_notifier_invalidate_range_end, before checking the mmu seq.
> >
> > Signed-off-by: Barret Rhoden <[email protected]>
> > ---
> > arch/x86/kvm/mmu/mmu.c | 36 ++++++++++++++++++++++++++++++++----
> > 1 file changed, 32 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 6f92b40d798c..cd07bc4e595f 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3384,6 +3384,35 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
> > return -EFAULT;
> > }
> >
> > +static bool pfn_is_huge_mapped(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn)
> > +{
> > + struct page *page = pfn_to_page(pfn);
> > + unsigned long hva;
> > +
> > + if (!is_zone_device_page(page))
> > + return PageTransCompoundMap(page);
> > +
> > + /*
> > + * DAX pages do not use compound pages. The page should have already
> > + * been mapped into the host-side page table during try_async_pf(), so
> > + * we can check the page tables directly.
> > + */
> > + hva = gfn_to_hva(kvm, gfn);
> > + if (kvm_is_error_hva(hva))
> > + return false;
> > +
> > + /*
> > + * Our caller grabbed the KVM mmu_lock with a successful
> > + * mmu_notifier_retry, so we're safe to walk the page table.
> > + */
> > + switch (dev_pagemap_mapping_shift(hva, current->mm)) {
> > + case PMD_SHIFT:
> > + case PUD_SIZE:
>
> I assume this means DAX can have 1GB pages?

Correct, it can. Not in the filesystem-dax case, but device-dax
supports 1GB pages.

> I ask because KVM's THP logic
> has historically relied on THP only supporting 2MB. I cleaned this up in
> a recent series[*], which is in kvm/queue, but I obviously didn't actually
> test whether or not KVM would correctly handle 1GB non-hugetlbfs pages.

Yeah, since device-dax is the only path to support longterm page
pinning for vfio device assignment, testing with device-dax + 1GB
pages would be a useful sanity check.

2019-12-12 17:40:50

by Liran Alon

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] kvm: Use huge pages for DAX-backed files



> On 12 Dec 2019, at 18:54, Dan Williams <[email protected]> wrote:
>
> On Thu, Dec 12, 2019 at 4:34 AM Liran Alon <[email protected]> wrote:
>>
>>
>>
>>> On 11 Dec 2019, at 23:32, Barret Rhoden <[email protected]> wrote:
>>>
>>> This change allows KVM to map DAX-backed files made of huge pages with
>>> huge mappings in the EPT/TDP.
>>>
>>> DAX pages are not PageTransCompound. The existing check is trying to
>>> determine if the mapping for the pfn is a huge mapping or not. For
>>> non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound.
>>> For DAX, we can check the page table itself.
>>
>> For hugetlbfs pages, tdp_page_fault() -> mapping_level() -> host_mapping_level() -> kvm_host_page_size() -> vma_kernel_pagesize()
>> will return the page-size of the hugetlbfs without the need to parse the page-tables.
>> See vma->vm_ops->pagesize() callback implementation at hugetlb_vm_ops->pagesize()==hugetlb_vm_op_pagesize().
>>
>> Only for pages that were originally mapped as small-pages and later merged to larger pages by THP, there is a need to check for PageTransCompound(). Again, instead of parsing page-tables.
>>
>> Therefore, it seems more logical to me that:
>> (a) If DAX-backed files are mapped as large-pages to userspace, it should be reflected in vma->vm_ops->page_size() of that mapping. Causing kvm_host_page_size() to return the right size without the need to parse the page-tables.
>
> A given dax-mapped vma may have mixed page sizes so ->page_size()
> can't be used reliably to enumerating the mapping size.

Naive question: Why don’t split the VMA in this case to multiple VMAs with different results for ->page_size()?
What you are describing sounds like DAX is breaking this callback semantics in an unpredictable manner.

>
>> (b) If DAX-backed files small-pages can be later merged to large-pages by THP, then the “struct page” of these pages should be modified as usual to make PageTransCompound() return true for them. I’m not highly familiar with this mechanism, but I would expect THP to be able to merge DAX-backed files small-pages to large-pages in case DAX provides “struct page” for the DAX pages.
>
> DAX pages do not participate in THP and do not have the
> PageTransCompound accounting. The only mechanism that records the
> mapping size for dax is the page tables themselves.

What is the rational behind this? Given that DAX pages can be described with “struct page” (i.e. ZONE_DEVICE), what prevents THP from manipulating page-tables to merge multiple DAX PFNs to a larger page?

-Liran

>
>
>>
>>>
>>> Note that KVM already faulted in the page (or huge page) in the host's
>>> page table, and we hold the KVM mmu spinlock. We grabbed that lock in
>>> kvm_mmu_notifier_invalidate_range_end, before checking the mmu seq.
>>>
>>> Signed-off-by: Barret Rhoden <[email protected]>
>>> ---
>>> arch/x86/kvm/mmu/mmu.c | 36 ++++++++++++++++++++++++++++++++----
>>> 1 file changed, 32 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>>> index 6f92b40d798c..cd07bc4e595f 100644
>>> --- a/arch/x86/kvm/mmu/mmu.c
>>> +++ b/arch/x86/kvm/mmu/mmu.c
>>> @@ -3384,6 +3384,35 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
>>> return -EFAULT;
>>> }
>>>
>>> +static bool pfn_is_huge_mapped(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn)
>>> +{
>>> + struct page *page = pfn_to_page(pfn);
>>> + unsigned long hva;
>>> +
>>> + if (!is_zone_device_page(page))
>>> + return PageTransCompoundMap(page);
>>> +
>>> + /*
>>> + * DAX pages do not use compound pages. The page should have already
>>> + * been mapped into the host-side page table during try_async_pf(), so
>>> + * we can check the page tables directly.
>>> + */
>>> + hva = gfn_to_hva(kvm, gfn);
>>> + if (kvm_is_error_hva(hva))
>>> + return false;
>>> +
>>> + /*
>>> + * Our caller grabbed the KVM mmu_lock with a successful
>>> + * mmu_notifier_retry, so we're safe to walk the page table.
>>> + */
>>> + switch (dev_pagemap_mapping_shift(hva, current->mm)) {
>>
>> Doesn’t dev_pagemap_mapping_shift() get “struct page” as first parameter?
>> Was this changed by a commit I missed?
>>
>> -Liran
>>
>>> + case PMD_SHIFT:
>>> + case PUD_SIZE:
>>> + return true;
>>> + }
>>> + return false;
>>> +}
>>> +
>>> static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
>>> gfn_t gfn, kvm_pfn_t *pfnp,
>>> int *levelp)
>>> @@ -3398,8 +3427,8 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
>>> * here.
>>> */
>>> if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
>>> - !kvm_is_zone_device_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL &&
>>> - PageTransCompoundMap(pfn_to_page(pfn)) &&
>>> + level == PT_PAGE_TABLE_LEVEL &&
>>> + pfn_is_huge_mapped(vcpu->kvm, gfn, pfn) &&
>>> !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {
>>> unsigned long mask;
>>> /*
>>> @@ -6015,8 +6044,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
>>> * mapping if the indirect sp has level = 1.
>>> */
>>> if (sp->role.direct && !kvm_is_reserved_pfn(pfn) &&
>>> - !kvm_is_zone_device_pfn(pfn) &&
>>> - PageTransCompoundMap(pfn_to_page(pfn))) {
>>> + pfn_is_huge_mapped(kvm, sp->gfn, pfn)) {
>>> pte_list_remove(rmap_head, sptep);
>>>
>>> if (kvm_available_flush_tlb_with_range())
>>> --
>>> 2.24.0.525.g8f36a354ae-goog
>>>
>>

2019-12-12 17:46:35

by Liran Alon

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] kvm: Use huge pages for DAX-backed files



> On 12 Dec 2019, at 19:34, Sean Christopherson <[email protected]> wrote:
>
> On Wed, Dec 11, 2019 at 04:32:07PM -0500, Barret Rhoden wrote:
>> This change allows KVM to map DAX-backed files made of huge pages with
>> huge mappings in the EPT/TDP.
>>
>> DAX pages are not PageTransCompound. The existing check is trying to
>> determine if the mapping for the pfn is a huge mapping or not. For
>> non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound.
>> For DAX, we can check the page table itself.
>>
>> Note that KVM already faulted in the page (or huge page) in the host's
>> page table, and we hold the KVM mmu spinlock. We grabbed that lock in
>> kvm_mmu_notifier_invalidate_range_end, before checking the mmu seq.
>>
>> Signed-off-by: Barret Rhoden <[email protected]>
>> ---
>> arch/x86/kvm/mmu/mmu.c | 36 ++++++++++++++++++++++++++++++++----
>> 1 file changed, 32 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>> index 6f92b40d798c..cd07bc4e595f 100644
>> --- a/arch/x86/kvm/mmu/mmu.c
>> +++ b/arch/x86/kvm/mmu/mmu.c
>> @@ -3384,6 +3384,35 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
>> return -EFAULT;
>> }
>>
>> +static bool pfn_is_huge_mapped(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn)
>> +{
>> + struct page *page = pfn_to_page(pfn);
>> + unsigned long hva;
>> +
>> + if (!is_zone_device_page(page))
>> + return PageTransCompoundMap(page);
>> +
>> + /*
>> + * DAX pages do not use compound pages. The page should have already
>> + * been mapped into the host-side page table during try_async_pf(), so
>> + * we can check the page tables directly.
>> + */
>> + hva = gfn_to_hva(kvm, gfn);
>> + if (kvm_is_error_hva(hva))
>> + return false;
>> +
>> + /*
>> + * Our caller grabbed the KVM mmu_lock with a successful
>> + * mmu_notifier_retry, so we're safe to walk the page table.
>> + */
>> + switch (dev_pagemap_mapping_shift(hva, current->mm)) {
>> + case PMD_SHIFT:
>> + case PUD_SIZE:
>
> I assume this means DAX can have 1GB pages? I ask because KVM's THP logic
> has historically relied on THP only supporting 2MB. I cleaned this up in
> a recent series[*], which is in kvm/queue, but I obviously didn't actually
> test whether or not KVM would correctly handle 1GB non-hugetlbfs pages.

KVM doesn’t handle 1GB correctly for all types of non-hugetlbfs pages.
One example we have noticed internally but haven’t submitted an upstream patch yet is
for pages without “struct page”. As in this case, hva_to_pfn() will notice vma->vm_flags have VM_PFNMAP set
and call hva_to_pfn_remapped() -> follow_pfn().
However, follow_pfn() currently just calls follow_pte() which use __follow_pte_pmd() that doesn’t handle a huge PUD entry.

>
> The easiest thing is probably to rebase on kvm/queue. You'll need to do
> that anyways, and I suspect doing so will help shake out any hiccups.
>
> [*] https://urldefense.proofpoint.com/v2/url?u=https-3A__lkml.kernel.org_r_20191206235729.29263-2D1-2Dsean.j.christopherson-40intel.com&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=Lk-PXE125WU3GWJOV4U4crsSEFx7f5AUmRJhkrfIeAE&s=BIo4tnL4OfswRQ2QKfTs9VYScLU5lBy2pwzePBnHow8&e=
>
>> + return true;
>> + }
>> + return false;
>> +}
>> +
>> static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
>> gfn_t gfn, kvm_pfn_t *pfnp,
>> int *levelp)
>> @@ -3398,8 +3427,8 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
>> * here.
>> */
>> if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
>> - !kvm_is_zone_device_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL &&
>> - PageTransCompoundMap(pfn_to_page(pfn)) &&
>> + level == PT_PAGE_TABLE_LEVEL &&
>> + pfn_is_huge_mapped(vcpu->kvm, gfn, pfn) &&
>> !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {
>> unsigned long mask;
>> /*
>> @@ -6015,8 +6044,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
>> * mapping if the indirect sp has level = 1.
>> */
>> if (sp->role.direct && !kvm_is_reserved_pfn(pfn) &&
>> - !kvm_is_zone_device_pfn(pfn) &&
>> - PageTransCompoundMap(pfn_to_page(pfn))) {
>> + pfn_is_huge_mapped(kvm, sp->gfn, pfn)) {
>> pte_list_remove(rmap_head, sptep);
>>
>> if (kvm_available_flush_tlb_with_range())
>> --
>> 2.24.0.525.g8f36a354ae-goog
>>

2019-12-12 18:01:10

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] kvm: Use huge pages for DAX-backed files

On Thu, Dec 12, 2019 at 9:39 AM Liran Alon <[email protected]> wrote:
>
>
>
> > On 12 Dec 2019, at 18:54, Dan Williams <[email protected]> wrote:
> >
> > On Thu, Dec 12, 2019 at 4:34 AM Liran Alon <[email protected]> wrote:
> >>
> >>
> >>
> >>> On 11 Dec 2019, at 23:32, Barret Rhoden <[email protected]> wrote:
> >>>
> >>> This change allows KVM to map DAX-backed files made of huge pages with
> >>> huge mappings in the EPT/TDP.
> >>>
> >>> DAX pages are not PageTransCompound. The existing check is trying to
> >>> determine if the mapping for the pfn is a huge mapping or not. For
> >>> non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound.
> >>> For DAX, we can check the page table itself.
> >>
> >> For hugetlbfs pages, tdp_page_fault() -> mapping_level() -> host_mapping_level() -> kvm_host_page_size() -> vma_kernel_pagesize()
> >> will return the page-size of the hugetlbfs without the need to parse the page-tables.
> >> See vma->vm_ops->pagesize() callback implementation at hugetlb_vm_ops->pagesize()==hugetlb_vm_op_pagesize().
> >>
> >> Only for pages that were originally mapped as small-pages and later merged to larger pages by THP, there is a need to check for PageTransCompound(). Again, instead of parsing page-tables.
> >>
> >> Therefore, it seems more logical to me that:
> >> (a) If DAX-backed files are mapped as large-pages to userspace, it should be reflected in vma->vm_ops->page_size() of that mapping. Causing kvm_host_page_size() to return the right size without the need to parse the page-tables.
> >
> > A given dax-mapped vma may have mixed page sizes so ->page_size()
> > can't be used reliably to enumerating the mapping size.
>
> Naive question: Why don’t split the VMA in this case to multiple VMAs with different results for ->page_size()?

Filesystems traditionally have not populated ->pagesize() in their
vm_operations, there was no compelling reason to go add it and the
complexity seems prohibitive.

> What you are describing sounds like DAX is breaking this callback semantics in an unpredictable manner.

It's not unpredictable. vma_kernel_pagesize() returns PAGE_SIZE. Huge
pages in the page cache has a similar issue.

> >> (b) If DAX-backed files small-pages can be later merged to large-pages by THP, then the “struct page” of these pages should be modified as usual to make PageTransCompound() return true for them. I’m not highly familiar with this mechanism, but I would expect THP to be able to merge DAX-backed files small-pages to large-pages in case DAX provides “struct page” for the DAX pages.
> >
> > DAX pages do not participate in THP and do not have the
> > PageTransCompound accounting. The only mechanism that records the
> > mapping size for dax is the page tables themselves.
>
> What is the rational behind this? Given that DAX pages can be described with “struct page” (i.e. ZONE_DEVICE), what prevents THP from manipulating page-tables to merge multiple DAX PFNs to a larger page?

THP accounting is a function of the page allocator. ZONE_DEVICE pages
are excluded from the page allocator. ZONE_DEVICE is just enough
infrastructure to support pfn_to_page(), page_address(), and
get_user_pages(). Other page allocator services beyond that are not
present.

2019-12-12 18:33:23

by Liran Alon

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] kvm: Use huge pages for DAX-backed files



> On 12 Dec 2019, at 19:59, Dan Williams <[email protected]> wrote:
>
> On Thu, Dec 12, 2019 at 9:39 AM Liran Alon <[email protected]> wrote:
>>
>>
>>
>>> On 12 Dec 2019, at 18:54, Dan Williams <[email protected]> wrote:
>>>
>>> On Thu, Dec 12, 2019 at 4:34 AM Liran Alon <[email protected]> wrote:
>>>>
>>>>
>>>>
>>>>> On 11 Dec 2019, at 23:32, Barret Rhoden <[email protected]> wrote:
>>>>>
>>>>> This change allows KVM to map DAX-backed files made of huge pages with
>>>>> huge mappings in the EPT/TDP.
>>>>>
>>>>> DAX pages are not PageTransCompound. The existing check is trying to
>>>>> determine if the mapping for the pfn is a huge mapping or not. For
>>>>> non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound.
>>>>> For DAX, we can check the page table itself.
>>>>
>>>> For hugetlbfs pages, tdp_page_fault() -> mapping_level() -> host_mapping_level() -> kvm_host_page_size() -> vma_kernel_pagesize()
>>>> will return the page-size of the hugetlbfs without the need to parse the page-tables.
>>>> See vma->vm_ops->pagesize() callback implementation at hugetlb_vm_ops->pagesize()==hugetlb_vm_op_pagesize().
>>>>
>>>> Only for pages that were originally mapped as small-pages and later merged to larger pages by THP, there is a need to check for PageTransCompound(). Again, instead of parsing page-tables.
>>>>
>>>> Therefore, it seems more logical to me that:
>>>> (a) If DAX-backed files are mapped as large-pages to userspace, it should be reflected in vma->vm_ops->page_size() of that mapping. Causing kvm_host_page_size() to return the right size without the need to parse the page-tables.
>>>
>>> A given dax-mapped vma may have mixed page sizes so ->page_size()
>>> can't be used reliably to enumerating the mapping size.
>>
>> Naive question: Why don’t split the VMA in this case to multiple VMAs with different results for ->page_size()?
>
> Filesystems traditionally have not populated ->pagesize() in their
> vm_operations, there was no compelling reason to go add it and the
> complexity seems prohibitive.

I understand. Though this is technical debt that breaks ->page_size() semantics which might cause a complex bug some day...

>
>> What you are describing sounds like DAX is breaking this callback semantics in an unpredictable manner.
>
> It's not unpredictable. vma_kernel_pagesize() returns PAGE_SIZE.

Of course. :) I meant it may be unexpected by the caller.

> Huge
> pages in the page cache has a similar issue.

Ok. I haven’t known that. Thanks for the explanation.

>
>>>> (b) If DAX-backed files small-pages can be later merged to large-pages by THP, then the “struct page” of these pages should be modified as usual to make PageTransCompound() return true for them. I’m not highly familiar with this mechanism, but I would expect THP to be able to merge DAX-backed files small-pages to large-pages in case DAX provides “struct page” for the DAX pages.
>>>
>>> DAX pages do not participate in THP and do not have the
>>> PageTransCompound accounting. The only mechanism that records the
>>> mapping size for dax is the page tables themselves.
>>
>> What is the rational behind this? Given that DAX pages can be described with “struct page” (i.e. ZONE_DEVICE), what prevents THP from manipulating page-tables to merge multiple DAX PFNs to a larger page?
>
> THP accounting is a function of the page allocator. ZONE_DEVICE pages
> are excluded from the page allocator. ZONE_DEVICE is just enough
> infrastructure to support pfn_to_page(), page_address(), and
> get_user_pages(). Other page allocator services beyond that are not
> present.

Ok.


2019-12-12 19:17:30

by Barret Rhoden

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] kvm: Use huge pages for DAX-backed files

On 12/12/19 12:37 PM, Dan Williams wrote:
> Yeah, since device-dax is the only path to support longterm page
> pinning for vfio device assignment, testing with device-dax + 1GB
> pages would be a useful sanity check.

What are the issues with fs-dax and page pinning? Is that limitation
something that is permanent and unfixable (by me or anyone)?

I'd like to put a lot more in a DAX/pmem region than just a guest's
memory, and having a mountable filesystem would be extremely convenient.

Thanks,

Barret


2019-12-12 19:49:08

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] kvm: Use huge pages for DAX-backed files

On Thu, Dec 12, 2019 at 11:16 AM Barret Rhoden <[email protected]> wrote:
>
> On 12/12/19 12:37 PM, Dan Williams wrote:
> > Yeah, since device-dax is the only path to support longterm page
> > pinning for vfio device assignment, testing with device-dax + 1GB
> > pages would be a useful sanity check.
>
> What are the issues with fs-dax and page pinning? Is that limitation
> something that is permanent and unfixable (by me or anyone)?

It's a surprisingly painful point of contention...

File backed DAX pages cannot be truncated while the page is pinned
because the pin may indicate that DMA is ongoing to the file block /
DAX page. When that pin is from RDMA or VFIO that creates a situation
where filesystem operations are blocked indefinitely. More details
here: 94db151dc892 "vfio: disable filesystem-dax page pinning".

Currently, to prevent the deadlock, RDMA, VFIO, and IO_URING memory
registration is blocked if the mapping is filesystem-dax backed (see
the FOLL_LONGTERM flag to get_user_pages).

One of the proposals to break the impasse was to allow the filesystem
to forcibly revoke the mapping. I.e. to use the IOMMU to forcibly kick
the RDMA device out of its registration. That was rejected by RDMA
folks because RDMA applications are not prepared for this revocation
to happen and the application that performed the registration may not
be the application that uses the registration. There was an attempt to
use a file lease to indicate the presence of a file /
memory-registration that is blocking file-system operations, but that
was still less palatable to filesystem folks than just keeping the
status quo of blocking longterm pinning.

That said, the VFIO use case seems a different situation than RDMA.
There's often a 1:1 relationship between the application performing
the memory registration and the application consuming it, the VMM, and
there is always an IOMMU present that could revoke access and kill the
guest is the mapping got truncated. It seems in theory that VFIO could
tolerate a "revoke pin on truncate" mechanism where RDMA could not.

> I'd like to put a lot more in a DAX/pmem region than just a guest's
> memory, and having a mountable filesystem would be extremely convenient.

Why would page pinning be involved in allowing the guest to mount a
filesystem on guest-pmem? That already works today, it's just the
device-passthrough that causes guest memory to be pinned indefinitely.

2019-12-12 20:10:03

by Barret Rhoden

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] kvm: Use huge pages for DAX-backed files

On 12/12/19 2:48 PM, Dan Williams wrote:
> On Thu, Dec 12, 2019 at 11:16 AM Barret Rhoden <[email protected]> wrote:
>>
>> On 12/12/19 12:37 PM, Dan Williams wrote:
>>> Yeah, since device-dax is the only path to support longterm page
>>> pinning for vfio device assignment, testing with device-dax + 1GB
>>> pages would be a useful sanity check.
>>
>> What are the issues with fs-dax and page pinning? Is that limitation
>> something that is permanent and unfixable (by me or anyone)?
>
> It's a surprisingly painful point of contention...

Thanks for the info; I'll check out those threads.

[snip]

>> I'd like to put a lot more in a DAX/pmem region than just a guest's
>> memory, and having a mountable filesystem would be extremely convenient.
>
> Why would page pinning be involved in allowing the guest to mount a
> filesystem on guest-pmem? That already works today, it's just the
> device-passthrough that causes guest memory to be pinned indefinitely.

I'd like to mount the pmem filesystem on the *host* and use its files
for the guest's memory. So far I've just been making an ext4 FS on
/dev/pmem0 and creating a bunch of files in the FS. Some of the files
are the guest memory: one file for each VM. Other files are just
metadata that the host uses.

That all works right now, but I'd also like to use VFIO with the guests.

Thanks,

Barret