2021-04-29 16:30:44

by Shanker Donthineni

[permalink] [raw]
Subject: [RFC 0/2] [RFC] Honor PCI prefetchable attributes for a virtual machine on ARM64

Problem statement: Virtual machine crashes when NVIDIA GPU driver access a prefetchable BAR space due to the unaligned reads/writes for pass-through devices. The same binary works fine as expected in the host kernel. Only one BAR has control & status registers (CSR) and other PCI BARs are marked as prefetchable. NVIDIA GPU driver uses the write-combine feature for mapping the prefetchable BARs to improve performance. This problem applies to all other drivers which want to enable WC.

Solution: Honor PCI prefetchable attributes for the guest operating systems.

Proposal: ARM64-KVM uses VMA struct for the needed information e.g. region physical address, size, and memory-type (struct page backed mapping or anonymous memory) for setting up a stage-2 page table. Right now memory region either can be mapped as DEVICE (strongly ordered) or NORMAL (write-back cache) depends on the flag VM_PFNMAP in VMA. VFIO-PCI will keep the prefetchable (write-combine) information in vma->vm_page_prot similar to other fields, and KVM will prepare stage-2 entries based on the memory-type attribute that was set in VMA.

Shanker Donthineni (2):
vfio/pci: keep the prefetchable attribute of a BAR region in VMA
KVM: arm64: Add write-combine support for stage-2 entries

arch/arm64/include/asm/kvm_mmu.h | 3 ++-
arch/arm64/include/asm/kvm_pgtable.h | 2 ++
arch/arm64/include/asm/memory.h | 4 +++-
arch/arm64/kvm/hyp/pgtable.c | 9 +++++++--
arch/arm64/kvm/mmu.c | 22 +++++++++++++++++++---
arch/arm64/kvm/vgic/vgic-v2.c | 2 +-
drivers/vfio/pci/vfio_pci.c | 6 +++++-
7 files changed, 39 insertions(+), 9 deletions(-)

--
2.17.1


2021-04-29 16:31:14

by Shanker Donthineni

[permalink] [raw]
Subject: [RFC 1/2] vfio/pci: keep the prefetchable attribute of a BAR region in VMA

For pass-through device assignment, the ARM64 KVM hypervisor retrieves
the memory region properties physical address, size, and whether a
region backed with struct page or not from VMA. The prefetchable
attribute of a BAR region isn't visible to KVM to make an optimal
decision for stage2 attributes.

This patch updates vma->vm_page_prot and maps with write-combine
attribute if the associated BAR is prefetchable. For ARM64
pgprot_writecombine() is mapped to memory-type MT_NORMAL_NC which
has no side effects on reads and multiple writes can be combined.

Signed-off-by: Shanker Donthineni <[email protected]>
---
drivers/vfio/pci/vfio_pci.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 5023e23db3bc..1b734fe1dd51 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1703,7 +1703,11 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
}

vma->vm_private_data = vdev;
- vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+ if (IS_ENABLED(CONFIG_ARM64) &&
+ (pci_resource_flags(pdev, index) & IORESOURCE_PREFETCH))
+ vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot);
+ else
+ vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;

/*
--
2.17.1

2021-04-29 16:32:29

by Shanker Donthineni

[permalink] [raw]
Subject: [RFC 2/2] KVM: arm64: Add write-combine support for stage-2 entries

In the current implementation, the device memory is always mapped as
DEVICE_nGnRE in stage-2. In the host kernel, device drivers have
flexibility whether to choose a memory-type device or write-combine
(Non-cacheable) depends on the use case. PCI specification has a
prefetchable BAR concept where multiple writes can be combined and
no side effects on reads. It provides huge performance improvement
and also allows unaligned access.

NVIDIA GPU PCIe devices have 3 BAR regions. Two regions are mapped to
video/compute memory and marked as prefetchable. The GPU driver takes
advantage of the write-combine feature for higher performance. The
same driver has no issues in the host kernel but crashes inside the
virtual machine because of unaligned accesses.

This patch finds the PTE attributes for device memory in VMA. It
updates the stage-2 attribute to NORMAL_NC for WC regions and
the default type DEVICE_nGnRE for non-WC regions.

Change-Id: Ibaea69c7a301df3c86609e871f6d066728391080
Signed-off-by: Shanker Donthineni <[email protected]>
---
arch/arm64/include/asm/kvm_mmu.h | 3 ++-
arch/arm64/include/asm/kvm_pgtable.h | 2 ++
arch/arm64/include/asm/memory.h | 4 +++-
arch/arm64/kvm/hyp/pgtable.c | 9 +++++++--
arch/arm64/kvm/mmu.c | 21 ++++++++++++++++++---
arch/arm64/kvm/vgic/vgic-v2.c | 2 +-
6 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 90873851f677..dec498a6ba2f 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -160,7 +160,8 @@ void stage2_unmap_vm(struct kvm *kvm);
int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu);
void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu);
int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
- phys_addr_t pa, unsigned long size, bool writable);
+ phys_addr_t pa, unsigned long size, bool writable,
+ bool writecombine);

int kvm_handle_guest_abort(struct kvm_vcpu *vcpu);

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 8886d43cfb11..26f28220f6f3 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -35,6 +35,7 @@ struct kvm_pgtable {
* @KVM_PGTABLE_PROT_W: Write permission.
* @KVM_PGTABLE_PROT_R: Read permission.
* @KVM_PGTABLE_PROT_DEVICE: Device attributes.
+ * @KVM_PGTABLE_PROT_WC: Normal non-cacheable (WC).
*/
enum kvm_pgtable_prot {
KVM_PGTABLE_PROT_X = BIT(0),
@@ -42,6 +43,7 @@ enum kvm_pgtable_prot {
KVM_PGTABLE_PROT_R = BIT(2),

KVM_PGTABLE_PROT_DEVICE = BIT(3),
+ KVM_PGTABLE_PROT_WC = BIT(4),
};

#define PAGE_HYP (KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W)
diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index 0aabc3be9a75..04a812b59437 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -144,13 +144,15 @@
* Memory types for Stage-2 translation
*/
#define MT_S2_NORMAL 0xf
+#define MT_S2_WRITE_COMBINE 5
#define MT_S2_DEVICE_nGnRE 0x1

/*
* Memory types for Stage-2 translation when ID_AA64MMFR2_EL1.FWB is 0001
- * Stage-2 enforces Normal-WB and Device-nGnRE
+ * Stage-2 enforces Normal-WB, Normal-NC and Device-nGnRE
*/
#define MT_S2_FWB_NORMAL 6
+#define MT_S2_FWB_WRITE_COMBINE 5
#define MT_S2_FWB_DEVICE_nGnRE 1

#ifdef CONFIG_ARM64_4K_PAGES
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 926fc07074f5..bdfed559eae2 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -444,9 +444,14 @@ static int stage2_map_set_prot_attr(enum kvm_pgtable_prot prot,
struct stage2_map_data *data)
{
bool device = prot & KVM_PGTABLE_PROT_DEVICE;
- kvm_pte_t attr = device ? PAGE_S2_MEMATTR(DEVICE_nGnRE) :
- PAGE_S2_MEMATTR(NORMAL);
u32 sh = KVM_PTE_LEAF_ATTR_LO_S2_SH_IS;
+ kvm_pte_t attr = PAGE_S2_MEMATTR(NORMAL);
+
+ if (device) {
+ attr = (prot & KVM_PGTABLE_PROT_WC) ?
+ PAGE_S2_MEMATTR(WRITE_COMBINE) :
+ PAGE_S2_MEMATTR(DEVICE_nGnRE);
+ }

if (!(prot & KVM_PGTABLE_PROT_X))
attr |= KVM_PTE_LEAF_ATTR_HI_S2_XN;
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 8711894db8c2..5b8ec1ab12e2 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -487,6 +487,16 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
}
}

+/**
+ * is_vma_write_combine - check if VMA is mapped with writecombine or not
+ * Return true if VMA mapped with MT_NORMAL_NC otherwise fasle
+ */
+static bool inline is_vma_write_combine(struct vm_area_struct *vma)
+{
+ pteval_t pteval = pgprot_val(vma->vm_page_prot);
+ return ((pteval & PTE_ATTRINDX_MASK) == PTE_ATTRINDX(MT_NORMAL_NC));
+}
+
/**
* kvm_phys_addr_ioremap - map a device range to guest IPA
*
@@ -495,9 +505,11 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
* @pa: The physical address of the device
* @size: The size of the mapping
* @writable: Whether or not to create a writable mapping
+ * @writecombine: Whether or not to create a writecombine mapping
*/
int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
- phys_addr_t pa, unsigned long size, bool writable)
+ phys_addr_t pa, unsigned long size, bool writable,
+ bool writecombine)
{
phys_addr_t addr;
int ret = 0;
@@ -505,6 +517,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
KVM_PGTABLE_PROT_R |
+ (writecombine ? KVM_PGTABLE_PROT_WC : 0) |
(writable ? KVM_PGTABLE_PROT_W : 0);

size += offset_in_page(guest_ipa);
@@ -891,7 +904,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
}

if (device)
- prot |= KVM_PGTABLE_PROT_DEVICE;
+ prot |= KVM_PGTABLE_PROT_DEVICE |
+ (is_vma_write_combine(vma) ? KVM_PGTABLE_PROT_WC : 0);
else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC))
prot |= KVM_PGTABLE_PROT_X;

@@ -1357,7 +1371,8 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,

ret = kvm_phys_addr_ioremap(kvm, gpa, pa,
vm_end - vm_start,
- writable);
+ writable,
+ is_vma_write_combine(vma));
if (ret)
break;
}
diff --git a/arch/arm64/kvm/vgic/vgic-v2.c b/arch/arm64/kvm/vgic/vgic-v2.c
index 11934c2af2f4..6f921efea6c0 100644
--- a/arch/arm64/kvm/vgic/vgic-v2.c
+++ b/arch/arm64/kvm/vgic/vgic-v2.c
@@ -336,7 +336,7 @@ int vgic_v2_map_resources(struct kvm *kvm)
if (!static_branch_unlikely(&vgic_v2_cpuif_trap)) {
ret = kvm_phys_addr_ioremap(kvm, dist->vgic_cpu_base,
kvm_vgic_global_state.vcpu_base,
- KVM_VGIC_V2_CPU_SIZE, true);
+ KVM_VGIC_V2_CPU_SIZE, true, false);
if (ret) {
kvm_err("Unable to remap VGIC CPU to VCPU\n");
return ret;
--
2.17.1

2021-04-29 18:30:30

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC 1/2] vfio/pci: keep the prefetchable attribute of a BAR region in VMA

On Thu, 29 Apr 2021 11:29:05 -0500
Shanker Donthineni <[email protected]> wrote:

> For pass-through device assignment, the ARM64 KVM hypervisor retrieves
> the memory region properties physical address, size, and whether a
> region backed with struct page or not from VMA. The prefetchable
> attribute of a BAR region isn't visible to KVM to make an optimal
> decision for stage2 attributes.
>
> This patch updates vma->vm_page_prot and maps with write-combine
> attribute if the associated BAR is prefetchable. For ARM64
> pgprot_writecombine() is mapped to memory-type MT_NORMAL_NC which
> has no side effects on reads and multiple writes can be combined.
>
> Signed-off-by: Shanker Donthineni <[email protected]>
> ---
> drivers/vfio/pci/vfio_pci.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 5023e23db3bc..1b734fe1dd51 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -1703,7 +1703,11 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
> }
>
> vma->vm_private_data = vdev;
> - vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> + if (IS_ENABLED(CONFIG_ARM64) &&
> + (pci_resource_flags(pdev, index) & IORESOURCE_PREFETCH))
> + vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot);
> + else
> + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);

If this were a valid thing to do, it should be done for all
architectures, not just ARM64. However, a prefetchable range only
necessarily allows merged writes, which seems like a subset of the
semantics implied by a WC attribute, therefore this doesn't seem
universally valid.

I'm also a bit confused by your problem statement that indicates that
without WC you're seeing unaligned accesses, does this suggest that
your driver is actually relying on WC semantics to perform merging to
achieve alignment? That seems rather like a driver bug, I'd expect UC
vs WC is largely a difference in performance, not a means to enforce
proper driver access patterns. Per the PCI spec, the bridge itself can
merge writes to prefetchable areas, presumably regardless of this
processor attribute, perhaps that's the feature your driver is relying
on that might be missing here. Thanks,

Alex

2021-04-29 19:25:11

by Shanker Donthineni

[permalink] [raw]
Subject: Re: [RFC 1/2] vfio/pci: keep the prefetchable attribute of a BAR region in VMA

Thanks Alex for quick reply.

On 4/29/21 1:28 PM, Alex Williamson wrote:
> If this were a valid thing to do, it should be done for all
> architectures, not just ARM64. However, a prefetchable range only
> necessarily allows merged writes, which seems like a subset of the
> semantics implied by a WC attribute, therefore this doesn't seem
> universally valid.
>
> I'm also a bit confused by your problem statement that indicates that
> without WC you're seeing unaligned accesses, does this suggest that
> your driver is actually relying on WC semantics to perform merging to
> achieve alignment? That seems rather like a driver bug, I'd expect UC
> vs WC is largely a difference in performance, not a means to enforce
> proper driver access patterns. Per the PCI spec, the bridge itself can
> merge writes to prefetchable areas, presumably regardless of this
> processor attribute, perhaps that's the feature your driver is relying
> on that might be missing here. Thanks,
The driver uses WC semantics, It's mapping PCI prefetchable BARS using ioremap_wc().  We don't see any issue for x86 architecture,  driver works fine in the host and guest kernel. The same driver works on ARM64 kernel but crashes inside VM.
GPU driver uses the architecture agnostic function ioremap_wc() like other drivers. This limitation applies to all the drivers if they use WC memory and follow ARM64 NORMAL-NC access rules.

On ARM64, ioremap_wc() is mapped to non-cacheable memory-type, no side effects on reads and unaligned accesses are allowed as per ARM-ARM architecture. The driver behavior is different in host vs guest on ARM64. 

ARM CPU generating alignment faults before transaction reaches the PCI-RC/switch/end-point-device.

We've two concerns here:
   - Performance impacts for pass-through devices.
   - The definition of ioremap_wc() function doesn't match the host kernel on ARM64

 
> Alex
>

2021-04-29 20:02:08

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC 1/2] vfio/pci: keep the prefetchable attribute of a BAR region in VMA

On Thu, 29 Apr 2021 14:14:50 -0500
Shanker R Donthineni <[email protected]> wrote:

> Thanks Alex for quick reply.
>
> On 4/29/21 1:28 PM, Alex Williamson wrote:
> > If this were a valid thing to do, it should be done for all
> > architectures, not just ARM64. However, a prefetchable range only
> > necessarily allows merged writes, which seems like a subset of the
> > semantics implied by a WC attribute, therefore this doesn't seem
> > universally valid.
> >
> > I'm also a bit confused by your problem statement that indicates that
> > without WC you're seeing unaligned accesses, does this suggest that
> > your driver is actually relying on WC semantics to perform merging to
> > achieve alignment? That seems rather like a driver bug, I'd expect UC
> > vs WC is largely a difference in performance, not a means to enforce
> > proper driver access patterns. Per the PCI spec, the bridge itself can
> > merge writes to prefetchable areas, presumably regardless of this
> > processor attribute, perhaps that's the feature your driver is relying
> > on that might be missing here. Thanks,
> The driver uses WC semantics, It's mapping PCI prefetchable BARS
> using ioremap_wc().  We don't see any issue for x86 architecture,
> driver works fine in the host and guest kernel. The same driver works
> on ARM64 kernel but crashes inside VM. GPU driver uses the
> architecture agnostic function ioremap_wc() like other drivers. This
> limitation applies to all the drivers if they use WC memory and
> follow ARM64 NORMAL-NC access rules.

x86 KVM works for other reasons, KVM will trust the vCPU attributes for
the memory range rather than relying only on the host mapping.

> On ARM64, ioremap_wc() is mapped to non-cacheable memory-type, no
> side effects on reads and unaligned accesses are allowed as per
> ARM-ARM architecture. The driver behavior is different in host vs
> guest on ARM64. 

Per the PCI spec, prefetchable memory only necessarily allows the bridge
to merge writes. I believe this is only a subset of what WC mappings
allow, therefore I expect this is incompatible with drivers that do not
use WC mappings.

> ARM CPU generating alignment faults before transaction reaches the
> PCI-RC/switch/end-point-device.

If an alignment fault is fixed by configuring a WC mapping, doesn't
that suggest that the driver performed an unaligned access itself and
is relying on write combining by the processor to correct that error?
That's wrong. Fix the driver or please offer another explanation of
how the WC mapping resolves this. I suspect you could enable tracing
in QEMU, disable MMIO mmaps on the vfio-pci device and find the invalid
access.

> We've two concerns here:
>    - Performance impacts for pass-through devices.
>    - The definition of ioremap_wc() function doesn't match the host
> kernel on ARM64

Performance I can understand, but I think you're also using it to mask
a driver bug which should be resolved first. Thanks,

Alex

2021-04-29 22:12:29

by Vikram Sethi

[permalink] [raw]
Subject: RE: [RFC 1/2] vfio/pci: keep the prefetchable attribute of a BAR region in VMA

Hi Alex,

> From: Alex Williamson <[email protected]>
> Subject: Re: [RFC 1/2] vfio/pci: keep the prefetchable attribute of a BAR region
> in VMA
> On Thu, 29 Apr 2021 14:14:50 -0500
> Shanker R Donthineni <[email protected]> wrote:
>
> > Thanks Alex for quick reply.
> >
> > On 4/29/21 1:28 PM, Alex Williamson wrote:
> > > If this were a valid thing to do, it should be done for all
> > > architectures, not just ARM64. However, a prefetchable range only
> > > necessarily allows merged writes, which seems like a subset of the
> > > semantics implied by a WC attribute, therefore this doesn't seem
> > > universally valid.
> > >
I didn't get your exact concern. If we removed the check for ARM arch
and simply stored that this is a prefetchable region in VMA, then each arch KVM
port could decide which PTE mappings are OK for prefetchable BAR.
KVM doesn't want to go through PCIe enumeration, and would rather
have the properties stored in VMA.
Beyond that, on arm64 specifically there is no WC Memtype, but we use
Normal Non Cacheable mapping for ioremap_wc which can be prefetched
and can be write combined. What semantics break for a device if
its prefetchable BAR is marked as Normal Noncacheable on arm64?

We need a way for write combining to work in a KVM-ARM guest, as it is
an important usecase for GPUs and NICs and also NVMe CMB IIRC. So
*some* way is needed of letting KVM know to map as write combine
(Normal NC) at stage2. Do you have a better solution in mind?

> > > I'm also a bit confused by your problem statement that indicates
> > > that without WC you're seeing unaligned accesses, does this suggest
> > > that your driver is actually relying on WC semantics to perform
> > > merging to achieve alignment? That seems rather like a driver bug,
> > > I'd expect UC vs WC is largely a difference in performance, not a
> > > means to enforce proper driver access patterns. Per the PCI spec,
> > > the bridge itself can merge writes to prefetchable areas, presumably
> > > regardless of this processor attribute, perhaps that's the feature
> > > your driver is relying on that might be missing here. Thanks,
> > The driver uses WC semantics, It's mapping PCI prefetchable BARS using
> > ioremap_wc(). We don't see any issue for x86 architecture, driver
> > works fine in the host and guest kernel. The same driver works on
> > ARM64 kernel but crashes inside VM. GPU driver uses the architecture
> > agnostic function ioremap_wc() like other drivers. This limitation
> > applies to all the drivers if they use WC memory and follow ARM64
> > NORMAL-NC access rules.
>
> x86 KVM works for other reasons, KVM will trust the vCPU attributes for the
> memory range rather than relying only on the host mapping.
>
> > On ARM64, ioremap_wc() is mapped to non-cacheable memory-type, no
> side
> > effects on reads and unaligned accesses are allowed as per ARM-ARM
> > architecture. The driver behavior is different in host vs guest on
> > ARM64.
>
> Per the PCI spec, prefetchable memory only necessarily allows the bridge to
> merge writes. I believe this is only a subset of what WC mappings allow,
> therefore I expect this is incompatible with drivers that do not use WC
> mappings.
>
> > ARM CPU generating alignment faults before transaction reaches the
> > PCI-RC/switch/end-point-device.
>
> If an alignment fault is fixed by configuring a WC mapping, doesn't that
> suggest that the driver performed an unaligned access itself and is relying on
> write combining by the processor to correct that error?
> That's wrong. Fix the driver or please offer another explanation of how the
> WC mapping resolves this. I suspect you could enable tracing in QEMU,
> disable MMIO mmaps on the vfio-pci device and find the invalid access.
>
> > We've two concerns here:
> > - Performance impacts for pass-through devices.
> > - The definition of ioremap_wc() function doesn't match the host
> > kernel on ARM64
>
> Performance I can understand, but I think you're also using it to mask a driver
> bug which should be resolved first. Thanks,
>
> Alex

2021-04-30 09:55:18

by Lorenzo Pieralisi

[permalink] [raw]
Subject: Re: [RFC 1/2] vfio/pci: keep the prefetchable attribute of a BAR region in VMA

[+Jason, Ben]

On Thu, Apr 29, 2021 at 11:29:05AM -0500, Shanker Donthineni wrote:
> For pass-through device assignment, the ARM64 KVM hypervisor retrieves
> the memory region properties physical address, size, and whether a
> region backed with struct page or not from VMA. The prefetchable
> attribute of a BAR region isn't visible to KVM to make an optimal
> decision for stage2 attributes.
>
> This patch updates vma->vm_page_prot and maps with write-combine
> attribute if the associated BAR is prefetchable. For ARM64
> pgprot_writecombine() is mapped to memory-type MT_NORMAL_NC which
> has no side effects on reads and multiple writes can be combined.
>
> Signed-off-by: Shanker Donthineni <[email protected]>
> ---
> drivers/vfio/pci/vfio_pci.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)

A bit of background information that may be useful:

https://lore.kernel.org/linux-pci/2b539df4c9ec703458e46da2fc879ee3b310b31c.camel@kernel.crashing.org

Lorenzo

> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 5023e23db3bc..1b734fe1dd51 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -1703,7 +1703,11 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
> }
>
> vma->vm_private_data = vdev;
> - vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> + if (IS_ENABLED(CONFIG_ARM64) &&
> + (pci_resource_flags(pdev, index) & IORESOURCE_PREFETCH))
> + vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot);
> + else
> + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
>
> /*
> --
> 2.17.1
>

2021-04-30 11:27:55

by Shanker Donthineni

[permalink] [raw]
Subject: Re: [RFC 1/2] vfio/pci: keep the prefetchable attribute of a BAR region in VMA

Hi Alex

On 4/29/21 2:46 PM, Alex Williamson wrote:
> If an alignment fault is fixed by configuring a WC mapping, doesn't
> that suggest that the driver performed an unaligned access itself and
> is relying on write combining by the processor to correct that error?
> That's wrong. Fix the driver or please offer another explanation of
> how the WC mapping resolves this. I suspect you could enable tracing
> in QEMU, disable MMIO mmaps on the vfio-pci device and find the invalid
> access.
>
>> We've two concerns here:
>> - Performance impacts for pass-through devices.
>> - The definition of ioremap_wc() function doesn't match the host
>> kernel on ARM64
> Performance I can understand, but I think you're also using it to mask
> a driver bug which should be resolved first. Thank

We’ve already instrumented the driver code and found the code path for the unaligned
accesses. We’ll fix this issue if it’s not following WC semantics.

Fixing the performance concern will be under KVM stage-2 page-table control. We're
looking for a guidance/solution for updating stage-2  PTE based on PCI-BAR attribute.


2021-04-30 12:39:39

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC 1/2] vfio/pci: keep the prefetchable attribute of a BAR region in VMA

On Fri, Apr 30, 2021 at 10:54:17AM +0100, Lorenzo Pieralisi wrote:
> [+Jason, Ben]
>
> On Thu, Apr 29, 2021 at 11:29:05AM -0500, Shanker Donthineni wrote:
> > For pass-through device assignment, the ARM64 KVM hypervisor retrieves
> > the memory region properties physical address, size, and whether a
> > region backed with struct page or not from VMA. The prefetchable
> > attribute of a BAR region isn't visible to KVM to make an optimal
> > decision for stage2 attributes.
> >
> > This patch updates vma->vm_page_prot and maps with write-combine
> > attribute if the associated BAR is prefetchable. For ARM64
> > pgprot_writecombine() is mapped to memory-type MT_NORMAL_NC which
> > has no side effects on reads and multiple writes can be combined.
> >
> > Signed-off-by: Shanker Donthineni <[email protected]>
> > drivers/vfio/pci/vfio_pci.c | 6 +++++-
> > 1 file changed, 5 insertions(+), 1 deletion(-)
>
> A bit of background information that may be useful:
>
> https://lore.kernel.org/linux-pci/2b539df4c9ec703458e46da2fc879ee3b310b31c.camel@kernel.crashing.org

This can't happen automatically.

writecombining or not is a uABI visible change. Userspace needs to
explicitly request and know that the mmap it gets back is
writecombining.

Jason

2021-04-30 13:09:50

by Shanker Donthineni

[permalink] [raw]
Subject: Re: [RFC 1/2] vfio/pci: keep the prefetchable attribute of a BAR region in VMA


Hi Marc,

On 4/30/21 6:47 AM, Marc Zyngi
>>>> We've two concerns here:
>>>> - Performance impacts for pass-through devices.
>>>> - The definition of ioremap_wc() function doesn't match the host
>>>> kernel on ARM64
>>> Performance I can understand, but I think you're also using it to mask
>>> a driver bug which should be resolved first. Thank
>> We’ve already instrumented the driver code and found the code path
>> for the unaligned accesses. We’ll fix this issue if it’s not
>> following WC semantics.
>>
>> Fixing the performance concern will be under KVM stage-2 page-table
>> control. We're looking for a guidance/solution for updating stage-2
>> PTE based on PCI-BAR attribute.
> Before we start discussing the *how*, I'd like to clearly understand
> what *arm64* memory attributes you are relying on. We already have
> established that the unaligned access was a bug, which was the biggest
> argument in favour of NORMAL_NC. What are the other requirements?
>
We are  relying  on NORMAL_NC mapping for PCI prefetchable BARs instead of DEVICE_nGnRE in baremetal and VM.
No other requirement other than this.

-Shanker

2021-04-30 15:01:19

by Shanker Donthineni

[permalink] [raw]
Subject: Re: [RFC 1/2] vfio/pci: keep the prefetchable attribute of a BAR region in VMA

Hi Marc,

On 4/30/21 6:47 AM, Marc Zyngier wrote:
>
>>>> We've two concerns here:
>>>> - Performance impacts for pass-through devices.
>>>> - The definition of ioremap_wc() function doesn't match the host
>>>> kernel on ARM64
>>> Performance I can understand, but I think you're also using it to mask
>>> a driver bug which should be resolved first. Thank
>> We’ve already instrumented the driver code and found the code path
>> for the unaligned accesses. We’ll fix this issue if it’s not
>> following WC semantics.
>>
>> Fixing the performance concern will be under KVM stage-2 page-table
>> control. We're looking for a guidance/solution for updating stage-2
>> PTE based on PCI-BAR attribute.
> Before we start discussing the *how*, I'd like to clearly understand
> what *arm64* memory attributes you are relying on. We already have
> established that the unaligned access was a bug, which was the biggest
> argument in favour of NORMAL_NC. What are the other requirements?
Sorry, my earlier response was not complete...

ARMv8 architecture has two features Gathering and Reorder transactions, very
important from a performance point of view. Small inline packets for NIC cards
and accesses to GPU's frame buffer are CPU-bound operations. We want to take
advantages of GRE features to achieve higher performance.

Both these features are disabled for prefetchable BARs in VM because memory-type
MT_DEVICE_nGnRE enforced in stage-2.
> Thanks,
>
> M.
>
> --
> Without deviation from the norm, progress is not possible.

2021-04-30 16:59:24

by Vikram Sethi

[permalink] [raw]
Subject: RE: [RFC 1/2] vfio/pci: keep the prefetchable attribute of a BAR region in VMA

Hi Marc,

> -----Original Message-----
> From: Marc Zyngier <[email protected]>
> Sent: Friday, April 30, 2021 10:31 AM
> On Fri, 30 Apr 2021 15:58:14 +0100,
> Shanker R Donthineni <[email protected]> wrote:
> >
> > Hi Marc,
> >
> > On 4/30/21 6:47 AM, Marc Zyngier wrote:
> > >
> > >>>> We've two concerns here:
> > >>>> - Performance impacts for pass-through devices.
> > >>>> - The definition of ioremap_wc() function doesn't match the
> > >>>> host kernel on ARM64
> > >>> Performance I can understand, but I think you're also using it to
> > >>> mask a driver bug which should be resolved first. Thank
> > >> We’ve already instrumented the driver code and found the code path
> > >> for the unaligned accesses. We’ll fix this issue if it’s not
> > >> following WC semantics.
> > >>
> > >> Fixing the performance concern will be under KVM stage-2 page-table
> > >> control. We're looking for a guidance/solution for updating stage-2
> > >> PTE based on PCI-BAR attribute.
> > > Before we start discussing the *how*, I'd like to clearly understand
> > > what *arm64* memory attributes you are relying on. We already have
> > > established that the unaligned access was a bug, which was the
> > > biggest argument in favour of NORMAL_NC. What are the other
> requirements?
> > Sorry, my earlier response was not complete...
> >
> > ARMv8 architecture has two features Gathering and Reorder
> > transactions, very important from a performance point of view. Small
> > inline packets for NIC cards and accesses to GPU's frame buffer are
> > CPU-bound operations. We want to take advantages of GRE features to
> > achieve higher performance.
> >
> > Both these features are disabled for prefetchable BARs in VM because
> > memory-type MT_DEVICE_nGnRE enforced in stage-2.
>
> Right, so Normal_NC is a red herring, and it is Device_GRE that you really are
> after, right?
>
I think Device GRE has some practical problems.
1. A lot of userspace code which is used to getting write combined mappings
to GPU memory from kernel drivers does memcpy/memset on it which
can insert ldp/stp which can crash on Device Memory Type. From a quick search
I didn't find a memcpy_io or memset_io in glibc. Perhaps there are some
other functions available, but a lot of userspace applications that work on x86 and
ARM baremetal won't work on ARM VMs without such changes. Changes to all of
userspace may not always be practical, specially if linking to binaries

2. Sometimes even if application is not using memset/memcpy directly,
gcc may insert a builtin memcpy/memset.

3. Recompiling all applications with gcc -m strict-align has performance issues.
In our experiments that resulted in an increase in code size, and also 3-5%
performance decrease reliably.
Also, it is not always practical to recompile all of userspace, depending on
who owns the code/linked binaries etc.

From KVM-ARM point of view, what is it about Normal NC at stage 2 for
Prefetchable BAR (however KVM gets the hint, whether from userspace or VMA)
that is undesirable vs Device GRE? I couldn't think of a difference to devices
whether the combining or prefetching or reordering happened because of one or
the other.

> Now, I'm not convinced that we can do that directly from vfio in a device-
> agnostic manner. It is userspace that places the device in the guest's
> memory, and I have the ugly feeling that userspace needs to be in control of
> memory attributes.
>
> Otherwise, we change the behaviour for all existing devices that have
> prefetchable BARs, and I don't think that's an acceptable move (userspace
> ABI change).
>
> M.
>
> --
> Without deviation from the norm, progress is not possible.

2021-05-01 09:33:31

by Marc Zyngier

[permalink] [raw]
Subject: Re: [RFC 1/2] vfio/pci: keep the prefetchable attribute of a BAR region in VMA

Hi Vikram,

On Fri, 30 Apr 2021 17:57:14 +0100,
Vikram Sethi <[email protected]> wrote:
>
> Hi Marc,
>
> > -----Original Message-----
> > From: Marc Zyngier <[email protected]>
> > Sent: Friday, April 30, 2021 10:31 AM
> > On Fri, 30 Apr 2021 15:58:14 +0100,
> > Shanker R Donthineni <[email protected]> wrote:
> > >
> > > Hi Marc,
> > >
> > > On 4/30/21 6:47 AM, Marc Zyngier wrote:
> > > >
> > > >>>> We've two concerns here:
> > > >>>> - Performance impacts for pass-through devices.
> > > >>>> - The definition of ioremap_wc() function doesn't match the
> > > >>>> host kernel on ARM64
> > > >>> Performance I can understand, but I think you're also using it to
> > > >>> mask a driver bug which should be resolved first. Thank
> > > >> We’ve already instrumented the driver code and found the code path
> > > >> for the unaligned accesses. We’ll fix this issue if it’s not
> > > >> following WC semantics.
> > > >>
> > > >> Fixing the performance concern will be under KVM stage-2 page-table
> > > >> control. We're looking for a guidance/solution for updating stage-2
> > > >> PTE based on PCI-BAR attribute.
> > > > Before we start discussing the *how*, I'd like to clearly understand
> > > > what *arm64* memory attributes you are relying on. We already have
> > > > established that the unaligned access was a bug, which was the
> > > > biggest argument in favour of NORMAL_NC. What are the other
> > requirements?
> > > Sorry, my earlier response was not complete...
> > >
> > > ARMv8 architecture has two features Gathering and Reorder
> > > transactions, very important from a performance point of view. Small
> > > inline packets for NIC cards and accesses to GPU's frame buffer are
> > > CPU-bound operations. We want to take advantages of GRE features to
> > > achieve higher performance.
> > >
> > > Both these features are disabled for prefetchable BARs in VM because
> > > memory-type MT_DEVICE_nGnRE enforced in stage-2.
> >
> > Right, so Normal_NC is a red herring, and it is Device_GRE that
> > you really are after, right?
> >
> I think Device GRE has some practical problems.
> 1. A lot of userspace code which is used to getting write combined
> mappings to GPU memory from kernel drivers does memcpy/memset on it
> which can insert ldp/stp which can crash on Device Memory Type. From
> a quick search I didn't find a memcpy_io or memset_io in
> glibc. Perhaps there are some other functions available, but a lot
> of userspace applications that work on x86 and ARM baremetal won't
> work on ARM VMs without such changes. Changes to all of userspace
> may not always be practical, specially if linking to binaries

This seems to go against what Alex was hinting at earlier, which is
that unaligned accesses were not expected on prefetchable regions, and
Shanker latter confirming that it was an actual bug. Where do we stand
here?

>
> 2. Sometimes even if application is not using memset/memcpy directly,
> gcc may insert a builtin memcpy/memset.
>
> 3. Recompiling all applications with gcc -m strict-align has
> performance issues. In our experiments that resulted in an increase
> in code size, and also 3-5% performance decrease reliably. Also, it
> is not always practical to recompile all of userspace, depending on
> who owns the code/linked binaries etc.
>
> From KVM-ARM point of view, what is it about Normal NC at stage 2
> for Prefetchable BAR (however KVM gets the hint, whether from
> userspace or VMA) that is undesirable vs Device GRE? I couldn't
> think of a difference to devices whether the combining or
> prefetching or reordering happened because of one or the other.

The problem I see is that we have VM and userspace being written in
terms of Write-Combine, which is:

- loosely defined even on x86

- subject to interpretations in the way it maps to PCI

- has no direct equivalent in the ARMv8 collection of memory
attributes (and Normal_NC comes with speculation capabilities which
strikes me as extremely undesirable on arbitrary devices)

How do we translate this into something consistent? I'd like to see an
actual description of what we *really* expect from WC on prefetchable
PCI regions, turn that into a documented definition agreed across
architectures, and then we can look at implementing it with one memory
type or another on arm64.

Because once we expose that memory type at S2 for KVM guests, it
becomes ABI and there is no turning back. So I want to get it right
once and for all.

Thanks,

M.

--
Without deviation from the norm, progress is not possible.

2021-05-01 11:52:33

by Shanker Donthineni

[permalink] [raw]
Subject: Re: [RFC 1/2] vfio/pci: keep the prefetchable attribute of a BAR region in VMA

Hi Marc,

On 5/1/21 4:30 AM, Marc Zyngier wrote:
>> I think Device GRE has some practical problems.
>> 1. A lot of userspace code which is used to getting write combined
>> mappings to GPU memory from kernel drivers does memcpy/memset on it
>> which can insert ldp/stp which can crash on Device Memory Type. From
>> a quick search I didn't find a memcpy_io or memset_io in
>> glibc. Perhaps there are some other functions available, but a lot
>> of userspace applications that work on x86 and ARM baremetal won't
>> work on ARM VMs without such changes. Changes to all of userspace
>> may not always be practical, specially if linking to binaries
> This seems to go against what Alex was hinting at earlier, which is
> that unaligned accesses were not expected on prefetchable regions, and
> Shanker latter confirming that it was an actual bug. Where do we stand
> here?
>
We agreed to call it a driver bug if it's not following Linux write-combining
API ioremap_wc() semantics. So far I didn't find whether unaligned accesses
allowed or not for WC regions explicitly in Linux documentation. Page faults
due to driver unaligned accesses in kernel space will be under driver control,
we'll fix it.

Driver uses the architecture agnostic functions that are available in the Linux
kernel and expecting the same behavior in VM vs Baremetal. We would like
to keep the driver implementation is architecture-independent as much as
possible and support VM unaware. For ARM64, VM's ioremap_wc() definition
doesn't match baremetal.

We don't have any control over the userspace applications/drivers/libraries as
Vikram saying. Another example GCC memset() function uses 'DC ZVA' which
triggers an alignment fault if the actual memory type is device_xxx.



2021-05-02 17:59:30

by Vikram Sethi

[permalink] [raw]
Subject: RE: [RFC 1/2] vfio/pci: keep the prefetchable attribute of a BAR region in VMA

Hi Marc,

> From: Marc Zyngier <[email protected]>
> Hi Vikram,
>

> The problem I see is that we have VM and userspace being written in terms
> of Write-Combine, which is:
>
> - loosely defined even on x86
>
> - subject to interpretations in the way it maps to PCI
>
> - has no direct equivalent in the ARMv8 collection of memory
> attributes (and Normal_NC comes with speculation capabilities which
> strikes me as extremely undesirable on arbitrary devices)

If speculation with Normal NC to prefetchable BARs in devices was a problem,
those devices would already be broken in baremetal with ioremap_wc on arm64,
and we would need quirks there to not do Normal NC for them but Device GRE,
and if such a quirk was needed on baremetal, it could be picked up by vfio/KVM
as well. But we haven't seen any broken devices doing wc on baremetal on ARM64, have we?
I know we have tested NICs write combining on arm64 in baremetal, as well as GPU
and NVMe CMB without issues.

Further, I don't see why speculation to non cacheble would be an issue if prefetch
without side effects is allowed by the device, which is what a prefetchable BAR is.
If it is an issue for a device I would consider that a bug already needing a quirk in
Baremetal/host kernel already.
From PCI spec " A prefetchable address range may have write side effects,
but it may not have read side effects."
>
> How do we translate this into something consistent? I'd like to see an actual
> description of what we *really* expect from WC on prefetchable PCI regions,
> turn that into a documented definition agreed across architectures, and then
> we can look at implementing it with one memory type or another on arm64.
>
> Because once we expose that memory type at S2 for KVM guests, it
> becomes ABI and there is no turning back. So I want to get it right once and
> for all.
>
I agree that we need a precise definition for the Linux ioremap_wc API wrt what
drivers (kernel and userspace) can expect and whether memset/memcpy is expected
to work or not and whether aligned accesses are a requirement.
To the extent ABI is set, I would think that the ABI is also already set in the host kernel
for arm64 WC = Normal NC, so why should that not also be the ABI for same driver in VMs.

> Thanks,
>
> M.
>
> --
> Without deviation from the norm, progress is not possible.

2021-05-03 07:03:19

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC 0/2] [RFC] Honor PCI prefetchable attributes for a virtual machine on ARM64

On Thu, Apr 29, 2021 at 11:29:04AM -0500, Shanker Donthineni wrote:
> Problem statement: Virtual machine crashes when NVIDIA GPU driver access a prefetchable BAR space due to the unaligned reads/writes for pass-through devices. The same binary works fine as expected in the host kernel. Only one BAR has control & status registers (CSR) and other PCI BARs are marked as prefetchable. NVIDIA GPU driver uses the write-combine feature for mapping the prefetchable BARs to improve performance. This problem applies to all other drivers which want to enable WC.

Unless you mean the noveau drivers this simply does not matter. Please
don't spam the kernel lists with issues with your broken and license
violating drivers.

2021-05-03 10:53:26

by Marc Zyngier

[permalink] [raw]
Subject: Re: [RFC 1/2] vfio/pci: keep the prefetchable attribute of a BAR region in VMA

Hi Vikram,

On Sun, 02 May 2021 18:56:31 +0100,
Vikram Sethi <[email protected]> wrote:
>
> Hi Marc,
>
> > From: Marc Zyngier <[email protected]>
> > Hi Vikram,
> >
>
> > The problem I see is that we have VM and userspace being written in terms
> > of Write-Combine, which is:
> >
> > - loosely defined even on x86
> >
> > - subject to interpretations in the way it maps to PCI
> >
> > - has no direct equivalent in the ARMv8 collection of memory
> > attributes (and Normal_NC comes with speculation capabilities which
> > strikes me as extremely undesirable on arbitrary devices)
>
> If speculation with Normal NC to prefetchable BARs in devices was a
> problem, those devices would already be broken in baremetal with
> ioremap_wc on arm64, and we would need quirks there to not do Normal
> NC for them but Device GRE, and if such a quirk was needed on
> baremetal, it could be picked up by vfio/KVM as well. But we haven't
> seen any broken devices doing wc on baremetal on ARM64, have we?

The lack of evidence does not equate to a proof, and your devices not
misbehaving doesn't mean it is the right thing, specially when we have
such a wide range of CPU and interconnect implementation. Which is why
I really want an answer at the architecture level. Not a "it works for
me" type of answer.

Furthermore, as I replied to Shanker in a separate email, what
Linux/arm64 does is pretty much irrelevant. KVM/arm64 implements the
ARMv8 architecture, and it is at that level that we need to solve the
problem.

If, by enumerating the properties of Prefetchable, you can show that
they are a strict superset of Normal_NC, I'm on board. I haven't seen
such an enumeration so far.

> I know we have tested NICs write combining on arm64 in baremetal, as
> well as GPU and NVMe CMB without issues.
>
> Further, I don't see why speculation to non cacheble would be an
> issue if prefetch without side effects is allowed by the device,
> which is what a prefetchable BAR is.
> If it is an issue for a device I would consider that a bug already needing a quirk in
> Baremetal/host kernel already.
> From PCI spec " A prefetchable address range may have write side effects,
> but it may not have read side effects."

Right, so we have made a small step in the direction of mapping
"prefetchable" onto "Normal_NC", thanks for that. What about all the
other properties (unaligned accesses, ordering, gathering)?

> > How do we translate this into something consistent? I'd like to see an actual
> > description of what we *really* expect from WC on prefetchable PCI regions,
> > turn that into a documented definition agreed across architectures, and then
> > we can look at implementing it with one memory type or another on arm64.
> >
> > Because once we expose that memory type at S2 for KVM guests, it
> > becomes ABI and there is no turning back. So I want to get it right once and
> > for all.
> >
> I agree that we need a precise definition for the Linux ioremap_wc
> API wrt what drivers (kernel and userspace) can expect and whether
> memset/memcpy is expected to work or not and whether aligned
> accesses are a requirement.
> To the extent ABI is set, I would think that the ABI is also already
> set in the host kernel for arm64 WC = Normal NC, so why should that
> not also be the ABI for same driver in VMs.

KVM is an implementation of the ARM architecture, and doesn't really
care about what WC is. If we come to the conclusion that Normal_NC is
the natural match for Prefetchable attributes, than we're good and we
can have Normal_NC being set by userspace, or even VFIO. But I don't
want to set it only because "it works when bare-metal Linux uses it".
Remember KVM doesn't only run Linux as guests.

M.

--
Without deviation from the norm, progress is not possible.

2021-05-03 13:44:03

by Shanker Donthineni

[permalink] [raw]
Subject: Re: [RFC 1/2] vfio/pci: keep the prefetchable attribute of a BAR region in VMA

Hi Marc,

On 5/3/21 4:50 AM, Marc Zyngier wrote:
> You are mixing two things: what Linux/arm64 gives to kernel drivers,
> and what KVM, as an implementation of the ARMv8 architecture, gives to
> virtual machines. There is zero reason for the two to match if there
> is no definition of what we need to provide.
Suggestion here is memory-types PROT_NORMAL_NC and PROT_DEVICE_nGnRE
are applicable to braemetal only and don't use for device memory inside VM.

2021-05-03 18:07:52

by Mark Kettenis

[permalink] [raw]
Subject: Re: [RFC 1/2] vfio/pci: keep the prefetchable attribute of a BAR region in VMA

> Date: Mon, 03 May 2021 11:17:23 +0100
> From: Marc Zyngier <[email protected]>
>
> Hi Vikram,
>
> On Sun, 02 May 2021 18:56:31 +0100,
> Vikram Sethi <[email protected]> wrote:
> >
> > Hi Marc,
> >
> > > From: Marc Zyngier <[email protected]>
> > > Hi Vikram,
> > >
> >
> > > The problem I see is that we have VM and userspace being written in terms
> > > of Write-Combine, which is:
> > >
> > > - loosely defined even on x86
> > >
> > > - subject to interpretations in the way it maps to PCI
> > >
> > > - has no direct equivalent in the ARMv8 collection of memory
> > > attributes (and Normal_NC comes with speculation capabilities which
> > > strikes me as extremely undesirable on arbitrary devices)
> >
> > If speculation with Normal NC to prefetchable BARs in devices was a
> > problem, those devices would already be broken in baremetal with
> > ioremap_wc on arm64, and we would need quirks there to not do Normal
> > NC for them but Device GRE, and if such a quirk was needed on
> > baremetal, it could be picked up by vfio/KVM as well. But we haven't
> > seen any broken devices doing wc on baremetal on ARM64, have we?

I think the SC2A11 SoC used in the Socionext developerbox counts as
"broken":

https://www.96boards.org/documentation/enterprise/developerbox/support/known-issues.html

I'm not sure my understanding of the issue is 100% correct, but I
believe the firmware workaround described there uses the stage 2
translation tables to map "Normal NC" onto "Device nGRE" or something
even more restricted. Now this hardware may be classified as simply
broken. However...

On hardware based on the NXP LX2160A SoC we're seeing some weird
behaviour when using "Normal NC" mappings with an AMD GPU that
disappear by using "Device nGnRnE" mappings on OpenBSD. No such issue
was observed with hardware based on an Ampere eMAG SoC. I don't fully
understand this issue yet, and it may very well be a bug in OpenBSD
code, but it does show there are potential pitfalls with using "Normal
NC" for mapping prefetchable BARs of PCIe devices.

> The lack of evidence does not equate to a proof, and your devices not
> misbehaving doesn't mean it is the right thing, specially when we have
> such a wide range of CPU and interconnect implementation. Which is why
> I really want an answer at the architecture level. Not a "it works for
> me" type of answer.
>
> Furthermore, as I replied to Shanker in a separate email, what
> Linux/arm64 does is pretty much irrelevant. KVM/arm64 implements the
> ARMv8 architecture, and it is at that level that we need to solve the
> problem.
>
> If, by enumerating the properties of Prefetchable, you can show that
> they are a strict superset of Normal_NC, I'm on board. I haven't seen
> such an enumeration so far.
>
> > I know we have tested NICs write combining on arm64 in baremetal, as
> > well as GPU and NVMe CMB without issues.
> >
> > Further, I don't see why speculation to non cacheble would be an
> > issue if prefetch without side effects is allowed by the device,
> > which is what a prefetchable BAR is.
> > If it is an issue for a device I would consider that a bug already needing a quirk in
> > Baremetal/host kernel already.
> > From PCI spec " A prefetchable address range may have write side effects,
> > but it may not have read side effects."
>
> Right, so we have made a small step in the direction of mapping
> "prefetchable" onto "Normal_NC", thanks for that. What about all the
> other properties (unaligned accesses, ordering, gathering)?

On x86 WC:

1. Is not cached (but stores are buffered).

2. Allows unaligned access just like normal memory.

3. Allows speculative reads.

4. Has weaker ordering than normal memory; [lsm]fence instructions are
needed to guarantee a particular ordering of writes with respect to
other writes and reads.

5. Stores are buffered. This buffer isn't snooped so it has to be
flushed before changes are globally visible. The [sm]fence
instructions flush the store buffer.

6. The store buffer may combine multiple writes into a single write.

Now whether the fact the unaligned access is allowed is really part of
the semantics of WC mappings is debatable as x86 always allows
unaligned access, even for areas mapped with ioremap().

However, this is where userland comes in. The userland graphics stack
does assume that graphics memory mapped throug a prefetchable PCIe BAR
allows unaligned access if the architecture allows unaligned access
for cacheable memory. On arm64 this means that such memory needs to
be "Normal NC". And since kernel drivers tend to map such memory
using ioremap_wc() that pretty much implies ioremap_wc() shoul use
"Normal NC" as well isn't it?

> > > How do we translate this into something consistent? I'd like to
> > > see an actual description of what we *really* expect from WC on
> > > prefetchable PCI regions, turn that into a documented definition
> > > agreed across architectures, and then we can look at
> > > implementing it with one memory type or another on arm64.
> > >
> > > Because once we expose that memory type at S2 for KVM guests, it
> > > becomes ABI and there is no turning back. So I want to get it
> > > right once and for all.
> > >
> > I agree that we need a precise definition for the Linux ioremap_wc
> > API wrt what drivers (kernel and userspace) can expect and whether
> > memset/memcpy is expected to work or not and whether aligned
> > accesses are a requirement.
> > To the extent ABI is set, I would think that the ABI is also already
> > set in the host kernel for arm64 WC = Normal NC, so why should that
> > not also be the ABI for same driver in VMs.
>
> KVM is an implementation of the ARM architecture, and doesn't really
> care about what WC is. If we come to the conclusion that Normal_NC is
> the natural match for Prefetchable attributes, than we're good and we
> can have Normal_NC being set by userspace, or even VFIO. But I don't
> want to set it only because "it works when bare-metal Linux uses it".
> Remember KVM doesn't only run Linux as guests.
>
> M.
>
> --
> Without deviation from the norm, progress is not possible.
>
> _______________________________________________
> linux-arm-kernel mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>

2021-05-03 18:09:08

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC 1/2] vfio/pci: keep the prefetchable attribute of a BAR region in VMA

On Mon, 3 May 2021 13:59:43 +0000
Vikram Sethi <[email protected]> wrote:

> > From: Mark Kettenis <[email protected]>
> > > From: Marc Zyngier <[email protected]>
>
> snip
> > > If, by enumerating the properties of Prefetchable, you can show that
> > > they are a strict superset of Normal_NC, I'm on board. I haven't seen
> > > such an enumeration so far.
> > >
> snip
> > > Right, so we have made a small step in the direction of mapping
> > > "prefetchable" onto "Normal_NC", thanks for that. What about all the
> > > other properties (unaligned accesses, ordering, gathering)?
> >
> Regarding gathering/write combining, that is also allowed to prefetchable per PCI spec

As others have stated, gather/write combining itself is not well
defined.

> From 1.3.2.2 of 5/0 base spec:
> A PCI Express Endpoint requesting memory resources through a BAR must set the BAR's Prefetchable bit unless
> the range contains locations with read side-effects or locations in which the Function does not tolerate write
> merging.

"write merging" This is a very specific thing, per PCI 3.0, 3.2.6:

Byte Merging – occurs when a sequence of individual memory writes
(bytes or words) are merged into a single DWORD.

The semantics suggest quadword support in addition to dword, but don't
require it. Writes to bytes within a dword can be merged, but
duplicate writes cannot.

It seems like an extremely liberal application to suggest that this one
write semantic encompasses full write combining semantics, which itself
is not clearly defined.

> Further 7.5.1.2.1 says " A Function is permitted
> to mark a range as prefetchable if there are no side effects on reads, the Function returns all bytes on reads regardless of
> the byte enables, and host bridges can merge processor writes into this range139 without causing errors"
>
> The "regardless of byte enables" suggests to me that unaligned is OK, as only
> certain byte enables may be set, what do you think?
>
> So to me prefetchable in PCIe spec allows for write combining, read without

Ironically here, the above PCI spec section defining write merging has
separate sections for "combining", "merging", and "collapsing". Only
merging is indicated as a requirement for prefetchable resources.

> sideeffect (prefetch/speculative as long as uncached), and unaligned. Regarding
> ordering I didn't find a statement one way or other in PCIe prefetchable definition, but
> I think that goes beyond what PCIe says or doesn't say anyway since reordering can
> also happen in the CPU, and since driver must be aware of correctness issues in its
> producer/consumer models it will need the right barriers where they are required
> for correctness anyway (required for the driver/userspace to work on host w/ ioremap_wc).

A lot of hand waving here, imo. Thanks,

Alex

2021-05-03 18:10:02

by Vikram Sethi

[permalink] [raw]
Subject: RE: [RFC 1/2] vfio/pci: keep the prefetchable attribute of a BAR region in VMA



> From: Mark Kettenis <[email protected]>
> > From: Marc Zyngier <[email protected]>

snip
> > If, by enumerating the properties of Prefetchable, you can show that
> > they are a strict superset of Normal_NC, I'm on board. I haven't seen
> > such an enumeration so far.
> >
snip
> > Right, so we have made a small step in the direction of mapping
> > "prefetchable" onto "Normal_NC", thanks for that. What about all the
> > other properties (unaligned accesses, ordering, gathering)?
>
Regarding gathering/write combining, that is also allowed to prefetchable per PCI spec
From 1.3.2.2 of 5/0 base spec:
A PCI Express Endpoint requesting memory resources through a BAR must set the BAR's Prefetchable bit unless
the range contains locations with read side-effects or locations in which the Function does not tolerate write
merging.
Further 7.5.1.2.1 says " A Function is permitted
to mark a range as prefetchable if there are no side effects on reads, the Function returns all bytes on reads regardless of
the byte enables, and host bridges can merge processor writes into this range139 without causing errors"

The "regardless of byte enables" suggests to me that unaligned is OK, as only
certain byte enables may be set, what do you think?

So to me prefetchable in PCIe spec allows for write combining, read without
sideeffect (prefetch/speculative as long as uncached), and unaligned. Regarding
ordering I didn't find a statement one way or other in PCIe prefetchable definition, but
I think that goes beyond what PCIe says or doesn't say anyway since reordering can
also happen in the CPU, and since driver must be aware of correctness issues in its
producer/consumer models it will need the right barriers where they are required
for correctness anyway (required for the driver/userspace to work on host w/ ioremap_wc).

But perhaps the bigger question is since WC doesn't exist as a Memory type
on armv8, yet we are trying to fit something onto ioremap_wc which came from
x86 world, shouldn't the arm64 MT we use for WC match the semantics of
whatever drivers and userspace expected from ioremap_wc as defined on
x86, which as Mark notes below includes unaligned? If we agree to that,
we can codify it in the documentation of ioremap_wc and allow for
Normal NC on arm64 for ioremap_wc in host or guest.
Beyond that, if we don't want to do it automatically based on prefetchable
but from explicit call from userspace is fine too.

> On x86 WC:
>
> 1. Is not cached (but stores are buffered).
>
> 2. Allows unaligned access just like normal memory.
>
> 3. Allows speculative reads.
>
> 4. Has weaker ordering than normal memory; [lsm]fence instructions are
> needed to guarantee a particular ordering of writes with respect to
> other writes and reads.
>
> 5. Stores are buffered. This buffer isn't snooped so it has to be
> flushed before changes are globally visible. The [sm]fence
> instructions flush the store buffer.
>
> 6. The store buffer may combine multiple writes into a single write.
>
> Now whether the fact the unaligned access is allowed is really part of the
> semantics of WC mappings is debatable as x86 always allows unaligned
> access, even for areas mapped with ioremap().
>
> However, this is where userland comes in. The userland graphics stack does
> assume that graphics memory mapped throug a prefetchable PCIe BAR
> allows unaligned access if the architecture allows unaligned access for
> cacheable memory. On arm64 this means that such memory needs to be
> "Normal NC". And since kernel drivers tend to map such memory using
> ioremap_wc() that pretty much implies ioremap_wc() shoul use "Normal NC"
> as well isn't it?
>
> > > > How do we translate this into something consistent? I'd like to
> > > > see an actual description of what we *really* expect from WC on
> > > > prefetchable PCI regions, turn that into a documented definition
> > > > agreed across architectures, and then we can look at implementing
> > > > it with one memory type or another on arm64.
> > > >
> > > > Because once we expose that memory type at S2 for KVM guests, it
> > > > becomes ABI and there is no turning back. So I want to get it
> > > > right once and for all.
> > > >
> > > I agree that we need a precise definition for the Linux ioremap_wc
> > > API wrt what drivers (kernel and userspace) can expect and whether
> > > memset/memcpy is expected to work or not and whether aligned
> > > accesses are a requirement.
> > > To the extent ABI is set, I would think that the ABI is also already
> > > set in the host kernel for arm64 WC = Normal NC, so why should that
> > > not also be the ABI for same driver in VMs.
> >
> > KVM is an implementation of the ARM architecture, and doesn't really
> > care about what WC is. If we come to the conclusion that Normal_NC is
> > the natural match for Prefetchable attributes, than we're good and we
> > can have Normal_NC being set by userspace, or even VFIO. But I don't
> > want to set it only because "it works when bare-metal Linux uses it".
> > Remember KVM doesn't only run Linux as guests.
> >
> > M.
> >
> > --
> > Without deviation from the norm, progress is not possible.
> >
> > _______________________________________________
> > linux-arm-kernel mailing list
> > [email protected]
> > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> >

2021-05-03 22:04:45

by Vikram Sethi

[permalink] [raw]
Subject: RE: [RFC 1/2] vfio/pci: keep the prefetchable attribute of a BAR region in VMA

Hi Alex,
> From: Alex Williamson <[email protected]>
> On Mon, 3 May 2021 13:59:43 +0000
> Vikram Sethi <[email protected]> wrote:
> > > From: Mark Kettenis <[email protected]>
> > > > From: Marc Zyngier <[email protected]>
> >
> > snip
> > > > If, by enumerating the properties of Prefetchable, you can show
> > > > that they are a strict superset of Normal_NC, I'm on board. I
> > > > haven't seen such an enumeration so far.
> > > >
> > snip
> > > > Right, so we have made a small step in the direction of mapping
> > > > "prefetchable" onto "Normal_NC", thanks for that. What about all
> > > > the other properties (unaligned accesses, ordering, gathering)?
> > >
> > Regarding gathering/write combining, that is also allowed to
> > prefetchable per PCI spec
>
> As others have stated, gather/write combining itself is not well defined.
>
> > From 1.3.2.2 of 5/0 base spec:
> > A PCI Express Endpoint requesting memory resources through a BAR must
> > set the BAR's Prefetchable bit unless the range contains locations
> > with read side-effects or locations in which the Function does not tolerate
> write merging.
>
> "write merging" This is a very specific thing, per PCI 3.0, 3.2.6:
>
> Byte Merging – occurs when a sequence of individual memory writes
> (bytes or words) are merged into a single DWORD.
>
> The semantics suggest quadword support in addition to dword, but don't
> require it. Writes to bytes within a dword can be merged, but duplicate
> writes cannot.
>
> It seems like an extremely liberal application to suggest that this one write
> semantic encompasses full write combining semantics, which itself is not
> clearly defined.
>
Talking to our PCIe SIG representative, PCIe switches are not allowed do any of the byte
Merging/combining etc as defined in the PCI spec, and per a rather poorly
worded Implementation note in the spec says that no known PCIe Host Briddges/Root
ports do it either.
So for PCIe we don't think believe there is any byte merging that happens in the PCIe
fabric so it's really a matter of what happens in the CPU core and interconnect
before it gets to the PCIe hierarchy.

Stepping back from this patchset, do you agree that it is desirable to support
Write combining as understood by ioremap_wc to work in all ISA guests including
ARMv8?

You note that x86 virtualization doesn't have this issue, but KVM-ARM does
because KVM maps all device BARs as Device Memory type nGnRE which
doesn't allow ioremap_wc from within the guest to get the actual semantics desired.

Marc and others have suggested that userspace should provide the hints. But the
question is how would qemu vfio do this either? We would be stuck in the same
arguments as here, as to what is the correct way to determine the desired attributes
for a given BAR such that eventually when a driver in the guest asks for
ioremap_wc it actually has a chance of working in the guest, in all ISAs.
Do you have any suggestions on how to make progress here?
A device specific list of which BARs are OK to allow ioremap_wc for seems terrible
and I'm not sure if a commandline qemu option is any better. Is the user of device
assignment/sysadmin supposed to know which BAR of which device is OK to allow
ioremap_wc for?

Will/Catalin, perhaps you could explain your thought process on why you chose
Normal NC for ioremap_wc on the armv8 linux port instead of Device GRE or other
Device Gxx.


Thanks,
Vikram

2021-05-04 08:31:43

by Will Deacon

[permalink] [raw]
Subject: Re: [RFC 1/2] vfio/pci: keep the prefetchable attribute of a BAR region in VMA

On Mon, May 03, 2021 at 10:03:59PM +0000, Vikram Sethi wrote:
> Will/Catalin, perhaps you could explain your thought process on why you chose
> Normal NC for ioremap_wc on the armv8 linux port instead of Device GRE or other
> Device Gxx.

I think a combination of: compatibility with 32-bit Arm, the need to
support unaligned accesses and the potential for higher performance.

Furthermore, ioremap() already gives you a Device memory type, and we're
tight on MAIR space.

Will

2021-05-04 18:08:11

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC 1/2] vfio/pci: keep the prefetchable attribute of a BAR region in VMA

On Mon, 3 May 2021 22:03:59 +0000
Vikram Sethi <[email protected]> wrote:

> Hi Alex,
> > From: Alex Williamson <[email protected]>
> > On Mon, 3 May 2021 13:59:43 +0000
> > Vikram Sethi <[email protected]> wrote:
> > > > From: Mark Kettenis <[email protected]>
> > > > > From: Marc Zyngier <[email protected]>
> > >
> > > snip
> > > > > If, by enumerating the properties of Prefetchable, you can show
> > > > > that they are a strict superset of Normal_NC, I'm on board. I
> > > > > haven't seen such an enumeration so far.
> > > > >
> > > snip
> > > > > Right, so we have made a small step in the direction of mapping
> > > > > "prefetchable" onto "Normal_NC", thanks for that. What about all
> > > > > the other properties (unaligned accesses, ordering, gathering)?
> > > >
> > > Regarding gathering/write combining, that is also allowed to
> > > prefetchable per PCI spec
> >
> > As others have stated, gather/write combining itself is not well defined.
> >
> > > From 1.3.2.2 of 5/0 base spec:
> > > A PCI Express Endpoint requesting memory resources through a BAR must
> > > set the BAR's Prefetchable bit unless the range contains locations
> > > with read side-effects or locations in which the Function does not tolerate
> > write merging.
> >
> > "write merging" This is a very specific thing, per PCI 3.0, 3.2.6:
> >
> > Byte Merging – occurs when a sequence of individual memory writes
> > (bytes or words) are merged into a single DWORD.
> >
> > The semantics suggest quadword support in addition to dword, but don't
> > require it. Writes to bytes within a dword can be merged, but duplicate
> > writes cannot.
> >
> > It seems like an extremely liberal application to suggest that this one write
> > semantic encompasses full write combining semantics, which itself is not
> > clearly defined.
> >
> Talking to our PCIe SIG representative, PCIe switches are not allowed do any of the byte
> Merging/combining etc as defined in the PCI spec, and per a rather poorly
> worded Implementation note in the spec says that no known PCIe Host Briddges/Root
> ports do it either.
> So for PCIe we don't think believe there is any byte merging that happens in the PCIe
> fabric so it's really a matter of what happens in the CPU core and interconnect
> before it gets to the PCIe hierarchy.

Yes, but merged writes, no matter where they happen, are still the only
type of write combining that a prefetchable BAR on an endpoint is
required to support.

> Stepping back from this patchset, do you agree that it is desirable to support
> Write combining as understood by ioremap_wc to work in all ISA guests including
> ARMv8?

Yes, a userspace vfio driver should be able to take advantage of the
hardware capabilities. I think where we disagree is whether it's
universally safe to assume write combining based on the PCI
prefetchable capability of a BAR. If that's something that can be
assumed universally for ARMv8 based on the architecture specification
compatibility with the PCI definition of a prefetchable BAR, then I
would expect a helper somewhere in arch code that returns the right
page protection flags, so that arch maintainers don't need to scour
device drivers for architecture hacks. Otherwise, it needs to be
exposed through the vfio uAPI to allow the userspace device driver
itself to select these semantics.

> You note that x86 virtualization doesn't have this issue, but KVM-ARM does
> because KVM maps all device BARs as Device Memory type nGnRE which
> doesn't allow ioremap_wc from within the guest to get the actual semantics desired.
>
> Marc and others have suggested that userspace should provide the hints. But the
> question is how would qemu vfio do this either? We would be stuck in the same
> arguments as here, as to what is the correct way to determine the desired attributes
> for a given BAR such that eventually when a driver in the guest asks for
> ioremap_wc it actually has a chance of working in the guest, in all ISAs.
> Do you have any suggestions on how to make progress here?

We do need some way for userspace drivers to also make use of WC
semantics, there were some discussions in the past, I think others have
referenced them as well, but nothing has been proposed for a vfio API.

If we had that API, QEMU deciding to universally enable WC for all
vfio prefetchable BARs seems only marginally better than this approach.
Ultimately the mapping should be based on the guest driver semantics,
and if you don't have any visibility to that on KVM/arm like we have on
KVM/x86, then it seems like there's nothing to trigger a vfio API here
anyway.

If that's the case, I'd probably go back to letting the arch/arm64 folks
declare that WC is compatible with the definition of PCI prefetchable
and export some sort of pgprot_pci_prefetchable() helper where the
default would be to #define it as pgproc_noncached() #ifndef by the
arch.

> A device specific list of which BARs are OK to allow ioremap_wc for seems terrible
> and I'm not sure if a commandline qemu option is any better. Is the user of device
> assignment/sysadmin supposed to know which BAR of which device is OK to allow
> ioremap_wc for?

No, a device specific userspace driver should know such device
semantics, but QEMU is not such a driver. Burdening the hypervisor
user/admin is not a good solution either. I'd lean on KVM/arm64 folks
to know how the guest driver semantics can be exposed to the
hypervisor. Thanks,

Alex

2021-05-05 21:02:19

by Catalin Marinas

[permalink] [raw]
Subject: Re: [RFC 1/2] vfio/pci: keep the prefetchable attribute of a BAR region in VMA

On Tue, May 04, 2021 at 09:30:05AM +0100, Will Deacon wrote:
> On Mon, May 03, 2021 at 10:03:59PM +0000, Vikram Sethi wrote:
> > Will/Catalin, perhaps you could explain your thought process on why you chose
> > Normal NC for ioremap_wc on the armv8 linux port instead of Device GRE or other
> > Device Gxx.
>
> I think a combination of: compatibility with 32-bit Arm, the need to
> support unaligned accesses and the potential for higher performance.

IIRC the _wc suffix also matches the pgprot_writecombine() used by some
drivers to map a video framebuffer into user space. Accesses to the
framebuffer are not guaranteed to be aligned (memset/memcpy don't ensure
alignment on arm64 and the user doesn't have a memset_io or memcpy_toio).

> Furthermore, ioremap() already gives you a Device memory type, and we're
> tight on MAIR space.

We have MT_DEVICE_GRE currently reserved though no in-kernel user, we
might as well remove it.

--
Catalin

2021-05-06 07:28:32

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC 1/2] vfio/pci: keep the prefetchable attribute of a BAR region in VMA

On Wed, May 05, 2021 at 07:02:31PM +0100, Catalin Marinas wrote:
> > Furthermore, ioremap() already gives you a Device memory type, and we're
> > tight on MAIR space.
>
> We have MT_DEVICE_GRE currently reserved though no in-kernel user, we
> might as well remove it.

Please do. The more we can cut down on different memory types, the
better.

2021-05-08 16:35:26

by Shanker Donthineni

[permalink] [raw]
Subject: Re: [RFC 1/2] vfio/pci: keep the prefetchable attribute of a BAR region in VMA

Hi Marc,

On 5/5/21 1:02 PM, Catalin Marinas wrote:
>>> Will/Catalin, perhaps you could explain your thought process on why you chose
>>> Normal NC for ioremap_wc on the armv8 linux port instead of Device GRE or other
>>> Device Gxx.
>> I think a combination of: compatibility with 32-bit Arm, the need to
>> support unaligned accesses and the potential for higher performance.
> IIRC the _wc suffix also matches the pgprot_writecombine() used by some
> drivers to map a video framebuffer into user space. Accesses to the
> framebuffer are not guaranteed to be aligned (memset/memcpy don't ensure
> alignment on arm64 and the user doesn't have a memset_io or memcpy_toio).
>
>> Furthermore, ioremap() already gives you a Device memory type, and we're
>> tight on MAIR space.
> We have MT_DEVICE_GRE currently reserved though no in-kernel user, we
> might as well remove it.
@Marc, Could you provide your thoughts/guidance for the next step? The
proposal of getting hints for prefetchable regions from VFIO/QEMU is not
recommended, The only option left is to implement ARM64 dependent logic
in KVM.

Option-1: I think we could take advantage of stage-1/2 combining rules to
allow NORMAL_NC memory-type for device memory in VM. Always map
device memory at stage-2 as NORMAL-NC and trust VM's stage-1 MT.

---------------------------------------------------------------
Stage-2 MT     Stage-1 MT    Resultant MT (combining-rules/FWB)
---------------------------------------------------------------
Normal-NC      Normal-WT           Normal-NC
   -           Normal-WB              -
   -           Normal-NC              -
   -           Device-<attr>       Device-<attr>
---------------------------------------------------------------

We've been using this option internally for testing purpose and validated with
NVME/Mellanox/GPU pass-through devices on Marvell-Thundex2 platform.

Option-2: Get resource properties associated with MMIO using lookup_resource()
and map at stage-2 as Normal-NC if IORESOURCE_PREFETCH is set in flags.


2021-06-02 09:57:24

by Marc Zyngier

[permalink] [raw]
Subject: Re: [RFC 1/2] vfio/pci: keep the prefetchable attribute of a BAR region in VMA

On Tue, 04 May 2021 19:03:48 +0100,
Alex Williamson <[email protected]> wrote:
>
> On Mon, 3 May 2021 22:03:59 +0000
> Vikram Sethi <[email protected]> wrote:
>
> > Hi Alex,
> > > From: Alex Williamson <[email protected]>
> > > On Mon, 3 May 2021 13:59:43 +0000
> > > Vikram Sethi <[email protected]> wrote:
> > > > > From: Mark Kettenis <[email protected]>
> > > > > > From: Marc Zyngier <[email protected]>
> > > >
> > > > snip
> > > > > > If, by enumerating the properties of Prefetchable, you can show
> > > > > > that they are a strict superset of Normal_NC, I'm on board. I
> > > > > > haven't seen such an enumeration so far.
> > > > > >
> > > > snip
> > > > > > Right, so we have made a small step in the direction of mapping
> > > > > > "prefetchable" onto "Normal_NC", thanks for that. What about all
> > > > > > the other properties (unaligned accesses, ordering, gathering)?
> > > > >
> > > > Regarding gathering/write combining, that is also allowed to
> > > > prefetchable per PCI spec
> > >
> > > As others have stated, gather/write combining itself is not well defined.
> > >
> > > > From 1.3.2.2 of 5/0 base spec:
> > > > A PCI Express Endpoint requesting memory resources through a BAR must
> > > > set the BAR's Prefetchable bit unless the range contains locations
> > > > with read side-effects or locations in which the Function does not tolerate
> > > write merging.
> > >
> > > "write merging" This is a very specific thing, per PCI 3.0, 3.2.6:
> > >
> > > Byte Merging – occurs when a sequence of individual memory writes
> > > (bytes or words) are merged into a single DWORD.
> > >
> > > The semantics suggest quadword support in addition to dword, but
> > > don't require it. Writes to bytes within a dword can be merged,
> > > but duplicate writes cannot.
> > >
> > > It seems like an extremely liberal application to suggest that
> > > this one write semantic encompasses full write combining
> > > semantics, which itself is not clearly defined.
> > >
> > Talking to our PCIe SIG representative, PCIe switches are not
> > allowed do any of the byte Merging/combining etc as defined in the
> > PCI spec, and per a rather poorly worded Implementation note in
> > the spec says that no known PCIe Host Briddges/Root ports do it
> > either. So for PCIe we don't think believe there is any byte
> > merging that happens in the PCIe fabric so it's really a matter of
> > what happens in the CPU core and interconnect before it gets to
> > the PCIe hierarchy.
>
> Yes, but merged writes, no matter where they happen, are still the only
> type of write combining that a prefetchable BAR on an endpoint is
> required to support.
>
> > Stepping back from this patchset, do you agree that it is
> > desirable to support Write combining as understood by ioremap_wc
> > to work in all ISA guests including ARMv8?
>
> Yes, a userspace vfio driver should be able to take advantage of the
> hardware capabilities. I think where we disagree is whether it's
> universally safe to assume write combining based on the PCI
> prefetchable capability of a BAR. If that's something that can be
> assumed universally for ARMv8 based on the architecture specification
> compatibility with the PCI definition of a prefetchable BAR, then I
> would expect a helper somewhere in arch code that returns the right
> page protection flags, so that arch maintainers don't need to scour
> device drivers for architecture hacks. Otherwise, it needs to be
> exposed through the vfio uAPI to allow the userspace device driver
> itself to select these semantics.
>
> > You note that x86 virtualization doesn't have this issue, but
> > KVM-ARM does because KVM maps all device BARs as Device Memory
> > type nGnRE which doesn't allow ioremap_wc from within the guest to
> > get the actual semantics desired.
> >
> > Marc and others have suggested that userspace should provide the
> > hints. But the question is how would qemu vfio do this either? We
> > would be stuck in the same arguments as here, as to what is the
> > correct way to determine the desired attributes for a given BAR
> > such that eventually when a driver in the guest asks for
> > ioremap_wc it actually has a chance of working in the guest, in
> > all ISAs. Do you have any suggestions on how to make progress
> > here?
>
> We do need some way for userspace drivers to also make use of WC
> semantics, there were some discussions in the past, I think others have
> referenced them as well, but nothing has been proposed for a vfio API.
>
> If we had that API, QEMU deciding to universally enable WC for all
> vfio prefetchable BARs seems only marginally better than this approach.
> Ultimately the mapping should be based on the guest driver semantics,
> and if you don't have any visibility to that on KVM/arm like we have on
> KVM/x86, then it seems like there's nothing to trigger a vfio API here
> anyway.

There isn't much KVM/arm64 can do here unless it is being told what to
do. We don't have visibility on the guest's page tables in a reliable
way, and trusting them is not something I want to entertain anyway.

> If that's the case, I'd probably go back to letting the arch/arm64 folks
> declare that WC is compatible with the definition of PCI prefetchable
> and export some sort of pgprot_pci_prefetchable() helper where the
> default would be to #define it as pgproc_noncached() #ifndef by the
> arch.
>
> > A device specific list of which BARs are OK to allow ioremap_wc
> > for seems terrible and I'm not sure if a commandline qemu option
> > is any better. Is the user of device assignment/sysadmin supposed
> > to know which BAR of which device is OK to allow ioremap_wc for?
>
> No, a device specific userspace driver should know such device
> semantics, but QEMU is not such a driver. Burdening the hypervisor
> user/admin is not a good solution either. I'd lean on KVM/arm64 folks
> to know how the guest driver semantics can be exposed to the
> hypervisor. Thanks,

I don't see a good way for that, unless we make it a per-guest buy-in
where all PCI prefetchable mappings get the same treatment. I'm
prepared to bet that this will break when two devices will have
different requirements. It would also require userspace to buy into
this scheme though, which is crap.

Exposing the guest's preference on a per-device basis seems difficult
(KVM knows nothing about the PCI devices) and would require some PV
interface that will quickly become unmaintainable.

M.

--
Without deviation from the norm, progress is not possible.

2021-06-02 10:04:52

by Marc Zyngier

[permalink] [raw]
Subject: Re: [RFC 1/2] vfio/pci: keep the prefetchable attribute of a BAR region in VMA

Hi Shanker,

On Sat, 08 May 2021 17:33:11 +0100,
Shanker R Donthineni <[email protected]> wrote:
>
> Hi Marc,
>
> On 5/5/21 1:02 PM, Catalin Marinas wrote:
> >>> Will/Catalin, perhaps you could explain your thought process on why you chose
> >>> Normal NC for ioremap_wc on the armv8 linux port instead of Device GRE or other
> >>> Device Gxx.
> >> I think a combination of: compatibility with 32-bit Arm, the need to
> >> support unaligned accesses and the potential for higher performance.
> > IIRC the _wc suffix also matches the pgprot_writecombine() used by some
> > drivers to map a video framebuffer into user space. Accesses to the
> > framebuffer are not guaranteed to be aligned (memset/memcpy don't ensure
> > alignment on arm64 and the user doesn't have a memset_io or memcpy_toio).
> >
> >> Furthermore, ioremap() already gives you a Device memory type, and we're
> >> tight on MAIR space.
> > We have MT_DEVICE_GRE currently reserved though no in-kernel user, we
> > might as well remove it.
> @Marc, Could you provide your thoughts/guidance for the next step? The
> proposal of getting hints for prefetchable regions from VFIO/QEMU is not
> recommended, The only option left is to implement ARM64 dependent logic
> in KVM.
>
> Option-1: I think we could take advantage of stage-1/2 combining rules to
> allow NORMAL_NC memory-type for device memory in VM. Always map
> device memory at stage-2 as NORMAL-NC and trust VM's stage-1 MT.
>
> ---------------------------------------------------------------
> Stage-2 MT     Stage-1 MT    Resultant MT (combining-rules/FWB)
> ---------------------------------------------------------------
> Normal-NC      Normal-WT           Normal-NC
>    -           Normal-WB              -
>    -           Normal-NC              -
>    -           Device-<attr>       Device-<attr>
> ---------------------------------------------------------------

I think this is unwise.

Will recently debugged a pretty horrible situation when doing exactly
that: when S1 is off and S2 is on, the I-side is allowed to generate
speculative accesses (see ARMv8 ARM G.a D5.2.9 for the details). And
yes, implementations definitely do that. Add side-effect reads to the
mix, and you're in for a treat.

> We've been using this option internally for testing purpose and
> validated with NVME/Mellanox/GPU pass-through devices on
> Marvell-Thundex2 platform.

See above. It *will* break eventually.

> Option-2: Get resource properties associated with MMIO using lookup_resource()
> and map at stage-2 as Normal-NC if IORESOURCE_PREFETCH is set in flags.

That's a pretty roundabout way of doing exactly the same thing you
initially proposed. And it suffers from the exact same problems, which
is that you change the semantics of the mapping without knowing what
the guest's intent is.

M.

--
Without deviation from the norm, progress is not possible.