From: Ankit Agrawal <[email protected]>
Currently, KVM for ARM64 maps at stage 2 memory that is considered device
(i.e. it is not RAM) with DEVICE_nGnRE memory attributes; this setting
overrides (as per the ARM architecture [1]) any device MMIO mapping
present at stage 1, resulting in a set-up whereby a guest operating
system cannot determine device MMIO mapping memory attributes on its
own but it is always overridden by the KVM stage 2 default.
This set-up does not allow guest operating systems to select device
memory attributes independently from KVM stage-2 mappings
(refer to [1], "Combining stage 1 and stage 2 memory type attributes"),
which turns out to be an issue in that guest operating systems
(e.g. Linux) may request to map devices MMIO regions with memory
attributes that guarantee better performance (e.g. gathering
attribute - that for some devices can generate larger PCIe memory
writes TLPs) and specific operations (e.g. unaligned transactions)
such as the NormalNC memory type.
The default device stage 2 mapping was chosen in KVM for ARM64 since
it was considered safer (i.e. it would not allow guests to trigger
uncontained failures ultimately crashing the machine) but this
turned out to be asynchronous (SError) defeating the purpose.
Failures containability is a property of the platform and is independent
from the memory type used for MMIO device memory mappings.
Actually, DEVICE_nGnRE memory type is even more problematic than
Normal-NC memory type in terms of faults containability in that e.g.
aborts triggered on DEVICE_nGnRE loads cannot be made, architecturally,
synchronous (i.e. that would imply that the processor should issue at
most 1 load transaction at a time - it cannot pipeline them - otherwise
the synchronous abort semantics would break the no-speculation attribute
attached to DEVICE_XXX memory).
This means that regardless of the combined stage1+stage2 mappings a
platform is safe if and only if device transactions cannot trigger
uncontained failures and that in turn relies on platform capabilities
and the device type being assigned (i.e. PCIe AER/DPC error containment
and RAS architecture[3]); therefore the default KVM device stage 2
memory attributes play no role in making device assignment safer
for a given platform (if the platform design adheres to design
guidelines outlined in [3]) and therefore can be relaxed.
For all these reasons, relax the KVM stage 2 device memory attributes
from DEVICE_nGnRE to Normal-NC. Add a new kvm_pgtable_prot flag for
Normal-NC.
The Normal-NC was chosen over a different Normal memory type default
at stage-2 (e.g. Normal Write-through) to avoid cache allocation/snooping.
Relaxing S2 KVM device MMIO mappings to Normal-NC is not expected to
trigger any issue on guest device reclaim use cases either (i.e. device
MMIO unmap followed by a device reset) at least for PCIe devices, in that
in PCIe a device reset is architected and carried out through PCI config
space transactions that are naturally ordered with respect to MMIO
transactions according to the PCI ordering rules.
Having Normal-NC S2 default puts guests in control (thanks to
stage1+stage2 combined memory attributes rules [1]) of device MMIO
regions memory mappings, according to the rules described in [1]
and summarized here ([(S1) - stage1], [(S2) - stage 2]):
S1 | S2 | Result
NORMAL-WB | NORMAL-NC | NORMAL-NC
NORMAL-WT | NORMAL-NC | NORMAL-NC
NORMAL-NC | NORMAL-NC | NORMAL-NC
DEVICE<attr> | NORMAL-NC | DEVICE<attr>
It is worth noting that currently, to map devices MMIO space to user
space in a device pass-through use case the VFIO framework applies memory
attributes derived from pgprot_noncached() settings applied to VMAs, which
result in device-nGnRnE memory attributes for the stage-1 VMM mappings.
This means that a userspace mapping for device MMIO space carried
out with the current VFIO framework and a guest OS mapping for the same
MMIO space may result in a mismatched alias as described in [2].
Defaulting KVM device stage-2 mappings to Normal-NC attributes does not
change anything in this respect, in that the mismatched aliases would
only affect (refer to [2] for a detailed explanation) ordering between
the userspace and GuestOS mappings resulting stream of transactions
(i.e. it does not cause loss of property for either stream of
transactions on its own), which is harmless given that the userspace
and GuestOS access to the device is carried out through independent
transactions streams.
[1] section D8.5 - DDI0487_I_a_a-profile_architecture_reference_manual.pdf
[2] section B2.8 - DDI0487_I_a_a-profile_architecture_reference_manual.pdf
[3] sections 1.7.7.3/1.8.5.2/appendix C - DEN0029H_SBSA_7.1.pdf
Applied over next-20231201
History
=======
v1 -> v2
- Updated commit log to the one posted by
Lorenzo Pieralisi <[email protected]> (Thanks!)
- Added new flag to represent the NORMAL_NC setting. Updated
stage2_set_prot_attr() to handle new flag.
v1 Link:
https://lore.kernel.org/all/[email protected]/
Signed-off-by: Ankit Agrawal <[email protected]>
Suggested-by: Jason Gunthorpe <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Tested-by: Ankit Agrawal <[email protected]>
---
arch/arm64/include/asm/kvm_pgtable.h | 2 ++
arch/arm64/include/asm/memory.h | 2 ++
arch/arm64/kvm/hyp/pgtable.c | 11 +++++++++--
arch/arm64/kvm/mmu.c | 4 ++--
4 files changed, 15 insertions(+), 4 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index cfdf40f734b1..19278dfe7978 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -197,6 +197,7 @@ enum kvm_pgtable_stage2_flags {
* @KVM_PGTABLE_PROT_W: Write permission.
* @KVM_PGTABLE_PROT_R: Read permission.
* @KVM_PGTABLE_PROT_DEVICE: Device attributes.
+ * @KVM_PGTABLE_PROT_NORMAL_NC: Normal noncacheable attributes.
* @KVM_PGTABLE_PROT_SW0: Software bit 0.
* @KVM_PGTABLE_PROT_SW1: Software bit 1.
* @KVM_PGTABLE_PROT_SW2: Software bit 2.
@@ -208,6 +209,7 @@ enum kvm_pgtable_prot {
KVM_PGTABLE_PROT_R = BIT(2),
KVM_PGTABLE_PROT_DEVICE = BIT(3),
+ KVM_PGTABLE_PROT_NORMAL_NC = BIT(4),
KVM_PGTABLE_PROT_SW0 = BIT(55),
KVM_PGTABLE_PROT_SW1 = BIT(56),
diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index fde4186cc387..c247e5f29d5a 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -147,6 +147,7 @@
* Memory types for Stage-2 translation
*/
#define MT_S2_NORMAL 0xf
+#define MT_S2_NORMAL_NC 0x5
#define MT_S2_DEVICE_nGnRE 0x1
/*
@@ -154,6 +155,7 @@
* Stage-2 enforces Normal-WB and Device-nGnRE
*/
#define MT_S2_FWB_NORMAL 6
+#define MT_S2_FWB_NORMAL_NC 5
#define MT_S2_FWB_DEVICE_nGnRE 1
#ifdef CONFIG_ARM64_4K_PAGES
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index c651df904fe3..d4835d553c61 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -718,10 +718,17 @@ static int stage2_set_prot_attr(struct kvm_pgtable *pgt, enum kvm_pgtable_prot p
kvm_pte_t *ptep)
{
bool device = prot & KVM_PGTABLE_PROT_DEVICE;
- kvm_pte_t attr = device ? KVM_S2_MEMATTR(pgt, DEVICE_nGnRE) :
- KVM_S2_MEMATTR(pgt, NORMAL);
+ bool normal_nc = prot & KVM_PGTABLE_PROT_NORMAL_NC;
+ kvm_pte_t attr;
u32 sh = KVM_PTE_LEAF_ATTR_LO_S2_SH_IS;
+ if (device)
+ attr = KVM_S2_MEMATTR(pgt, DEVICE_nGnRE);
+ else if (normal_nc)
+ attr = KVM_S2_MEMATTR(pgt, NORMAL_NC);
+ else
+ attr = KVM_S2_MEMATTR(pgt, NORMAL);
+
if (!(prot & KVM_PGTABLE_PROT_X))
attr |= KVM_PTE_LEAF_ATTR_HI_S2_XN;
else if (device)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index d14504821b79..1cb302457d3f 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1071,7 +1071,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
struct kvm_mmu_memory_cache cache = { .gfp_zero = __GFP_ZERO };
struct kvm_s2_mmu *mmu = &kvm->arch.mmu;
struct kvm_pgtable *pgt = mmu->pgt;
- enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
+ enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_NORMAL_NC |
KVM_PGTABLE_PROT_R |
(writable ? KVM_PGTABLE_PROT_W : 0);
@@ -1558,7 +1558,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
prot |= KVM_PGTABLE_PROT_X;
if (device)
- prot |= KVM_PGTABLE_PROT_DEVICE;
+ prot |= KVM_PGTABLE_PROT_NORMAL_NC;
else if (cpus_have_final_cap(ARM64_HAS_CACHE_DIC))
prot |= KVM_PGTABLE_PROT_X;
--
2.17.1
+ Shameer
On Tue, 05 Dec 2023 03:30:15 +0000,
<[email protected]> wrote:
>
> From: Ankit Agrawal <[email protected]>
>
> Currently, KVM for ARM64 maps at stage 2 memory that is considered device
> (i.e. it is not RAM) with DEVICE_nGnRE memory attributes; this setting
> overrides (as per the ARM architecture [1]) any device MMIO mapping
> present at stage 1, resulting in a set-up whereby a guest operating
> system cannot determine device MMIO mapping memory attributes on its
> own but it is always overridden by the KVM stage 2 default.
>
> This set-up does not allow guest operating systems to select device
> memory attributes independently from KVM stage-2 mappings
> (refer to [1], "Combining stage 1 and stage 2 memory type attributes"),
> which turns out to be an issue in that guest operating systems
> (e.g. Linux) may request to map devices MMIO regions with memory
> attributes that guarantee better performance (e.g. gathering
> attribute - that for some devices can generate larger PCIe memory
> writes TLPs) and specific operations (e.g. unaligned transactions)
> such as the NormalNC memory type.
>
> The default device stage 2 mapping was chosen in KVM for ARM64 since
> it was considered safer (i.e. it would not allow guests to trigger
> uncontained failures ultimately crashing the machine) but this
> turned out to be asynchronous (SError) defeating the purpose.
>
> Failures containability is a property of the platform and is independent
> from the memory type used for MMIO device memory mappings.
>
> Actually, DEVICE_nGnRE memory type is even more problematic than
> Normal-NC memory type in terms of faults containability in that e.g.
> aborts triggered on DEVICE_nGnRE loads cannot be made, architecturally,
> synchronous (i.e. that would imply that the processor should issue at
> most 1 load transaction at a time - it cannot pipeline them - otherwise
> the synchronous abort semantics would break the no-speculation attribute
> attached to DEVICE_XXX memory).
>
> This means that regardless of the combined stage1+stage2 mappings a
> platform is safe if and only if device transactions cannot trigger
> uncontained failures and that in turn relies on platform capabilities
> and the device type being assigned (i.e. PCIe AER/DPC error containment
> and RAS architecture[3]); therefore the default KVM device stage 2
> memory attributes play no role in making device assignment safer
> for a given platform (if the platform design adheres to design
> guidelines outlined in [3]) and therefore can be relaxed.
>
> For all these reasons, relax the KVM stage 2 device memory attributes
> from DEVICE_nGnRE to Normal-NC. Add a new kvm_pgtable_prot flag for
> Normal-NC.
>
> The Normal-NC was chosen over a different Normal memory type default
> at stage-2 (e.g. Normal Write-through) to avoid cache allocation/snooping.
>
> Relaxing S2 KVM device MMIO mappings to Normal-NC is not expected to
> trigger any issue on guest device reclaim use cases either (i.e. device
> MMIO unmap followed by a device reset) at least for PCIe devices, in that
> in PCIe a device reset is architected and carried out through PCI config
> space transactions that are naturally ordered with respect to MMIO
> transactions according to the PCI ordering rules.
>
> Having Normal-NC S2 default puts guests in control (thanks to
> stage1+stage2 combined memory attributes rules [1]) of device MMIO
> regions memory mappings, according to the rules described in [1]
> and summarized here ([(S1) - stage1], [(S2) - stage 2]):
>
> S1 | S2 | Result
> NORMAL-WB | NORMAL-NC | NORMAL-NC
> NORMAL-WT | NORMAL-NC | NORMAL-NC
> NORMAL-NC | NORMAL-NC | NORMAL-NC
> DEVICE<attr> | NORMAL-NC | DEVICE<attr>
>
> It is worth noting that currently, to map devices MMIO space to user
> space in a device pass-through use case the VFIO framework applies memory
> attributes derived from pgprot_noncached() settings applied to VMAs, which
> result in device-nGnRnE memory attributes for the stage-1 VMM mappings.
>
> This means that a userspace mapping for device MMIO space carried
> out with the current VFIO framework and a guest OS mapping for the same
> MMIO space may result in a mismatched alias as described in [2].
>
> Defaulting KVM device stage-2 mappings to Normal-NC attributes does not
> change anything in this respect, in that the mismatched aliases would
> only affect (refer to [2] for a detailed explanation) ordering between
> the userspace and GuestOS mappings resulting stream of transactions
> (i.e. it does not cause loss of property for either stream of
> transactions on its own), which is harmless given that the userspace
> and GuestOS access to the device is carried out through independent
> transactions streams.
>
> [1] section D8.5 - DDI0487_I_a_a-profile_architecture_reference_manual.pdf
> [2] section B2.8 - DDI0487_I_a_a-profile_architecture_reference_manual.pdf
Can you please quote the latest specs?
> [3] sections 1.7.7.3/1.8.5.2/appendix C - DEN0029H_SBSA_7.1.pdf
>
> Applied over next-20231201
>
> History
> =======
> v1 -> v2
> - Updated commit log to the one posted by
> Lorenzo Pieralisi <[email protected]> (Thanks!)
> - Added new flag to represent the NORMAL_NC setting. Updated
> stage2_set_prot_attr() to handle new flag.
>
> v1 Link:
> https://lore.kernel.org/all/[email protected]/
>
> Signed-off-by: Ankit Agrawal <[email protected]>
> Suggested-by: Jason Gunthorpe <[email protected]>
> Acked-by: Catalin Marinas <[email protected]>
> Tested-by: Ankit Agrawal <[email protected]>
Despite the considerable increase in the commit message length, a
number of questions are left unanswered:
- Shameer reported a regression on non-FWB systems, breaking device
assignment:
https://lore.kernel.org/all/[email protected]/
How has this been addressed?
- Will had unanswered questions in another part of the thread:
https://lore.kernel.org/all/20231013092954.GB13524@willie-the-truck/
Can someone please help concluding it?
>
> ---
> arch/arm64/include/asm/kvm_pgtable.h | 2 ++
> arch/arm64/include/asm/memory.h | 2 ++
> arch/arm64/kvm/hyp/pgtable.c | 11 +++++++++--
> arch/arm64/kvm/mmu.c | 4 ++--
> 4 files changed, 15 insertions(+), 4 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index cfdf40f734b1..19278dfe7978 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -197,6 +197,7 @@ enum kvm_pgtable_stage2_flags {
> * @KVM_PGTABLE_PROT_W: Write permission.
> * @KVM_PGTABLE_PROT_R: Read permission.
> * @KVM_PGTABLE_PROT_DEVICE: Device attributes.
> + * @KVM_PGTABLE_PROT_NORMAL_NC: Normal noncacheable attributes.
> * @KVM_PGTABLE_PROT_SW0: Software bit 0.
> * @KVM_PGTABLE_PROT_SW1: Software bit 1.
> * @KVM_PGTABLE_PROT_SW2: Software bit 2.
> @@ -208,6 +209,7 @@ enum kvm_pgtable_prot {
> KVM_PGTABLE_PROT_R = BIT(2),
>
> KVM_PGTABLE_PROT_DEVICE = BIT(3),
> + KVM_PGTABLE_PROT_NORMAL_NC = BIT(4),
>
> KVM_PGTABLE_PROT_SW0 = BIT(55),
> KVM_PGTABLE_PROT_SW1 = BIT(56),
> diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
> index fde4186cc387..c247e5f29d5a 100644
> --- a/arch/arm64/include/asm/memory.h
> +++ b/arch/arm64/include/asm/memory.h
> @@ -147,6 +147,7 @@
> * Memory types for Stage-2 translation
> */
> #define MT_S2_NORMAL 0xf
> +#define MT_S2_NORMAL_NC 0x5
> #define MT_S2_DEVICE_nGnRE 0x1
>
> /*
> @@ -154,6 +155,7 @@
> * Stage-2 enforces Normal-WB and Device-nGnRE
> */
> #define MT_S2_FWB_NORMAL 6
> +#define MT_S2_FWB_NORMAL_NC 5
> #define MT_S2_FWB_DEVICE_nGnRE 1
>
> #ifdef CONFIG_ARM64_4K_PAGES
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index c651df904fe3..d4835d553c61 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -718,10 +718,17 @@ static int stage2_set_prot_attr(struct kvm_pgtable *pgt, enum kvm_pgtable_prot p
> kvm_pte_t *ptep)
> {
> bool device = prot & KVM_PGTABLE_PROT_DEVICE;
> - kvm_pte_t attr = device ? KVM_S2_MEMATTR(pgt, DEVICE_nGnRE) :
> - KVM_S2_MEMATTR(pgt, NORMAL);
> + bool normal_nc = prot & KVM_PGTABLE_PROT_NORMAL_NC;
> + kvm_pte_t attr;
> u32 sh = KVM_PTE_LEAF_ATTR_LO_S2_SH_IS;
>
> + if (device)
> + attr = KVM_S2_MEMATTR(pgt, DEVICE_nGnRE);
> + else if (normal_nc)
> + attr = KVM_S2_MEMATTR(pgt, NORMAL_NC);
> + else
> + attr = KVM_S2_MEMATTR(pgt, NORMAL);
> +
> if (!(prot & KVM_PGTABLE_PROT_X))
> attr |= KVM_PTE_LEAF_ATTR_HI_S2_XN;
> else if (device)
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index d14504821b79..1cb302457d3f 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1071,7 +1071,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> struct kvm_mmu_memory_cache cache = { .gfp_zero = __GFP_ZERO };
> struct kvm_s2_mmu *mmu = &kvm->arch.mmu;
> struct kvm_pgtable *pgt = mmu->pgt;
> - enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
> + enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_NORMAL_NC |
> KVM_PGTABLE_PROT_R |
> (writable ? KVM_PGTABLE_PROT_W : 0);
Doesn't this affect the GICv2 VCPU interface, which is effectively a
shared peripheral, now allowing a guest to affect another guest's
interrupt distribution? If that is the case, this needs to be fixed.
In general, I don't think this should be a blanket statement, but be
limited to devices that we presume can deal with this (i.e. PCIe, and
not much else).
>
> @@ -1558,7 +1558,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> prot |= KVM_PGTABLE_PROT_X;
>
> if (device)
> - prot |= KVM_PGTABLE_PROT_DEVICE;
> + prot |= KVM_PGTABLE_PROT_NORMAL_NC;
> else if (cpus_have_final_cap(ARM64_HAS_CACHE_DIC))
> prot |= KVM_PGTABLE_PROT_X;
>
Thanks,
M.
--
Without deviation from the norm, progress is not possible.
On Tue, Dec 05, 2023 at 09:00:15AM +0530, [email protected] wrote:
> From: Ankit Agrawal <[email protected]>
>
> Currently, KVM for ARM64 maps at stage 2 memory that is considered device
> (i.e. it is not RAM) with DEVICE_nGnRE memory attributes; this setting
> overrides (as per the ARM architecture [1]) any device MMIO mapping
> present at stage 1, resulting in a set-up whereby a guest operating
> system cannot determine device MMIO mapping memory attributes on its
> own but it is always overridden by the KVM stage 2 default.
>
> This set-up does not allow guest operating systems to select device
> memory attributes independently from KVM stage-2 mappings
> (refer to [1], "Combining stage 1 and stage 2 memory type attributes"),
> which turns out to be an issue in that guest operating systems
> (e.g. Linux) may request to map devices MMIO regions with memory
> attributes that guarantee better performance (e.g. gathering
> attribute - that for some devices can generate larger PCIe memory
> writes TLPs) and specific operations (e.g. unaligned transactions)
> such as the NormalNC memory type.
>
> The default device stage 2 mapping was chosen in KVM for ARM64 since
> it was considered safer (i.e. it would not allow guests to trigger
> uncontained failures ultimately crashing the machine) but this
> turned out to be asynchronous (SError) defeating the purpose.
>
> Failures containability is a property of the platform and is independent
> from the memory type used for MMIO device memory mappings.
>
> Actually, DEVICE_nGnRE memory type is even more problematic than
> Normal-NC memory type in terms of faults containability in that e.g.
> aborts triggered on DEVICE_nGnRE loads cannot be made, architecturally,
> synchronous (i.e. that would imply that the processor should issue at
> most 1 load transaction at a time - it cannot pipeline them - otherwise
> the synchronous abort semantics would break the no-speculation attribute
> attached to DEVICE_XXX memory).
>
> This means that regardless of the combined stage1+stage2 mappings a
> platform is safe if and only if device transactions cannot trigger
> uncontained failures and that in turn relies on platform capabilities
> and the device type being assigned (i.e. PCIe AER/DPC error containment
> and RAS architecture[3]); therefore the default KVM device stage 2
> memory attributes play no role in making device assignment safer
> for a given platform (if the platform design adheres to design
> guidelines outlined in [3]) and therefore can be relaxed.
>
> For all these reasons, relax the KVM stage 2 device memory attributes
> from DEVICE_nGnRE to Normal-NC. Add a new kvm_pgtable_prot flag for
> Normal-NC.
>
> The Normal-NC was chosen over a different Normal memory type default
> at stage-2 (e.g. Normal Write-through) to avoid cache allocation/snooping.
>
> Relaxing S2 KVM device MMIO mappings to Normal-NC is not expected to
> trigger any issue on guest device reclaim use cases either (i.e. device
> MMIO unmap followed by a device reset) at least for PCIe devices, in that
> in PCIe a device reset is architected and carried out through PCI config
> space transactions that are naturally ordered with respect to MMIO
> transactions according to the PCI ordering rules.
>
> Having Normal-NC S2 default puts guests in control (thanks to
> stage1+stage2 combined memory attributes rules [1]) of device MMIO
> regions memory mappings, according to the rules described in [1]
> and summarized here ([(S1) - stage1], [(S2) - stage 2]):
>
> S1 | S2 | Result
> NORMAL-WB | NORMAL-NC | NORMAL-NC
> NORMAL-WT | NORMAL-NC | NORMAL-NC
> NORMAL-NC | NORMAL-NC | NORMAL-NC
> DEVICE<attr> | NORMAL-NC | DEVICE<attr>
>
> It is worth noting that currently, to map devices MMIO space to user
> space in a device pass-through use case the VFIO framework applies memory
> attributes derived from pgprot_noncached() settings applied to VMAs, which
> result in device-nGnRnE memory attributes for the stage-1 VMM mappings.
>
> This means that a userspace mapping for device MMIO space carried
> out with the current VFIO framework and a guest OS mapping for the same
> MMIO space may result in a mismatched alias as described in [2].
>
> Defaulting KVM device stage-2 mappings to Normal-NC attributes does not
> change anything in this respect, in that the mismatched aliases would
> only affect (refer to [2] for a detailed explanation) ordering between
> the userspace and GuestOS mappings resulting stream of transactions
> (i.e. it does not cause loss of property for either stream of
> transactions on its own), which is harmless given that the userspace
> and GuestOS access to the device is carried out through independent
> transactions streams.
>
> [1] section D8.5 - DDI0487_I_a_a-profile_architecture_reference_manual.pdf
> [2] section B2.8 - DDI0487_I_a_a-profile_architecture_reference_manual.pdf
> [3] sections 1.7.7.3/1.8.5.2/appendix C - DEN0029H_SBSA_7.1.pdf
>
> Applied over next-20231201
>
> History
> =======
> v1 -> v2
> - Updated commit log to the one posted by
> Lorenzo Pieralisi <[email protected]> (Thanks!)
> - Added new flag to represent the NORMAL_NC setting. Updated
> stage2_set_prot_attr() to handle new flag.
>
> v1 Link:
> https://lore.kernel.org/all/[email protected]/
>
> Signed-off-by: Ankit Agrawal <[email protected]>
> Suggested-by: Jason Gunthorpe <[email protected]>
> Acked-by: Catalin Marinas <[email protected]>
In the light of having to keep this relaxation only for PCIe devices, I
will withdraw my Ack until we come to a conclusion.
Thanks.
--
Catalin
+ Lorenzo, he really needs to be on this thread.
On Tue, Dec 05, 2023 at 09:21:28AM +0000, Marc Zyngier wrote:
> On Tue, 05 Dec 2023 03:30:15 +0000,
> <[email protected]> wrote:
> > From: Ankit Agrawal <[email protected]>
> >
> > Currently, KVM for ARM64 maps at stage 2 memory that is considered device
> > (i.e. it is not RAM) with DEVICE_nGnRE memory attributes; this setting
> > overrides (as per the ARM architecture [1]) any device MMIO mapping
> > present at stage 1, resulting in a set-up whereby a guest operating
> > system cannot determine device MMIO mapping memory attributes on its
> > own but it is always overridden by the KVM stage 2 default.
[...]
> Despite the considerable increase in the commit message length, a
> number of questions are left unanswered:
>
> - Shameer reported a regression on non-FWB systems, breaking device
> assignment:
>
> https://lore.kernel.org/all/[email protected]/
This referred to the first patch in the old series which was trying to
make the Stage 2 cacheable based on the vma attributes. That patch has
been dropped for now. It would be good if Shameer confirms this was the
problem.
> - Will had unanswered questions in another part of the thread:
>
> https://lore.kernel.org/all/20231013092954.GB13524@willie-the-truck/
>
> Can someone please help concluding it?
Is this about reclaiming the device? I think we concluded that we can't
generalise this beyond PCIe, though not sure there was any formal
statement to that thread. The other point Will had was around stating
in the commit message why we only relax this to Normal NC. I haven't
checked the commit message yet, it needs careful reading ;).
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index d14504821b79..1cb302457d3f 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -1071,7 +1071,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> > struct kvm_mmu_memory_cache cache = { .gfp_zero = __GFP_ZERO };
> > struct kvm_s2_mmu *mmu = &kvm->arch.mmu;
> > struct kvm_pgtable *pgt = mmu->pgt;
> > - enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
> > + enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_NORMAL_NC |
> > KVM_PGTABLE_PROT_R |
> > (writable ? KVM_PGTABLE_PROT_W : 0);
>
> Doesn't this affect the GICv2 VCPU interface, which is effectively a
> shared peripheral, now allowing a guest to affect another guest's
> interrupt distribution? If that is the case, this needs to be fixed.
>
> In general, I don't think this should be a blanket statement, but be
> limited to devices that we presume can deal with this (i.e. PCIe, and
> not much else).
Based on other on-list and off-line discussions, I came to the same
conclusion that we can't relax this beyond PCIe. How we do this, it's up
for debate. Some ideas:
- something in the vfio-pci driver like a new VM_* flag that KVM can
consume (hard sell for an arch-specific thing)
- changing the vfio-pci driver to allow WC and NC mappings (x86
terminology) with additional knowledge about what it can safely map
as NC. KVM would mimic the vma attributes (it doesn't eliminate the
alias though, the guest can always go for Device while the VMM for
Normal)
- some per-SoC pfn ranges that are permitted as Normal NC. Can the arch
code figure out where these PCIe BARs are or is it only the vfio-pci
driver? I guess we can sort something out around the PCIe but I'm not
familiar with this subsystem. The alternative is some "safe" ranges in
firmware tables, assuming the firmware configures the PCIe BARs
location
--
Catalin
On Tue, Dec 05, 2023 at 11:40:47AM +0000, Catalin Marinas wrote:
> > - Will had unanswered questions in another part of the thread:
> >
> > https://lore.kernel.org/all/20231013092954.GB13524@willie-the-truck/
> >
> > Can someone please help concluding it?
>
> Is this about reclaiming the device? I think we concluded that we can't
> generalise this beyond PCIe, though not sure there was any formal
> statement to that thread. The other point Will had was around stating
> in the commit message why we only relax this to Normal NC. I haven't
> checked the commit message yet, it needs careful reading ;).
Not quite, we said reclaiming is VFIO's problem and if VFIO can't
reliably reclaim a device it shouldn't create it in the first place.
Again, I think alot of this is trying to take VFIO problems into KVM.
VFIO devices should not exist if they pose a harm to the system. If
VFIO decided to create the devices anyhow (eg admin override or
something) then it is not KVM's job to do any further enforcement.
Remember, the feedback we got from the CPU architects was that even
DEVICE_* will experience an uncontained failure if the device tiggers
an error response in shipping ARM IP.
The reason PCIe is safe is because the PCI bridge does not generate
errors in the first place!
Thus, the way a platform device can actually be safe is if it too
never generates errors in the first place! Obviously this approach
works just as well with NORMAL_NC.
If a platform device does generate errors then we shouldn't expect
containment at all, and the memory type has no bearing on the
safety. The correct answer is to block these platform devices from
VFIO/KVM/etc because they can trigger uncontained failures.
If you have learned something different then please share it..
IOW, what is the practical scenario where DEVICE_* has contained
errors but NORMAL_NC does not?
Jason
[+James]
On Tue, Dec 05, 2023 at 11:40:47AM +0000, Catalin Marinas wrote:
[...]
> > - Will had unanswered questions in another part of the thread:
> >
> > https://lore.kernel.org/all/20231013092954.GB13524@willie-the-truck/
> >
> > Can someone please help concluding it?
>
> Is this about reclaiming the device? I think we concluded that we can't
> generalise this beyond PCIe, though not sure there was any formal
> statement to that thread. The other point Will had was around stating
> in the commit message why we only relax this to Normal NC. I haven't
> checked the commit message yet, it needs careful reading ;).
1) Reclaiming the device: I wrote a section that was reported in this
commit log hunk:
"Relaxing S2 KVM device MMIO mappings to Normal-NC is not expected to
trigger any issue on guest device reclaim use cases either (i.e.
device MMIO unmap followed by a device reset) at least for PCIe
devices, in that in PCIe a device reset is architected and carried
out through PCI config space transactions that are naturally ordered
with respect to MMIO transactions according to the PCI ordering
rules."
It is not too verbose on purpose - I thought it is too
complicated to express the details in a commit log but
we can elaborate on that if it is beneficial, probably
in /Documentation, not in the log itself.
2) On FWB: I added a full section to the log I posted here:
https://lore.kernel.org/all/ZUz78gFPgMupew+m@lpieralisi
but I think Ankit trimmed it and I can't certainly blame anyone
for that, it is a commit log, it is already hard to digest as it is.
I added James because I think that most of the points I made in the logs
are really RAS related (and without him I would not have understood half
of them :)). It would be very beneficial to everyone to have those
added to kernel documentation - especially to express in architectural
terms why this it is a safe change to make (at least for PCIe devices).
I am happy to work on the documentation changes - let's just agree what
should be done next.
Thanks,
Lorenzo
> > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > index d14504821b79..1cb302457d3f 100644
> > > --- a/arch/arm64/kvm/mmu.c
> > > +++ b/arch/arm64/kvm/mmu.c
> > > @@ -1071,7 +1071,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> > > struct kvm_mmu_memory_cache cache = { .gfp_zero = __GFP_ZERO };
> > > struct kvm_s2_mmu *mmu = &kvm->arch.mmu;
> > > struct kvm_pgtable *pgt = mmu->pgt;
> > > - enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
> > > + enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_NORMAL_NC |
> > > KVM_PGTABLE_PROT_R |
> > > (writable ? KVM_PGTABLE_PROT_W : 0);
> >
> > Doesn't this affect the GICv2 VCPU interface, which is effectively a
> > shared peripheral, now allowing a guest to affect another guest's
> > interrupt distribution? If that is the case, this needs to be fixed.
> >
> > In general, I don't think this should be a blanket statement, but be
> > limited to devices that we presume can deal with this (i.e. PCIe, and
> > not much else).
>
> Based on other on-list and off-line discussions, I came to the same
> conclusion that we can't relax this beyond PCIe. How we do this, it's up
> for debate. Some ideas:
>
> - something in the vfio-pci driver like a new VM_* flag that KVM can
> consume (hard sell for an arch-specific thing)
>
> - changing the vfio-pci driver to allow WC and NC mappings (x86
> terminology) with additional knowledge about what it can safely map
> as NC. KVM would mimic the vma attributes (it doesn't eliminate the
> alias though, the guest can always go for Device while the VMM for
> Normal)
>
> - some per-SoC pfn ranges that are permitted as Normal NC. Can the arch
> code figure out where these PCIe BARs are or is it only the vfio-pci
> driver? I guess we can sort something out around the PCIe but I'm not
> familiar with this subsystem. The alternative is some "safe" ranges in
> firmware tables, assuming the firmware configures the PCIe BARs
> location
>
> --
> Catalin
> -----Original Message-----
> From: Catalin Marinas <[email protected]>
> Sent: Tuesday, December 5, 2023 11:41 AM
> To: Marc Zyngier <[email protected]>
> Cc: [email protected]; Shameerali Kolothum Thodi
> <[email protected]>; [email protected];
> [email protected]; [email protected]; yuzenghui
> <[email protected]>; [email protected]; [email protected]; akpm@linux-
> foundation.org; [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]
> Subject: Re: [PATCH v2 1/1] KVM: arm64: allow the VM to select DEVICE_*
> and NORMAL_NC for IO memory
>
> + Lorenzo, he really needs to be on this thread.
>
> On Tue, Dec 05, 2023 at 09:21:28AM +0000, Marc Zyngier wrote:
> > On Tue, 05 Dec 2023 03:30:15 +0000,
> > <[email protected]> wrote:
> > > From: Ankit Agrawal <[email protected]>
> > >
> > > Currently, KVM for ARM64 maps at stage 2 memory that is considered
> device
> > > (i.e. it is not RAM) with DEVICE_nGnRE memory attributes; this setting
> > > overrides (as per the ARM architecture [1]) any device MMIO mapping
> > > present at stage 1, resulting in a set-up whereby a guest operating
> > > system cannot determine device MMIO mapping memory attributes on
> its
> > > own but it is always overridden by the KVM stage 2 default.
> [...]
> > Despite the considerable increase in the commit message length, a
> > number of questions are left unanswered:
> >
> > - Shameer reported a regression on non-FWB systems, breaking device
> > assignment:
> >
> >
> https://lore.kernel.org/all/[email protected]
> m/
>
> This referred to the first patch in the old series which was trying to
> make the Stage 2 cacheable based on the vma attributes. That patch has
> been dropped for now. It would be good if Shameer confirms this was the
> problem.
>
Yes, that was related to the first patch. We will soon test this on both FWB and
non-FWB platforms and report back.
Thanks,
Shameer
On Tue, Dec 05, 2023 at 09:05:17AM -0400, Jason Gunthorpe wrote:
> On Tue, Dec 05, 2023 at 11:40:47AM +0000, Catalin Marinas wrote:
> > > - Will had unanswered questions in another part of the thread:
> > >
> > > https://lore.kernel.org/all/20231013092954.GB13524@willie-the-truck/
> > >
> > > Can someone please help concluding it?
> >
> > Is this about reclaiming the device? I think we concluded that we can't
> > generalise this beyond PCIe, though not sure there was any formal
> > statement to that thread. The other point Will had was around stating
> > in the commit message why we only relax this to Normal NC. I haven't
> > checked the commit message yet, it needs careful reading ;).
>
> Not quite, we said reclaiming is VFIO's problem and if VFIO can't
> reliably reclaim a device it shouldn't create it in the first place.
I think that as far as device reclaiming was concerned the question
posed was related to memory attributes of transactions for guest
mappings and the related grouping/ordering with device reset MMIO
transactions - it was not (or wasn't only) about error containment.
Thanks,
Lorenzo
On Tue, Dec 05, 2023 at 03:37:13PM +0100, Lorenzo Pieralisi wrote:
> On Tue, Dec 05, 2023 at 09:05:17AM -0400, Jason Gunthorpe wrote:
> > On Tue, Dec 05, 2023 at 11:40:47AM +0000, Catalin Marinas wrote:
> > > > - Will had unanswered questions in another part of the thread:
> > > >
> > > > https://lore.kernel.org/all/20231013092954.GB13524@willie-the-truck/
> > > >
> > > > Can someone please help concluding it?
> > >
> > > Is this about reclaiming the device? I think we concluded that we can't
> > > generalise this beyond PCIe, though not sure there was any formal
> > > statement to that thread. The other point Will had was around stating
> > > in the commit message why we only relax this to Normal NC. I haven't
> > > checked the commit message yet, it needs careful reading ;).
> >
> > Not quite, we said reclaiming is VFIO's problem and if VFIO can't
> > reliably reclaim a device it shouldn't create it in the first place.
>
> I think that as far as device reclaiming was concerned the question
> posed was related to memory attributes of transactions for guest
> mappings and the related grouping/ordering with device reset MMIO
> transactions - it was not (or wasn't only) about error containment.
Yes. It is VFIO that issues the reset, it is VFIO that must provide
the ordering under the assumption that NORMAL_NC was used.
Jason
On Tue, Dec 05, 2023 at 10:44:17AM -0400, Jason Gunthorpe wrote:
> On Tue, Dec 05, 2023 at 03:37:13PM +0100, Lorenzo Pieralisi wrote:
> > On Tue, Dec 05, 2023 at 09:05:17AM -0400, Jason Gunthorpe wrote:
> > > On Tue, Dec 05, 2023 at 11:40:47AM +0000, Catalin Marinas wrote:
> > > > > - Will had unanswered questions in another part of the thread:
> > > > >
> > > > > https://lore.kernel.org/all/20231013092954.GB13524@willie-the-truck/
> > > > >
> > > > > Can someone please help concluding it?
> > > >
> > > > Is this about reclaiming the device? I think we concluded that we can't
> > > > generalise this beyond PCIe, though not sure there was any formal
> > > > statement to that thread. The other point Will had was around stating
> > > > in the commit message why we only relax this to Normal NC. I haven't
> > > > checked the commit message yet, it needs careful reading ;).
> > >
> > > Not quite, we said reclaiming is VFIO's problem and if VFIO can't
> > > reliably reclaim a device it shouldn't create it in the first place.
> >
> > I think that as far as device reclaiming was concerned the question
> > posed was related to memory attributes of transactions for guest
> > mappings and the related grouping/ordering with device reset MMIO
> > transactions - it was not (or wasn't only) about error containment.
>
> Yes. It is VFIO that issues the reset, it is VFIO that must provide
> the ordering under the assumption that NORMAL_NC was used.
And does it? Because VFIO so far only assumes Device-nGnRnE. Do we need
to address this first before attempting to change KVM? Sorry, just
questions, trying to clear the roadblocks.
--
Catalin
On Tue, Dec 05, 2023 at 09:05:17AM -0400, Jason Gunthorpe wrote:
> On Tue, Dec 05, 2023 at 11:40:47AM +0000, Catalin Marinas wrote:
> > > - Will had unanswered questions in another part of the thread:
> > >
> > > https://lore.kernel.org/all/20231013092954.GB13524@willie-the-truck/
> > >
> > > Can someone please help concluding it?
> >
> > Is this about reclaiming the device? I think we concluded that we can't
> > generalise this beyond PCIe, though not sure there was any formal
> > statement to that thread. The other point Will had was around stating
> > in the commit message why we only relax this to Normal NC. I haven't
> > checked the commit message yet, it needs careful reading ;).
>
> Not quite, we said reclaiming is VFIO's problem and if VFIO can't
> reliably reclaim a device it shouldn't create it in the first place.
>
> Again, I think alot of this is trying to take VFIO problems into KVM.
>
> VFIO devices should not exist if they pose a harm to the system. If
> VFIO decided to create the devices anyhow (eg admin override or
> something) then it is not KVM's job to do any further enforcement.
Yeah, I made this argument in the past. But it's a fair question to ask
since the Arm world is different from x86. Just reusing an existing
driver in a different context may break its expectations. Does Normal NC
access complete by the time a TLBI (for Stage 2) and DSB (DVMsync) is
completed? It does reach some point of serialisation with subsequent
accesses to the same address but not sure how it is ordered with an
access to a different location like the config space used for reset.
Maybe it's not a problem at all or it is safe only for PCIe but it would
be good to get to the bottom of this.
Device-nGnRnE has some stronger rules around end-point completion and
that's what the vfio-pci uses. KVM, however, went for the slightly more
relaxed nGnRE variant which, at least per the Arm ARM, doesn't have
these guarantees.
> Remember, the feedback we got from the CPU architects was that even
> DEVICE_* will experience an uncontained failure if the device tiggers
> an error response in shipping ARM IP.
>
> The reason PCIe is safe is because the PCI bridge does not generate
> errors in the first place!
That's an argument to restrict this feature to PCIe. It's really about
fewer arguments on the behaviour of other devices. Marc did raise
another issue with the GIC VCPU interface (does this even have a vma in
the host VMM?). That's a class of devices where the mapping is
context-switched, so the TLBI+DSB rules don't help.
> Thus, the way a platform device can actually be safe is if it too
> never generates errors in the first place! Obviously this approach
> works just as well with NORMAL_NC.
>
> If a platform device does generate errors then we shouldn't expect
> containment at all, and the memory type has no bearing on the
> safety. The correct answer is to block these platform devices from
> VFIO/KVM/etc because they can trigger uncontained failures.
Assuming the error containment is sorted, there are two other issues
with other types of devices:
1. Ordering guarantees on reclaim or context switch
2. Unaligned accesses
On (2), I think PCIe is fairly clear on how the TLPs are generated, so I
wouldn't expect additional errors here. But I have no idea what AMBA/AXI
does here in general. Perhaps it's fine, I don't think we looked into it
as the focus was mostly on PCIe.
So, I think it would be easier to get this patch upstream if we limit
the change to PCIe devices for now. We may relax this further in the
future. Do you actually have a need for non-PCIe devices to support WC
in the guest or it's more about the complexity of the logic to detect
whether it's actually a PCIe BAR we are mapping into the guest? (I can
see some Arm GPU folk asking for this but those devices are not easily
virtualisable).
--
Catalin
On Tue, Dec 05, 2023 at 04:22:33PM +0000, Catalin Marinas wrote:
> Yeah, I made this argument in the past. But it's a fair question to ask
> since the Arm world is different from x86. Just reusing an existing
> driver in a different context may break its expectations. Does Normal NC
> access complete by the time a TLBI (for Stage 2) and DSB (DVMsync) is
> completed? It does reach some point of serialisation with subsequent
> accesses to the same address but not sure how it is ordered with an
> access to a different location like the config space used for reset.
> Maybe it's not a problem at all or it is safe only for PCIe but it would
> be good to get to the bottom of this.
IMHO, the answer is you can't know architecturally. The specific
vfio-platform driver must do an analysis of it's specific SOC and
determine what exactly is required to order the reset. The primary
purpose of the vfio-platform drivers is to provide this reset!
In most cases I would expect some reads from the device to be required
before the reset.
> > Remember, the feedback we got from the CPU architects was that even
> > DEVICE_* will experience an uncontained failure if the device tiggers
> > an error response in shipping ARM IP.
> >
> > The reason PCIe is safe is because the PCI bridge does not generate
> > errors in the first place!
>
> That's an argument to restrict this feature to PCIe. It's really about
> fewer arguments on the behaviour of other devices. Marc did raise
> another issue with the GIC VCPU interface (does this even have a vma in
> the host VMM?). That's a class of devices where the mapping is
> context-switched, so the TLBI+DSB rules don't help.
I don't know anything about the GIC VCPU interface, to give any
comment unfortunately. Since it seems there is something to fix here I
would appreciate some background..
When you say it is context switched do you mean kvm does a register
write on every vm entry to set the proper HW context for the vCPU?
We are worrying that register write will possibly not order after
NORMAL_NC?
> > Thus, the way a platform device can actually be safe is if it too
> > never generates errors in the first place! Obviously this approach
> > works just as well with NORMAL_NC.
> >
> > If a platform device does generate errors then we shouldn't expect
> > containment at all, and the memory type has no bearing on the
> > safety. The correct answer is to block these platform devices from
> > VFIO/KVM/etc because they can trigger uncontained failures.
>
> Assuming the error containment is sorted, there are two other issues
> with other types of devices:
>
> 1. Ordering guarantees on reclaim or context switch
Solved in VFIO
> 2. Unaligned accesses
>
> On (2), I think PCIe is fairly clear on how the TLPs are generated, so I
> wouldn't expect additional errors here. But I have no idea what AMBA/AXI
> does here in general. Perhaps it's fine, I don't think we looked into it
> as the focus was mostly on PCIe.
I would expect AXI devices to throw errors in all sorts of odd
cases. eg I would not be surprised at all to see carelessly built AXI
devices error if ST64B is pointed at them. At least when I was
building AXI logic years ago I was so lazy :P
This is mostly my point - if the devices under vfio-platform were not
designed to have contained errors then, IMHO, it is hard to believe
that DEVICE_X/NORMAL_NC is the only issue in that HW.
> So, I think it would be easier to get this patch upstream if we limit
> the change to PCIe devices for now. We may relax this further in the
> future. Do you actually have a need for non-PCIe devices to support WC
> in the guest or it's more about the complexity of the logic to detect
> whether it's actually a PCIe BAR we are mapping into the guest? (I can
> see some Arm GPU folk asking for this but those devices are not easily
> virtualisable).
The complexity is my concern, and the disruption to the ecosystem with
some of the ideas given.
If there was a trivial way to convey in the VMA that it is safe then
sure, no objection from me.
My worry is this has turned from fixing a real problem we have today
into a debate about theoretical issues that nobody may care about that
are very disruptive to solve.
I would turn it around and ask we find a way to restrict platform
devices when someone comes with a platform device that wants to use
secure kvm and has a single well defined HW problem that is solved by
this work.
What if we change vfio-pci to use pgprot_device() like it already
really should and say the pgprot_noncached() is enforced as
DEVICE_nGnRnE and pgprot_device() may be DEVICE_nGnRE or NORMAL_NC?
Would that be acceptable?
Jason
On Tue, 05 Dec 2023 16:43:18 +0000,
Jason Gunthorpe <[email protected]> wrote:
>
> On Tue, Dec 05, 2023 at 04:22:33PM +0000, Catalin Marinas wrote:
> > That's an argument to restrict this feature to PCIe. It's really about
> > fewer arguments on the behaviour of other devices. Marc did raise
> > another issue with the GIC VCPU interface (does this even have a vma in
> > the host VMM?). That's a class of devices where the mapping is
> > context-switched, so the TLBI+DSB rules don't help.
There is no vma. The CPU interface is entirely under control of KVM.
Userspace only provides the IPA for the mapping.
>
> I don't know anything about the GIC VCPU interface, to give any
> comment unfortunately. Since it seems there is something to fix here I
> would appreciate some background..
>
> When you say it is context switched do you mean kvm does a register
> write on every vm entry to set the proper HW context for the vCPU?
The CPU interface is mapped in every guest S2 page tables as a per-CPU
device, and under complete control of the guest. There is no KVM
register write to that frame (unless we're proxying it, but that's for
another day).
>
> We are worrying that register write will possibly not order after
> NORMAL_NC?
Guest maps the device as Normal-NC (because it now can), which means
that there is no control over the alignment or anything like that. The
accesses could also be reordered, and/or hit after a context switch to
another guest. Which is why KVM has so far used nGnRE as the mapping
type.
M.
--
Without deviation from the norm, progress is not possible.
On Tue, Dec 05, 2023 at 04:24:22PM +0000, Catalin Marinas wrote:
> On Tue, Dec 05, 2023 at 10:44:17AM -0400, Jason Gunthorpe wrote:
> > On Tue, Dec 05, 2023 at 03:37:13PM +0100, Lorenzo Pieralisi wrote:
> > > On Tue, Dec 05, 2023 at 09:05:17AM -0400, Jason Gunthorpe wrote:
> > > > On Tue, Dec 05, 2023 at 11:40:47AM +0000, Catalin Marinas wrote:
> > > > > > - Will had unanswered questions in another part of the thread:
> > > > > >
> > > > > > https://lore.kernel.org/all/20231013092954.GB13524@willie-the-truck/
> > > > > >
> > > > > > Can someone please help concluding it?
> > > > >
> > > > > Is this about reclaiming the device? I think we concluded that we can't
> > > > > generalise this beyond PCIe, though not sure there was any formal
> > > > > statement to that thread. The other point Will had was around stating
> > > > > in the commit message why we only relax this to Normal NC. I haven't
> > > > > checked the commit message yet, it needs careful reading ;).
> > > >
> > > > Not quite, we said reclaiming is VFIO's problem and if VFIO can't
> > > > reliably reclaim a device it shouldn't create it in the first place.
> > >
> > > I think that as far as device reclaiming was concerned the question
> > > posed was related to memory attributes of transactions for guest
> > > mappings and the related grouping/ordering with device reset MMIO
> > > transactions - it was not (or wasn't only) about error containment.
> >
> > Yes. It is VFIO that issues the reset, it is VFIO that must provide
> > the ordering under the assumption that NORMAL_NC was used.
>
> And does it? Because VFIO so far only assumes Device-nGnRnE. Do we need
> to address this first before attempting to change KVM? Sorry, just
> questions, trying to clear the roadblocks.
There is no way to know. It is SOC specific what would be needed.
Could people have implemented their platform devices with a multi-path
bus architecture for the reset? Yes, definately. In fact, I've built
things like that. Low speed stuff like reset gets its own low speed
bus.
If that was done will NORMAL_NC vs DEVICE_nGnRnE make a difference?
I'm not sure. It depends a lot on how the SOC was designed and how
transactions flow on the high speed side. Posting writes, like PCIe
does, would destroy any practical ordering difference between the two
memory types. If the writes are not posted then the barriers in the
TLBI sequence should order it.
Fortunately, if some SOC has this issue we know how to solve it - you
must do flushing reads on all the multi-path interconnect segments to
serialize everything around the reset.
Regardless, getting this wrong is not a functional issue, it causes a
subtle time sensitive security race with VFIO close() that would be
hard to actually hit, and would already require privilege to open a
VFIO device to exploit. IOW, we don't care.
Jason
On Tue, Dec 05, 2023 at 05:01:28PM +0000, Marc Zyngier wrote:
> On Tue, 05 Dec 2023 16:43:18 +0000,
> Jason Gunthorpe <[email protected]> wrote:
> > On Tue, Dec 05, 2023 at 04:22:33PM +0000, Catalin Marinas wrote:
> > > That's an argument to restrict this feature to PCIe. It's really about
> > > fewer arguments on the behaviour of other devices. Marc did raise
> > > another issue with the GIC VCPU interface (does this even have a vma in
> > > the host VMM?). That's a class of devices where the mapping is
> > > context-switched, so the TLBI+DSB rules don't help.
>
> There is no vma. The CPU interface is entirely under control of KVM.
> Userspace only provides the IPA for the mapping.
That's good to know. We can solve the GIC issue by limiting the
relaxation to those mappings that have a user vma. Ideally we should do
this for vfio only but we don't have an easy way to convey this to KVM.
--
Catalin
On Tue, 05 Dec 2023 17:33:01 +0000,
Catalin Marinas <[email protected]> wrote:
>
> On Tue, Dec 05, 2023 at 05:01:28PM +0000, Marc Zyngier wrote:
> > On Tue, 05 Dec 2023 16:43:18 +0000,
> > Jason Gunthorpe <[email protected]> wrote:
> > > On Tue, Dec 05, 2023 at 04:22:33PM +0000, Catalin Marinas wrote:
> > > > That's an argument to restrict this feature to PCIe. It's really about
> > > > fewer arguments on the behaviour of other devices. Marc did raise
> > > > another issue with the GIC VCPU interface (does this even have a vma in
> > > > the host VMM?). That's a class of devices where the mapping is
> > > > context-switched, so the TLBI+DSB rules don't help.
> >
> > There is no vma. The CPU interface is entirely under control of KVM.
> > Userspace only provides the IPA for the mapping.
>
> That's good to know. We can solve the GIC issue by limiting the
> relaxation to those mappings that have a user vma.
Yes, this should definitely be part of the decision.
> Ideally we should do this for vfio only but we don't have an easy
> way to convey this to KVM.
But if we want to limit this to PCIe, we'll have to find out. The
initial proposal (a long while ago) had a flag conveying some
information, and I'd definitely feel more confident having something
like that.
M.
--
Without deviation from the norm, progress is not possible.
On Tue, Dec 05, 2023 at 05:50:27PM +0000, Marc Zyngier wrote:
> On Tue, 05 Dec 2023 17:33:01 +0000,
> Catalin Marinas <[email protected]> wrote:
> > Ideally we should do this for vfio only but we don't have an easy
> > way to convey this to KVM.
>
> But if we want to limit this to PCIe, we'll have to find out. The
> initial proposal (a long while ago) had a flag conveying some
> information, and I'd definitely feel more confident having something
> like that.
We can add a VM_PCI_IO in the high vma flags to be set by
vfio_pci_core_mmap(), though it limits it to 64-bit architectures. KVM
knows this is PCI and relaxes things a bit. It's not generic though if
we need this later for something else.
A question for Lorenzo: do these BARs appear in iomem_resource? We could
search that up instead of a flag, something like the page_is_ram()
helper.
--
Catalin
On Tue, Dec 05, 2023 at 12:43:18PM -0400, Jason Gunthorpe wrote:
> On Tue, Dec 05, 2023 at 04:22:33PM +0000, Catalin Marinas wrote:
> > Yeah, I made this argument in the past. But it's a fair question to ask
> > since the Arm world is different from x86. Just reusing an existing
> > driver in a different context may break its expectations. Does Normal NC
> > access complete by the time a TLBI (for Stage 2) and DSB (DVMsync) is
> > completed? It does reach some point of serialisation with subsequent
> > accesses to the same address but not sure how it is ordered with an
> > access to a different location like the config space used for reset.
> > Maybe it's not a problem at all or it is safe only for PCIe but it would
> > be good to get to the bottom of this.
>
> IMHO, the answer is you can't know architecturally. The specific
> vfio-platform driver must do an analysis of it's specific SOC and
> determine what exactly is required to order the reset. The primary
> purpose of the vfio-platform drivers is to provide this reset!
>
> In most cases I would expect some reads from the device to be required
> before the reset.
I can see in the vfio_platform_common.c code that the reset is either
handled by an ACPI _RST method or some custom function in case of DT.
Let's consider the ACPI method for now, I assume the AML code pokes some
device registers but we can't say much about the ordering it expects
without knowing the details. The AML may assume that the ioaddr mapped
as Device-nRnRE (ioremap()) in the kernel has the same attributes
wherever else is mapped in user or guests. Note that currently the
vfio_platform and vfio_pci drivers only allow pgprot_noncached() in
user, so they wouldn't worry about other mismatched aliases.
I think PCIe is slightly better documented but even here we'll have to
rely on the TLBI+DSB to clear any prior writes on different CPUs.
It can be argued that it's the responsibility of whoever grants device
access to know the details. However, it would help if we give some
guidance, any expectations broken if an alias is Normal-NC? It's easier
to start with PCIe first until we get some concrete request for other
types of devices.
> > So, I think it would be easier to get this patch upstream if we limit
> > the change to PCIe devices for now. We may relax this further in the
> > future. Do you actually have a need for non-PCIe devices to support WC
> > in the guest or it's more about the complexity of the logic to detect
> > whether it's actually a PCIe BAR we are mapping into the guest? (I can
> > see some Arm GPU folk asking for this but those devices are not easily
> > virtualisable).
>
> The complexity is my concern, and the disruption to the ecosystem with
> some of the ideas given.
>
> If there was a trivial way to convey in the VMA that it is safe then
> sure, no objection from me.
I suggested a new VM_* flag or some way to probe the iomem_resources for
PCIe ranges (if they are described in there, not sure). We can invent
other tree searching for ranges that get registers from the vfio driver,
I don't think it's that difficult.
Question is, do we need to do this for other types of devices or it's
mostly theoretical at this point (what's theoretical goes both ways
really).
A more complex way is to change vfio to allow Normal mappings and KVM
would mimic them. You were actually planning to do this for Cacheable
anyway.
> I would turn it around and ask we find a way to restrict platform
> devices when someone comes with a platform device that wants to use
> secure kvm and has a single well defined HW problem that is solved by
> this work.
We end up with similar search/validation mechanism, so not sure we gain
much.
> What if we change vfio-pci to use pgprot_device() like it already
> really should and say the pgprot_noncached() is enforced as
> DEVICE_nGnRnE and pgprot_device() may be DEVICE_nGnRE or NORMAL_NC?
> Would that be acceptable?
pgprot_device() needs to stay as Device, otherwise you'd get speculative
reads with potential side-effects.
--
Catalin
On Tue, Dec 05, 2023 at 07:24:37PM +0000, Catalin Marinas wrote:
> On Tue, Dec 05, 2023 at 12:43:18PM -0400, Jason Gunthorpe wrote:
> > On Tue, Dec 05, 2023 at 04:22:33PM +0000, Catalin Marinas wrote:
> > > Yeah, I made this argument in the past. But it's a fair question to ask
> > > since the Arm world is different from x86. Just reusing an existing
> > > driver in a different context may break its expectations. Does Normal NC
> > > access complete by the time a TLBI (for Stage 2) and DSB (DVMsync) is
> > > completed? It does reach some point of serialisation with subsequent
> > > accesses to the same address but not sure how it is ordered with an
> > > access to a different location like the config space used for reset.
> > > Maybe it's not a problem at all or it is safe only for PCIe but it would
> > > be good to get to the bottom of this.
> >
> > IMHO, the answer is you can't know architecturally. The specific
> > vfio-platform driver must do an analysis of it's specific SOC and
> > determine what exactly is required to order the reset. The primary
> > purpose of the vfio-platform drivers is to provide this reset!
> >
> > In most cases I would expect some reads from the device to be required
> > before the reset.
>
> I can see in the vfio_platform_common.c code that the reset is either
> handled by an ACPI _RST method or some custom function in case of DT.
> Let's consider the ACPI method for now, I assume the AML code pokes some
> device registers but we can't say much about the ordering it expects
> without knowing the details. The AML may assume that the ioaddr mapped
> as Device-nRnRE (ioremap()) in the kernel has the same attributes
> wherever else is mapped in user or guests. Note that currently the
> vfio_platform and vfio_pci drivers only allow pgprot_noncached() in
> user, so they wouldn't worry about other mismatched aliases.
AML is OS agnostic. It technically shouldn't assume the OS never used
NORMAL_NC to access the device.
By the time the AML is invoked the VMAs are all revoked and all TLBs
are flushed so I think the mismatched alias problem is gone??
> It can be argued that it's the responsibility of whoever grants device
> access to know the details. However, it would help if we give some
> guidance, any expectations broken if an alias is Normal-NC? It's easier
> to start with PCIe first until we get some concrete request for other
> types of devices.
The issue was about NORMAL_NC access prior to the TLBI continuing to
float in the interconnect and not be ordered strictly before the reset
write.
Which, IMHO, is basically arguing that DSB doesn't order these
operations on some specific SOC - which I view with some doubt.
> > The complexity is my concern, and the disruption to the ecosystem with
> > some of the ideas given.
> >
> > If there was a trivial way to convey in the VMA that it is safe then
> > sure, no objection from me.
>
> I suggested a new VM_* flag or some way to probe the iomem_resources for
> PCIe ranges (if they are described in there, not sure). We can invent
> other tree searching for ranges that get registers from the vfio driver,
> I don't think it's that difficult.
I do not like iomem_resource probing on principle :(
A VM_ flag would be fine, but seems wasteful.
> A more complex way is to change vfio to allow Normal mappings and KVM
> would mimic them. You were actually planning to do this for Cacheable
> anyway.
This needs qemu changes, so I very much don't like it.
Cachable is always cachable, it is different thing.
> > I would turn it around and ask we find a way to restrict platform
> > devices when someone comes with a platform device that wants to use
> > secure kvm and has a single well defined HW problem that is solved by
> > this work.
>
> We end up with similar search/validation mechanism, so not sure we gain
> much.
We gain by not doing it, ever. I'm expecting nobody will ever bring it
up.
The vfio-platform drivers all look to be rather old to me, and the
manifestation is, at worst, that a hostile VM that didn't crash the
machine before does crash it now.
I have serious doubts anyone will run into an issue.
> > What if we change vfio-pci to use pgprot_device() like it already
> > really should and say the pgprot_noncached() is enforced as
> > DEVICE_nGnRnE and pgprot_device() may be DEVICE_nGnRE or NORMAL_NC?
> > Would that be acceptable?
>
> pgprot_device() needs to stay as Device, otherwise you'd get speculative
> reads with potential side-effects.
I do not mean to change pgprot_device() I mean to detect the
difference via pgprot_device() vs pgprot_noncached(). They put a
different value in the PTE that we can sense. It is very hacky.
Jason
> -----Original Message-----
> From: Shameerali Kolothum Thodi
> Sent: Tuesday, December 5, 2023 2:17 PM
> To: 'Catalin Marinas' <[email protected]>; Marc Zyngier
> <[email protected]>
> Cc: [email protected]; [email protected]; [email protected];
> [email protected]; yuzenghui <[email protected]>;
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; linyufeng (A)
> <[email protected]>
> Subject: RE: [PATCH v2 1/1] KVM: arm64: allow the VM to select DEVICE_*
> and NORMAL_NC for IO memory
>
>
>
> > -----Original Message-----
> > From: Catalin Marinas <[email protected]>
> > Sent: Tuesday, December 5, 2023 11:41 AM
> > To: Marc Zyngier <[email protected]>
> > Cc: [email protected]; Shameerali Kolothum Thodi
> > <[email protected]>; [email protected];
> > [email protected]; [email protected]; yuzenghui
> > <[email protected]>; [email protected]; [email protected];
> akpm@linux-
> > foundation.org; [email protected]; [email protected];
> [email protected];
> > [email protected]; [email protected]; [email protected];
> > [email protected]; [email protected]; [email protected];
> > [email protected]; [email protected]; [email protected];
> > [email protected]; [email protected]; [email protected];
> > [email protected]
> > Subject: Re: [PATCH v2 1/1] KVM: arm64: allow the VM to select DEVICE_*
> > and NORMAL_NC for IO memory
> >
> > + Lorenzo, he really needs to be on this thread.
> >
> > On Tue, Dec 05, 2023 at 09:21:28AM +0000, Marc Zyngier wrote:
> > > On Tue, 05 Dec 2023 03:30:15 +0000,
> > > <[email protected]> wrote:
> > > > From: Ankit Agrawal <[email protected]>
> > > >
> > > > Currently, KVM for ARM64 maps at stage 2 memory that is considered
> > device
> > > > (i.e. it is not RAM) with DEVICE_nGnRE memory attributes; this setting
> > > > overrides (as per the ARM architecture [1]) any device MMIO mapping
> > > > present at stage 1, resulting in a set-up whereby a guest operating
> > > > system cannot determine device MMIO mapping memory attributes on
> > its
> > > > own but it is always overridden by the KVM stage 2 default.
> > [...]
> > > Despite the considerable increase in the commit message length, a
> > > number of questions are left unanswered:
> > >
> > > - Shameer reported a regression on non-FWB systems, breaking device
> > > assignment:
> > >
> > >
> >
> https://lore.kernel.org/all/[email protected]
> > m/
> >
> > This referred to the first patch in the old series which was trying to
> > make the Stage 2 cacheable based on the vma attributes. That patch has
> > been dropped for now. It would be good if Shameer confirms this was the
> > problem.
> >
>
> Yes, that was related to the first patch. We will soon test this on both FWB
> and
> non-FWB platforms and report back.
>
Thanks to Yufeng, the test on both FWB and non-FWB platforms with a GPU dev
assignment looks fine with this patch.
Shameer
On Tue, 05 Dec 2023 18:40:42 +0000,
Catalin Marinas <[email protected]> wrote:
>
> On Tue, Dec 05, 2023 at 05:50:27PM +0000, Marc Zyngier wrote:
> > On Tue, 05 Dec 2023 17:33:01 +0000,
> > Catalin Marinas <[email protected]> wrote:
> > > Ideally we should do this for vfio only but we don't have an easy
> > > way to convey this to KVM.
> >
> > But if we want to limit this to PCIe, we'll have to find out. The
> > initial proposal (a long while ago) had a flag conveying some
> > information, and I'd definitely feel more confident having something
> > like that.
>
> We can add a VM_PCI_IO in the high vma flags to be set by
> vfio_pci_core_mmap(), though it limits it to 64-bit architectures. KVM
> knows this is PCI and relaxes things a bit. It's not generic though if
> we need this later for something else.
Either that, or something actually describing the attributes that VFIO
wants.
And I very much want it to be a buy-in behaviour, not something that
automagically happens and changes the default behaviour for everyone
based on some hand-wavy assertions.
If that means a userspace change, fine by me. The VMM better know what
is happening.
M.
--
Without deviation from the norm, progress is not possible.
On Tue, Dec 05, 2023 at 06:40:42PM +0000, Catalin Marinas wrote:
> On Tue, Dec 05, 2023 at 05:50:27PM +0000, Marc Zyngier wrote:
> > On Tue, 05 Dec 2023 17:33:01 +0000,
> > Catalin Marinas <[email protected]> wrote:
> > > Ideally we should do this for vfio only but we don't have an easy
> > > way to convey this to KVM.
> >
> > But if we want to limit this to PCIe, we'll have to find out. The
> > initial proposal (a long while ago) had a flag conveying some
> > information, and I'd definitely feel more confident having something
> > like that.
>
> We can add a VM_PCI_IO in the high vma flags to be set by
> vfio_pci_core_mmap(), though it limits it to 64-bit architectures. KVM
> knows this is PCI and relaxes things a bit. It's not generic though if
> we need this later for something else.
>
> A question for Lorenzo: do these BARs appear in iomem_resource? We could
> search that up instead of a flag, something like the page_is_ram()
> helper.
They do but an iomem_resource look-up alone if I am not mistaken would
not tell us if it is PCI address space, we would need additional checks
(like detecting if we are decoding an address within a PCI host bridge
windows) to determine that if we go down this route.
Lorenzo
On Wed, Dec 06, 2023 at 11:39:03AM +0000, Marc Zyngier wrote:
> On Tue, 05 Dec 2023 18:40:42 +0000,
> Catalin Marinas <[email protected]> wrote:
> > On Tue, Dec 05, 2023 at 05:50:27PM +0000, Marc Zyngier wrote:
> > > On Tue, 05 Dec 2023 17:33:01 +0000,
> > > Catalin Marinas <[email protected]> wrote:
> > > > Ideally we should do this for vfio only but we don't have an easy
> > > > way to convey this to KVM.
> > >
> > > But if we want to limit this to PCIe, we'll have to find out. The
> > > initial proposal (a long while ago) had a flag conveying some
> > > information, and I'd definitely feel more confident having something
> > > like that.
> >
> > We can add a VM_PCI_IO in the high vma flags to be set by
> > vfio_pci_core_mmap(), though it limits it to 64-bit architectures. KVM
> > knows this is PCI and relaxes things a bit. It's not generic though if
> > we need this later for something else.
>
> Either that, or something actually describing the attributes that VFIO
> wants.
>
> And I very much want it to be a buy-in behaviour, not something that
> automagically happens and changes the default behaviour for everyone
> based on some hand-wavy assertions.
>
> If that means a userspace change, fine by me. The VMM better know what
> is happening.
Driving the attributes from a single point like the VFIO driver is
indeed better. The problem is that write-combining on Arm doesn't come
without speculative loads, otherwise we would have solved it by now. I
also recall the VFIO maintainer pushing back on relaxing the
pgprot_noncached() for the user mapping but I don't remember the
reasons.
We could do with a pgprot_maybewritecombine() or
pgprot_writecombinenospec() (similar to Jason's idea but without
changing the semantics of pgprot_device()). For the user mapping on
arm64 this would be Device (even _GRE) since it can't disable
speculation but stage 2 would leave the decision to the guest since the
speculative loads aren't much different from committed loads done
wrongly.
If we want the VMM to drive this entirely, we could add a new mmap()
flag like MAP_WRITECOMBINE or PROT_WRITECOMBINE. They do feel a bit
weird but there is precedent with PROT_MTE to describe a memory type.
One question is whether the VFIO driver still needs to have the
knowledge and sanitise the requests from the VMM within a single BAR. If
there are no security implications to such mappings, the VMM can map
parts of the BAR as pgprot_noncached(), other parts as
pgprot_writecombine() and KVM just follows them (similarly if we need a
cacheable mapping).
The latter has some benefits for DPDK but it's a lot more involved with
having to add device-specific knowledge into the VMM. The VMM would also
have to present the whole BAR contiguously to the guest even if there
are different mapping attributes within the range. So a lot of MAP_FIXED
uses. I'd rather leaving this decision with the guest than the VMM, it
looks like more hassle to create those mappings. The VMM or the VFIO
could only state write-combine and speculation allowed.
--
Catalin
On Tue, Dec 05, 2023 at 03:48:22PM -0400, Jason Gunthorpe wrote:
> On Tue, Dec 05, 2023 at 07:24:37PM +0000, Catalin Marinas wrote:
> > On Tue, Dec 05, 2023 at 12:43:18PM -0400, Jason Gunthorpe wrote:
> > > What if we change vfio-pci to use pgprot_device() like it already
> > > really should and say the pgprot_noncached() is enforced as
> > > DEVICE_nGnRnE and pgprot_device() may be DEVICE_nGnRE or NORMAL_NC?
> > > Would that be acceptable?
> >
> > pgprot_device() needs to stay as Device, otherwise you'd get speculative
> > reads with potential side-effects.
>
> I do not mean to change pgprot_device() I mean to detect the
> difference via pgprot_device() vs pgprot_noncached(). They put a
> different value in the PTE that we can sense. It is very hacky.
Ah, ok, it does look hacky though (as is the alternative of coming up
with a new specific pgprot_*() that KVM can treat differently).
BTW, on those Mellanox devices that require different attributes within
a BAR, do they have a problem with speculative reads causing
side-effects? If not, we might as well map the whole BAR in user as
Normal NC but have the guest use the appropriate attributes within the
BAR. The VMM wouldn't explicitly access the BAR but we'd get the
occasional speculative reads.
--
Catalin
On Wed, Dec 06, 2023 at 02:49:02PM +0000, Catalin Marinas wrote:
> On Tue, Dec 05, 2023 at 03:48:22PM -0400, Jason Gunthorpe wrote:
> > On Tue, Dec 05, 2023 at 07:24:37PM +0000, Catalin Marinas wrote:
> > > On Tue, Dec 05, 2023 at 12:43:18PM -0400, Jason Gunthorpe wrote:
> > > > What if we change vfio-pci to use pgprot_device() like it already
> > > > really should and say the pgprot_noncached() is enforced as
> > > > DEVICE_nGnRnE and pgprot_device() may be DEVICE_nGnRE or NORMAL_NC?
> > > > Would that be acceptable?
> > >
> > > pgprot_device() needs to stay as Device, otherwise you'd get speculative
> > > reads with potential side-effects.
> >
> > I do not mean to change pgprot_device() I mean to detect the
> > difference via pgprot_device() vs pgprot_noncached(). They put a
> > different value in the PTE that we can sense. It is very hacky.
>
> Ah, ok, it does look hacky though (as is the alternative of coming up
> with a new specific pgprot_*() that KVM can treat differently).
>
> BTW, on those Mellanox devices that require different attributes within
> a BAR, do they have a problem with speculative reads causing
> side-effects?
Yes. We definitely have had that problem in the past on older
devices. VFIO must map the BAR using pgprot_device/noncached() into
the VMM, no other choice is functionally OK.
Only some pages can safely tolerate speculative reads and the guest
driver knows which they are.
Jason
On Wed, Dec 06, 2023 at 12:14:18PM +0000, Catalin Marinas wrote:
> We could do with a pgprot_maybewritecombine() or
> pgprot_writecombinenospec() (similar to Jason's idea but without
> changing the semantics of pgprot_device()). For the user mapping on
> arm64 this would be Device (even _GRE) since it can't disable
> speculation but stage 2 would leave the decision to the guest since the
> speculative loads aren't much different from committed loads done
> wrongly.
This would be fine, as would a VMA flag. Please pick one :)
I think a VMA flag is simpler than messing with pgprot.
> If we want the VMM to drive this entirely, we could add a new mmap()
> flag like MAP_WRITECOMBINE or PROT_WRITECOMBINE. They do feel a bit
As in the other thread, we cannot unconditionally map NORMAL_NC into
the VMM.
> The latter has some benefits for DPDK but it's a lot more involved
> with
DPDK WC support will be solved with some VFIO-only change if anyone
ever cares to make it, if that is what you mean.
> having to add device-specific knowledge into the VMM. The VMM would also
> have to present the whole BAR contiguously to the guest even if there
> are different mapping attributes within the range. So a lot of MAP_FIXED
> uses. I'd rather leaving this decision with the guest than the VMM, it
> looks like more hassle to create those mappings. The VMM or the VFIO
> could only state write-combine and speculation allowed.
We talked about this already, the guest must decide, the VMM doesn't
have the information to pre-predict which pages the guest will want to
use WC on.
Jason
On Wed, Dec 06, 2023 at 11:05:56AM -0400, Jason Gunthorpe wrote:
> On Wed, Dec 06, 2023 at 02:49:02PM +0000, Catalin Marinas wrote:
> > On Tue, Dec 05, 2023 at 03:48:22PM -0400, Jason Gunthorpe wrote:
> > > On Tue, Dec 05, 2023 at 07:24:37PM +0000, Catalin Marinas wrote:
> > > > On Tue, Dec 05, 2023 at 12:43:18PM -0400, Jason Gunthorpe wrote:
> > > > > What if we change vfio-pci to use pgprot_device() like it already
> > > > > really should and say the pgprot_noncached() is enforced as
> > > > > DEVICE_nGnRnE and pgprot_device() may be DEVICE_nGnRE or NORMAL_NC?
> > > > > Would that be acceptable?
> > > >
> > > > pgprot_device() needs to stay as Device, otherwise you'd get speculative
> > > > reads with potential side-effects.
> > >
> > > I do not mean to change pgprot_device() I mean to detect the
> > > difference via pgprot_device() vs pgprot_noncached(). They put a
> > > different value in the PTE that we can sense. It is very hacky.
> >
> > Ah, ok, it does look hacky though (as is the alternative of coming up
> > with a new specific pgprot_*() that KVM can treat differently).
> >
> > BTW, on those Mellanox devices that require different attributes within
> > a BAR, do they have a problem with speculative reads causing
> > side-effects?
>
> Yes. We definitely have had that problem in the past on older
> devices. VFIO must map the BAR using pgprot_device/noncached() into
> the VMM, no other choice is functionally OK.
Were those BARs tagged as prefetchable or non-prefetchable ? I assume the
latter but please let me know if I am guessing wrong.
Thanks,
Lorenzo
On Wed, Dec 06, 2023 at 04:18:05PM +0100, Lorenzo Pieralisi wrote:
> On Wed, Dec 06, 2023 at 11:05:56AM -0400, Jason Gunthorpe wrote:
> > On Wed, Dec 06, 2023 at 02:49:02PM +0000, Catalin Marinas wrote:
> > > On Tue, Dec 05, 2023 at 03:48:22PM -0400, Jason Gunthorpe wrote:
> > > > On Tue, Dec 05, 2023 at 07:24:37PM +0000, Catalin Marinas wrote:
> > > > > On Tue, Dec 05, 2023 at 12:43:18PM -0400, Jason Gunthorpe wrote:
> > > > > > What if we change vfio-pci to use pgprot_device() like it already
> > > > > > really should and say the pgprot_noncached() is enforced as
> > > > > > DEVICE_nGnRnE and pgprot_device() may be DEVICE_nGnRE or NORMAL_NC?
> > > > > > Would that be acceptable?
> > > > >
> > > > > pgprot_device() needs to stay as Device, otherwise you'd get speculative
> > > > > reads with potential side-effects.
> > > >
> > > > I do not mean to change pgprot_device() I mean to detect the
> > > > difference via pgprot_device() vs pgprot_noncached(). They put a
> > > > different value in the PTE that we can sense. It is very hacky.
> > >
> > > Ah, ok, it does look hacky though (as is the alternative of coming up
> > > with a new specific pgprot_*() that KVM can treat differently).
> > >
> > > BTW, on those Mellanox devices that require different attributes within
> > > a BAR, do they have a problem with speculative reads causing
> > > side-effects?
> >
> > Yes. We definitely have had that problem in the past on older
> > devices. VFIO must map the BAR using pgprot_device/noncached() into
> > the VMM, no other choice is functionally OK.
>
> Were those BARs tagged as prefetchable or non-prefetchable ? I assume the
> latter but please let me know if I am guessing wrong.
I don't know it was quite old HW. Probably.
Just because a BAR is not marked as prefetchable doesn't mean that the
device can't use NORMAL_NC on subsets of it.
Jason
On Wed, Dec 06, 2023 at 11:38:09AM -0400, Jason Gunthorpe wrote:
> On Wed, Dec 06, 2023 at 04:18:05PM +0100, Lorenzo Pieralisi wrote:
> > On Wed, Dec 06, 2023 at 11:05:56AM -0400, Jason Gunthorpe wrote:
> > > On Wed, Dec 06, 2023 at 02:49:02PM +0000, Catalin Marinas wrote:
> > > > BTW, on those Mellanox devices that require different attributes within
> > > > a BAR, do they have a problem with speculative reads causing
> > > > side-effects?
> > >
> > > Yes. We definitely have had that problem in the past on older
> > > devices. VFIO must map the BAR using pgprot_device/noncached() into
> > > the VMM, no other choice is functionally OK.
> >
> > Were those BARs tagged as prefetchable or non-prefetchable ? I assume the
> > latter but please let me know if I am guessing wrong.
>
> I don't know it was quite old HW. Probably.
>
> Just because a BAR is not marked as prefetchable doesn't mean that the
> device can't use NORMAL_NC on subsets of it.
What about the other way around - would we have a prefetchable BAR that
has portions which are unprefetchable?
--
Catalin
On Wed, Dec 06, 2023 at 11:16:03AM -0400, Jason Gunthorpe wrote:
> On Wed, Dec 06, 2023 at 12:14:18PM +0000, Catalin Marinas wrote:
> > We could do with a pgprot_maybewritecombine() or
> > pgprot_writecombinenospec() (similar to Jason's idea but without
> > changing the semantics of pgprot_device()). For the user mapping on
> > arm64 this would be Device (even _GRE) since it can't disable
> > speculation but stage 2 would leave the decision to the guest since the
> > speculative loads aren't much different from committed loads done
> > wrongly.
>
> This would be fine, as would a VMA flag. Please pick one :)
>
> I think a VMA flag is simpler than messing with pgprot.
I guess one could write a patch and see how it goes ;).
> > If we want the VMM to drive this entirely, we could add a new mmap()
> > flag like MAP_WRITECOMBINE or PROT_WRITECOMBINE. They do feel a bit
>
> As in the other thread, we cannot unconditionally map NORMAL_NC into
> the VMM.
I'm not suggesting this but rather the VMM map portions of the BAR with
either Device or Normal-NC, concatenate them (MAP_FIXED) and pass this
range as a memory slot (or multiple if a slot doesn't allow multiple
vmas).
> > The latter has some benefits for DPDK but it's a lot more involved
> > with
>
> DPDK WC support will be solved with some VFIO-only change if anyone
> ever cares to make it, if that is what you mean.
Yeah. Some arguments I've heard in private and public discussions is
that the KVM device pass-through shouldn't be different from the DPDK
case. So fixing that would cover KVM as well, though we'd need
additional logic in the VMM. BenH had a short talk at Plumbers around
this - https://youtu.be/QLvN3KXCn0k?t=7010. There was some statement in
there that for x86, the guests are allowed to do WC without other KVM
restrictions (not sure whether that's the case, not familiar with it).
> > having to add device-specific knowledge into the VMM. The VMM would also
> > have to present the whole BAR contiguously to the guest even if there
> > are different mapping attributes within the range. So a lot of MAP_FIXED
> > uses. I'd rather leaving this decision with the guest than the VMM, it
> > looks like more hassle to create those mappings. The VMM or the VFIO
> > could only state write-combine and speculation allowed.
>
> We talked about this already, the guest must decide, the VMM doesn't
> have the information to pre-predict which pages the guest will want to
> use WC on.
Are the Device/Normal offsets within a BAR fixed, documented in e.g. the
spec or this is something configurable via some MMIO that the guest
does.
--
Catalin
On Wed, Dec 06, 2023 at 04:23:25PM +0000, Catalin Marinas wrote:
> On Wed, Dec 06, 2023 at 11:38:09AM -0400, Jason Gunthorpe wrote:
> > On Wed, Dec 06, 2023 at 04:18:05PM +0100, Lorenzo Pieralisi wrote:
> > > On Wed, Dec 06, 2023 at 11:05:56AM -0400, Jason Gunthorpe wrote:
> > > > On Wed, Dec 06, 2023 at 02:49:02PM +0000, Catalin Marinas wrote:
> > > > > BTW, on those Mellanox devices that require different attributes within
> > > > > a BAR, do they have a problem with speculative reads causing
> > > > > side-effects?
> > > >
> > > > Yes. We definitely have had that problem in the past on older
> > > > devices. VFIO must map the BAR using pgprot_device/noncached() into
> > > > the VMM, no other choice is functionally OK.
> > >
> > > Were those BARs tagged as prefetchable or non-prefetchable ? I assume the
> > > latter but please let me know if I am guessing wrong.
> >
> > I don't know it was quite old HW. Probably.
> >
> > Just because a BAR is not marked as prefetchable doesn't mean that the
> > device can't use NORMAL_NC on subsets of it.
>
> What about the other way around - would we have a prefetchable BAR that
> has portions which are unprefetchable?
I would say possibly.
Prefetch is a dead concept in PCIe, it was obsoleted in PCI-X about 20
years ago. No PCIe system has ever done prefetch.
There is a strong incentive to mark BAR's as prefetchable because it
allows 64 bit addressing in configurations with bridges.
So.. I would expect people have done interesting things here.
Jason
On Wed, Dec 06, 2023 at 04:31:48PM +0000, Catalin Marinas wrote:
> > This would be fine, as would a VMA flag. Please pick one :)
> >
> > I think a VMA flag is simpler than messing with pgprot.
>
> I guess one could write a patch and see how it goes ;).
A lot of patches have been sent on this already :(
> > > If we want the VMM to drive this entirely, we could add a new mmap()
> > > flag like MAP_WRITECOMBINE or PROT_WRITECOMBINE. They do feel a bit
> >
> > As in the other thread, we cannot unconditionally map NORMAL_NC into
> > the VMM.
>
> I'm not suggesting this but rather the VMM map portions of the BAR with
> either Device or Normal-NC, concatenate them (MAP_FIXED) and pass this
> range as a memory slot (or multiple if a slot doesn't allow multiple
> vmas).
The VMM can't know what to do. We already talked about this. The VMM
cannot be involved in the decision to make pages NORMAL_NC or
not. That idea ignores how actual devices work.
Either the VM decides directly as this patch proposes or the VM does
some new generic trap/hypercall to ask the VMM to change it on its
behalf. The VMM cannot do it independently.
AFAIK nobody wants to see a trap/hypercall solution.
That is why we have been exclusively focused on this approach.
> > > The latter has some benefits for DPDK but it's a lot more involved
> > > with
> >
> > DPDK WC support will be solved with some VFIO-only change if anyone
> > ever cares to make it, if that is what you mean.
>
> Yeah. Some arguments I've heard in private and public discussions is
> that the KVM device pass-through shouldn't be different from the DPDK
> case.
I strongly disagree with this.
The KVM case should be solved without the VMM being aware of what
mappings the VM is doing.
DPDK is in control and can directly ask VFIO to make the correct
pgprot with an ioctl.
You can hear Alex also articulate this position in that video.
> There was some statement in there that for x86, the guests are
> allowed to do WC without other KVM restrictions (not sure whether
> that's the case, not familiar with it).
x86 has a similar issue (Sean was talking about this and how he wants
to fix it) where the VMM can restrict things and on x86 there are
configurations where WC does and doesn't work in VM's too. Depends on
who made the hypervisor. :(
Nobody has pushed hard enough to see it resolved in upstream, but I
understand some of the cloud operator set have their own solutions.
> > We talked about this already, the guest must decide, the VMM doesn't
> > have the information to pre-predict which pages the guest will want to
> > use WC on.
>
> Are the Device/Normal offsets within a BAR fixed, documented in e.g. the
> spec or this is something configurable via some MMIO that the guest
> does.
No, it is fully dynamic on demand with firmware RPCs.
Jason
On Wed, Dec 06, 2023 at 01:20:35PM -0400, Jason Gunthorpe wrote:
> On Wed, Dec 06, 2023 at 04:31:48PM +0000, Catalin Marinas wrote:
> > > This would be fine, as would a VMA flag. Please pick one :)
> > >
> > > I think a VMA flag is simpler than messing with pgprot.
> >
> > I guess one could write a patch and see how it goes ;).
>
> A lot of patches have been sent on this already :(
But not one with a VM_* flag. I guess we could also add a VM_VFIO flag
which implies KVM has less restrictions on the memory type. I think
that's more bike-shedding.
The key point is that we don't want to relax this for whatever KVM may
map in the guest but only for certain devices. Just having a vma may not
be sufficient, we can't tell where that vma came from.
So for the vfio bits, completely untested:
-------------8<----------------------------
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 1929103ee59a..b89d2dfcd534 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1863,7 +1863,7 @@ int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma
* See remap_pfn_range(), called from vfio_pci_fault() but we can't
* change vm_flags within the fault handler. Set them now.
*/
- vm_flags_set(vma, VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP);
+ vm_flags_set(vma, VM_VFIO | VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP);
vma->vm_ops = &vfio_pci_mmap_ops;
return 0;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 418d26608ece..6df46fd7836a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -391,6 +391,13 @@ extern unsigned int kobjsize(const void *objp);
# define VM_UFFD_MINOR VM_NONE
#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
+#ifdef CONFIG_64BIT
+#define VM_VFIO_BIT 39
+#define VM_VFIO BIT(VM_VFIO_BIT)
+#else
+#define VM_VFIO VM_NONE
+#endif
+
/* Bits set in the VMA until the stack is in its final location */
#define VM_STACK_INCOMPLETE_SETUP (VM_RAND_READ | VM_SEQ_READ | VM_STACK_EARLY)
-------------8<----------------------------
In KVM, Akita's patch would take this into account, not just rely on
"device==true".
> > > > If we want the VMM to drive this entirely, we could add a new mmap()
> > > > flag like MAP_WRITECOMBINE or PROT_WRITECOMBINE. They do feel a bit
> > >
> > > As in the other thread, we cannot unconditionally map NORMAL_NC into
> > > the VMM.
> >
> > I'm not suggesting this but rather the VMM map portions of the BAR with
> > either Device or Normal-NC, concatenate them (MAP_FIXED) and pass this
> > range as a memory slot (or multiple if a slot doesn't allow multiple
> > vmas).
>
> The VMM can't know what to do. We already talked about this. The VMM
> cannot be involved in the decision to make pages NORMAL_NC or
> not. That idea ignores how actual devices work.
[...]
> > Are the Device/Normal offsets within a BAR fixed, documented in e.g. the
> > spec or this is something configurable via some MMIO that the guest
> > does.
>
> No, it is fully dynamic on demand with firmware RPCs.
I think that's a key argument. The VMM cannot, on its own, configure the
BAR and figure a way to communicate this to the guest. We could invent
some para-virtualisation/trapping mechanism but that's unnecessarily
complicated. In the DPDK case, DPDK both configures and interacts with
the device. In the VMM/VM case, we need the VM to do this, we can't
split the configuration in VMM and interaction with the device in the
VM.
--
Catalin
On Wed, Dec 06, 2023 at 06:58:44PM +0000, Catalin Marinas wrote:
> -------------8<----------------------------
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 1929103ee59a..b89d2dfcd534 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -1863,7 +1863,7 @@ int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma
> * See remap_pfn_range(), called from vfio_pci_fault() but we can't
> * change vm_flags within the fault handler. Set them now.
> */
> - vm_flags_set(vma, VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP);
> + vm_flags_set(vma, VM_VFIO | VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP);
> vma->vm_ops = &vfio_pci_mmap_ops;
>
> return 0;
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 418d26608ece..6df46fd7836a 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -391,6 +391,13 @@ extern unsigned int kobjsize(const void *objp);
> # define VM_UFFD_MINOR VM_NONE
> #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
>
> +#ifdef CONFIG_64BIT
> +#define VM_VFIO_BIT 39
> +#define VM_VFIO BIT(VM_VFIO_BIT)
> +#else
> +#define VM_VFIO VM_NONE
> +#endif
> +
> /* Bits set in the VMA until the stack is in its final location */
> #define VM_STACK_INCOMPLETE_SETUP (VM_RAND_READ | VM_SEQ_READ | VM_STACK_EARLY)
> -------------8<----------------------------
>
> In KVM, Akita's patch would take this into account, not just rely on
> "device==true".
Yes, Ankit let's try this please. I would not call it VM_VFIO though
VM_VFIO_ALLOW_WC ?
Introduce it in a separate patch and summarize this thread, with a
suggested-by from Catalin :)
Cc Sean Christopherson <[email protected]> too as x86 kvm might
support the idea
> I think that's a key argument. The VMM cannot, on its own, configure the
> BAR and figure a way to communicate this to the guest. We could invent
> some para-virtualisation/trapping mechanism but that's unnecessarily
> complicated. In the DPDK case, DPDK both configures and interacts with
> the device. In the VMM/VM case, we need the VM to do this, we can't
> split the configuration in VMM and interaction with the device in the
> VM.
Yes
Thanks,
Jason
On Wed, Dec 06, 2023 at 03:03:56PM -0400, Jason Gunthorpe wrote:
> On Wed, Dec 06, 2023 at 06:58:44PM +0000, Catalin Marinas wrote:
> > -------------8<----------------------------
> > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> > index 1929103ee59a..b89d2dfcd534 100644
> > --- a/drivers/vfio/pci/vfio_pci_core.c
> > +++ b/drivers/vfio/pci/vfio_pci_core.c
> > @@ -1863,7 +1863,7 @@ int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma
> > * See remap_pfn_range(), called from vfio_pci_fault() but we can't
> > * change vm_flags within the fault handler. Set them now.
> > */
> > - vm_flags_set(vma, VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP);
> > + vm_flags_set(vma, VM_VFIO | VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP);
> > vma->vm_ops = &vfio_pci_mmap_ops;
> >
> > return 0;
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 418d26608ece..6df46fd7836a 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -391,6 +391,13 @@ extern unsigned int kobjsize(const void *objp);
> > # define VM_UFFD_MINOR VM_NONE
> > #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
> >
> > +#ifdef CONFIG_64BIT
> > +#define VM_VFIO_BIT 39
> > +#define VM_VFIO BIT(VM_VFIO_BIT)
> > +#else
> > +#define VM_VFIO VM_NONE
> > +#endif
> > +
> > /* Bits set in the VMA until the stack is in its final location */
> > #define VM_STACK_INCOMPLETE_SETUP (VM_RAND_READ | VM_SEQ_READ | VM_STACK_EARLY)
> > -------------8<----------------------------
> >
> > In KVM, Akita's patch would take this into account, not just rely on
> > "device==true".
>
> Yes, Ankit let's try this please. I would not call it VM_VFIO though
>
> VM_VFIO_ALLOW_WC ?
Yeah. I don't really care about the name.
--
Catalin
>On Wed, Dec 06, 2023 at 03:03:56PM -0400, Jason Gunthorpe wrote:
>> On Wed, Dec 06, 2023 at 06:58:44PM +0000, Catalin Marinas wrote:
>> > -------------8<----------------------------
>> > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
>> > index 1929103ee59a..b89d2dfcd534 100644
>> > --- a/drivers/vfio/pci/vfio_pci_core.c
>> > +++ b/drivers/vfio/pci/vfio_pci_core.c
>> > @@ -1863,7 +1863,7 @@ int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma
>> >????? * See remap_pfn_range(), called from vfio_pci_fault() but we can't
>> >????? * change vm_flags within the fault handler.? Set them now.
>> >????? */
>> > -?? vm_flags_set(vma, VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP);
>> > +?? vm_flags_set(vma, VM_VFIO | VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP);
>> >???? vma->vm_ops = &vfio_pci_mmap_ops;
>> >
>> >???? return 0;
>> > diff --git a/include/linux/mm.h b/include/linux/mm.h
>> > index 418d26608ece..6df46fd7836a 100644
>> > --- a/include/linux/mm.h
>> > +++ b/include/linux/mm.h
>> > @@ -391,6 +391,13 @@ extern unsigned int kobjsize(const void *objp);
>> >? # define VM_UFFD_MINOR???????????? VM_NONE
>> >? #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
>> >
>> > +#ifdef CONFIG_64BIT
>> > +#define VM_VFIO_BIT??????????????? 39
>> > +#define VM_VFIO??????????????????? BIT(VM_VFIO_BIT)
>> > +#else
>> > +#define VM_VFIO??????????????????? VM_NONE
>> > +#endif
>> > +
>> >? /* Bits set in the VMA until the stack is in its final location */
>> >? #define VM_STACK_INCOMPLETE_SETUP (VM_RAND_READ | VM_SEQ_READ | VM_STACK_EARLY)
>> > -------------8<----------------------------
>> >
>> > In KVM, Akita's patch would take this into account, not just rely on
>> > "device==true".
>>
>> Yes, Ankit let's try this please. I would not call it VM_VFIO though
>>
>> VM_VFIO_ALLOW_WC ?
>
> Yeah. I don't really care about the name.
Thanks, I will make the change to set the VM_VFIO_ALLOW_WC flag in vfio.
Then make use of "device==true" and presence of the flag to decide on
setting the S2 as NORMAL_NC.
On Wed, Dec 06, 2023 at 12:48:02PM -0400, Jason Gunthorpe wrote:
> On Wed, Dec 06, 2023 at 04:23:25PM +0000, Catalin Marinas wrote:
> > On Wed, Dec 06, 2023 at 11:38:09AM -0400, Jason Gunthorpe wrote:
> > > On Wed, Dec 06, 2023 at 04:18:05PM +0100, Lorenzo Pieralisi wrote:
> > > > On Wed, Dec 06, 2023 at 11:05:56AM -0400, Jason Gunthorpe wrote:
> > > > > On Wed, Dec 06, 2023 at 02:49:02PM +0000, Catalin Marinas wrote:
> > > > > > BTW, on those Mellanox devices that require different attributes within
> > > > > > a BAR, do they have a problem with speculative reads causing
> > > > > > side-effects?
> > > > >
> > > > > Yes. We definitely have had that problem in the past on older
> > > > > devices. VFIO must map the BAR using pgprot_device/noncached() into
> > > > > the VMM, no other choice is functionally OK.
> > > >
> > > > Were those BARs tagged as prefetchable or non-prefetchable ? I assume the
> > > > latter but please let me know if I am guessing wrong.
> > >
> > > I don't know it was quite old HW. Probably.
> > >
> > > Just because a BAR is not marked as prefetchable doesn't mean that the
> > > device can't use NORMAL_NC on subsets of it.
> >
> > What about the other way around - would we have a prefetchable BAR that
> > has portions which are unprefetchable?
>
> I would say possibly.
>
> Prefetch is a dead concept in PCIe, it was obsoleted in PCI-X about 20
> years ago. No PCIe system has ever done prefetch.
>
> There is a strong incentive to mark BAR's as prefetchable because it
> allows 64 bit addressing in configurations with bridges.
If by strong incentive you mean the "Additional guidance on the
Prefetchable Bit in Memory Space BARs" in the PCI express specifications,
I think it has been removed from the spec and the criteria that had to be
met to implement it were basically impossible to fulfill on ARM systems,
it did not make any sense in the first place.
I agree on your statement related to the prefetchable concept but I
believe that a prefetchable BAR containing regions that have
read side-effects is essentially a borked design unless at system level
speculative reads are prevented (as far as I understand the
implementation note this could only be an endpoint integrated in a
system where read speculation can't just happen (?)).
Long story short: a PCIe card/device that can be plugged on any PCIe
compliant system (x86, ARM or whatever) should not mark a
BAR region with memory read side-effects as prefetchable, either
that or I don't understand what the implementation note above
was all about.
AFAIK the prefetchable concept in PCIe is likely to be scrapped
altogether in the not too distant future.
Anyway, that was just for completeness (and to shed light
on the BAR prefetchable bit usage).
Lorenzo
On Thu, Dec 07, 2023 at 11:13:52AM +0100, Lorenzo Pieralisi wrote:
> > > What about the other way around - would we have a prefetchable BAR that
> > > has portions which are unprefetchable?
> >
> > I would say possibly.
> >
> > Prefetch is a dead concept in PCIe, it was obsoleted in PCI-X about 20
> > years ago. No PCIe system has ever done prefetch.
> >
> > There is a strong incentive to mark BAR's as prefetchable because it
> > allows 64 bit addressing in configurations with bridges.
>
> If by strong incentive you mean the "Additional guidance on the
> Prefetchable Bit in Memory Space BARs" in the PCI express specifications,
> I think it has been removed from the spec and the criteria that had to be
> met to implement it were basically impossible to fulfill on ARM systems,
> it did not make any sense in the first place.
No, I mean many systems don't have room to accommodate large 32 bit
BARs and the only real way to make stuff work is to have a 64 bit BAR
by setting prefetchable. Given mis-marking a read-side-effect region
as prefetchable has no actual consequence on PCI-E I would not be
surprised to learn people have done this.
> I agree on your statement related to the prefetchable concept but I
> believe that a prefetchable BAR containing regions that have
> read side-effects is essentially a borked design unless at system level
> speculative reads are prevented (as far as I understand the
> implementation note this could only be an endpoint integrated in a
> system where read speculation can't just happen (?)).
IMHO the PCIe concept of prefetchable has no intersection with the
CPU. The CPU chooses entirely on its own what rules to apply to the
PCI MMIO regions and no OS should drive that decision based on the
prefetchable BAR flag.
The *only* purpose of the prefetchable flag was to permit a classical
33/66MHz PCI bridge to prefetch reads because the physical bus
protocol back then did not include a read length.
For any system that does not have an ancient PCI bridge the indication
is totally useless.
Jason
On Thu, Dec 07, 2023 at 09:38:25AM -0400, Jason Gunthorpe wrote:
> On Thu, Dec 07, 2023 at 11:13:52AM +0100, Lorenzo Pieralisi wrote:
> > > > What about the other way around - would we have a prefetchable BAR that
> > > > has portions which are unprefetchable?
> > >
> > > I would say possibly.
> > >
> > > Prefetch is a dead concept in PCIe, it was obsoleted in PCI-X about 20
> > > years ago. No PCIe system has ever done prefetch.
> > >
> > > There is a strong incentive to mark BAR's as prefetchable because it
> > > allows 64 bit addressing in configurations with bridges.
> >
> > If by strong incentive you mean the "Additional guidance on the
> > Prefetchable Bit in Memory Space BARs" in the PCI express specifications,
> > I think it has been removed from the spec and the criteria that had to be
> > met to implement it were basically impossible to fulfill on ARM systems,
> > it did not make any sense in the first place.
>
> No, I mean many systems don't have room to accommodate large 32 bit
> BARs and the only real way to make stuff work is to have a 64 bit BAR
> by setting prefetchable.
That's what the implementation note I mentioned referred to ;)
> Given mis-marking a read-side-effect region as prefetchable has no
> actual consequence on PCI-E I would not be surprised to learn people
> have done this.
PCIe specs 6.1, 7.5.1.2.1 "Base Address Registers"
"A function is permitted to mark a range as prefetchable if there are
no side effects on reads..."
I don't think that an OS should use the prefetchable flag to infer
anything (even though we do at the moment -> sysfs mappings, I know that
the prefetchable concept is being scrapped from the PCI specs altogether
for a reason), I don't see though how we can say that's a SW bug at
present given what I quoted above.
I'd agree that it is best not to use that flag for new code we are adding
(because it will be deprecated soon).
Thanks,
Lorenzo