2023-03-12 12:02:30

by Huang Rui

[permalink] [raw]
Subject: [RFC PATCH 0/5] Add Xen PVH dom0 support for GPU

Hi all,

Currently, we are working to add VirtIO GPU and Passthrough GPU support on
Xen. We expected to use HVM on domU and PVH on dom0. The x86 PVH dom0
support needs a few modifications on our APU platform. These functions
requires multiple software components support including kernel, xen, qemu,
mesa, and virglrenderer. Please see the patch series on Xen and QEMU bleow.

Xen: https://lists.xenproject.org/archives/html/xen-devel/2023-03/msg00714.html
QEMU: https://lists.nongnu.org/archive/html/qemu-devel/2023-03/msg03972.html

Kernel part mainly adds the PVH dom0 support:

1) Enable Xen PVH dom0 for AMDGPU

Please check patch 1 to 3, that enable Xen PVH dom0 on amdgpu. Because we
would like to use hardware IOMMU instead of swiotlb for buffer copy, PV
dom0 only supported swiotlb.

There still some workarounds in the kernel need to dig it out like below
https://git.kernel.org/pub/scm/linux/kernel/git/rui/linux.git/commit/?h=upstream-fox-xen&id=9bee65dd3498dfc6aad283d22ff641198b5c91ed

2) Add PCIe Passthrough (GPU) on Xen PVH dom0

Please check patch 4 to 5, this implements acpi_register_gsi_xen_pvh API to
register GSI for guest domU, amd make a new privcmd to handle the GSI from
the IRQ.

Below are the screenshot of these functions, please take a look.

Passthrough GPU: https://drive.google.com/file/d/17onr5gvDK8KM_LniHTSQEI2hGJZlI09L/view?usp=share_link
Venus: https://drive.google.com/file/d/1_lPq6DMwHu1JQv7LUUVRx31dBj0HJYcL/view?usp=share_link
Zink: https://drive.google.com/file/d/1FxLmKu6X7uJOxx1ZzwOm1yA6IL5WMGzd/view?usp=share_link

Repositories
Kernel: https://git.kernel.org/pub/scm/linux/kernel/git/rui/linux.git/log/?h=upstream-fox-xen
Xen: https://gitlab.com/huangrui123/xen/-/commits/upstream-for-xen
QEMU: https://gitlab.com/huangrui123/qemu/-/commits/upstream-for-xen
Mesa: https://gitlab.freedesktop.org/rui/mesa/-/commits/upstream-for-xen
Virglrenderer: https://gitlab.freedesktop.org/rui/virglrenderer/-/commits/upstream-for-xen

We are writting the documentation on xen wiki page, and will update it in
feature version.

Thanks,
Ray

Chen Jiqian (2):
x86/xen: acpi registers gsi for xen pvh
xen/privcmd: add IOCTL_PRIVCMD_GSI_FROM_IRQ

Huang Rui (3):
x86/xen: disable swiotlb for xen pvh
xen/grants: update initialization order of xen grant table
drm/amdgpu: set passthrough mode for xen pvh/hvm

arch/x86/include/asm/apic.h | 7 ++++
arch/x86/include/asm/xen/pci.h | 5 +++
arch/x86/kernel/acpi/boot.c | 2 +-
arch/x86/kernel/pci-dma.c | 8 ++++-
arch/x86/pci/xen.c | 43 ++++++++++++++++++++++++
arch/x86/xen/grant-table.c | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 3 +-
drivers/xen/events/events_base.c | 39 +++++++++++++++++++++
drivers/xen/grant-table.c | 2 +-
drivers/xen/privcmd.c | 20 +++++++++++
include/uapi/xen/privcmd.h | 7 ++++
include/xen/events.h | 5 +++
12 files changed, 138 insertions(+), 5 deletions(-)

--
2.25.1



2023-03-12 12:02:39

by Huang Rui

[permalink] [raw]
Subject: [RFC PATCH 1/5] x86/xen: disable swiotlb for xen pvh

Xen PVH is the paravirtualized mode and takes advantage of hardware
virtualization support when possible. It will using the hardware IOMMU
support instead of xen-swiotlb, so disable swiotlb if current domain is
Xen PVH.

Signed-off-by: Huang Rui <[email protected]>
---
arch/x86/kernel/pci-dma.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index 30bbe4abb5d6..f5c73dd18f2a 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -74,6 +74,12 @@ static inline void __init pci_swiotlb_detect(void)
#ifdef CONFIG_SWIOTLB_XEN
static void __init pci_xen_swiotlb_init(void)
{
+ /* Xen PVH domain won't use swiotlb */
+ if (xen_pvh_domain()) {
+ x86_swiotlb_enable = false;
+ return;
+ }
+
if (!xen_initial_domain() && !x86_swiotlb_enable)
return;
x86_swiotlb_enable = true;
@@ -86,7 +92,7 @@ static void __init pci_xen_swiotlb_init(void)

int pci_xen_swiotlb_init_late(void)
{
- if (dma_ops == &xen_swiotlb_dma_ops)
+ if (xen_pvh_domain() || dma_ops == &xen_swiotlb_dma_ops)
return 0;

/* we can work with the default swiotlb */
--
2.25.1


2023-03-12 12:02:43

by Huang Rui

[permalink] [raw]
Subject: [RFC PATCH 2/5] xen/grants: update initialization order of xen grant table

The xen grant table will be initialied before parsing the PCI resources,
so xen_alloc_unpopulated_pages() ends up using a range from the PCI
window because Linux hasn't parsed the PCI information yet.

So modify the initialization order to make sure the real PCI resources
are parsed before.

Signed-off-by: Huang Rui <[email protected]>
---
arch/x86/xen/grant-table.c | 2 +-
drivers/xen/grant-table.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/xen/grant-table.c b/arch/x86/xen/grant-table.c
index 1e681bf62561..64a04d1e70f5 100644
--- a/arch/x86/xen/grant-table.c
+++ b/arch/x86/xen/grant-table.c
@@ -165,5 +165,5 @@ static int __init xen_pvh_gnttab_setup(void)
}
/* Call it _before_ __gnttab_init as we need to initialize the
* xen_auto_xlat_grant_frames first. */
-core_initcall(xen_pvh_gnttab_setup);
+fs_initcall_sync(xen_pvh_gnttab_setup);
#endif
diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c
index e1ec725c2819..6382112f3473 100644
--- a/drivers/xen/grant-table.c
+++ b/drivers/xen/grant-table.c
@@ -1680,4 +1680,4 @@ static int __gnttab_init(void)
}
/* Starts after core_initcall so that xen_pvh_gnttab_setup can be called
* beforehand to initialize xen_auto_xlat_grant_frames. */
-core_initcall_sync(__gnttab_init);
+rootfs_initcall(__gnttab_init);
--
2.25.1


2023-03-12 12:02:48

by Huang Rui

[permalink] [raw]
Subject: [RFC PATCH 3/5] drm/amdgpu: set passthrough mode for xen pvh/hvm

There is an second stage translation between the guest machine address
and host machine address in Xen PVH/HVM. The PCI bar address in the xen
guest kernel are not translated at the second stage on Xen PVH/HVM, so
it's not the real physical address that hardware would like to know, so
we need to set passthrough mode for Xen PVH/HVM as well.

Signed-off-by: Huang Rui <[email protected]>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
index f2e2cbaa7fde..7b4369eba19d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
@@ -743,7 +743,8 @@ void amdgpu_detect_virtualization(struct amdgpu_device *adev)

if (!reg) {
/* passthrough mode exclus sriov mod */
- if (is_virtual_machine() && !xen_initial_domain())
+ if (is_virtual_machine() &&
+ !(xen_initial_domain() && xen_pv_domain()))
adev->virt.caps |= AMDGPU_PASSTHROUGH_MODE;
}

--
2.25.1


2023-03-12 12:03:14

by Huang Rui

[permalink] [raw]
Subject: [RFC PATCH 5/5] xen/privcmd: add IOCTL_PRIVCMD_GSI_FROM_IRQ

From: Chen Jiqian <[email protected]>

When hypervisor get an interrupt, it needs interrupt's
gsi number instead of irq number. Gsi number is unique
in xen, but irq number is only unique in one domain.
So, we need to record the relationship between irq and
gsi when dom0 initialized pci devices, and provide syscall
IOCTL_PRIVCMD_GSI_FROM_IRQ to translate irq to gsi. So
that, we can map pirq successfully in hypervisor side.

Signed-off-by: Chen Jiqian <[email protected]>
Signed-off-by: Huang Rui <[email protected]>
---
arch/x86/pci/xen.c | 4 ++++
drivers/xen/events/events_base.c | 37 ++++++++++++++++++++++++++++++++
drivers/xen/privcmd.c | 20 +++++++++++++++++
include/uapi/xen/privcmd.h | 7 ++++++
include/xen/events.h | 5 +++++
5 files changed, 73 insertions(+)

diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
index 43b8b6d7147b..3237961c7640 100644
--- a/arch/x86/pci/xen.c
+++ b/arch/x86/pci/xen.c
@@ -143,6 +143,10 @@ static int acpi_register_gsi_xen_pvh(struct device *dev, u32 gsi,
else if (rc)
printk(KERN_ERR "Failed to setup GSI :%u, err_code:%d\n", gsi, rc);

+ rc = xen_pvh_add_gsi_irq_map(gsi, irq);
+ if (rc == -EEXIST)
+ printk(KERN_INFO "Already map the GSI :%u and IRQ: %d\n", gsi, irq);
+
return irq;
}

diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index 48dff0ed9acd..39a57fed2de3 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -967,6 +967,43 @@ int xen_irq_from_gsi(unsigned gsi)
}
EXPORT_SYMBOL_GPL(xen_irq_from_gsi);

+int xen_gsi_from_irq(unsigned irq)
+{
+ struct irq_info *info;
+
+ list_for_each_entry(info, &xen_irq_list_head, list) {
+ if (info->type != IRQT_PIRQ)
+ continue;
+
+ if (info->irq == irq)
+ return info->u.pirq.gsi;
+ }
+
+ return -1;
+}
+EXPORT_SYMBOL_GPL(xen_gsi_from_irq);
+
+int xen_pvh_add_gsi_irq_map(unsigned gsi, unsigned irq)
+{
+ int tmp_irq;
+ struct irq_info *info;
+
+ tmp_irq = xen_irq_from_gsi(gsi);
+ if (tmp_irq != -1)
+ return -EEXIST;
+
+ info = kzalloc(sizeof(*info), GFP_KERNEL);
+ if (info == NULL)
+ panic("Unable to allocate metadata for GSI%d\n", gsi);
+
+ info->type = IRQT_PIRQ;
+ info->irq = irq;
+ info->u.pirq.gsi = gsi;
+ list_add_tail(&info->list, &xen_irq_list_head);
+
+ return 0;
+}
+
static void __unbind_from_irq(unsigned int irq)
{
evtchn_port_t evtchn = evtchn_from_irq(irq);
diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
index e88e8f6f0a33..830e84451731 100644
--- a/drivers/xen/privcmd.c
+++ b/drivers/xen/privcmd.c
@@ -37,6 +37,7 @@
#include <xen/page.h>
#include <xen/xen-ops.h>
#include <xen/balloon.h>
+#include <xen/events.h>

#include "privcmd.h"

@@ -833,6 +834,21 @@ static long privcmd_ioctl_mmap_resource(struct file *file,
return rc;
}

+static long privcmd_ioctl_gsi_from_irq(struct file *file, void __user *udata)
+{
+ struct privcmd_gsi_from_irq kdata;
+
+ if (copy_from_user(&kdata, udata, sizeof(kdata)))
+ return -EFAULT;
+
+ kdata.gsi = xen_gsi_from_irq(kdata.irq);
+
+ if (copy_to_user(udata, &kdata, sizeof(kdata)))
+ return -EFAULT;
+
+ return 0;
+}
+
static long privcmd_ioctl(struct file *file,
unsigned int cmd, unsigned long data)
{
@@ -868,6 +884,10 @@ static long privcmd_ioctl(struct file *file,
ret = privcmd_ioctl_mmap_resource(file, udata);
break;

+ case IOCTL_PRIVCMD_GSI_FROM_IRQ:
+ ret = privcmd_ioctl_gsi_from_irq(file, udata);
+ break;
+
default:
break;
}
diff --git a/include/uapi/xen/privcmd.h b/include/uapi/xen/privcmd.h
index d2029556083e..55fe748bbfd7 100644
--- a/include/uapi/xen/privcmd.h
+++ b/include/uapi/xen/privcmd.h
@@ -98,6 +98,11 @@ struct privcmd_mmap_resource {
__u64 addr;
};

+struct privcmd_gsi_from_irq {
+ __u32 irq;
+ __u32 gsi;
+};
+
/*
* @cmd: IOCTL_PRIVCMD_HYPERCALL
* @arg: &privcmd_hypercall_t
@@ -125,5 +130,7 @@ struct privcmd_mmap_resource {
_IOC(_IOC_NONE, 'P', 6, sizeof(domid_t))
#define IOCTL_PRIVCMD_MMAP_RESOURCE \
_IOC(_IOC_NONE, 'P', 7, sizeof(struct privcmd_mmap_resource))
+#define IOCTL_PRIVCMD_GSI_FROM_IRQ \
+ _IOC(_IOC_NONE, 'P', 8, sizeof(struct privcmd_gsi_from_irq))

#endif /* __LINUX_PUBLIC_PRIVCMD_H__ */
diff --git a/include/xen/events.h b/include/xen/events.h
index 344081e71584..8377d8dfaa71 100644
--- a/include/xen/events.h
+++ b/include/xen/events.h
@@ -133,6 +133,11 @@ int xen_pirq_from_irq(unsigned irq);
/* Return the irq allocated to the gsi */
int xen_irq_from_gsi(unsigned gsi);

+/* Return the gsi from irq */
+int xen_gsi_from_irq(unsigned irq);
+
+int xen_pvh_add_gsi_irq_map(unsigned gsi, unsigned irq);
+
/* Determine whether to ignore this IRQ if it is passed to a guest. */
int xen_test_irq_shared(int irq);

--
2.25.1


2023-03-12 12:03:14

by Huang Rui

[permalink] [raw]
Subject: [RFC PATCH 4/5] x86/xen: acpi registers gsi for xen pvh

From: Chen Jiqian <[email protected]>

Add acpi_register_gsi_xen_pvh() to register gsi for PVH mode.
In addition to call acpi_register_gsi_ioapic(), it also setup
a map between gsi and vector in hypervisor side. So that,
when dgpu create an interrupt, hypervisor can correctly find
which guest domain to process interrupt by vector.

Signed-off-by: Chen Jiqian <[email protected]>
Signed-off-by: Huang Rui <[email protected]>
---
arch/x86/include/asm/apic.h | 7 ++++++
arch/x86/include/asm/xen/pci.h | 5 ++++
arch/x86/kernel/acpi/boot.c | 2 +-
arch/x86/pci/xen.c | 39 ++++++++++++++++++++++++++++++++
drivers/xen/events/events_base.c | 2 ++
5 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 3415321c8240..f3bc5de1f1d4 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -179,6 +179,8 @@ extern bool apic_needs_pit(void);

extern void apic_send_IPI_allbutself(unsigned int vector);

+extern int acpi_register_gsi_ioapic(struct device *dev, u32 gsi,
+ int trigger, int polarity);
#else /* !CONFIG_X86_LOCAL_APIC */
static inline void lapic_shutdown(void) { }
#define local_apic_timer_c2_ok 1
@@ -193,6 +195,11 @@ static inline void apic_intr_mode_init(void) { }
static inline void lapic_assign_system_vectors(void) { }
static inline void lapic_assign_legacy_vector(unsigned int i, bool r) { }
static inline bool apic_needs_pit(void) { return true; }
+static inline int acpi_register_gsi_ioapic(struct device *dev, u32 gsi,
+ int trigger, int polarity)
+{
+ return (int)gsi;
+}
#endif /* !CONFIG_X86_LOCAL_APIC */

#ifdef CONFIG_X86_X2APIC
diff --git a/arch/x86/include/asm/xen/pci.h b/arch/x86/include/asm/xen/pci.h
index 9015b888edd6..aa8ded61fc2d 100644
--- a/arch/x86/include/asm/xen/pci.h
+++ b/arch/x86/include/asm/xen/pci.h
@@ -5,6 +5,7 @@
#if defined(CONFIG_PCI_XEN)
extern int __init pci_xen_init(void);
extern int __init pci_xen_hvm_init(void);
+extern int __init pci_xen_pvh_init(void);
#define pci_xen 1
#else
#define pci_xen 0
@@ -13,6 +14,10 @@ static inline int pci_xen_hvm_init(void)
{
return -1;
}
+static inline int pci_xen_pvh_init(void)
+{
+ return -1;
+}
#endif
#ifdef CONFIG_XEN_PV_DOM0
int __init pci_xen_initial_domain(void);
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 907cc98b1938..25ec48dd897e 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -718,7 +718,7 @@ static int acpi_register_gsi_pic(struct device *dev, u32 gsi,
}

#ifdef CONFIG_X86_LOCAL_APIC
-static int acpi_register_gsi_ioapic(struct device *dev, u32 gsi,
+int acpi_register_gsi_ioapic(struct device *dev, u32 gsi,
int trigger, int polarity)
{
int irq = gsi;
diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
index b94f727251b6..43b8b6d7147b 100644
--- a/arch/x86/pci/xen.c
+++ b/arch/x86/pci/xen.c
@@ -114,6 +114,38 @@ static int acpi_register_gsi_xen_hvm(struct device *dev, u32 gsi,
false /* no mapping of GSI to PIRQ */);
}

+static int acpi_register_gsi_xen_pvh(struct device *dev, u32 gsi,
+ int trigger, int polarity)
+{
+ int irq;
+ int rc;
+ struct physdev_map_pirq map_irq;
+ struct physdev_setup_gsi setup_gsi;
+
+ irq = acpi_register_gsi_ioapic(dev, gsi, trigger, polarity);
+
+ map_irq.domid = DOMID_SELF;
+ map_irq.type = MAP_PIRQ_TYPE_GSI;
+ map_irq.index = gsi;
+ map_irq.pirq = gsi;
+
+ rc = HYPERVISOR_physdev_op(PHYSDEVOP_map_pirq, &map_irq);
+ if (rc)
+ printk(KERN_ERR "xen map GSI: %u failed %d\n", gsi, rc);
+
+ setup_gsi.gsi = gsi;
+ setup_gsi.triggering = (trigger == ACPI_EDGE_SENSITIVE ? 0 : 1);
+ setup_gsi.polarity = (polarity == ACPI_ACTIVE_HIGH ? 0 : 1);
+
+ rc = HYPERVISOR_physdev_op(PHYSDEVOP_setup_gsi, &setup_gsi);
+ if (rc == -EEXIST)
+ printk(KERN_INFO "Already setup the GSI :%u\n", gsi);
+ else if (rc)
+ printk(KERN_ERR "Failed to setup GSI :%u, err_code:%d\n", gsi, rc);
+
+ return irq;
+}
+
#ifdef CONFIG_XEN_PV_DOM0
static int xen_register_gsi(u32 gsi, int triggering, int polarity)
{
@@ -554,6 +586,13 @@ int __init pci_xen_hvm_init(void)
return 0;
}

+int __init pci_xen_pvh_init(void)
+{
+ __acpi_register_gsi = acpi_register_gsi_xen_pvh;
+ __acpi_unregister_gsi = NULL;
+ return 0;
+}
+
#ifdef CONFIG_XEN_PV_DOM0
int __init pci_xen_initial_domain(void)
{
diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index c443f04aaad7..48dff0ed9acd 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -2317,6 +2317,8 @@ void __init xen_init_IRQ(void)
xen_init_setup_upcall_vector();
xen_alloc_callback_vector();

+ if (xen_pvh_domain())
+ pci_xen_pvh_init();

if (xen_hvm_domain()) {
native_init_IRQ();
--
2.25.1


2023-03-13 09:01:07

by Jan Beulich

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] x86/xen: disable swiotlb for xen pvh

On 12.03.2023 13:01, Huang Rui wrote:
> Xen PVH is the paravirtualized mode and takes advantage of hardware
> virtualization support when possible. It will using the hardware IOMMU
> support instead of xen-swiotlb, so disable swiotlb if current domain is
> Xen PVH.

But the kernel has no way (yet) to drive the IOMMU, so how can it get
away without resorting to swiotlb in certain cases (like I/O to an
address-restricted device)?

Jan

2023-03-15 00:52:51

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] x86/xen: disable swiotlb for xen pvh

On Mon, 13 Mar 2023, Jan Beulich wrote:
> On 12.03.2023 13:01, Huang Rui wrote:
> > Xen PVH is the paravirtualized mode and takes advantage of hardware
> > virtualization support when possible. It will using the hardware IOMMU
> > support instead of xen-swiotlb, so disable swiotlb if current domain is
> > Xen PVH.
>
> But the kernel has no way (yet) to drive the IOMMU, so how can it get
> away without resorting to swiotlb in certain cases (like I/O to an
> address-restricted device)?

I think Ray meant that, thanks to the IOMMU setup by Xen, there is no
need for swiotlb-xen in Dom0. Address translations are done by the IOMMU
so we can use guest physical addresses instead of machine addresses for
DMA. This is a similar case to Dom0 on ARM when the IOMMU is available
(see include/xen/arm/swiotlb-xen.h:xen_swiotlb_detect, the corresponding
case is XENFEAT_not_direct_mapped).

Jurgen, what do you think? Would you rather make xen_swiotlb_detect
common between ARM and x86?

2023-03-15 04:15:02

by Huang Rui

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] x86/xen: disable swiotlb for xen pvh

On Wed, Mar 15, 2023 at 08:52:30AM +0800, Stefano Stabellini wrote:
> On Mon, 13 Mar 2023, Jan Beulich wrote:
> > On 12.03.2023 13:01, Huang Rui wrote:
> > > Xen PVH is the paravirtualized mode and takes advantage of hardware
> > > virtualization support when possible. It will using the hardware IOMMU
> > > support instead of xen-swiotlb, so disable swiotlb if current domain is
> > > Xen PVH.
> >
> > But the kernel has no way (yet) to drive the IOMMU, so how can it get
> > away without resorting to swiotlb in certain cases (like I/O to an
> > address-restricted device)?
>
> I think Ray meant that, thanks to the IOMMU setup by Xen, there is no
> need for swiotlb-xen in Dom0. Address translations are done by the IOMMU
> so we can use guest physical addresses instead of machine addresses for
> DMA. This is a similar case to Dom0 on ARM when the IOMMU is available
> (see include/xen/arm/swiotlb-xen.h:xen_swiotlb_detect, the corresponding
> case is XENFEAT_not_direct_mapped).

Hi Jan, sorry to late reply. We are using the native kernel amdgpu and ttm
driver on Dom0, amdgpu/ttm would like to use IOMMU to allocate coherent
buffers for userptr that map the user space memory to gpu access, however,
swiotlb doesn't support this. In other words, with swiotlb, we only can
handle the buffer page by page.

Thanks,
Ray

>
> Jurgen, what do you think? Would you rather make xen_swiotlb_detect
> common between ARM and x86?

2023-03-15 06:53:42

by Jan Beulich

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] x86/xen: disable swiotlb for xen pvh

On 15.03.2023 05:14, Huang Rui wrote:
> On Wed, Mar 15, 2023 at 08:52:30AM +0800, Stefano Stabellini wrote:
>> On Mon, 13 Mar 2023, Jan Beulich wrote:
>>> On 12.03.2023 13:01, Huang Rui wrote:
>>>> Xen PVH is the paravirtualized mode and takes advantage of hardware
>>>> virtualization support when possible. It will using the hardware IOMMU
>>>> support instead of xen-swiotlb, so disable swiotlb if current domain is
>>>> Xen PVH.
>>>
>>> But the kernel has no way (yet) to drive the IOMMU, so how can it get
>>> away without resorting to swiotlb in certain cases (like I/O to an
>>> address-restricted device)?
>>
>> I think Ray meant that, thanks to the IOMMU setup by Xen, there is no
>> need for swiotlb-xen in Dom0. Address translations are done by the IOMMU
>> so we can use guest physical addresses instead of machine addresses for
>> DMA. This is a similar case to Dom0 on ARM when the IOMMU is available
>> (see include/xen/arm/swiotlb-xen.h:xen_swiotlb_detect, the corresponding
>> case is XENFEAT_not_direct_mapped).
>
> Hi Jan, sorry to late reply. We are using the native kernel amdgpu and ttm
> driver on Dom0, amdgpu/ttm would like to use IOMMU to allocate coherent
> buffers for userptr that map the user space memory to gpu access, however,
> swiotlb doesn't support this. In other words, with swiotlb, we only can
> handle the buffer page by page.

But how does outright disabling swiotlb help with this? There still wouldn't
be an IOMMU that your kernel has control over. Looks like you want something
like pvIOMMU, but that work was never completed. And even then the swiotlb
may continue to be needed for other purposes.

Jan

2023-03-15 06:55:41

by Jan Beulich

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] x86/xen: disable swiotlb for xen pvh

On 15.03.2023 01:52, Stefano Stabellini wrote:
> On Mon, 13 Mar 2023, Jan Beulich wrote:
>> On 12.03.2023 13:01, Huang Rui wrote:
>>> Xen PVH is the paravirtualized mode and takes advantage of hardware
>>> virtualization support when possible. It will using the hardware IOMMU
>>> support instead of xen-swiotlb, so disable swiotlb if current domain is
>>> Xen PVH.
>>
>> But the kernel has no way (yet) to drive the IOMMU, so how can it get
>> away without resorting to swiotlb in certain cases (like I/O to an
>> address-restricted device)?
>
> I think Ray meant that, thanks to the IOMMU setup by Xen, there is no
> need for swiotlb-xen in Dom0. Address translations are done by the IOMMU
> so we can use guest physical addresses instead of machine addresses for
> DMA. This is a similar case to Dom0 on ARM when the IOMMU is available
> (see include/xen/arm/swiotlb-xen.h:xen_swiotlb_detect, the corresponding
> case is XENFEAT_not_direct_mapped).

But how does Xen using an IOMMU help with, as said, address-restricted
devices? They may still need e.g. a 32-bit address to be programmed in,
and if the kernel has memory beyond the 4G boundary not all I/O buffers
may fulfill this requirement.

Jan

2023-03-15 12:32:28

by Roger Pau Monne

[permalink] [raw]
Subject: Re: [RFC PATCH 2/5] xen/grants: update initialization order of xen grant table

On Sun, Mar 12, 2023 at 08:01:54PM +0800, Huang Rui wrote:
> The xen grant table will be initialied before parsing the PCI resources,
> so xen_alloc_unpopulated_pages() ends up using a range from the PCI
> window because Linux hasn't parsed the PCI information yet.
>
> So modify the initialization order to make sure the real PCI resources
> are parsed before.

Has this been tested on a domU to make sure the late grant table init
doesn't interfere with PV devices getting setup?

> Signed-off-by: Huang Rui <[email protected]>
> ---
> arch/x86/xen/grant-table.c | 2 +-
> drivers/xen/grant-table.c | 2 +-
> 2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/xen/grant-table.c b/arch/x86/xen/grant-table.c
> index 1e681bf62561..64a04d1e70f5 100644
> --- a/arch/x86/xen/grant-table.c
> +++ b/arch/x86/xen/grant-table.c
> @@ -165,5 +165,5 @@ static int __init xen_pvh_gnttab_setup(void)
> }
> /* Call it _before_ __gnttab_init as we need to initialize the
> * xen_auto_xlat_grant_frames first. */
> -core_initcall(xen_pvh_gnttab_setup);
> +fs_initcall_sync();
> #endif
> diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c
> index e1ec725c2819..6382112f3473 100644
> --- a/drivers/xen/grant-table.c
> +++ b/drivers/xen/grant-table.c
> @@ -1680,4 +1680,4 @@ static int __gnttab_init(void)
> }
> /* Starts after core_initcall so that xen_pvh_gnttab_setup can be called
> * beforehand to initialize xen_auto_xlat_grant_frames. */

Comment need to be updated, but I was thinking whether it won't be
best to simply call xen_pvh_gnttab_setup() from __gnttab_init() itself
when running as a PVH guest?

Thanks, Roger.

2023-03-15 12:45:11

by Roger Pau Monne

[permalink] [raw]
Subject: Re: [RFC PATCH 3/5] drm/amdgpu: set passthrough mode for xen pvh/hvm

On Sun, Mar 12, 2023 at 08:01:55PM +0800, Huang Rui wrote:
> There is an second stage translation between the guest machine address
> and host machine address in Xen PVH/HVM. The PCI bar address in the xen
> guest kernel are not translated at the second stage on Xen PVH/HVM, so

I'm confused by the sentence above, do you think it could be reworded
or expanded to clarify?

PCI BAR addresses are not in the guest kernel, but rather in the
physical memory layout made available to the guest.

Also, I'm unsure why xen_initial_domain() needs to be used in the
conditional below: all PV domains handle addresses the same, so if
it's not needed for a PV dom0 it's likely not needed for a PV domU
either. Albeit it would help to know more about what
AMDGPU_PASSTHROUGH_MODE implies.

Thanks, Roger.

2023-03-15 14:01:02

by Roger Pau Monne

[permalink] [raw]
Subject: Re: [RFC PATCH 4/5] x86/xen: acpi registers gsi for xen pvh

On Sun, Mar 12, 2023 at 08:01:56PM +0800, Huang Rui wrote:
> From: Chen Jiqian <[email protected]>
>
> Add acpi_register_gsi_xen_pvh() to register gsi for PVH mode.
> In addition to call acpi_register_gsi_ioapic(), it also setup
> a map between gsi and vector in hypervisor side. So that,
> when dgpu create an interrupt, hypervisor can correctly find
> which guest domain to process interrupt by vector.

The term 'dgpu' needs clarifying or replacing by a more generic
naming.

Also, I would like to be able to get away from requiring dom0 to
register the GSIs in this way. If you take a look at Xen, there's
code in the emulated IO-APIC available to dom0 that already does this
registering (see vioapic_hwdom_map_gsi() in Xen).

I think the problem here is that the GSI used by the device you want
to passthrough has never had it's pin unmasked in the IO-APIC, and
hence hasn't been registered.

It would be helpful if you could state in the commit message what
issue you are trying to solve by doing this registering here, I assume
it is done in order to map the IRQ to a PIRQ, so later calls by the
toolstack to bind it succeed.

Would it be possible instead to perform the call to PHYSDEVOP_map_pirq
in the toolstack itself if the PIRQ cannot be found?

Thanks, Roger.

2023-03-15 14:26:38

by Roger Pau Monne

[permalink] [raw]
Subject: Re: [RFC PATCH 5/5] xen/privcmd: add IOCTL_PRIVCMD_GSI_FROM_IRQ

On Sun, Mar 12, 2023 at 08:01:57PM +0800, Huang Rui wrote:
> From: Chen Jiqian <[email protected]>
>
> When hypervisor get an interrupt, it needs interrupt's
> gsi number instead of irq number. Gsi number is unique
> in xen, but irq number is only unique in one domain.
> So, we need to record the relationship between irq and
> gsi when dom0 initialized pci devices, and provide syscall
> IOCTL_PRIVCMD_GSI_FROM_IRQ to translate irq to gsi. So
> that, we can map pirq successfully in hypervisor side.

GSI is not only unique in Xen, it's an ACPI provided value that's
unique in the platform. The text above make it look like GSI is some
kind of internal Xen reference to an interrupt, but it's not.

How does a PV domain deal with this? I would assume there Linux will
also end up with IRQ != GSI, and hence will need some kind of
translation?

Thanks, Roger.

2023-03-15 23:25:19

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] x86/xen: disable swiotlb for xen pvh

On Wed, 15 Mar 2023, Jan Beulich wrote:
> On 15.03.2023 01:52, Stefano Stabellini wrote:
> > On Mon, 13 Mar 2023, Jan Beulich wrote:
> >> On 12.03.2023 13:01, Huang Rui wrote:
> >>> Xen PVH is the paravirtualized mode and takes advantage of hardware
> >>> virtualization support when possible. It will using the hardware IOMMU
> >>> support instead of xen-swiotlb, so disable swiotlb if current domain is
> >>> Xen PVH.
> >>
> >> But the kernel has no way (yet) to drive the IOMMU, so how can it get
> >> away without resorting to swiotlb in certain cases (like I/O to an
> >> address-restricted device)?
> >
> > I think Ray meant that, thanks to the IOMMU setup by Xen, there is no
> > need for swiotlb-xen in Dom0. Address translations are done by the IOMMU
> > so we can use guest physical addresses instead of machine addresses for
> > DMA. This is a similar case to Dom0 on ARM when the IOMMU is available
> > (see include/xen/arm/swiotlb-xen.h:xen_swiotlb_detect, the corresponding
> > case is XENFEAT_not_direct_mapped).
>
> But how does Xen using an IOMMU help with, as said, address-restricted
> devices? They may still need e.g. a 32-bit address to be programmed in,
> and if the kernel has memory beyond the 4G boundary not all I/O buffers
> may fulfill this requirement.

In short, it is going to work as long as Linux has guest physical
addresses (not machine addresses, those could be anything) lower than
4GB.

If the address-restricted device does DMA via an IOMMU, then the device
gets programmed by Linux using its guest physical addresses (not machine
addresses).

The 32-bit restriction would be applied by Linux to its choice of guest
physical address to use to program the device, the same way it does on
native. The device would be fine as it always uses Linux-provided <4GB
addresses. After the IOMMU translation (pagetable setup by Xen), we
could get any address, including >4GB addresses, and that is expected to
work.

2023-03-16 07:50:44

by Jan Beulich

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] x86/xen: disable swiotlb for xen pvh

On 16.03.2023 00:25, Stefano Stabellini wrote:
> On Wed, 15 Mar 2023, Jan Beulich wrote:
>> On 15.03.2023 01:52, Stefano Stabellini wrote:
>>> On Mon, 13 Mar 2023, Jan Beulich wrote:
>>>> On 12.03.2023 13:01, Huang Rui wrote:
>>>>> Xen PVH is the paravirtualized mode and takes advantage of hardware
>>>>> virtualization support when possible. It will using the hardware IOMMU
>>>>> support instead of xen-swiotlb, so disable swiotlb if current domain is
>>>>> Xen PVH.
>>>>
>>>> But the kernel has no way (yet) to drive the IOMMU, so how can it get
>>>> away without resorting to swiotlb in certain cases (like I/O to an
>>>> address-restricted device)?
>>>
>>> I think Ray meant that, thanks to the IOMMU setup by Xen, there is no
>>> need for swiotlb-xen in Dom0. Address translations are done by the IOMMU
>>> so we can use guest physical addresses instead of machine addresses for
>>> DMA. This is a similar case to Dom0 on ARM when the IOMMU is available
>>> (see include/xen/arm/swiotlb-xen.h:xen_swiotlb_detect, the corresponding
>>> case is XENFEAT_not_direct_mapped).
>>
>> But how does Xen using an IOMMU help with, as said, address-restricted
>> devices? They may still need e.g. a 32-bit address to be programmed in,
>> and if the kernel has memory beyond the 4G boundary not all I/O buffers
>> may fulfill this requirement.
>
> In short, it is going to work as long as Linux has guest physical
> addresses (not machine addresses, those could be anything) lower than
> 4GB.
>
> If the address-restricted device does DMA via an IOMMU, then the device
> gets programmed by Linux using its guest physical addresses (not machine
> addresses).
>
> The 32-bit restriction would be applied by Linux to its choice of guest
> physical address to use to program the device, the same way it does on
> native. The device would be fine as it always uses Linux-provided <4GB
> addresses. After the IOMMU translation (pagetable setup by Xen), we
> could get any address, including >4GB addresses, and that is expected to
> work.

I understand that's the "normal" way of working. But whatever the swiotlb
is used for in baremetal Linux, that would similarly require its use in
PVH (or HVM) aiui. So unconditionally disabling it in PVH would look to
me like an incomplete attempt to disable its use altogether on x86. What
difference of PVH vs baremetal am I missing here?

Jan

2023-03-16 13:45:35

by Alex Deucher

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] x86/xen: disable swiotlb for xen pvh

On Thu, Mar 16, 2023 at 3:50 AM Jan Beulich <[email protected]> wrote:
>
> On 16.03.2023 00:25, Stefano Stabellini wrote:
> > On Wed, 15 Mar 2023, Jan Beulich wrote:
> >> On 15.03.2023 01:52, Stefano Stabellini wrote:
> >>> On Mon, 13 Mar 2023, Jan Beulich wrote:
> >>>> On 12.03.2023 13:01, Huang Rui wrote:
> >>>>> Xen PVH is the paravirtualized mode and takes advantage of hardware
> >>>>> virtualization support when possible. It will using the hardware IOMMU
> >>>>> support instead of xen-swiotlb, so disable swiotlb if current domain is
> >>>>> Xen PVH.
> >>>>
> >>>> But the kernel has no way (yet) to drive the IOMMU, so how can it get
> >>>> away without resorting to swiotlb in certain cases (like I/O to an
> >>>> address-restricted device)?
> >>>
> >>> I think Ray meant that, thanks to the IOMMU setup by Xen, there is no
> >>> need for swiotlb-xen in Dom0. Address translations are done by the IOMMU
> >>> so we can use guest physical addresses instead of machine addresses for
> >>> DMA. This is a similar case to Dom0 on ARM when the IOMMU is available
> >>> (see include/xen/arm/swiotlb-xen.h:xen_swiotlb_detect, the corresponding
> >>> case is XENFEAT_not_direct_mapped).
> >>
> >> But how does Xen using an IOMMU help with, as said, address-restricted
> >> devices? They may still need e.g. a 32-bit address to be programmed in,
> >> and if the kernel has memory beyond the 4G boundary not all I/O buffers
> >> may fulfill this requirement.
> >
> > In short, it is going to work as long as Linux has guest physical
> > addresses (not machine addresses, those could be anything) lower than
> > 4GB.
> >
> > If the address-restricted device does DMA via an IOMMU, then the device
> > gets programmed by Linux using its guest physical addresses (not machine
> > addresses).
> >
> > The 32-bit restriction would be applied by Linux to its choice of guest
> > physical address to use to program the device, the same way it does on
> > native. The device would be fine as it always uses Linux-provided <4GB
> > addresses. After the IOMMU translation (pagetable setup by Xen), we
> > could get any address, including >4GB addresses, and that is expected to
> > work.
>
> I understand that's the "normal" way of working. But whatever the swiotlb
> is used for in baremetal Linux, that would similarly require its use in
> PVH (or HVM) aiui. So unconditionally disabling it in PVH would look to
> me like an incomplete attempt to disable its use altogether on x86. What
> difference of PVH vs baremetal am I missing here?

swiotlb is not usable for GPUs even on bare metal. They often have
hundreds or megs or even gigs of memory mapped on the device at any
given time. Also, AMD GPUs support 44-48 bit DMA masks (depending on
the chip family).

Alex

2023-03-16 13:48:58

by Jürgen Groß

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] x86/xen: disable swiotlb for xen pvh

On 16.03.23 14:45, Alex Deucher wrote:
> On Thu, Mar 16, 2023 at 3:50 AM Jan Beulich <[email protected]> wrote:
>>
>> On 16.03.2023 00:25, Stefano Stabellini wrote:
>>> On Wed, 15 Mar 2023, Jan Beulich wrote:
>>>> On 15.03.2023 01:52, Stefano Stabellini wrote:
>>>>> On Mon, 13 Mar 2023, Jan Beulich wrote:
>>>>>> On 12.03.2023 13:01, Huang Rui wrote:
>>>>>>> Xen PVH is the paravirtualized mode and takes advantage of hardware
>>>>>>> virtualization support when possible. It will using the hardware IOMMU
>>>>>>> support instead of xen-swiotlb, so disable swiotlb if current domain is
>>>>>>> Xen PVH.
>>>>>>
>>>>>> But the kernel has no way (yet) to drive the IOMMU, so how can it get
>>>>>> away without resorting to swiotlb in certain cases (like I/O to an
>>>>>> address-restricted device)?
>>>>>
>>>>> I think Ray meant that, thanks to the IOMMU setup by Xen, there is no
>>>>> need for swiotlb-xen in Dom0. Address translations are done by the IOMMU
>>>>> so we can use guest physical addresses instead of machine addresses for
>>>>> DMA. This is a similar case to Dom0 on ARM when the IOMMU is available
>>>>> (see include/xen/arm/swiotlb-xen.h:xen_swiotlb_detect, the corresponding
>>>>> case is XENFEAT_not_direct_mapped).
>>>>
>>>> But how does Xen using an IOMMU help with, as said, address-restricted
>>>> devices? They may still need e.g. a 32-bit address to be programmed in,
>>>> and if the kernel has memory beyond the 4G boundary not all I/O buffers
>>>> may fulfill this requirement.
>>>
>>> In short, it is going to work as long as Linux has guest physical
>>> addresses (not machine addresses, those could be anything) lower than
>>> 4GB.
>>>
>>> If the address-restricted device does DMA via an IOMMU, then the device
>>> gets programmed by Linux using its guest physical addresses (not machine
>>> addresses).
>>>
>>> The 32-bit restriction would be applied by Linux to its choice of guest
>>> physical address to use to program the device, the same way it does on
>>> native. The device would be fine as it always uses Linux-provided <4GB
>>> addresses. After the IOMMU translation (pagetable setup by Xen), we
>>> could get any address, including >4GB addresses, and that is expected to
>>> work.
>>
>> I understand that's the "normal" way of working. But whatever the swiotlb
>> is used for in baremetal Linux, that would similarly require its use in
>> PVH (or HVM) aiui. So unconditionally disabling it in PVH would look to
>> me like an incomplete attempt to disable its use altogether on x86. What
>> difference of PVH vs baremetal am I missing here?
>
> swiotlb is not usable for GPUs even on bare metal. They often have
> hundreds or megs or even gigs of memory mapped on the device at any
> given time. Also, AMD GPUs support 44-48 bit DMA masks (depending on
> the chip family).

But the swiotlb isn't per device, but system global.


Juergen


Attachments:
OpenPGP_0xB0DE9DD628BF132F.asc (3.03 kB)
OpenPGP public key
OpenPGP_signature (495.00 B)
OpenPGP digital signature
Download all attachments

2023-03-16 13:53:31

by Alex Deucher

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] x86/xen: disable swiotlb for xen pvh

On Thu, Mar 16, 2023 at 9:48 AM Juergen Gross <[email protected]> wrote:
>
> On 16.03.23 14:45, Alex Deucher wrote:
> > On Thu, Mar 16, 2023 at 3:50 AM Jan Beulich <[email protected]> wrote:
> >>
> >> On 16.03.2023 00:25, Stefano Stabellini wrote:
> >>> On Wed, 15 Mar 2023, Jan Beulich wrote:
> >>>> On 15.03.2023 01:52, Stefano Stabellini wrote:
> >>>>> On Mon, 13 Mar 2023, Jan Beulich wrote:
> >>>>>> On 12.03.2023 13:01, Huang Rui wrote:
> >>>>>>> Xen PVH is the paravirtualized mode and takes advantage of hardware
> >>>>>>> virtualization support when possible. It will using the hardware IOMMU
> >>>>>>> support instead of xen-swiotlb, so disable swiotlb if current domain is
> >>>>>>> Xen PVH.
> >>>>>>
> >>>>>> But the kernel has no way (yet) to drive the IOMMU, so how can it get
> >>>>>> away without resorting to swiotlb in certain cases (like I/O to an
> >>>>>> address-restricted device)?
> >>>>>
> >>>>> I think Ray meant that, thanks to the IOMMU setup by Xen, there is no
> >>>>> need for swiotlb-xen in Dom0. Address translations are done by the IOMMU
> >>>>> so we can use guest physical addresses instead of machine addresses for
> >>>>> DMA. This is a similar case to Dom0 on ARM when the IOMMU is available
> >>>>> (see include/xen/arm/swiotlb-xen.h:xen_swiotlb_detect, the corresponding
> >>>>> case is XENFEAT_not_direct_mapped).
> >>>>
> >>>> But how does Xen using an IOMMU help with, as said, address-restricted
> >>>> devices? They may still need e.g. a 32-bit address to be programmed in,
> >>>> and if the kernel has memory beyond the 4G boundary not all I/O buffers
> >>>> may fulfill this requirement.
> >>>
> >>> In short, it is going to work as long as Linux has guest physical
> >>> addresses (not machine addresses, those could be anything) lower than
> >>> 4GB.
> >>>
> >>> If the address-restricted device does DMA via an IOMMU, then the device
> >>> gets programmed by Linux using its guest physical addresses (not machine
> >>> addresses).
> >>>
> >>> The 32-bit restriction would be applied by Linux to its choice of guest
> >>> physical address to use to program the device, the same way it does on
> >>> native. The device would be fine as it always uses Linux-provided <4GB
> >>> addresses. After the IOMMU translation (pagetable setup by Xen), we
> >>> could get any address, including >4GB addresses, and that is expected to
> >>> work.
> >>
> >> I understand that's the "normal" way of working. But whatever the swiotlb
> >> is used for in baremetal Linux, that would similarly require its use in
> >> PVH (or HVM) aiui. So unconditionally disabling it in PVH would look to
> >> me like an incomplete attempt to disable its use altogether on x86. What
> >> difference of PVH vs baremetal am I missing here?
> >
> > swiotlb is not usable for GPUs even on bare metal. They often have
> > hundreds or megs or even gigs of memory mapped on the device at any
> > given time. Also, AMD GPUs support 44-48 bit DMA masks (depending on
> > the chip family).
>
> But the swiotlb isn't per device, but system global.

Sure, but if the swiotlb is in use, then you can't really use the GPU.
So you get to pick one.

Alex

2023-03-16 13:59:04

by Jan Beulich

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] x86/xen: disable swiotlb for xen pvh

On 16.03.2023 14:53, Alex Deucher wrote:
> On Thu, Mar 16, 2023 at 9:48 AM Juergen Gross <[email protected]> wrote:
>>
>> On 16.03.23 14:45, Alex Deucher wrote:
>>> On Thu, Mar 16, 2023 at 3:50 AM Jan Beulich <[email protected]> wrote:
>>>>
>>>> On 16.03.2023 00:25, Stefano Stabellini wrote:
>>>>> On Wed, 15 Mar 2023, Jan Beulich wrote:
>>>>>> On 15.03.2023 01:52, Stefano Stabellini wrote:
>>>>>>> On Mon, 13 Mar 2023, Jan Beulich wrote:
>>>>>>>> On 12.03.2023 13:01, Huang Rui wrote:
>>>>>>>>> Xen PVH is the paravirtualized mode and takes advantage of hardware
>>>>>>>>> virtualization support when possible. It will using the hardware IOMMU
>>>>>>>>> support instead of xen-swiotlb, so disable swiotlb if current domain is
>>>>>>>>> Xen PVH.
>>>>>>>>
>>>>>>>> But the kernel has no way (yet) to drive the IOMMU, so how can it get
>>>>>>>> away without resorting to swiotlb in certain cases (like I/O to an
>>>>>>>> address-restricted device)?
>>>>>>>
>>>>>>> I think Ray meant that, thanks to the IOMMU setup by Xen, there is no
>>>>>>> need for swiotlb-xen in Dom0. Address translations are done by the IOMMU
>>>>>>> so we can use guest physical addresses instead of machine addresses for
>>>>>>> DMA. This is a similar case to Dom0 on ARM when the IOMMU is available
>>>>>>> (see include/xen/arm/swiotlb-xen.h:xen_swiotlb_detect, the corresponding
>>>>>>> case is XENFEAT_not_direct_mapped).
>>>>>>
>>>>>> But how does Xen using an IOMMU help with, as said, address-restricted
>>>>>> devices? They may still need e.g. a 32-bit address to be programmed in,
>>>>>> and if the kernel has memory beyond the 4G boundary not all I/O buffers
>>>>>> may fulfill this requirement.
>>>>>
>>>>> In short, it is going to work as long as Linux has guest physical
>>>>> addresses (not machine addresses, those could be anything) lower than
>>>>> 4GB.
>>>>>
>>>>> If the address-restricted device does DMA via an IOMMU, then the device
>>>>> gets programmed by Linux using its guest physical addresses (not machine
>>>>> addresses).
>>>>>
>>>>> The 32-bit restriction would be applied by Linux to its choice of guest
>>>>> physical address to use to program the device, the same way it does on
>>>>> native. The device would be fine as it always uses Linux-provided <4GB
>>>>> addresses. After the IOMMU translation (pagetable setup by Xen), we
>>>>> could get any address, including >4GB addresses, and that is expected to
>>>>> work.
>>>>
>>>> I understand that's the "normal" way of working. But whatever the swiotlb
>>>> is used for in baremetal Linux, that would similarly require its use in
>>>> PVH (or HVM) aiui. So unconditionally disabling it in PVH would look to
>>>> me like an incomplete attempt to disable its use altogether on x86. What
>>>> difference of PVH vs baremetal am I missing here?
>>>
>>> swiotlb is not usable for GPUs even on bare metal. They often have
>>> hundreds or megs or even gigs of memory mapped on the device at any
>>> given time. Also, AMD GPUs support 44-48 bit DMA masks (depending on
>>> the chip family).
>>
>> But the swiotlb isn't per device, but system global.
>
> Sure, but if the swiotlb is in use, then you can't really use the GPU.
> So you get to pick one.

Yet that "pick one" then can't be an unconditional disable in the source code.
If there's no way to avoid swiotlb on a per-device basis, then users will need
to be told to arrange for this via command line option when they want to use
the GPU is certain ways.

Jan

2023-03-16 14:21:29

by Jürgen Groß

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] x86/xen: disable swiotlb for xen pvh

On 16.03.23 14:53, Alex Deucher wrote:
> On Thu, Mar 16, 2023 at 9:48 AM Juergen Gross <[email protected]> wrote:
>>
>> On 16.03.23 14:45, Alex Deucher wrote:
>>> On Thu, Mar 16, 2023 at 3:50 AM Jan Beulich <[email protected]> wrote:
>>>>
>>>> On 16.03.2023 00:25, Stefano Stabellini wrote:
>>>>> On Wed, 15 Mar 2023, Jan Beulich wrote:
>>>>>> On 15.03.2023 01:52, Stefano Stabellini wrote:
>>>>>>> On Mon, 13 Mar 2023, Jan Beulich wrote:
>>>>>>>> On 12.03.2023 13:01, Huang Rui wrote:
>>>>>>>>> Xen PVH is the paravirtualized mode and takes advantage of hardware
>>>>>>>>> virtualization support when possible. It will using the hardware IOMMU
>>>>>>>>> support instead of xen-swiotlb, so disable swiotlb if current domain is
>>>>>>>>> Xen PVH.
>>>>>>>>
>>>>>>>> But the kernel has no way (yet) to drive the IOMMU, so how can it get
>>>>>>>> away without resorting to swiotlb in certain cases (like I/O to an
>>>>>>>> address-restricted device)?
>>>>>>>
>>>>>>> I think Ray meant that, thanks to the IOMMU setup by Xen, there is no
>>>>>>> need for swiotlb-xen in Dom0. Address translations are done by the IOMMU
>>>>>>> so we can use guest physical addresses instead of machine addresses for
>>>>>>> DMA. This is a similar case to Dom0 on ARM when the IOMMU is available
>>>>>>> (see include/xen/arm/swiotlb-xen.h:xen_swiotlb_detect, the corresponding
>>>>>>> case is XENFEAT_not_direct_mapped).
>>>>>>
>>>>>> But how does Xen using an IOMMU help with, as said, address-restricted
>>>>>> devices? They may still need e.g. a 32-bit address to be programmed in,
>>>>>> and if the kernel has memory beyond the 4G boundary not all I/O buffers
>>>>>> may fulfill this requirement.
>>>>>
>>>>> In short, it is going to work as long as Linux has guest physical
>>>>> addresses (not machine addresses, those could be anything) lower than
>>>>> 4GB.
>>>>>
>>>>> If the address-restricted device does DMA via an IOMMU, then the device
>>>>> gets programmed by Linux using its guest physical addresses (not machine
>>>>> addresses).
>>>>>
>>>>> The 32-bit restriction would be applied by Linux to its choice of guest
>>>>> physical address to use to program the device, the same way it does on
>>>>> native. The device would be fine as it always uses Linux-provided <4GB
>>>>> addresses. After the IOMMU translation (pagetable setup by Xen), we
>>>>> could get any address, including >4GB addresses, and that is expected to
>>>>> work.
>>>>
>>>> I understand that's the "normal" way of working. But whatever the swiotlb
>>>> is used for in baremetal Linux, that would similarly require its use in
>>>> PVH (or HVM) aiui. So unconditionally disabling it in PVH would look to
>>>> me like an incomplete attempt to disable its use altogether on x86. What
>>>> difference of PVH vs baremetal am I missing here?
>>>
>>> swiotlb is not usable for GPUs even on bare metal. They often have
>>> hundreds or megs or even gigs of memory mapped on the device at any
>>> given time. Also, AMD GPUs support 44-48 bit DMA masks (depending on
>>> the chip family).
>>
>> But the swiotlb isn't per device, but system global.
>
> Sure, but if the swiotlb is in use, then you can't really use the GPU.
> So you get to pick one.

The swiotlb is used only for buffers which are not within the DMA mask of a
device (see dma_direct_map_page()). So an AMD GPU supporting a 44 bit DMA mask
won't use the swiotlb unless you have a buffer above guest physical address of
16TB (so basically never).

Disabling swiotlb in such a guest would OTOH mean, that a device with only
32 bit DMA mask passed through to this guest couldn't work with buffers
above 4GB.

I don't think this is acceptable.


Juergen


Attachments:
OpenPGP_0xB0DE9DD628BF132F.asc (3.03 kB)
OpenPGP public key
OpenPGP_signature (495.00 B)
OpenPGP digital signature
Download all attachments

2023-03-16 16:29:18

by Roger Pau Monne

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] x86/xen: disable swiotlb for xen pvh

On Sun, Mar 12, 2023 at 08:01:53PM +0800, Huang Rui wrote:
> Xen PVH is the paravirtualized mode and takes advantage of hardware
> virtualization support when possible. It will using the hardware IOMMU
> support instead of xen-swiotlb, so disable swiotlb if current domain is
> Xen PVH.
>
> Signed-off-by: Huang Rui <[email protected]>
> ---
> arch/x86/kernel/pci-dma.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
> index 30bbe4abb5d6..f5c73dd18f2a 100644
> --- a/arch/x86/kernel/pci-dma.c
> +++ b/arch/x86/kernel/pci-dma.c
> @@ -74,6 +74,12 @@ static inline void __init pci_swiotlb_detect(void)
> #ifdef CONFIG_SWIOTLB_XEN
> static void __init pci_xen_swiotlb_init(void)
> {
> + /* Xen PVH domain won't use swiotlb */
> + if (xen_pvh_domain()) {
> + x86_swiotlb_enable = false;
> + return;
> + }

I'm very confused by this: pci_xen_swiotlb_init() is only called for
PV domains, see the only caller in pci_iommu_alloc(). So this is just
dead code.

> +
> if (!xen_initial_domain() && !x86_swiotlb_enable)
> return;
> x86_swiotlb_enable = true;
> @@ -86,7 +92,7 @@ static void __init pci_xen_swiotlb_init(void)
>
> int pci_xen_swiotlb_init_late(void)
> {
> - if (dma_ops == &xen_swiotlb_dma_ops)
> + if (xen_pvh_domain() || dma_ops == &xen_swiotlb_dma_ops)

Same here, this function is only called by
pcifront_connect_and_init_dma() and pcifront should never attach on a
PVH domain, hence it's also dead code.

Thanks, Roger.

2023-03-16 23:09:58

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] x86/xen: disable swiotlb for xen pvh

On Thu, 16 Mar 2023, Juergen Gross wrote:
> On 16.03.23 14:53, Alex Deucher wrote:
> > On Thu, Mar 16, 2023 at 9:48 AM Juergen Gross <[email protected]> wrote:
> > >
> > > On 16.03.23 14:45, Alex Deucher wrote:
> > > > On Thu, Mar 16, 2023 at 3:50 AM Jan Beulich <[email protected]> wrote:
> > > > >
> > > > > On 16.03.2023 00:25, Stefano Stabellini wrote:
> > > > > > On Wed, 15 Mar 2023, Jan Beulich wrote:
> > > > > > > On 15.03.2023 01:52, Stefano Stabellini wrote:
> > > > > > > > On Mon, 13 Mar 2023, Jan Beulich wrote:
> > > > > > > > > On 12.03.2023 13:01, Huang Rui wrote:
> > > > > > > > > > Xen PVH is the paravirtualized mode and takes advantage of
> > > > > > > > > > hardware
> > > > > > > > > > virtualization support when possible. It will using the
> > > > > > > > > > hardware IOMMU
> > > > > > > > > > support instead of xen-swiotlb, so disable swiotlb if
> > > > > > > > > > current domain is
> > > > > > > > > > Xen PVH.
> > > > > > > > >
> > > > > > > > > But the kernel has no way (yet) to drive the IOMMU, so how can
> > > > > > > > > it get
> > > > > > > > > away without resorting to swiotlb in certain cases (like I/O
> > > > > > > > > to an
> > > > > > > > > address-restricted device)?
> > > > > > > >
> > > > > > > > I think Ray meant that, thanks to the IOMMU setup by Xen, there
> > > > > > > > is no
> > > > > > > > need for swiotlb-xen in Dom0. Address translations are done by
> > > > > > > > the IOMMU
> > > > > > > > so we can use guest physical addresses instead of machine
> > > > > > > > addresses for
> > > > > > > > DMA. This is a similar case to Dom0 on ARM when the IOMMU is
> > > > > > > > available
> > > > > > > > (see include/xen/arm/swiotlb-xen.h:xen_swiotlb_detect, the
> > > > > > > > corresponding
> > > > > > > > case is XENFEAT_not_direct_mapped).
> > > > > > >
> > > > > > > But how does Xen using an IOMMU help with, as said,
> > > > > > > address-restricted
> > > > > > > devices? They may still need e.g. a 32-bit address to be
> > > > > > > programmed in,
> > > > > > > and if the kernel has memory beyond the 4G boundary not all I/O
> > > > > > > buffers
> > > > > > > may fulfill this requirement.
> > > > > >
> > > > > > In short, it is going to work as long as Linux has guest physical
> > > > > > addresses (not machine addresses, those could be anything) lower
> > > > > > than
> > > > > > 4GB.
> > > > > >
> > > > > > If the address-restricted device does DMA via an IOMMU, then the
> > > > > > device
> > > > > > gets programmed by Linux using its guest physical addresses (not
> > > > > > machine
> > > > > > addresses).
> > > > > >
> > > > > > The 32-bit restriction would be applied by Linux to its choice of
> > > > > > guest
> > > > > > physical address to use to program the device, the same way it does
> > > > > > on
> > > > > > native. The device would be fine as it always uses Linux-provided
> > > > > > <4GB
> > > > > > addresses. After the IOMMU translation (pagetable setup by Xen), we
> > > > > > could get any address, including >4GB addresses, and that is
> > > > > > expected to
> > > > > > work.
> > > > >
> > > > > I understand that's the "normal" way of working. But whatever the
> > > > > swiotlb
> > > > > is used for in baremetal Linux, that would similarly require its use
> > > > > in
> > > > > PVH (or HVM) aiui. So unconditionally disabling it in PVH would look
> > > > > to
> > > > > me like an incomplete attempt to disable its use altogether on x86.
> > > > > What
> > > > > difference of PVH vs baremetal am I missing here?
> > > >
> > > > swiotlb is not usable for GPUs even on bare metal. They often have
> > > > hundreds or megs or even gigs of memory mapped on the device at any
> > > > given time. Also, AMD GPUs support 44-48 bit DMA masks (depending on
> > > > the chip family).
> > >
> > > But the swiotlb isn't per device, but system global.
> >
> > Sure, but if the swiotlb is in use, then you can't really use the GPU.
> > So you get to pick one.
>
> The swiotlb is used only for buffers which are not within the DMA mask of a
> device (see dma_direct_map_page()). So an AMD GPU supporting a 44 bit DMA mask
> won't use the swiotlb unless you have a buffer above guest physical address of
> 16TB (so basically never).
>
> Disabling swiotlb in such a guest would OTOH mean, that a device with only
> 32 bit DMA mask passed through to this guest couldn't work with buffers
> above 4GB.
>
> I don't think this is acceptable.

From the Xen subsystem in Linux point of view, the only thing we need to
do is to make sure *not* to enable swiotlb_xen (yes "swiotlb_xen", not
the global swiotlb) on PVH because it is not needed anyway.

I think we should leave the global "swiotlb" setting alone. The global
swiotlb is not relevant to Xen anyway, and surely baremetal Linux has to
have a way to deal with swiotlb/GPU incompatibilities.

We just have to avoid making things worse on Xen, and for that we just
need to avoid unconditionally enabling swiotlb-xen. If the Xen subsystem
doesn't enable swiotlb_xen/swiotlb, and no other subsystem enables
swiotlb, then we have a good Linux configuration capable of handling the
GPU properly.

Alex, please correct me if I am wrong. How is x86_swiotlb_enable set to
false on native (non-Xen) x86?

2023-03-17 10:19:21

by Roger Pau Monne

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] x86/xen: disable swiotlb for xen pvh

On Thu, Mar 16, 2023 at 04:09:44PM -0700, Stefano Stabellini wrote:
> On Thu, 16 Mar 2023, Juergen Gross wrote:
> > On 16.03.23 14:53, Alex Deucher wrote:
> > > On Thu, Mar 16, 2023 at 9:48 AM Juergen Gross <[email protected]> wrote:
> > > >
> > > > On 16.03.23 14:45, Alex Deucher wrote:
> > > > > On Thu, Mar 16, 2023 at 3:50 AM Jan Beulich <[email protected]> wrote:
> > > > > >
> > > > > > On 16.03.2023 00:25, Stefano Stabellini wrote:
> > > > > > > On Wed, 15 Mar 2023, Jan Beulich wrote:
> > > > > > > > On 15.03.2023 01:52, Stefano Stabellini wrote:
> > > > > > > > > On Mon, 13 Mar 2023, Jan Beulich wrote:
> > > > > > > > > > On 12.03.2023 13:01, Huang Rui wrote:
> > > > > > > > > > > Xen PVH is the paravirtualized mode and takes advantage of
> > > > > > > > > > > hardware
> > > > > > > > > > > virtualization support when possible. It will using the
> > > > > > > > > > > hardware IOMMU
> > > > > > > > > > > support instead of xen-swiotlb, so disable swiotlb if
> > > > > > > > > > > current domain is
> > > > > > > > > > > Xen PVH.
> > > > > > > > > >
> > > > > > > > > > But the kernel has no way (yet) to drive the IOMMU, so how can
> > > > > > > > > > it get
> > > > > > > > > > away without resorting to swiotlb in certain cases (like I/O
> > > > > > > > > > to an
> > > > > > > > > > address-restricted device)?
> > > > > > > > >
> > > > > > > > > I think Ray meant that, thanks to the IOMMU setup by Xen, there
> > > > > > > > > is no
> > > > > > > > > need for swiotlb-xen in Dom0. Address translations are done by
> > > > > > > > > the IOMMU
> > > > > > > > > so we can use guest physical addresses instead of machine
> > > > > > > > > addresses for
> > > > > > > > > DMA. This is a similar case to Dom0 on ARM when the IOMMU is
> > > > > > > > > available
> > > > > > > > > (see include/xen/arm/swiotlb-xen.h:xen_swiotlb_detect, the
> > > > > > > > > corresponding
> > > > > > > > > case is XENFEAT_not_direct_mapped).
> > > > > > > >
> > > > > > > > But how does Xen using an IOMMU help with, as said,
> > > > > > > > address-restricted
> > > > > > > > devices? They may still need e.g. a 32-bit address to be
> > > > > > > > programmed in,
> > > > > > > > and if the kernel has memory beyond the 4G boundary not all I/O
> > > > > > > > buffers
> > > > > > > > may fulfill this requirement.
> > > > > > >
> > > > > > > In short, it is going to work as long as Linux has guest physical
> > > > > > > addresses (not machine addresses, those could be anything) lower
> > > > > > > than
> > > > > > > 4GB.
> > > > > > >
> > > > > > > If the address-restricted device does DMA via an IOMMU, then the
> > > > > > > device
> > > > > > > gets programmed by Linux using its guest physical addresses (not
> > > > > > > machine
> > > > > > > addresses).
> > > > > > >
> > > > > > > The 32-bit restriction would be applied by Linux to its choice of
> > > > > > > guest
> > > > > > > physical address to use to program the device, the same way it does
> > > > > > > on
> > > > > > > native. The device would be fine as it always uses Linux-provided
> > > > > > > <4GB
> > > > > > > addresses. After the IOMMU translation (pagetable setup by Xen), we
> > > > > > > could get any address, including >4GB addresses, and that is
> > > > > > > expected to
> > > > > > > work.
> > > > > >
> > > > > > I understand that's the "normal" way of working. But whatever the
> > > > > > swiotlb
> > > > > > is used for in baremetal Linux, that would similarly require its use
> > > > > > in
> > > > > > PVH (or HVM) aiui. So unconditionally disabling it in PVH would look
> > > > > > to
> > > > > > me like an incomplete attempt to disable its use altogether on x86.
> > > > > > What
> > > > > > difference of PVH vs baremetal am I missing here?
> > > > >
> > > > > swiotlb is not usable for GPUs even on bare metal. They often have
> > > > > hundreds or megs or even gigs of memory mapped on the device at any
> > > > > given time. Also, AMD GPUs support 44-48 bit DMA masks (depending on
> > > > > the chip family).
> > > >
> > > > But the swiotlb isn't per device, but system global.
> > >
> > > Sure, but if the swiotlb is in use, then you can't really use the GPU.
> > > So you get to pick one.
> >
> > The swiotlb is used only for buffers which are not within the DMA mask of a
> > device (see dma_direct_map_page()). So an AMD GPU supporting a 44 bit DMA mask
> > won't use the swiotlb unless you have a buffer above guest physical address of
> > 16TB (so basically never).
> >
> > Disabling swiotlb in such a guest would OTOH mean, that a device with only
> > 32 bit DMA mask passed through to this guest couldn't work with buffers
> > above 4GB.
> >
> > I don't think this is acceptable.
>
> From the Xen subsystem in Linux point of view, the only thing we need to
> do is to make sure *not* to enable swiotlb_xen (yes "swiotlb_xen", not
> the global swiotlb) on PVH because it is not needed anyway.

But this is already the case on PVH, swiotlb_xen won't be enabled.
swiotlb_xen is only enabled for PV domains, other domain types don't
enable it under any circumstance on x86.

> I think we should leave the global "swiotlb" setting alone. The global
> swiotlb is not relevant to Xen anyway, and surely baremetal Linux has to
> have a way to deal with swiotlb/GPU incompatibilities.
>
> We just have to avoid making things worse on Xen, and for that we just
> need to avoid unconditionally enabling swiotlb-xen. If the Xen subsystem
> doesn't enable swiotlb_xen/swiotlb, and no other subsystem enables
> swiotlb, then we have a good Linux configuration capable of handling the
> GPU properly.

Given that this patch is basically a non-functional change (because
the modified functions are only called for PV domains) I think we all
agree that swiotlb_xen should never be used on PVH, and native swiotlb
might be required depending on the DMA address restrictions of the
devices on the system. So no change required.

Thanks, Roger.

2023-03-17 14:46:19

by Alex Deucher

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] x86/xen: disable swiotlb for xen pvh

On Thu, Mar 16, 2023 at 7:09 PM Stefano Stabellini
<[email protected]> wrote:
>
> On Thu, 16 Mar 2023, Juergen Gross wrote:
> > On 16.03.23 14:53, Alex Deucher wrote:
> > > On Thu, Mar 16, 2023 at 9:48 AM Juergen Gross <[email protected]> wrote:
> > > >
> > > > On 16.03.23 14:45, Alex Deucher wrote:
> > > > > On Thu, Mar 16, 2023 at 3:50 AM Jan Beulich <[email protected]> wrote:
> > > > > >
> > > > > > On 16.03.2023 00:25, Stefano Stabellini wrote:
> > > > > > > On Wed, 15 Mar 2023, Jan Beulich wrote:
> > > > > > > > On 15.03.2023 01:52, Stefano Stabellini wrote:
> > > > > > > > > On Mon, 13 Mar 2023, Jan Beulich wrote:
> > > > > > > > > > On 12.03.2023 13:01, Huang Rui wrote:
> > > > > > > > > > > Xen PVH is the paravirtualized mode and takes advantage of
> > > > > > > > > > > hardware
> > > > > > > > > > > virtualization support when possible. It will using the
> > > > > > > > > > > hardware IOMMU
> > > > > > > > > > > support instead of xen-swiotlb, so disable swiotlb if
> > > > > > > > > > > current domain is
> > > > > > > > > > > Xen PVH.
> > > > > > > > > >
> > > > > > > > > > But the kernel has no way (yet) to drive the IOMMU, so how can
> > > > > > > > > > it get
> > > > > > > > > > away without resorting to swiotlb in certain cases (like I/O
> > > > > > > > > > to an
> > > > > > > > > > address-restricted device)?
> > > > > > > > >
> > > > > > > > > I think Ray meant that, thanks to the IOMMU setup by Xen, there
> > > > > > > > > is no
> > > > > > > > > need for swiotlb-xen in Dom0. Address translations are done by
> > > > > > > > > the IOMMU
> > > > > > > > > so we can use guest physical addresses instead of machine
> > > > > > > > > addresses for
> > > > > > > > > DMA. This is a similar case to Dom0 on ARM when the IOMMU is
> > > > > > > > > available
> > > > > > > > > (see include/xen/arm/swiotlb-xen.h:xen_swiotlb_detect, the
> > > > > > > > > corresponding
> > > > > > > > > case is XENFEAT_not_direct_mapped).
> > > > > > > >
> > > > > > > > But how does Xen using an IOMMU help with, as said,
> > > > > > > > address-restricted
> > > > > > > > devices? They may still need e.g. a 32-bit address to be
> > > > > > > > programmed in,
> > > > > > > > and if the kernel has memory beyond the 4G boundary not all I/O
> > > > > > > > buffers
> > > > > > > > may fulfill this requirement.
> > > > > > >
> > > > > > > In short, it is going to work as long as Linux has guest physical
> > > > > > > addresses (not machine addresses, those could be anything) lower
> > > > > > > than
> > > > > > > 4GB.
> > > > > > >
> > > > > > > If the address-restricted device does DMA via an IOMMU, then the
> > > > > > > device
> > > > > > > gets programmed by Linux using its guest physical addresses (not
> > > > > > > machine
> > > > > > > addresses).
> > > > > > >
> > > > > > > The 32-bit restriction would be applied by Linux to its choice of
> > > > > > > guest
> > > > > > > physical address to use to program the device, the same way it does
> > > > > > > on
> > > > > > > native. The device would be fine as it always uses Linux-provided
> > > > > > > <4GB
> > > > > > > addresses. After the IOMMU translation (pagetable setup by Xen), we
> > > > > > > could get any address, including >4GB addresses, and that is
> > > > > > > expected to
> > > > > > > work.
> > > > > >
> > > > > > I understand that's the "normal" way of working. But whatever the
> > > > > > swiotlb
> > > > > > is used for in baremetal Linux, that would similarly require its use
> > > > > > in
> > > > > > PVH (or HVM) aiui. So unconditionally disabling it in PVH would look
> > > > > > to
> > > > > > me like an incomplete attempt to disable its use altogether on x86.
> > > > > > What
> > > > > > difference of PVH vs baremetal am I missing here?
> > > > >
> > > > > swiotlb is not usable for GPUs even on bare metal. They often have
> > > > > hundreds or megs or even gigs of memory mapped on the device at any
> > > > > given time. Also, AMD GPUs support 44-48 bit DMA masks (depending on
> > > > > the chip family).
> > > >
> > > > But the swiotlb isn't per device, but system global.
> > >
> > > Sure, but if the swiotlb is in use, then you can't really use the GPU.
> > > So you get to pick one.
> >
> > The swiotlb is used only for buffers which are not within the DMA mask of a
> > device (see dma_direct_map_page()). So an AMD GPU supporting a 44 bit DMA mask
> > won't use the swiotlb unless you have a buffer above guest physical address of
> > 16TB (so basically never).
> >
> > Disabling swiotlb in such a guest would OTOH mean, that a device with only
> > 32 bit DMA mask passed through to this guest couldn't work with buffers
> > above 4GB.
> >
> > I don't think this is acceptable.
>
> From the Xen subsystem in Linux point of view, the only thing we need to
> do is to make sure *not* to enable swiotlb_xen (yes "swiotlb_xen", not
> the global swiotlb) on PVH because it is not needed anyway.
>
> I think we should leave the global "swiotlb" setting alone. The global
> swiotlb is not relevant to Xen anyway, and surely baremetal Linux has to
> have a way to deal with swiotlb/GPU incompatibilities.
>
> We just have to avoid making things worse on Xen, and for that we just
> need to avoid unconditionally enabling swiotlb-xen. If the Xen subsystem
> doesn't enable swiotlb_xen/swiotlb, and no other subsystem enables
> swiotlb, then we have a good Linux configuration capable of handling the
> GPU properly.
>
> Alex, please correct me if I am wrong. How is x86_swiotlb_enable set to
> false on native (non-Xen) x86?

In most cases we have an IOMMU enabled and IIRC, TTM has slightly
different behavior for memory allocation depending on whether swiotlb
would be needed or not.

Alex

2023-03-21 18:55:36

by Christian König

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] x86/xen: disable swiotlb for xen pvh

Am 17.03.23 um 15:45 schrieb Alex Deucher:
> On Thu, Mar 16, 2023 at 7:09 PM Stefano Stabellini
> <[email protected]> wrote:
>> On Thu, 16 Mar 2023, Juergen Gross wrote:
>>> On 16.03.23 14:53, Alex Deucher wrote:
>>>> On Thu, Mar 16, 2023 at 9:48 AM Juergen Gross <[email protected]> wrote:
>>>>> On 16.03.23 14:45, Alex Deucher wrote:
>>>>>> On Thu, Mar 16, 2023 at 3:50 AM Jan Beulich <[email protected]> wrote:
>>>>>>> On 16.03.2023 00:25, Stefano Stabellini wrote:
>>>>>>>> On Wed, 15 Mar 2023, Jan Beulich wrote:
>>>>>>>>> On 15.03.2023 01:52, Stefano Stabellini wrote:
>>>>>>>>>> On Mon, 13 Mar 2023, Jan Beulich wrote:
>>>>>>>>>>> On 12.03.2023 13:01, Huang Rui wrote:
>>>>>>>>>>>> Xen PVH is the paravirtualized mode and takes advantage of
>>>>>>>>>>>> hardware
>>>>>>>>>>>> virtualization support when possible. It will using the
>>>>>>>>>>>> hardware IOMMU
>>>>>>>>>>>> support instead of xen-swiotlb, so disable swiotlb if
>>>>>>>>>>>> current domain is
>>>>>>>>>>>> Xen PVH.
>>>>>>>>>>> But the kernel has no way (yet) to drive the IOMMU, so how can
>>>>>>>>>>> it get
>>>>>>>>>>> away without resorting to swiotlb in certain cases (like I/O
>>>>>>>>>>> to an
>>>>>>>>>>> address-restricted device)?
>>>>>>>>>> I think Ray meant that, thanks to the IOMMU setup by Xen, there
>>>>>>>>>> is no
>>>>>>>>>> need for swiotlb-xen in Dom0. Address translations are done by
>>>>>>>>>> the IOMMU
>>>>>>>>>> so we can use guest physical addresses instead of machine
>>>>>>>>>> addresses for
>>>>>>>>>> DMA. This is a similar case to Dom0 on ARM when the IOMMU is
>>>>>>>>>> available
>>>>>>>>>> (see include/xen/arm/swiotlb-xen.h:xen_swiotlb_detect, the
>>>>>>>>>> corresponding
>>>>>>>>>> case is XENFEAT_not_direct_mapped).
>>>>>>>>> But how does Xen using an IOMMU help with, as said,
>>>>>>>>> address-restricted
>>>>>>>>> devices? They may still need e.g. a 32-bit address to be
>>>>>>>>> programmed in,
>>>>>>>>> and if the kernel has memory beyond the 4G boundary not all I/O
>>>>>>>>> buffers
>>>>>>>>> may fulfill this requirement.
>>>>>>>> In short, it is going to work as long as Linux has guest physical
>>>>>>>> addresses (not machine addresses, those could be anything) lower
>>>>>>>> than
>>>>>>>> 4GB.
>>>>>>>>
>>>>>>>> If the address-restricted device does DMA via an IOMMU, then the
>>>>>>>> device
>>>>>>>> gets programmed by Linux using its guest physical addresses (not
>>>>>>>> machine
>>>>>>>> addresses).
>>>>>>>>
>>>>>>>> The 32-bit restriction would be applied by Linux to its choice of
>>>>>>>> guest
>>>>>>>> physical address to use to program the device, the same way it does
>>>>>>>> on
>>>>>>>> native. The device would be fine as it always uses Linux-provided
>>>>>>>> <4GB
>>>>>>>> addresses. After the IOMMU translation (pagetable setup by Xen), we
>>>>>>>> could get any address, including >4GB addresses, and that is
>>>>>>>> expected to
>>>>>>>> work.
>>>>>>> I understand that's the "normal" way of working. But whatever the
>>>>>>> swiotlb
>>>>>>> is used for in baremetal Linux, that would similarly require its use
>>>>>>> in
>>>>>>> PVH (or HVM) aiui. So unconditionally disabling it in PVH would look
>>>>>>> to
>>>>>>> me like an incomplete attempt to disable its use altogether on x86.
>>>>>>> What
>>>>>>> difference of PVH vs baremetal am I missing here?
>>>>>> swiotlb is not usable for GPUs even on bare metal. They often have
>>>>>> hundreds or megs or even gigs of memory mapped on the device at any
>>>>>> given time. Also, AMD GPUs support 44-48 bit DMA masks (depending on
>>>>>> the chip family).
>>>>> But the swiotlb isn't per device, but system global.
>>>> Sure, but if the swiotlb is in use, then you can't really use the GPU.
>>>> So you get to pick one.
>>> The swiotlb is used only for buffers which are not within the DMA mask of a
>>> device (see dma_direct_map_page()). So an AMD GPU supporting a 44 bit DMA mask
>>> won't use the swiotlb unless you have a buffer above guest physical address of
>>> 16TB (so basically never).
>>>
>>> Disabling swiotlb in such a guest would OTOH mean, that a device with only
>>> 32 bit DMA mask passed through to this guest couldn't work with buffers
>>> above 4GB.
>>>
>>> I don't think this is acceptable.
>> From the Xen subsystem in Linux point of view, the only thing we need to
>> do is to make sure *not* to enable swiotlb_xen (yes "swiotlb_xen", not
>> the global swiotlb) on PVH because it is not needed anyway.
>>
>> I think we should leave the global "swiotlb" setting alone. The global
>> swiotlb is not relevant to Xen anyway, and surely baremetal Linux has to
>> have a way to deal with swiotlb/GPU incompatibilities.
>>
>> We just have to avoid making things worse on Xen, and for that we just
>> need to avoid unconditionally enabling swiotlb-xen. If the Xen subsystem
>> doesn't enable swiotlb_xen/swiotlb, and no other subsystem enables
>> swiotlb, then we have a good Linux configuration capable of handling the
>> GPU properly.
>>
>> Alex, please correct me if I am wrong. How is x86_swiotlb_enable set to
>> false on native (non-Xen) x86?
> In most cases we have an IOMMU enabled and IIRC, TTM has slightly
> different behavior for memory allocation depending on whether swiotlb
> would be needed or not.

Well "slightly different" is an understatement. We need to disable quite
a bunch of features to make swiotlb work with GPUs.

Especially userptr and inter device sharing won't work any more.

Regards,
Christian.

>
> Alex