2023-11-24 10:33:39

by Jiqian Chen

[permalink] [raw]
Subject: [RFC KERNEL PATCH v2 0/3] Support device passthrough when dom0 is PVH on Xen

Hi All,

This series of patches are the v2 of the implementation of passthrough when dom0 is PVH on Xen.
We sent the v1 to upstream before, but the v1 had so many problems and we got lots of suggestions.
I will introduce all issues that these patches try to fix and the differences between v1 and v2.

Issues we encountered:
1. pci_stub failed to write bar for a passthrough device.
Problem: when we run “sudo xl pci-assignable-add <sbdf>” to assign a device, pci_stub will
call “pcistub_init_device() -> pci_restore_state() -> pci_restore_config_space() ->
pci_restore_config_space_range() -> pci_restore_config_dword() -> pci_write_config_dword()”,
the pci config write will trigger an io interrupt to bar_write() in the xen, but the
bar->enabled was set before, the write is not allowed now, and then when Qemu config the
passthrough device in xen_pt_realize(), it gets invalid bar values.

Reason: the reason is that we don't tell vPCI that the device has been reset, so the current
cached state in pdev->vpci is all out of date and is different from the real device state.

Solution: to solve this problem, the first patch of kernel(xen/pci: Add xen_reset_device_state
function) and the fist patch of xen(xen/vpci: Clear all vpci status of device) add a new
hypercall to reset the state stored in vPCI when the state of real device has changed.
Thank Roger for the suggestion of this v2, and it is different from v1
(https://lore.kernel.org/xen-devel/[email protected]/), v1 simply allow
domU to write pci bar, it does not comply with the design principles of vPCI.

2. failed to do PHYSDEVOP_map_pirq when dom0 is PVH
Problem: HVM domU will do PHYSDEVOP_map_pirq for a passthrough device by using gsi. See
xen_pt_realize->xc_physdev_map_pirq and pci_add_dm_done->xc_physdev_map_pirq. Then
xc_physdev_map_pirq will call into Xen, but in hvm_physdev_op(), PHYSDEVOP_map_pirq is not allowed.

Reason: In hvm_physdev_op(), the variable "currd" is PVH dom0 and PVH has no X86_EMU_USE_PIRQ flag,
it will fail at has_pirq check.

Solution: I think we may need to allow PHYSDEVOP_map_pirq when "currd" is dom0 (at present dom0 is
PVH). The second patch of xen(x86/pvh: Open PHYSDEVOP_map_pirq for PVH dom0) allow PVH dom0 do
PHYSDEVOP_map_pirq. This v2 patch is better than v1, v1 simply remove the has_pirq check(xen
https://lore.kernel.org/xen-devel/[email protected]/).

3. the gsi of a passthrough device doesn't be unmasked
3.1 failed to check the permission of pirq
3.2 the gsi of passthrough device was not registered in PVH dom0

Problem:
3.1 callback function pci_add_dm_done() will be called when qemu config a passthrough device for domU.
This function will call xc_domain_irq_permission()-> pirq_access_permitted() to check if the gsi has
corresponding mappings in dom0. But it didn’t, so failed. See
XEN_DOMCTL_irq_permission->pirq_access_permitted, "current" is PVH dom0 and it return irq is 0.
3.2 it's possible for a gsi (iow: vIO-APIC pin) to never get registered on PVH dom0, because the
devices of PVH are using MSI(-X) interrupts. However, the IO-APIC pin must be configured for it to be
able to be mapped into a domU.

Reason: After searching codes, I find "map_pirq" and "register_gsi" will be done in function
vioapic_write_redirent->vioapic_hwdom_map_gsi when the gsi(aka ioapic's pin) is unmasked in PVH dom0.
So the two problems can be concluded to that the gsi of a passthrough device doesn't be unmasked.

Solution: to solve these problems, the second patch of kernel(xen/pvh: Unmask irq for passthrough
device in PVH dom0) call the unmask_irq() when we assign a device to be passthrough. So that
passthrough devices can have the mapping of gsi on PVH dom0 and gsi can be registered. This v2 patch
is different from the v1(
kernel https://lore.kernel.org/xen-devel/[email protected]/,
kernel https://lore.kernel.org/xen-devel/[email protected]/ and
xen https://lore.kernel.org/xen-devel/[email protected]/),
v1 performed "map_pirq" and "register_gsi" on all pci devices on PVH dom0, which is unnecessary and
may cause multiple registration.

4. failed to map pirq for gsi
Problem: qemu will call xc_physdev_map_pirq() to map a passthrough device’s gsi to pirq in function
xen_pt_realize(). But failed.

Reason: According to the implement of xc_physdev_map_pirq(), it needs gsi instead of irq, but qemu
pass irq to it and treat irq as gsi, it is got from file /sys/bus/pci/devices/xxxx:xx:xx.x/irq in
function xen_host_pci_device_get(). But actually the gsi number is not equal with irq. On PVH dom0,
when it allocates irq for a gsi in function acpi_register_gsi_ioapic(), allocation is dynamic, and
follow the principle of applying first, distributing first. And if you debug the kernel codes(see
function __irq_alloc_descs), you will find the irq number is allocated from small to large by order,
but the applying gsi number is not, gsi 38 may come before gsi 28, that causes gsi 38 get a smaller
irq number than gsi 28, and then gsi != irq.

Solution: we can record the relation between gsi and irq, then when userspace(qemu) want to use gsi,
we can do a translation. The third patch of kernel(xen/privcmd: Add new syscall to get gsi from irq)
records all the relations in acpi_register_gsi_xen_pvh() when dom0 initialize pci devices, and provide
a syscall for userspace to get the gsi from irq. The third patch of xen(tools: Add new function to get
gsi from irq) add a new function xc_physdev_gsi_from_irq() to call the new syscall added on kernel side.
And then userspace can use that function to get gsi. Then xc_physdev_map_pirq() will success. This v2
patch is the same as v1(
kernel https://lore.kernel.org/xen-devel/[email protected]/ and
xen https://lore.kernel.org/xen-devel/[email protected]/)

About the v2 patch of qemu, just change an included head file, other are similar to the v1 (
qemu https://lore.kernel.org/xen-devel/[email protected]/), just call
xc_physdev_gsi_from_irq() to get gsi from irq.


Jiqian Chen (3):
xen/pci: Add xen_reset_device_state function
xen/pvh: Unmask irq for passthrough device in PVH dom0
xen/privcmd: Add new syscall to get gsi from irq

arch/x86/include/asm/apic.h | 7 ++++++
arch/x86/include/asm/xen/pci.h | 5 ++++
arch/x86/kernel/acpi/boot.c | 2 +-
arch/x86/pci/xen.c | 21 ++++++++++++++++
drivers/xen/events/events_base.c | 39 ++++++++++++++++++++++++++++++
drivers/xen/pci.c | 12 +++++++++
drivers/xen/privcmd.c | 20 +++++++++++++++
drivers/xen/xen-pciback/pci_stub.c | 10 ++++++++
include/linux/irq.h | 1 +
include/uapi/xen/privcmd.h | 7 ++++++
include/xen/events.h | 5 ++++
include/xen/interface/physdev.h | 2 ++
include/xen/pci.h | 6 +++++
kernel/irq/chip.c | 1 +
kernel/irq/internals.h | 1 -
kernel/irq/irqdesc.c | 2 +-
16 files changed, 138 insertions(+), 3 deletions(-)

--
2.34.1


2023-11-24 10:33:46

by Jiqian Chen

[permalink] [raw]
Subject: [RFC KERNEL PATCH v2 1/3] xen/pci: Add xen_reset_device_state function

When device on dom0 side has been reset, the vpci on Xen side
won't get notification, so that the cached state in vpci is
all out of date with the real device state.
To solve that problem, this patch add a function to clear all
vpci device state when device is reset on dom0 side.

And call that function in pcistub_init_device. Because when
we use "pci-assignable-add" to assign a passthrough device in
Xen, it will reset passthrough device and the vpci state will
out of date, and then device will fail to restore bar state.

Signed-off-by: Jiqian Chen <[email protected]>
Signed-off-by: Huang Rui <[email protected]>
---
drivers/xen/pci.c | 12 ++++++++++++
drivers/xen/xen-pciback/pci_stub.c | 3 +++
include/xen/interface/physdev.h | 2 ++
include/xen/pci.h | 6 ++++++
4 files changed, 23 insertions(+)

diff --git a/drivers/xen/pci.c b/drivers/xen/pci.c
index 72d4e3f193af..e9b30bc09139 100644
--- a/drivers/xen/pci.c
+++ b/drivers/xen/pci.c
@@ -177,6 +177,18 @@ static int xen_remove_device(struct device *dev)
return r;
}

+int xen_reset_device_state(const struct pci_dev *dev)
+{
+ struct physdev_pci_device device = {
+ .seg = pci_domain_nr(dev->bus),
+ .bus = dev->bus->number,
+ .devfn = dev->devfn
+ };
+
+ return HYPERVISOR_physdev_op(PHYSDEVOP_pci_device_state_reset, &device);
+}
+EXPORT_SYMBOL_GPL(xen_reset_device_state);
+
static int xen_pci_notifier(struct notifier_block *nb,
unsigned long action, void *data)
{
diff --git a/drivers/xen/xen-pciback/pci_stub.c b/drivers/xen/xen-pciback/pci_stub.c
index e34b623e4b41..5a96b6c66c07 100644
--- a/drivers/xen/xen-pciback/pci_stub.c
+++ b/drivers/xen/xen-pciback/pci_stub.c
@@ -421,6 +421,9 @@ static int pcistub_init_device(struct pci_dev *dev)
else {
dev_dbg(&dev->dev, "resetting (FLR, D3, etc) the device\n");
__pci_reset_function_locked(dev);
+ err = xen_reset_device_state(dev);
+ if (err)
+ goto config_release;
pci_restore_state(dev);
}
/* Now disable the device (this also ensures some private device
diff --git a/include/xen/interface/physdev.h b/include/xen/interface/physdev.h
index a237af867873..231526f80f6c 100644
--- a/include/xen/interface/physdev.h
+++ b/include/xen/interface/physdev.h
@@ -263,6 +263,8 @@ struct physdev_pci_device {
uint8_t devfn;
};

+#define PHYSDEVOP_pci_device_state_reset 32
+
#define PHYSDEVOP_DBGP_RESET_PREPARE 1
#define PHYSDEVOP_DBGP_RESET_DONE 2

diff --git a/include/xen/pci.h b/include/xen/pci.h
index b8337cf85fd1..b2e2e856efd6 100644
--- a/include/xen/pci.h
+++ b/include/xen/pci.h
@@ -4,10 +4,16 @@
#define __XEN_PCI_H__

#if defined(CONFIG_XEN_DOM0)
+int xen_reset_device_state(const struct pci_dev *dev);
int xen_find_device_domain_owner(struct pci_dev *dev);
int xen_register_device_domain_owner(struct pci_dev *dev, uint16_t domain);
int xen_unregister_device_domain_owner(struct pci_dev *dev);
#else
+static inline int xen_reset_device_state(const struct pci_dev *dev)
+{
+ return -1;
+}
+
static inline int xen_find_device_domain_owner(struct pci_dev *dev)
{
return -1;
--
2.34.1

2023-11-24 10:34:04

by Jiqian Chen

[permalink] [raw]
Subject: [RFC KERNEL PATCH v2 2/3] xen/pvh: Unmask irq for passthrough device in PVH dom0

This patch is to solve two problems we encountered when we try to
passthrough a device to hvm domU base on Xen PVH dom0.

First, hvm guest will alloc a pirq and irq for a passthrough device
by using gsi, before that, the gsi must first has a mapping in dom0,
see Xen code pci_add_dm_done->xc_domain_irq_permission, it will call
into Xen and check whether dom0 has the mapping. See
XEN_DOMCTL_irq_permission->pirq_access_permitted, "current" is PVH
dom0 and it return irq is 0, and then return -EPERM.
This is because the passthrough device doesn't do PHYSDEVOP_map_pirq
when thay are enabled.

Second, in PVH dom0, the gsi of a passthrough device doesn't get
registered, but gsi must be configured for it to be able to be
mapped into a domU.

After searching codes, we can find map_pirq and register_gsi will be
done in function vioapic_write_redirent->vioapic_hwdom_map_gsi when
the gsi(aka ioapic's pin) is unmasked in PVH dom0. So the problems
can be conclude to that the gsi of a passthrough device doesn't be
unmasked.

To solve the unmaske problem, this patch call the unmask_irq when we
assign a device to be passthrough. So that the gsi can get registered
and mapped in PVH dom0.

Signed-off-by: Jiqian Chen <[email protected]>
Signed-off-by: Huang Rui <[email protected]>
---
drivers/xen/xen-pciback/pci_stub.c | 7 +++++++
include/linux/irq.h | 1 +
kernel/irq/chip.c | 1 +
kernel/irq/internals.h | 1 -
kernel/irq/irqdesc.c | 2 +-
5 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/xen/xen-pciback/pci_stub.c b/drivers/xen/xen-pciback/pci_stub.c
index 5a96b6c66c07..b83d02bcc76c 100644
--- a/drivers/xen/xen-pciback/pci_stub.c
+++ b/drivers/xen/xen-pciback/pci_stub.c
@@ -357,6 +357,7 @@ static int pcistub_match(struct pci_dev *dev)
static int pcistub_init_device(struct pci_dev *dev)
{
struct xen_pcibk_dev_data *dev_data;
+ struct irq_desc *desc = NULL;
int err = 0;

dev_dbg(&dev->dev, "initializing...\n");
@@ -399,6 +400,12 @@ static int pcistub_init_device(struct pci_dev *dev)
if (err)
goto config_release;

+ if (xen_initial_domain() && xen_pvh_domain()) {
+ if (dev->irq <= 0 || !(desc = irq_to_desc(dev->irq)))
+ goto config_release;
+ unmask_irq(desc);
+ }
+
if (dev->msix_cap) {
struct physdev_pci_device ppdev = {
.seg = pci_domain_nr(dev->bus),
diff --git a/include/linux/irq.h b/include/linux/irq.h
index 90081afa10ce..44650ca178d9 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -659,6 +659,7 @@ extern void handle_percpu_irq(struct irq_desc *desc);
extern void handle_percpu_devid_irq(struct irq_desc *desc);
extern void handle_bad_irq(struct irq_desc *desc);
extern void handle_nested_irq(unsigned int irq);
+extern void unmask_irq(struct irq_desc *desc);

extern void handle_fasteoi_nmi(struct irq_desc *desc);
extern void handle_percpu_devid_fasteoi_nmi(struct irq_desc *desc);
diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
index dc94e0bf2c94..fd67b40b678d 100644
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -439,6 +439,7 @@ void unmask_irq(struct irq_desc *desc)
irq_state_clr_masked(desc);
}
}
+EXPORT_SYMBOL_GPL(unmask_irq);

void unmask_threaded_irq(struct irq_desc *desc)
{
diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h
index bcc7f21db9ee..d08e3e7b2819 100644
--- a/kernel/irq/internals.h
+++ b/kernel/irq/internals.h
@@ -95,7 +95,6 @@ extern void irq_disable(struct irq_desc *desc);
extern void irq_percpu_enable(struct irq_desc *desc, unsigned int cpu);
extern void irq_percpu_disable(struct irq_desc *desc, unsigned int cpu);
extern void mask_irq(struct irq_desc *desc);
-extern void unmask_irq(struct irq_desc *desc);
extern void unmask_threaded_irq(struct irq_desc *desc);

#ifdef CONFIG_SPARSE_IRQ
diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 27ca1c866f29..5977efed31b5 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -380,7 +380,7 @@ struct irq_desc *irq_to_desc(unsigned int irq)
{
return mtree_load(&sparse_irqs, irq);
}
-#ifdef CONFIG_KVM_BOOK3S_64_HV_MODULE
+#if defined CONFIG_KVM_BOOK3S_64_HV_MODULE || defined CONFIG_XEN_PVH
EXPORT_SYMBOL_GPL(irq_to_desc);
#endif

--
2.34.1

2023-11-24 10:34:05

by Jiqian Chen

[permalink] [raw]
Subject: [RFC KERNEL PATCH v2 3/3] xen/privcmd: Add new syscall to get gsi from irq

In PVH dom0, it uses the linux local interrupt mechanism,
when it allocs irq for a gsi, it is dynamic, and follow
the principle of applying first, distributing first. And
if you debug the codes, you will find the irq number is
alloced from small to large, but the applying gsi number
is not, may gsi 38 comes before gsi 28, it causes the irq
number is not equal with the gsi number.
And when we passthrough a device, QEMU will use device's gsi
number to do mapping actions, see xen_pt_realize->
xc_physdev_map_pirq, but the gsi number is got from file
/sys/bus/pci/devices/xxxx:xx:xx.x/irq, irq!= gsi, so it will
fail when mapping.
And in current linux codes, there is no method to translate
irq to gsi for userspace.

For above purpose, this patch record the relationship of gsi
and irq when PVH dom0 do acpi_register_gsi_ioapic for devices
and adds a new syscall into privcmd to let userspace can get
that translation when they have a need.

Signed-off-by: Jiqian Chen <[email protected]>
Signed-off-by: Huang Rui <[email protected]>
---
arch/x86/include/asm/apic.h | 7 ++++++
arch/x86/include/asm/xen/pci.h | 5 ++++
arch/x86/kernel/acpi/boot.c | 2 +-
arch/x86/pci/xen.c | 21 +++++++++++++++++
drivers/xen/events/events_base.c | 39 ++++++++++++++++++++++++++++++++
drivers/xen/privcmd.c | 20 ++++++++++++++++
include/uapi/xen/privcmd.h | 7 ++++++
include/xen/events.h | 5 ++++
8 files changed, 105 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index d21f48f1c242..5646444285ac 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -169,6 +169,8 @@ extern bool apic_needs_pit(void);

extern void apic_send_IPI_allbutself(unsigned int vector);

+extern int acpi_register_gsi_ioapic(struct device *dev, u32 gsi,
+ int trigger, int polarity);
#else /* !CONFIG_X86_LOCAL_APIC */
static inline void lapic_shutdown(void) { }
#define local_apic_timer_c2_ok 1
@@ -183,6 +185,11 @@ static inline void apic_intr_mode_init(void) { }
static inline void lapic_assign_system_vectors(void) { }
static inline void lapic_assign_legacy_vector(unsigned int i, bool r) { }
static inline bool apic_needs_pit(void) { return true; }
+static inline int acpi_register_gsi_ioapic(struct device *dev, u32 gsi,
+ int trigger, int polarity)
+{
+ return (int)gsi;
+}
#endif /* !CONFIG_X86_LOCAL_APIC */

#ifdef CONFIG_X86_X2APIC
diff --git a/arch/x86/include/asm/xen/pci.h b/arch/x86/include/asm/xen/pci.h
index 9015b888edd6..aa8ded61fc2d 100644
--- a/arch/x86/include/asm/xen/pci.h
+++ b/arch/x86/include/asm/xen/pci.h
@@ -5,6 +5,7 @@
#if defined(CONFIG_PCI_XEN)
extern int __init pci_xen_init(void);
extern int __init pci_xen_hvm_init(void);
+extern int __init pci_xen_pvh_init(void);
#define pci_xen 1
#else
#define pci_xen 0
@@ -13,6 +14,10 @@ static inline int pci_xen_hvm_init(void)
{
return -1;
}
+static inline int pci_xen_pvh_init(void)
+{
+ return -1;
+}
#endif
#ifdef CONFIG_XEN_PV_DOM0
int __init pci_xen_initial_domain(void);
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index d0918a75cb00..45b157e18c0b 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -739,7 +739,7 @@ static int acpi_register_gsi_pic(struct device *dev, u32 gsi,
}

#ifdef CONFIG_X86_LOCAL_APIC
-static int acpi_register_gsi_ioapic(struct device *dev, u32 gsi,
+int acpi_register_gsi_ioapic(struct device *dev, u32 gsi,
int trigger, int polarity)
{
int irq = gsi;
diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
index 652cd53e77f6..f056ab5c0a06 100644
--- a/arch/x86/pci/xen.c
+++ b/arch/x86/pci/xen.c
@@ -114,6 +114,21 @@ static int acpi_register_gsi_xen_hvm(struct device *dev, u32 gsi,
false /* no mapping of GSI to PIRQ */);
}

+static int acpi_register_gsi_xen_pvh(struct device *dev, u32 gsi,
+ int trigger, int polarity)
+{
+ int irq;
+
+ irq = acpi_register_gsi_ioapic(dev, gsi, trigger, polarity);
+ if (irq < 0)
+ return irq;
+
+ if (xen_pvh_add_gsi_irq_map(gsi, irq) == -EEXIST)
+ printk(KERN_INFO "Already map the GSI :%u and IRQ: %d\n", gsi, irq);
+
+ return irq;
+}
+
#ifdef CONFIG_XEN_PV_DOM0
static int xen_register_gsi(u32 gsi, int triggering, int polarity)
{
@@ -558,6 +573,12 @@ int __init pci_xen_hvm_init(void)
return 0;
}

+int __init pci_xen_pvh_init(void)
+{
+ __acpi_register_gsi = acpi_register_gsi_xen_pvh;
+ return 0;
+}
+
#ifdef CONFIG_XEN_PV_DOM0
int __init pci_xen_initial_domain(void)
{
diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index 6de6b084ea60..a02d62955509 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -957,6 +957,43 @@ int xen_irq_from_gsi(unsigned gsi)
}
EXPORT_SYMBOL_GPL(xen_irq_from_gsi);

+int xen_gsi_from_irq(unsigned irq)
+{
+ struct irq_info *info;
+
+ list_for_each_entry(info, &xen_irq_list_head, list) {
+ if (info->type != IRQT_PIRQ)
+ continue;
+
+ if (info->irq == irq)
+ return info->u.pirq.gsi;
+ }
+
+ return -1;
+}
+EXPORT_SYMBOL_GPL(xen_gsi_from_irq);
+
+int xen_pvh_add_gsi_irq_map(unsigned gsi, unsigned irq)
+{
+ int tmp_irq;
+ struct irq_info *info;
+
+ tmp_irq = xen_irq_from_gsi(gsi);
+ if (tmp_irq != -1)
+ return -EEXIST;
+
+ info = kzalloc(sizeof(*info), GFP_KERNEL);
+ if (info == NULL)
+ panic("Unable to allocate metadata for GSI%d\n", gsi);
+
+ info->type = IRQT_PIRQ;
+ info->irq = irq;
+ info->u.pirq.gsi = gsi;
+ list_add_tail(&info->list, &xen_irq_list_head);
+
+ return 0;
+}
+
static void __unbind_from_irq(unsigned int irq)
{
evtchn_port_t evtchn = evtchn_from_irq(irq);
@@ -2303,6 +2340,8 @@ void __init xen_init_IRQ(void)
xen_init_setup_upcall_vector();
xen_alloc_callback_vector();

+ if (xen_pvh_domain())
+ pci_xen_pvh_init();

if (xen_hvm_domain()) {
native_init_IRQ();
diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
index 1ce7f3c7a950..6fa8a01d8ae6 100644
--- a/drivers/xen/privcmd.c
+++ b/drivers/xen/privcmd.c
@@ -45,6 +45,7 @@
#include <xen/page.h>
#include <xen/xen-ops.h>
#include <xen/balloon.h>
+#include <xen/events.h>

#include "privcmd.h"

@@ -842,6 +843,21 @@ static long privcmd_ioctl_mmap_resource(struct file *file,
return rc;
}

+static long privcmd_ioctl_gsi_from_irq(struct file *file, void __user *udata)
+{
+ struct privcmd_gsi_from_irq kdata;
+
+ if (copy_from_user(&kdata, udata, sizeof(kdata)))
+ return -EFAULT;
+
+ kdata.gsi = xen_gsi_from_irq(kdata.irq);
+
+ if (copy_to_user(udata, &kdata, sizeof(kdata)))
+ return -EFAULT;
+
+ return 0;
+}
+
#ifdef CONFIG_XEN_PRIVCMD_EVENTFD
/* Irqfd support */
static struct workqueue_struct *irqfd_cleanup_wq;
@@ -1534,6 +1550,10 @@ static long privcmd_ioctl(struct file *file,
ret = privcmd_ioctl_ioeventfd(file, udata);
break;

+ case IOCTL_PRIVCMD_GSI_FROM_IRQ:
+ ret = privcmd_ioctl_gsi_from_irq(file, udata);
+ break;
+
default:
break;
}
diff --git a/include/uapi/xen/privcmd.h b/include/uapi/xen/privcmd.h
index 8b8c5d1420fe..61f0ffbec077 100644
--- a/include/uapi/xen/privcmd.h
+++ b/include/uapi/xen/privcmd.h
@@ -126,6 +126,11 @@ struct privcmd_ioeventfd {
__u8 pad[2];
};

+struct privcmd_gsi_from_irq {
+ __u32 irq;
+ __u32 gsi;
+};
+
/*
* @cmd: IOCTL_PRIVCMD_HYPERCALL
* @arg: &privcmd_hypercall_t
@@ -157,5 +162,7 @@ struct privcmd_ioeventfd {
_IOW('P', 8, struct privcmd_irqfd)
#define IOCTL_PRIVCMD_IOEVENTFD \
_IOW('P', 9, struct privcmd_ioeventfd)
+#define IOCTL_PRIVCMD_GSI_FROM_IRQ \
+ _IOC(_IOC_NONE, 'P', 10, sizeof(struct privcmd_gsi_from_irq))

#endif /* __LINUX_PUBLIC_PRIVCMD_H__ */
diff --git a/include/xen/events.h b/include/xen/events.h
index 23932b0673dc..78377c936efe 100644
--- a/include/xen/events.h
+++ b/include/xen/events.h
@@ -131,6 +131,11 @@ int xen_pirq_from_irq(unsigned irq);
/* Return the irq allocated to the gsi */
int xen_irq_from_gsi(unsigned gsi);

+/* Return the gsi from irq */
+int xen_gsi_from_irq(unsigned irq);
+
+int xen_pvh_add_gsi_irq_map(unsigned gsi, unsigned irq);
+
/* Determine whether to ignore this IRQ if it is passed to a guest. */
int xen_test_irq_shared(int irq);

--
2.34.1

2023-11-30 03:46:31

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [RFC KERNEL PATCH v2 1/3] xen/pci: Add xen_reset_device_state function

On Fri, 24 Nov 2023, Jiqian Chen wrote:
> When device on dom0 side has been reset, the vpci on Xen side
> won't get notification, so that the cached state in vpci is
> all out of date with the real device state.
> To solve that problem, this patch add a function to clear all
> vpci device state when device is reset on dom0 side.
>
> And call that function in pcistub_init_device. Because when
> we use "pci-assignable-add" to assign a passthrough device in
> Xen, it will reset passthrough device and the vpci state will
> out of date, and then device will fail to restore bar state.
>
> Signed-off-by: Jiqian Chen <[email protected]>
> Signed-off-by: Huang Rui <[email protected]>
> ---
> drivers/xen/pci.c | 12 ++++++++++++
> drivers/xen/xen-pciback/pci_stub.c | 3 +++
> include/xen/interface/physdev.h | 2 ++
> include/xen/pci.h | 6 ++++++
> 4 files changed, 23 insertions(+)
>
> diff --git a/drivers/xen/pci.c b/drivers/xen/pci.c
> index 72d4e3f193af..e9b30bc09139 100644
> --- a/drivers/xen/pci.c
> +++ b/drivers/xen/pci.c
> @@ -177,6 +177,18 @@ static int xen_remove_device(struct device *dev)
> return r;
> }
>
> +int xen_reset_device_state(const struct pci_dev *dev)
> +{
> + struct physdev_pci_device device = {
> + .seg = pci_domain_nr(dev->bus),
> + .bus = dev->bus->number,
> + .devfn = dev->devfn
> + };
> +
> + return HYPERVISOR_physdev_op(PHYSDEVOP_pci_device_state_reset, &device);
> +}
> +EXPORT_SYMBOL_GPL(xen_reset_device_state);
> +
> static int xen_pci_notifier(struct notifier_block *nb,
> unsigned long action, void *data)
> {
> diff --git a/drivers/xen/xen-pciback/pci_stub.c b/drivers/xen/xen-pciback/pci_stub.c
> index e34b623e4b41..5a96b6c66c07 100644
> --- a/drivers/xen/xen-pciback/pci_stub.c
> +++ b/drivers/xen/xen-pciback/pci_stub.c
> @@ -421,6 +421,9 @@ static int pcistub_init_device(struct pci_dev *dev)
> else {
> dev_dbg(&dev->dev, "resetting (FLR, D3, etc) the device\n");
> __pci_reset_function_locked(dev);
> + err = xen_reset_device_state(dev);
> + if (err)
> + goto config_release;

Older versions of Xen won't have the hypercall
PHYSDEVOP_pci_device_state_reset implemented. I think we should do
something like:

if (err && xen_pvh_domain())
goto config_release;


Or even:

if (xen_pvh_domain()) {
err = xen_reset_device_state(dev);
if (err)
goto config_release;
}

depending on whether we want to call xen_reset_device_state also for PV
guests or not. I am assuming we don't want to error out on failure such
as -ENOENT for PV guests.


> pci_restore_state(dev);
> }
> /* Now disable the device (this also ensures some private device
> diff --git a/include/xen/interface/physdev.h b/include/xen/interface/physdev.h
> index a237af867873..231526f80f6c 100644
> --- a/include/xen/interface/physdev.h
> +++ b/include/xen/interface/physdev.h
> @@ -263,6 +263,8 @@ struct physdev_pci_device {
> uint8_t devfn;
> };
>
> +#define PHYSDEVOP_pci_device_state_reset 32
> +
> #define PHYSDEVOP_DBGP_RESET_PREPARE 1
> #define PHYSDEVOP_DBGP_RESET_DONE 2
>
> diff --git a/include/xen/pci.h b/include/xen/pci.h
> index b8337cf85fd1..b2e2e856efd6 100644
> --- a/include/xen/pci.h
> +++ b/include/xen/pci.h
> @@ -4,10 +4,16 @@
> #define __XEN_PCI_H__
>
> #if defined(CONFIG_XEN_DOM0)
> +int xen_reset_device_state(const struct pci_dev *dev);
> int xen_find_device_domain_owner(struct pci_dev *dev);
> int xen_register_device_domain_owner(struct pci_dev *dev, uint16_t domain);
> int xen_unregister_device_domain_owner(struct pci_dev *dev);
> #else
> +static inline int xen_reset_device_state(const struct pci_dev *dev)
> +{
> + return -1;
> +}
> +
> static inline int xen_find_device_domain_owner(struct pci_dev *dev)
> {
> return -1;
> --
> 2.34.1
>

2023-11-30 03:54:16

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [RFC KERNEL PATCH v2 2/3] xen/pvh: Unmask irq for passthrough device in PVH dom0

On Fri, 24 Nov 2023, Jiqian Chen wrote:
> This patch is to solve two problems we encountered when we try to
> passthrough a device to hvm domU base on Xen PVH dom0.
>
> First, hvm guest will alloc a pirq and irq for a passthrough device
> by using gsi, before that, the gsi must first has a mapping in dom0,
> see Xen code pci_add_dm_done->xc_domain_irq_permission, it will call
> into Xen and check whether dom0 has the mapping. See
> XEN_DOMCTL_irq_permission->pirq_access_permitted, "current" is PVH
> dom0 and it return irq is 0, and then return -EPERM.
> This is because the passthrough device doesn't do PHYSDEVOP_map_pirq
> when thay are enabled.
>
> Second, in PVH dom0, the gsi of a passthrough device doesn't get
> registered, but gsi must be configured for it to be able to be
> mapped into a domU.
>
> After searching codes, we can find map_pirq and register_gsi will be
> done in function vioapic_write_redirent->vioapic_hwdom_map_gsi when
> the gsi(aka ioapic's pin) is unmasked in PVH dom0. So the problems
> can be conclude to that the gsi of a passthrough device doesn't be
> unmasked.
>
> To solve the unmaske problem, this patch call the unmask_irq when we
> assign a device to be passthrough. So that the gsi can get registered
> and mapped in PVH dom0.


Roger, this seems to be more of a Xen issue than a Linux issue. Why do
we need the unmask check in Xen? Couldn't we just do:


diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
index 4e40d3609a..df262a4a18 100644
--- a/xen/arch/x86/hvm/vioapic.c
+++ b/xen/arch/x86/hvm/vioapic.c
@@ -287,7 +287,7 @@ static void vioapic_write_redirent(
hvm_dpci_eoi(d, gsi);
}

- if ( is_hardware_domain(d) && unmasked )
+ if ( is_hardware_domain(d) )
{
/*
* NB: don't call vioapic_hwdom_map_gsi while holding hvm.irq_lock

2023-11-30 07:04:00

by Jiqian Chen

[permalink] [raw]
Subject: Re: [RFC KERNEL PATCH v2 1/3] xen/pci: Add xen_reset_device_state function


On 2023/11/30 11:46, Stefano Stabellini wrote:
> On Fri, 24 Nov 2023, Jiqian Chen wrote:
>> When device on dom0 side has been reset, the vpci on Xen side
>> won't get notification, so that the cached state in vpci is
>> all out of date with the real device state.
>> To solve that problem, this patch add a function to clear all
>> vpci device state when device is reset on dom0 side.
>>
>> And call that function in pcistub_init_device. Because when
>> we use "pci-assignable-add" to assign a passthrough device in
>> Xen, it will reset passthrough device and the vpci state will
>> out of date, and then device will fail to restore bar state.
>>
>> Signed-off-by: Jiqian Chen <[email protected]>
>> Signed-off-by: Huang Rui <[email protected]>
>> ---
>> drivers/xen/pci.c | 12 ++++++++++++
>> drivers/xen/xen-pciback/pci_stub.c | 3 +++
>> include/xen/interface/physdev.h | 2 ++
>> include/xen/pci.h | 6 ++++++
>> 4 files changed, 23 insertions(+)
>>
>> diff --git a/drivers/xen/pci.c b/drivers/xen/pci.c
>> index 72d4e3f193af..e9b30bc09139 100644
>> --- a/drivers/xen/pci.c
>> +++ b/drivers/xen/pci.c
>> @@ -177,6 +177,18 @@ static int xen_remove_device(struct device *dev)
>> return r;
>> }
>>
>> +int xen_reset_device_state(const struct pci_dev *dev)
>> +{
>> + struct physdev_pci_device device = {
>> + .seg = pci_domain_nr(dev->bus),
>> + .bus = dev->bus->number,
>> + .devfn = dev->devfn
>> + };
>> +
>> + return HYPERVISOR_physdev_op(PHYSDEVOP_pci_device_state_reset, &device);
>> +}
>> +EXPORT_SYMBOL_GPL(xen_reset_device_state);
>> +
>> static int xen_pci_notifier(struct notifier_block *nb,
>> unsigned long action, void *data)
>> {
>> diff --git a/drivers/xen/xen-pciback/pci_stub.c b/drivers/xen/xen-pciback/pci_stub.c
>> index e34b623e4b41..5a96b6c66c07 100644
>> --- a/drivers/xen/xen-pciback/pci_stub.c
>> +++ b/drivers/xen/xen-pciback/pci_stub.c
>> @@ -421,6 +421,9 @@ static int pcistub_init_device(struct pci_dev *dev)
>> else {
>> dev_dbg(&dev->dev, "resetting (FLR, D3, etc) the device\n");
>> __pci_reset_function_locked(dev);
>> + err = xen_reset_device_state(dev);
>> + if (err)
>> + goto config_release;
>
> Older versions of Xen won't have the hypercall
> PHYSDEVOP_pci_device_state_reset implemented. I think we should do
> something like:
>
> if (err && xen_pvh_domain())
> goto config_release;
>
>
> Or even:
>
> if (xen_pvh_domain()) {
> err = xen_reset_device_state(dev);
> if (err)
> goto config_release;
> }
>
> depending on whether we want to call xen_reset_device_state also for PV
> guests or not. I am assuming we don't want to error out on failure such
> as -ENOENT for PV guests.
Yes, only for PVH dom0, I will add the condition in next version. Thank you!

>
>
>> pci_restore_state(dev);
>> }
>> /* Now disable the device (this also ensures some private device
>> diff --git a/include/xen/interface/physdev.h b/include/xen/interface/physdev.h
>> index a237af867873..231526f80f6c 100644
>> --- a/include/xen/interface/physdev.h
>> +++ b/include/xen/interface/physdev.h
>> @@ -263,6 +263,8 @@ struct physdev_pci_device {
>> uint8_t devfn;
>> };
>>
>> +#define PHYSDEVOP_pci_device_state_reset 32
>> +
>> #define PHYSDEVOP_DBGP_RESET_PREPARE 1
>> #define PHYSDEVOP_DBGP_RESET_DONE 2
>>
>> diff --git a/include/xen/pci.h b/include/xen/pci.h
>> index b8337cf85fd1..b2e2e856efd6 100644
>> --- a/include/xen/pci.h
>> +++ b/include/xen/pci.h
>> @@ -4,10 +4,16 @@
>> #define __XEN_PCI_H__
>>
>> #if defined(CONFIG_XEN_DOM0)
>> +int xen_reset_device_state(const struct pci_dev *dev);
>> int xen_find_device_domain_owner(struct pci_dev *dev);
>> int xen_register_device_domain_owner(struct pci_dev *dev, uint16_t domain);
>> int xen_unregister_device_domain_owner(struct pci_dev *dev);
>> #else
>> +static inline int xen_reset_device_state(const struct pci_dev *dev)
>> +{
>> + return -1;
>> +}
>> +
>> static inline int xen_find_device_domain_owner(struct pci_dev *dev)
>> {
>> return -1;
>> --
>> 2.34.1
>>

--
Best regards,
Jiqian Chen.

2023-11-30 15:03:36

by Stewart Hildebrand

[permalink] [raw]
Subject: Re: [RFC KERNEL PATCH v2 1/3] xen/pci: Add xen_reset_device_state function

On 11/30/23 02:03, Chen, Jiqian wrote:
>
> On 2023/11/30 11:46, Stefano Stabellini wrote:
>> On Fri, 24 Nov 2023, Jiqian Chen wrote:
>>> When device on dom0 side has been reset, the vpci on Xen side
>>> won't get notification, so that the cached state in vpci is
>>> all out of date with the real device state.
>>> To solve that problem, this patch add a function to clear all
>>> vpci device state when device is reset on dom0 side.
>>>
>>> And call that function in pcistub_init_device. Because when
>>> we use "pci-assignable-add" to assign a passthrough device in
>>> Xen, it will reset passthrough device and the vpci state will
>>> out of date, and then device will fail to restore bar state.
>>>
>>> Signed-off-by: Jiqian Chen <[email protected]>
>>> Signed-off-by: Huang Rui <[email protected]>
>>> ---
>>> drivers/xen/pci.c | 12 ++++++++++++
>>> drivers/xen/xen-pciback/pci_stub.c | 3 +++
>>> include/xen/interface/physdev.h | 2 ++
>>> include/xen/pci.h | 6 ++++++
>>> 4 files changed, 23 insertions(+)
>>>
>>> diff --git a/drivers/xen/pci.c b/drivers/xen/pci.c
>>> index 72d4e3f193af..e9b30bc09139 100644
>>> --- a/drivers/xen/pci.c
>>> +++ b/drivers/xen/pci.c
>>> @@ -177,6 +177,18 @@ static int xen_remove_device(struct device *dev)
>>> return r;
>>> }
>>>
>>> +int xen_reset_device_state(const struct pci_dev *dev)
>>> +{
>>> + struct physdev_pci_device device = {
>>> + .seg = pci_domain_nr(dev->bus),
>>> + .bus = dev->bus->number,
>>> + .devfn = dev->devfn
>>> + };
>>> +
>>> + return HYPERVISOR_physdev_op(PHYSDEVOP_pci_device_state_reset, &device);
>>> +}
>>> +EXPORT_SYMBOL_GPL(xen_reset_device_state);
>>> +
>>> static int xen_pci_notifier(struct notifier_block *nb,
>>> unsigned long action, void *data)
>>> {
>>> diff --git a/drivers/xen/xen-pciback/pci_stub.c b/drivers/xen/xen-pciback/pci_stub.c
>>> index e34b623e4b41..5a96b6c66c07 100644
>>> --- a/drivers/xen/xen-pciback/pci_stub.c
>>> +++ b/drivers/xen/xen-pciback/pci_stub.c
>>> @@ -421,6 +421,9 @@ static int pcistub_init_device(struct pci_dev *dev)
>>> else {
>>> dev_dbg(&dev->dev, "resetting (FLR, D3, etc) the device\n");
>>> __pci_reset_function_locked(dev);
>>> + err = xen_reset_device_state(dev);
>>> + if (err)
>>> + goto config_release;
>>
>> Older versions of Xen won't have the hypercall
>> PHYSDEVOP_pci_device_state_reset implemented. I think we should do
>> something like:
>>
>> if (err && xen_pvh_domain())
>> goto config_release;
>>
>>
>> Or even:
>>
>> if (xen_pvh_domain()) {
>> err = xen_reset_device_state(dev);
>> if (err)
>> goto config_release;
>> }
>>
>> depending on whether we want to call xen_reset_device_state also for PV
>> guests or not. I am assuming we don't want to error out on failure such
>> as -ENOENT for PV guests.
> Yes, only for PVH dom0, I will add the condition in next version. Thank you!

We will want to call xen_reset_device_state() for Arm dom0, too, so checking xen_pvh_domain() alone is not sufficient. I suggest instead to check !xen_pv_domain().

>
>>
>>
>>> pci_restore_state(dev);
>>> }
>>> /* Now disable the device (this also ensures some private device
>>> diff --git a/include/xen/interface/physdev.h b/include/xen/interface/physdev.h
>>> index a237af867873..231526f80f6c 100644
>>> --- a/include/xen/interface/physdev.h
>>> +++ b/include/xen/interface/physdev.h
>>> @@ -263,6 +263,8 @@ struct physdev_pci_device {
>>> uint8_t devfn;
>>> };
>>>
>>> +#define PHYSDEVOP_pci_device_state_reset 32
>>> +
>>> #define PHYSDEVOP_DBGP_RESET_PREPARE 1
>>> #define PHYSDEVOP_DBGP_RESET_DONE 2
>>>
>>> diff --git a/include/xen/pci.h b/include/xen/pci.h
>>> index b8337cf85fd1..b2e2e856efd6 100644
>>> --- a/include/xen/pci.h
>>> +++ b/include/xen/pci.h
>>> @@ -4,10 +4,16 @@
>>> #define __XEN_PCI_H__
>>>
>>> #if defined(CONFIG_XEN_DOM0)
>>> +int xen_reset_device_state(const struct pci_dev *dev);
>>> int xen_find_device_domain_owner(struct pci_dev *dev);
>>> int xen_register_device_domain_owner(struct pci_dev *dev, uint16_t domain);
>>> int xen_unregister_device_domain_owner(struct pci_dev *dev);
>>> #else
>>> +static inline int xen_reset_device_state(const struct pci_dev *dev)
>>> +{
>>> + return -1;
>>> +}
>>> +
>>> static inline int xen_find_device_domain_owner(struct pci_dev *dev)
>>> {
>>> return -1;
>>> --
>>> 2.34.1
>>>
>

2023-11-30 16:02:49

by Roger Pau Monné

[permalink] [raw]
Subject: Re: [RFC KERNEL PATCH v2 2/3] xen/pvh: Unmask irq for passthrough device in PVH dom0

On Wed, Nov 29, 2023 at 07:53:59PM -0800, Stefano Stabellini wrote:
> On Fri, 24 Nov 2023, Jiqian Chen wrote:
> > This patch is to solve two problems we encountered when we try to
> > passthrough a device to hvm domU base on Xen PVH dom0.
> >
> > First, hvm guest will alloc a pirq and irq for a passthrough device
> > by using gsi, before that, the gsi must first has a mapping in dom0,
> > see Xen code pci_add_dm_done->xc_domain_irq_permission, it will call
> > into Xen and check whether dom0 has the mapping. See
> > XEN_DOMCTL_irq_permission->pirq_access_permitted, "current" is PVH
> > dom0 and it return irq is 0, and then return -EPERM.
> > This is because the passthrough device doesn't do PHYSDEVOP_map_pirq
> > when thay are enabled.
> >
> > Second, in PVH dom0, the gsi of a passthrough device doesn't get
> > registered, but gsi must be configured for it to be able to be
> > mapped into a domU.
> >
> > After searching codes, we can find map_pirq and register_gsi will be
> > done in function vioapic_write_redirent->vioapic_hwdom_map_gsi when
> > the gsi(aka ioapic's pin) is unmasked in PVH dom0. So the problems
> > can be conclude to that the gsi of a passthrough device doesn't be
> > unmasked.
> >
> > To solve the unmaske problem, this patch call the unmask_irq when we
> > assign a device to be passthrough. So that the gsi can get registered
> > and mapped in PVH dom0.
>
>
> Roger, this seems to be more of a Xen issue than a Linux issue. Why do
> we need the unmask check in Xen? Couldn't we just do:
>
>
> diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
> index 4e40d3609a..df262a4a18 100644
> --- a/xen/arch/x86/hvm/vioapic.c
> +++ b/xen/arch/x86/hvm/vioapic.c
> @@ -287,7 +287,7 @@ static void vioapic_write_redirent(
> hvm_dpci_eoi(d, gsi);
> }
>
> - if ( is_hardware_domain(d) && unmasked )
> + if ( is_hardware_domain(d) )
> {
> /*
> * NB: don't call vioapic_hwdom_map_gsi while holding hvm.irq_lock

There are some issues with this approach.

mp_register_gsi() will only setup the trigger and polarity of the
IO-APIC pin once, so we do so once the guest unmask the pin in order
to assert that the configuration is the intended one. A guest is
allowed to write all kind of nonsense stuff to the IO-APIC RTE, but
that doesn't take effect unless the pin is unmasked.

Overall the question would be whether we have any guarantees that
the hardware domain has properly configured the pin, even if it's not
using it itself (as it hasn't been unmasked).

IIRC PCI legacy interrupts are level triggered and low polarity, so we
could configure any pins that are not setup at bind time?

Thanks, Roger.

2023-12-01 03:15:48

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [RFC KERNEL PATCH v2 2/3] xen/pvh: Unmask irq for passthrough device in PVH dom0

On Thu, 30 Nov 2023, Roger Pau Monné wrote:
> On Wed, Nov 29, 2023 at 07:53:59PM -0800, Stefano Stabellini wrote:
> > On Fri, 24 Nov 2023, Jiqian Chen wrote:
> > > This patch is to solve two problems we encountered when we try to
> > > passthrough a device to hvm domU base on Xen PVH dom0.
> > >
> > > First, hvm guest will alloc a pirq and irq for a passthrough device
> > > by using gsi, before that, the gsi must first has a mapping in dom0,
> > > see Xen code pci_add_dm_done->xc_domain_irq_permission, it will call
> > > into Xen and check whether dom0 has the mapping. See
> > > XEN_DOMCTL_irq_permission->pirq_access_permitted, "current" is PVH
> > > dom0 and it return irq is 0, and then return -EPERM.
> > > This is because the passthrough device doesn't do PHYSDEVOP_map_pirq
> > > when thay are enabled.
> > >
> > > Second, in PVH dom0, the gsi of a passthrough device doesn't get
> > > registered, but gsi must be configured for it to be able to be
> > > mapped into a domU.
> > >
> > > After searching codes, we can find map_pirq and register_gsi will be
> > > done in function vioapic_write_redirent->vioapic_hwdom_map_gsi when
> > > the gsi(aka ioapic's pin) is unmasked in PVH dom0. So the problems
> > > can be conclude to that the gsi of a passthrough device doesn't be
> > > unmasked.
> > >
> > > To solve the unmaske problem, this patch call the unmask_irq when we
> > > assign a device to be passthrough. So that the gsi can get registered
> > > and mapped in PVH dom0.
> >
> >
> > Roger, this seems to be more of a Xen issue than a Linux issue. Why do
> > we need the unmask check in Xen? Couldn't we just do:
> >
> >
> > diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
> > index 4e40d3609a..df262a4a18 100644
> > --- a/xen/arch/x86/hvm/vioapic.c
> > +++ b/xen/arch/x86/hvm/vioapic.c
> > @@ -287,7 +287,7 @@ static void vioapic_write_redirent(
> > hvm_dpci_eoi(d, gsi);
> > }
> >
> > - if ( is_hardware_domain(d) && unmasked )
> > + if ( is_hardware_domain(d) )
> > {
> > /*
> > * NB: don't call vioapic_hwdom_map_gsi while holding hvm.irq_lock
>
> There are some issues with this approach.
>
> mp_register_gsi() will only setup the trigger and polarity of the
> IO-APIC pin once, so we do so once the guest unmask the pin in order
> to assert that the configuration is the intended one. A guest is
> allowed to write all kind of nonsense stuff to the IO-APIC RTE, but
> that doesn't take effect unless the pin is unmasked.
>
> Overall the question would be whether we have any guarantees that
> the hardware domain has properly configured the pin, even if it's not
> using it itself (as it hasn't been unmasked).
>
> IIRC PCI legacy interrupts are level triggered and low polarity, so we
> could configure any pins that are not setup at bind time?

That could work.

Another idea is to move only the call to allocate_and_map_gsi_pirq at
bind time? That might be enough to pass a pirq_access_permitted check.

2023-12-01 08:58:30

by Roger Pau Monné

[permalink] [raw]
Subject: Re: [RFC KERNEL PATCH v2 2/3] xen/pvh: Unmask irq for passthrough device in PVH dom0

On Thu, Nov 30, 2023 at 07:15:17PM -0800, Stefano Stabellini wrote:
> On Thu, 30 Nov 2023, Roger Pau Monné wrote:
> > On Wed, Nov 29, 2023 at 07:53:59PM -0800, Stefano Stabellini wrote:
> > > On Fri, 24 Nov 2023, Jiqian Chen wrote:
> > > > This patch is to solve two problems we encountered when we try to
> > > > passthrough a device to hvm domU base on Xen PVH dom0.
> > > >
> > > > First, hvm guest will alloc a pirq and irq for a passthrough device
> > > > by using gsi, before that, the gsi must first has a mapping in dom0,
> > > > see Xen code pci_add_dm_done->xc_domain_irq_permission, it will call
> > > > into Xen and check whether dom0 has the mapping. See
> > > > XEN_DOMCTL_irq_permission->pirq_access_permitted, "current" is PVH
> > > > dom0 and it return irq is 0, and then return -EPERM.
> > > > This is because the passthrough device doesn't do PHYSDEVOP_map_pirq
> > > > when thay are enabled.
> > > >
> > > > Second, in PVH dom0, the gsi of a passthrough device doesn't get
> > > > registered, but gsi must be configured for it to be able to be
> > > > mapped into a domU.
> > > >
> > > > After searching codes, we can find map_pirq and register_gsi will be
> > > > done in function vioapic_write_redirent->vioapic_hwdom_map_gsi when
> > > > the gsi(aka ioapic's pin) is unmasked in PVH dom0. So the problems
> > > > can be conclude to that the gsi of a passthrough device doesn't be
> > > > unmasked.
> > > >
> > > > To solve the unmaske problem, this patch call the unmask_irq when we
> > > > assign a device to be passthrough. So that the gsi can get registered
> > > > and mapped in PVH dom0.
> > >
> > >
> > > Roger, this seems to be more of a Xen issue than a Linux issue. Why do
> > > we need the unmask check in Xen? Couldn't we just do:
> > >
> > >
> > > diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
> > > index 4e40d3609a..df262a4a18 100644
> > > --- a/xen/arch/x86/hvm/vioapic.c
> > > +++ b/xen/arch/x86/hvm/vioapic.c
> > > @@ -287,7 +287,7 @@ static void vioapic_write_redirent(
> > > hvm_dpci_eoi(d, gsi);
> > > }
> > >
> > > - if ( is_hardware_domain(d) && unmasked )
> > > + if ( is_hardware_domain(d) )
> > > {
> > > /*
> > > * NB: don't call vioapic_hwdom_map_gsi while holding hvm.irq_lock
> >
> > There are some issues with this approach.
> >
> > mp_register_gsi() will only setup the trigger and polarity of the
> > IO-APIC pin once, so we do so once the guest unmask the pin in order
> > to assert that the configuration is the intended one. A guest is
> > allowed to write all kind of nonsense stuff to the IO-APIC RTE, but
> > that doesn't take effect unless the pin is unmasked.
> >
> > Overall the question would be whether we have any guarantees that
> > the hardware domain has properly configured the pin, even if it's not
> > using it itself (as it hasn't been unmasked).
> >
> > IIRC PCI legacy interrupts are level triggered and low polarity, so we
> > could configure any pins that are not setup at bind time?
>
> That could work.
>
> Another idea is to move only the call to allocate_and_map_gsi_pirq at
> bind time? That might be enough to pass a pirq_access_permitted check.

Maybe, albeit that would change the behavior of XEN_DOMCTL_bind_pt_irq
just for PT_IRQ_TYPE_PCI and only when called from a PVH dom0 (as the
parameter would be a GSI instead of a previously mapped IRQ). Such
difference just for PT_IRQ_TYPE_PCI is slightly weird - if we go that
route I would recommend that we instead introduce a new dmop that has
this syntax regardless of the domain type it's called from.

Thanks, Roger.

2023-12-02 03:38:35

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [RFC KERNEL PATCH v2 2/3] xen/pvh: Unmask irq for passthrough device in PVH dom0

On Fri, 1 Dec 2023, Roger Pau Monné wrote:
> On Thu, Nov 30, 2023 at 07:15:17PM -0800, Stefano Stabellini wrote:
> > On Thu, 30 Nov 2023, Roger Pau Monné wrote:
> > > On Wed, Nov 29, 2023 at 07:53:59PM -0800, Stefano Stabellini wrote:
> > > > On Fri, 24 Nov 2023, Jiqian Chen wrote:
> > > > > This patch is to solve two problems we encountered when we try to
> > > > > passthrough a device to hvm domU base on Xen PVH dom0.
> > > > >
> > > > > First, hvm guest will alloc a pirq and irq for a passthrough device
> > > > > by using gsi, before that, the gsi must first has a mapping in dom0,
> > > > > see Xen code pci_add_dm_done->xc_domain_irq_permission, it will call
> > > > > into Xen and check whether dom0 has the mapping. See
> > > > > XEN_DOMCTL_irq_permission->pirq_access_permitted, "current" is PVH
> > > > > dom0 and it return irq is 0, and then return -EPERM.
> > > > > This is because the passthrough device doesn't do PHYSDEVOP_map_pirq
> > > > > when thay are enabled.
> > > > >
> > > > > Second, in PVH dom0, the gsi of a passthrough device doesn't get
> > > > > registered, but gsi must be configured for it to be able to be
> > > > > mapped into a domU.
> > > > >
> > > > > After searching codes, we can find map_pirq and register_gsi will be
> > > > > done in function vioapic_write_redirent->vioapic_hwdom_map_gsi when
> > > > > the gsi(aka ioapic's pin) is unmasked in PVH dom0. So the problems
> > > > > can be conclude to that the gsi of a passthrough device doesn't be
> > > > > unmasked.
> > > > >
> > > > > To solve the unmaske problem, this patch call the unmask_irq when we
> > > > > assign a device to be passthrough. So that the gsi can get registered
> > > > > and mapped in PVH dom0.
> > > >
> > > >
> > > > Roger, this seems to be more of a Xen issue than a Linux issue. Why do
> > > > we need the unmask check in Xen? Couldn't we just do:
> > > >
> > > >
> > > > diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
> > > > index 4e40d3609a..df262a4a18 100644
> > > > --- a/xen/arch/x86/hvm/vioapic.c
> > > > +++ b/xen/arch/x86/hvm/vioapic.c
> > > > @@ -287,7 +287,7 @@ static void vioapic_write_redirent(
> > > > hvm_dpci_eoi(d, gsi);
> > > > }
> > > >
> > > > - if ( is_hardware_domain(d) && unmasked )
> > > > + if ( is_hardware_domain(d) )
> > > > {
> > > > /*
> > > > * NB: don't call vioapic_hwdom_map_gsi while holding hvm.irq_lock
> > >
> > > There are some issues with this approach.
> > >
> > > mp_register_gsi() will only setup the trigger and polarity of the
> > > IO-APIC pin once, so we do so once the guest unmask the pin in order
> > > to assert that the configuration is the intended one. A guest is
> > > allowed to write all kind of nonsense stuff to the IO-APIC RTE, but
> > > that doesn't take effect unless the pin is unmasked.
> > >
> > > Overall the question would be whether we have any guarantees that
> > > the hardware domain has properly configured the pin, even if it's not
> > > using it itself (as it hasn't been unmasked).
> > >
> > > IIRC PCI legacy interrupts are level triggered and low polarity, so we
> > > could configure any pins that are not setup at bind time?
> >
> > That could work.
> >
> > Another idea is to move only the call to allocate_and_map_gsi_pirq at
> > bind time? That might be enough to pass a pirq_access_permitted check.
>
> Maybe, albeit that would change the behavior of XEN_DOMCTL_bind_pt_irq
> just for PT_IRQ_TYPE_PCI and only when called from a PVH dom0 (as the
> parameter would be a GSI instead of a previously mapped IRQ). Such
> difference just for PT_IRQ_TYPE_PCI is slightly weird - if we go that
> route I would recommend that we instead introduce a new dmop that has
> this syntax regardless of the domain type it's called from.

Looking at the code it is certainly a bit confusing. My point was that
we don't need to wait until polarity and trigger are set appropriately
to allow Dom0 to pass successfully a pirq_access_permitted() check. Xen
should be able to figure out that Dom0 is permitted pirq access.

So the idea was to move the call to allocate_and_map_gsi_pirq() earlier
somewhere because allocate_and_map_gsi_pirq doesn't require trigger or
polarity to be configured to work. But the suggestion of doing it a
"bind time" (meaning: XEN_DOMCTL_bind_pt_irq) was a bad idea.

But maybe we can find another location, maybe within
xen/arch/x86/hvm/vioapic.c, to call allocate_and_map_gsi_pirq() before
trigger and polarity are set and before the interrupt is unmasked.

Then we change the implementation of vioapic_hwdom_map_gsi to skip the
call to allocate_and_map_gsi_pirq, because by the time
vioapic_hwdom_map_gsi we assume that allocate_and_map_gsi_pirq had
already been done.

I am not familiar with vioapic.c but to give you an idea of what I was
thinking:


diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
index 4e40d3609a..16d56fe851 100644
--- a/xen/arch/x86/hvm/vioapic.c
+++ b/xen/arch/x86/hvm/vioapic.c
@@ -189,14 +189,6 @@ static int vioapic_hwdom_map_gsi(unsigned int gsi, unsigned int trig,
return ret;
}

- ret = allocate_and_map_gsi_pirq(currd, pirq, &pirq);
- if ( ret )
- {
- gprintk(XENLOG_WARNING, "vioapic: error mapping GSI %u: %d\n",
- gsi, ret);
- return ret;
- }
-
pcidevs_lock();
ret = pt_irq_create_bind(currd, &pt_irq_bind);
if ( ret )
@@ -287,6 +279,17 @@ static void vioapic_write_redirent(
hvm_dpci_eoi(d, gsi);
}

+ if ( is_hardware_domain(d) )
+ {
+ int pirq = gsi, ret;
+ ret = allocate_and_map_gsi_pirq(currd, pirq, &pirq);
+ if ( ret )
+ {
+ gprintk(XENLOG_WARNING, "vioapic: error mapping GSI %u: %d\n",
+ gsi, ret);
+ return ret;
+ }
+ }
if ( is_hardware_domain(d) && unmasked )
{
/*

2023-12-04 03:46:03

by Jiqian Chen

[permalink] [raw]
Subject: Re: [RFC KERNEL PATCH v2 1/3] xen/pci: Add xen_reset_device_state function

Hi Stewart,

On 2023/11/30 23:03, Stewart Hildebrand wrote:
> On 11/30/23 02:03, Chen, Jiqian wrote:
>>
>> On 2023/11/30 11:46, Stefano Stabellini wrote:
>>> On Fri, 24 Nov 2023, Jiqian Chen wrote:
>>>> When device on dom0 side has been reset, the vpci on Xen side
>>>> won't get notification, so that the cached state in vpci is
>>>> all out of date with the real device state.
>>>> To solve that problem, this patch add a function to clear all
>>>> vpci device state when device is reset on dom0 side.
>>>>
>>>> And call that function in pcistub_init_device. Because when
>>>> we use "pci-assignable-add" to assign a passthrough device in
>>>> Xen, it will reset passthrough device and the vpci state will
>>>> out of date, and then device will fail to restore bar state.
>>>>
>>>> Signed-off-by: Jiqian Chen <[email protected]>
>>>> Signed-off-by: Huang Rui <[email protected]>
>>>> ---
>>>> drivers/xen/pci.c | 12 ++++++++++++
>>>> drivers/xen/xen-pciback/pci_stub.c | 3 +++
>>>> include/xen/interface/physdev.h | 2 ++
>>>> include/xen/pci.h | 6 ++++++
>>>> 4 files changed, 23 insertions(+)
>>>>
>>>> diff --git a/drivers/xen/pci.c b/drivers/xen/pci.c
>>>> index 72d4e3f193af..e9b30bc09139 100644
>>>> --- a/drivers/xen/pci.c
>>>> +++ b/drivers/xen/pci.c
>>>> @@ -177,6 +177,18 @@ static int xen_remove_device(struct device *dev)
>>>> return r;
>>>> }
>>>>
>>>> +int xen_reset_device_state(const struct pci_dev *dev)
>>>> +{
>>>> + struct physdev_pci_device device = {
>>>> + .seg = pci_domain_nr(dev->bus),
>>>> + .bus = dev->bus->number,
>>>> + .devfn = dev->devfn
>>>> + };
>>>> +
>>>> + return HYPERVISOR_physdev_op(PHYSDEVOP_pci_device_state_reset, &device);
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(xen_reset_device_state);
>>>> +
>>>> static int xen_pci_notifier(struct notifier_block *nb,
>>>> unsigned long action, void *data)
>>>> {
>>>> diff --git a/drivers/xen/xen-pciback/pci_stub.c b/drivers/xen/xen-pciback/pci_stub.c
>>>> index e34b623e4b41..5a96b6c66c07 100644
>>>> --- a/drivers/xen/xen-pciback/pci_stub.c
>>>> +++ b/drivers/xen/xen-pciback/pci_stub.c
>>>> @@ -421,6 +421,9 @@ static int pcistub_init_device(struct pci_dev *dev)
>>>> else {
>>>> dev_dbg(&dev->dev, "resetting (FLR, D3, etc) the device\n");
>>>> __pci_reset_function_locked(dev);
>>>> + err = xen_reset_device_state(dev);
>>>> + if (err)
>>>> + goto config_release;
>>>
>>> Older versions of Xen won't have the hypercall
>>> PHYSDEVOP_pci_device_state_reset implemented. I think we should do
>>> something like:
>>>
>>> if (err && xen_pvh_domain())
>>> goto config_release;
>>>
>>>
>>> Or even:
>>>
>>> if (xen_pvh_domain()) {
>>> err = xen_reset_device_state(dev);
>>> if (err)
>>> goto config_release;
>>> }
>>>
>>> depending on whether we want to call xen_reset_device_state also for PV
>>> guests or not. I am assuming we don't want to error out on failure such
>>> as -ENOENT for PV guests.
>> Yes, only for PVH dom0, I will add the condition in next version. Thank you!
>
> We will want to call xen_reset_device_state() for Arm dom0, too, so checking xen_pvh_domain() alone is not sufficient. I suggest instead to check !xen_pv_domain().
I am not using Arm. But is Arm dom0 not a PVH type dom0?

>
>>
>>>
>>>
>>>> pci_restore_state(dev);
>>>> }
>>>> /* Now disable the device (this also ensures some private device
>>>> diff --git a/include/xen/interface/physdev.h b/include/xen/interface/physdev.h
>>>> index a237af867873..231526f80f6c 100644
>>>> --- a/include/xen/interface/physdev.h
>>>> +++ b/include/xen/interface/physdev.h
>>>> @@ -263,6 +263,8 @@ struct physdev_pci_device {
>>>> uint8_t devfn;
>>>> };
>>>>
>>>> +#define PHYSDEVOP_pci_device_state_reset 32
>>>> +
>>>> #define PHYSDEVOP_DBGP_RESET_PREPARE 1
>>>> #define PHYSDEVOP_DBGP_RESET_DONE 2
>>>>
>>>> diff --git a/include/xen/pci.h b/include/xen/pci.h
>>>> index b8337cf85fd1..b2e2e856efd6 100644
>>>> --- a/include/xen/pci.h
>>>> +++ b/include/xen/pci.h
>>>> @@ -4,10 +4,16 @@
>>>> #define __XEN_PCI_H__
>>>>
>>>> #if defined(CONFIG_XEN_DOM0)
>>>> +int xen_reset_device_state(const struct pci_dev *dev);
>>>> int xen_find_device_domain_owner(struct pci_dev *dev);
>>>> int xen_register_device_domain_owner(struct pci_dev *dev, uint16_t domain);
>>>> int xen_unregister_device_domain_owner(struct pci_dev *dev);
>>>> #else
>>>> +static inline int xen_reset_device_state(const struct pci_dev *dev)
>>>> +{
>>>> + return -1;
>>>> +}
>>>> +
>>>> static inline int xen_find_device_domain_owner(struct pci_dev *dev)
>>>> {
>>>> return -1;
>>>> --
>>>> 2.34.1
>>>>
>>

--
Best regards,
Jiqian Chen.

2023-12-04 03:46:03

by Stewart Hildebrand

[permalink] [raw]
Subject: Re: [RFC KERNEL PATCH v2 1/3] xen/pci: Add xen_reset_device_state function

On 12/3/23 22:25, Chen, Jiqian wrote:
> Hi Stewart,
>
> On 2023/11/30 23:03, Stewart Hildebrand wrote:
>> On 11/30/23 02:03, Chen, Jiqian wrote:
>>>
>>> On 2023/11/30 11:46, Stefano Stabellini wrote:
>>>> On Fri, 24 Nov 2023, Jiqian Chen wrote:
>>>>> When device on dom0 side has been reset, the vpci on Xen side
>>>>> won't get notification, so that the cached state in vpci is
>>>>> all out of date with the real device state.
>>>>> To solve that problem, this patch add a function to clear all
>>>>> vpci device state when device is reset on dom0 side.
>>>>>
>>>>> And call that function in pcistub_init_device. Because when
>>>>> we use "pci-assignable-add" to assign a passthrough device in
>>>>> Xen, it will reset passthrough device and the vpci state will
>>>>> out of date, and then device will fail to restore bar state.
>>>>>
>>>>> Signed-off-by: Jiqian Chen <[email protected]>
>>>>> Signed-off-by: Huang Rui <[email protected]>
>>>>> ---
>>>>> drivers/xen/pci.c | 12 ++++++++++++
>>>>> drivers/xen/xen-pciback/pci_stub.c | 3 +++
>>>>> include/xen/interface/physdev.h | 2 ++
>>>>> include/xen/pci.h | 6 ++++++
>>>>> 4 files changed, 23 insertions(+)
>>>>>
>>>>> diff --git a/drivers/xen/pci.c b/drivers/xen/pci.c
>>>>> index 72d4e3f193af..e9b30bc09139 100644
>>>>> --- a/drivers/xen/pci.c
>>>>> +++ b/drivers/xen/pci.c
>>>>> @@ -177,6 +177,18 @@ static int xen_remove_device(struct device *dev)
>>>>> return r;
>>>>> }
>>>>>
>>>>> +int xen_reset_device_state(const struct pci_dev *dev)
>>>>> +{
>>>>> + struct physdev_pci_device device = {
>>>>> + .seg = pci_domain_nr(dev->bus),
>>>>> + .bus = dev->bus->number,
>>>>> + .devfn = dev->devfn
>>>>> + };
>>>>> +
>>>>> + return HYPERVISOR_physdev_op(PHYSDEVOP_pci_device_state_reset, &device);
>>>>> +}
>>>>> +EXPORT_SYMBOL_GPL(xen_reset_device_state);
>>>>> +
>>>>> static int xen_pci_notifier(struct notifier_block *nb,
>>>>> unsigned long action, void *data)
>>>>> {
>>>>> diff --git a/drivers/xen/xen-pciback/pci_stub.c b/drivers/xen/xen-pciback/pci_stub.c
>>>>> index e34b623e4b41..5a96b6c66c07 100644
>>>>> --- a/drivers/xen/xen-pciback/pci_stub.c
>>>>> +++ b/drivers/xen/xen-pciback/pci_stub.c
>>>>> @@ -421,6 +421,9 @@ static int pcistub_init_device(struct pci_dev *dev)
>>>>> else {
>>>>> dev_dbg(&dev->dev, "resetting (FLR, D3, etc) the device\n");
>>>>> __pci_reset_function_locked(dev);
>>>>> + err = xen_reset_device_state(dev);
>>>>> + if (err)
>>>>> + goto config_release;
>>>>
>>>> Older versions of Xen won't have the hypercall
>>>> PHYSDEVOP_pci_device_state_reset implemented. I think we should do
>>>> something like:
>>>>
>>>> if (err && xen_pvh_domain())
>>>> goto config_release;
>>>>
>>>>
>>>> Or even:
>>>>
>>>> if (xen_pvh_domain()) {
>>>> err = xen_reset_device_state(dev);
>>>> if (err)
>>>> goto config_release;
>>>> }
>>>>
>>>> depending on whether we want to call xen_reset_device_state also for PV
>>>> guests or not. I am assuming we don't want to error out on failure such
>>>> as -ENOENT for PV guests.
>>> Yes, only for PVH dom0, I will add the condition in next version. Thank you!
>>
>> We will want to call xen_reset_device_state() for Arm dom0, too, so checking xen_pvh_domain() alone is not sufficient. I suggest instead to check !xen_pv_domain().
> I am not using Arm. But is Arm dom0 not a PVH type dom0?

Apparently not, Arm dom0 appears to be a xen_hvm_domain() from the kernel's perspective.

>
>>
>>>
>>>>
>>>>
>>>>> pci_restore_state(dev);
>>>>> }
>>>>> /* Now disable the device (this also ensures some private device
>>>>> diff --git a/include/xen/interface/physdev.h b/include/xen/interface/physdev.h
>>>>> index a237af867873..231526f80f6c 100644
>>>>> --- a/include/xen/interface/physdev.h
>>>>> +++ b/include/xen/interface/physdev.h
>>>>> @@ -263,6 +263,8 @@ struct physdev_pci_device {
>>>>> uint8_t devfn;
>>>>> };
>>>>>
>>>>> +#define PHYSDEVOP_pci_device_state_reset 32
>>>>> +
>>>>> #define PHYSDEVOP_DBGP_RESET_PREPARE 1
>>>>> #define PHYSDEVOP_DBGP_RESET_DONE 2
>>>>>
>>>>> diff --git a/include/xen/pci.h b/include/xen/pci.h
>>>>> index b8337cf85fd1..b2e2e856efd6 100644
>>>>> --- a/include/xen/pci.h
>>>>> +++ b/include/xen/pci.h
>>>>> @@ -4,10 +4,16 @@
>>>>> #define __XEN_PCI_H__
>>>>>
>>>>> #if defined(CONFIG_XEN_DOM0)
>>>>> +int xen_reset_device_state(const struct pci_dev *dev);
>>>>> int xen_find_device_domain_owner(struct pci_dev *dev);
>>>>> int xen_register_device_domain_owner(struct pci_dev *dev, uint16_t domain);
>>>>> int xen_unregister_device_domain_owner(struct pci_dev *dev);
>>>>> #else
>>>>> +static inline int xen_reset_device_state(const struct pci_dev *dev)
>>>>> +{
>>>>> + return -1;
>>>>> +}
>>>>> +
>>>>> static inline int xen_find_device_domain_owner(struct pci_dev *dev)
>>>>> {
>>>>> return -1;
>>>>> --
>>>>> 2.34.1
>>>>>
>>>
>

2023-12-04 07:58:37

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC KERNEL PATCH v2 1/3] xen/pci: Add xen_reset_device_state function

On Fri, Nov 24 2023 at 18:31, Jiqian Chen wrote:
> When device on dom0 side has been reset, the vpci on Xen side
> won't get notification, so that the cached state in vpci is
> all out of date with the real device state.
> To solve that problem, this patch add a function to clear all

Please get rid of 'this patch' all over the series.

# grep -r 'This patch' Documentation/process/

> vpci device state when device is reset on dom0 side.
>
> And call that function in pcistub_init_device. Because when
> we use "pci-assignable-add" to assign a passthrough device in
> Xen, it will reset passthrough device and the vpci state will
> out of date, and then device will fail to restore bar state.
>
> Signed-off-by: Jiqian Chen <[email protected]>
> Signed-off-by: Huang Rui <[email protected]>

This Signed-off-by chain is incorrect.

Documentation/process/submitting-patches.rst has a full chapter about
S-O-B and the correct usage.

Thanks,

tglx

2023-12-04 08:14:06

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC KERNEL PATCH v2 2/3] xen/pvh: Unmask irq for passthrough device in PVH dom0

On Fri, Nov 24 2023 at 18:31, Jiqian Chen wrote:
> diff --git a/drivers/xen/xen-pciback/pci_stub.c b/drivers/xen/xen-pciback/pci_stub.c
> index 5a96b6c66c07..b83d02bcc76c 100644
> --- a/drivers/xen/xen-pciback/pci_stub.c
> +++ b/drivers/xen/xen-pciback/pci_stub.c
> @@ -357,6 +357,7 @@ static int pcistub_match(struct pci_dev *dev)
> static int pcistub_init_device(struct pci_dev *dev)
> {
> struct xen_pcibk_dev_data *dev_data;
> + struct irq_desc *desc = NULL;
> int err = 0;
>
> dev_dbg(&dev->dev, "initializing...\n");
> @@ -399,6 +400,12 @@ static int pcistub_init_device(struct pci_dev *dev)
> if (err)
> goto config_release;
>
> + if (xen_initial_domain() && xen_pvh_domain()) {
> + if (dev->irq <= 0 || !(desc = irq_to_desc(dev->irq)))

Driver code has absolutely no business to access irq_desc.

> + goto config_release;
> + unmask_irq(desc);

Or to invoke any internal function.

> --- a/kernel/irq/chip.c
> +++ b/kernel/irq/chip.c
> @@ -439,6 +439,7 @@ void unmask_irq(struct irq_desc *desc)
> irq_state_clr_masked(desc);
> }
> }
> +EXPORT_SYMBOL_GPL(unmask_irq);

Not going to happen.

> --- a/kernel/irq/irqdesc.c
> +++ b/kernel/irq/irqdesc.c
> @@ -380,7 +380,7 @@ struct irq_desc *irq_to_desc(unsigned int irq)
> {
> return mtree_load(&sparse_irqs, irq);
> }
> -#ifdef CONFIG_KVM_BOOK3S_64_HV_MODULE
> +#if defined CONFIG_KVM_BOOK3S_64_HV_MODULE || defined CONFIG_XEN_PVH

Neither that.

This all smells badly like a XEN internal issue and we are not going to
hack around it by exposing interrupt internals.

Thanks,

tglx

2023-12-04 08:50:54

by Jiqian Chen

[permalink] [raw]
Subject: Re: [RFC KERNEL PATCH v2 1/3] xen/pci: Add xen_reset_device_state function


On 2023/12/4 15:58, Thomas Gleixner wrote:
> On Fri, Nov 24 2023 at 18:31, Jiqian Chen wrote:
>> When device on dom0 side has been reset, the vpci on Xen side
>> won't get notification, so that the cached state in vpci is
>> all out of date with the real device state.
>> To solve that problem, this patch add a function to clear all
>
> Please get rid of 'this patch' all over the series.
>
> # grep -r 'This patch' Documentation/process/
Thanks. I will remove "this patch" or "I/we" in next version.

>
>> vpci device state when device is reset on dom0 side.
>>
>> And call that function in pcistub_init_device. Because when
>> we use "pci-assignable-add" to assign a passthrough device in
>> Xen, it will reset passthrough device and the vpci state will
>> out of date, and then device will fail to restore bar state.
>>
>> Signed-off-by: Jiqian Chen <[email protected]>
>> Signed-off-by: Huang Rui <[email protected]>
>
> This Signed-off-by chain is incorrect.
>
> Documentation/process/submitting-patches.rst has a full chapter about
> S-O-B and the correct usage.
I am the author of this series of patches, and Huang Rui transported the v1 to upstream. And now I transport v2. I am not aware that the SOB chain is incorrect.
Do you have any suggestions?

>
> Thanks,
>
> tglx

--
Best regards,
Jiqian Chen.

2023-12-04 10:28:32

by Roger Pau Monné

[permalink] [raw]
Subject: Re: [RFC KERNEL PATCH v2 2/3] xen/pvh: Unmask irq for passthrough device in PVH dom0

On Fri, Dec 01, 2023 at 07:37:55PM -0800, Stefano Stabellini wrote:
> On Fri, 1 Dec 2023, Roger Pau Monné wrote:
> > On Thu, Nov 30, 2023 at 07:15:17PM -0800, Stefano Stabellini wrote:
> > > On Thu, 30 Nov 2023, Roger Pau Monné wrote:
> > > > On Wed, Nov 29, 2023 at 07:53:59PM -0800, Stefano Stabellini wrote:
> > > > > On Fri, 24 Nov 2023, Jiqian Chen wrote:
> > > > > > This patch is to solve two problems we encountered when we try to
> > > > > > passthrough a device to hvm domU base on Xen PVH dom0.
> > > > > >
> > > > > > First, hvm guest will alloc a pirq and irq for a passthrough device
> > > > > > by using gsi, before that, the gsi must first has a mapping in dom0,
> > > > > > see Xen code pci_add_dm_done->xc_domain_irq_permission, it will call
> > > > > > into Xen and check whether dom0 has the mapping. See
> > > > > > XEN_DOMCTL_irq_permission->pirq_access_permitted, "current" is PVH
> > > > > > dom0 and it return irq is 0, and then return -EPERM.
> > > > > > This is because the passthrough device doesn't do PHYSDEVOP_map_pirq
> > > > > > when thay are enabled.
> > > > > >
> > > > > > Second, in PVH dom0, the gsi of a passthrough device doesn't get
> > > > > > registered, but gsi must be configured for it to be able to be
> > > > > > mapped into a domU.
> > > > > >
> > > > > > After searching codes, we can find map_pirq and register_gsi will be
> > > > > > done in function vioapic_write_redirent->vioapic_hwdom_map_gsi when
> > > > > > the gsi(aka ioapic's pin) is unmasked in PVH dom0. So the problems
> > > > > > can be conclude to that the gsi of a passthrough device doesn't be
> > > > > > unmasked.
> > > > > >
> > > > > > To solve the unmaske problem, this patch call the unmask_irq when we
> > > > > > assign a device to be passthrough. So that the gsi can get registered
> > > > > > and mapped in PVH dom0.
> > > > >
> > > > >
> > > > > Roger, this seems to be more of a Xen issue than a Linux issue. Why do
> > > > > we need the unmask check in Xen? Couldn't we just do:
> > > > >
> > > > >
> > > > > diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
> > > > > index 4e40d3609a..df262a4a18 100644
> > > > > --- a/xen/arch/x86/hvm/vioapic.c
> > > > > +++ b/xen/arch/x86/hvm/vioapic.c
> > > > > @@ -287,7 +287,7 @@ static void vioapic_write_redirent(
> > > > > hvm_dpci_eoi(d, gsi);
> > > > > }
> > > > >
> > > > > - if ( is_hardware_domain(d) && unmasked )
> > > > > + if ( is_hardware_domain(d) )
> > > > > {
> > > > > /*
> > > > > * NB: don't call vioapic_hwdom_map_gsi while holding hvm.irq_lock
> > > >
> > > > There are some issues with this approach.
> > > >
> > > > mp_register_gsi() will only setup the trigger and polarity of the
> > > > IO-APIC pin once, so we do so once the guest unmask the pin in order
> > > > to assert that the configuration is the intended one. A guest is
> > > > allowed to write all kind of nonsense stuff to the IO-APIC RTE, but
> > > > that doesn't take effect unless the pin is unmasked.
> > > >
> > > > Overall the question would be whether we have any guarantees that
> > > > the hardware domain has properly configured the pin, even if it's not
> > > > using it itself (as it hasn't been unmasked).
> > > >
> > > > IIRC PCI legacy interrupts are level triggered and low polarity, so we
> > > > could configure any pins that are not setup at bind time?
> > >
> > > That could work.
> > >
> > > Another idea is to move only the call to allocate_and_map_gsi_pirq at
> > > bind time? That might be enough to pass a pirq_access_permitted check.
> >
> > Maybe, albeit that would change the behavior of XEN_DOMCTL_bind_pt_irq
> > just for PT_IRQ_TYPE_PCI and only when called from a PVH dom0 (as the
> > parameter would be a GSI instead of a previously mapped IRQ). Such
> > difference just for PT_IRQ_TYPE_PCI is slightly weird - if we go that
> > route I would recommend that we instead introduce a new dmop that has
> > this syntax regardless of the domain type it's called from.
>
> Looking at the code it is certainly a bit confusing. My point was that
> we don't need to wait until polarity and trigger are set appropriately
> to allow Dom0 to pass successfully a pirq_access_permitted() check. Xen
> should be able to figure out that Dom0 is permitted pirq access.

The logic is certainly not straightforward, and it could benefit from
some comments.

The irq permissions are a bit special, in that they get setup when the
IRQ is mapped.

The problem however is not so much with IRQ permissions, that we can
indeed sort out internally in Xen. Such check in dom0 has the side
effect of preventing the IRQ from being assigned to a domU without the
hardware source being properly configured AFAICT.

>
> So the idea was to move the call to allocate_and_map_gsi_pirq() earlier
> somewhere because allocate_and_map_gsi_pirq doesn't require trigger or
> polarity to be configured to work. But the suggestion of doing it a
> "bind time" (meaning: XEN_DOMCTL_bind_pt_irq) was a bad idea.
>
> But maybe we can find another location, maybe within
> xen/arch/x86/hvm/vioapic.c, to call allocate_and_map_gsi_pirq() before
> trigger and polarity are set and before the interrupt is unmasked.
>
> Then we change the implementation of vioapic_hwdom_map_gsi to skip the
> call to allocate_and_map_gsi_pirq, because by the time
> vioapic_hwdom_map_gsi we assume that allocate_and_map_gsi_pirq had
> already been done.

But then we would end up in a situation where the
pirq_access_permitted() check will pass, but the IO-APIC pin won't be
configured, which I think it's not what we want.

One option would be to allow mp_register_gsi() to be called multiple
times, and update the IO-APIC pin configuration as long as the pin is
not unmasked. That would propagate each dom0 RTE update to the
underlying IO-APIC. However such approach relies on dom0 configuring
all possible IO-APIC pins, even if no device on dom0 is using them, I
think it's not a very reliable option.

Another option would be to modify the toolstack to setup the GSI
itself using the PHYSDEVOP_setup_gsi hypercall. As said in a previous
email, since we only care about PCI device passthrough the legacy INTx
should always be level triggered and low polarity.

> I am not familiar with vioapic.c but to give you an idea of what I was
> thinking:
>
>
> diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
> index 4e40d3609a..16d56fe851 100644
> --- a/xen/arch/x86/hvm/vioapic.c
> +++ b/xen/arch/x86/hvm/vioapic.c
> @@ -189,14 +189,6 @@ static int vioapic_hwdom_map_gsi(unsigned int gsi, unsigned int trig,
> return ret;
> }
>
> - ret = allocate_and_map_gsi_pirq(currd, pirq, &pirq);
> - if ( ret )
> - {
> - gprintk(XENLOG_WARNING, "vioapic: error mapping GSI %u: %d\n",
> - gsi, ret);
> - return ret;
> - }
> -
> pcidevs_lock();
> ret = pt_irq_create_bind(currd, &pt_irq_bind);
> if ( ret )
> @@ -287,6 +279,17 @@ static void vioapic_write_redirent(
> hvm_dpci_eoi(d, gsi);
> }
>
> + if ( is_hardware_domain(d) )
> + {
> + int pirq = gsi, ret;
> + ret = allocate_and_map_gsi_pirq(currd, pirq, &pirq);
> + if ( ret )
> + {
> + gprintk(XENLOG_WARNING, "vioapic: error mapping GSI %u: %d\n",
> + gsi, ret);
> + return ret;
> + }
> + }
> if ( is_hardware_domain(d) && unmasked )
> {
> /*

As said above, such approach relies on dom0 writing to the IO-APIC RTE
of likely each IO-APIC pin, which is IMO not quite reliable. In there
are two different issues here that need to be fixed for PVH dom0:

- Fix the XEN_DOMCTL_irq_permission pirq_access_permitted() call to
succeed for a PVH dom0, even if dom0 is not using the GSI itself.

- Configure IO-APIC pins for PCI interrupts even if dom0 is not using
the IO-APIC pin itself.

First one needs to be fixed internally in Xen, second one will require
the toolstack to issue an extra hypercall in order to ensure the
IO-APIC pin is properly configured.

Thanks, Roger.

2023-12-04 21:31:43

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [RFC KERNEL PATCH v2 1/3] xen/pci: Add xen_reset_device_state function

On Mon, 3 Dec 2023, Chen, Jiqian wrote:
> >> vpci device state when device is reset on dom0 side.
> >>
> >> And call that function in pcistub_init_device. Because when
> >> we use "pci-assignable-add" to assign a passthrough device in
> >> Xen, it will reset passthrough device and the vpci state will
> >> out of date, and then device will fail to restore bar state.
> >>
> >> Signed-off-by: Jiqian Chen <[email protected]>
> >> Signed-off-by: Huang Rui <[email protected]>
> >
> > This Signed-off-by chain is incorrect.
> >
> > Documentation/process/submitting-patches.rst has a full chapter about
> > S-O-B and the correct usage.
> I am the author of this series of patches, and Huang Rui transported the v1 to upstream. And now I transport v2. I am not aware that the SOB chain is incorrect.
> Do you have any suggestions?

I think he means that your Signed-off-by should be the second one of the
two as you are the one submitting the patch to the LKML

2023-12-04 22:20:05

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [RFC KERNEL PATCH v2 2/3] xen/pvh: Unmask irq for passthrough device in PVH dom0

On Mon, 4 Dec 2023, Roger Pau Monné wrote:
> On Fri, Dec 01, 2023 at 07:37:55PM -0800, Stefano Stabellini wrote:
> > On Fri, 1 Dec 2023, Roger Pau Monné wrote:
> > > On Thu, Nov 30, 2023 at 07:15:17PM -0800, Stefano Stabellini wrote:
> > > > On Thu, 30 Nov 2023, Roger Pau Monné wrote:
> > > > > On Wed, Nov 29, 2023 at 07:53:59PM -0800, Stefano Stabellini wrote:
> > > > > > On Fri, 24 Nov 2023, Jiqian Chen wrote:
> > > > > > > This patch is to solve two problems we encountered when we try to
> > > > > > > passthrough a device to hvm domU base on Xen PVH dom0.
> > > > > > >
> > > > > > > First, hvm guest will alloc a pirq and irq for a passthrough device
> > > > > > > by using gsi, before that, the gsi must first has a mapping in dom0,
> > > > > > > see Xen code pci_add_dm_done->xc_domain_irq_permission, it will call
> > > > > > > into Xen and check whether dom0 has the mapping. See
> > > > > > > XEN_DOMCTL_irq_permission->pirq_access_permitted, "current" is PVH
> > > > > > > dom0 and it return irq is 0, and then return -EPERM.
> > > > > > > This is because the passthrough device doesn't do PHYSDEVOP_map_pirq
> > > > > > > when thay are enabled.
> > > > > > >
> > > > > > > Second, in PVH dom0, the gsi of a passthrough device doesn't get
> > > > > > > registered, but gsi must be configured for it to be able to be
> > > > > > > mapped into a domU.
> > > > > > >
> > > > > > > After searching codes, we can find map_pirq and register_gsi will be
> > > > > > > done in function vioapic_write_redirent->vioapic_hwdom_map_gsi when
> > > > > > > the gsi(aka ioapic's pin) is unmasked in PVH dom0. So the problems
> > > > > > > can be conclude to that the gsi of a passthrough device doesn't be
> > > > > > > unmasked.
> > > > > > >
> > > > > > > To solve the unmaske problem, this patch call the unmask_irq when we
> > > > > > > assign a device to be passthrough. So that the gsi can get registered
> > > > > > > and mapped in PVH dom0.
> > > > > >
> > > > > >
> > > > > > Roger, this seems to be more of a Xen issue than a Linux issue. Why do
> > > > > > we need the unmask check in Xen? Couldn't we just do:
> > > > > >
> > > > > >
> > > > > > diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
> > > > > > index 4e40d3609a..df262a4a18 100644
> > > > > > --- a/xen/arch/x86/hvm/vioapic.c
> > > > > > +++ b/xen/arch/x86/hvm/vioapic.c
> > > > > > @@ -287,7 +287,7 @@ static void vioapic_write_redirent(
> > > > > > hvm_dpci_eoi(d, gsi);
> > > > > > }
> > > > > >
> > > > > > - if ( is_hardware_domain(d) && unmasked )
> > > > > > + if ( is_hardware_domain(d) )
> > > > > > {
> > > > > > /*
> > > > > > * NB: don't call vioapic_hwdom_map_gsi while holding hvm.irq_lock
> > > > >
> > > > > There are some issues with this approach.
> > > > >
> > > > > mp_register_gsi() will only setup the trigger and polarity of the
> > > > > IO-APIC pin once, so we do so once the guest unmask the pin in order
> > > > > to assert that the configuration is the intended one. A guest is
> > > > > allowed to write all kind of nonsense stuff to the IO-APIC RTE, but
> > > > > that doesn't take effect unless the pin is unmasked.
> > > > >
> > > > > Overall the question would be whether we have any guarantees that
> > > > > the hardware domain has properly configured the pin, even if it's not
> > > > > using it itself (as it hasn't been unmasked).
> > > > >
> > > > > IIRC PCI legacy interrupts are level triggered and low polarity, so we
> > > > > could configure any pins that are not setup at bind time?
> > > >
> > > > That could work.
> > > >
> > > > Another idea is to move only the call to allocate_and_map_gsi_pirq at
> > > > bind time? That might be enough to pass a pirq_access_permitted check.
> > >
> > > Maybe, albeit that would change the behavior of XEN_DOMCTL_bind_pt_irq
> > > just for PT_IRQ_TYPE_PCI and only when called from a PVH dom0 (as the
> > > parameter would be a GSI instead of a previously mapped IRQ). Such
> > > difference just for PT_IRQ_TYPE_PCI is slightly weird - if we go that
> > > route I would recommend that we instead introduce a new dmop that has
> > > this syntax regardless of the domain type it's called from.
> >
> > Looking at the code it is certainly a bit confusing. My point was that
> > we don't need to wait until polarity and trigger are set appropriately
> > to allow Dom0 to pass successfully a pirq_access_permitted() check. Xen
> > should be able to figure out that Dom0 is permitted pirq access.
>
> The logic is certainly not straightforward, and it could benefit from
> some comments.
>
> The irq permissions are a bit special, in that they get setup when the
> IRQ is mapped.
>
> The problem however is not so much with IRQ permissions, that we can
> indeed sort out internally in Xen. Such check in dom0 has the side
> effect of preventing the IRQ from being assigned to a domU without the
> hardware source being properly configured AFAICT.

Now I understand why you made a comment previously about Xen having to
configure trigger and polarity for these interrupts on its own.


> > So the idea was to move the call to allocate_and_map_gsi_pirq() earlier
> > somewhere because allocate_and_map_gsi_pirq doesn't require trigger or
> > polarity to be configured to work. But the suggestion of doing it a
> > "bind time" (meaning: XEN_DOMCTL_bind_pt_irq) was a bad idea.
> >
> > But maybe we can find another location, maybe within
> > xen/arch/x86/hvm/vioapic.c, to call allocate_and_map_gsi_pirq() before
> > trigger and polarity are set and before the interrupt is unmasked.
> >
> > Then we change the implementation of vioapic_hwdom_map_gsi to skip the
> > call to allocate_and_map_gsi_pirq, because by the time
> > vioapic_hwdom_map_gsi we assume that allocate_and_map_gsi_pirq had
> > already been done.
>
> But then we would end up in a situation where the
> pirq_access_permitted() check will pass, but the IO-APIC pin won't be
> configured, which I think it's not what we want.
>
> One option would be to allow mp_register_gsi() to be called multiple
> times, and update the IO-APIC pin configuration as long as the pin is
> not unmasked. That would propagate each dom0 RTE update to the
> underlying IO-APIC. However such approach relies on dom0 configuring
> all possible IO-APIC pins, even if no device on dom0 is using them, I
> think it's not a very reliable option.
>
> Another option would be to modify the toolstack to setup the GSI
> itself using the PHYSDEVOP_setup_gsi hypercall. As said in a previous
> email, since we only care about PCI device passthrough the legacy INTx
> should always be level triggered and low polarity.
>
> > I am not familiar with vioapic.c but to give you an idea of what I was
> > thinking:
> >
> >
> > diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
> > index 4e40d3609a..16d56fe851 100644
> > --- a/xen/arch/x86/hvm/vioapic.c
> > +++ b/xen/arch/x86/hvm/vioapic.c
> > @@ -189,14 +189,6 @@ static int vioapic_hwdom_map_gsi(unsigned int gsi, unsigned int trig,
> > return ret;
> > }
> >
> > - ret = allocate_and_map_gsi_pirq(currd, pirq, &pirq);
> > - if ( ret )
> > - {
> > - gprintk(XENLOG_WARNING, "vioapic: error mapping GSI %u: %d\n",
> > - gsi, ret);
> > - return ret;
> > - }
> > -
> > pcidevs_lock();
> > ret = pt_irq_create_bind(currd, &pt_irq_bind);
> > if ( ret )
> > @@ -287,6 +279,17 @@ static void vioapic_write_redirent(
> > hvm_dpci_eoi(d, gsi);
> > }
> >
> > + if ( is_hardware_domain(d) )
> > + {
> > + int pirq = gsi, ret;
> > + ret = allocate_and_map_gsi_pirq(currd, pirq, &pirq);
> > + if ( ret )
> > + {
> > + gprintk(XENLOG_WARNING, "vioapic: error mapping GSI %u: %d\n",
> > + gsi, ret);
> > + return ret;
> > + }
> > + }
> > if ( is_hardware_domain(d) && unmasked )
> > {
> > /*
>
> As said above, such approach relies on dom0 writing to the IO-APIC RTE
> of likely each IO-APIC pin, which is IMO not quite reliable. In there
> are two different issues here that need to be fixed for PVH dom0:
>
> - Fix the XEN_DOMCTL_irq_permission pirq_access_permitted() call to
> succeed for a PVH dom0, even if dom0 is not using the GSI itself.

Yes makes sense


> - Configure IO-APIC pins for PCI interrupts even if dom0 is not using
> the IO-APIC pin itself.
>
> First one needs to be fixed internally in Xen, second one will require
> the toolstack to issue an extra hypercall in order to ensure the
> IO-APIC pin is properly configured.

On ARM, Xen doesn't need to wait for dom0 to configure interrupts
correctly. Xen configures them all on its own at boot based on Device
Tree information. I guess it is not possible to do the same on x86? If
not, then I can see why we would need 1 extra toolstack hypercall for
that (or to bundle the operation of configuring IO-APIC pins together
with an existing toolstack hypercall).

2023-12-05 06:47:17

by Jiqian Chen

[permalink] [raw]
Subject: Re: [RFC KERNEL PATCH v2 2/3] xen/pvh: Unmask irq for passthrough device in PVH dom0

On 2023/12/4 18:28, Roger Pau Monné wrote:
> On Fri, Dec 01, 2023 at 07:37:55PM -0800, Stefano Stabellini wrote:
>> On Fri, 1 Dec 2023, Roger Pau Monné wrote:
>>> On Thu, Nov 30, 2023 at 07:15:17PM -0800, Stefano Stabellini wrote:
>>>> On Thu, 30 Nov 2023, Roger Pau Monné wrote:
>>>>> On Wed, Nov 29, 2023 at 07:53:59PM -0800, Stefano Stabellini wrote:
>>>>>> On Fri, 24 Nov 2023, Jiqian Chen wrote:
>>>>>>> This patch is to solve two problems we encountered when we try to
>>>>>>> passthrough a device to hvm domU base on Xen PVH dom0.
>>>>>>>
>>>>>>> First, hvm guest will alloc a pirq and irq for a passthrough device
>>>>>>> by using gsi, before that, the gsi must first has a mapping in dom0,
>>>>>>> see Xen code pci_add_dm_done->xc_domain_irq_permission, it will call
>>>>>>> into Xen and check whether dom0 has the mapping. See
>>>>>>> XEN_DOMCTL_irq_permission->pirq_access_permitted, "current" is PVH
>>>>>>> dom0 and it return irq is 0, and then return -EPERM.
>>>>>>> This is because the passthrough device doesn't do PHYSDEVOP_map_pirq
>>>>>>> when thay are enabled.
>>>>>>>
>>>>>>> Second, in PVH dom0, the gsi of a passthrough device doesn't get
>>>>>>> registered, but gsi must be configured for it to be able to be
>>>>>>> mapped into a domU.
>>>>>>>
>>>>>>> After searching codes, we can find map_pirq and register_gsi will be
>>>>>>> done in function vioapic_write_redirent->vioapic_hwdom_map_gsi when
>>>>>>> the gsi(aka ioapic's pin) is unmasked in PVH dom0. So the problems
>>>>>>> can be conclude to that the gsi of a passthrough device doesn't be
>>>>>>> unmasked.
>>>>>>>
>>>>>>> To solve the unmaske problem, this patch call the unmask_irq when we
>>>>>>> assign a device to be passthrough. So that the gsi can get registered
>>>>>>> and mapped in PVH dom0.
>>>>>>
>>>>>>
>>>>>> Roger, this seems to be more of a Xen issue than a Linux issue. Why do
>>>>>> we need the unmask check in Xen? Couldn't we just do:
>>>>>>
>>>>>>
>>>>>> diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
>>>>>> index 4e40d3609a..df262a4a18 100644
>>>>>> --- a/xen/arch/x86/hvm/vioapic.c
>>>>>> +++ b/xen/arch/x86/hvm/vioapic.c
>>>>>> @@ -287,7 +287,7 @@ static void vioapic_write_redirent(
>>>>>> hvm_dpci_eoi(d, gsi);
>>>>>> }
>>>>>>
>>>>>> - if ( is_hardware_domain(d) && unmasked )
>>>>>> + if ( is_hardware_domain(d) )
>>>>>> {
>>>>>> /*
>>>>>> * NB: don't call vioapic_hwdom_map_gsi while holding hvm.irq_lock
>>>>>
>>>>> There are some issues with this approach.
>>>>>
>>>>> mp_register_gsi() will only setup the trigger and polarity of the
>>>>> IO-APIC pin once, so we do so once the guest unmask the pin in order
>>>>> to assert that the configuration is the intended one. A guest is
>>>>> allowed to write all kind of nonsense stuff to the IO-APIC RTE, but
>>>>> that doesn't take effect unless the pin is unmasked.
>>>>>
>>>>> Overall the question would be whether we have any guarantees that
>>>>> the hardware domain has properly configured the pin, even if it's not
>>>>> using it itself (as it hasn't been unmasked).
>>>>>
>>>>> IIRC PCI legacy interrupts are level triggered and low polarity, so we
>>>>> could configure any pins that are not setup at bind time?
>>>>
>>>> That could work.
>>>>
>>>> Another idea is to move only the call to allocate_and_map_gsi_pirq at
>>>> bind time? That might be enough to pass a pirq_access_permitted check.
>>>
>>> Maybe, albeit that would change the behavior of XEN_DOMCTL_bind_pt_irq
>>> just for PT_IRQ_TYPE_PCI and only when called from a PVH dom0 (as the
>>> parameter would be a GSI instead of a previously mapped IRQ). Such
>>> difference just for PT_IRQ_TYPE_PCI is slightly weird - if we go that
>>> route I would recommend that we instead introduce a new dmop that has
>>> this syntax regardless of the domain type it's called from.
>>
>> Looking at the code it is certainly a bit confusing. My point was that
>> we don't need to wait until polarity and trigger are set appropriately
>> to allow Dom0 to pass successfully a pirq_access_permitted() check. Xen
>> should be able to figure out that Dom0 is permitted pirq access.
>
> The logic is certainly not straightforward, and it could benefit from
> some comments.
>
> The irq permissions are a bit special, in that they get setup when the
> IRQ is mapped.
>
> The problem however is not so much with IRQ permissions, that we can
> indeed sort out internally in Xen. Such check in dom0 has the side
> effect of preventing the IRQ from being assigned to a domU without the
> hardware source being properly configured AFAICT.
>
>>
>> So the idea was to move the call to allocate_and_map_gsi_pirq() earlier
>> somewhere because allocate_and_map_gsi_pirq doesn't require trigger or
>> polarity to be configured to work. But the suggestion of doing it a
>> "bind time" (meaning: XEN_DOMCTL_bind_pt_irq) was a bad idea.
>>
>> But maybe we can find another location, maybe within
>> xen/arch/x86/hvm/vioapic.c, to call allocate_and_map_gsi_pirq() before
>> trigger and polarity are set and before the interrupt is unmasked.
>>
>> Then we change the implementation of vioapic_hwdom_map_gsi to skip the
>> call to allocate_and_map_gsi_pirq, because by the time
>> vioapic_hwdom_map_gsi we assume that allocate_and_map_gsi_pirq had
>> already been done.
>
> But then we would end up in a situation where the
> pirq_access_permitted() check will pass, but the IO-APIC pin won't be
> configured, which I think it's not what we want.
>
> One option would be to allow mp_register_gsi() to be called multiple
> times, and update the IO-APIC pin configuration as long as the pin is
> not unmasked. That would propagate each dom0 RTE update to the
> underlying IO-APIC. However such approach relies on dom0 configuring
> all possible IO-APIC pins, even if no device on dom0 is using them, I
> think it's not a very reliable option.
>
> Another option would be to modify the toolstack to setup the GSI
> itself using the PHYSDEVOP_setup_gsi hypercall. As said in a previous
> email, since we only care about PCI device passthrough the legacy INTx
> should always be level triggered and low polarity.

I am thinking if we can do PHYSDEVOP_map_pirq and PHYSDEVOP_setup_gsi only for passthrough devices(in function pcistub_init_device).
Then it can pass permission check and setup gsi without affecting other devices that do not use IO-APIC pins.
What do you think?

>
>> I am not familiar with vioapic.c but to give you an idea of what I was
>> thinking:
>>
>>
>> diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
>> index 4e40d3609a..16d56fe851 100644
>> --- a/xen/arch/x86/hvm/vioapic.c
>> +++ b/xen/arch/x86/hvm/vioapic.c
>> @@ -189,14 +189,6 @@ static int vioapic_hwdom_map_gsi(unsigned int gsi, unsigned int trig,
>> return ret;
>> }
>>
>> - ret = allocate_and_map_gsi_pirq(currd, pirq, &pirq);
>> - if ( ret )
>> - {
>> - gprintk(XENLOG_WARNING, "vioapic: error mapping GSI %u: %d\n",
>> - gsi, ret);
>> - return ret;
>> - }
>> -
>> pcidevs_lock();
>> ret = pt_irq_create_bind(currd, &pt_irq_bind);
>> if ( ret )
>> @@ -287,6 +279,17 @@ static void vioapic_write_redirent(
>> hvm_dpci_eoi(d, gsi);
>> }
>>
>> + if ( is_hardware_domain(d) )
>> + {
>> + int pirq = gsi, ret;
>> + ret = allocate_and_map_gsi_pirq(currd, pirq, &pirq);
>> + if ( ret )
>> + {
>> + gprintk(XENLOG_WARNING, "vioapic: error mapping GSI %u: %d\n",
>> + gsi, ret);
>> + return ret;
>> + }
>> + }
>> if ( is_hardware_domain(d) && unmasked )
>> {
>> /*
>
> As said above, such approach relies on dom0 writing to the IO-APIC RTE
> of likely each IO-APIC pin, which is IMO not quite reliable. In there
> are two different issues here that need to be fixed for PVH dom0:
>
> - Fix the XEN_DOMCTL_irq_permission pirq_access_permitted() call to
> succeed for a PVH dom0, even if dom0 is not using the GSI itself.
>
> - Configure IO-APIC pins for PCI interrupts even if dom0 is not using
> the IO-APIC pin itself.
>
> First one needs to be fixed internally in Xen, second one will require
> the toolstack to issue an extra hypercall in order to ensure the
> IO-APIC pin is properly configured.
>
> Thanks, Roger.

--
Best regards,
Jiqian Chen.

2023-12-05 06:51:20

by Jiqian Chen

[permalink] [raw]
Subject: Re: [RFC KERNEL PATCH v2 1/3] xen/pci: Add xen_reset_device_state function

On 2023/12/5 05:31, Stefano Stabellini wrote:
> On Mon, 3 Dec 2023, Chen, Jiqian wrote:
>>>> vpci device state when device is reset on dom0 side.
>>>>
>>>> And call that function in pcistub_init_device. Because when
>>>> we use "pci-assignable-add" to assign a passthrough device in
>>>> Xen, it will reset passthrough device and the vpci state will
>>>> out of date, and then device will fail to restore bar state.
>>>>
>>>> Signed-off-by: Jiqian Chen <[email protected]>
>>>> Signed-off-by: Huang Rui <[email protected]>
>>>
>>> This Signed-off-by chain is incorrect.
>>>
>>> Documentation/process/submitting-patches.rst has a full chapter about
>>> S-O-B and the correct usage.
>> I am the author of this series of patches, and Huang Rui transported the v1 to upstream. And now I transport v2. I am not aware that the SOB chain is incorrect.
>> Do you have any suggestions?
>
> I think he means that your Signed-off-by should be the second one of the
> two as you are the one submitting the patch to the LKML
Got it. Thanks Stefano! I will adjust the sequence in next version.

--
Best regards,
Jiqian Chen.

2023-12-05 07:03:49

by Jiqian Chen

[permalink] [raw]
Subject: Re: [RFC KERNEL PATCH v2 2/3] xen/pvh: Unmask irq for passthrough device in PVH dom0

Hi Thomas Gleixner,

Thank you for review, and you are right, it seems more like a XEN internal issue. We are discussing it and maybe will fix it in Xen code next version.

On 2023/12/4 16:13, Thomas Gleixner wrote:
> On Fri, Nov 24 2023 at 18:31, Jiqian Chen wrote:
>> diff --git a/drivers/xen/xen-pciback/pci_stub.c b/drivers/xen/xen-pciback/pci_stub.c
>> index 5a96b6c66c07..b83d02bcc76c 100644
>> --- a/drivers/xen/xen-pciback/pci_stub.c
>> +++ b/drivers/xen/xen-pciback/pci_stub.c
>> @@ -357,6 +357,7 @@ static int pcistub_match(struct pci_dev *dev)
>> static int pcistub_init_device(struct pci_dev *dev)
>> {
>> struct xen_pcibk_dev_data *dev_data;
>> + struct irq_desc *desc = NULL;
>> int err = 0;
>>
>> dev_dbg(&dev->dev, "initializing...\n");
>> @@ -399,6 +400,12 @@ static int pcistub_init_device(struct pci_dev *dev)
>> if (err)
>> goto config_release;
>>
>> + if (xen_initial_domain() && xen_pvh_domain()) {
>> + if (dev->irq <= 0 || !(desc = irq_to_desc(dev->irq)))
>
> Driver code has absolutely no business to access irq_desc.
>
>> + goto config_release;
>> + unmask_irq(desc);
>
> Or to invoke any internal function.
>
>> --- a/kernel/irq/chip.c
>> +++ b/kernel/irq/chip.c
>> @@ -439,6 +439,7 @@ void unmask_irq(struct irq_desc *desc)
>> irq_state_clr_masked(desc);
>> }
>> }
>> +EXPORT_SYMBOL_GPL(unmask_irq);
>
> Not going to happen.
>
>> --- a/kernel/irq/irqdesc.c
>> +++ b/kernel/irq/irqdesc.c
>> @@ -380,7 +380,7 @@ struct irq_desc *irq_to_desc(unsigned int irq)
>> {
>> return mtree_load(&sparse_irqs, irq);
>> }
>> -#ifdef CONFIG_KVM_BOOK3S_64_HV_MODULE
>> +#if defined CONFIG_KVM_BOOK3S_64_HV_MODULE || defined CONFIG_XEN_PVH
>
> Neither that.
>
> This all smells badly like a XEN internal issue and we are not going to
> hack around it by exposing interrupt internals.
>
> Thanks,
>
> tglx

--
Best regards,
Jiqian Chen.

2023-12-05 09:19:32

by Roger Pau Monné

[permalink] [raw]
Subject: Re: [RFC KERNEL PATCH v2 2/3] xen/pvh: Unmask irq for passthrough device in PVH dom0

On Mon, Dec 04, 2023 at 02:19:33PM -0800, Stefano Stabellini wrote:
> On Mon, 4 Dec 2023, Roger Pau Monné wrote:
> > On Fri, Dec 01, 2023 at 07:37:55PM -0800, Stefano Stabellini wrote:
> > > On Fri, 1 Dec 2023, Roger Pau Monné wrote:
> > > > On Thu, Nov 30, 2023 at 07:15:17PM -0800, Stefano Stabellini wrote:
> > > > > On Thu, 30 Nov 2023, Roger Pau Monné wrote:
> > > > > > On Wed, Nov 29, 2023 at 07:53:59PM -0800, Stefano Stabellini wrote:
> > > > > > > On Fri, 24 Nov 2023, Jiqian Chen wrote:
> > > > > > > > This patch is to solve two problems we encountered when we try to
> > > > > > > > passthrough a device to hvm domU base on Xen PVH dom0.
> > > > > > > >
> > > > > > > > First, hvm guest will alloc a pirq and irq for a passthrough device
> > > > > > > > by using gsi, before that, the gsi must first has a mapping in dom0,
> > > > > > > > see Xen code pci_add_dm_done->xc_domain_irq_permission, it will call
> > > > > > > > into Xen and check whether dom0 has the mapping. See
> > > > > > > > XEN_DOMCTL_irq_permission->pirq_access_permitted, "current" is PVH
> > > > > > > > dom0 and it return irq is 0, and then return -EPERM.
> > > > > > > > This is because the passthrough device doesn't do PHYSDEVOP_map_pirq
> > > > > > > > when thay are enabled.
> > > > > > > >
> > > > > > > > Second, in PVH dom0, the gsi of a passthrough device doesn't get
> > > > > > > > registered, but gsi must be configured for it to be able to be
> > > > > > > > mapped into a domU.
> > > > > > > >
> > > > > > > > After searching codes, we can find map_pirq and register_gsi will be
> > > > > > > > done in function vioapic_write_redirent->vioapic_hwdom_map_gsi when
> > > > > > > > the gsi(aka ioapic's pin) is unmasked in PVH dom0. So the problems
> > > > > > > > can be conclude to that the gsi of a passthrough device doesn't be
> > > > > > > > unmasked.
> > > > > > > >
> > > > > > > > To solve the unmaske problem, this patch call the unmask_irq when we
> > > > > > > > assign a device to be passthrough. So that the gsi can get registered
> > > > > > > > and mapped in PVH dom0.
> > > > > > >
> > > > > > >
> > > > > > > Roger, this seems to be more of a Xen issue than a Linux issue. Why do
> > > > > > > we need the unmask check in Xen? Couldn't we just do:
> > > > > > >
> > > > > > >
> > > > > > > diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
> > > > > > > index 4e40d3609a..df262a4a18 100644
> > > > > > > --- a/xen/arch/x86/hvm/vioapic.c
> > > > > > > +++ b/xen/arch/x86/hvm/vioapic.c
> > > > > > > @@ -287,7 +287,7 @@ static void vioapic_write_redirent(
> > > > > > > hvm_dpci_eoi(d, gsi);
> > > > > > > }
> > > > > > >
> > > > > > > - if ( is_hardware_domain(d) && unmasked )
> > > > > > > + if ( is_hardware_domain(d) )
> > > > > > > {
> > > > > > > /*
> > > > > > > * NB: don't call vioapic_hwdom_map_gsi while holding hvm.irq_lock
> > > > > >
> > > > > > There are some issues with this approach.
> > > > > >
> > > > > > mp_register_gsi() will only setup the trigger and polarity of the
> > > > > > IO-APIC pin once, so we do so once the guest unmask the pin in order
> > > > > > to assert that the configuration is the intended one. A guest is
> > > > > > allowed to write all kind of nonsense stuff to the IO-APIC RTE, but
> > > > > > that doesn't take effect unless the pin is unmasked.
> > > > > >
> > > > > > Overall the question would be whether we have any guarantees that
> > > > > > the hardware domain has properly configured the pin, even if it's not
> > > > > > using it itself (as it hasn't been unmasked).
> > > > > >
> > > > > > IIRC PCI legacy interrupts are level triggered and low polarity, so we
> > > > > > could configure any pins that are not setup at bind time?
> > > > >
> > > > > That could work.
> > > > >
> > > > > Another idea is to move only the call to allocate_and_map_gsi_pirq at
> > > > > bind time? That might be enough to pass a pirq_access_permitted check.
> > > >
> > > > Maybe, albeit that would change the behavior of XEN_DOMCTL_bind_pt_irq
> > > > just for PT_IRQ_TYPE_PCI and only when called from a PVH dom0 (as the
> > > > parameter would be a GSI instead of a previously mapped IRQ). Such
> > > > difference just for PT_IRQ_TYPE_PCI is slightly weird - if we go that
> > > > route I would recommend that we instead introduce a new dmop that has
> > > > this syntax regardless of the domain type it's called from.
> > >
> > > Looking at the code it is certainly a bit confusing. My point was that
> > > we don't need to wait until polarity and trigger are set appropriately
> > > to allow Dom0 to pass successfully a pirq_access_permitted() check. Xen
> > > should be able to figure out that Dom0 is permitted pirq access.
> >
> > The logic is certainly not straightforward, and it could benefit from
> > some comments.
> >
> > The irq permissions are a bit special, in that they get setup when the
> > IRQ is mapped.
> >
> > The problem however is not so much with IRQ permissions, that we can
> > indeed sort out internally in Xen. Such check in dom0 has the side
> > effect of preventing the IRQ from being assigned to a domU without the
> > hardware source being properly configured AFAICT.
>
> Now I understand why you made a comment previously about Xen having to
> configure trigger and polarity for these interrupts on its own.
>
>
> > > So the idea was to move the call to allocate_and_map_gsi_pirq() earlier
> > > somewhere because allocate_and_map_gsi_pirq doesn't require trigger or
> > > polarity to be configured to work. But the suggestion of doing it a
> > > "bind time" (meaning: XEN_DOMCTL_bind_pt_irq) was a bad idea.
> > >
> > > But maybe we can find another location, maybe within
> > > xen/arch/x86/hvm/vioapic.c, to call allocate_and_map_gsi_pirq() before
> > > trigger and polarity are set and before the interrupt is unmasked.
> > >
> > > Then we change the implementation of vioapic_hwdom_map_gsi to skip the
> > > call to allocate_and_map_gsi_pirq, because by the time
> > > vioapic_hwdom_map_gsi we assume that allocate_and_map_gsi_pirq had
> > > already been done.
> >
> > But then we would end up in a situation where the
> > pirq_access_permitted() check will pass, but the IO-APIC pin won't be
> > configured, which I think it's not what we want.
> >
> > One option would be to allow mp_register_gsi() to be called multiple
> > times, and update the IO-APIC pin configuration as long as the pin is
> > not unmasked. That would propagate each dom0 RTE update to the
> > underlying IO-APIC. However such approach relies on dom0 configuring
> > all possible IO-APIC pins, even if no device on dom0 is using them, I
> > think it's not a very reliable option.
> >
> > Another option would be to modify the toolstack to setup the GSI
> > itself using the PHYSDEVOP_setup_gsi hypercall. As said in a previous
> > email, since we only care about PCI device passthrough the legacy INTx
> > should always be level triggered and low polarity.
> >
> > > I am not familiar with vioapic.c but to give you an idea of what I was
> > > thinking:
> > >
> > >
> > > diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
> > > index 4e40d3609a..16d56fe851 100644
> > > --- a/xen/arch/x86/hvm/vioapic.c
> > > +++ b/xen/arch/x86/hvm/vioapic.c
> > > @@ -189,14 +189,6 @@ static int vioapic_hwdom_map_gsi(unsigned int gsi, unsigned int trig,
> > > return ret;
> > > }
> > >
> > > - ret = allocate_and_map_gsi_pirq(currd, pirq, &pirq);
> > > - if ( ret )
> > > - {
> > > - gprintk(XENLOG_WARNING, "vioapic: error mapping GSI %u: %d\n",
> > > - gsi, ret);
> > > - return ret;
> > > - }
> > > -
> > > pcidevs_lock();
> > > ret = pt_irq_create_bind(currd, &pt_irq_bind);
> > > if ( ret )
> > > @@ -287,6 +279,17 @@ static void vioapic_write_redirent(
> > > hvm_dpci_eoi(d, gsi);
> > > }
> > >
> > > + if ( is_hardware_domain(d) )
> > > + {
> > > + int pirq = gsi, ret;
> > > + ret = allocate_and_map_gsi_pirq(currd, pirq, &pirq);
> > > + if ( ret )
> > > + {
> > > + gprintk(XENLOG_WARNING, "vioapic: error mapping GSI %u: %d\n",
> > > + gsi, ret);
> > > + return ret;
> > > + }
> > > + }
> > > if ( is_hardware_domain(d) && unmasked )
> > > {
> > > /*
> >
> > As said above, such approach relies on dom0 writing to the IO-APIC RTE
> > of likely each IO-APIC pin, which is IMO not quite reliable. In there
> > are two different issues here that need to be fixed for PVH dom0:
> >
> > - Fix the XEN_DOMCTL_irq_permission pirq_access_permitted() call to
> > succeed for a PVH dom0, even if dom0 is not using the GSI itself.
>
> Yes makes sense
>
>
> > - Configure IO-APIC pins for PCI interrupts even if dom0 is not using
> > the IO-APIC pin itself.
> >
> > First one needs to be fixed internally in Xen, second one will require
> > the toolstack to issue an extra hypercall in order to ensure the
> > IO-APIC pin is properly configured.
>
> On ARM, Xen doesn't need to wait for dom0 to configure interrupts
> correctly. Xen configures them all on its own at boot based on Device
> Tree information. I guess it is not possible to do the same on x86?

No, not exactly. There's some interrupt information in the ACPI MADT,
but that's just for very specific sources (Interrupt Source Override
Structures)

Then on AML devices can have resource descriptors that contain
information about how interrupts are setup. However Xen is not able
to read any of this information on AML.

Legacy PCI interrupts are (always?) level triggered and low polarity,
because it's assumed that an interrupt source can be shared between
multiple devices.

I'm however not able to find any reference to this in the PCI spec,
hence I'm reluctant to take this for granted in Xen, and default all
GSIs >= 16 to such mode.

OTOH legacy PCI interrupts are not that used anymore, as almost all
devices will support MSI(-X) (because PCIe mandates it) and OSes
should prefer the latter. SR-IOV VF don't even support legacy PCI
interrupts anymore.

> If
> not, then I can see why we would need 1 extra toolstack hypercall for
> that (or to bundle the operation of configuring IO-APIC pins together
> with an existing toolstack hypercall).

One suitable compromise would be to default unconfigured GSIs >= 16 to
level-triggered and low-polarity, as I would expect that to work in
almost all cases. We can always introduce the usage of
PHYSDEVOP_setup_gsi later if required.

Maybe Jan has more input here, would you agree to defaulting non-ISA
GSIs to level-triggered, low-polarity in the absence of a specific
setup provided by dom0?

Thanks, Roger.

2023-12-05 09:39:43

by Jiqian Chen

[permalink] [raw]
Subject: Re: [RFC KERNEL PATCH v2 2/3] xen/pvh: Unmask irq for passthrough device in PVH dom0


On 2023/12/5 17:19, Roger Pau Monné wrote:
> On Mon, Dec 04, 2023 at 02:19:33PM -0800, Stefano Stabellini wrote:
>> On Mon, 4 Dec 2023, Roger Pau Monné wrote:
>>> On Fri, Dec 01, 2023 at 07:37:55PM -0800, Stefano Stabellini wrote:
>>>> On Fri, 1 Dec 2023, Roger Pau Monné wrote:
>>>>> On Thu, Nov 30, 2023 at 07:15:17PM -0800, Stefano Stabellini wrote:
>>>>>> On Thu, 30 Nov 2023, Roger Pau Monné wrote:
>>>>>>> On Wed, Nov 29, 2023 at 07:53:59PM -0800, Stefano Stabellini wrote:
>>>>>>>> On Fri, 24 Nov 2023, Jiqian Chen wrote:
>>>>>>>>> This patch is to solve two problems we encountered when we try to
>>>>>>>>> passthrough a device to hvm domU base on Xen PVH dom0.
>>>>>>>>>
>>>>>>>>> First, hvm guest will alloc a pirq and irq for a passthrough device
>>>>>>>>> by using gsi, before that, the gsi must first has a mapping in dom0,
>>>>>>>>> see Xen code pci_add_dm_done->xc_domain_irq_permission, it will call
>>>>>>>>> into Xen and check whether dom0 has the mapping. See
>>>>>>>>> XEN_DOMCTL_irq_permission->pirq_access_permitted, "current" is PVH
>>>>>>>>> dom0 and it return irq is 0, and then return -EPERM.
>>>>>>>>> This is because the passthrough device doesn't do PHYSDEVOP_map_pirq
>>>>>>>>> when thay are enabled.
>>>>>>>>>
>>>>>>>>> Second, in PVH dom0, the gsi of a passthrough device doesn't get
>>>>>>>>> registered, but gsi must be configured for it to be able to be
>>>>>>>>> mapped into a domU.
>>>>>>>>>
>>>>>>>>> After searching codes, we can find map_pirq and register_gsi will be
>>>>>>>>> done in function vioapic_write_redirent->vioapic_hwdom_map_gsi when
>>>>>>>>> the gsi(aka ioapic's pin) is unmasked in PVH dom0. So the problems
>>>>>>>>> can be conclude to that the gsi of a passthrough device doesn't be
>>>>>>>>> unmasked.
>>>>>>>>>
>>>>>>>>> To solve the unmaske problem, this patch call the unmask_irq when we
>>>>>>>>> assign a device to be passthrough. So that the gsi can get registered
>>>>>>>>> and mapped in PVH dom0.
>>>>>>>>
>>>>>>>>
>>>>>>>> Roger, this seems to be more of a Xen issue than a Linux issue. Why do
>>>>>>>> we need the unmask check in Xen? Couldn't we just do:
>>>>>>>>
>>>>>>>>
>>>>>>>> diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
>>>>>>>> index 4e40d3609a..df262a4a18 100644
>>>>>>>> --- a/xen/arch/x86/hvm/vioapic.c
>>>>>>>> +++ b/xen/arch/x86/hvm/vioapic.c
>>>>>>>> @@ -287,7 +287,7 @@ static void vioapic_write_redirent(
>>>>>>>> hvm_dpci_eoi(d, gsi);
>>>>>>>> }
>>>>>>>>
>>>>>>>> - if ( is_hardware_domain(d) && unmasked )
>>>>>>>> + if ( is_hardware_domain(d) )
>>>>>>>> {
>>>>>>>> /*
>>>>>>>> * NB: don't call vioapic_hwdom_map_gsi while holding hvm.irq_lock
>>>>>>>
>>>>>>> There are some issues with this approach.
>>>>>>>
>>>>>>> mp_register_gsi() will only setup the trigger and polarity of the
>>>>>>> IO-APIC pin once, so we do so once the guest unmask the pin in order
>>>>>>> to assert that the configuration is the intended one. A guest is
>>>>>>> allowed to write all kind of nonsense stuff to the IO-APIC RTE, but
>>>>>>> that doesn't take effect unless the pin is unmasked.
>>>>>>>
>>>>>>> Overall the question would be whether we have any guarantees that
>>>>>>> the hardware domain has properly configured the pin, even if it's not
>>>>>>> using it itself (as it hasn't been unmasked).
>>>>>>>
>>>>>>> IIRC PCI legacy interrupts are level triggered and low polarity, so we
>>>>>>> could configure any pins that are not setup at bind time?
>>>>>>
>>>>>> That could work.
>>>>>>
>>>>>> Another idea is to move only the call to allocate_and_map_gsi_pirq at
>>>>>> bind time? That might be enough to pass a pirq_access_permitted check.
>>>>>
>>>>> Maybe, albeit that would change the behavior of XEN_DOMCTL_bind_pt_irq
>>>>> just for PT_IRQ_TYPE_PCI and only when called from a PVH dom0 (as the
>>>>> parameter would be a GSI instead of a previously mapped IRQ). Such
>>>>> difference just for PT_IRQ_TYPE_PCI is slightly weird - if we go that
>>>>> route I would recommend that we instead introduce a new dmop that has
>>>>> this syntax regardless of the domain type it's called from.
>>>>
>>>> Looking at the code it is certainly a bit confusing. My point was that
>>>> we don't need to wait until polarity and trigger are set appropriately
>>>> to allow Dom0 to pass successfully a pirq_access_permitted() check. Xen
>>>> should be able to figure out that Dom0 is permitted pirq access.
>>>
>>> The logic is certainly not straightforward, and it could benefit from
>>> some comments.
>>>
>>> The irq permissions are a bit special, in that they get setup when the
>>> IRQ is mapped.
>>>
>>> The problem however is not so much with IRQ permissions, that we can
>>> indeed sort out internally in Xen. Such check in dom0 has the side
>>> effect of preventing the IRQ from being assigned to a domU without the
>>> hardware source being properly configured AFAICT.
>>
>> Now I understand why you made a comment previously about Xen having to
>> configure trigger and polarity for these interrupts on its own.
>>
>>
>>>> So the idea was to move the call to allocate_and_map_gsi_pirq() earlier
>>>> somewhere because allocate_and_map_gsi_pirq doesn't require trigger or
>>>> polarity to be configured to work. But the suggestion of doing it a
>>>> "bind time" (meaning: XEN_DOMCTL_bind_pt_irq) was a bad idea.
>>>>
>>>> But maybe we can find another location, maybe within
>>>> xen/arch/x86/hvm/vioapic.c, to call allocate_and_map_gsi_pirq() before
>>>> trigger and polarity are set and before the interrupt is unmasked.
>>>>
>>>> Then we change the implementation of vioapic_hwdom_map_gsi to skip the
>>>> call to allocate_and_map_gsi_pirq, because by the time
>>>> vioapic_hwdom_map_gsi we assume that allocate_and_map_gsi_pirq had
>>>> already been done.
>>>
>>> But then we would end up in a situation where the
>>> pirq_access_permitted() check will pass, but the IO-APIC pin won't be
>>> configured, which I think it's not what we want.
>>>
>>> One option would be to allow mp_register_gsi() to be called multiple
>>> times, and update the IO-APIC pin configuration as long as the pin is
>>> not unmasked. That would propagate each dom0 RTE update to the
>>> underlying IO-APIC. However such approach relies on dom0 configuring
>>> all possible IO-APIC pins, even if no device on dom0 is using them, I
>>> think it's not a very reliable option.
>>>
>>> Another option would be to modify the toolstack to setup the GSI
>>> itself using the PHYSDEVOP_setup_gsi hypercall. As said in a previous
>>> email, since we only care about PCI device passthrough the legacy INTx
>>> should always be level triggered and low polarity.
>>>
>>>> I am not familiar with vioapic.c but to give you an idea of what I was
>>>> thinking:
>>>>
>>>>
>>>> diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
>>>> index 4e40d3609a..16d56fe851 100644
>>>> --- a/xen/arch/x86/hvm/vioapic.c
>>>> +++ b/xen/arch/x86/hvm/vioapic.c
>>>> @@ -189,14 +189,6 @@ static int vioapic_hwdom_map_gsi(unsigned int gsi, unsigned int trig,
>>>> return ret;
>>>> }
>>>>
>>>> - ret = allocate_and_map_gsi_pirq(currd, pirq, &pirq);
>>>> - if ( ret )
>>>> - {
>>>> - gprintk(XENLOG_WARNING, "vioapic: error mapping GSI %u: %d\n",
>>>> - gsi, ret);
>>>> - return ret;
>>>> - }
>>>> -
>>>> pcidevs_lock();
>>>> ret = pt_irq_create_bind(currd, &pt_irq_bind);
>>>> if ( ret )
>>>> @@ -287,6 +279,17 @@ static void vioapic_write_redirent(
>>>> hvm_dpci_eoi(d, gsi);
>>>> }
>>>>
>>>> + if ( is_hardware_domain(d) )
>>>> + {
>>>> + int pirq = gsi, ret;
>>>> + ret = allocate_and_map_gsi_pirq(currd, pirq, &pirq);
>>>> + if ( ret )
>>>> + {
>>>> + gprintk(XENLOG_WARNING, "vioapic: error mapping GSI %u: %d\n",
>>>> + gsi, ret);
>>>> + return ret;
>>>> + }
>>>> + }
>>>> if ( is_hardware_domain(d) && unmasked )
>>>> {
>>>> /*
>>>
>>> As said above, such approach relies on dom0 writing to the IO-APIC RTE
>>> of likely each IO-APIC pin, which is IMO not quite reliable. In there
>>> are two different issues here that need to be fixed for PVH dom0:
>>>
>>> - Fix the XEN_DOMCTL_irq_permission pirq_access_permitted() call to
>>> succeed for a PVH dom0, even if dom0 is not using the GSI itself.
>>
>> Yes makes sense
>>
>>
>>> - Configure IO-APIC pins for PCI interrupts even if dom0 is not using
>>> the IO-APIC pin itself.
>>>
>>> First one needs to be fixed internally in Xen, second one will require
>>> the toolstack to issue an extra hypercall in order to ensure the
>>> IO-APIC pin is properly configured.
>>
>> On ARM, Xen doesn't need to wait for dom0 to configure interrupts
>> correctly. Xen configures them all on its own at boot based on Device
>> Tree information. I guess it is not possible to do the same on x86?
>
> No, not exactly. There's some interrupt information in the ACPI MADT,
> but that's just for very specific sources (Interrupt Source Override
> Structures)
>
> Then on AML devices can have resource descriptors that contain
> information about how interrupts are setup. However Xen is not able
> to read any of this information on AML.
>
> Legacy PCI interrupts are (always?) level triggered and low polarity,
> because it's assumed that an interrupt source can be shared between
> multiple devices.
>
> I'm however not able to find any reference to this in the PCI spec,
> hence I'm reluctant to take this for granted in Xen, and default all
> GSIs >= 16 to such mode.
>
> OTOH legacy PCI interrupts are not that used anymore, as almost all
> devices will support MSI(-X) (because PCIe mandates it) and OSes
> should prefer the latter. SR-IOV VF don't even support legacy PCI
> interrupts anymore.
>
>> If
>> not, then I can see why we would need 1 extra toolstack hypercall for
>> that (or to bundle the operation of configuring IO-APIC pins together
>> with an existing toolstack hypercall).
>
> One suitable compromise would be to default unconfigured GSIs >= 16 to
> level-triggered and low-polarity, as I would expect that to work in
> almost all cases. We can always introduce the usage of
> PHYSDEVOP_setup_gsi later if required.
>
> Maybe Jan has more input here, would you agree to defaulting non-ISA
> GSIs to level-triggered, low-polarity in the absence of a specific
> setup provided by dom0?
>
> Thanks, Roger.

No intention to disturb if I am incorrect, just a little input. On dom0 PVH, when it enables devices, it will call acpi_pci_irq_enable, and in that function, its default trigger is level and polarity is low for pci interrupt.

--
Best regards,
Jiqian Chen.

2023-12-05 10:33:08

by Jan Beulich

[permalink] [raw]
Subject: Re: [RFC KERNEL PATCH v2 2/3] xen/pvh: Unmask irq for passthrough device in PVH dom0

On 05.12.2023 10:19, Roger Pau Monné wrote:
> On Mon, Dec 04, 2023 at 02:19:33PM -0800, Stefano Stabellini wrote:
>> On Mon, 4 Dec 2023, Roger Pau Monné wrote:
>>> On Fri, Dec 01, 2023 at 07:37:55PM -0800, Stefano Stabellini wrote:
>>>> On Fri, 1 Dec 2023, Roger Pau Monné wrote:
>>>>> On Thu, Nov 30, 2023 at 07:15:17PM -0800, Stefano Stabellini wrote:
>>>>>> On Thu, 30 Nov 2023, Roger Pau Monné wrote:
>>>>>>> On Wed, Nov 29, 2023 at 07:53:59PM -0800, Stefano Stabellini wrote:
>>>>>>>> On Fri, 24 Nov 2023, Jiqian Chen wrote:
>>>>>>>>> This patch is to solve two problems we encountered when we try to
>>>>>>>>> passthrough a device to hvm domU base on Xen PVH dom0.
>>>>>>>>>
>>>>>>>>> First, hvm guest will alloc a pirq and irq for a passthrough device
>>>>>>>>> by using gsi, before that, the gsi must first has a mapping in dom0,
>>>>>>>>> see Xen code pci_add_dm_done->xc_domain_irq_permission, it will call
>>>>>>>>> into Xen and check whether dom0 has the mapping. See
>>>>>>>>> XEN_DOMCTL_irq_permission->pirq_access_permitted, "current" is PVH
>>>>>>>>> dom0 and it return irq is 0, and then return -EPERM.
>>>>>>>>> This is because the passthrough device doesn't do PHYSDEVOP_map_pirq
>>>>>>>>> when thay are enabled.
>>>>>>>>>
>>>>>>>>> Second, in PVH dom0, the gsi of a passthrough device doesn't get
>>>>>>>>> registered, but gsi must be configured for it to be able to be
>>>>>>>>> mapped into a domU.
>>>>>>>>>
>>>>>>>>> After searching codes, we can find map_pirq and register_gsi will be
>>>>>>>>> done in function vioapic_write_redirent->vioapic_hwdom_map_gsi when
>>>>>>>>> the gsi(aka ioapic's pin) is unmasked in PVH dom0. So the problems
>>>>>>>>> can be conclude to that the gsi of a passthrough device doesn't be
>>>>>>>>> unmasked.
>>>>>>>>>
>>>>>>>>> To solve the unmaske problem, this patch call the unmask_irq when we
>>>>>>>>> assign a device to be passthrough. So that the gsi can get registered
>>>>>>>>> and mapped in PVH dom0.
>>>>>>>>
>>>>>>>>
>>>>>>>> Roger, this seems to be more of a Xen issue than a Linux issue. Why do
>>>>>>>> we need the unmask check in Xen? Couldn't we just do:
>>>>>>>>
>>>>>>>>
>>>>>>>> diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
>>>>>>>> index 4e40d3609a..df262a4a18 100644
>>>>>>>> --- a/xen/arch/x86/hvm/vioapic.c
>>>>>>>> +++ b/xen/arch/x86/hvm/vioapic.c
>>>>>>>> @@ -287,7 +287,7 @@ static void vioapic_write_redirent(
>>>>>>>> hvm_dpci_eoi(d, gsi);
>>>>>>>> }
>>>>>>>>
>>>>>>>> - if ( is_hardware_domain(d) && unmasked )
>>>>>>>> + if ( is_hardware_domain(d) )
>>>>>>>> {
>>>>>>>> /*
>>>>>>>> * NB: don't call vioapic_hwdom_map_gsi while holding hvm.irq_lock
>>>>>>>
>>>>>>> There are some issues with this approach.
>>>>>>>
>>>>>>> mp_register_gsi() will only setup the trigger and polarity of the
>>>>>>> IO-APIC pin once, so we do so once the guest unmask the pin in order
>>>>>>> to assert that the configuration is the intended one. A guest is
>>>>>>> allowed to write all kind of nonsense stuff to the IO-APIC RTE, but
>>>>>>> that doesn't take effect unless the pin is unmasked.
>>>>>>>
>>>>>>> Overall the question would be whether we have any guarantees that
>>>>>>> the hardware domain has properly configured the pin, even if it's not
>>>>>>> using it itself (as it hasn't been unmasked).
>>>>>>>
>>>>>>> IIRC PCI legacy interrupts are level triggered and low polarity, so we
>>>>>>> could configure any pins that are not setup at bind time?
>>>>>>
>>>>>> That could work.
>>>>>>
>>>>>> Another idea is to move only the call to allocate_and_map_gsi_pirq at
>>>>>> bind time? That might be enough to pass a pirq_access_permitted check.
>>>>>
>>>>> Maybe, albeit that would change the behavior of XEN_DOMCTL_bind_pt_irq
>>>>> just for PT_IRQ_TYPE_PCI and only when called from a PVH dom0 (as the
>>>>> parameter would be a GSI instead of a previously mapped IRQ). Such
>>>>> difference just for PT_IRQ_TYPE_PCI is slightly weird - if we go that
>>>>> route I would recommend that we instead introduce a new dmop that has
>>>>> this syntax regardless of the domain type it's called from.
>>>>
>>>> Looking at the code it is certainly a bit confusing. My point was that
>>>> we don't need to wait until polarity and trigger are set appropriately
>>>> to allow Dom0 to pass successfully a pirq_access_permitted() check. Xen
>>>> should be able to figure out that Dom0 is permitted pirq access.
>>>
>>> The logic is certainly not straightforward, and it could benefit from
>>> some comments.
>>>
>>> The irq permissions are a bit special, in that they get setup when the
>>> IRQ is mapped.
>>>
>>> The problem however is not so much with IRQ permissions, that we can
>>> indeed sort out internally in Xen. Such check in dom0 has the side
>>> effect of preventing the IRQ from being assigned to a domU without the
>>> hardware source being properly configured AFAICT.
>>
>> Now I understand why you made a comment previously about Xen having to
>> configure trigger and polarity for these interrupts on its own.
>>
>>
>>>> So the idea was to move the call to allocate_and_map_gsi_pirq() earlier
>>>> somewhere because allocate_and_map_gsi_pirq doesn't require trigger or
>>>> polarity to be configured to work. But the suggestion of doing it a
>>>> "bind time" (meaning: XEN_DOMCTL_bind_pt_irq) was a bad idea.
>>>>
>>>> But maybe we can find another location, maybe within
>>>> xen/arch/x86/hvm/vioapic.c, to call allocate_and_map_gsi_pirq() before
>>>> trigger and polarity are set and before the interrupt is unmasked.
>>>>
>>>> Then we change the implementation of vioapic_hwdom_map_gsi to skip the
>>>> call to allocate_and_map_gsi_pirq, because by the time
>>>> vioapic_hwdom_map_gsi we assume that allocate_and_map_gsi_pirq had
>>>> already been done.
>>>
>>> But then we would end up in a situation where the
>>> pirq_access_permitted() check will pass, but the IO-APIC pin won't be
>>> configured, which I think it's not what we want.
>>>
>>> One option would be to allow mp_register_gsi() to be called multiple
>>> times, and update the IO-APIC pin configuration as long as the pin is
>>> not unmasked. That would propagate each dom0 RTE update to the
>>> underlying IO-APIC. However such approach relies on dom0 configuring
>>> all possible IO-APIC pins, even if no device on dom0 is using them, I
>>> think it's not a very reliable option.
>>>
>>> Another option would be to modify the toolstack to setup the GSI
>>> itself using the PHYSDEVOP_setup_gsi hypercall. As said in a previous
>>> email, since we only care about PCI device passthrough the legacy INTx
>>> should always be level triggered and low polarity.
>>>
>>>> I am not familiar with vioapic.c but to give you an idea of what I was
>>>> thinking:
>>>>
>>>>
>>>> diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
>>>> index 4e40d3609a..16d56fe851 100644
>>>> --- a/xen/arch/x86/hvm/vioapic.c
>>>> +++ b/xen/arch/x86/hvm/vioapic.c
>>>> @@ -189,14 +189,6 @@ static int vioapic_hwdom_map_gsi(unsigned int gsi, unsigned int trig,
>>>> return ret;
>>>> }
>>>>
>>>> - ret = allocate_and_map_gsi_pirq(currd, pirq, &pirq);
>>>> - if ( ret )
>>>> - {
>>>> - gprintk(XENLOG_WARNING, "vioapic: error mapping GSI %u: %d\n",
>>>> - gsi, ret);
>>>> - return ret;
>>>> - }
>>>> -
>>>> pcidevs_lock();
>>>> ret = pt_irq_create_bind(currd, &pt_irq_bind);
>>>> if ( ret )
>>>> @@ -287,6 +279,17 @@ static void vioapic_write_redirent(
>>>> hvm_dpci_eoi(d, gsi);
>>>> }
>>>>
>>>> + if ( is_hardware_domain(d) )
>>>> + {
>>>> + int pirq = gsi, ret;
>>>> + ret = allocate_and_map_gsi_pirq(currd, pirq, &pirq);
>>>> + if ( ret )
>>>> + {
>>>> + gprintk(XENLOG_WARNING, "vioapic: error mapping GSI %u: %d\n",
>>>> + gsi, ret);
>>>> + return ret;
>>>> + }
>>>> + }
>>>> if ( is_hardware_domain(d) && unmasked )
>>>> {
>>>> /*
>>>
>>> As said above, such approach relies on dom0 writing to the IO-APIC RTE
>>> of likely each IO-APIC pin, which is IMO not quite reliable. In there
>>> are two different issues here that need to be fixed for PVH dom0:
>>>
>>> - Fix the XEN_DOMCTL_irq_permission pirq_access_permitted() call to
>>> succeed for a PVH dom0, even if dom0 is not using the GSI itself.
>>
>> Yes makes sense
>>
>>
>>> - Configure IO-APIC pins for PCI interrupts even if dom0 is not using
>>> the IO-APIC pin itself.
>>>
>>> First one needs to be fixed internally in Xen, second one will require
>>> the toolstack to issue an extra hypercall in order to ensure the
>>> IO-APIC pin is properly configured.
>>
>> On ARM, Xen doesn't need to wait for dom0 to configure interrupts
>> correctly. Xen configures them all on its own at boot based on Device
>> Tree information. I guess it is not possible to do the same on x86?
>
> No, not exactly. There's some interrupt information in the ACPI MADT,
> but that's just for very specific sources (Interrupt Source Override
> Structures)
>
> Then on AML devices can have resource descriptors that contain
> information about how interrupts are setup. However Xen is not able
> to read any of this information on AML.
>
> Legacy PCI interrupts are (always?) level triggered and low polarity,
> because it's assumed that an interrupt source can be shared between
> multiple devices.

Except that as per what you said just in the earlier paragraph ACPI can
tell us otherwise.

> I'm however not able to find any reference to this in the PCI spec,
> hence I'm reluctant to take this for granted in Xen, and default all
> GSIs >= 16 to such mode.
>
> OTOH legacy PCI interrupts are not that used anymore, as almost all
> devices will support MSI(-X) (because PCIe mandates it) and OSes
> should prefer the latter. SR-IOV VF don't even support legacy PCI
> interrupts anymore.
>
>> If
>> not, then I can see why we would need 1 extra toolstack hypercall for
>> that (or to bundle the operation of configuring IO-APIC pins together
>> with an existing toolstack hypercall).
>
> One suitable compromise would be to default unconfigured GSIs >= 16 to
> level-triggered and low-polarity, as I would expect that to work in
> almost all cases. We can always introduce the usage of
> PHYSDEVOP_setup_gsi later if required.
>
> Maybe Jan has more input here, would you agree to defaulting non-ISA
> GSIs to level-triggered, low-polarity in the absence of a specific
> setup provided by dom0?

Well, such defaulting is an option, but in case it's wrong we might
end up with hard to diagnose issues. Personally I'd prefer if we
didn't take shortcuts here, i.e. if we followed what Dom0 is able
to read from ACPI.

Jan

2023-12-05 17:03:21

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC KERNEL PATCH v2 1/3] xen/pci: Add xen_reset_device_state function

On Mon, Dec 04 2023 at 13:31, Stefano Stabellini wrote:
> On Mon, 3 Dec 2023, Chen, Jiqian wrote:
>> >> vpci device state when device is reset on dom0 side.
>> >>
>> >> And call that function in pcistub_init_device. Because when
>> >> we use "pci-assignable-add" to assign a passthrough device in
>> >> Xen, it will reset passthrough device and the vpci state will
>> >> out of date, and then device will fail to restore bar state.
>> >>
>> >> Signed-off-by: Jiqian Chen <[email protected]>
>> >> Signed-off-by: Huang Rui <[email protected]>
>> >
>> > This Signed-off-by chain is incorrect.
>> >
>> > Documentation/process/submitting-patches.rst has a full chapter about
>> > S-O-B and the correct usage.
>> I am the author of this series of patches, and Huang Rui transported the v1 to upstream. And now I transport v2. I am not aware that the SOB chain is incorrect.
>> Do you have any suggestions?
>
> I think he means that your Signed-off-by should be the second one of the
> two as you are the one submitting the patch to the LKML

No.

Mailfrom: Jiqian Chen <[email protected]>
<body>

Changelog-text

Signed-off-by: Huang Rui <[email protected]>
Signed-off-by: Jiqian Chen <[email protected]>

is equally wrong because that would end up with Chen as author and Huang
as first S-O-B which is required to be the author's S-O-B

To make the above correct this would require:

Mailfrom: Jiqian Chen <[email protected]>
<body>

From: Huang Rui <[email protected]>

Changelog-text

Signed-off-by: Huang Rui <[email protected]>
Signed-off-by: Jiqian Chen <[email protected]>

which tells that Huang is the author and Chen is the 'transporter',
which unfortunately does not reflect reality.

Or:

Mailfrom: Jiqian Chen <[email protected]>
<body>

Changelog-text

Co-developed-by: Huang Rui <[email protected]>
Signed-off-by: Huang Rui <[email protected]>
Signed-off-by: Jiqian Chen <[email protected]>

which tells that Checn is the author and Huang co-developed the
patch, which might be true or not.


V1 which was sent by Huang has the ordering is correct:

Mailfrom: Huang Rui <[email protected]>
<body>

From: Jiqian Chen <[email protected]>

Changelog-text

Signed-off-by: Jiqian Chen <[email protected]>
Signed-off-by: Huang Rui <[email protected]>

i.e. Chen authored and Huang transported

Now this V2 has not really much to do with V1 and is a new
implementation to solve the problem, which was authored by Chen, so
Huang is not involved at all if I understand correctly.

So what does his S-O-B mean here? Nothing...

It's very well documented how the whole S-O-B business works and it's
not really rocket science to get it straight.

It has a meaning and is not just for decoration purposes.

Thanks,

tglx

2023-12-06 06:08:28

by Jiqian Chen

[permalink] [raw]
Subject: Re: [RFC KERNEL PATCH v2 2/3] xen/pvh: Unmask irq for passthrough device in PVH dom0

On 2023/12/5 18:32, Jan Beulich wrote:
> On 05.12.2023 10:19, Roger Pau Monné wrote:
>> On Mon, Dec 04, 2023 at 02:19:33PM -0800, Stefano Stabellini wrote:
>>> On Mon, 4 Dec 2023, Roger Pau Monné wrote:
>>>> On Fri, Dec 01, 2023 at 07:37:55PM -0800, Stefano Stabellini wrote:
>>>>> On Fri, 1 Dec 2023, Roger Pau Monné wrote:
>>>>>> On Thu, Nov 30, 2023 at 07:15:17PM -0800, Stefano Stabellini wrote:
>>>>>>> On Thu, 30 Nov 2023, Roger Pau Monné wrote:
>>>>>>>> On Wed, Nov 29, 2023 at 07:53:59PM -0800, Stefano Stabellini wrote:
>>>>>>>>> On Fri, 24 Nov 2023, Jiqian Chen wrote:
>>>>>>>>>> This patch is to solve two problems we encountered when we try to
>>>>>>>>>> passthrough a device to hvm domU base on Xen PVH dom0.
>>>>>>>>>>
>>>>>>>>>> First, hvm guest will alloc a pirq and irq for a passthrough device
>>>>>>>>>> by using gsi, before that, the gsi must first has a mapping in dom0,
>>>>>>>>>> see Xen code pci_add_dm_done->xc_domain_irq_permission, it will call
>>>>>>>>>> into Xen and check whether dom0 has the mapping. See
>>>>>>>>>> XEN_DOMCTL_irq_permission->pirq_access_permitted, "current" is PVH
>>>>>>>>>> dom0 and it return irq is 0, and then return -EPERM.
>>>>>>>>>> This is because the passthrough device doesn't do PHYSDEVOP_map_pirq
>>>>>>>>>> when thay are enabled.
>>>>>>>>>>
>>>>>>>>>> Second, in PVH dom0, the gsi of a passthrough device doesn't get
>>>>>>>>>> registered, but gsi must be configured for it to be able to be
>>>>>>>>>> mapped into a domU.
>>>>>>>>>>
>>>>>>>>>> After searching codes, we can find map_pirq and register_gsi will be
>>>>>>>>>> done in function vioapic_write_redirent->vioapic_hwdom_map_gsi when
>>>>>>>>>> the gsi(aka ioapic's pin) is unmasked in PVH dom0. So the problems
>>>>>>>>>> can be conclude to that the gsi of a passthrough device doesn't be
>>>>>>>>>> unmasked.
>>>>>>>>>>
>>>>>>>>>> To solve the unmaske problem, this patch call the unmask_irq when we
>>>>>>>>>> assign a device to be passthrough. So that the gsi can get registered
>>>>>>>>>> and mapped in PVH dom0.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Roger, this seems to be more of a Xen issue than a Linux issue. Why do
>>>>>>>>> we need the unmask check in Xen? Couldn't we just do:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
>>>>>>>>> index 4e40d3609a..df262a4a18 100644
>>>>>>>>> --- a/xen/arch/x86/hvm/vioapic.c
>>>>>>>>> +++ b/xen/arch/x86/hvm/vioapic.c
>>>>>>>>> @@ -287,7 +287,7 @@ static void vioapic_write_redirent(
>>>>>>>>> hvm_dpci_eoi(d, gsi);
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> - if ( is_hardware_domain(d) && unmasked )
>>>>>>>>> + if ( is_hardware_domain(d) )
>>>>>>>>> {
>>>>>>>>> /*
>>>>>>>>> * NB: don't call vioapic_hwdom_map_gsi while holding hvm.irq_lock
>>>>>>>>
>>>>>>>> There are some issues with this approach.
>>>>>>>>
>>>>>>>> mp_register_gsi() will only setup the trigger and polarity of the
>>>>>>>> IO-APIC pin once, so we do so once the guest unmask the pin in order
>>>>>>>> to assert that the configuration is the intended one. A guest is
>>>>>>>> allowed to write all kind of nonsense stuff to the IO-APIC RTE, but
>>>>>>>> that doesn't take effect unless the pin is unmasked.
>>>>>>>>
>>>>>>>> Overall the question would be whether we have any guarantees that
>>>>>>>> the hardware domain has properly configured the pin, even if it's not
>>>>>>>> using it itself (as it hasn't been unmasked).
>>>>>>>>
>>>>>>>> IIRC PCI legacy interrupts are level triggered and low polarity, so we
>>>>>>>> could configure any pins that are not setup at bind time?
>>>>>>>
>>>>>>> That could work.
>>>>>>>
>>>>>>> Another idea is to move only the call to allocate_and_map_gsi_pirq at
>>>>>>> bind time? That might be enough to pass a pirq_access_permitted check.
>>>>>>
>>>>>> Maybe, albeit that would change the behavior of XEN_DOMCTL_bind_pt_irq
>>>>>> just for PT_IRQ_TYPE_PCI and only when called from a PVH dom0 (as the
>>>>>> parameter would be a GSI instead of a previously mapped IRQ). Such
>>>>>> difference just for PT_IRQ_TYPE_PCI is slightly weird - if we go that
>>>>>> route I would recommend that we instead introduce a new dmop that has
>>>>>> this syntax regardless of the domain type it's called from.
>>>>>
>>>>> Looking at the code it is certainly a bit confusing. My point was that
>>>>> we don't need to wait until polarity and trigger are set appropriately
>>>>> to allow Dom0 to pass successfully a pirq_access_permitted() check. Xen
>>>>> should be able to figure out that Dom0 is permitted pirq access.
>>>>
>>>> The logic is certainly not straightforward, and it could benefit from
>>>> some comments.
>>>>
>>>> The irq permissions are a bit special, in that they get setup when the
>>>> IRQ is mapped.
>>>>
>>>> The problem however is not so much with IRQ permissions, that we can
>>>> indeed sort out internally in Xen. Such check in dom0 has the side
>>>> effect of preventing the IRQ from being assigned to a domU without the
>>>> hardware source being properly configured AFAICT.
>>>
>>> Now I understand why you made a comment previously about Xen having to
>>> configure trigger and polarity for these interrupts on its own.
>>>
>>>
>>>>> So the idea was to move the call to allocate_and_map_gsi_pirq() earlier
>>>>> somewhere because allocate_and_map_gsi_pirq doesn't require trigger or
>>>>> polarity to be configured to work. But the suggestion of doing it a
>>>>> "bind time" (meaning: XEN_DOMCTL_bind_pt_irq) was a bad idea.
>>>>>
>>>>> But maybe we can find another location, maybe within
>>>>> xen/arch/x86/hvm/vioapic.c, to call allocate_and_map_gsi_pirq() before
>>>>> trigger and polarity are set and before the interrupt is unmasked.
>>>>>
>>>>> Then we change the implementation of vioapic_hwdom_map_gsi to skip the
>>>>> call to allocate_and_map_gsi_pirq, because by the time
>>>>> vioapic_hwdom_map_gsi we assume that allocate_and_map_gsi_pirq had
>>>>> already been done.
>>>>
>>>> But then we would end up in a situation where the
>>>> pirq_access_permitted() check will pass, but the IO-APIC pin won't be
>>>> configured, which I think it's not what we want.
>>>>
>>>> One option would be to allow mp_register_gsi() to be called multiple
>>>> times, and update the IO-APIC pin configuration as long as the pin is
>>>> not unmasked. That would propagate each dom0 RTE update to the
>>>> underlying IO-APIC. However such approach relies on dom0 configuring
>>>> all possible IO-APIC pins, even if no device on dom0 is using them, I
>>>> think it's not a very reliable option.
>>>>
>>>> Another option would be to modify the toolstack to setup the GSI
>>>> itself using the PHYSDEVOP_setup_gsi hypercall. As said in a previous
>>>> email, since we only care about PCI device passthrough the legacy INTx
>>>> should always be level triggered and low polarity.
>>>>
>>>>> I am not familiar with vioapic.c but to give you an idea of what I was
>>>>> thinking:
>>>>>
>>>>>
>>>>> diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
>>>>> index 4e40d3609a..16d56fe851 100644
>>>>> --- a/xen/arch/x86/hvm/vioapic.c
>>>>> +++ b/xen/arch/x86/hvm/vioapic.c
>>>>> @@ -189,14 +189,6 @@ static int vioapic_hwdom_map_gsi(unsigned int gsi, unsigned int trig,
>>>>> return ret;
>>>>> }
>>>>>
>>>>> - ret = allocate_and_map_gsi_pirq(currd, pirq, &pirq);
>>>>> - if ( ret )
>>>>> - {
>>>>> - gprintk(XENLOG_WARNING, "vioapic: error mapping GSI %u: %d\n",
>>>>> - gsi, ret);
>>>>> - return ret;
>>>>> - }
>>>>> -
>>>>> pcidevs_lock();
>>>>> ret = pt_irq_create_bind(currd, &pt_irq_bind);
>>>>> if ( ret )
>>>>> @@ -287,6 +279,17 @@ static void vioapic_write_redirent(
>>>>> hvm_dpci_eoi(d, gsi);
>>>>> }
>>>>>
>>>>> + if ( is_hardware_domain(d) )
>>>>> + {
>>>>> + int pirq = gsi, ret;
>>>>> + ret = allocate_and_map_gsi_pirq(currd, pirq, &pirq);
>>>>> + if ( ret )
>>>>> + {
>>>>> + gprintk(XENLOG_WARNING, "vioapic: error mapping GSI %u: %d\n",
>>>>> + gsi, ret);
>>>>> + return ret;
>>>>> + }
>>>>> + }
>>>>> if ( is_hardware_domain(d) && unmasked )
>>>>> {
>>>>> /*
>>>>
>>>> As said above, such approach relies on dom0 writing to the IO-APIC RTE
>>>> of likely each IO-APIC pin, which is IMO not quite reliable. In there
>>>> are two different issues here that need to be fixed for PVH dom0:
>>>>
>>>> - Fix the XEN_DOMCTL_irq_permission pirq_access_permitted() call to
>>>> succeed for a PVH dom0, even if dom0 is not using the GSI itself.
>>>
>>> Yes makes sense
>>>
>>>
>>>> - Configure IO-APIC pins for PCI interrupts even if dom0 is not using
>>>> the IO-APIC pin itself.
>>>>
>>>> First one needs to be fixed internally in Xen, second one will require
>>>> the toolstack to issue an extra hypercall in order to ensure the
>>>> IO-APIC pin is properly configured.
>>>
>>> On ARM, Xen doesn't need to wait for dom0 to configure interrupts
>>> correctly. Xen configures them all on its own at boot based on Device
>>> Tree information. I guess it is not possible to do the same on x86?
>>
>> No, not exactly. There's some interrupt information in the ACPI MADT,
>> but that's just for very specific sources (Interrupt Source Override
>> Structures)
>>
>> Then on AML devices can have resource descriptors that contain
>> information about how interrupts are setup. However Xen is not able
>> to read any of this information on AML.
>>
>> Legacy PCI interrupts are (always?) level triggered and low polarity,
>> because it's assumed that an interrupt source can be shared between
>> multiple devices.
>
> Except that as per what you said just in the earlier paragraph ACPI can
> tell us otherwise.
>
>> I'm however not able to find any reference to this in the PCI spec,
>> hence I'm reluctant to take this for granted in Xen, and default all
>> GSIs >= 16 to such mode.
>>
>> OTOH legacy PCI interrupts are not that used anymore, as almost all
>> devices will support MSI(-X) (because PCIe mandates it) and OSes
>> should prefer the latter. SR-IOV VF don't even support legacy PCI
>> interrupts anymore.
>>
>>> If
>>> not, then I can see why we would need 1 extra toolstack hypercall for
>>> that (or to bundle the operation of configuring IO-APIC pins together
>>> with an existing toolstack hypercall).
>>
>> One suitable compromise would be to default unconfigured GSIs >= 16 to
>> level-triggered and low-polarity, as I would expect that to work in
>> almost all cases. We can always introduce the usage of
>> PHYSDEVOP_setup_gsi later if required.
>>
>> Maybe Jan has more input here, would you agree to defaulting non-ISA
>> GSIs to level-triggered, low-polarity in the absence of a specific
>> setup provided by dom0?
>
> Well, such defaulting is an option, but in case it's wrong we might
> end up with hard to diagnose issues. Personally I'd prefer if we
> didn't take shortcuts here, i.e. if we followed what Dom0 is able
> to read from ACPI.

When PVH dom0 enable a device, it will get trigger and polarity from ACPI (see acpi_pci_irq_enable)
I have a version of patch which tried that way, see below:

diff --git a/arch/x86/xen/enlighten_pvh.c b/arch/x86/xen/enlighten_pvh.c
index ada3868c02c2..43e1bda9f946 100644
--- a/arch/x86/xen/enlighten_pvh.c
+++ b/arch/x86/xen/enlighten_pvh.c
@@ -1,6 +1,7 @@
// SPDX-License-Identifier: GPL-2.0
#include <linux/acpi.h>
#include <linux/export.h>
+#include <linux/pci.h>

#include <xen/hvc-console.h>

@@ -25,6 +26,127 @@
bool __ro_after_init xen_pvh;
EXPORT_SYMBOL_GPL(xen_pvh);

+typedef struct gsi_info {
+ int gsi;
+ int trigger;
+ int polarity;
+ int pirq;
+} gsi_info_t;
+
+struct acpi_prt_entry {
+ struct acpi_pci_id id;
+ u8 pin;
+ acpi_handle link;
+ u32 index; /* GSI, or link _CRS index */
+};
+
+static int xen_pvh_get_gsi_info(struct pci_dev *dev,
+ gsi_info_t *gsi_info)
+{
+ int gsi;
+ u8 pin = 0;
+ struct acpi_prt_entry *entry;
+ int trigger = ACPI_LEVEL_SENSITIVE;
+ int polarity = acpi_irq_model == ACPI_IRQ_MODEL_GIC ?
+ ACPI_ACTIVE_HIGH : ACPI_ACTIVE_LOW;
+
+ if (dev)
+ pin = dev->pin;
+ if (!pin) {
+ xen_raw_printk("No interrupt pin configured\n");
+ return -EINVAL;
+ }
+
+ entry = acpi_pci_irq_lookup(dev, pin);
+ if (entry) {
+ if (entry->link)
+ gsi = acpi_pci_link_allocate_irq(entry->link,
+ entry->index,
+ &trigger, &polarity,
+ NULL);
+ else
+ gsi = entry->index;
+ } else
+ return -EINVAL;
+
+ gsi_info->gsi = gsi;
+ gsi_info->trigger = trigger;
+ gsi_info->polarity = polarity;
+
+ return 0;
+}
+
+static int xen_pvh_map_pirq(gsi_info_t *gsi_info)
+{
+ struct physdev_map_pirq map_irq;
+ int ret;
+
+ map_irq.domid = DOMID_SELF;
+ map_irq.type = MAP_PIRQ_TYPE_GSI;
+ map_irq.index = gsi_info->gsi;
+ map_irq.pirq = gsi_info->gsi;
+
+ ret = HYPERVISOR_physdev_op(PHYSDEVOP_map_pirq, &map_irq);
+ gsi_info->pirq = map_irq.pirq;
+
+ return ret;
+}
+
+static int xen_pvh_unmap_pirq(gsi_info_t *gsi_info)
+{
+ struct physdev_unmap_pirq unmap_irq;
+
+ unmap_irq.domid = DOMID_SELF;
+ unmap_irq.pirq = gsi_info->pirq;
+
+ return HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq, &unmap_irq);
+}
+
+static int xen_pvh_setup_gsi(gsi_info_t *gsi_info)
+{
+ struct physdev_setup_gsi setup_gsi;
+
+ setup_gsi.gsi = gsi_info->gsi;
+ setup_gsi.triggering = (gsi_info->trigger == ACPI_EDGE_SENSITIVE ? 0 : 1);
+ setup_gsi.polarity = (gsi_info->polarity == ACPI_ACTIVE_HIGH ? 0 : 1);
+
+ return HYPERVISOR_physdev_op(PHYSDEVOP_setup_gsi, &setup_gsi);
+}
+
+int xen_pvh_passthrough_gsi(struct pci_dev *dev)
+{
+ int ret;
+ gsi_info_t gsi_info;
+
+ if (!dev) {
+ return -EINVAL;
+ }
+
+ ret = xen_pvh_get_gsi_info(dev, &gsi_info);
+ if (ret) {
+ xen_raw_printk("Fail to get gsi info!\n");
+ return ret;
+ }
+
+ ret = xen_pvh_map_pirq(&gsi_info);
+ if (ret) {
+ xen_raw_printk("Fail to map pirq for gsi (%d)!\n", gsi_info.gsi);
+ return ret;
+ }
+
+ ret = xen_pvh_setup_gsi(&gsi_info);
+ if (ret == -EEXIST) {
+ ret = 0;
+ xen_raw_printk("Already setup the GSI :%u\n", gsi_info.gsi);
+ } else if (ret) {
+ xen_raw_printk("Fail to setup gsi (%d)!\n", gsi_info.gsi);
+ xen_pvh_unmap_pirq(&gsi_info);
+ }
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(xen_pvh_passthrough_gsi);
+
void __init xen_pvh_init(struct boot_params *boot_params)
{
u32 msr;
diff --git a/drivers/acpi/pci_irq.c b/drivers/acpi/pci_irq.c
index ff30ceca2203..630fe0a34bc6 100644
--- a/drivers/acpi/pci_irq.c
+++ b/drivers/acpi/pci_irq.c
@@ -288,7 +288,7 @@ static int acpi_reroute_boot_interrupt(struct pci_dev *dev,
}
#endif /* CONFIG_X86_IO_APIC */

-static struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin)
+struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin)
{
struct acpi_prt_entry *entry = NULL;
struct pci_dev *bridge;
diff --git a/drivers/xen/xen-pciback/pci_stub.c b/drivers/xen/xen-pciback/pci_stub.c
index e34b623e4b41..1abd4dad6f40 100644
--- a/drivers/xen/xen-pciback/pci_stub.c
+++ b/drivers/xen/xen-pciback/pci_stub.c
@@ -20,6 +20,7 @@
#include <linux/atomic.h>
#include <xen/events.h>
#include <xen/pci.h>
+#include <xen/acpi.h>
#include <xen/xen.h>
#include <asm/xen/hypervisor.h>
#include <xen/interface/physdev.h>
@@ -399,6 +400,12 @@ static int pcistub_init_device(struct pci_dev *dev)
if (err)
goto config_release;

+ if (xen_initial_domain() && xen_pvh_domain()) {
+ err = xen_pvh_passthrough_gsi(dev);
+ if (err)
+ goto config_release;
+ }
+
if (dev->msix_cap) {
struct physdev_pci_device ppdev = {
.seg = pci_domain_nr(dev->bus),
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 641dc4843987..368d56ba2c5e 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -375,6 +375,7 @@ void acpi_unregister_gsi (u32 gsi);

struct pci_dev;

+struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin);
int acpi_pci_irq_enable (struct pci_dev *dev);
void acpi_penalize_isa_irq(int irq, int active);
bool acpi_isa_irq_available(int irq);
diff --git a/include/xen/acpi.h b/include/xen/acpi.h
index b1e11863144d..ce7f5554f88e 100644
--- a/include/xen/acpi.h
+++ b/include/xen/acpi.h
@@ -67,6 +67,7 @@ static inline void xen_acpi_sleep_register(void)
acpi_suspend_lowlevel = xen_acpi_suspend_lowlevel;
}
}
+int xen_pvh_passthrough_gsi(struct pci_dev *dev);
#else
static inline void xen_acpi_sleep_register(void)
{

>
> Jan

--
Best regards,
Jiqian Chen.

2023-12-06 06:38:23

by Jiqian Chen

[permalink] [raw]
Subject: Re: [RFC KERNEL PATCH v2 1/3] xen/pci: Add xen_reset_device_state function

Hi Thomas Gleixner,

On 2023/12/6 01:02, Thomas Gleixner wrote:
> On Mon, Dec 04 2023 at 13:31, Stefano Stabellini wrote:
>> On Mon, 3 Dec 2023, Chen, Jiqian wrote:
>>>>> vpci device state when device is reset on dom0 side.
>>>>>
>>>>> And call that function in pcistub_init_device. Because when
>>>>> we use "pci-assignable-add" to assign a passthrough device in
>>>>> Xen, it will reset passthrough device and the vpci state will
>>>>> out of date, and then device will fail to restore bar state.
>>>>>
>>>>> Signed-off-by: Jiqian Chen <[email protected]>
>>>>> Signed-off-by: Huang Rui <[email protected]>
>>>>
>>>> This Signed-off-by chain is incorrect.
>>>>
>>>> Documentation/process/submitting-patches.rst has a full chapter about
>>>> S-O-B and the correct usage.
>>> I am the author of this series of patches, and Huang Rui transported the v1 to upstream. And now I transport v2. I am not aware that the SOB chain is incorrect.
>>> Do you have any suggestions?
>>
>> I think he means that your Signed-off-by should be the second one of the
>> two as you are the one submitting the patch to the LKML
>
> No.
>
> Mailfrom: Jiqian Chen <[email protected]>
> <body>
>
> Changelog-text
>
> Signed-off-by: Huang Rui <[email protected]>
> Signed-off-by: Jiqian Chen <[email protected]>
>
> is equally wrong because that would end up with Chen as author and Huang
> as first S-O-B which is required to be the author's S-O-B
>
> To make the above correct this would require:
>
> Mailfrom: Jiqian Chen <[email protected]>
> <body>
>
> From: Huang Rui <[email protected]>
>
> Changelog-text
>
> Signed-off-by: Huang Rui <[email protected]>
> Signed-off-by: Jiqian Chen <[email protected]>
>
> which tells that Huang is the author and Chen is the 'transporter',
> which unfortunately does not reflect reality.
>
> Or:
>
> Mailfrom: Jiqian Chen <[email protected]>
> <body>
>
> Changelog-text
>
> Co-developed-by: Huang Rui <[email protected]>
> Signed-off-by: Huang Rui <[email protected]>
> Signed-off-by: Jiqian Chen <[email protected]>
>
> which tells that Checn is the author and Huang co-developed the
> patch, which might be true or not.
>
>
> V1 which was sent by Huang has the ordering is correct:
>
> Mailfrom: Huang Rui <[email protected]>
> <body>
>
> From: Jiqian Chen <[email protected]>
>
> Changelog-text
>
> Signed-off-by: Jiqian Chen <[email protected]>
> Signed-off-by: Huang Rui <[email protected]>
>
> i.e. Chen authored and Huang transported
>
> Now this V2 has not really much to do with V1 and is a new
> implementation to solve the problem, which was authored by Chen, so
> Huang is not involved at all if I understand correctly.
Not exactly, V2 is modified based on the V1. I am the author of this series. Huang Rui upstream the V1, and he also helped me to improve the first patch of this V2 series.
Maybe in next version, below is more suitable:

Co-developed-by: Huang Rui <[email protected]>
Signed-off-by: Jiqian Chen <[email protected]>

Which can reflect that Huang Rui is co-developer, and I am the author and the last people to send patches.

>
> So what does his S-O-B mean here? Nothing...
>
> It's very well documented how the whole S-O-B business works and it's
> not really rocket science to get it straight.
>
> It has a meaning and is not just for decoration purposes.
>
> Thanks,
>
> tglx

--
Best regards,
Jiqian Chen.

2023-12-07 02:18:33

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [RFC KERNEL PATCH v2 2/3] xen/pvh: Unmask irq for passthrough device in PVH dom0

On Tue, 5 Dec 2023, Chen, Jiqian wrote:
> When PVH dom0 enable a device, it will get trigger and polarity from ACPI (see acpi_pci_irq_enable)
> I have a version of patch which tried that way, see below:

This approach looks much better. I think this patch is OKish. Juergen,
what do you think?


> diff --git a/arch/x86/xen/enlighten_pvh.c b/arch/x86/xen/enlighten_pvh.c
> index ada3868c02c2..43e1bda9f946 100644
> --- a/arch/x86/xen/enlighten_pvh.c
> +++ b/arch/x86/xen/enlighten_pvh.c
> @@ -1,6 +1,7 @@
> // SPDX-License-Identifier: GPL-2.0
> #include <linux/acpi.h>
> #include <linux/export.h>
> +#include <linux/pci.h>
>
> #include <xen/hvc-console.h>
>
> @@ -25,6 +26,127 @@
> bool __ro_after_init xen_pvh;
> EXPORT_SYMBOL_GPL(xen_pvh);
>
> +typedef struct gsi_info {
> + int gsi;
> + int trigger;
> + int polarity;
> + int pirq;
> +} gsi_info_t;
> +
> +struct acpi_prt_entry {
> + struct acpi_pci_id id;
> + u8 pin;
> + acpi_handle link;
> + u32 index; /* GSI, or link _CRS index */
> +};
> +
> +static int xen_pvh_get_gsi_info(struct pci_dev *dev,
> + gsi_info_t *gsi_info)
> +{
> + int gsi;
> + u8 pin = 0;
> + struct acpi_prt_entry *entry;
> + int trigger = ACPI_LEVEL_SENSITIVE;
> + int polarity = acpi_irq_model == ACPI_IRQ_MODEL_GIC ?
> + ACPI_ACTIVE_HIGH : ACPI_ACTIVE_LOW;
> +
> + if (dev)
> + pin = dev->pin;
> + if (!pin) {
> + xen_raw_printk("No interrupt pin configured\n");
> + return -EINVAL;
> + }
> +
> + entry = acpi_pci_irq_lookup(dev, pin);
> + if (entry) {
> + if (entry->link)
> + gsi = acpi_pci_link_allocate_irq(entry->link,
> + entry->index,
> + &trigger, &polarity,
> + NULL);
> + else
> + gsi = entry->index;
> + } else
> + return -EINVAL;
> +
> + gsi_info->gsi = gsi;
> + gsi_info->trigger = trigger;
> + gsi_info->polarity = polarity;
> +
> + return 0;
> +}
> +
> +static int xen_pvh_map_pirq(gsi_info_t *gsi_info)
> +{
> + struct physdev_map_pirq map_irq;
> + int ret;
> +
> + map_irq.domid = DOMID_SELF;
> + map_irq.type = MAP_PIRQ_TYPE_GSI;
> + map_irq.index = gsi_info->gsi;
> + map_irq.pirq = gsi_info->gsi;
> +
> + ret = HYPERVISOR_physdev_op(PHYSDEVOP_map_pirq, &map_irq);
> + gsi_info->pirq = map_irq.pirq;
> +
> + return ret;
> +}
> +
> +static int xen_pvh_unmap_pirq(gsi_info_t *gsi_info)
> +{
> + struct physdev_unmap_pirq unmap_irq;
> +
> + unmap_irq.domid = DOMID_SELF;
> + unmap_irq.pirq = gsi_info->pirq;
> +
> + return HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq, &unmap_irq);
> +}
> +
> +static int xen_pvh_setup_gsi(gsi_info_t *gsi_info)
> +{
> + struct physdev_setup_gsi setup_gsi;
> +
> + setup_gsi.gsi = gsi_info->gsi;
> + setup_gsi.triggering = (gsi_info->trigger == ACPI_EDGE_SENSITIVE ? 0 : 1);
> + setup_gsi.polarity = (gsi_info->polarity == ACPI_ACTIVE_HIGH ? 0 : 1);
> +
> + return HYPERVISOR_physdev_op(PHYSDEVOP_setup_gsi, &setup_gsi);
> +}
> +
> +int xen_pvh_passthrough_gsi(struct pci_dev *dev)
> +{
> + int ret;
> + gsi_info_t gsi_info;
> +
> + if (!dev) {
> + return -EINVAL;
> + }
> +
> + ret = xen_pvh_get_gsi_info(dev, &gsi_info);
> + if (ret) {
> + xen_raw_printk("Fail to get gsi info!\n");
> + return ret;
> + }
> +
> + ret = xen_pvh_map_pirq(&gsi_info);
> + if (ret) {
> + xen_raw_printk("Fail to map pirq for gsi (%d)!\n", gsi_info.gsi);
> + return ret;
> + }
> +
> + ret = xen_pvh_setup_gsi(&gsi_info);
> + if (ret == -EEXIST) {
> + ret = 0;
> + xen_raw_printk("Already setup the GSI :%u\n", gsi_info.gsi);
> + } else if (ret) {
> + xen_raw_printk("Fail to setup gsi (%d)!\n", gsi_info.gsi);
> + xen_pvh_unmap_pirq(&gsi_info);
> + }
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(xen_pvh_passthrough_gsi);
> +
> void __init xen_pvh_init(struct boot_params *boot_params)
> {
> u32 msr;
> diff --git a/drivers/acpi/pci_irq.c b/drivers/acpi/pci_irq.c
> index ff30ceca2203..630fe0a34bc6 100644
> --- a/drivers/acpi/pci_irq.c
> +++ b/drivers/acpi/pci_irq.c
> @@ -288,7 +288,7 @@ static int acpi_reroute_boot_interrupt(struct pci_dev *dev,
> }
> #endif /* CONFIG_X86_IO_APIC */
>
> -static struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin)
> +struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin)
> {
> struct acpi_prt_entry *entry = NULL;
> struct pci_dev *bridge;
> diff --git a/drivers/xen/xen-pciback/pci_stub.c b/drivers/xen/xen-pciback/pci_stub.c
> index e34b623e4b41..1abd4dad6f40 100644
> --- a/drivers/xen/xen-pciback/pci_stub.c
> +++ b/drivers/xen/xen-pciback/pci_stub.c
> @@ -20,6 +20,7 @@
> #include <linux/atomic.h>
> #include <xen/events.h>
> #include <xen/pci.h>
> +#include <xen/acpi.h>
> #include <xen/xen.h>
> #include <asm/xen/hypervisor.h>
> #include <xen/interface/physdev.h>
> @@ -399,6 +400,12 @@ static int pcistub_init_device(struct pci_dev *dev)
> if (err)
> goto config_release;
>
> + if (xen_initial_domain() && xen_pvh_domain()) {
> + err = xen_pvh_passthrough_gsi(dev);
> + if (err)
> + goto config_release;
> + }
> +
> if (dev->msix_cap) {
> struct physdev_pci_device ppdev = {
> .seg = pci_domain_nr(dev->bus),
> diff --git a/include/linux/acpi.h b/include/linux/acpi.h
> index 641dc4843987..368d56ba2c5e 100644
> --- a/include/linux/acpi.h
> +++ b/include/linux/acpi.h
> @@ -375,6 +375,7 @@ void acpi_unregister_gsi (u32 gsi);
>
> struct pci_dev;
>
> +struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin);
> int acpi_pci_irq_enable (struct pci_dev *dev);
> void acpi_penalize_isa_irq(int irq, int active);
> bool acpi_isa_irq_available(int irq);
> diff --git a/include/xen/acpi.h b/include/xen/acpi.h
> index b1e11863144d..ce7f5554f88e 100644
> --- a/include/xen/acpi.h
> +++ b/include/xen/acpi.h
> @@ -67,6 +67,7 @@ static inline void xen_acpi_sleep_register(void)
> acpi_suspend_lowlevel = xen_acpi_suspend_lowlevel;
> }
> }
> +int xen_pvh_passthrough_gsi(struct pci_dev *dev);
> #else
> static inline void xen_acpi_sleep_register(void)
> {
>
> >
> > Jan
>
> --
> Best regards,
> Jiqian Chen.
>

2023-12-07 03:42:42

by Jiqian Chen

[permalink] [raw]
Subject: Re: [RFC KERNEL PATCH v2 2/3] xen/pvh: Unmask irq for passthrough device in PVH dom0

(Adding Juergen to the "To" list.)

Hi Juergen,
Looking forward to your opinions.

On 2023/12/7 10:18, Stefano Stabellini wrote:
> On Tue, 5 Dec 2023, Chen, Jiqian wrote:
>> When PVH dom0 enable a device, it will get trigger and polarity from ACPI (see acpi_pci_irq_enable)
>> I have a version of patch which tried that way, see below:
>
> This approach looks much better. I think this patch is OKish. Juergen,
> what do you think?
>
>
>> diff --git a/arch/x86/xen/enlighten_pvh.c b/arch/x86/xen/enlighten_pvh.c
>> index ada3868c02c2..43e1bda9f946 100644
>> --- a/arch/x86/xen/enlighten_pvh.c
>> +++ b/arch/x86/xen/enlighten_pvh.c
>> @@ -1,6 +1,7 @@
>> // SPDX-License-Identifier: GPL-2.0
>> #include <linux/acpi.h>
>> #include <linux/export.h>
>> +#include <linux/pci.h>
>>
>> #include <xen/hvc-console.h>
>>
>> @@ -25,6 +26,127 @@
>> bool __ro_after_init xen_pvh;
>> EXPORT_SYMBOL_GPL(xen_pvh);
>>
>> +typedef struct gsi_info {
>> + int gsi;
>> + int trigger;
>> + int polarity;
>> + int pirq;
>> +} gsi_info_t;
>> +
>> +struct acpi_prt_entry {
>> + struct acpi_pci_id id;
>> + u8 pin;
>> + acpi_handle link;
>> + u32 index; /* GSI, or link _CRS index */
>> +};
>> +
>> +static int xen_pvh_get_gsi_info(struct pci_dev *dev,
>> + gsi_info_t *gsi_info)
>> +{
>> + int gsi;
>> + u8 pin = 0;
>> + struct acpi_prt_entry *entry;
>> + int trigger = ACPI_LEVEL_SENSITIVE;
>> + int polarity = acpi_irq_model == ACPI_IRQ_MODEL_GIC ?
>> + ACPI_ACTIVE_HIGH : ACPI_ACTIVE_LOW;
>> +
>> + if (dev)
>> + pin = dev->pin;
>> + if (!pin) {
>> + xen_raw_printk("No interrupt pin configured\n");
>> + return -EINVAL;
>> + }
>> +
>> + entry = acpi_pci_irq_lookup(dev, pin);
>> + if (entry) {
>> + if (entry->link)
>> + gsi = acpi_pci_link_allocate_irq(entry->link,
>> + entry->index,
>> + &trigger, &polarity,
>> + NULL);
>> + else
>> + gsi = entry->index;
>> + } else
>> + return -EINVAL;
>> +
>> + gsi_info->gsi = gsi;
>> + gsi_info->trigger = trigger;
>> + gsi_info->polarity = polarity;
>> +
>> + return 0;
>> +}
>> +
>> +static int xen_pvh_map_pirq(gsi_info_t *gsi_info)
>> +{
>> + struct physdev_map_pirq map_irq;
>> + int ret;
>> +
>> + map_irq.domid = DOMID_SELF;
>> + map_irq.type = MAP_PIRQ_TYPE_GSI;
>> + map_irq.index = gsi_info->gsi;
>> + map_irq.pirq = gsi_info->gsi;
>> +
>> + ret = HYPERVISOR_physdev_op(PHYSDEVOP_map_pirq, &map_irq);
>> + gsi_info->pirq = map_irq.pirq;
>> +
>> + return ret;
>> +}
>> +
>> +static int xen_pvh_unmap_pirq(gsi_info_t *gsi_info)
>> +{
>> + struct physdev_unmap_pirq unmap_irq;
>> +
>> + unmap_irq.domid = DOMID_SELF;
>> + unmap_irq.pirq = gsi_info->pirq;
>> +
>> + return HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq, &unmap_irq);
>> +}
>> +
>> +static int xen_pvh_setup_gsi(gsi_info_t *gsi_info)
>> +{
>> + struct physdev_setup_gsi setup_gsi;
>> +
>> + setup_gsi.gsi = gsi_info->gsi;
>> + setup_gsi.triggering = (gsi_info->trigger == ACPI_EDGE_SENSITIVE ? 0 : 1);
>> + setup_gsi.polarity = (gsi_info->polarity == ACPI_ACTIVE_HIGH ? 0 : 1);
>> +
>> + return HYPERVISOR_physdev_op(PHYSDEVOP_setup_gsi, &setup_gsi);
>> +}
>> +
>> +int xen_pvh_passthrough_gsi(struct pci_dev *dev)
>> +{
>> + int ret;
>> + gsi_info_t gsi_info;
>> +
>> + if (!dev) {
>> + return -EINVAL;
>> + }
>> +
>> + ret = xen_pvh_get_gsi_info(dev, &gsi_info);
>> + if (ret) {
>> + xen_raw_printk("Fail to get gsi info!\n");
>> + return ret;
>> + }
>> +
>> + ret = xen_pvh_map_pirq(&gsi_info);
>> + if (ret) {
>> + xen_raw_printk("Fail to map pirq for gsi (%d)!\n", gsi_info.gsi);
>> + return ret;
>> + }
>> +
>> + ret = xen_pvh_setup_gsi(&gsi_info);
>> + if (ret == -EEXIST) {
>> + ret = 0;
>> + xen_raw_printk("Already setup the GSI :%u\n", gsi_info.gsi);
>> + } else if (ret) {
>> + xen_raw_printk("Fail to setup gsi (%d)!\n", gsi_info.gsi);
>> + xen_pvh_unmap_pirq(&gsi_info);
>> + }
>> +
>> + return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(xen_pvh_passthrough_gsi);
>> +
>> void __init xen_pvh_init(struct boot_params *boot_params)
>> {
>> u32 msr;
>> diff --git a/drivers/acpi/pci_irq.c b/drivers/acpi/pci_irq.c
>> index ff30ceca2203..630fe0a34bc6 100644
>> --- a/drivers/acpi/pci_irq.c
>> +++ b/drivers/acpi/pci_irq.c
>> @@ -288,7 +288,7 @@ static int acpi_reroute_boot_interrupt(struct pci_dev *dev,
>> }
>> #endif /* CONFIG_X86_IO_APIC */
>>
>> -static struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin)
>> +struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin)
>> {
>> struct acpi_prt_entry *entry = NULL;
>> struct pci_dev *bridge;
>> diff --git a/drivers/xen/xen-pciback/pci_stub.c b/drivers/xen/xen-pciback/pci_stub.c
>> index e34b623e4b41..1abd4dad6f40 100644
>> --- a/drivers/xen/xen-pciback/pci_stub.c
>> +++ b/drivers/xen/xen-pciback/pci_stub.c
>> @@ -20,6 +20,7 @@
>> #include <linux/atomic.h>
>> #include <xen/events.h>
>> #include <xen/pci.h>
>> +#include <xen/acpi.h>
>> #include <xen/xen.h>
>> #include <asm/xen/hypervisor.h>
>> #include <xen/interface/physdev.h>
>> @@ -399,6 +400,12 @@ static int pcistub_init_device(struct pci_dev *dev)
>> if (err)
>> goto config_release;
>>
>> + if (xen_initial_domain() && xen_pvh_domain()) {
>> + err = xen_pvh_passthrough_gsi(dev);
>> + if (err)
>> + goto config_release;
>> + }
>> +
>> if (dev->msix_cap) {
>> struct physdev_pci_device ppdev = {
>> .seg = pci_domain_nr(dev->bus),
>> diff --git a/include/linux/acpi.h b/include/linux/acpi.h
>> index 641dc4843987..368d56ba2c5e 100644
>> --- a/include/linux/acpi.h
>> +++ b/include/linux/acpi.h
>> @@ -375,6 +375,7 @@ void acpi_unregister_gsi (u32 gsi);
>>
>> struct pci_dev;
>>
>> +struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin);
>> int acpi_pci_irq_enable (struct pci_dev *dev);
>> void acpi_penalize_isa_irq(int irq, int active);
>> bool acpi_isa_irq_available(int irq);
>> diff --git a/include/xen/acpi.h b/include/xen/acpi.h
>> index b1e11863144d..ce7f5554f88e 100644
>> --- a/include/xen/acpi.h
>> +++ b/include/xen/acpi.h
>> @@ -67,6 +67,7 @@ static inline void xen_acpi_sleep_register(void)
>> acpi_suspend_lowlevel = xen_acpi_suspend_lowlevel;
>> }
>> }
>> +int xen_pvh_passthrough_gsi(struct pci_dev *dev);
>> #else
>> static inline void xen_acpi_sleep_register(void)
>> {
>>
>>>
>>> Jan
>>
>> --
>> Best regards,
>> Jiqian Chen.
>>

--
Best regards,
Jiqian Chen.

2023-12-07 06:44:08

by Juergen Gross

[permalink] [raw]
Subject: Re: [RFC KERNEL PATCH v2 2/3] xen/pvh: Unmask irq for passthrough device in PVH dom0

On 07.12.23 03:18, Stefano Stabellini wrote:
> On Tue, 5 Dec 2023, Chen, Jiqian wrote:
>> When PVH dom0 enable a device, it will get trigger and polarity from ACPI (see acpi_pci_irq_enable)
>> I have a version of patch which tried that way, see below:
>
> This approach looks much better. I think this patch is OKish. Juergen,
> what do you think?

The approach seems to be fine.


Juergen

>
>
>> diff --git a/arch/x86/xen/enlighten_pvh.c b/arch/x86/xen/enlighten_pvh.c
>> index ada3868c02c2..43e1bda9f946 100644
>> --- a/arch/x86/xen/enlighten_pvh.c
>> +++ b/arch/x86/xen/enlighten_pvh.c
>> @@ -1,6 +1,7 @@
>> // SPDX-License-Identifier: GPL-2.0
>> #include <linux/acpi.h>
>> #include <linux/export.h>
>> +#include <linux/pci.h>
>>
>> #include <xen/hvc-console.h>
>>
>> @@ -25,6 +26,127 @@
>> bool __ro_after_init xen_pvh;
>> EXPORT_SYMBOL_GPL(xen_pvh);
>>
>> +typedef struct gsi_info {
>> + int gsi;
>> + int trigger;
>> + int polarity;
>> + int pirq;
>> +} gsi_info_t;
>> +
>> +struct acpi_prt_entry {
>> + struct acpi_pci_id id;
>> + u8 pin;
>> + acpi_handle link;
>> + u32 index; /* GSI, or link _CRS index */
>> +};
>> +
>> +static int xen_pvh_get_gsi_info(struct pci_dev *dev,
>> + gsi_info_t *gsi_info)
>> +{
>> + int gsi;
>> + u8 pin = 0;
>> + struct acpi_prt_entry *entry;
>> + int trigger = ACPI_LEVEL_SENSITIVE;
>> + int polarity = acpi_irq_model == ACPI_IRQ_MODEL_GIC ?
>> + ACPI_ACTIVE_HIGH : ACPI_ACTIVE_LOW;
>> +
>> + if (dev)
>> + pin = dev->pin;
>> + if (!pin) {
>> + xen_raw_printk("No interrupt pin configured\n");
>> + return -EINVAL;
>> + }
>> +
>> + entry = acpi_pci_irq_lookup(dev, pin);
>> + if (entry) {
>> + if (entry->link)
>> + gsi = acpi_pci_link_allocate_irq(entry->link,
>> + entry->index,
>> + &trigger, &polarity,
>> + NULL);
>> + else
>> + gsi = entry->index;
>> + } else
>> + return -EINVAL;
>> +
>> + gsi_info->gsi = gsi;
>> + gsi_info->trigger = trigger;
>> + gsi_info->polarity = polarity;
>> +
>> + return 0;
>> +}
>> +
>> +static int xen_pvh_map_pirq(gsi_info_t *gsi_info)
>> +{
>> + struct physdev_map_pirq map_irq;
>> + int ret;
>> +
>> + map_irq.domid = DOMID_SELF;
>> + map_irq.type = MAP_PIRQ_TYPE_GSI;
>> + map_irq.index = gsi_info->gsi;
>> + map_irq.pirq = gsi_info->gsi;
>> +
>> + ret = HYPERVISOR_physdev_op(PHYSDEVOP_map_pirq, &map_irq);
>> + gsi_info->pirq = map_irq.pirq;
>> +
>> + return ret;
>> +}
>> +
>> +static int xen_pvh_unmap_pirq(gsi_info_t *gsi_info)
>> +{
>> + struct physdev_unmap_pirq unmap_irq;
>> +
>> + unmap_irq.domid = DOMID_SELF;
>> + unmap_irq.pirq = gsi_info->pirq;
>> +
>> + return HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq, &unmap_irq);
>> +}
>> +
>> +static int xen_pvh_setup_gsi(gsi_info_t *gsi_info)
>> +{
>> + struct physdev_setup_gsi setup_gsi;
>> +
>> + setup_gsi.gsi = gsi_info->gsi;
>> + setup_gsi.triggering = (gsi_info->trigger == ACPI_EDGE_SENSITIVE ? 0 : 1);
>> + setup_gsi.polarity = (gsi_info->polarity == ACPI_ACTIVE_HIGH ? 0 : 1);
>> +
>> + return HYPERVISOR_physdev_op(PHYSDEVOP_setup_gsi, &setup_gsi);
>> +}
>> +
>> +int xen_pvh_passthrough_gsi(struct pci_dev *dev)
>> +{
>> + int ret;
>> + gsi_info_t gsi_info;
>> +
>> + if (!dev) {
>> + return -EINVAL;
>> + }
>> +
>> + ret = xen_pvh_get_gsi_info(dev, &gsi_info);
>> + if (ret) {
>> + xen_raw_printk("Fail to get gsi info!\n");
>> + return ret;
>> + }
>> +
>> + ret = xen_pvh_map_pirq(&gsi_info);
>> + if (ret) {
>> + xen_raw_printk("Fail to map pirq for gsi (%d)!\n", gsi_info.gsi);
>> + return ret;
>> + }
>> +
>> + ret = xen_pvh_setup_gsi(&gsi_info);
>> + if (ret == -EEXIST) {
>> + ret = 0;
>> + xen_raw_printk("Already setup the GSI :%u\n", gsi_info.gsi);
>> + } else if (ret) {
>> + xen_raw_printk("Fail to setup gsi (%d)!\n", gsi_info.gsi);
>> + xen_pvh_unmap_pirq(&gsi_info);
>> + }
>> +
>> + return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(xen_pvh_passthrough_gsi);
>> +
>> void __init xen_pvh_init(struct boot_params *boot_params)
>> {
>> u32 msr;
>> diff --git a/drivers/acpi/pci_irq.c b/drivers/acpi/pci_irq.c
>> index ff30ceca2203..630fe0a34bc6 100644
>> --- a/drivers/acpi/pci_irq.c
>> +++ b/drivers/acpi/pci_irq.c
>> @@ -288,7 +288,7 @@ static int acpi_reroute_boot_interrupt(struct pci_dev *dev,
>> }
>> #endif /* CONFIG_X86_IO_APIC */
>>
>> -static struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin)
>> +struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin)
>> {
>> struct acpi_prt_entry *entry = NULL;
>> struct pci_dev *bridge;
>> diff --git a/drivers/xen/xen-pciback/pci_stub.c b/drivers/xen/xen-pciback/pci_stub.c
>> index e34b623e4b41..1abd4dad6f40 100644
>> --- a/drivers/xen/xen-pciback/pci_stub.c
>> +++ b/drivers/xen/xen-pciback/pci_stub.c
>> @@ -20,6 +20,7 @@
>> #include <linux/atomic.h>
>> #include <xen/events.h>
>> #include <xen/pci.h>
>> +#include <xen/acpi.h>
>> #include <xen/xen.h>
>> #include <asm/xen/hypervisor.h>
>> #include <xen/interface/physdev.h>
>> @@ -399,6 +400,12 @@ static int pcistub_init_device(struct pci_dev *dev)
>> if (err)
>> goto config_release;
>>
>> + if (xen_initial_domain() && xen_pvh_domain()) {
>> + err = xen_pvh_passthrough_gsi(dev);
>> + if (err)
>> + goto config_release;
>> + }
>> +
>> if (dev->msix_cap) {
>> struct physdev_pci_device ppdev = {
>> .seg = pci_domain_nr(dev->bus),
>> diff --git a/include/linux/acpi.h b/include/linux/acpi.h
>> index 641dc4843987..368d56ba2c5e 100644
>> --- a/include/linux/acpi.h
>> +++ b/include/linux/acpi.h
>> @@ -375,6 +375,7 @@ void acpi_unregister_gsi (u32 gsi);
>>
>> struct pci_dev;
>>
>> +struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin);
>> int acpi_pci_irq_enable (struct pci_dev *dev);
>> void acpi_penalize_isa_irq(int irq, int active);
>> bool acpi_isa_irq_available(int irq);
>> diff --git a/include/xen/acpi.h b/include/xen/acpi.h
>> index b1e11863144d..ce7f5554f88e 100644
>> --- a/include/xen/acpi.h
>> +++ b/include/xen/acpi.h
>> @@ -67,6 +67,7 @@ static inline void xen_acpi_sleep_register(void)
>> acpi_suspend_lowlevel = xen_acpi_suspend_lowlevel;
>> }
>> }
>> +int xen_pvh_passthrough_gsi(struct pci_dev *dev);
>> #else
>> static inline void xen_acpi_sleep_register(void)
>> {
>>
>>>
>>> Jan
>>
>> --
>> Best regards,
>> Jiqian Chen.
>>


Attachments:
OpenPGP_0xB0DE9DD628BF132F.asc (3.66 kB)
OpenPGP public key
OpenPGP_signature.asc (505.00 B)
OpenPGP digital signature
Download all attachments

2023-12-08 05:53:53

by Jiqian Chen

[permalink] [raw]
Subject: Re: [RFC KERNEL PATCH v2 2/3] xen/pvh: Unmask irq for passthrough device in PVH dom0

Thank Stefano and Juergen, I will use this approach in next version.

On 2023/12/7 14:43, Juergen Gross wrote:
> On 07.12.23 03:18, Stefano Stabellini wrote:
>> On Tue, 5 Dec 2023, Chen, Jiqian wrote:
>>> When PVH dom0 enable a device, it will get trigger and polarity from ACPI (see acpi_pci_irq_enable)
>>> I have a version of patch which tried that way, see below:
>>
>> This approach looks much better. I think this patch is OKish. Juergen,
>> what do you think?
>
> The approach seems to be fine.
>
>
> Juergen
>
>>
>>
>>> diff --git a/arch/x86/xen/enlighten_pvh.c b/arch/x86/xen/enlighten_pvh.c
>>> index ada3868c02c2..43e1bda9f946 100644
>>> --- a/arch/x86/xen/enlighten_pvh.c
>>> +++ b/arch/x86/xen/enlighten_pvh.c
>>> @@ -1,6 +1,7 @@
>>>   // SPDX-License-Identifier: GPL-2.0
>>>   #include <linux/acpi.h>
>>>   #include <linux/export.h>
>>> +#include <linux/pci.h>
>>>
>>>   #include <xen/hvc-console.h>
>>>
>>> @@ -25,6 +26,127 @@
>>>   bool __ro_after_init xen_pvh;
>>>   EXPORT_SYMBOL_GPL(xen_pvh);
>>>
>>> +typedef struct gsi_info {
>>> +       int gsi;
>>> +       int trigger;
>>> +       int polarity;
>>> +       int pirq;
>>> +} gsi_info_t;
>>> +
>>> +struct acpi_prt_entry {
>>> +       struct acpi_pci_id      id;
>>> +       u8                      pin;
>>> +       acpi_handle             link;
>>> +       u32                     index;          /* GSI, or link _CRS index */
>>> +};
>>> +
>>> +static int xen_pvh_get_gsi_info(struct pci_dev *dev,
>>> +                                                               gsi_info_t *gsi_info)
>>> +{
>>> +       int gsi;
>>> +       u8 pin = 0;
>>> +       struct acpi_prt_entry *entry;
>>> +       int trigger = ACPI_LEVEL_SENSITIVE;
>>> +       int polarity = acpi_irq_model == ACPI_IRQ_MODEL_GIC ?
>>> +                                     ACPI_ACTIVE_HIGH : ACPI_ACTIVE_LOW;
>>> +
>>> +       if (dev)
>>> +               pin = dev->pin;
>>> +       if (!pin) {
>>> +               xen_raw_printk("No interrupt pin configured\n");
>>> +               return -EINVAL;
>>> +       }
>>> +
>>> +       entry = acpi_pci_irq_lookup(dev, pin);
>>> +       if (entry) {
>>> +               if (entry->link)
>>> +                       gsi = acpi_pci_link_allocate_irq(entry->link,
>>> +                                                        entry->index,
>>> +                                                        &trigger, &polarity,
>>> +                                                        NULL);
>>> +               else
>>> +                       gsi = entry->index;
>>> +       } else
>>> +               return -EINVAL;
>>> +
>>> +       gsi_info->gsi = gsi;
>>> +       gsi_info->trigger = trigger;
>>> +       gsi_info->polarity = polarity;
>>> +
>>> +       return 0;
>>> +}
>>> +
>>> +static int xen_pvh_map_pirq(gsi_info_t *gsi_info)
>>> +{
>>> +       struct physdev_map_pirq map_irq;
>>> +       int ret;
>>> +
>>> +       map_irq.domid = DOMID_SELF;
>>> +       map_irq.type = MAP_PIRQ_TYPE_GSI;
>>> +       map_irq.index = gsi_info->gsi;
>>> +       map_irq.pirq = gsi_info->gsi;
>>> +
>>> +       ret = HYPERVISOR_physdev_op(PHYSDEVOP_map_pirq, &map_irq);
>>> +       gsi_info->pirq = map_irq.pirq;
>>> +
>>> +       return ret;
>>> +}
>>> +
>>> +static int xen_pvh_unmap_pirq(gsi_info_t *gsi_info)
>>> +{
>>> +       struct physdev_unmap_pirq unmap_irq;
>>> +
>>> +       unmap_irq.domid = DOMID_SELF;
>>> +       unmap_irq.pirq = gsi_info->pirq;
>>> +
>>> +       return HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq, &unmap_irq);
>>> +}
>>> +
>>> +static int xen_pvh_setup_gsi(gsi_info_t *gsi_info)
>>> +{
>>> +       struct physdev_setup_gsi setup_gsi;
>>> +
>>> +       setup_gsi.gsi = gsi_info->gsi;
>>> +       setup_gsi.triggering = (gsi_info->trigger == ACPI_EDGE_SENSITIVE ? 0 : 1);
>>> +       setup_gsi.polarity = (gsi_info->polarity == ACPI_ACTIVE_HIGH ? 0 : 1);
>>> +
>>> +       return HYPERVISOR_physdev_op(PHYSDEVOP_setup_gsi, &setup_gsi);
>>> +}
>>> +
>>> +int xen_pvh_passthrough_gsi(struct pci_dev *dev)
>>> +{
>>> +       int ret;
>>> +       gsi_info_t gsi_info;
>>> +
>>> +       if (!dev) {
>>> +               return -EINVAL;
>>> +       }
>>> +
>>> +       ret = xen_pvh_get_gsi_info(dev, &gsi_info);
>>> +       if (ret) {
>>> +               xen_raw_printk("Fail to get gsi info!\n");
>>> +               return ret;
>>> +       }
>>> +
>>> +       ret = xen_pvh_map_pirq(&gsi_info);
>>> +       if (ret) {
>>> +               xen_raw_printk("Fail to map pirq for gsi (%d)!\n", gsi_info.gsi);
>>> +               return ret;
>>> +       }
>>> +
>>> +       ret = xen_pvh_setup_gsi(&gsi_info);
>>> +       if (ret == -EEXIST) {
>>> +               ret = 0;
>>> +               xen_raw_printk("Already setup the GSI :%u\n", gsi_info.gsi);
>>> +       } else if (ret) {
>>> +               xen_raw_printk("Fail to setup gsi (%d)!\n", gsi_info.gsi);
>>> +               xen_pvh_unmap_pirq(&gsi_info);
>>> +       }
>>> +
>>> +       return ret;
>>> +}
>>> +EXPORT_SYMBOL_GPL(xen_pvh_passthrough_gsi);
>>> +
>>>   void __init xen_pvh_init(struct boot_params *boot_params)
>>>   {
>>>          u32 msr;
>>> diff --git a/drivers/acpi/pci_irq.c b/drivers/acpi/pci_irq.c
>>> index ff30ceca2203..630fe0a34bc6 100644
>>> --- a/drivers/acpi/pci_irq.c
>>> +++ b/drivers/acpi/pci_irq.c
>>> @@ -288,7 +288,7 @@ static int acpi_reroute_boot_interrupt(struct pci_dev *dev,
>>>   }
>>>   #endif /* CONFIG_X86_IO_APIC */
>>>
>>> -static struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin)
>>> +struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin)
>>>   {
>>>          struct acpi_prt_entry *entry = NULL;
>>>          struct pci_dev *bridge;
>>> diff --git a/drivers/xen/xen-pciback/pci_stub.c b/drivers/xen/xen-pciback/pci_stub.c
>>> index e34b623e4b41..1abd4dad6f40 100644
>>> --- a/drivers/xen/xen-pciback/pci_stub.c
>>> +++ b/drivers/xen/xen-pciback/pci_stub.c
>>> @@ -20,6 +20,7 @@
>>>   #include <linux/atomic.h>
>>>   #include <xen/events.h>
>>>   #include <xen/pci.h>
>>> +#include <xen/acpi.h>
>>>   #include <xen/xen.h>
>>>   #include <asm/xen/hypervisor.h>
>>>   #include <xen/interface/physdev.h>
>>> @@ -399,6 +400,12 @@ static int pcistub_init_device(struct pci_dev *dev)
>>>          if (err)
>>>                  goto config_release;
>>>
>>> +       if (xen_initial_domain() && xen_pvh_domain()) {
>>> +               err = xen_pvh_passthrough_gsi(dev);
>>> +               if (err)
>>> +                       goto config_release;
>>> +       }
>>> +
>>>          if (dev->msix_cap) {
>>>                  struct physdev_pci_device ppdev = {
>>>                          .seg = pci_domain_nr(dev->bus),
>>> diff --git a/include/linux/acpi.h b/include/linux/acpi.h
>>> index 641dc4843987..368d56ba2c5e 100644
>>> --- a/include/linux/acpi.h
>>> +++ b/include/linux/acpi.h
>>> @@ -375,6 +375,7 @@ void acpi_unregister_gsi (u32 gsi);
>>>
>>>   struct pci_dev;
>>>
>>> +struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin);
>>>   int acpi_pci_irq_enable (struct pci_dev *dev);
>>>   void acpi_penalize_isa_irq(int irq, int active);
>>>   bool acpi_isa_irq_available(int irq);
>>> diff --git a/include/xen/acpi.h b/include/xen/acpi.h
>>> index b1e11863144d..ce7f5554f88e 100644
>>> --- a/include/xen/acpi.h
>>> +++ b/include/xen/acpi.h
>>> @@ -67,6 +67,7 @@ static inline void xen_acpi_sleep_register(void)
>>>                  acpi_suspend_lowlevel = xen_acpi_suspend_lowlevel;
>>>          }
>>>   }
>>> +int xen_pvh_passthrough_gsi(struct pci_dev *dev);
>>>   #else
>>>   static inline void xen_acpi_sleep_register(void)
>>>   {
>>>
>>>>
>>>> Jan
>>>
>>> -- 
>>> Best regards,
>>> Jiqian Chen.
>>>
>

--
Best regards,
Jiqian Chen.

2023-12-11 15:46:09

by Roger Pau Monné

[permalink] [raw]
Subject: Re: [RFC KERNEL PATCH v2 2/3] xen/pvh: Unmask irq for passthrough device in PVH dom0

On Wed, Dec 06, 2023 at 06:07:26AM +0000, Chen, Jiqian wrote:
> On 2023/12/5 18:32, Jan Beulich wrote:
> > On 05.12.2023 10:19, Roger Pau Monné wrote:
> >> On Mon, Dec 04, 2023 at 02:19:33PM -0800, Stefano Stabellini wrote:
> >>> On Mon, 4 Dec 2023, Roger Pau Monné wrote:
> >>>> On Fri, Dec 01, 2023 at 07:37:55PM -0800, Stefano Stabellini wrote:
> >>>>> On Fri, 1 Dec 2023, Roger Pau Monné wrote:
> >>>>>> On Thu, Nov 30, 2023 at 07:15:17PM -0800, Stefano Stabellini wrote:
> >>>>>>> On Thu, 30 Nov 2023, Roger Pau Monné wrote:
> >>>>>>>> On Wed, Nov 29, 2023 at 07:53:59PM -0800, Stefano Stabellini wrote:
> >>>>>>>>> On Fri, 24 Nov 2023, Jiqian Chen wrote:
> >>>>>>>>>> This patch is to solve two problems we encountered when we try to
> >>>>>>>>>> passthrough a device to hvm domU base on Xen PVH dom0.
> >>>>>>>>>>
> >>>>>>>>>> First, hvm guest will alloc a pirq and irq for a passthrough device
> >>>>>>>>>> by using gsi, before that, the gsi must first has a mapping in dom0,
> >>>>>>>>>> see Xen code pci_add_dm_done->xc_domain_irq_permission, it will call
> >>>>>>>>>> into Xen and check whether dom0 has the mapping. See
> >>>>>>>>>> XEN_DOMCTL_irq_permission->pirq_access_permitted, "current" is PVH
> >>>>>>>>>> dom0 and it return irq is 0, and then return -EPERM.
> >>>>>>>>>> This is because the passthrough device doesn't do PHYSDEVOP_map_pirq
> >>>>>>>>>> when thay are enabled.
> >>>>>>>>>>
> >>>>>>>>>> Second, in PVH dom0, the gsi of a passthrough device doesn't get
> >>>>>>>>>> registered, but gsi must be configured for it to be able to be
> >>>>>>>>>> mapped into a domU.
> >>>>>>>>>>
> >>>>>>>>>> After searching codes, we can find map_pirq and register_gsi will be
> >>>>>>>>>> done in function vioapic_write_redirent->vioapic_hwdom_map_gsi when
> >>>>>>>>>> the gsi(aka ioapic's pin) is unmasked in PVH dom0. So the problems
> >>>>>>>>>> can be conclude to that the gsi of a passthrough device doesn't be
> >>>>>>>>>> unmasked.
> >>>>>>>>>>
> >>>>>>>>>> To solve the unmaske problem, this patch call the unmask_irq when we
> >>>>>>>>>> assign a device to be passthrough. So that the gsi can get registered
> >>>>>>>>>> and mapped in PVH dom0.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Roger, this seems to be more of a Xen issue than a Linux issue. Why do
> >>>>>>>>> we need the unmask check in Xen? Couldn't we just do:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
> >>>>>>>>> index 4e40d3609a..df262a4a18 100644
> >>>>>>>>> --- a/xen/arch/x86/hvm/vioapic.c
> >>>>>>>>> +++ b/xen/arch/x86/hvm/vioapic.c
> >>>>>>>>> @@ -287,7 +287,7 @@ static void vioapic_write_redirent(
> >>>>>>>>> hvm_dpci_eoi(d, gsi);
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>> - if ( is_hardware_domain(d) && unmasked )
> >>>>>>>>> + if ( is_hardware_domain(d) )
> >>>>>>>>> {
> >>>>>>>>> /*
> >>>>>>>>> * NB: don't call vioapic_hwdom_map_gsi while holding hvm.irq_lock
> >>>>>>>>
> >>>>>>>> There are some issues with this approach.
> >>>>>>>>
> >>>>>>>> mp_register_gsi() will only setup the trigger and polarity of the
> >>>>>>>> IO-APIC pin once, so we do so once the guest unmask the pin in order
> >>>>>>>> to assert that the configuration is the intended one. A guest is
> >>>>>>>> allowed to write all kind of nonsense stuff to the IO-APIC RTE, but
> >>>>>>>> that doesn't take effect unless the pin is unmasked.
> >>>>>>>>
> >>>>>>>> Overall the question would be whether we have any guarantees that
> >>>>>>>> the hardware domain has properly configured the pin, even if it's not
> >>>>>>>> using it itself (as it hasn't been unmasked).
> >>>>>>>>
> >>>>>>>> IIRC PCI legacy interrupts are level triggered and low polarity, so we
> >>>>>>>> could configure any pins that are not setup at bind time?
> >>>>>>>
> >>>>>>> That could work.
> >>>>>>>
> >>>>>>> Another idea is to move only the call to allocate_and_map_gsi_pirq at
> >>>>>>> bind time? That might be enough to pass a pirq_access_permitted check.
> >>>>>>
> >>>>>> Maybe, albeit that would change the behavior of XEN_DOMCTL_bind_pt_irq
> >>>>>> just for PT_IRQ_TYPE_PCI and only when called from a PVH dom0 (as the
> >>>>>> parameter would be a GSI instead of a previously mapped IRQ). Such
> >>>>>> difference just for PT_IRQ_TYPE_PCI is slightly weird - if we go that
> >>>>>> route I would recommend that we instead introduce a new dmop that has
> >>>>>> this syntax regardless of the domain type it's called from.
> >>>>>
> >>>>> Looking at the code it is certainly a bit confusing. My point was that
> >>>>> we don't need to wait until polarity and trigger are set appropriately
> >>>>> to allow Dom0 to pass successfully a pirq_access_permitted() check. Xen
> >>>>> should be able to figure out that Dom0 is permitted pirq access.
> >>>>
> >>>> The logic is certainly not straightforward, and it could benefit from
> >>>> some comments.
> >>>>
> >>>> The irq permissions are a bit special, in that they get setup when the
> >>>> IRQ is mapped.
> >>>>
> >>>> The problem however is not so much with IRQ permissions, that we can
> >>>> indeed sort out internally in Xen. Such check in dom0 has the side
> >>>> effect of preventing the IRQ from being assigned to a domU without the
> >>>> hardware source being properly configured AFAICT.
> >>>
> >>> Now I understand why you made a comment previously about Xen having to
> >>> configure trigger and polarity for these interrupts on its own.
> >>>
> >>>
> >>>>> So the idea was to move the call to allocate_and_map_gsi_pirq() earlier
> >>>>> somewhere because allocate_and_map_gsi_pirq doesn't require trigger or
> >>>>> polarity to be configured to work. But the suggestion of doing it a
> >>>>> "bind time" (meaning: XEN_DOMCTL_bind_pt_irq) was a bad idea.
> >>>>>
> >>>>> But maybe we can find another location, maybe within
> >>>>> xen/arch/x86/hvm/vioapic.c, to call allocate_and_map_gsi_pirq() before
> >>>>> trigger and polarity are set and before the interrupt is unmasked.
> >>>>>
> >>>>> Then we change the implementation of vioapic_hwdom_map_gsi to skip the
> >>>>> call to allocate_and_map_gsi_pirq, because by the time
> >>>>> vioapic_hwdom_map_gsi we assume that allocate_and_map_gsi_pirq had
> >>>>> already been done.
> >>>>
> >>>> But then we would end up in a situation where the
> >>>> pirq_access_permitted() check will pass, but the IO-APIC pin won't be
> >>>> configured, which I think it's not what we want.
> >>>>
> >>>> One option would be to allow mp_register_gsi() to be called multiple
> >>>> times, and update the IO-APIC pin configuration as long as the pin is
> >>>> not unmasked. That would propagate each dom0 RTE update to the
> >>>> underlying IO-APIC. However such approach relies on dom0 configuring
> >>>> all possible IO-APIC pins, even if no device on dom0 is using them, I
> >>>> think it's not a very reliable option.
> >>>>
> >>>> Another option would be to modify the toolstack to setup the GSI
> >>>> itself using the PHYSDEVOP_setup_gsi hypercall. As said in a previous
> >>>> email, since we only care about PCI device passthrough the legacy INTx
> >>>> should always be level triggered and low polarity.
> >>>>
> >>>>> I am not familiar with vioapic.c but to give you an idea of what I was
> >>>>> thinking:
> >>>>>
> >>>>>
> >>>>> diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
> >>>>> index 4e40d3609a..16d56fe851 100644
> >>>>> --- a/xen/arch/x86/hvm/vioapic.c
> >>>>> +++ b/xen/arch/x86/hvm/vioapic.c
> >>>>> @@ -189,14 +189,6 @@ static int vioapic_hwdom_map_gsi(unsigned int gsi, unsigned int trig,
> >>>>> return ret;
> >>>>> }
> >>>>>
> >>>>> - ret = allocate_and_map_gsi_pirq(currd, pirq, &pirq);
> >>>>> - if ( ret )
> >>>>> - {
> >>>>> - gprintk(XENLOG_WARNING, "vioapic: error mapping GSI %u: %d\n",
> >>>>> - gsi, ret);
> >>>>> - return ret;
> >>>>> - }
> >>>>> -
> >>>>> pcidevs_lock();
> >>>>> ret = pt_irq_create_bind(currd, &pt_irq_bind);
> >>>>> if ( ret )
> >>>>> @@ -287,6 +279,17 @@ static void vioapic_write_redirent(
> >>>>> hvm_dpci_eoi(d, gsi);
> >>>>> }
> >>>>>
> >>>>> + if ( is_hardware_domain(d) )
> >>>>> + {
> >>>>> + int pirq = gsi, ret;
> >>>>> + ret = allocate_and_map_gsi_pirq(currd, pirq, &pirq);
> >>>>> + if ( ret )
> >>>>> + {
> >>>>> + gprintk(XENLOG_WARNING, "vioapic: error mapping GSI %u: %d\n",
> >>>>> + gsi, ret);
> >>>>> + return ret;
> >>>>> + }
> >>>>> + }
> >>>>> if ( is_hardware_domain(d) && unmasked )
> >>>>> {
> >>>>> /*
> >>>>
> >>>> As said above, such approach relies on dom0 writing to the IO-APIC RTE
> >>>> of likely each IO-APIC pin, which is IMO not quite reliable. In there
> >>>> are two different issues here that need to be fixed for PVH dom0:
> >>>>
> >>>> - Fix the XEN_DOMCTL_irq_permission pirq_access_permitted() call to
> >>>> succeed for a PVH dom0, even if dom0 is not using the GSI itself.
> >>>
> >>> Yes makes sense
> >>>
> >>>
> >>>> - Configure IO-APIC pins for PCI interrupts even if dom0 is not using
> >>>> the IO-APIC pin itself.
> >>>>
> >>>> First one needs to be fixed internally in Xen, second one will require
> >>>> the toolstack to issue an extra hypercall in order to ensure the
> >>>> IO-APIC pin is properly configured.
> >>>
> >>> On ARM, Xen doesn't need to wait for dom0 to configure interrupts
> >>> correctly. Xen configures them all on its own at boot based on Device
> >>> Tree information. I guess it is not possible to do the same on x86?
> >>
> >> No, not exactly. There's some interrupt information in the ACPI MADT,
> >> but that's just for very specific sources (Interrupt Source Override
> >> Structures)
> >>
> >> Then on AML devices can have resource descriptors that contain
> >> information about how interrupts are setup. However Xen is not able
> >> to read any of this information on AML.
> >>
> >> Legacy PCI interrupts are (always?) level triggered and low polarity,
> >> because it's assumed that an interrupt source can be shared between
> >> multiple devices.
> >
> > Except that as per what you said just in the earlier paragraph ACPI can
> > tell us otherwise.
> >
> >> I'm however not able to find any reference to this in the PCI spec,
> >> hence I'm reluctant to take this for granted in Xen, and default all
> >> GSIs >= 16 to such mode.
> >>
> >> OTOH legacy PCI interrupts are not that used anymore, as almost all
> >> devices will support MSI(-X) (because PCIe mandates it) and OSes
> >> should prefer the latter. SR-IOV VF don't even support legacy PCI
> >> interrupts anymore.
> >>
> >>> If
> >>> not, then I can see why we would need 1 extra toolstack hypercall for
> >>> that (or to bundle the operation of configuring IO-APIC pins together
> >>> with an existing toolstack hypercall).
> >>
> >> One suitable compromise would be to default unconfigured GSIs >= 16 to
> >> level-triggered and low-polarity, as I would expect that to work in
> >> almost all cases. We can always introduce the usage of
> >> PHYSDEVOP_setup_gsi later if required.
> >>
> >> Maybe Jan has more input here, would you agree to defaulting non-ISA
> >> GSIs to level-triggered, low-polarity in the absence of a specific
> >> setup provided by dom0?
> >
> > Well, such defaulting is an option, but in case it's wrong we might
> > end up with hard to diagnose issues. Personally I'd prefer if we
> > didn't take shortcuts here, i.e. if we followed what Dom0 is able
> > to read from ACPI.
>
> When PVH dom0 enable a device, it will get trigger and polarity from ACPI (see acpi_pci_irq_enable)
> I have a version of patch which tried that way, see below:
>
> diff --git a/arch/x86/xen/enlighten_pvh.c b/arch/x86/xen/enlighten_pvh.c
> index ada3868c02c2..43e1bda9f946 100644
> --- a/arch/x86/xen/enlighten_pvh.c
> +++ b/arch/x86/xen/enlighten_pvh.c
> @@ -1,6 +1,7 @@
> // SPDX-License-Identifier: GPL-2.0
> #include <linux/acpi.h>
> #include <linux/export.h>
> +#include <linux/pci.h>
>
> #include <xen/hvc-console.h>
>
> @@ -25,6 +26,127 @@
> bool __ro_after_init xen_pvh;
> EXPORT_SYMBOL_GPL(xen_pvh);
>
> +typedef struct gsi_info {
> + int gsi;
> + int trigger;
> + int polarity;
> + int pirq;
> +} gsi_info_t;
> +
> +struct acpi_prt_entry {
> + struct acpi_pci_id id;
> + u8 pin;
> + acpi_handle link;
> + u32 index; /* GSI, or link _CRS index */
> +};
> +
> +static int xen_pvh_get_gsi_info(struct pci_dev *dev,
> + gsi_info_t *gsi_info)
> +{
> + int gsi;
> + u8 pin = 0;
> + struct acpi_prt_entry *entry;
> + int trigger = ACPI_LEVEL_SENSITIVE;
> + int polarity = acpi_irq_model == ACPI_IRQ_MODEL_GIC ?
> + ACPI_ACTIVE_HIGH : ACPI_ACTIVE_LOW;
> +
> + if (dev)
> + pin = dev->pin;
> + if (!pin) {
> + xen_raw_printk("No interrupt pin configured\n");
> + return -EINVAL;
> + }
> +
> + entry = acpi_pci_irq_lookup(dev, pin);
> + if (entry) {
> + if (entry->link)
> + gsi = acpi_pci_link_allocate_irq(entry->link,
> + entry->index,
> + &trigger, &polarity,
> + NULL);
> + else
> + gsi = entry->index;
> + } else
> + return -EINVAL;
> +
> + gsi_info->gsi = gsi;
> + gsi_info->trigger = trigger;
> + gsi_info->polarity = polarity;
> +
> + return 0;
> +}
> +
> +static int xen_pvh_map_pirq(gsi_info_t *gsi_info)
> +{
> + struct physdev_map_pirq map_irq;
> + int ret;
> +
> + map_irq.domid = DOMID_SELF;
> + map_irq.type = MAP_PIRQ_TYPE_GSI;
> + map_irq.index = gsi_info->gsi;
> + map_irq.pirq = gsi_info->gsi;
> +
> + ret = HYPERVISOR_physdev_op(PHYSDEVOP_map_pirq, &map_irq);
> + gsi_info->pirq = map_irq.pirq;
> +
> + return ret;
> +}
> +
> +static int xen_pvh_unmap_pirq(gsi_info_t *gsi_info)
> +{
> + struct physdev_unmap_pirq unmap_irq;
> +
> + unmap_irq.domid = DOMID_SELF;
> + unmap_irq.pirq = gsi_info->pirq;
> +
> + return HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq, &unmap_irq);
> +}
> +
> +static int xen_pvh_setup_gsi(gsi_info_t *gsi_info)
> +{
> + struct physdev_setup_gsi setup_gsi;
> +
> + setup_gsi.gsi = gsi_info->gsi;
> + setup_gsi.triggering = (gsi_info->trigger == ACPI_EDGE_SENSITIVE ? 0 : 1);
> + setup_gsi.polarity = (gsi_info->polarity == ACPI_ACTIVE_HIGH ? 0 : 1);
> +
> + return HYPERVISOR_physdev_op(PHYSDEVOP_setup_gsi, &setup_gsi);
> +}

Hm, why not simply call pcibios_enable_device() from pciback? What
you are doing here using the hypercalls is a backdoor into what's done
automatically by Xen on IO-APIC accesses by a PVH dom0.

It will be much more natural for the PVH dom0 model to simply use the
native way to configure and unmask the IO-APIC pin, and that would
correctly setup the triggering/polarity and bind it to dom0 without
requiring the usage of any hypercalls.

Is that an issue since in that case the gsi will get mapped and bound
to dom0?

Otherwise I would prefer if the gsi is just configured from pciback
(PHYSDEVOP_setup_gsi) but not mapped, as allowing a PVH dom0 to map
interrupts using PHYSDEVOP_{,un}map_pirq to itself introduces a new
interface to manage interrupts that clashes with the native way that a
PVH dom0 uses.

Thanks, Roger.

2023-12-12 06:24:10

by Jiqian Chen

[permalink] [raw]
Subject: Re: [RFC KERNEL PATCH v2 2/3] xen/pvh: Unmask irq for passthrough device in PVH dom0

On 2023/12/11 23:45, Roger Pau Monné wrote:
> On Wed, Dec 06, 2023 at 06:07:26AM +0000, Chen, Jiqian wrote:
>>
>> When PVH dom0 enable a device, it will get trigger and polarity from ACPI (see acpi_pci_irq_enable)
>> I have a version of patch which tried that way, see below:
>>
>> diff --git a/arch/x86/xen/enlighten_pvh.c b/arch/x86/xen/enlighten_pvh.c
>> index ada3868c02c2..43e1bda9f946 100644
>> --- a/arch/x86/xen/enlighten_pvh.c
>> +++ b/arch/x86/xen/enlighten_pvh.c
>> @@ -1,6 +1,7 @@
>> // SPDX-License-Identifier: GPL-2.0
>> #include <linux/acpi.h>
>> #include <linux/export.h>
>> +#include <linux/pci.h>
>>
>> #include <xen/hvc-console.h>
>>
>> @@ -25,6 +26,127 @@
>> bool __ro_after_init xen_pvh;
>> EXPORT_SYMBOL_GPL(xen_pvh);
>>
>> +typedef struct gsi_info {
>> + int gsi;
>> + int trigger;
>> + int polarity;
>> + int pirq;
>> +} gsi_info_t;
>> +
>> +struct acpi_prt_entry {
>> + struct acpi_pci_id id;
>> + u8 pin;
>> + acpi_handle link;
>> + u32 index; /* GSI, or link _CRS index */
>> +};
>> +
>> +static int xen_pvh_get_gsi_info(struct pci_dev *dev,
>> + gsi_info_t *gsi_info)
>> +{
>> + int gsi;
>> + u8 pin = 0;
>> + struct acpi_prt_entry *entry;
>> + int trigger = ACPI_LEVEL_SENSITIVE;
>> + int polarity = acpi_irq_model == ACPI_IRQ_MODEL_GIC ?
>> + ACPI_ACTIVE_HIGH : ACPI_ACTIVE_LOW;
>> +
>> + if (dev)
>> + pin = dev->pin;
>> + if (!pin) {
>> + xen_raw_printk("No interrupt pin configured\n");
>> + return -EINVAL;
>> + }
>> +
>> + entry = acpi_pci_irq_lookup(dev, pin);
>> + if (entry) {
>> + if (entry->link)
>> + gsi = acpi_pci_link_allocate_irq(entry->link,
>> + entry->index,
>> + &trigger, &polarity,
>> + NULL);
>> + else
>> + gsi = entry->index;
>> + } else
>> + return -EINVAL;
>> +
>> + gsi_info->gsi = gsi;
>> + gsi_info->trigger = trigger;
>> + gsi_info->polarity = polarity;
>> +
>> + return 0;
>> +}
>> +
>> +static int xen_pvh_map_pirq(gsi_info_t *gsi_info)
>> +{
>> + struct physdev_map_pirq map_irq;
>> + int ret;
>> +
>> + map_irq.domid = DOMID_SELF;
>> + map_irq.type = MAP_PIRQ_TYPE_GSI;
>> + map_irq.index = gsi_info->gsi;
>> + map_irq.pirq = gsi_info->gsi;
>> +
>> + ret = HYPERVISOR_physdev_op(PHYSDEVOP_map_pirq, &map_irq);
>> + gsi_info->pirq = map_irq.pirq;
>> +
>> + return ret;
>> +}
>> +
>> +static int xen_pvh_unmap_pirq(gsi_info_t *gsi_info)
>> +{
>> + struct physdev_unmap_pirq unmap_irq;
>> +
>> + unmap_irq.domid = DOMID_SELF;
>> + unmap_irq.pirq = gsi_info->pirq;
>> +
>> + return HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq, &unmap_irq);
>> +}
>> +
>> +static int xen_pvh_setup_gsi(gsi_info_t *gsi_info)
>> +{
>> + struct physdev_setup_gsi setup_gsi;
>> +
>> + setup_gsi.gsi = gsi_info->gsi;
>> + setup_gsi.triggering = (gsi_info->trigger == ACPI_EDGE_SENSITIVE ? 0 : 1);
>> + setup_gsi.polarity = (gsi_info->polarity == ACPI_ACTIVE_HIGH ? 0 : 1);
>> +
>> + return HYPERVISOR_physdev_op(PHYSDEVOP_setup_gsi, &setup_gsi);
>> +}
>
> Hm, why not simply call pcibios_enable_device() from pciback? What
pcibios_enable_device had been called when using cmd "xl pci-assignable-add sbdf" from pciback. But it didn't do map_pirq and setup_gsi.
Because pcibios_enable_device-> pcibios_enable_irq-> __acpi_register_gsi(acpi_register_gsi_ioapic PVH specific)
> you are doing here using the hypercalls is a backdoor into what's done
> automatically by Xen on IO-APIC accesses by a PVH dom0.
But the gsi didn't be unmasked, and vioapic_hwdom_map_gsi is never called.
So, I think in pciback, if we can do what vioapic_hwdom_map_gsi does.
>
> It will be much more natural for the PVH dom0 model to simply use the
> native way to configure and unmask the IO-APIC pin, and that would
> correctly setup the triggering/polarity and bind it to dom0 without
> requiring the usage of any hypercalls.
Do you still prefer that I called unmask_irq in pcistub_init_device, as this v2 patch do?
But Thomas Gleixner think it is not suitable to export unmask_irq.
>
> Is that an issue since in that case the gsi will get mapped and bound
> to dom0?
Dom0 do map_pirq is to pass the check xc_domain_irq_permission()-> pirq_access_permitted(),
>
> Otherwise I would prefer if the gsi is just configured from pciback
> (PHYSDEVOP_setup_gsi) but not mapped, as allowing a PVH dom0 to map
> interrupts using PHYSDEVOP_{,un}map_pirq to itself introduces a new
> interface to manage interrupts that clashes with the native way that a
> PVH dom0 uses.
This method does map_pirq and setup_gsi only when a device is assigned to passthrough, it won't affect the other device using native way.

>
> Thanks, Roger.

--
Best regards,
Jiqian Chen.

2023-12-12 08:49:31

by Roger Pau Monné

[permalink] [raw]
Subject: Re: [RFC KERNEL PATCH v2 2/3] xen/pvh: Unmask irq for passthrough device in PVH dom0

On Tue, Dec 12, 2023 at 06:16:43AM +0000, Chen, Jiqian wrote:
> On 2023/12/11 23:45, Roger Pau Monné wrote:
> > On Wed, Dec 06, 2023 at 06:07:26AM +0000, Chen, Jiqian wrote:
> >>
> >> When PVH dom0 enable a device, it will get trigger and polarity from ACPI (see acpi_pci_irq_enable)
> >> I have a version of patch which tried that way, see below:
> >>
> >> diff --git a/arch/x86/xen/enlighten_pvh.c b/arch/x86/xen/enlighten_pvh.c
> >> index ada3868c02c2..43e1bda9f946 100644
> >> --- a/arch/x86/xen/enlighten_pvh.c
> >> +++ b/arch/x86/xen/enlighten_pvh.c
> >> @@ -1,6 +1,7 @@
> >> // SPDX-License-Identifier: GPL-2.0
> >> #include <linux/acpi.h>
> >> #include <linux/export.h>
> >> +#include <linux/pci.h>
> >>
> >> #include <xen/hvc-console.h>
> >>
> >> @@ -25,6 +26,127 @@
> >> bool __ro_after_init xen_pvh;
> >> EXPORT_SYMBOL_GPL(xen_pvh);
> >>
> >> +typedef struct gsi_info {
> >> + int gsi;
> >> + int trigger;
> >> + int polarity;
> >> + int pirq;
> >> +} gsi_info_t;
> >> +
> >> +struct acpi_prt_entry {
> >> + struct acpi_pci_id id;
> >> + u8 pin;
> >> + acpi_handle link;
> >> + u32 index; /* GSI, or link _CRS index */
> >> +};
> >> +
> >> +static int xen_pvh_get_gsi_info(struct pci_dev *dev,
> >> + gsi_info_t *gsi_info)
> >> +{
> >> + int gsi;
> >> + u8 pin = 0;
> >> + struct acpi_prt_entry *entry;
> >> + int trigger = ACPI_LEVEL_SENSITIVE;
> >> + int polarity = acpi_irq_model == ACPI_IRQ_MODEL_GIC ?
> >> + ACPI_ACTIVE_HIGH : ACPI_ACTIVE_LOW;
> >> +
> >> + if (dev)
> >> + pin = dev->pin;
> >> + if (!pin) {
> >> + xen_raw_printk("No interrupt pin configured\n");
> >> + return -EINVAL;
> >> + }
> >> +
> >> + entry = acpi_pci_irq_lookup(dev, pin);
> >> + if (entry) {
> >> + if (entry->link)
> >> + gsi = acpi_pci_link_allocate_irq(entry->link,
> >> + entry->index,
> >> + &trigger, &polarity,
> >> + NULL);
> >> + else
> >> + gsi = entry->index;
> >> + } else
> >> + return -EINVAL;
> >> +
> >> + gsi_info->gsi = gsi;
> >> + gsi_info->trigger = trigger;
> >> + gsi_info->polarity = polarity;
> >> +
> >> + return 0;
> >> +}
> >> +
> >> +static int xen_pvh_map_pirq(gsi_info_t *gsi_info)
> >> +{
> >> + struct physdev_map_pirq map_irq;
> >> + int ret;
> >> +
> >> + map_irq.domid = DOMID_SELF;
> >> + map_irq.type = MAP_PIRQ_TYPE_GSI;
> >> + map_irq.index = gsi_info->gsi;
> >> + map_irq.pirq = gsi_info->gsi;
> >> +
> >> + ret = HYPERVISOR_physdev_op(PHYSDEVOP_map_pirq, &map_irq);
> >> + gsi_info->pirq = map_irq.pirq;
> >> +
> >> + return ret;
> >> +}
> >> +
> >> +static int xen_pvh_unmap_pirq(gsi_info_t *gsi_info)
> >> +{
> >> + struct physdev_unmap_pirq unmap_irq;
> >> +
> >> + unmap_irq.domid = DOMID_SELF;
> >> + unmap_irq.pirq = gsi_info->pirq;
> >> +
> >> + return HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq, &unmap_irq);
> >> +}
> >> +
> >> +static int xen_pvh_setup_gsi(gsi_info_t *gsi_info)
> >> +{
> >> + struct physdev_setup_gsi setup_gsi;
> >> +
> >> + setup_gsi.gsi = gsi_info->gsi;
> >> + setup_gsi.triggering = (gsi_info->trigger == ACPI_EDGE_SENSITIVE ? 0 : 1);
> >> + setup_gsi.polarity = (gsi_info->polarity == ACPI_ACTIVE_HIGH ? 0 : 1);
> >> +
> >> + return HYPERVISOR_physdev_op(PHYSDEVOP_setup_gsi, &setup_gsi);
> >> +}
> >
> > Hm, why not simply call pcibios_enable_device() from pciback? What
> pcibios_enable_device had been called when using cmd "xl pci-assignable-add sbdf" from pciback. But it didn't do map_pirq and setup_gsi.
> Because pcibios_enable_device-> pcibios_enable_irq-> __acpi_register_gsi(acpi_register_gsi_ioapic PVH specific)
> > you are doing here using the hypercalls is a backdoor into what's done
> > automatically by Xen on IO-APIC accesses by a PVH dom0.
> But the gsi didn't be unmasked, and vioapic_hwdom_map_gsi is never called.
> So, I think in pciback, if we can do what vioapic_hwdom_map_gsi does.
>

I see, it does setup the IO-APIC pin but doesn't unmask it, that's
what I feared.

> > It will be much more natural for the PVH dom0 model to simply use the
> > native way to configure and unmask the IO-APIC pin, and that would
> > correctly setup the triggering/polarity and bind it to dom0 without
> > requiring the usage of any hypercalls.
> Do you still prefer that I called unmask_irq in pcistub_init_device, as this v2 patch do?
> But Thomas Gleixner think it is not suitable to export unmask_irq.

Yeah, that wasn't good.

> >
> > Is that an issue since in that case the gsi will get mapped and bound
> > to dom0?
> Dom0 do map_pirq is to pass the check xc_domain_irq_permission()-> pirq_access_permitted(),

Can we see about finding another way to fix this check?

One option would be granting permissions over the IRQ in
PHYSDEVOP_setup_gsi?

Otherwise we could see about modifying the logic in PHYSDEVOP_map_pirq
so that the hardware domain can map IRQs to other domains without
having it mapped to itself first?

I think the call to PHYSDEVOP_setup_gsi in pciback is fine, but I
would really like to avoid the usage of PHYSDEVOP_{,un}map_pirq on a
PVH dom0 against itself.

> >
> > Otherwise I would prefer if the gsi is just configured from pciback
> > (PHYSDEVOP_setup_gsi) but not mapped, as allowing a PVH dom0 to map
> > interrupts using PHYSDEVOP_{,un}map_pirq to itself introduces a new
> > interface to manage interrupts that clashes with the native way that a
> > PVH dom0 uses.
> This method does map_pirq and setup_gsi only when a device is assigned to passthrough, it won't affect the other device using native way.

It's not affected because of the specific usage in Linux, but allowing
the interface to be used against itself (so to manage interrupts
from assigned to dom0) is opening a whole new way to setup interrupts,
and it's unclear to me how that will affect the current way we use to
manage interrupts on a PVH dom0.

Thanks, Roger.

2023-12-12 09:38:44

by Jan Beulich

[permalink] [raw]
Subject: Re: [RFC KERNEL PATCH v2 2/3] xen/pvh: Unmask irq for passthrough device in PVH dom0

(I think the Cc list is too long here, but then I don't know who to
keep and who to possibly drop.)

On 12.12.2023 09:49, Roger Pau Monné wrote:
> On Tue, Dec 12, 2023 at 06:16:43AM +0000, Chen, Jiqian wrote:
>> On 2023/12/11 23:45, Roger Pau Monné wrote:
>>> On Wed, Dec 06, 2023 at 06:07:26AM +0000, Chen, Jiqian wrote:
>>>> +static int xen_pvh_setup_gsi(gsi_info_t *gsi_info)
>>>> +{
>>>> + struct physdev_setup_gsi setup_gsi;
>>>> +
>>>> + setup_gsi.gsi = gsi_info->gsi;
>>>> + setup_gsi.triggering = (gsi_info->trigger == ACPI_EDGE_SENSITIVE ? 0 : 1);
>>>> + setup_gsi.polarity = (gsi_info->polarity == ACPI_ACTIVE_HIGH ? 0 : 1);
>>>> +
>>>> + return HYPERVISOR_physdev_op(PHYSDEVOP_setup_gsi, &setup_gsi);
>>>> +}
>>>
>>> Hm, why not simply call pcibios_enable_device() from pciback? What
>> pcibios_enable_device had been called when using cmd "xl pci-assignable-add sbdf" from pciback. But it didn't do map_pirq and setup_gsi.
>> Because pcibios_enable_device-> pcibios_enable_irq-> __acpi_register_gsi(acpi_register_gsi_ioapic PVH specific)
>>> you are doing here using the hypercalls is a backdoor into what's done
>>> automatically by Xen on IO-APIC accesses by a PVH dom0.
>> But the gsi didn't be unmasked, and vioapic_hwdom_map_gsi is never called.
>> So, I think in pciback, if we can do what vioapic_hwdom_map_gsi does.
>>
>
> I see, it does setup the IO-APIC pin but doesn't unmask it, that's
> what I feared.
>
>>> It will be much more natural for the PVH dom0 model to simply use the
>>> native way to configure and unmask the IO-APIC pin, and that would
>>> correctly setup the triggering/polarity and bind it to dom0 without
>>> requiring the usage of any hypercalls.
>> Do you still prefer that I called unmask_irq in pcistub_init_device, as this v2 patch do?
>> But Thomas Gleixner think it is not suitable to export unmask_irq.
>
> Yeah, that wasn't good.
>
>>>
>>> Is that an issue since in that case the gsi will get mapped and bound
>>> to dom0?
>> Dom0 do map_pirq is to pass the check xc_domain_irq_permission()-> pirq_access_permitted(),
>
> Can we see about finding another way to fix this check?
>
> One option would be granting permissions over the IRQ in
> PHYSDEVOP_setup_gsi?

There's no domain available there, and imo it's also the wrong interface to
possibly grant any permissions.

> Otherwise we could see about modifying the logic in PHYSDEVOP_map_pirq
> so that the hardware domain can map IRQs to other domains without
> having it mapped to itself first?

While this may be possible to arrange for, it still would feel wrong. How
would you express the same in a disaggregated environment? I.e. how would
you make sure a domain trying to grant permission is actually permitted to
do so for the resource in question?

> I think the call to PHYSDEVOP_setup_gsi in pciback is fine, but I
> would really like to avoid the usage of PHYSDEVOP_{,un}map_pirq on a
> PVH dom0 against itself.

+1

Jan

2023-12-12 11:18:39

by Roger Pau Monné

[permalink] [raw]
Subject: Re: [RFC KERNEL PATCH v2 2/3] xen/pvh: Unmask irq for passthrough device in PVH dom0

On Tue, Dec 12, 2023 at 10:38:08AM +0100, Jan Beulich wrote:
> (I think the Cc list is too long here, but then I don't know who to
> keep and who to possibly drop.)
>
> On 12.12.2023 09:49, Roger Pau Monné wrote:
> > On Tue, Dec 12, 2023 at 06:16:43AM +0000, Chen, Jiqian wrote:
> >> On 2023/12/11 23:45, Roger Pau Monné wrote:
> >>> On Wed, Dec 06, 2023 at 06:07:26AM +0000, Chen, Jiqian wrote:
> >>>> +static int xen_pvh_setup_gsi(gsi_info_t *gsi_info)
> >>>> +{
> >>>> + struct physdev_setup_gsi setup_gsi;
> >>>> +
> >>>> + setup_gsi.gsi = gsi_info->gsi;
> >>>> + setup_gsi.triggering = (gsi_info->trigger == ACPI_EDGE_SENSITIVE ? 0 : 1);
> >>>> + setup_gsi.polarity = (gsi_info->polarity == ACPI_ACTIVE_HIGH ? 0 : 1);
> >>>> +
> >>>> + return HYPERVISOR_physdev_op(PHYSDEVOP_setup_gsi, &setup_gsi);
> >>>> +}
> >>>
> >>> Hm, why not simply call pcibios_enable_device() from pciback? What
> >> pcibios_enable_device had been called when using cmd "xl pci-assignable-add sbdf" from pciback. But it didn't do map_pirq and setup_gsi.
> >> Because pcibios_enable_device-> pcibios_enable_irq-> __acpi_register_gsi(acpi_register_gsi_ioapic PVH specific)
> >>> you are doing here using the hypercalls is a backdoor into what's done
> >>> automatically by Xen on IO-APIC accesses by a PVH dom0.
> >> But the gsi didn't be unmasked, and vioapic_hwdom_map_gsi is never called.
> >> So, I think in pciback, if we can do what vioapic_hwdom_map_gsi does.
> >>
> >
> > I see, it does setup the IO-APIC pin but doesn't unmask it, that's
> > what I feared.
> >
> >>> It will be much more natural for the PVH dom0 model to simply use the
> >>> native way to configure and unmask the IO-APIC pin, and that would
> >>> correctly setup the triggering/polarity and bind it to dom0 without
> >>> requiring the usage of any hypercalls.
> >> Do you still prefer that I called unmask_irq in pcistub_init_device, as this v2 patch do?
> >> But Thomas Gleixner think it is not suitable to export unmask_irq.
> >
> > Yeah, that wasn't good.
> >
> >>>
> >>> Is that an issue since in that case the gsi will get mapped and bound
> >>> to dom0?
> >> Dom0 do map_pirq is to pass the check xc_domain_irq_permission()-> pirq_access_permitted(),
> >
> > Can we see about finding another way to fix this check?
> >
> > One option would be granting permissions over the IRQ in
> > PHYSDEVOP_setup_gsi?
>
> There's no domain available there, and imo it's also the wrong interface to
> possibly grant any permissions.

Well, the domain is the caller.

> > Otherwise we could see about modifying the logic in PHYSDEVOP_map_pirq
> > so that the hardware domain can map IRQs to other domains without
> > having it mapped to itself first?
>
> While this may be possible to arrange for, it still would feel wrong. How
> would you express the same in a disaggregated environment? I.e. how would
> you make sure a domain trying to grant permission is actually permitted to
> do so for the resource in question?

I've been looking again at the original issue, and I think I was
confused. The issue is not that dom0 doesn't have permissions over
the GSIs (as we do grant those in dom0_setup_permissions()), but
rather that XEN_DOMCTL_irq_permission attempts to use
domain_pirq_to_irq() in order to get the IRQ from the PIRQ parameter.

I've always been a bit confused with the PIRQ GSI relation, but IIRC
GSIs are always identity mapped to PIRQs, and hence you could possibly
adjust XEN_DOMCTL_irq_permission to use irq_permit_access() to check
if the caller domain has permissions over the target PIRQ.

Thanks, Roger.

2023-12-12 11:20:13

by Jan Beulich

[permalink] [raw]
Subject: Re: [RFC KERNEL PATCH v2 2/3] xen/pvh: Unmask irq for passthrough device in PVH dom0

On 12.12.2023 12:18, Roger Pau Monné wrote:
> On Tue, Dec 12, 2023 at 10:38:08AM +0100, Jan Beulich wrote:
>> (I think the Cc list is too long here, but then I don't know who to
>> keep and who to possibly drop.)
>>
>> On 12.12.2023 09:49, Roger Pau Monné wrote:
>>> On Tue, Dec 12, 2023 at 06:16:43AM +0000, Chen, Jiqian wrote:
>>>> On 2023/12/11 23:45, Roger Pau Monné wrote:
>>>>> On Wed, Dec 06, 2023 at 06:07:26AM +0000, Chen, Jiqian wrote:
>>>>>> +static int xen_pvh_setup_gsi(gsi_info_t *gsi_info)
>>>>>> +{
>>>>>> + struct physdev_setup_gsi setup_gsi;
>>>>>> +
>>>>>> + setup_gsi.gsi = gsi_info->gsi;
>>>>>> + setup_gsi.triggering = (gsi_info->trigger == ACPI_EDGE_SENSITIVE ? 0 : 1);
>>>>>> + setup_gsi.polarity = (gsi_info->polarity == ACPI_ACTIVE_HIGH ? 0 : 1);
>>>>>> +
>>>>>> + return HYPERVISOR_physdev_op(PHYSDEVOP_setup_gsi, &setup_gsi);
>>>>>> +}
>>>>>
>>>>> Hm, why not simply call pcibios_enable_device() from pciback? What
>>>> pcibios_enable_device had been called when using cmd "xl pci-assignable-add sbdf" from pciback. But it didn't do map_pirq and setup_gsi.
>>>> Because pcibios_enable_device-> pcibios_enable_irq-> __acpi_register_gsi(acpi_register_gsi_ioapic PVH specific)
>>>>> you are doing here using the hypercalls is a backdoor into what's done
>>>>> automatically by Xen on IO-APIC accesses by a PVH dom0.
>>>> But the gsi didn't be unmasked, and vioapic_hwdom_map_gsi is never called.
>>>> So, I think in pciback, if we can do what vioapic_hwdom_map_gsi does.
>>>>
>>>
>>> I see, it does setup the IO-APIC pin but doesn't unmask it, that's
>>> what I feared.
>>>
>>>>> It will be much more natural for the PVH dom0 model to simply use the
>>>>> native way to configure and unmask the IO-APIC pin, and that would
>>>>> correctly setup the triggering/polarity and bind it to dom0 without
>>>>> requiring the usage of any hypercalls.
>>>> Do you still prefer that I called unmask_irq in pcistub_init_device, as this v2 patch do?
>>>> But Thomas Gleixner think it is not suitable to export unmask_irq.
>>>
>>> Yeah, that wasn't good.
>>>
>>>>>
>>>>> Is that an issue since in that case the gsi will get mapped and bound
>>>>> to dom0?
>>>> Dom0 do map_pirq is to pass the check xc_domain_irq_permission()-> pirq_access_permitted(),
>>>
>>> Can we see about finding another way to fix this check?
>>>
>>> One option would be granting permissions over the IRQ in
>>> PHYSDEVOP_setup_gsi?
>>
>> There's no domain available there, and imo it's also the wrong interface to
>> possibly grant any permissions.
>
> Well, the domain is the caller.

Granting permission to itself?

Jan

2023-12-12 11:40:06

by Roger Pau Monné

[permalink] [raw]
Subject: Re: [RFC KERNEL PATCH v2 2/3] xen/pvh: Unmask irq for passthrough device in PVH dom0

On Tue, Dec 12, 2023 at 12:19:49PM +0100, Jan Beulich wrote:
> On 12.12.2023 12:18, Roger Pau Monné wrote:
> > On Tue, Dec 12, 2023 at 10:38:08AM +0100, Jan Beulich wrote:
> >> (I think the Cc list is too long here, but then I don't know who to
> >> keep and who to possibly drop.)
> >>
> >> On 12.12.2023 09:49, Roger Pau Monné wrote:
> >>> On Tue, Dec 12, 2023 at 06:16:43AM +0000, Chen, Jiqian wrote:
> >>>> On 2023/12/11 23:45, Roger Pau Monné wrote:
> >>>>> On Wed, Dec 06, 2023 at 06:07:26AM +0000, Chen, Jiqian wrote:
> >>>>>> +static int xen_pvh_setup_gsi(gsi_info_t *gsi_info)
> >>>>>> +{
> >>>>>> + struct physdev_setup_gsi setup_gsi;
> >>>>>> +
> >>>>>> + setup_gsi.gsi = gsi_info->gsi;
> >>>>>> + setup_gsi.triggering = (gsi_info->trigger == ACPI_EDGE_SENSITIVE ? 0 : 1);
> >>>>>> + setup_gsi.polarity = (gsi_info->polarity == ACPI_ACTIVE_HIGH ? 0 : 1);
> >>>>>> +
> >>>>>> + return HYPERVISOR_physdev_op(PHYSDEVOP_setup_gsi, &setup_gsi);
> >>>>>> +}
> >>>>>
> >>>>> Hm, why not simply call pcibios_enable_device() from pciback? What
> >>>> pcibios_enable_device had been called when using cmd "xl pci-assignable-add sbdf" from pciback. But it didn't do map_pirq and setup_gsi.
> >>>> Because pcibios_enable_device-> pcibios_enable_irq-> __acpi_register_gsi(acpi_register_gsi_ioapic PVH specific)
> >>>>> you are doing here using the hypercalls is a backdoor into what's done
> >>>>> automatically by Xen on IO-APIC accesses by a PVH dom0.
> >>>> But the gsi didn't be unmasked, and vioapic_hwdom_map_gsi is never called.
> >>>> So, I think in pciback, if we can do what vioapic_hwdom_map_gsi does.
> >>>>
> >>>
> >>> I see, it does setup the IO-APIC pin but doesn't unmask it, that's
> >>> what I feared.
> >>>
> >>>>> It will be much more natural for the PVH dom0 model to simply use the
> >>>>> native way to configure and unmask the IO-APIC pin, and that would
> >>>>> correctly setup the triggering/polarity and bind it to dom0 without
> >>>>> requiring the usage of any hypercalls.
> >>>> Do you still prefer that I called unmask_irq in pcistub_init_device, as this v2 patch do?
> >>>> But Thomas Gleixner think it is not suitable to export unmask_irq.
> >>>
> >>> Yeah, that wasn't good.
> >>>
> >>>>>
> >>>>> Is that an issue since in that case the gsi will get mapped and bound
> >>>>> to dom0?
> >>>> Dom0 do map_pirq is to pass the check xc_domain_irq_permission()-> pirq_access_permitted(),
> >>>
> >>> Can we see about finding another way to fix this check?
> >>>
> >>> One option would be granting permissions over the IRQ in
> >>> PHYSDEVOP_setup_gsi?
> >>
> >> There's no domain available there, and imo it's also the wrong interface to
> >> possibly grant any permissions.
> >
> > Well, the domain is the caller.
>
> Granting permission to itself?

See below in the previous email, the issue is not with the
permissions, which are correctly assigned from
dom0_setup_permissions(), but the usage of domain_pirq_to_irq() in
pirq_access_permitted() as called by XEN_DOMCTL_irq_permission.
There's no need to play with the permissions at all.

Regards, Roger.

2023-12-13 07:15:40

by Jiqian Chen

[permalink] [raw]
Subject: Re: [RFC KERNEL PATCH v2 2/3] xen/pvh: Unmask irq for passthrough device in PVH dom0

On 2023/12/12 19:39, Roger Pau Monné wrote:
> On Tue, Dec 12, 2023 at 12:19:49PM +0100, Jan Beulich wrote:
>> On 12.12.2023 12:18, Roger Pau Monné wrote:
>>> On Tue, Dec 12, 2023 at 10:38:08AM +0100, Jan Beulich wrote:
>>>> (I think the Cc list is too long here, but then I don't know who to
>>>> keep and who to possibly drop.)
>>>>
>>>> On 12.12.2023 09:49, Roger Pau Monné wrote:
>>>>> On Tue, Dec 12, 2023 at 06:16:43AM +0000, Chen, Jiqian wrote:
>>>>>> On 2023/12/11 23:45, Roger Pau Monné wrote:
>>>>>>> On Wed, Dec 06, 2023 at 06:07:26AM +0000, Chen, Jiqian wrote:
>>>>>>>> +static int xen_pvh_setup_gsi(gsi_info_t *gsi_info)
>>>>>>>> +{
>>>>>>>> + struct physdev_setup_gsi setup_gsi;
>>>>>>>> +
>>>>>>>> + setup_gsi.gsi = gsi_info->gsi;
>>>>>>>> + setup_gsi.triggering = (gsi_info->trigger == ACPI_EDGE_SENSITIVE ? 0 : 1);
>>>>>>>> + setup_gsi.polarity = (gsi_info->polarity == ACPI_ACTIVE_HIGH ? 0 : 1);
>>>>>>>> +
>>>>>>>> + return HYPERVISOR_physdev_op(PHYSDEVOP_setup_gsi, &setup_gsi);
>>>>>>>> +}
>>>>>>>
>>>>>>> Hm, why not simply call pcibios_enable_device() from pciback? What
>>>>>> pcibios_enable_device had been called when using cmd "xl pci-assignable-add sbdf" from pciback. But it didn't do map_pirq and setup_gsi.
>>>>>> Because pcibios_enable_device-> pcibios_enable_irq-> __acpi_register_gsi(acpi_register_gsi_ioapic PVH specific)
>>>>>>> you are doing here using the hypercalls is a backdoor into what's done
>>>>>>> automatically by Xen on IO-APIC accesses by a PVH dom0.
>>>>>> But the gsi didn't be unmasked, and vioapic_hwdom_map_gsi is never called.
>>>>>> So, I think in pciback, if we can do what vioapic_hwdom_map_gsi does.
>>>>>>
>>>>>
>>>>> I see, it does setup the IO-APIC pin but doesn't unmask it, that's
>>>>> what I feared.
>>>>>
>>>>>>> It will be much more natural for the PVH dom0 model to simply use the
>>>>>>> native way to configure and unmask the IO-APIC pin, and that would
>>>>>>> correctly setup the triggering/polarity and bind it to dom0 without
>>>>>>> requiring the usage of any hypercalls.
>>>>>> Do you still prefer that I called unmask_irq in pcistub_init_device, as this v2 patch do?
>>>>>> But Thomas Gleixner think it is not suitable to export unmask_irq.
>>>>>
>>>>> Yeah, that wasn't good.
>>>>>
>>>>>>>
>>>>>>> Is that an issue since in that case the gsi will get mapped and bound
>>>>>>> to dom0?
>>>>>> Dom0 do map_pirq is to pass the check xc_domain_irq_permission()-> pirq_access_permitted(),
>>>>>
>>>>> Can we see about finding another way to fix this check?
>>>>>
>>>>> One option would be granting permissions over the IRQ in
>>>>> PHYSDEVOP_setup_gsi?
>>>>
>>>> There's no domain available there, and imo it's also the wrong interface to
>>>> possibly grant any permissions.
>>>
>>> Well, the domain is the caller.
>>
>> Granting permission to itself?
>
> See below in the previous email, the issue is not with the
> permissions, which are correctly assigned from
> dom0_setup_permissions(), but the usage of domain_pirq_to_irq() in
> pirq_access_permitted() as called by XEN_DOMCTL_irq_permission.
> There's no need to play with the permissions at all.
Yes, the problem is pci_add_dm_done-> xc_domain_irq_permission-> XEN_DOMCTL_irq_permission-> pirq_access_permitted->domain_pirq_to_irq->return irq is 0, so it failed.
I am think that since the PVH doesn't use pirq, can we just skip this irq_permission check for PVH?

>
> Regards, Roger.

--
Best regards,
Jiqian Chen.

2023-12-13 07:41:25

by Jan Beulich

[permalink] [raw]
Subject: Re: [RFC KERNEL PATCH v2 2/3] xen/pvh: Unmask irq for passthrough device in PVH dom0

On 13.12.2023 08:14, Chen, Jiqian wrote:
> On 2023/12/12 19:39, Roger Pau Monné wrote:
>> On Tue, Dec 12, 2023 at 12:19:49PM +0100, Jan Beulich wrote:
>>> On 12.12.2023 12:18, Roger Pau Monné wrote:
>>>> On Tue, Dec 12, 2023 at 10:38:08AM +0100, Jan Beulich wrote:
>>>>> (I think the Cc list is too long here, but then I don't know who to
>>>>> keep and who to possibly drop.)
>>>>>
>>>>> On 12.12.2023 09:49, Roger Pau Monné wrote:
>>>>>> On Tue, Dec 12, 2023 at 06:16:43AM +0000, Chen, Jiqian wrote:
>>>>>>> On 2023/12/11 23:45, Roger Pau Monné wrote:
>>>>>>>> On Wed, Dec 06, 2023 at 06:07:26AM +0000, Chen, Jiqian wrote:
>>>>>>>>> +static int xen_pvh_setup_gsi(gsi_info_t *gsi_info)
>>>>>>>>> +{
>>>>>>>>> + struct physdev_setup_gsi setup_gsi;
>>>>>>>>> +
>>>>>>>>> + setup_gsi.gsi = gsi_info->gsi;
>>>>>>>>> + setup_gsi.triggering = (gsi_info->trigger == ACPI_EDGE_SENSITIVE ? 0 : 1);
>>>>>>>>> + setup_gsi.polarity = (gsi_info->polarity == ACPI_ACTIVE_HIGH ? 0 : 1);
>>>>>>>>> +
>>>>>>>>> + return HYPERVISOR_physdev_op(PHYSDEVOP_setup_gsi, &setup_gsi);
>>>>>>>>> +}
>>>>>>>>
>>>>>>>> Hm, why not simply call pcibios_enable_device() from pciback? What
>>>>>>> pcibios_enable_device had been called when using cmd "xl pci-assignable-add sbdf" from pciback. But it didn't do map_pirq and setup_gsi.
>>>>>>> Because pcibios_enable_device-> pcibios_enable_irq-> __acpi_register_gsi(acpi_register_gsi_ioapic PVH specific)
>>>>>>>> you are doing here using the hypercalls is a backdoor into what's done
>>>>>>>> automatically by Xen on IO-APIC accesses by a PVH dom0.
>>>>>>> But the gsi didn't be unmasked, and vioapic_hwdom_map_gsi is never called.
>>>>>>> So, I think in pciback, if we can do what vioapic_hwdom_map_gsi does.
>>>>>>>
>>>>>>
>>>>>> I see, it does setup the IO-APIC pin but doesn't unmask it, that's
>>>>>> what I feared.
>>>>>>
>>>>>>>> It will be much more natural for the PVH dom0 model to simply use the
>>>>>>>> native way to configure and unmask the IO-APIC pin, and that would
>>>>>>>> correctly setup the triggering/polarity and bind it to dom0 without
>>>>>>>> requiring the usage of any hypercalls.
>>>>>>> Do you still prefer that I called unmask_irq in pcistub_init_device, as this v2 patch do?
>>>>>>> But Thomas Gleixner think it is not suitable to export unmask_irq.
>>>>>>
>>>>>> Yeah, that wasn't good.
>>>>>>
>>>>>>>>
>>>>>>>> Is that an issue since in that case the gsi will get mapped and bound
>>>>>>>> to dom0?
>>>>>>> Dom0 do map_pirq is to pass the check xc_domain_irq_permission()-> pirq_access_permitted(),
>>>>>>
>>>>>> Can we see about finding another way to fix this check?
>>>>>>
>>>>>> One option would be granting permissions over the IRQ in
>>>>>> PHYSDEVOP_setup_gsi?
>>>>>
>>>>> There's no domain available there, and imo it's also the wrong interface to
>>>>> possibly grant any permissions.
>>>>
>>>> Well, the domain is the caller.
>>>
>>> Granting permission to itself?
>>
>> See below in the previous email, the issue is not with the
>> permissions, which are correctly assigned from
>> dom0_setup_permissions(), but the usage of domain_pirq_to_irq() in
>> pirq_access_permitted() as called by XEN_DOMCTL_irq_permission.
>> There's no need to play with the permissions at all.
> Yes, the problem is pci_add_dm_done-> xc_domain_irq_permission-> XEN_DOMCTL_irq_permission-> pirq_access_permitted->domain_pirq_to_irq->return irq is 0, so it failed.
> I am think that since the PVH doesn't use pirq, can we just skip this irq_permission check for PVH?

If not the pIRQ, then the real IRQ would need checking for permissions.
You won't get away without any checking at all, I'm afraid.

Jan