Hi All,
This is v4 series to support passthrough on Xen when dom0 is PVH.
v3->v4 changes:
* patch#1: change the comment of PHYSDEVOP_pci_device_state_reset; use a new function pcistub_reset_device_state to wrap __pci_reset_function_locked and xen_reset_device_state, and call pcistub_reset_device_state in pci_stub.c
* patch#2: remove map_pirq from xen_pvh_passthrough_gsi
v3 link:
https://lore.kernel.org/lkml/[email protected]/T/#t
v2->v3 changes:
* patch#1: add condition to limit do xen_reset_device_state for no-pv domain in pcistub_init_device.
* patch#2: Abandoning previous implementations that call unmask_irq. To setup gsi and map pirq for passthrough device in pcistub_init_device.
* patch#3: Abandoning previous implementations that adds new syscall to get gsi from irq. To add a new sysfs for gsi, then userspace can get gsi number from sysfs.
v2 link:
https://lore.kernel.org/lkml/[email protected]/T/#t
Below is the description of v2 cover letter:
This series of patches are the v2 of the implementation of passthrough when dom0 is PVH on Xen.
We sent the v1 to upstream before, but the v1 had so many problems and we got lots of suggestions.
I will introduce all issues that these patches try to fix and the differences between v1 and v2.
Issues we encountered:
1. pci_stub failed to write bar for a passthrough device.
Problem: when we run “sudo xl pci-assignable-add <sbdf>” to assign a device, pci_stub will call “pcistub_init_device() -> pci_restore_state() -> pci_restore_config_space() ->
pci_restore_config_space_range() -> pci_restore_config_dword() -> pci_write_config_dword()”, the pci config write will trigger an io interrupt to bar_write() in the xen, but the
bar->enabled was set before, the write is not allowed now, and then when
bar->Qemu config the
passthrough device in xen_pt_realize(), it gets invalid bar values.
Reason: the reason is that we don't tell vPCI that the device has been reset, so the current cached state in pdev->vpci is all out of date and is different from the real device state.
Solution: to solve this problem, the first patch of kernel(xen/pci: Add xen_reset_device_state
function) and the fist patch of xen(xen/vpci: Clear all vpci status of device) add a new hypercall to reset the state stored in vPCI when the state of real device has changed.
Thank Roger for the suggestion of this v2, and it is different from v1 (https://lore.kernel.org/xen-devel/[email protected]/), v1 simply allow domU to write pci bar, it does not comply with the design principles of vPCI.
2. failed to do PHYSDEVOP_map_pirq when dom0 is PVH
Problem: HVM domU will do PHYSDEVOP_map_pirq for a passthrough device by using gsi. See xen_pt_realize->xc_physdev_map_pirq and pci_add_dm_done->xc_physdev_map_pirq. Then xc_physdev_map_pirq will call into Xen, but in hvm_physdev_op(), PHYSDEVOP_map_pirq is not allowed.
Reason: In hvm_physdev_op(), the variable "currd" is PVH dom0 and PVH has no X86_EMU_USE_PIRQ flag, it will fail at has_pirq check.
Solution: I think we may need to allow PHYSDEVOP_map_pirq when "currd" is dom0 (at present dom0 is PVH). The second patch of xen(x86/pvh: Open PHYSDEVOP_map_pirq for PVH dom0) allow PVH dom0 do PHYSDEVOP_map_pirq. This v2 patch is better than v1, v1 simply remove the has_pirq check(xen https://lore.kernel.org/xen-devel/[email protected]/).
3. the gsi of a passthrough device doesn't be unmasked
3.1 failed to check the permission of pirq
3.2 the gsi of passthrough device was not registered in PVH dom0
Problem:
3.1 callback function pci_add_dm_done() will be called when qemu config a passthrough device for domU.
This function will call xc_domain_irq_permission()-> pirq_access_permitted() to check if the gsi has corresponding mappings in dom0. But it didn’t, so failed. See XEN_DOMCTL_irq_permission->pirq_access_permitted, "current" is PVH dom0 and it return irq is 0.
3.2 it's possible for a gsi (iow: vIO-APIC pin) to never get registered on PVH dom0, because the devices of PVH are using MSI(-X) interrupts. However, the IO-APIC pin must be configured for it to be able to be mapped into a domU.
Reason: After searching codes, I find "map_pirq" and "register_gsi" will be done in function vioapic_write_redirent->vioapic_hwdom_map_gsi when the gsi(aka ioapic's pin) is unmasked in PVH dom0.
So the two problems can be concluded to that the gsi of a passthrough device doesn't be unmasked.
Solution: to solve these problems, the second patch of kernel(xen/pvh: Unmask irq for passthrough device in PVH dom0) call the unmask_irq() when we assign a device to be passthrough. So that passthrough devices can have the mapping of gsi on PVH dom0 and gsi can be registered. This v2 patch is different from the v1( kernel https://lore.kernel.org/xen-devel/[email protected]/,
kernel https://lore.kernel.org/xen-devel/[email protected]/ and xen https://lore.kernel.org/xen-devel/[email protected]/),
v1 performed "map_pirq" and "register_gsi" on all pci devices on PVH dom0, which is unnecessary and may cause multiple registration.
4. failed to map pirq for gsi
Problem: qemu will call xc_physdev_map_pirq() to map a passthrough device’s gsi to pirq in function xen_pt_realize(). But failed.
Reason: According to the implement of xc_physdev_map_pirq(), it needs gsi instead of irq, but qemu pass irq to it and treat irq as gsi, it is got from file /sys/bus/pci/devices/xxxx:xx:xx.x/irq in function xen_host_pci_device_get(). But actually the gsi number is not equal with irq. On PVH dom0, when it allocates irq for a gsi in function acpi_register_gsi_ioapic(), allocation is dynamic, and follow the principle of applying first, distributing first. And if you debug the kernel codes(see function __irq_alloc_descs), you will find the irq number is allocated from small to large by order, but the applying gsi number is not, gsi 38 may come before gsi 28, that causes gsi 38 get a smaller irq number than gsi 28, and then gsi != irq.
Solution: we can record the relation between gsi and irq, then when userspace(qemu) want to use gsi, we can do a translation. The third patch of kernel(xen/privcmd: Add new syscall to get gsi from irq) records all the relations in acpi_register_gsi_xen_pvh() when dom0 initialize pci devices, and provide a syscall for userspace to get the gsi from irq. The third patch of xen(tools: Add new function to get gsi from irq) add a new function xc_physdev_gsi_from_irq() to call the new syscall added on kernel side.
And then userspace can use that function to get gsi. Then xc_physdev_map_pirq() will success. This v2 patch is the same as v1( kernel https://lore.kernel.org/xen-devel/[email protected]/ and xen https://lore.kernel.org/xen-devel/[email protected]/)
About the v2 patch of qemu, just change an included head file, other are similar to the v1 ( qemu https://lore.kernel.org/xen-devel/[email protected]/), just call
xc_physdev_gsi_from_irq() to get gsi from irq.
Jiqian Chen (3):
xen/pci: Add xen_reset_device_state function
xen/pvh: Setup gsi for passthrough device
PCI/sysfs: Add gsi sysfs for pci_dev
arch/x86/xen/enlighten_pvh.c | 90 ++++++++++++++++++++++++++++++
drivers/acpi/pci_irq.c | 3 +-
drivers/pci/pci-sysfs.c | 11 ++++
drivers/xen/pci.c | 12 ++++
drivers/xen/xen-pciback/pci_stub.c | 26 ++++++++-
include/linux/acpi.h | 1 +
include/linux/pci.h | 2 +
include/xen/acpi.h | 6 ++
include/xen/interface/physdev.h | 7 +++
include/xen/pci.h | 6 ++
10 files changed, 160 insertions(+), 4 deletions(-)
--
2.34.1
When device on dom0 side has been reset, the vpci on Xen side
won't get notification, so that the cached state in vpci is
all out of date with the real device state.
To solve that problem, add a new function to clear all vpci
device state when device is reset on dom0 side.
And call that function in pcistub_init_device. Because when
using "pci-assignable-add" to assign a passthrough device in
Xen, it will reset passthrough device and the vpci state will
out of date, and then device will fail to restore bar state.
Co-developed-by: Huang Rui <[email protected]>
Signed-off-by: Jiqian Chen <[email protected]>
---
drivers/xen/pci.c | 12 ++++++++++++
drivers/xen/xen-pciback/pci_stub.c | 18 +++++++++++++++---
include/xen/interface/physdev.h | 7 +++++++
include/xen/pci.h | 6 ++++++
4 files changed, 40 insertions(+), 3 deletions(-)
diff --git a/drivers/xen/pci.c b/drivers/xen/pci.c
index 72d4e3f193af..e9b30bc09139 100644
--- a/drivers/xen/pci.c
+++ b/drivers/xen/pci.c
@@ -177,6 +177,18 @@ static int xen_remove_device(struct device *dev)
return r;
}
+int xen_reset_device_state(const struct pci_dev *dev)
+{
+ struct physdev_pci_device device = {
+ .seg = pci_domain_nr(dev->bus),
+ .bus = dev->bus->number,
+ .devfn = dev->devfn
+ };
+
+ return HYPERVISOR_physdev_op(PHYSDEVOP_pci_device_state_reset, &device);
+}
+EXPORT_SYMBOL_GPL(xen_reset_device_state);
+
static int xen_pci_notifier(struct notifier_block *nb,
unsigned long action, void *data)
{
diff --git a/drivers/xen/xen-pciback/pci_stub.c b/drivers/xen/xen-pciback/pci_stub.c
index e34b623e4b41..46c40ec8a18e 100644
--- a/drivers/xen/xen-pciback/pci_stub.c
+++ b/drivers/xen/xen-pciback/pci_stub.c
@@ -89,6 +89,16 @@ static struct pcistub_device *pcistub_device_alloc(struct pci_dev *dev)
return psdev;
}
+static int pcistub_reset_device_state(struct pci_dev *dev)
+{
+ __pci_reset_function_locked(dev);
+
+ if (!xen_pv_domain())
+ return xen_reset_device_state(dev);
+ else
+ return 0;
+}
+
/* Don't call this directly as it's called by pcistub_device_put */
static void pcistub_device_release(struct kref *kref)
{
@@ -107,7 +117,7 @@ static void pcistub_device_release(struct kref *kref)
/* Call the reset function which does not take lock as this
* is called from "unbind" which takes a device_lock mutex.
*/
- __pci_reset_function_locked(dev);
+ pcistub_reset_device_state(dev);
if (dev_data &&
pci_load_and_free_saved_state(dev, &dev_data->pci_saved_state))
dev_info(&dev->dev, "Could not reload PCI state\n");
@@ -284,7 +294,7 @@ void pcistub_put_pci_dev(struct pci_dev *dev)
* (so it's ready for the next domain)
*/
device_lock_assert(&dev->dev);
- __pci_reset_function_locked(dev);
+ pcistub_reset_device_state(dev);
dev_data = pci_get_drvdata(dev);
ret = pci_load_saved_state(dev, dev_data->pci_saved_state);
@@ -420,7 +430,9 @@ static int pcistub_init_device(struct pci_dev *dev)
dev_err(&dev->dev, "Could not store PCI conf saved state!\n");
else {
dev_dbg(&dev->dev, "resetting (FLR, D3, etc) the device\n");
- __pci_reset_function_locked(dev);
+ err = pcistub_reset_device_state(dev);
+ if (err)
+ goto config_release;
pci_restore_state(dev);
}
/* Now disable the device (this also ensures some private device
diff --git a/include/xen/interface/physdev.h b/include/xen/interface/physdev.h
index a237af867873..8609770e28f5 100644
--- a/include/xen/interface/physdev.h
+++ b/include/xen/interface/physdev.h
@@ -256,6 +256,13 @@ struct physdev_pci_device_add {
*/
#define PHYSDEVOP_prepare_msix 30
#define PHYSDEVOP_release_msix 31
+/*
+ * Notify the hypervisor that a PCI device has been reset, so that any
+ * internally cached state is regenerated. Should be called after any
+ * device reset performed by the hardware domain.
+ */
+#define PHYSDEVOP_pci_device_state_reset 32
+
struct physdev_pci_device {
/* IN */
uint16_t seg;
diff --git a/include/xen/pci.h b/include/xen/pci.h
index b8337cf85fd1..b2e2e856efd6 100644
--- a/include/xen/pci.h
+++ b/include/xen/pci.h
@@ -4,10 +4,16 @@
#define __XEN_PCI_H__
#if defined(CONFIG_XEN_DOM0)
+int xen_reset_device_state(const struct pci_dev *dev);
int xen_find_device_domain_owner(struct pci_dev *dev);
int xen_register_device_domain_owner(struct pci_dev *dev, uint16_t domain);
int xen_unregister_device_domain_owner(struct pci_dev *dev);
#else
+static inline int xen_reset_device_state(const struct pci_dev *dev)
+{
+ return -1;
+}
+
static inline int xen_find_device_domain_owner(struct pci_dev *dev)
{
return -1;
--
2.34.1
In PVH dom0, the gsis don't get registered, but the gsi of
a passthrough device must be configured for it to be able to be
mapped into a domU.
When assign a device to passthrough, proactively setup the gsi
of the device during that process.
Co-developed-by: Huang Rui <[email protected]>
Signed-off-by: Jiqian Chen <[email protected]>
---
arch/x86/xen/enlighten_pvh.c | 90 ++++++++++++++++++++++++++++++
drivers/acpi/pci_irq.c | 2 +-
drivers/xen/xen-pciback/pci_stub.c | 8 +++
include/linux/acpi.h | 1 +
include/xen/acpi.h | 6 ++
5 files changed, 106 insertions(+), 1 deletion(-)
diff --git a/arch/x86/xen/enlighten_pvh.c b/arch/x86/xen/enlighten_pvh.c
index ada3868c02c2..ecadd966c684 100644
--- a/arch/x86/xen/enlighten_pvh.c
+++ b/arch/x86/xen/enlighten_pvh.c
@@ -1,6 +1,7 @@
// SPDX-License-Identifier: GPL-2.0
#include <linux/acpi.h>
#include <linux/export.h>
+#include <linux/pci.h>
#include <xen/hvc-console.h>
@@ -25,6 +26,95 @@
bool __ro_after_init xen_pvh;
EXPORT_SYMBOL_GPL(xen_pvh);
+typedef struct gsi_info {
+ int gsi;
+ int trigger;
+ int polarity;
+} gsi_info_t;
+
+struct acpi_prt_entry {
+ struct acpi_pci_id id;
+ u8 pin;
+ acpi_handle link;
+ u32 index; /* GSI, or link _CRS index */
+};
+
+static int xen_pvh_get_gsi_info(struct pci_dev *dev,
+ gsi_info_t *gsi_info)
+{
+ int gsi;
+ u8 pin = 0;
+ struct acpi_prt_entry *entry;
+ int trigger = ACPI_LEVEL_SENSITIVE;
+ int polarity = acpi_irq_model == ACPI_IRQ_MODEL_GIC ?
+ ACPI_ACTIVE_HIGH : ACPI_ACTIVE_LOW;
+
+ if (dev)
+ pin = dev->pin;
+ if (!dev || !pin || !gsi_info)
+ return -EINVAL;
+
+ entry = acpi_pci_irq_lookup(dev, pin);
+ if (entry) {
+ if (entry->link)
+ gsi = acpi_pci_link_allocate_irq(entry->link,
+ entry->index,
+ &trigger, &polarity,
+ NULL);
+ else
+ gsi = entry->index;
+ } else
+ gsi = -1;
+
+ if (gsi < 0)
+ return -EINVAL;
+
+ gsi_info->gsi = gsi;
+ gsi_info->trigger = trigger;
+ gsi_info->polarity = polarity;
+
+ return 0;
+}
+
+static int xen_pvh_setup_gsi(gsi_info_t *gsi_info)
+{
+ struct physdev_setup_gsi setup_gsi;
+
+ if (!gsi_info)
+ return -EINVAL;
+
+ setup_gsi.gsi = gsi_info->gsi;
+ setup_gsi.triggering = (gsi_info->trigger == ACPI_EDGE_SENSITIVE ? 0 : 1);
+ setup_gsi.polarity = (gsi_info->polarity == ACPI_ACTIVE_HIGH ? 0 : 1);
+
+ return HYPERVISOR_physdev_op(PHYSDEVOP_setup_gsi, &setup_gsi);
+}
+
+int xen_pvh_passthrough_gsi(struct pci_dev *dev)
+{
+ int ret;
+ gsi_info_t gsi_info;
+
+ if (!dev)
+ return -EINVAL;
+
+ ret = xen_pvh_get_gsi_info(dev, &gsi_info);
+ if (ret) {
+ xen_raw_printk("Fail to get gsi info!\n");
+ return ret;
+ }
+
+ ret = xen_pvh_setup_gsi(&gsi_info);
+ if (ret == -EEXIST) {
+ xen_raw_printk("Already setup the GSI :%d\n", gsi_info.gsi);
+ ret = 0;
+ } else if (ret)
+ xen_raw_printk("Fail to setup GSI (%d)!\n", gsi_info.gsi);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(xen_pvh_passthrough_gsi);
+
void __init xen_pvh_init(struct boot_params *boot_params)
{
u32 msr;
diff --git a/drivers/acpi/pci_irq.c b/drivers/acpi/pci_irq.c
index ff30ceca2203..630fe0a34bc6 100644
--- a/drivers/acpi/pci_irq.c
+++ b/drivers/acpi/pci_irq.c
@@ -288,7 +288,7 @@ static int acpi_reroute_boot_interrupt(struct pci_dev *dev,
}
#endif /* CONFIG_X86_IO_APIC */
-static struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin)
+struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin)
{
struct acpi_prt_entry *entry = NULL;
struct pci_dev *bridge;
diff --git a/drivers/xen/xen-pciback/pci_stub.c b/drivers/xen/xen-pciback/pci_stub.c
index 46c40ec8a18e..22d4380d2b04 100644
--- a/drivers/xen/xen-pciback/pci_stub.c
+++ b/drivers/xen/xen-pciback/pci_stub.c
@@ -20,6 +20,7 @@
#include <linux/atomic.h>
#include <xen/events.h>
#include <xen/pci.h>
+#include <xen/acpi.h>
#include <xen/xen.h>
#include <asm/xen/hypervisor.h>
#include <xen/interface/physdev.h>
@@ -435,6 +436,13 @@ static int pcistub_init_device(struct pci_dev *dev)
goto config_release;
pci_restore_state(dev);
}
+
+ if (xen_initial_domain() && xen_pvh_domain()) {
+ err = xen_pvh_passthrough_gsi(dev);
+ if (err)
+ goto config_release;
+ }
+
/* Now disable the device (this also ensures some private device
* data is setup before we export)
*/
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 4db54e928b36..7ea3be981cc3 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -360,6 +360,7 @@ void acpi_unregister_gsi (u32 gsi);
struct pci_dev;
+struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin);
int acpi_pci_irq_enable (struct pci_dev *dev);
void acpi_penalize_isa_irq(int irq, int active);
bool acpi_isa_irq_available(int irq);
diff --git a/include/xen/acpi.h b/include/xen/acpi.h
index b1e11863144d..17c4d37f1e60 100644
--- a/include/xen/acpi.h
+++ b/include/xen/acpi.h
@@ -67,10 +67,16 @@ static inline void xen_acpi_sleep_register(void)
acpi_suspend_lowlevel = xen_acpi_suspend_lowlevel;
}
}
+int xen_pvh_passthrough_gsi(struct pci_dev *dev);
#else
static inline void xen_acpi_sleep_register(void)
{
}
+
+static inline int xen_pvh_passthrough_gsi(struct pci_dev *dev)
+{
+ return -1;
+}
#endif
#endif /* _XEN_ACPI_H */
--
2.34.1
There is a need for some scenarios to use gsi sysfs.
For example, when xen passthrough a device to dumU, it will
use gsi to map pirq, but currently userspace can't get gsi
number.
So, add gsi sysfs for that and for other potential scenarios.
Co-developed-by: Huang Rui <[email protected]>
Signed-off-by: Jiqian Chen <[email protected]>
---
drivers/acpi/pci_irq.c | 1 +
drivers/pci/pci-sysfs.c | 11 +++++++++++
include/linux/pci.h | 2 ++
3 files changed, 14 insertions(+)
diff --git a/drivers/acpi/pci_irq.c b/drivers/acpi/pci_irq.c
index 630fe0a34bc6..739a58755df2 100644
--- a/drivers/acpi/pci_irq.c
+++ b/drivers/acpi/pci_irq.c
@@ -449,6 +449,7 @@ int acpi_pci_irq_enable(struct pci_dev *dev)
kfree(entry);
return 0;
}
+ dev->gsi = gsi;
rc = acpi_register_gsi(&dev->dev, gsi, triggering, polarity);
if (rc < 0) {
diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index 2321fdfefd7d..c51df88d079e 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -71,6 +71,16 @@ static ssize_t irq_show(struct device *dev,
}
static DEVICE_ATTR_RO(irq);
+static ssize_t gsi_show(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct pci_dev *pdev = to_pci_dev(dev);
+
+ return sysfs_emit(buf, "%u\n", pdev->gsi);
+}
+static DEVICE_ATTR_RO(gsi);
+
static ssize_t broken_parity_status_show(struct device *dev,
struct device_attribute *attr,
char *buf)
@@ -596,6 +606,7 @@ static struct attribute *pci_dev_attrs[] = {
&dev_attr_revision.attr,
&dev_attr_class.attr,
&dev_attr_irq.attr,
+ &dev_attr_gsi.attr,
&dev_attr_local_cpus.attr,
&dev_attr_local_cpulist.attr,
&dev_attr_modalias.attr,
diff --git a/include/linux/pci.h b/include/linux/pci.h
index dea043bc1e38..0618d4a87a50 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -529,6 +529,8 @@ struct pci_dev {
/* These methods index pci_reset_fn_methods[] */
u8 reset_methods[PCI_NUM_RESET_METHODS]; /* In priority order */
+
+ unsigned int gsi;
};
static inline struct pci_dev *pci_physfn(struct pci_dev *dev)
--
2.34.1
Hi Bjorn Helgaas,
Do you have any comments on this patch?
On 2024/1/5 14:22, Chen, Jiqian wrote:
> There is a need for some scenarios to use gsi sysfs.
> For example, when xen passthrough a device to dumU, it will
> use gsi to map pirq, but currently userspace can't get gsi
> number.
> So, add gsi sysfs for that and for other potential scenarios.
>
> Co-developed-by: Huang Rui <[email protected]>
> Signed-off-by: Jiqian Chen <[email protected]>
> ---
> drivers/acpi/pci_irq.c | 1 +
> drivers/pci/pci-sysfs.c | 11 +++++++++++
> include/linux/pci.h | 2 ++
> 3 files changed, 14 insertions(+)
>
> diff --git a/drivers/acpi/pci_irq.c b/drivers/acpi/pci_irq.c
> index 630fe0a34bc6..739a58755df2 100644
> --- a/drivers/acpi/pci_irq.c
> +++ b/drivers/acpi/pci_irq.c
> @@ -449,6 +449,7 @@ int acpi_pci_irq_enable(struct pci_dev *dev)
> kfree(entry);
> return 0;
> }
> + dev->gsi = gsi;
>
> rc = acpi_register_gsi(&dev->dev, gsi, triggering, polarity);
> if (rc < 0) {
> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> index 2321fdfefd7d..c51df88d079e 100644
> --- a/drivers/pci/pci-sysfs.c
> +++ b/drivers/pci/pci-sysfs.c
> @@ -71,6 +71,16 @@ static ssize_t irq_show(struct device *dev,
> }
> static DEVICE_ATTR_RO(irq);
>
> +static ssize_t gsi_show(struct device *dev,
> + struct device_attribute *attr,
> + char *buf)
> +{
> + struct pci_dev *pdev = to_pci_dev(dev);
> +
> + return sysfs_emit(buf, "%u\n", pdev->gsi);
> +}
> +static DEVICE_ATTR_RO(gsi);
> +
> static ssize_t broken_parity_status_show(struct device *dev,
> struct device_attribute *attr,
> char *buf)
> @@ -596,6 +606,7 @@ static struct attribute *pci_dev_attrs[] = {
> &dev_attr_revision.attr,
> &dev_attr_class.attr,
> &dev_attr_irq.attr,
> + &dev_attr_gsi.attr,
> &dev_attr_local_cpus.attr,
> &dev_attr_local_cpulist.attr,
> &dev_attr_modalias.attr,
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index dea043bc1e38..0618d4a87a50 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -529,6 +529,8 @@ struct pci_dev {
>
> /* These methods index pci_reset_fn_methods[] */
> u8 reset_methods[PCI_NUM_RESET_METHODS]; /* In priority order */
> +
> + unsigned int gsi;
> };
>
> static inline struct pci_dev *pci_physfn(struct pci_dev *dev)
--
Best regards,
Jiqian Chen.
On Fri, Jan 05, 2024 at 02:22:17PM +0800, Jiqian Chen wrote:
> There is a need for some scenarios to use gsi sysfs.
> For example, when xen passthrough a device to dumU, it will
> use gsi to map pirq, but currently userspace can't get gsi
> number.
> So, add gsi sysfs for that and for other potential scenarios.
Isn't GSI really an ACPI-specific concept?
I don't know enough about Xen to know why it needs the GSI in
userspace. Is this passthrough brand new functionality that can't be
done today because we don't expose the GSI yet?
How does userspace use the GSI? I see "to map pirq", but I think we
should have more concrete details about exactly what is needed and how
it is used before adding something new in sysfs.
Is there some more generic kernel interface we could use
for this?
s/dumU/DomU/ ? (I dunno, but https://www.google.com/search?q=xen+dumu
suggests it :))
> Co-developed-by: Huang Rui <[email protected]>
> Signed-off-by: Jiqian Chen <[email protected]>
> ---
> drivers/acpi/pci_irq.c | 1 +
> drivers/pci/pci-sysfs.c | 11 +++++++++++
> include/linux/pci.h | 2 ++
> 3 files changed, 14 insertions(+)
>
> diff --git a/drivers/acpi/pci_irq.c b/drivers/acpi/pci_irq.c
> index 630fe0a34bc6..739a58755df2 100644
> --- a/drivers/acpi/pci_irq.c
> +++ b/drivers/acpi/pci_irq.c
> @@ -449,6 +449,7 @@ int acpi_pci_irq_enable(struct pci_dev *dev)
> kfree(entry);
> return 0;
> }
> + dev->gsi = gsi;
>
> rc = acpi_register_gsi(&dev->dev, gsi, triggering, polarity);
> if (rc < 0) {
> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> index 2321fdfefd7d..c51df88d079e 100644
> --- a/drivers/pci/pci-sysfs.c
> +++ b/drivers/pci/pci-sysfs.c
> @@ -71,6 +71,16 @@ static ssize_t irq_show(struct device *dev,
> }
> static DEVICE_ATTR_RO(irq);
>
> +static ssize_t gsi_show(struct device *dev,
> + struct device_attribute *attr,
> + char *buf)
> +{
> + struct pci_dev *pdev = to_pci_dev(dev);
> +
> + return sysfs_emit(buf, "%u\n", pdev->gsi);
> +}
> +static DEVICE_ATTR_RO(gsi);
> +
> static ssize_t broken_parity_status_show(struct device *dev,
> struct device_attribute *attr,
> char *buf)
> @@ -596,6 +606,7 @@ static struct attribute *pci_dev_attrs[] = {
> &dev_attr_revision.attr,
> &dev_attr_class.attr,
> &dev_attr_irq.attr,
> + &dev_attr_gsi.attr,
> &dev_attr_local_cpus.attr,
> &dev_attr_local_cpulist.attr,
> &dev_attr_modalias.attr,
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index dea043bc1e38..0618d4a87a50 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -529,6 +529,8 @@ struct pci_dev {
>
> /* These methods index pci_reset_fn_methods[] */
> u8 reset_methods[PCI_NUM_RESET_METHODS]; /* In priority order */
> +
> + unsigned int gsi;
> };
>
> static inline struct pci_dev *pci_physfn(struct pci_dev *dev)
> --
> 2.34.1
>
>
On 2024/1/23 07:37, Bjorn Helgaas wrote:
> On Fri, Jan 05, 2024 at 02:22:17PM +0800, Jiqian Chen wrote:
>> There is a need for some scenarios to use gsi sysfs.
>> For example, when xen passthrough a device to dumU, it will
>> use gsi to map pirq, but currently userspace can't get gsi
>> number.
>> So, add gsi sysfs for that and for other potential scenarios.
>
> Isn't GSI really an ACPI-specific concept?
I also added the Maintains of ACPI to get some inputs.
Hi Rafael J. Wysocki and Len Brown, do you have any suggestions about this patch?
>
> I don't know enough about Xen to know why it needs the GSI in
> userspace. Is this passthrough brand new functionality that can't be
> done today because we don't expose the GSI yet?
In Xen architecture, there is a privileged domain named Dom0 that has ACPI support and is responsible for detecting and controlling the hardware, also it performs privileged operations such as the creation of normal (unprivileged) domains DomUs. When we give to a DomU direct access to a device, we need also to route the physical interrupts to the DomU. In order to do so Xen needs to setup and map the interrupts appropriately. For the case of GSI interrupts, since Xen does not have support to get the ACPI routing info in the hypervisor itself, it needs to get this info from Dom0. One way would be for this info to be exposed in sysfs and the xen toolstack that runs in Dom0's userspace to get this info reading sysfs and then pass it to Xen.
And I have tried another approach in the past version patches that keeping irq to gsi mappings and then xen tool was consulting the map via a syscall and was passing the info to Xen. But it was rejected by Xen maintainers because they thought the mappings and translations were all Linux internal actions, and has nothing to do with Xen, so they suggested me to expose the GSI in sysfs because it is cleaner and easier to retrieve it in userspace.
This is my past version:
Kernel: https://lore.kernel.org/lkml/[email protected]/T/#m8d20edd326cf7735c2804f0371e8a63b6beec60c
Xen: https://lore.kernel.org/xen-devel/[email protected]/T/#m9f9068d558822af0a5b28cd241cab4d779e36974
>
> How does userspace use the GSI? I see "to map pirq", but I think we
> should have more concrete details about exactly what is needed and how
> it is used before adding something new in sysfs.
As above reason.
>
> Is there some more generic kernel interface we could use
> for this?
No, there is no method for now, I think.
>
> s/dumU/DomU/ ? (I dunno, but https://www.google.com/search?q=xen+dumu
> suggests it :))
>
>> Co-developed-by: Huang Rui <[email protected]>
>> Signed-off-by: Jiqian Chen <[email protected]>
>> ---
>> drivers/acpi/pci_irq.c | 1 +
>> drivers/pci/pci-sysfs.c | 11 +++++++++++
>> include/linux/pci.h | 2 ++
>> 3 files changed, 14 insertions(+)
>>
>> diff --git a/drivers/acpi/pci_irq.c b/drivers/acpi/pci_irq.c
>> index 630fe0a34bc6..739a58755df2 100644
>> --- a/drivers/acpi/pci_irq.c
>> +++ b/drivers/acpi/pci_irq.c
>> @@ -449,6 +449,7 @@ int acpi_pci_irq_enable(struct pci_dev *dev)
>> kfree(entry);
>> return 0;
>> }
>> + dev->gsi = gsi;
>>
>> rc = acpi_register_gsi(&dev->dev, gsi, triggering, polarity);
>> if (rc < 0) {
>> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
>> index 2321fdfefd7d..c51df88d079e 100644
>> --- a/drivers/pci/pci-sysfs.c
>> +++ b/drivers/pci/pci-sysfs.c
>> @@ -71,6 +71,16 @@ static ssize_t irq_show(struct device *dev,
>> }
>> static DEVICE_ATTR_RO(irq);
>>
>> +static ssize_t gsi_show(struct device *dev,
>> + struct device_attribute *attr,
>> + char *buf)
>> +{
>> + struct pci_dev *pdev = to_pci_dev(dev);
>> +
>> + return sysfs_emit(buf, "%u\n", pdev->gsi);
>> +}
>> +static DEVICE_ATTR_RO(gsi);
>> +
>> static ssize_t broken_parity_status_show(struct device *dev,
>> struct device_attribute *attr,
>> char *buf)
>> @@ -596,6 +606,7 @@ static struct attribute *pci_dev_attrs[] = {
>> &dev_attr_revision.attr,
>> &dev_attr_class.attr,
>> &dev_attr_irq.attr,
>> + &dev_attr_gsi.attr,
>> &dev_attr_local_cpus.attr,
>> &dev_attr_local_cpulist.attr,
>> &dev_attr_modalias.attr,
>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>> index dea043bc1e38..0618d4a87a50 100644
>> --- a/include/linux/pci.h
>> +++ b/include/linux/pci.h
>> @@ -529,6 +529,8 @@ struct pci_dev {
>>
>> /* These methods index pci_reset_fn_methods[] */
>> u8 reset_methods[PCI_NUM_RESET_METHODS]; /* In priority order */
>> +
>> + unsigned int gsi;
>> };
>>
>> static inline struct pci_dev *pci_physfn(struct pci_dev *dev)
>> --
>> 2.34.1
>>
>>
--
Best regards,
Jiqian Chen.
On Tue, Jan 23, 2024 at 10:13:52AM +0000, Chen, Jiqian wrote:
> On 2024/1/23 07:37, Bjorn Helgaas wrote:
> > On Fri, Jan 05, 2024 at 02:22:17PM +0800, Jiqian Chen wrote:
> >> There is a need for some scenarios to use gsi sysfs.
> >> For example, when xen passthrough a device to dumU, it will
> >> use gsi to map pirq, but currently userspace can't get gsi
> >> number.
> >> So, add gsi sysfs for that and for other potential scenarios.
> ...
> > I don't know enough about Xen to know why it needs the GSI in
> > userspace. Is this passthrough brand new functionality that can't be
> > done today because we don't expose the GSI yet?
>
> In Xen architecture, there is a privileged domain named Dom0 that
> has ACPI support and is responsible for detecting and controlling
> the hardware, also it performs privileged operations such as the
> creation of normal (unprivileged) domains DomUs. When we give to a
> DomU direct access to a device, we need also to route the physical
> interrupts to the DomU. In order to do so Xen needs to setup and map
> the interrupts appropriately.
What kernel interfaces are used for this setup and mapping?
On 2024/1/24 00:02, Bjorn Helgaas wrote:
> On Tue, Jan 23, 2024 at 10:13:52AM +0000, Chen, Jiqian wrote:
>> On 2024/1/23 07:37, Bjorn Helgaas wrote:
>>> On Fri, Jan 05, 2024 at 02:22:17PM +0800, Jiqian Chen wrote:
>>>> There is a need for some scenarios to use gsi sysfs.
>>>> For example, when xen passthrough a device to dumU, it will
>>>> use gsi to map pirq, but currently userspace can't get gsi
>>>> number.
>>>> So, add gsi sysfs for that and for other potential scenarios.
>> ...
>
>>> I don't know enough about Xen to know why it needs the GSI in
>>> userspace. Is this passthrough brand new functionality that can't be
>>> done today because we don't expose the GSI yet?
>>
>> In Xen architecture, there is a privileged domain named Dom0 that
>> has ACPI support and is responsible for detecting and controlling
>> the hardware, also it performs privileged operations such as the
>> creation of normal (unprivileged) domains DomUs. When we give to a
>> DomU direct access to a device, we need also to route the physical
>> interrupts to the DomU. In order to do so Xen needs to setup and map
>> the interrupts appropriately.
>
> What kernel interfaces are used for this setup and mapping?
For passthrough devices, the setup and mapping of routing physical interrupts to DomU are done on Xen hypervisor side, hypervisor only need userspace to provide the GSI info, see Xen code: xc_physdev_map_pirq require GSI and then will call hypercall to pass GSI into hypervisor and then hypervisor will do the mapping and routing, kernel doesn't do the setup and mapping.
For devices on PVH Dom0, Dom0 setups interrupts for devices as the baremetal Linux kernel does, through using acpi_pci_irq_enable-> acpi_register_gsi-> __acpi_register_gsi->acpi_register_gsi_ioapic.
--
Best regards,
Jiqian Chen.
On Thu, Jan 25, 2024 at 07:17:24AM +0000, Chen, Jiqian wrote:
> On 2024/1/24 00:02, Bjorn Helgaas wrote:
> > On Tue, Jan 23, 2024 at 10:13:52AM +0000, Chen, Jiqian wrote:
> >> On 2024/1/23 07:37, Bjorn Helgaas wrote:
> >>> On Fri, Jan 05, 2024 at 02:22:17PM +0800, Jiqian Chen wrote:
> >>>> There is a need for some scenarios to use gsi sysfs.
> >>>> For example, when xen passthrough a device to dumU, it will
> >>>> use gsi to map pirq, but currently userspace can't get gsi
> >>>> number.
> >>>> So, add gsi sysfs for that and for other potential scenarios.
> >> ...
> >
> >>> I don't know enough about Xen to know why it needs the GSI in
> >>> userspace. Is this passthrough brand new functionality that can't be
> >>> done today because we don't expose the GSI yet?
I assume this must be new functionality, i.e., this kind of
passthrough does not work today, right?
> >> has ACPI support and is responsible for detecting and controlling
> >> the hardware, also it performs privileged operations such as the
> >> creation of normal (unprivileged) domains DomUs. When we give to a
> >> DomU direct access to a device, we need also to route the physical
> >> interrupts to the DomU. In order to do so Xen needs to setup and map
> >> the interrupts appropriately.
> >
> > What kernel interfaces are used for this setup and mapping?
>
> For passthrough devices, the setup and mapping of routing physical
> interrupts to DomU are done on Xen hypervisor side, hypervisor only
> need userspace to provide the GSI info, see Xen code:
> xc_physdev_map_pirq require GSI and then will call hypercall to pass
> GSI into hypervisor and then hypervisor will do the mapping and
> routing, kernel doesn't do the setup and mapping.
So we have to expose the GSI to userspace not because userspace itself
uses it, but so userspace can turn around and pass it back into the
kernel?
It seems like it would be better for userspace to pass an identifier
of the PCI device itself back into the hypervisor. Then the interface
could be generic and potentially work even on non-ACPI systems where
the GSI concept doesn't apply.
> For devices on PVH Dom0, Dom0 setups interrupts for devices as the
> baremetal Linux kernel does, through using acpi_pci_irq_enable->
> acpi_register_gsi-> __acpi_register_gsi->acpi_register_gsi_ioapic.
This case sounds like it's all inside Linux, so I assume there's no
problem with this one? If you can call acpi_pci_irq_enable(), you
have the pci_dev, so I assume there's no need to expose the GSI in
sysfs?
Bjorn
On Mon, Jan 29, 2024 at 04:01:13PM -0600, Bjorn Helgaas wrote:
> On Thu, Jan 25, 2024 at 07:17:24AM +0000, Chen, Jiqian wrote:
> > On 2024/1/24 00:02, Bjorn Helgaas wrote:
> > > On Tue, Jan 23, 2024 at 10:13:52AM +0000, Chen, Jiqian wrote:
> > >> On 2024/1/23 07:37, Bjorn Helgaas wrote:
> > >>> On Fri, Jan 05, 2024 at 02:22:17PM +0800, Jiqian Chen wrote:
> > >>>> There is a need for some scenarios to use gsi sysfs.
> > >>>> For example, when xen passthrough a device to dumU, it will
> > >>>> use gsi to map pirq, but currently userspace can't get gsi
> > >>>> number.
> > >>>> So, add gsi sysfs for that and for other potential scenarios.
> > >> ...
> > >
> > >>> I don't know enough about Xen to know why it needs the GSI in
> > >>> userspace. Is this passthrough brand new functionality that can't be
> > >>> done today because we don't expose the GSI yet?
>
> I assume this must be new functionality, i.e., this kind of
> passthrough does not work today, right?
>
> > >> has ACPI support and is responsible for detecting and controlling
> > >> the hardware, also it performs privileged operations such as the
> > >> creation of normal (unprivileged) domains DomUs. When we give to a
> > >> DomU direct access to a device, we need also to route the physical
> > >> interrupts to the DomU. In order to do so Xen needs to setup and map
> > >> the interrupts appropriately.
> > >
> > > What kernel interfaces are used for this setup and mapping?
> >
> > For passthrough devices, the setup and mapping of routing physical
> > interrupts to DomU are done on Xen hypervisor side, hypervisor only
> > need userspace to provide the GSI info, see Xen code:
> > xc_physdev_map_pirq require GSI and then will call hypercall to pass
> > GSI into hypervisor and then hypervisor will do the mapping and
> > routing, kernel doesn't do the setup and mapping.
>
> So we have to expose the GSI to userspace not because userspace itself
> uses it, but so userspace can turn around and pass it back into the
> kernel?
No, the point is to pass it back to Xen, which doesn't know the
mapping between GSIs and PCI devices because it can't execute the ACPI
AML resource methods that provide such information.
The (Linux) kernel is just a proxy that forwards the hypercalls from
user-space tools into Xen.
> It seems like it would be better for userspace to pass an identifier
> of the PCI device itself back into the hypervisor. Then the interface
> could be generic and potentially work even on non-ACPI systems where
> the GSI concept doesn't apply.
We would still need a way to pass the GSI to PCI device relation to
the hypervisor, and then cache such data in the hypervisor.
I don't think we have any preference of where such information should
be exposed, but given GSIs are an ACPI concept not specific to Xen
they should be exposed by a non-Xen specific interface.
Thanks, Roger.
On Tue, Jan 30, 2024 at 10:07:36AM +0100, Roger Pau Monné wrote:
> On Mon, Jan 29, 2024 at 04:01:13PM -0600, Bjorn Helgaas wrote:
> > On Thu, Jan 25, 2024 at 07:17:24AM +0000, Chen, Jiqian wrote:
> > > On 2024/1/24 00:02, Bjorn Helgaas wrote:
> > > > On Tue, Jan 23, 2024 at 10:13:52AM +0000, Chen, Jiqian wrote:
> > > >> On 2024/1/23 07:37, Bjorn Helgaas wrote:
> > > >>> On Fri, Jan 05, 2024 at 02:22:17PM +0800, Jiqian Chen wrote:
> > > >>>> There is a need for some scenarios to use gsi sysfs.
> > > >>>> For example, when xen passthrough a device to dumU, it will
> > > >>>> use gsi to map pirq, but currently userspace can't get gsi
> > > >>>> number.
> > > >>>> So, add gsi sysfs for that and for other potential scenarios.
> > > >> ...
> > > >
> > > >>> I don't know enough about Xen to know why it needs the GSI in
> > > >>> userspace. Is this passthrough brand new functionality that can't be
> > > >>> done today because we don't expose the GSI yet?
> >
> > I assume this must be new functionality, i.e., this kind of
> > passthrough does not work today, right?
> >
> > > >> has ACPI support and is responsible for detecting and controlling
> > > >> the hardware, also it performs privileged operations such as the
> > > >> creation of normal (unprivileged) domains DomUs. When we give to a
> > > >> DomU direct access to a device, we need also to route the physical
> > > >> interrupts to the DomU. In order to do so Xen needs to setup and map
> > > >> the interrupts appropriately.
> > > >
> > > > What kernel interfaces are used for this setup and mapping?
> > >
> > > For passthrough devices, the setup and mapping of routing physical
> > > interrupts to DomU are done on Xen hypervisor side, hypervisor only
> > > need userspace to provide the GSI info, see Xen code:
> > > xc_physdev_map_pirq require GSI and then will call hypercall to pass
> > > GSI into hypervisor and then hypervisor will do the mapping and
> > > routing, kernel doesn't do the setup and mapping.
> >
> > So we have to expose the GSI to userspace not because userspace itself
> > uses it, but so userspace can turn around and pass it back into the
> > kernel?
>
> No, the point is to pass it back to Xen, which doesn't know the
> mapping between GSIs and PCI devices because it can't execute the ACPI
> AML resource methods that provide such information.
>
> The (Linux) kernel is just a proxy that forwards the hypercalls from
> user-space tools into Xen.
But I guess Xen knows how to interpret a GSI even though it doesn't
have access to AML?
> > It seems like it would be better for userspace to pass an identifier
> > of the PCI device itself back into the hypervisor. Then the interface
> > could be generic and potentially work even on non-ACPI systems where
> > the GSI concept doesn't apply.
>
> We would still need a way to pass the GSI to PCI device relation to
> the hypervisor, and then cache such data in the hypervisor.
>
> I don't think we have any preference of where such information should
> be exposed, but given GSIs are an ACPI concept not specific to Xen
> they should be exposed by a non-Xen specific interface.
AFAIK Linux doesn't expose GSIs directly to userspace yet. The GSI
concept relies on ACPI MADT, _MAT, _PRT, etc. A GSI is associated
with some device (PCI in this case) and some interrupt controller
entry. I don't understand how a GSI value is useful without knowing
something about that framework in which GSIs exist.
Obviously I know less than nothing about Xen, so I apologize for
asking all these stupid questions, but it just doesn't all make sense
to me yet.
Bjorn
On Tue, Jan 30, 2024 at 02:44:03PM -0600, Bjorn Helgaas wrote:
> On Tue, Jan 30, 2024 at 10:07:36AM +0100, Roger Pau Monné wrote:
> > On Mon, Jan 29, 2024 at 04:01:13PM -0600, Bjorn Helgaas wrote:
> > > On Thu, Jan 25, 2024 at 07:17:24AM +0000, Chen, Jiqian wrote:
> > > > On 2024/1/24 00:02, Bjorn Helgaas wrote:
> > > > > On Tue, Jan 23, 2024 at 10:13:52AM +0000, Chen, Jiqian wrote:
> > > > >> On 2024/1/23 07:37, Bjorn Helgaas wrote:
> > > > >>> On Fri, Jan 05, 2024 at 02:22:17PM +0800, Jiqian Chen wrote:
> > > > >>>> There is a need for some scenarios to use gsi sysfs.
> > > > >>>> For example, when xen passthrough a device to dumU, it will
> > > > >>>> use gsi to map pirq, but currently userspace can't get gsi
> > > > >>>> number.
> > > > >>>> So, add gsi sysfs for that and for other potential scenarios.
> > > > >> ...
> > > > >
> > > > >>> I don't know enough about Xen to know why it needs the GSI in
> > > > >>> userspace. Is this passthrough brand new functionality that can't be
> > > > >>> done today because we don't expose the GSI yet?
> > >
> > > I assume this must be new functionality, i.e., this kind of
> > > passthrough does not work today, right?
> > >
> > > > >> has ACPI support and is responsible for detecting and controlling
> > > > >> the hardware, also it performs privileged operations such as the
> > > > >> creation of normal (unprivileged) domains DomUs. When we give to a
> > > > >> DomU direct access to a device, we need also to route the physical
> > > > >> interrupts to the DomU. In order to do so Xen needs to setup and map
> > > > >> the interrupts appropriately.
> > > > >
> > > > > What kernel interfaces are used for this setup and mapping?
> > > >
> > > > For passthrough devices, the setup and mapping of routing physical
> > > > interrupts to DomU are done on Xen hypervisor side, hypervisor only
> > > > need userspace to provide the GSI info, see Xen code:
> > > > xc_physdev_map_pirq require GSI and then will call hypercall to pass
> > > > GSI into hypervisor and then hypervisor will do the mapping and
> > > > routing, kernel doesn't do the setup and mapping.
> > >
> > > So we have to expose the GSI to userspace not because userspace itself
> > > uses it, but so userspace can turn around and pass it back into the
> > > kernel?
> >
> > No, the point is to pass it back to Xen, which doesn't know the
> > mapping between GSIs and PCI devices because it can't execute the ACPI
> > AML resource methods that provide such information.
> >
> > The (Linux) kernel is just a proxy that forwards the hypercalls from
> > user-space tools into Xen.
>
> But I guess Xen knows how to interpret a GSI even though it doesn't
> have access to AML?
On x86 Xen does know how to map a GSI into an IO-APIC pin, in order
configure the RTE as requested.
> > > It seems like it would be better for userspace to pass an identifier
> > > of the PCI device itself back into the hypervisor. Then the interface
> > > could be generic and potentially work even on non-ACPI systems where
> > > the GSI concept doesn't apply.
> >
> > We would still need a way to pass the GSI to PCI device relation to
> > the hypervisor, and then cache such data in the hypervisor.
> >
> > I don't think we have any preference of where such information should
> > be exposed, but given GSIs are an ACPI concept not specific to Xen
> > they should be exposed by a non-Xen specific interface.
>
> AFAIK Linux doesn't expose GSIs directly to userspace yet. The GSI
> concept relies on ACPI MADT, _MAT, _PRT, etc. A GSI is associated
> with some device (PCI in this case) and some interrupt controller
> entry. I don't understand how a GSI value is useful without knowing
> something about that framework in which GSIs exist.
I wouldn't say it's strictly associated with PCI. A GSI is a way for
ACPI to have a single space that unifies all possible IO-APICs pins in
the system in a flat way. A GSI is useful in itself because there's
a single GSI space for the whole host.
> Obviously I know less than nothing about Xen, so I apologize for
> asking all these stupid questions, but it just doesn't all make sense
> to me yet.
That's all fine, maybe there's a better path or way to expose this ACPI
information. Maybe introduce a per-device acpi directory and expose
it there? Or rename the entry to acpi_gsi?
Thanks, Roger.
On Wed, Jan 31, 2024 at 09:58:19AM +0100, Roger Pau Monné wrote:
> On Tue, Jan 30, 2024 at 02:44:03PM -0600, Bjorn Helgaas wrote:
> > On Tue, Jan 30, 2024 at 10:07:36AM +0100, Roger Pau Monné wrote:
> > > On Mon, Jan 29, 2024 at 04:01:13PM -0600, Bjorn Helgaas wrote:
> > > > On Thu, Jan 25, 2024 at 07:17:24AM +0000, Chen, Jiqian wrote:
> > > > > On 2024/1/24 00:02, Bjorn Helgaas wrote:
> > > > > > On Tue, Jan 23, 2024 at 10:13:52AM +0000, Chen, Jiqian wrote:
> > > > > >> On 2024/1/23 07:37, Bjorn Helgaas wrote:
> > > > > >>> On Fri, Jan 05, 2024 at 02:22:17PM +0800, Jiqian Chen wrote:
> > > > > >>>> There is a need for some scenarios to use gsi sysfs.
> > > > > >>>> For example, when xen passthrough a device to dumU, it will
> > > > > >>>> use gsi to map pirq, but currently userspace can't get gsi
> > > > > >>>> number.
> > > > > >>>> So, add gsi sysfs for that and for other potential scenarios.
> > > > > >> ...
> > > > > >
> > > > > >>> I don't know enough about Xen to know why it needs the GSI in
> > > > > >>> userspace. Is this passthrough brand new functionality that can't be
> > > > > >>> done today because we don't expose the GSI yet?
> > > >
> > > > I assume this must be new functionality, i.e., this kind of
> > > > passthrough does not work today, right?
> > > >
> > > > > >> has ACPI support and is responsible for detecting and controlling
> > > > > >> the hardware, also it performs privileged operations such as the
> > > > > >> creation of normal (unprivileged) domains DomUs. When we give to a
> > > > > >> DomU direct access to a device, we need also to route the physical
> > > > > >> interrupts to the DomU. In order to do so Xen needs to setup and map
> > > > > >> the interrupts appropriately.
> > > > > >
> > > > > > What kernel interfaces are used for this setup and mapping?
> > > > >
> > > > > For passthrough devices, the setup and mapping of routing physical
> > > > > interrupts to DomU are done on Xen hypervisor side, hypervisor only
> > > > > need userspace to provide the GSI info, see Xen code:
> > > > > xc_physdev_map_pirq require GSI and then will call hypercall to pass
> > > > > GSI into hypervisor and then hypervisor will do the mapping and
> > > > > routing, kernel doesn't do the setup and mapping.
> > > >
> > > > So we have to expose the GSI to userspace not because userspace itself
> > > > uses it, but so userspace can turn around and pass it back into the
> > > > kernel?
> > >
> > > No, the point is to pass it back to Xen, which doesn't know the
> > > mapping between GSIs and PCI devices because it can't execute the ACPI
> > > AML resource methods that provide such information.
> > >
> > > The (Linux) kernel is just a proxy that forwards the hypercalls from
> > > user-space tools into Xen.
> >
> > But I guess Xen knows how to interpret a GSI even though it doesn't
> > have access to AML?
>
> On x86 Xen does know how to map a GSI into an IO-APIC pin, in order
> configure the RTE as requested.
IIUC, mapping a GSI to an IO-APIC pin requires information from the
MADT. So I guess Xen does use the static ACPI tables, but not the AML
_PRT methods that would connect a GSI with a PCI device?
I guess this means Xen would not be able to deal with _MAT methods,
which also contains MADT entries? I don't know the implications of
this -- maybe it means Xen might not be able to use with hot-added
devices?
The tables (including DSDT and SSDTS that contain the AML) are exposed
to userspace via /sys/firmware/acpi/tables/, but of course that
doesn't mean Xen knows how to interpret the AML, and even if it did,
Xen probably wouldn't be able to *evaluate* it since that could
conflict with the host kernel's use of AML.
Bjorn
On Wed, Jan 31, 2024 at 01:00:14PM -0600, Bjorn Helgaas wrote:
> On Wed, Jan 31, 2024 at 09:58:19AM +0100, Roger Pau Monné wrote:
> > On Tue, Jan 30, 2024 at 02:44:03PM -0600, Bjorn Helgaas wrote:
> > > On Tue, Jan 30, 2024 at 10:07:36AM +0100, Roger Pau Monné wrote:
> > > > On Mon, Jan 29, 2024 at 04:01:13PM -0600, Bjorn Helgaas wrote:
> > > > > On Thu, Jan 25, 2024 at 07:17:24AM +0000, Chen, Jiqian wrote:
> > > > > > On 2024/1/24 00:02, Bjorn Helgaas wrote:
> > > > > > > On Tue, Jan 23, 2024 at 10:13:52AM +0000, Chen, Jiqian wrote:
> > > > > > >> On 2024/1/23 07:37, Bjorn Helgaas wrote:
> > > > > > >>> On Fri, Jan 05, 2024 at 02:22:17PM +0800, Jiqian Chen wrote:
> > > > > > >>>> There is a need for some scenarios to use gsi sysfs.
> > > > > > >>>> For example, when xen passthrough a device to dumU, it will
> > > > > > >>>> use gsi to map pirq, but currently userspace can't get gsi
> > > > > > >>>> number.
> > > > > > >>>> So, add gsi sysfs for that and for other potential scenarios.
> > > > > > >> ...
> > > > > > >
> > > > > > >>> I don't know enough about Xen to know why it needs the GSI in
> > > > > > >>> userspace. Is this passthrough brand new functionality that can't be
> > > > > > >>> done today because we don't expose the GSI yet?
> > > > >
> > > > > I assume this must be new functionality, i.e., this kind of
> > > > > passthrough does not work today, right?
> > > > >
> > > > > > >> has ACPI support and is responsible for detecting and controlling
> > > > > > >> the hardware, also it performs privileged operations such as the
> > > > > > >> creation of normal (unprivileged) domains DomUs. When we give to a
> > > > > > >> DomU direct access to a device, we need also to route the physical
> > > > > > >> interrupts to the DomU. In order to do so Xen needs to setup and map
> > > > > > >> the interrupts appropriately.
> > > > > > >
> > > > > > > What kernel interfaces are used for this setup and mapping?
> > > > > >
> > > > > > For passthrough devices, the setup and mapping of routing physical
> > > > > > interrupts to DomU are done on Xen hypervisor side, hypervisor only
> > > > > > need userspace to provide the GSI info, see Xen code:
> > > > > > xc_physdev_map_pirq require GSI and then will call hypercall to pass
> > > > > > GSI into hypervisor and then hypervisor will do the mapping and
> > > > > > routing, kernel doesn't do the setup and mapping.
> > > > >
> > > > > So we have to expose the GSI to userspace not because userspace itself
> > > > > uses it, but so userspace can turn around and pass it back into the
> > > > > kernel?
> > > >
> > > > No, the point is to pass it back to Xen, which doesn't know the
> > > > mapping between GSIs and PCI devices because it can't execute the ACPI
> > > > AML resource methods that provide such information.
> > > >
> > > > The (Linux) kernel is just a proxy that forwards the hypercalls from
> > > > user-space tools into Xen.
> > >
> > > But I guess Xen knows how to interpret a GSI even though it doesn't
> > > have access to AML?
> >
> > On x86 Xen does know how to map a GSI into an IO-APIC pin, in order
> > configure the RTE as requested.
>
> IIUC, mapping a GSI to an IO-APIC pin requires information from the
> MADT. So I guess Xen does use the static ACPI tables, but not the AML
> _PRT methods that would connect a GSI with a PCI device?
Yes, Xen can parse the static tables, and knows the base GSI of
IO-APICs from the MADT.
> I guess this means Xen would not be able to deal with _MAT methods,
> which also contains MADT entries? I don't know the implications of
> this -- maybe it means Xen might not be able to use with hot-added
> devices?
It's my understanding _MAT will only be present on some very specific
devices (IO-APIC or CPU objects). Xen doesn't support hotplug of
IO-APICs, but hotplug of CPUs should in principle be supported with
cooperation from the control domain OS (albeit it's not something that
we tests on our CI). I don't expect however that a CPU object _MAT
method will return IO APIC entries.
> The tables (including DSDT and SSDTS that contain the AML) are exposed
> to userspace via /sys/firmware/acpi/tables/, but of course that
> doesn't mean Xen knows how to interpret the AML, and even if it did,
> Xen probably wouldn't be able to *evaluate* it since that could
> conflict with the host kernel's use of AML.
Indeed, there can only be a single OSPM, and that's the dom0 OS (Linux
in our context).
Getting back to our context though, what would be a suitable place for
exposing the GSI assigned to each device?
Thanks, Roger.
On Thu, Feb 01, 2024 at 09:39:49AM +0100, Roger Pau Monné wrote:
> On Wed, Jan 31, 2024 at 01:00:14PM -0600, Bjorn Helgaas wrote:
> > On Wed, Jan 31, 2024 at 09:58:19AM +0100, Roger Pau Monné wrote:
> > > On Tue, Jan 30, 2024 at 02:44:03PM -0600, Bjorn Helgaas wrote:
> > > > On Tue, Jan 30, 2024 at 10:07:36AM +0100, Roger Pau Monné wrote:
> > > > > On Mon, Jan 29, 2024 at 04:01:13PM -0600, Bjorn Helgaas wrote:
> > > > > > On Thu, Jan 25, 2024 at 07:17:24AM +0000, Chen, Jiqian wrote:
> > > > > > > On 2024/1/24 00:02, Bjorn Helgaas wrote:
> > > > > > > > On Tue, Jan 23, 2024 at 10:13:52AM +0000, Chen, Jiqian wrote:
> > > > > > > >> On 2024/1/23 07:37, Bjorn Helgaas wrote:
> > > > > > > >>> On Fri, Jan 05, 2024 at 02:22:17PM +0800, Jiqian Chen wrote:
> > > > > > > >>>> There is a need for some scenarios to use gsi sysfs.
> > > > > > > >>>> For example, when xen passthrough a device to dumU, it will
> > > > > > > >>>> use gsi to map pirq, but currently userspace can't get gsi
> > > > > > > >>>> number.
> > > > > > > >>>> So, add gsi sysfs for that and for other potential scenarios.
> > > > > > > >> ...
> > > > > > > >
> > > > > > > >>> I don't know enough about Xen to know why it needs the GSI in
> > > > > > > >>> userspace. Is this passthrough brand new functionality that can't be
> > > > > > > >>> done today because we don't expose the GSI yet?
> > > > > >
> > > > > > I assume this must be new functionality, i.e., this kind of
> > > > > > passthrough does not work today, right?
> > > > > >
> > > > > > > >> has ACPI support and is responsible for detecting and controlling
> > > > > > > >> the hardware, also it performs privileged operations such as the
> > > > > > > >> creation of normal (unprivileged) domains DomUs. When we give to a
> > > > > > > >> DomU direct access to a device, we need also to route the physical
> > > > > > > >> interrupts to the DomU. In order to do so Xen needs to setup and map
> > > > > > > >> the interrupts appropriately.
> > > > > > > >
> > > > > > > > What kernel interfaces are used for this setup and mapping?
> > > > > > >
> > > > > > > For passthrough devices, the setup and mapping of routing physical
> > > > > > > interrupts to DomU are done on Xen hypervisor side, hypervisor only
> > > > > > > need userspace to provide the GSI info, see Xen code:
> > > > > > > xc_physdev_map_pirq require GSI and then will call hypercall to pass
> > > > > > > GSI into hypervisor and then hypervisor will do the mapping and
> > > > > > > routing, kernel doesn't do the setup and mapping.
> > > > > >
> > > > > > So we have to expose the GSI to userspace not because userspace itself
> > > > > > uses it, but so userspace can turn around and pass it back into the
> > > > > > kernel?
> > > > >
> > > > > No, the point is to pass it back to Xen, which doesn't know the
> > > > > mapping between GSIs and PCI devices because it can't execute the ACPI
> > > > > AML resource methods that provide such information.
> > > > >
> > > > > The (Linux) kernel is just a proxy that forwards the hypercalls from
> > > > > user-space tools into Xen.
> > > >
> > > > But I guess Xen knows how to interpret a GSI even though it doesn't
> > > > have access to AML?
> > >
> > > On x86 Xen does know how to map a GSI into an IO-APIC pin, in order
> > > configure the RTE as requested.
> >
> > IIUC, mapping a GSI to an IO-APIC pin requires information from the
> > MADT. So I guess Xen does use the static ACPI tables, but not the AML
> > _PRT methods that would connect a GSI with a PCI device?
>
> Yes, Xen can parse the static tables, and knows the base GSI of
> IO-APICs from the MADT.
>
> > I guess this means Xen would not be able to deal with _MAT methods,
> > which also contains MADT entries? I don't know the implications of
> > this -- maybe it means Xen might not be able to use with hot-added
> > devices?
>
> It's my understanding _MAT will only be present on some very specific
> devices (IO-APIC or CPU objects). Xen doesn't support hotplug of
> IO-APICs, but hotplug of CPUs should in principle be supported with
> cooperation from the control domain OS (albeit it's not something that
> we tests on our CI). I don't expect however that a CPU object _MAT
> method will return IO APIC entries.
>
> > The tables (including DSDT and SSDTS that contain the AML) are exposed
> > to userspace via /sys/firmware/acpi/tables/, but of course that
> > doesn't mean Xen knows how to interpret the AML, and even if it did,
> > Xen probably wouldn't be able to *evaluate* it since that could
> > conflict with the host kernel's use of AML.
>
> Indeed, there can only be a single OSPM, and that's the dom0 OS (Linux
> in our context).
>
> Getting back to our context though, what would be a suitable place for
> exposing the GSI assigned to each device?
IIUC, the Xen hypervisor:
- Interprets /sys/firmware/acpi/tables/APIC (or gets this via
something running on the Dom0 kernel) to find the physical base
address and GSI base, e.g., from I/O APIC, I/O SAPIC.
- Needs the GSI to locate the APIC and pin within the APIC. The
Dom0 kernel is the OSPM, so only it can evaluate the AML _PRT to
learn the PCI device -> GSI mapping.
- Has direct access to the APIC physical base address to program the
Redirection Table.
The patch seems a little messy to me because the PCI core has to keep
track of the GSI even though it doesn't need it itself. And the
current patch exposes it on all arches, even non-ACPI ones or when
ACPI is disabled (easily fixable).
We only call acpi_pci_irq_enable() in the pci_enable_device() path, so
we don't know the GSI unless a Dom0 driver has claimed the device and
called pci_enable_device() for it, which seems like it might not be
desirable.
I was hoping we could put it in /sys/firmware/acpi/interrupts, but
that looks like it's only for SCI statistics. I guess we could moot a
new /sys/firmware/acpi/gsi/ directory, but then each file there would
have to identify a device, which might not be as convenient as the
/sys/devices/ directory that already exists. I guess there may be
GSIs for things other than PCI devices; will you ever care about any
of those?
Bjorn
On Fri, Feb 09, 2024 at 03:05:49PM -0600, Bjorn Helgaas wrote:
> On Thu, Feb 01, 2024 at 09:39:49AM +0100, Roger Pau Monné wrote:
> > On Wed, Jan 31, 2024 at 01:00:14PM -0600, Bjorn Helgaas wrote:
> > > On Wed, Jan 31, 2024 at 09:58:19AM +0100, Roger Pau Monné wrote:
> > > > On Tue, Jan 30, 2024 at 02:44:03PM -0600, Bjorn Helgaas wrote:
> > > > > On Tue, Jan 30, 2024 at 10:07:36AM +0100, Roger Pau Monné wrote:
> > > > > > On Mon, Jan 29, 2024 at 04:01:13PM -0600, Bjorn Helgaas wrote:
> > > > > > > On Thu, Jan 25, 2024 at 07:17:24AM +0000, Chen, Jiqian wrote:
> > > > > > > > On 2024/1/24 00:02, Bjorn Helgaas wrote:
> > > > > > > > > On Tue, Jan 23, 2024 at 10:13:52AM +0000, Chen, Jiqian wrote:
> > > > > > > > >> On 2024/1/23 07:37, Bjorn Helgaas wrote:
> > > > > > > > >>> On Fri, Jan 05, 2024 at 02:22:17PM +0800, Jiqian Chen wrote:
> > > > > > > > >>>> There is a need for some scenarios to use gsi sysfs.
> > > > > > > > >>>> For example, when xen passthrough a device to dumU, it will
> > > > > > > > >>>> use gsi to map pirq, but currently userspace can't get gsi
> > > > > > > > >>>> number.
> > > > > > > > >>>> So, add gsi sysfs for that and for other potential scenarios.
> > > > > > > > >> ...
> > > > > > > > >
> > > > > > > > >>> I don't know enough about Xen to know why it needs the GSI in
> > > > > > > > >>> userspace. Is this passthrough brand new functionality that can't be
> > > > > > > > >>> done today because we don't expose the GSI yet?
> > > > > > >
> > > > > > > I assume this must be new functionality, i.e., this kind of
> > > > > > > passthrough does not work today, right?
> > > > > > >
> > > > > > > > >> has ACPI support and is responsible for detecting and controlling
> > > > > > > > >> the hardware, also it performs privileged operations such as the
> > > > > > > > >> creation of normal (unprivileged) domains DomUs. When we give to a
> > > > > > > > >> DomU direct access to a device, we need also to route the physical
> > > > > > > > >> interrupts to the DomU. In order to do so Xen needs to setup and map
> > > > > > > > >> the interrupts appropriately.
> > > > > > > > >
> > > > > > > > > What kernel interfaces are used for this setup and mapping?
> > > > > > > >
> > > > > > > > For passthrough devices, the setup and mapping of routing physical
> > > > > > > > interrupts to DomU are done on Xen hypervisor side, hypervisor only
> > > > > > > > need userspace to provide the GSI info, see Xen code:
> > > > > > > > xc_physdev_map_pirq require GSI and then will call hypercall to pass
> > > > > > > > GSI into hypervisor and then hypervisor will do the mapping and
> > > > > > > > routing, kernel doesn't do the setup and mapping.
> > > > > > >
> > > > > > > So we have to expose the GSI to userspace not because userspace itself
> > > > > > > uses it, but so userspace can turn around and pass it back into the
> > > > > > > kernel?
> > > > > >
> > > > > > No, the point is to pass it back to Xen, which doesn't know the
> > > > > > mapping between GSIs and PCI devices because it can't execute the ACPI
> > > > > > AML resource methods that provide such information.
> > > > > >
> > > > > > The (Linux) kernel is just a proxy that forwards the hypercalls from
> > > > > > user-space tools into Xen.
> > > > >
> > > > > But I guess Xen knows how to interpret a GSI even though it doesn't
> > > > > have access to AML?
> > > >
> > > > On x86 Xen does know how to map a GSI into an IO-APIC pin, in order
> > > > configure the RTE as requested.
> > >
> > > IIUC, mapping a GSI to an IO-APIC pin requires information from the
> > > MADT. So I guess Xen does use the static ACPI tables, but not the AML
> > > _PRT methods that would connect a GSI with a PCI device?
> >
> > Yes, Xen can parse the static tables, and knows the base GSI of
> > IO-APICs from the MADT.
> >
> > > I guess this means Xen would not be able to deal with _MAT methods,
> > > which also contains MADT entries? I don't know the implications of
> > > this -- maybe it means Xen might not be able to use with hot-added
> > > devices?
> >
> > It's my understanding _MAT will only be present on some very specific
> > devices (IO-APIC or CPU objects). Xen doesn't support hotplug of
> > IO-APICs, but hotplug of CPUs should in principle be supported with
> > cooperation from the control domain OS (albeit it's not something that
> > we tests on our CI). I don't expect however that a CPU object _MAT
> > method will return IO APIC entries.
> >
> > > The tables (including DSDT and SSDTS that contain the AML) are exposed
> > > to userspace via /sys/firmware/acpi/tables/, but of course that
> > > doesn't mean Xen knows how to interpret the AML, and even if it did,
> > > Xen probably wouldn't be able to *evaluate* it since that could
> > > conflict with the host kernel's use of AML.
> >
> > Indeed, there can only be a single OSPM, and that's the dom0 OS (Linux
> > in our context).
> >
> > Getting back to our context though, what would be a suitable place for
> > exposing the GSI assigned to each device?
>
> IIUC, the Xen hypervisor:
>
> - Interprets /sys/firmware/acpi/tables/APIC (or gets this via
> something running on the Dom0 kernel) to find the physical base
> address and GSI base, e.g., from I/O APIC, I/O SAPIC.
No, Xen parses the MADT directly from memory, before stating dom0.
That's a static table so it's fine for Xen to parse it and obtain the
I/O APIC GSI base.
> - Needs the GSI to locate the APIC and pin within the APIC. The
> Dom0 kernel is the OSPM, so only it can evaluate the AML _PRT to
> learn the PCI device -> GSI mapping.
Yes, Xen doesn't know the PCI device -> GSI mapping. Dom0 needs to
parse the ACPI methods and signal Xen to configure a GSI with a
given trigger and polarity.
> - Has direct access to the APIC physical base address to program the
> Redirection Table.
Yes, the hardware (native) I/O APIC is owned by Xen, and not directly
accessible by dom0.
> The patch seems a little messy to me because the PCI core has to keep
> track of the GSI even though it doesn't need it itself. And the
> current patch exposes it on all arches, even non-ACPI ones or when
> ACPI is disabled (easily fixable).
>
> We only call acpi_pci_irq_enable() in the pci_enable_device() path, so
> we don't know the GSI unless a Dom0 driver has claimed the device and
> called pci_enable_device() for it, which seems like it might not be
> desirable.
I think that's always the case, as on dom0 devices to be passed
through are handled by pciback which does enable them.
I agree it might be best to not tie exposing the node to
pci_enable_device() having been called. Is _PRT only evaluated as
part of acpi_pci_irq_enable()? (or pci_enable_device()).
> I was hoping we could put it in /sys/firmware/acpi/interrupts, but
> that looks like it's only for SCI statistics. I guess we could moot a
> new /sys/firmware/acpi/gsi/ directory, but then each file there would
> have to identify a device, which might not be as convenient as the
> /sys/devices/ directory that already exists. I guess there may be
> GSIs for things other than PCI devices; will you ever care about any
> of those?
We only support passthrough of PCI devices so far, but I guess if any
of such non-PCI devices ever appear and those use a GSI, and Xen
supports passthrough for them, then yes, we would need to fetch such
GSI somehow.
Thanks, Roger.
On Mon, Feb 12, 2024 at 10:13:28AM +0100, Roger Pau Monné wrote:
> On Fri, Feb 09, 2024 at 03:05:49PM -0600, Bjorn Helgaas wrote:
> > On Thu, Feb 01, 2024 at 09:39:49AM +0100, Roger Pau Monné wrote:
> > > On Wed, Jan 31, 2024 at 01:00:14PM -0600, Bjorn Helgaas wrote:
> > > > On Wed, Jan 31, 2024 at 09:58:19AM +0100, Roger Pau Monné wrote:
> > > > > On Tue, Jan 30, 2024 at 02:44:03PM -0600, Bjorn Helgaas wrote:
> > > > > > On Tue, Jan 30, 2024 at 10:07:36AM +0100, Roger Pau Monné wrote:
> > > > > > > On Mon, Jan 29, 2024 at 04:01:13PM -0600, Bjorn Helgaas wrote:
> > > > > > > > On Thu, Jan 25, 2024 at 07:17:24AM +0000, Chen, Jiqian wrote:
> > > > > > > > > On 2024/1/24 00:02, Bjorn Helgaas wrote:
> > > > > > > > > > On Tue, Jan 23, 2024 at 10:13:52AM +0000, Chen, Jiqian wrote:
> > > > > > > > > >> On 2024/1/23 07:37, Bjorn Helgaas wrote:
> > > > > > > > > >>> On Fri, Jan 05, 2024 at 02:22:17PM +0800, Jiqian Chen wrote:
> > > > > > > > > >>>> There is a need for some scenarios to use gsi sysfs.
> > > > > > > > > >>>> For example, when xen passthrough a device to dumU, it will
> > > > > > > > > >>>> use gsi to map pirq, but currently userspace can't get gsi
> > > > > > > > > >>>> number.
> > > > > > > > > >>>> So, add gsi sysfs for that and for other potential scenarios.
> > > > > > > > > >> ...
> > > > > > > > > >
> > > > > > > > > >>> I don't know enough about Xen to know why it needs the GSI in
> > > > > > > > > >>> userspace. Is this passthrough brand new functionality that can't be
> > > > > > > > > >>> done today because we don't expose the GSI yet?
> > > > > > > >
> > > > > > > > I assume this must be new functionality, i.e., this kind of
> > > > > > > > passthrough does not work today, right?
> > > > > > > >
> > > > > > > > > >> has ACPI support and is responsible for detecting and controlling
> > > > > > > > > >> the hardware, also it performs privileged operations such as the
> > > > > > > > > >> creation of normal (unprivileged) domains DomUs. When we give to a
> > > > > > > > > >> DomU direct access to a device, we need also to route the physical
> > > > > > > > > >> interrupts to the DomU. In order to do so Xen needs to setup and map
> > > > > > > > > >> the interrupts appropriately.
> > > > > > > > > >
> > > > > > > > > > What kernel interfaces are used for this setup and mapping?
> > > > > > > > >
> > > > > > > > > For passthrough devices, the setup and mapping of routing physical
> > > > > > > > > interrupts to DomU are done on Xen hypervisor side, hypervisor only
> > > > > > > > > need userspace to provide the GSI info, see Xen code:
> > > > > > > > > xc_physdev_map_pirq require GSI and then will call hypercall to pass
> > > > > > > > > GSI into hypervisor and then hypervisor will do the mapping and
> > > > > > > > > routing, kernel doesn't do the setup and mapping.
> > > > > > > >
> > > > > > > > So we have to expose the GSI to userspace not because userspace itself
> > > > > > > > uses it, but so userspace can turn around and pass it back into the
> > > > > > > > kernel?
> > > > > > >
> > > > > > > No, the point is to pass it back to Xen, which doesn't know the
> > > > > > > mapping between GSIs and PCI devices because it can't execute the ACPI
> > > > > > > AML resource methods that provide such information.
> > > > > > >
> > > > > > > The (Linux) kernel is just a proxy that forwards the hypercalls from
> > > > > > > user-space tools into Xen.
> > > > > >
> > > > > > But I guess Xen knows how to interpret a GSI even though it doesn't
> > > > > > have access to AML?
> > > > >
> > > > > On x86 Xen does know how to map a GSI into an IO-APIC pin, in order
> > > > > configure the RTE as requested.
> > > >
> > > > IIUC, mapping a GSI to an IO-APIC pin requires information from the
> > > > MADT. So I guess Xen does use the static ACPI tables, but not the AML
> > > > _PRT methods that would connect a GSI with a PCI device?
> > >
> > > Yes, Xen can parse the static tables, and knows the base GSI of
> > > IO-APICs from the MADT.
> > >
> > > > I guess this means Xen would not be able to deal with _MAT methods,
> > > > which also contains MADT entries? I don't know the implications of
> > > > this -- maybe it means Xen might not be able to use with hot-added
> > > > devices?
> > >
> > > It's my understanding _MAT will only be present on some very specific
> > > devices (IO-APIC or CPU objects). Xen doesn't support hotplug of
> > > IO-APICs, but hotplug of CPUs should in principle be supported with
> > > cooperation from the control domain OS (albeit it's not something that
> > > we tests on our CI). I don't expect however that a CPU object _MAT
> > > method will return IO APIC entries.
> > >
> > > > The tables (including DSDT and SSDTS that contain the AML) are exposed
> > > > to userspace via /sys/firmware/acpi/tables/, but of course that
> > > > doesn't mean Xen knows how to interpret the AML, and even if it did,
> > > > Xen probably wouldn't be able to *evaluate* it since that could
> > > > conflict with the host kernel's use of AML.
> > >
> > > Indeed, there can only be a single OSPM, and that's the dom0 OS (Linux
> > > in our context).
> > >
> > > Getting back to our context though, what would be a suitable place for
> > > exposing the GSI assigned to each device?
> >
> > IIUC, the Xen hypervisor:
> >
> > - Interprets /sys/firmware/acpi/tables/APIC (or gets this via
> > something running on the Dom0 kernel) to find the physical base
> > address and GSI base, e.g., from I/O APIC, I/O SAPIC.
>
> No, Xen parses the MADT directly from memory, before stating dom0.
> That's a static table so it's fine for Xen to parse it and obtain the
> I/O APIC GSI base.
It's an interesting split to consume ACPI static tables directly but
put the AML interpreter elsewhere. I doubt the ACPI spec envisioned
that, which makes me wonder what other things we could trip over, but
that's just a tangent.
> > - Needs the GSI to locate the APIC and pin within the APIC. The
> > Dom0 kernel is the OSPM, so only it can evaluate the AML _PRT to
> > learn the PCI device -> GSI mapping.
>
> Yes, Xen doesn't know the PCI device -> GSI mapping. Dom0 needs to
> parse the ACPI methods and signal Xen to configure a GSI with a
> given trigger and polarity.
>
> > - Has direct access to the APIC physical base address to program the
> > Redirection Table.
>
> Yes, the hardware (native) I/O APIC is owned by Xen, and not directly
> accessible by dom0.
>
> > The patch seems a little messy to me because the PCI core has to keep
> > track of the GSI even though it doesn't need it itself. And the
> > current patch exposes it on all arches, even non-ACPI ones or when
> > ACPI is disabled (easily fixable).
> >
> > We only call acpi_pci_irq_enable() in the pci_enable_device() path, so
> > we don't know the GSI unless a Dom0 driver has claimed the device and
> > called pci_enable_device() for it, which seems like it might not be
> > desirable.
>
> I think that's always the case, as on dom0 devices to be passed
> through are handled by pciback which does enable them.
pcistub_init_device() labels the pci_enable_device() as a "HACK"
related to determining the IRQ, which makes me think there's not
really a requirement for the device to be *enabled* (BAR decoding
enabled) by dom0.
> I agree it might be best to not tie exposing the node to
> pci_enable_device() having been called. Is _PRT only evaluated as
> part of acpi_pci_irq_enable()? (or pci_enable_device()).
Yes. AFAICT, acpi_pci_irq_enable() is the only path that evaluates
_PRT (except for a debugger interface). I don't think it *needs* to
be that way, and the fact that we do it per-device like that means we
evaluate _PRT many times even though I think the results never change.
I could imagine evaluating _PRT once as part of enumerating a PCI host
bridge (and maybe PCI-PCI bridge, per acpi_pci_irq_find_prt_entry()
comment), but that looks like a fair bit of work to implement. And of
course it doesn't really affect the question of how to expose the
result, although it does suggest /sys/bus/acpi/devices/PNP0A03:00/ as
a possible location.
Bjorn
On Mon, Feb 12, 2024 at 01:18:58PM -0600, Bjorn Helgaas wrote:
> On Mon, Feb 12, 2024 at 10:13:28AM +0100, Roger Pau Monné wrote:
> > On Fri, Feb 09, 2024 at 03:05:49PM -0600, Bjorn Helgaas wrote:
> > > On Thu, Feb 01, 2024 at 09:39:49AM +0100, Roger Pau Monné wrote:
> > > > On Wed, Jan 31, 2024 at 01:00:14PM -0600, Bjorn Helgaas wrote:
> > > > > On Wed, Jan 31, 2024 at 09:58:19AM +0100, Roger Pau Monné wrote:
> > > > > > On Tue, Jan 30, 2024 at 02:44:03PM -0600, Bjorn Helgaas wrote:
> > > > > > > On Tue, Jan 30, 2024 at 10:07:36AM +0100, Roger Pau Monné wrote:
> > > > > > > > On Mon, Jan 29, 2024 at 04:01:13PM -0600, Bjorn Helgaas wrote:
> > > > > > > > > On Thu, Jan 25, 2024 at 07:17:24AM +0000, Chen, Jiqian wrote:
> > > > > > > > > > On 2024/1/24 00:02, Bjorn Helgaas wrote:
> > > > > > > > > > > On Tue, Jan 23, 2024 at 10:13:52AM +0000, Chen, Jiqian wrote:
> > > > > > > > > > >> On 2024/1/23 07:37, Bjorn Helgaas wrote:
> > > > > > > > > > >>> On Fri, Jan 05, 2024 at 02:22:17PM +0800, Jiqian Chen wrote:
> > > > > > > > > > >>>> There is a need for some scenarios to use gsi sysfs.
> > > > > > > > > > >>>> For example, when xen passthrough a device to dumU, it will
> > > > > > > > > > >>>> use gsi to map pirq, but currently userspace can't get gsi
> > > > > > > > > > >>>> number.
> > > > > > > > > > >>>> So, add gsi sysfs for that and for other potential scenarios.
> > > > > > > > > > >> ...
> > > > > > > > > > >
> > > > > > > > > > >>> I don't know enough about Xen to know why it needs the GSI in
> > > > > > > > > > >>> userspace. Is this passthrough brand new functionality that can't be
> > > > > > > > > > >>> done today because we don't expose the GSI yet?
> > > > > > > > >
> > > > > > > > > I assume this must be new functionality, i.e., this kind of
> > > > > > > > > passthrough does not work today, right?
> > > > > > > > >
> > > > > > > > > > >> has ACPI support and is responsible for detecting and controlling
> > > > > > > > > > >> the hardware, also it performs privileged operations such as the
> > > > > > > > > > >> creation of normal (unprivileged) domains DomUs. When we give to a
> > > > > > > > > > >> DomU direct access to a device, we need also to route the physical
> > > > > > > > > > >> interrupts to the DomU. In order to do so Xen needs to setup and map
> > > > > > > > > > >> the interrupts appropriately.
> > > > > > > > > > >
> > > > > > > > > > > What kernel interfaces are used for this setup and mapping?
> > > > > > > > > >
> > > > > > > > > > For passthrough devices, the setup and mapping of routing physical
> > > > > > > > > > interrupts to DomU are done on Xen hypervisor side, hypervisor only
> > > > > > > > > > need userspace to provide the GSI info, see Xen code:
> > > > > > > > > > xc_physdev_map_pirq require GSI and then will call hypercall to pass
> > > > > > > > > > GSI into hypervisor and then hypervisor will do the mapping and
> > > > > > > > > > routing, kernel doesn't do the setup and mapping.
> > > > > > > > >
> > > > > > > > > So we have to expose the GSI to userspace not because userspace itself
> > > > > > > > > uses it, but so userspace can turn around and pass it back into the
> > > > > > > > > kernel?
> > > > > > > >
> > > > > > > > No, the point is to pass it back to Xen, which doesn't know the
> > > > > > > > mapping between GSIs and PCI devices because it can't execute the ACPI
> > > > > > > > AML resource methods that provide such information.
> > > > > > > >
> > > > > > > > The (Linux) kernel is just a proxy that forwards the hypercalls from
> > > > > > > > user-space tools into Xen.
> > > > > > >
> > > > > > > But I guess Xen knows how to interpret a GSI even though it doesn't
> > > > > > > have access to AML?
> > > > > >
> > > > > > On x86 Xen does know how to map a GSI into an IO-APIC pin, in order
> > > > > > configure the RTE as requested.
> > > > >
> > > > > IIUC, mapping a GSI to an IO-APIC pin requires information from the
> > > > > MADT. So I guess Xen does use the static ACPI tables, but not the AML
> > > > > _PRT methods that would connect a GSI with a PCI device?
> > > >
> > > > Yes, Xen can parse the static tables, and knows the base GSI of
> > > > IO-APICs from the MADT.
> > > >
> > > > > I guess this means Xen would not be able to deal with _MAT methods,
> > > > > which also contains MADT entries? I don't know the implications of
> > > > > this -- maybe it means Xen might not be able to use with hot-added
> > > > > devices?
> > > >
> > > > It's my understanding _MAT will only be present on some very specific
> > > > devices (IO-APIC or CPU objects). Xen doesn't support hotplug of
> > > > IO-APICs, but hotplug of CPUs should in principle be supported with
> > > > cooperation from the control domain OS (albeit it's not something that
> > > > we tests on our CI). I don't expect however that a CPU object _MAT
> > > > method will return IO APIC entries.
> > > >
> > > > > The tables (including DSDT and SSDTS that contain the AML) are exposed
> > > > > to userspace via /sys/firmware/acpi/tables/, but of course that
> > > > > doesn't mean Xen knows how to interpret the AML, and even if it did,
> > > > > Xen probably wouldn't be able to *evaluate* it since that could
> > > > > conflict with the host kernel's use of AML.
> > > >
> > > > Indeed, there can only be a single OSPM, and that's the dom0 OS (Linux
> > > > in our context).
> > > >
> > > > Getting back to our context though, what would be a suitable place for
> > > > exposing the GSI assigned to each device?
> > >
> > > IIUC, the Xen hypervisor:
> > >
> > > - Interprets /sys/firmware/acpi/tables/APIC (or gets this via
> > > something running on the Dom0 kernel) to find the physical base
> > > address and GSI base, e.g., from I/O APIC, I/O SAPIC.
> >
> > No, Xen parses the MADT directly from memory, before stating dom0.
> > That's a static table so it's fine for Xen to parse it and obtain the
> > I/O APIC GSI base.
>
> It's an interesting split to consume ACPI static tables directly but
> put the AML interpreter elsewhere.
Well, static tables can be consumed by Xen, because thye don't require
an AML parser (obviously), and parsing them doesn't have any
side-effects that would prevent dom0 from being the OSPM (no methods
or similar are evaluated).
> I doubt the ACPI spec envisioned
> that, which makes me wonder what other things we could trip over, but
> that's just a tangent.
Indeed, ACPI is not be best interface for the Xen/dom0 split model.
> > > - Needs the GSI to locate the APIC and pin within the APIC. The
> > > Dom0 kernel is the OSPM, so only it can evaluate the AML _PRT to
> > > learn the PCI device -> GSI mapping.
> >
> > Yes, Xen doesn't know the PCI device -> GSI mapping. Dom0 needs to
> > parse the ACPI methods and signal Xen to configure a GSI with a
> > given trigger and polarity.
> >
> > > - Has direct access to the APIC physical base address to program the
> > > Redirection Table.
> >
> > Yes, the hardware (native) I/O APIC is owned by Xen, and not directly
> > accessible by dom0.
> >
> > > The patch seems a little messy to me because the PCI core has to keep
> > > track of the GSI even though it doesn't need it itself. And the
> > > current patch exposes it on all arches, even non-ACPI ones or when
> > > ACPI is disabled (easily fixable).
> > >
> > > We only call acpi_pci_irq_enable() in the pci_enable_device() path, so
> > > we don't know the GSI unless a Dom0 driver has claimed the device and
> > > called pci_enable_device() for it, which seems like it might not be
> > > desirable.
> >
> > I think that's always the case, as on dom0 devices to be passed
> > through are handled by pciback which does enable them.
>
> pcistub_init_device() labels the pci_enable_device() as a "HACK"
> related to determining the IRQ, which makes me think there's not
> really a requirement for the device to be *enabled* (BAR decoding
> enabled) by dom0.
No, there's no need for memory decoding to be enabled for getting the
GSI from the ACPI method I would assume. I'm confused by that
pci_enable_device() call. Is maybe the purpose to make sure the
device is powered up so that reading the PCI header Interrupt Line and
Pin fields returns valid values? No idea whether reading those fields
requires the device to be in certain (active) power states.
> > I agree it might be best to not tie exposing the node to
> > pci_enable_device() having been called. Is _PRT only evaluated as
> > part of acpi_pci_irq_enable()? (or pci_enable_device()).
>
> Yes. AFAICT, acpi_pci_irq_enable() is the only path that evaluates
> _PRT (except for a debugger interface). I don't think it *needs* to
> be that way, and the fact that we do it per-device like that means we
> evaluate _PRT many times even though I think the results never change.
>
> I could imagine evaluating _PRT once as part of enumerating a PCI host
> bridge (and maybe PCI-PCI bridge, per acpi_pci_irq_find_prt_entry()
> comment), but that looks like a fair bit of work to implement. And of
> course it doesn't really affect the question of how to expose the
> result, although it does suggest /sys/bus/acpi/devices/PNP0A03:00/ as
> a possible location.
So you suggest exposing the GSI as part of the PCI host bridge? I'm
afraid I'm not following how we could then map PCI SBDFs from devices
to their assigned GSI.
Thanks, Roger.
On Fri, 5 Jan 2024, Jiqian Chen wrote:
> When device on dom0 side has been reset, the vpci on Xen side
> won't get notification, so that the cached state in vpci is
> all out of date with the real device state.
> To solve that problem, add a new function to clear all vpci
> device state when device is reset on dom0 side.
>
> And call that function in pcistub_init_device. Because when
> using "pci-assignable-add" to assign a passthrough device in
> Xen, it will reset passthrough device and the vpci state will
> out of date, and then device will fail to restore bar state.
>
> Co-developed-by: Huang Rui <[email protected]>
> Signed-off-by: Jiqian Chen <[email protected]>
Reviewed-by: Stefano Stabellini <[email protected]>
> ---
> drivers/xen/pci.c | 12 ++++++++++++
> drivers/xen/xen-pciback/pci_stub.c | 18 +++++++++++++++---
> include/xen/interface/physdev.h | 7 +++++++
> include/xen/pci.h | 6 ++++++
> 4 files changed, 40 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/xen/pci.c b/drivers/xen/pci.c
> index 72d4e3f193af..e9b30bc09139 100644
> --- a/drivers/xen/pci.c
> +++ b/drivers/xen/pci.c
> @@ -177,6 +177,18 @@ static int xen_remove_device(struct device *dev)
> return r;
> }
>
> +int xen_reset_device_state(const struct pci_dev *dev)
> +{
> + struct physdev_pci_device device = {
> + .seg = pci_domain_nr(dev->bus),
> + .bus = dev->bus->number,
> + .devfn = dev->devfn
> + };
> +
> + return HYPERVISOR_physdev_op(PHYSDEVOP_pci_device_state_reset, &device);
> +}
> +EXPORT_SYMBOL_GPL(xen_reset_device_state);
> +
> static int xen_pci_notifier(struct notifier_block *nb,
> unsigned long action, void *data)
> {
> diff --git a/drivers/xen/xen-pciback/pci_stub.c b/drivers/xen/xen-pciback/pci_stub.c
> index e34b623e4b41..46c40ec8a18e 100644
> --- a/drivers/xen/xen-pciback/pci_stub.c
> +++ b/drivers/xen/xen-pciback/pci_stub.c
> @@ -89,6 +89,16 @@ static struct pcistub_device *pcistub_device_alloc(struct pci_dev *dev)
> return psdev;
> }
>
> +static int pcistub_reset_device_state(struct pci_dev *dev)
> +{
> + __pci_reset_function_locked(dev);
> +
> + if (!xen_pv_domain())
> + return xen_reset_device_state(dev);
> + else
> + return 0;
> +}
> +
> /* Don't call this directly as it's called by pcistub_device_put */
> static void pcistub_device_release(struct kref *kref)
> {
> @@ -107,7 +117,7 @@ static void pcistub_device_release(struct kref *kref)
> /* Call the reset function which does not take lock as this
> * is called from "unbind" which takes a device_lock mutex.
> */
> - __pci_reset_function_locked(dev);
> + pcistub_reset_device_state(dev);
> if (dev_data &&
> pci_load_and_free_saved_state(dev, &dev_data->pci_saved_state))
> dev_info(&dev->dev, "Could not reload PCI state\n");
> @@ -284,7 +294,7 @@ void pcistub_put_pci_dev(struct pci_dev *dev)
> * (so it's ready for the next domain)
> */
> device_lock_assert(&dev->dev);
> - __pci_reset_function_locked(dev);
> + pcistub_reset_device_state(dev);
>
> dev_data = pci_get_drvdata(dev);
> ret = pci_load_saved_state(dev, dev_data->pci_saved_state);
> @@ -420,7 +430,9 @@ static int pcistub_init_device(struct pci_dev *dev)
> dev_err(&dev->dev, "Could not store PCI conf saved state!\n");
> else {
> dev_dbg(&dev->dev, "resetting (FLR, D3, etc) the device\n");
> - __pci_reset_function_locked(dev);
> + err = pcistub_reset_device_state(dev);
> + if (err)
> + goto config_release;
> pci_restore_state(dev);
> }
> /* Now disable the device (this also ensures some private device
> diff --git a/include/xen/interface/physdev.h b/include/xen/interface/physdev.h
> index a237af867873..8609770e28f5 100644
> --- a/include/xen/interface/physdev.h
> +++ b/include/xen/interface/physdev.h
> @@ -256,6 +256,13 @@ struct physdev_pci_device_add {
> */
> #define PHYSDEVOP_prepare_msix 30
> #define PHYSDEVOP_release_msix 31
> +/*
> + * Notify the hypervisor that a PCI device has been reset, so that any
> + * internally cached state is regenerated. Should be called after any
> + * device reset performed by the hardware domain.
> + */
> +#define PHYSDEVOP_pci_device_state_reset 32
> +
> struct physdev_pci_device {
> /* IN */
> uint16_t seg;
> diff --git a/include/xen/pci.h b/include/xen/pci.h
> index b8337cf85fd1..b2e2e856efd6 100644
> --- a/include/xen/pci.h
> +++ b/include/xen/pci.h
> @@ -4,10 +4,16 @@
> #define __XEN_PCI_H__
>
> #if defined(CONFIG_XEN_DOM0)
> +int xen_reset_device_state(const struct pci_dev *dev);
> int xen_find_device_domain_owner(struct pci_dev *dev);
> int xen_register_device_domain_owner(struct pci_dev *dev, uint16_t domain);
> int xen_unregister_device_domain_owner(struct pci_dev *dev);
> #else
> +static inline int xen_reset_device_state(const struct pci_dev *dev)
> +{
> + return -1;
> +}
> +
> static inline int xen_find_device_domain_owner(struct pci_dev *dev)
> {
> return -1;
> --
> 2.34.1
>
On Fri, 5 Jan 2024, Jiqian Chen wrote:
> In PVH dom0, the gsis don't get registered, but the gsi of
> a passthrough device must be configured for it to be able to be
> mapped into a domU.
>
> When assign a device to passthrough, proactively setup the gsi
> of the device during that process.
>
> Co-developed-by: Huang Rui <[email protected]>
> Signed-off-by: Jiqian Chen <[email protected]>
> ---
> arch/x86/xen/enlighten_pvh.c | 90 ++++++++++++++++++++++++++++++
> drivers/acpi/pci_irq.c | 2 +-
> drivers/xen/xen-pciback/pci_stub.c | 8 +++
> include/linux/acpi.h | 1 +
> include/xen/acpi.h | 6 ++
> 5 files changed, 106 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/xen/enlighten_pvh.c b/arch/x86/xen/enlighten_pvh.c
> index ada3868c02c2..ecadd966c684 100644
> --- a/arch/x86/xen/enlighten_pvh.c
> +++ b/arch/x86/xen/enlighten_pvh.c
> @@ -1,6 +1,7 @@
> // SPDX-License-Identifier: GPL-2.0
> #include <linux/acpi.h>
> #include <linux/export.h>
> +#include <linux/pci.h>
>
> #include <xen/hvc-console.h>
>
> @@ -25,6 +26,95 @@
> bool __ro_after_init xen_pvh;
> EXPORT_SYMBOL_GPL(xen_pvh);
>
> +typedef struct gsi_info {
> + int gsi;
> + int trigger;
> + int polarity;
> +} gsi_info_t;
> +
> +struct acpi_prt_entry {
> + struct acpi_pci_id id;
> + u8 pin;
> + acpi_handle link;
> + u32 index; /* GSI, or link _CRS index */
> +};
> +
> +static int xen_pvh_get_gsi_info(struct pci_dev *dev,
> + gsi_info_t *gsi_info)
> +{
> + int gsi;
> + u8 pin = 0;
> + struct acpi_prt_entry *entry;
> + int trigger = ACPI_LEVEL_SENSITIVE;
> + int polarity = acpi_irq_model == ACPI_IRQ_MODEL_GIC ?
> + ACPI_ACTIVE_HIGH : ACPI_ACTIVE_LOW;
> +
> + if (dev)
> + pin = dev->pin;
This is minor, but you can just move the pin = dev->pin after the !dev
check below.
With that change, and assuming the Xen-side and QEMU-side patches are
accepted:
Reviewed-by: Stefano Stabellini <[email protected]>
> + if (!dev || !pin || !gsi_info)
> + return -EINVAL;
> +
> + entry = acpi_pci_irq_lookup(dev, pin);
> + if (entry) {
> + if (entry->link)
> + gsi = acpi_pci_link_allocate_irq(entry->link,
> + entry->index,
> + &trigger, &polarity,
> + NULL);
> + else
> + gsi = entry->index;
> + } else
> + gsi = -1;
> +
> + if (gsi < 0)
> + return -EINVAL;
> +
> + gsi_info->gsi = gsi;
> + gsi_info->trigger = trigger;
> + gsi_info->polarity = polarity;
> +
> + return 0;
> +}
> +
> +static int xen_pvh_setup_gsi(gsi_info_t *gsi_info)
> +{
> + struct physdev_setup_gsi setup_gsi;
> +
> + if (!gsi_info)
> + return -EINVAL;
> +
> + setup_gsi.gsi = gsi_info->gsi;
> + setup_gsi.triggering = (gsi_info->trigger == ACPI_EDGE_SENSITIVE ? 0 : 1);
> + setup_gsi.polarity = (gsi_info->polarity == ACPI_ACTIVE_HIGH ? 0 : 1);
> +
> + return HYPERVISOR_physdev_op(PHYSDEVOP_setup_gsi, &setup_gsi);
> +}
> +
> +int xen_pvh_passthrough_gsi(struct pci_dev *dev)
> +{
> + int ret;
> + gsi_info_t gsi_info;
> +
> + if (!dev)
> + return -EINVAL;
> +
> + ret = xen_pvh_get_gsi_info(dev, &gsi_info);
> + if (ret) {
> + xen_raw_printk("Fail to get gsi info!\n");
> + return ret;
> + }
> +
> + ret = xen_pvh_setup_gsi(&gsi_info);
> + if (ret == -EEXIST) {
> + xen_raw_printk("Already setup the GSI :%d\n", gsi_info.gsi);
> + ret = 0;
> + } else if (ret)
> + xen_raw_printk("Fail to setup GSI (%d)!\n", gsi_info.gsi);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(xen_pvh_passthrough_gsi);
> +
> void __init xen_pvh_init(struct boot_params *boot_params)
> {
> u32 msr;
> diff --git a/drivers/acpi/pci_irq.c b/drivers/acpi/pci_irq.c
> index ff30ceca2203..630fe0a34bc6 100644
> --- a/drivers/acpi/pci_irq.c
> +++ b/drivers/acpi/pci_irq.c
> @@ -288,7 +288,7 @@ static int acpi_reroute_boot_interrupt(struct pci_dev *dev,
> }
> #endif /* CONFIG_X86_IO_APIC */
>
> -static struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin)
> +struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin)
> {
> struct acpi_prt_entry *entry = NULL;
> struct pci_dev *bridge;
> diff --git a/drivers/xen/xen-pciback/pci_stub.c b/drivers/xen/xen-pciback/pci_stub.c
> index 46c40ec8a18e..22d4380d2b04 100644
> --- a/drivers/xen/xen-pciback/pci_stub.c
> +++ b/drivers/xen/xen-pciback/pci_stub.c
> @@ -20,6 +20,7 @@
> #include <linux/atomic.h>
> #include <xen/events.h>
> #include <xen/pci.h>
> +#include <xen/acpi.h>
> #include <xen/xen.h>
> #include <asm/xen/hypervisor.h>
> #include <xen/interface/physdev.h>
> @@ -435,6 +436,13 @@ static int pcistub_init_device(struct pci_dev *dev)
> goto config_release;
> pci_restore_state(dev);
> }
> +
> + if (xen_initial_domain() && xen_pvh_domain()) {
> + err = xen_pvh_passthrough_gsi(dev);
> + if (err)
> + goto config_release;
> + }
> +
> /* Now disable the device (this also ensures some private device
> * data is setup before we export)
> */
> diff --git a/include/linux/acpi.h b/include/linux/acpi.h
> index 4db54e928b36..7ea3be981cc3 100644
> --- a/include/linux/acpi.h
> +++ b/include/linux/acpi.h
> @@ -360,6 +360,7 @@ void acpi_unregister_gsi (u32 gsi);
>
> struct pci_dev;
>
> +struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin);
> int acpi_pci_irq_enable (struct pci_dev *dev);
> void acpi_penalize_isa_irq(int irq, int active);
> bool acpi_isa_irq_available(int irq);
> diff --git a/include/xen/acpi.h b/include/xen/acpi.h
> index b1e11863144d..17c4d37f1e60 100644
> --- a/include/xen/acpi.h
> +++ b/include/xen/acpi.h
> @@ -67,10 +67,16 @@ static inline void xen_acpi_sleep_register(void)
> acpi_suspend_lowlevel = xen_acpi_suspend_lowlevel;
> }
> }
> +int xen_pvh_passthrough_gsi(struct pci_dev *dev);
> #else
> static inline void xen_acpi_sleep_register(void)
> {
> }
> +
> +static inline int xen_pvh_passthrough_gsi(struct pci_dev *dev)
> +{
> + return -1;
> +}
> #endif
>
> #endif /* _XEN_ACPI_H */
> --
> 2.34.1
>
On 2024/2/23 08:23, Stefano Stabellini wrote:
> On Fri, 5 Jan 2024, Jiqian Chen wrote:
>> In PVH dom0, the gsis don't get registered, but the gsi of
>> a passthrough device must be configured for it to be able to be
>> mapped into a domU.
>>
>> When assign a device to passthrough, proactively setup the gsi
>> of the device during that process.
>>
>> Co-developed-by: Huang Rui <[email protected]>
>> Signed-off-by: Jiqian Chen <[email protected]>
>> ---
>> arch/x86/xen/enlighten_pvh.c | 90 ++++++++++++++++++++++++++++++
>> drivers/acpi/pci_irq.c | 2 +-
>> drivers/xen/xen-pciback/pci_stub.c | 8 +++
>> include/linux/acpi.h | 1 +
>> include/xen/acpi.h | 6 ++
>> 5 files changed, 106 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/xen/enlighten_pvh.c b/arch/x86/xen/enlighten_pvh.c
>> index ada3868c02c2..ecadd966c684 100644
>> --- a/arch/x86/xen/enlighten_pvh.c
>> +++ b/arch/x86/xen/enlighten_pvh.c
>> @@ -1,6 +1,7 @@
>> // SPDX-License-Identifier: GPL-2.0
>> #include <linux/acpi.h>
>> #include <linux/export.h>
>> +#include <linux/pci.h>
>>
>> #include <xen/hvc-console.h>
>>
>> @@ -25,6 +26,95 @@
>> bool __ro_after_init xen_pvh;
>> EXPORT_SYMBOL_GPL(xen_pvh);
>>
>> +typedef struct gsi_info {
>> + int gsi;
>> + int trigger;
>> + int polarity;
>> +} gsi_info_t;
>> +
>> +struct acpi_prt_entry {
>> + struct acpi_pci_id id;
>> + u8 pin;
>> + acpi_handle link;
>> + u32 index; /* GSI, or link _CRS index */
>> +};
>> +
>> +static int xen_pvh_get_gsi_info(struct pci_dev *dev,
>> + gsi_info_t *gsi_info)
>> +{
>> + int gsi;
>> + u8 pin = 0;
>> + struct acpi_prt_entry *entry;
>> + int trigger = ACPI_LEVEL_SENSITIVE;
>> + int polarity = acpi_irq_model == ACPI_IRQ_MODEL_GIC ?
>> + ACPI_ACTIVE_HIGH : ACPI_ACTIVE_LOW;
>> +
>> + if (dev)
>> + pin = dev->pin;
>
> This is minor, but you can just move the pin = dev->pin after the !dev
> check below.
Will change in next version.
>
> With that change, and assuming the Xen-side and QEMU-side patches are
> accepted:
>
> Reviewed-by: Stefano Stabellini <[email protected]>
Thank you very much!
>
>
>
>
>> + if (!dev || !pin || !gsi_info)
>> + return -EINVAL;
>> +
>> + entry = acpi_pci_irq_lookup(dev, pin);
>> + if (entry) {
>> + if (entry->link)
>> + gsi = acpi_pci_link_allocate_irq(entry->link,
>> + entry->index,
>> + &trigger, &polarity,
>> + NULL);
>> + else
>> + gsi = entry->index;
>> + } else
>> + gsi = -1;
>> +
>> + if (gsi < 0)
>> + return -EINVAL;
>> +
>> + gsi_info->gsi = gsi;
>> + gsi_info->trigger = trigger;
>> + gsi_info->polarity = polarity;
>> +
>> + return 0;
>> +}
>> +
>> +static int xen_pvh_setup_gsi(gsi_info_t *gsi_info)
>> +{
>> + struct physdev_setup_gsi setup_gsi;
>> +
>> + if (!gsi_info)
>> + return -EINVAL;
>> +
>> + setup_gsi.gsi = gsi_info->gsi;
>> + setup_gsi.triggering = (gsi_info->trigger == ACPI_EDGE_SENSITIVE ? 0 : 1);
>> + setup_gsi.polarity = (gsi_info->polarity == ACPI_ACTIVE_HIGH ? 0 : 1);
>> +
>> + return HYPERVISOR_physdev_op(PHYSDEVOP_setup_gsi, &setup_gsi);
>> +}
>> +
>> +int xen_pvh_passthrough_gsi(struct pci_dev *dev)
>> +{
>> + int ret;
>> + gsi_info_t gsi_info;
>> +
>> + if (!dev)
>> + return -EINVAL;
>> +
>> + ret = xen_pvh_get_gsi_info(dev, &gsi_info);
>> + if (ret) {
>> + xen_raw_printk("Fail to get gsi info!\n");
>> + return ret;
>> + }
>> +
>> + ret = xen_pvh_setup_gsi(&gsi_info);
>> + if (ret == -EEXIST) {
>> + xen_raw_printk("Already setup the GSI :%d\n", gsi_info.gsi);
>> + ret = 0;
>> + } else if (ret)
>> + xen_raw_printk("Fail to setup GSI (%d)!\n", gsi_info.gsi);
>> +
>> + return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(xen_pvh_passthrough_gsi);
>> +
>> void __init xen_pvh_init(struct boot_params *boot_params)
>> {
>> u32 msr;
>> diff --git a/drivers/acpi/pci_irq.c b/drivers/acpi/pci_irq.c
>> index ff30ceca2203..630fe0a34bc6 100644
>> --- a/drivers/acpi/pci_irq.c
>> +++ b/drivers/acpi/pci_irq.c
>> @@ -288,7 +288,7 @@ static int acpi_reroute_boot_interrupt(struct pci_dev *dev,
>> }
>> #endif /* CONFIG_X86_IO_APIC */
>>
>> -static struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin)
>> +struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin)
>> {
>> struct acpi_prt_entry *entry = NULL;
>> struct pci_dev *bridge;
>> diff --git a/drivers/xen/xen-pciback/pci_stub.c b/drivers/xen/xen-pciback/pci_stub.c
>> index 46c40ec8a18e..22d4380d2b04 100644
>> --- a/drivers/xen/xen-pciback/pci_stub.c
>> +++ b/drivers/xen/xen-pciback/pci_stub.c
>> @@ -20,6 +20,7 @@
>> #include <linux/atomic.h>
>> #include <xen/events.h>
>> #include <xen/pci.h>
>> +#include <xen/acpi.h>
>> #include <xen/xen.h>
>> #include <asm/xen/hypervisor.h>
>> #include <xen/interface/physdev.h>
>> @@ -435,6 +436,13 @@ static int pcistub_init_device(struct pci_dev *dev)
>> goto config_release;
>> pci_restore_state(dev);
>> }
>> +
>> + if (xen_initial_domain() && xen_pvh_domain()) {
>> + err = xen_pvh_passthrough_gsi(dev);
>> + if (err)
>> + goto config_release;
>> + }
>> +
>> /* Now disable the device (this also ensures some private device
>> * data is setup before we export)
>> */
>> diff --git a/include/linux/acpi.h b/include/linux/acpi.h
>> index 4db54e928b36..7ea3be981cc3 100644
>> --- a/include/linux/acpi.h
>> +++ b/include/linux/acpi.h
>> @@ -360,6 +360,7 @@ void acpi_unregister_gsi (u32 gsi);
>>
>> struct pci_dev;
>>
>> +struct acpi_prt_entry *acpi_pci_irq_lookup(struct pci_dev *dev, int pin);
>> int acpi_pci_irq_enable (struct pci_dev *dev);
>> void acpi_penalize_isa_irq(int irq, int active);
>> bool acpi_isa_irq_available(int irq);
>> diff --git a/include/xen/acpi.h b/include/xen/acpi.h
>> index b1e11863144d..17c4d37f1e60 100644
>> --- a/include/xen/acpi.h
>> +++ b/include/xen/acpi.h
>> @@ -67,10 +67,16 @@ static inline void xen_acpi_sleep_register(void)
>> acpi_suspend_lowlevel = xen_acpi_suspend_lowlevel;
>> }
>> }
>> +int xen_pvh_passthrough_gsi(struct pci_dev *dev);
>> #else
>> static inline void xen_acpi_sleep_register(void)
>> {
>> }
>> +
>> +static inline int xen_pvh_passthrough_gsi(struct pci_dev *dev)
>> +{
>> + return -1;
>> +}
>> #endif
>>
>> #endif /* _XEN_ACPI_H */
>> --
>> 2.34.1
>>
--
Best regards,
Jiqian Chen.
Hi Bjorn,
Looking forward to getting your more inputs and suggestions.
It seems /sys/bus/acpi/devices/PNP0A03:00/ is not a good place to create gsi sysfs.
On 2024/2/15 16:37, Roger Pau Monné wrote:
> On Mon, Feb 12, 2024 at 01:18:58PM -0600, Bjorn Helgaas wrote:
>> On Mon, Feb 12, 2024 at 10:13:28AM +0100, Roger Pau Monné wrote:
>>> On Fri, Feb 09, 2024 at 03:05:49PM -0600, Bjorn Helgaas wrote:
>>>> On Thu, Feb 01, 2024 at 09:39:49AM +0100, Roger Pau Monné wrote:
>>>>> On Wed, Jan 31, 2024 at 01:00:14PM -0600, Bjorn Helgaas wrote:
>>>>>> On Wed, Jan 31, 2024 at 09:58:19AM +0100, Roger Pau Monné wrote:
>>>>>>> On Tue, Jan 30, 2024 at 02:44:03PM -0600, Bjorn Helgaas wrote:
>>>>>>>> On Tue, Jan 30, 2024 at 10:07:36AM +0100, Roger Pau Monné wrote:
>>>>>>>>> On Mon, Jan 29, 2024 at 04:01:13PM -0600, Bjorn Helgaas wrote:
>>>>>>>>>> On Thu, Jan 25, 2024 at 07:17:24AM +0000, Chen, Jiqian wrote:
>>>>>>>>>>> On 2024/1/24 00:02, Bjorn Helgaas wrote:
>>>>>>>>>>>> On Tue, Jan 23, 2024 at 10:13:52AM +0000, Chen, Jiqian wrote:
>>>>>>>>>>>>> On 2024/1/23 07:37, Bjorn Helgaas wrote:
>>>>>>>>>>>>>> On Fri, Jan 05, 2024 at 02:22:17PM +0800, Jiqian Chen wrote:
>>>>>>>>>>>>>>> There is a need for some scenarios to use gsi sysfs.
>>>>>>>>>>>>>>> For example, when xen passthrough a device to dumU, it will
>>>>>>>>>>>>>>> use gsi to map pirq, but currently userspace can't get gsi
>>>>>>>>>>>>>>> number.
>>>>>>>>>>>>>>> So, add gsi sysfs for that and for other potential scenarios.
>>>>>>>>>>>>> ...
>>>>>>>>>>>>
>>>>>>>>>>>>>> I don't know enough about Xen to know why it needs the GSI in
>>>>>>>>>>>>>> userspace. Is this passthrough brand new functionality that can't be
>>>>>>>>>>>>>> done today because we don't expose the GSI yet?
>>>>>>>>>>
>>>>>>>>>> I assume this must be new functionality, i.e., this kind of
>>>>>>>>>> passthrough does not work today, right?
>>>>>>>>>>
>>>>>>>>>>>>> has ACPI support and is responsible for detecting and controlling
>>>>>>>>>>>>> the hardware, also it performs privileged operations such as the
>>>>>>>>>>>>> creation of normal (unprivileged) domains DomUs. When we give to a
>>>>>>>>>>>>> DomU direct access to a device, we need also to route the physical
>>>>>>>>>>>>> interrupts to the DomU. In order to do so Xen needs to setup and map
>>>>>>>>>>>>> the interrupts appropriately.
>>>>>>>>>>>>
>>>>>>>>>>>> What kernel interfaces are used for this setup and mapping?
>>>>>>>>>>>
>>>>>>>>>>> For passthrough devices, the setup and mapping of routing physical
>>>>>>>>>>> interrupts to DomU are done on Xen hypervisor side, hypervisor only
>>>>>>>>>>> need userspace to provide the GSI info, see Xen code:
>>>>>>>>>>> xc_physdev_map_pirq require GSI and then will call hypercall to pass
>>>>>>>>>>> GSI into hypervisor and then hypervisor will do the mapping and
>>>>>>>>>>> routing, kernel doesn't do the setup and mapping.
>>>>>>>>>>
>>>>>>>>>> So we have to expose the GSI to userspace not because userspace itself
>>>>>>>>>> uses it, but so userspace can turn around and pass it back into the
>>>>>>>>>> kernel?
>>>>>>>>>
>>>>>>>>> No, the point is to pass it back to Xen, which doesn't know the
>>>>>>>>> mapping between GSIs and PCI devices because it can't execute the ACPI
>>>>>>>>> AML resource methods that provide such information.
>>>>>>>>>
>>>>>>>>> The (Linux) kernel is just a proxy that forwards the hypercalls from
>>>>>>>>> user-space tools into Xen.
>>>>>>>>
>>>>>>>> But I guess Xen knows how to interpret a GSI even though it doesn't
>>>>>>>> have access to AML?
>>>>>>>
>>>>>>> On x86 Xen does know how to map a GSI into an IO-APIC pin, in order
>>>>>>> configure the RTE as requested.
>>>>>>
>>>>>> IIUC, mapping a GSI to an IO-APIC pin requires information from the
>>>>>> MADT. So I guess Xen does use the static ACPI tables, but not the AML
>>>>>> _PRT methods that would connect a GSI with a PCI device?
>>>>>
>>>>> Yes, Xen can parse the static tables, and knows the base GSI of
>>>>> IO-APICs from the MADT.
>>>>>
>>>>>> I guess this means Xen would not be able to deal with _MAT methods,
>>>>>> which also contains MADT entries? I don't know the implications of
>>>>>> this -- maybe it means Xen might not be able to use with hot-added
>>>>>> devices?
>>>>>
>>>>> It's my understanding _MAT will only be present on some very specific
>>>>> devices (IO-APIC or CPU objects). Xen doesn't support hotplug of
>>>>> IO-APICs, but hotplug of CPUs should in principle be supported with
>>>>> cooperation from the control domain OS (albeit it's not something that
>>>>> we tests on our CI). I don't expect however that a CPU object _MAT
>>>>> method will return IO APIC entries.
>>>>>
>>>>>> The tables (including DSDT and SSDTS that contain the AML) are exposed
>>>>>> to userspace via /sys/firmware/acpi/tables/, but of course that
>>>>>> doesn't mean Xen knows how to interpret the AML, and even if it did,
>>>>>> Xen probably wouldn't be able to *evaluate* it since that could
>>>>>> conflict with the host kernel's use of AML.
>>>>>
>>>>> Indeed, there can only be a single OSPM, and that's the dom0 OS (Linux
>>>>> in our context).
>>>>>
>>>>> Getting back to our context though, what would be a suitable place for
>>>>> exposing the GSI assigned to each device?
>>>>
>>>> IIUC, the Xen hypervisor:
>>>>
>>>> - Interprets /sys/firmware/acpi/tables/APIC (or gets this via
>>>> something running on the Dom0 kernel) to find the physical base
>>>> address and GSI base, e.g., from I/O APIC, I/O SAPIC.
>>>
>>> No, Xen parses the MADT directly from memory, before stating dom0.
>>> That's a static table so it's fine for Xen to parse it and obtain the
>>> I/O APIC GSI base.
>>
>> It's an interesting split to consume ACPI static tables directly but
>> put the AML interpreter elsewhere.
>
> Well, static tables can be consumed by Xen, because thye don't require
> an AML parser (obviously), and parsing them doesn't have any
> side-effects that would prevent dom0 from being the OSPM (no methods
> or similar are evaluated).
>
>> I doubt the ACPI spec envisioned
>> that, which makes me wonder what other things we could trip over, but
>> that's just a tangent.
>
> Indeed, ACPI is not be best interface for the Xen/dom0 split model.
>
>>>> - Needs the GSI to locate the APIC and pin within the APIC. The
>>>> Dom0 kernel is the OSPM, so only it can evaluate the AML _PRT to
>>>> learn the PCI device -> GSI mapping.
>>>
>>> Yes, Xen doesn't know the PCI device -> GSI mapping. Dom0 needs to
>>> parse the ACPI methods and signal Xen to configure a GSI with a
>>> given trigger and polarity.
>>>
>>>> - Has direct access to the APIC physical base address to program the
>>>> Redirection Table.
>>>
>>> Yes, the hardware (native) I/O APIC is owned by Xen, and not directly
>>> accessible by dom0.
>>>
>>>> The patch seems a little messy to me because the PCI core has to keep
>>>> track of the GSI even though it doesn't need it itself. And the
>>>> current patch exposes it on all arches, even non-ACPI ones or when
>>>> ACPI is disabled (easily fixable).
>>>>
>>>> We only call acpi_pci_irq_enable() in the pci_enable_device() path, so
>>>> we don't know the GSI unless a Dom0 driver has claimed the device and
>>>> called pci_enable_device() for it, which seems like it might not be
>>>> desirable.
>>>
>>> I think that's always the case, as on dom0 devices to be passed
>>> through are handled by pciback which does enable them.
>>
>> pcistub_init_device() labels the pci_enable_device() as a "HACK"
>> related to determining the IRQ, which makes me think there's not
>> really a requirement for the device to be *enabled* (BAR decoding
>> enabled) by dom0.
>
> No, there's no need for memory decoding to be enabled for getting the
> GSI from the ACPI method I would assume. I'm confused by that
> pci_enable_device() call. Is maybe the purpose to make sure the
> device is powered up so that reading the PCI header Interrupt Line and
> Pin fields returns valid values? No idea whether reading those fields
> requires the device to be in certain (active) power states.
>
>>> I agree it might be best to not tie exposing the node to
>>> pci_enable_device() having been called. Is _PRT only evaluated as
>>> part of acpi_pci_irq_enable()? (or pci_enable_device()).
>>
>> Yes. AFAICT, acpi_pci_irq_enable() is the only path that evaluates
>> _PRT (except for a debugger interface). I don't think it *needs* to
>> be that way, and the fact that we do it per-device like that means we
>> evaluate _PRT many times even though I think the results never change.
>>
>> I could imagine evaluating _PRT once as part of enumerating a PCI host
>> bridge (and maybe PCI-PCI bridge, per acpi_pci_irq_find_prt_entry()
>> comment), but that looks like a fair bit of work to implement. And of
>> course it doesn't really affect the question of how to expose the
>> result, although it does suggest /sys/bus/acpi/devices/PNP0A03:00/ as
>> a possible location.
>
> So you suggest exposing the GSI as part of the PCI host bridge? I'm
> afraid I'm not following how we could then map PCI SBDFs from devices
> to their assigned GSI.
>
> Thanks, Roger.
--
Best regards,
Jiqian Chen.
Hi Bjorn,
It has been almost two months since we received your reply last time.
This series are blocking on this patch, since there are patches on Xen and Qemu side depending on it.
Do you still have any confusion about this patch? Or do you have other suggestions?
If no, may I get your Reviewed-by?
On 2024/3/1 15:57, Chen, Jiqian wrote:
> Hi Bjorn,
> Looking forward to getting your more inputs and suggestions.
> It seems /sys/bus/acpi/devices/PNP0A03:00/ is not a good place to create gsi sysfs.
>
> On 2024/2/15 16:37, Roger Pau Monné wrote:
>> On Mon, Feb 12, 2024 at 01:18:58PM -0600, Bjorn Helgaas wrote:
>>> On Mon, Feb 12, 2024 at 10:13:28AM +0100, Roger Pau Monné wrote:
>>>> On Fri, Feb 09, 2024 at 03:05:49PM -0600, Bjorn Helgaas wrote:
>>>>> On Thu, Feb 01, 2024 at 09:39:49AM +0100, Roger Pau Monné wrote:
>>>>>> On Wed, Jan 31, 2024 at 01:00:14PM -0600, Bjorn Helgaas wrote:
>>>>>>> On Wed, Jan 31, 2024 at 09:58:19AM +0100, Roger Pau Monné wrote:
>>>>>>>> On Tue, Jan 30, 2024 at 02:44:03PM -0600, Bjorn Helgaas wrote:
>>>>>>>>> On Tue, Jan 30, 2024 at 10:07:36AM +0100, Roger Pau Monné wrote:
>>>>>>>>>> On Mon, Jan 29, 2024 at 04:01:13PM -0600, Bjorn Helgaas wrote:
>>>>>>>>>>> On Thu, Jan 25, 2024 at 07:17:24AM +0000, Chen, Jiqian wrote:
>>>>>>>>>>>> On 2024/1/24 00:02, Bjorn Helgaas wrote:
>>>>>>>>>>>>> On Tue, Jan 23, 2024 at 10:13:52AM +0000, Chen, Jiqian wrote:
>>>>>>>>>>>>>> On 2024/1/23 07:37, Bjorn Helgaas wrote:
>>>>>>>>>>>>>>> On Fri, Jan 05, 2024 at 02:22:17PM +0800, Jiqian Chen wrote:
>>>>>>>>>>>>>>>> There is a need for some scenarios to use gsi sysfs.
>>>>>>>>>>>>>>>> For example, when xen passthrough a device to dumU, it will
>>>>>>>>>>>>>>>> use gsi to map pirq, but currently userspace can't get gsi
>>>>>>>>>>>>>>>> number.
>>>>>>>>>>>>>>>> So, add gsi sysfs for that and for other potential scenarios.
>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I don't know enough about Xen to know why it needs the GSI in
>>>>>>>>>>>>>>> userspace. Is this passthrough brand new functionality that can't be
>>>>>>>>>>>>>>> done today because we don't expose the GSI yet?
>>>>>>>>>>>
>>>>>>>>>>> I assume this must be new functionality, i.e., this kind of
>>>>>>>>>>> passthrough does not work today, right?
>>>>>>>>>>>
>>>>>>>>>>>>>> has ACPI support and is responsible for detecting and controlling
>>>>>>>>>>>>>> the hardware, also it performs privileged operations such as the
>>>>>>>>>>>>>> creation of normal (unprivileged) domains DomUs. When we give to a
>>>>>>>>>>>>>> DomU direct access to a device, we need also to route the physical
>>>>>>>>>>>>>> interrupts to the DomU. In order to do so Xen needs to setup and map
>>>>>>>>>>>>>> the interrupts appropriately.
>>>>>>>>>>>>>
>>>>>>>>>>>>> What kernel interfaces are used for this setup and mapping?
>>>>>>>>>>>>
>>>>>>>>>>>> For passthrough devices, the setup and mapping of routing physical
>>>>>>>>>>>> interrupts to DomU are done on Xen hypervisor side, hypervisor only
>>>>>>>>>>>> need userspace to provide the GSI info, see Xen code:
>>>>>>>>>>>> xc_physdev_map_pirq require GSI and then will call hypercall to pass
>>>>>>>>>>>> GSI into hypervisor and then hypervisor will do the mapping and
>>>>>>>>>>>> routing, kernel doesn't do the setup and mapping.
>>>>>>>>>>>
>>>>>>>>>>> So we have to expose the GSI to userspace not because userspace itself
>>>>>>>>>>> uses it, but so userspace can turn around and pass it back into the
>>>>>>>>>>> kernel?
>>>>>>>>>>
>>>>>>>>>> No, the point is to pass it back to Xen, which doesn't know the
>>>>>>>>>> mapping between GSIs and PCI devices because it can't execute the ACPI
>>>>>>>>>> AML resource methods that provide such information.
>>>>>>>>>>
>>>>>>>>>> The (Linux) kernel is just a proxy that forwards the hypercalls from
>>>>>>>>>> user-space tools into Xen.
>>>>>>>>>
>>>>>>>>> But I guess Xen knows how to interpret a GSI even though it doesn't
>>>>>>>>> have access to AML?
>>>>>>>>
>>>>>>>> On x86 Xen does know how to map a GSI into an IO-APIC pin, in order
>>>>>>>> configure the RTE as requested.
>>>>>>>
>>>>>>> IIUC, mapping a GSI to an IO-APIC pin requires information from the
>>>>>>> MADT. So I guess Xen does use the static ACPI tables, but not the AML
>>>>>>> _PRT methods that would connect a GSI with a PCI device?
>>>>>>
>>>>>> Yes, Xen can parse the static tables, and knows the base GSI of
>>>>>> IO-APICs from the MADT.
>>>>>>
>>>>>>> I guess this means Xen would not be able to deal with _MAT methods,
>>>>>>> which also contains MADT entries? I don't know the implications of
>>>>>>> this -- maybe it means Xen might not be able to use with hot-added
>>>>>>> devices?
>>>>>>
>>>>>> It's my understanding _MAT will only be present on some very specific
>>>>>> devices (IO-APIC or CPU objects). Xen doesn't support hotplug of
>>>>>> IO-APICs, but hotplug of CPUs should in principle be supported with
>>>>>> cooperation from the control domain OS (albeit it's not something that
>>>>>> we tests on our CI). I don't expect however that a CPU object _MAT
>>>>>> method will return IO APIC entries.
>>>>>>
>>>>>>> The tables (including DSDT and SSDTS that contain the AML) are exposed
>>>>>>> to userspace via /sys/firmware/acpi/tables/, but of course that
>>>>>>> doesn't mean Xen knows how to interpret the AML, and even if it did,
>>>>>>> Xen probably wouldn't be able to *evaluate* it since that could
>>>>>>> conflict with the host kernel's use of AML.
>>>>>>
>>>>>> Indeed, there can only be a single OSPM, and that's the dom0 OS (Linux
>>>>>> in our context).
>>>>>>
>>>>>> Getting back to our context though, what would be a suitable place for
>>>>>> exposing the GSI assigned to each device?
>>>>>
>>>>> IIUC, the Xen hypervisor:
>>>>>
>>>>> - Interprets /sys/firmware/acpi/tables/APIC (or gets this via
>>>>> something running on the Dom0 kernel) to find the physical base
>>>>> address and GSI base, e.g., from I/O APIC, I/O SAPIC.
>>>>
>>>> No, Xen parses the MADT directly from memory, before stating dom0.
>>>> That's a static table so it's fine for Xen to parse it and obtain the
>>>> I/O APIC GSI base.
>>>
>>> It's an interesting split to consume ACPI static tables directly but
>>> put the AML interpreter elsewhere.
>>
>> Well, static tables can be consumed by Xen, because thye don't require
>> an AML parser (obviously), and parsing them doesn't have any
>> side-effects that would prevent dom0 from being the OSPM (no methods
>> or similar are evaluated).
>>
>>> I doubt the ACPI spec envisioned
>>> that, which makes me wonder what other things we could trip over, but
>>> that's just a tangent.
>>
>> Indeed, ACPI is not be best interface for the Xen/dom0 split model.
>>
>>>>> - Needs the GSI to locate the APIC and pin within the APIC. The
>>>>> Dom0 kernel is the OSPM, so only it can evaluate the AML _PRT to
>>>>> learn the PCI device -> GSI mapping.
>>>>
>>>> Yes, Xen doesn't know the PCI device -> GSI mapping. Dom0 needs to
>>>> parse the ACPI methods and signal Xen to configure a GSI with a
>>>> given trigger and polarity.
>>>>
>>>>> - Has direct access to the APIC physical base address to program the
>>>>> Redirection Table.
>>>>
>>>> Yes, the hardware (native) I/O APIC is owned by Xen, and not directly
>>>> accessible by dom0.
>>>>
>>>>> The patch seems a little messy to me because the PCI core has to keep
>>>>> track of the GSI even though it doesn't need it itself. And the
>>>>> current patch exposes it on all arches, even non-ACPI ones or when
>>>>> ACPI is disabled (easily fixable).
>>>>>
>>>>> We only call acpi_pci_irq_enable() in the pci_enable_device() path, so
>>>>> we don't know the GSI unless a Dom0 driver has claimed the device and
>>>>> called pci_enable_device() for it, which seems like it might not be
>>>>> desirable.
>>>>
>>>> I think that's always the case, as on dom0 devices to be passed
>>>> through are handled by pciback which does enable them.
>>>
>>> pcistub_init_device() labels the pci_enable_device() as a "HACK"
>>> related to determining the IRQ, which makes me think there's not
>>> really a requirement for the device to be *enabled* (BAR decoding
>>> enabled) by dom0.
>>
>> No, there's no need for memory decoding to be enabled for getting the
>> GSI from the ACPI method I would assume. I'm confused by that
>> pci_enable_device() call. Is maybe the purpose to make sure the
>> device is powered up so that reading the PCI header Interrupt Line and
>> Pin fields returns valid values? No idea whether reading those fields
>> requires the device to be in certain (active) power states.
>>
>>>> I agree it might be best to not tie exposing the node to
>>>> pci_enable_device() having been called. Is _PRT only evaluated as
>>>> part of acpi_pci_irq_enable()? (or pci_enable_device()).
>>>
>>> Yes. AFAICT, acpi_pci_irq_enable() is the only path that evaluates
>>> _PRT (except for a debugger interface). I don't think it *needs* to
>>> be that way, and the fact that we do it per-device like that means we
>>> evaluate _PRT many times even though I think the results never change.
>>>
>>> I could imagine evaluating _PRT once as part of enumerating a PCI host
>>> bridge (and maybe PCI-PCI bridge, per acpi_pci_irq_find_prt_entry()
>>> comment), but that looks like a fair bit of work to implement. And of
>>> course it doesn't really affect the question of how to expose the
>>> result, although it does suggest /sys/bus/acpi/devices/PNP0A03:00/ as
>>> a possible location.
>>
>> So you suggest exposing the GSI as part of the PCI host bridge? I'm
>> afraid I'm not following how we could then map PCI SBDFs from devices
>> to their assigned GSI.
>>
>> Thanks, Roger.
>
--
Best regards,
Jiqian Chen.
[+to Rafael]
On Mon, Apr 08, 2024 at 06:42:31AM +0000, Chen, Jiqian wrote:
> Hi Bjorn,
> It has been almost two months since we received your reply last time.
> This series are blocking on this patch, since there are patches on Xen and Qemu side depending on it.
> Do you still have any confusion about this patch? Or do you have other suggestions?
> If no, may I get your Reviewed-by?
- This is ACPI-specific, but exposes /sys/.../gsi for all systems,
including non-ACPI systems. I don't think we want that.
- Do you care about similar Xen configurations on non-ACPI systems?
If so, maybe the commit log could mention how you learn about PCI
INTx routing on them in case there's some way to unify this in the
future.
- Missing an update to Documentation/ABI/.
- A nit: I asked about s/dumU/DomU/ in the commit log earlier,
haven't seen any response.
- Commit log mentions "and for other potential scenarios." It's
another nit, but unless you have another concrete use for this,
that phrase is meaningless hand waving and should be dropped.
- A _PRT entry may refer directly to a GSI or to an interrupt link
device (PNP0C0F) that can be routed to one of several GSIs:
ACPI: PCI Interrupt Link [LNKA] (IRQs 3 4 5 6 7 9 10 *11 12 14 15)
I don't think the kernel reconfigures interrupt links after
enumeration, but if they are reconfigured at run-time (via _SRS),
the cached GSI will be wrong. I think setpnp could do this, but
that tool is dead. So maybe this isn't a concern anymore, but I
*would* like to get Rafael's take on this. If we don't care
enough, I think we should mention it in the commit log just in
case.
Bjorn