2023-10-03 04:45:09

by Anup Patel

[permalink] [raw]
Subject: [PATCH v10 00/15] Linux RISC-V AIA Support

The RISC-V AIA specification is ratified as-per the RISC-V international
process. The latest ratified AIA specifcation can be found at:
https://github.com/riscv/riscv-aia/releases/download/1.0/riscv-interrupts-1.0.pdf

At a high-level, the AIA specification adds three things:
1) AIA CSRs
- Improved local interrupt support
2) Incoming Message Signaled Interrupt Controller (IMSIC)
- Per-HART MSI controller
- Support MSI virtualization
- Support IPI along with virtualization
3) Advanced Platform-Level Interrupt Controller (APLIC)
- Wired interrupt controller
- In MSI-mode, converts wired interrupt into MSIs (i.e. MSI generator)
- In Direct-mode, injects external interrupts directly into HARTs

For an overview of the AIA specification, refer the AIA virtualization
talk at KVM Forum 2022:
https://static.sched.com/hosted_files/kvmforum2022/a1/AIA_Virtualization_in_KVM_RISCV_final.pdf
https://www.youtube.com/watch?v=r071dL8Z0yo

To test this series, use QEMU v7.2 (or higher) and OpenSBI v1.2 (or higher).

These patches can also be found in the riscv_aia_v10 branch at:
https://github.com/avpatel/linux.git

Changes since v9:
- Rebased on Linux-6.6-rc4
- Use builtin_platform_driver() in PATCH5, PATCH9, and PATCH12

Changes since v8:
- Rebased on Linux-6.6-rc3
- Dropped PATCH2 of v8 series since we won't be requiring
riscv_get_intc_hartid() based on Marc Z's comments on ACPI AIA support.
- Addressed Saravana's comments in PATCH3 of v8 series
- Update PATCH9 and PATCH13 of v8 series based on comments from Sunil

Changes since v7:
- Rebased on Linux-6.6-rc1
- Addressed comments on PATCH1 of v7 series and split it into two PATCHes
- Use DEFINE_SIMPLE_PROP() in PATCH2 of v7 series

Changes since v6:
- Rebased on Linux-6.5-rc4
- Updated PATCH2 to use IS_ENABLED(CONFIG_SPARC) instead of
!IS_ENABLED(CONFIG_OF_IRQ)
- Added new PATCH4 to fix syscore registration in PLIC driver
- Update PATCH5 to convert PLIC driver into full-blown platform driver
with a re-written probe function.

Changes since v5:
- Rebased on Linux-6.5-rc2
- Updated the overall series to ensure that only IPI, timer, and
INTC drivers are probed very early whereas rest of the interrupt
controllers (such as PLIC, APLIC, and IMISC) are probed as
regular platform drivers.
- Renamed riscv_fw_parent_hartid() to riscv_get_intc_hartid()
- New PATCH1 to add fw_devlink support for msi-parent DT property
- New PATCH2 to ensure all INTC suppliers are initialized which in-turn
fixes the probing issue for PLIC, APLIC and IMSIC as platform driver
- New PATCH3 to use platform driver probing for PLIC
- Re-structured the IMSIC driver into two separate drivers: early and
platform. The IMSIC early driver (PATCH7) only initialized IMSIC state
and provides IPIs whereas the IMSIC platform driver (PATCH8) is probed
provides MSI domain for platform devices.
- Re-structure the APLIC platform driver into three separe sources: main,
direct mode, and MSI mode.

Changes since v4:
- Rebased on Linux-6.5-rc1
- Added "Dependencies" in the APLIC bindings (PATCH6 in v4)
- Dropped the PATCH6 which was changing the IOMMU DMA domain APIs
- Dropped use of IOMMU DMA APIs in the IMSIC driver (PATCH4)

Changes since v3:
- Rebased on Linux-6.4-rc6
- Droped PATCH2 of v3 series instead we now set FWNODE_FLAG_BEST_EFFORT via
IRQCHIP_DECLARE()
- Extend riscv_fw_parent_hartid() to support both DT and ACPI in PATCH1
- Extend iommu_dma_compose_msi_msg() instead of adding iommu_dma_select_msi()
in PATCH6
- Addressed Conor's comments in PATCH3
- Addressed Conor's and Rob's comments in PATCH7

Changes since v2:
- Rebased on Linux-6.4-rc1
- Addressed Rob's comments on DT bindings patches 4 and 8.
- Addessed Marc's comments on IMSIC driver PATCH5
- Replaced use of OF apis in APLIC and IMSIC drivers with FWNODE apis
this makes both drivers easily portable for ACPI support. This also
removes unnecessary indirection from the APLIC and IMSIC drivers.
- PATCH1 is a new patch for portability with ACPI support
- PATCH2 is a new patch to fix probing in APLIC drivers for APLIC-only systems.
- PATCH7 is a new patch which addresses the IOMMU DMA domain issues pointed
out by SiFive

Changes since v1:
- Rebased on Linux-6.2-rc2
- Addressed comments on IMSIC DT bindings for PATCH4
- Use raw_spin_lock_irqsave() on ids_lock for PATCH5
- Improved MMIO alignment checks in PATCH5 to allow MMIO regions
with holes.
- Addressed comments on APLIC DT bindings for PATCH6
- Fixed warning splat in aplic_msi_write_msg() caused by
zeroed MSI message in PATCH7
- Dropped DT property riscv,slow-ipi instead will have module
parameter in future.

Anup Patel (15):
RISC-V: Don't fail in riscv_of_parent_hartid() for disabled HARTs
of: property: Add fw_devlink support for msi-parent
drivers: irqchip/riscv-intc: Mark all INTC nodes as initialized
irqchip/sifive-plic: Fix syscore registration for multi-socket systems
irqchip/sifive-plic: Convert PLIC driver into a platform driver
irqchip/riscv-intc: Add support for RISC-V AIA
dt-bindings: interrupt-controller: Add RISC-V incoming MSI controller
irqchip: Add RISC-V incoming MSI controller early driver
irqchip/riscv-imsic: Add support for platform MSI irqdomain
irqchip/riscv-imsic: Add support for PCI MSI irqdomain
dt-bindings: interrupt-controller: Add RISC-V advanced PLIC
irqchip: Add RISC-V advanced PLIC driver for direct-mode
irqchip/riscv-aplic: Add support for MSI-mode
RISC-V: Select APLIC and IMSIC drivers
MAINTAINERS: Add entry for RISC-V AIA drivers

.../interrupt-controller/riscv,aplic.yaml | 172 ++++++
.../interrupt-controller/riscv,imsics.yaml | 172 ++++++
MAINTAINERS | 14 +
arch/riscv/Kconfig | 2 +
arch/riscv/kernel/cpu.c | 11 +-
drivers/irqchip/Kconfig | 24 +
drivers/irqchip/Makefile | 3 +
drivers/irqchip/irq-riscv-aplic-direct.c | 343 +++++++++++
drivers/irqchip/irq-riscv-aplic-main.c | 232 +++++++
drivers/irqchip/irq-riscv-aplic-main.h | 53 ++
drivers/irqchip/irq-riscv-aplic-msi.c | 285 +++++++++
drivers/irqchip/irq-riscv-imsic-early.c | 259 ++++++++
drivers/irqchip/irq-riscv-imsic-platform.c | 319 ++++++++++
drivers/irqchip/irq-riscv-imsic-state.c | 570 ++++++++++++++++++
drivers/irqchip/irq-riscv-imsic-state.h | 67 ++
drivers/irqchip/irq-riscv-intc.c | 44 +-
drivers/irqchip/irq-sifive-plic.c | 242 +++++---
drivers/of/property.c | 2 +
include/linux/irqchip/riscv-aplic.h | 119 ++++
include/linux/irqchip/riscv-imsic.h | 86 +++
20 files changed, 2915 insertions(+), 104 deletions(-)
create mode 100644 Documentation/devicetree/bindings/interrupt-controller/riscv,aplic.yaml
create mode 100644 Documentation/devicetree/bindings/interrupt-controller/riscv,imsics.yaml
create mode 100644 drivers/irqchip/irq-riscv-aplic-direct.c
create mode 100644 drivers/irqchip/irq-riscv-aplic-main.c
create mode 100644 drivers/irqchip/irq-riscv-aplic-main.h
create mode 100644 drivers/irqchip/irq-riscv-aplic-msi.c
create mode 100644 drivers/irqchip/irq-riscv-imsic-early.c
create mode 100644 drivers/irqchip/irq-riscv-imsic-platform.c
create mode 100644 drivers/irqchip/irq-riscv-imsic-state.c
create mode 100644 drivers/irqchip/irq-riscv-imsic-state.h
create mode 100644 include/linux/irqchip/riscv-aplic.h
create mode 100644 include/linux/irqchip/riscv-imsic.h

--
2.34.1


2023-10-03 04:45:45

by Anup Patel

[permalink] [raw]
Subject: [PATCH v10 06/15] irqchip/riscv-intc: Add support for RISC-V AIA

The RISC-V advanced interrupt architecture (AIA) extends the per-HART
local interrupts in following ways:
1. Minimum 64 local interrupts for both RV32 and RV64
2. Ability to process multiple pending local interrupts in same
interrupt handler
3. Priority configuration for each local interrupts
4. Special CSRs to configure/access the per-HART MSI controller

We add support for #1 and #2 described above in the RISC-V intc driver.

Signed-off-by: Anup Patel <[email protected]>
---
drivers/irqchip/irq-riscv-intc.c | 34 ++++++++++++++++++++++++++------
1 file changed, 28 insertions(+), 6 deletions(-)

diff --git a/drivers/irqchip/irq-riscv-intc.c b/drivers/irqchip/irq-riscv-intc.c
index e8d01b14ccdd..bab536bbaf2c 100644
--- a/drivers/irqchip/irq-riscv-intc.c
+++ b/drivers/irqchip/irq-riscv-intc.c
@@ -17,6 +17,7 @@
#include <linux/module.h>
#include <linux/of.h>
#include <linux/smp.h>
+#include <asm/hwcap.h>

static struct irq_domain *intc_domain;

@@ -30,6 +31,15 @@ static asmlinkage void riscv_intc_irq(struct pt_regs *regs)
generic_handle_domain_irq(intc_domain, cause);
}

+static asmlinkage void riscv_intc_aia_irq(struct pt_regs *regs)
+{
+ unsigned long topi;
+
+ while ((topi = csr_read(CSR_TOPI)))
+ generic_handle_domain_irq(intc_domain,
+ topi >> TOPI_IID_SHIFT);
+}
+
/*
* On RISC-V systems local interrupts are masked or unmasked by writing
* the SIE (Supervisor Interrupt Enable) CSR. As CSRs can only be written
@@ -39,12 +49,18 @@ static asmlinkage void riscv_intc_irq(struct pt_regs *regs)

static void riscv_intc_irq_mask(struct irq_data *d)
{
- csr_clear(CSR_IE, BIT(d->hwirq));
+ if (IS_ENABLED(CONFIG_32BIT) && d->hwirq >= BITS_PER_LONG)
+ csr_clear(CSR_IEH, BIT(d->hwirq - BITS_PER_LONG));
+ else
+ csr_clear(CSR_IE, BIT(d->hwirq));
}

static void riscv_intc_irq_unmask(struct irq_data *d)
{
- csr_set(CSR_IE, BIT(d->hwirq));
+ if (IS_ENABLED(CONFIG_32BIT) && d->hwirq >= BITS_PER_LONG)
+ csr_set(CSR_IEH, BIT(d->hwirq - BITS_PER_LONG));
+ else
+ csr_set(CSR_IE, BIT(d->hwirq));
}

static void riscv_intc_irq_eoi(struct irq_data *d)
@@ -115,16 +131,20 @@ static struct fwnode_handle *riscv_intc_hwnode(void)

static int __init riscv_intc_init_common(struct fwnode_handle *fn)
{
- int rc;
+ int rc, nr_irqs = riscv_isa_extension_available(NULL, SxAIA) ?
+ 64 : BITS_PER_LONG;

- intc_domain = irq_domain_create_linear(fn, BITS_PER_LONG,
+ intc_domain = irq_domain_create_linear(fn, nr_irqs,
&riscv_intc_domain_ops, NULL);
if (!intc_domain) {
pr_err("unable to add IRQ domain\n");
return -ENXIO;
}

- rc = set_handle_irq(&riscv_intc_irq);
+ if (riscv_isa_extension_available(NULL, SxAIA))
+ rc = set_handle_irq(&riscv_intc_aia_irq);
+ else
+ rc = set_handle_irq(&riscv_intc_irq);
if (rc) {
pr_err("failed to set irq handler\n");
return rc;
@@ -132,7 +152,9 @@ static int __init riscv_intc_init_common(struct fwnode_handle *fn)

riscv_set_intc_hwnode_fn(riscv_intc_hwnode);

- pr_info("%d local interrupts mapped\n", BITS_PER_LONG);
+ pr_info("%d local interrupts mapped%s\n",
+ nr_irqs, riscv_isa_extension_available(NULL, SxAIA) ?
+ " using AIA" : "");

return 0;
}
--
2.34.1

2023-10-03 04:46:08

by Anup Patel

[permalink] [raw]
Subject: [PATCH v10 05/15] irqchip/sifive-plic: Convert PLIC driver into a platform driver

The PLIC driver does not require very early initialization so let
us convert it into a platform driver.

As part of the conversion, the PLIC probing undergoes the following
changes:
1. Use dev_info(), dev_err() and dev_warn() instead of pr_info(),
pr_err() and pr_warn()
2. Use devm_xyz() APIs wherever applicable
3. PLIC is now probed after CPUs are brought-up so we have to
setup cpuhp state after context handler of all online CPUs
are initialized otherwise we see crash on multi-socket systems

Signed-off-by: Anup Patel <[email protected]>
---
drivers/irqchip/irq-sifive-plic.c | 239 ++++++++++++++++++------------
1 file changed, 148 insertions(+), 91 deletions(-)

diff --git a/drivers/irqchip/irq-sifive-plic.c b/drivers/irqchip/irq-sifive-plic.c
index 5b7bc4fd9517..c8f8a8cdcce1 100644
--- a/drivers/irqchip/irq-sifive-plic.c
+++ b/drivers/irqchip/irq-sifive-plic.c
@@ -3,7 +3,6 @@
* Copyright (C) 2017 SiFive
* Copyright (C) 2018 Christoph Hellwig
*/
-#define pr_fmt(fmt) "plic: " fmt
#include <linux/cpu.h>
#include <linux/interrupt.h>
#include <linux/io.h>
@@ -64,6 +63,7 @@
#define PLIC_QUIRK_EDGE_INTERRUPT 0

struct plic_priv {
+ struct device *dev;
struct cpumask lmask;
struct irq_domain *irqdomain;
void __iomem *regs;
@@ -85,7 +85,6 @@ struct plic_handler {
struct plic_priv *priv;
};
static int plic_parent_irq __ro_after_init;
-static bool plic_cpuhp_setup_done __ro_after_init;
static DEFINE_PER_CPU(struct plic_handler, plic_handlers);

static int plic_irq_set_type(struct irq_data *d, unsigned int type);
@@ -371,7 +370,8 @@ static void plic_handle_irq(struct irq_desc *desc)
int err = generic_handle_domain_irq(handler->priv->irqdomain,
hwirq);
if (unlikely(err))
- pr_warn_ratelimited("can't find mapping for hwirq %lu\n",
+ dev_warn_ratelimited(handler->priv->dev,
+ "can't find mapping for hwirq %lu\n",
hwirq);
}

@@ -406,57 +406,126 @@ static int plic_starting_cpu(unsigned int cpu)
return 0;
}

-static int __init __plic_init(struct device_node *node,
- struct device_node *parent,
- unsigned long plic_quirks)
+static const struct of_device_id plic_match[] = {
+ { .compatible = "sifive,plic-1.0.0" },
+ { .compatible = "riscv,plic0" },
+ { .compatible = "andestech,nceplic100",
+ .data = (const void *)BIT(PLIC_QUIRK_EDGE_INTERRUPT) },
+ { .compatible = "thead,c900-plic",
+ .data = (const void *)BIT(PLIC_QUIRK_EDGE_INTERRUPT) },
+ {}
+};
+
+static int plic_parse_nr_irqs_and_contexts(struct platform_device *pdev,
+ u32 *nr_irqs, u32 *nr_contexts)
{
- int error = 0, nr_contexts, nr_handlers = 0, i;
- u32 nr_irqs;
- struct plic_priv *priv;
+ struct device *dev = &pdev->dev;
+ int rc;
+
+ /*
+ * Currently, only OF fwnode is supported so extend this
+ * function for ACPI support.
+ */
+ if (!is_of_node(dev->fwnode))
+ return -EINVAL;
+
+ rc = of_property_read_u32(to_of_node(dev->fwnode),
+ "riscv,ndev", nr_irqs);
+ if (rc) {
+ dev_err(dev, "riscv,ndev property not available\n");
+ return rc;
+ }
+
+ *nr_contexts = of_irq_count(to_of_node(dev->fwnode));
+ if (WARN_ON(!(*nr_contexts))) {
+ dev_err(dev, "no PLIC context available\n");
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+static int plic_parse_context_parent_hwirq(struct platform_device *pdev,
+ u32 context, u32 *parent_hwirq,
+ unsigned long *parent_hartid)
+{
+ struct device *dev = &pdev->dev;
+ struct of_phandle_args parent;
+ int rc;
+
+ /*
+ * Currently, only OF fwnode is supported so extend this
+ * function for ACPI support.
+ */
+ if (!is_of_node(dev->fwnode))
+ return -EINVAL;
+
+ rc = of_irq_parse_one(to_of_node(dev->fwnode), context, &parent);
+ if (rc)
+ return rc;
+
+ rc = riscv_of_parent_hartid(parent.np, parent_hartid);
+ if (rc)
+ return rc;
+
+ *parent_hwirq = parent.args[0];
+ return 0;
+}
+
+static int plic_probe(struct platform_device *pdev)
+{
+ int rc, nr_contexts, nr_handlers = 0, i, cpu;
+ unsigned long plic_quirks = 0, hartid;
+ struct device *dev = &pdev->dev;
struct plic_handler *handler;
- unsigned int cpu;
+ u32 nr_irqs, parent_hwirq;
+ struct irq_domain *domain;
+ struct plic_priv *priv;
+ irq_hw_number_t hwirq;
+ struct resource *res;
+ bool cpuhp_setup;
+
+ if (is_of_node(dev->fwnode)) {
+ const struct of_device_id *id;
+
+ id = of_match_node(plic_match, to_of_node(dev->fwnode));
+ if (id)
+ plic_quirks = (unsigned long)id->data;
+ }

- priv = kzalloc(sizeof(*priv), GFP_KERNEL);
+ priv = devm_kzalloc(dev, sizeof(*priv), GFP_KERNEL);
if (!priv)
return -ENOMEM;
-
+ priv->dev = dev;
priv->plic_quirks = plic_quirks;

- priv->regs = of_iomap(node, 0);
- if (WARN_ON(!priv->regs)) {
- error = -EIO;
- goto out_free_priv;
+ res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
+ if (!res) {
+ dev_err(dev, "failed to get MMIO resource\n");
+ return -EINVAL;
+ }
+ priv->regs = devm_ioremap(dev, res->start, resource_size(res));
+ if (!priv->regs) {
+ dev_err(dev, "failed map MMIO registers\n");
+ return -EIO;
}

- error = -EINVAL;
- of_property_read_u32(node, "riscv,ndev", &nr_irqs);
- if (WARN_ON(!nr_irqs))
- goto out_iounmap;
-
+ rc = plic_parse_nr_irqs_and_contexts(pdev, &nr_irqs, &nr_contexts);
+ if (rc) {
+ dev_err(dev, "failed to parse irqs and contexts\n");
+ return rc;
+ }
priv->nr_irqs = nr_irqs;

- priv->prio_save = bitmap_alloc(nr_irqs, GFP_KERNEL);
+ priv->prio_save = devm_bitmap_zalloc(dev, nr_irqs, GFP_KERNEL);
if (!priv->prio_save)
- goto out_free_priority_reg;
-
- nr_contexts = of_irq_count(node);
- if (WARN_ON(!nr_contexts))
- goto out_free_priority_reg;
-
- error = -ENOMEM;
- priv->irqdomain = irq_domain_add_linear(node, nr_irqs + 1,
- &plic_irqdomain_ops, priv);
- if (WARN_ON(!priv->irqdomain))
- goto out_free_priority_reg;
+ return -ENOMEM;

for (i = 0; i < nr_contexts; i++) {
- struct of_phandle_args parent;
- irq_hw_number_t hwirq;
- int cpu;
- unsigned long hartid;
-
- if (of_irq_parse_one(node, i, &parent)) {
- pr_err("failed to parse parent for context %d.\n", i);
+ rc = plic_parse_context_parent_hwirq(pdev, i,
+ &parent_hwirq, &hartid);
+ if (rc) {
+ dev_warn(dev, "hwirq for context%d not found\n", i);
continue;
}

@@ -464,7 +533,7 @@ static int __init __plic_init(struct device_node *node,
* Skip contexts other than external interrupts for our
* privilege level.
*/
- if (parent.args[0] != RV_IRQ_EXT) {
+ if (parent_hwirq != RV_IRQ_EXT) {
/* Disable S-mode enable bits if running in M-mode. */
if (IS_ENABLED(CONFIG_RISCV_M_MODE)) {
void __iomem *enable_base = priv->regs +
@@ -477,21 +546,17 @@ static int __init __plic_init(struct device_node *node,
continue;
}

- error = riscv_of_parent_hartid(parent.np, &hartid);
- if (error < 0) {
- pr_warn("failed to parse hart ID for context %d.\n", i);
- continue;
- }
-
cpu = riscv_hartid_to_cpuid(hartid);
if (cpu < 0) {
- pr_warn("Invalid cpuid for context %d\n", i);
+ dev_warn(dev, "Invalid cpuid for context %d\n", i);
continue;
}

/* Find parent domain and register chained handler */
- if (!plic_parent_irq && irq_find_host(parent.np)) {
- plic_parent_irq = irq_of_parse_and_map(node, i);
+ domain = irq_find_matching_fwnode(riscv_get_intc_hwnode(),
+ DOMAIN_BUS_ANY);
+ if (!plic_parent_irq && domain) {
+ plic_parent_irq = irq_create_mapping(domain, RV_IRQ_EXT);
if (plic_parent_irq)
irq_set_chained_handler(plic_parent_irq,
plic_handle_irq);
@@ -504,7 +569,7 @@ static int __init __plic_init(struct device_node *node,
*/
handler = per_cpu_ptr(&plic_handlers, cpu);
if (handler->present) {
- pr_warn("handler already present for context %d.\n", i);
+ dev_warn(dev, "handler already present for context%d.\n", i);
plic_set_threshold(handler, PLIC_DISABLE_THRESHOLD);
goto done;
}
@@ -518,10 +583,13 @@ static int __init __plic_init(struct device_node *node,
i * CONTEXT_ENABLE_SIZE;
handler->priv = priv;

- handler->enable_save = kcalloc(DIV_ROUND_UP(nr_irqs, 32),
- sizeof(*handler->enable_save), GFP_KERNEL);
+ handler->enable_save = devm_kcalloc(dev,
+ DIV_ROUND_UP(nr_irqs, 32),
+ sizeof(*handler->enable_save),
+ GFP_KERNEL);
if (!handler->enable_save)
- goto out_free_enable_reg;
+ return -ENOMEM;
+
done:
for (hwirq = 1; hwirq <= nr_irqs; hwirq++) {
plic_toggle(handler, hwirq, 0);
@@ -531,52 +599,41 @@ static int __init __plic_init(struct device_node *node,
nr_handlers++;
}

+ priv->irqdomain = irq_domain_create_linear(dev->fwnode, nr_irqs + 1,
+ &plic_irqdomain_ops, priv);
+ if (WARN_ON(!priv->irqdomain))
+ return -ENOMEM;
+
/*
* We can have multiple PLIC instances so setup cpuhp state
- * and register syscore operations only when context handler
- * for current/boot CPU is present.
+ * and register syscore operations only after context handlers
+ * of all online CPUs are initialized.
*/
- handler = this_cpu_ptr(&plic_handlers);
- if (handler->present && !plic_cpuhp_setup_done) {
+ cpuhp_setup = true;
+ for_each_online_cpu(cpu) {
+ handler = per_cpu_ptr(&plic_handlers, cpu);
+ if (!handler->present) {
+ cpuhp_setup = false;
+ break;
+ }
+ }
+ if (cpuhp_setup) {
cpuhp_setup_state(CPUHP_AP_IRQ_SIFIVE_PLIC_STARTING,
"irqchip/sifive/plic:starting",
plic_starting_cpu, plic_dying_cpu);
register_syscore_ops(&plic_irq_syscore_ops);
- plic_cpuhp_setup_done = true;
}

- pr_info("%pOFP: mapped %d interrupts with %d handlers for"
- " %d contexts.\n", node, nr_irqs, nr_handlers, nr_contexts);
+ dev_info(dev, "mapped %d interrupts with %d handlers for"
+ " %d contexts.\n", nr_irqs, nr_handlers, nr_contexts);
return 0;
-
-out_free_enable_reg:
- for_each_cpu(cpu, cpu_present_mask) {
- handler = per_cpu_ptr(&plic_handlers, cpu);
- kfree(handler->enable_save);
- }
-out_free_priority_reg:
- kfree(priv->prio_save);
-out_iounmap:
- iounmap(priv->regs);
-out_free_priv:
- kfree(priv);
- return error;
}

-static int __init plic_init(struct device_node *node,
- struct device_node *parent)
-{
- return __plic_init(node, parent, 0);
-}
-
-IRQCHIP_DECLARE(sifive_plic, "sifive,plic-1.0.0", plic_init);
-IRQCHIP_DECLARE(riscv_plic0, "riscv,plic0", plic_init); /* for legacy systems */
-
-static int __init plic_edge_init(struct device_node *node,
- struct device_node *parent)
-{
- return __plic_init(node, parent, BIT(PLIC_QUIRK_EDGE_INTERRUPT));
-}
-
-IRQCHIP_DECLARE(andestech_nceplic100, "andestech,nceplic100", plic_edge_init);
-IRQCHIP_DECLARE(thead_c900_plic, "thead,c900-plic", plic_edge_init);
+static struct platform_driver plic_driver = {
+ .driver = {
+ .name = "riscv-plic",
+ .of_match_table = plic_match,
+ },
+ .probe = plic_probe,
+};
+builtin_platform_driver(plic_driver);
--
2.34.1

2023-10-03 04:46:12

by Anup Patel

[permalink] [raw]
Subject: [PATCH v10 10/15] irqchip/riscv-imsic: Add support for PCI MSI irqdomain

The Linux PCI framework requires it's own dedicated MSI irqdomain so
let us create PCI MSI irqdomain as child of the IMSIC base irqdomain.

Signed-off-by: Anup Patel <[email protected]>
---
drivers/irqchip/Kconfig | 7 ++++
drivers/irqchip/irq-riscv-imsic-platform.c | 48 ++++++++++++++++++++++
drivers/irqchip/irq-riscv-imsic-state.h | 1 +
3 files changed, 56 insertions(+)

diff --git a/drivers/irqchip/Kconfig b/drivers/irqchip/Kconfig
index bdd80716114d..c1d69b418dfb 100644
--- a/drivers/irqchip/Kconfig
+++ b/drivers/irqchip/Kconfig
@@ -552,6 +552,13 @@ config RISCV_IMSIC
select IRQ_DOMAIN_HIERARCHY
select GENERIC_MSI_IRQ

+config RISCV_IMSIC_PCI
+ bool
+ depends on RISCV_IMSIC
+ depends on PCI
+ depends on PCI_MSI
+ default RISCV_IMSIC
+
config EXYNOS_IRQ_COMBINER
bool "Samsung Exynos IRQ combiner support" if COMPILE_TEST
depends on (ARCH_EXYNOS && ARM) || COMPILE_TEST
diff --git a/drivers/irqchip/irq-riscv-imsic-platform.c b/drivers/irqchip/irq-riscv-imsic-platform.c
index 809e338bf9c3..ff46b85ca45a 100644
--- a/drivers/irqchip/irq-riscv-imsic-platform.c
+++ b/drivers/irqchip/irq-riscv-imsic-platform.c
@@ -12,6 +12,7 @@
#include <linux/irqdomain.h>
#include <linux/module.h>
#include <linux/msi.h>
+#include <linux/pci.h>
#include <linux/platform_device.h>
#include <linux/spinlock.h>
#include <linux/smp.h>
@@ -184,6 +185,39 @@ static const struct irq_domain_ops imsic_base_domain_ops = {
.free = imsic_irq_domain_free,
};

+#ifdef CONFIG_RISCV_IMSIC_PCI
+
+static void imsic_pci_mask_irq(struct irq_data *d)
+{
+ pci_msi_mask_irq(d);
+ irq_chip_mask_parent(d);
+}
+
+static void imsic_pci_unmask_irq(struct irq_data *d)
+{
+ pci_msi_unmask_irq(d);
+ irq_chip_unmask_parent(d);
+}
+
+static struct irq_chip imsic_pci_irq_chip = {
+ .name = "IMSIC-PCI",
+ .irq_mask = imsic_pci_mask_irq,
+ .irq_unmask = imsic_pci_unmask_irq,
+ .irq_eoi = irq_chip_eoi_parent,
+};
+
+static struct msi_domain_ops imsic_pci_domain_ops = {
+};
+
+static struct msi_domain_info imsic_pci_domain_info = {
+ .flags = (MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS |
+ MSI_FLAG_PCI_MSIX | MSI_FLAG_MULTI_PCI_MSI),
+ .ops = &imsic_pci_domain_ops,
+ .chip = &imsic_pci_irq_chip,
+};
+
+#endif
+
static struct irq_chip imsic_plat_irq_chip = {
.name = "IMSIC-PLAT",
};
@@ -208,12 +242,26 @@ static int imsic_irq_domains_init(struct device *dev)
}
irq_domain_update_bus_token(imsic->base_domain, DOMAIN_BUS_NEXUS);

+#ifdef CONFIG_RISCV_IMSIC_PCI
+ /* Create PCI MSI domain */
+ imsic->pci_domain = pci_msi_create_irq_domain(dev->fwnode,
+ &imsic_pci_domain_info,
+ imsic->base_domain);
+ if (!imsic->pci_domain) {
+ dev_err(dev, "failed to create IMSIC PCI domain\n");
+ irq_domain_remove(imsic->base_domain);
+ return -ENOMEM;
+ }
+#endif
+
/* Create Platform MSI domain */
imsic->plat_domain = platform_msi_create_irq_domain(dev->fwnode,
&imsic_plat_domain_info,
imsic->base_domain);
if (!imsic->plat_domain) {
dev_err(dev, "failed to create IMSIC platform domain\n");
+ if (imsic->pci_domain)
+ irq_domain_remove(imsic->pci_domain);
irq_domain_remove(imsic->base_domain);
return -ENOMEM;
}
diff --git a/drivers/irqchip/irq-riscv-imsic-state.h b/drivers/irqchip/irq-riscv-imsic-state.h
index 3170018949a8..ff3c377b9b33 100644
--- a/drivers/irqchip/irq-riscv-imsic-state.h
+++ b/drivers/irqchip/irq-riscv-imsic-state.h
@@ -31,6 +31,7 @@ struct imsic_priv {

/* IRQ domains (created by platform driver) */
struct irq_domain *base_domain;
+ struct irq_domain *pci_domain;
struct irq_domain *plat_domain;
};

--
2.34.1

2023-10-03 04:46:17

by Anup Patel

[permalink] [raw]
Subject: [PATCH v10 11/15] dt-bindings: interrupt-controller: Add RISC-V advanced PLIC

We add DT bindings document for RISC-V advanced platform level interrupt
controller (APLIC) defined by the RISC-V advanced interrupt architecture
(AIA) specification.

Signed-off-by: Anup Patel <[email protected]>
Reviewed-by: Conor Dooley <[email protected]>
---
.../interrupt-controller/riscv,aplic.yaml | 172 ++++++++++++++++++
1 file changed, 172 insertions(+)
create mode 100644 Documentation/devicetree/bindings/interrupt-controller/riscv,aplic.yaml

diff --git a/Documentation/devicetree/bindings/interrupt-controller/riscv,aplic.yaml b/Documentation/devicetree/bindings/interrupt-controller/riscv,aplic.yaml
new file mode 100644
index 000000000000..190a6499c932
--- /dev/null
+++ b/Documentation/devicetree/bindings/interrupt-controller/riscv,aplic.yaml
@@ -0,0 +1,172 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/interrupt-controller/riscv,aplic.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: RISC-V Advanced Platform Level Interrupt Controller (APLIC)
+
+maintainers:
+ - Anup Patel <[email protected]>
+
+description:
+ The RISC-V advanced interrupt architecture (AIA) defines an advanced
+ platform level interrupt controller (APLIC) for handling wired interrupts
+ in a RISC-V platform. The RISC-V AIA specification can be found at
+ https://github.com/riscv/riscv-aia.
+
+ The RISC-V APLIC is implemented as hierarchical APLIC domains where all
+ interrupt sources connect to the root APLIC domain and a parent APLIC
+ domain can delegate interrupt sources to it's child APLIC domains. There
+ is one device tree node for each APLIC domain.
+
+allOf:
+ - $ref: /schemas/interrupt-controller.yaml#
+
+properties:
+ compatible:
+ items:
+ - enum:
+ - qemu,aplic
+ - const: riscv,aplic
+
+ reg:
+ maxItems: 1
+
+ interrupt-controller: true
+
+ "#interrupt-cells":
+ const: 2
+
+ interrupts-extended:
+ minItems: 1
+ maxItems: 16384
+ description:
+ Given APLIC domain directly injects external interrupts to a set of
+ RISC-V HARTS (or CPUs). Each node pointed to should be a riscv,cpu-intc
+ node, which has a CPU node (i.e. RISC-V HART) as parent.
+
+ msi-parent:
+ description:
+ Given APLIC domain forwards wired interrupts as MSIs to a AIA incoming
+ message signaled interrupt controller (IMSIC). If both "msi-parent" and
+ "interrupts-extended" properties are present then it means the APLIC
+ domain supports both MSI mode and Direct mode in HW. In this case, the
+ APLIC driver has to choose between MSI mode or Direct mode.
+
+ riscv,num-sources:
+ $ref: /schemas/types.yaml#/definitions/uint32
+ minimum: 1
+ maximum: 1023
+ description:
+ Specifies the number of wired interrupt sources supported by this
+ APLIC domain.
+
+ riscv,children:
+ $ref: /schemas/types.yaml#/definitions/phandle-array
+ minItems: 1
+ maxItems: 1024
+ items:
+ maxItems: 1
+ description:
+ A list of child APLIC domains for the given APLIC domain. Each child
+ APLIC domain is assigned a child index in increasing order, with the
+ first child APLIC domain assigned child index 0. The APLIC domain child
+ index is used by firmware to delegate interrupts from the given APLIC
+ domain to a particular child APLIC domain.
+
+ riscv,delegation:
+ $ref: /schemas/types.yaml#/definitions/phandle-array
+ minItems: 1
+ maxItems: 1024
+ items:
+ items:
+ - description: child APLIC domain phandle
+ - description: first interrupt number of the parent APLIC domain (inclusive)
+ - description: last interrupt number of the parent APLIC domain (inclusive)
+ description:
+ A interrupt delegation list where each entry is a triple consisting
+ of child APLIC domain phandle, first interrupt number of the parent
+ APLIC domain, and last interrupt number of the parent APLIC domain.
+ Firmware must configure interrupt delegation registers based on
+ interrupt delegation list.
+
+dependencies:
+ riscv,delegation: [ "riscv,children" ]
+
+required:
+ - compatible
+ - reg
+ - interrupt-controller
+ - "#interrupt-cells"
+ - riscv,num-sources
+
+anyOf:
+ - required:
+ - interrupts-extended
+ - required:
+ - msi-parent
+
+unevaluatedProperties: false
+
+examples:
+ - |
+ // Example 1 (APLIC domains directly injecting interrupt to HARTs):
+
+ interrupt-controller@c000000 {
+ compatible = "qemu,aplic", "riscv,aplic";
+ interrupts-extended = <&cpu1_intc 11>,
+ <&cpu2_intc 11>,
+ <&cpu3_intc 11>,
+ <&cpu4_intc 11>;
+ reg = <0xc000000 0x4080>;
+ interrupt-controller;
+ #interrupt-cells = <2>;
+ riscv,num-sources = <63>;
+ riscv,children = <&aplic1>, <&aplic2>;
+ riscv,delegation = <&aplic1 1 63>;
+ };
+
+ aplic1: interrupt-controller@d000000 {
+ compatible = "qemu,aplic", "riscv,aplic";
+ interrupts-extended = <&cpu1_intc 9>,
+ <&cpu2_intc 9>;
+ reg = <0xd000000 0x4080>;
+ interrupt-controller;
+ #interrupt-cells = <2>;
+ riscv,num-sources = <63>;
+ };
+
+ aplic2: interrupt-controller@e000000 {
+ compatible = "qemu,aplic", "riscv,aplic";
+ interrupts-extended = <&cpu3_intc 9>,
+ <&cpu4_intc 9>;
+ reg = <0xe000000 0x4080>;
+ interrupt-controller;
+ #interrupt-cells = <2>;
+ riscv,num-sources = <63>;
+ };
+
+ - |
+ // Example 2 (APLIC domains forwarding interrupts as MSIs):
+
+ interrupt-controller@c000000 {
+ compatible = "qemu,aplic", "riscv,aplic";
+ msi-parent = <&imsic_mlevel>;
+ reg = <0xc000000 0x4000>;
+ interrupt-controller;
+ #interrupt-cells = <2>;
+ riscv,num-sources = <63>;
+ riscv,children = <&aplic3>;
+ riscv,delegation = <&aplic3 1 63>;
+ };
+
+ aplic3: interrupt-controller@d000000 {
+ compatible = "qemu,aplic", "riscv,aplic";
+ msi-parent = <&imsic_slevel>;
+ reg = <0xd000000 0x4000>;
+ interrupt-controller;
+ #interrupt-cells = <2>;
+ riscv,num-sources = <63>;
+ };
+...
--
2.34.1

2023-10-03 04:46:23

by Anup Patel

[permalink] [raw]
Subject: [PATCH v10 13/15] irqchip/riscv-aplic: Add support for MSI-mode

The RISC-V advanced platform-level interrupt controller (APLIC) has
two modes of operation: 1) Direct mode and 2) MSI mode.
(For more details, refer https://github.com/riscv/riscv-aia)

In APLIC MSI-mode, wired interrupts are forwared as message signaled
interrupts (MSIs) to CPUs via IMSIC.

We extend the existing APLIC irqchip driver to support MSI-mode for
RISC-V platforms having both wired interrupts and MSIs.

Signed-off-by: Anup Patel <[email protected]>
---
drivers/irqchip/Kconfig | 6 +
drivers/irqchip/Makefile | 1 +
drivers/irqchip/irq-riscv-aplic-main.c | 2 +-
drivers/irqchip/irq-riscv-aplic-main.h | 8 +
drivers/irqchip/irq-riscv-aplic-msi.c | 285 +++++++++++++++++++++++++
5 files changed, 301 insertions(+), 1 deletion(-)
create mode 100644 drivers/irqchip/irq-riscv-aplic-msi.c

diff --git a/drivers/irqchip/Kconfig b/drivers/irqchip/Kconfig
index 1996cc6f666a..7adc4dbe07ff 100644
--- a/drivers/irqchip/Kconfig
+++ b/drivers/irqchip/Kconfig
@@ -551,6 +551,12 @@ config RISCV_APLIC
depends on RISCV
select IRQ_DOMAIN_HIERARCHY

+config RISCV_APLIC_MSI
+ bool
+ depends on RISCV_APLIC
+ select GENERIC_MSI_IRQ
+ default RISCV_APLIC
+
config RISCV_IMSIC
bool
depends on RISCV
diff --git a/drivers/irqchip/Makefile b/drivers/irqchip/Makefile
index 7f8289790ed8..47995fdb2c60 100644
--- a/drivers/irqchip/Makefile
+++ b/drivers/irqchip/Makefile
@@ -96,6 +96,7 @@ obj-$(CONFIG_CSKY_MPINTC) += irq-csky-mpintc.o
obj-$(CONFIG_CSKY_APB_INTC) += irq-csky-apb-intc.o
obj-$(CONFIG_RISCV_INTC) += irq-riscv-intc.o
obj-$(CONFIG_RISCV_APLIC) += irq-riscv-aplic-main.o irq-riscv-aplic-direct.o
+obj-$(CONFIG_RISCV_APLIC_MSI) += irq-riscv-aplic-msi.o
obj-$(CONFIG_RISCV_IMSIC) += irq-riscv-imsic-state.o irq-riscv-imsic-early.o irq-riscv-imsic-platform.o
obj-$(CONFIG_SIFIVE_PLIC) += irq-sifive-plic.o
obj-$(CONFIG_IMX_IRQSTEER) += irq-imx-irqsteer.o
diff --git a/drivers/irqchip/irq-riscv-aplic-main.c b/drivers/irqchip/irq-riscv-aplic-main.c
index 87450708a733..d1b342b66551 100644
--- a/drivers/irqchip/irq-riscv-aplic-main.c
+++ b/drivers/irqchip/irq-riscv-aplic-main.c
@@ -205,7 +205,7 @@ static int aplic_probe(struct platform_device *pdev)
msi_mode = of_property_present(to_of_node(dev->fwnode),
"msi-parent");
if (msi_mode)
- rc = -ENODEV;
+ rc = aplic_msi_setup(dev, regs);
else
rc = aplic_direct_setup(dev, regs);
if (rc) {
diff --git a/drivers/irqchip/irq-riscv-aplic-main.h b/drivers/irqchip/irq-riscv-aplic-main.h
index 474a04229334..78267ec58098 100644
--- a/drivers/irqchip/irq-riscv-aplic-main.h
+++ b/drivers/irqchip/irq-riscv-aplic-main.h
@@ -41,5 +41,13 @@ void aplic_init_hw_global(struct aplic_priv *priv, bool msi_mode);
int aplic_setup_priv(struct aplic_priv *priv, struct device *dev,
void __iomem *regs);
int aplic_direct_setup(struct device *dev, void __iomem *regs);
+#ifdef CONFIG_RISCV_APLIC_MSI
+int aplic_msi_setup(struct device *dev, void __iomem *regs);
+#else
+static inline int aplic_msi_setup(struct device *dev, void __iomem *regs)
+{
+ return -ENODEV;
+}
+#endif

#endif
diff --git a/drivers/irqchip/irq-riscv-aplic-msi.c b/drivers/irqchip/irq-riscv-aplic-msi.c
new file mode 100644
index 000000000000..086d00e0429e
--- /dev/null
+++ b/drivers/irqchip/irq-riscv-aplic-msi.c
@@ -0,0 +1,285 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2021 Western Digital Corporation or its affiliates.
+ * Copyright (C) 2022 Ventana Micro Systems Inc.
+ */
+
+#include <linux/bitops.h>
+#include <linux/cpu.h>
+#include <linux/interrupt.h>
+#include <linux/irqchip.h>
+#include <linux/irqchip/riscv-aplic.h>
+#include <linux/irqchip/riscv-imsic.h>
+#include <linux/module.h>
+#include <linux/msi.h>
+#include <linux/of_irq.h>
+#include <linux/platform_device.h>
+#include <linux/printk.h>
+#include <linux/smp.h>
+
+#include "irq-riscv-aplic-main.h"
+
+static void aplic_msi_irq_unmask(struct irq_data *d)
+{
+ aplic_irq_unmask(d);
+ irq_chip_unmask_parent(d);
+}
+
+static void aplic_msi_irq_mask(struct irq_data *d)
+{
+ aplic_irq_mask(d);
+ irq_chip_mask_parent(d);
+}
+
+static void aplic_msi_irq_eoi(struct irq_data *d)
+{
+ struct aplic_priv *priv = irq_data_get_irq_chip_data(d);
+ u32 reg_off, reg_mask;
+
+ /*
+ * EOI handling only required only for level-triggered
+ * interrupts in APLIC MSI mode.
+ */
+
+ reg_off = APLIC_CLRIP_BASE + ((d->hwirq / APLIC_IRQBITS_PER_REG) * 4);
+ reg_mask = BIT(d->hwirq % APLIC_IRQBITS_PER_REG);
+ switch (irqd_get_trigger_type(d)) {
+ case IRQ_TYPE_LEVEL_LOW:
+ if (!(readl(priv->regs + reg_off) & reg_mask))
+ writel(d->hwirq, priv->regs + APLIC_SETIPNUM_LE);
+ break;
+ case IRQ_TYPE_LEVEL_HIGH:
+ if (readl(priv->regs + reg_off) & reg_mask)
+ writel(d->hwirq, priv->regs + APLIC_SETIPNUM_LE);
+ break;
+ }
+}
+
+static struct irq_chip aplic_msi_chip = {
+ .name = "APLIC-MSI",
+ .irq_mask = aplic_msi_irq_mask,
+ .irq_unmask = aplic_msi_irq_unmask,
+ .irq_set_type = aplic_irq_set_type,
+ .irq_eoi = aplic_msi_irq_eoi,
+#ifdef CONFIG_SMP
+ .irq_set_affinity = irq_chip_set_affinity_parent,
+#endif
+ .flags = IRQCHIP_SET_TYPE_MASKED |
+ IRQCHIP_SKIP_SET_WAKE |
+ IRQCHIP_MASK_ON_SUSPEND,
+};
+
+static int aplic_msi_irqdomain_translate(struct irq_domain *d,
+ struct irq_fwspec *fwspec,
+ unsigned long *hwirq,
+ unsigned int *type)
+{
+ struct aplic_priv *priv = platform_msi_get_host_data(d);
+
+ return aplic_irqdomain_translate(fwspec, priv->gsi_base, hwirq, type);
+}
+
+static int aplic_msi_irqdomain_alloc(struct irq_domain *domain,
+ unsigned int virq, unsigned int nr_irqs,
+ void *arg)
+{
+ int i, ret;
+ unsigned int type;
+ irq_hw_number_t hwirq;
+ struct irq_fwspec *fwspec = arg;
+ struct aplic_priv *priv = platform_msi_get_host_data(domain);
+
+ ret = aplic_irqdomain_translate(fwspec, priv->gsi_base, &hwirq, &type);
+ if (ret)
+ return ret;
+
+ ret = platform_msi_device_domain_alloc(domain, virq, nr_irqs);
+ if (ret)
+ return ret;
+
+ for (i = 0; i < nr_irqs; i++) {
+ irq_domain_set_info(domain, virq + i, hwirq + i,
+ &aplic_msi_chip, priv, handle_fasteoi_irq,
+ NULL, NULL);
+ /*
+ * APLIC does not implement irq_disable() so Linux interrupt
+ * subsystem will take a lazy approach for disabling an APLIC
+ * interrupt. This means APLIC interrupts are left unmasked
+ * upon system suspend and interrupts are not processed
+ * immediately upon system wake up. To tackle this, we disable
+ * the lazy approach for all APLIC interrupts.
+ */
+ irq_set_status_flags(virq + i, IRQ_DISABLE_UNLAZY);
+ }
+
+ return 0;
+}
+
+static const struct irq_domain_ops aplic_msi_irqdomain_ops = {
+ .translate = aplic_msi_irqdomain_translate,
+ .alloc = aplic_msi_irqdomain_alloc,
+ .free = platform_msi_device_domain_free,
+};
+
+static void aplic_msi_write_msg(struct msi_desc *desc, struct msi_msg *msg)
+{
+ unsigned int group_index, hart_index, guest_index, val;
+ struct irq_data *d = irq_get_irq_data(desc->irq);
+ struct aplic_priv *priv = irq_data_get_irq_chip_data(d);
+ struct aplic_msicfg *mc = &priv->msicfg;
+ phys_addr_t tppn, tbppn, msg_addr;
+ void __iomem *target;
+
+ /* For zeroed MSI, simply write zero into the target register */
+ if (!msg->address_hi && !msg->address_lo && !msg->data) {
+ target = priv->regs + APLIC_TARGET_BASE;
+ target += (d->hwirq - 1) * sizeof(u32);
+ writel(0, target);
+ return;
+ }
+
+ /* Sanity check on message data */
+ WARN_ON(msg->data > APLIC_TARGET_EIID_MASK);
+
+ /* Compute target MSI address */
+ msg_addr = (((u64)msg->address_hi) << 32) | msg->address_lo;
+ tppn = msg_addr >> APLIC_xMSICFGADDR_PPN_SHIFT;
+
+ /* Compute target HART Base PPN */
+ tbppn = tppn;
+ tbppn &= ~APLIC_xMSICFGADDR_PPN_HART(mc->lhxs);
+ tbppn &= ~APLIC_xMSICFGADDR_PPN_LHX(mc->lhxw, mc->lhxs);
+ tbppn &= ~APLIC_xMSICFGADDR_PPN_HHX(mc->hhxw, mc->hhxs);
+ WARN_ON(tbppn != mc->base_ppn);
+
+ /* Compute target group and hart indexes */
+ group_index = (tppn >> APLIC_xMSICFGADDR_PPN_HHX_SHIFT(mc->hhxs)) &
+ APLIC_xMSICFGADDR_PPN_HHX_MASK(mc->hhxw);
+ hart_index = (tppn >> APLIC_xMSICFGADDR_PPN_LHX_SHIFT(mc->lhxs)) &
+ APLIC_xMSICFGADDR_PPN_LHX_MASK(mc->lhxw);
+ hart_index |= (group_index << mc->lhxw);
+ WARN_ON(hart_index > APLIC_TARGET_HART_IDX_MASK);
+
+ /* Compute target guest index */
+ guest_index = tppn & APLIC_xMSICFGADDR_PPN_HART(mc->lhxs);
+ WARN_ON(guest_index > APLIC_TARGET_GUEST_IDX_MASK);
+
+ /* Update IRQ TARGET register */
+ target = priv->regs + APLIC_TARGET_BASE;
+ target += (d->hwirq - 1) * sizeof(u32);
+ val = (hart_index & APLIC_TARGET_HART_IDX_MASK)
+ << APLIC_TARGET_HART_IDX_SHIFT;
+ val |= (guest_index & APLIC_TARGET_GUEST_IDX_MASK)
+ << APLIC_TARGET_GUEST_IDX_SHIFT;
+ val |= (msg->data & APLIC_TARGET_EIID_MASK);
+ writel(val, target);
+}
+
+int aplic_msi_setup(struct device *dev, void __iomem *regs)
+{
+ const struct imsic_global_config *imsic_global;
+ struct irq_domain *irqdomain;
+ struct aplic_priv *priv;
+ struct aplic_msicfg *mc;
+ phys_addr_t pa;
+ int rc;
+
+ priv = devm_kzalloc(dev, sizeof(*priv), GFP_KERNEL);
+ if (!priv)
+ return -ENOMEM;
+
+ rc = aplic_setup_priv(priv, dev, regs);
+ if (!priv) {
+ dev_err(dev, "failed to create APLIC context\n");
+ return rc;
+ }
+ mc = &priv->msicfg;
+
+ /*
+ * The APLIC outgoing MSI config registers assume target MSI
+ * controller to be RISC-V AIA IMSIC controller.
+ */
+ imsic_global = imsic_get_global_config();
+ if (!imsic_global) {
+ dev_err(dev, "IMSIC global config not found\n");
+ return -ENODEV;
+ }
+
+ /* Find number of guest index bits (LHXS) */
+ mc->lhxs = imsic_global->guest_index_bits;
+ if (APLIC_xMSICFGADDRH_LHXS_MASK < mc->lhxs) {
+ dev_err(dev, "IMSIC guest index bits big for APLIC LHXS\n");
+ return -EINVAL;
+ }
+
+ /* Find number of HART index bits (LHXW) */
+ mc->lhxw = imsic_global->hart_index_bits;
+ if (APLIC_xMSICFGADDRH_LHXW_MASK < mc->lhxw) {
+ dev_err(dev, "IMSIC hart index bits big for APLIC LHXW\n");
+ return -EINVAL;
+ }
+
+ /* Find number of group index bits (HHXW) */
+ mc->hhxw = imsic_global->group_index_bits;
+ if (APLIC_xMSICFGADDRH_HHXW_MASK < mc->hhxw) {
+ dev_err(dev, "IMSIC group index bits big for APLIC HHXW\n");
+ return -EINVAL;
+ }
+
+ /* Find first bit position of group index (HHXS) */
+ mc->hhxs = imsic_global->group_index_shift;
+ if (mc->hhxs < (2 * APLIC_xMSICFGADDR_PPN_SHIFT)) {
+ dev_err(dev, "IMSIC group index shift should be >= %d\n",
+ (2 * APLIC_xMSICFGADDR_PPN_SHIFT));
+ return -EINVAL;
+ }
+ mc->hhxs -= (2 * APLIC_xMSICFGADDR_PPN_SHIFT);
+ if (APLIC_xMSICFGADDRH_HHXS_MASK < mc->hhxs) {
+ dev_err(dev, "IMSIC group index shift big for APLIC HHXS\n");
+ return -EINVAL;
+ }
+
+ /* Compute PPN base */
+ mc->base_ppn = imsic_global->base_addr >> APLIC_xMSICFGADDR_PPN_SHIFT;
+ mc->base_ppn &= ~APLIC_xMSICFGADDR_PPN_HART(mc->lhxs);
+ mc->base_ppn &= ~APLIC_xMSICFGADDR_PPN_LHX(mc->lhxw, mc->lhxs);
+ mc->base_ppn &= ~APLIC_xMSICFGADDR_PPN_HHX(mc->hhxw, mc->hhxs);
+
+ /* Setup global config and interrupt delivery */
+ aplic_init_hw_global(priv, true);
+
+ /* Set the APLIC device MSI domain if not available */
+ if (!dev_get_msi_domain(dev)) {
+ /*
+ * The device MSI domain for OF devices is only set at the
+ * time of populating/creating OF device. If the device MSI
+ * domain is discovered later after the OF device is created
+ * then we need to set it explicitly before using any platform
+ * MSI functions.
+ *
+ * In case of APLIC device, the parent MSI domain is always
+ * IMSIC and the IMSIC MSI domains are created later through
+ * the platform driver probing so we set it explicitly here.
+ */
+ if (is_of_node(dev->fwnode))
+ of_msi_configure(dev, to_of_node(dev->fwnode));
+ }
+
+ /* Create irq domain instance for the APLIC MSI-mode */
+ irqdomain = platform_msi_create_device_domain(
+ dev, priv->nr_irqs + 1,
+ aplic_msi_write_msg,
+ &aplic_msi_irqdomain_ops,
+ priv);
+ if (!irqdomain) {
+ dev_err(dev, "failed to create MSI irq domain\n");
+ return -ENOMEM;
+ }
+
+ /* Advertise the interrupt controller */
+ pa = priv->msicfg.base_ppn << APLIC_xMSICFGADDR_PPN_SHIFT;
+ dev_info(dev, "%d interrupts forwared to MSI base %pa\n",
+ priv->nr_irqs, &pa);
+
+ return 0;
+}
--
2.34.1

2023-10-03 04:46:37

by Anup Patel

[permalink] [raw]
Subject: [PATCH v10 15/15] MAINTAINERS: Add entry for RISC-V AIA drivers

Add myself as maintainer for RISC-V AIA drivers including the
RISC-V INTC driver which supports both AIA and non-AIA platforms.

Signed-off-by: Anup Patel <[email protected]>
---
MAINTAINERS | 14 ++++++++++++++
1 file changed, 14 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index c335ff64ecf1..89f0e7905b59 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -18412,6 +18412,20 @@ S: Maintained
F: drivers/mtd/nand/raw/r852.c
F: drivers/mtd/nand/raw/r852.h

+RISC-V AIA DRIVERS
+M: Anup Patel <[email protected]>
+L: [email protected]
+S: Maintained
+F: Documentation/devicetree/bindings/interrupt-controller/riscv,aplic.yaml
+F: Documentation/devicetree/bindings/interrupt-controller/riscv,imsics.yaml
+F: drivers/irqchip/irq-riscv-aplic-*.c
+F: drivers/irqchip/irq-riscv-aplic-*.h
+F: drivers/irqchip/irq-riscv-imsic-*.c
+F: drivers/irqchip/irq-riscv-imsic-*.h
+F: drivers/irqchip/irq-riscv-intc.c
+F: include/linux/irqchip/riscv-aplic.h
+F: include/linux/irqchip/riscv-imsic.h
+
RISC-V ARCHITECTURE
M: Paul Walmsley <[email protected]>
M: Palmer Dabbelt <[email protected]>
--
2.34.1

2023-10-03 04:46:45

by Anup Patel

[permalink] [raw]
Subject: [PATCH v10 14/15] RISC-V: Select APLIC and IMSIC drivers

The QEMU virt machine supports AIA emulation and we also have
quite a few RISC-V platforms with AIA support under development
so let us select APLIC and IMSIC drivers for all RISC-V platforms.

Signed-off-by: Anup Patel <[email protected]>
Reviewed-by: Conor Dooley <[email protected]>
---
arch/riscv/Kconfig | 2 ++
1 file changed, 2 insertions(+)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index d607ab0f7c6d..45c660f1219d 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -153,6 +153,8 @@ config RISCV
select PCI_DOMAINS_GENERIC if PCI
select PCI_MSI if PCI
select RISCV_ALTERNATIVE if !XIP_KERNEL
+ select RISCV_APLIC
+ select RISCV_IMSIC
select RISCV_INTC
select RISCV_TIMER if RISCV_SBI
select SIFIVE_PLIC
--
2.34.1

2023-10-03 04:46:54

by Anup Patel

[permalink] [raw]
Subject: [PATCH v10 09/15] irqchip/riscv-imsic: Add support for platform MSI irqdomain

The Linux platform MSI support requires a platform MSI irqdomain so
let us add a platform irqchip driver for RISC-V IMSIC which provides
a base IRQ domain and platform MSI domain. This driver assumes that
the IMSIC state is already initialized by the IMSIC early driver.

Signed-off-by: Anup Patel <[email protected]>
---
drivers/irqchip/Makefile | 2 +-
drivers/irqchip/irq-riscv-imsic-platform.c | 271 +++++++++++++++++++++
2 files changed, 272 insertions(+), 1 deletion(-)
create mode 100644 drivers/irqchip/irq-riscv-imsic-platform.c

diff --git a/drivers/irqchip/Makefile b/drivers/irqchip/Makefile
index d714724387ce..abca445a3229 100644
--- a/drivers/irqchip/Makefile
+++ b/drivers/irqchip/Makefile
@@ -95,7 +95,7 @@ obj-$(CONFIG_QCOM_MPM) += irq-qcom-mpm.o
obj-$(CONFIG_CSKY_MPINTC) += irq-csky-mpintc.o
obj-$(CONFIG_CSKY_APB_INTC) += irq-csky-apb-intc.o
obj-$(CONFIG_RISCV_INTC) += irq-riscv-intc.o
-obj-$(CONFIG_RISCV_IMSIC) += irq-riscv-imsic-state.o irq-riscv-imsic-early.o
+obj-$(CONFIG_RISCV_IMSIC) += irq-riscv-imsic-state.o irq-riscv-imsic-early.o irq-riscv-imsic-platform.o
obj-$(CONFIG_SIFIVE_PLIC) += irq-sifive-plic.o
obj-$(CONFIG_IMX_IRQSTEER) += irq-imx-irqsteer.o
obj-$(CONFIG_IMX_INTMUX) += irq-imx-intmux.o
diff --git a/drivers/irqchip/irq-riscv-imsic-platform.c b/drivers/irqchip/irq-riscv-imsic-platform.c
new file mode 100644
index 000000000000..809e338bf9c3
--- /dev/null
+++ b/drivers/irqchip/irq-riscv-imsic-platform.c
@@ -0,0 +1,271 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2021 Western Digital Corporation or its affiliates.
+ * Copyright (C) 2022 Ventana Micro Systems Inc.
+ */
+
+#include <linux/bitmap.h>
+#include <linux/cpu.h>
+#include <linux/interrupt.h>
+#include <linux/irq.h>
+#include <linux/irqchip.h>
+#include <linux/irqdomain.h>
+#include <linux/module.h>
+#include <linux/msi.h>
+#include <linux/platform_device.h>
+#include <linux/spinlock.h>
+#include <linux/smp.h>
+
+#include "irq-riscv-imsic-state.h"
+
+static int imsic_cpu_page_phys(unsigned int cpu,
+ unsigned int guest_index,
+ phys_addr_t *out_msi_pa)
+{
+ struct imsic_global_config *global;
+ struct imsic_local_config *local;
+
+ global = &imsic->global;
+ local = per_cpu_ptr(global->local, cpu);
+
+ if (BIT(global->guest_index_bits) <= guest_index)
+ return -EINVAL;
+
+ if (out_msi_pa)
+ *out_msi_pa = local->msi_pa +
+ (guest_index * IMSIC_MMIO_PAGE_SZ);
+
+ return 0;
+}
+
+static int imsic_get_cpu(const struct cpumask *mask_val, bool force,
+ unsigned int *out_target_cpu)
+{
+ unsigned int cpu;
+
+ if (force)
+ cpu = cpumask_first(mask_val);
+ else
+ cpu = cpumask_any_and(mask_val, cpu_online_mask);
+
+ if (cpu >= nr_cpu_ids)
+ return -EINVAL;
+
+ if (out_target_cpu)
+ *out_target_cpu = cpu;
+
+ return 0;
+}
+
+static void imsic_irq_mask(struct irq_data *d)
+{
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&imsic->ids_lock, flags);
+ bitmap_clear(imsic->ids_enabled_bimap, d->hwirq, 1);
+ __imsic_id_disable(d->hwirq);
+ raw_spin_unlock_irqrestore(&imsic->ids_lock, flags);
+
+ imsic_ids_remote_sync();
+}
+
+static void imsic_irq_unmask(struct irq_data *d)
+{
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&imsic->ids_lock, flags);
+ bitmap_set(imsic->ids_enabled_bimap, d->hwirq, 1);
+ __imsic_id_enable(d->hwirq);
+ raw_spin_unlock_irqrestore(&imsic->ids_lock, flags);
+
+ imsic_ids_remote_sync();
+}
+
+static void imsic_irq_compose_msi_msg(struct irq_data *d,
+ struct msi_msg *msg)
+{
+ phys_addr_t msi_addr;
+ unsigned int cpu;
+ int err;
+
+ cpu = imsic_id_get_target(d->hwirq);
+ if (WARN_ON(cpu == UINT_MAX))
+ return;
+
+ err = imsic_cpu_page_phys(cpu, 0, &msi_addr);
+ if (WARN_ON(err))
+ return;
+
+ msg->address_hi = upper_32_bits(msi_addr);
+ msg->address_lo = lower_32_bits(msi_addr);
+ msg->data = d->hwirq;
+}
+
+#ifdef CONFIG_SMP
+static int imsic_irq_set_affinity(struct irq_data *d,
+ const struct cpumask *mask_val,
+ bool force)
+{
+ unsigned int target_cpu;
+ int rc;
+
+ rc = imsic_get_cpu(mask_val, force, &target_cpu);
+ if (rc)
+ return rc;
+
+ imsic_id_set_target(d->hwirq, target_cpu);
+ irq_data_update_effective_affinity(d, cpumask_of(target_cpu));
+
+ return IRQ_SET_MASK_OK;
+}
+#endif
+
+static struct irq_chip imsic_irq_base_chip = {
+ .name = "IMSIC-BASE",
+ .irq_mask = imsic_irq_mask,
+ .irq_unmask = imsic_irq_unmask,
+#ifdef CONFIG_SMP
+ .irq_set_affinity = imsic_irq_set_affinity,
+#endif
+ .irq_compose_msi_msg = imsic_irq_compose_msi_msg,
+ .flags = IRQCHIP_SKIP_SET_WAKE |
+ IRQCHIP_MASK_ON_SUSPEND,
+};
+
+static int imsic_irq_domain_alloc(struct irq_domain *domain,
+ unsigned int virq,
+ unsigned int nr_irqs,
+ void *args)
+{
+ int i, hwirq, err = 0;
+ unsigned int cpu;
+
+ err = imsic_get_cpu(cpu_online_mask, false, &cpu);
+ if (err)
+ return err;
+
+ hwirq = imsic_ids_alloc(get_count_order(nr_irqs));
+ if (hwirq < 0)
+ return hwirq;
+
+ for (i = 0; i < nr_irqs; i++) {
+ imsic_id_set_target(hwirq + i, cpu);
+ irq_domain_set_info(domain, virq + i, hwirq + i,
+ &imsic_irq_base_chip, imsic,
+ handle_simple_irq, NULL, NULL);
+ irq_set_noprobe(virq + i);
+ irq_set_affinity(virq + i, cpu_online_mask);
+ /*
+ * IMSIC does not implement irq_disable() so Linux interrupt
+ * subsystem will take a lazy approach for disabling an IMSIC
+ * interrupt. This means IMSIC interrupts are left unmasked
+ * upon system suspend and interrupts are not processed
+ * immediately upon system wake up. To tackle this, we disable
+ * the lazy approach for all IMSIC interrupts.
+ */
+ irq_set_status_flags(virq + i, IRQ_DISABLE_UNLAZY);
+ }
+
+ return 0;
+}
+
+static void imsic_irq_domain_free(struct irq_domain *domain,
+ unsigned int virq,
+ unsigned int nr_irqs)
+{
+ struct irq_data *d = irq_domain_get_irq_data(domain, virq);
+
+ imsic_ids_free(d->hwirq, get_count_order(nr_irqs));
+ irq_domain_free_irqs_parent(domain, virq, nr_irqs);
+}
+
+static const struct irq_domain_ops imsic_base_domain_ops = {
+ .alloc = imsic_irq_domain_alloc,
+ .free = imsic_irq_domain_free,
+};
+
+static struct irq_chip imsic_plat_irq_chip = {
+ .name = "IMSIC-PLAT",
+};
+
+static struct msi_domain_ops imsic_plat_domain_ops = {
+};
+
+static struct msi_domain_info imsic_plat_domain_info = {
+ .flags = (MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS),
+ .ops = &imsic_plat_domain_ops,
+ .chip = &imsic_plat_irq_chip,
+};
+
+static int imsic_irq_domains_init(struct device *dev)
+{
+ /* Create Base IRQ domain */
+ imsic->base_domain = irq_domain_create_tree(dev->fwnode,
+ &imsic_base_domain_ops, imsic);
+ if (!imsic->base_domain) {
+ dev_err(dev, "failed to create IMSIC base domain\n");
+ return -ENOMEM;
+ }
+ irq_domain_update_bus_token(imsic->base_domain, DOMAIN_BUS_NEXUS);
+
+ /* Create Platform MSI domain */
+ imsic->plat_domain = platform_msi_create_irq_domain(dev->fwnode,
+ &imsic_plat_domain_info,
+ imsic->base_domain);
+ if (!imsic->plat_domain) {
+ dev_err(dev, "failed to create IMSIC platform domain\n");
+ irq_domain_remove(imsic->base_domain);
+ return -ENOMEM;
+ }
+
+ return 0;
+}
+
+static int imsic_platform_probe(struct platform_device *pdev)
+{
+ struct device *dev = &pdev->dev;
+ struct imsic_global_config *global;
+ int rc;
+
+ if (!imsic) {
+ dev_err(dev, "early driver not probed\n");
+ return -ENODEV;
+ }
+
+ if (imsic->base_domain) {
+ dev_err(dev, "irq domain already created\n");
+ return -ENODEV;
+ }
+
+ global = &imsic->global;
+
+ /* Initialize IRQ and MSI domains */
+ rc = imsic_irq_domains_init(dev);
+ if (rc) {
+ dev_err(dev, "failed to initialize IRQ and MSI domains\n");
+ return rc;
+ }
+
+ dev_info(dev, " hart-index-bits: %d, guest-index-bits: %d\n",
+ global->hart_index_bits, global->guest_index_bits);
+ dev_info(dev, " group-index-bits: %d, group-index-shift: %d\n",
+ global->group_index_bits, global->group_index_shift);
+ dev_info(dev, " mapped %d interrupts at base PPN %pa\n",
+ global->nr_ids, &global->base_addr);
+
+ return 0;
+}
+
+static const struct of_device_id imsic_platform_match[] = {
+ { .compatible = "riscv,imsics" },
+ {}
+};
+
+static struct platform_driver imsic_platform_driver = {
+ .driver = {
+ .name = "riscv-imsic",
+ .of_match_table = imsic_platform_match,
+ },
+ .probe = imsic_platform_probe,
+};
+builtin_platform_driver(imsic_platform_driver);
--
2.34.1

2023-10-03 04:47:20

by Anup Patel

[permalink] [raw]
Subject: [PATCH v10 03/15] drivers: irqchip/riscv-intc: Mark all INTC nodes as initialized

The RISC-V INTC local interrupts are per-HART (or per-CPU) so we
create INTC IRQ domain only for the INTC node belonging to the boot
HART. This means only the boot HART INTC node will be marked as
initialized and other INTC nodes won't be marked which results
downstream interrupt controllers (such as PLIC, IMSIC and APLIC
direct-mode) not being probed due to missing device suppliers.

To address this issue, we mark all INTC node for which we don't
create IRQ domain as initialized.

Signed-off-by: Anup Patel <[email protected]>
---
drivers/irqchip/irq-riscv-intc.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/irqchip/irq-riscv-intc.c b/drivers/irqchip/irq-riscv-intc.c
index 4adeee1bc391..e8d01b14ccdd 100644
--- a/drivers/irqchip/irq-riscv-intc.c
+++ b/drivers/irqchip/irq-riscv-intc.c
@@ -155,8 +155,16 @@ static int __init riscv_intc_init(struct device_node *node,
* for each INTC DT node. We only need to do INTC initialization
* for the INTC DT node belonging to boot CPU (or boot HART).
*/
- if (riscv_hartid_to_cpuid(hartid) != smp_processor_id())
+ if (riscv_hartid_to_cpuid(hartid) != smp_processor_id()) {
+ /*
+ * The INTC nodes of each CPU are suppliers for downstream
+ * interrupt controllers (such as PLIC, IMSIC and APLIC
+ * direct-mode) so we should mark an INTC node as initialized
+ * if we are not creating IRQ domain for it.
+ */
+ fwnode_dev_initialized(of_fwnode_handle(node), true);
return 0;
+ }

return riscv_intc_init_common(of_node_to_fwnode(node));
}
--
2.34.1

2023-10-03 04:47:32

by Anup Patel

[permalink] [raw]
Subject: [PATCH v10 07/15] dt-bindings: interrupt-controller: Add RISC-V incoming MSI controller

We add DT bindings document for the RISC-V incoming MSI controller
(IMSIC) defined by the RISC-V advanced interrupt architecture (AIA)
specification.

Signed-off-by: Anup Patel <[email protected]>
Reviewed-by: Conor Dooley <[email protected]>
Acked-by: Krzysztof Kozlowski <[email protected]>
---
.../interrupt-controller/riscv,imsics.yaml | 172 ++++++++++++++++++
1 file changed, 172 insertions(+)
create mode 100644 Documentation/devicetree/bindings/interrupt-controller/riscv,imsics.yaml

diff --git a/Documentation/devicetree/bindings/interrupt-controller/riscv,imsics.yaml b/Documentation/devicetree/bindings/interrupt-controller/riscv,imsics.yaml
new file mode 100644
index 000000000000..84976f17a4a1
--- /dev/null
+++ b/Documentation/devicetree/bindings/interrupt-controller/riscv,imsics.yaml
@@ -0,0 +1,172 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/interrupt-controller/riscv,imsics.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: RISC-V Incoming MSI Controller (IMSIC)
+
+maintainers:
+ - Anup Patel <[email protected]>
+
+description: |
+ The RISC-V advanced interrupt architecture (AIA) defines a per-CPU incoming
+ MSI controller (IMSIC) for handling MSIs in a RISC-V platform. The RISC-V
+ AIA specification can be found at https://github.com/riscv/riscv-aia.
+
+ The IMSIC is a per-CPU (or per-HART) device with separate interrupt file
+ for each privilege level (machine or supervisor). The configuration of
+ a IMSIC interrupt file is done using AIA CSRs and it also has a 4KB MMIO
+ space to receive MSIs from devices. Each IMSIC interrupt file supports a
+ fixed number of interrupt identities (to distinguish MSIs from devices)
+ which is same for given privilege level across CPUs (or HARTs).
+
+ The device tree of a RISC-V platform will have one IMSIC device tree node
+ for each privilege level (machine or supervisor) which collectively describe
+ IMSIC interrupt files at that privilege level across CPUs (or HARTs).
+
+ The arrangement of IMSIC interrupt files in MMIO space of a RISC-V platform
+ follows a particular scheme defined by the RISC-V AIA specification. A IMSIC
+ group is a set of IMSIC interrupt files co-located in MMIO space and we can
+ have multiple IMSIC groups (i.e. clusters, sockets, chiplets, etc) in a
+ RISC-V platform. The MSI target address of a IMSIC interrupt file at given
+ privilege level (machine or supervisor) encodes group index, HART index,
+ and guest index (shown below).
+
+ XLEN-1 > (HART Index MSB) 12 0
+ | | | |
+ -------------------------------------------------------------
+ |xxxxxx|Group Index|xxxxxxxxxxx|HART Index|Guest Index| 0 |
+ -------------------------------------------------------------
+
+allOf:
+ - $ref: /schemas/interrupt-controller.yaml#
+ - $ref: /schemas/interrupt-controller/msi-controller.yaml#
+
+properties:
+ compatible:
+ items:
+ - enum:
+ - qemu,imsics
+ - const: riscv,imsics
+
+ reg:
+ minItems: 1
+ maxItems: 16384
+ description:
+ Base address of each IMSIC group.
+
+ interrupt-controller: true
+
+ "#interrupt-cells":
+ const: 0
+
+ msi-controller: true
+
+ "#msi-cells":
+ const: 0
+
+ interrupts-extended:
+ minItems: 1
+ maxItems: 16384
+ description:
+ This property represents the set of CPUs (or HARTs) for which given
+ device tree node describes the IMSIC interrupt files. Each node pointed
+ to should be a riscv,cpu-intc node, which has a CPU node (i.e. RISC-V
+ HART) as parent.
+
+ riscv,num-ids:
+ $ref: /schemas/types.yaml#/definitions/uint32
+ minimum: 63
+ maximum: 2047
+ description:
+ Number of interrupt identities supported by IMSIC interrupt file.
+
+ riscv,num-guest-ids:
+ $ref: /schemas/types.yaml#/definitions/uint32
+ minimum: 63
+ maximum: 2047
+ description:
+ Number of interrupt identities are supported by IMSIC guest interrupt
+ file. When not specified it is assumed to be same as specified by the
+ riscv,num-ids property.
+
+ riscv,guest-index-bits:
+ minimum: 0
+ maximum: 7
+ default: 0
+ description:
+ Number of guest index bits in the MSI target address.
+
+ riscv,hart-index-bits:
+ minimum: 0
+ maximum: 15
+ description:
+ Number of HART index bits in the MSI target address. When not
+ specified it is calculated based on the interrupts-extended property.
+
+ riscv,group-index-bits:
+ minimum: 0
+ maximum: 7
+ default: 0
+ description:
+ Number of group index bits in the MSI target address.
+
+ riscv,group-index-shift:
+ $ref: /schemas/types.yaml#/definitions/uint32
+ minimum: 0
+ maximum: 55
+ default: 24
+ description:
+ The least significant bit position of the group index bits in the
+ MSI target address.
+
+required:
+ - compatible
+ - reg
+ - interrupt-controller
+ - msi-controller
+ - "#msi-cells"
+ - interrupts-extended
+ - riscv,num-ids
+
+unevaluatedProperties: false
+
+examples:
+ - |
+ // Example 1 (Machine-level IMSIC files with just one group):
+
+ interrupt-controller@24000000 {
+ compatible = "qemu,imsics", "riscv,imsics";
+ interrupts-extended = <&cpu1_intc 11>,
+ <&cpu2_intc 11>,
+ <&cpu3_intc 11>,
+ <&cpu4_intc 11>;
+ reg = <0x28000000 0x4000>;
+ interrupt-controller;
+ #interrupt-cells = <0>;
+ msi-controller;
+ #msi-cells = <0>;
+ riscv,num-ids = <127>;
+ };
+
+ - |
+ // Example 2 (Supervisor-level IMSIC files with two groups):
+
+ interrupt-controller@28000000 {
+ compatible = "qemu,imsics", "riscv,imsics";
+ interrupts-extended = <&cpu1_intc 9>,
+ <&cpu2_intc 9>,
+ <&cpu3_intc 9>,
+ <&cpu4_intc 9>;
+ reg = <0x28000000 0x2000>, /* Group0 IMSICs */
+ <0x29000000 0x2000>; /* Group1 IMSICs */
+ interrupt-controller;
+ #interrupt-cells = <0>;
+ msi-controller;
+ #msi-cells = <0>;
+ riscv,num-ids = <127>;
+ riscv,group-index-bits = <1>;
+ riscv,group-index-shift = <24>;
+ };
+...
--
2.34.1

2023-10-03 04:47:42

by Anup Patel

[permalink] [raw]
Subject: [PATCH v10 12/15] irqchip: Add RISC-V advanced PLIC driver for direct-mode

The RISC-V advanced interrupt architecture (AIA) specification defines
advanced platform-level interrupt controller (APLIC) which has two modes
of operation: 1) Direct mode and 2) MSI mode.
(For more details, refer https://github.com/riscv/riscv-aia)

In APLIC direct-mode, wired interrupts are forwared to CPUs (or HARTs)
as a local external interrupt.

We add a platform irqchip driver for the RISC-V APLIC direct-mode to
support RISC-V platforms having only wired interrupts.

Signed-off-by: Anup Patel <[email protected]>
---
drivers/irqchip/Kconfig | 5 +
drivers/irqchip/Makefile | 1 +
drivers/irqchip/irq-riscv-aplic-direct.c | 343 +++++++++++++++++++++++
drivers/irqchip/irq-riscv-aplic-main.c | 232 +++++++++++++++
drivers/irqchip/irq-riscv-aplic-main.h | 45 +++
include/linux/irqchip/riscv-aplic.h | 119 ++++++++
6 files changed, 745 insertions(+)
create mode 100644 drivers/irqchip/irq-riscv-aplic-direct.c
create mode 100644 drivers/irqchip/irq-riscv-aplic-main.c
create mode 100644 drivers/irqchip/irq-riscv-aplic-main.h
create mode 100644 include/linux/irqchip/riscv-aplic.h

diff --git a/drivers/irqchip/Kconfig b/drivers/irqchip/Kconfig
index c1d69b418dfb..1996cc6f666a 100644
--- a/drivers/irqchip/Kconfig
+++ b/drivers/irqchip/Kconfig
@@ -546,6 +546,11 @@ config SIFIVE_PLIC
select IRQ_DOMAIN_HIERARCHY
select GENERIC_IRQ_EFFECTIVE_AFF_MASK if SMP

+config RISCV_APLIC
+ bool
+ depends on RISCV
+ select IRQ_DOMAIN_HIERARCHY
+
config RISCV_IMSIC
bool
depends on RISCV
diff --git a/drivers/irqchip/Makefile b/drivers/irqchip/Makefile
index abca445a3229..7f8289790ed8 100644
--- a/drivers/irqchip/Makefile
+++ b/drivers/irqchip/Makefile
@@ -95,6 +95,7 @@ obj-$(CONFIG_QCOM_MPM) += irq-qcom-mpm.o
obj-$(CONFIG_CSKY_MPINTC) += irq-csky-mpintc.o
obj-$(CONFIG_CSKY_APB_INTC) += irq-csky-apb-intc.o
obj-$(CONFIG_RISCV_INTC) += irq-riscv-intc.o
+obj-$(CONFIG_RISCV_APLIC) += irq-riscv-aplic-main.o irq-riscv-aplic-direct.o
obj-$(CONFIG_RISCV_IMSIC) += irq-riscv-imsic-state.o irq-riscv-imsic-early.o irq-riscv-imsic-platform.o
obj-$(CONFIG_SIFIVE_PLIC) += irq-sifive-plic.o
obj-$(CONFIG_IMX_IRQSTEER) += irq-imx-irqsteer.o
diff --git a/drivers/irqchip/irq-riscv-aplic-direct.c b/drivers/irqchip/irq-riscv-aplic-direct.c
new file mode 100644
index 000000000000..9ed2666bfb5e
--- /dev/null
+++ b/drivers/irqchip/irq-riscv-aplic-direct.c
@@ -0,0 +1,343 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2021 Western Digital Corporation or its affiliates.
+ * Copyright (C) 2022 Ventana Micro Systems Inc.
+ */
+
+#include <linux/bitops.h>
+#include <linux/cpu.h>
+#include <linux/interrupt.h>
+#include <linux/irqchip.h>
+#include <linux/irqchip/chained_irq.h>
+#include <linux/irqchip/riscv-aplic.h>
+#include <linux/module.h>
+#include <linux/of_address.h>
+#include <linux/printk.h>
+#include <linux/smp.h>
+
+#include "irq-riscv-aplic-main.h"
+
+#define APLIC_DISABLE_IDELIVERY 0
+#define APLIC_ENABLE_IDELIVERY 1
+#define APLIC_DISABLE_ITHRESHOLD 1
+#define APLIC_ENABLE_ITHRESHOLD 0
+
+struct aplic_direct {
+ struct aplic_priv priv;
+ struct irq_domain *irqdomain;
+ struct cpumask lmask;
+};
+
+struct aplic_idc {
+ unsigned int hart_index;
+ void __iomem *regs;
+ struct aplic_direct *direct;
+};
+
+static unsigned int aplic_direct_parent_irq;
+static DEFINE_PER_CPU(struct aplic_idc, aplic_idcs);
+
+static void aplic_direct_irq_eoi(struct irq_data *d)
+{
+ /*
+ * The fasteoi_handler requires irq_eoi() callback hence
+ * provide a dummy handler.
+ */
+}
+
+#ifdef CONFIG_SMP
+static int aplic_direct_set_affinity(struct irq_data *d,
+ const struct cpumask *mask_val, bool force)
+{
+ struct aplic_priv *priv = irq_data_get_irq_chip_data(d);
+ struct aplic_direct *direct =
+ container_of(priv, struct aplic_direct, priv);
+ struct aplic_idc *idc;
+ unsigned int cpu, val;
+ struct cpumask amask;
+ void __iomem *target;
+
+ cpumask_and(&amask, &direct->lmask, mask_val);
+
+ if (force)
+ cpu = cpumask_first(&amask);
+ else
+ cpu = cpumask_any_and(&amask, cpu_online_mask);
+
+ if (cpu >= nr_cpu_ids)
+ return -EINVAL;
+
+ idc = per_cpu_ptr(&aplic_idcs, cpu);
+ target = priv->regs + APLIC_TARGET_BASE;
+ target += (d->hwirq - 1) * sizeof(u32);
+ val = idc->hart_index & APLIC_TARGET_HART_IDX_MASK;
+ val <<= APLIC_TARGET_HART_IDX_SHIFT;
+ val |= APLIC_DEFAULT_PRIORITY;
+ writel(val, target);
+
+ irq_data_update_effective_affinity(d, cpumask_of(cpu));
+
+ return IRQ_SET_MASK_OK_DONE;
+}
+#endif
+
+static struct irq_chip aplic_direct_chip = {
+ .name = "APLIC-DIRECT",
+ .irq_mask = aplic_irq_mask,
+ .irq_unmask = aplic_irq_unmask,
+ .irq_set_type = aplic_irq_set_type,
+ .irq_eoi = aplic_direct_irq_eoi,
+#ifdef CONFIG_SMP
+ .irq_set_affinity = aplic_direct_set_affinity,
+#endif
+ .flags = IRQCHIP_SET_TYPE_MASKED |
+ IRQCHIP_SKIP_SET_WAKE |
+ IRQCHIP_MASK_ON_SUSPEND,
+};
+
+static int aplic_direct_irqdomain_translate(struct irq_domain *d,
+ struct irq_fwspec *fwspec,
+ unsigned long *hwirq,
+ unsigned int *type)
+{
+ struct aplic_priv *priv = d->host_data;
+
+ return aplic_irqdomain_translate(fwspec, priv->gsi_base,
+ hwirq, type);
+}
+
+static int aplic_direct_irqdomain_alloc(struct irq_domain *domain,
+ unsigned int virq, unsigned int nr_irqs,
+ void *arg)
+{
+ int i, ret;
+ unsigned int type;
+ irq_hw_number_t hwirq;
+ struct irq_fwspec *fwspec = arg;
+ struct aplic_priv *priv = domain->host_data;
+ struct aplic_direct *direct =
+ container_of(priv, struct aplic_direct, priv);
+
+ ret = aplic_irqdomain_translate(fwspec, priv->gsi_base,
+ &hwirq, &type);
+ if (ret)
+ return ret;
+
+ for (i = 0; i < nr_irqs; i++) {
+ irq_domain_set_info(domain, virq + i, hwirq + i,
+ &aplic_direct_chip, priv,
+ handle_fasteoi_irq, NULL, NULL);
+ irq_set_affinity(virq + i, &direct->lmask);
+ /* See the reason described in aplic_msi_irqdomain_alloc() */
+ irq_set_status_flags(virq + i, IRQ_DISABLE_UNLAZY);
+ }
+
+ return 0;
+}
+
+static const struct irq_domain_ops aplic_direct_irqdomain_ops = {
+ .translate = aplic_direct_irqdomain_translate,
+ .alloc = aplic_direct_irqdomain_alloc,
+ .free = irq_domain_free_irqs_top,
+};
+
+/*
+ * To handle an APLIC direct interrupts, we just read the CLAIMI register
+ * which will return highest priority pending interrupt and clear the
+ * pending bit of the interrupt. This process is repeated until CLAIMI
+ * register return zero value.
+ */
+static void aplic_direct_handle_irq(struct irq_desc *desc)
+{
+ struct aplic_idc *idc = this_cpu_ptr(&aplic_idcs);
+ struct irq_chip *chip = irq_desc_get_chip(desc);
+ struct irq_domain *irqdomain = idc->direct->irqdomain;
+ irq_hw_number_t hw_irq;
+ int irq;
+
+ chained_irq_enter(chip, desc);
+
+ while ((hw_irq = readl(idc->regs + APLIC_IDC_CLAIMI))) {
+ hw_irq = hw_irq >> APLIC_IDC_TOPI_ID_SHIFT;
+ irq = irq_find_mapping(irqdomain, hw_irq);
+
+ if (unlikely(irq <= 0))
+ dev_warn_ratelimited(idc->direct->priv.dev,
+ "hw_irq %lu mapping not found\n",
+ hw_irq);
+ else
+ generic_handle_irq(irq);
+ }
+
+ chained_irq_exit(chip, desc);
+}
+
+static void aplic_idc_set_delivery(struct aplic_idc *idc, bool en)
+{
+ u32 de = (en) ? APLIC_ENABLE_IDELIVERY : APLIC_DISABLE_IDELIVERY;
+ u32 th = (en) ? APLIC_ENABLE_ITHRESHOLD : APLIC_DISABLE_ITHRESHOLD;
+
+ /* Priority must be less than threshold for interrupt triggering */
+ writel(th, idc->regs + APLIC_IDC_ITHRESHOLD);
+
+ /* Delivery must be set to 1 for interrupt triggering */
+ writel(de, idc->regs + APLIC_IDC_IDELIVERY);
+}
+
+static int aplic_direct_dying_cpu(unsigned int cpu)
+{
+ if (aplic_direct_parent_irq)
+ disable_percpu_irq(aplic_direct_parent_irq);
+
+ return 0;
+}
+
+static int aplic_direct_starting_cpu(unsigned int cpu)
+{
+ if (aplic_direct_parent_irq)
+ enable_percpu_irq(aplic_direct_parent_irq,
+ irq_get_trigger_type(aplic_direct_parent_irq));
+
+ return 0;
+}
+
+static int aplic_direct_parse_parent_hwirq(struct device *dev,
+ u32 index, u32 *parent_hwirq,
+ unsigned long *parent_hartid)
+{
+ struct of_phandle_args parent;
+ int rc;
+
+ /*
+ * Currently, only OF fwnode is supported so extend this
+ * function for ACPI support.
+ */
+ if (!is_of_node(dev->fwnode))
+ return -EINVAL;
+
+ rc = of_irq_parse_one(to_of_node(dev->fwnode), index, &parent);
+ if (rc)
+ return rc;
+
+ rc = riscv_of_parent_hartid(parent.np, parent_hartid);
+ if (rc)
+ return rc;
+
+ *parent_hwirq = parent.args[0];
+ return 0;
+}
+
+int aplic_direct_setup(struct device *dev, void __iomem *regs)
+{
+ int i, j, rc, cpu, setup_count = 0;
+ struct aplic_direct *direct;
+ struct aplic_priv *priv;
+ struct irq_domain *domain;
+ unsigned long hartid;
+ struct aplic_idc *idc;
+ u32 val, hwirq;
+
+ direct = kzalloc(sizeof(*direct), GFP_KERNEL);
+ if (!direct)
+ return -ENOMEM;
+ priv = &direct->priv;
+
+ rc = aplic_setup_priv(priv, dev, regs);
+ if (rc) {
+ dev_err(dev, "failed to create APLIC context\n");
+ kfree(direct);
+ return rc;
+ }
+
+ /* Setup per-CPU IDC and target CPU mask */
+ for (i = 0; i < priv->nr_idcs; i++) {
+ rc = aplic_direct_parse_parent_hwirq(dev, i, &hwirq, &hartid);
+ if (rc) {
+ dev_warn(dev, "parent irq for IDC%d not found\n", i);
+ continue;
+ }
+
+ /*
+ * Skip interrupts other than external interrupts for
+ * current privilege level.
+ */
+ if (hwirq != RV_IRQ_EXT)
+ continue;
+
+ cpu = riscv_hartid_to_cpuid(hartid);
+ if (cpu < 0) {
+ dev_warn(dev, "invalid cpuid for IDC%d\n", i);
+ continue;
+ }
+
+ cpumask_set_cpu(cpu, &direct->lmask);
+
+ idc = per_cpu_ptr(&aplic_idcs, cpu);
+ idc->hart_index = i;
+ idc->regs = priv->regs + APLIC_IDC_BASE + i * APLIC_IDC_SIZE;
+ idc->direct = direct;
+
+ aplic_idc_set_delivery(idc, true);
+
+ /*
+ * Boot cpu might not have APLIC hart_index = 0 so check
+ * and update target registers of all interrupts.
+ */
+ if (cpu == smp_processor_id() && idc->hart_index) {
+ val = idc->hart_index & APLIC_TARGET_HART_IDX_MASK;
+ val <<= APLIC_TARGET_HART_IDX_SHIFT;
+ val |= APLIC_DEFAULT_PRIORITY;
+ for (j = 1; j <= priv->nr_irqs; j++)
+ writel(val, priv->regs + APLIC_TARGET_BASE +
+ (j - 1) * sizeof(u32));
+ }
+
+ setup_count++;
+ }
+
+ /* Find parent domain and register chained handler */
+ domain = irq_find_matching_fwnode(riscv_get_intc_hwnode(),
+ DOMAIN_BUS_ANY);
+ if (!aplic_direct_parent_irq && domain) {
+ aplic_direct_parent_irq = irq_create_mapping(domain, RV_IRQ_EXT);
+ if (aplic_direct_parent_irq) {
+ irq_set_chained_handler(aplic_direct_parent_irq,
+ aplic_direct_handle_irq);
+
+ /*
+ * Setup CPUHP notifier to enable parent
+ * interrupt on all CPUs
+ */
+ cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
+ "irqchip/riscv/aplic:starting",
+ aplic_direct_starting_cpu,
+ aplic_direct_dying_cpu);
+ }
+ }
+
+ /* Fail if we were not able to setup IDC for any CPU */
+ if (!setup_count) {
+ kfree(direct);
+ return -ENODEV;
+ }
+
+ /* Setup global config and interrupt delivery */
+ aplic_init_hw_global(priv, false);
+
+ /* Create irq domain instance for the APLIC */
+ direct->irqdomain = irq_domain_create_linear(dev->fwnode,
+ priv->nr_irqs + 1,
+ &aplic_direct_irqdomain_ops,
+ priv);
+ if (!direct->irqdomain) {
+ dev_err(dev, "failed to create direct irq domain\n");
+ kfree(direct);
+ return -ENOMEM;
+ }
+
+ /* Advertise the interrupt controller */
+ dev_info(dev, "%d interrupts directly connected to %d CPUs\n",
+ priv->nr_irqs, priv->nr_idcs);
+
+ return 0;
+}
diff --git a/drivers/irqchip/irq-riscv-aplic-main.c b/drivers/irqchip/irq-riscv-aplic-main.c
new file mode 100644
index 000000000000..87450708a733
--- /dev/null
+++ b/drivers/irqchip/irq-riscv-aplic-main.c
@@ -0,0 +1,232 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2021 Western Digital Corporation or its affiliates.
+ * Copyright (C) 2022 Ventana Micro Systems Inc.
+ */
+
+#include <linux/of.h>
+#include <linux/of_irq.h>
+#include <linux/printk.h>
+#include <linux/module.h>
+#include <linux/platform_device.h>
+#include <linux/irqchip/riscv-aplic.h>
+
+#include "irq-riscv-aplic-main.h"
+
+void aplic_irq_unmask(struct irq_data *d)
+{
+ struct aplic_priv *priv = irq_data_get_irq_chip_data(d);
+
+ writel(d->hwirq, priv->regs + APLIC_SETIENUM);
+}
+
+void aplic_irq_mask(struct irq_data *d)
+{
+ struct aplic_priv *priv = irq_data_get_irq_chip_data(d);
+
+ writel(d->hwirq, priv->regs + APLIC_CLRIENUM);
+}
+
+int aplic_irq_set_type(struct irq_data *d, unsigned int type)
+{
+ u32 val = 0;
+ void __iomem *sourcecfg;
+ struct aplic_priv *priv = irq_data_get_irq_chip_data(d);
+
+ switch (type) {
+ case IRQ_TYPE_NONE:
+ val = APLIC_SOURCECFG_SM_INACTIVE;
+ break;
+ case IRQ_TYPE_LEVEL_LOW:
+ val = APLIC_SOURCECFG_SM_LEVEL_LOW;
+ break;
+ case IRQ_TYPE_LEVEL_HIGH:
+ val = APLIC_SOURCECFG_SM_LEVEL_HIGH;
+ break;
+ case IRQ_TYPE_EDGE_FALLING:
+ val = APLIC_SOURCECFG_SM_EDGE_FALL;
+ break;
+ case IRQ_TYPE_EDGE_RISING:
+ val = APLIC_SOURCECFG_SM_EDGE_RISE;
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ sourcecfg = priv->regs + APLIC_SOURCECFG_BASE;
+ sourcecfg += (d->hwirq - 1) * sizeof(u32);
+ writel(val, sourcecfg);
+
+ return 0;
+}
+
+int aplic_irqdomain_translate(struct irq_fwspec *fwspec, u32 gsi_base,
+ unsigned long *hwirq, unsigned int *type)
+{
+ if (WARN_ON(fwspec->param_count < 2))
+ return -EINVAL;
+ if (WARN_ON(!fwspec->param[0]))
+ return -EINVAL;
+
+ /* For DT, gsi_base is always zero. */
+ *hwirq = fwspec->param[0] - gsi_base;
+ *type = fwspec->param[1] & IRQ_TYPE_SENSE_MASK;
+
+ WARN_ON(*type == IRQ_TYPE_NONE);
+
+ return 0;
+}
+
+void aplic_init_hw_global(struct aplic_priv *priv, bool msi_mode)
+{
+ u32 val;
+#ifdef CONFIG_RISCV_M_MODE
+ u32 valH;
+
+ if (msi_mode) {
+ val = priv->msicfg.base_ppn;
+ valH = ((u64)priv->msicfg.base_ppn >> 32) &
+ APLIC_xMSICFGADDRH_BAPPN_MASK;
+ valH |= (priv->msicfg.lhxw & APLIC_xMSICFGADDRH_LHXW_MASK)
+ << APLIC_xMSICFGADDRH_LHXW_SHIFT;
+ valH |= (priv->msicfg.hhxw & APLIC_xMSICFGADDRH_HHXW_MASK)
+ << APLIC_xMSICFGADDRH_HHXW_SHIFT;
+ valH |= (priv->msicfg.lhxs & APLIC_xMSICFGADDRH_LHXS_MASK)
+ << APLIC_xMSICFGADDRH_LHXS_SHIFT;
+ valH |= (priv->msicfg.hhxs & APLIC_xMSICFGADDRH_HHXS_MASK)
+ << APLIC_xMSICFGADDRH_HHXS_SHIFT;
+ writel(val, priv->regs + APLIC_xMSICFGADDR);
+ writel(valH, priv->regs + APLIC_xMSICFGADDRH);
+ }
+#endif
+
+ /* Setup APLIC domaincfg register */
+ val = readl(priv->regs + APLIC_DOMAINCFG);
+ val |= APLIC_DOMAINCFG_IE;
+ if (msi_mode)
+ val |= APLIC_DOMAINCFG_DM;
+ writel(val, priv->regs + APLIC_DOMAINCFG);
+ if (readl(priv->regs + APLIC_DOMAINCFG) != val)
+ dev_warn(priv->dev, "unable to write 0x%x in domaincfg\n",
+ val);
+}
+
+static void aplic_init_hw_irqs(struct aplic_priv *priv)
+{
+ int i;
+
+ /* Disable all interrupts */
+ for (i = 0; i <= priv->nr_irqs; i += 32)
+ writel(-1U, priv->regs + APLIC_CLRIE_BASE +
+ (i / 32) * sizeof(u32));
+
+ /* Set interrupt type and default priority for all interrupts */
+ for (i = 1; i <= priv->nr_irqs; i++) {
+ writel(0, priv->regs + APLIC_SOURCECFG_BASE +
+ (i - 1) * sizeof(u32));
+ writel(APLIC_DEFAULT_PRIORITY,
+ priv->regs + APLIC_TARGET_BASE +
+ (i - 1) * sizeof(u32));
+ }
+
+ /* Clear APLIC domaincfg */
+ writel(0, priv->regs + APLIC_DOMAINCFG);
+}
+
+int aplic_setup_priv(struct aplic_priv *priv, struct device *dev,
+ void __iomem *regs)
+{
+ struct of_phandle_args parent;
+ int rc;
+
+ /*
+ * Currently, only OF fwnode is supported so extend this
+ * function for ACPI support.
+ */
+ if (!is_of_node(dev->fwnode))
+ return -EINVAL;
+
+ /* Save device pointer and register base */
+ priv->dev = dev;
+ priv->regs = regs;
+
+ /* Find out number of interrupt sources */
+ rc = of_property_read_u32(to_of_node(dev->fwnode),
+ "riscv,num-sources",
+ &priv->nr_irqs);
+ if (rc) {
+ dev_err(dev, "failed to get number of interrupt sources\n");
+ return rc;
+ }
+
+ /*
+ * Find out number of IDCs based on parent interrupts
+ *
+ * If "msi-parent" property is present then we ignore the
+ * APLIC IDCs which forces the APLIC driver to use MSI mode.
+ */
+ if (!of_property_present(to_of_node(dev->fwnode), "msi-parent")) {
+ while (!of_irq_parse_one(to_of_node(dev->fwnode),
+ priv->nr_idcs, &parent))
+ priv->nr_idcs++;
+ }
+
+ /* Setup initial state APLIC interrupts */
+ aplic_init_hw_irqs(priv);
+
+ return 0;
+}
+
+static int aplic_probe(struct platform_device *pdev)
+{
+ struct device *dev = &pdev->dev;
+ bool msi_mode = false;
+ struct resource *res;
+ void __iomem *regs;
+ int rc;
+
+ /* Map the MMIO registers */
+ res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
+ if (!res) {
+ dev_err(dev, "failed to get MMIO resource\n");
+ return -EINVAL;
+ }
+ regs = devm_ioremap(&pdev->dev, res->start, resource_size(res));
+ if (!regs) {
+ dev_err(dev, "failed map MMIO registers\n");
+ return -ENOMEM;
+ }
+
+ /*
+ * If msi-parent property is present then setup APLIC MSI
+ * mode otherwise setup APLIC direct mode.
+ */
+ if (is_of_node(dev->fwnode))
+ msi_mode = of_property_present(to_of_node(dev->fwnode),
+ "msi-parent");
+ if (msi_mode)
+ rc = -ENODEV;
+ else
+ rc = aplic_direct_setup(dev, regs);
+ if (rc) {
+ dev_err(dev, "failed setup APLIC in %s mode\n",
+ msi_mode ? "MSI" : "direct");
+ return rc;
+ }
+
+ return 0;
+}
+
+static const struct of_device_id aplic_match[] = {
+ { .compatible = "riscv,aplic" },
+ {}
+};
+
+static struct platform_driver aplic_driver = {
+ .driver = {
+ .name = "riscv-aplic",
+ .of_match_table = aplic_match,
+ },
+ .probe = aplic_probe,
+};
+builtin_platform_driver(aplic_driver);
diff --git a/drivers/irqchip/irq-riscv-aplic-main.h b/drivers/irqchip/irq-riscv-aplic-main.h
new file mode 100644
index 000000000000..474a04229334
--- /dev/null
+++ b/drivers/irqchip/irq-riscv-aplic-main.h
@@ -0,0 +1,45 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2021 Western Digital Corporation or its affiliates.
+ * Copyright (C) 2022 Ventana Micro Systems Inc.
+ */
+
+#ifndef _IRQ_RISCV_APLIC_MAIN_H
+#define _IRQ_RISCV_APLIC_MAIN_H
+
+#include <linux/device.h>
+#include <linux/io.h>
+#include <linux/irq.h>
+#include <linux/irqdomain.h>
+#include <linux/fwnode.h>
+
+#define APLIC_DEFAULT_PRIORITY 1
+
+struct aplic_msicfg {
+ phys_addr_t base_ppn;
+ u32 hhxs;
+ u32 hhxw;
+ u32 lhxs;
+ u32 lhxw;
+};
+
+struct aplic_priv {
+ struct device *dev;
+ u32 gsi_base;
+ u32 nr_irqs;
+ u32 nr_idcs;
+ void __iomem *regs;
+ struct aplic_msicfg msicfg;
+};
+
+void aplic_irq_unmask(struct irq_data *d);
+void aplic_irq_mask(struct irq_data *d);
+int aplic_irq_set_type(struct irq_data *d, unsigned int type);
+int aplic_irqdomain_translate(struct irq_fwspec *fwspec, u32 gsi_base,
+ unsigned long *hwirq, unsigned int *type);
+void aplic_init_hw_global(struct aplic_priv *priv, bool msi_mode);
+int aplic_setup_priv(struct aplic_priv *priv, struct device *dev,
+ void __iomem *regs);
+int aplic_direct_setup(struct device *dev, void __iomem *regs);
+
+#endif
diff --git a/include/linux/irqchip/riscv-aplic.h b/include/linux/irqchip/riscv-aplic.h
new file mode 100644
index 000000000000..97e198ea0109
--- /dev/null
+++ b/include/linux/irqchip/riscv-aplic.h
@@ -0,0 +1,119 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2021 Western Digital Corporation or its affiliates.
+ * Copyright (C) 2022 Ventana Micro Systems Inc.
+ */
+#ifndef __LINUX_IRQCHIP_RISCV_APLIC_H
+#define __LINUX_IRQCHIP_RISCV_APLIC_H
+
+#include <linux/bitops.h>
+
+#define APLIC_MAX_IDC BIT(14)
+#define APLIC_MAX_SOURCE 1024
+
+#define APLIC_DOMAINCFG 0x0000
+#define APLIC_DOMAINCFG_RDONLY 0x80000000
+#define APLIC_DOMAINCFG_IE BIT(8)
+#define APLIC_DOMAINCFG_DM BIT(2)
+#define APLIC_DOMAINCFG_BE BIT(0)
+
+#define APLIC_SOURCECFG_BASE 0x0004
+#define APLIC_SOURCECFG_D BIT(10)
+#define APLIC_SOURCECFG_CHILDIDX_MASK 0x000003ff
+#define APLIC_SOURCECFG_SM_MASK 0x00000007
+#define APLIC_SOURCECFG_SM_INACTIVE 0x0
+#define APLIC_SOURCECFG_SM_DETACH 0x1
+#define APLIC_SOURCECFG_SM_EDGE_RISE 0x4
+#define APLIC_SOURCECFG_SM_EDGE_FALL 0x5
+#define APLIC_SOURCECFG_SM_LEVEL_HIGH 0x6
+#define APLIC_SOURCECFG_SM_LEVEL_LOW 0x7
+
+#define APLIC_MMSICFGADDR 0x1bc0
+#define APLIC_MMSICFGADDRH 0x1bc4
+#define APLIC_SMSICFGADDR 0x1bc8
+#define APLIC_SMSICFGADDRH 0x1bcc
+
+#ifdef CONFIG_RISCV_M_MODE
+#define APLIC_xMSICFGADDR APLIC_MMSICFGADDR
+#define APLIC_xMSICFGADDRH APLIC_MMSICFGADDRH
+#else
+#define APLIC_xMSICFGADDR APLIC_SMSICFGADDR
+#define APLIC_xMSICFGADDRH APLIC_SMSICFGADDRH
+#endif
+
+#define APLIC_xMSICFGADDRH_L BIT(31)
+#define APLIC_xMSICFGADDRH_HHXS_MASK 0x1f
+#define APLIC_xMSICFGADDRH_HHXS_SHIFT 24
+#define APLIC_xMSICFGADDRH_LHXS_MASK 0x7
+#define APLIC_xMSICFGADDRH_LHXS_SHIFT 20
+#define APLIC_xMSICFGADDRH_HHXW_MASK 0x7
+#define APLIC_xMSICFGADDRH_HHXW_SHIFT 16
+#define APLIC_xMSICFGADDRH_LHXW_MASK 0xf
+#define APLIC_xMSICFGADDRH_LHXW_SHIFT 12
+#define APLIC_xMSICFGADDRH_BAPPN_MASK 0xfff
+
+#define APLIC_xMSICFGADDR_PPN_SHIFT 12
+
+#define APLIC_xMSICFGADDR_PPN_HART(__lhxs) \
+ (BIT(__lhxs) - 1)
+
+#define APLIC_xMSICFGADDR_PPN_LHX_MASK(__lhxw) \
+ (BIT(__lhxw) - 1)
+#define APLIC_xMSICFGADDR_PPN_LHX_SHIFT(__lhxs) \
+ ((__lhxs))
+#define APLIC_xMSICFGADDR_PPN_LHX(__lhxw, __lhxs) \
+ (APLIC_xMSICFGADDR_PPN_LHX_MASK(__lhxw) << \
+ APLIC_xMSICFGADDR_PPN_LHX_SHIFT(__lhxs))
+
+#define APLIC_xMSICFGADDR_PPN_HHX_MASK(__hhxw) \
+ (BIT(__hhxw) - 1)
+#define APLIC_xMSICFGADDR_PPN_HHX_SHIFT(__hhxs) \
+ ((__hhxs) + APLIC_xMSICFGADDR_PPN_SHIFT)
+#define APLIC_xMSICFGADDR_PPN_HHX(__hhxw, __hhxs) \
+ (APLIC_xMSICFGADDR_PPN_HHX_MASK(__hhxw) << \
+ APLIC_xMSICFGADDR_PPN_HHX_SHIFT(__hhxs))
+
+#define APLIC_IRQBITS_PER_REG 32
+
+#define APLIC_SETIP_BASE 0x1c00
+#define APLIC_SETIPNUM 0x1cdc
+
+#define APLIC_CLRIP_BASE 0x1d00
+#define APLIC_CLRIPNUM 0x1ddc
+
+#define APLIC_SETIE_BASE 0x1e00
+#define APLIC_SETIENUM 0x1edc
+
+#define APLIC_CLRIE_BASE 0x1f00
+#define APLIC_CLRIENUM 0x1fdc
+
+#define APLIC_SETIPNUM_LE 0x2000
+#define APLIC_SETIPNUM_BE 0x2004
+
+#define APLIC_GENMSI 0x3000
+
+#define APLIC_TARGET_BASE 0x3004
+#define APLIC_TARGET_HART_IDX_SHIFT 18
+#define APLIC_TARGET_HART_IDX_MASK 0x3fff
+#define APLIC_TARGET_GUEST_IDX_SHIFT 12
+#define APLIC_TARGET_GUEST_IDX_MASK 0x3f
+#define APLIC_TARGET_IPRIO_MASK 0xff
+#define APLIC_TARGET_EIID_MASK 0x7ff
+
+#define APLIC_IDC_BASE 0x4000
+#define APLIC_IDC_SIZE 32
+
+#define APLIC_IDC_IDELIVERY 0x00
+
+#define APLIC_IDC_IFORCE 0x04
+
+#define APLIC_IDC_ITHRESHOLD 0x08
+
+#define APLIC_IDC_TOPI 0x18
+#define APLIC_IDC_TOPI_ID_SHIFT 16
+#define APLIC_IDC_TOPI_ID_MASK 0x3ff
+#define APLIC_IDC_TOPI_PRIO_MASK 0xff
+
+#define APLIC_IDC_CLAIMI 0x1c
+
+#endif
--
2.34.1

2023-10-03 04:48:11

by Anup Patel

[permalink] [raw]
Subject: [PATCH v10 08/15] irqchip: Add RISC-V incoming MSI controller early driver

The RISC-V advanced interrupt architecture (AIA) specification
defines a new MSI controller called incoming message signalled
interrupt controller (IMSIC) which manages MSI on per-HART (or
per-CPU) basis. It also supports IPIs as software injected MSIs.
(For more details refer https://github.com/riscv/riscv-aia)

Let us add an early irqchip driver for RISC-V IMSIC which sets
up the IMSIC state and provide IPIs.

Signed-off-by: Anup Patel <[email protected]>
---
drivers/irqchip/Kconfig | 6 +
drivers/irqchip/Makefile | 1 +
drivers/irqchip/irq-riscv-imsic-early.c | 259 +++++++++++
drivers/irqchip/irq-riscv-imsic-state.c | 570 ++++++++++++++++++++++++
drivers/irqchip/irq-riscv-imsic-state.h | 66 +++
include/linux/irqchip/riscv-imsic.h | 86 ++++
6 files changed, 988 insertions(+)
create mode 100644 drivers/irqchip/irq-riscv-imsic-early.c
create mode 100644 drivers/irqchip/irq-riscv-imsic-state.c
create mode 100644 drivers/irqchip/irq-riscv-imsic-state.h
create mode 100644 include/linux/irqchip/riscv-imsic.h

diff --git a/drivers/irqchip/Kconfig b/drivers/irqchip/Kconfig
index f7149d0f3d45..bdd80716114d 100644
--- a/drivers/irqchip/Kconfig
+++ b/drivers/irqchip/Kconfig
@@ -546,6 +546,12 @@ config SIFIVE_PLIC
select IRQ_DOMAIN_HIERARCHY
select GENERIC_IRQ_EFFECTIVE_AFF_MASK if SMP

+config RISCV_IMSIC
+ bool
+ depends on RISCV
+ select IRQ_DOMAIN_HIERARCHY
+ select GENERIC_MSI_IRQ
+
config EXYNOS_IRQ_COMBINER
bool "Samsung Exynos IRQ combiner support" if COMPILE_TEST
depends on (ARCH_EXYNOS && ARM) || COMPILE_TEST
diff --git a/drivers/irqchip/Makefile b/drivers/irqchip/Makefile
index ffd945fe71aa..d714724387ce 100644
--- a/drivers/irqchip/Makefile
+++ b/drivers/irqchip/Makefile
@@ -95,6 +95,7 @@ obj-$(CONFIG_QCOM_MPM) += irq-qcom-mpm.o
obj-$(CONFIG_CSKY_MPINTC) += irq-csky-mpintc.o
obj-$(CONFIG_CSKY_APB_INTC) += irq-csky-apb-intc.o
obj-$(CONFIG_RISCV_INTC) += irq-riscv-intc.o
+obj-$(CONFIG_RISCV_IMSIC) += irq-riscv-imsic-state.o irq-riscv-imsic-early.o
obj-$(CONFIG_SIFIVE_PLIC) += irq-sifive-plic.o
obj-$(CONFIG_IMX_IRQSTEER) += irq-imx-irqsteer.o
obj-$(CONFIG_IMX_INTMUX) += irq-imx-intmux.o
diff --git a/drivers/irqchip/irq-riscv-imsic-early.c b/drivers/irqchip/irq-riscv-imsic-early.c
new file mode 100644
index 000000000000..68561ca385e8
--- /dev/null
+++ b/drivers/irqchip/irq-riscv-imsic-early.c
@@ -0,0 +1,259 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2021 Western Digital Corporation or its affiliates.
+ * Copyright (C) 2022 Ventana Micro Systems Inc.
+ */
+
+#define pr_fmt(fmt) "riscv-imsic: " fmt
+#include <linux/cpu.h>
+#include <linux/interrupt.h>
+#include <linux/io.h>
+#include <linux/irq.h>
+#include <linux/irqchip.h>
+#include <linux/irqchip/chained_irq.h>
+#include <linux/module.h>
+#include <linux/spinlock.h>
+#include <linux/smp.h>
+
+#include "irq-riscv-imsic-state.h"
+
+/*
+ * The IMSIC driver uses 1 IPI for ID synchronization and
+ * arch/riscv/kernel/smp.c require 6 IPIs so we fix the
+ * total number of IPIs to 8.
+ */
+#define IMSIC_NR_IPI 8
+
+static int imsic_parent_irq;
+
+#ifdef CONFIG_SMP
+static irqreturn_t imsic_ids_sync_handler(int irq, void *data)
+{
+ imsic_ids_local_sync();
+ return IRQ_HANDLED;
+}
+
+void imsic_ids_remote_sync(void)
+{
+ struct cpumask amask;
+
+ /*
+ * We simply inject ID synchronization IPI to all target CPUs
+ * except current CPU. The ipi_send_mask() implementation of
+ * IPI mux will inject ID synchronization IPI only for CPUs
+ * that have enabled it so offline CPUs won't receive IPI.
+ * An offline CPU will unconditionally synchronize IDs through
+ * imsic_starting_cpu() when the CPU is brought up.
+ */
+ cpumask_andnot(&amask, cpu_online_mask, cpumask_of(smp_processor_id()));
+ __ipi_send_mask(imsic->ipi_lsync_desc, &amask);
+}
+
+static void imsic_ipi_send(unsigned int cpu)
+{
+ struct imsic_local_config *local =
+ per_cpu_ptr(imsic->global.local, cpu);
+
+ writel(imsic->ipi_id, local->msi_va);
+}
+
+static void imsic_ipi_starting_cpu(void)
+{
+ /* Enable IPIs for current CPU. */
+ __imsic_id_enable(imsic->ipi_id);
+
+ /* Enable virtual IPI used for IMSIC ID synchronization */
+ enable_percpu_irq(imsic->ipi_virq, 0);
+}
+
+static void imsic_ipi_dying_cpu(void)
+{
+ /*
+ * Disable virtual IPI used for IMSIC ID synchronization so
+ * that we don't receive ID synchronization requests.
+ */
+ disable_percpu_irq(imsic->ipi_virq);
+}
+
+static int __init imsic_ipi_domain_init(void)
+{
+ int virq;
+
+ /* Allocate interrupt identity for IPIs */
+ virq = imsic_ids_alloc(get_count_order(1));
+ if (virq < 0)
+ return virq;
+ imsic->ipi_id = virq;
+
+ /* Create IMSIC IPI multiplexing */
+ virq = ipi_mux_create(IMSIC_NR_IPI, imsic_ipi_send);
+ if (virq <= 0) {
+ imsic_ids_free(imsic->ipi_id, get_count_order(1));
+ return (virq < 0) ? virq : -ENOMEM;
+ }
+ imsic->ipi_virq = virq;
+
+ /* First vIRQ is used for IMSIC ID synchronization */
+ virq = request_percpu_irq(imsic->ipi_virq, imsic_ids_sync_handler,
+ "riscv-imsic-lsync", imsic->global.local);
+ if (virq) {
+ imsic_ids_free(imsic->ipi_id, get_count_order(1));
+ return virq;
+ }
+ irq_set_status_flags(imsic->ipi_virq, IRQ_HIDDEN);
+ imsic->ipi_lsync_desc = irq_to_desc(imsic->ipi_virq);
+
+ /* Set vIRQ range */
+ riscv_ipi_set_virq_range(imsic->ipi_virq + 1, IMSIC_NR_IPI - 1, true);
+
+ /* Announce that IMSIC is providing IPIs */
+ pr_info("%pfwP: providing IPIs using interrupt %d\n",
+ imsic->fwnode, imsic->ipi_id);
+
+ return 0;
+}
+#else
+static void imsic_ipi_starting_cpu(void)
+{
+}
+
+static void imsic_ipi_dying_cpu(void)
+{
+}
+
+static int __init imsic_ipi_domain_init(void)
+{
+ /* Clear the IPI id because we are not using IPIs */
+ imsic->ipi_id = 0;
+ return 0;
+}
+#endif
+
+/*
+ * To handle an interrupt, we read the TOPEI CSR and write zero in one
+ * instruction. If TOPEI CSR is non-zero then we translate TOPEI.ID to
+ * Linux interrupt number and let Linux IRQ subsystem handle it.
+ */
+static void imsic_handle_irq(struct irq_desc *desc)
+{
+ struct irq_chip *chip = irq_desc_get_chip(desc);
+ irq_hw_number_t hwirq;
+ int err;
+
+ chained_irq_enter(chip, desc);
+
+ while ((hwirq = csr_swap(CSR_TOPEI, 0))) {
+ hwirq = hwirq >> TOPEI_ID_SHIFT;
+
+ if (hwirq == imsic->ipi_id) {
+#ifdef CONFIG_SMP
+ ipi_mux_process();
+#endif
+ continue;
+ }
+
+ if (unlikely(!imsic->base_domain))
+ continue;
+
+ err = generic_handle_domain_irq(imsic->base_domain, hwirq);
+ if (unlikely(err))
+ pr_warn_ratelimited(
+ "hwirq %lu mapping not found\n", hwirq);
+ }
+
+ chained_irq_exit(chip, desc);
+}
+
+static int imsic_starting_cpu(unsigned int cpu)
+{
+ /* Enable per-CPU parent interrupt */
+ enable_percpu_irq(imsic_parent_irq,
+ irq_get_trigger_type(imsic_parent_irq));
+
+ /* Setup IPIs */
+ imsic_ipi_starting_cpu();
+
+ /*
+ * Interrupts identities might have been enabled/disabled while
+ * this CPU was not running so sync-up local enable/disable state.
+ */
+ imsic_ids_local_sync();
+
+ /* Enable local interrupt delivery */
+ imsic_ids_local_delivery(true);
+
+ return 0;
+}
+
+static int imsic_dying_cpu(unsigned int cpu)
+{
+ /* Cleanup IPIs */
+ imsic_ipi_dying_cpu();
+
+ return 0;
+}
+
+static int __init imsic_early_probe(struct fwnode_handle *fwnode)
+{
+ int rc;
+ struct irq_domain *domain;
+
+ /* Find parent domain and register chained handler */
+ domain = irq_find_matching_fwnode(riscv_get_intc_hwnode(),
+ DOMAIN_BUS_ANY);
+ if (!domain) {
+ pr_err("%pfwP: Failed to find INTC domain\n", fwnode);
+ return -ENOENT;
+ }
+ imsic_parent_irq = irq_create_mapping(domain, RV_IRQ_EXT);
+ if (!imsic_parent_irq) {
+ pr_err("%pfwP: Failed to create INTC mapping\n", fwnode);
+ return -ENOENT;
+ }
+ irq_set_chained_handler(imsic_parent_irq, imsic_handle_irq);
+
+ /* Initialize IPI domain */
+ rc = imsic_ipi_domain_init();
+ if (rc) {
+ pr_err("%pfwP: Failed to initialize IPI domain\n", fwnode);
+ return rc;
+ }
+
+ /*
+ * Setup cpuhp state (must be done after setting imsic_parent_irq)
+ *
+ * Don't disable per-CPU IMSIC file when CPU goes offline
+ * because this affects IPI and the masking/unmasking of
+ * virtual IPIs is done via generic IPI-Mux
+ */
+ cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
+ "irqchip/riscv/imsic:starting",
+ imsic_starting_cpu, imsic_dying_cpu);
+
+ return 0;
+}
+
+static int __init imsic_early_dt_init(struct device_node *node,
+ struct device_node *parent)
+{
+ int rc;
+ struct fwnode_handle *fwnode = &node->fwnode;
+
+ /* Setup IMSIC state */
+ rc = imsic_setup_state(fwnode);
+ if (rc) {
+ pr_err("%pfwP: failed to setup state (error %d)\n",
+ fwnode, rc);
+ return rc;
+ }
+
+ /* Do early setup of IPIs */
+ rc = imsic_early_probe(fwnode);
+ if (rc)
+ return rc;
+
+ /* Ensure that OF platform device gets probed */
+ of_node_clear_flag(node, OF_POPULATED);
+ return 0;
+}
+IRQCHIP_DECLARE(riscv_imsic, "riscv,imsics", imsic_early_dt_init);
diff --git a/drivers/irqchip/irq-riscv-imsic-state.c b/drivers/irqchip/irq-riscv-imsic-state.c
new file mode 100644
index 000000000000..aedd0bf34d2d
--- /dev/null
+++ b/drivers/irqchip/irq-riscv-imsic-state.c
@@ -0,0 +1,570 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2021 Western Digital Corporation or its affiliates.
+ * Copyright (C) 2022 Ventana Micro Systems Inc.
+ */
+
+#define pr_fmt(fmt) "riscv-imsic: " fmt
+#include <linux/bitmap.h>
+#include <linux/module.h>
+#include <linux/of.h>
+#include <linux/of_address.h>
+#include <linux/of_irq.h>
+#include <linux/spinlock.h>
+#include <linux/smp.h>
+#include <asm/hwcap.h>
+
+#include "irq-riscv-imsic-state.h"
+
+#define IMSIC_DISABLE_EIDELIVERY 0
+#define IMSIC_ENABLE_EIDELIVERY 1
+#define IMSIC_DISABLE_EITHRESHOLD 1
+#define IMSIC_ENABLE_EITHRESHOLD 0
+
+#define imsic_csr_write(__c, __v) \
+do { \
+ csr_write(CSR_ISELECT, __c); \
+ csr_write(CSR_IREG, __v); \
+} while (0)
+
+#define imsic_csr_read(__c) \
+({ \
+ unsigned long __v; \
+ csr_write(CSR_ISELECT, __c); \
+ __v = csr_read(CSR_IREG); \
+ __v; \
+})
+
+#define imsic_csr_set(__c, __v) \
+do { \
+ csr_write(CSR_ISELECT, __c); \
+ csr_set(CSR_IREG, __v); \
+} while (0)
+
+#define imsic_csr_clear(__c, __v) \
+do { \
+ csr_write(CSR_ISELECT, __c); \
+ csr_clear(CSR_IREG, __v); \
+} while (0)
+
+struct imsic_priv *imsic;
+
+const struct imsic_global_config *imsic_get_global_config(void)
+{
+ return (imsic) ? &imsic->global : NULL;
+}
+EXPORT_SYMBOL_GPL(imsic_get_global_config);
+
+void __imsic_eix_update(unsigned long base_id,
+ unsigned long num_id, bool pend, bool val)
+{
+ unsigned long i, isel, ireg;
+ unsigned long id = base_id, last_id = base_id + num_id;
+
+ while (id < last_id) {
+ isel = id / BITS_PER_LONG;
+ isel *= BITS_PER_LONG / IMSIC_EIPx_BITS;
+ isel += (pend) ? IMSIC_EIP0 : IMSIC_EIE0;
+
+ ireg = 0;
+ for (i = id & (__riscv_xlen - 1);
+ (id < last_id) && (i < __riscv_xlen); i++) {
+ ireg |= BIT(i);
+ id++;
+ }
+
+ /*
+ * The IMSIC EIEx and EIPx registers are indirectly
+ * accessed via using ISELECT and IREG CSRs so we
+ * need to access these CSRs without getting preempted.
+ *
+ * All existing users of this function call this
+ * function with local IRQs disabled so we don't
+ * need to do anything special here.
+ */
+ if (val)
+ imsic_csr_set(isel, ireg);
+ else
+ imsic_csr_clear(isel, ireg);
+ }
+}
+
+void imsic_id_set_target(unsigned int id, unsigned int target_cpu)
+{
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&imsic->ids_lock, flags);
+ imsic->ids_target_cpu[id] = target_cpu;
+ raw_spin_unlock_irqrestore(&imsic->ids_lock, flags);
+}
+
+unsigned int imsic_id_get_target(unsigned int id)
+{
+ unsigned int ret;
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&imsic->ids_lock, flags);
+ ret = imsic->ids_target_cpu[id];
+ raw_spin_unlock_irqrestore(&imsic->ids_lock, flags);
+
+ return ret;
+}
+
+void imsic_ids_local_sync(void)
+{
+ int i;
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&imsic->ids_lock, flags);
+ for (i = 1; i <= imsic->global.nr_ids; i++) {
+ if (imsic->ipi_id == i)
+ continue;
+
+ if (test_bit(i, imsic->ids_enabled_bimap))
+ __imsic_id_enable(i);
+ else
+ __imsic_id_disable(i);
+ }
+ raw_spin_unlock_irqrestore(&imsic->ids_lock, flags);
+}
+
+void imsic_ids_local_delivery(bool enable)
+{
+ if (enable) {
+ imsic_csr_write(IMSIC_EITHRESHOLD, IMSIC_ENABLE_EITHRESHOLD);
+ imsic_csr_write(IMSIC_EIDELIVERY, IMSIC_ENABLE_EIDELIVERY);
+ } else {
+ imsic_csr_write(IMSIC_EIDELIVERY, IMSIC_DISABLE_EIDELIVERY);
+ imsic_csr_write(IMSIC_EITHRESHOLD, IMSIC_DISABLE_EITHRESHOLD);
+ }
+}
+
+int imsic_ids_alloc(unsigned int order)
+{
+ int ret;
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&imsic->ids_lock, flags);
+ ret = bitmap_find_free_region(imsic->ids_used_bimap,
+ imsic->global.nr_ids + 1, order);
+ raw_spin_unlock_irqrestore(&imsic->ids_lock, flags);
+
+ return ret;
+}
+
+void imsic_ids_free(unsigned int base_id, unsigned int order)
+{
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&imsic->ids_lock, flags);
+ bitmap_release_region(imsic->ids_used_bimap, base_id, order);
+ raw_spin_unlock_irqrestore(&imsic->ids_lock, flags);
+}
+
+static int __init imsic_ids_init(void)
+{
+ int i;
+ struct imsic_global_config *global = &imsic->global;
+
+ raw_spin_lock_init(&imsic->ids_lock);
+
+ /* Allocate used bitmap */
+ imsic->ids_used_bimap = bitmap_zalloc(global->nr_ids + 1, GFP_KERNEL);
+ if (!imsic->ids_used_bimap)
+ return -ENOMEM;
+
+ /* Allocate enabled bitmap */
+ imsic->ids_enabled_bimap = bitmap_zalloc(global->nr_ids + 1,
+ GFP_KERNEL);
+ if (!imsic->ids_enabled_bimap) {
+ kfree(imsic->ids_used_bimap);
+ return -ENOMEM;
+ }
+
+ /* Allocate target CPU array */
+ imsic->ids_target_cpu = kcalloc(global->nr_ids + 1,
+ sizeof(unsigned int), GFP_KERNEL);
+ if (!imsic->ids_target_cpu) {
+ bitmap_free(imsic->ids_enabled_bimap);
+ bitmap_free(imsic->ids_used_bimap);
+ return -ENOMEM;
+ }
+ for (i = 0; i <= global->nr_ids; i++)
+ imsic->ids_target_cpu[i] = UINT_MAX;
+
+ /* Reserve ID#0 because it is special and never implemented */
+ bitmap_set(imsic->ids_used_bimap, 0, 1);
+
+ return 0;
+}
+
+static void __init imsic_ids_cleanup(void)
+{
+ kfree(imsic->ids_target_cpu);
+ bitmap_free(imsic->ids_enabled_bimap);
+ bitmap_free(imsic->ids_used_bimap);
+}
+
+static int __init imsic_get_parent_hartid(struct fwnode_handle *fwnode,
+ u32 index, unsigned long *hartid)
+{
+ int rc;
+ struct of_phandle_args parent;
+
+ /*
+ * Currently, only OF fwnode is supported so extend this
+ * function for ACPI support.
+ */
+ if (!is_of_node(fwnode))
+ return -EINVAL;
+
+ rc = of_irq_parse_one(to_of_node(fwnode), index, &parent);
+ if (rc)
+ return rc;
+
+ /*
+ * Skip interrupts other than external interrupts for
+ * current privilege level.
+ */
+ if (parent.args[0] != RV_IRQ_EXT)
+ return -EINVAL;
+
+ return riscv_of_parent_hartid(parent.np, hartid);
+}
+
+static int __init imsic_get_mmio_resource(struct fwnode_handle *fwnode,
+ u32 index, struct resource *res)
+{
+ /*
+ * Currently, only OF fwnode is supported so extend this
+ * function for ACPI support.
+ */
+ if (!is_of_node(fwnode))
+ return -EINVAL;
+
+ return of_address_to_resource(to_of_node(fwnode), index, res);
+}
+
+static int __init imsic_parse_fwnode(struct fwnode_handle *fwnode,
+ struct imsic_global_config *global,
+ u32 *nr_parent_irqs,
+ u32 *nr_mmios)
+{
+ unsigned long hartid;
+ struct resource res;
+ int rc;
+ u32 i;
+
+ /*
+ * Currently, only OF fwnode is supported so extend this
+ * function for ACPI support.
+ */
+ if (!is_of_node(fwnode))
+ return -EINVAL;
+
+ *nr_parent_irqs = 0;
+ *nr_mmios = 0;
+
+ /* Find number of parent interrupts */
+ *nr_parent_irqs = 0;
+ while (!imsic_get_parent_hartid(fwnode, *nr_parent_irqs, &hartid))
+ (*nr_parent_irqs)++;
+ if (!(*nr_parent_irqs)) {
+ pr_err("%pfwP: no parent irqs available\n", fwnode);
+ return -EINVAL;
+ }
+
+ /* Find number of guest index bits in MSI address */
+ rc = of_property_read_u32(to_of_node(fwnode),
+ "riscv,guest-index-bits",
+ &global->guest_index_bits);
+ if (rc)
+ global->guest_index_bits = 0;
+
+ /* Find number of HART index bits */
+ rc = of_property_read_u32(to_of_node(fwnode),
+ "riscv,hart-index-bits",
+ &global->hart_index_bits);
+ if (rc) {
+ /* Assume default value */
+ global->hart_index_bits = __fls(*nr_parent_irqs);
+ if (BIT(global->hart_index_bits) < *nr_parent_irqs)
+ global->hart_index_bits++;
+ }
+
+ /* Find number of group index bits */
+ rc = of_property_read_u32(to_of_node(fwnode),
+ "riscv,group-index-bits",
+ &global->group_index_bits);
+ if (rc)
+ global->group_index_bits = 0;
+
+ /*
+ * Find first bit position of group index.
+ * If not specified assumed the default APLIC-IMSIC configuration.
+ */
+ rc = of_property_read_u32(to_of_node(fwnode),
+ "riscv,group-index-shift",
+ &global->group_index_shift);
+ if (rc)
+ global->group_index_shift = IMSIC_MMIO_PAGE_SHIFT * 2;
+
+ /* Find number of interrupt identities */
+ rc = of_property_read_u32(to_of_node(fwnode),
+ "riscv,num-ids",
+ &global->nr_ids);
+ if (rc) {
+ pr_err("%pfwP: number of interrupt identities not found\n",
+ fwnode);
+ return rc;
+ }
+
+ /* Find number of guest interrupt identities */
+ rc = of_property_read_u32(to_of_node(fwnode),
+ "riscv,num-guest-ids",
+ &global->nr_guest_ids);
+ if (rc)
+ global->nr_guest_ids = global->nr_ids;
+
+ /* Sanity check guest index bits */
+ i = BITS_PER_LONG - IMSIC_MMIO_PAGE_SHIFT;
+ if (i < global->guest_index_bits) {
+ pr_err("%pfwP: guest index bits too big\n", fwnode);
+ return -EINVAL;
+ }
+
+ /* Sanity check HART index bits */
+ i = BITS_PER_LONG - IMSIC_MMIO_PAGE_SHIFT - global->guest_index_bits;
+ if (i < global->hart_index_bits) {
+ pr_err("%pfwP: HART index bits too big\n", fwnode);
+ return -EINVAL;
+ }
+
+ /* Sanity check group index bits */
+ i = BITS_PER_LONG - IMSIC_MMIO_PAGE_SHIFT -
+ global->guest_index_bits - global->hart_index_bits;
+ if (i < global->group_index_bits) {
+ pr_err("%pfwP: group index bits too big\n", fwnode);
+ return -EINVAL;
+ }
+
+ /* Sanity check group index shift */
+ i = global->group_index_bits + global->group_index_shift - 1;
+ if (i >= BITS_PER_LONG) {
+ pr_err("%pfwP: group index shift too big\n", fwnode);
+ return -EINVAL;
+ }
+
+ /* Sanity check number of interrupt identities */
+ if ((global->nr_ids < IMSIC_MIN_ID) ||
+ (global->nr_ids >= IMSIC_MAX_ID) ||
+ ((global->nr_ids & IMSIC_MIN_ID) != IMSIC_MIN_ID)) {
+ pr_err("%pfwP: invalid number of interrupt identities\n",
+ fwnode);
+ return -EINVAL;
+ }
+
+ /* Sanity check number of guest interrupt identities */
+ if ((global->nr_guest_ids < IMSIC_MIN_ID) ||
+ (global->nr_guest_ids >= IMSIC_MAX_ID) ||
+ ((global->nr_guest_ids & IMSIC_MIN_ID) != IMSIC_MIN_ID)) {
+ pr_err("%pfwP: invalid number of guest interrupt identities\n",
+ fwnode);
+ return -EINVAL;
+ }
+
+ /* Compute base address */
+ rc = imsic_get_mmio_resource(fwnode, 0, &res);
+ if (rc) {
+ pr_err("%pfwP: first MMIO resource not found\n", fwnode);
+ return -EINVAL;
+ }
+ global->base_addr = res.start;
+ global->base_addr &= ~(BIT(global->guest_index_bits +
+ global->hart_index_bits +
+ IMSIC_MMIO_PAGE_SHIFT) - 1);
+ global->base_addr &= ~((BIT(global->group_index_bits) - 1) <<
+ global->group_index_shift);
+
+ /* Find number of MMIO register sets */
+ while (!imsic_get_mmio_resource(fwnode, *nr_mmios, &res))
+ (*nr_mmios)++;
+
+ return 0;
+}
+
+int __init imsic_setup_state(struct fwnode_handle *fwnode)
+{
+ int rc, cpu;
+ phys_addr_t base_addr;
+ void __iomem **mmios_va = NULL;
+ struct resource *mmios = NULL;
+ struct imsic_local_config *local;
+ struct imsic_global_config *global;
+ unsigned long reloff, hartid;
+ u32 i, j, index, nr_parent_irqs, nr_mmios, nr_handlers = 0;
+
+ /*
+ * Only one IMSIC instance allowed in a platform for clean
+ * implementation of SMP IRQ affinity and per-CPU IPIs.
+ *
+ * This means on a multi-socket (or multi-die) platform we
+ * will have multiple MMIO regions for one IMSIC instance.
+ */
+ if (imsic) {
+ pr_err("%pfwP: already initialized hence ignoring\n",
+ fwnode);
+ return -EALREADY;
+ }
+
+ if (!riscv_isa_extension_available(NULL, SxAIA)) {
+ pr_err("%pfwP: AIA support not available\n", fwnode);
+ return -ENODEV;
+ }
+
+ imsic = kzalloc(sizeof(*imsic), GFP_KERNEL);
+ if (!imsic)
+ return -ENOMEM;
+ imsic->fwnode = fwnode;
+ global = &imsic->global;
+
+ global->local = alloc_percpu(typeof(*(global->local)));
+ if (!global->local) {
+ rc = -ENOMEM;
+ goto out_free_priv;
+ }
+
+ /* Parse IMSIC fwnode */
+ rc = imsic_parse_fwnode(fwnode, global, &nr_parent_irqs, &nr_mmios);
+ if (rc)
+ goto out_free_local;
+
+ /* Allocate MMIO resource array */
+ mmios = kcalloc(nr_mmios, sizeof(*mmios), GFP_KERNEL);
+ if (!mmios) {
+ rc = -ENOMEM;
+ goto out_free_local;
+ }
+
+ /* Allocate MMIO virtual address array */
+ mmios_va = kcalloc(nr_mmios, sizeof(*mmios_va), GFP_KERNEL);
+ if (!mmios_va) {
+ rc = -ENOMEM;
+ goto out_iounmap;
+ }
+
+ /* Parse and map MMIO register sets */
+ for (i = 0; i < nr_mmios; i++) {
+ rc = imsic_get_mmio_resource(fwnode, i, &mmios[i]);
+ if (rc) {
+ pr_err("%pfwP: unable to parse MMIO regset %d\n",
+ fwnode, i);
+ goto out_iounmap;
+ }
+
+ base_addr = mmios[i].start;
+ base_addr &= ~(BIT(global->guest_index_bits +
+ global->hart_index_bits +
+ IMSIC_MMIO_PAGE_SHIFT) - 1);
+ base_addr &= ~((BIT(global->group_index_bits) - 1) <<
+ global->group_index_shift);
+ if (base_addr != global->base_addr) {
+ rc = -EINVAL;
+ pr_err("%pfwP: address mismatch for regset %d\n",
+ fwnode, i);
+ goto out_iounmap;
+ }
+
+ mmios_va[i] = ioremap(mmios[i].start, resource_size(&mmios[i]));
+ if (!mmios_va[i]) {
+ rc = -EIO;
+ pr_err("%pfwP: unable to map MMIO regset %d\n",
+ fwnode, i);
+ goto out_iounmap;
+ }
+ }
+
+ /* Initialize interrupt identity management */
+ rc = imsic_ids_init();
+ if (rc) {
+ pr_err("%pfwP: failed to initialize interrupt management\n",
+ fwnode);
+ goto out_iounmap;
+ }
+
+ /* Configure handlers for target CPUs */
+ for (i = 0; i < nr_parent_irqs; i++) {
+ rc = imsic_get_parent_hartid(fwnode, i, &hartid);
+ if (rc) {
+ pr_warn("%pfwP: hart ID for parent irq%d not found\n",
+ fwnode, i);
+ continue;
+ }
+
+ cpu = riscv_hartid_to_cpuid(hartid);
+ if (cpu < 0) {
+ pr_warn("%pfwP: invalid cpuid for parent irq%d\n",
+ fwnode, i);
+ continue;
+ }
+
+ /* Find MMIO location of MSI page */
+ index = nr_mmios;
+ reloff = i * BIT(global->guest_index_bits) *
+ IMSIC_MMIO_PAGE_SZ;
+ for (j = 0; nr_mmios; j++) {
+ if (reloff < resource_size(&mmios[j])) {
+ index = j;
+ break;
+ }
+
+ /*
+ * MMIO region size may not be aligned to
+ * BIT(global->guest_index_bits) * IMSIC_MMIO_PAGE_SZ
+ * if holes are present.
+ */
+ reloff -= ALIGN(resource_size(&mmios[j]),
+ BIT(global->guest_index_bits) * IMSIC_MMIO_PAGE_SZ);
+ }
+ if (index >= nr_mmios) {
+ pr_warn("%pfwP: MMIO not found for parent irq%d\n",
+ fwnode, i);
+ continue;
+ }
+
+ local = per_cpu_ptr(global->local, cpu);
+ local->msi_pa = mmios[index].start + reloff;
+ local->msi_va = mmios_va[index] + reloff;
+
+ nr_handlers++;
+ }
+
+ /* If no CPU handlers found then can't take interrupts */
+ if (!nr_handlers) {
+ pr_err("%pfwP: No CPU handlers found\n", fwnode);
+ rc = -ENODEV;
+ goto out_ids_cleanup;
+ }
+
+ /* We don't need MMIO arrays anymore so let's free-up */
+ kfree(mmios_va);
+ kfree(mmios);
+
+ return 0;
+
+out_ids_cleanup:
+ imsic_ids_cleanup();
+out_iounmap:
+ for (i = 0; i < nr_mmios; i++) {
+ if (mmios_va[i])
+ iounmap(mmios_va[i]);
+ }
+ kfree(mmios_va);
+ kfree(mmios);
+out_free_local:
+ free_percpu(imsic->global.local);
+out_free_priv:
+ kfree(imsic);
+ imsic = NULL;
+ return rc;
+}
diff --git a/drivers/irqchip/irq-riscv-imsic-state.h b/drivers/irqchip/irq-riscv-imsic-state.h
new file mode 100644
index 000000000000..3170018949a8
--- /dev/null
+++ b/drivers/irqchip/irq-riscv-imsic-state.h
@@ -0,0 +1,66 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2021 Western Digital Corporation or its affiliates.
+ * Copyright (C) 2022 Ventana Micro Systems Inc.
+ */
+
+#ifndef _IRQ_RISCV_IMSIC_STATE_H
+#define _IRQ_RISCV_IMSIC_STATE_H
+
+#include <linux/irqchip/riscv-imsic.h>
+#include <linux/irqdomain.h>
+#include <linux/fwnode.h>
+
+struct imsic_priv {
+ /* Device details */
+ struct fwnode_handle *fwnode;
+
+ /* Global configuration common for all HARTs */
+ struct imsic_global_config global;
+
+ /* Global state of interrupt identities */
+ raw_spinlock_t ids_lock;
+ unsigned long *ids_used_bimap;
+ unsigned long *ids_enabled_bimap;
+ unsigned int *ids_target_cpu;
+
+ /* IPI interrupt identity and synchronization */
+ u32 ipi_id;
+ int ipi_virq;
+ struct irq_desc *ipi_lsync_desc;
+
+ /* IRQ domains (created by platform driver) */
+ struct irq_domain *base_domain;
+ struct irq_domain *plat_domain;
+};
+
+extern struct imsic_priv *imsic;
+
+void __imsic_eix_update(unsigned long base_id,
+ unsigned long num_id, bool pend, bool val);
+
+#define __imsic_id_enable(__id) \
+ __imsic_eix_update((__id), 1, false, true)
+#define __imsic_id_disable(__id) \
+ __imsic_eix_update((__id), 1, false, false)
+
+void imsic_id_set_target(unsigned int id, unsigned int target_cpu);
+unsigned int imsic_id_get_target(unsigned int id);
+
+void imsic_ids_local_sync(void);
+void imsic_ids_local_delivery(bool enable);
+
+#ifdef CONFIG_SMP
+void imsic_ids_remote_sync(void);
+#else
+static inline void imsic_ids_remote_sync(void)
+{
+}
+#endif
+
+int imsic_ids_alloc(unsigned int order);
+void imsic_ids_free(unsigned int base_id, unsigned int order);
+
+int imsic_setup_state(struct fwnode_handle *fwnode);
+
+#endif
diff --git a/include/linux/irqchip/riscv-imsic.h b/include/linux/irqchip/riscv-imsic.h
new file mode 100644
index 000000000000..1f6fc9a57218
--- /dev/null
+++ b/include/linux/irqchip/riscv-imsic.h
@@ -0,0 +1,86 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2021 Western Digital Corporation or its affiliates.
+ * Copyright (C) 2022 Ventana Micro Systems Inc.
+ */
+#ifndef __LINUX_IRQCHIP_RISCV_IMSIC_H
+#define __LINUX_IRQCHIP_RISCV_IMSIC_H
+
+#include <linux/types.h>
+#include <asm/csr.h>
+
+#define IMSIC_MMIO_PAGE_SHIFT 12
+#define IMSIC_MMIO_PAGE_SZ (1UL << IMSIC_MMIO_PAGE_SHIFT)
+#define IMSIC_MMIO_PAGE_LE 0x00
+#define IMSIC_MMIO_PAGE_BE 0x04
+
+#define IMSIC_MIN_ID 63
+#define IMSIC_MAX_ID 2048
+
+#define IMSIC_EIDELIVERY 0x70
+
+#define IMSIC_EITHRESHOLD 0x72
+
+#define IMSIC_EIP0 0x80
+#define IMSIC_EIP63 0xbf
+#define IMSIC_EIPx_BITS 32
+
+#define IMSIC_EIE0 0xc0
+#define IMSIC_EIE63 0xff
+#define IMSIC_EIEx_BITS 32
+
+#define IMSIC_FIRST IMSIC_EIDELIVERY
+#define IMSIC_LAST IMSIC_EIE63
+
+#define IMSIC_MMIO_SETIPNUM_LE 0x00
+#define IMSIC_MMIO_SETIPNUM_BE 0x04
+
+struct imsic_local_config {
+ phys_addr_t msi_pa;
+ void __iomem *msi_va;
+};
+
+struct imsic_global_config {
+ /*
+ * MSI Target Address Scheme
+ *
+ * XLEN-1 12 0
+ * | | |
+ * -------------------------------------------------------------
+ * |xxxxxx|Group Index|xxxxxxxxxxx|HART Index|Guest Index| 0 |
+ * -------------------------------------------------------------
+ */
+
+ /* Bits representing Guest index, HART index, and Group index */
+ u32 guest_index_bits;
+ u32 hart_index_bits;
+ u32 group_index_bits;
+ u32 group_index_shift;
+
+ /* Global base address matching all target MSI addresses */
+ phys_addr_t base_addr;
+
+ /* Number of interrupt identities */
+ u32 nr_ids;
+
+ /* Number of guest interrupt identities */
+ u32 nr_guest_ids;
+
+ /* Per-CPU IMSIC addresses */
+ struct imsic_local_config __percpu *local;
+};
+
+#ifdef CONFIG_RISCV_IMSIC
+
+extern const struct imsic_global_config *imsic_get_global_config(void);
+
+#else
+
+static inline const struct imsic_global_config *imsic_get_global_config(void)
+{
+ return NULL;
+}
+
+#endif
+
+#endif
--
2.34.1

2023-10-12 16:36:30

by Conor Dooley

[permalink] [raw]
Subject: Re: [PATCH v10 07/15] dt-bindings: interrupt-controller: Add RISC-V incoming MSI controller

Hey,

On Tue, Oct 03, 2023 at 10:13:55AM +0530, Anup Patel wrote:
> We add DT bindings document for the RISC-V incoming MSI controller
> (IMSIC) defined by the RISC-V advanced interrupt architecture (AIA)
> specification.
>
> Signed-off-by: Anup Patel <[email protected]>
> Reviewed-by: Conor Dooley <[email protected]>

Just FYI, since they'll reply to this themselves, but some of the
Microchip folks have run into problems with sparse hart indexes while
trying to use the imsic binding to describe some configurations they
have. I think there were also so problems with how to describe to a
linux guest which file to use, when the first hart available to the
guest does not use the first file. They'll do a better job of describing
their problems than I will, so I shall leave it to them!

Cheers,
Conor.

> Acked-by: Krzysztof Kozlowski <[email protected]>
> ---
> .../interrupt-controller/riscv,imsics.yaml | 172 ++++++++++++++++++
> 1 file changed, 172 insertions(+)
> create mode 100644 Documentation/devicetree/bindings/interrupt-controller/riscv,imsics.yaml
>
> diff --git a/Documentation/devicetree/bindings/interrupt-controller/riscv,imsics.yaml b/Documentation/devicetree/bindings/interrupt-controller/riscv,imsics.yaml
> new file mode 100644
> index 000000000000..84976f17a4a1
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/interrupt-controller/riscv,imsics.yaml
> @@ -0,0 +1,172 @@
> +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
> +%YAML 1.2
> +---
> +$id: http://devicetree.org/schemas/interrupt-controller/riscv,imsics.yaml#
> +$schema: http://devicetree.org/meta-schemas/core.yaml#
> +
> +title: RISC-V Incoming MSI Controller (IMSIC)
> +
> +maintainers:
> + - Anup Patel <[email protected]>
> +
> +description: |
> + The RISC-V advanced interrupt architecture (AIA) defines a per-CPU incoming
> + MSI controller (IMSIC) for handling MSIs in a RISC-V platform. The RISC-V
> + AIA specification can be found at https://github.com/riscv/riscv-aia.
> +
> + The IMSIC is a per-CPU (or per-HART) device with separate interrupt file
> + for each privilege level (machine or supervisor). The configuration of
> + a IMSIC interrupt file is done using AIA CSRs and it also has a 4KB MMIO
> + space to receive MSIs from devices. Each IMSIC interrupt file supports a
> + fixed number of interrupt identities (to distinguish MSIs from devices)
> + which is same for given privilege level across CPUs (or HARTs).
> +
> + The device tree of a RISC-V platform will have one IMSIC device tree node
> + for each privilege level (machine or supervisor) which collectively describe
> + IMSIC interrupt files at that privilege level across CPUs (or HARTs).
> +
> + The arrangement of IMSIC interrupt files in MMIO space of a RISC-V platform
> + follows a particular scheme defined by the RISC-V AIA specification. A IMSIC
> + group is a set of IMSIC interrupt files co-located in MMIO space and we can
> + have multiple IMSIC groups (i.e. clusters, sockets, chiplets, etc) in a
> + RISC-V platform. The MSI target address of a IMSIC interrupt file at given
> + privilege level (machine or supervisor) encodes group index, HART index,
> + and guest index (shown below).
> +
> + XLEN-1 > (HART Index MSB) 12 0
> + | | | |
> + -------------------------------------------------------------
> + |xxxxxx|Group Index|xxxxxxxxxxx|HART Index|Guest Index| 0 |
> + -------------------------------------------------------------
> +
> +allOf:
> + - $ref: /schemas/interrupt-controller.yaml#
> + - $ref: /schemas/interrupt-controller/msi-controller.yaml#
> +
> +properties:
> + compatible:
> + items:
> + - enum:
> + - qemu,imsics
> + - const: riscv,imsics
> +
> + reg:
> + minItems: 1
> + maxItems: 16384
> + description:
> + Base address of each IMSIC group.
> +
> + interrupt-controller: true
> +
> + "#interrupt-cells":
> + const: 0
> +
> + msi-controller: true
> +
> + "#msi-cells":
> + const: 0
> +
> + interrupts-extended:
> + minItems: 1
> + maxItems: 16384
> + description:
> + This property represents the set of CPUs (or HARTs) for which given
> + device tree node describes the IMSIC interrupt files. Each node pointed
> + to should be a riscv,cpu-intc node, which has a CPU node (i.e. RISC-V
> + HART) as parent.
> +
> + riscv,num-ids:
> + $ref: /schemas/types.yaml#/definitions/uint32
> + minimum: 63
> + maximum: 2047
> + description:
> + Number of interrupt identities supported by IMSIC interrupt file.
> +
> + riscv,num-guest-ids:
> + $ref: /schemas/types.yaml#/definitions/uint32
> + minimum: 63
> + maximum: 2047
> + description:
> + Number of interrupt identities are supported by IMSIC guest interrupt
> + file. When not specified it is assumed to be same as specified by the
> + riscv,num-ids property.
> +
> + riscv,guest-index-bits:
> + minimum: 0
> + maximum: 7
> + default: 0
> + description:
> + Number of guest index bits in the MSI target address.
> +
> + riscv,hart-index-bits:
> + minimum: 0
> + maximum: 15
> + description:
> + Number of HART index bits in the MSI target address. When not
> + specified it is calculated based on the interrupts-extended property.
> +
> + riscv,group-index-bits:
> + minimum: 0
> + maximum: 7
> + default: 0
> + description:
> + Number of group index bits in the MSI target address.
> +
> + riscv,group-index-shift:
> + $ref: /schemas/types.yaml#/definitions/uint32
> + minimum: 0
> + maximum: 55
> + default: 24
> + description:
> + The least significant bit position of the group index bits in the
> + MSI target address.
> +
> +required:
> + - compatible
> + - reg
> + - interrupt-controller
> + - msi-controller
> + - "#msi-cells"
> + - interrupts-extended
> + - riscv,num-ids
> +
> +unevaluatedProperties: false
> +
> +examples:
> + - |
> + // Example 1 (Machine-level IMSIC files with just one group):
> +
> + interrupt-controller@24000000 {
> + compatible = "qemu,imsics", "riscv,imsics";
> + interrupts-extended = <&cpu1_intc 11>,
> + <&cpu2_intc 11>,
> + <&cpu3_intc 11>,
> + <&cpu4_intc 11>;
> + reg = <0x28000000 0x4000>;
> + interrupt-controller;
> + #interrupt-cells = <0>;
> + msi-controller;
> + #msi-cells = <0>;
> + riscv,num-ids = <127>;
> + };
> +
> + - |
> + // Example 2 (Supervisor-level IMSIC files with two groups):
> +
> + interrupt-controller@28000000 {
> + compatible = "qemu,imsics", "riscv,imsics";
> + interrupts-extended = <&cpu1_intc 9>,
> + <&cpu2_intc 9>,
> + <&cpu3_intc 9>,
> + <&cpu4_intc 9>;
> + reg = <0x28000000 0x2000>, /* Group0 IMSICs */
> + <0x29000000 0x2000>; /* Group1 IMSICs */
> + interrupt-controller;
> + #interrupt-cells = <0>;
> + msi-controller;
> + #msi-cells = <0>;
> + riscv,num-ids = <127>;
> + riscv,group-index-bits = <1>;
> + riscv,group-index-shift = <24>;
> + };
> +...
> --
> 2.34.1
>


Attachments:
(No filename) (7.40 kB)
signature.asc (235.00 B)
Download all attachments

2023-10-13 06:47:14

by Anup Patel

[permalink] [raw]
Subject: Re: [PATCH v10 07/15] dt-bindings: interrupt-controller: Add RISC-V incoming MSI controller

On Thu, Oct 12, 2023 at 10:05 PM Conor Dooley <[email protected]> wrote:
>
> Hey,
>
> On Tue, Oct 03, 2023 at 10:13:55AM +0530, Anup Patel wrote:
> > We add DT bindings document for the RISC-V incoming MSI controller
> > (IMSIC) defined by the RISC-V advanced interrupt architecture (AIA)
> > specification.
> >
> > Signed-off-by: Anup Patel <[email protected]>
> > Reviewed-by: Conor Dooley <[email protected]>
>
> Just FYI, since they'll reply to this themselves, but some of the
> Microchip folks have run into problems with sparse hart indexes while
> trying to use the imsic binding to describe some configurations they
> have. I think there were also so problems with how to describe to a
> linux guest which file to use, when the first hart available to the
> guest does not use the first file. They'll do a better job of describing
> their problems than I will, so I shall leave it to them!

Quoting AIA spec:
"For the purpose of locating the memory pages of interrupt files in the
address space, assume each hart (or each hart within a group) has a
unique hart number that may or may not be related to the unique hart
identifiers (“hart IDs”) that the RISC-V Privileged Architecture assigns
to harts."

It is very easy to get confused between the AIA "hart index" and
"hart IDs" defined by the RISC-V Privileged specification but these
are two very different things. The AIA "hart index" over here is the
bits in the address of an IMSIC file.

This DT binding follows the IMSIC file arrangement in the address
space as defined by the section "3.6 Arrangement of the memory
regions of multiple interrupt files" of the AIA specification. This
arrangement is MANDATORY for platforms having both APLIC
and IMSIC because in MSI-mode the APLIC generates target
MSI address based the IMSIC file arrangement described in the
section "3.6 Arrangement of the memory regions of multiple
interrupt files". In fact, this also applies to virtual platforms
created by hypervisors (KVM, Xen, ...)

Regards,
Anup


>
> Cheers,
> Conor.
>
> > Acked-by: Krzysztof Kozlowski <[email protected]>
> > ---
> > .../interrupt-controller/riscv,imsics.yaml | 172 ++++++++++++++++++
> > 1 file changed, 172 insertions(+)
> > create mode 100644 Documentation/devicetree/bindings/interrupt-controller/riscv,imsics.yaml
> >
> > diff --git a/Documentation/devicetree/bindings/interrupt-controller/riscv,imsics.yaml b/Documentation/devicetree/bindings/interrupt-controller/riscv,imsics.yaml
> > new file mode 100644
> > index 000000000000..84976f17a4a1
> > --- /dev/null
> > +++ b/Documentation/devicetree/bindings/interrupt-controller/riscv,imsics.yaml
> > @@ -0,0 +1,172 @@
> > +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
> > +%YAML 1.2
> > +---
> > +$id: http://devicetree.org/schemas/interrupt-controller/riscv,imsics.yaml#
> > +$schema: http://devicetree.org/meta-schemas/core.yaml#
> > +
> > +title: RISC-V Incoming MSI Controller (IMSIC)
> > +
> > +maintainers:
> > + - Anup Patel <[email protected]>
> > +
> > +description: |
> > + The RISC-V advanced interrupt architecture (AIA) defines a per-CPU incoming
> > + MSI controller (IMSIC) for handling MSIs in a RISC-V platform. The RISC-V
> > + AIA specification can be found at https://github.com/riscv/riscv-aia.
> > +
> > + The IMSIC is a per-CPU (or per-HART) device with separate interrupt file
> > + for each privilege level (machine or supervisor). The configuration of
> > + a IMSIC interrupt file is done using AIA CSRs and it also has a 4KB MMIO
> > + space to receive MSIs from devices. Each IMSIC interrupt file supports a
> > + fixed number of interrupt identities (to distinguish MSIs from devices)
> > + which is same for given privilege level across CPUs (or HARTs).
> > +
> > + The device tree of a RISC-V platform will have one IMSIC device tree node
> > + for each privilege level (machine or supervisor) which collectively describe
> > + IMSIC interrupt files at that privilege level across CPUs (or HARTs).
> > +
> > + The arrangement of IMSIC interrupt files in MMIO space of a RISC-V platform
> > + follows a particular scheme defined by the RISC-V AIA specification. A IMSIC
> > + group is a set of IMSIC interrupt files co-located in MMIO space and we can
> > + have multiple IMSIC groups (i.e. clusters, sockets, chiplets, etc) in a
> > + RISC-V platform. The MSI target address of a IMSIC interrupt file at given
> > + privilege level (machine or supervisor) encodes group index, HART index,
> > + and guest index (shown below).
> > +
> > + XLEN-1 > (HART Index MSB) 12 0
> > + | | | |
> > + -------------------------------------------------------------
> > + |xxxxxx|Group Index|xxxxxxxxxxx|HART Index|Guest Index| 0 |
> > + -------------------------------------------------------------
> > +
> > +allOf:
> > + - $ref: /schemas/interrupt-controller.yaml#
> > + - $ref: /schemas/interrupt-controller/msi-controller.yaml#
> > +
> > +properties:
> > + compatible:
> > + items:
> > + - enum:
> > + - qemu,imsics
> > + - const: riscv,imsics
> > +
> > + reg:
> > + minItems: 1
> > + maxItems: 16384
> > + description:
> > + Base address of each IMSIC group.
> > +
> > + interrupt-controller: true
> > +
> > + "#interrupt-cells":
> > + const: 0
> > +
> > + msi-controller: true
> > +
> > + "#msi-cells":
> > + const: 0
> > +
> > + interrupts-extended:
> > + minItems: 1
> > + maxItems: 16384
> > + description:
> > + This property represents the set of CPUs (or HARTs) for which given
> > + device tree node describes the IMSIC interrupt files. Each node pointed
> > + to should be a riscv,cpu-intc node, which has a CPU node (i.e. RISC-V
> > + HART) as parent.
> > +
> > + riscv,num-ids:
> > + $ref: /schemas/types.yaml#/definitions/uint32
> > + minimum: 63
> > + maximum: 2047
> > + description:
> > + Number of interrupt identities supported by IMSIC interrupt file.
> > +
> > + riscv,num-guest-ids:
> > + $ref: /schemas/types.yaml#/definitions/uint32
> > + minimum: 63
> > + maximum: 2047
> > + description:
> > + Number of interrupt identities are supported by IMSIC guest interrupt
> > + file. When not specified it is assumed to be same as specified by the
> > + riscv,num-ids property.
> > +
> > + riscv,guest-index-bits:
> > + minimum: 0
> > + maximum: 7
> > + default: 0
> > + description:
> > + Number of guest index bits in the MSI target address.
> > +
> > + riscv,hart-index-bits:
> > + minimum: 0
> > + maximum: 15
> > + description:
> > + Number of HART index bits in the MSI target address. When not
> > + specified it is calculated based on the interrupts-extended property.
> > +
> > + riscv,group-index-bits:
> > + minimum: 0
> > + maximum: 7
> > + default: 0
> > + description:
> > + Number of group index bits in the MSI target address.
> > +
> > + riscv,group-index-shift:
> > + $ref: /schemas/types.yaml#/definitions/uint32
> > + minimum: 0
> > + maximum: 55
> > + default: 24
> > + description:
> > + The least significant bit position of the group index bits in the
> > + MSI target address.
> > +
> > +required:
> > + - compatible
> > + - reg
> > + - interrupt-controller
> > + - msi-controller
> > + - "#msi-cells"
> > + - interrupts-extended
> > + - riscv,num-ids
> > +
> > +unevaluatedProperties: false
> > +
> > +examples:
> > + - |
> > + // Example 1 (Machine-level IMSIC files with just one group):
> > +
> > + interrupt-controller@24000000 {
> > + compatible = "qemu,imsics", "riscv,imsics";
> > + interrupts-extended = <&cpu1_intc 11>,
> > + <&cpu2_intc 11>,
> > + <&cpu3_intc 11>,
> > + <&cpu4_intc 11>;
> > + reg = <0x28000000 0x4000>;
> > + interrupt-controller;
> > + #interrupt-cells = <0>;
> > + msi-controller;
> > + #msi-cells = <0>;
> > + riscv,num-ids = <127>;
> > + };
> > +
> > + - |
> > + // Example 2 (Supervisor-level IMSIC files with two groups):
> > +
> > + interrupt-controller@28000000 {
> > + compatible = "qemu,imsics", "riscv,imsics";
> > + interrupts-extended = <&cpu1_intc 9>,
> > + <&cpu2_intc 9>,
> > + <&cpu3_intc 9>,
> > + <&cpu4_intc 9>;
> > + reg = <0x28000000 0x2000>, /* Group0 IMSICs */
> > + <0x29000000 0x2000>; /* Group1 IMSICs */
> > + interrupt-controller;
> > + #interrupt-cells = <0>;
> > + msi-controller;
> > + #msi-cells = <0>;
> > + riscv,num-ids = <127>;
> > + riscv,group-index-bits = <1>;
> > + riscv,group-index-shift = <24>;
> > + };
> > +...
> > --
> > 2.34.1
> >

2023-10-13 07:41:32

by Conor Dooley

[permalink] [raw]
Subject: Re: [PATCH v10 07/15] dt-bindings: interrupt-controller: Add RISC-V incoming MSI controller

Hey Anup,

On Fri, Oct 13, 2023 at 12:16:45PM +0530, Anup Patel wrote:
> On Thu, Oct 12, 2023 at 10:05 PM Conor Dooley <[email protected]> wrote:
> > On Tue, Oct 03, 2023 at 10:13:55AM +0530, Anup Patel wrote:
> > > We add DT bindings document for the RISC-V incoming MSI controller
> > > (IMSIC) defined by the RISC-V advanced interrupt architecture (AIA)
> > > specification.
> > >
> > > Signed-off-by: Anup Patel <[email protected]>
> > > Reviewed-by: Conor Dooley <[email protected]>
> >
> > Just FYI, since they'll reply to this themselves, but some of the
> > Microchip folks have run into problems with sparse hart indexes while
> > trying to use the imsic binding to describe some configurations they
> > have. I think there were also so problems with how to describe to a
> > linux guest which file to use, when the first hart available to the
> > guest does not use the first file. They'll do a better job of describing
> > their problems than I will, so I shall leave it to them!
>
> Quoting AIA spec:
> "For the purpose of locating the memory pages of interrupt files in the
> address space, assume each hart (or each hart within a group) has a
> unique hart number that may or may not be related to the unique hart
> identifiers (“hart IDs”) that the RISC-V Privileged Architecture assigns
> to harts."
>
> It is very easy to get confused between the AIA "hart index" and
> "hart IDs" defined by the RISC-V Privileged specification but these
> are two very different things. The AIA "hart index" over here is the
> bits in the address of an IMSIC file.
>
> This DT binding follows the IMSIC file arrangement in the address
> space as defined by the section "3.6 Arrangement of the memory
> regions of multiple interrupt files" of the AIA specification. This
> arrangement is MANDATORY for platforms having both APLIC
> and IMSIC because in MSI-mode the APLIC generates target
> MSI address based the IMSIC file arrangement described in the
> section "3.6 Arrangement of the memory regions of multiple
> interrupt files". In fact, this also applies to virtual platforms
> created by hypervisors (KVM, Xen, ...)

Thanks for pointing this out - I'll pass it on and hopefully it is
helpful to them. If not, I expect that you'll hear :)

Cheers,
Conor.


Attachments:
(No filename) (2.28 kB)
signature.asc (235.00 B)
Download all attachments

2023-10-19 13:43:39

by Björn Töpel

[permalink] [raw]
Subject: Re: [PATCH v10 00/15] Linux RISC-V AIA Support

Hi Anup,

Anup Patel <[email protected]> writes:

> The RISC-V AIA specification is ratified as-per the RISC-V international
> process. The latest ratified AIA specifcation can be found at:
> https://github.com/riscv/riscv-aia/releases/download/1.0/riscv-interrupts-1.0.pdf
>
> At a high-level, the AIA specification adds three things:
> 1) AIA CSRs
> - Improved local interrupt support
> 2) Incoming Message Signaled Interrupt Controller (IMSIC)
> - Per-HART MSI controller
> - Support MSI virtualization
> - Support IPI along with virtualization
> 3) Advanced Platform-Level Interrupt Controller (APLIC)
> - Wired interrupt controller
> - In MSI-mode, converts wired interrupt into MSIs (i.e. MSI generator)
> - In Direct-mode, injects external interrupts directly into HARTs

Thanks for working on the AIA support! I had a look at the series, and
have some concerns about interrupt ID abstraction.

A bit of background, for readers not familiar with the AIA details.

IMSIC allows for 2047 unique MSI ("msi-irq") sources per hart, and
each MSI is dedicated to a certain hart. The series takes the approach
to say that there are, e.g., 2047 interrupts ("lnx-irq") globally.
Each lnx-irq consists of #harts * msi-irq -- a slice -- and in the
slice only *one* msi-irq is acutally used.

This scheme makes affinity changes more robust, because the interrupt
sources on "other" harts are pre-allocated. On the other hand it
requires to propagate irq masking to other harts via IPIs (this is
mostly done up setup/tear down). It's also wasteful, because msi-irqs
are hogged, and cannot be used.

Contemporary storage/networking drivers usually uses queues per core
(or a sub-set of cores). The current scheme wastes a lot of msi-irqs.
If we instead used a scheme where "msi-irq == lnx-irq", instead of
"lnq-irq = {hart 0;msi-irq x , ... hart N;msi-irq x}", there would be
a lot MSIs for other users. 1-1 vs 1-N. E.g., if a storage device
would like to use 5 queues (5 cores) on a 128 core system, the current
scheme would consume 5 * 128 MSIs, instead of just 5.

On the plus side:
* Changing interrupts affinity will never fail, because the interrupts
on each hart is pre-allocated.

On the negative side:
* Wasteful interrupt usage, and a system can potientially "run out" of
interrupts. Especially for many core systems.
* Interrupt masking need to proagate to harts via IPIs (there's no
broadcast csr in IMSIC), and a more complex locking scheme IMSIC

Summary:
The current series caps the number of global interrupts to maximum
2047 MSIs for all cores (whole system). A better scheme, IMO, would be
to expose 2047 * #harts unique MSIs.

I think this could simplify/remove(?) the locking as well.

I realize that the series in v10, and coming with a change like this
now might be a bit of a pain...

Finally, another question related to APLIC/IMSIC. AFAIU the memory map
of the IMSIC regions are constrained by the APLIC, which requires a
certain layout for MSI forwarding (group/hart/guest bits). Say that a
system doesn't have an APLIC, couldn't the layout requirement be
simplified?


Again, thanks for the hard work!
Björn

2023-10-19 16:11:49

by Anup Patel

[permalink] [raw]
Subject: Re: [PATCH v10 00/15] Linux RISC-V AIA Support

On Thu, Oct 19, 2023 at 7:13 PM Björn Töpel <[email protected]> wrote:
>
> Hi Anup,
>
> Anup Patel <[email protected]> writes:
>
> > The RISC-V AIA specification is ratified as-per the RISC-V international
> > process. The latest ratified AIA specifcation can be found at:
> > https://github.com/riscv/riscv-aia/releases/download/1.0/riscv-interrupts-1.0.pdf
> >
> > At a high-level, the AIA specification adds three things:
> > 1) AIA CSRs
> > - Improved local interrupt support
> > 2) Incoming Message Signaled Interrupt Controller (IMSIC)
> > - Per-HART MSI controller
> > - Support MSI virtualization
> > - Support IPI along with virtualization
> > 3) Advanced Platform-Level Interrupt Controller (APLIC)
> > - Wired interrupt controller
> > - In MSI-mode, converts wired interrupt into MSIs (i.e. MSI generator)
> > - In Direct-mode, injects external interrupts directly into HARTs
>
> Thanks for working on the AIA support! I had a look at the series, and
> have some concerns about interrupt ID abstraction.
>
> A bit of background, for readers not familiar with the AIA details.
>
> IMSIC allows for 2047 unique MSI ("msi-irq") sources per hart, and
> each MSI is dedicated to a certain hart. The series takes the approach
> to say that there are, e.g., 2047 interrupts ("lnx-irq") globally.
> Each lnx-irq consists of #harts * msi-irq -- a slice -- and in the
> slice only *one* msi-irq is acutally used.
>
> This scheme makes affinity changes more robust, because the interrupt
> sources on "other" harts are pre-allocated. On the other hand it
> requires to propagate irq masking to other harts via IPIs (this is
> mostly done up setup/tear down). It's also wasteful, because msi-irqs
> are hogged, and cannot be used.
>
> Contemporary storage/networking drivers usually uses queues per core
> (or a sub-set of cores). The current scheme wastes a lot of msi-irqs.
> If we instead used a scheme where "msi-irq == lnx-irq", instead of
> "lnq-irq = {hart 0;msi-irq x , ... hart N;msi-irq x}", there would be
> a lot MSIs for other users. 1-1 vs 1-N. E.g., if a storage device
> would like to use 5 queues (5 cores) on a 128 core system, the current
> scheme would consume 5 * 128 MSIs, instead of just 5.
>
> On the plus side:
> * Changing interrupts affinity will never fail, because the interrupts
> on each hart is pre-allocated.
>
> On the negative side:
> * Wasteful interrupt usage, and a system can potientially "run out" of
> interrupts. Especially for many core systems.
> * Interrupt masking need to proagate to harts via IPIs (there's no
> broadcast csr in IMSIC), and a more complex locking scheme IMSIC
>
> Summary:
> The current series caps the number of global interrupts to maximum
> 2047 MSIs for all cores (whole system). A better scheme, IMO, would be
> to expose 2047 * #harts unique MSIs.
>
> I think this could simplify/remove(?) the locking as well.

Exposing 2047 * #harts unique MSIs has multiple issues:
1) The irq_set_affinity() does not work for MSIs because each
IRQ is not tied to a particular HART. This means we can't
balance the IRQ processing load among HARTs.
2) All wired IRQs for APLIC MSI-mode will also target a
fixed HART hence irq_set_affinity() won't work for wired
IRQs as well.
3) Contemporary storage/networking drivers which use per-core
queues use irq_set_affinity() on queue IRQs to balance
across cores but this will fail.
4) HART hotplug breaks because kernel irq-subsystem can't
migrate the IRQs (both MSIs and Wired) targeting HART X
to another HART Y when the HART X goes down.

The idea of treating per-HART MSIs as separate IRQs has
been discussed in the past. The current approach works nicely
with all kernel use-cases at the cost of additional work on the
driver side.

Also, the current approach is very similar to the ARM GICv3
driver where ITS LPIs across CPUs are treated as single IRQ.

>
> I realize that the series in v10, and coming with a change like this
> now might be a bit of a pain...
>
> Finally, another question related to APLIC/IMSIC. AFAIU the memory map
> of the IMSIC regions are constrained by the APLIC, which requires a
> certain layout for MSI forwarding (group/hart/guest bits). Say that a
> system doesn't have an APLIC, couldn't the layout requirement be
> simplified?

Yes, this is already taken care of in the current IMSIC driver based
on feedback from Atish. We can certainly improve flexibility on the
IMSIC driver side if some case is missed-out.

The APLIC driver is certainly very strict about the arrangement of
IMSIC files so we do additional sanity checks on the APLIC driver
side at the time of probing.

>
>
> Again, thanks for the hard work!
> Björn

Regards,
Anup

2023-10-20 08:47:51

by Björn Töpel

[permalink] [raw]
Subject: Re: [PATCH v10 00/15] Linux RISC-V AIA Support

Thanks for the quick reply!

Anup Patel <[email protected]> writes:

> On Thu, Oct 19, 2023 at 7:13 PM Björn Töpel <[email protected]> wrote:
>>
>> Hi Anup,
>>
>> Anup Patel <[email protected]> writes:
>>
>> > The RISC-V AIA specification is ratified as-per the RISC-V international
>> > process. The latest ratified AIA specifcation can be found at:
>> > https://github.com/riscv/riscv-aia/releases/download/1.0/riscv-interrupts-1.0.pdf
>> >
>> > At a high-level, the AIA specification adds three things:
>> > 1) AIA CSRs
>> > - Improved local interrupt support
>> > 2) Incoming Message Signaled Interrupt Controller (IMSIC)
>> > - Per-HART MSI controller
>> > - Support MSI virtualization
>> > - Support IPI along with virtualization
>> > 3) Advanced Platform-Level Interrupt Controller (APLIC)
>> > - Wired interrupt controller
>> > - In MSI-mode, converts wired interrupt into MSIs (i.e. MSI generator)
>> > - In Direct-mode, injects external interrupts directly into HARTs
>>
>> Thanks for working on the AIA support! I had a look at the series, and
>> have some concerns about interrupt ID abstraction.
>>
>> A bit of background, for readers not familiar with the AIA details.
>>
>> IMSIC allows for 2047 unique MSI ("msi-irq") sources per hart, and
>> each MSI is dedicated to a certain hart. The series takes the approach
>> to say that there are, e.g., 2047 interrupts ("lnx-irq") globally.
>> Each lnx-irq consists of #harts * msi-irq -- a slice -- and in the
>> slice only *one* msi-irq is acutally used.
>>
>> This scheme makes affinity changes more robust, because the interrupt
>> sources on "other" harts are pre-allocated. On the other hand it
>> requires to propagate irq masking to other harts via IPIs (this is
>> mostly done up setup/tear down). It's also wasteful, because msi-irqs
>> are hogged, and cannot be used.
>>
>> Contemporary storage/networking drivers usually uses queues per core
>> (or a sub-set of cores). The current scheme wastes a lot of msi-irqs.
>> If we instead used a scheme where "msi-irq == lnx-irq", instead of
>> "lnq-irq = {hart 0;msi-irq x , ... hart N;msi-irq x}", there would be
>> a lot MSIs for other users. 1-1 vs 1-N. E.g., if a storage device
>> would like to use 5 queues (5 cores) on a 128 core system, the current
>> scheme would consume 5 * 128 MSIs, instead of just 5.
>>
>> On the plus side:
>> * Changing interrupts affinity will never fail, because the interrupts
>> on each hart is pre-allocated.
>>
>> On the negative side:
>> * Wasteful interrupt usage, and a system can potientially "run out" of
>> interrupts. Especially for many core systems.
>> * Interrupt masking need to proagate to harts via IPIs (there's no
>> broadcast csr in IMSIC), and a more complex locking scheme IMSIC
>>
>> Summary:
>> The current series caps the number of global interrupts to maximum
>> 2047 MSIs for all cores (whole system). A better scheme, IMO, would be
>> to expose 2047 * #harts unique MSIs.
>>
>> I think this could simplify/remove(?) the locking as well.
>
> Exposing 2047 * #harts unique MSIs has multiple issues:
> 1) The irq_set_affinity() does not work for MSIs because each
> IRQ is not tied to a particular HART. This means we can't
> balance the IRQ processing load among HARTs.

Yes, you can balance. In your code, each *active* MSI is still
bound/active to a specific hard together with the affinity mask. In an
1-1 model you would still need to track the affinity mask, but the
irq_set_affinity() would be different. It would try to allocate a new
MSI from the target CPU, and then switch to having that MSI active.

That's what x86 does AFAIU, which is also constrained by the # of
available MSIs.

The downside, as I pointed out, is that the set affinity action can
fail for a certain target CPU.

> 2) All wired IRQs for APLIC MSI-mode will also target a
> fixed HART hence irq_set_affinity() won't work for wired
> IRQs as well.

I'm not following here. Why would APLIC put a constraint here? I had a
look at the specs, and I didn't see anything supporting the current
scheme explicitly.

> 3) Contemporary storage/networking drivers which use per-core
> queues use irq_set_affinity() on queue IRQs to balance
> across cores but this will fail.

Or via the the managed interrupts. But this is a non-issue, as pointed
out in my reply to 1.

> 4) HART hotplug breaks because kernel irq-subsystem can't
> migrate the IRQs (both MSIs and Wired) targeting HART X
> to another HART Y when the HART X goes down.

Yes, we might end up in scenarios where we can't move to a certain
target cpu, but I wouldn't expect that to be a common scenario.

> The idea of treating per-HART MSIs as separate IRQs has
> been discussed in the past.

Aha! I tried to look for it in lore, but didn't find any. Could you
point me to those discussions?

> Also, the current approach is very similar to the ARM GICv3 driver
> where ITS LPIs across CPUs are treated as single IRQ.

I'm not familiar with the GIC. Is the GICv3 design similar to IMSIC? I
had the impression that the GIC had a more advanced interrupt routing
mechanism, than what IMSIC exposes. I think x86 APIC takes the 1-1
approach (the folks on the To: list definitely knows! ;-)).

My concern is interrupts become a scarce resource with this
implementation, but maybe my view is incorrect. I've seen bare-metal
x86 systems (no VMs) with ~200 cores, and ~2000 interrupts, but maybe
that is considered "a lot of interrupts".

As long as we don't get into scenarios where we're running out of
interrupts, due to the software design.


Björn

2023-10-20 11:01:42

by Anup Patel

[permalink] [raw]
Subject: Re: [PATCH v10 00/15] Linux RISC-V AIA Support

On Fri, Oct 20, 2023 at 2:17 PM Björn Töpel <[email protected]> wrote:
>
> Thanks for the quick reply!
>
> Anup Patel <[email protected]> writes:
>
> > On Thu, Oct 19, 2023 at 7:13 PM Björn Töpel <[email protected]> wrote:
> >>
> >> Hi Anup,
> >>
> >> Anup Patel <[email protected]> writes:
> >>
> >> > The RISC-V AIA specification is ratified as-per the RISC-V international
> >> > process. The latest ratified AIA specifcation can be found at:
> >> > https://github.com/riscv/riscv-aia/releases/download/1.0/riscv-interrupts-1.0.pdf
> >> >
> >> > At a high-level, the AIA specification adds three things:
> >> > 1) AIA CSRs
> >> > - Improved local interrupt support
> >> > 2) Incoming Message Signaled Interrupt Controller (IMSIC)
> >> > - Per-HART MSI controller
> >> > - Support MSI virtualization
> >> > - Support IPI along with virtualization
> >> > 3) Advanced Platform-Level Interrupt Controller (APLIC)
> >> > - Wired interrupt controller
> >> > - In MSI-mode, converts wired interrupt into MSIs (i.e. MSI generator)
> >> > - In Direct-mode, injects external interrupts directly into HARTs
> >>
> >> Thanks for working on the AIA support! I had a look at the series, and
> >> have some concerns about interrupt ID abstraction.
> >>
> >> A bit of background, for readers not familiar with the AIA details.
> >>
> >> IMSIC allows for 2047 unique MSI ("msi-irq") sources per hart, and
> >> each MSI is dedicated to a certain hart. The series takes the approach
> >> to say that there are, e.g., 2047 interrupts ("lnx-irq") globally.
> >> Each lnx-irq consists of #harts * msi-irq -- a slice -- and in the
> >> slice only *one* msi-irq is acutally used.
> >>
> >> This scheme makes affinity changes more robust, because the interrupt
> >> sources on "other" harts are pre-allocated. On the other hand it
> >> requires to propagate irq masking to other harts via IPIs (this is
> >> mostly done up setup/tear down). It's also wasteful, because msi-irqs
> >> are hogged, and cannot be used.
> >>
> >> Contemporary storage/networking drivers usually uses queues per core
> >> (or a sub-set of cores). The current scheme wastes a lot of msi-irqs.
> >> If we instead used a scheme where "msi-irq == lnx-irq", instead of
> >> "lnq-irq = {hart 0;msi-irq x , ... hart N;msi-irq x}", there would be
> >> a lot MSIs for other users. 1-1 vs 1-N. E.g., if a storage device
> >> would like to use 5 queues (5 cores) on a 128 core system, the current
> >> scheme would consume 5 * 128 MSIs, instead of just 5.
> >>
> >> On the plus side:
> >> * Changing interrupts affinity will never fail, because the interrupts
> >> on each hart is pre-allocated.
> >>
> >> On the negative side:
> >> * Wasteful interrupt usage, and a system can potientially "run out" of
> >> interrupts. Especially for many core systems.
> >> * Interrupt masking need to proagate to harts via IPIs (there's no
> >> broadcast csr in IMSIC), and a more complex locking scheme IMSIC
> >>
> >> Summary:
> >> The current series caps the number of global interrupts to maximum
> >> 2047 MSIs for all cores (whole system). A better scheme, IMO, would be
> >> to expose 2047 * #harts unique MSIs.
> >>
> >> I think this could simplify/remove(?) the locking as well.
> >
> > Exposing 2047 * #harts unique MSIs has multiple issues:
> > 1) The irq_set_affinity() does not work for MSIs because each
> > IRQ is not tied to a particular HART. This means we can't
> > balance the IRQ processing load among HARTs.
>
> Yes, you can balance. In your code, each *active* MSI is still
> bound/active to a specific hard together with the affinity mask. In an
> 1-1 model you would still need to track the affinity mask, but the
> irq_set_affinity() would be different. It would try to allocate a new
> MSI from the target CPU, and then switch to having that MSI active.
>
> That's what x86 does AFAIU, which is also constrained by the # of
> available MSIs.
>
> The downside, as I pointed out, is that the set affinity action can
> fail for a certain target CPU.

Yes, irq_set_affinity() can fail for the suggested approach plus for
RISC-V AIA, one HART does not have access to other HARTs
MSI enable/disable bits so the approach will also involve IPI.

>
> > 2) All wired IRQs for APLIC MSI-mode will also target a
> > fixed HART hence irq_set_affinity() won't work for wired
> > IRQs as well.
>
> I'm not following here. Why would APLIC put a constraint here? I had a
> look at the specs, and I didn't see anything supporting the current
> scheme explicitly.

Lets say the number of APLIC wired interrupts are greater than the
number of per-CPU IMSIC IDs. In this case, if all wired interrupts are
moved to a particular CPU then irq_set_affinity() will fail for some of
the wired interrupts.

>
> > 3) Contemporary storage/networking drivers which use per-core
> > queues use irq_set_affinity() on queue IRQs to balance
> > across cores but this will fail.
>
> Or via the the managed interrupts. But this is a non-issue, as pointed
> out in my reply to 1.
>
> > 4) HART hotplug breaks because kernel irq-subsystem can't
> > migrate the IRQs (both MSIs and Wired) targeting HART X
> > to another HART Y when the HART X goes down.
>
> Yes, we might end up in scenarios where we can't move to a certain
> target cpu, but I wouldn't expect that to be a common scenario.
>
> > The idea of treating per-HART MSIs as separate IRQs has
> > been discussed in the past.
>
> Aha! I tried to look for it in lore, but didn't find any. Could you
> point me to those discussions?

This was done 2 years back in the AIA TG meeting when we were
doing the PoC for AIA spec.

>
> > Also, the current approach is very similar to the ARM GICv3 driver
> > where ITS LPIs across CPUs are treated as single IRQ.
>
> I'm not familiar with the GIC. Is the GICv3 design similar to IMSIC? I
> had the impression that the GIC had a more advanced interrupt routing
> mechanism, than what IMSIC exposes. I think x86 APIC takes the 1-1
> approach (the folks on the To: list definitely knows! ;-)).

GIC has a per-CPU redistributor which handles LPIs. The MSIs are
taken by GIC ITS and forwarded as LPI to the redistributor of a CPU.

The GIC driver treats LPI numbering space as global and not per-CPU.
Also, the limit on maximum number of LPIs is quite high because LPI
INTID can be 32-bit wide.

>
> My concern is interrupts become a scarce resource with this
> implementation, but maybe my view is incorrect. I've seen bare-metal
> x86 systems (no VMs) with ~200 cores, and ~2000 interrupts, but maybe
> that is considered "a lot of interrupts".
>
> As long as we don't get into scenarios where we're running out of
> interrupts, due to the software design.
>

The current approach is simpler and ensures irq_set_affinity
always works. The limit of max 2047 IDs is sufficient for many
systems (if not all).

When we encounter a system requiring a large number of MSIs,
we can either:
1) Extend the AIA spec to support greater than 2047 IDs
2) Re-think the approach in the IMSIC driver

The choice between #1 and #2 above depends on the
guarantees we want for irq_set_affinity().

Regards,
Anup

2023-10-20 14:40:51

by Björn Töpel

[permalink] [raw]
Subject: Re: [PATCH v10 00/15] Linux RISC-V AIA Support

Anup Patel <[email protected]> writes:

> On Fri, Oct 20, 2023 at 2:17 PM Björn Töpel <[email protected]> wrote:
>>
>> Thanks for the quick reply!
>>
>> Anup Patel <[email protected]> writes:
>>
>> > On Thu, Oct 19, 2023 at 7:13 PM Björn Töpel <[email protected]> wrote:
>> >>
>> >> Hi Anup,
>> >>
>> >> Anup Patel <[email protected]> writes:
>> >>
>> >> > The RISC-V AIA specification is ratified as-per the RISC-V international
>> >> > process. The latest ratified AIA specifcation can be found at:
>> >> > https://github.com/riscv/riscv-aia/releases/download/1.0/riscv-interrupts-1.0.pdf
>> >> >
>> >> > At a high-level, the AIA specification adds three things:
>> >> > 1) AIA CSRs
>> >> > - Improved local interrupt support
>> >> > 2) Incoming Message Signaled Interrupt Controller (IMSIC)
>> >> > - Per-HART MSI controller
>> >> > - Support MSI virtualization
>> >> > - Support IPI along with virtualization
>> >> > 3) Advanced Platform-Level Interrupt Controller (APLIC)
>> >> > - Wired interrupt controller
>> >> > - In MSI-mode, converts wired interrupt into MSIs (i.e. MSI generator)
>> >> > - In Direct-mode, injects external interrupts directly into HARTs
>> >>
>> >> Thanks for working on the AIA support! I had a look at the series, and
>> >> have some concerns about interrupt ID abstraction.
>> >>
>> >> A bit of background, for readers not familiar with the AIA details.
>> >>
>> >> IMSIC allows for 2047 unique MSI ("msi-irq") sources per hart, and
>> >> each MSI is dedicated to a certain hart. The series takes the approach
>> >> to say that there are, e.g., 2047 interrupts ("lnx-irq") globally.
>> >> Each lnx-irq consists of #harts * msi-irq -- a slice -- and in the
>> >> slice only *one* msi-irq is acutally used.
>> >>
>> >> This scheme makes affinity changes more robust, because the interrupt
>> >> sources on "other" harts are pre-allocated. On the other hand it
>> >> requires to propagate irq masking to other harts via IPIs (this is
>> >> mostly done up setup/tear down). It's also wasteful, because msi-irqs
>> >> are hogged, and cannot be used.
>> >>
>> >> Contemporary storage/networking drivers usually uses queues per core
>> >> (or a sub-set of cores). The current scheme wastes a lot of msi-irqs.
>> >> If we instead used a scheme where "msi-irq == lnx-irq", instead of
>> >> "lnq-irq = {hart 0;msi-irq x , ... hart N;msi-irq x}", there would be
>> >> a lot MSIs for other users. 1-1 vs 1-N. E.g., if a storage device
>> >> would like to use 5 queues (5 cores) on a 128 core system, the current
>> >> scheme would consume 5 * 128 MSIs, instead of just 5.
>> >>
>> >> On the plus side:
>> >> * Changing interrupts affinity will never fail, because the interrupts
>> >> on each hart is pre-allocated.
>> >>
>> >> On the negative side:
>> >> * Wasteful interrupt usage, and a system can potientially "run out" of
>> >> interrupts. Especially for many core systems.
>> >> * Interrupt masking need to proagate to harts via IPIs (there's no
>> >> broadcast csr in IMSIC), and a more complex locking scheme IMSIC
>> >>
>> >> Summary:
>> >> The current series caps the number of global interrupts to maximum
>> >> 2047 MSIs for all cores (whole system). A better scheme, IMO, would be
>> >> to expose 2047 * #harts unique MSIs.
>> >>
>> >> I think this could simplify/remove(?) the locking as well.
>> >
>> > Exposing 2047 * #harts unique MSIs has multiple issues:
>> > 1) The irq_set_affinity() does not work for MSIs because each
>> > IRQ is not tied to a particular HART. This means we can't
>> > balance the IRQ processing load among HARTs.
>>
>> Yes, you can balance. In your code, each *active* MSI is still
>> bound/active to a specific hard together with the affinity mask. In an
>> 1-1 model you would still need to track the affinity mask, but the
>> irq_set_affinity() would be different. It would try to allocate a new
>> MSI from the target CPU, and then switch to having that MSI active.
>>
>> That's what x86 does AFAIU, which is also constrained by the # of
>> available MSIs.
>>
>> The downside, as I pointed out, is that the set affinity action can
>> fail for a certain target CPU.
>
> Yes, irq_set_affinity() can fail for the suggested approach plus for
> RISC-V AIA, one HART does not have access to other HARTs
> MSI enable/disable bits so the approach will also involve IPI.

Correct, but the current series does a broadcast to all cores, where the
1-1 approach is at most an IPI to a single core.

128+c machines are getting more common, and you have devices that you
bring up/down on a per-core basis. Broadcasting IPIs to all cores, when
dealing with a per-core activity is a pretty noisy neighbor.

This could be fixed in the existing 1-n approach, by not require to sync
the cores that are not handling the MSI in question. "Lazy disable"

>> > 2) All wired IRQs for APLIC MSI-mode will also target a
>> > fixed HART hence irq_set_affinity() won't work for wired
>> > IRQs as well.
>>
>> I'm not following here. Why would APLIC put a constraint here? I had a
>> look at the specs, and I didn't see anything supporting the current
>> scheme explicitly.
>
> Lets say the number of APLIC wired interrupts are greater than the
> number of per-CPU IMSIC IDs. In this case, if all wired interrupts are
> moved to a particular CPU then irq_set_affinity() will fail for some of
> the wired interrupts.

Right, it's the case of "full remote CPU" again. Thanks for clearing
that up.

>> > The idea of treating per-HART MSIs as separate IRQs has
>> > been discussed in the past.
>>
>> Aha! I tried to look for it in lore, but didn't find any. Could you
>> point me to those discussions?
>
> This was done 2 years back in the AIA TG meeting when we were
> doing the PoC for AIA spec.

Ah, too bad. Thanks regardless.

>> My concern is interrupts become a scarce resource with this
>> implementation, but maybe my view is incorrect. I've seen bare-metal
>> x86 systems (no VMs) with ~200 cores, and ~2000 interrupts, but maybe
>> that is considered "a lot of interrupts".
>>
>> As long as we don't get into scenarios where we're running out of
>> interrupts, due to the software design.
>>
>
> The current approach is simpler and ensures irq_set_affinity
> always works. The limit of max 2047 IDs is sufficient for many
> systems (if not all).

Let me give you another view. On a 128c system each core has ~16 unique
interrupts for disposal. E.g. the Intel E800 NIC has more than 2048
network queue pairs for each PF.

> When we encounter a system requiring a large number of MSIs,
> we can either:
> 1) Extend the AIA spec to support greater than 2047 IDs
> 2) Re-think the approach in the IMSIC driver
>
> The choice between #1 and #2 above depends on the
> guarantees we want for irq_set_affinity().

The irq_set_affinity() behavior is better with this series, but I think
the other downsides: number of available interrupt sources, and IPI
broadcast are worse.


Björn

2023-10-20 15:34:51

by Anup Patel

[permalink] [raw]
Subject: Re: [PATCH v10 00/15] Linux RISC-V AIA Support

On Fri, Oct 20, 2023 at 8:10 PM Björn Töpel <[email protected]> wrote:
>
> Anup Patel <[email protected]> writes:
>
> > On Fri, Oct 20, 2023 at 2:17 PM Björn Töpel <[email protected]> wrote:
> >>
> >> Thanks for the quick reply!
> >>
> >> Anup Patel <[email protected]> writes:
> >>
> >> > On Thu, Oct 19, 2023 at 7:13 PM Björn Töpel <[email protected]> wrote:
> >> >>
> >> >> Hi Anup,
> >> >>
> >> >> Anup Patel <[email protected]> writes:
> >> >>
> >> >> > The RISC-V AIA specification is ratified as-per the RISC-V international
> >> >> > process. The latest ratified AIA specifcation can be found at:
> >> >> > https://github.com/riscv/riscv-aia/releases/download/1.0/riscv-interrupts-1.0.pdf
> >> >> >
> >> >> > At a high-level, the AIA specification adds three things:
> >> >> > 1) AIA CSRs
> >> >> > - Improved local interrupt support
> >> >> > 2) Incoming Message Signaled Interrupt Controller (IMSIC)
> >> >> > - Per-HART MSI controller
> >> >> > - Support MSI virtualization
> >> >> > - Support IPI along with virtualization
> >> >> > 3) Advanced Platform-Level Interrupt Controller (APLIC)
> >> >> > - Wired interrupt controller
> >> >> > - In MSI-mode, converts wired interrupt into MSIs (i.e. MSI generator)
> >> >> > - In Direct-mode, injects external interrupts directly into HARTs
> >> >>
> >> >> Thanks for working on the AIA support! I had a look at the series, and
> >> >> have some concerns about interrupt ID abstraction.
> >> >>
> >> >> A bit of background, for readers not familiar with the AIA details.
> >> >>
> >> >> IMSIC allows for 2047 unique MSI ("msi-irq") sources per hart, and
> >> >> each MSI is dedicated to a certain hart. The series takes the approach
> >> >> to say that there are, e.g., 2047 interrupts ("lnx-irq") globally.
> >> >> Each lnx-irq consists of #harts * msi-irq -- a slice -- and in the
> >> >> slice only *one* msi-irq is acutally used.
> >> >>
> >> >> This scheme makes affinity changes more robust, because the interrupt
> >> >> sources on "other" harts are pre-allocated. On the other hand it
> >> >> requires to propagate irq masking to other harts via IPIs (this is
> >> >> mostly done up setup/tear down). It's also wasteful, because msi-irqs
> >> >> are hogged, and cannot be used.
> >> >>
> >> >> Contemporary storage/networking drivers usually uses queues per core
> >> >> (or a sub-set of cores). The current scheme wastes a lot of msi-irqs.
> >> >> If we instead used a scheme where "msi-irq == lnx-irq", instead of
> >> >> "lnq-irq = {hart 0;msi-irq x , ... hart N;msi-irq x}", there would be
> >> >> a lot MSIs for other users. 1-1 vs 1-N. E.g., if a storage device
> >> >> would like to use 5 queues (5 cores) on a 128 core system, the current
> >> >> scheme would consume 5 * 128 MSIs, instead of just 5.
> >> >>
> >> >> On the plus side:
> >> >> * Changing interrupts affinity will never fail, because the interrupts
> >> >> on each hart is pre-allocated.
> >> >>
> >> >> On the negative side:
> >> >> * Wasteful interrupt usage, and a system can potientially "run out" of
> >> >> interrupts. Especially for many core systems.
> >> >> * Interrupt masking need to proagate to harts via IPIs (there's no
> >> >> broadcast csr in IMSIC), and a more complex locking scheme IMSIC
> >> >>
> >> >> Summary:
> >> >> The current series caps the number of global interrupts to maximum
> >> >> 2047 MSIs for all cores (whole system). A better scheme, IMO, would be
> >> >> to expose 2047 * #harts unique MSIs.
> >> >>
> >> >> I think this could simplify/remove(?) the locking as well.
> >> >
> >> > Exposing 2047 * #harts unique MSIs has multiple issues:
> >> > 1) The irq_set_affinity() does not work for MSIs because each
> >> > IRQ is not tied to a particular HART. This means we can't
> >> > balance the IRQ processing load among HARTs.
> >>
> >> Yes, you can balance. In your code, each *active* MSI is still
> >> bound/active to a specific hard together with the affinity mask. In an
> >> 1-1 model you would still need to track the affinity mask, but the
> >> irq_set_affinity() would be different. It would try to allocate a new
> >> MSI from the target CPU, and then switch to having that MSI active.
> >>
> >> That's what x86 does AFAIU, which is also constrained by the # of
> >> available MSIs.
> >>
> >> The downside, as I pointed out, is that the set affinity action can
> >> fail for a certain target CPU.
> >
> > Yes, irq_set_affinity() can fail for the suggested approach plus for
> > RISC-V AIA, one HART does not have access to other HARTs
> > MSI enable/disable bits so the approach will also involve IPI.
>
> Correct, but the current series does a broadcast to all cores, where the
> 1-1 approach is at most an IPI to a single core.
>
> 128+c machines are getting more common, and you have devices that you
> bring up/down on a per-core basis. Broadcasting IPIs to all cores, when
> dealing with a per-core activity is a pretty noisy neighbor.

Broadcast IPI in the current approach is only done upon MSI mask/unmask
operation. It is not done upon set_affinity() of interrupt handling.

>
> This could be fixed in the existing 1-n approach, by not require to sync
> the cores that are not handling the MSI in question. "Lazy disable"

Incorrect. The approach you are suggesting involves an IPI upon every
irq_set_affinity(). This is because a HART can only enable it's own
MSI ID so when an IRQ is moved to from HART A to HART B with
a different ID X on HART B then we will need an IPI in irq_set_affinit()
to enable ID X on HART B.

>
> >> > 2) All wired IRQs for APLIC MSI-mode will also target a
> >> > fixed HART hence irq_set_affinity() won't work for wired
> >> > IRQs as well.
> >>
> >> I'm not following here. Why would APLIC put a constraint here? I had a
> >> look at the specs, and I didn't see anything supporting the current
> >> scheme explicitly.
> >
> > Lets say the number of APLIC wired interrupts are greater than the
> > number of per-CPU IMSIC IDs. In this case, if all wired interrupts are
> > moved to a particular CPU then irq_set_affinity() will fail for some of
> > the wired interrupts.
>
> Right, it's the case of "full remote CPU" again. Thanks for clearing
> that up.
>
> >> > The idea of treating per-HART MSIs as separate IRQs has
> >> > been discussed in the past.
> >>
> >> Aha! I tried to look for it in lore, but didn't find any. Could you
> >> point me to those discussions?
> >
> > This was done 2 years back in the AIA TG meeting when we were
> > doing the PoC for AIA spec.
>
> Ah, too bad. Thanks regardless.
>
> >> My concern is interrupts become a scarce resource with this
> >> implementation, but maybe my view is incorrect. I've seen bare-metal
> >> x86 systems (no VMs) with ~200 cores, and ~2000 interrupts, but maybe
> >> that is considered "a lot of interrupts".
> >>
> >> As long as we don't get into scenarios where we're running out of
> >> interrupts, due to the software design.
> >>
> >
> > The current approach is simpler and ensures irq_set_affinity
> > always works. The limit of max 2047 IDs is sufficient for many
> > systems (if not all).
>
> Let me give you another view. On a 128c system each core has ~16 unique
> interrupts for disposal. E.g. the Intel E800 NIC has more than 2048
> network queue pairs for each PF.

Clearly, this example is a hypothetical and represents a poorly
designed platform.

Having just 16 IDs per-Core is a very poor design choice. In fact, the
Server SoC spec mandates a minimum 255 IDs.

Regarding NICs which support a large number of queues, the driver
will typically enable only one queue per-core and set the affinity to
separate cores. We have user-space data plane applications based
on DPDK which are capable of using a large number of NIC queues
but these applications are polling based and don't use MSIs.

>
> > When we encounter a system requiring a large number of MSIs,
> > we can either:
> > 1) Extend the AIA spec to support greater than 2047 IDs
> > 2) Re-think the approach in the IMSIC driver
> >
> > The choice between #1 and #2 above depends on the
> > guarantees we want for irq_set_affinity().
>
> The irq_set_affinity() behavior is better with this series, but I think
> the other downsides: number of available interrupt sources, and IPI
> broadcast are worse.

The IPI overhead in the approach you are suggesting will be
even bad compared to the IPI overhead of the current approach
because we will end-up doing IPI upon every irq_set_affinity()
in the suggested approach compared to doing IPI upon every
mask/unmask in the current approach.

The biggest advantage of the current approach is a reliable
irq_set_affinity() which is a very valuable thing to have.

ARM systems easily support a large number of LPIs per-core.
For example, GIC-700 supports 56000 LPIs per-core.
(Refer, https://developer.arm.com/documentation/101516/0300/About-the-GIC-700/Features)

In the RISC-V world, we can easily define a small fast track
extension based on S*csrind extension which can allow a
large number of IMSIC IDs per-core.

Instead of addressing problems on a hypothetical system,
I suggest we go ahead with the current approach and deal
with a system having MSI over-subscription when such a
system shows up.

Regards,
Anup

2023-10-20 16:37:28

by Björn Töpel

[permalink] [raw]
Subject: Re: [PATCH v10 00/15] Linux RISC-V AIA Support

Anup Patel <[email protected]> writes:

> On Fri, Oct 20, 2023 at 8:10 PM Björn Töpel <[email protected]> wrote:
>>
>> Anup Patel <[email protected]> writes:
>>
>> > On Fri, Oct 20, 2023 at 2:17 PM Björn Töpel <[email protected]> wrote:
>> >>
>> >> Thanks for the quick reply!
>> >>
>> >> Anup Patel <[email protected]> writes:
>> >>
>> >> > On Thu, Oct 19, 2023 at 7:13 PM Björn Töpel <[email protected]> wrote:
>> >> >>
>> >> >> Hi Anup,
>> >> >>
>> >> >> Anup Patel <[email protected]> writes:
>> >> >>
>> >> >> > The RISC-V AIA specification is ratified as-per the RISC-V international
>> >> >> > process. The latest ratified AIA specifcation can be found at:
>> >> >> > https://github.com/riscv/riscv-aia/releases/download/1.0/riscv-interrupts-1.0.pdf
>> >> >> >
>> >> >> > At a high-level, the AIA specification adds three things:
>> >> >> > 1) AIA CSRs
>> >> >> > - Improved local interrupt support
>> >> >> > 2) Incoming Message Signaled Interrupt Controller (IMSIC)
>> >> >> > - Per-HART MSI controller
>> >> >> > - Support MSI virtualization
>> >> >> > - Support IPI along with virtualization
>> >> >> > 3) Advanced Platform-Level Interrupt Controller (APLIC)
>> >> >> > - Wired interrupt controller
>> >> >> > - In MSI-mode, converts wired interrupt into MSIs (i.e. MSI generator)
>> >> >> > - In Direct-mode, injects external interrupts directly into HARTs
>> >> >>
>> >> >> Thanks for working on the AIA support! I had a look at the series, and
>> >> >> have some concerns about interrupt ID abstraction.
>> >> >>
>> >> >> A bit of background, for readers not familiar with the AIA details.
>> >> >>
>> >> >> IMSIC allows for 2047 unique MSI ("msi-irq") sources per hart, and
>> >> >> each MSI is dedicated to a certain hart. The series takes the approach
>> >> >> to say that there are, e.g., 2047 interrupts ("lnx-irq") globally.
>> >> >> Each lnx-irq consists of #harts * msi-irq -- a slice -- and in the
>> >> >> slice only *one* msi-irq is acutally used.
>> >> >>
>> >> >> This scheme makes affinity changes more robust, because the interrupt
>> >> >> sources on "other" harts are pre-allocated. On the other hand it
>> >> >> requires to propagate irq masking to other harts via IPIs (this is
>> >> >> mostly done up setup/tear down). It's also wasteful, because msi-irqs
>> >> >> are hogged, and cannot be used.
>> >> >>
>> >> >> Contemporary storage/networking drivers usually uses queues per core
>> >> >> (or a sub-set of cores). The current scheme wastes a lot of msi-irqs.
>> >> >> If we instead used a scheme where "msi-irq == lnx-irq", instead of
>> >> >> "lnq-irq = {hart 0;msi-irq x , ... hart N;msi-irq x}", there would be
>> >> >> a lot MSIs for other users. 1-1 vs 1-N. E.g., if a storage device
>> >> >> would like to use 5 queues (5 cores) on a 128 core system, the current
>> >> >> scheme would consume 5 * 128 MSIs, instead of just 5.
>> >> >>
>> >> >> On the plus side:
>> >> >> * Changing interrupts affinity will never fail, because the interrupts
>> >> >> on each hart is pre-allocated.
>> >> >>
>> >> >> On the negative side:
>> >> >> * Wasteful interrupt usage, and a system can potientially "run out" of
>> >> >> interrupts. Especially for many core systems.
>> >> >> * Interrupt masking need to proagate to harts via IPIs (there's no
>> >> >> broadcast csr in IMSIC), and a more complex locking scheme IMSIC
>> >> >>
>> >> >> Summary:
>> >> >> The current series caps the number of global interrupts to maximum
>> >> >> 2047 MSIs for all cores (whole system). A better scheme, IMO, would be
>> >> >> to expose 2047 * #harts unique MSIs.
>> >> >>
>> >> >> I think this could simplify/remove(?) the locking as well.
>> >> >
>> >> > Exposing 2047 * #harts unique MSIs has multiple issues:
>> >> > 1) The irq_set_affinity() does not work for MSIs because each
>> >> > IRQ is not tied to a particular HART. This means we can't
>> >> > balance the IRQ processing load among HARTs.
>> >>
>> >> Yes, you can balance. In your code, each *active* MSI is still
>> >> bound/active to a specific hard together with the affinity mask. In an
>> >> 1-1 model you would still need to track the affinity mask, but the
>> >> irq_set_affinity() would be different. It would try to allocate a new
>> >> MSI from the target CPU, and then switch to having that MSI active.
>> >>
>> >> That's what x86 does AFAIU, which is also constrained by the # of
>> >> available MSIs.
>> >>
>> >> The downside, as I pointed out, is that the set affinity action can
>> >> fail for a certain target CPU.
>> >
>> > Yes, irq_set_affinity() can fail for the suggested approach plus for
>> > RISC-V AIA, one HART does not have access to other HARTs
>> > MSI enable/disable bits so the approach will also involve IPI.
>>
>> Correct, but the current series does a broadcast to all cores, where the
>> 1-1 approach is at most an IPI to a single core.
>>
>> 128+c machines are getting more common, and you have devices that you
>> bring up/down on a per-core basis. Broadcasting IPIs to all cores, when
>> dealing with a per-core activity is a pretty noisy neighbor.
>
> Broadcast IPI in the current approach is only done upon MSI mask/unmask
> operation. It is not done upon set_affinity() of interrupt handling.

I'm aware. We're on the same page here.

>>
>> This could be fixed in the existing 1-n approach, by not require to sync
>> the cores that are not handling the MSI in question. "Lazy disable"
>
> Incorrect. The approach you are suggesting involves an IPI upon every
> irq_set_affinity(). This is because a HART can only enable it's own
> MSI ID so when an IRQ is moved to from HART A to HART B with
> a different ID X on HART B then we will need an IPI in irq_set_affinit()
> to enable ID X on HART B.

Yes, the 1-1 approach will require an IPI to one target cpu on affinity
changes, and similar on mask/unmask.

The 1-n approach, require no-IPI on affinity changes (nice!), but IPI
broadcast to all cores on mask/unmask (not so nice).

>> >> My concern is interrupts become a scarce resource with this
>> >> implementation, but maybe my view is incorrect. I've seen bare-metal
>> >> x86 systems (no VMs) with ~200 cores, and ~2000 interrupts, but maybe
>> >> that is considered "a lot of interrupts".
>> >>
>> >> As long as we don't get into scenarios where we're running out of
>> >> interrupts, due to the software design.
>> >>
>> >
>> > The current approach is simpler and ensures irq_set_affinity
>> > always works. The limit of max 2047 IDs is sufficient for many
>> > systems (if not all).
>>
>> Let me give you another view. On a 128c system each core has ~16 unique
>> interrupts for disposal. E.g. the Intel E800 NIC has more than 2048
>> network queue pairs for each PF.
>
> Clearly, this example is a hypothetical and represents a poorly
> designed platform.
>
> Having just 16 IDs per-Core is a very poor design choice. In fact, the
> Server SoC spec mandates a minimum 255 IDs.

You are misreading. A 128c system with 2047 MSIs per-core, will only
have 16 *per-core unique* (2047/128) interrupts with the current series.

I'm not saying that each IMSIC has 16 IDs, I'm saying that in a 128c
system with the maximum amount of MSIs possible in the spec, you'll end
up with 16 *unique* interrupts per core.

> Regarding NICs which support a large number of queues, the driver
> will typically enable only one queue per-core and set the affinity to
> separate cores. We have user-space data plane applications based
> on DPDK which are capable of using a large number of NIC queues
> but these applications are polling based and don't use MSIs.

That's one sample point, and clearly not the only one. There are *many*
different usage models. Just because you *assign* MSI, doesn't mean they
are firing all the time.

I can show you a couple of networking setups where this is clearly not
enough. Each core has a large number of QoS queues, and each queue would
very much like to have a dedicated MSI.

>> > When we encounter a system requiring a large number of MSIs,
>> > we can either:
>> > 1) Extend the AIA spec to support greater than 2047 IDs
>> > 2) Re-think the approach in the IMSIC driver
>> >
>> > The choice between #1 and #2 above depends on the
>> > guarantees we want for irq_set_affinity().
>>
>> The irq_set_affinity() behavior is better with this series, but I think
>> the other downsides: number of available interrupt sources, and IPI
>> broadcast are worse.
>
> The IPI overhead in the approach you are suggesting will be
> even bad compared to the IPI overhead of the current approach
> because we will end-up doing IPI upon every irq_set_affinity()
> in the suggested approach compared to doing IPI upon every
> mask/unmask in the current approach.

Again, very workload dependent.

This series does IPI broadcast on masking/unmasking, which means that
cores that don't care get interrupted because, say, a network queue-pair
is setup on another core.

Some workloads never change the irq affinity.

I'm just pointing out that there are pro/cons with both variants.

> The biggest advantage of the current approach is a reliable
> irq_set_affinity() which is a very valuable thing to have.

...and I'm arguing that we're paying a big price for that.

> ARM systems easily support a large number of LPIs per-core.
> For example, GIC-700 supports 56000 LPIs per-core.
> (Refer, https://developer.arm.com/documentation/101516/0300/About-the-GIC-700/Features)

Yeah, but this is not the GIC. This is something that looks more like
the x86 world. We'll be stuck with a lot of implementations with AIA 1.0
spec, and many cores.

> In the RISC-V world, we can easily define a small fast track
> extension based on S*csrind extension which can allow a
> large number of IMSIC IDs per-core.
>
> Instead of addressing problems on a hypothetical system,
> I suggest we go ahead with the current approach and deal
> with a system having MSI over-subscription when such a
> system shows up.

I've pointed out my concerns. We're not agreeing, but hey, I'm just one
sample point here! I'll leave it here for others to chime in!

Still much appreciate all the hard work on the series!


Have a nice weekend,
Björn

2023-10-20 17:14:18

by Anup Patel

[permalink] [raw]
Subject: Re: [PATCH v10 00/15] Linux RISC-V AIA Support

On Fri, Oct 20, 2023 at 10:07 PM Björn Töpel <[email protected]> wrote:
>
> Anup Patel <[email protected]> writes:
>
> > On Fri, Oct 20, 2023 at 8:10 PM Björn Töpel <[email protected]> wrote:
> >>
> >> Anup Patel <[email protected]> writes:
> >>
> >> > On Fri, Oct 20, 2023 at 2:17 PM Björn Töpel <[email protected]> wrote:
> >> >>
> >> >> Thanks for the quick reply!
> >> >>
> >> >> Anup Patel <[email protected]> writes:
> >> >>
> >> >> > On Thu, Oct 19, 2023 at 7:13 PM Björn Töpel <[email protected]> wrote:
> >> >> >>
> >> >> >> Hi Anup,
> >> >> >>
> >> >> >> Anup Patel <[email protected]> writes:
> >> >> >>
> >> >> >> > The RISC-V AIA specification is ratified as-per the RISC-V international
> >> >> >> > process. The latest ratified AIA specifcation can be found at:
> >> >> >> > https://github.com/riscv/riscv-aia/releases/download/1.0/riscv-interrupts-1.0.pdf
> >> >> >> >
> >> >> >> > At a high-level, the AIA specification adds three things:
> >> >> >> > 1) AIA CSRs
> >> >> >> > - Improved local interrupt support
> >> >> >> > 2) Incoming Message Signaled Interrupt Controller (IMSIC)
> >> >> >> > - Per-HART MSI controller
> >> >> >> > - Support MSI virtualization
> >> >> >> > - Support IPI along with virtualization
> >> >> >> > 3) Advanced Platform-Level Interrupt Controller (APLIC)
> >> >> >> > - Wired interrupt controller
> >> >> >> > - In MSI-mode, converts wired interrupt into MSIs (i.e. MSI generator)
> >> >> >> > - In Direct-mode, injects external interrupts directly into HARTs
> >> >> >>
> >> >> >> Thanks for working on the AIA support! I had a look at the series, and
> >> >> >> have some concerns about interrupt ID abstraction.
> >> >> >>
> >> >> >> A bit of background, for readers not familiar with the AIA details.
> >> >> >>
> >> >> >> IMSIC allows for 2047 unique MSI ("msi-irq") sources per hart, and
> >> >> >> each MSI is dedicated to a certain hart. The series takes the approach
> >> >> >> to say that there are, e.g., 2047 interrupts ("lnx-irq") globally.
> >> >> >> Each lnx-irq consists of #harts * msi-irq -- a slice -- and in the
> >> >> >> slice only *one* msi-irq is acutally used.
> >> >> >>
> >> >> >> This scheme makes affinity changes more robust, because the interrupt
> >> >> >> sources on "other" harts are pre-allocated. On the other hand it
> >> >> >> requires to propagate irq masking to other harts via IPIs (this is
> >> >> >> mostly done up setup/tear down). It's also wasteful, because msi-irqs
> >> >> >> are hogged, and cannot be used.
> >> >> >>
> >> >> >> Contemporary storage/networking drivers usually uses queues per core
> >> >> >> (or a sub-set of cores). The current scheme wastes a lot of msi-irqs.
> >> >> >> If we instead used a scheme where "msi-irq == lnx-irq", instead of
> >> >> >> "lnq-irq = {hart 0;msi-irq x , ... hart N;msi-irq x}", there would be
> >> >> >> a lot MSIs for other users. 1-1 vs 1-N. E.g., if a storage device
> >> >> >> would like to use 5 queues (5 cores) on a 128 core system, the current
> >> >> >> scheme would consume 5 * 128 MSIs, instead of just 5.
> >> >> >>
> >> >> >> On the plus side:
> >> >> >> * Changing interrupts affinity will never fail, because the interrupts
> >> >> >> on each hart is pre-allocated.
> >> >> >>
> >> >> >> On the negative side:
> >> >> >> * Wasteful interrupt usage, and a system can potientially "run out" of
> >> >> >> interrupts. Especially for many core systems.
> >> >> >> * Interrupt masking need to proagate to harts via IPIs (there's no
> >> >> >> broadcast csr in IMSIC), and a more complex locking scheme IMSIC
> >> >> >>
> >> >> >> Summary:
> >> >> >> The current series caps the number of global interrupts to maximum
> >> >> >> 2047 MSIs for all cores (whole system). A better scheme, IMO, would be
> >> >> >> to expose 2047 * #harts unique MSIs.
> >> >> >>
> >> >> >> I think this could simplify/remove(?) the locking as well.
> >> >> >
> >> >> > Exposing 2047 * #harts unique MSIs has multiple issues:
> >> >> > 1) The irq_set_affinity() does not work for MSIs because each
> >> >> > IRQ is not tied to a particular HART. This means we can't
> >> >> > balance the IRQ processing load among HARTs.
> >> >>
> >> >> Yes, you can balance. In your code, each *active* MSI is still
> >> >> bound/active to a specific hard together with the affinity mask. In an
> >> >> 1-1 model you would still need to track the affinity mask, but the
> >> >> irq_set_affinity() would be different. It would try to allocate a new
> >> >> MSI from the target CPU, and then switch to having that MSI active.
> >> >>
> >> >> That's what x86 does AFAIU, which is also constrained by the # of
> >> >> available MSIs.
> >> >>
> >> >> The downside, as I pointed out, is that the set affinity action can
> >> >> fail for a certain target CPU.
> >> >
> >> > Yes, irq_set_affinity() can fail for the suggested approach plus for
> >> > RISC-V AIA, one HART does not have access to other HARTs
> >> > MSI enable/disable bits so the approach will also involve IPI.
> >>
> >> Correct, but the current series does a broadcast to all cores, where the
> >> 1-1 approach is at most an IPI to a single core.
> >>
> >> 128+c machines are getting more common, and you have devices that you
> >> bring up/down on a per-core basis. Broadcasting IPIs to all cores, when
> >> dealing with a per-core activity is a pretty noisy neighbor.
> >
> > Broadcast IPI in the current approach is only done upon MSI mask/unmask
> > operation. It is not done upon set_affinity() of interrupt handling.
>
> I'm aware. We're on the same page here.
>
> >>
> >> This could be fixed in the existing 1-n approach, by not require to sync
> >> the cores that are not handling the MSI in question. "Lazy disable"
> >
> > Incorrect. The approach you are suggesting involves an IPI upon every
> > irq_set_affinity(). This is because a HART can only enable it's own
> > MSI ID so when an IRQ is moved to from HART A to HART B with
> > a different ID X on HART B then we will need an IPI in irq_set_affinit()
> > to enable ID X on HART B.
>
> Yes, the 1-1 approach will require an IPI to one target cpu on affinity
> changes, and similar on mask/unmask.
>
> The 1-n approach, require no-IPI on affinity changes (nice!), but IPI
> broadcast to all cores on mask/unmask (not so nice).
>
> >> >> My concern is interrupts become a scarce resource with this
> >> >> implementation, but maybe my view is incorrect. I've seen bare-metal
> >> >> x86 systems (no VMs) with ~200 cores, and ~2000 interrupts, but maybe
> >> >> that is considered "a lot of interrupts".
> >> >>
> >> >> As long as we don't get into scenarios where we're running out of
> >> >> interrupts, due to the software design.
> >> >>
> >> >
> >> > The current approach is simpler and ensures irq_set_affinity
> >> > always works. The limit of max 2047 IDs is sufficient for many
> >> > systems (if not all).
> >>
> >> Let me give you another view. On a 128c system each core has ~16 unique
> >> interrupts for disposal. E.g. the Intel E800 NIC has more than 2048
> >> network queue pairs for each PF.
> >
> > Clearly, this example is a hypothetical and represents a poorly
> > designed platform.
> >
> > Having just 16 IDs per-Core is a very poor design choice. In fact, the
> > Server SoC spec mandates a minimum 255 IDs.
>
> You are misreading. A 128c system with 2047 MSIs per-core, will only
> have 16 *per-core unique* (2047/128) interrupts with the current series.
>
> I'm not saying that each IMSIC has 16 IDs, I'm saying that in a 128c
> system with the maximum amount of MSIs possible in the spec, you'll end
> up with 16 *unique* interrupts per core.

-ENOPARSE

I don't see how this applies to the current approach because we treat
MSI ID space as global across cores so if a system has 2047 MSIs
per-core then we have 2047 MSIs across all cores.

>
> > Regarding NICs which support a large number of queues, the driver
> > will typically enable only one queue per-core and set the affinity to
> > separate cores. We have user-space data plane applications based
> > on DPDK which are capable of using a large number of NIC queues
> > but these applications are polling based and don't use MSIs.
>
> That's one sample point, and clearly not the only one. There are *many*
> different usage models. Just because you *assign* MSI, doesn't mean they
> are firing all the time.
>
> I can show you a couple of networking setups where this is clearly not
> enough. Each core has a large number of QoS queues, and each queue would
> very much like to have a dedicated MSI.
>
> >> > When we encounter a system requiring a large number of MSIs,
> >> > we can either:
> >> > 1) Extend the AIA spec to support greater than 2047 IDs
> >> > 2) Re-think the approach in the IMSIC driver
> >> >
> >> > The choice between #1 and #2 above depends on the
> >> > guarantees we want for irq_set_affinity().
> >>
> >> The irq_set_affinity() behavior is better with this series, but I think
> >> the other downsides: number of available interrupt sources, and IPI
> >> broadcast are worse.
> >
> > The IPI overhead in the approach you are suggesting will be
> > even bad compared to the IPI overhead of the current approach
> > because we will end-up doing IPI upon every irq_set_affinity()
> > in the suggested approach compared to doing IPI upon every
> > mask/unmask in the current approach.
>
> Again, very workload dependent.
>
> This series does IPI broadcast on masking/unmasking, which means that
> cores that don't care get interrupted because, say, a network queue-pair
> is setup on another core.
>
> Some workloads never change the irq affinity.

There are various events which irq affinity such as irq balance,
CPU hotplug, system suspend, etc.

Also, the 1-1 approach does IPI upon set_affinity, mask and
unmask whereas the 1-n approach does IPI only upon mask
and unmask.

>
> I'm just pointing out that there are pro/cons with both variants.
>
> > The biggest advantage of the current approach is a reliable
> > irq_set_affinity() which is a very valuable thing to have.
>
> ...and I'm arguing that we're paying a big price for that.
>
> > ARM systems easily support a large number of LPIs per-core.
> > For example, GIC-700 supports 56000 LPIs per-core.
> > (Refer, https://developer.arm.com/documentation/101516/0300/About-the-GIC-700/Features)
>
> Yeah, but this is not the GIC. This is something that looks more like
> the x86 world. We'll be stuck with a lot of implementations with AIA 1.0
> spec, and many cores.

Well, RISC-V AIA is neigher ARM GIG not x86 APIC. All I am saying
is that there are systems with large number per-core interrupt IDs
for handling MSIs.

>
> > In the RISC-V world, we can easily define a small fast track
> > extension based on S*csrind extension which can allow a
> > large number of IMSIC IDs per-core.
> >
> > Instead of addressing problems on a hypothetical system,
> > I suggest we go ahead with the current approach and deal
> > with a system having MSI over-subscription when such a
> > system shows up.
>
> I've pointed out my concerns. We're not agreeing, but hey, I'm just one
> sample point here! I'll leave it here for others to chime in!
>
> Still much appreciate all the hard work on the series!

Thanks, we have disagreements on this topic but this is
certainly a good discussion.

>
>
> Have a nice weekend,

Regards,
Anup

2023-10-20 19:46:45

by Björn Töpel

[permalink] [raw]
Subject: Re: [PATCH v10 00/15] Linux RISC-V AIA Support

Anup Patel <[email protected]> writes:

> On Fri, Oct 20, 2023 at 10:07 PM Björn Töpel <[email protected]> wrote:
>>
>> Anup Patel <[email protected]> writes:
>>
>> > On Fri, Oct 20, 2023 at 8:10 PM Björn Töpel <[email protected]> wrote:
>> >>
>> >> Anup Patel <[email protected]> writes:
>> >>
>> >> > On Fri, Oct 20, 2023 at 2:17 PM Björn Töpel <[email protected]> wrote:
>> >> >>
>> >> >> Thanks for the quick reply!
>> >> >>
>> >> >> Anup Patel <[email protected]> writes:
>> >> >>
>> >> >> > On Thu, Oct 19, 2023 at 7:13 PM Björn Töpel <[email protected]> wrote:
>> >> >> >>
>> >> >> >> Hi Anup,
>> >> >> >>
>> >> >> >> Anup Patel <[email protected]> writes:
>> >> >> >>
>> >> >> >> > The RISC-V AIA specification is ratified as-per the RISC-V international
>> >> >> >> > process. The latest ratified AIA specifcation can be found at:
>> >> >> >> > https://github.com/riscv/riscv-aia/releases/download/1.0/riscv-interrupts-1.0.pdf
>> >> >> >> >
>> >> >> >> > At a high-level, the AIA specification adds three things:
>> >> >> >> > 1) AIA CSRs
>> >> >> >> > - Improved local interrupt support
>> >> >> >> > 2) Incoming Message Signaled Interrupt Controller (IMSIC)
>> >> >> >> > - Per-HART MSI controller
>> >> >> >> > - Support MSI virtualization
>> >> >> >> > - Support IPI along with virtualization
>> >> >> >> > 3) Advanced Platform-Level Interrupt Controller (APLIC)
>> >> >> >> > - Wired interrupt controller
>> >> >> >> > - In MSI-mode, converts wired interrupt into MSIs (i.e. MSI generator)
>> >> >> >> > - In Direct-mode, injects external interrupts directly into HARTs
>> >> >> >>
>> >> >> >> Thanks for working on the AIA support! I had a look at the series, and
>> >> >> >> have some concerns about interrupt ID abstraction.
>> >> >> >>
>> >> >> >> A bit of background, for readers not familiar with the AIA details.
>> >> >> >>
>> >> >> >> IMSIC allows for 2047 unique MSI ("msi-irq") sources per hart, and
>> >> >> >> each MSI is dedicated to a certain hart. The series takes the approach
>> >> >> >> to say that there are, e.g., 2047 interrupts ("lnx-irq") globally.
>> >> >> >> Each lnx-irq consists of #harts * msi-irq -- a slice -- and in the
>> >> >> >> slice only *one* msi-irq is acutally used.
>> >> >> >>
>> >> >> >> This scheme makes affinity changes more robust, because the interrupt
>> >> >> >> sources on "other" harts are pre-allocated. On the other hand it
>> >> >> >> requires to propagate irq masking to other harts via IPIs (this is
>> >> >> >> mostly done up setup/tear down). It's also wasteful, because msi-irqs
>> >> >> >> are hogged, and cannot be used.
>> >> >> >>
>> >> >> >> Contemporary storage/networking drivers usually uses queues per core
>> >> >> >> (or a sub-set of cores). The current scheme wastes a lot of msi-irqs.
>> >> >> >> If we instead used a scheme where "msi-irq == lnx-irq", instead of
>> >> >> >> "lnq-irq = {hart 0;msi-irq x , ... hart N;msi-irq x}", there would be
>> >> >> >> a lot MSIs for other users. 1-1 vs 1-N. E.g., if a storage device
>> >> >> >> would like to use 5 queues (5 cores) on a 128 core system, the current
>> >> >> >> scheme would consume 5 * 128 MSIs, instead of just 5.
>> >> >> >>
>> >> >> >> On the plus side:
>> >> >> >> * Changing interrupts affinity will never fail, because the interrupts
>> >> >> >> on each hart is pre-allocated.
>> >> >> >>
>> >> >> >> On the negative side:
>> >> >> >> * Wasteful interrupt usage, and a system can potientially "run out" of
>> >> >> >> interrupts. Especially for many core systems.
>> >> >> >> * Interrupt masking need to proagate to harts via IPIs (there's no
>> >> >> >> broadcast csr in IMSIC), and a more complex locking scheme IMSIC
>> >> >> >>
>> >> >> >> Summary:
>> >> >> >> The current series caps the number of global interrupts to maximum
>> >> >> >> 2047 MSIs for all cores (whole system). A better scheme, IMO, would be
>> >> >> >> to expose 2047 * #harts unique MSIs.
>> >> >> >>
>> >> >> >> I think this could simplify/remove(?) the locking as well.
>> >> >> >
>> >> >> > Exposing 2047 * #harts unique MSIs has multiple issues:
>> >> >> > 1) The irq_set_affinity() does not work for MSIs because each
>> >> >> > IRQ is not tied to a particular HART. This means we can't
>> >> >> > balance the IRQ processing load among HARTs.
>> >> >>
>> >> >> Yes, you can balance. In your code, each *active* MSI is still
>> >> >> bound/active to a specific hard together with the affinity mask. In an
>> >> >> 1-1 model you would still need to track the affinity mask, but the
>> >> >> irq_set_affinity() would be different. It would try to allocate a new
>> >> >> MSI from the target CPU, and then switch to having that MSI active.
>> >> >>
>> >> >> That's what x86 does AFAIU, which is also constrained by the # of
>> >> >> available MSIs.
>> >> >>
>> >> >> The downside, as I pointed out, is that the set affinity action can
>> >> >> fail for a certain target CPU.
>> >> >
>> >> > Yes, irq_set_affinity() can fail for the suggested approach plus for
>> >> > RISC-V AIA, one HART does not have access to other HARTs
>> >> > MSI enable/disable bits so the approach will also involve IPI.
>> >>
>> >> Correct, but the current series does a broadcast to all cores, where the
>> >> 1-1 approach is at most an IPI to a single core.
>> >>
>> >> 128+c machines are getting more common, and you have devices that you
>> >> bring up/down on a per-core basis. Broadcasting IPIs to all cores, when
>> >> dealing with a per-core activity is a pretty noisy neighbor.
>> >
>> > Broadcast IPI in the current approach is only done upon MSI mask/unmask
>> > operation. It is not done upon set_affinity() of interrupt handling.
>>
>> I'm aware. We're on the same page here.
>>
>> >>
>> >> This could be fixed in the existing 1-n approach, by not require to sync
>> >> the cores that are not handling the MSI in question. "Lazy disable"
>> >
>> > Incorrect. The approach you are suggesting involves an IPI upon every
>> > irq_set_affinity(). This is because a HART can only enable it's own
>> > MSI ID so when an IRQ is moved to from HART A to HART B with
>> > a different ID X on HART B then we will need an IPI in irq_set_affinit()
>> > to enable ID X on HART B.
>>
>> Yes, the 1-1 approach will require an IPI to one target cpu on affinity
>> changes, and similar on mask/unmask.
>>
>> The 1-n approach, require no-IPI on affinity changes (nice!), but IPI
>> broadcast to all cores on mask/unmask (not so nice).
>>
>> >> >> My concern is interrupts become a scarce resource with this
>> >> >> implementation, but maybe my view is incorrect. I've seen bare-metal
>> >> >> x86 systems (no VMs) with ~200 cores, and ~2000 interrupts, but maybe
>> >> >> that is considered "a lot of interrupts".
>> >> >>
>> >> >> As long as we don't get into scenarios where we're running out of
>> >> >> interrupts, due to the software design.
>> >> >>
>> >> >
>> >> > The current approach is simpler and ensures irq_set_affinity
>> >> > always works. The limit of max 2047 IDs is sufficient for many
>> >> > systems (if not all).
>> >>
>> >> Let me give you another view. On a 128c system each core has ~16 unique
>> >> interrupts for disposal. E.g. the Intel E800 NIC has more than 2048
>> >> network queue pairs for each PF.
>> >
>> > Clearly, this example is a hypothetical and represents a poorly
>> > designed platform.
>> >
>> > Having just 16 IDs per-Core is a very poor design choice. In fact, the
>> > Server SoC spec mandates a minimum 255 IDs.
>>
>> You are misreading. A 128c system with 2047 MSIs per-core, will only
>> have 16 *per-core unique* (2047/128) interrupts with the current series.
>>
>> I'm not saying that each IMSIC has 16 IDs, I'm saying that in a 128c
>> system with the maximum amount of MSIs possible in the spec, you'll end
>> up with 16 *unique* interrupts per core.
>
> -ENOPARSE
>
> I don't see how this applies to the current approach because we treat
> MSI ID space as global across cores so if a system has 2047 MSIs
> per-core then we have 2047 MSIs across all cores.

Ok, I'll try again! :-)

Let's assume that each core in the 128c system has some per-core
resources, say a two NIC queue pairs, and a storage queue pair. This
will consume, e.g., 2*2 + 2 (6) MSI sources from the global namespace.

If each core does this it'll be 6*128 MSI sources of the global
namespace.

The maximum number of "privates" MSI sources a core can utilize is 16.

I'm trying (it's does seem to go that well ;-)) to point out that it's
only 16 unique sources per core. For, say, a 256 core system it would be
8. 2047 MSI sources in a system is not much.

Say that I want to spin up 24 NIC queues with one MSI each on each core
on my 128c system. That's not possible with this series, while with an
1-1 system it wouldn't be an issue.

Clearer, or still weird?

>
>>
>> > Regarding NICs which support a large number of queues, the driver
>> > will typically enable only one queue per-core and set the affinity to
>> > separate cores. We have user-space data plane applications based
>> > on DPDK which are capable of using a large number of NIC queues
>> > but these applications are polling based and don't use MSIs.
>>
>> That's one sample point, and clearly not the only one. There are *many*
>> different usage models. Just because you *assign* MSI, doesn't mean they
>> are firing all the time.
>>
>> I can show you a couple of networking setups where this is clearly not
>> enough. Each core has a large number of QoS queues, and each queue would
>> very much like to have a dedicated MSI.
>>
>> >> > When we encounter a system requiring a large number of MSIs,
>> >> > we can either:
>> >> > 1) Extend the AIA spec to support greater than 2047 IDs
>> >> > 2) Re-think the approach in the IMSIC driver
>> >> >
>> >> > The choice between #1 and #2 above depends on the
>> >> > guarantees we want for irq_set_affinity().
>> >>
>> >> The irq_set_affinity() behavior is better with this series, but I think
>> >> the other downsides: number of available interrupt sources, and IPI
>> >> broadcast are worse.
>> >
>> > The IPI overhead in the approach you are suggesting will be
>> > even bad compared to the IPI overhead of the current approach
>> > because we will end-up doing IPI upon every irq_set_affinity()
>> > in the suggested approach compared to doing IPI upon every
>> > mask/unmask in the current approach.
>>
>> Again, very workload dependent.
>>
>> This series does IPI broadcast on masking/unmasking, which means that
>> cores that don't care get interrupted because, say, a network queue-pair
>> is setup on another core.
>>
>> Some workloads never change the irq affinity.
>
> There are various events which irq affinity such as irq balance,
> CPU hotplug, system suspend, etc.
>
> Also, the 1-1 approach does IPI upon set_affinity, mask and
> unmask whereas the 1-n approach does IPI only upon mask
> and unmask.

An important distinction; When you say IPI on mask/unmask it is a
broadcast IPI to *all* cores, which is pretty instrusive.

The 1-1 variant does an IPI to a *one* target core.

>> I'm just pointing out that there are pro/cons with both variants.
>>
>> > The biggest advantage of the current approach is a reliable
>> > irq_set_affinity() which is a very valuable thing to have.
>>
>> ...and I'm arguing that we're paying a big price for that.
>>
>> > ARM systems easily support a large number of LPIs per-core.
>> > For example, GIC-700 supports 56000 LPIs per-core.
>> > (Refer, https://developer.arm.com/documentation/101516/0300/About-the-GIC-700/Features)
>>
>> Yeah, but this is not the GIC. This is something that looks more like
>> the x86 world. We'll be stuck with a lot of implementations with AIA 1.0
>> spec, and many cores.
>
> Well, RISC-V AIA is neigher ARM GIG not x86 APIC. All I am saying
> is that there are systems with large number per-core interrupt IDs
> for handling MSIs.

Yes, and while that is nice, it's not what IMSIC is.


Now, back to the weekend for real! ;-) (https://xkcd.com/386/)
Björn

2023-10-23 07:02:41

by Björn Töpel

[permalink] [raw]
Subject: Re: [PATCH v10 00/15] Linux RISC-V AIA Support

Björn Töpel <[email protected]> writes:

> Anup Patel <[email protected]> writes:
>
>> On Fri, Oct 20, 2023 at 10:07 PM Björn Töpel <[email protected]> wrote:
>>>
>>> Anup Patel <[email protected]> writes:
>>>
>>> > On Fri, Oct 20, 2023 at 8:10 PM Björn Töpel <[email protected]> wrote:
>>> >>
>>> >> Anup Patel <[email protected]> writes:
>>> >>
>>> >> > On Fri, Oct 20, 2023 at 2:17 PM Björn Töpel <[email protected]> wrote:
>>> >> >>
>>> >> >> Thanks for the quick reply!
>>> >> >>
>>> >> >> Anup Patel <[email protected]> writes:
>>> >> >>
>>> >> >> > On Thu, Oct 19, 2023 at 7:13 PM Björn Töpel <[email protected]> wrote:
>>> >> >> >>
>>> >> >> >> Hi Anup,
>>> >> >> >>
>>> >> >> >> Anup Patel <[email protected]> writes:
>>> >> >> >>
>>> >> >> >> > The RISC-V AIA specification is ratified as-per the RISC-V international
>>> >> >> >> > process. The latest ratified AIA specifcation can be found at:
>>> >> >> >> > https://github.com/riscv/riscv-aia/releases/download/1.0/riscv-interrupts-1.0.pdf
>>> >> >> >> >
>>> >> >> >> > At a high-level, the AIA specification adds three things:
>>> >> >> >> > 1) AIA CSRs
>>> >> >> >> > - Improved local interrupt support
>>> >> >> >> > 2) Incoming Message Signaled Interrupt Controller (IMSIC)
>>> >> >> >> > - Per-HART MSI controller
>>> >> >> >> > - Support MSI virtualization
>>> >> >> >> > - Support IPI along with virtualization
>>> >> >> >> > 3) Advanced Platform-Level Interrupt Controller (APLIC)
>>> >> >> >> > - Wired interrupt controller
>>> >> >> >> > - In MSI-mode, converts wired interrupt into MSIs (i.e. MSI generator)
>>> >> >> >> > - In Direct-mode, injects external interrupts directly into HARTs
>>> >> >> >>
>>> >> >> >> Thanks for working on the AIA support! I had a look at the series, and
>>> >> >> >> have some concerns about interrupt ID abstraction.
>>> >> >> >>
>>> >> >> >> A bit of background, for readers not familiar with the AIA details.
>>> >> >> >>
>>> >> >> >> IMSIC allows for 2047 unique MSI ("msi-irq") sources per hart, and
>>> >> >> >> each MSI is dedicated to a certain hart. The series takes the approach
>>> >> >> >> to say that there are, e.g., 2047 interrupts ("lnx-irq") globally.
>>> >> >> >> Each lnx-irq consists of #harts * msi-irq -- a slice -- and in the
>>> >> >> >> slice only *one* msi-irq is acutally used.
>>> >> >> >>
>>> >> >> >> This scheme makes affinity changes more robust, because the interrupt
>>> >> >> >> sources on "other" harts are pre-allocated. On the other hand it
>>> >> >> >> requires to propagate irq masking to other harts via IPIs (this is
>>> >> >> >> mostly done up setup/tear down). It's also wasteful, because msi-irqs
>>> >> >> >> are hogged, and cannot be used.
>>> >> >> >>
>>> >> >> >> Contemporary storage/networking drivers usually uses queues per core
>>> >> >> >> (or a sub-set of cores). The current scheme wastes a lot of msi-irqs.
>>> >> >> >> If we instead used a scheme where "msi-irq == lnx-irq", instead of
>>> >> >> >> "lnq-irq = {hart 0;msi-irq x , ... hart N;msi-irq x}", there would be
>>> >> >> >> a lot MSIs for other users. 1-1 vs 1-N. E.g., if a storage device
>>> >> >> >> would like to use 5 queues (5 cores) on a 128 core system, the current
>>> >> >> >> scheme would consume 5 * 128 MSIs, instead of just 5.
>>> >> >> >>
>>> >> >> >> On the plus side:
>>> >> >> >> * Changing interrupts affinity will never fail, because the interrupts
>>> >> >> >> on each hart is pre-allocated.
>>> >> >> >>
>>> >> >> >> On the negative side:
>>> >> >> >> * Wasteful interrupt usage, and a system can potientially "run out" of
>>> >> >> >> interrupts. Especially for many core systems.
>>> >> >> >> * Interrupt masking need to proagate to harts via IPIs (there's no
>>> >> >> >> broadcast csr in IMSIC), and a more complex locking scheme IMSIC
>>> >> >> >>
>>> >> >> >> Summary:
>>> >> >> >> The current series caps the number of global interrupts to maximum
>>> >> >> >> 2047 MSIs for all cores (whole system). A better scheme, IMO, would be
>>> >> >> >> to expose 2047 * #harts unique MSIs.
>>> >> >> >>
>>> >> >> >> I think this could simplify/remove(?) the locking as well.
>>> >> >> >
>>> >> >> > Exposing 2047 * #harts unique MSIs has multiple issues:
>>> >> >> > 1) The irq_set_affinity() does not work for MSIs because each
>>> >> >> > IRQ is not tied to a particular HART. This means we can't
>>> >> >> > balance the IRQ processing load among HARTs.
>>> >> >>
>>> >> >> Yes, you can balance. In your code, each *active* MSI is still
>>> >> >> bound/active to a specific hard together with the affinity mask. In an
>>> >> >> 1-1 model you would still need to track the affinity mask, but the
>>> >> >> irq_set_affinity() would be different. It would try to allocate a new
>>> >> >> MSI from the target CPU, and then switch to having that MSI active.
>>> >> >>
>>> >> >> That's what x86 does AFAIU, which is also constrained by the # of
>>> >> >> available MSIs.
>>> >> >>
>>> >> >> The downside, as I pointed out, is that the set affinity action can
>>> >> >> fail for a certain target CPU.
>>> >> >
>>> >> > Yes, irq_set_affinity() can fail for the suggested approach plus for
>>> >> > RISC-V AIA, one HART does not have access to other HARTs
>>> >> > MSI enable/disable bits so the approach will also involve IPI.
>>> >>
>>> >> Correct, but the current series does a broadcast to all cores, where the
>>> >> 1-1 approach is at most an IPI to a single core.
>>> >>
>>> >> 128+c machines are getting more common, and you have devices that you
>>> >> bring up/down on a per-core basis. Broadcasting IPIs to all cores, when
>>> >> dealing with a per-core activity is a pretty noisy neighbor.
>>> >
>>> > Broadcast IPI in the current approach is only done upon MSI mask/unmask
>>> > operation. It is not done upon set_affinity() of interrupt handling.
>>>
>>> I'm aware. We're on the same page here.
>>>
>>> >>
>>> >> This could be fixed in the existing 1-n approach, by not require to sync
>>> >> the cores that are not handling the MSI in question. "Lazy disable"
>>> >
>>> > Incorrect. The approach you are suggesting involves an IPI upon every
>>> > irq_set_affinity(). This is because a HART can only enable it's own
>>> > MSI ID so when an IRQ is moved to from HART A to HART B with
>>> > a different ID X on HART B then we will need an IPI in irq_set_affinit()
>>> > to enable ID X on HART B.
>>>
>>> Yes, the 1-1 approach will require an IPI to one target cpu on affinity
>>> changes, and similar on mask/unmask.
>>>
>>> The 1-n approach, require no-IPI on affinity changes (nice!), but IPI
>>> broadcast to all cores on mask/unmask (not so nice).
>>>
>>> >> >> My concern is interrupts become a scarce resource with this
>>> >> >> implementation, but maybe my view is incorrect. I've seen bare-metal
>>> >> >> x86 systems (no VMs) with ~200 cores, and ~2000 interrupts, but maybe
>>> >> >> that is considered "a lot of interrupts".
>>> >> >>
>>> >> >> As long as we don't get into scenarios where we're running out of
>>> >> >> interrupts, due to the software design.
>>> >> >>
>>> >> >
>>> >> > The current approach is simpler and ensures irq_set_affinity
>>> >> > always works. The limit of max 2047 IDs is sufficient for many
>>> >> > systems (if not all).
>>> >>
>>> >> Let me give you another view. On a 128c system each core has ~16 unique
>>> >> interrupts for disposal. E.g. the Intel E800 NIC has more than 2048
>>> >> network queue pairs for each PF.
>>> >
>>> > Clearly, this example is a hypothetical and represents a poorly
>>> > designed platform.
>>> >
>>> > Having just 16 IDs per-Core is a very poor design choice. In fact, the
>>> > Server SoC spec mandates a minimum 255 IDs.
>>>
>>> You are misreading. A 128c system with 2047 MSIs per-core, will only
>>> have 16 *per-core unique* (2047/128) interrupts with the current series.
>>>
>>> I'm not saying that each IMSIC has 16 IDs, I'm saying that in a 128c
>>> system with the maximum amount of MSIs possible in the spec, you'll end
>>> up with 16 *unique* interrupts per core.
>>
>> -ENOPARSE
>>
>> I don't see how this applies to the current approach because we treat
>> MSI ID space as global across cores so if a system has 2047 MSIs
>> per-core then we have 2047 MSIs across all cores.
>
> Ok, I'll try again! :-)
>
> Let's assume that each core in the 128c system has some per-core
> resources, say a two NIC queue pairs, and a storage queue pair. This
> will consume, e.g., 2*2 + 2 (6) MSI sources from the global namespace.
>
> If each core does this it'll be 6*128 MSI sources of the global
> namespace.
>
> The maximum number of "privates" MSI sources a core can utilize is 16.
>
> I'm trying (it's does seem to go that well ;-)) to point out that it's
> only 16 unique sources per core. For, say, a 256 core system it would be
> 8. 2047 MSI sources in a system is not much.
>
> Say that I want to spin up 24 NIC queues with one MSI each on each core
> on my 128c system. That's not possible with this series, while with an
> 1-1 system it wouldn't be an issue.
>
> Clearer, or still weird?
>
>>
>>>
>>> > Regarding NICs which support a large number of queues, the driver
>>> > will typically enable only one queue per-core and set the affinity to
>>> > separate cores. We have user-space data plane applications based
>>> > on DPDK which are capable of using a large number of NIC queues
>>> > but these applications are polling based and don't use MSIs.
>>>
>>> That's one sample point, and clearly not the only one. There are *many*
>>> different usage models. Just because you *assign* MSI, doesn't mean they
>>> are firing all the time.
>>>
>>> I can show you a couple of networking setups where this is clearly not
>>> enough. Each core has a large number of QoS queues, and each queue would
>>> very much like to have a dedicated MSI.
>>>
>>> >> > When we encounter a system requiring a large number of MSIs,
>>> >> > we can either:
>>> >> > 1) Extend the AIA spec to support greater than 2047 IDs
>>> >> > 2) Re-think the approach in the IMSIC driver
>>> >> >
>>> >> > The choice between #1 and #2 above depends on the
>>> >> > guarantees we want for irq_set_affinity().
>>> >>
>>> >> The irq_set_affinity() behavior is better with this series, but I think
>>> >> the other downsides: number of available interrupt sources, and IPI
>>> >> broadcast are worse.
>>> >
>>> > The IPI overhead in the approach you are suggesting will be
>>> > even bad compared to the IPI overhead of the current approach
>>> > because we will end-up doing IPI upon every irq_set_affinity()
>>> > in the suggested approach compared to doing IPI upon every
>>> > mask/unmask in the current approach.
>>>
>>> Again, very workload dependent.
>>>
>>> This series does IPI broadcast on masking/unmasking, which means that
>>> cores that don't care get interrupted because, say, a network queue-pair
>>> is setup on another core.
>>>
>>> Some workloads never change the irq affinity.
>>
>> There are various events which irq affinity such as irq balance,
>> CPU hotplug, system suspend, etc.
>>
>> Also, the 1-1 approach does IPI upon set_affinity, mask and
>> unmask whereas the 1-n approach does IPI only upon mask
>> and unmask.
>
> An important distinction; When you say IPI on mask/unmask it is a
> broadcast IPI to *all* cores, which is pretty instrusive.
>
> The 1-1 variant does an IPI to a *one* target core.
>
>>> I'm just pointing out that there are pro/cons with both variants.
>>>
>>> > The biggest advantage of the current approach is a reliable
>>> > irq_set_affinity() which is a very valuable thing to have.
>>>
>>> ...and I'm arguing that we're paying a big price for that.
>>>
>>> > ARM systems easily support a large number of LPIs per-core.
>>> > For example, GIC-700 supports 56000 LPIs per-core.
>>> > (Refer, https://developer.arm.com/documentation/101516/0300/About-the-GIC-700/Features)
>>>
>>> Yeah, but this is not the GIC. This is something that looks more like
>>> the x86 world. We'll be stuck with a lot of implementations with AIA 1.0
>>> spec, and many cores.
>>
>> Well, RISC-V AIA is neigher ARM GIG not x86 APIC. All I am saying
>> is that there are systems with large number per-core interrupt IDs
>> for handling MSIs.
>
> Yes, and while that is nice, it's not what IMSIC is.

Some follow-ups, after thinking more about it more over the weekend.

* Do one really need an IPI for irq_set_affinity() for the 1-1 model?
Why touch the enable/disable bits when moving interrupts?

* In my book the IMSIC looks very much like the x86 LAPIC, which also
has few interrupts (IMSIC <2048, LAPIC 256). The IRQ matrix allocator
[1], and a scheme similar to LAPIC [2] would be a good fit. This is
the 1-1 model, but more sophisticated than what I've been describing
(e.g. properly handling mangaged/regular irqs). As a bonus we would
get the IRQ matrix debugfs/tracepoint support.


Björn

2023-10-23 08:35:08

by Anup Patel

[permalink] [raw]
Subject: Re: [PATCH v10 00/15] Linux RISC-V AIA Support

On Mon, Oct 23, 2023 at 12:32 PM Björn Töpel <[email protected]> wrote:
>
> Björn Töpel <[email protected]> writes:
>
> > Anup Patel <[email protected]> writes:
> >
> >> On Fri, Oct 20, 2023 at 10:07 PM Björn Töpel <[email protected]> wrote:
> >>>
> >>> Anup Patel <[email protected]> writes:
> >>>
> >>> > On Fri, Oct 20, 2023 at 8:10 PM Björn Töpel <[email protected]> wrote:
> >>> >>
> >>> >> Anup Patel <[email protected]> writes:
> >>> >>
> >>> >> > On Fri, Oct 20, 2023 at 2:17 PM Björn Töpel <[email protected]> wrote:
> >>> >> >>
> >>> >> >> Thanks for the quick reply!
> >>> >> >>
> >>> >> >> Anup Patel <[email protected]> writes:
> >>> >> >>
> >>> >> >> > On Thu, Oct 19, 2023 at 7:13 PM Björn Töpel <[email protected]> wrote:
> >>> >> >> >>
> >>> >> >> >> Hi Anup,
> >>> >> >> >>
> >>> >> >> >> Anup Patel <[email protected]> writes:
> >>> >> >> >>
> >>> >> >> >> > The RISC-V AIA specification is ratified as-per the RISC-V international
> >>> >> >> >> > process. The latest ratified AIA specifcation can be found at:
> >>> >> >> >> > https://github.com/riscv/riscv-aia/releases/download/1.0/riscv-interrupts-1.0.pdf
> >>> >> >> >> >
> >>> >> >> >> > At a high-level, the AIA specification adds three things:
> >>> >> >> >> > 1) AIA CSRs
> >>> >> >> >> > - Improved local interrupt support
> >>> >> >> >> > 2) Incoming Message Signaled Interrupt Controller (IMSIC)
> >>> >> >> >> > - Per-HART MSI controller
> >>> >> >> >> > - Support MSI virtualization
> >>> >> >> >> > - Support IPI along with virtualization
> >>> >> >> >> > 3) Advanced Platform-Level Interrupt Controller (APLIC)
> >>> >> >> >> > - Wired interrupt controller
> >>> >> >> >> > - In MSI-mode, converts wired interrupt into MSIs (i.e. MSI generator)
> >>> >> >> >> > - In Direct-mode, injects external interrupts directly into HARTs
> >>> >> >> >>
> >>> >> >> >> Thanks for working on the AIA support! I had a look at the series, and
> >>> >> >> >> have some concerns about interrupt ID abstraction.
> >>> >> >> >>
> >>> >> >> >> A bit of background, for readers not familiar with the AIA details.
> >>> >> >> >>
> >>> >> >> >> IMSIC allows for 2047 unique MSI ("msi-irq") sources per hart, and
> >>> >> >> >> each MSI is dedicated to a certain hart. The series takes the approach
> >>> >> >> >> to say that there are, e.g., 2047 interrupts ("lnx-irq") globally.
> >>> >> >> >> Each lnx-irq consists of #harts * msi-irq -- a slice -- and in the
> >>> >> >> >> slice only *one* msi-irq is acutally used.
> >>> >> >> >>
> >>> >> >> >> This scheme makes affinity changes more robust, because the interrupt
> >>> >> >> >> sources on "other" harts are pre-allocated. On the other hand it
> >>> >> >> >> requires to propagate irq masking to other harts via IPIs (this is
> >>> >> >> >> mostly done up setup/tear down). It's also wasteful, because msi-irqs
> >>> >> >> >> are hogged, and cannot be used.
> >>> >> >> >>
> >>> >> >> >> Contemporary storage/networking drivers usually uses queues per core
> >>> >> >> >> (or a sub-set of cores). The current scheme wastes a lot of msi-irqs.
> >>> >> >> >> If we instead used a scheme where "msi-irq == lnx-irq", instead of
> >>> >> >> >> "lnq-irq = {hart 0;msi-irq x , ... hart N;msi-irq x}", there would be
> >>> >> >> >> a lot MSIs for other users. 1-1 vs 1-N. E.g., if a storage device
> >>> >> >> >> would like to use 5 queues (5 cores) on a 128 core system, the current
> >>> >> >> >> scheme would consume 5 * 128 MSIs, instead of just 5.
> >>> >> >> >>
> >>> >> >> >> On the plus side:
> >>> >> >> >> * Changing interrupts affinity will never fail, because the interrupts
> >>> >> >> >> on each hart is pre-allocated.
> >>> >> >> >>
> >>> >> >> >> On the negative side:
> >>> >> >> >> * Wasteful interrupt usage, and a system can potientially "run out" of
> >>> >> >> >> interrupts. Especially for many core systems.
> >>> >> >> >> * Interrupt masking need to proagate to harts via IPIs (there's no
> >>> >> >> >> broadcast csr in IMSIC), and a more complex locking scheme IMSIC
> >>> >> >> >>
> >>> >> >> >> Summary:
> >>> >> >> >> The current series caps the number of global interrupts to maximum
> >>> >> >> >> 2047 MSIs for all cores (whole system). A better scheme, IMO, would be
> >>> >> >> >> to expose 2047 * #harts unique MSIs.
> >>> >> >> >>
> >>> >> >> >> I think this could simplify/remove(?) the locking as well.
> >>> >> >> >
> >>> >> >> > Exposing 2047 * #harts unique MSIs has multiple issues:
> >>> >> >> > 1) The irq_set_affinity() does not work for MSIs because each
> >>> >> >> > IRQ is not tied to a particular HART. This means we can't
> >>> >> >> > balance the IRQ processing load among HARTs.
> >>> >> >>
> >>> >> >> Yes, you can balance. In your code, each *active* MSI is still
> >>> >> >> bound/active to a specific hard together with the affinity mask. In an
> >>> >> >> 1-1 model you would still need to track the affinity mask, but the
> >>> >> >> irq_set_affinity() would be different. It would try to allocate a new
> >>> >> >> MSI from the target CPU, and then switch to having that MSI active.
> >>> >> >>
> >>> >> >> That's what x86 does AFAIU, which is also constrained by the # of
> >>> >> >> available MSIs.
> >>> >> >>
> >>> >> >> The downside, as I pointed out, is that the set affinity action can
> >>> >> >> fail for a certain target CPU.
> >>> >> >
> >>> >> > Yes, irq_set_affinity() can fail for the suggested approach plus for
> >>> >> > RISC-V AIA, one HART does not have access to other HARTs
> >>> >> > MSI enable/disable bits so the approach will also involve IPI.
> >>> >>
> >>> >> Correct, but the current series does a broadcast to all cores, where the
> >>> >> 1-1 approach is at most an IPI to a single core.
> >>> >>
> >>> >> 128+c machines are getting more common, and you have devices that you
> >>> >> bring up/down on a per-core basis. Broadcasting IPIs to all cores, when
> >>> >> dealing with a per-core activity is a pretty noisy neighbor.
> >>> >
> >>> > Broadcast IPI in the current approach is only done upon MSI mask/unmask
> >>> > operation. It is not done upon set_affinity() of interrupt handling.
> >>>
> >>> I'm aware. We're on the same page here.
> >>>
> >>> >>
> >>> >> This could be fixed in the existing 1-n approach, by not require to sync
> >>> >> the cores that are not handling the MSI in question. "Lazy disable"
> >>> >
> >>> > Incorrect. The approach you are suggesting involves an IPI upon every
> >>> > irq_set_affinity(). This is because a HART can only enable it's own
> >>> > MSI ID so when an IRQ is moved to from HART A to HART B with
> >>> > a different ID X on HART B then we will need an IPI in irq_set_affinit()
> >>> > to enable ID X on HART B.
> >>>
> >>> Yes, the 1-1 approach will require an IPI to one target cpu on affinity
> >>> changes, and similar on mask/unmask.
> >>>
> >>> The 1-n approach, require no-IPI on affinity changes (nice!), but IPI
> >>> broadcast to all cores on mask/unmask (not so nice).
> >>>
> >>> >> >> My concern is interrupts become a scarce resource with this
> >>> >> >> implementation, but maybe my view is incorrect. I've seen bare-metal
> >>> >> >> x86 systems (no VMs) with ~200 cores, and ~2000 interrupts, but maybe
> >>> >> >> that is considered "a lot of interrupts".
> >>> >> >>
> >>> >> >> As long as we don't get into scenarios where we're running out of
> >>> >> >> interrupts, due to the software design.
> >>> >> >>
> >>> >> >
> >>> >> > The current approach is simpler and ensures irq_set_affinity
> >>> >> > always works. The limit of max 2047 IDs is sufficient for many
> >>> >> > systems (if not all).
> >>> >>
> >>> >> Let me give you another view. On a 128c system each core has ~16 unique
> >>> >> interrupts for disposal. E.g. the Intel E800 NIC has more than 2048
> >>> >> network queue pairs for each PF.
> >>> >
> >>> > Clearly, this example is a hypothetical and represents a poorly
> >>> > designed platform.
> >>> >
> >>> > Having just 16 IDs per-Core is a very poor design choice. In fact, the
> >>> > Server SoC spec mandates a minimum 255 IDs.
> >>>
> >>> You are misreading. A 128c system with 2047 MSIs per-core, will only
> >>> have 16 *per-core unique* (2047/128) interrupts with the current series.
> >>>
> >>> I'm not saying that each IMSIC has 16 IDs, I'm saying that in a 128c
> >>> system with the maximum amount of MSIs possible in the spec, you'll end
> >>> up with 16 *unique* interrupts per core.
> >>
> >> -ENOPARSE
> >>
> >> I don't see how this applies to the current approach because we treat
> >> MSI ID space as global across cores so if a system has 2047 MSIs
> >> per-core then we have 2047 MSIs across all cores.
> >
> > Ok, I'll try again! :-)
> >
> > Let's assume that each core in the 128c system has some per-core
> > resources, say a two NIC queue pairs, and a storage queue pair. This
> > will consume, e.g., 2*2 + 2 (6) MSI sources from the global namespace.
> >
> > If each core does this it'll be 6*128 MSI sources of the global
> > namespace.
> >
> > The maximum number of "privates" MSI sources a core can utilize is 16.
> >
> > I'm trying (it's does seem to go that well ;-)) to point out that it's
> > only 16 unique sources per core. For, say, a 256 core system it would be
> > 8. 2047 MSI sources in a system is not much.
> >
> > Say that I want to spin up 24 NIC queues with one MSI each on each core
> > on my 128c system. That's not possible with this series, while with an
> > 1-1 system it wouldn't be an issue.
> >
> > Clearer, or still weird?
> >
> >>
> >>>
> >>> > Regarding NICs which support a large number of queues, the driver
> >>> > will typically enable only one queue per-core and set the affinity to
> >>> > separate cores. We have user-space data plane applications based
> >>> > on DPDK which are capable of using a large number of NIC queues
> >>> > but these applications are polling based and don't use MSIs.
> >>>
> >>> That's one sample point, and clearly not the only one. There are *many*
> >>> different usage models. Just because you *assign* MSI, doesn't mean they
> >>> are firing all the time.
> >>>
> >>> I can show you a couple of networking setups where this is clearly not
> >>> enough. Each core has a large number of QoS queues, and each queue would
> >>> very much like to have a dedicated MSI.
> >>>
> >>> >> > When we encounter a system requiring a large number of MSIs,
> >>> >> > we can either:
> >>> >> > 1) Extend the AIA spec to support greater than 2047 IDs
> >>> >> > 2) Re-think the approach in the IMSIC driver
> >>> >> >
> >>> >> > The choice between #1 and #2 above depends on the
> >>> >> > guarantees we want for irq_set_affinity().
> >>> >>
> >>> >> The irq_set_affinity() behavior is better with this series, but I think
> >>> >> the other downsides: number of available interrupt sources, and IPI
> >>> >> broadcast are worse.
> >>> >
> >>> > The IPI overhead in the approach you are suggesting will be
> >>> > even bad compared to the IPI overhead of the current approach
> >>> > because we will end-up doing IPI upon every irq_set_affinity()
> >>> > in the suggested approach compared to doing IPI upon every
> >>> > mask/unmask in the current approach.
> >>>
> >>> Again, very workload dependent.
> >>>
> >>> This series does IPI broadcast on masking/unmasking, which means that
> >>> cores that don't care get interrupted because, say, a network queue-pair
> >>> is setup on another core.
> >>>
> >>> Some workloads never change the irq affinity.
> >>
> >> There are various events which irq affinity such as irq balance,
> >> CPU hotplug, system suspend, etc.
> >>
> >> Also, the 1-1 approach does IPI upon set_affinity, mask and
> >> unmask whereas the 1-n approach does IPI only upon mask
> >> and unmask.
> >
> > An important distinction; When you say IPI on mask/unmask it is a
> > broadcast IPI to *all* cores, which is pretty instrusive.
> >
> > The 1-1 variant does an IPI to a *one* target core.
> >
> >>> I'm just pointing out that there are pro/cons with both variants.
> >>>
> >>> > The biggest advantage of the current approach is a reliable
> >>> > irq_set_affinity() which is a very valuable thing to have.
> >>>
> >>> ...and I'm arguing that we're paying a big price for that.
> >>>
> >>> > ARM systems easily support a large number of LPIs per-core.
> >>> > For example, GIC-700 supports 56000 LPIs per-core.
> >>> > (Refer, https://developer.arm.com/documentation/101516/0300/About-the-GIC-700/Features)
> >>>
> >>> Yeah, but this is not the GIC. This is something that looks more like
> >>> the x86 world. We'll be stuck with a lot of implementations with AIA 1.0
> >>> spec, and many cores.
> >>
> >> Well, RISC-V AIA is neigher ARM GIG not x86 APIC. All I am saying
> >> is that there are systems with large number per-core interrupt IDs
> >> for handling MSIs.
> >
> > Yes, and while that is nice, it's not what IMSIC is.
>
> Some follow-ups, after thinking more about it more over the weekend.
>
> * Do one really need an IPI for irq_set_affinity() for the 1-1 model?
> Why touch the enable/disable bits when moving interrupts?

In the 1-1 model, the ID on the current HART and target HART upon
irq_set_affinity will be different so we can't leave the unused ID on
current HART enabled because it can lead to spurious interrupts
when the ID on current HART is re-used for some other device.

There is also a possibility of receiving an interrupt while the ID was
moved to a new target HART in-which case we have to detect and
re-trigger interrupt on the new target HART. In fact, x86 APLIC does
an IPI to take care of this case.

>
> * In my book the IMSIC looks very much like the x86 LAPIC, which also
> has few interrupts (IMSIC <2048, LAPIC 256). The IRQ matrix allocator
> [1], and a scheme similar to LAPIC [2] would be a good fit. This is
> the 1-1 model, but more sophisticated than what I've been describing
> (e.g. properly handling mangaged/regular irqs). As a bonus we would
> get the IRQ matrix debugfs/tracepoint support.
>

Yes, I have been evaluating the 1-1 model for the past few days. I also
have a working implementation with a simple per-CPU bitmap based
allocator which handles both legacy MSI (block of 1,2,4,8,16, or 32 IDs)
and MSI-X.

The irq matrix allocator needs to be improved for handling legacy MSI
so initially I will post a v11 series which works for me and converging
with irq matrix allocator can be future work.

Regards,
Anup

2023-10-23 14:07:39

by Björn Töpel

[permalink] [raw]
Subject: Re: [PATCH v10 00/15] Linux RISC-V AIA Support

Anup Patel <[email protected]> writes:

> On Mon, Oct 23, 2023 at 12:32 PM Björn Töpel <[email protected]> wrote:
>>
>> Björn Töpel <[email protected]> writes:
>>
>> > Anup Patel <[email protected]> writes:
>> >
>> >> On Fri, Oct 20, 2023 at 10:07 PM Björn Töpel <[email protected]> wrote:
>> >>>
>> >>> Anup Patel <[email protected]> writes:
>> >>>
>> >>> > On Fri, Oct 20, 2023 at 8:10 PM Björn Töpel <[email protected]> wrote:
>> >>> >>
>> >>> >> Anup Patel <[email protected]> writes:
>> >>> >>
>> >>> >> > On Fri, Oct 20, 2023 at 2:17 PM Björn Töpel <[email protected]> wrote:
>> >>> >> >>
>> >>> >> >> Thanks for the quick reply!
>> >>> >> >>
>> >>> >> >> Anup Patel <[email protected]> writes:
>> >>> >> >>
>> >>> >> >> > On Thu, Oct 19, 2023 at 7:13 PM Björn Töpel <[email protected]> wrote:
>> >>> >> >> >>
>> >>> >> >> >> Hi Anup,
>> >>> >> >> >>
>> >>> >> >> >> Anup Patel <[email protected]> writes:
>> >>> >> >> >>
>> >>> >> >> >> > The RISC-V AIA specification is ratified as-per the RISC-V international
>> >>> >> >> >> > process. The latest ratified AIA specifcation can be found at:
>> >>> >> >> >> > https://github.com/riscv/riscv-aia/releases/download/1.0/riscv-interrupts-1.0.pdf
>> >>> >> >> >> >
>> >>> >> >> >> > At a high-level, the AIA specification adds three things:
>> >>> >> >> >> > 1) AIA CSRs
>> >>> >> >> >> > - Improved local interrupt support
>> >>> >> >> >> > 2) Incoming Message Signaled Interrupt Controller (IMSIC)
>> >>> >> >> >> > - Per-HART MSI controller
>> >>> >> >> >> > - Support MSI virtualization
>> >>> >> >> >> > - Support IPI along with virtualization
>> >>> >> >> >> > 3) Advanced Platform-Level Interrupt Controller (APLIC)
>> >>> >> >> >> > - Wired interrupt controller
>> >>> >> >> >> > - In MSI-mode, converts wired interrupt into MSIs (i.e. MSI generator)
>> >>> >> >> >> > - In Direct-mode, injects external interrupts directly into HARTs
>> >>> >> >> >>
>> >>> >> >> >> Thanks for working on the AIA support! I had a look at the series, and
>> >>> >> >> >> have some concerns about interrupt ID abstraction.
>> >>> >> >> >>
>> >>> >> >> >> A bit of background, for readers not familiar with the AIA details.
>> >>> >> >> >>
>> >>> >> >> >> IMSIC allows for 2047 unique MSI ("msi-irq") sources per hart, and
>> >>> >> >> >> each MSI is dedicated to a certain hart. The series takes the approach
>> >>> >> >> >> to say that there are, e.g., 2047 interrupts ("lnx-irq") globally.
>> >>> >> >> >> Each lnx-irq consists of #harts * msi-irq -- a slice -- and in the
>> >>> >> >> >> slice only *one* msi-irq is acutally used.
>> >>> >> >> >>
>> >>> >> >> >> This scheme makes affinity changes more robust, because the interrupt
>> >>> >> >> >> sources on "other" harts are pre-allocated. On the other hand it
>> >>> >> >> >> requires to propagate irq masking to other harts via IPIs (this is
>> >>> >> >> >> mostly done up setup/tear down). It's also wasteful, because msi-irqs
>> >>> >> >> >> are hogged, and cannot be used.
>> >>> >> >> >>
>> >>> >> >> >> Contemporary storage/networking drivers usually uses queues per core
>> >>> >> >> >> (or a sub-set of cores). The current scheme wastes a lot of msi-irqs.
>> >>> >> >> >> If we instead used a scheme where "msi-irq == lnx-irq", instead of
>> >>> >> >> >> "lnq-irq = {hart 0;msi-irq x , ... hart N;msi-irq x}", there would be
>> >>> >> >> >> a lot MSIs for other users. 1-1 vs 1-N. E.g., if a storage device
>> >>> >> >> >> would like to use 5 queues (5 cores) on a 128 core system, the current
>> >>> >> >> >> scheme would consume 5 * 128 MSIs, instead of just 5.
>> >>> >> >> >>
>> >>> >> >> >> On the plus side:
>> >>> >> >> >> * Changing interrupts affinity will never fail, because the interrupts
>> >>> >> >> >> on each hart is pre-allocated.
>> >>> >> >> >>
>> >>> >> >> >> On the negative side:
>> >>> >> >> >> * Wasteful interrupt usage, and a system can potientially "run out" of
>> >>> >> >> >> interrupts. Especially for many core systems.
>> >>> >> >> >> * Interrupt masking need to proagate to harts via IPIs (there's no
>> >>> >> >> >> broadcast csr in IMSIC), and a more complex locking scheme IMSIC
>> >>> >> >> >>
>> >>> >> >> >> Summary:
>> >>> >> >> >> The current series caps the number of global interrupts to maximum
>> >>> >> >> >> 2047 MSIs for all cores (whole system). A better scheme, IMO, would be
>> >>> >> >> >> to expose 2047 * #harts unique MSIs.
>> >>> >> >> >>
>> >>> >> >> >> I think this could simplify/remove(?) the locking as well.
>> >>> >> >> >
>> >>> >> >> > Exposing 2047 * #harts unique MSIs has multiple issues:
>> >>> >> >> > 1) The irq_set_affinity() does not work for MSIs because each
>> >>> >> >> > IRQ is not tied to a particular HART. This means we can't
>> >>> >> >> > balance the IRQ processing load among HARTs.
>> >>> >> >>
>> >>> >> >> Yes, you can balance. In your code, each *active* MSI is still
>> >>> >> >> bound/active to a specific hard together with the affinity mask. In an
>> >>> >> >> 1-1 model you would still need to track the affinity mask, but the
>> >>> >> >> irq_set_affinity() would be different. It would try to allocate a new
>> >>> >> >> MSI from the target CPU, and then switch to having that MSI active.
>> >>> >> >>
>> >>> >> >> That's what x86 does AFAIU, which is also constrained by the # of
>> >>> >> >> available MSIs.
>> >>> >> >>
>> >>> >> >> The downside, as I pointed out, is that the set affinity action can
>> >>> >> >> fail for a certain target CPU.
>> >>> >> >
>> >>> >> > Yes, irq_set_affinity() can fail for the suggested approach plus for
>> >>> >> > RISC-V AIA, one HART does not have access to other HARTs
>> >>> >> > MSI enable/disable bits so the approach will also involve IPI.
>> >>> >>
>> >>> >> Correct, but the current series does a broadcast to all cores, where the
>> >>> >> 1-1 approach is at most an IPI to a single core.
>> >>> >>
>> >>> >> 128+c machines are getting more common, and you have devices that you
>> >>> >> bring up/down on a per-core basis. Broadcasting IPIs to all cores, when
>> >>> >> dealing with a per-core activity is a pretty noisy neighbor.
>> >>> >
>> >>> > Broadcast IPI in the current approach is only done upon MSI mask/unmask
>> >>> > operation. It is not done upon set_affinity() of interrupt handling.
>> >>>
>> >>> I'm aware. We're on the same page here.
>> >>>
>> >>> >>
>> >>> >> This could be fixed in the existing 1-n approach, by not require to sync
>> >>> >> the cores that are not handling the MSI in question. "Lazy disable"
>> >>> >
>> >>> > Incorrect. The approach you are suggesting involves an IPI upon every
>> >>> > irq_set_affinity(). This is because a HART can only enable it's own
>> >>> > MSI ID so when an IRQ is moved to from HART A to HART B with
>> >>> > a different ID X on HART B then we will need an IPI in irq_set_affinit()
>> >>> > to enable ID X on HART B.
>> >>>
>> >>> Yes, the 1-1 approach will require an IPI to one target cpu on affinity
>> >>> changes, and similar on mask/unmask.
>> >>>
>> >>> The 1-n approach, require no-IPI on affinity changes (nice!), but IPI
>> >>> broadcast to all cores on mask/unmask (not so nice).
>> >>>
>> >>> >> >> My concern is interrupts become a scarce resource with this
>> >>> >> >> implementation, but maybe my view is incorrect. I've seen bare-metal
>> >>> >> >> x86 systems (no VMs) with ~200 cores, and ~2000 interrupts, but maybe
>> >>> >> >> that is considered "a lot of interrupts".
>> >>> >> >>
>> >>> >> >> As long as we don't get into scenarios where we're running out of
>> >>> >> >> interrupts, due to the software design.
>> >>> >> >>
>> >>> >> >
>> >>> >> > The current approach is simpler and ensures irq_set_affinity
>> >>> >> > always works. The limit of max 2047 IDs is sufficient for many
>> >>> >> > systems (if not all).
>> >>> >>
>> >>> >> Let me give you another view. On a 128c system each core has ~16 unique
>> >>> >> interrupts for disposal. E.g. the Intel E800 NIC has more than 2048
>> >>> >> network queue pairs for each PF.
>> >>> >
>> >>> > Clearly, this example is a hypothetical and represents a poorly
>> >>> > designed platform.
>> >>> >
>> >>> > Having just 16 IDs per-Core is a very poor design choice. In fact, the
>> >>> > Server SoC spec mandates a minimum 255 IDs.
>> >>>
>> >>> You are misreading. A 128c system with 2047 MSIs per-core, will only
>> >>> have 16 *per-core unique* (2047/128) interrupts with the current series.
>> >>>
>> >>> I'm not saying that each IMSIC has 16 IDs, I'm saying that in a 128c
>> >>> system with the maximum amount of MSIs possible in the spec, you'll end
>> >>> up with 16 *unique* interrupts per core.
>> >>
>> >> -ENOPARSE
>> >>
>> >> I don't see how this applies to the current approach because we treat
>> >> MSI ID space as global across cores so if a system has 2047 MSIs
>> >> per-core then we have 2047 MSIs across all cores.
>> >
>> > Ok, I'll try again! :-)
>> >
>> > Let's assume that each core in the 128c system has some per-core
>> > resources, say a two NIC queue pairs, and a storage queue pair. This
>> > will consume, e.g., 2*2 + 2 (6) MSI sources from the global namespace.
>> >
>> > If each core does this it'll be 6*128 MSI sources of the global
>> > namespace.
>> >
>> > The maximum number of "privates" MSI sources a core can utilize is 16.
>> >
>> > I'm trying (it's does seem to go that well ;-)) to point out that it's
>> > only 16 unique sources per core. For, say, a 256 core system it would be
>> > 8. 2047 MSI sources in a system is not much.
>> >
>> > Say that I want to spin up 24 NIC queues with one MSI each on each core
>> > on my 128c system. That's not possible with this series, while with an
>> > 1-1 system it wouldn't be an issue.
>> >
>> > Clearer, or still weird?
>> >
>> >>
>> >>>
>> >>> > Regarding NICs which support a large number of queues, the driver
>> >>> > will typically enable only one queue per-core and set the affinity to
>> >>> > separate cores. We have user-space data plane applications based
>> >>> > on DPDK which are capable of using a large number of NIC queues
>> >>> > but these applications are polling based and don't use MSIs.
>> >>>
>> >>> That's one sample point, and clearly not the only one. There are *many*
>> >>> different usage models. Just because you *assign* MSI, doesn't mean they
>> >>> are firing all the time.
>> >>>
>> >>> I can show you a couple of networking setups where this is clearly not
>> >>> enough. Each core has a large number of QoS queues, and each queue would
>> >>> very much like to have a dedicated MSI.
>> >>>
>> >>> >> > When we encounter a system requiring a large number of MSIs,
>> >>> >> > we can either:
>> >>> >> > 1) Extend the AIA spec to support greater than 2047 IDs
>> >>> >> > 2) Re-think the approach in the IMSIC driver
>> >>> >> >
>> >>> >> > The choice between #1 and #2 above depends on the
>> >>> >> > guarantees we want for irq_set_affinity().
>> >>> >>
>> >>> >> The irq_set_affinity() behavior is better with this series, but I think
>> >>> >> the other downsides: number of available interrupt sources, and IPI
>> >>> >> broadcast are worse.
>> >>> >
>> >>> > The IPI overhead in the approach you are suggesting will be
>> >>> > even bad compared to the IPI overhead of the current approach
>> >>> > because we will end-up doing IPI upon every irq_set_affinity()
>> >>> > in the suggested approach compared to doing IPI upon every
>> >>> > mask/unmask in the current approach.
>> >>>
>> >>> Again, very workload dependent.
>> >>>
>> >>> This series does IPI broadcast on masking/unmasking, which means that
>> >>> cores that don't care get interrupted because, say, a network queue-pair
>> >>> is setup on another core.
>> >>>
>> >>> Some workloads never change the irq affinity.
>> >>
>> >> There are various events which irq affinity such as irq balance,
>> >> CPU hotplug, system suspend, etc.
>> >>
>> >> Also, the 1-1 approach does IPI upon set_affinity, mask and
>> >> unmask whereas the 1-n approach does IPI only upon mask
>> >> and unmask.
>> >
>> > An important distinction; When you say IPI on mask/unmask it is a
>> > broadcast IPI to *all* cores, which is pretty instrusive.
>> >
>> > The 1-1 variant does an IPI to a *one* target core.
>> >
>> >>> I'm just pointing out that there are pro/cons with both variants.
>> >>>
>> >>> > The biggest advantage of the current approach is a reliable
>> >>> > irq_set_affinity() which is a very valuable thing to have.
>> >>>
>> >>> ...and I'm arguing that we're paying a big price for that.
>> >>>
>> >>> > ARM systems easily support a large number of LPIs per-core.
>> >>> > For example, GIC-700 supports 56000 LPIs per-core.
>> >>> > (Refer, https://developer.arm.com/documentation/101516/0300/About-the-GIC-700/Features)
>> >>>
>> >>> Yeah, but this is not the GIC. This is something that looks more like
>> >>> the x86 world. We'll be stuck with a lot of implementations with AIA 1.0
>> >>> spec, and many cores.
>> >>
>> >> Well, RISC-V AIA is neigher ARM GIG not x86 APIC. All I am saying
>> >> is that there are systems with large number per-core interrupt IDs
>> >> for handling MSIs.
>> >
>> > Yes, and while that is nice, it's not what IMSIC is.
>>
>> Some follow-ups, after thinking more about it more over the weekend.
>>
>> * Do one really need an IPI for irq_set_affinity() for the 1-1 model?
>> Why touch the enable/disable bits when moving interrupts?
>
> In the 1-1 model, the ID on the current HART and target HART upon
> irq_set_affinity will be different so we can't leave the unused ID on
> current HART enabled because it can lead to spurious interrupts
> when the ID on current HART is re-used for some other device.

Hmm, is this really an actual problem, or a theoretical one? The
implementation need to track what's in-use, so can we ever get into this
situation?

Somewhat related; I had a similar question for imsic_pci_{un,}mask_irq()
-- why not only do the the default mask operation (only
pci_msi_{un,}mask_irq()), but instead propagate to the IMSIC
mask/unmask?

> There is also a possibility of receiving an interrupt while the ID was
> moved to a new target HART in-which case we have to detect and
> re-trigger interrupt on the new target HART. In fact, x86 APLIC does
> an IPI to take care of this case.

This case I get, and the implementation can track that both are in use.
It's the spurious one that I'm dubious of (don't get).

>>
>> * In my book the IMSIC looks very much like the x86 LAPIC, which also
>> has few interrupts (IMSIC <2048, LAPIC 256). The IRQ matrix allocator
>> [1], and a scheme similar to LAPIC [2] would be a good fit. This is
>> the 1-1 model, but more sophisticated than what I've been describing
>> (e.g. properly handling mangaged/regular irqs). As a bonus we would
>> get the IRQ matrix debugfs/tracepoint support.
>>
>
> Yes, I have been evaluating the 1-1 model for the past few days. I also
> have a working implementation with a simple per-CPU bitmap based
> allocator which handles both legacy MSI (block of 1,2,4,8,16, or 32 IDs)
> and MSI-X.
>
> The irq matrix allocator needs to be improved for handling legacy MSI
> so initially I will post a v11 series which works for me and converging
> with irq matrix allocator can be future work.

What's missing/needs to be improved for legacy MSI (legacy MSI ==
!MSI-X, right?) in the matrix allocator?


Björn

2023-10-23 14:43:15

by Anup Patel

[permalink] [raw]
Subject: Re: [PATCH v10 00/15] Linux RISC-V AIA Support

On Mon, Oct 23, 2023 at 7:37 PM Björn Töpel <[email protected]> wrote:
>
> Anup Patel <[email protected]> writes:
>
> > On Mon, Oct 23, 2023 at 12:32 PM Björn Töpel <[email protected]> wrote:
> >>
> >> Björn Töpel <[email protected]> writes:
> >>
> >> > Anup Patel <[email protected]> writes:
> >> >
> >> >> On Fri, Oct 20, 2023 at 10:07 PM Björn Töpel <[email protected]> wrote:
> >> >>>
> >> >>> Anup Patel <[email protected]> writes:
> >> >>>
> >> >>> > On Fri, Oct 20, 2023 at 8:10 PM Björn Töpel <[email protected]> wrote:
> >> >>> >>
> >> >>> >> Anup Patel <[email protected]> writes:
> >> >>> >>
> >> >>> >> > On Fri, Oct 20, 2023 at 2:17 PM Björn Töpel <[email protected]> wrote:
> >> >>> >> >>
> >> >>> >> >> Thanks for the quick reply!
> >> >>> >> >>
> >> >>> >> >> Anup Patel <[email protected]> writes:
> >> >>> >> >>
> >> >>> >> >> > On Thu, Oct 19, 2023 at 7:13 PM Björn Töpel <[email protected]> wrote:
> >> >>> >> >> >>
> >> >>> >> >> >> Hi Anup,
> >> >>> >> >> >>
> >> >>> >> >> >> Anup Patel <[email protected]> writes:
> >> >>> >> >> >>
> >> >>> >> >> >> > The RISC-V AIA specification is ratified as-per the RISC-V international
> >> >>> >> >> >> > process. The latest ratified AIA specifcation can be found at:
> >> >>> >> >> >> > https://github.com/riscv/riscv-aia/releases/download/1.0/riscv-interrupts-1.0.pdf
> >> >>> >> >> >> >
> >> >>> >> >> >> > At a high-level, the AIA specification adds three things:
> >> >>> >> >> >> > 1) AIA CSRs
> >> >>> >> >> >> > - Improved local interrupt support
> >> >>> >> >> >> > 2) Incoming Message Signaled Interrupt Controller (IMSIC)
> >> >>> >> >> >> > - Per-HART MSI controller
> >> >>> >> >> >> > - Support MSI virtualization
> >> >>> >> >> >> > - Support IPI along with virtualization
> >> >>> >> >> >> > 3) Advanced Platform-Level Interrupt Controller (APLIC)
> >> >>> >> >> >> > - Wired interrupt controller
> >> >>> >> >> >> > - In MSI-mode, converts wired interrupt into MSIs (i.e. MSI generator)
> >> >>> >> >> >> > - In Direct-mode, injects external interrupts directly into HARTs
> >> >>> >> >> >>
> >> >>> >> >> >> Thanks for working on the AIA support! I had a look at the series, and
> >> >>> >> >> >> have some concerns about interrupt ID abstraction.
> >> >>> >> >> >>
> >> >>> >> >> >> A bit of background, for readers not familiar with the AIA details.
> >> >>> >> >> >>
> >> >>> >> >> >> IMSIC allows for 2047 unique MSI ("msi-irq") sources per hart, and
> >> >>> >> >> >> each MSI is dedicated to a certain hart. The series takes the approach
> >> >>> >> >> >> to say that there are, e.g., 2047 interrupts ("lnx-irq") globally.
> >> >>> >> >> >> Each lnx-irq consists of #harts * msi-irq -- a slice -- and in the
> >> >>> >> >> >> slice only *one* msi-irq is acutally used.
> >> >>> >> >> >>
> >> >>> >> >> >> This scheme makes affinity changes more robust, because the interrupt
> >> >>> >> >> >> sources on "other" harts are pre-allocated. On the other hand it
> >> >>> >> >> >> requires to propagate irq masking to other harts via IPIs (this is
> >> >>> >> >> >> mostly done up setup/tear down). It's also wasteful, because msi-irqs
> >> >>> >> >> >> are hogged, and cannot be used.
> >> >>> >> >> >>
> >> >>> >> >> >> Contemporary storage/networking drivers usually uses queues per core
> >> >>> >> >> >> (or a sub-set of cores). The current scheme wastes a lot of msi-irqs.
> >> >>> >> >> >> If we instead used a scheme where "msi-irq == lnx-irq", instead of
> >> >>> >> >> >> "lnq-irq = {hart 0;msi-irq x , ... hart N;msi-irq x}", there would be
> >> >>> >> >> >> a lot MSIs for other users. 1-1 vs 1-N. E.g., if a storage device
> >> >>> >> >> >> would like to use 5 queues (5 cores) on a 128 core system, the current
> >> >>> >> >> >> scheme would consume 5 * 128 MSIs, instead of just 5.
> >> >>> >> >> >>
> >> >>> >> >> >> On the plus side:
> >> >>> >> >> >> * Changing interrupts affinity will never fail, because the interrupts
> >> >>> >> >> >> on each hart is pre-allocated.
> >> >>> >> >> >>
> >> >>> >> >> >> On the negative side:
> >> >>> >> >> >> * Wasteful interrupt usage, and a system can potientially "run out" of
> >> >>> >> >> >> interrupts. Especially for many core systems.
> >> >>> >> >> >> * Interrupt masking need to proagate to harts via IPIs (there's no
> >> >>> >> >> >> broadcast csr in IMSIC), and a more complex locking scheme IMSIC
> >> >>> >> >> >>
> >> >>> >> >> >> Summary:
> >> >>> >> >> >> The current series caps the number of global interrupts to maximum
> >> >>> >> >> >> 2047 MSIs for all cores (whole system). A better scheme, IMO, would be
> >> >>> >> >> >> to expose 2047 * #harts unique MSIs.
> >> >>> >> >> >>
> >> >>> >> >> >> I think this could simplify/remove(?) the locking as well.
> >> >>> >> >> >
> >> >>> >> >> > Exposing 2047 * #harts unique MSIs has multiple issues:
> >> >>> >> >> > 1) The irq_set_affinity() does not work for MSIs because each
> >> >>> >> >> > IRQ is not tied to a particular HART. This means we can't
> >> >>> >> >> > balance the IRQ processing load among HARTs.
> >> >>> >> >>
> >> >>> >> >> Yes, you can balance. In your code, each *active* MSI is still
> >> >>> >> >> bound/active to a specific hard together with the affinity mask. In an
> >> >>> >> >> 1-1 model you would still need to track the affinity mask, but the
> >> >>> >> >> irq_set_affinity() would be different. It would try to allocate a new
> >> >>> >> >> MSI from the target CPU, and then switch to having that MSI active.
> >> >>> >> >>
> >> >>> >> >> That's what x86 does AFAIU, which is also constrained by the # of
> >> >>> >> >> available MSIs.
> >> >>> >> >>
> >> >>> >> >> The downside, as I pointed out, is that the set affinity action can
> >> >>> >> >> fail for a certain target CPU.
> >> >>> >> >
> >> >>> >> > Yes, irq_set_affinity() can fail for the suggested approach plus for
> >> >>> >> > RISC-V AIA, one HART does not have access to other HARTs
> >> >>> >> > MSI enable/disable bits so the approach will also involve IPI.
> >> >>> >>
> >> >>> >> Correct, but the current series does a broadcast to all cores, where the
> >> >>> >> 1-1 approach is at most an IPI to a single core.
> >> >>> >>
> >> >>> >> 128+c machines are getting more common, and you have devices that you
> >> >>> >> bring up/down on a per-core basis. Broadcasting IPIs to all cores, when
> >> >>> >> dealing with a per-core activity is a pretty noisy neighbor.
> >> >>> >
> >> >>> > Broadcast IPI in the current approach is only done upon MSI mask/unmask
> >> >>> > operation. It is not done upon set_affinity() of interrupt handling.
> >> >>>
> >> >>> I'm aware. We're on the same page here.
> >> >>>
> >> >>> >>
> >> >>> >> This could be fixed in the existing 1-n approach, by not require to sync
> >> >>> >> the cores that are not handling the MSI in question. "Lazy disable"
> >> >>> >
> >> >>> > Incorrect. The approach you are suggesting involves an IPI upon every
> >> >>> > irq_set_affinity(). This is because a HART can only enable it's own
> >> >>> > MSI ID so when an IRQ is moved to from HART A to HART B with
> >> >>> > a different ID X on HART B then we will need an IPI in irq_set_affinit()
> >> >>> > to enable ID X on HART B.
> >> >>>
> >> >>> Yes, the 1-1 approach will require an IPI to one target cpu on affinity
> >> >>> changes, and similar on mask/unmask.
> >> >>>
> >> >>> The 1-n approach, require no-IPI on affinity changes (nice!), but IPI
> >> >>> broadcast to all cores on mask/unmask (not so nice).
> >> >>>
> >> >>> >> >> My concern is interrupts become a scarce resource with this
> >> >>> >> >> implementation, but maybe my view is incorrect. I've seen bare-metal
> >> >>> >> >> x86 systems (no VMs) with ~200 cores, and ~2000 interrupts, but maybe
> >> >>> >> >> that is considered "a lot of interrupts".
> >> >>> >> >>
> >> >>> >> >> As long as we don't get into scenarios where we're running out of
> >> >>> >> >> interrupts, due to the software design.
> >> >>> >> >>
> >> >>> >> >
> >> >>> >> > The current approach is simpler and ensures irq_set_affinity
> >> >>> >> > always works. The limit of max 2047 IDs is sufficient for many
> >> >>> >> > systems (if not all).
> >> >>> >>
> >> >>> >> Let me give you another view. On a 128c system each core has ~16 unique
> >> >>> >> interrupts for disposal. E.g. the Intel E800 NIC has more than 2048
> >> >>> >> network queue pairs for each PF.
> >> >>> >
> >> >>> > Clearly, this example is a hypothetical and represents a poorly
> >> >>> > designed platform.
> >> >>> >
> >> >>> > Having just 16 IDs per-Core is a very poor design choice. In fact, the
> >> >>> > Server SoC spec mandates a minimum 255 IDs.
> >> >>>
> >> >>> You are misreading. A 128c system with 2047 MSIs per-core, will only
> >> >>> have 16 *per-core unique* (2047/128) interrupts with the current series.
> >> >>>
> >> >>> I'm not saying that each IMSIC has 16 IDs, I'm saying that in a 128c
> >> >>> system with the maximum amount of MSIs possible in the spec, you'll end
> >> >>> up with 16 *unique* interrupts per core.
> >> >>
> >> >> -ENOPARSE
> >> >>
> >> >> I don't see how this applies to the current approach because we treat
> >> >> MSI ID space as global across cores so if a system has 2047 MSIs
> >> >> per-core then we have 2047 MSIs across all cores.
> >> >
> >> > Ok, I'll try again! :-)
> >> >
> >> > Let's assume that each core in the 128c system has some per-core
> >> > resources, say a two NIC queue pairs, and a storage queue pair. This
> >> > will consume, e.g., 2*2 + 2 (6) MSI sources from the global namespace.
> >> >
> >> > If each core does this it'll be 6*128 MSI sources of the global
> >> > namespace.
> >> >
> >> > The maximum number of "privates" MSI sources a core can utilize is 16.
> >> >
> >> > I'm trying (it's does seem to go that well ;-)) to point out that it's
> >> > only 16 unique sources per core. For, say, a 256 core system it would be
> >> > 8. 2047 MSI sources in a system is not much.
> >> >
> >> > Say that I want to spin up 24 NIC queues with one MSI each on each core
> >> > on my 128c system. That's not possible with this series, while with an
> >> > 1-1 system it wouldn't be an issue.
> >> >
> >> > Clearer, or still weird?
> >> >
> >> >>
> >> >>>
> >> >>> > Regarding NICs which support a large number of queues, the driver
> >> >>> > will typically enable only one queue per-core and set the affinity to
> >> >>> > separate cores. We have user-space data plane applications based
> >> >>> > on DPDK which are capable of using a large number of NIC queues
> >> >>> > but these applications are polling based and don't use MSIs.
> >> >>>
> >> >>> That's one sample point, and clearly not the only one. There are *many*
> >> >>> different usage models. Just because you *assign* MSI, doesn't mean they
> >> >>> are firing all the time.
> >> >>>
> >> >>> I can show you a couple of networking setups where this is clearly not
> >> >>> enough. Each core has a large number of QoS queues, and each queue would
> >> >>> very much like to have a dedicated MSI.
> >> >>>
> >> >>> >> > When we encounter a system requiring a large number of MSIs,
> >> >>> >> > we can either:
> >> >>> >> > 1) Extend the AIA spec to support greater than 2047 IDs
> >> >>> >> > 2) Re-think the approach in the IMSIC driver
> >> >>> >> >
> >> >>> >> > The choice between #1 and #2 above depends on the
> >> >>> >> > guarantees we want for irq_set_affinity().
> >> >>> >>
> >> >>> >> The irq_set_affinity() behavior is better with this series, but I think
> >> >>> >> the other downsides: number of available interrupt sources, and IPI
> >> >>> >> broadcast are worse.
> >> >>> >
> >> >>> > The IPI overhead in the approach you are suggesting will be
> >> >>> > even bad compared to the IPI overhead of the current approach
> >> >>> > because we will end-up doing IPI upon every irq_set_affinity()
> >> >>> > in the suggested approach compared to doing IPI upon every
> >> >>> > mask/unmask in the current approach.
> >> >>>
> >> >>> Again, very workload dependent.
> >> >>>
> >> >>> This series does IPI broadcast on masking/unmasking, which means that
> >> >>> cores that don't care get interrupted because, say, a network queue-pair
> >> >>> is setup on another core.
> >> >>>
> >> >>> Some workloads never change the irq affinity.
> >> >>
> >> >> There are various events which irq affinity such as irq balance,
> >> >> CPU hotplug, system suspend, etc.
> >> >>
> >> >> Also, the 1-1 approach does IPI upon set_affinity, mask and
> >> >> unmask whereas the 1-n approach does IPI only upon mask
> >> >> and unmask.
> >> >
> >> > An important distinction; When you say IPI on mask/unmask it is a
> >> > broadcast IPI to *all* cores, which is pretty instrusive.
> >> >
> >> > The 1-1 variant does an IPI to a *one* target core.
> >> >
> >> >>> I'm just pointing out that there are pro/cons with both variants.
> >> >>>
> >> >>> > The biggest advantage of the current approach is a reliable
> >> >>> > irq_set_affinity() which is a very valuable thing to have.
> >> >>>
> >> >>> ...and I'm arguing that we're paying a big price for that.
> >> >>>
> >> >>> > ARM systems easily support a large number of LPIs per-core.
> >> >>> > For example, GIC-700 supports 56000 LPIs per-core.
> >> >>> > (Refer, https://developer.arm.com/documentation/101516/0300/About-the-GIC-700/Features)
> >> >>>
> >> >>> Yeah, but this is not the GIC. This is something that looks more like
> >> >>> the x86 world. We'll be stuck with a lot of implementations with AIA 1.0
> >> >>> spec, and many cores.
> >> >>
> >> >> Well, RISC-V AIA is neigher ARM GIG not x86 APIC. All I am saying
> >> >> is that there are systems with large number per-core interrupt IDs
> >> >> for handling MSIs.
> >> >
> >> > Yes, and while that is nice, it's not what IMSIC is.
> >>
> >> Some follow-ups, after thinking more about it more over the weekend.
> >>
> >> * Do one really need an IPI for irq_set_affinity() for the 1-1 model?
> >> Why touch the enable/disable bits when moving interrupts?
> >
> > In the 1-1 model, the ID on the current HART and target HART upon
> > irq_set_affinity will be different so we can't leave the unused ID on
> > current HART enabled because it can lead to spurious interrupts
> > when the ID on current HART is re-used for some other device.
>
> Hmm, is this really an actual problem, or a theoretical one? The
> implementation need to track what's in-use, so can we ever get into this
> situation?

As of now, it is theoretical but it is certainly possible to hit this issue.

>
> Somewhat related; I had a similar question for imsic_pci_{un,}mask_irq()
> -- why not only do the the default mask operation (only
> pci_msi_{un,}mask_irq()), but instead propagate to the IMSIC
> mask/unmask?

We have hierarchical IMSIC PCI irq domain whoes parent irq domain
is IMSIC base domain. Unfortunately pci_msi_[un]mask_irq() don't
work for hierarchical irq domain.

>
> > There is also a possibility of receiving an interrupt while the ID was
> > moved to a new target HART in-which case we have to detect and
> > re-trigger interrupt on the new target HART. In fact, x86 APLIC does
> > an IPI to take care of this case.
>
> This case I get, and the implementation can track that both are in use.
> It's the spurious one that I'm dubious of (don't get).
>
> >>
> >> * In my book the IMSIC looks very much like the x86 LAPIC, which also
> >> has few interrupts (IMSIC <2048, LAPIC 256). The IRQ matrix allocator
> >> [1], and a scheme similar to LAPIC [2] would be a good fit. This is
> >> the 1-1 model, but more sophisticated than what I've been describing
> >> (e.g. properly handling mangaged/regular irqs). As a bonus we would
> >> get the IRQ matrix debugfs/tracepoint support.
> >>
> >
> > Yes, I have been evaluating the 1-1 model for the past few days. I also
> > have a working implementation with a simple per-CPU bitmap based
> > allocator which handles both legacy MSI (block of 1,2,4,8,16, or 32 IDs)
> > and MSI-X.
> >
> > The irq matrix allocator needs to be improved for handling legacy MSI
> > so initially I will post a v11 series which works for me and converging
> > with irq matrix allocator can be future work.
>
> What's missing/needs to be improved for legacy MSI (legacy MSI ==
> !MSI-X, right?) in the matrix allocator?

For legacy MSI, a block IDs needs to be contiguous and the number
of IDs can be a power of 2.

For a short difference between MSI vs MSI-X, refer:
https://en.wikipedia.org/wiki/Message_Signaled_Interrupts

Regards,
Anup

2023-10-23 15:45:42

by Björn Töpel

[permalink] [raw]
Subject: Re: [PATCH v10 00/15] Linux RISC-V AIA Support

Anup Patel <[email protected]> writes:

> On Mon, Oct 23, 2023 at 7:37 PM Björn Töpel <[email protected]> wrote:
>>
>> Anup Patel <[email protected]> writes:
>>
>> > On Mon, Oct 23, 2023 at 12:32 PM Björn Töpel <[email protected]> wrote:
>> >>
>> >> Björn Töpel <[email protected]> writes:
>> >>
>> >> > Anup Patel <[email protected]> writes:
>> >> >
>> >> >> On Fri, Oct 20, 2023 at 10:07 PM Björn Töpel <[email protected]> wrote:
>> >> >>>
>> >> >>> Anup Patel <[email protected]> writes:
>> >> >>>
>> >> >>> > On Fri, Oct 20, 2023 at 8:10 PM Björn Töpel <[email protected]> wrote:
>> >> >>> >>
>> >> >>> >> Anup Patel <[email protected]> writes:
>> >> >>> >>
>> >> >>> >> > On Fri, Oct 20, 2023 at 2:17 PM Björn Töpel <[email protected]> wrote:
>> >> >>> >> >>
>> >> >>> >> >> Thanks for the quick reply!
>> >> >>> >> >>
>> >> >>> >> >> Anup Patel <[email protected]> writes:
>> >> >>> >> >>
>> >> >>> >> >> > On Thu, Oct 19, 2023 at 7:13 PM Björn Töpel <[email protected]> wrote:
>> >> >>> >> >> >>
>> >> >>> >> >> >> Hi Anup,
>> >> >>> >> >> >>
>> >> >>> >> >> >> Anup Patel <[email protected]> writes:
>> >> >>> >> >> >>
>> >> >>> >> >> >> > The RISC-V AIA specification is ratified as-per the RISC-V international
>> >> >>> >> >> >> > process. The latest ratified AIA specifcation can be found at:
>> >> >>> >> >> >> > https://github.com/riscv/riscv-aia/releases/download/1.0/riscv-interrupts-1.0.pdf
>> >> >>> >> >> >> >
>> >> >>> >> >> >> > At a high-level, the AIA specification adds three things:
>> >> >>> >> >> >> > 1) AIA CSRs
>> >> >>> >> >> >> > - Improved local interrupt support
>> >> >>> >> >> >> > 2) Incoming Message Signaled Interrupt Controller (IMSIC)
>> >> >>> >> >> >> > - Per-HART MSI controller
>> >> >>> >> >> >> > - Support MSI virtualization
>> >> >>> >> >> >> > - Support IPI along with virtualization
>> >> >>> >> >> >> > 3) Advanced Platform-Level Interrupt Controller (APLIC)
>> >> >>> >> >> >> > - Wired interrupt controller
>> >> >>> >> >> >> > - In MSI-mode, converts wired interrupt into MSIs (i.e. MSI generator)
>> >> >>> >> >> >> > - In Direct-mode, injects external interrupts directly into HARTs
>> >> >>> >> >> >>
>> >> >>> >> >> >> Thanks for working on the AIA support! I had a look at the series, and
>> >> >>> >> >> >> have some concerns about interrupt ID abstraction.
>> >> >>> >> >> >>
>> >> >>> >> >> >> A bit of background, for readers not familiar with the AIA details.
>> >> >>> >> >> >>
>> >> >>> >> >> >> IMSIC allows for 2047 unique MSI ("msi-irq") sources per hart, and
>> >> >>> >> >> >> each MSI is dedicated to a certain hart. The series takes the approach
>> >> >>> >> >> >> to say that there are, e.g., 2047 interrupts ("lnx-irq") globally.
>> >> >>> >> >> >> Each lnx-irq consists of #harts * msi-irq -- a slice -- and in the
>> >> >>> >> >> >> slice only *one* msi-irq is acutally used.
>> >> >>> >> >> >>
>> >> >>> >> >> >> This scheme makes affinity changes more robust, because the interrupt
>> >> >>> >> >> >> sources on "other" harts are pre-allocated. On the other hand it
>> >> >>> >> >> >> requires to propagate irq masking to other harts via IPIs (this is
>> >> >>> >> >> >> mostly done up setup/tear down). It's also wasteful, because msi-irqs
>> >> >>> >> >> >> are hogged, and cannot be used.
>> >> >>> >> >> >>
>> >> >>> >> >> >> Contemporary storage/networking drivers usually uses queues per core
>> >> >>> >> >> >> (or a sub-set of cores). The current scheme wastes a lot of msi-irqs.
>> >> >>> >> >> >> If we instead used a scheme where "msi-irq == lnx-irq", instead of
>> >> >>> >> >> >> "lnq-irq = {hart 0;msi-irq x , ... hart N;msi-irq x}", there would be
>> >> >>> >> >> >> a lot MSIs for other users. 1-1 vs 1-N. E.g., if a storage device
>> >> >>> >> >> >> would like to use 5 queues (5 cores) on a 128 core system, the current
>> >> >>> >> >> >> scheme would consume 5 * 128 MSIs, instead of just 5.
>> >> >>> >> >> >>
>> >> >>> >> >> >> On the plus side:
>> >> >>> >> >> >> * Changing interrupts affinity will never fail, because the interrupts
>> >> >>> >> >> >> on each hart is pre-allocated.
>> >> >>> >> >> >>
>> >> >>> >> >> >> On the negative side:
>> >> >>> >> >> >> * Wasteful interrupt usage, and a system can potientially "run out" of
>> >> >>> >> >> >> interrupts. Especially for many core systems.
>> >> >>> >> >> >> * Interrupt masking need to proagate to harts via IPIs (there's no
>> >> >>> >> >> >> broadcast csr in IMSIC), and a more complex locking scheme IMSIC
>> >> >>> >> >> >>
>> >> >>> >> >> >> Summary:
>> >> >>> >> >> >> The current series caps the number of global interrupts to maximum
>> >> >>> >> >> >> 2047 MSIs for all cores (whole system). A better scheme, IMO, would be
>> >> >>> >> >> >> to expose 2047 * #harts unique MSIs.
>> >> >>> >> >> >>
>> >> >>> >> >> >> I think this could simplify/remove(?) the locking as well.
>> >> >>> >> >> >
>> >> >>> >> >> > Exposing 2047 * #harts unique MSIs has multiple issues:
>> >> >>> >> >> > 1) The irq_set_affinity() does not work for MSIs because each
>> >> >>> >> >> > IRQ is not tied to a particular HART. This means we can't
>> >> >>> >> >> > balance the IRQ processing load among HARTs.
>> >> >>> >> >>
>> >> >>> >> >> Yes, you can balance. In your code, each *active* MSI is still
>> >> >>> >> >> bound/active to a specific hard together with the affinity mask. In an
>> >> >>> >> >> 1-1 model you would still need to track the affinity mask, but the
>> >> >>> >> >> irq_set_affinity() would be different. It would try to allocate a new
>> >> >>> >> >> MSI from the target CPU, and then switch to having that MSI active.
>> >> >>> >> >>
>> >> >>> >> >> That's what x86 does AFAIU, which is also constrained by the # of
>> >> >>> >> >> available MSIs.
>> >> >>> >> >>
>> >> >>> >> >> The downside, as I pointed out, is that the set affinity action can
>> >> >>> >> >> fail for a certain target CPU.
>> >> >>> >> >
>> >> >>> >> > Yes, irq_set_affinity() can fail for the suggested approach plus for
>> >> >>> >> > RISC-V AIA, one HART does not have access to other HARTs
>> >> >>> >> > MSI enable/disable bits so the approach will also involve IPI.
>> >> >>> >>
>> >> >>> >> Correct, but the current series does a broadcast to all cores, where the
>> >> >>> >> 1-1 approach is at most an IPI to a single core.
>> >> >>> >>
>> >> >>> >> 128+c machines are getting more common, and you have devices that you
>> >> >>> >> bring up/down on a per-core basis. Broadcasting IPIs to all cores, when
>> >> >>> >> dealing with a per-core activity is a pretty noisy neighbor.
>> >> >>> >
>> >> >>> > Broadcast IPI in the current approach is only done upon MSI mask/unmask
>> >> >>> > operation. It is not done upon set_affinity() of interrupt handling.
>> >> >>>
>> >> >>> I'm aware. We're on the same page here.
>> >> >>>
>> >> >>> >>
>> >> >>> >> This could be fixed in the existing 1-n approach, by not require to sync
>> >> >>> >> the cores that are not handling the MSI in question. "Lazy disable"
>> >> >>> >
>> >> >>> > Incorrect. The approach you are suggesting involves an IPI upon every
>> >> >>> > irq_set_affinity(). This is because a HART can only enable it's own
>> >> >>> > MSI ID so when an IRQ is moved to from HART A to HART B with
>> >> >>> > a different ID X on HART B then we will need an IPI in irq_set_affinit()
>> >> >>> > to enable ID X on HART B.
>> >> >>>
>> >> >>> Yes, the 1-1 approach will require an IPI to one target cpu on affinity
>> >> >>> changes, and similar on mask/unmask.
>> >> >>>
>> >> >>> The 1-n approach, require no-IPI on affinity changes (nice!), but IPI
>> >> >>> broadcast to all cores on mask/unmask (not so nice).
>> >> >>>
>> >> >>> >> >> My concern is interrupts become a scarce resource with this
>> >> >>> >> >> implementation, but maybe my view is incorrect. I've seen bare-metal
>> >> >>> >> >> x86 systems (no VMs) with ~200 cores, and ~2000 interrupts, but maybe
>> >> >>> >> >> that is considered "a lot of interrupts".
>> >> >>> >> >>
>> >> >>> >> >> As long as we don't get into scenarios where we're running out of
>> >> >>> >> >> interrupts, due to the software design.
>> >> >>> >> >>
>> >> >>> >> >
>> >> >>> >> > The current approach is simpler and ensures irq_set_affinity
>> >> >>> >> > always works. The limit of max 2047 IDs is sufficient for many
>> >> >>> >> > systems (if not all).
>> >> >>> >>
>> >> >>> >> Let me give you another view. On a 128c system each core has ~16 unique
>> >> >>> >> interrupts for disposal. E.g. the Intel E800 NIC has more than 2048
>> >> >>> >> network queue pairs for each PF.
>> >> >>> >
>> >> >>> > Clearly, this example is a hypothetical and represents a poorly
>> >> >>> > designed platform.
>> >> >>> >
>> >> >>> > Having just 16 IDs per-Core is a very poor design choice. In fact, the
>> >> >>> > Server SoC spec mandates a minimum 255 IDs.
>> >> >>>
>> >> >>> You are misreading. A 128c system with 2047 MSIs per-core, will only
>> >> >>> have 16 *per-core unique* (2047/128) interrupts with the current series.
>> >> >>>
>> >> >>> I'm not saying that each IMSIC has 16 IDs, I'm saying that in a 128c
>> >> >>> system with the maximum amount of MSIs possible in the spec, you'll end
>> >> >>> up with 16 *unique* interrupts per core.
>> >> >>
>> >> >> -ENOPARSE
>> >> >>
>> >> >> I don't see how this applies to the current approach because we treat
>> >> >> MSI ID space as global across cores so if a system has 2047 MSIs
>> >> >> per-core then we have 2047 MSIs across all cores.
>> >> >
>> >> > Ok, I'll try again! :-)
>> >> >
>> >> > Let's assume that each core in the 128c system has some per-core
>> >> > resources, say a two NIC queue pairs, and a storage queue pair. This
>> >> > will consume, e.g., 2*2 + 2 (6) MSI sources from the global namespace.
>> >> >
>> >> > If each core does this it'll be 6*128 MSI sources of the global
>> >> > namespace.
>> >> >
>> >> > The maximum number of "privates" MSI sources a core can utilize is 16.
>> >> >
>> >> > I'm trying (it's does seem to go that well ;-)) to point out that it's
>> >> > only 16 unique sources per core. For, say, a 256 core system it would be
>> >> > 8. 2047 MSI sources in a system is not much.
>> >> >
>> >> > Say that I want to spin up 24 NIC queues with one MSI each on each core
>> >> > on my 128c system. That's not possible with this series, while with an
>> >> > 1-1 system it wouldn't be an issue.
>> >> >
>> >> > Clearer, or still weird?
>> >> >
>> >> >>
>> >> >>>
>> >> >>> > Regarding NICs which support a large number of queues, the driver
>> >> >>> > will typically enable only one queue per-core and set the affinity to
>> >> >>> > separate cores. We have user-space data plane applications based
>> >> >>> > on DPDK which are capable of using a large number of NIC queues
>> >> >>> > but these applications are polling based and don't use MSIs.
>> >> >>>
>> >> >>> That's one sample point, and clearly not the only one. There are *many*
>> >> >>> different usage models. Just because you *assign* MSI, doesn't mean they
>> >> >>> are firing all the time.
>> >> >>>
>> >> >>> I can show you a couple of networking setups where this is clearly not
>> >> >>> enough. Each core has a large number of QoS queues, and each queue would
>> >> >>> very much like to have a dedicated MSI.
>> >> >>>
>> >> >>> >> > When we encounter a system requiring a large number of MSIs,
>> >> >>> >> > we can either:
>> >> >>> >> > 1) Extend the AIA spec to support greater than 2047 IDs
>> >> >>> >> > 2) Re-think the approach in the IMSIC driver
>> >> >>> >> >
>> >> >>> >> > The choice between #1 and #2 above depends on the
>> >> >>> >> > guarantees we want for irq_set_affinity().
>> >> >>> >>
>> >> >>> >> The irq_set_affinity() behavior is better with this series, but I think
>> >> >>> >> the other downsides: number of available interrupt sources, and IPI
>> >> >>> >> broadcast are worse.
>> >> >>> >
>> >> >>> > The IPI overhead in the approach you are suggesting will be
>> >> >>> > even bad compared to the IPI overhead of the current approach
>> >> >>> > because we will end-up doing IPI upon every irq_set_affinity()
>> >> >>> > in the suggested approach compared to doing IPI upon every
>> >> >>> > mask/unmask in the current approach.
>> >> >>>
>> >> >>> Again, very workload dependent.
>> >> >>>
>> >> >>> This series does IPI broadcast on masking/unmasking, which means that
>> >> >>> cores that don't care get interrupted because, say, a network queue-pair
>> >> >>> is setup on another core.
>> >> >>>
>> >> >>> Some workloads never change the irq affinity.
>> >> >>
>> >> >> There are various events which irq affinity such as irq balance,
>> >> >> CPU hotplug, system suspend, etc.
>> >> >>
>> >> >> Also, the 1-1 approach does IPI upon set_affinity, mask and
>> >> >> unmask whereas the 1-n approach does IPI only upon mask
>> >> >> and unmask.
>> >> >
>> >> > An important distinction; When you say IPI on mask/unmask it is a
>> >> > broadcast IPI to *all* cores, which is pretty instrusive.
>> >> >
>> >> > The 1-1 variant does an IPI to a *one* target core.
>> >> >
>> >> >>> I'm just pointing out that there are pro/cons with both variants.
>> >> >>>
>> >> >>> > The biggest advantage of the current approach is a reliable
>> >> >>> > irq_set_affinity() which is a very valuable thing to have.
>> >> >>>
>> >> >>> ...and I'm arguing that we're paying a big price for that.
>> >> >>>
>> >> >>> > ARM systems easily support a large number of LPIs per-core.
>> >> >>> > For example, GIC-700 supports 56000 LPIs per-core.
>> >> >>> > (Refer, https://developer.arm.com/documentation/101516/0300/About-the-GIC-700/Features)
>> >> >>>
>> >> >>> Yeah, but this is not the GIC. This is something that looks more like
>> >> >>> the x86 world. We'll be stuck with a lot of implementations with AIA 1.0
>> >> >>> spec, and many cores.
>> >> >>
>> >> >> Well, RISC-V AIA is neigher ARM GIG not x86 APIC. All I am saying
>> >> >> is that there are systems with large number per-core interrupt IDs
>> >> >> for handling MSIs.
>> >> >
>> >> > Yes, and while that is nice, it's not what IMSIC is.
>> >>
>> >> Some follow-ups, after thinking more about it more over the weekend.
>> >>
>> >> * Do one really need an IPI for irq_set_affinity() for the 1-1 model?
>> >> Why touch the enable/disable bits when moving interrupts?
>> >
>> > In the 1-1 model, the ID on the current HART and target HART upon
>> > irq_set_affinity will be different so we can't leave the unused ID on
>> > current HART enabled because it can lead to spurious interrupts
>> > when the ID on current HART is re-used for some other device.
>>
>> Hmm, is this really an actual problem, or a theoretical one? The
>> implementation need to track what's in-use, so can we ever get into this
>> situation?
>
> As of now, it is theoretical but it is certainly possible to hit this issue.

Sorry for being slow here, Anup, but could you give an example how this
could happen? For me it sounds like this could only be caused by a
broken (buggy) implementation?

>> Somewhat related; I had a similar question for imsic_pci_{un,}mask_irq()
>> -- why not only do the the default mask operation (only
>> pci_msi_{un,}mask_irq()), but instead propagate to the IMSIC
>> mask/unmask?
>
> We have hierarchical IMSIC PCI irq domain whoes parent irq domain
> is IMSIC base domain. Unfortunately pci_msi_[un]mask_irq() don't
> work for hierarchical irq domain.

Ok! Thanks for the explanation!

>> > There is also a possibility of receiving an interrupt while the ID was
>> > moved to a new target HART in-which case we have to detect and
>> > re-trigger interrupt on the new target HART. In fact, x86 APLIC does
>> > an IPI to take care of this case.
>>
>> This case I get, and the implementation can track that both are in use.
>> It's the spurious one that I'm dubious of (don't get).
>>
>> >>
>> >> * In my book the IMSIC looks very much like the x86 LAPIC, which also
>> >> has few interrupts (IMSIC <2048, LAPIC 256). The IRQ matrix allocator
>> >> [1], and a scheme similar to LAPIC [2] would be a good fit. This is
>> >> the 1-1 model, but more sophisticated than what I've been describing
>> >> (e.g. properly handling mangaged/regular irqs). As a bonus we would
>> >> get the IRQ matrix debugfs/tracepoint support.
>> >>
>> >
>> > Yes, I have been evaluating the 1-1 model for the past few days. I also
>> > have a working implementation with a simple per-CPU bitmap based
>> > allocator which handles both legacy MSI (block of 1,2,4,8,16, or 32 IDs)
>> > and MSI-X.
>> >
>> > The irq matrix allocator needs to be improved for handling legacy MSI
>> > so initially I will post a v11 series which works for me and converging
>> > with irq matrix allocator can be future work.
>>
>> What's missing/needs to be improved for legacy MSI (legacy MSI ==
>> !MSI-X, right?) in the matrix allocator?
>
> For legacy MSI, a block IDs needs to be contiguous and the number
> of IDs can be a power of 2.

Oh, so this is not supported by the matrix allocator?


Björn

2023-10-23 17:26:22

by Anup Patel

[permalink] [raw]
Subject: Re: [PATCH v10 00/15] Linux RISC-V AIA Support

On Mon, Oct 23, 2023 at 9:15 PM Björn Töpel <[email protected]> wrote:
>
> Anup Patel <[email protected]> writes:
>
> > On Mon, Oct 23, 2023 at 7:37 PM Björn Töpel <[email protected]> wrote:
> >>
> >> Anup Patel <[email protected]> writes:
> >>
> >> > On Mon, Oct 23, 2023 at 12:32 PM Björn Töpel <[email protected]> wrote:
> >> >>
> >> >> Björn Töpel <[email protected]> writes:
> >> >>
> >> >> > Anup Patel <[email protected]> writes:
> >> >> >
> >> >> >> On Fri, Oct 20, 2023 at 10:07 PM Björn Töpel <[email protected]> wrote:
> >> >> >>>
> >> >> >>> Anup Patel <[email protected]> writes:
> >> >> >>>
> >> >> >>> > On Fri, Oct 20, 2023 at 8:10 PM Björn Töpel <[email protected]> wrote:
> >> >> >>> >>
> >> >> >>> >> Anup Patel <[email protected]> writes:
> >> >> >>> >>
> >> >> >>> >> > On Fri, Oct 20, 2023 at 2:17 PM Björn Töpel <[email protected]> wrote:
> >> >> >>> >> >>
> >> >> >>> >> >> Thanks for the quick reply!
> >> >> >>> >> >>
> >> >> >>> >> >> Anup Patel <[email protected]> writes:
> >> >> >>> >> >>
> >> >> >>> >> >> > On Thu, Oct 19, 2023 at 7:13 PM Björn Töpel <[email protected]> wrote:
> >> >> >>> >> >> >>
> >> >> >>> >> >> >> Hi Anup,
> >> >> >>> >> >> >>
> >> >> >>> >> >> >> Anup Patel <[email protected]> writes:
> >> >> >>> >> >> >>
> >> >> >>> >> >> >> > The RISC-V AIA specification is ratified as-per the RISC-V international
> >> >> >>> >> >> >> > process. The latest ratified AIA specifcation can be found at:
> >> >> >>> >> >> >> > https://github.com/riscv/riscv-aia/releases/download/1.0/riscv-interrupts-1.0.pdf
> >> >> >>> >> >> >> >
> >> >> >>> >> >> >> > At a high-level, the AIA specification adds three things:
> >> >> >>> >> >> >> > 1) AIA CSRs
> >> >> >>> >> >> >> > - Improved local interrupt support
> >> >> >>> >> >> >> > 2) Incoming Message Signaled Interrupt Controller (IMSIC)
> >> >> >>> >> >> >> > - Per-HART MSI controller
> >> >> >>> >> >> >> > - Support MSI virtualization
> >> >> >>> >> >> >> > - Support IPI along with virtualization
> >> >> >>> >> >> >> > 3) Advanced Platform-Level Interrupt Controller (APLIC)
> >> >> >>> >> >> >> > - Wired interrupt controller
> >> >> >>> >> >> >> > - In MSI-mode, converts wired interrupt into MSIs (i.e. MSI generator)
> >> >> >>> >> >> >> > - In Direct-mode, injects external interrupts directly into HARTs
> >> >> >>> >> >> >>
> >> >> >>> >> >> >> Thanks for working on the AIA support! I had a look at the series, and
> >> >> >>> >> >> >> have some concerns about interrupt ID abstraction.
> >> >> >>> >> >> >>
> >> >> >>> >> >> >> A bit of background, for readers not familiar with the AIA details.
> >> >> >>> >> >> >>
> >> >> >>> >> >> >> IMSIC allows for 2047 unique MSI ("msi-irq") sources per hart, and
> >> >> >>> >> >> >> each MSI is dedicated to a certain hart. The series takes the approach
> >> >> >>> >> >> >> to say that there are, e.g., 2047 interrupts ("lnx-irq") globally.
> >> >> >>> >> >> >> Each lnx-irq consists of #harts * msi-irq -- a slice -- and in the
> >> >> >>> >> >> >> slice only *one* msi-irq is acutally used.
> >> >> >>> >> >> >>
> >> >> >>> >> >> >> This scheme makes affinity changes more robust, because the interrupt
> >> >> >>> >> >> >> sources on "other" harts are pre-allocated. On the other hand it
> >> >> >>> >> >> >> requires to propagate irq masking to other harts via IPIs (this is
> >> >> >>> >> >> >> mostly done up setup/tear down). It's also wasteful, because msi-irqs
> >> >> >>> >> >> >> are hogged, and cannot be used.
> >> >> >>> >> >> >>
> >> >> >>> >> >> >> Contemporary storage/networking drivers usually uses queues per core
> >> >> >>> >> >> >> (or a sub-set of cores). The current scheme wastes a lot of msi-irqs.
> >> >> >>> >> >> >> If we instead used a scheme where "msi-irq == lnx-irq", instead of
> >> >> >>> >> >> >> "lnq-irq = {hart 0;msi-irq x , ... hart N;msi-irq x}", there would be
> >> >> >>> >> >> >> a lot MSIs for other users. 1-1 vs 1-N. E.g., if a storage device
> >> >> >>> >> >> >> would like to use 5 queues (5 cores) on a 128 core system, the current
> >> >> >>> >> >> >> scheme would consume 5 * 128 MSIs, instead of just 5.
> >> >> >>> >> >> >>
> >> >> >>> >> >> >> On the plus side:
> >> >> >>> >> >> >> * Changing interrupts affinity will never fail, because the interrupts
> >> >> >>> >> >> >> on each hart is pre-allocated.
> >> >> >>> >> >> >>
> >> >> >>> >> >> >> On the negative side:
> >> >> >>> >> >> >> * Wasteful interrupt usage, and a system can potientially "run out" of
> >> >> >>> >> >> >> interrupts. Especially for many core systems.
> >> >> >>> >> >> >> * Interrupt masking need to proagate to harts via IPIs (there's no
> >> >> >>> >> >> >> broadcast csr in IMSIC), and a more complex locking scheme IMSIC
> >> >> >>> >> >> >>
> >> >> >>> >> >> >> Summary:
> >> >> >>> >> >> >> The current series caps the number of global interrupts to maximum
> >> >> >>> >> >> >> 2047 MSIs for all cores (whole system). A better scheme, IMO, would be
> >> >> >>> >> >> >> to expose 2047 * #harts unique MSIs.
> >> >> >>> >> >> >>
> >> >> >>> >> >> >> I think this could simplify/remove(?) the locking as well.
> >> >> >>> >> >> >
> >> >> >>> >> >> > Exposing 2047 * #harts unique MSIs has multiple issues:
> >> >> >>> >> >> > 1) The irq_set_affinity() does not work for MSIs because each
> >> >> >>> >> >> > IRQ is not tied to a particular HART. This means we can't
> >> >> >>> >> >> > balance the IRQ processing load among HARTs.
> >> >> >>> >> >>
> >> >> >>> >> >> Yes, you can balance. In your code, each *active* MSI is still
> >> >> >>> >> >> bound/active to a specific hard together with the affinity mask. In an
> >> >> >>> >> >> 1-1 model you would still need to track the affinity mask, but the
> >> >> >>> >> >> irq_set_affinity() would be different. It would try to allocate a new
> >> >> >>> >> >> MSI from the target CPU, and then switch to having that MSI active.
> >> >> >>> >> >>
> >> >> >>> >> >> That's what x86 does AFAIU, which is also constrained by the # of
> >> >> >>> >> >> available MSIs.
> >> >> >>> >> >>
> >> >> >>> >> >> The downside, as I pointed out, is that the set affinity action can
> >> >> >>> >> >> fail for a certain target CPU.
> >> >> >>> >> >
> >> >> >>> >> > Yes, irq_set_affinity() can fail for the suggested approach plus for
> >> >> >>> >> > RISC-V AIA, one HART does not have access to other HARTs
> >> >> >>> >> > MSI enable/disable bits so the approach will also involve IPI.
> >> >> >>> >>
> >> >> >>> >> Correct, but the current series does a broadcast to all cores, where the
> >> >> >>> >> 1-1 approach is at most an IPI to a single core.
> >> >> >>> >>
> >> >> >>> >> 128+c machines are getting more common, and you have devices that you
> >> >> >>> >> bring up/down on a per-core basis. Broadcasting IPIs to all cores, when
> >> >> >>> >> dealing with a per-core activity is a pretty noisy neighbor.
> >> >> >>> >
> >> >> >>> > Broadcast IPI in the current approach is only done upon MSI mask/unmask
> >> >> >>> > operation. It is not done upon set_affinity() of interrupt handling.
> >> >> >>>
> >> >> >>> I'm aware. We're on the same page here.
> >> >> >>>
> >> >> >>> >>
> >> >> >>> >> This could be fixed in the existing 1-n approach, by not require to sync
> >> >> >>> >> the cores that are not handling the MSI in question. "Lazy disable"
> >> >> >>> >
> >> >> >>> > Incorrect. The approach you are suggesting involves an IPI upon every
> >> >> >>> > irq_set_affinity(). This is because a HART can only enable it's own
> >> >> >>> > MSI ID so when an IRQ is moved to from HART A to HART B with
> >> >> >>> > a different ID X on HART B then we will need an IPI in irq_set_affinit()
> >> >> >>> > to enable ID X on HART B.
> >> >> >>>
> >> >> >>> Yes, the 1-1 approach will require an IPI to one target cpu on affinity
> >> >> >>> changes, and similar on mask/unmask.
> >> >> >>>
> >> >> >>> The 1-n approach, require no-IPI on affinity changes (nice!), but IPI
> >> >> >>> broadcast to all cores on mask/unmask (not so nice).
> >> >> >>>
> >> >> >>> >> >> My concern is interrupts become a scarce resource with this
> >> >> >>> >> >> implementation, but maybe my view is incorrect. I've seen bare-metal
> >> >> >>> >> >> x86 systems (no VMs) with ~200 cores, and ~2000 interrupts, but maybe
> >> >> >>> >> >> that is considered "a lot of interrupts".
> >> >> >>> >> >>
> >> >> >>> >> >> As long as we don't get into scenarios where we're running out of
> >> >> >>> >> >> interrupts, due to the software design.
> >> >> >>> >> >>
> >> >> >>> >> >
> >> >> >>> >> > The current approach is simpler and ensures irq_set_affinity
> >> >> >>> >> > always works. The limit of max 2047 IDs is sufficient for many
> >> >> >>> >> > systems (if not all).
> >> >> >>> >>
> >> >> >>> >> Let me give you another view. On a 128c system each core has ~16 unique
> >> >> >>> >> interrupts for disposal. E.g. the Intel E800 NIC has more than 2048
> >> >> >>> >> network queue pairs for each PF.
> >> >> >>> >
> >> >> >>> > Clearly, this example is a hypothetical and represents a poorly
> >> >> >>> > designed platform.
> >> >> >>> >
> >> >> >>> > Having just 16 IDs per-Core is a very poor design choice. In fact, the
> >> >> >>> > Server SoC spec mandates a minimum 255 IDs.
> >> >> >>>
> >> >> >>> You are misreading. A 128c system with 2047 MSIs per-core, will only
> >> >> >>> have 16 *per-core unique* (2047/128) interrupts with the current series.
> >> >> >>>
> >> >> >>> I'm not saying that each IMSIC has 16 IDs, I'm saying that in a 128c
> >> >> >>> system with the maximum amount of MSIs possible in the spec, you'll end
> >> >> >>> up with 16 *unique* interrupts per core.
> >> >> >>
> >> >> >> -ENOPARSE
> >> >> >>
> >> >> >> I don't see how this applies to the current approach because we treat
> >> >> >> MSI ID space as global across cores so if a system has 2047 MSIs
> >> >> >> per-core then we have 2047 MSIs across all cores.
> >> >> >
> >> >> > Ok, I'll try again! :-)
> >> >> >
> >> >> > Let's assume that each core in the 128c system has some per-core
> >> >> > resources, say a two NIC queue pairs, and a storage queue pair. This
> >> >> > will consume, e.g., 2*2 + 2 (6) MSI sources from the global namespace.
> >> >> >
> >> >> > If each core does this it'll be 6*128 MSI sources of the global
> >> >> > namespace.
> >> >> >
> >> >> > The maximum number of "privates" MSI sources a core can utilize is 16.
> >> >> >
> >> >> > I'm trying (it's does seem to go that well ;-)) to point out that it's
> >> >> > only 16 unique sources per core. For, say, a 256 core system it would be
> >> >> > 8. 2047 MSI sources in a system is not much.
> >> >> >
> >> >> > Say that I want to spin up 24 NIC queues with one MSI each on each core
> >> >> > on my 128c system. That's not possible with this series, while with an
> >> >> > 1-1 system it wouldn't be an issue.
> >> >> >
> >> >> > Clearer, or still weird?
> >> >> >
> >> >> >>
> >> >> >>>
> >> >> >>> > Regarding NICs which support a large number of queues, the driver
> >> >> >>> > will typically enable only one queue per-core and set the affinity to
> >> >> >>> > separate cores. We have user-space data plane applications based
> >> >> >>> > on DPDK which are capable of using a large number of NIC queues
> >> >> >>> > but these applications are polling based and don't use MSIs.
> >> >> >>>
> >> >> >>> That's one sample point, and clearly not the only one. There are *many*
> >> >> >>> different usage models. Just because you *assign* MSI, doesn't mean they
> >> >> >>> are firing all the time.
> >> >> >>>
> >> >> >>> I can show you a couple of networking setups where this is clearly not
> >> >> >>> enough. Each core has a large number of QoS queues, and each queue would
> >> >> >>> very much like to have a dedicated MSI.
> >> >> >>>
> >> >> >>> >> > When we encounter a system requiring a large number of MSIs,
> >> >> >>> >> > we can either:
> >> >> >>> >> > 1) Extend the AIA spec to support greater than 2047 IDs
> >> >> >>> >> > 2) Re-think the approach in the IMSIC driver
> >> >> >>> >> >
> >> >> >>> >> > The choice between #1 and #2 above depends on the
> >> >> >>> >> > guarantees we want for irq_set_affinity().
> >> >> >>> >>
> >> >> >>> >> The irq_set_affinity() behavior is better with this series, but I think
> >> >> >>> >> the other downsides: number of available interrupt sources, and IPI
> >> >> >>> >> broadcast are worse.
> >> >> >>> >
> >> >> >>> > The IPI overhead in the approach you are suggesting will be
> >> >> >>> > even bad compared to the IPI overhead of the current approach
> >> >> >>> > because we will end-up doing IPI upon every irq_set_affinity()
> >> >> >>> > in the suggested approach compared to doing IPI upon every
> >> >> >>> > mask/unmask in the current approach.
> >> >> >>>
> >> >> >>> Again, very workload dependent.
> >> >> >>>
> >> >> >>> This series does IPI broadcast on masking/unmasking, which means that
> >> >> >>> cores that don't care get interrupted because, say, a network queue-pair
> >> >> >>> is setup on another core.
> >> >> >>>
> >> >> >>> Some workloads never change the irq affinity.
> >> >> >>
> >> >> >> There are various events which irq affinity such as irq balance,
> >> >> >> CPU hotplug, system suspend, etc.
> >> >> >>
> >> >> >> Also, the 1-1 approach does IPI upon set_affinity, mask and
> >> >> >> unmask whereas the 1-n approach does IPI only upon mask
> >> >> >> and unmask.
> >> >> >
> >> >> > An important distinction; When you say IPI on mask/unmask it is a
> >> >> > broadcast IPI to *all* cores, which is pretty instrusive.
> >> >> >
> >> >> > The 1-1 variant does an IPI to a *one* target core.
> >> >> >
> >> >> >>> I'm just pointing out that there are pro/cons with both variants.
> >> >> >>>
> >> >> >>> > The biggest advantage of the current approach is a reliable
> >> >> >>> > irq_set_affinity() which is a very valuable thing to have.
> >> >> >>>
> >> >> >>> ...and I'm arguing that we're paying a big price for that.
> >> >> >>>
> >> >> >>> > ARM systems easily support a large number of LPIs per-core.
> >> >> >>> > For example, GIC-700 supports 56000 LPIs per-core.
> >> >> >>> > (Refer, https://developer.arm.com/documentation/101516/0300/About-the-GIC-700/Features)
> >> >> >>>
> >> >> >>> Yeah, but this is not the GIC. This is something that looks more like
> >> >> >>> the x86 world. We'll be stuck with a lot of implementations with AIA 1.0
> >> >> >>> spec, and many cores.
> >> >> >>
> >> >> >> Well, RISC-V AIA is neigher ARM GIG not x86 APIC. All I am saying
> >> >> >> is that there are systems with large number per-core interrupt IDs
> >> >> >> for handling MSIs.
> >> >> >
> >> >> > Yes, and while that is nice, it's not what IMSIC is.
> >> >>
> >> >> Some follow-ups, after thinking more about it more over the weekend.
> >> >>
> >> >> * Do one really need an IPI for irq_set_affinity() for the 1-1 model?
> >> >> Why touch the enable/disable bits when moving interrupts?
> >> >
> >> > In the 1-1 model, the ID on the current HART and target HART upon
> >> > irq_set_affinity will be different so we can't leave the unused ID on
> >> > current HART enabled because it can lead to spurious interrupts
> >> > when the ID on current HART is re-used for some other device.
> >>
> >> Hmm, is this really an actual problem, or a theoretical one? The
> >> implementation need to track what's in-use, so can we ever get into this
> >> situation?
> >
> > As of now, it is theoretical but it is certainly possible to hit this issue.
>
> Sorry for being slow here, Anup, but could you give an example how this
> could happen? For me it sounds like this could only be caused by a
> broken (buggy) implementation?

Let me re-state the problem: When ID X on HART A is moved to ID Y
on HART B. Now the movement might have be initiated by some
some HART C doing irq_set_affinity(). Once the movement is done
the ID X on HART A should be disabled because if not disabled then
some device (possibly buggy) can have HART A to take MSI with
ID X but this MSI won't map to any Linux IRQ since the it is already
moved to ID Y on HART B.

>
> >> Somewhat related; I had a similar question for imsic_pci_{un,}mask_irq()
> >> -- why not only do the the default mask operation (only
> >> pci_msi_{un,}mask_irq()), but instead propagate to the IMSIC
> >> mask/unmask?
> >
> > We have hierarchical IMSIC PCI irq domain whoes parent irq domain
> > is IMSIC base domain. Unfortunately pci_msi_[un]mask_irq() don't
> > work for hierarchical irq domain.
>
> Ok! Thanks for the explanation!
>
> >> > There is also a possibility of receiving an interrupt while the ID was
> >> > moved to a new target HART in-which case we have to detect and
> >> > re-trigger interrupt on the new target HART. In fact, x86 APLIC does
> >> > an IPI to take care of this case.
> >>
> >> This case I get, and the implementation can track that both are in use.
> >> It's the spurious one that I'm dubious of (don't get).
> >>
> >> >>
> >> >> * In my book the IMSIC looks very much like the x86 LAPIC, which also
> >> >> has few interrupts (IMSIC <2048, LAPIC 256). The IRQ matrix allocator
> >> >> [1], and a scheme similar to LAPIC [2] would be a good fit. This is
> >> >> the 1-1 model, but more sophisticated than what I've been describing
> >> >> (e.g. properly handling mangaged/regular irqs). As a bonus we would
> >> >> get the IRQ matrix debugfs/tracepoint support.
> >> >>
> >> >
> >> > Yes, I have been evaluating the 1-1 model for the past few days. I also
> >> > have a working implementation with a simple per-CPU bitmap based
> >> > allocator which handles both legacy MSI (block of 1,2,4,8,16, or 32 IDs)
> >> > and MSI-X.
> >> >
> >> > The irq matrix allocator needs to be improved for handling legacy MSI
> >> > so initially I will post a v11 series which works for me and converging
> >> > with irq matrix allocator can be future work.
> >>
> >> What's missing/needs to be improved for legacy MSI (legacy MSI ==
> >> !MSI-X, right?) in the matrix allocator?
> >
> > For legacy MSI, a block IDs needs to be contiguous and the number
> > of IDs can be a power of 2.
>
> Oh, so this is not supported by the matrix allocator?

Yes, that's correct. IRQ matrix allocator only supports allocating
IDs suitable for MSI-X. Improving IRQ matrix allocator is a
separate effort in my opinion.

The ARM GICv3/v4 driver supports both MSI and MSI-X.

Regards,
Anup