2024-05-03 16:12:58

by Tomasz Jeznach

[permalink] [raw]
Subject: [PATCH v4 0/7] Linux RISC-V IOMMU Support

This patch series introduces support for RISC-V IOMMU architected
hardware into the Linux kernel.

The RISC-V IOMMU specification, which this series is based on, is
ratified and available at GitHub/riscv-non-isa [1].

At a high level, the RISC-V IOMMU specification defines:

1) Data structures:
- Device-context: Associates devices with address spaces and holds
per-device parameters for address translations.
- Process-contexts: Associates different virtual address spaces based
on device-provided process identification numbers.
- MSI page table configuration used to direct an MSI to a guest
interrupt file in an IMSIC.
2) In-memory queue interface:
- Command-queue for issuing commands to the IOMMU.
- Fault/event queue for reporting faults and events.
- Page-request queue for reporting "Page Request" messages received
from PCIe devices.
- Message-signaled and wire-signaled interrupt mechanisms.
3) Memory-mapped programming interface:
- Mandatory and optional register layout and description.
- Software guidelines for device initialization and capabilities discovery.


This series introduces RISC-V IOMMU hardware initialization and complete
single-stage translation with paging domain support.

The patches are organized as follows:

Patch 1: Introduces minimal required device tree bindings for the driver.
Patch 2: Defines RISC-V IOMMU data structures, hardware programming interface
registers layout, and minimal initialization code for enabling global
pass-through for all connected masters.
Patch 3: Implements the device driver for PCIe implementation of RISC-V IOMMU
architected hardware.
Patch 4: Introduces IOMMU interfaces to the kernel subsystem.
Patch 5: Implements device directory management with discovery sequences for
I/O mapped or in-memory device directory table location, hardware
capabilities discovery, and device to domain attach implementation.
Patch 6: Implements command and fault queue, and introduces directory cache
invalidation sequences.
Patch 7: Implements paging domain, using highest page-table mode advertised
by the hardware. This series enables only 4K mappings; complete support
for large page mappings will be introduced in follow-up patch series.

Follow-up patch series, providing large page support and updated walk cache
management based on the revised specification, and complete ATS/PRI/SVA support,
will be posted to GitHub [2].

Changes from v3:
- dt-bindings: s/qemu,iommu/qemu,riscv-iommu/, fix iommu-map sample
- device probe will fail if IOMMU if running in restricted BARE mode
- synchronize_rcu moved to release_device, fixes for bonds locking, iotlb_inval fix
- page table radix tree selection based on IOMMU capabilities, failover to use SATP
- private iommu per device data structure added
- Editorial changes: rename goto labels, blocking_domain/blocking_domain, reformat
to fit mostly under 80 characters per line

Patch series depends on (applied to iommu-next):
IOMMU memory observability, v6 [3]
iommu, dma-mapping: Simplify arch_setup_dma_ops(), v4 [4]

Best regards,
Tomasz Jeznach

[1] link: https://github.com/riscv-non-isa/riscv-iommu
[2] link: https://github.com/tjeznach/linux
[3] link: https://lore.kernel.org/linux-iommu/[email protected]/
[4] link: https://lore.kernel.org/linux-iommu/[email protected]/
v3 link: https://lore.kernel.org/linux-iommu/[email protected]/
v2 link: https://lore.kernel.org/linux-iommu/[email protected]/
v1 link: https://lore.kernel.org/linux-iommu/[email protected]/


Tomasz Jeznach (7):
dt-bindings: iommu: riscv: Add bindings for RISC-V IOMMU
iommu/riscv: Add RISC-V IOMMU platform device driver
iommu/riscv: Add RISC-V IOMMU PCIe device driver
iommu/riscv: Enable IOMMU registration and device probe.
iommu/riscv: Device directory management.
iommu/riscv: Command and fault queue support
iommu/riscv: Paging domain support

.../bindings/iommu/riscv,iommu.yaml | 147 ++
MAINTAINERS | 8 +
drivers/iommu/Kconfig | 1 +
drivers/iommu/Makefile | 2 +-
drivers/iommu/riscv/Kconfig | 20 +
drivers/iommu/riscv/Makefile | 3 +
drivers/iommu/riscv/iommu-bits.h | 782 ++++++++
drivers/iommu/riscv/iommu-pci.c | 119 ++
drivers/iommu/riscv/iommu-platform.c | 92 +
drivers/iommu/riscv/iommu.c | 1616 +++++++++++++++++
drivers/iommu/riscv/iommu.h | 88 +
11 files changed, 2877 insertions(+), 1 deletion(-)
create mode 100644 Documentation/devicetree/bindings/iommu/riscv,iommu.yaml
create mode 100644 drivers/iommu/riscv/Kconfig
create mode 100644 drivers/iommu/riscv/Makefile
create mode 100644 drivers/iommu/riscv/iommu-bits.h
create mode 100644 drivers/iommu/riscv/iommu-pci.c
create mode 100644 drivers/iommu/riscv/iommu-platform.c
create mode 100644 drivers/iommu/riscv/iommu.c
create mode 100644 drivers/iommu/riscv/iommu.h


base-commit: e67572cd2204894179d89bd7b984072f19313b03
message-id: [email protected]
message-id: [email protected]
--
2.34.1



2024-05-03 16:13:09

by Tomasz Jeznach

[permalink] [raw]
Subject: [PATCH v4 1/7] dt-bindings: iommu: riscv: Add bindings for RISC-V IOMMU

Add bindings for the RISC-V IOMMU device drivers.

Co-developed-by: Anup Patel <[email protected]>
Signed-off-by: Anup Patel <[email protected]>
Reviewed-by: Conor Dooley <[email protected]>
Signed-off-by: Tomasz Jeznach <[email protected]>
---
.../bindings/iommu/riscv,iommu.yaml | 147 ++++++++++++++++++
MAINTAINERS | 7 +
2 files changed, 154 insertions(+)
create mode 100644 Documentation/devicetree/bindings/iommu/riscv,iommu.yaml

diff --git a/Documentation/devicetree/bindings/iommu/riscv,iommu.yaml b/Documentation/devicetree/bindings/iommu/riscv,iommu.yaml
new file mode 100644
index 000000000000..5d015eeb06d0
--- /dev/null
+++ b/Documentation/devicetree/bindings/iommu/riscv,iommu.yaml
@@ -0,0 +1,147 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/iommu/riscv,iommu.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: RISC-V IOMMU Architecture Implementation
+
+maintainers:
+ - Tomasz Jeznach <[email protected]>
+
+description: |
+ The RISC-V IOMMU provides memory address translation and isolation for
+ input and output devices, supporting per-device translation context,
+ shared process address spaces including the ATS and PRI components of
+ the PCIe specification, two stage address translation and MSI remapping.
+ It supports identical translation table format to the RISC-V address
+ translation tables with page level access and protection attributes.
+ Hardware uses in-memory command and fault reporting queues with wired
+ interrupt or MSI notifications.
+
+ Visit https://github.com/riscv-non-isa/riscv-iommu for more details.
+
+ For information on assigning RISC-V IOMMU to its peripheral devices,
+ see generic IOMMU bindings.
+
+properties:
+ # For PCIe IOMMU hardware compatible property should contain the vendor
+ # and device ID according to the PCI Bus Binding specification.
+ # Since PCI provides built-in identification methods, compatible is not
+ # actually required. For non-PCIe hardware implementations 'riscv,iommu'
+ # should be specified along with 'reg' property providing MMIO location.
+ compatible:
+ oneOf:
+ - items:
+ - enum:
+ - qemu,riscv-iommu
+ - const: riscv,iommu
+ - items:
+ - enum:
+ - pci1efd,edf1
+ - const: riscv,pci-iommu
+
+ reg:
+ maxItems: 1
+ description:
+ For non-PCI devices this represents base address and size of for the
+ IOMMU memory mapped registers interface.
+ For PCI IOMMU hardware implementation this should represent an address
+ of the IOMMU, as defined in the PCI Bus Binding reference.
+
+ '#iommu-cells':
+ const: 1
+ description:
+ The single cell describes the requester id emitted by a master to the
+ IOMMU.
+
+ interrupts:
+ minItems: 1
+ maxItems: 4
+ description:
+ Wired interrupt vectors available for RISC-V IOMMU to notify the
+ RISC-V HARTS. The cause to interrupt vector is software defined
+ using IVEC IOMMU register.
+
+ msi-parent: true
+
+ power-domains:
+ maxItems: 1
+
+required:
+ - compatible
+ - reg
+ - '#iommu-cells'
+
+additionalProperties: false
+
+examples:
+ - |+
+ /* Example 1 (IOMMU device with wired interrupts) */
+ #include <dt-bindings/interrupt-controller/irq.h>
+
+ iommu1: iommu@1bccd000 {
+ compatible = "qemu,riscv-iommu", "riscv,iommu";
+ reg = <0x1bccd000 0x1000>;
+ interrupt-parent = <&aplic_smode>;
+ interrupts = <32 IRQ_TYPE_LEVEL_HIGH>,
+ <33 IRQ_TYPE_LEVEL_HIGH>,
+ <34 IRQ_TYPE_LEVEL_HIGH>,
+ <35 IRQ_TYPE_LEVEL_HIGH>;
+ #iommu-cells = <1>;
+ };
+
+ /* Device with two IOMMU device IDs, 0 and 7 */
+ master1 {
+ iommus = <&iommu1 0>, <&iommu1 7>;
+ };
+
+ - |+
+ /* Example 2 (IOMMU device with shared wired interrupt) */
+ #include <dt-bindings/interrupt-controller/irq.h>
+
+ iommu2: iommu@1bccd000 {
+ compatible = "qemu,riscv-iommu", "riscv,iommu";
+ reg = <0x1bccd000 0x1000>;
+ interrupt-parent = <&aplic_smode>;
+ interrupts = <32 IRQ_TYPE_LEVEL_HIGH>;
+ #iommu-cells = <1>;
+ };
+
+ - |+
+ /* Example 3 (IOMMU device with MSIs) */
+ iommu3: iommu@1bcdd000 {
+ compatible = "qemu,riscv-iommu", "riscv,iommu";
+ reg = <0x1bccd000 0x1000>;
+ msi-parent = <&imsics_smode>;
+ #iommu-cells = <1>;
+ };
+
+ - |+
+ /* Example 4 (IOMMU PCIe device with MSIs) */
+ bus {
+ #address-cells = <2>;
+ #size-cells = <2>;
+
+ pcie@30000000 {
+ device_type = "pci";
+ #address-cells = <3>;
+ #size-cells = <2>;
+ reg = <0x0 0x30000000 0x0 0x1000000>;
+ ranges = <0x02000000 0x0 0x41000000 0x0 0x41000000 0x0 0x0f000000>;
+
+ /*
+ * The IOMMU manages all functions in this PCI domain except
+ * itself. Omit BDF 00:01.0.
+ */
+ iommu-map = <0x0 &iommu0 0x0 0x8>,
+ <0x9 &iommu0 0x9 0xfff7>;
+
+ /* The IOMMU programming interface uses slot 00:01.0 */
+ iommu0: iommu@1,0 {
+ compatible = "pci1efd,edf1", "riscv,pci-iommu";
+ reg = <0x800 0 0 0 0>;
+ #iommu-cells = <1>;
+ };
+ };
+ };
diff --git a/MAINTAINERS b/MAINTAINERS
index f6dc90559341..7fcf7c27ef6b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -18958,6 +18958,13 @@ F: arch/riscv/
N: riscv
K: riscv

+RISC-V IOMMU
+M: Tomasz Jeznach <[email protected]>
+L: [email protected]
+L: [email protected]
+S: Maintained
+F: Documentation/devicetree/bindings/iommu/riscv,iommu.yaml
+
RISC-V MICROCHIP FPGA SUPPORT
M: Conor Dooley <[email protected]>
M: Daire McNamara <[email protected]>
--
2.34.1


2024-05-03 16:13:46

by Tomasz Jeznach

[permalink] [raw]
Subject: [PATCH v4 4/7] iommu/riscv: Enable IOMMU registration and device probe.

Advertise IOMMU device and its core API.
Only minimal implementation for single identity domain type, without
per-group domain protection.

Reviewed-by: Lu Baolu <[email protected]>
Signed-off-by: Tomasz Jeznach <[email protected]>
---
drivers/iommu/riscv/iommu.c | 66 +++++++++++++++++++++++++++++++++++++
1 file changed, 66 insertions(+)

diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
index 3c5a6b49669d..1f889daffb0e 100644
--- a/drivers/iommu/riscv/iommu.c
+++ b/drivers/iommu/riscv/iommu.c
@@ -17,6 +17,7 @@
#include <linux/init.h>
#include <linux/iommu.h>
#include <linux/kernel.h>
+#include <linux/pci.h>

#include "iommu-bits.h"
#include "iommu.h"
@@ -36,6 +37,60 @@ static void riscv_iommu_disable(struct riscv_iommu_device *iommu)
riscv_iommu_writel(iommu, RISCV_IOMMU_REG_PQCSR, 0);
}

+static int riscv_iommu_attach_identity_domain(struct iommu_domain *iommu_domain,
+ struct device *dev)
+{
+ /* Global pass-through already enabled, do nothing for now. */
+ return 0;
+}
+
+static struct iommu_domain riscv_iommu_identity_domain = {
+ .type = IOMMU_DOMAIN_IDENTITY,
+ .ops = &(const struct iommu_domain_ops) {
+ .attach_dev = riscv_iommu_attach_identity_domain,
+ }
+};
+
+static int riscv_iommu_device_domain_type(struct device *dev)
+{
+ return IOMMU_DOMAIN_IDENTITY;
+}
+
+static struct iommu_group *riscv_iommu_device_group(struct device *dev)
+{
+ if (dev_is_pci(dev))
+ return pci_device_group(dev);
+ return generic_device_group(dev);
+}
+
+static int riscv_iommu_of_xlate(struct device *dev, const struct of_phandle_args *args)
+{
+ return iommu_fwspec_add_ids(dev, args->args, 1);
+}
+
+static struct iommu_device *riscv_iommu_probe_device(struct device *dev)
+{
+ struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
+ struct riscv_iommu_device *iommu;
+
+ if (!fwspec->iommu_fwnode->dev || !fwspec->num_ids)
+ return ERR_PTR(-ENODEV);
+
+ iommu = dev_get_drvdata(fwspec->iommu_fwnode->dev);
+ if (!iommu)
+ return ERR_PTR(-ENODEV);
+
+ return &iommu->iommu;
+}
+
+static const struct iommu_ops riscv_iommu_ops = {
+ .of_xlate = riscv_iommu_of_xlate,
+ .identity_domain = &riscv_iommu_identity_domain,
+ .def_domain_type = riscv_iommu_device_domain_type,
+ .device_group = riscv_iommu_device_group,
+ .probe_device = riscv_iommu_probe_device,
+};
+
static int riscv_iommu_init_check(struct riscv_iommu_device *iommu)
{
u64 ddtp;
@@ -71,6 +126,7 @@ static int riscv_iommu_init_check(struct riscv_iommu_device *iommu)

void riscv_iommu_remove(struct riscv_iommu_device *iommu)
{
+ iommu_device_unregister(&iommu->iommu);
iommu_device_sysfs_remove(&iommu->iommu);
}

@@ -95,5 +151,15 @@ int riscv_iommu_init(struct riscv_iommu_device *iommu)
return dev_err_probe(iommu->dev, rc,
"cannot register sysfs interface\n");

+ rc = iommu_device_register(&iommu->iommu, &riscv_iommu_ops, iommu->dev);
+ if (rc) {
+ dev_err_probe(iommu->dev, rc, "cannot register iommu interface\n");
+ goto err_remove_sysfs;
+ }
+
return 0;
+
+err_remove_sysfs:
+ iommu_device_sysfs_remove(&iommu->iommu);
+ return rc;
}
--
2.34.1


2024-05-03 16:13:50

by Tomasz Jeznach

[permalink] [raw]
Subject: [PATCH v4 2/7] iommu/riscv: Add RISC-V IOMMU platform device driver

Introduce platform device driver for implementation of RISC-V IOMMU
architected hardware.

Hardware interface definition located in file iommu-bits.h is based on
ratified RISC-V IOMMU Architecture Specification version 1.0.0.

This patch implements platform device initialization, early check and
configuration of the IOMMU interfaces and enables global pass-through
address translation mode (iommu_mode == BARE), without registering
hardware instance in the IOMMU subsystem.

Link: https://github.com/riscv-non-isa/riscv-iommu
Co-developed-by: Nick Kossifidis <[email protected]>
Signed-off-by: Nick Kossifidis <[email protected]>
Co-developed-by: Sebastien Boeuf <[email protected]>
Signed-off-by: Sebastien Boeuf <[email protected]>
Signed-off-by: Tomasz Jeznach <[email protected]>
---
MAINTAINERS | 1 +
drivers/iommu/Kconfig | 1 +
drivers/iommu/Makefile | 2 +-
drivers/iommu/riscv/Kconfig | 15 +
drivers/iommu/riscv/Makefile | 2 +
drivers/iommu/riscv/iommu-bits.h | 707 +++++++++++++++++++++++++++
drivers/iommu/riscv/iommu-platform.c | 92 ++++
drivers/iommu/riscv/iommu.c | 99 ++++
drivers/iommu/riscv/iommu.h | 62 +++
9 files changed, 980 insertions(+), 1 deletion(-)
create mode 100644 drivers/iommu/riscv/Kconfig
create mode 100644 drivers/iommu/riscv/Makefile
create mode 100644 drivers/iommu/riscv/iommu-bits.h
create mode 100644 drivers/iommu/riscv/iommu-platform.c
create mode 100644 drivers/iommu/riscv/iommu.c
create mode 100644 drivers/iommu/riscv/iommu.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 7fcf7c27ef6b..42df3b3871f4 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -18964,6 +18964,7 @@ L: [email protected]
L: [email protected]
S: Maintained
F: Documentation/devicetree/bindings/iommu/riscv,iommu.yaml
+F: drivers/iommu/riscv/

RISC-V MICROCHIP FPGA SUPPORT
M: Conor Dooley <[email protected]>
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 0af39bbbe3a3..ae762db0365e 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -195,6 +195,7 @@ config MSM_IOMMU
source "drivers/iommu/amd/Kconfig"
source "drivers/iommu/intel/Kconfig"
source "drivers/iommu/iommufd/Kconfig"
+source "drivers/iommu/riscv/Kconfig"

config IRQ_REMAP
bool "Support for Interrupt Remapping"
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index 542760d963ec..5e5a83c6c2aa 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -1,5 +1,5 @@
# SPDX-License-Identifier: GPL-2.0
-obj-y += amd/ intel/ arm/ iommufd/
+obj-y += amd/ intel/ arm/ iommufd/ riscv/
obj-$(CONFIG_IOMMU_API) += iommu.o
obj-$(CONFIG_IOMMU_API) += iommu-traces.o
obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
diff --git a/drivers/iommu/riscv/Kconfig b/drivers/iommu/riscv/Kconfig
new file mode 100644
index 000000000000..5dcc5c45aa50
--- /dev/null
+++ b/drivers/iommu/riscv/Kconfig
@@ -0,0 +1,15 @@
+# SPDX-License-Identifier: GPL-2.0-only
+# RISC-V IOMMU support
+
+config RISCV_IOMMU
+ bool "RISC-V IOMMU Support"
+ depends on RISCV && 64BIT
+ default y
+ select IOMMU_API
+ help
+ Support for implementations of the RISC-V IOMMU architecture that
+ complements the RISC-V MMU capabilities, providing similar address
+ translation and protection functions for accesses from I/O devices.
+
+ Say Y here if your SoC includes an IOMMU device implementing
+ the RISC-V IOMMU architecture.
diff --git a/drivers/iommu/riscv/Makefile b/drivers/iommu/riscv/Makefile
new file mode 100644
index 000000000000..e4c189de58d3
--- /dev/null
+++ b/drivers/iommu/riscv/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-$(CONFIG_RISCV_IOMMU) += iommu.o iommu-platform.o
diff --git a/drivers/iommu/riscv/iommu-bits.h b/drivers/iommu/riscv/iommu-bits.h
new file mode 100644
index 000000000000..ba093c29de9f
--- /dev/null
+++ b/drivers/iommu/riscv/iommu-bits.h
@@ -0,0 +1,707 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright © 2022-2024 Rivos Inc.
+ * Copyright © 2023 FORTH-ICS/CARV
+ * Copyright © 2023 RISC-V IOMMU Task Group
+ *
+ * RISC-V IOMMU - Register Layout and Data Structures.
+ *
+ * Based on the 'RISC-V IOMMU Architecture Specification', Version 1.0
+ * Published at https://github.com/riscv-non-isa/riscv-iommu
+ *
+ */
+
+#ifndef _RISCV_IOMMU_BITS_H_
+#define _RISCV_IOMMU_BITS_H_
+
+#include <linux/types.h>
+#include <linux/bitfield.h>
+#include <linux/bits.h>
+
+/*
+ * Chapter 5: Memory Mapped register interface
+ */
+
+/* Common field positions */
+#define RISCV_IOMMU_PPN_FIELD GENMASK_ULL(53, 10)
+#define RISCV_IOMMU_QUEUE_LOGSZ_FIELD GENMASK_ULL(4, 0)
+#define RISCV_IOMMU_QUEUE_INDEX_FIELD GENMASK_ULL(31, 0)
+#define RISCV_IOMMU_QUEUE_ENABLE BIT(0)
+#define RISCV_IOMMU_QUEUE_INTR_ENABLE BIT(1)
+#define RISCV_IOMMU_QUEUE_MEM_FAULT BIT(8)
+#define RISCV_IOMMU_QUEUE_OVERFLOW BIT(9)
+#define RISCV_IOMMU_QUEUE_ACTIVE BIT(16)
+#define RISCV_IOMMU_QUEUE_BUSY BIT(17)
+
+#define RISCV_IOMMU_ATP_PPN_FIELD GENMASK_ULL(43, 0)
+#define RISCV_IOMMU_ATP_MODE_FIELD GENMASK_ULL(63, 60)
+
+/* 5.3 IOMMU Capabilities (64bits) */
+#define RISCV_IOMMU_REG_CAP 0x0000
+#define RISCV_IOMMU_CAP_VERSION GENMASK_ULL(7, 0)
+#define RISCV_IOMMU_CAP_S_SV32 BIT_ULL(8)
+#define RISCV_IOMMU_CAP_S_SV39 BIT_ULL(9)
+#define RISCV_IOMMU_CAP_S_SV48 BIT_ULL(10)
+#define RISCV_IOMMU_CAP_S_SV57 BIT_ULL(11)
+#define RISCV_IOMMU_CAP_SVPBMT BIT_ULL(15)
+#define RISCV_IOMMU_CAP_G_SV32 BIT_ULL(16)
+#define RISCV_IOMMU_CAP_G_SV39 BIT_ULL(17)
+#define RISCV_IOMMU_CAP_G_SV48 BIT_ULL(18)
+#define RISCV_IOMMU_CAP_G_SV57 BIT_ULL(19)
+#define RISCV_IOMMU_CAP_AMO_MRIF BIT_ULL(21)
+#define RISCV_IOMMU_CAP_MSI_FLAT BIT_ULL(22)
+#define RISCV_IOMMU_CAP_MSI_MRIF BIT_ULL(23)
+#define RISCV_IOMMU_CAP_AMO_HWAD BIT_ULL(24)
+#define RISCV_IOMMU_CAP_ATS BIT_ULL(25)
+#define RISCV_IOMMU_CAP_T2GPA BIT_ULL(26)
+#define RISCV_IOMMU_CAP_END BIT_ULL(27)
+#define RISCV_IOMMU_CAP_IGS GENMASK_ULL(29, 28)
+#define RISCV_IOMMU_CAP_HPM BIT_ULL(30)
+#define RISCV_IOMMU_CAP_DBG BIT_ULL(31)
+#define RISCV_IOMMU_CAP_PAS GENMASK_ULL(37, 32)
+#define RISCV_IOMMU_CAP_PD8 BIT_ULL(38)
+#define RISCV_IOMMU_CAP_PD17 BIT_ULL(39)
+#define RISCV_IOMMU_CAP_PD20 BIT_ULL(40)
+
+#define RISCV_IOMMU_CAP_VERSION_VER_MASK 0xF0
+#define RISCV_IOMMU_CAP_VERSION_REV_MASK 0x0F
+
+/**
+ * enum riscv_iommu_igs_settings - Interrupt Generation Support Settings
+ * @RISCV_IOMMU_CAP_IGS_MSI: I/O MMU supports only MSI generation
+ * @RISCV_IOMMU_CAP_IGS_WSI: I/O MMU supports only Wired-Signaled interrupt
+ * @RISCV_IOMMU_CAP_IGS_BOTH: I/O MMU supports both MSI and WSI generation
+ * @RISCV_IOMMU_CAP_IGS_RSRV: Reserved for standard use
+ */
+enum riscv_iommu_igs_settings {
+ RISCV_IOMMU_CAP_IGS_MSI = 0,
+ RISCV_IOMMU_CAP_IGS_WSI = 1,
+ RISCV_IOMMU_CAP_IGS_BOTH = 2,
+ RISCV_IOMMU_CAP_IGS_RSRV = 3
+};
+
+/* 5.4 Features control register (32bits) */
+#define RISCV_IOMMU_REG_FCTL 0x0008
+#define RISCV_IOMMU_FCTL_BE BIT(0)
+#define RISCV_IOMMU_FCTL_WSI BIT(1)
+#define RISCV_IOMMU_FCTL_GXL BIT(2)
+
+/* 5.5 Device-directory-table pointer (64bits) */
+#define RISCV_IOMMU_REG_DDTP 0x0010
+#define RISCV_IOMMU_DDTP_MODE GENMASK_ULL(3, 0)
+#define RISCV_IOMMU_DDTP_BUSY BIT_ULL(4)
+#define RISCV_IOMMU_DDTP_PPN RISCV_IOMMU_PPN_FIELD
+
+/**
+ * enum riscv_iommu_ddtp_modes - I/O MMU translation modes
+ * @RISCV_IOMMU_DDTP_MODE_OFF: No inbound transactions allowed
+ * @RISCV_IOMMU_DDTP_MODE_BARE: Pass-through mode
+ * @RISCV_IOMMU_DDTP_MODE_1LVL: One-level DDT
+ * @RISCV_IOMMU_DDTP_MODE_2LVL: Two-level DDT
+ * @RISCV_IOMMU_DDTP_MODE_3LVL: Three-level DDT
+ * @RISCV_IOMMU_DDTP_MODE_MAX: Max value allowed by specification
+ */
+enum riscv_iommu_ddtp_modes {
+ RISCV_IOMMU_DDTP_MODE_OFF = 0,
+ RISCV_IOMMU_DDTP_MODE_BARE = 1,
+ RISCV_IOMMU_DDTP_MODE_1LVL = 2,
+ RISCV_IOMMU_DDTP_MODE_2LVL = 3,
+ RISCV_IOMMU_DDTP_MODE_3LVL = 4,
+ RISCV_IOMMU_DDTP_MODE_MAX = 4
+};
+
+/* 5.6 Command Queue Base (64bits) */
+#define RISCV_IOMMU_REG_CQB 0x0018
+#define RISCV_IOMMU_CQB_ENTRIES RISCV_IOMMU_QUEUE_LOGSZ_FIELD
+#define RISCV_IOMMU_CQB_PPN RISCV_IOMMU_PPN_FIELD
+
+/* 5.7 Command Queue head (32bits) */
+#define RISCV_IOMMU_REG_CQH 0x0020
+#define RISCV_IOMMU_CQH_INDEX RISCV_IOMMU_QUEUE_INDEX_FIELD
+
+/* 5.8 Command Queue tail (32bits) */
+#define RISCV_IOMMU_REG_CQT 0x0024
+#define RISCV_IOMMU_CQT_INDEX RISCV_IOMMU_QUEUE_INDEX_FIELD
+
+/* 5.9 Fault Queue Base (64bits) */
+#define RISCV_IOMMU_REG_FQB 0x0028
+#define RISCV_IOMMU_FQB_ENTRIES RISCV_IOMMU_QUEUE_LOGSZ_FIELD
+#define RISCV_IOMMU_FQB_PPN RISCV_IOMMU_PPN_FIELD
+
+/* 5.10 Fault Queue Head (32bits) */
+#define RISCV_IOMMU_REG_FQH 0x0030
+#define RISCV_IOMMU_FQH_INDEX RISCV_IOMMU_QUEUE_INDEX_FIELD
+
+/* 5.11 Fault Queue tail (32bits) */
+#define RISCV_IOMMU_REG_FQT 0x0034
+#define RISCV_IOMMU_FQT_INDEX RISCV_IOMMU_QUEUE_INDEX_FIELD
+
+/* 5.12 Page Request Queue base (64bits) */
+#define RISCV_IOMMU_REG_PQB 0x0038
+#define RISCV_IOMMU_PQB_ENTRIES RISCV_IOMMU_QUEUE_LOGSZ_FIELD
+#define RISCV_IOMMU_PQB_PPN RISCV_IOMMU_PPN_FIELD
+
+/* 5.13 Page Request Queue head (32bits) */
+#define RISCV_IOMMU_REG_PQH 0x0040
+#define RISCV_IOMMU_PQH_INDEX RISCV_IOMMU_QUEUE_INDEX_FIELD
+
+/* 5.14 Page Request Queue tail (32bits) */
+#define RISCV_IOMMU_REG_PQT 0x0044
+#define RISCV_IOMMU_PQT_INDEX_MASK RISCV_IOMMU_QUEUE_INDEX_FIELD
+
+/* 5.15 Command Queue CSR (32bits) */
+#define RISCV_IOMMU_REG_CQCSR 0x0048
+#define RISCV_IOMMU_CQCSR_CQEN RISCV_IOMMU_QUEUE_ENABLE
+#define RISCV_IOMMU_CQCSR_CIE RISCV_IOMMU_QUEUE_INTR_ENABLE
+#define RISCV_IOMMU_CQCSR_CQMF RISCV_IOMMU_QUEUE_MEM_FAULT
+#define RISCV_IOMMU_CQCSR_CMD_TO BIT(9)
+#define RISCV_IOMMU_CQCSR_CMD_ILL BIT(10)
+#define RISCV_IOMMU_CQCSR_FENCE_W_IP BIT(11)
+#define RISCV_IOMMU_CQCSR_CQON RISCV_IOMMU_QUEUE_ACTIVE
+#define RISCV_IOMMU_CQCSR_BUSY RISCV_IOMMU_QUEUE_BUSY
+
+/* 5.16 Fault Queue CSR (32bits) */
+#define RISCV_IOMMU_REG_FQCSR 0x004C
+#define RISCV_IOMMU_FQCSR_FQEN RISCV_IOMMU_QUEUE_ENABLE
+#define RISCV_IOMMU_FQCSR_FIE RISCV_IOMMU_QUEUE_INTR_ENABLE
+#define RISCV_IOMMU_FQCSR_FQMF RISCV_IOMMU_QUEUE_MEM_FAULT
+#define RISCV_IOMMU_FQCSR_FQOF RISCV_IOMMU_QUEUE_OVERFLOW
+#define RISCV_IOMMU_FQCSR_FQON RISCV_IOMMU_QUEUE_ACTIVE
+#define RISCV_IOMMU_FQCSR_BUSY RISCV_IOMMU_QUEUE_BUSY
+
+/* 5.17 Page Request Queue CSR (32bits) */
+#define RISCV_IOMMU_REG_PQCSR 0x0050
+#define RISCV_IOMMU_PQCSR_PQEN RISCV_IOMMU_QUEUE_ENABLE
+#define RISCV_IOMMU_PQCSR_PIE RISCV_IOMMU_QUEUE_INTR_ENABLE
+#define RISCV_IOMMU_PQCSR_PQMF RISCV_IOMMU_QUEUE_MEM_FAULT
+#define RISCV_IOMMU_PQCSR_PQOF RISCV_IOMMU_QUEUE_OVERFLOW
+#define RISCV_IOMMU_PQCSR_PQON RISCV_IOMMU_QUEUE_ACTIVE
+#define RISCV_IOMMU_PQCSR_BUSY RISCV_IOMMU_QUEUE_BUSY
+
+/* 5.18 Interrupt Pending Status (32bits) */
+#define RISCV_IOMMU_REG_IPSR 0x0054
+
+#define RISCV_IOMMU_INTR_CQ 0
+#define RISCV_IOMMU_INTR_FQ 1
+#define RISCV_IOMMU_INTR_PM 2
+#define RISCV_IOMMU_INTR_PQ 3
+#define RISCV_IOMMU_INTR_COUNT 4
+
+#define RISCV_IOMMU_IPSR_CIP BIT(RISCV_IOMMU_INTR_CQ)
+#define RISCV_IOMMU_IPSR_FIP BIT(RISCV_IOMMU_INTR_FQ)
+#define RISCV_IOMMU_IPSR_PMIP BIT(RISCV_IOMMU_INTR_PM)
+#define RISCV_IOMMU_IPSR_PIP BIT(RISCV_IOMMU_INTR_PQ)
+
+/* 5.19 Performance monitoring counter overflow status (32bits) */
+#define RISCV_IOMMU_REG_IOCOUNTOVF 0x0058
+#define RISCV_IOMMU_IOCOUNTOVF_CY BIT(0)
+#define RISCV_IOMMU_IOCOUNTOVF_HPM GENMASK_ULL(31, 1)
+
+/* 5.20 Performance monitoring counter inhibits (32bits) */
+#define RISCV_IOMMU_REG_IOCOUNTINH 0x005C
+#define RISCV_IOMMU_IOCOUNTINH_CY BIT(0)
+#define RISCV_IOMMU_IOCOUNTINH_HPM GENMASK(31, 1)
+
+/* 5.21 Performance monitoring cycles counter (64bits) */
+#define RISCV_IOMMU_REG_IOHPMCYCLES 0x0060
+#define RISCV_IOMMU_IOHPMCYCLES_COUNTER GENMASK_ULL(62, 0)
+#define RISCV_IOMMU_IOHPMCYCLES_OVF BIT_ULL(63)
+
+/* 5.22 Performance monitoring event counters (31 * 64bits) */
+#define RISCV_IOMMU_REG_IOHPMCTR_BASE 0x0068
+#define RISCV_IOMMU_REG_IOHPMCTR(_n) (RISCV_IOMMU_REG_IOHPMCTR_BASE + ((_n) * 0x8))
+
+/* 5.23 Performance monitoring event selectors (31 * 64bits) */
+#define RISCV_IOMMU_REG_IOHPMEVT_BASE 0x0160
+#define RISCV_IOMMU_REG_IOHPMEVT(_n) (RISCV_IOMMU_REG_IOHPMEVT_BASE + ((_n) * 0x8))
+#define RISCV_IOMMU_IOHPMEVT_CNT 31
+#define RISCV_IOMMU_IOHPMEVT_EVENT_ID GENMASK_ULL(14, 0)
+#define RISCV_IOMMU_IOHPMEVT_DMASK BIT_ULL(15)
+#define RISCV_IOMMU_IOHPMEVT_PID_PSCID GENMASK_ULL(35, 16)
+#define RISCV_IOMMU_IOHPMEVT_DID_GSCID GENMASK_ULL(59, 36)
+#define RISCV_IOMMU_IOHPMEVT_PV_PSCV BIT_ULL(60)
+#define RISCV_IOMMU_IOHPMEVT_DV_GSCV BIT_ULL(61)
+#define RISCV_IOMMU_IOHPMEVT_IDT BIT_ULL(62)
+#define RISCV_IOMMU_IOHPMEVT_OF BIT_ULL(63)
+
+/**
+ * enum riscv_iommu_hpmevent_id - Performance-monitoring event identifier
+ *
+ * @RISCV_IOMMU_HPMEVENT_INVALID: Invalid event, do not count
+ * @RISCV_IOMMU_HPMEVENT_URQ: Untranslated requests
+ * @RISCV_IOMMU_HPMEVENT_TRQ: Translated requests
+ * @RISCV_IOMMU_HPMEVENT_ATS_RQ: ATS translation requests
+ * @RISCV_IOMMU_HPMEVENT_TLB_MISS: TLB misses
+ * @RISCV_IOMMU_HPMEVENT_DD_WALK: Device directory walks
+ * @RISCV_IOMMU_HPMEVENT_PD_WALK: Process directory walks
+ * @RISCV_IOMMU_HPMEVENT_S_VS_WALKS: S/VS-Stage page table walks
+ * @RISCV_IOMMU_HPMEVENT_G_WALKS: G-Stage page table walks
+ * @RISCV_IOMMU_HPMEVENT_MAX: Value to denote maximum Event IDs
+ */
+enum riscv_iommu_hpmevent_id {
+ RISCV_IOMMU_HPMEVENT_INVALID = 0,
+ RISCV_IOMMU_HPMEVENT_URQ = 1,
+ RISCV_IOMMU_HPMEVENT_TRQ = 2,
+ RISCV_IOMMU_HPMEVENT_ATS_RQ = 3,
+ RISCV_IOMMU_HPMEVENT_TLB_MISS = 4,
+ RISCV_IOMMU_HPMEVENT_DD_WALK = 5,
+ RISCV_IOMMU_HPMEVENT_PD_WALK = 6,
+ RISCV_IOMMU_HPMEVENT_S_VS_WALKS = 7,
+ RISCV_IOMMU_HPMEVENT_G_WALKS = 8,
+ RISCV_IOMMU_HPMEVENT_MAX = 9
+};
+
+/* 5.24 Translation request IOVA (64bits) */
+#define RISCV_IOMMU_REG_TR_REQ_IOVA 0x0258
+#define RISCV_IOMMU_TR_REQ_IOVA_VPN GENMASK_ULL(63, 12)
+
+/* 5.25 Translation request control (64bits) */
+#define RISCV_IOMMU_REG_TR_REQ_CTL 0x0260
+#define RISCV_IOMMU_TR_REQ_CTL_GO_BUSY BIT_ULL(0)
+#define RISCV_IOMMU_TR_REQ_CTL_PRIV BIT_ULL(1)
+#define RISCV_IOMMU_TR_REQ_CTL_EXE BIT_ULL(2)
+#define RISCV_IOMMU_TR_REQ_CTL_NW BIT_ULL(3)
+#define RISCV_IOMMU_TR_REQ_CTL_PID GENMASK_ULL(31, 12)
+#define RISCV_IOMMU_TR_REQ_CTL_PV BIT_ULL(32)
+#define RISCV_IOMMU_TR_REQ_CTL_DID GENMASK_ULL(63, 40)
+
+/* 5.26 Translation request response (64bits) */
+#define RISCV_IOMMU_REG_TR_RESPONSE 0x0268
+#define RISCV_IOMMU_TR_RESPONSE_FAULT BIT_ULL(0)
+#define RISCV_IOMMU_TR_RESPONSE_PBMT GENMASK_ULL(8, 7)
+#define RISCV_IOMMU_TR_RESPONSE_SZ BIT_ULL(9)
+#define RISCV_IOMMU_TR_RESPONSE_PPN RISCV_IOMMU_PPN_FIELD
+
+/* 5.27 Interrupt cause to vector (64bits) */
+#define RISCV_IOMMU_REG_IVEC 0x02F8
+#define RISCV_IOMMU_IVEC_CIV GENMASK_ULL(3, 0)
+#define RISCV_IOMMU_IVEC_FIV GENMASK_ULL(7, 4)
+#define RISCV_IOMMU_IVEC_PMIV GENMASK_ULL(11, 8)
+#define RISCV_IOMMU_IVEC_PIV GENMASK_ULL(15, 12)
+
+/* 5.28 MSI Configuration table (32 * 64bits) */
+#define RISCV_IOMMU_REG_MSI_CONFIG 0x0300
+#define RISCV_IOMMU_REG_MSI_ADDR(_n) (RISCV_IOMMU_REG_MSI_CONFIG + ((_n) * 0x10))
+#define RISCV_IOMMU_MSI_ADDR GENMASK_ULL(55, 2)
+#define RISCV_IOMMU_REG_MSI_DATA(_n) (RISCV_IOMMU_REG_MSI_CONFIG + ((_n) * 0x10) + 0x08)
+#define RISCV_IOMMU_MSI_DATA GENMASK_ULL(31, 0)
+#define RISCV_IOMMU_REG_MSI_VEC_CTL(_n) (RISCV_IOMMU_REG_MSI_CONFIG + ((_n) * 0x10) + 0x0C)
+#define RISCV_IOMMU_MSI_VEC_CTL_M BIT_ULL(0)
+
+#define RISCV_IOMMU_REG_SIZE 0x1000
+
+/*
+ * Chapter 2: Data structures
+ */
+
+/*
+ * Device Directory Table macros for non-leaf nodes
+ */
+#define RISCV_IOMMU_DDTE_VALID BIT_ULL(0)
+#define RISCV_IOMMU_DDTE_PPN RISCV_IOMMU_PPN_FIELD
+
+/**
+ * struct riscv_iommu_dc - Device Context
+ * @tc: Translation Control
+ * @iohgatp: I/O Hypervisor guest address translation and protection
+ * (Second stage context)
+ * @ta: Translation Attributes
+ * @fsc: First stage context
+ * @msiptp: MSI page table pointer
+ * @msi_addr_mask: MSI address mask
+ * @msi_addr_pattern: MSI address pattern
+ * @_reserved: Reserved for future use, padding
+ *
+ * This structure is used for leaf nodes on the Device Directory Table,
+ * in case RISCV_IOMMU_CAP_MSI_FLAT is not set, the bottom 4 fields are
+ * not present and are skipped with pointer arithmetic to avoid
+ * casting, check out riscv_iommu_get_dc().
+ * See section 2.1 for more details
+ */
+struct riscv_iommu_dc {
+ u64 tc;
+ u64 iohgatp;
+ u64 ta;
+ u64 fsc;
+ u64 msiptp;
+ u64 msi_addr_mask;
+ u64 msi_addr_pattern;
+ u64 _reserved;
+};
+
+/* Translation control fields */
+#define RISCV_IOMMU_DC_TC_V BIT_ULL(0)
+#define RISCV_IOMMU_DC_TC_EN_ATS BIT_ULL(1)
+#define RISCV_IOMMU_DC_TC_EN_PRI BIT_ULL(2)
+#define RISCV_IOMMU_DC_TC_T2GPA BIT_ULL(3)
+#define RISCV_IOMMU_DC_TC_DTF BIT_ULL(4)
+#define RISCV_IOMMU_DC_TC_PDTV BIT_ULL(5)
+#define RISCV_IOMMU_DC_TC_PRPR BIT_ULL(6)
+#define RISCV_IOMMU_DC_TC_GADE BIT_ULL(7)
+#define RISCV_IOMMU_DC_TC_SADE BIT_ULL(8)
+#define RISCV_IOMMU_DC_TC_DPE BIT_ULL(9)
+#define RISCV_IOMMU_DC_TC_SBE BIT_ULL(10)
+#define RISCV_IOMMU_DC_TC_SXL BIT_ULL(11)
+
+/* Second-stage (aka G-stage) context fields */
+#define RISCV_IOMMU_DC_IOHGATP_PPN RISCV_IOMMU_ATP_PPN_FIELD
+#define RISCV_IOMMU_DC_IOHGATP_GSCID GENMASK_ULL(59, 44)
+#define RISCV_IOMMU_DC_IOHGATP_MODE RISCV_IOMMU_ATP_MODE_FIELD
+
+/**
+ * enum riscv_iommu_dc_iohgatp_modes - Guest address translation/protection modes
+ * @RISCV_IOMMU_DC_IOHGATP_MODE_BARE: No translation/protection
+ * @RISCV_IOMMU_DC_IOHGATP_MODE_SV32X4: Sv32x4 (2-bit extension of Sv32), when fctl.GXL == 1
+ * @RISCV_IOMMU_DC_IOHGATP_MODE_SV39X4: Sv39x4 (2-bit extension of Sv39), when fctl.GXL == 0
+ * @RISCV_IOMMU_DC_IOHGATP_MODE_SV48X4: Sv48x4 (2-bit extension of Sv48), when fctl.GXL == 0
+ * @RISCV_IOMMU_DC_IOHGATP_MODE_SV57X4: Sv57x4 (2-bit extension of Sv57), when fctl.GXL == 0
+ */
+enum riscv_iommu_dc_iohgatp_modes {
+ RISCV_IOMMU_DC_IOHGATP_MODE_BARE = 0,
+ RISCV_IOMMU_DC_IOHGATP_MODE_SV32X4 = 8,
+ RISCV_IOMMU_DC_IOHGATP_MODE_SV39X4 = 8,
+ RISCV_IOMMU_DC_IOHGATP_MODE_SV48X4 = 9,
+ RISCV_IOMMU_DC_IOHGATP_MODE_SV57X4 = 10
+};
+
+/* Translation attributes fields */
+#define RISCV_IOMMU_DC_TA_PSCID GENMASK_ULL(31, 12)
+
+/* First-stage context fields */
+#define RISCV_IOMMU_DC_FSC_PPN RISCV_IOMMU_ATP_PPN_FIELD
+#define RISCV_IOMMU_DC_FSC_MODE RISCV_IOMMU_ATP_MODE_FIELD
+
+/**
+ * enum riscv_iommu_dc_fsc_atp_modes - First stage address translation/protection modes
+ * @RISCV_IOMMU_DC_FSC_MODE_BARE: No translation/protection
+ * @RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV32: Sv32, when dc.tc.SXL == 1
+ * @RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV39: Sv39, when dc.tc.SXL == 0
+ * @RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV48: Sv48, when dc.tc.SXL == 0
+ * @RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV57: Sv57, when dc.tc.SXL == 0
+ * @RISCV_IOMMU_DC_FSC_PDTP_MODE_PD8: 1lvl PDT, 8bit process ids
+ * @RISCV_IOMMU_DC_FSC_PDTP_MODE_PD17: 2lvl PDT, 17bit process ids
+ * @RISCV_IOMMU_DC_FSC_PDTP_MODE_PD20: 3lvl PDT, 20bit process ids
+ *
+ * FSC holds IOSATP when RISCV_IOMMU_DC_TC_PDTV is 0 and PDTP otherwise.
+ * IOSATP controls the first stage address translation (same as the satp register on
+ * the RISC-V MMU), and PDTP holds the process directory table, used to select a
+ * first stage page table based on a process id (for devices that support multiple
+ * process ids).
+ */
+enum riscv_iommu_dc_fsc_atp_modes {
+ RISCV_IOMMU_DC_FSC_MODE_BARE = 0,
+ RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV32 = 8,
+ RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV39 = 8,
+ RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV48 = 9,
+ RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV57 = 10,
+ RISCV_IOMMU_DC_FSC_PDTP_MODE_PD8 = 1,
+ RISCV_IOMMU_DC_FSC_PDTP_MODE_PD17 = 2,
+ RISCV_IOMMU_DC_FSC_PDTP_MODE_PD20 = 3
+};
+
+/* MSI page table pointer */
+#define RISCV_IOMMU_DC_MSIPTP_PPN RISCV_IOMMU_ATP_PPN_FIELD
+#define RISCV_IOMMU_DC_MSIPTP_MODE RISCV_IOMMU_ATP_MODE_FIELD
+#define RISCV_IOMMU_DC_MSIPTP_MODE_OFF 0
+#define RISCV_IOMMU_DC_MSIPTP_MODE_FLAT 1
+
+/* MSI address mask */
+#define RISCV_IOMMU_DC_MSI_ADDR_MASK GENMASK_ULL(51, 0)
+
+/* MSI address pattern */
+#define RISCV_IOMMU_DC_MSI_PATTERN GENMASK_ULL(51, 0)
+
+/**
+ * struct riscv_iommu_pc - Process Context
+ * @ta: Translation Attributes
+ * @fsc: First stage context
+ *
+ * This structure is used for leaf nodes on the Process Directory Table
+ * See section 2.3 for more details
+ */
+struct riscv_iommu_pc {
+ u64 ta;
+ u64 fsc;
+};
+
+/* Translation attributes fields */
+#define RISCV_IOMMU_PC_TA_V BIT_ULL(0)
+#define RISCV_IOMMU_PC_TA_ENS BIT_ULL(1)
+#define RISCV_IOMMU_PC_TA_SUM BIT_ULL(2)
+#define RISCV_IOMMU_PC_TA_PSCID GENMASK_ULL(31, 12)
+
+/* First stage context fields */
+#define RISCV_IOMMU_PC_FSC_PPN RISCV_IOMMU_ATP_PPN_FIELD
+#define RISCV_IOMMU_PC_FSC_MODE RISCV_IOMMU_ATP_MODE_FIELD
+
+/*
+ * Chapter 3: In-memory queue interface
+ */
+
+/**
+ * struct riscv_iommu_command - Generic I/O MMU command structure
+ * @dword0: Includes the opcode and the function identifier
+ * @dword1: Opcode specific data
+ *
+ * The commands are interpreted as two 64bit fields, where the first
+ * 7bits of the first field are the opcode which also defines the
+ * command's format, followed by a 3bit field that specifies the
+ * function invoked by that command, and the rest is opcode-specific.
+ * This is a generic struct which will be populated differently
+ * according to each command. For more infos on the commands and
+ * the command queue check section 3.1.
+ */
+struct riscv_iommu_command {
+ u64 dword0;
+ u64 dword1;
+};
+
+/* Fields on dword0, common for all commands */
+#define RISCV_IOMMU_CMD_OPCODE GENMASK_ULL(6, 0)
+#define RISCV_IOMMU_CMD_FUNC GENMASK_ULL(9, 7)
+
+/* 3.1.1 I/O MMU Page-table cache invalidation */
+/* Fields on dword0 */
+#define RISCV_IOMMU_CMD_IOTINVAL_OPCODE 1
+#define RISCV_IOMMU_CMD_IOTINVAL_FUNC_VMA 0
+#define RISCV_IOMMU_CMD_IOTINVAL_FUNC_GVMA 1
+#define RISCV_IOMMU_CMD_IOTINVAL_AV BIT_ULL(10)
+#define RISCV_IOMMU_CMD_IOTINVAL_PSCID GENMASK_ULL(31, 12)
+#define RISCV_IOMMU_CMD_IOTINVAL_PSCV BIT_ULL(32)
+#define RISCV_IOMMU_CMD_IOTINVAL_GV BIT_ULL(33)
+#define RISCV_IOMMU_CMD_IOTINVAL_GSCID GENMASK_ULL(59, 44)
+/* dword1[61:10] is the 4K-aligned page address */
+#define RISCV_IOMMU_CMD_IOTINVAL_ADDR GENMASK_ULL(61, 10)
+
+/* 3.1.2 I/O MMU Command Queue Fences */
+/* Fields on dword0 */
+#define RISCV_IOMMU_CMD_IOFENCE_OPCODE 2
+#define RISCV_IOMMU_CMD_IOFENCE_FUNC_C 0
+#define RISCV_IOMMU_CMD_IOFENCE_AV BIT_ULL(10)
+#define RISCV_IOMMU_CMD_IOFENCE_WSI BIT_ULL(11)
+#define RISCV_IOMMU_CMD_IOFENCE_PR BIT_ULL(12)
+#define RISCV_IOMMU_CMD_IOFENCE_PW BIT_ULL(13)
+#define RISCV_IOMMU_CMD_IOFENCE_DATA GENMASK_ULL(63, 32)
+/* dword1 is the address, word-size aligned and shifted to the right by two bits. */
+
+/* 3.1.3 I/O MMU Directory cache invalidation */
+/* Fields on dword0 */
+#define RISCV_IOMMU_CMD_IODIR_OPCODE 3
+#define RISCV_IOMMU_CMD_IODIR_FUNC_INVAL_DDT 0
+#define RISCV_IOMMU_CMD_IODIR_FUNC_INVAL_PDT 1
+#define RISCV_IOMMU_CMD_IODIR_PID GENMASK_ULL(31, 12)
+#define RISCV_IOMMU_CMD_IODIR_DV BIT_ULL(33)
+#define RISCV_IOMMU_CMD_IODIR_DID GENMASK_ULL(63, 40)
+/* dword1 is reserved for standard use */
+
+/* 3.1.4 I/O MMU PCIe ATS */
+/* Fields on dword0 */
+#define RISCV_IOMMU_CMD_ATS_OPCODE 4
+#define RISCV_IOMMU_CMD_ATS_FUNC_INVAL 0
+#define RISCV_IOMMU_CMD_ATS_FUNC_PRGR 1
+#define RISCV_IOMMU_CMD_ATS_PID GENMASK_ULL(31, 12)
+#define RISCV_IOMMU_CMD_ATS_PV BIT_ULL(32)
+#define RISCV_IOMMU_CMD_ATS_DSV BIT_ULL(33)
+#define RISCV_IOMMU_CMD_ATS_RID GENMASK_ULL(55, 40)
+#define RISCV_IOMMU_CMD_ATS_DSEG GENMASK_ULL(63, 56)
+/* dword1 is the ATS payload, two different payload types for INVAL and PRGR */
+
+/* ATS.INVAL payload*/
+#define RISCV_IOMMU_CMD_ATS_INVAL_G BIT_ULL(0)
+/* Bits 1 - 10 are zeroed */
+#define RISCV_IOMMU_CMD_ATS_INVAL_S BIT_ULL(11)
+#define RISCV_IOMMU_CMD_ATS_INVAL_UADDR GENMASK_ULL(63, 12)
+
+/* ATS.PRGR payload */
+/* Bits 0 - 31 are zeroed */
+#define RISCV_IOMMU_CMD_ATS_PRGR_PRG_INDEX GENMASK_ULL(40, 32)
+/* Bits 41 - 43 are zeroed */
+#define RISCV_IOMMU_CMD_ATS_PRGR_RESP_CODE GENMASK_ULL(47, 44)
+#define RISCV_IOMMU_CMD_ATS_PRGR_DST_ID GENMASK_ULL(63, 48)
+
+/**
+ * struct riscv_iommu_fq_record - Fault/Event Queue Record
+ * @hdr: Header, includes fault/event cause, PID/DID, transaction type etc
+ * @_reserved: Low 32bits for custom use, high 32bits for standard use
+ * @iotval: Transaction-type/cause specific format
+ * @iotval2: Cause specific format
+ *
+ * The fault/event queue reports events and failures raised when
+ * processing transactions. Each record is a 32byte structure where
+ * the first dword has a fixed format for providing generic infos
+ * regarding the fault/event, and two more dwords are there for
+ * fault/event-specific information. For more details see section
+ * 3.2.
+ */
+struct riscv_iommu_fq_record {
+ u64 hdr;
+ u64 _reserved;
+ u64 iotval;
+ u64 iotval2;
+};
+
+/* Fields on header */
+#define RISCV_IOMMU_FQ_HDR_CAUSE GENMASK_ULL(11, 0)
+#define RISCV_IOMMU_FQ_HDR_PID GENMASK_ULL(31, 12)
+#define RISCV_IOMMU_FQ_HDR_PV BIT_ULL(32)
+#define RISCV_IOMMU_FQ_HDR_PRIV BIT_ULL(33)
+#define RISCV_IOMMU_FQ_HDR_TTYPE GENMASK_ULL(39, 34)
+#define RISCV_IOMMU_FQ_HDR_DID GENMASK_ULL(63, 40)
+
+/**
+ * enum riscv_iommu_fq_causes - Fault/event cause values
+ * @RISCV_IOMMU_FQ_CAUSE_INST_FAULT: Instruction access fault
+ * @RISCV_IOMMU_FQ_CAUSE_RD_ADDR_MISALIGNED: Read address misaligned
+ * @RISCV_IOMMU_FQ_CAUSE_RD_FAULT: Read load fault
+ * @RISCV_IOMMU_FQ_CAUSE_WR_ADDR_MISALIGNED: Write/AMO address misaligned
+ * @RISCV_IOMMU_FQ_CAUSE_WR_FAULT: Write/AMO access fault
+ * @RISCV_IOMMU_FQ_CAUSE_INST_FAULT_S: Instruction page fault
+ * @RISCV_IOMMU_FQ_CAUSE_RD_FAULT_S: Read page fault
+ * @RISCV_IOMMU_FQ_CAUSE_WR_FAULT_S: Write/AMO page fault
+ * @RISCV_IOMMU_FQ_CAUSE_INST_FAULT_VS: Instruction guest page fault
+ * @RISCV_IOMMU_FQ_CAUSE_RD_FAULT_VS: Read guest page fault
+ * @RISCV_IOMMU_FQ_CAUSE_WR_FAULT_VS: Write/AMO guest page fault
+ * @RISCV_IOMMU_FQ_CAUSE_DMA_DISABLED: All inbound transactions disallowed
+ * @RISCV_IOMMU_FQ_CAUSE_DDT_LOAD_FAULT: DDT entry load access fault
+ * @RISCV_IOMMU_FQ_CAUSE_DDT_INVALID: DDT entry invalid
+ * @RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED: DDT entry misconfigured
+ * @RISCV_IOMMU_FQ_CAUSE_TTYPE_BLOCKED: Transaction type disallowed
+ * @RISCV_IOMMU_FQ_CAUSE_MSI_LOAD_FAULT: MSI PTE load access fault
+ * @RISCV_IOMMU_FQ_CAUSE_MSI_INVALID: MSI PTE invalid
+ * @RISCV_IOMMU_FQ_CAUSE_MSI_MISCONFIGURED: MSI PTE misconfigured
+ * @RISCV_IOMMU_FQ_CAUSE_MRIF_FAULT: MRIF access fault
+ * @RISCV_IOMMU_FQ_CAUSE_PDT_LOAD_FAULT: PDT entry load access fault
+ * @RISCV_IOMMU_FQ_CAUSE_PDT_INVALID: PDT entry invalid
+ * @RISCV_IOMMU_FQ_CAUSE_PDT_MISCONFIGURED: PDT entry misconfigured
+ * @RISCV_IOMMU_FQ_CAUSE_DDT_CORRUPTED: DDT data corruption
+ * @RISCV_IOMMU_FQ_CAUSE_PDT_CORRUPTED: PDT data corruption
+ * @RISCV_IOMMU_FQ_CAUSE_MSI_PT_CORRUPTED: MSI page table data corruption
+ * @RISCV_IOMMU_FQ_CAUSE_MRIF_CORRUIPTED: MRIF data corruption
+ * @RISCV_IOMMU_FQ_CAUSE_INTERNAL_DP_ERROR: Internal data path error
+ * @RISCV_IOMMU_FQ_CAUSE_MSI_WR_FAULT: IOMMU MSI write access fault
+ * @RISCV_IOMMU_FQ_CAUSE_PT_CORRUPTED: First/second stage page table data corruption
+ *
+ * Values are on table 11 of the spec, encodings 275 - 2047 are reserved for standard
+ * use, and 2048 - 4095 for custom use.
+ */
+enum riscv_iommu_fq_causes {
+ RISCV_IOMMU_FQ_CAUSE_INST_FAULT = 1,
+ RISCV_IOMMU_FQ_CAUSE_RD_ADDR_MISALIGNED = 4,
+ RISCV_IOMMU_FQ_CAUSE_RD_FAULT = 5,
+ RISCV_IOMMU_FQ_CAUSE_WR_ADDR_MISALIGNED = 6,
+ RISCV_IOMMU_FQ_CAUSE_WR_FAULT = 7,
+ RISCV_IOMMU_FQ_CAUSE_INST_FAULT_S = 12,
+ RISCV_IOMMU_FQ_CAUSE_RD_FAULT_S = 13,
+ RISCV_IOMMU_FQ_CAUSE_WR_FAULT_S = 15,
+ RISCV_IOMMU_FQ_CAUSE_INST_FAULT_VS = 20,
+ RISCV_IOMMU_FQ_CAUSE_RD_FAULT_VS = 21,
+ RISCV_IOMMU_FQ_CAUSE_WR_FAULT_VS = 23,
+ RISCV_IOMMU_FQ_CAUSE_DMA_DISABLED = 256,
+ RISCV_IOMMU_FQ_CAUSE_DDT_LOAD_FAULT = 257,
+ RISCV_IOMMU_FQ_CAUSE_DDT_INVALID = 258,
+ RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED = 259,
+ RISCV_IOMMU_FQ_CAUSE_TTYPE_BLOCKED = 260,
+ RISCV_IOMMU_FQ_CAUSE_MSI_LOAD_FAULT = 261,
+ RISCV_IOMMU_FQ_CAUSE_MSI_INVALID = 262,
+ RISCV_IOMMU_FQ_CAUSE_MSI_MISCONFIGURED = 263,
+ RISCV_IOMMU_FQ_CAUSE_MRIF_FAULT = 264,
+ RISCV_IOMMU_FQ_CAUSE_PDT_LOAD_FAULT = 265,
+ RISCV_IOMMU_FQ_CAUSE_PDT_INVALID = 266,
+ RISCV_IOMMU_FQ_CAUSE_PDT_MISCONFIGURED = 267,
+ RISCV_IOMMU_FQ_CAUSE_DDT_CORRUPTED = 268,
+ RISCV_IOMMU_FQ_CAUSE_PDT_CORRUPTED = 269,
+ RISCV_IOMMU_FQ_CAUSE_MSI_PT_CORRUPTED = 270,
+ RISCV_IOMMU_FQ_CAUSE_MRIF_CORRUIPTED = 271,
+ RISCV_IOMMU_FQ_CAUSE_INTERNAL_DP_ERROR = 272,
+ RISCV_IOMMU_FQ_CAUSE_MSI_WR_FAULT = 273,
+ RISCV_IOMMU_FQ_CAUSE_PT_CORRUPTED = 274
+};
+
+/**
+ * enum riscv_iommu_fq_ttypes: Fault/event transaction types
+ * @RISCV_IOMMU_FQ_TTYPE_NONE: None. Fault not caused by an inbound transaction.
+ * @RISCV_IOMMU_FQ_TTYPE_UADDR_INST_FETCH: Instruction fetch from untranslated address
+ * @RISCV_IOMMU_FQ_TTYPE_UADDR_RD: Read from untranslated address
+ * @RISCV_IOMMU_FQ_TTYPE_UADDR_WR: Write/AMO to untranslated address
+ * @RISCV_IOMMU_FQ_TTYPE_TADDR_INST_FETCH: Instruction fetch from translated address
+ * @RISCV_IOMMU_FQ_TTYPE_TADDR_RD: Read from translated address
+ * @RISCV_IOMMU_FQ_TTYPE_TADDR_WR: Write/AMO to translated address
+ * @RISCV_IOMMU_FQ_TTYPE_PCIE_ATS_REQ: PCIe ATS translation request
+ * @RISCV_IOMMU_FW_TTYPE_PCIE_MSG_REQ: PCIe message request
+ *
+ * Values are on table 12 of the spec, type 4 and 10 - 31 are reserved for standard use
+ * and 31 - 63 for custom use.
+ */
+enum riscv_iommu_fq_ttypes {
+ RISCV_IOMMU_FQ_TTYPE_NONE = 0,
+ RISCV_IOMMU_FQ_TTYPE_UADDR_INST_FETCH = 1,
+ RISCV_IOMMU_FQ_TTYPE_UADDR_RD = 2,
+ RISCV_IOMMU_FQ_TTYPE_UADDR_WR = 3,
+ RISCV_IOMMU_FQ_TTYPE_TADDR_INST_FETCH = 5,
+ RISCV_IOMMU_FQ_TTYPE_TADDR_RD = 6,
+ RISCV_IOMMU_FQ_TTYPE_TADDR_WR = 7,
+ RISCV_IOMMU_FQ_TTYPE_PCIE_ATS_REQ = 8,
+ RISCV_IOMMU_FW_TTYPE_PCIE_MSG_REQ = 9,
+};
+
+/**
+ * struct riscv_iommu_pq_record - PCIe Page Request record
+ * @hdr: Header, includes PID, DID etc
+ * @payload: Holds the page address, request group and permission bits
+ *
+ * For more infos on the PCIe Page Request queue see chapter 3.3.
+ */
+struct riscv_iommu_pq_record {
+ u64 hdr;
+ u64 payload;
+};
+
+/* Header fields */
+#define RISCV_IOMMU_PREQ_HDR_PID GENMASK_ULL(31, 12)
+#define RISCV_IOMMU_PREQ_HDR_PV BIT_ULL(32)
+#define RISCV_IOMMU_PREQ_HDR_PRIV BIT_ULL(33)
+#define RISCV_IOMMU_PREQ_HDR_EXEC BIT_ULL(34)
+#define RISCV_IOMMU_PREQ_HDR_DID GENMASK_ULL(63, 40)
+
+/* Payload fields */
+#define RISCV_IOMMU_PREQ_PAYLOAD_R BIT_ULL(0)
+#define RISCV_IOMMU_PREQ_PAYLOAD_W BIT_ULL(1)
+#define RISCV_IOMMU_PREQ_PAYLOAD_L BIT_ULL(2)
+#define RISCV_IOMMU_PREQ_PAYLOAD_M GENMASK_ULL(2, 0) /* Mask of RWL for convenience */
+#define RISCV_IOMMU_PREQ_PRG_INDEX GENMASK_ULL(11, 3)
+#define RISCV_IOMMU_PREQ_UADDR GENMASK_ULL(63, 12)
+
+/**
+ * struct riscv_iommu_msi_pte - MSI Page Table Entry
+ * @pte: MSI PTE
+ * @mrif_info: Memory-resident interrupt file info
+ *
+ * The MSI Page Table is used for virtualizing MSIs, so that when
+ * a device sends an MSI to a guest, the IOMMU can reroute it
+ * by translating the MSI address, either to a guest interrupt file
+ * or a memory resident interrupt file (MRIF). Note that this page table
+ * is an array of MSI PTEs, not a multi-level pt, each entry
+ * is a leaf entry. For more infos check out the AIA spec, chapter 9.5.
+ *
+ * Also in basic mode the mrif_info field is ignored by the IOMMU and can
+ * be used by software, any other reserved fields on pte must be zeroed-out
+ * by software.
+ */
+struct riscv_iommu_msi_pte {
+ u64 pte;
+ u64 mrif_info;
+};
+
+/* Fields on pte */
+#define RISCV_IOMMU_MSI_PTE_V BIT_ULL(0)
+#define RISCV_IOMMU_MSI_PTE_M GENMASK_ULL(2, 1)
+#define RISCV_IOMMU_MSI_PTE_MRIF_ADDR GENMASK_ULL(53, 7) /* When M == 1 (MRIF mode) */
+#define RISCV_IOMMU_MSI_PTE_PPN RISCV_IOMMU_PPN_FIELD /* When M == 3 (basic mode) */
+#define RISCV_IOMMU_MSI_PTE_C BIT_ULL(63)
+
+/* Fields on mrif_info */
+#define RISCV_IOMMU_MSI_MRIF_NID GENMASK_ULL(9, 0)
+#define RISCV_IOMMU_MSI_MRIF_NPPN RISCV_IOMMU_PPN_FIELD
+#define RISCV_IOMMU_MSI_MRIF_NID_MSB BIT_ULL(60)
+
+#endif /* _RISCV_IOMMU_BITS_H_ */
diff --git a/drivers/iommu/riscv/iommu-platform.c b/drivers/iommu/riscv/iommu-platform.c
new file mode 100644
index 000000000000..1b453334fbbe
--- /dev/null
+++ b/drivers/iommu/riscv/iommu-platform.c
@@ -0,0 +1,92 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * RISC-V IOMMU as a platform device
+ *
+ * Copyright © 2023 FORTH-ICS/CARV
+ * Copyright © 2023-2024 Rivos Inc.
+ *
+ * Authors
+ * Nick Kossifidis <[email protected]>
+ * Tomasz Jeznach <[email protected]>
+ */
+
+#include <linux/kernel.h>
+#include <linux/of_platform.h>
+#include <linux/platform_device.h>
+
+#include "iommu-bits.h"
+#include "iommu.h"
+
+static int riscv_iommu_platform_probe(struct platform_device *pdev)
+{
+ struct device *dev = &pdev->dev;
+ struct riscv_iommu_device *iommu = NULL;
+ struct resource *res = NULL;
+ int vec;
+
+ iommu = devm_kzalloc(dev, sizeof(*iommu), GFP_KERNEL);
+ if (!iommu)
+ return -ENOMEM;
+
+ iommu->dev = dev;
+ iommu->reg = devm_platform_get_and_ioremap_resource(pdev, 0, &res);
+ if (IS_ERR(iommu->reg))
+ return dev_err_probe(dev, PTR_ERR(iommu->reg),
+ "could not map register region\n");
+
+ dev_set_drvdata(dev, iommu);
+
+ /* Check device reported capabilities / features. */
+ iommu->caps = riscv_iommu_readq(iommu, RISCV_IOMMU_REG_CAP);
+ iommu->fctl = riscv_iommu_readl(iommu, RISCV_IOMMU_REG_FCTL);
+
+ /* For now we only support WSI */
+ switch (FIELD_GET(RISCV_IOMMU_CAP_IGS, iommu->caps)) {
+ case RISCV_IOMMU_CAP_IGS_WSI:
+ case RISCV_IOMMU_CAP_IGS_BOTH:
+ break;
+ default:
+ return dev_err_probe(dev, -ENODEV,
+ "unable to use wire-signaled interrupts\n");
+ }
+
+ iommu->irqs_count = platform_irq_count(pdev);
+ if (iommu->irqs_count <= 0)
+ return dev_err_probe(dev, -ENODEV,
+ "no IRQ resources provided\n");
+ if (iommu->irqs_count > RISCV_IOMMU_INTR_COUNT)
+ iommu->irqs_count = RISCV_IOMMU_INTR_COUNT;
+
+ for (vec = 0; vec < iommu->irqs_count; vec++)
+ iommu->irqs[vec] = platform_get_irq(pdev, vec);
+
+ /* Enable wire-signaled interrupts, fctl.WSI */
+ if (!(iommu->fctl & RISCV_IOMMU_FCTL_WSI)) {
+ iommu->fctl |= RISCV_IOMMU_FCTL_WSI;
+ riscv_iommu_writel(iommu, RISCV_IOMMU_REG_FCTL, iommu->fctl);
+ }
+
+ return riscv_iommu_init(iommu);
+};
+
+static void riscv_iommu_platform_remove(struct platform_device *pdev)
+{
+ riscv_iommu_remove(dev_get_drvdata(&pdev->dev));
+};
+
+static const struct of_device_id riscv_iommu_of_match[] = {
+ {.compatible = "riscv,iommu",},
+ {},
+};
+
+static struct platform_driver riscv_iommu_platform_driver = {
+ .probe = riscv_iommu_platform_probe,
+ .remove_new = riscv_iommu_platform_remove,
+ .driver = {
+ .name = "riscv,iommu",
+ .of_match_table = riscv_iommu_of_match,
+ .suppress_bind_attrs = true,
+ },
+};
+
+builtin_platform_driver(riscv_iommu_platform_driver);
diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
new file mode 100644
index 000000000000..3c5a6b49669d
--- /dev/null
+++ b/drivers/iommu/riscv/iommu.c
@@ -0,0 +1,99 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * IOMMU API for RISC-V IOMMU implementations.
+ *
+ * Copyright © 2022-2024 Rivos Inc.
+ * Copyright © 2023 FORTH-ICS/CARV
+ *
+ * Authors
+ * Tomasz Jeznach <[email protected]>
+ * Nick Kossifidis <[email protected]>
+ */
+
+#define pr_fmt(fmt) "riscv-iommu: " fmt
+
+#include <linux/compiler.h>
+#include <linux/crash_dump.h>
+#include <linux/init.h>
+#include <linux/iommu.h>
+#include <linux/kernel.h>
+
+#include "iommu-bits.h"
+#include "iommu.h"
+
+/* Timeouts in [us] */
+#define RISCV_IOMMU_DDTP_TIMEOUT 50000
+
+/*
+ * This is best effort IOMMU translation shutdown flow.
+ * Disable IOMMU without waiting for hardware response.
+ */
+static void riscv_iommu_disable(struct riscv_iommu_device *iommu)
+{
+ riscv_iommu_writeq(iommu, RISCV_IOMMU_REG_DDTP, 0);
+ riscv_iommu_writel(iommu, RISCV_IOMMU_REG_CQCSR, 0);
+ riscv_iommu_writel(iommu, RISCV_IOMMU_REG_FQCSR, 0);
+ riscv_iommu_writel(iommu, RISCV_IOMMU_REG_PQCSR, 0);
+}
+
+static int riscv_iommu_init_check(struct riscv_iommu_device *iommu)
+{
+ u64 ddtp;
+
+ /*
+ * Make sure the IOMMU is switched off or in pass-through mode during
+ * regular boot flow and disable translation when we boot into a kexec
+ * kernel and the previous kernel left them enabled.
+ */
+ ddtp = riscv_iommu_readq(iommu, RISCV_IOMMU_REG_DDTP);
+ if (ddtp & RISCV_IOMMU_DDTP_BUSY)
+ return -EBUSY;
+
+ if (FIELD_GET(RISCV_IOMMU_DDTP_MODE, ddtp) > RISCV_IOMMU_DDTP_MODE_BARE) {
+ if (!is_kdump_kernel())
+ return -EBUSY;
+ riscv_iommu_disable(iommu);
+ }
+
+ /* Configure accesses to in-memory data structures for CPU-native byte order. */
+ if (IS_ENABLED(CONFIG_CPU_BIG_ENDIAN) != !!(iommu->fctl & RISCV_IOMMU_FCTL_BE)) {
+ if (!(iommu->caps & RISCV_IOMMU_CAP_END))
+ return -EINVAL;
+ riscv_iommu_writel(iommu, RISCV_IOMMU_REG_FCTL,
+ iommu->fctl ^ RISCV_IOMMU_FCTL_BE);
+ iommu->fctl = riscv_iommu_readl(iommu, RISCV_IOMMU_REG_FCTL);
+ if (IS_ENABLED(CONFIG_CPU_BIG_ENDIAN) != !!(iommu->fctl & RISCV_IOMMU_FCTL_BE))
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+void riscv_iommu_remove(struct riscv_iommu_device *iommu)
+{
+ iommu_device_sysfs_remove(&iommu->iommu);
+}
+
+int riscv_iommu_init(struct riscv_iommu_device *iommu)
+{
+ int rc;
+
+ rc = riscv_iommu_init_check(iommu);
+ if (rc)
+ return dev_err_probe(iommu->dev, rc, "unexpected device state\n");
+
+ /*
+ * Placeholder for a complete IOMMU device initialization. For now,
+ * only bare minimum: enable global identity mapping mode and register sysfs.
+ */
+ riscv_iommu_writeq(iommu, RISCV_IOMMU_REG_DDTP,
+ FIELD_PREP(RISCV_IOMMU_DDTP_MODE, RISCV_IOMMU_DDTP_MODE_BARE));
+
+ rc = iommu_device_sysfs_add(&iommu->iommu, NULL, NULL, "riscv-iommu@%s",
+ dev_name(iommu->dev));
+ if (rc)
+ return dev_err_probe(iommu->dev, rc,
+ "cannot register sysfs interface\n");
+
+ return 0;
+}
diff --git a/drivers/iommu/riscv/iommu.h b/drivers/iommu/riscv/iommu.h
new file mode 100644
index 000000000000..700e33dc2446
--- /dev/null
+++ b/drivers/iommu/riscv/iommu.h
@@ -0,0 +1,62 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright © 2022-2024 Rivos Inc.
+ * Copyright © 2023 FORTH-ICS/CARV
+ *
+ * Authors
+ * Tomasz Jeznach <[email protected]>
+ * Nick Kossifidis <[email protected]>
+ */
+
+#ifndef _RISCV_IOMMU_H_
+#define _RISCV_IOMMU_H_
+
+#include <linux/iommu.h>
+#include <linux/types.h>
+#include <linux/iopoll.h>
+
+#include "iommu-bits.h"
+
+struct riscv_iommu_device {
+ /* iommu core interface */
+ struct iommu_device iommu;
+
+ /* iommu hardware */
+ struct device *dev;
+
+ /* hardware control register space */
+ void __iomem *reg;
+
+ /* supported and enabled hardware capabilities */
+ u64 caps;
+ u32 fctl;
+
+ /* available interrupt numbers, MSI or WSI */
+ unsigned int irqs[RISCV_IOMMU_INTR_COUNT];
+ unsigned int irqs_count;
+};
+
+int riscv_iommu_init(struct riscv_iommu_device *iommu);
+void riscv_iommu_remove(struct riscv_iommu_device *iommu);
+
+#define riscv_iommu_readl(iommu, addr) \
+ readl_relaxed((iommu)->reg + (addr))
+
+#define riscv_iommu_readq(iommu, addr) \
+ readq_relaxed((iommu)->reg + (addr))
+
+#define riscv_iommu_writel(iommu, addr, val) \
+ writel_relaxed((val), (iommu)->reg + (addr))
+
+#define riscv_iommu_writeq(iommu, addr, val) \
+ writeq_relaxed((val), (iommu)->reg + (addr))
+
+#define riscv_iommu_readq_timeout(iommu, addr, val, cond, delay_us, timeout_us) \
+ readx_poll_timeout(readq_relaxed, (iommu)->reg + (addr), val, cond, \
+ delay_us, timeout_us)
+
+#define riscv_iommu_readl_timeout(iommu, addr, val, cond, delay_us, timeout_us) \
+ readx_poll_timeout(readl_relaxed, (iommu)->reg + (addr), val, cond, \
+ delay_us, timeout_us)
+
+#endif
--
2.34.1


2024-05-03 16:15:12

by Tomasz Jeznach

[permalink] [raw]
Subject: [PATCH v4 6/7] iommu/riscv: Command and fault queue support

Introduce device command submission and fault reporting queues,
as described in Chapter 3.1 and 3.2 of the RISC-V IOMMU Architecture
Specification.

Command and fault queues are instantiated in contiguous system memory
local to IOMMU device domain, or mapped from fixed I/O space provided
by the hardware implementation. Detection of the location and maximum
allowed size of the queue utilize WARL properties of queue base control
register. Driver implementation will try to allocate up to 128KB of
system memory, while respecting hardware supported maximum queue size.

Interrupts allocation is based on interrupt vectors availability and
distributed to all queues in simple round-robin fashion. For hardware
Implementation with fixed event type to interrupt vector assignment
IVEC WARL property is used to discover such mappings.

Address translation, command and queue fault handling in this change
is limited to simple fault reporting without taking any action.

Reviewed-by: Lu Baolu <[email protected]>
Signed-off-by: Tomasz Jeznach <[email protected]>
---
drivers/iommu/riscv/iommu-bits.h | 75 +++++
drivers/iommu/riscv/iommu.c | 496 ++++++++++++++++++++++++++++++-
drivers/iommu/riscv/iommu.h | 21 ++
3 files changed, 590 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/riscv/iommu-bits.h b/drivers/iommu/riscv/iommu-bits.h
index ba093c29de9f..40c379222821 100644
--- a/drivers/iommu/riscv/iommu-bits.h
+++ b/drivers/iommu/riscv/iommu-bits.h
@@ -704,4 +704,79 @@ struct riscv_iommu_msi_pte {
#define RISCV_IOMMU_MSI_MRIF_NPPN RISCV_IOMMU_PPN_FIELD
#define RISCV_IOMMU_MSI_MRIF_NID_MSB BIT_ULL(60)

+/* Helper functions: command structure builders. */
+
+static inline void riscv_iommu_cmd_inval_vma(struct riscv_iommu_command *cmd)
+{
+ cmd->dword0 = FIELD_PREP(RISCV_IOMMU_CMD_OPCODE, RISCV_IOMMU_CMD_IOTINVAL_OPCODE) |
+ FIELD_PREP(RISCV_IOMMU_CMD_FUNC, RISCV_IOMMU_CMD_IOTINVAL_FUNC_VMA);
+ cmd->dword1 = 0;
+}
+
+static inline void riscv_iommu_cmd_inval_set_addr(struct riscv_iommu_command *cmd,
+ u64 addr)
+{
+ cmd->dword1 = FIELD_PREP(RISCV_IOMMU_CMD_IOTINVAL_ADDR, phys_to_pfn(addr));
+ cmd->dword0 |= RISCV_IOMMU_CMD_IOTINVAL_AV;
+}
+
+static inline void riscv_iommu_cmd_inval_set_pscid(struct riscv_iommu_command *cmd,
+ int pscid)
+{
+ cmd->dword0 |= FIELD_PREP(RISCV_IOMMU_CMD_IOTINVAL_PSCID, pscid) |
+ RISCV_IOMMU_CMD_IOTINVAL_PSCV;
+}
+
+static inline void riscv_iommu_cmd_inval_set_gscid(struct riscv_iommu_command *cmd,
+ int gscid)
+{
+ cmd->dword0 |= FIELD_PREP(RISCV_IOMMU_CMD_IOTINVAL_GSCID, gscid) |
+ RISCV_IOMMU_CMD_IOTINVAL_GV;
+}
+
+static inline void riscv_iommu_cmd_iofence(struct riscv_iommu_command *cmd)
+{
+ cmd->dword0 = FIELD_PREP(RISCV_IOMMU_CMD_OPCODE, RISCV_IOMMU_CMD_IOFENCE_OPCODE) |
+ FIELD_PREP(RISCV_IOMMU_CMD_FUNC, RISCV_IOMMU_CMD_IOFENCE_FUNC_C) |
+ RISCV_IOMMU_CMD_IOFENCE_PR | RISCV_IOMMU_CMD_IOFENCE_PW;
+ cmd->dword1 = 0;
+}
+
+static inline void riscv_iommu_cmd_iofence_set_av(struct riscv_iommu_command *cmd,
+ u64 addr, u32 data)
+{
+ cmd->dword0 = FIELD_PREP(RISCV_IOMMU_CMD_OPCODE, RISCV_IOMMU_CMD_IOFENCE_OPCODE) |
+ FIELD_PREP(RISCV_IOMMU_CMD_FUNC, RISCV_IOMMU_CMD_IOFENCE_FUNC_C) |
+ FIELD_PREP(RISCV_IOMMU_CMD_IOFENCE_DATA, data) |
+ RISCV_IOMMU_CMD_IOFENCE_AV;
+ cmd->dword1 = addr >> 2;
+}
+
+static inline void riscv_iommu_cmd_iodir_inval_ddt(struct riscv_iommu_command *cmd)
+{
+ cmd->dword0 = FIELD_PREP(RISCV_IOMMU_CMD_OPCODE, RISCV_IOMMU_CMD_IODIR_OPCODE) |
+ FIELD_PREP(RISCV_IOMMU_CMD_FUNC, RISCV_IOMMU_CMD_IODIR_FUNC_INVAL_DDT);
+ cmd->dword1 = 0;
+}
+
+static inline void riscv_iommu_cmd_iodir_inval_pdt(struct riscv_iommu_command *cmd)
+{
+ cmd->dword0 = FIELD_PREP(RISCV_IOMMU_CMD_OPCODE, RISCV_IOMMU_CMD_IODIR_OPCODE) |
+ FIELD_PREP(RISCV_IOMMU_CMD_FUNC, RISCV_IOMMU_CMD_IODIR_FUNC_INVAL_PDT);
+ cmd->dword1 = 0;
+}
+
+static inline void riscv_iommu_cmd_iodir_set_did(struct riscv_iommu_command *cmd,
+ unsigned int devid)
+{
+ cmd->dword0 |= FIELD_PREP(RISCV_IOMMU_CMD_IODIR_DID, devid) |
+ RISCV_IOMMU_CMD_IODIR_DV;
+}
+
+static inline void riscv_iommu_cmd_iodir_set_pid(struct riscv_iommu_command *cmd,
+ unsigned int pasid)
+{
+ cmd->dword0 |= FIELD_PREP(RISCV_IOMMU_CMD_IODIR_PID, pasid);
+}
+
#endif /* _RISCV_IOMMU_BITS_H_ */
diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
index 71b7903d83d4..4349ac8a3990 100644
--- a/drivers/iommu/riscv/iommu.c
+++ b/drivers/iommu/riscv/iommu.c
@@ -25,7 +25,14 @@
#include "iommu.h"

/* Timeouts in [us] */
-#define RISCV_IOMMU_DDTP_TIMEOUT 50000
+#define RISCV_IOMMU_QCSR_TIMEOUT 150000
+#define RISCV_IOMMU_QUEUE_TIMEOUT 150000
+#define RISCV_IOMMU_DDTP_TIMEOUT 10000000
+#define RISCV_IOMMU_IOTINVAL_TIMEOUT 90000000
+
+/* Number of entries per CMD/FLT queue, should be <= INT_MAX */
+#define RISCV_IOMMU_DEF_CQ_COUNT 8192
+#define RISCV_IOMMU_DEF_FQ_COUNT 4096

/* RISC-V IOMMU PPN <> PHYS address conversions, PHYS <=> PPN[53:10] */
#define phys_to_ppn(va) (((va) >> 2) & (((1ULL << 44) - 1) << 10))
@@ -89,6 +96,432 @@ static void riscv_iommu_free_pages(struct riscv_iommu_device *iommu, void *addr)
riscv_iommu_devres_pages_match, &devres);
}

+/*
+ * Hardware queue allocation and management.
+ */
+
+/* Setup queue base, control registers and default queue length */
+#define RISCV_IOMMU_QUEUE_INIT(q, name) do { \
+ struct riscv_iommu_queue *_q = q; \
+ _q->qid = RISCV_IOMMU_INTR_ ## name; \
+ _q->qbr = RISCV_IOMMU_REG_ ## name ## B; \
+ _q->qcr = RISCV_IOMMU_REG_ ## name ## CSR; \
+ _q->mask = _q->mask ?: (RISCV_IOMMU_DEF_ ## name ## _COUNT) - 1;\
+} while (0)
+
+/* Note: offsets are the same for all queues */
+#define Q_HEAD(q) ((q)->qbr + (RISCV_IOMMU_REG_CQH - RISCV_IOMMU_REG_CQB))
+#define Q_TAIL(q) ((q)->qbr + (RISCV_IOMMU_REG_CQT - RISCV_IOMMU_REG_CQB))
+#define Q_ITEM(q, index) ((q)->mask & (index))
+#define Q_IPSR(q) BIT((q)->qid)
+
+/*
+ * Discover queue ring buffer hardware configuration, allocate in-memory
+ * ring buffer or use fixed I/O memory location, configure queue base register.
+ * Must be called before hardware queue is enabled.
+ *
+ * @queue - data structure, configured with RISCV_IOMMU_QUEUE_INIT()
+ * @entry_size - queue single element size in bytes.
+ */
+static int riscv_iommu_queue_alloc(struct riscv_iommu_device *iommu,
+ struct riscv_iommu_queue *queue,
+ size_t entry_size)
+{
+ unsigned int logsz;
+ u64 qb, rb;
+
+ /*
+ * Use WARL base register property to discover maximum allowed
+ * number of entries and optional fixed IO address for queue location.
+ */
+ riscv_iommu_writeq(iommu, queue->qbr, RISCV_IOMMU_QUEUE_LOGSZ_FIELD);
+ qb = riscv_iommu_readq(iommu, queue->qbr);
+
+ /*
+ * Calculate and verify hardware supported queue length, as reported
+ * by the field LOGSZ, where max queue length is equal to 2^(LOGSZ + 1).
+ * Update queue size based on hardware supported value.
+ */
+ logsz = ilog2(queue->mask);
+ if (logsz > FIELD_GET(RISCV_IOMMU_QUEUE_LOGSZ_FIELD, qb))
+ logsz = FIELD_GET(RISCV_IOMMU_QUEUE_LOGSZ_FIELD, qb);
+
+ /*
+ * Use WARL base register property to discover an optional fixed IO
+ * address for queue ring buffer location. Otherwise allocate contigus
+ * system memory.
+ */
+ if (FIELD_GET(RISCV_IOMMU_PPN_FIELD, qb)) {
+ const size_t queue_size = entry_size << (logsz + 1);
+
+ queue->phys = ppn_to_phys(FIELD_GET(RISCV_IOMMU_PPN_FIELD, qb));
+ queue->base = devm_ioremap(iommu->dev, queue->phys, queue_size);
+ } else {
+ do {
+ const size_t queue_size = entry_size << (logsz + 1);
+ const int order = get_order(queue_size);
+
+ queue->base = riscv_iommu_get_pages(iommu, order);
+ queue->phys = __pa(queue->base);
+ } while (!queue->base && logsz-- > 0);
+ }
+
+ if (!queue->base)
+ return -ENOMEM;
+
+ qb = phys_to_ppn(queue->phys) |
+ FIELD_PREP(RISCV_IOMMU_QUEUE_LOGSZ_FIELD, logsz);
+
+ /* Update base register and read back to verify hw accepted our write */
+ riscv_iommu_writeq(iommu, queue->qbr, qb);
+ rb = riscv_iommu_readq(iommu, queue->qbr);
+ if (rb != qb) {
+ dev_err(iommu->dev, "queue #%u allocation failed\n", queue->qid);
+ return -ENODEV;
+ }
+
+ /* Update actual queue mask */
+ queue->mask = (2U << logsz) - 1;
+
+ dev_dbg(iommu->dev, "queue #%u allocated 2^%u entries",
+ queue->qid, logsz + 1);
+
+ return 0;
+}
+
+/* Check interrupt queue status, IPSR */
+static irqreturn_t riscv_iommu_queue_ipsr(int irq, void *data)
+{
+ struct riscv_iommu_queue *queue = (struct riscv_iommu_queue *)data;
+
+ if (riscv_iommu_readl(queue->iommu, RISCV_IOMMU_REG_IPSR) & Q_IPSR(queue))
+ return IRQ_WAKE_THREAD;
+
+ return IRQ_NONE;
+}
+
+static int riscv_iommu_queue_vec(struct riscv_iommu_device *iommu, int n)
+{
+ /* Reuse IVEC.CIV mask for all interrupt vectors mapping. */
+ return (iommu->ivec >> (n * 4)) & RISCV_IOMMU_IVEC_CIV;
+}
+
+/*
+ * Enable queue processing in the hardware, register interrupt handler.
+ *
+ * @queue - data structure, already allocated with riscv_iommu_queue_alloc()
+ * @irq_handler - threaded interrupt handler.
+ */
+static int riscv_iommu_queue_enable(struct riscv_iommu_device *iommu,
+ struct riscv_iommu_queue *queue,
+ irq_handler_t irq_handler)
+{
+ const unsigned int irq = iommu->irqs[riscv_iommu_queue_vec(iommu, queue->qid)];
+ u32 csr;
+ int rc;
+
+ if (queue->iommu)
+ return -EBUSY;
+
+ /* Polling not implemented */
+ if (!irq)
+ return -ENODEV;
+
+ queue->iommu = iommu;
+ rc = request_threaded_irq(irq, riscv_iommu_queue_ipsr, irq_handler,
+ IRQF_ONESHOT | IRQF_SHARED,
+ dev_name(iommu->dev), queue);
+ if (rc) {
+ queue->iommu = NULL;
+ return rc;
+ }
+
+ /*
+ * Enable queue with interrupts, clear any memory fault if any.
+ * Wait for the hardware to acknowledge request and activate queue
+ * processing.
+ * Note: All CSR bitfields are in the same offsets for all queues.
+ */
+ riscv_iommu_writel(iommu, queue->qcr,
+ RISCV_IOMMU_QUEUE_ENABLE |
+ RISCV_IOMMU_QUEUE_INTR_ENABLE |
+ RISCV_IOMMU_QUEUE_MEM_FAULT);
+
+ riscv_iommu_readl_timeout(iommu, queue->qcr,
+ csr, !(csr & RISCV_IOMMU_QUEUE_BUSY),
+ 10, RISCV_IOMMU_QCSR_TIMEOUT);
+
+ if (RISCV_IOMMU_QUEUE_ACTIVE != (csr & (RISCV_IOMMU_QUEUE_ACTIVE |
+ RISCV_IOMMU_QUEUE_BUSY |
+ RISCV_IOMMU_QUEUE_MEM_FAULT))) {
+ /* Best effort to stop and disable failing hardware queue. */
+ riscv_iommu_writel(iommu, queue->qcr, 0);
+ free_irq(irq, queue);
+ queue->iommu = NULL;
+ dev_err(iommu->dev, "queue #%u failed to start\n", queue->qid);
+ return -EBUSY;
+ }
+
+ /* Clear any pending interrupt flag. */
+ riscv_iommu_writel(iommu, RISCV_IOMMU_REG_IPSR, Q_IPSR(queue));
+
+ return 0;
+}
+
+/*
+ * Disable queue. Wait for the hardware to acknowledge request and
+ * stop processing enqueued requests. Report errors but continue.
+ */
+static void riscv_iommu_queue_disable(struct riscv_iommu_queue *queue)
+{
+ struct riscv_iommu_device *iommu = queue->iommu;
+ u32 csr;
+
+ if (!iommu)
+ return;
+
+ free_irq(iommu->irqs[riscv_iommu_queue_vec(iommu, queue->qid)], queue);
+ riscv_iommu_writel(iommu, queue->qcr, 0);
+ riscv_iommu_readl_timeout(iommu, queue->qcr,
+ csr, !(csr & RISCV_IOMMU_QUEUE_BUSY),
+ 10, RISCV_IOMMU_QCSR_TIMEOUT);
+
+ if (csr & (RISCV_IOMMU_QUEUE_ACTIVE | RISCV_IOMMU_QUEUE_BUSY))
+ dev_err(iommu->dev, "fail to disable hardware queue #%u, csr 0x%x\n",
+ queue->qid, csr);
+
+ queue->iommu = NULL;
+}
+
+/*
+ * Returns number of available valid queue entries and the first item index.
+ * Update shadow producer index if necessary.
+ */
+static int riscv_iommu_queue_consume(struct riscv_iommu_queue *queue,
+ unsigned int *index)
+{
+ unsigned int head = atomic_read(&queue->head);
+ unsigned int tail = atomic_read(&queue->tail);
+ unsigned int last = Q_ITEM(queue, tail);
+ int available = (int)(tail - head);
+
+ *index = head;
+
+ if (available > 0)
+ return available;
+
+ /* read hardware producer index, check reserved register bits are not set. */
+ if (riscv_iommu_readl_timeout(queue->iommu, Q_TAIL(queue),
+ tail, (tail & ~queue->mask) == 0,
+ 0, RISCV_IOMMU_QUEUE_TIMEOUT)) {
+ dev_err_once(queue->iommu->dev,
+ "Hardware error: queue access timeout\n");
+ return 0;
+ }
+
+ if (tail == last)
+ return 0;
+
+ /* update shadow producer index */
+ return (int)(atomic_add_return((tail - last) & queue->mask, &queue->tail) - head);
+}
+
+/*
+ * Release processed queue entries, should match riscv_iommu_queue_consume() calls.
+ */
+static void riscv_iommu_queue_release(struct riscv_iommu_queue *queue, int count)
+{
+ const unsigned int head = atomic_add_return(count, &queue->head);
+
+ riscv_iommu_writel(queue->iommu, Q_HEAD(queue), Q_ITEM(queue, head));
+}
+
+/* Return actual consumer index based on hardware reported queue head index. */
+static unsigned int riscv_iommu_queue_cons(struct riscv_iommu_queue *queue)
+{
+ const unsigned int cons = atomic_read(&queue->head);
+ const unsigned int last = Q_ITEM(queue, cons);
+ unsigned int head;
+
+ if (riscv_iommu_readl_timeout(queue->iommu, Q_HEAD(queue), head,
+ !(head & ~queue->mask),
+ 0, RISCV_IOMMU_QUEUE_TIMEOUT))
+ return cons;
+
+ return cons + ((head - last) & queue->mask);
+}
+
+/* Wait for submitted item to be processed. */
+static int riscv_iommu_queue_wait(struct riscv_iommu_queue *queue,
+ unsigned int index,
+ unsigned int timeout_us)
+{
+ unsigned int cons = atomic_read(&queue->head);
+
+ /* Already processed by the consumer */
+ if ((int)(cons - index) > 0)
+ return 0;
+
+ /* Monitor consumer index */
+ return readx_poll_timeout(riscv_iommu_queue_cons, queue, cons,
+ (int)(cons - index) > 0, 0, timeout_us);
+}
+
+/* Enqueue an entry and wait to be processed if timeout_us > 0
+ *
+ * Error handling for IOMMU hardware not responding in reasonable time
+ * will be added as separate patch series along with other RAS features.
+ * For now, only report hardware failure and continue.
+ */
+static void riscv_iommu_queue_send(struct riscv_iommu_queue *queue,
+ void *entry, size_t entry_size,
+ unsigned int timeout_us)
+{
+ unsigned int prod;
+ unsigned int head;
+ unsigned int tail;
+ unsigned long flags;
+
+ /* Do not preempt submission flow. */
+ local_irq_save(flags);
+
+ /* 1. Allocate some space in the queue */
+ prod = atomic_inc_return(&queue->prod) - 1;
+ head = atomic_read(&queue->head);
+
+ /* 2. Wait for space availability. */
+ if ((prod - head) > queue->mask) {
+ if (readx_poll_timeout(atomic_read, &queue->head,
+ head, (prod - head) < queue->mask,
+ 0, RISCV_IOMMU_QUEUE_TIMEOUT))
+ goto err_busy;
+ } else if ((prod - head) == queue->mask) {
+ const unsigned int last = Q_ITEM(queue, head);
+
+ if (riscv_iommu_readl_timeout(queue->iommu, Q_HEAD(queue), head,
+ !(head & ~queue->mask) && head != last,
+ 0, RISCV_IOMMU_QUEUE_TIMEOUT))
+ goto err_busy;
+ atomic_add((head - last) & queue->mask, &queue->head);
+ }
+
+ /* 3. Store entry in the ring buffer. */
+ memcpy(queue->base + Q_ITEM(queue, prod) * entry_size, entry, entry_size);
+
+ /* 4. Wait for all previous entries to be ready */
+ if (readx_poll_timeout(atomic_read, &queue->tail, tail, prod == tail,
+ 0, RISCV_IOMMU_QUEUE_TIMEOUT))
+ goto err_busy;
+
+ /* 5. Complete submission and restore local interrupts */
+ dma_wmb();
+ riscv_iommu_writel(queue->iommu, Q_TAIL(queue), Q_ITEM(queue, prod + 1));
+ atomic_inc(&queue->tail);
+ local_irq_restore(flags);
+
+ if (timeout_us && riscv_iommu_queue_wait(queue, prod, timeout_us))
+ dev_err_once(queue->iommu->dev,
+ "Hardware error: command execution timeout\n");
+
+ return;
+
+err_busy:
+ local_irq_restore(flags);
+ dev_err_once(queue->iommu->dev, "Hardware error: command enqueue failed\n");
+}
+
+/*
+ * IOMMU Command queue chapter 3.1
+ */
+
+/* Command queue interrupt handler thread function */
+static irqreturn_t riscv_iommu_cmdq_process(int irq, void *data)
+{
+ const struct riscv_iommu_queue *queue = (struct riscv_iommu_queue *)data;
+ unsigned int ctrl;
+
+ /* Clear MF/CQ errors, complete error recovery to be implemented. */
+ ctrl = riscv_iommu_readl(queue->iommu, queue->qcr);
+ if (ctrl & (RISCV_IOMMU_CQCSR_CQMF | RISCV_IOMMU_CQCSR_CMD_TO |
+ RISCV_IOMMU_CQCSR_CMD_ILL | RISCV_IOMMU_CQCSR_FENCE_W_IP)) {
+ riscv_iommu_writel(queue->iommu, queue->qcr, ctrl);
+ dev_warn(queue->iommu->dev,
+ "Queue #%u error; fault:%d timeout:%d illegal:%d fence_w_ip:%d\n",
+ queue->qid,
+ !!(ctrl & RISCV_IOMMU_CQCSR_CQMF),
+ !!(ctrl & RISCV_IOMMU_CQCSR_CMD_TO),
+ !!(ctrl & RISCV_IOMMU_CQCSR_CMD_ILL),
+ !!(ctrl & RISCV_IOMMU_CQCSR_FENCE_W_IP));
+ }
+
+ /* Placeholder for command queue interrupt notifiers */
+
+ /* Clear command interrupt pending. */
+ riscv_iommu_writel(queue->iommu, RISCV_IOMMU_REG_IPSR, Q_IPSR(queue));
+
+ return IRQ_HANDLED;
+}
+
+/* Send command to the IOMMU command queue */
+static void riscv_iommu_cmd_send(struct riscv_iommu_device *iommu,
+ struct riscv_iommu_command *cmd,
+ unsigned int timeout_us)
+{
+ riscv_iommu_queue_send(&iommu->cmdq, cmd, sizeof(*cmd), timeout_us);
+}
+
+/*
+ * IOMMU Fault/Event queue chapter 3.2
+ */
+
+static void riscv_iommu_fault(struct riscv_iommu_device *iommu,
+ struct riscv_iommu_fq_record *event)
+{
+ unsigned int err = FIELD_GET(RISCV_IOMMU_FQ_HDR_CAUSE, event->hdr);
+ unsigned int devid = FIELD_GET(RISCV_IOMMU_FQ_HDR_DID, event->hdr);
+
+ /* Placeholder for future fault handling implementation, report only. */
+ if (err)
+ dev_warn_ratelimited(iommu->dev,
+ "Fault %d devid: 0x%x iotval: %llx iotval2: %llx\n",
+ err, devid, event->iotval, event->iotval2);
+}
+
+/* Fault queue interrupt handler thread function */
+static irqreturn_t riscv_iommu_fltq_process(int irq, void *data)
+{
+ struct riscv_iommu_queue *queue = (struct riscv_iommu_queue *)data;
+ struct riscv_iommu_device *iommu = queue->iommu;
+ struct riscv_iommu_fq_record *events;
+ unsigned int ctrl, idx;
+ int cnt, len;
+
+ events = (struct riscv_iommu_fq_record *)queue->base;
+
+ /* Clear fault interrupt pending and process all received fault events. */
+ riscv_iommu_writel(iommu, RISCV_IOMMU_REG_IPSR, Q_IPSR(queue));
+
+ do {
+ cnt = riscv_iommu_queue_consume(queue, &idx);
+ for (len = 0; len < cnt; idx++, len++)
+ riscv_iommu_fault(iommu, &events[Q_ITEM(queue, idx)]);
+ riscv_iommu_queue_release(queue, cnt);
+ } while (cnt > 0);
+
+ /* Clear MF/OF errors, complete error recovery to be implemented. */
+ ctrl = riscv_iommu_readl(iommu, queue->qcr);
+ if (ctrl & (RISCV_IOMMU_FQCSR_FQMF | RISCV_IOMMU_FQCSR_FQOF)) {
+ riscv_iommu_writel(iommu, queue->qcr, ctrl);
+ dev_warn(iommu->dev,
+ "Queue #%u error; memory fault:%d overflow:%d\n",
+ queue->qid,
+ !!(ctrl & RISCV_IOMMU_FQCSR_FQMF),
+ !!(ctrl & RISCV_IOMMU_FQCSR_FQOF));
+ }
+
+ return IRQ_HANDLED;
+}
+
/* Lookup and initialize device context info structure. */
static struct riscv_iommu_dc *riscv_iommu_get_dc(struct riscv_iommu_device *iommu,
unsigned int devid)
@@ -250,6 +683,7 @@ static int riscv_iommu_iodir_set_mode(struct riscv_iommu_device *iommu,
struct device *dev = iommu->dev;
u64 ddtp, rq_ddtp;
unsigned int mode, rq_mode = ddtp_mode;
+ struct riscv_iommu_command cmd;

ddtp = riscv_iommu_read_ddtp(iommu);
if (ddtp & RISCV_IOMMU_DDTP_BUSY)
@@ -317,6 +751,18 @@ static int riscv_iommu_iodir_set_mode(struct riscv_iommu_device *iommu,
if (mode != ddtp_mode)
dev_dbg(dev, "DDTP hw mode %u, requested %u\n", mode, ddtp_mode);

+ /* Invalidate device context cache */
+ riscv_iommu_cmd_iodir_inval_ddt(&cmd);
+ riscv_iommu_cmd_send(iommu, &cmd, 0);
+
+ /* Invalidate address translation cache */
+ riscv_iommu_cmd_inval_vma(&cmd);
+ riscv_iommu_cmd_send(iommu, &cmd, 0);
+
+ /* IOFENCE.C */
+ riscv_iommu_cmd_iofence(&cmd);
+ riscv_iommu_cmd_send(iommu, &cmd, RISCV_IOMMU_IOTINVAL_TIMEOUT);
+
return 0;
}

@@ -492,6 +938,26 @@ static int riscv_iommu_init_check(struct riscv_iommu_device *iommu)
return -EINVAL;
}

+ /*
+ * Distribute interrupt vectors, always use first vector for CIV.
+ * At least one interrupt is required. Read back and verify.
+ */
+ if (!iommu->irqs_count)
+ return -EINVAL;
+
+ iommu->ivec = 0;
+ iommu->ivec |= FIELD_PREP(RISCV_IOMMU_IVEC_FIV, 1 % iommu->irqs_count);
+ iommu->ivec |= FIELD_PREP(RISCV_IOMMU_IVEC_PIV, 2 % iommu->irqs_count);
+ iommu->ivec |= FIELD_PREP(RISCV_IOMMU_IVEC_PMIV, 3 % iommu->irqs_count);
+ riscv_iommu_writeq(iommu, RISCV_IOMMU_REG_IVEC, iommu->ivec);
+
+ iommu->ivec = riscv_iommu_readq(iommu, RISCV_IOMMU_REG_IVEC);
+ if (riscv_iommu_queue_vec(iommu, RISCV_IOMMU_IVEC_CIV) >= RISCV_IOMMU_INTR_COUNT ||
+ riscv_iommu_queue_vec(iommu, RISCV_IOMMU_IVEC_FIV) >= RISCV_IOMMU_INTR_COUNT ||
+ riscv_iommu_queue_vec(iommu, RISCV_IOMMU_IVEC_PIV) >= RISCV_IOMMU_INTR_COUNT ||
+ riscv_iommu_queue_vec(iommu, RISCV_IOMMU_IVEC_PMIV) >= RISCV_IOMMU_INTR_COUNT)
+ return -EINVAL;
+
return 0;
}

@@ -500,12 +966,17 @@ void riscv_iommu_remove(struct riscv_iommu_device *iommu)
iommu_device_unregister(&iommu->iommu);
iommu_device_sysfs_remove(&iommu->iommu);
riscv_iommu_iodir_set_mode(iommu, RISCV_IOMMU_DDTP_MODE_OFF);
+ riscv_iommu_queue_disable(&iommu->cmdq);
+ riscv_iommu_queue_disable(&iommu->fltq);
}

int riscv_iommu_init(struct riscv_iommu_device *iommu)
{
int rc;

+ RISCV_IOMMU_QUEUE_INIT(&iommu->cmdq, CQ);
+ RISCV_IOMMU_QUEUE_INIT(&iommu->fltq, FQ);
+
rc = riscv_iommu_init_check(iommu);
if (rc)
return dev_err_probe(iommu->dev, rc, "unexpected device state\n");
@@ -514,10 +985,28 @@ int riscv_iommu_init(struct riscv_iommu_device *iommu)
if (rc)
return rc;

- rc = riscv_iommu_iodir_set_mode(iommu, RISCV_IOMMU_DDTP_MODE_MAX);
+ rc = riscv_iommu_queue_alloc(iommu, &iommu->cmdq,
+ sizeof(struct riscv_iommu_command));
+ if (rc)
+ return rc;
+
+ rc = riscv_iommu_queue_alloc(iommu, &iommu->fltq,
+ sizeof(struct riscv_iommu_fq_record));
+ if (rc)
+ return rc;
+
+ rc = riscv_iommu_queue_enable(iommu, &iommu->cmdq, riscv_iommu_cmdq_process);
if (rc)
return rc;

+ rc = riscv_iommu_queue_enable(iommu, &iommu->fltq, riscv_iommu_fltq_process);
+ if (rc)
+ goto err_queue_disable;
+
+ rc = riscv_iommu_iodir_set_mode(iommu, RISCV_IOMMU_DDTP_MODE_MAX);
+ if (rc)
+ goto err_queue_disable;
+
rc = iommu_device_sysfs_add(&iommu->iommu, NULL, NULL, "riscv-iommu@%s",
dev_name(iommu->dev));
if (rc) {
@@ -537,5 +1026,8 @@ int riscv_iommu_init(struct riscv_iommu_device *iommu)
iommu_device_sysfs_remove(&iommu->iommu);
err_iodir_off:
riscv_iommu_iodir_set_mode(iommu, RISCV_IOMMU_DDTP_MODE_OFF);
+err_queue_disable:
+ riscv_iommu_queue_disable(&iommu->fltq);
+ riscv_iommu_queue_disable(&iommu->cmdq);
return rc;
}
diff --git a/drivers/iommu/riscv/iommu.h b/drivers/iommu/riscv/iommu.h
index f1696926582c..03e0c45bc7e1 100644
--- a/drivers/iommu/riscv/iommu.h
+++ b/drivers/iommu/riscv/iommu.h
@@ -17,6 +17,22 @@

#include "iommu-bits.h"

+struct riscv_iommu_device;
+
+struct riscv_iommu_queue {
+ atomic_t prod; /* unbounded producer allocation index */
+ atomic_t head; /* unbounded shadow ring buffer consumer index */
+ atomic_t tail; /* unbounded shadow ring buffer producer index */
+ unsigned int mask; /* index mask, queue length - 1 */
+ unsigned int irq; /* allocated interrupt number */
+ struct riscv_iommu_device *iommu; /* iommu device handling the queue when active */
+ void *base; /* ring buffer kernel pointer */
+ dma_addr_t phys; /* ring buffer physical address */
+ u16 qbr; /* base register offset, head and tail reference */
+ u16 qcr; /* control and status register offset */
+ u8 qid; /* queue identifier, same as RISCV_IOMMU_INTR_XX */
+};
+
struct riscv_iommu_device {
/* iommu core interface */
struct iommu_device iommu;
@@ -34,6 +50,11 @@ struct riscv_iommu_device {
/* available interrupt numbers, MSI or WSI */
unsigned int irqs[RISCV_IOMMU_INTR_COUNT];
unsigned int irqs_count;
+ unsigned int ivec;
+
+ /* hardware queues */
+ struct riscv_iommu_queue cmdq;
+ struct riscv_iommu_queue fltq;

/* device directory */
unsigned int ddt_mode;
--
2.34.1


2024-05-03 16:15:58

by Tomasz Jeznach

[permalink] [raw]
Subject: [PATCH v4 3/7] iommu/riscv: Add RISC-V IOMMU PCIe device driver

Introduce device driver for PCIe implementation
of RISC-V IOMMU architected hardware.

IOMMU hardware and system support for MSI or MSI-X is
required by this implementation.

Vendor and device identifiers used in this patch
matches QEMU implementation of the RISC-V IOMMU PCIe
device, from Rivos VID (0x1efd) range allocated by the PCI-SIG.

MAINTAINERS | added iommu-pci.c already covered by matching pattern.

Link: https://lore.kernel.org/qemu-devel/[email protected]/
Co-developed-by: Nick Kossifidis <[email protected]>
Signed-off-by: Nick Kossifidis <[email protected]>
Reviewed-by: Lu Baolu <[email protected]>
Signed-off-by: Tomasz Jeznach <[email protected]>
---
drivers/iommu/riscv/Kconfig | 5 ++
drivers/iommu/riscv/Makefile | 1 +
drivers/iommu/riscv/iommu-pci.c | 119 ++++++++++++++++++++++++++++++++
3 files changed, 125 insertions(+)
create mode 100644 drivers/iommu/riscv/iommu-pci.c

diff --git a/drivers/iommu/riscv/Kconfig b/drivers/iommu/riscv/Kconfig
index 5dcc5c45aa50..c071816f59a6 100644
--- a/drivers/iommu/riscv/Kconfig
+++ b/drivers/iommu/riscv/Kconfig
@@ -13,3 +13,8 @@ config RISCV_IOMMU

Say Y here if your SoC includes an IOMMU device implementing
the RISC-V IOMMU architecture.
+
+config RISCV_IOMMU_PCI
+ def_bool y if RISCV_IOMMU && PCI_MSI
+ help
+ Support for the PCIe implementation of RISC-V IOMMU architecture.
diff --git a/drivers/iommu/riscv/Makefile b/drivers/iommu/riscv/Makefile
index e4c189de58d3..f54c9ed17d41 100644
--- a/drivers/iommu/riscv/Makefile
+++ b/drivers/iommu/riscv/Makefile
@@ -1,2 +1,3 @@
# SPDX-License-Identifier: GPL-2.0-only
obj-$(CONFIG_RISCV_IOMMU) += iommu.o iommu-platform.o
+obj-$(CONFIG_RISCV_IOMMU_PCI) += iommu-pci.o
diff --git a/drivers/iommu/riscv/iommu-pci.c b/drivers/iommu/riscv/iommu-pci.c
new file mode 100644
index 000000000000..0a60e068fdc9
--- /dev/null
+++ b/drivers/iommu/riscv/iommu-pci.c
@@ -0,0 +1,119 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+/*
+ * Copyright © 2022-2024 Rivos Inc.
+ * Copyright © 2023 FORTH-ICS/CARV
+ *
+ * RISCV IOMMU as a PCIe device
+ *
+ * Authors
+ * Tomasz Jeznach <[email protected]>
+ * Nick Kossifidis <[email protected]>
+ */
+
+#include <linux/compiler.h>
+#include <linux/init.h>
+#include <linux/iommu.h>
+#include <linux/kernel.h>
+#include <linux/pci.h>
+
+#include "iommu-bits.h"
+#include "iommu.h"
+
+/* Rivos Inc. assigned PCI Vendor and Device IDs */
+#ifndef PCI_VENDOR_ID_RIVOS
+#define PCI_VENDOR_ID_RIVOS 0x1efd
+#endif
+
+#ifndef PCI_DEVICE_ID_RIVOS_IOMMU
+#define PCI_DEVICE_ID_RIVOS_IOMMU 0xedf1
+#endif
+
+static int riscv_iommu_pci_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
+{
+ struct device *dev = &pdev->dev;
+ struct riscv_iommu_device *iommu;
+ int rc, vec;
+
+ rc = pcim_enable_device(pdev);
+ if (rc)
+ return rc;
+
+ if (!(pci_resource_flags(pdev, 0) & IORESOURCE_MEM))
+ return -ENODEV;
+
+ if (pci_resource_len(pdev, 0) < RISCV_IOMMU_REG_SIZE)
+ return -ENODEV;
+
+ rc = pcim_iomap_regions(pdev, BIT(0), pci_name(pdev));
+ if (rc)
+ return dev_err_probe(dev, rc, "pcim_iomap_regions failed\n");
+
+ iommu = devm_kzalloc(dev, sizeof(*iommu), GFP_KERNEL);
+ if (!iommu)
+ return -ENOMEM;
+
+ iommu->dev = dev;
+ iommu->reg = pcim_iomap_table(pdev)[0];
+
+ pci_set_master(pdev);
+ dev_set_drvdata(dev, iommu);
+
+ /* Check device reported capabilities / features. */
+ iommu->caps = riscv_iommu_readq(iommu, RISCV_IOMMU_REG_CAP);
+ iommu->fctl = riscv_iommu_readl(iommu, RISCV_IOMMU_REG_FCTL);
+
+ /* The PCI driver only uses MSIs, make sure the IOMMU supports this */
+ switch (FIELD_GET(RISCV_IOMMU_CAP_IGS, iommu->caps)) {
+ case RISCV_IOMMU_CAP_IGS_MSI:
+ case RISCV_IOMMU_CAP_IGS_BOTH:
+ break;
+ default:
+ return dev_err_probe(dev, -ENODEV,
+ "unable to use message-signaled interrupts\n");
+ }
+
+ /* Allocate and assign IRQ vectors for the various events */
+ rc = pci_alloc_irq_vectors(pdev, 1, RISCV_IOMMU_INTR_COUNT,
+ PCI_IRQ_MSIX | PCI_IRQ_MSI);
+ if (rc <= 0)
+ return dev_err_probe(dev, -ENODEV,
+ "unable to allocate irq vectors\n");
+
+ iommu->irqs_count = rc;
+ for (vec = 0; vec < iommu->irqs_count; vec++)
+ iommu->irqs[vec] = msi_get_virq(dev, vec);
+
+ /* Enable message-signaled interrupts, fctl.WSI */
+ if (iommu->fctl & RISCV_IOMMU_FCTL_WSI) {
+ iommu->fctl ^= RISCV_IOMMU_FCTL_WSI;
+ riscv_iommu_writel(iommu, RISCV_IOMMU_REG_FCTL, iommu->fctl);
+ }
+
+ return riscv_iommu_init(iommu);
+}
+
+static void riscv_iommu_pci_remove(struct pci_dev *pdev)
+{
+ struct riscv_iommu_device *iommu = dev_get_drvdata(&pdev->dev);
+
+ riscv_iommu_remove(iommu);
+}
+
+static const struct pci_device_id riscv_iommu_pci_tbl[] = {
+ {PCI_VENDOR_ID_RIVOS, PCI_DEVICE_ID_RIVOS_IOMMU,
+ PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0},
+ {0,}
+};
+
+static struct pci_driver riscv_iommu_pci_driver = {
+ .name = KBUILD_MODNAME,
+ .id_table = riscv_iommu_pci_tbl,
+ .probe = riscv_iommu_pci_probe,
+ .remove = riscv_iommu_pci_remove,
+ .driver = {
+ .suppress_bind_attrs = true,
+ },
+};
+
+builtin_pci_driver(riscv_iommu_pci_driver);
--
2.34.1


2024-05-03 16:17:25

by Tomasz Jeznach

[permalink] [raw]
Subject: [PATCH v4 7/7] iommu/riscv: Paging domain support

Introduce first-stage address translation support.

Page table configured by the IOMMU driver will use the highest mode
implemented by the hardware, unless not known at the domain allocation
time falling back to the CPU’s MMU page mode.

This change introduces IOTINVAL.VMA command, required to invalidate
any cached IOATC entries after mapping is updated and/or removed from
the paging domain. Invalidations for the non-leaf page entries use
IOTINVAL for all addresses assigned to the protection domain for
hardware not supporting more granular non-leaf page table cache
invalidations.

Signed-off-by: Tomasz Jeznach <[email protected]>
---
drivers/iommu/riscv/iommu.c | 587 +++++++++++++++++++++++++++++++++++-
1 file changed, 585 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
index 4349ac8a3990..ec701fde520f 100644
--- a/drivers/iommu/riscv/iommu.c
+++ b/drivers/iommu/riscv/iommu.c
@@ -41,6 +41,10 @@
#define dev_to_iommu(dev) \
iommu_get_iommu_dev(dev, struct riscv_iommu_device, iommu)

+/* IOMMU PSCID allocation namespace. */
+static DEFINE_IDA(riscv_iommu_pscids);
+#define RISCV_IOMMU_MAX_PSCID (BIT(20) - 1)
+
/* Device resource-managed allocations */
struct riscv_iommu_devres {
void *addr;
@@ -766,6 +770,143 @@ static int riscv_iommu_iodir_set_mode(struct riscv_iommu_device *iommu,
return 0;
}

+/* This struct contains protection domain specific IOMMU driver data. */
+struct riscv_iommu_domain {
+ struct iommu_domain domain;
+ struct list_head bonds;
+ spinlock_t lock; /* protect bonds list updates. */
+ int pscid;
+ int numa_node;
+ int amo_enabled:1;
+ unsigned int pgd_mode;
+ unsigned long *pgd_root;
+};
+
+#define iommu_domain_to_riscv(iommu_domain) \
+ container_of(iommu_domain, struct riscv_iommu_domain, domain)
+
+/* Private IOMMU data for managed devices, dev_iommu_priv_* */
+struct riscv_iommu_info {
+ struct riscv_iommu_domain *domain;
+};
+
+/* Linkage between an iommu_domain and attached devices. */
+struct riscv_iommu_bond {
+ struct list_head list;
+ struct rcu_head rcu;
+ struct device *dev;
+};
+
+static int riscv_iommu_bond_link(struct riscv_iommu_domain *domain,
+ struct device *dev)
+{
+ struct riscv_iommu_device *iommu = dev_to_iommu(dev);
+ struct riscv_iommu_bond *bond;
+ struct list_head *bonds;
+
+ bond = kzalloc(sizeof(*bond), GFP_KERNEL);
+ if (!bond)
+ return -ENOMEM;
+ bond->dev = dev;
+
+ /*
+ * Linked device pointer and iommu data remain stable in bond struct
+ * as long the device is attached to the managed IOMMU at _probe_device(),
+ * up to completion of _release_device() call. Release of the bond will be
+ * synchronized at the device release, to guarantee no stale pointer is
+ * used inside rcu protected sections.
+ *
+ * List of devices attached to the domain is arranged based on
+ * managed IOMMU device.
+ */
+
+ spin_lock(&domain->lock);
+ list_for_each_rcu(bonds, &domain->bonds)
+ if (dev_to_iommu(list_entry(bonds, struct riscv_iommu_bond, list)->dev) == iommu)
+ break;
+ list_add_rcu(&bond->list, bonds);
+ spin_unlock(&domain->lock);
+
+ return 0;
+}
+
+static void riscv_iommu_bond_unlink(struct riscv_iommu_domain *domain,
+ struct device *dev)
+{
+ struct riscv_iommu_bond *bond, *found = NULL;
+
+ if (!domain)
+ return;
+
+ spin_lock(&domain->lock);
+ list_for_each_entry_rcu(bond, &domain->bonds, list) {
+ if (bond->dev == dev) {
+ list_del_rcu(&bond->list);
+ found = bond;
+ }
+ }
+ spin_unlock(&domain->lock);
+ kfree_rcu(found, rcu);
+}
+
+/*
+ * Send IOTLB.INVAL for whole address space for ranges larger than 2MB.
+ * This limit will be replaced with range invalidations, if supported by
+ * the hardware, when RISC-V IOMMU architecture specification update for
+ * range invalidations update will be available.
+ */
+#define RISCV_IOMMU_IOTLB_INVAL_LIMIT (2 << 20)
+
+static void riscv_iommu_iotlb_inval(struct riscv_iommu_domain *domain,
+ unsigned long start, unsigned long end)
+{
+ struct riscv_iommu_bond *bond;
+ struct riscv_iommu_device *iommu, *prev;
+ struct riscv_iommu_command cmd;
+ unsigned long len = end - start + 1;
+ unsigned long iova;
+
+ rcu_read_lock();
+
+ prev = NULL;
+ list_for_each_entry_rcu(bond, &domain->bonds, list) {
+ iommu = dev_to_iommu(bond->dev);
+
+ /*
+ * IOTLB invalidation request can be safely omitted if already sent
+ * to the IOMMU for the same PSCID, and with domain->bonds list
+ * arranged based on the device's IOMMU, it's sufficient to check
+ * last device the invalidation was sent to.
+ */
+ if (iommu == prev)
+ continue;
+
+ riscv_iommu_cmd_inval_vma(&cmd);
+ riscv_iommu_cmd_inval_set_pscid(&cmd, domain->pscid);
+ if (len && len >= RISCV_IOMMU_IOTLB_INVAL_LIMIT) {
+ for (iova = start; iova < end; iova += PAGE_SIZE) {
+ riscv_iommu_cmd_inval_set_addr(&cmd, iova);
+ riscv_iommu_cmd_send(iommu, &cmd, 0);
+ }
+ } else {
+ riscv_iommu_cmd_send(iommu, &cmd, 0);
+ }
+ prev = iommu;
+ }
+
+ prev = NULL;
+ list_for_each_entry_rcu(bond, &domain->bonds, list) {
+ iommu = dev_to_iommu(bond->dev);
+ if (iommu == prev)
+ continue;
+
+ riscv_iommu_cmd_iofence(&cmd);
+ riscv_iommu_cmd_send(iommu, &cmd, RISCV_IOMMU_QUEUE_TIMEOUT);
+ prev = iommu;
+ }
+ rcu_read_unlock();
+}
+
#define RISCV_IOMMU_FSC_BARE 0

/*
@@ -785,10 +926,29 @@ static void riscv_iommu_iodir_update(struct riscv_iommu_device *iommu,
{
struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
struct riscv_iommu_dc *dc;
+ struct riscv_iommu_command cmd;
+ bool sync_required = false;
u64 tc;
int i;

- /* Device context invalidation ignored for now. */
+ for (i = 0; i < fwspec->num_ids; i++) {
+ dc = riscv_iommu_get_dc(iommu, fwspec->ids[i]);
+ tc = READ_ONCE(dc->tc);
+ if (tc & RISCV_IOMMU_DC_TC_V) {
+ WRITE_ONCE(dc->tc, tc & ~RISCV_IOMMU_DC_TC_V);
+
+ /* Invalidate device context cached values */
+ riscv_iommu_cmd_iodir_inval_ddt(&cmd);
+ riscv_iommu_cmd_iodir_set_did(&cmd, fwspec->ids[i]);
+ riscv_iommu_cmd_send(iommu, &cmd, 0);
+ sync_required = true;
+ }
+ }
+
+ if (sync_required) {
+ riscv_iommu_cmd_iofence(&cmd);
+ riscv_iommu_cmd_send(iommu, &cmd, RISCV_IOMMU_IOTINVAL_TIMEOUT);
+ }

/*
* For device context with DC_TC_PDTV = 0, translation attributes valid bit
@@ -806,12 +966,415 @@ static void riscv_iommu_iodir_update(struct riscv_iommu_device *iommu,
}
}

+/*
+ * IOVA page translation tree management.
+ */
+
+#define IOMMU_PAGE_SIZE_4K BIT_ULL(12)
+#define IOMMU_PAGE_SIZE_2M BIT_ULL(21)
+#define IOMMU_PAGE_SIZE_1G BIT_ULL(30)
+#define IOMMU_PAGE_SIZE_512G BIT_ULL(39)
+
+#define PT_SHIFT (PAGE_SHIFT - ilog2(sizeof(pte_t)))
+
+static void riscv_iommu_iotlb_flush_all(struct iommu_domain *iommu_domain)
+{
+ struct riscv_iommu_domain *domain = iommu_domain_to_riscv(iommu_domain);
+
+ riscv_iommu_iotlb_inval(domain, 0, ULONG_MAX);
+}
+
+static void riscv_iommu_iotlb_sync(struct iommu_domain *iommu_domain,
+ struct iommu_iotlb_gather *gather)
+{
+ struct riscv_iommu_domain *domain = iommu_domain_to_riscv(iommu_domain);
+
+ riscv_iommu_iotlb_inval(domain, gather->start, gather->end);
+}
+
+static inline size_t get_page_size(size_t size)
+{
+ if (size >= IOMMU_PAGE_SIZE_512G)
+ return IOMMU_PAGE_SIZE_512G;
+ if (size >= IOMMU_PAGE_SIZE_1G)
+ return IOMMU_PAGE_SIZE_1G;
+ if (size >= IOMMU_PAGE_SIZE_2M)
+ return IOMMU_PAGE_SIZE_2M;
+ return IOMMU_PAGE_SIZE_4K;
+}
+
+#define _io_pte_present(pte) ((pte) & (_PAGE_PRESENT | _PAGE_PROT_NONE))
+#define _io_pte_leaf(pte) ((pte) & _PAGE_LEAF)
+#define _io_pte_none(pte) ((pte) == 0)
+#define _io_pte_entry(pn, prot) ((_PAGE_PFN_MASK & ((pn) << _PAGE_PFN_SHIFT)) | (prot))
+
+static void riscv_iommu_pte_free(struct riscv_iommu_domain *domain,
+ unsigned long pte, struct list_head *freelist)
+{
+ unsigned long *ptr;
+ int i;
+
+ if (!_io_pte_present(pte) || _io_pte_leaf(pte))
+ return;
+
+ ptr = (unsigned long *)pfn_to_virt(__page_val_to_pfn(pte));
+
+ /* Recursively free all sub page table pages */
+ for (i = 0; i < PTRS_PER_PTE; i++) {
+ pte = READ_ONCE(ptr[i]);
+ if (!_io_pte_none(pte) && cmpxchg_relaxed(ptr + i, pte, 0) == pte)
+ riscv_iommu_pte_free(domain, pte, freelist);
+ }
+
+ if (freelist)
+ list_add_tail(&virt_to_page(ptr)->lru, freelist);
+ else
+ iommu_free_page(ptr);
+}
+
+static unsigned long *riscv_iommu_pte_alloc(struct riscv_iommu_domain *domain,
+ unsigned long iova, size_t pgsize,
+ gfp_t gfp)
+{
+ unsigned long *ptr = domain->pgd_root;
+ unsigned long pte, old;
+ int level = domain->pgd_mode - RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV39 + 2;
+ void *addr;
+
+ do {
+ const int shift = PAGE_SHIFT + PT_SHIFT * level;
+
+ ptr += ((iova >> shift) & (PTRS_PER_PTE - 1));
+ /*
+ * Note: returned entry might be a non-leaf if there was
+ * existing mapping with smaller granularity. Up to the caller
+ * to replace and invalidate.
+ */
+ if (((size_t)1 << shift) == pgsize)
+ return ptr;
+pte_retry:
+ pte = READ_ONCE(*ptr);
+ /*
+ * This is very likely incorrect as we should not be adding
+ * new mapping with smaller granularity on top
+ * of existing 2M/1G mapping. Fail.
+ */
+ if (_io_pte_present(pte) && _io_pte_leaf(pte))
+ return NULL;
+ /*
+ * Non-leaf entry is missing, allocate and try to add to the
+ * page table. This might race with other mappings, retry.
+ */
+ if (_io_pte_none(pte)) {
+ addr = iommu_alloc_page_node(domain->numa_node, gfp);
+ if (!addr)
+ return NULL;
+ old = pte;
+ pte = _io_pte_entry(virt_to_pfn(addr), _PAGE_TABLE);
+ if (cmpxchg_relaxed(ptr, old, pte) != old) {
+ iommu_free_page(addr);
+ goto pte_retry;
+ }
+ }
+ ptr = (unsigned long *)pfn_to_virt(__page_val_to_pfn(pte));
+ } while (level-- > 0);
+
+ return NULL;
+}
+
+static unsigned long *riscv_iommu_pte_fetch(struct riscv_iommu_domain *domain,
+ unsigned long iova, size_t *pte_pgsize)
+{
+ unsigned long *ptr = domain->pgd_root;
+ unsigned long pte;
+ int level = domain->pgd_mode - RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV39 + 2;
+
+ do {
+ const int shift = PAGE_SHIFT + PT_SHIFT * level;
+
+ ptr += ((iova >> shift) & (PTRS_PER_PTE - 1));
+ pte = READ_ONCE(*ptr);
+ if (_io_pte_present(pte) && _io_pte_leaf(pte)) {
+ *pte_pgsize = (size_t)1 << shift;
+ return ptr;
+ }
+ if (_io_pte_none(pte))
+ return NULL;
+ ptr = (unsigned long *)pfn_to_virt(__page_val_to_pfn(pte));
+ } while (level-- > 0);
+
+ return NULL;
+}
+
+static int riscv_iommu_map_pages(struct iommu_domain *iommu_domain,
+ unsigned long iova, phys_addr_t phys,
+ size_t pgsize, size_t pgcount, int prot,
+ gfp_t gfp, size_t *mapped)
+{
+ struct riscv_iommu_domain *domain = iommu_domain_to_riscv(iommu_domain);
+ size_t size = 0;
+ size_t page_size = get_page_size(pgsize);
+ unsigned long *ptr;
+ unsigned long pte, old, pte_prot;
+ int rc = 0;
+ LIST_HEAD(freelist);
+
+ if (!(prot & IOMMU_WRITE))
+ pte_prot = _PAGE_BASE | _PAGE_READ;
+ else if (domain->amo_enabled)
+ pte_prot = _PAGE_BASE | _PAGE_READ | _PAGE_WRITE;
+ else
+ pte_prot = _PAGE_BASE | _PAGE_READ | _PAGE_WRITE | _PAGE_DIRTY;
+
+ while (pgcount) {
+ ptr = riscv_iommu_pte_alloc(domain, iova, page_size, gfp);
+ if (!ptr) {
+ rc = -ENOMEM;
+ break;
+ }
+
+ old = READ_ONCE(*ptr);
+ pte = _io_pte_entry(phys_to_pfn(phys), pte_prot);
+ if (cmpxchg_relaxed(ptr, old, pte) != old)
+ continue;
+
+ riscv_iommu_pte_free(domain, old, &freelist);
+
+ size += page_size;
+ iova += page_size;
+ phys += page_size;
+ --pgcount;
+ }
+
+ *mapped = size;
+
+ if (!list_empty(&freelist)) {
+ /*
+ * In 1.0 spec version, the smallest scope we can use to
+ * invalidate all levels of page table (i.e. leaf and non-leaf)
+ * is an invalidate-all-PSCID IOTINVAL.VMA with AV=0.
+ * This will be updated with hardware support for
+ * capability.NL (non-leaf) IOTINVAL command.
+ */
+ riscv_iommu_iotlb_inval(domain, 0, ULONG_MAX);
+ iommu_put_pages_list(&freelist);
+ }
+
+ return rc;
+}
+
+static size_t riscv_iommu_unmap_pages(struct iommu_domain *iommu_domain,
+ unsigned long iova, size_t pgsize,
+ size_t pgcount,
+ struct iommu_iotlb_gather *gather)
+{
+ struct riscv_iommu_domain *domain = iommu_domain_to_riscv(iommu_domain);
+ size_t size = pgcount << __ffs(pgsize);
+ unsigned long *ptr, old;
+ size_t unmapped = 0;
+ size_t pte_size;
+
+ while (unmapped < size) {
+ ptr = riscv_iommu_pte_fetch(domain, iova, &pte_size);
+ if (!ptr)
+ return unmapped;
+
+ /* partial unmap is not allowed, fail. */
+ if (iova & (pte_size - 1))
+ return unmapped;
+
+ old = READ_ONCE(*ptr);
+ if (cmpxchg_relaxed(ptr, old, 0) != old)
+ continue;
+
+ iommu_iotlb_gather_add_page(&domain->domain, gather, iova,
+ pte_size);
+
+ iova += pte_size;
+ unmapped += pte_size;
+ }
+
+ return unmapped;
+}
+
+static phys_addr_t riscv_iommu_iova_to_phys(struct iommu_domain *iommu_domain,
+ dma_addr_t iova)
+{
+ struct riscv_iommu_domain *domain = iommu_domain_to_riscv(iommu_domain);
+ unsigned long pte_size;
+ unsigned long *ptr;
+
+ ptr = riscv_iommu_pte_fetch(domain, iova, &pte_size);
+ if (_io_pte_none(*ptr) || !_io_pte_present(*ptr))
+ return 0;
+
+ return pfn_to_phys(__page_val_to_pfn(*ptr)) | (iova & (pte_size - 1));
+}
+
+static void riscv_iommu_free_paging_domain(struct iommu_domain *iommu_domain)
+{
+ struct riscv_iommu_domain *domain = iommu_domain_to_riscv(iommu_domain);
+ const unsigned long pfn = virt_to_pfn(domain->pgd_root);
+
+ WARN_ON(!list_empty(&domain->bonds));
+
+ if ((int)domain->pscid > 0)
+ ida_free(&riscv_iommu_pscids, domain->pscid);
+
+ riscv_iommu_pte_free(domain, _io_pte_entry(pfn, _PAGE_TABLE), NULL);
+ kfree(domain);
+}
+
+static bool riscv_iommu_pt_supported(struct riscv_iommu_device *iommu, int pgd_mode)
+{
+ switch (pgd_mode) {
+ case RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV39:
+ return iommu->caps & RISCV_IOMMU_CAP_S_SV39;
+
+ case RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV48:
+ return iommu->caps & RISCV_IOMMU_CAP_S_SV48;
+
+ case RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV57:
+ return iommu->caps & RISCV_IOMMU_CAP_S_SV57;
+ }
+ return false;
+}
+
+static int riscv_iommu_attach_paging_domain(struct iommu_domain *iommu_domain,
+ struct device *dev)
+{
+ struct riscv_iommu_domain *domain = iommu_domain_to_riscv(iommu_domain);
+ struct riscv_iommu_device *iommu = dev_to_iommu(dev);
+ struct riscv_iommu_info *info = dev_iommu_priv_get(dev);
+ u64 fsc, ta;
+
+ if (!riscv_iommu_pt_supported(iommu, domain->pgd_mode))
+ return -ENODEV;
+
+ if (domain->numa_node == NUMA_NO_NODE)
+ domain->numa_node = dev_to_node(iommu->dev);
+
+ if (riscv_iommu_bond_link(domain, dev))
+ return -ENOMEM;
+
+ /*
+ * Invalidate PSCID.
+ * This invalidation might be redundant if IOATC has been already cleared,
+ * however we are not keeping track for domains not linked to a device.
+ */
+ riscv_iommu_iotlb_inval(domain, 0, ULONG_MAX);
+
+ fsc = FIELD_PREP(RISCV_IOMMU_PC_FSC_MODE, domain->pgd_mode) |
+ FIELD_PREP(RISCV_IOMMU_PC_FSC_PPN, virt_to_pfn(domain->pgd_root));
+ ta = FIELD_PREP(RISCV_IOMMU_PC_TA_PSCID, domain->pscid) |
+ RISCV_IOMMU_PC_TA_V;
+
+ riscv_iommu_iodir_update(iommu, dev, fsc, ta);
+ riscv_iommu_bond_unlink(info->domain, dev);
+ info->domain = domain;
+
+ return 0;
+}
+
+static const struct iommu_domain_ops riscv_iommu_paging_domain_ops = {
+ .attach_dev = riscv_iommu_attach_paging_domain,
+ .free = riscv_iommu_free_paging_domain,
+ .map_pages = riscv_iommu_map_pages,
+ .unmap_pages = riscv_iommu_unmap_pages,
+ .iova_to_phys = riscv_iommu_iova_to_phys,
+ .iotlb_sync = riscv_iommu_iotlb_sync,
+ .flush_iotlb_all = riscv_iommu_iotlb_flush_all,
+};
+
+static struct iommu_domain *riscv_iommu_alloc_paging_domain(struct device *dev)
+{
+ struct riscv_iommu_domain *domain;
+ struct riscv_iommu_device *iommu;
+ unsigned int pgd_mode;
+ int va_bits;
+
+ iommu = dev ? dev_to_iommu(dev) : NULL;
+
+ /*
+ * In unlikely case when dev or iommu is not known, use system
+ * SATP mode to configure paging domain radix tree depth.
+ * Use highest available if actual IOMMU hardware capabilities
+ * are known here.
+ */
+ if (!iommu) {
+ pgd_mode = satp_mode >> SATP_MODE_SHIFT;
+ va_bits = VA_BITS;
+ } else if (iommu->caps & RISCV_IOMMU_CAP_S_SV57) {
+ pgd_mode = RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV57;
+ va_bits = 57;
+ } else if (iommu->caps & RISCV_IOMMU_CAP_S_SV48) {
+ pgd_mode = RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV48;
+ va_bits = 48;
+ } else if (iommu->caps & RISCV_IOMMU_CAP_S_SV39) {
+ pgd_mode = RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV39;
+ va_bits = 39;
+ } else {
+ dev_err(dev, "cannot find supported page table mode\n");
+ return ERR_PTR(-ENODEV);
+ }
+
+ domain = kzalloc(sizeof(*domain), GFP_KERNEL);
+ if (!domain)
+ return ERR_PTR(-ENOMEM);
+
+ INIT_LIST_HEAD_RCU(&domain->bonds);
+ spin_lock_init(&domain->lock);
+ domain->numa_node = NUMA_NO_NODE;
+
+ if (iommu) {
+ domain->numa_node = dev_to_node(iommu->dev);
+ domain->amo_enabled = !!(iommu->caps & RISCV_IOMMU_CAP_AMO_HWAD);
+ }
+
+ domain->pgd_mode = pgd_mode;
+ domain->pgd_root = iommu_alloc_page_node(domain->numa_node,
+ GFP_KERNEL_ACCOUNT);
+ if (!domain->pgd_root) {
+ kfree(domain);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ domain->pscid = ida_alloc_range(&riscv_iommu_pscids, 1,
+ RISCV_IOMMU_MAX_PSCID, GFP_KERNEL);
+ if (domain->pscid < 0) {
+ iommu_free_page(domain->pgd_root);
+ kfree(domain);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ /*
+ * Note: RISC-V Privilege spec mandates that virtual addresses
+ * need to be sign-extended, so if (VA_BITS - 1) is set, all
+ * bits >= VA_BITS need to also be set or else we'll get a
+ * page fault. However the code that creates the mappings
+ * above us (e.g. iommu_dma_alloc_iova()) won't do that for us
+ * for now, so we'll end up with invalid virtual addresses
+ * to map. As a workaround until we get this sorted out
+ * limit the available virtual addresses to VA_BITS - 1.
+ */
+ domain->domain.geometry.aperture_start = 0;
+ domain->domain.geometry.aperture_end = DMA_BIT_MASK(va_bits - 1);
+ domain->domain.geometry.force_aperture = true;
+
+ domain->domain.ops = &riscv_iommu_paging_domain_ops;
+
+ return &domain->domain;
+}
+
static int riscv_iommu_attach_blocking_domain(struct iommu_domain *iommu_domain,
struct device *dev)
{
struct riscv_iommu_device *iommu = dev_to_iommu(dev);
+ struct riscv_iommu_info *info = dev_iommu_priv_get(dev);

riscv_iommu_iodir_update(iommu, dev, RISCV_IOMMU_FSC_BARE, 0);
+ riscv_iommu_bond_unlink(info->domain, dev);
+ info->domain = NULL;

return 0;
}
@@ -827,8 +1390,11 @@ static int riscv_iommu_attach_identity_domain(struct iommu_domain *iommu_domain,
struct device *dev)
{
struct riscv_iommu_device *iommu = dev_to_iommu(dev);
+ struct riscv_iommu_info *info = dev_iommu_priv_get(dev);

riscv_iommu_iodir_update(iommu, dev, RISCV_IOMMU_FSC_BARE, RISCV_IOMMU_PC_TA_V);
+ riscv_iommu_bond_unlink(info->domain, dev);
+ info->domain = NULL;

return 0;
}
@@ -842,7 +1408,7 @@ static struct iommu_domain riscv_iommu_identity_domain = {

static int riscv_iommu_device_domain_type(struct device *dev)
{
- return IOMMU_DOMAIN_IDENTITY;
+ return 0;
}

static struct iommu_group *riscv_iommu_device_group(struct device *dev)
@@ -861,6 +1427,7 @@ static struct iommu_device *riscv_iommu_probe_device(struct device *dev)
{
struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
struct riscv_iommu_device *iommu;
+ struct riscv_iommu_info *info;
struct riscv_iommu_dc *dc;
u64 tc;
int i;
@@ -895,17 +1462,33 @@ static struct iommu_device *riscv_iommu_probe_device(struct device *dev)
WRITE_ONCE(dc->tc, tc);
}

+ info = kzalloc(sizeof(*info), GFP_KERNEL);
+ if (!info)
+ return ERR_PTR(-ENOMEM);
+ dev_iommu_priv_set(dev, info);
+
return &iommu->iommu;
}

+static void riscv_iommu_release_device(struct device *dev)
+{
+ struct riscv_iommu_info *info = dev_iommu_priv_get(dev);
+
+ synchronize_rcu();
+ kfree(info);
+}
+
static const struct iommu_ops riscv_iommu_ops = {
+ .pgsize_bitmap = SZ_4K,
.of_xlate = riscv_iommu_of_xlate,
.identity_domain = &riscv_iommu_identity_domain,
.blocked_domain = &riscv_iommu_blocking_domain,
.release_domain = &riscv_iommu_blocking_domain,
+ .domain_alloc_paging = riscv_iommu_alloc_paging_domain,
.def_domain_type = riscv_iommu_device_domain_type,
.device_group = riscv_iommu_device_group,
.probe_device = riscv_iommu_probe_device,
+ .release_device = riscv_iommu_release_device,
};

static int riscv_iommu_init_check(struct riscv_iommu_device *iommu)
--
2.34.1


2024-05-03 16:22:20

by Tomasz Jeznach

[permalink] [raw]
Subject: [PATCH v4 5/7] iommu/riscv: Device directory management.

Introduce device context allocation and device directory tree
management including capabilities discovery sequence, as described
in Chapter 2.1 of the RISC-V IOMMU Architecture Specification.

Device directory mode will be auto detected using DDTP WARL property,
using highest mode supported by the driver and hardware. If none
supported can be configured, driver will fall back to global pass-through.

First level DDTP page can be located in I/O (detected using DDTP WARL)
and system memory.

Only simple identity and blocking protection domains are supported by
this implementation.

Co-developed-by: Nick Kossifidis <[email protected]>
Signed-off-by: Nick Kossifidis <[email protected]>
Reviewed-by: Lu Baolu <[email protected]>
Signed-off-by: Tomasz Jeznach <[email protected]>
---
drivers/iommu/riscv/iommu.c | 396 +++++++++++++++++++++++++++++++++++-
drivers/iommu/riscv/iommu.h | 5 +
2 files changed, 391 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
index 1f889daffb0e..71b7903d83d4 100644
--- a/drivers/iommu/riscv/iommu.c
+++ b/drivers/iommu/riscv/iommu.c
@@ -16,15 +16,168 @@
#include <linux/crash_dump.h>
#include <linux/init.h>
#include <linux/iommu.h>
+#include <linux/iopoll.h>
#include <linux/kernel.h>
#include <linux/pci.h>

+#include "../iommu-pages.h"
#include "iommu-bits.h"
#include "iommu.h"

/* Timeouts in [us] */
#define RISCV_IOMMU_DDTP_TIMEOUT 50000

+/* RISC-V IOMMU PPN <> PHYS address conversions, PHYS <=> PPN[53:10] */
+#define phys_to_ppn(va) (((va) >> 2) & (((1ULL << 44) - 1) << 10))
+#define ppn_to_phys(pn) (((pn) << 2) & (((1ULL << 44) - 1) << 12))
+
+#define dev_to_iommu(dev) \
+ iommu_get_iommu_dev(dev, struct riscv_iommu_device, iommu)
+
+/* Device resource-managed allocations */
+struct riscv_iommu_devres {
+ void *addr;
+ int order;
+};
+
+static void riscv_iommu_devres_pages_release(struct device *dev, void *res)
+{
+ struct riscv_iommu_devres *devres = res;
+
+ iommu_free_pages(devres->addr, devres->order);
+}
+
+static int riscv_iommu_devres_pages_match(struct device *dev, void *res, void *p)
+{
+ struct riscv_iommu_devres *devres = res;
+ struct riscv_iommu_devres *target = p;
+
+ return devres->addr == target->addr;
+}
+
+static void *riscv_iommu_get_pages(struct riscv_iommu_device *iommu, int order)
+{
+ struct riscv_iommu_devres *devres;
+ void *addr;
+
+ addr = iommu_alloc_pages_node(dev_to_node(iommu->dev),
+ GFP_KERNEL_ACCOUNT, order);
+ if (unlikely(!addr))
+ return NULL;
+
+ devres = devres_alloc(riscv_iommu_devres_pages_release,
+ sizeof(struct riscv_iommu_devres), GFP_KERNEL);
+
+ if (unlikely(!devres)) {
+ iommu_free_pages(addr, order);
+ return NULL;
+ }
+
+ devres->addr = addr;
+ devres->order = order;
+
+ devres_add(iommu->dev, devres);
+
+ return addr;
+}
+
+static void riscv_iommu_free_pages(struct riscv_iommu_device *iommu, void *addr)
+{
+ struct riscv_iommu_devres devres = { .addr = addr };
+
+ devres_release(iommu->dev, riscv_iommu_devres_pages_release,
+ riscv_iommu_devres_pages_match, &devres);
+}
+
+/* Lookup and initialize device context info structure. */
+static struct riscv_iommu_dc *riscv_iommu_get_dc(struct riscv_iommu_device *iommu,
+ unsigned int devid)
+{
+ const bool base_format = !(iommu->caps & RISCV_IOMMU_CAP_MSI_FLAT);
+ unsigned int depth;
+ unsigned long ddt, old, new;
+ void *ptr;
+ u8 ddi_bits[3] = { 0 };
+ u64 *ddtp = NULL;
+
+ /* Make sure the mode is valid */
+ if (iommu->ddt_mode < RISCV_IOMMU_DDTP_MODE_1LVL ||
+ iommu->ddt_mode > RISCV_IOMMU_DDTP_MODE_3LVL)
+ return NULL;
+
+ /*
+ * Device id partitioning for base format:
+ * DDI[0]: bits 0 - 6 (1st level) (7 bits)
+ * DDI[1]: bits 7 - 15 (2nd level) (9 bits)
+ * DDI[2]: bits 16 - 23 (3rd level) (8 bits)
+ *
+ * For extended format:
+ * DDI[0]: bits 0 - 5 (1st level) (6 bits)
+ * DDI[1]: bits 6 - 14 (2nd level) (9 bits)
+ * DDI[2]: bits 15 - 23 (3rd level) (9 bits)
+ */
+ if (base_format) {
+ ddi_bits[0] = 7;
+ ddi_bits[1] = 7 + 9;
+ ddi_bits[2] = 7 + 9 + 8;
+ } else {
+ ddi_bits[0] = 6;
+ ddi_bits[1] = 6 + 9;
+ ddi_bits[2] = 6 + 9 + 9;
+ }
+
+ /* Make sure device id is within range */
+ depth = iommu->ddt_mode - RISCV_IOMMU_DDTP_MODE_1LVL;
+ if (devid >= (1 << ddi_bits[depth]))
+ return NULL;
+
+ /* Get to the level of the non-leaf node that holds the device context */
+ for (ddtp = iommu->ddt_root; depth-- > 0;) {
+ const int split = ddi_bits[depth];
+ /*
+ * Each non-leaf node is 64bits wide and on each level
+ * nodes are indexed by DDI[depth].
+ */
+ ddtp += (devid >> split) & 0x1FF;
+
+ /*
+ * Check if this node has been populated and if not
+ * allocate a new level and populate it.
+ */
+ do {
+ ddt = READ_ONCE(*(unsigned long *)ddtp);
+ if (ddt & RISCV_IOMMU_DDTE_VALID) {
+ ddtp = __va(ppn_to_phys(ddt));
+ break;
+ }
+
+ ptr = riscv_iommu_get_pages(iommu, 0);
+ if (!ptr)
+ return NULL;
+
+ new = phys_to_ppn(__pa(ptr)) | RISCV_IOMMU_DDTE_VALID;
+ old = cmpxchg_relaxed((unsigned long *)ddtp, ddt, new);
+
+ if (old == ddt) {
+ ddtp = (u64 *)ptr;
+ break;
+ }
+
+ /* Race setting DDT detected, re-read and retry. */
+ riscv_iommu_free_pages(iommu, ptr);
+ } while (1);
+ }
+
+ /*
+ * Grab the node that matches DDI[depth], note that when using base
+ * format the device context is 4 * 64bits, and the extended format
+ * is 8 * 64bits, hence the (3 - base_format) below.
+ */
+ ddtp += (devid & ((64 << base_format) - 1)) << (3 - base_format);
+
+ return (struct riscv_iommu_dc *)ddtp;
+}
+
/*
* This is best effort IOMMU translation shutdown flow.
* Disable IOMMU without waiting for hardware response.
@@ -37,10 +190,200 @@ static void riscv_iommu_disable(struct riscv_iommu_device *iommu)
riscv_iommu_writel(iommu, RISCV_IOMMU_REG_PQCSR, 0);
}

+#define riscv_iommu_read_ddtp(iommu) ({ \
+ u64 ddtp; \
+ riscv_iommu_readq_timeout((iommu), RISCV_IOMMU_REG_DDTP, ddtp, \
+ !(ddtp & RISCV_IOMMU_DDTP_BUSY), 10, \
+ RISCV_IOMMU_DDTP_TIMEOUT); \
+ ddtp; })
+
+static int riscv_iommu_iodir_alloc(struct riscv_iommu_device *iommu)
+{
+ u64 ddtp;
+ unsigned int mode;
+
+ ddtp = riscv_iommu_read_ddtp(iommu);
+ if (ddtp & RISCV_IOMMU_DDTP_BUSY)
+ return -EBUSY;
+
+ /*
+ * It is optional for the hardware to report a fixed address for device
+ * directory root page when DDT.MODE is OFF or BARE.
+ */
+ mode = FIELD_GET(RISCV_IOMMU_DDTP_MODE, ddtp);
+ if (mode == RISCV_IOMMU_DDTP_MODE_BARE ||
+ mode == RISCV_IOMMU_DDTP_MODE_OFF) {
+ /* Use WARL to discover hardware fixed DDT PPN */
+ riscv_iommu_writeq(iommu, RISCV_IOMMU_REG_DDTP,
+ FIELD_PREP(RISCV_IOMMU_DDTP_MODE, mode));
+ ddtp = riscv_iommu_read_ddtp(iommu);
+ if (ddtp & RISCV_IOMMU_DDTP_BUSY)
+ return -EBUSY;
+
+ iommu->ddt_phys = ppn_to_phys(ddtp);
+ if (iommu->ddt_phys)
+ iommu->ddt_root = devm_ioremap(iommu->dev,
+ iommu->ddt_phys, PAGE_SIZE);
+ if (iommu->ddt_root)
+ memset(iommu->ddt_root, 0, PAGE_SIZE);
+ }
+
+ if (!iommu->ddt_root) {
+ iommu->ddt_root = riscv_iommu_get_pages(iommu, 0);
+ iommu->ddt_phys = __pa(iommu->ddt_root);
+ }
+
+ if (!iommu->ddt_root)
+ return -ENOMEM;
+
+ return 0;
+}
+
+/*
+ * Discover supported DDT modes starting from requested value,
+ * configure DDTP register with accepted mode and root DDT address.
+ * Accepted iommu->ddt_mode is updated on success.
+ */
+static int riscv_iommu_iodir_set_mode(struct riscv_iommu_device *iommu,
+ unsigned int ddtp_mode)
+{
+ struct device *dev = iommu->dev;
+ u64 ddtp, rq_ddtp;
+ unsigned int mode, rq_mode = ddtp_mode;
+
+ ddtp = riscv_iommu_read_ddtp(iommu);
+ if (ddtp & RISCV_IOMMU_DDTP_BUSY)
+ return -EBUSY;
+
+ /* Disallow state transition from xLVL to xLVL. */
+ mode = FIELD_GET(RISCV_IOMMU_DDTP_MODE, ddtp);
+ if (mode != RISCV_IOMMU_DDTP_MODE_BARE &&
+ mode != RISCV_IOMMU_DDTP_MODE_OFF &&
+ rq_mode != RISCV_IOMMU_DDTP_MODE_BARE &&
+ rq_mode != RISCV_IOMMU_DDTP_MODE_OFF)
+ return -EINVAL;
+
+ do {
+ rq_ddtp = FIELD_PREP(RISCV_IOMMU_DDTP_MODE, rq_mode);
+ if (rq_mode > RISCV_IOMMU_DDTP_MODE_BARE)
+ rq_ddtp |= phys_to_ppn(iommu->ddt_phys);
+
+ riscv_iommu_writeq(iommu, RISCV_IOMMU_REG_DDTP, rq_ddtp);
+ ddtp = riscv_iommu_read_ddtp(iommu);
+ if (ddtp & RISCV_IOMMU_DDTP_BUSY) {
+ dev_err(dev, "timeout when setting ddtp (ddt mode: %u, read: %llx)\n",
+ rq_mode, ddtp);
+ return -EBUSY;
+ }
+
+ /* Verify IOMMU hardware accepts new DDTP config. */
+ mode = FIELD_GET(RISCV_IOMMU_DDTP_MODE, ddtp);
+
+ if (rq_mode == mode)
+ break;
+
+ /* Hardware mandatory DDTP mode has not been accepted. */
+ if (rq_mode < RISCV_IOMMU_DDTP_MODE_1LVL && rq_ddtp != ddtp) {
+ dev_err(dev, "DDTP update failed hw: %llx vs %llx\n",
+ ddtp, rq_ddtp);
+ return -EINVAL;
+ }
+
+ /*
+ * Mode field is WARL, an IOMMU may support a subset of
+ * directory table levels in which case if we tried to set
+ * an unsupported number of levels we'll readback either
+ * a valid xLVL or off/bare. If we got off/bare, try again
+ * with a smaller xLVL.
+ */
+ if (mode < RISCV_IOMMU_DDTP_MODE_1LVL &&
+ rq_mode > RISCV_IOMMU_DDTP_MODE_1LVL) {
+ dev_dbg(dev, "DDTP hw mode %u vs %u\n", mode, rq_mode);
+ rq_mode--;
+ continue;
+ }
+
+ /*
+ * We tried all supported modes and IOMMU hardware failed to
+ * accept new settings, something went very wrong since off/bare
+ * and at least one xLVL must be supported.
+ */
+ dev_err(dev, "DDTP hw mode %u, failed to set %u\n",
+ mode, ddtp_mode);
+ return -EINVAL;
+ } while (1);
+
+ iommu->ddt_mode = mode;
+ if (mode != ddtp_mode)
+ dev_dbg(dev, "DDTP hw mode %u, requested %u\n", mode, ddtp_mode);
+
+ return 0;
+}
+
+#define RISCV_IOMMU_FSC_BARE 0
+
+/*
+ * Update IODIR for the device.
+ *
+ * During the execution of riscv_iommu_probe_device(), IODIR entries are
+ * allocated for the device's identifiers. Device context invalidation
+ * becomes necessary only if one of the updated entries was previously
+ * marked as valid, given that invalid device context entries are not
+ * cached by the IOMMU hardware.
+ * In this implementation, updating a valid device context while the
+ * device is not quiesced might be disruptive, potentially causing
+ * interim translation faults.
+ */
+static void riscv_iommu_iodir_update(struct riscv_iommu_device *iommu,
+ struct device *dev, u64 fsc, u64 ta)
+{
+ struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
+ struct riscv_iommu_dc *dc;
+ u64 tc;
+ int i;
+
+ /* Device context invalidation ignored for now. */
+
+ /*
+ * For device context with DC_TC_PDTV = 0, translation attributes valid bit
+ * is stored as DC_TC_V bit (both sharing the same location at BIT(0))..
+ */
+ for (i = 0; i < fwspec->num_ids; i++) {
+ dc = riscv_iommu_get_dc(iommu, fwspec->ids[i]);
+ tc = READ_ONCE(dc->tc);
+ tc |= ta & RISCV_IOMMU_DC_TC_V;
+
+ /* Update device context, write TC.V as the last step. */
+ WRITE_ONCE(dc->fsc, fsc);
+ WRITE_ONCE(dc->ta, ta & RISCV_IOMMU_PC_TA_PSCID);
+ WRITE_ONCE(dc->tc, tc);
+ }
+}
+
+static int riscv_iommu_attach_blocking_domain(struct iommu_domain *iommu_domain,
+ struct device *dev)
+{
+ struct riscv_iommu_device *iommu = dev_to_iommu(dev);
+
+ riscv_iommu_iodir_update(iommu, dev, RISCV_IOMMU_FSC_BARE, 0);
+
+ return 0;
+}
+
+static struct iommu_domain riscv_iommu_blocking_domain = {
+ .type = IOMMU_DOMAIN_BLOCKED,
+ .ops = &(const struct iommu_domain_ops) {
+ .attach_dev = riscv_iommu_attach_blocking_domain,
+ }
+};
+
static int riscv_iommu_attach_identity_domain(struct iommu_domain *iommu_domain,
struct device *dev)
{
- /* Global pass-through already enabled, do nothing for now. */
+ struct riscv_iommu_device *iommu = dev_to_iommu(dev);
+
+ riscv_iommu_iodir_update(iommu, dev, RISCV_IOMMU_FSC_BARE, RISCV_IOMMU_PC_TA_V);
+
return 0;
}

@@ -72,6 +415,9 @@ static struct iommu_device *riscv_iommu_probe_device(struct device *dev)
{
struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
struct riscv_iommu_device *iommu;
+ struct riscv_iommu_dc *dc;
+ u64 tc;
+ int i;

if (!fwspec->iommu_fwnode->dev || !fwspec->num_ids)
return ERR_PTR(-ENODEV);
@@ -80,12 +426,37 @@ static struct iommu_device *riscv_iommu_probe_device(struct device *dev)
if (!iommu)
return ERR_PTR(-ENODEV);

+ /*
+ * IOMMU hardware operating in fail-over BARE mode will provide
+ * identity translation for all connected devices anyway...
+ */
+ if (iommu->ddt_mode <= RISCV_IOMMU_DDTP_MODE_BARE)
+ return ERR_PTR(-ENODEV);
+
+ /*
+ * Allocate and pre-configure device context entries in
+ * the device directory. Do not mark the context valid yet.
+ */
+ tc = 0;
+ if (iommu->caps & RISCV_IOMMU_CAP_AMO_HWAD)
+ tc |= RISCV_IOMMU_DC_TC_SADE;
+ for (i = 0; i < fwspec->num_ids; i++) {
+ dc = riscv_iommu_get_dc(iommu, fwspec->ids[i]);
+ if (!dc)
+ return ERR_PTR(-ENODEV);
+ if (READ_ONCE(dc->tc) & RISCV_IOMMU_DC_TC_V)
+ dev_warn(dev, "already attached to IOMMU device directory\n");
+ WRITE_ONCE(dc->tc, tc);
+ }
+
return &iommu->iommu;
}

static const struct iommu_ops riscv_iommu_ops = {
.of_xlate = riscv_iommu_of_xlate,
.identity_domain = &riscv_iommu_identity_domain,
+ .blocked_domain = &riscv_iommu_blocking_domain,
+ .release_domain = &riscv_iommu_blocking_domain,
.def_domain_type = riscv_iommu_device_domain_type,
.device_group = riscv_iommu_device_group,
.probe_device = riscv_iommu_probe_device,
@@ -128,6 +499,7 @@ void riscv_iommu_remove(struct riscv_iommu_device *iommu)
{
iommu_device_unregister(&iommu->iommu);
iommu_device_sysfs_remove(&iommu->iommu);
+ riscv_iommu_iodir_set_mode(iommu, RISCV_IOMMU_DDTP_MODE_OFF);
}

int riscv_iommu_init(struct riscv_iommu_device *iommu)
@@ -138,18 +510,20 @@ int riscv_iommu_init(struct riscv_iommu_device *iommu)
if (rc)
return dev_err_probe(iommu->dev, rc, "unexpected device state\n");

- /*
- * Placeholder for a complete IOMMU device initialization. For now,
- * only bare minimum: enable global identity mapping mode and register sysfs.
- */
- riscv_iommu_writeq(iommu, RISCV_IOMMU_REG_DDTP,
- FIELD_PREP(RISCV_IOMMU_DDTP_MODE, RISCV_IOMMU_DDTP_MODE_BARE));
+ rc = riscv_iommu_iodir_alloc(iommu);
+ if (rc)
+ return rc;
+
+ rc = riscv_iommu_iodir_set_mode(iommu, RISCV_IOMMU_DDTP_MODE_MAX);
+ if (rc)
+ return rc;

rc = iommu_device_sysfs_add(&iommu->iommu, NULL, NULL, "riscv-iommu@%s",
dev_name(iommu->dev));
- if (rc)
- return dev_err_probe(iommu->dev, rc,
- "cannot register sysfs interface\n");
+ if (rc) {
+ dev_err_probe(iommu->dev, rc, "cannot register sysfs interface\n");
+ goto err_iodir_off;
+ }

rc = iommu_device_register(&iommu->iommu, &riscv_iommu_ops, iommu->dev);
if (rc) {
@@ -161,5 +535,7 @@ int riscv_iommu_init(struct riscv_iommu_device *iommu)

err_remove_sysfs:
iommu_device_sysfs_remove(&iommu->iommu);
+err_iodir_off:
+ riscv_iommu_iodir_set_mode(iommu, RISCV_IOMMU_DDTP_MODE_OFF);
return rc;
}
diff --git a/drivers/iommu/riscv/iommu.h b/drivers/iommu/riscv/iommu.h
index 700e33dc2446..f1696926582c 100644
--- a/drivers/iommu/riscv/iommu.h
+++ b/drivers/iommu/riscv/iommu.h
@@ -34,6 +34,11 @@ struct riscv_iommu_device {
/* available interrupt numbers, MSI or WSI */
unsigned int irqs[RISCV_IOMMU_INTR_COUNT];
unsigned int irqs_count;
+
+ /* device directory */
+ unsigned int ddt_mode;
+ dma_addr_t ddt_phys;
+ u64 *ddt_root;
};

int riscv_iommu_init(struct riscv_iommu_device *iommu);
--
2.34.1


2024-05-04 03:21:16

by Baolu Lu

[permalink] [raw]
Subject: Re: [PATCH v4 7/7] iommu/riscv: Paging domain support

On 5/4/24 12:12 AM, Tomasz Jeznach wrote:
> Introduce first-stage address translation support.
>
> Page table configured by the IOMMU driver will use the highest mode
> implemented by the hardware, unless not known at the domain allocation
> time falling back to the CPU’s MMU page mode.
>
> This change introduces IOTINVAL.VMA command, required to invalidate
> any cached IOATC entries after mapping is updated and/or removed from
> the paging domain. Invalidations for the non-leaf page entries use
> IOTINVAL for all addresses assigned to the protection domain for
> hardware not supporting more granular non-leaf page table cache
> invalidations.
>
> Signed-off-by: Tomasz Jeznach<[email protected]>

Reviewed-by: Lu Baolu <[email protected]>

Best regards,
baolu

2024-05-04 05:11:35

by Baolu Lu

[permalink] [raw]
Subject: Re: [PATCH v4 2/7] iommu/riscv: Add RISC-V IOMMU platform device driver

On 5/4/24 12:12 AM, Tomasz Jeznach wrote:
> Introduce platform device driver for implementation of RISC-V IOMMU
> architected hardware.
>
> Hardware interface definition located in file iommu-bits.h is based on
> ratified RISC-V IOMMU Architecture Specification version 1.0.0.
>
> This patch implements platform device initialization, early check and
> configuration of the IOMMU interfaces and enables global pass-through
> address translation mode (iommu_mode == BARE), without registering
> hardware instance in the IOMMU subsystem.
>
> Link:https://github.com/riscv-non-isa/riscv-iommu
> Co-developed-by: Nick Kossifidis<[email protected]>
> Signed-off-by: Nick Kossifidis<[email protected]>
> Co-developed-by: Sebastien Boeuf<[email protected]>
> Signed-off-by: Sebastien Boeuf<[email protected]>
> Signed-off-by: Tomasz Jeznach<[email protected]>

Reviewed-by: Lu Baolu <[email protected]>

Best regards,
baolu

2024-05-07 14:48:46

by Rob Herring

[permalink] [raw]
Subject: Re: [PATCH v4 1/7] dt-bindings: iommu: riscv: Add bindings for RISC-V IOMMU


On Fri, 03 May 2024 09:12:34 -0700, Tomasz Jeznach wrote:
> Add bindings for the RISC-V IOMMU device drivers.
>
> Co-developed-by: Anup Patel <[email protected]>
> Signed-off-by: Anup Patel <[email protected]>
> Reviewed-by: Conor Dooley <[email protected]>
> Signed-off-by: Tomasz Jeznach <[email protected]>
> ---
> .../bindings/iommu/riscv,iommu.yaml | 147 ++++++++++++++++++
> MAINTAINERS | 7 +
> 2 files changed, 154 insertions(+)
> create mode 100644 Documentation/devicetree/bindings/iommu/riscv,iommu.yaml
>

Reviewed-by: Rob Herring (Arm) <[email protected]>


2024-05-08 15:35:42

by Zong Li

[permalink] [raw]
Subject: Re: [PATCH v4 5/7] iommu/riscv: Device directory management.

On Sat, May 4, 2024 at 12:13 AM Tomasz Jeznach <[email protected]> wrote:
>
> Introduce device context allocation and device directory tree
> management including capabilities discovery sequence, as described
> in Chapter 2.1 of the RISC-V IOMMU Architecture Specification.
>
> Device directory mode will be auto detected using DDTP WARL property,
> using highest mode supported by the driver and hardware. If none
> supported can be configured, driver will fall back to global pass-through.
>
> First level DDTP page can be located in I/O (detected using DDTP WARL)
> and system memory.
>
> Only simple identity and blocking protection domains are supported by
> this implementation.
>
> Co-developed-by: Nick Kossifidis <[email protected]>
> Signed-off-by: Nick Kossifidis <[email protected]>
> Reviewed-by: Lu Baolu <[email protected]>
> Signed-off-by: Tomasz Jeznach <[email protected]>
> ---
> drivers/iommu/riscv/iommu.c | 396 +++++++++++++++++++++++++++++++++++-
> drivers/iommu/riscv/iommu.h | 5 +
> 2 files changed, 391 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
> index 1f889daffb0e..71b7903d83d4 100644
> --- a/drivers/iommu/riscv/iommu.c
> +++ b/drivers/iommu/riscv/iommu.c
> @@ -16,15 +16,168 @@
> #include <linux/crash_dump.h>
> #include <linux/init.h>
> #include <linux/iommu.h>
> +#include <linux/iopoll.h>
> #include <linux/kernel.h>
> #include <linux/pci.h>
>
> +#include "../iommu-pages.h"
> #include "iommu-bits.h"
> #include "iommu.h"
>
> /* Timeouts in [us] */
> #define RISCV_IOMMU_DDTP_TIMEOUT 50000
>
> +/* RISC-V IOMMU PPN <> PHYS address conversions, PHYS <=> PPN[53:10] */
> +#define phys_to_ppn(va) (((va) >> 2) & (((1ULL << 44) - 1) << 10))

Should the parameter be 'pa' instead of 'va'?


> +#define ppn_to_phys(pn) (((pn) << 2) & (((1ULL << 44) - 1) << 12))
> +
> +#define dev_to_iommu(dev) \
> + iommu_get_iommu_dev(dev, struct riscv_iommu_device, iommu)
> +
> +/* Device resource-managed allocations */
> +struct riscv_iommu_devres {
> + void *addr;
> + int order;
> +};
> +
> +static void riscv_iommu_devres_pages_release(struct device *dev, void *res)
> +{
> + struct riscv_iommu_devres *devres = res;
> +
> + iommu_free_pages(devres->addr, devres->order);
> +}
> +
> +static int riscv_iommu_devres_pages_match(struct device *dev, void *res, void *p)
> +{
> + struct riscv_iommu_devres *devres = res;
> + struct riscv_iommu_devres *target = p;
> +
> + return devres->addr == target->addr;
> +}
> +
> +static void *riscv_iommu_get_pages(struct riscv_iommu_device *iommu, int order)
> +{
> + struct riscv_iommu_devres *devres;
> + void *addr;
> +
> + addr = iommu_alloc_pages_node(dev_to_node(iommu->dev),
> + GFP_KERNEL_ACCOUNT, order);
> + if (unlikely(!addr))
> + return NULL;
> +
> + devres = devres_alloc(riscv_iommu_devres_pages_release,
> + sizeof(struct riscv_iommu_devres), GFP_KERNEL);
> +
> + if (unlikely(!devres)) {
> + iommu_free_pages(addr, order);
> + return NULL;
> + }
> +
> + devres->addr = addr;
> + devres->order = order;
> +
> + devres_add(iommu->dev, devres);
> +
> + return addr;
> +}
> +
> +static void riscv_iommu_free_pages(struct riscv_iommu_device *iommu, void *addr)
> +{
> + struct riscv_iommu_devres devres = { .addr = addr };
> +
> + devres_release(iommu->dev, riscv_iommu_devres_pages_release,
> + riscv_iommu_devres_pages_match, &devres);
> +}
> +
> +/* Lookup and initialize device context info structure. */
> +static struct riscv_iommu_dc *riscv_iommu_get_dc(struct riscv_iommu_device *iommu,
> + unsigned int devid)
> +{
> + const bool base_format = !(iommu->caps & RISCV_IOMMU_CAP_MSI_FLAT);
> + unsigned int depth;
> + unsigned long ddt, old, new;
> + void *ptr;
> + u8 ddi_bits[3] = { 0 };
> + u64 *ddtp = NULL;
> +
> + /* Make sure the mode is valid */
> + if (iommu->ddt_mode < RISCV_IOMMU_DDTP_MODE_1LVL ||
> + iommu->ddt_mode > RISCV_IOMMU_DDTP_MODE_3LVL)
> + return NULL;
> +
> + /*
> + * Device id partitioning for base format:
> + * DDI[0]: bits 0 - 6 (1st level) (7 bits)
> + * DDI[1]: bits 7 - 15 (2nd level) (9 bits)
> + * DDI[2]: bits 16 - 23 (3rd level) (8 bits)
> + *
> + * For extended format:
> + * DDI[0]: bits 0 - 5 (1st level) (6 bits)
> + * DDI[1]: bits 6 - 14 (2nd level) (9 bits)
> + * DDI[2]: bits 15 - 23 (3rd level) (9 bits)
> + */
> + if (base_format) {
> + ddi_bits[0] = 7;
> + ddi_bits[1] = 7 + 9;
> + ddi_bits[2] = 7 + 9 + 8;
> + } else {
> + ddi_bits[0] = 6;
> + ddi_bits[1] = 6 + 9;
> + ddi_bits[2] = 6 + 9 + 9;
> + }
> +
> + /* Make sure device id is within range */
> + depth = iommu->ddt_mode - RISCV_IOMMU_DDTP_MODE_1LVL;
> + if (devid >= (1 << ddi_bits[depth]))
> + return NULL;
> +
> + /* Get to the level of the non-leaf node that holds the device context */
> + for (ddtp = iommu->ddt_root; depth-- > 0;) {
> + const int split = ddi_bits[depth];
> + /*
> + * Each non-leaf node is 64bits wide and on each level
> + * nodes are indexed by DDI[depth].
> + */
> + ddtp += (devid >> split) & 0x1FF;
> +
> + /*
> + * Check if this node has been populated and if not
> + * allocate a new level and populate it.
> + */
> + do {
> + ddt = READ_ONCE(*(unsigned long *)ddtp);
> + if (ddt & RISCV_IOMMU_DDTE_VALID) {
> + ddtp = __va(ppn_to_phys(ddt));
> + break;
> + }
> +
> + ptr = riscv_iommu_get_pages(iommu, 0);
> + if (!ptr)
> + return NULL;
> +
> + new = phys_to_ppn(__pa(ptr)) | RISCV_IOMMU_DDTE_VALID;
> + old = cmpxchg_relaxed((unsigned long *)ddtp, ddt, new);
> +
> + if (old == ddt) {
> + ddtp = (u64 *)ptr;
> + break;
> + }
> +
> + /* Race setting DDT detected, re-read and retry. */
> + riscv_iommu_free_pages(iommu, ptr);
> + } while (1);
> + }
> +
> + /*
> + * Grab the node that matches DDI[depth], note that when using base
> + * format the device context is 4 * 64bits, and the extended format
> + * is 8 * 64bits, hence the (3 - base_format) below.
> + */
> + ddtp += (devid & ((64 << base_format) - 1)) << (3 - base_format);
> +
> + return (struct riscv_iommu_dc *)ddtp;
> +}
> +
> /*
> * This is best effort IOMMU translation shutdown flow.
> * Disable IOMMU without waiting for hardware response.
> @@ -37,10 +190,200 @@ static void riscv_iommu_disable(struct riscv_iommu_device *iommu)
> riscv_iommu_writel(iommu, RISCV_IOMMU_REG_PQCSR, 0);
> }
>
> +#define riscv_iommu_read_ddtp(iommu) ({ \
> + u64 ddtp; \
> + riscv_iommu_readq_timeout((iommu), RISCV_IOMMU_REG_DDTP, ddtp, \
> + !(ddtp & RISCV_IOMMU_DDTP_BUSY), 10, \
> + RISCV_IOMMU_DDTP_TIMEOUT); \
> + ddtp; })
> +
> +static int riscv_iommu_iodir_alloc(struct riscv_iommu_device *iommu)
> +{
> + u64 ddtp;
> + unsigned int mode;
> +
> + ddtp = riscv_iommu_read_ddtp(iommu);
> + if (ddtp & RISCV_IOMMU_DDTP_BUSY)
> + return -EBUSY;
> +
> + /*
> + * It is optional for the hardware to report a fixed address for device
> + * directory root page when DDT.MODE is OFF or BARE.
> + */
> + mode = FIELD_GET(RISCV_IOMMU_DDTP_MODE, ddtp);
> + if (mode == RISCV_IOMMU_DDTP_MODE_BARE ||
> + mode == RISCV_IOMMU_DDTP_MODE_OFF) {
> + /* Use WARL to discover hardware fixed DDT PPN */
> + riscv_iommu_writeq(iommu, RISCV_IOMMU_REG_DDTP,
> + FIELD_PREP(RISCV_IOMMU_DDTP_MODE, mode));
> + ddtp = riscv_iommu_read_ddtp(iommu);
> + if (ddtp & RISCV_IOMMU_DDTP_BUSY)
> + return -EBUSY;
> +
> + iommu->ddt_phys = ppn_to_phys(ddtp);
> + if (iommu->ddt_phys)
> + iommu->ddt_root = devm_ioremap(iommu->dev,
> + iommu->ddt_phys, PAGE_SIZE);
> + if (iommu->ddt_root)
> + memset(iommu->ddt_root, 0, PAGE_SIZE);
> + }
> +
> + if (!iommu->ddt_root) {
> + iommu->ddt_root = riscv_iommu_get_pages(iommu, 0);
> + iommu->ddt_phys = __pa(iommu->ddt_root);
> + }
> +
> + if (!iommu->ddt_root)
> + return -ENOMEM;
> +
> + return 0;
> +}
> +
> +/*
> + * Discover supported DDT modes starting from requested value,
> + * configure DDTP register with accepted mode and root DDT address.
> + * Accepted iommu->ddt_mode is updated on success.
> + */
> +static int riscv_iommu_iodir_set_mode(struct riscv_iommu_device *iommu,
> + unsigned int ddtp_mode)
> +{
> + struct device *dev = iommu->dev;
> + u64 ddtp, rq_ddtp;
> + unsigned int mode, rq_mode = ddtp_mode;
> +
> + ddtp = riscv_iommu_read_ddtp(iommu);
> + if (ddtp & RISCV_IOMMU_DDTP_BUSY)
> + return -EBUSY;
> +
> + /* Disallow state transition from xLVL to xLVL. */
> + mode = FIELD_GET(RISCV_IOMMU_DDTP_MODE, ddtp);
> + if (mode != RISCV_IOMMU_DDTP_MODE_BARE &&
> + mode != RISCV_IOMMU_DDTP_MODE_OFF &&
> + rq_mode != RISCV_IOMMU_DDTP_MODE_BARE &&
> + rq_mode != RISCV_IOMMU_DDTP_MODE_OFF)
> + return -EINVAL;
> +
> + do {
> + rq_ddtp = FIELD_PREP(RISCV_IOMMU_DDTP_MODE, rq_mode);
> + if (rq_mode > RISCV_IOMMU_DDTP_MODE_BARE)
> + rq_ddtp |= phys_to_ppn(iommu->ddt_phys);
> +
> + riscv_iommu_writeq(iommu, RISCV_IOMMU_REG_DDTP, rq_ddtp);
> + ddtp = riscv_iommu_read_ddtp(iommu);
> + if (ddtp & RISCV_IOMMU_DDTP_BUSY) {
> + dev_err(dev, "timeout when setting ddtp (ddt mode: %u, read: %llx)\n",
> + rq_mode, ddtp);
> + return -EBUSY;
> + }
> +
> + /* Verify IOMMU hardware accepts new DDTP config. */
> + mode = FIELD_GET(RISCV_IOMMU_DDTP_MODE, ddtp);
> +
> + if (rq_mode == mode)
> + break;
> +
> + /* Hardware mandatory DDTP mode has not been accepted. */
> + if (rq_mode < RISCV_IOMMU_DDTP_MODE_1LVL && rq_ddtp != ddtp) {
> + dev_err(dev, "DDTP update failed hw: %llx vs %llx\n",
> + ddtp, rq_ddtp);
> + return -EINVAL;
> + }
> +
> + /*
> + * Mode field is WARL, an IOMMU may support a subset of
> + * directory table levels in which case if we tried to set
> + * an unsupported number of levels we'll readback either
> + * a valid xLVL or off/bare. If we got off/bare, try again
> + * with a smaller xLVL.
> + */
> + if (mode < RISCV_IOMMU_DDTP_MODE_1LVL &&
> + rq_mode > RISCV_IOMMU_DDTP_MODE_1LVL) {
> + dev_dbg(dev, "DDTP hw mode %u vs %u\n", mode, rq_mode);
> + rq_mode--;
> + continue;
> + }
> +
> + /*
> + * We tried all supported modes and IOMMU hardware failed to
> + * accept new settings, something went very wrong since off/bare
> + * and at least one xLVL must be supported.
> + */
> + dev_err(dev, "DDTP hw mode %u, failed to set %u\n",
> + mode, ddtp_mode);
> + return -EINVAL;
> + } while (1);
> +
> + iommu->ddt_mode = mode;
> + if (mode != ddtp_mode)
> + dev_dbg(dev, "DDTP hw mode %u, requested %u\n", mode, ddtp_mode);
> +
> + return 0;
> +}
> +
> +#define RISCV_IOMMU_FSC_BARE 0
> +
> +/*
> + * Update IODIR for the device.
> + *
> + * During the execution of riscv_iommu_probe_device(), IODIR entries are
> + * allocated for the device's identifiers. Device context invalidation
> + * becomes necessary only if one of the updated entries was previously
> + * marked as valid, given that invalid device context entries are not
> + * cached by the IOMMU hardware.
> + * In this implementation, updating a valid device context while the
> + * device is not quiesced might be disruptive, potentially causing
> + * interim translation faults.
> + */
> +static void riscv_iommu_iodir_update(struct riscv_iommu_device *iommu,
> + struct device *dev, u64 fsc, u64 ta)
> +{
> + struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
> + struct riscv_iommu_dc *dc;
> + u64 tc;
> + int i;
> +
> + /* Device context invalidation ignored for now. */
> +
> + /*
> + * For device context with DC_TC_PDTV = 0, translation attributes valid bit
> + * is stored as DC_TC_V bit (both sharing the same location at BIT(0))..
> + */
> + for (i = 0; i < fwspec->num_ids; i++) {
> + dc = riscv_iommu_get_dc(iommu, fwspec->ids[i]);
> + tc = READ_ONCE(dc->tc);
> + tc |= ta & RISCV_IOMMU_DC_TC_V;
> +
> + /* Update device context, write TC.V as the last step. */
> + WRITE_ONCE(dc->fsc, fsc);
> + WRITE_ONCE(dc->ta, ta & RISCV_IOMMU_PC_TA_PSCID);
> + WRITE_ONCE(dc->tc, tc);
> + }
> +}
> +
> +static int riscv_iommu_attach_blocking_domain(struct iommu_domain *iommu_domain,
> + struct device *dev)
> +{
> + struct riscv_iommu_device *iommu = dev_to_iommu(dev);
> +
> + riscv_iommu_iodir_update(iommu, dev, RISCV_IOMMU_FSC_BARE, 0);
> +
> + return 0;
> +}
> +
> +static struct iommu_domain riscv_iommu_blocking_domain = {
> + .type = IOMMU_DOMAIN_BLOCKED,
> + .ops = &(const struct iommu_domain_ops) {
> + .attach_dev = riscv_iommu_attach_blocking_domain,
> + }
> +};
> +
> static int riscv_iommu_attach_identity_domain(struct iommu_domain *iommu_domain,
> struct device *dev)
> {
> - /* Global pass-through already enabled, do nothing for now. */
> + struct riscv_iommu_device *iommu = dev_to_iommu(dev);
> +
> + riscv_iommu_iodir_update(iommu, dev, RISCV_IOMMU_FSC_BARE, RISCV_IOMMU_PC_TA_V);
> +
> return 0;
> }
>
> @@ -72,6 +415,9 @@ static struct iommu_device *riscv_iommu_probe_device(struct device *dev)
> {
> struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
> struct riscv_iommu_device *iommu;
> + struct riscv_iommu_dc *dc;
> + u64 tc;
> + int i;
>
> if (!fwspec->iommu_fwnode->dev || !fwspec->num_ids)
> return ERR_PTR(-ENODEV);
> @@ -80,12 +426,37 @@ static struct iommu_device *riscv_iommu_probe_device(struct device *dev)
> if (!iommu)
> return ERR_PTR(-ENODEV);
>
> + /*
> + * IOMMU hardware operating in fail-over BARE mode will provide
> + * identity translation for all connected devices anyway...
> + */
> + if (iommu->ddt_mode <= RISCV_IOMMU_DDTP_MODE_BARE)
> + return ERR_PTR(-ENODEV);
> +
> + /*
> + * Allocate and pre-configure device context entries in
> + * the device directory. Do not mark the context valid yet.
> + */
> + tc = 0;
> + if (iommu->caps & RISCV_IOMMU_CAP_AMO_HWAD)
> + tc |= RISCV_IOMMU_DC_TC_SADE;
> + for (i = 0; i < fwspec->num_ids; i++) {
> + dc = riscv_iommu_get_dc(iommu, fwspec->ids[i]);
> + if (!dc)
> + return ERR_PTR(-ENODEV);
> + if (READ_ONCE(dc->tc) & RISCV_IOMMU_DC_TC_V)
> + dev_warn(dev, "already attached to IOMMU device directory\n");
> + WRITE_ONCE(dc->tc, tc);
> + }
> +
> return &iommu->iommu;
> }
>
> static const struct iommu_ops riscv_iommu_ops = {
> .of_xlate = riscv_iommu_of_xlate,
> .identity_domain = &riscv_iommu_identity_domain,
> + .blocked_domain = &riscv_iommu_blocking_domain,
> + .release_domain = &riscv_iommu_blocking_domain,
> .def_domain_type = riscv_iommu_device_domain_type,
> .device_group = riscv_iommu_device_group,
> .probe_device = riscv_iommu_probe_device,
> @@ -128,6 +499,7 @@ void riscv_iommu_remove(struct riscv_iommu_device *iommu)
> {
> iommu_device_unregister(&iommu->iommu);
> iommu_device_sysfs_remove(&iommu->iommu);
> + riscv_iommu_iodir_set_mode(iommu, RISCV_IOMMU_DDTP_MODE_OFF);
> }
>
> int riscv_iommu_init(struct riscv_iommu_device *iommu)
> @@ -138,18 +510,20 @@ int riscv_iommu_init(struct riscv_iommu_device *iommu)
> if (rc)
> return dev_err_probe(iommu->dev, rc, "unexpected device state\n");
>
> - /*
> - * Placeholder for a complete IOMMU device initialization. For now,
> - * only bare minimum: enable global identity mapping mode and register sysfs.
> - */
> - riscv_iommu_writeq(iommu, RISCV_IOMMU_REG_DDTP,
> - FIELD_PREP(RISCV_IOMMU_DDTP_MODE, RISCV_IOMMU_DDTP_MODE_BARE));
> + rc = riscv_iommu_iodir_alloc(iommu);
> + if (rc)
> + return rc;
> +
> + rc = riscv_iommu_iodir_set_mode(iommu, RISCV_IOMMU_DDTP_MODE_MAX);
> + if (rc)
> + return rc;
>
> rc = iommu_device_sysfs_add(&iommu->iommu, NULL, NULL, "riscv-iommu@%s",
> dev_name(iommu->dev));
> - if (rc)
> - return dev_err_probe(iommu->dev, rc,
> - "cannot register sysfs interface\n");
> + if (rc) {
> + dev_err_probe(iommu->dev, rc, "cannot register sysfs interface\n");
> + goto err_iodir_off;
> + }
>
> rc = iommu_device_register(&iommu->iommu, &riscv_iommu_ops, iommu->dev);
> if (rc) {
> @@ -161,5 +535,7 @@ int riscv_iommu_init(struct riscv_iommu_device *iommu)
>
> err_remove_sysfs:
> iommu_device_sysfs_remove(&iommu->iommu);
> +err_iodir_off:
> + riscv_iommu_iodir_set_mode(iommu, RISCV_IOMMU_DDTP_MODE_OFF);
> return rc;
> }
> diff --git a/drivers/iommu/riscv/iommu.h b/drivers/iommu/riscv/iommu.h
> index 700e33dc2446..f1696926582c 100644
> --- a/drivers/iommu/riscv/iommu.h
> +++ b/drivers/iommu/riscv/iommu.h
> @@ -34,6 +34,11 @@ struct riscv_iommu_device {
> /* available interrupt numbers, MSI or WSI */
> unsigned int irqs[RISCV_IOMMU_INTR_COUNT];
> unsigned int irqs_count;
> +
> + /* device directory */
> + unsigned int ddt_mode;
> + dma_addr_t ddt_phys;
> + u64 *ddt_root;
> };
>
> int riscv_iommu_init(struct riscv_iommu_device *iommu);
> --
> 2.34.1
>
>
> _______________________________________________
> linux-riscv mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-riscv

2024-05-08 15:42:52

by Zong Li

[permalink] [raw]
Subject: Re: [PATCH v4 2/7] iommu/riscv: Add RISC-V IOMMU platform device driver

On Sat, May 4, 2024 at 12:13 AM Tomasz Jeznach <[email protected]> wrote:
>
> Introduce platform device driver for implementation of RISC-V IOMMU
> architected hardware.
>
> Hardware interface definition located in file iommu-bits.h is based on
> ratified RISC-V IOMMU Architecture Specification version 1.0.0.
>
> This patch implements platform device initialization, early check and
> configuration of the IOMMU interfaces and enables global pass-through
> address translation mode (iommu_mode == BARE), without registering
> hardware instance in the IOMMU subsystem.
>
> Link: https://github.com/riscv-non-isa/riscv-iommu
> Co-developed-by: Nick Kossifidis <[email protected]>
> Signed-off-by: Nick Kossifidis <[email protected]>
> Co-developed-by: Sebastien Boeuf <[email protected]>
> Signed-off-by: Sebastien Boeuf <[email protected]>
> Signed-off-by: Tomasz Jeznach <[email protected]>
> ---
> MAINTAINERS | 1 +
> drivers/iommu/Kconfig | 1 +
> drivers/iommu/Makefile | 2 +-
> drivers/iommu/riscv/Kconfig | 15 +
> drivers/iommu/riscv/Makefile | 2 +
> drivers/iommu/riscv/iommu-bits.h | 707 +++++++++++++++++++++++++++
> drivers/iommu/riscv/iommu-platform.c | 92 ++++
> drivers/iommu/riscv/iommu.c | 99 ++++
> drivers/iommu/riscv/iommu.h | 62 +++
> 9 files changed, 980 insertions(+), 1 deletion(-)
> create mode 100644 drivers/iommu/riscv/Kconfig
> create mode 100644 drivers/iommu/riscv/Makefile
> create mode 100644 drivers/iommu/riscv/iommu-bits.h
> create mode 100644 drivers/iommu/riscv/iommu-platform.c
> create mode 100644 drivers/iommu/riscv/iommu.c
> create mode 100644 drivers/iommu/riscv/iommu.h
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 7fcf7c27ef6b..42df3b3871f4 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -18964,6 +18964,7 @@ L: [email protected]
> L: [email protected]
> S: Maintained
> F: Documentation/devicetree/bindings/iommu/riscv,iommu.yaml
> +F: drivers/iommu/riscv/
>
> RISC-V MICROCHIP FPGA SUPPORT
> M: Conor Dooley <[email protected]>
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> index 0af39bbbe3a3..ae762db0365e 100644
> --- a/drivers/iommu/Kconfig
> +++ b/drivers/iommu/Kconfig
> @@ -195,6 +195,7 @@ config MSM_IOMMU
> source "drivers/iommu/amd/Kconfig"
> source "drivers/iommu/intel/Kconfig"
> source "drivers/iommu/iommufd/Kconfig"
> +source "drivers/iommu/riscv/Kconfig"
>
> config IRQ_REMAP
> bool "Support for Interrupt Remapping"
> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> index 542760d963ec..5e5a83c6c2aa 100644
> --- a/drivers/iommu/Makefile
> +++ b/drivers/iommu/Makefile
> @@ -1,5 +1,5 @@
> # SPDX-License-Identifier: GPL-2.0
> -obj-y += amd/ intel/ arm/ iommufd/
> +obj-y += amd/ intel/ arm/ iommufd/ riscv/
> obj-$(CONFIG_IOMMU_API) += iommu.o
> obj-$(CONFIG_IOMMU_API) += iommu-traces.o
> obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
> diff --git a/drivers/iommu/riscv/Kconfig b/drivers/iommu/riscv/Kconfig
> new file mode 100644
> index 000000000000..5dcc5c45aa50
> --- /dev/null
> +++ b/drivers/iommu/riscv/Kconfig
> @@ -0,0 +1,15 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +# RISC-V IOMMU support
> +
> +config RISCV_IOMMU
> + bool "RISC-V IOMMU Support"
> + depends on RISCV && 64BIT
> + default y
> + select IOMMU_API
> + help
> + Support for implementations of the RISC-V IOMMU architecture that
> + complements the RISC-V MMU capabilities, providing similar address
> + translation and protection functions for accesses from I/O devices.
> +
> + Say Y here if your SoC includes an IOMMU device implementing
> + the RISC-V IOMMU architecture.
> diff --git a/drivers/iommu/riscv/Makefile b/drivers/iommu/riscv/Makefile
> new file mode 100644
> index 000000000000..e4c189de58d3
> --- /dev/null
> +++ b/drivers/iommu/riscv/Makefile
> @@ -0,0 +1,2 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +obj-$(CONFIG_RISCV_IOMMU) += iommu.o iommu-platform.o
> diff --git a/drivers/iommu/riscv/iommu-bits.h b/drivers/iommu/riscv/iommu-bits.h
> new file mode 100644
> index 000000000000..ba093c29de9f
> --- /dev/null
> +++ b/drivers/iommu/riscv/iommu-bits.h
> @@ -0,0 +1,707 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright © 2022-2024 Rivos Inc.
> + * Copyright © 2023 FORTH-ICS/CARV
> + * Copyright © 2023 RISC-V IOMMU Task Group
> + *
> + * RISC-V IOMMU - Register Layout and Data Structures.
> + *
> + * Based on the 'RISC-V IOMMU Architecture Specification', Version 1.0
> + * Published at https://github.com/riscv-non-isa/riscv-iommu
> + *
> + */
> +
> +#ifndef _RISCV_IOMMU_BITS_H_
> +#define _RISCV_IOMMU_BITS_H_
> +
> +#include <linux/types.h>
> +#include <linux/bitfield.h>
> +#include <linux/bits.h>
> +
> +/*
> + * Chapter 5: Memory Mapped register interface
> + */
> +
> +/* Common field positions */
> +#define RISCV_IOMMU_PPN_FIELD GENMASK_ULL(53, 10)
> +#define RISCV_IOMMU_QUEUE_LOGSZ_FIELD GENMASK_ULL(4, 0)
> +#define RISCV_IOMMU_QUEUE_INDEX_FIELD GENMASK_ULL(31, 0)
> +#define RISCV_IOMMU_QUEUE_ENABLE BIT(0)
> +#define RISCV_IOMMU_QUEUE_INTR_ENABLE BIT(1)
> +#define RISCV_IOMMU_QUEUE_MEM_FAULT BIT(8)
> +#define RISCV_IOMMU_QUEUE_OVERFLOW BIT(9)
> +#define RISCV_IOMMU_QUEUE_ACTIVE BIT(16)
> +#define RISCV_IOMMU_QUEUE_BUSY BIT(17)
> +
> +#define RISCV_IOMMU_ATP_PPN_FIELD GENMASK_ULL(43, 0)
> +#define RISCV_IOMMU_ATP_MODE_FIELD GENMASK_ULL(63, 60)
> +
> +/* 5.3 IOMMU Capabilities (64bits) */
> +#define RISCV_IOMMU_REG_CAP 0x0000
> +#define RISCV_IOMMU_CAP_VERSION GENMASK_ULL(7, 0)
> +#define RISCV_IOMMU_CAP_S_SV32 BIT_ULL(8)
> +#define RISCV_IOMMU_CAP_S_SV39 BIT_ULL(9)
> +#define RISCV_IOMMU_CAP_S_SV48 BIT_ULL(10)
> +#define RISCV_IOMMU_CAP_S_SV57 BIT_ULL(11)
> +#define RISCV_IOMMU_CAP_SVPBMT BIT_ULL(15)
> +#define RISCV_IOMMU_CAP_G_SV32 BIT_ULL(16)
> +#define RISCV_IOMMU_CAP_G_SV39 BIT_ULL(17)
> +#define RISCV_IOMMU_CAP_G_SV48 BIT_ULL(18)
> +#define RISCV_IOMMU_CAP_G_SV57 BIT_ULL(19)
> +#define RISCV_IOMMU_CAP_AMO_MRIF BIT_ULL(21)
> +#define RISCV_IOMMU_CAP_MSI_FLAT BIT_ULL(22)
> +#define RISCV_IOMMU_CAP_MSI_MRIF BIT_ULL(23)
> +#define RISCV_IOMMU_CAP_AMO_HWAD BIT_ULL(24)
> +#define RISCV_IOMMU_CAP_ATS BIT_ULL(25)
> +#define RISCV_IOMMU_CAP_T2GPA BIT_ULL(26)
> +#define RISCV_IOMMU_CAP_END BIT_ULL(27)
> +#define RISCV_IOMMU_CAP_IGS GENMASK_ULL(29, 28)
> +#define RISCV_IOMMU_CAP_HPM BIT_ULL(30)
> +#define RISCV_IOMMU_CAP_DBG BIT_ULL(31)
> +#define RISCV_IOMMU_CAP_PAS GENMASK_ULL(37, 32)
> +#define RISCV_IOMMU_CAP_PD8 BIT_ULL(38)
> +#define RISCV_IOMMU_CAP_PD17 BIT_ULL(39)
> +#define RISCV_IOMMU_CAP_PD20 BIT_ULL(40)
> +
> +#define RISCV_IOMMU_CAP_VERSION_VER_MASK 0xF0
> +#define RISCV_IOMMU_CAP_VERSION_REV_MASK 0x0F
> +
> +/**
> + * enum riscv_iommu_igs_settings - Interrupt Generation Support Settings
> + * @RISCV_IOMMU_CAP_IGS_MSI: I/O MMU supports only MSI generation
> + * @RISCV_IOMMU_CAP_IGS_WSI: I/O MMU supports only Wired-Signaled interrupt
> + * @RISCV_IOMMU_CAP_IGS_BOTH: I/O MMU supports both MSI and WSI generation
> + * @RISCV_IOMMU_CAP_IGS_RSRV: Reserved for standard use
> + */
> +enum riscv_iommu_igs_settings {
> + RISCV_IOMMU_CAP_IGS_MSI = 0,
> + RISCV_IOMMU_CAP_IGS_WSI = 1,
> + RISCV_IOMMU_CAP_IGS_BOTH = 2,
> + RISCV_IOMMU_CAP_IGS_RSRV = 3
> +};
> +
> +/* 5.4 Features control register (32bits) */
> +#define RISCV_IOMMU_REG_FCTL 0x0008
> +#define RISCV_IOMMU_FCTL_BE BIT(0)
> +#define RISCV_IOMMU_FCTL_WSI BIT(1)
> +#define RISCV_IOMMU_FCTL_GXL BIT(2)
> +
> +/* 5.5 Device-directory-table pointer (64bits) */
> +#define RISCV_IOMMU_REG_DDTP 0x0010
> +#define RISCV_IOMMU_DDTP_MODE GENMASK_ULL(3, 0)
> +#define RISCV_IOMMU_DDTP_BUSY BIT_ULL(4)
> +#define RISCV_IOMMU_DDTP_PPN RISCV_IOMMU_PPN_FIELD
> +
> +/**
> + * enum riscv_iommu_ddtp_modes - I/O MMU translation modes
> + * @RISCV_IOMMU_DDTP_MODE_OFF: No inbound transactions allowed
> + * @RISCV_IOMMU_DDTP_MODE_BARE: Pass-through mode
> + * @RISCV_IOMMU_DDTP_MODE_1LVL: One-level DDT
> + * @RISCV_IOMMU_DDTP_MODE_2LVL: Two-level DDT
> + * @RISCV_IOMMU_DDTP_MODE_3LVL: Three-level DDT
> + * @RISCV_IOMMU_DDTP_MODE_MAX: Max value allowed by specification
> + */
> +enum riscv_iommu_ddtp_modes {
> + RISCV_IOMMU_DDTP_MODE_OFF = 0,
> + RISCV_IOMMU_DDTP_MODE_BARE = 1,
> + RISCV_IOMMU_DDTP_MODE_1LVL = 2,
> + RISCV_IOMMU_DDTP_MODE_2LVL = 3,
> + RISCV_IOMMU_DDTP_MODE_3LVL = 4,
> + RISCV_IOMMU_DDTP_MODE_MAX = 4
> +};
> +
> +/* 5.6 Command Queue Base (64bits) */
> +#define RISCV_IOMMU_REG_CQB 0x0018
> +#define RISCV_IOMMU_CQB_ENTRIES RISCV_IOMMU_QUEUE_LOGSZ_FIELD
> +#define RISCV_IOMMU_CQB_PPN RISCV_IOMMU_PPN_FIELD
> +
> +/* 5.7 Command Queue head (32bits) */
> +#define RISCV_IOMMU_REG_CQH 0x0020
> +#define RISCV_IOMMU_CQH_INDEX RISCV_IOMMU_QUEUE_INDEX_FIELD
> +
> +/* 5.8 Command Queue tail (32bits) */
> +#define RISCV_IOMMU_REG_CQT 0x0024
> +#define RISCV_IOMMU_CQT_INDEX RISCV_IOMMU_QUEUE_INDEX_FIELD
> +
> +/* 5.9 Fault Queue Base (64bits) */
> +#define RISCV_IOMMU_REG_FQB 0x0028
> +#define RISCV_IOMMU_FQB_ENTRIES RISCV_IOMMU_QUEUE_LOGSZ_FIELD
> +#define RISCV_IOMMU_FQB_PPN RISCV_IOMMU_PPN_FIELD
> +
> +/* 5.10 Fault Queue Head (32bits) */
> +#define RISCV_IOMMU_REG_FQH 0x0030
> +#define RISCV_IOMMU_FQH_INDEX RISCV_IOMMU_QUEUE_INDEX_FIELD
> +
> +/* 5.11 Fault Queue tail (32bits) */
> +#define RISCV_IOMMU_REG_FQT 0x0034
> +#define RISCV_IOMMU_FQT_INDEX RISCV_IOMMU_QUEUE_INDEX_FIELD
> +
> +/* 5.12 Page Request Queue base (64bits) */
> +#define RISCV_IOMMU_REG_PQB 0x0038
> +#define RISCV_IOMMU_PQB_ENTRIES RISCV_IOMMU_QUEUE_LOGSZ_FIELD
> +#define RISCV_IOMMU_PQB_PPN RISCV_IOMMU_PPN_FIELD
> +
> +/* 5.13 Page Request Queue head (32bits) */
> +#define RISCV_IOMMU_REG_PQH 0x0040
> +#define RISCV_IOMMU_PQH_INDEX RISCV_IOMMU_QUEUE_INDEX_FIELD
> +
> +/* 5.14 Page Request Queue tail (32bits) */
> +#define RISCV_IOMMU_REG_PQT 0x0044
> +#define RISCV_IOMMU_PQT_INDEX_MASK RISCV_IOMMU_QUEUE_INDEX_FIELD
> +
> +/* 5.15 Command Queue CSR (32bits) */
> +#define RISCV_IOMMU_REG_CQCSR 0x0048
> +#define RISCV_IOMMU_CQCSR_CQEN RISCV_IOMMU_QUEUE_ENABLE
> +#define RISCV_IOMMU_CQCSR_CIE RISCV_IOMMU_QUEUE_INTR_ENABLE
> +#define RISCV_IOMMU_CQCSR_CQMF RISCV_IOMMU_QUEUE_MEM_FAULT
> +#define RISCV_IOMMU_CQCSR_CMD_TO BIT(9)
> +#define RISCV_IOMMU_CQCSR_CMD_ILL BIT(10)
> +#define RISCV_IOMMU_CQCSR_FENCE_W_IP BIT(11)
> +#define RISCV_IOMMU_CQCSR_CQON RISCV_IOMMU_QUEUE_ACTIVE
> +#define RISCV_IOMMU_CQCSR_BUSY RISCV_IOMMU_QUEUE_BUSY
> +
> +/* 5.16 Fault Queue CSR (32bits) */
> +#define RISCV_IOMMU_REG_FQCSR 0x004C
> +#define RISCV_IOMMU_FQCSR_FQEN RISCV_IOMMU_QUEUE_ENABLE
> +#define RISCV_IOMMU_FQCSR_FIE RISCV_IOMMU_QUEUE_INTR_ENABLE
> +#define RISCV_IOMMU_FQCSR_FQMF RISCV_IOMMU_QUEUE_MEM_FAULT
> +#define RISCV_IOMMU_FQCSR_FQOF RISCV_IOMMU_QUEUE_OVERFLOW
> +#define RISCV_IOMMU_FQCSR_FQON RISCV_IOMMU_QUEUE_ACTIVE
> +#define RISCV_IOMMU_FQCSR_BUSY RISCV_IOMMU_QUEUE_BUSY
> +
> +/* 5.17 Page Request Queue CSR (32bits) */
> +#define RISCV_IOMMU_REG_PQCSR 0x0050
> +#define RISCV_IOMMU_PQCSR_PQEN RISCV_IOMMU_QUEUE_ENABLE
> +#define RISCV_IOMMU_PQCSR_PIE RISCV_IOMMU_QUEUE_INTR_ENABLE
> +#define RISCV_IOMMU_PQCSR_PQMF RISCV_IOMMU_QUEUE_MEM_FAULT
> +#define RISCV_IOMMU_PQCSR_PQOF RISCV_IOMMU_QUEUE_OVERFLOW
> +#define RISCV_IOMMU_PQCSR_PQON RISCV_IOMMU_QUEUE_ACTIVE
> +#define RISCV_IOMMU_PQCSR_BUSY RISCV_IOMMU_QUEUE_BUSY
> +
> +/* 5.18 Interrupt Pending Status (32bits) */
> +#define RISCV_IOMMU_REG_IPSR 0x0054
> +
> +#define RISCV_IOMMU_INTR_CQ 0
> +#define RISCV_IOMMU_INTR_FQ 1
> +#define RISCV_IOMMU_INTR_PM 2
> +#define RISCV_IOMMU_INTR_PQ 3
> +#define RISCV_IOMMU_INTR_COUNT 4
> +
> +#define RISCV_IOMMU_IPSR_CIP BIT(RISCV_IOMMU_INTR_CQ)
> +#define RISCV_IOMMU_IPSR_FIP BIT(RISCV_IOMMU_INTR_FQ)
> +#define RISCV_IOMMU_IPSR_PMIP BIT(RISCV_IOMMU_INTR_PM)
> +#define RISCV_IOMMU_IPSR_PIP BIT(RISCV_IOMMU_INTR_PQ)
> +
> +/* 5.19 Performance monitoring counter overflow status (32bits) */
> +#define RISCV_IOMMU_REG_IOCOUNTOVF 0x0058
> +#define RISCV_IOMMU_IOCOUNTOVF_CY BIT(0)
> +#define RISCV_IOMMU_IOCOUNTOVF_HPM GENMASK_ULL(31, 1)
> +
> +/* 5.20 Performance monitoring counter inhibits (32bits) */
> +#define RISCV_IOMMU_REG_IOCOUNTINH 0x005C
> +#define RISCV_IOMMU_IOCOUNTINH_CY BIT(0)
> +#define RISCV_IOMMU_IOCOUNTINH_HPM GENMASK(31, 1)
> +
> +/* 5.21 Performance monitoring cycles counter (64bits) */
> +#define RISCV_IOMMU_REG_IOHPMCYCLES 0x0060
> +#define RISCV_IOMMU_IOHPMCYCLES_COUNTER GENMASK_ULL(62, 0)
> +#define RISCV_IOMMU_IOHPMCYCLES_OVF BIT_ULL(63)
> +
> +/* 5.22 Performance monitoring event counters (31 * 64bits) */
> +#define RISCV_IOMMU_REG_IOHPMCTR_BASE 0x0068
> +#define RISCV_IOMMU_REG_IOHPMCTR(_n) (RISCV_IOMMU_REG_IOHPMCTR_BASE + ((_n) * 0x8))
> +
> +/* 5.23 Performance monitoring event selectors (31 * 64bits) */
> +#define RISCV_IOMMU_REG_IOHPMEVT_BASE 0x0160
> +#define RISCV_IOMMU_REG_IOHPMEVT(_n) (RISCV_IOMMU_REG_IOHPMEVT_BASE + ((_n) * 0x8))
> +#define RISCV_IOMMU_IOHPMEVT_CNT 31
> +#define RISCV_IOMMU_IOHPMEVT_EVENT_ID GENMASK_ULL(14, 0)
> +#define RISCV_IOMMU_IOHPMEVT_DMASK BIT_ULL(15)
> +#define RISCV_IOMMU_IOHPMEVT_PID_PSCID GENMASK_ULL(35, 16)
> +#define RISCV_IOMMU_IOHPMEVT_DID_GSCID GENMASK_ULL(59, 36)
> +#define RISCV_IOMMU_IOHPMEVT_PV_PSCV BIT_ULL(60)
> +#define RISCV_IOMMU_IOHPMEVT_DV_GSCV BIT_ULL(61)
> +#define RISCV_IOMMU_IOHPMEVT_IDT BIT_ULL(62)
> +#define RISCV_IOMMU_IOHPMEVT_OF BIT_ULL(63)
> +
> +/**
> + * enum riscv_iommu_hpmevent_id - Performance-monitoring event identifier
> + *
> + * @RISCV_IOMMU_HPMEVENT_INVALID: Invalid event, do not count
> + * @RISCV_IOMMU_HPMEVENT_URQ: Untranslated requests
> + * @RISCV_IOMMU_HPMEVENT_TRQ: Translated requests
> + * @RISCV_IOMMU_HPMEVENT_ATS_RQ: ATS translation requests
> + * @RISCV_IOMMU_HPMEVENT_TLB_MISS: TLB misses
> + * @RISCV_IOMMU_HPMEVENT_DD_WALK: Device directory walks
> + * @RISCV_IOMMU_HPMEVENT_PD_WALK: Process directory walks
> + * @RISCV_IOMMU_HPMEVENT_S_VS_WALKS: S/VS-Stage page table walks
> + * @RISCV_IOMMU_HPMEVENT_G_WALKS: G-Stage page table walks
> + * @RISCV_IOMMU_HPMEVENT_MAX: Value to denote maximum Event IDs
> + */
> +enum riscv_iommu_hpmevent_id {
> + RISCV_IOMMU_HPMEVENT_INVALID = 0,
> + RISCV_IOMMU_HPMEVENT_URQ = 1,
> + RISCV_IOMMU_HPMEVENT_TRQ = 2,
> + RISCV_IOMMU_HPMEVENT_ATS_RQ = 3,
> + RISCV_IOMMU_HPMEVENT_TLB_MISS = 4,
> + RISCV_IOMMU_HPMEVENT_DD_WALK = 5,
> + RISCV_IOMMU_HPMEVENT_PD_WALK = 6,
> + RISCV_IOMMU_HPMEVENT_S_VS_WALKS = 7,
> + RISCV_IOMMU_HPMEVENT_G_WALKS = 8,
> + RISCV_IOMMU_HPMEVENT_MAX = 9
> +};
> +
> +/* 5.24 Translation request IOVA (64bits) */
> +#define RISCV_IOMMU_REG_TR_REQ_IOVA 0x0258
> +#define RISCV_IOMMU_TR_REQ_IOVA_VPN GENMASK_ULL(63, 12)
> +
> +/* 5.25 Translation request control (64bits) */
> +#define RISCV_IOMMU_REG_TR_REQ_CTL 0x0260
> +#define RISCV_IOMMU_TR_REQ_CTL_GO_BUSY BIT_ULL(0)
> +#define RISCV_IOMMU_TR_REQ_CTL_PRIV BIT_ULL(1)
> +#define RISCV_IOMMU_TR_REQ_CTL_EXE BIT_ULL(2)
> +#define RISCV_IOMMU_TR_REQ_CTL_NW BIT_ULL(3)
> +#define RISCV_IOMMU_TR_REQ_CTL_PID GENMASK_ULL(31, 12)
> +#define RISCV_IOMMU_TR_REQ_CTL_PV BIT_ULL(32)
> +#define RISCV_IOMMU_TR_REQ_CTL_DID GENMASK_ULL(63, 40)
> +
> +/* 5.26 Translation request response (64bits) */
> +#define RISCV_IOMMU_REG_TR_RESPONSE 0x0268
> +#define RISCV_IOMMU_TR_RESPONSE_FAULT BIT_ULL(0)
> +#define RISCV_IOMMU_TR_RESPONSE_PBMT GENMASK_ULL(8, 7)
> +#define RISCV_IOMMU_TR_RESPONSE_SZ BIT_ULL(9)
> +#define RISCV_IOMMU_TR_RESPONSE_PPN RISCV_IOMMU_PPN_FIELD
> +
> +/* 5.27 Interrupt cause to vector (64bits) */
> +#define RISCV_IOMMU_REG_IVEC 0x02F8
> +#define RISCV_IOMMU_IVEC_CIV GENMASK_ULL(3, 0)
> +#define RISCV_IOMMU_IVEC_FIV GENMASK_ULL(7, 4)
> +#define RISCV_IOMMU_IVEC_PMIV GENMASK_ULL(11, 8)
> +#define RISCV_IOMMU_IVEC_PIV GENMASK_ULL(15, 12)
> +

Could we use ICVEC instead of IVEC for aligning the term in spec? If
it is ok, you might also need to modify it in riscv_iommu_device
structure.


> +/* 5.28 MSI Configuration table (32 * 64bits) */
> +#define RISCV_IOMMU_REG_MSI_CONFIG 0x0300
> +#define RISCV_IOMMU_REG_MSI_ADDR(_n) (RISCV_IOMMU_REG_MSI_CONFIG + ((_n) * 0x10))
> +#define RISCV_IOMMU_MSI_ADDR GENMASK_ULL(55, 2)
> +#define RISCV_IOMMU_REG_MSI_DATA(_n) (RISCV_IOMMU_REG_MSI_CONFIG + ((_n) * 0x10) + 0x08)
> +#define RISCV_IOMMU_MSI_DATA GENMASK_ULL(31, 0)
> +#define RISCV_IOMMU_REG_MSI_VEC_CTL(_n) (RISCV_IOMMU_REG_MSI_CONFIG + ((_n) * 0x10) + 0x0C)
> +#define RISCV_IOMMU_MSI_VEC_CTL_M BIT_ULL(0)
> +
> +#define RISCV_IOMMU_REG_SIZE 0x1000
> +
> +/*
> + * Chapter 2: Data structures
> + */
> +
> +/*
> + * Device Directory Table macros for non-leaf nodes
> + */
> +#define RISCV_IOMMU_DDTE_VALID BIT_ULL(0)
> +#define RISCV_IOMMU_DDTE_PPN RISCV_IOMMU_PPN_FIELD
> +
> +/**
> + * struct riscv_iommu_dc - Device Context
> + * @tc: Translation Control
> + * @iohgatp: I/O Hypervisor guest address translation and protection
> + * (Second stage context)
> + * @ta: Translation Attributes
> + * @fsc: First stage context
> + * @msiptp: MSI page table pointer
> + * @msi_addr_mask: MSI address mask
> + * @msi_addr_pattern: MSI address pattern
> + * @_reserved: Reserved for future use, padding
> + *
> + * This structure is used for leaf nodes on the Device Directory Table,
> + * in case RISCV_IOMMU_CAP_MSI_FLAT is not set, the bottom 4 fields are
> + * not present and are skipped with pointer arithmetic to avoid
> + * casting, check out riscv_iommu_get_dc().
> + * See section 2.1 for more details
> + */
> +struct riscv_iommu_dc {
> + u64 tc;
> + u64 iohgatp;
> + u64 ta;
> + u64 fsc;
> + u64 msiptp;
> + u64 msi_addr_mask;
> + u64 msi_addr_pattern;
> + u64 _reserved;
> +};
> +
> +/* Translation control fields */
> +#define RISCV_IOMMU_DC_TC_V BIT_ULL(0)
> +#define RISCV_IOMMU_DC_TC_EN_ATS BIT_ULL(1)
> +#define RISCV_IOMMU_DC_TC_EN_PRI BIT_ULL(2)
> +#define RISCV_IOMMU_DC_TC_T2GPA BIT_ULL(3)
> +#define RISCV_IOMMU_DC_TC_DTF BIT_ULL(4)
> +#define RISCV_IOMMU_DC_TC_PDTV BIT_ULL(5)
> +#define RISCV_IOMMU_DC_TC_PRPR BIT_ULL(6)
> +#define RISCV_IOMMU_DC_TC_GADE BIT_ULL(7)
> +#define RISCV_IOMMU_DC_TC_SADE BIT_ULL(8)
> +#define RISCV_IOMMU_DC_TC_DPE BIT_ULL(9)
> +#define RISCV_IOMMU_DC_TC_SBE BIT_ULL(10)
> +#define RISCV_IOMMU_DC_TC_SXL BIT_ULL(11)
> +
> +/* Second-stage (aka G-stage) context fields */
> +#define RISCV_IOMMU_DC_IOHGATP_PPN RISCV_IOMMU_ATP_PPN_FIELD
> +#define RISCV_IOMMU_DC_IOHGATP_GSCID GENMASK_ULL(59, 44)
> +#define RISCV_IOMMU_DC_IOHGATP_MODE RISCV_IOMMU_ATP_MODE_FIELD
> +
> +/**
> + * enum riscv_iommu_dc_iohgatp_modes - Guest address translation/protection modes
> + * @RISCV_IOMMU_DC_IOHGATP_MODE_BARE: No translation/protection
> + * @RISCV_IOMMU_DC_IOHGATP_MODE_SV32X4: Sv32x4 (2-bit extension of Sv32), when fctl.GXL == 1
> + * @RISCV_IOMMU_DC_IOHGATP_MODE_SV39X4: Sv39x4 (2-bit extension of Sv39), when fctl.GXL == 0
> + * @RISCV_IOMMU_DC_IOHGATP_MODE_SV48X4: Sv48x4 (2-bit extension of Sv48), when fctl.GXL == 0
> + * @RISCV_IOMMU_DC_IOHGATP_MODE_SV57X4: Sv57x4 (2-bit extension of Sv57), when fctl.GXL == 0
> + */
> +enum riscv_iommu_dc_iohgatp_modes {
> + RISCV_IOMMU_DC_IOHGATP_MODE_BARE = 0,
> + RISCV_IOMMU_DC_IOHGATP_MODE_SV32X4 = 8,
> + RISCV_IOMMU_DC_IOHGATP_MODE_SV39X4 = 8,
> + RISCV_IOMMU_DC_IOHGATP_MODE_SV48X4 = 9,
> + RISCV_IOMMU_DC_IOHGATP_MODE_SV57X4 = 10
> +};
> +
> +/* Translation attributes fields */
> +#define RISCV_IOMMU_DC_TA_PSCID GENMASK_ULL(31, 12)
> +
> +/* First-stage context fields */
> +#define RISCV_IOMMU_DC_FSC_PPN RISCV_IOMMU_ATP_PPN_FIELD
> +#define RISCV_IOMMU_DC_FSC_MODE RISCV_IOMMU_ATP_MODE_FIELD
> +
> +/**
> + * enum riscv_iommu_dc_fsc_atp_modes - First stage address translation/protection modes
> + * @RISCV_IOMMU_DC_FSC_MODE_BARE: No translation/protection
> + * @RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV32: Sv32, when dc.tc.SXL == 1
> + * @RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV39: Sv39, when dc.tc.SXL == 0
> + * @RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV48: Sv48, when dc.tc.SXL == 0
> + * @RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV57: Sv57, when dc.tc.SXL == 0
> + * @RISCV_IOMMU_DC_FSC_PDTP_MODE_PD8: 1lvl PDT, 8bit process ids
> + * @RISCV_IOMMU_DC_FSC_PDTP_MODE_PD17: 2lvl PDT, 17bit process ids
> + * @RISCV_IOMMU_DC_FSC_PDTP_MODE_PD20: 3lvl PDT, 20bit process ids
> + *
> + * FSC holds IOSATP when RISCV_IOMMU_DC_TC_PDTV is 0 and PDTP otherwise.
> + * IOSATP controls the first stage address translation (same as the satp register on
> + * the RISC-V MMU), and PDTP holds the process directory table, used to select a
> + * first stage page table based on a process id (for devices that support multiple
> + * process ids).
> + */
> +enum riscv_iommu_dc_fsc_atp_modes {
> + RISCV_IOMMU_DC_FSC_MODE_BARE = 0,
> + RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV32 = 8,
> + RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV39 = 8,
> + RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV48 = 9,
> + RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV57 = 10,
> + RISCV_IOMMU_DC_FSC_PDTP_MODE_PD8 = 1,
> + RISCV_IOMMU_DC_FSC_PDTP_MODE_PD17 = 2,
> + RISCV_IOMMU_DC_FSC_PDTP_MODE_PD20 = 3
> +};
> +
> +/* MSI page table pointer */
> +#define RISCV_IOMMU_DC_MSIPTP_PPN RISCV_IOMMU_ATP_PPN_FIELD
> +#define RISCV_IOMMU_DC_MSIPTP_MODE RISCV_IOMMU_ATP_MODE_FIELD
> +#define RISCV_IOMMU_DC_MSIPTP_MODE_OFF 0
> +#define RISCV_IOMMU_DC_MSIPTP_MODE_FLAT 1
> +
> +/* MSI address mask */
> +#define RISCV_IOMMU_DC_MSI_ADDR_MASK GENMASK_ULL(51, 0)
> +
> +/* MSI address pattern */
> +#define RISCV_IOMMU_DC_MSI_PATTERN GENMASK_ULL(51, 0)
> +
> +/**
> + * struct riscv_iommu_pc - Process Context
> + * @ta: Translation Attributes
> + * @fsc: First stage context
> + *
> + * This structure is used for leaf nodes on the Process Directory Table
> + * See section 2.3 for more details
> + */
> +struct riscv_iommu_pc {
> + u64 ta;
> + u64 fsc;
> +};
> +
> +/* Translation attributes fields */
> +#define RISCV_IOMMU_PC_TA_V BIT_ULL(0)
> +#define RISCV_IOMMU_PC_TA_ENS BIT_ULL(1)
> +#define RISCV_IOMMU_PC_TA_SUM BIT_ULL(2)
> +#define RISCV_IOMMU_PC_TA_PSCID GENMASK_ULL(31, 12)
> +
> +/* First stage context fields */
> +#define RISCV_IOMMU_PC_FSC_PPN RISCV_IOMMU_ATP_PPN_FIELD
> +#define RISCV_IOMMU_PC_FSC_MODE RISCV_IOMMU_ATP_MODE_FIELD
> +
> +/*
> + * Chapter 3: In-memory queue interface
> + */
> +
> +/**
> + * struct riscv_iommu_command - Generic I/O MMU command structure
> + * @dword0: Includes the opcode and the function identifier
> + * @dword1: Opcode specific data
> + *
> + * The commands are interpreted as two 64bit fields, where the first
> + * 7bits of the first field are the opcode which also defines the
> + * command's format, followed by a 3bit field that specifies the
> + * function invoked by that command, and the rest is opcode-specific.
> + * This is a generic struct which will be populated differently
> + * according to each command. For more infos on the commands and
> + * the command queue check section 3.1.
> + */
> +struct riscv_iommu_command {
> + u64 dword0;
> + u64 dword1;
> +};
> +
> +/* Fields on dword0, common for all commands */
> +#define RISCV_IOMMU_CMD_OPCODE GENMASK_ULL(6, 0)
> +#define RISCV_IOMMU_CMD_FUNC GENMASK_ULL(9, 7)
> +
> +/* 3.1.1 I/O MMU Page-table cache invalidation */
> +/* Fields on dword0 */
> +#define RISCV_IOMMU_CMD_IOTINVAL_OPCODE 1
> +#define RISCV_IOMMU_CMD_IOTINVAL_FUNC_VMA 0
> +#define RISCV_IOMMU_CMD_IOTINVAL_FUNC_GVMA 1
> +#define RISCV_IOMMU_CMD_IOTINVAL_AV BIT_ULL(10)
> +#define RISCV_IOMMU_CMD_IOTINVAL_PSCID GENMASK_ULL(31, 12)
> +#define RISCV_IOMMU_CMD_IOTINVAL_PSCV BIT_ULL(32)
> +#define RISCV_IOMMU_CMD_IOTINVAL_GV BIT_ULL(33)
> +#define RISCV_IOMMU_CMD_IOTINVAL_GSCID GENMASK_ULL(59, 44)
> +/* dword1[61:10] is the 4K-aligned page address */
> +#define RISCV_IOMMU_CMD_IOTINVAL_ADDR GENMASK_ULL(61, 10)
> +
> +/* 3.1.2 I/O MMU Command Queue Fences */
> +/* Fields on dword0 */
> +#define RISCV_IOMMU_CMD_IOFENCE_OPCODE 2
> +#define RISCV_IOMMU_CMD_IOFENCE_FUNC_C 0
> +#define RISCV_IOMMU_CMD_IOFENCE_AV BIT_ULL(10)
> +#define RISCV_IOMMU_CMD_IOFENCE_WSI BIT_ULL(11)
> +#define RISCV_IOMMU_CMD_IOFENCE_PR BIT_ULL(12)
> +#define RISCV_IOMMU_CMD_IOFENCE_PW BIT_ULL(13)
> +#define RISCV_IOMMU_CMD_IOFENCE_DATA GENMASK_ULL(63, 32)
> +/* dword1 is the address, word-size aligned and shifted to the right by two bits. */
> +
> +/* 3.1.3 I/O MMU Directory cache invalidation */
> +/* Fields on dword0 */
> +#define RISCV_IOMMU_CMD_IODIR_OPCODE 3
> +#define RISCV_IOMMU_CMD_IODIR_FUNC_INVAL_DDT 0
> +#define RISCV_IOMMU_CMD_IODIR_FUNC_INVAL_PDT 1
> +#define RISCV_IOMMU_CMD_IODIR_PID GENMASK_ULL(31, 12)
> +#define RISCV_IOMMU_CMD_IODIR_DV BIT_ULL(33)
> +#define RISCV_IOMMU_CMD_IODIR_DID GENMASK_ULL(63, 40)
> +/* dword1 is reserved for standard use */
> +
> +/* 3.1.4 I/O MMU PCIe ATS */
> +/* Fields on dword0 */
> +#define RISCV_IOMMU_CMD_ATS_OPCODE 4
> +#define RISCV_IOMMU_CMD_ATS_FUNC_INVAL 0
> +#define RISCV_IOMMU_CMD_ATS_FUNC_PRGR 1
> +#define RISCV_IOMMU_CMD_ATS_PID GENMASK_ULL(31, 12)
> +#define RISCV_IOMMU_CMD_ATS_PV BIT_ULL(32)
> +#define RISCV_IOMMU_CMD_ATS_DSV BIT_ULL(33)
> +#define RISCV_IOMMU_CMD_ATS_RID GENMASK_ULL(55, 40)
> +#define RISCV_IOMMU_CMD_ATS_DSEG GENMASK_ULL(63, 56)
> +/* dword1 is the ATS payload, two different payload types for INVAL and PRGR */
> +
> +/* ATS.INVAL payload*/
> +#define RISCV_IOMMU_CMD_ATS_INVAL_G BIT_ULL(0)
> +/* Bits 1 - 10 are zeroed */
> +#define RISCV_IOMMU_CMD_ATS_INVAL_S BIT_ULL(11)
> +#define RISCV_IOMMU_CMD_ATS_INVAL_UADDR GENMASK_ULL(63, 12)
> +
> +/* ATS.PRGR payload */
> +/* Bits 0 - 31 are zeroed */
> +#define RISCV_IOMMU_CMD_ATS_PRGR_PRG_INDEX GENMASK_ULL(40, 32)
> +/* Bits 41 - 43 are zeroed */
> +#define RISCV_IOMMU_CMD_ATS_PRGR_RESP_CODE GENMASK_ULL(47, 44)
> +#define RISCV_IOMMU_CMD_ATS_PRGR_DST_ID GENMASK_ULL(63, 48)
> +
> +/**
> + * struct riscv_iommu_fq_record - Fault/Event Queue Record
> + * @hdr: Header, includes fault/event cause, PID/DID, transaction type etc
> + * @_reserved: Low 32bits for custom use, high 32bits for standard use
> + * @iotval: Transaction-type/cause specific format
> + * @iotval2: Cause specific format
> + *
> + * The fault/event queue reports events and failures raised when
> + * processing transactions. Each record is a 32byte structure where
> + * the first dword has a fixed format for providing generic infos
> + * regarding the fault/event, and two more dwords are there for
> + * fault/event-specific information. For more details see section
> + * 3.2.
> + */
> +struct riscv_iommu_fq_record {
> + u64 hdr;
> + u64 _reserved;
> + u64 iotval;
> + u64 iotval2;
> +};
> +
> +/* Fields on header */
> +#define RISCV_IOMMU_FQ_HDR_CAUSE GENMASK_ULL(11, 0)
> +#define RISCV_IOMMU_FQ_HDR_PID GENMASK_ULL(31, 12)
> +#define RISCV_IOMMU_FQ_HDR_PV BIT_ULL(32)
> +#define RISCV_IOMMU_FQ_HDR_PRIV BIT_ULL(33)
> +#define RISCV_IOMMU_FQ_HDR_TTYPE GENMASK_ULL(39, 34)
> +#define RISCV_IOMMU_FQ_HDR_DID GENMASK_ULL(63, 40)
> +
> +/**
> + * enum riscv_iommu_fq_causes - Fault/event cause values
> + * @RISCV_IOMMU_FQ_CAUSE_INST_FAULT: Instruction access fault
> + * @RISCV_IOMMU_FQ_CAUSE_RD_ADDR_MISALIGNED: Read address misaligned
> + * @RISCV_IOMMU_FQ_CAUSE_RD_FAULT: Read load fault
> + * @RISCV_IOMMU_FQ_CAUSE_WR_ADDR_MISALIGNED: Write/AMO address misaligned
> + * @RISCV_IOMMU_FQ_CAUSE_WR_FAULT: Write/AMO access fault
> + * @RISCV_IOMMU_FQ_CAUSE_INST_FAULT_S: Instruction page fault
> + * @RISCV_IOMMU_FQ_CAUSE_RD_FAULT_S: Read page fault
> + * @RISCV_IOMMU_FQ_CAUSE_WR_FAULT_S: Write/AMO page fault
> + * @RISCV_IOMMU_FQ_CAUSE_INST_FAULT_VS: Instruction guest page fault
> + * @RISCV_IOMMU_FQ_CAUSE_RD_FAULT_VS: Read guest page fault
> + * @RISCV_IOMMU_FQ_CAUSE_WR_FAULT_VS: Write/AMO guest page fault
> + * @RISCV_IOMMU_FQ_CAUSE_DMA_DISABLED: All inbound transactions disallowed
> + * @RISCV_IOMMU_FQ_CAUSE_DDT_LOAD_FAULT: DDT entry load access fault
> + * @RISCV_IOMMU_FQ_CAUSE_DDT_INVALID: DDT entry invalid
> + * @RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED: DDT entry misconfigured
> + * @RISCV_IOMMU_FQ_CAUSE_TTYPE_BLOCKED: Transaction type disallowed
> + * @RISCV_IOMMU_FQ_CAUSE_MSI_LOAD_FAULT: MSI PTE load access fault
> + * @RISCV_IOMMU_FQ_CAUSE_MSI_INVALID: MSI PTE invalid
> + * @RISCV_IOMMU_FQ_CAUSE_MSI_MISCONFIGURED: MSI PTE misconfigured
> + * @RISCV_IOMMU_FQ_CAUSE_MRIF_FAULT: MRIF access fault
> + * @RISCV_IOMMU_FQ_CAUSE_PDT_LOAD_FAULT: PDT entry load access fault
> + * @RISCV_IOMMU_FQ_CAUSE_PDT_INVALID: PDT entry invalid
> + * @RISCV_IOMMU_FQ_CAUSE_PDT_MISCONFIGURED: PDT entry misconfigured
> + * @RISCV_IOMMU_FQ_CAUSE_DDT_CORRUPTED: DDT data corruption
> + * @RISCV_IOMMU_FQ_CAUSE_PDT_CORRUPTED: PDT data corruption
> + * @RISCV_IOMMU_FQ_CAUSE_MSI_PT_CORRUPTED: MSI page table data corruption
> + * @RISCV_IOMMU_FQ_CAUSE_MRIF_CORRUIPTED: MRIF data corruption
> + * @RISCV_IOMMU_FQ_CAUSE_INTERNAL_DP_ERROR: Internal data path error
> + * @RISCV_IOMMU_FQ_CAUSE_MSI_WR_FAULT: IOMMU MSI write access fault
> + * @RISCV_IOMMU_FQ_CAUSE_PT_CORRUPTED: First/second stage page table data corruption
> + *
> + * Values are on table 11 of the spec, encodings 275 - 2047 are reserved for standard
> + * use, and 2048 - 4095 for custom use.
> + */
> +enum riscv_iommu_fq_causes {
> + RISCV_IOMMU_FQ_CAUSE_INST_FAULT = 1,
> + RISCV_IOMMU_FQ_CAUSE_RD_ADDR_MISALIGNED = 4,
> + RISCV_IOMMU_FQ_CAUSE_RD_FAULT = 5,
> + RISCV_IOMMU_FQ_CAUSE_WR_ADDR_MISALIGNED = 6,
> + RISCV_IOMMU_FQ_CAUSE_WR_FAULT = 7,
> + RISCV_IOMMU_FQ_CAUSE_INST_FAULT_S = 12,
> + RISCV_IOMMU_FQ_CAUSE_RD_FAULT_S = 13,
> + RISCV_IOMMU_FQ_CAUSE_WR_FAULT_S = 15,
> + RISCV_IOMMU_FQ_CAUSE_INST_FAULT_VS = 20,
> + RISCV_IOMMU_FQ_CAUSE_RD_FAULT_VS = 21,
> + RISCV_IOMMU_FQ_CAUSE_WR_FAULT_VS = 23,
> + RISCV_IOMMU_FQ_CAUSE_DMA_DISABLED = 256,
> + RISCV_IOMMU_FQ_CAUSE_DDT_LOAD_FAULT = 257,
> + RISCV_IOMMU_FQ_CAUSE_DDT_INVALID = 258,
> + RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED = 259,
> + RISCV_IOMMU_FQ_CAUSE_TTYPE_BLOCKED = 260,
> + RISCV_IOMMU_FQ_CAUSE_MSI_LOAD_FAULT = 261,
> + RISCV_IOMMU_FQ_CAUSE_MSI_INVALID = 262,
> + RISCV_IOMMU_FQ_CAUSE_MSI_MISCONFIGURED = 263,
> + RISCV_IOMMU_FQ_CAUSE_MRIF_FAULT = 264,
> + RISCV_IOMMU_FQ_CAUSE_PDT_LOAD_FAULT = 265,
> + RISCV_IOMMU_FQ_CAUSE_PDT_INVALID = 266,
> + RISCV_IOMMU_FQ_CAUSE_PDT_MISCONFIGURED = 267,
> + RISCV_IOMMU_FQ_CAUSE_DDT_CORRUPTED = 268,
> + RISCV_IOMMU_FQ_CAUSE_PDT_CORRUPTED = 269,
> + RISCV_IOMMU_FQ_CAUSE_MSI_PT_CORRUPTED = 270,
> + RISCV_IOMMU_FQ_CAUSE_MRIF_CORRUIPTED = 271,
> + RISCV_IOMMU_FQ_CAUSE_INTERNAL_DP_ERROR = 272,
> + RISCV_IOMMU_FQ_CAUSE_MSI_WR_FAULT = 273,
> + RISCV_IOMMU_FQ_CAUSE_PT_CORRUPTED = 274
> +};
> +
> +/**
> + * enum riscv_iommu_fq_ttypes: Fault/event transaction types
> + * @RISCV_IOMMU_FQ_TTYPE_NONE: None. Fault not caused by an inbound transaction.
> + * @RISCV_IOMMU_FQ_TTYPE_UADDR_INST_FETCH: Instruction fetch from untranslated address
> + * @RISCV_IOMMU_FQ_TTYPE_UADDR_RD: Read from untranslated address
> + * @RISCV_IOMMU_FQ_TTYPE_UADDR_WR: Write/AMO to untranslated address
> + * @RISCV_IOMMU_FQ_TTYPE_TADDR_INST_FETCH: Instruction fetch from translated address
> + * @RISCV_IOMMU_FQ_TTYPE_TADDR_RD: Read from translated address
> + * @RISCV_IOMMU_FQ_TTYPE_TADDR_WR: Write/AMO to translated address
> + * @RISCV_IOMMU_FQ_TTYPE_PCIE_ATS_REQ: PCIe ATS translation request
> + * @RISCV_IOMMU_FW_TTYPE_PCIE_MSG_REQ: PCIe message request
> + *
> + * Values are on table 12 of the spec, type 4 and 10 - 31 are reserved for standard use
> + * and 31 - 63 for custom use.
> + */
> +enum riscv_iommu_fq_ttypes {
> + RISCV_IOMMU_FQ_TTYPE_NONE = 0,
> + RISCV_IOMMU_FQ_TTYPE_UADDR_INST_FETCH = 1,
> + RISCV_IOMMU_FQ_TTYPE_UADDR_RD = 2,
> + RISCV_IOMMU_FQ_TTYPE_UADDR_WR = 3,
> + RISCV_IOMMU_FQ_TTYPE_TADDR_INST_FETCH = 5,
> + RISCV_IOMMU_FQ_TTYPE_TADDR_RD = 6,
> + RISCV_IOMMU_FQ_TTYPE_TADDR_WR = 7,
> + RISCV_IOMMU_FQ_TTYPE_PCIE_ATS_REQ = 8,
> + RISCV_IOMMU_FW_TTYPE_PCIE_MSG_REQ = 9,
> +};
> +
> +/**
> + * struct riscv_iommu_pq_record - PCIe Page Request record
> + * @hdr: Header, includes PID, DID etc
> + * @payload: Holds the page address, request group and permission bits
> + *
> + * For more infos on the PCIe Page Request queue see chapter 3.3.
> + */
> +struct riscv_iommu_pq_record {
> + u64 hdr;
> + u64 payload;
> +};
> +
> +/* Header fields */
> +#define RISCV_IOMMU_PREQ_HDR_PID GENMASK_ULL(31, 12)
> +#define RISCV_IOMMU_PREQ_HDR_PV BIT_ULL(32)
> +#define RISCV_IOMMU_PREQ_HDR_PRIV BIT_ULL(33)
> +#define RISCV_IOMMU_PREQ_HDR_EXEC BIT_ULL(34)
> +#define RISCV_IOMMU_PREQ_HDR_DID GENMASK_ULL(63, 40)
> +
> +/* Payload fields */
> +#define RISCV_IOMMU_PREQ_PAYLOAD_R BIT_ULL(0)
> +#define RISCV_IOMMU_PREQ_PAYLOAD_W BIT_ULL(1)
> +#define RISCV_IOMMU_PREQ_PAYLOAD_L BIT_ULL(2)
> +#define RISCV_IOMMU_PREQ_PAYLOAD_M GENMASK_ULL(2, 0) /* Mask of RWL for convenience */
> +#define RISCV_IOMMU_PREQ_PRG_INDEX GENMASK_ULL(11, 3)
> +#define RISCV_IOMMU_PREQ_UADDR GENMASK_ULL(63, 12)
> +
> +/**
> + * struct riscv_iommu_msi_pte - MSI Page Table Entry
> + * @pte: MSI PTE
> + * @mrif_info: Memory-resident interrupt file info
> + *
> + * The MSI Page Table is used for virtualizing MSIs, so that when
> + * a device sends an MSI to a guest, the IOMMU can reroute it
> + * by translating the MSI address, either to a guest interrupt file
> + * or a memory resident interrupt file (MRIF). Note that this page table
> + * is an array of MSI PTEs, not a multi-level pt, each entry
> + * is a leaf entry. For more infos check out the AIA spec, chapter 9.5.
> + *
> + * Also in basic mode the mrif_info field is ignored by the IOMMU and can
> + * be used by software, any other reserved fields on pte must be zeroed-out
> + * by software.
> + */
> +struct riscv_iommu_msi_pte {
> + u64 pte;
> + u64 mrif_info;
> +};
> +
> +/* Fields on pte */
> +#define RISCV_IOMMU_MSI_PTE_V BIT_ULL(0)
> +#define RISCV_IOMMU_MSI_PTE_M GENMASK_ULL(2, 1)
> +#define RISCV_IOMMU_MSI_PTE_MRIF_ADDR GENMASK_ULL(53, 7) /* When M == 1 (MRIF mode) */
> +#define RISCV_IOMMU_MSI_PTE_PPN RISCV_IOMMU_PPN_FIELD /* When M == 3 (basic mode) */
> +#define RISCV_IOMMU_MSI_PTE_C BIT_ULL(63)
> +
> +/* Fields on mrif_info */
> +#define RISCV_IOMMU_MSI_MRIF_NID GENMASK_ULL(9, 0)
> +#define RISCV_IOMMU_MSI_MRIF_NPPN RISCV_IOMMU_PPN_FIELD
> +#define RISCV_IOMMU_MSI_MRIF_NID_MSB BIT_ULL(60)
> +
> +#endif /* _RISCV_IOMMU_BITS_H_ */
> diff --git a/drivers/iommu/riscv/iommu-platform.c b/drivers/iommu/riscv/iommu-platform.c
> new file mode 100644
> index 000000000000..1b453334fbbe
> --- /dev/null
> +++ b/drivers/iommu/riscv/iommu-platform.c
> @@ -0,0 +1,92 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * RISC-V IOMMU as a platform device
> + *
> + * Copyright © 2023 FORTH-ICS/CARV
> + * Copyright © 2023-2024 Rivos Inc.
> + *
> + * Authors
> + * Nick Kossifidis <[email protected]>
> + * Tomasz Jeznach <[email protected]>
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/of_platform.h>
> +#include <linux/platform_device.h>
> +
> +#include "iommu-bits.h"
> +#include "iommu.h"
> +
> +static int riscv_iommu_platform_probe(struct platform_device *pdev)
> +{
> + struct device *dev = &pdev->dev;
> + struct riscv_iommu_device *iommu = NULL;
> + struct resource *res = NULL;
> + int vec;
> +
> + iommu = devm_kzalloc(dev, sizeof(*iommu), GFP_KERNEL);
> + if (!iommu)
> + return -ENOMEM;
> +
> + iommu->dev = dev;
> + iommu->reg = devm_platform_get_and_ioremap_resource(pdev, 0, &res);
> + if (IS_ERR(iommu->reg))
> + return dev_err_probe(dev, PTR_ERR(iommu->reg),
> + "could not map register region\n");
> +
> + dev_set_drvdata(dev, iommu);
> +
> + /* Check device reported capabilities / features. */
> + iommu->caps = riscv_iommu_readq(iommu, RISCV_IOMMU_REG_CAP);
> + iommu->fctl = riscv_iommu_readl(iommu, RISCV_IOMMU_REG_FCTL);
> +
> + /* For now we only support WSI */
> + switch (FIELD_GET(RISCV_IOMMU_CAP_IGS, iommu->caps)) {
> + case RISCV_IOMMU_CAP_IGS_WSI:
> + case RISCV_IOMMU_CAP_IGS_BOTH:
> + break;
> + default:
> + return dev_err_probe(dev, -ENODEV,
> + "unable to use wire-signaled interrupts\n");
> + }
> +
> + iommu->irqs_count = platform_irq_count(pdev);
> + if (iommu->irqs_count <= 0)
> + return dev_err_probe(dev, -ENODEV,
> + "no IRQ resources provided\n");
> + if (iommu->irqs_count > RISCV_IOMMU_INTR_COUNT)
> + iommu->irqs_count = RISCV_IOMMU_INTR_COUNT;
> +
> + for (vec = 0; vec < iommu->irqs_count; vec++)
> + iommu->irqs[vec] = platform_get_irq(pdev, vec);
> +
> + /* Enable wire-signaled interrupts, fctl.WSI */
> + if (!(iommu->fctl & RISCV_IOMMU_FCTL_WSI)) {
> + iommu->fctl |= RISCV_IOMMU_FCTL_WSI;
> + riscv_iommu_writel(iommu, RISCV_IOMMU_REG_FCTL, iommu->fctl);
> + }
> +
> + return riscv_iommu_init(iommu);
> +};
> +
> +static void riscv_iommu_platform_remove(struct platform_device *pdev)
> +{
> + riscv_iommu_remove(dev_get_drvdata(&pdev->dev));
> +};
> +
> +static const struct of_device_id riscv_iommu_of_match[] = {
> + {.compatible = "riscv,iommu",},
> + {},
> +};
> +
> +static struct platform_driver riscv_iommu_platform_driver = {
> + .probe = riscv_iommu_platform_probe,
> + .remove_new = riscv_iommu_platform_remove,
> + .driver = {
> + .name = "riscv,iommu",
> + .of_match_table = riscv_iommu_of_match,
> + .suppress_bind_attrs = true,
> + },
> +};
> +
> +builtin_platform_driver(riscv_iommu_platform_driver);
> diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
> new file mode 100644
> index 000000000000..3c5a6b49669d
> --- /dev/null
> +++ b/drivers/iommu/riscv/iommu.c
> @@ -0,0 +1,99 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * IOMMU API for RISC-V IOMMU implementations.
> + *
> + * Copyright © 2022-2024 Rivos Inc.
> + * Copyright © 2023 FORTH-ICS/CARV
> + *
> + * Authors
> + * Tomasz Jeznach <[email protected]>
> + * Nick Kossifidis <[email protected]>
> + */
> +
> +#define pr_fmt(fmt) "riscv-iommu: " fmt
> +
> +#include <linux/compiler.h>
> +#include <linux/crash_dump.h>
> +#include <linux/init.h>
> +#include <linux/iommu.h>
> +#include <linux/kernel.h>
> +
> +#include "iommu-bits.h"
> +#include "iommu.h"
> +
> +/* Timeouts in [us] */
> +#define RISCV_IOMMU_DDTP_TIMEOUT 50000
> +
> +/*
> + * This is best effort IOMMU translation shutdown flow.
> + * Disable IOMMU without waiting for hardware response.
> + */
> +static void riscv_iommu_disable(struct riscv_iommu_device *iommu)
> +{
> + riscv_iommu_writeq(iommu, RISCV_IOMMU_REG_DDTP, 0);
> + riscv_iommu_writel(iommu, RISCV_IOMMU_REG_CQCSR, 0);
> + riscv_iommu_writel(iommu, RISCV_IOMMU_REG_FQCSR, 0);
> + riscv_iommu_writel(iommu, RISCV_IOMMU_REG_PQCSR, 0);
> +}
> +
> +static int riscv_iommu_init_check(struct riscv_iommu_device *iommu)
> +{
> + u64 ddtp;
> +
> + /*
> + * Make sure the IOMMU is switched off or in pass-through mode during
> + * regular boot flow and disable translation when we boot into a kexec
> + * kernel and the previous kernel left them enabled.
> + */
> + ddtp = riscv_iommu_readq(iommu, RISCV_IOMMU_REG_DDTP);
> + if (ddtp & RISCV_IOMMU_DDTP_BUSY)
> + return -EBUSY;
> +
> + if (FIELD_GET(RISCV_IOMMU_DDTP_MODE, ddtp) > RISCV_IOMMU_DDTP_MODE_BARE) {
> + if (!is_kdump_kernel())
> + return -EBUSY;
> + riscv_iommu_disable(iommu);
> + }
> +
> + /* Configure accesses to in-memory data structures for CPU-native byte order. */
> + if (IS_ENABLED(CONFIG_CPU_BIG_ENDIAN) != !!(iommu->fctl & RISCV_IOMMU_FCTL_BE)) {
> + if (!(iommu->caps & RISCV_IOMMU_CAP_END))
> + return -EINVAL;
> + riscv_iommu_writel(iommu, RISCV_IOMMU_REG_FCTL,
> + iommu->fctl ^ RISCV_IOMMU_FCTL_BE);
> + iommu->fctl = riscv_iommu_readl(iommu, RISCV_IOMMU_REG_FCTL);
> + if (IS_ENABLED(CONFIG_CPU_BIG_ENDIAN) != !!(iommu->fctl & RISCV_IOMMU_FCTL_BE))
> + return -EINVAL;
> + }
> +
> + return 0;
> +}
> +
> +void riscv_iommu_remove(struct riscv_iommu_device *iommu)
> +{
> + iommu_device_sysfs_remove(&iommu->iommu);
> +}
> +
> +int riscv_iommu_init(struct riscv_iommu_device *iommu)
> +{
> + int rc;
> +
> + rc = riscv_iommu_init_check(iommu);
> + if (rc)
> + return dev_err_probe(iommu->dev, rc, "unexpected device state\n");
> +
> + /*
> + * Placeholder for a complete IOMMU device initialization. For now,
> + * only bare minimum: enable global identity mapping mode and register sysfs.
> + */
> + riscv_iommu_writeq(iommu, RISCV_IOMMU_REG_DDTP,
> + FIELD_PREP(RISCV_IOMMU_DDTP_MODE, RISCV_IOMMU_DDTP_MODE_BARE));
> +
> + rc = iommu_device_sysfs_add(&iommu->iommu, NULL, NULL, "riscv-iommu@%s",
> + dev_name(iommu->dev));
> + if (rc)
> + return dev_err_probe(iommu->dev, rc,
> + "cannot register sysfs interface\n");
> +
> + return 0;
> +}
> diff --git a/drivers/iommu/riscv/iommu.h b/drivers/iommu/riscv/iommu.h
> new file mode 100644
> index 000000000000..700e33dc2446
> --- /dev/null
> +++ b/drivers/iommu/riscv/iommu.h
> @@ -0,0 +1,62 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright © 2022-2024 Rivos Inc.
> + * Copyright © 2023 FORTH-ICS/CARV
> + *
> + * Authors
> + * Tomasz Jeznach <[email protected]>
> + * Nick Kossifidis <[email protected]>
> + */
> +
> +#ifndef _RISCV_IOMMU_H_
> +#define _RISCV_IOMMU_H_
> +
> +#include <linux/iommu.h>
> +#include <linux/types.h>
> +#include <linux/iopoll.h>
> +
> +#include "iommu-bits.h"
> +
> +struct riscv_iommu_device {
> + /* iommu core interface */
> + struct iommu_device iommu;
> +
> + /* iommu hardware */
> + struct device *dev;
> +
> + /* hardware control register space */
> + void __iomem *reg;
> +
> + /* supported and enabled hardware capabilities */
> + u64 caps;
> + u32 fctl;
> +
> + /* available interrupt numbers, MSI or WSI */
> + unsigned int irqs[RISCV_IOMMU_INTR_COUNT];
> + unsigned int irqs_count;
> +};
> +
> +int riscv_iommu_init(struct riscv_iommu_device *iommu);
> +void riscv_iommu_remove(struct riscv_iommu_device *iommu);
> +
> +#define riscv_iommu_readl(iommu, addr) \
> + readl_relaxed((iommu)->reg + (addr))
> +
> +#define riscv_iommu_readq(iommu, addr) \
> + readq_relaxed((iommu)->reg + (addr))
> +
> +#define riscv_iommu_writel(iommu, addr, val) \
> + writel_relaxed((val), (iommu)->reg + (addr))
> +
> +#define riscv_iommu_writeq(iommu, addr, val) \
> + writeq_relaxed((val), (iommu)->reg + (addr))
> +
> +#define riscv_iommu_readq_timeout(iommu, addr, val, cond, delay_us, timeout_us) \
> + readx_poll_timeout(readq_relaxed, (iommu)->reg + (addr), val, cond, \
> + delay_us, timeout_us)
> +
> +#define riscv_iommu_readl_timeout(iommu, addr, val, cond, delay_us, timeout_us) \
> + readx_poll_timeout(readl_relaxed, (iommu)->reg + (addr), val, cond, \
> + delay_us, timeout_us)
> +
> +#endif
> --
> 2.34.1
>
>
> _______________________________________________
> linux-riscv mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-riscv

2024-05-08 15:57:28

by Zong Li

[permalink] [raw]
Subject: Re: [PATCH v4 7/7] iommu/riscv: Paging domain support

On Sat, May 4, 2024 at 12:13 AM Tomasz Jeznach <[email protected]> wrote:
>
> Introduce first-stage address translation support.
>
> Page table configured by the IOMMU driver will use the highest mode
> implemented by the hardware, unless not known at the domain allocation
> time falling back to the CPU’s MMU page mode.
>
> This change introduces IOTINVAL.VMA command, required to invalidate
> any cached IOATC entries after mapping is updated and/or removed from
> the paging domain. Invalidations for the non-leaf page entries use
> IOTINVAL for all addresses assigned to the protection domain for
> hardware not supporting more granular non-leaf page table cache
> invalidations.
>
> Signed-off-by: Tomasz Jeznach <[email protected]>
> ---
> drivers/iommu/riscv/iommu.c | 587 +++++++++++++++++++++++++++++++++++-
> 1 file changed, 585 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
> index 4349ac8a3990..ec701fde520f 100644
> --- a/drivers/iommu/riscv/iommu.c
> +++ b/drivers/iommu/riscv/iommu.c
> @@ -41,6 +41,10 @@
> #define dev_to_iommu(dev) \
> iommu_get_iommu_dev(dev, struct riscv_iommu_device, iommu)
>
> +/* IOMMU PSCID allocation namespace. */
> +static DEFINE_IDA(riscv_iommu_pscids);
> +#define RISCV_IOMMU_MAX_PSCID (BIT(20) - 1)
> +
> /* Device resource-managed allocations */
> struct riscv_iommu_devres {
> void *addr;
> @@ -766,6 +770,143 @@ static int riscv_iommu_iodir_set_mode(struct riscv_iommu_device *iommu,
> return 0;
> }
>
> +/* This struct contains protection domain specific IOMMU driver data. */
> +struct riscv_iommu_domain {
> + struct iommu_domain domain;
> + struct list_head bonds;
> + spinlock_t lock; /* protect bonds list updates. */
> + int pscid;
> + int numa_node;
> + int amo_enabled:1;
> + unsigned int pgd_mode;
> + unsigned long *pgd_root;
> +};
> +
> +#define iommu_domain_to_riscv(iommu_domain) \
> + container_of(iommu_domain, struct riscv_iommu_domain, domain)
> +
> +/* Private IOMMU data for managed devices, dev_iommu_priv_* */
> +struct riscv_iommu_info {
> + struct riscv_iommu_domain *domain;
> +};
> +
> +/* Linkage between an iommu_domain and attached devices. */
> +struct riscv_iommu_bond {
> + struct list_head list;
> + struct rcu_head rcu;
> + struct device *dev;
> +};
> +
> +static int riscv_iommu_bond_link(struct riscv_iommu_domain *domain,
> + struct device *dev)
> +{
> + struct riscv_iommu_device *iommu = dev_to_iommu(dev);
> + struct riscv_iommu_bond *bond;
> + struct list_head *bonds;
> +
> + bond = kzalloc(sizeof(*bond), GFP_KERNEL);
> + if (!bond)
> + return -ENOMEM;
> + bond->dev = dev;
> +
> + /*
> + * Linked device pointer and iommu data remain stable in bond struct
> + * as long the device is attached to the managed IOMMU at _probe_device(),
> + * up to completion of _release_device() call. Release of the bond will be
> + * synchronized at the device release, to guarantee no stale pointer is
> + * used inside rcu protected sections.
> + *
> + * List of devices attached to the domain is arranged based on
> + * managed IOMMU device.
> + */
> +
> + spin_lock(&domain->lock);
> + list_for_each_rcu(bonds, &domain->bonds)
> + if (dev_to_iommu(list_entry(bonds, struct riscv_iommu_bond, list)->dev) == iommu)
> + break;
> + list_add_rcu(&bond->list, bonds);
> + spin_unlock(&domain->lock);
> +
> + return 0;
> +}
> +
> +static void riscv_iommu_bond_unlink(struct riscv_iommu_domain *domain,
> + struct device *dev)
> +{
> + struct riscv_iommu_bond *bond, *found = NULL;
> +
> + if (!domain)
> + return;
> +
> + spin_lock(&domain->lock);
> + list_for_each_entry_rcu(bond, &domain->bonds, list) {
> + if (bond->dev == dev) {
> + list_del_rcu(&bond->list);
> + found = bond;
> + }
> + }
> + spin_unlock(&domain->lock);
> + kfree_rcu(found, rcu);
> +}
> +
> +/*
> + * Send IOTLB.INVAL for whole address space for ranges larger than 2MB.
> + * This limit will be replaced with range invalidations, if supported by
> + * the hardware, when RISC-V IOMMU architecture specification update for
> + * range invalidations update will be available.
> + */
> +#define RISCV_IOMMU_IOTLB_INVAL_LIMIT (2 << 20)
> +
> +static void riscv_iommu_iotlb_inval(struct riscv_iommu_domain *domain,
> + unsigned long start, unsigned long end)
> +{
> + struct riscv_iommu_bond *bond;
> + struct riscv_iommu_device *iommu, *prev;
> + struct riscv_iommu_command cmd;
> + unsigned long len = end - start + 1;
> + unsigned long iova;
> +
> + rcu_read_lock();
> +
> + prev = NULL;
> + list_for_each_entry_rcu(bond, &domain->bonds, list) {
> + iommu = dev_to_iommu(bond->dev);
> +
> + /*
> + * IOTLB invalidation request can be safely omitted if already sent
> + * to the IOMMU for the same PSCID, and with domain->bonds list
> + * arranged based on the device's IOMMU, it's sufficient to check
> + * last device the invalidation was sent to.
> + */
> + if (iommu == prev)
> + continue;
> +
> + riscv_iommu_cmd_inval_vma(&cmd);
> + riscv_iommu_cmd_inval_set_pscid(&cmd, domain->pscid);
> + if (len && len >= RISCV_IOMMU_IOTLB_INVAL_LIMIT) {
> + for (iova = start; iova < end; iova += PAGE_SIZE) {
> + riscv_iommu_cmd_inval_set_addr(&cmd, iova);
> + riscv_iommu_cmd_send(iommu, &cmd, 0);
> + }
> + } else {
> + riscv_iommu_cmd_send(iommu, &cmd, 0);
> + }
> + prev = iommu;
> + }
> +
> + prev = NULL;
> + list_for_each_entry_rcu(bond, &domain->bonds, list) {
> + iommu = dev_to_iommu(bond->dev);
> + if (iommu == prev)
> + continue;
> +
> + riscv_iommu_cmd_iofence(&cmd);

I was wondering why we need many 'iofence' commands following
invalidation commands. Could we just use one iofence to guarantees
that all previous invalidation commands have been completed?

> + riscv_iommu_cmd_send(iommu, &cmd, RISCV_IOMMU_QUEUE_TIMEOUT);
> + prev = iommu;
> + }
> + rcu_read_unlock();
> +}
> +
> #define RISCV_IOMMU_FSC_BARE 0
>
> /*
> @@ -785,10 +926,29 @@ static void riscv_iommu_iodir_update(struct riscv_iommu_device *iommu,
> {
> struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
> struct riscv_iommu_dc *dc;
> + struct riscv_iommu_command cmd;
> + bool sync_required = false;
> u64 tc;
> int i;
>
> - /* Device context invalidation ignored for now. */
> + for (i = 0; i < fwspec->num_ids; i++) {
> + dc = riscv_iommu_get_dc(iommu, fwspec->ids[i]);
> + tc = READ_ONCE(dc->tc);
> + if (tc & RISCV_IOMMU_DC_TC_V) {
> + WRITE_ONCE(dc->tc, tc & ~RISCV_IOMMU_DC_TC_V);
> +
> + /* Invalidate device context cached values */
> + riscv_iommu_cmd_iodir_inval_ddt(&cmd);
> + riscv_iommu_cmd_iodir_set_did(&cmd, fwspec->ids[i]);
> + riscv_iommu_cmd_send(iommu, &cmd, 0);
> + sync_required = true;
> + }
> + }
> +
> + if (sync_required) {
> + riscv_iommu_cmd_iofence(&cmd);
> + riscv_iommu_cmd_send(iommu, &cmd, RISCV_IOMMU_IOTINVAL_TIMEOUT);
> + }
>
> /*
> * For device context with DC_TC_PDTV = 0, translation attributes valid bit
> @@ -806,12 +966,415 @@ static void riscv_iommu_iodir_update(struct riscv_iommu_device *iommu,
> }
> }
>
> +/*
> + * IOVA page translation tree management.
> + */
> +
> +#define IOMMU_PAGE_SIZE_4K BIT_ULL(12)
> +#define IOMMU_PAGE_SIZE_2M BIT_ULL(21)
> +#define IOMMU_PAGE_SIZE_1G BIT_ULL(30)
> +#define IOMMU_PAGE_SIZE_512G BIT_ULL(39)
> +
> +#define PT_SHIFT (PAGE_SHIFT - ilog2(sizeof(pte_t)))
> +
> +static void riscv_iommu_iotlb_flush_all(struct iommu_domain *iommu_domain)
> +{
> + struct riscv_iommu_domain *domain = iommu_domain_to_riscv(iommu_domain);
> +
> + riscv_iommu_iotlb_inval(domain, 0, ULONG_MAX);
> +}
> +
> +static void riscv_iommu_iotlb_sync(struct iommu_domain *iommu_domain,
> + struct iommu_iotlb_gather *gather)
> +{
> + struct riscv_iommu_domain *domain = iommu_domain_to_riscv(iommu_domain);
> +
> + riscv_iommu_iotlb_inval(domain, gather->start, gather->end);
> +}
> +
> +static inline size_t get_page_size(size_t size)
> +{
> + if (size >= IOMMU_PAGE_SIZE_512G)
> + return IOMMU_PAGE_SIZE_512G;
> + if (size >= IOMMU_PAGE_SIZE_1G)
> + return IOMMU_PAGE_SIZE_1G;
> + if (size >= IOMMU_PAGE_SIZE_2M)
> + return IOMMU_PAGE_SIZE_2M;
> + return IOMMU_PAGE_SIZE_4K;
> +}
> +
> +#define _io_pte_present(pte) ((pte) & (_PAGE_PRESENT | _PAGE_PROT_NONE))
> +#define _io_pte_leaf(pte) ((pte) & _PAGE_LEAF)
> +#define _io_pte_none(pte) ((pte) == 0)
> +#define _io_pte_entry(pn, prot) ((_PAGE_PFN_MASK & ((pn) << _PAGE_PFN_SHIFT)) | (prot))
> +
> +static void riscv_iommu_pte_free(struct riscv_iommu_domain *domain,
> + unsigned long pte, struct list_head *freelist)
> +{
> + unsigned long *ptr;
> + int i;
> +
> + if (!_io_pte_present(pte) || _io_pte_leaf(pte))
> + return;
> +
> + ptr = (unsigned long *)pfn_to_virt(__page_val_to_pfn(pte));
> +
> + /* Recursively free all sub page table pages */
> + for (i = 0; i < PTRS_PER_PTE; i++) {
> + pte = READ_ONCE(ptr[i]);
> + if (!_io_pte_none(pte) && cmpxchg_relaxed(ptr + i, pte, 0) == pte)
> + riscv_iommu_pte_free(domain, pte, freelist);
> + }
> +
> + if (freelist)
> + list_add_tail(&virt_to_page(ptr)->lru, freelist);
> + else
> + iommu_free_page(ptr);
> +}
> +
> +static unsigned long *riscv_iommu_pte_alloc(struct riscv_iommu_domain *domain,
> + unsigned long iova, size_t pgsize,
> + gfp_t gfp)
> +{
> + unsigned long *ptr = domain->pgd_root;
> + unsigned long pte, old;
> + int level = domain->pgd_mode - RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV39 + 2;
> + void *addr;
> +
> + do {
> + const int shift = PAGE_SHIFT + PT_SHIFT * level;
> +
> + ptr += ((iova >> shift) & (PTRS_PER_PTE - 1));
> + /*
> + * Note: returned entry might be a non-leaf if there was
> + * existing mapping with smaller granularity. Up to the caller
> + * to replace and invalidate.
> + */
> + if (((size_t)1 << shift) == pgsize)
> + return ptr;
> +pte_retry:
> + pte = READ_ONCE(*ptr);
> + /*
> + * This is very likely incorrect as we should not be adding
> + * new mapping with smaller granularity on top
> + * of existing 2M/1G mapping. Fail.
> + */
> + if (_io_pte_present(pte) && _io_pte_leaf(pte))
> + return NULL;
> + /*
> + * Non-leaf entry is missing, allocate and try to add to the
> + * page table. This might race with other mappings, retry.
> + */
> + if (_io_pte_none(pte)) {
> + addr = iommu_alloc_page_node(domain->numa_node, gfp);
> + if (!addr)
> + return NULL;
> + old = pte;
> + pte = _io_pte_entry(virt_to_pfn(addr), _PAGE_TABLE);
> + if (cmpxchg_relaxed(ptr, old, pte) != old) {
> + iommu_free_page(addr);
> + goto pte_retry;
> + }
> + }
> + ptr = (unsigned long *)pfn_to_virt(__page_val_to_pfn(pte));
> + } while (level-- > 0);
> +
> + return NULL;
> +}
> +
> +static unsigned long *riscv_iommu_pte_fetch(struct riscv_iommu_domain *domain,
> + unsigned long iova, size_t *pte_pgsize)
> +{
> + unsigned long *ptr = domain->pgd_root;
> + unsigned long pte;
> + int level = domain->pgd_mode - RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV39 + 2;
> +
> + do {
> + const int shift = PAGE_SHIFT + PT_SHIFT * level;
> +
> + ptr += ((iova >> shift) & (PTRS_PER_PTE - 1));
> + pte = READ_ONCE(*ptr);
> + if (_io_pte_present(pte) && _io_pte_leaf(pte)) {
> + *pte_pgsize = (size_t)1 << shift;
> + return ptr;
> + }
> + if (_io_pte_none(pte))
> + return NULL;
> + ptr = (unsigned long *)pfn_to_virt(__page_val_to_pfn(pte));
> + } while (level-- > 0);
> +
> + return NULL;
> +}
> +
> +static int riscv_iommu_map_pages(struct iommu_domain *iommu_domain,
> + unsigned long iova, phys_addr_t phys,
> + size_t pgsize, size_t pgcount, int prot,
> + gfp_t gfp, size_t *mapped)
> +{
> + struct riscv_iommu_domain *domain = iommu_domain_to_riscv(iommu_domain);
> + size_t size = 0;
> + size_t page_size = get_page_size(pgsize);
> + unsigned long *ptr;
> + unsigned long pte, old, pte_prot;
> + int rc = 0;
> + LIST_HEAD(freelist);
> +
> + if (!(prot & IOMMU_WRITE))
> + pte_prot = _PAGE_BASE | _PAGE_READ;
> + else if (domain->amo_enabled)
> + pte_prot = _PAGE_BASE | _PAGE_READ | _PAGE_WRITE;
> + else
> + pte_prot = _PAGE_BASE | _PAGE_READ | _PAGE_WRITE | _PAGE_DIRTY;
> +
> + while (pgcount) {
> + ptr = riscv_iommu_pte_alloc(domain, iova, page_size, gfp);
> + if (!ptr) {
> + rc = -ENOMEM;
> + break;
> + }
> +
> + old = READ_ONCE(*ptr);
> + pte = _io_pte_entry(phys_to_pfn(phys), pte_prot);
> + if (cmpxchg_relaxed(ptr, old, pte) != old)
> + continue;
> +
> + riscv_iommu_pte_free(domain, old, &freelist);
> +
> + size += page_size;
> + iova += page_size;
> + phys += page_size;
> + --pgcount;
> + }
> +
> + *mapped = size;
> +
> + if (!list_empty(&freelist)) {
> + /*
> + * In 1.0 spec version, the smallest scope we can use to
> + * invalidate all levels of page table (i.e. leaf and non-leaf)
> + * is an invalidate-all-PSCID IOTINVAL.VMA with AV=0.
> + * This will be updated with hardware support for
> + * capability.NL (non-leaf) IOTINVAL command.
> + */
> + riscv_iommu_iotlb_inval(domain, 0, ULONG_MAX);
> + iommu_put_pages_list(&freelist);
> + }
> +
> + return rc;
> +}
> +
> +static size_t riscv_iommu_unmap_pages(struct iommu_domain *iommu_domain,
> + unsigned long iova, size_t pgsize,
> + size_t pgcount,
> + struct iommu_iotlb_gather *gather)
> +{
> + struct riscv_iommu_domain *domain = iommu_domain_to_riscv(iommu_domain);
> + size_t size = pgcount << __ffs(pgsize);
> + unsigned long *ptr, old;
> + size_t unmapped = 0;
> + size_t pte_size;
> +
> + while (unmapped < size) {
> + ptr = riscv_iommu_pte_fetch(domain, iova, &pte_size);
> + if (!ptr)
> + return unmapped;
> +
> + /* partial unmap is not allowed, fail. */
> + if (iova & (pte_size - 1))
> + return unmapped;
> +
> + old = READ_ONCE(*ptr);
> + if (cmpxchg_relaxed(ptr, old, 0) != old)
> + continue;
> +
> + iommu_iotlb_gather_add_page(&domain->domain, gather, iova,
> + pte_size);
> +
> + iova += pte_size;
> + unmapped += pte_size;
> + }
> +
> + return unmapped;
> +}
> +
> +static phys_addr_t riscv_iommu_iova_to_phys(struct iommu_domain *iommu_domain,
> + dma_addr_t iova)
> +{
> + struct riscv_iommu_domain *domain = iommu_domain_to_riscv(iommu_domain);
> + unsigned long pte_size;
> + unsigned long *ptr;
> +
> + ptr = riscv_iommu_pte_fetch(domain, iova, &pte_size);
> + if (_io_pte_none(*ptr) || !_io_pte_present(*ptr))
> + return 0;
> +
> + return pfn_to_phys(__page_val_to_pfn(*ptr)) | (iova & (pte_size - 1));
> +}
> +
> +static void riscv_iommu_free_paging_domain(struct iommu_domain *iommu_domain)
> +{
> + struct riscv_iommu_domain *domain = iommu_domain_to_riscv(iommu_domain);
> + const unsigned long pfn = virt_to_pfn(domain->pgd_root);
> +
> + WARN_ON(!list_empty(&domain->bonds));
> +
> + if ((int)domain->pscid > 0)
> + ida_free(&riscv_iommu_pscids, domain->pscid);
> +
> + riscv_iommu_pte_free(domain, _io_pte_entry(pfn, _PAGE_TABLE), NULL);
> + kfree(domain);
> +}
> +
> +static bool riscv_iommu_pt_supported(struct riscv_iommu_device *iommu, int pgd_mode)
> +{
> + switch (pgd_mode) {
> + case RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV39:
> + return iommu->caps & RISCV_IOMMU_CAP_S_SV39;
> +
> + case RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV48:
> + return iommu->caps & RISCV_IOMMU_CAP_S_SV48;
> +
> + case RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV57:
> + return iommu->caps & RISCV_IOMMU_CAP_S_SV57;
> + }
> + return false;
> +}
> +
> +static int riscv_iommu_attach_paging_domain(struct iommu_domain *iommu_domain,
> + struct device *dev)
> +{
> + struct riscv_iommu_domain *domain = iommu_domain_to_riscv(iommu_domain);
> + struct riscv_iommu_device *iommu = dev_to_iommu(dev);
> + struct riscv_iommu_info *info = dev_iommu_priv_get(dev);
> + u64 fsc, ta;
> +
> + if (!riscv_iommu_pt_supported(iommu, domain->pgd_mode))
> + return -ENODEV;
> +
> + if (domain->numa_node == NUMA_NO_NODE)
> + domain->numa_node = dev_to_node(iommu->dev);
> +
> + if (riscv_iommu_bond_link(domain, dev))
> + return -ENOMEM;
> +
> + /*
> + * Invalidate PSCID.
> + * This invalidation might be redundant if IOATC has been already cleared,
> + * however we are not keeping track for domains not linked to a device.
> + */
> + riscv_iommu_iotlb_inval(domain, 0, ULONG_MAX);
> +
> + fsc = FIELD_PREP(RISCV_IOMMU_PC_FSC_MODE, domain->pgd_mode) |
> + FIELD_PREP(RISCV_IOMMU_PC_FSC_PPN, virt_to_pfn(domain->pgd_root));
> + ta = FIELD_PREP(RISCV_IOMMU_PC_TA_PSCID, domain->pscid) |
> + RISCV_IOMMU_PC_TA_V;
> +
> + riscv_iommu_iodir_update(iommu, dev, fsc, ta);
> + riscv_iommu_bond_unlink(info->domain, dev);

Should we unlink bond before link it?

> + info->domain = domain;
> +
> + return 0;
> +}
> +
> +static const struct iommu_domain_ops riscv_iommu_paging_domain_ops = {
> + .attach_dev = riscv_iommu_attach_paging_domain,
> + .free = riscv_iommu_free_paging_domain,
> + .map_pages = riscv_iommu_map_pages,
> + .unmap_pages = riscv_iommu_unmap_pages,
> + .iova_to_phys = riscv_iommu_iova_to_phys,
> + .iotlb_sync = riscv_iommu_iotlb_sync,
> + .flush_iotlb_all = riscv_iommu_iotlb_flush_all,
> +};
> +
> +static struct iommu_domain *riscv_iommu_alloc_paging_domain(struct device *dev)
> +{
> + struct riscv_iommu_domain *domain;
> + struct riscv_iommu_device *iommu;
> + unsigned int pgd_mode;
> + int va_bits;
> +
> + iommu = dev ? dev_to_iommu(dev) : NULL;
> +
> + /*
> + * In unlikely case when dev or iommu is not known, use system
> + * SATP mode to configure paging domain radix tree depth.
> + * Use highest available if actual IOMMU hardware capabilities
> + * are known here.
> + */
> + if (!iommu) {
> + pgd_mode = satp_mode >> SATP_MODE_SHIFT;
> + va_bits = VA_BITS;
> + } else if (iommu->caps & RISCV_IOMMU_CAP_S_SV57) {
> + pgd_mode = RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV57;
> + va_bits = 57;
> + } else if (iommu->caps & RISCV_IOMMU_CAP_S_SV48) {
> + pgd_mode = RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV48;
> + va_bits = 48;
> + } else if (iommu->caps & RISCV_IOMMU_CAP_S_SV39) {
> + pgd_mode = RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV39;
> + va_bits = 39;
> + } else {
> + dev_err(dev, "cannot find supported page table mode\n");
> + return ERR_PTR(-ENODEV);
> + }
> +
> + domain = kzalloc(sizeof(*domain), GFP_KERNEL);
> + if (!domain)
> + return ERR_PTR(-ENOMEM);
> +
> + INIT_LIST_HEAD_RCU(&domain->bonds);
> + spin_lock_init(&domain->lock);
> + domain->numa_node = NUMA_NO_NODE;
> +
> + if (iommu) {
> + domain->numa_node = dev_to_node(iommu->dev);
> + domain->amo_enabled = !!(iommu->caps & RISCV_IOMMU_CAP_AMO_HWAD);
> + }
> +
> + domain->pgd_mode = pgd_mode;
> + domain->pgd_root = iommu_alloc_page_node(domain->numa_node,
> + GFP_KERNEL_ACCOUNT);
> + if (!domain->pgd_root) {
> + kfree(domain);
> + return ERR_PTR(-ENOMEM);
> + }
> +
> + domain->pscid = ida_alloc_range(&riscv_iommu_pscids, 1,
> + RISCV_IOMMU_MAX_PSCID, GFP_KERNEL);
> + if (domain->pscid < 0) {
> + iommu_free_page(domain->pgd_root);
> + kfree(domain);
> + return ERR_PTR(-ENOMEM);
> + }
> +
> + /*
> + * Note: RISC-V Privilege spec mandates that virtual addresses
> + * need to be sign-extended, so if (VA_BITS - 1) is set, all
> + * bits >= VA_BITS need to also be set or else we'll get a
> + * page fault. However the code that creates the mappings
> + * above us (e.g. iommu_dma_alloc_iova()) won't do that for us
> + * for now, so we'll end up with invalid virtual addresses
> + * to map. As a workaround until we get this sorted out
> + * limit the available virtual addresses to VA_BITS - 1.
> + */
> + domain->domain.geometry.aperture_start = 0;
> + domain->domain.geometry.aperture_end = DMA_BIT_MASK(va_bits - 1);
> + domain->domain.geometry.force_aperture = true;
> +
> + domain->domain.ops = &riscv_iommu_paging_domain_ops;
> +
> + return &domain->domain;
> +}
> +
> static int riscv_iommu_attach_blocking_domain(struct iommu_domain *iommu_domain,
> struct device *dev)
> {
> struct riscv_iommu_device *iommu = dev_to_iommu(dev);
> + struct riscv_iommu_info *info = dev_iommu_priv_get(dev);
>
> riscv_iommu_iodir_update(iommu, dev, RISCV_IOMMU_FSC_BARE, 0);
> + riscv_iommu_bond_unlink(info->domain, dev);
> + info->domain = NULL;
>
> return 0;
> }
> @@ -827,8 +1390,11 @@ static int riscv_iommu_attach_identity_domain(struct iommu_domain *iommu_domain,
> struct device *dev)
> {
> struct riscv_iommu_device *iommu = dev_to_iommu(dev);
> + struct riscv_iommu_info *info = dev_iommu_priv_get(dev);
>
> riscv_iommu_iodir_update(iommu, dev, RISCV_IOMMU_FSC_BARE, RISCV_IOMMU_PC_TA_V);
> + riscv_iommu_bond_unlink(info->domain, dev);
> + info->domain = NULL;
>
> return 0;
> }
> @@ -842,7 +1408,7 @@ static struct iommu_domain riscv_iommu_identity_domain = {
>
> static int riscv_iommu_device_domain_type(struct device *dev)
> {
> - return IOMMU_DOMAIN_IDENTITY;
> + return 0;
> }
>
> static struct iommu_group *riscv_iommu_device_group(struct device *dev)
> @@ -861,6 +1427,7 @@ static struct iommu_device *riscv_iommu_probe_device(struct device *dev)
> {
> struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
> struct riscv_iommu_device *iommu;
> + struct riscv_iommu_info *info;
> struct riscv_iommu_dc *dc;
> u64 tc;
> int i;
> @@ -895,17 +1462,33 @@ static struct iommu_device *riscv_iommu_probe_device(struct device *dev)
> WRITE_ONCE(dc->tc, tc);
> }
>
> + info = kzalloc(sizeof(*info), GFP_KERNEL);
> + if (!info)
> + return ERR_PTR(-ENOMEM);
> + dev_iommu_priv_set(dev, info);
> +
> return &iommu->iommu;
> }
>
> +static void riscv_iommu_release_device(struct device *dev)
> +{
> + struct riscv_iommu_info *info = dev_iommu_priv_get(dev);
> +
> + synchronize_rcu();
> + kfree(info);
> +}
> +
> static const struct iommu_ops riscv_iommu_ops = {
> + .pgsize_bitmap = SZ_4K,
> .of_xlate = riscv_iommu_of_xlate,
> .identity_domain = &riscv_iommu_identity_domain,
> .blocked_domain = &riscv_iommu_blocking_domain,
> .release_domain = &riscv_iommu_blocking_domain,
> + .domain_alloc_paging = riscv_iommu_alloc_paging_domain,
> .def_domain_type = riscv_iommu_device_domain_type,
> .device_group = riscv_iommu_device_group,
> .probe_device = riscv_iommu_probe_device,
> + .release_device = riscv_iommu_release_device,
> };
>
> static int riscv_iommu_init_check(struct riscv_iommu_device *iommu)
> --
> 2.34.1
>
>
> _______________________________________________
> linux-riscv mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-riscv

2024-05-08 16:05:28

by Tomasz Jeznach

[permalink] [raw]
Subject: Re: [PATCH v4 6/7] iommu/riscv: Command and fault queue support

On Wed, May 8, 2024 at 8:39 AM Zong Li <[email protected]> wrote:
>
> On Sat, May 4, 2024 at 12:13 AM Tomasz Jeznach <tjeznach@rivosinccom> wrote:
> >
> > Introduce device command submission and fault reporting queues,
> > as described in Chapter 3.1 and 3.2 of the RISC-V IOMMU Architecture
> > Specification.
> >
> > Command and fault queues are instantiated in contiguous system memory
> > local to IOMMU device domain, or mapped from fixed I/O space provided
> > by the hardware implementation. Detection of the location and maximum
> > allowed size of the queue utilize WARL properties of queue base control
> > register. Driver implementation will try to allocate up to 128KB of
> > system memory, while respecting hardware supported maximum queue size.
> >
> > Interrupts allocation is based on interrupt vectors availability and
> > distributed to all queues in simple round-robin fashion. For hardware
> > Implementation with fixed event type to interrupt vector assignment
> > IVEC WARL property is used to discover such mappings.
> >
> > Address translation, command and queue fault handling in this change
> > is limited to simple fault reporting without taking any action.
> >
> > Reviewed-by: Lu Baolu <[email protected]>
> > Signed-off-by: Tomasz Jeznach <[email protected]>
> > ---
> > drivers/iommu/riscv/iommu-bits.h | 75 +++++
> > drivers/iommu/riscv/iommu.c | 496 ++++++++++++++++++++++++++++++-
> > drivers/iommu/riscv/iommu.h | 21 ++
> > 3 files changed, 590 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/iommu/riscv/iommu-bits.h b/drivers/iommu/riscv/iommu-bits.h
> > index ba093c29de9f..40c379222821 100644
> > --- a/drivers/iommu/riscv/iommu-bits.h
> > +++ b/drivers/iommu/riscv/iommu-bits.h
> > @@ -704,4 +704,79 @@ struct riscv_iommu_msi_pte {
> > #define RISCV_IOMMU_MSI_MRIF_NPPN RISCV_IOMMU_PPN_FIELD
> > #define RISCV_IOMMU_MSI_MRIF_NID_MSB BIT_ULL(60)
> >
> > +/* Helper functions: command structure builders. */
> > +
> > +static inline void riscv_iommu_cmd_inval_vma(struct riscv_iommu_command *cmd)
> > +{
> > + cmd->dword0 = FIELD_PREP(RISCV_IOMMU_CMD_OPCODE, RISCV_IOMMU_CMD_IOTINVAL_OPCODE) |
> > + FIELD_PREP(RISCV_IOMMU_CMD_FUNC, RISCV_IOMMU_CMD_IOTINVAL_FUNC_VMA);
> > + cmd->dword1 = 0;
> > +}
> > +
> > +static inline void riscv_iommu_cmd_inval_set_addr(struct riscv_iommu_command *cmd,
> > + u64 addr)
> > +{
> > + cmd->dword1 = FIELD_PREP(RISCV_IOMMU_CMD_IOTINVAL_ADDR, phys_to_pfn(addr));
> > + cmd->dword0 |= RISCV_IOMMU_CMD_IOTINVAL_AV;
> > +}
> > +
> > +static inline void riscv_iommu_cmd_inval_set_pscid(struct riscv_iommu_command *cmd,
> > + int pscid)
> > +{
> > + cmd->dword0 |= FIELD_PREP(RISCV_IOMMU_CMD_IOTINVAL_PSCID, pscid) |
> > + RISCV_IOMMU_CMD_IOTINVAL_PSCV;
> > +}
> > +
> > +static inline void riscv_iommu_cmd_inval_set_gscid(struct riscv_iommu_command *cmd,
> > + int gscid)
> > +{
> > + cmd->dword0 |= FIELD_PREP(RISCV_IOMMU_CMD_IOTINVAL_GSCID, gscid) |
> > + RISCV_IOMMU_CMD_IOTINVAL_GV;
> > +}
> > +
> > +static inline void riscv_iommu_cmd_iofence(struct riscv_iommu_command *cmd)
> > +{
> > + cmd->dword0 = FIELD_PREP(RISCV_IOMMU_CMD_OPCODE, RISCV_IOMMU_CMD_IOFENCE_OPCODE) |
> > + FIELD_PREP(RISCV_IOMMU_CMD_FUNC, RISCV_IOMMU_CMD_IOFENCE_FUNC_C) |
> > + RISCV_IOMMU_CMD_IOFENCE_PR | RISCV_IOMMU_CMD_IOFENCE_PW;
> > + cmd->dword1 = 0;
> > +}
> > +
> > +static inline void riscv_iommu_cmd_iofence_set_av(struct riscv_iommu_command *cmd,
> > + u64 addr, u32 data)
> > +{
> > + cmd->dword0 = FIELD_PREP(RISCV_IOMMU_CMD_OPCODE, RISCV_IOMMU_CMD_IOFENCE_OPCODE) |
> > + FIELD_PREP(RISCV_IOMMU_CMD_FUNC, RISCV_IOMMU_CMD_IOFENCE_FUNC_C) |
> > + FIELD_PREP(RISCV_IOMMU_CMD_IOFENCE_DATA, data) |
> > + RISCV_IOMMU_CMD_IOFENCE_AV;
> > + cmd->dword1 = addr >> 2;
> > +}
> > +
> > +static inline void riscv_iommu_cmd_iodir_inval_ddt(struct riscv_iommu_command *cmd)
> > +{
> > + cmd->dword0 = FIELD_PREP(RISCV_IOMMU_CMD_OPCODE, RISCV_IOMMU_CMD_IODIR_OPCODE) |
> > + FIELD_PREP(RISCV_IOMMU_CMD_FUNC, RISCV_IOMMU_CMD_IODIR_FUNC_INVAL_DDT);
> > + cmd->dword1 = 0;
> > +}
> > +
> > +static inline void riscv_iommu_cmd_iodir_inval_pdt(struct riscv_iommu_command *cmd)
> > +{
> > + cmd->dword0 = FIELD_PREP(RISCV_IOMMU_CMD_OPCODE, RISCV_IOMMU_CMD_IODIR_OPCODE) |
> > + FIELD_PREP(RISCV_IOMMU_CMD_FUNC, RISCV_IOMMU_CMD_IODIR_FUNC_INVAL_PDT);
> > + cmd->dword1 = 0;
> > +}
> > +
> > +static inline void riscv_iommu_cmd_iodir_set_did(struct riscv_iommu_command *cmd,
> > + unsigned int devid)
> > +{
> > + cmd->dword0 |= FIELD_PREP(RISCV_IOMMU_CMD_IODIR_DID, devid) |
> > + RISCV_IOMMU_CMD_IODIR_DV;
> > +}
> > +
> > +static inline void riscv_iommu_cmd_iodir_set_pid(struct riscv_iommu_command *cmd,
> > + unsigned int pasid)
> > +{
> > + cmd->dword0 |= FIELD_PREP(RISCV_IOMMU_CMD_IODIR_PID, pasid);
> > +}
> > +
> > #endif /* _RISCV_IOMMU_BITS_H_ */
> > diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
> > index 71b7903d83d4..4349ac8a3990 100644
> > --- a/drivers/iommu/riscv/iommu.c
> > +++ b/drivers/iommu/riscv/iommu.c
> > @@ -25,7 +25,14 @@
> > #include "iommu.h"
> >
> > /* Timeouts in [us] */
> > -#define RISCV_IOMMU_DDTP_TIMEOUT 50000
> > +#define RISCV_IOMMU_QCSR_TIMEOUT 150000
> > +#define RISCV_IOMMU_QUEUE_TIMEOUT 150000
> > +#define RISCV_IOMMU_DDTP_TIMEOUT 10000000
> > +#define RISCV_IOMMU_IOTINVAL_TIMEOUT 90000000
> > +
> > +/* Number of entries per CMD/FLT queue, should be <= INT_MAX */
> > +#define RISCV_IOMMU_DEF_CQ_COUNT 8192
> > +#define RISCV_IOMMU_DEF_FQ_COUNT 4096
> >
> > /* RISC-V IOMMU PPN <> PHYS address conversions, PHYS <=> PPN[53:10] */
> > #define phys_to_ppn(va) (((va) >> 2) & (((1ULL << 44) - 1) << 10))
> > @@ -89,6 +96,432 @@ static void riscv_iommu_free_pages(struct riscv_iommu_device *iommu, void *addr)
> > riscv_iommu_devres_pages_match, &devres);
> > }
> >
> > +/*
> > + * Hardware queue allocation and management.
> > + */
> > +
> > +/* Setup queue base, control registers and default queue length */
> > +#define RISCV_IOMMU_QUEUE_INIT(q, name) do { \
> > + struct riscv_iommu_queue *_q = q; \
> > + _q->qid = RISCV_IOMMU_INTR_ ## name; \
> > + _q->qbr = RISCV_IOMMU_REG_ ## name ## B; \
> > + _q->qcr = RISCV_IOMMU_REG_ ## name ## CSR; \
> > + _q->mask = _q->mask ?: (RISCV_IOMMU_DEF_ ## name ## _COUNT) - 1;\
> > +} while (0)
> > +
> > +/* Note: offsets are the same for all queues */
> > +#define Q_HEAD(q) ((q)->qbr + (RISCV_IOMMU_REG_CQH - RISCV_IOMMU_REG_CQB))
> > +#define Q_TAIL(q) ((q)->qbr + (RISCV_IOMMU_REG_CQT - RISCV_IOMMU_REG_CQB))
> > +#define Q_ITEM(q, index) ((q)->mask & (index))
> > +#define Q_IPSR(q) BIT((q)->qid)
> > +
> > +/*
> > + * Discover queue ring buffer hardware configuration, allocate in-memory
> > + * ring buffer or use fixed I/O memory location, configure queue base register.
> > + * Must be called before hardware queue is enabled.
> > + *
> > + * @queue - data structure, configured with RISCV_IOMMU_QUEUE_INIT()
> > + * @entry_size - queue single element size in bytes.
> > + */
> > +static int riscv_iommu_queue_alloc(struct riscv_iommu_device *iommu,
> > + struct riscv_iommu_queue *queue,
> > + size_t entry_size)
> > +{
> > + unsigned int logsz;
> > + u64 qb, rb;
> > +
> > + /*
> > + * Use WARL base register property to discover maximum allowed
> > + * number of entries and optional fixed IO address for queue location.
> > + */
> > + riscv_iommu_writeq(iommu, queue->qbr, RISCV_IOMMU_QUEUE_LOGSZ_FIELD);
> > + qb = riscv_iommu_readq(iommu, queue->qbr);
> > +
> > + /*
> > + * Calculate and verify hardware supported queue length, as reported
> > + * by the field LOGSZ, where max queue length is equal to 2^(LOGSZ + 1).
> > + * Update queue size based on hardware supported value.
> > + */
> > + logsz = ilog2(queue->mask);
> > + if (logsz > FIELD_GET(RISCV_IOMMU_QUEUE_LOGSZ_FIELD, qb))
> > + logsz = FIELD_GET(RISCV_IOMMU_QUEUE_LOGSZ_FIELD, qb);
> > +
> > + /*
> > + * Use WARL base register property to discover an optional fixed IO
> > + * address for queue ring buffer location. Otherwise allocate contigus
> > + * system memory.
> > + */
> > + if (FIELD_GET(RISCV_IOMMU_PPN_FIELD, qb)) {
> > + const size_t queue_size = entry_size << (logsz + 1);
> > +
> > + queue->phys = ppn_to_phys(FIELD_GET(RISCV_IOMMU_PPN_FIELD, qb));
> > + queue->base = devm_ioremap(iommu->dev, queue->phys, queue_size);
> > + } else {
> > + do {
> > + const size_t queue_size = entry_size << (logsz + 1);
> > + const int order = get_order(queue_size);
> > +
> > + queue->base = riscv_iommu_get_pages(iommu, order);
> > + queue->phys = __pa(queue->base);
> > + } while (!queue->base && logsz-- > 0);
> > + }
> > +
> > + if (!queue->base)
> > + return -ENOMEM;
> > +
> > + qb = phys_to_ppn(queue->phys) |
> > + FIELD_PREP(RISCV_IOMMU_QUEUE_LOGSZ_FIELD, logsz);
> > +
> > + /* Update base register and read back to verify hw accepted our write */
> > + riscv_iommu_writeq(iommu, queue->qbr, qb);
> > + rb = riscv_iommu_readq(iommu, queue->qbr);
> > + if (rb != qb) {
> > + dev_err(iommu->dev, "queue #%u allocation failed\n", queue->qid);
> > + return -ENODEV;
> > + }
> > +
> > + /* Update actual queue mask */
> > + queue->mask = (2U << logsz) - 1;
> > +
> > + dev_dbg(iommu->dev, "queue #%u allocated 2^%u entries",
> > + queue->qid, logsz + 1);
> > +
> > + return 0;
> > +}
> > +
> > +/* Check interrupt queue status, IPSR */
> > +static irqreturn_t riscv_iommu_queue_ipsr(int irq, void *data)
> > +{
> > + struct riscv_iommu_queue *queue = (struct riscv_iommu_queue *)data;
> > +
> > + if (riscv_iommu_readl(queue->iommu, RISCV_IOMMU_REG_IPSR) & Q_IPSR(queue))
> > + return IRQ_WAKE_THREAD;
> > +
> > + return IRQ_NONE;
> > +}
> > +
> > +static int riscv_iommu_queue_vec(struct riscv_iommu_device *iommu, int n)
> > +{
> > + /* Reuse IVEC.CIV mask for all interrupt vectors mapping. */
> > + return (iommu->ivec >> (n * 4)) & RISCV_IOMMU_IVEC_CIV;
> > +}
> > +
> > +/*
> > + * Enable queue processing in the hardware, register interrupt handler.
> > + *
> > + * @queue - data structure, already allocated with riscv_iommu_queue_alloc()
> > + * @irq_handler - threaded interrupt handler.
> > + */
> > +static int riscv_iommu_queue_enable(struct riscv_iommu_device *iommu,
> > + struct riscv_iommu_queue *queue,
> > + irq_handler_t irq_handler)
> > +{
> > + const unsigned int irq = iommu->irqs[riscv_iommu_queue_vec(iommu, queue->qid)];
> > + u32 csr;
> > + int rc;
> > +
> > + if (queue->iommu)
> > + return -EBUSY;
> > +
> > + /* Polling not implemented */
> > + if (!irq)
> > + return -ENODEV;
> > +
> > + queue->iommu = iommu;
> > + rc = request_threaded_irq(irq, riscv_iommu_queue_ipsr, irq_handler,
> > + IRQF_ONESHOT | IRQF_SHARED,
> > + dev_name(iommu->dev), queue);
> > + if (rc) {
> > + queue->iommu = NULL;
> > + return rc;
> > + }
> > +
> > + /*
> > + * Enable queue with interrupts, clear any memory fault if any.
> > + * Wait for the hardware to acknowledge request and activate queue
> > + * processing.
> > + * Note: All CSR bitfields are in the same offsets for all queues.
> > + */
> > + riscv_iommu_writel(iommu, queue->qcr,
> > + RISCV_IOMMU_QUEUE_ENABLE |
> > + RISCV_IOMMU_QUEUE_INTR_ENABLE |
> > + RISCV_IOMMU_QUEUE_MEM_FAULT);
> > +
> > + riscv_iommu_readl_timeout(iommu, queue->qcr,
> > + csr, !(csr & RISCV_IOMMU_QUEUE_BUSY),
> > + 10, RISCV_IOMMU_QCSR_TIMEOUT);
> > +
> > + if (RISCV_IOMMU_QUEUE_ACTIVE != (csr & (RISCV_IOMMU_QUEUE_ACTIVE |
> > + RISCV_IOMMU_QUEUE_BUSY |
> > + RISCV_IOMMU_QUEUE_MEM_FAULT))) {
> > + /* Best effort to stop and disable failing hardware queue. */
> > + riscv_iommu_writel(iommu, queue->qcr, 0);
> > + free_irq(irq, queue);
> > + queue->iommu = NULL;
> > + dev_err(iommu->dev, "queue #%u failed to start\n", queue->qid);
> > + return -EBUSY;
> > + }
> > +
> > + /* Clear any pending interrupt flag. */
> > + riscv_iommu_writel(iommu, RISCV_IOMMU_REG_IPSR, Q_IPSR(queue));
> > +
> > + return 0;
> > +}
> > +
> > +/*
> > + * Disable queue. Wait for the hardware to acknowledge request and
> > + * stop processing enqueued requests. Report errors but continue.
> > + */
> > +static void riscv_iommu_queue_disable(struct riscv_iommu_queue *queue)
> > +{
> > + struct riscv_iommu_device *iommu = queue->iommu;
> > + u32 csr;
> > +
> > + if (!iommu)
> > + return;
> > +
> > + free_irq(iommu->irqs[riscv_iommu_queue_vec(iommu, queue->qid)], queue);
> > + riscv_iommu_writel(iommu, queue->qcr, 0);
> > + riscv_iommu_readl_timeout(iommu, queue->qcr,
> > + csr, !(csr & RISCV_IOMMU_QUEUE_BUSY),
> > + 10, RISCV_IOMMU_QCSR_TIMEOUT);
> > +
> > + if (csr & (RISCV_IOMMU_QUEUE_ACTIVE | RISCV_IOMMU_QUEUE_BUSY))
> > + dev_err(iommu->dev, "fail to disable hardware queue #%u, csr 0x%x\n",
> > + queue->qid, csr);
> > +
> > + queue->iommu = NULL;
> > +}
> > +
> > +/*
> > + * Returns number of available valid queue entries and the first item index.
> > + * Update shadow producer index if necessary.
> > + */
> > +static int riscv_iommu_queue_consume(struct riscv_iommu_queue *queue,
> > + unsigned int *index)
> > +{
> > + unsigned int head = atomic_read(&queue->head);
> > + unsigned int tail = atomic_read(&queue->tail);
> > + unsigned int last = Q_ITEM(queue, tail);
> > + int available = (int)(tail - head);
> > +
> > + *index = head;
> > +
> > + if (available > 0)
> > + return available;
> > +
> > + /* read hardware producer index, check reserved register bits are not set. */
> > + if (riscv_iommu_readl_timeout(queue->iommu, Q_TAIL(queue),
> > + tail, (tail & ~queue->mask) == 0,
> > + 0, RISCV_IOMMU_QUEUE_TIMEOUT)) {
> > + dev_err_once(queue->iommu->dev,
> > + "Hardware error: queue access timeout\n");
> > + return 0;
> > + }
> > +
> > + if (tail == last)
> > + return 0;
> > +
> > + /* update shadow producer index */
> > + return (int)(atomic_add_return((tail - last) & queue->mask, &queue->tail) - head);
> > +}
> > +
> > +/*
> > + * Release processed queue entries, should match riscv_iommu_queue_consume() calls.
> > + */
> > +static void riscv_iommu_queue_release(struct riscv_iommu_queue *queue, int count)
> > +{
> > + const unsigned int head = atomic_add_return(count, &queue->head);
> > +
> > + riscv_iommu_writel(queue->iommu, Q_HEAD(queue), Q_ITEM(queue, head));
> > +}
> > +
> > +/* Return actual consumer index based on hardware reported queue head index. */
> > +static unsigned int riscv_iommu_queue_cons(struct riscv_iommu_queue *queue)
> > +{
> > + const unsigned int cons = atomic_read(&queue->head);
> > + const unsigned int last = Q_ITEM(queue, cons);
> > + unsigned int head;
> > +
> > + if (riscv_iommu_readl_timeout(queue->iommu, Q_HEAD(queue), head,
> > + !(head & ~queue->mask),
> > + 0, RISCV_IOMMU_QUEUE_TIMEOUT))
> > + return cons;
> > +
> > + return cons + ((head - last) & queue->mask);
> > +}
> > +
> > +/* Wait for submitted item to be processed. */
> > +static int riscv_iommu_queue_wait(struct riscv_iommu_queue *queue,
> > + unsigned int index,
> > + unsigned int timeout_us)
> > +{
> > + unsigned int cons = atomic_read(&queue->head);
> > +
> > + /* Already processed by the consumer */
> > + if ((int)(cons - index) > 0)
> > + return 0;
> > +
> > + /* Monitor consumer index */
> > + return readx_poll_timeout(riscv_iommu_queue_cons, queue, cons,
> > + (int)(cons - index) > 0, 0, timeout_us);
> > +}
>
> Apart from iofence command, it seems to me that we might not be able
> to use the head pointer to determine whether the command is executed,
> because spec mentioned that the IOMMU advancing cqh is not a guarantee
> that the commands fetched by the IOMMU have been executed or
> committed. For iofence completion, perhaps using ADD and AV to write
> memory would be better as well.
>

I'll update the comment to the function that this is *only* allowed to
track IOFENCE.C completion. Call to riscv_iommu_queue_wait() is used
only for this purpose (unless specification will be extended to track
other commands with head/tail updates).

Previous version of the command tracking used address/AV notifier, but
after several iterations and testing we've considered tracking
head/tail as best approach.

> > +
> > +/* Enqueue an entry and wait to be processed if timeout_us > 0
> > + *
> > + * Error handling for IOMMU hardware not responding in reasonable time
> > + * will be added as separate patch series along with other RAS features.
> > + * For now, only report hardware failure and continue.
> > + */
> > +static void riscv_iommu_queue_send(struct riscv_iommu_queue *queue,
> > + void *entry, size_t entry_size,
> > + unsigned int timeout_us)
> > +{
> > + unsigned int prod;
> > + unsigned int head;
> > + unsigned int tail;
> > + unsigned long flags;
> > +
> > + /* Do not preempt submission flow. */
> > + local_irq_save(flags);
> > +
> > + /* 1. Allocate some space in the queue */
> > + prod = atomic_inc_return(&queue->prod) - 1;
> > + head = atomic_read(&queue->head);
> > +
> > + /* 2. Wait for space availability. */
> > + if ((prod - head) > queue->mask) {
> > + if (readx_poll_timeout(atomic_read, &queue->head,
> > + head, (prod - head) < queue->mask,
> > + 0, RISCV_IOMMU_QUEUE_TIMEOUT))
> > + goto err_busy;
> > + } else if ((prod - head) == queue->mask) {
> > + const unsigned int last = Q_ITEM(queue, head);
> > +
> > + if (riscv_iommu_readl_timeout(queue->iommu, Q_HEAD(queue), head,
> > + !(head & ~queue->mask) && head != last,
> > + 0, RISCV_IOMMU_QUEUE_TIMEOUT))
> > + goto err_busy;
> > + atomic_add((head - last) & queue->mask, &queue->head);
> > + }
> > +
> > + /* 3. Store entry in the ring buffer. */
> > + memcpy(queue->base + Q_ITEM(queue, prod) * entry_size, entry, entry_size);
> > +
> > + /* 4. Wait for all previous entries to be ready */
> > + if (readx_poll_timeout(atomic_read, &queue->tail, tail, prod == tail,
> > + 0, RISCV_IOMMU_QUEUE_TIMEOUT))
> > + goto err_busy;
> > +
> > + /* 5. Complete submission and restore local interrupts */
> > + dma_wmb();
> > + riscv_iommu_writel(queue->iommu, Q_TAIL(queue), Q_ITEM(queue, prod + 1));
> > + atomic_inc(&queue->tail);
> > + local_irq_restore(flags);
> > +
> > + if (timeout_us && riscv_iommu_queue_wait(queue, prod, timeout_us))
> > + dev_err_once(queue->iommu->dev,
> > + "Hardware error: command execution timeout\n");
> > +
> > + return;
> > +
> > +err_busy:
> > + local_irq_restore(flags);
> > + dev_err_once(queue->iommu->dev, "Hardware error: command enqueue failed\n");
> > +}
> > +
> > +/*
> > + * IOMMU Command queue chapter 3.1
> > + */
> > +
> > +/* Command queue interrupt handler thread function */
> > +static irqreturn_t riscv_iommu_cmdq_process(int irq, void *data)
> > +{
> > + const struct riscv_iommu_queue *queue = (struct riscv_iommu_queue *)data;
> > + unsigned int ctrl;
> > +
> > + /* Clear MF/CQ errors, complete error recovery to be implemented. */
> > + ctrl = riscv_iommu_readl(queue->iommu, queue->qcr);
> > + if (ctrl & (RISCV_IOMMU_CQCSR_CQMF | RISCV_IOMMU_CQCSR_CMD_TO |
> > + RISCV_IOMMU_CQCSR_CMD_ILL | RISCV_IOMMU_CQCSR_FENCE_W_IP)) {
> > + riscv_iommu_writel(queue->iommu, queue->qcr, ctrl);
>
> For RISCV_IOMMU_CQCSR_CMD_ILL and RISCV_IOMMU_CQCSR_CMD_TO, I think we
> need to adjust the head/tail pointer, otherwise, IOMMU might keep
> trying to execute the problematic command.
>
> > + dev_warn(queue->iommu->dev,
> > + "Queue #%u error; fault:%d timeout:%d illegal:%d fence_w_ip:%d\n",
>
> It might be a bit weird if we view fence_w_ip as queue error and print
> the message as error?

Driver does not set WSI bit in IOFENCE.C command structure, so it's
unexpected to see fence_w_ip set. This message is warning only about
unexpected command queue interrupt, listing state of possible
triggers.

>
> > + queue->qid,
> > + !!(ctrl & RISCV_IOMMU_CQCSR_CQMF),
> > + !!(ctrl & RISCV_IOMMU_CQCSR_CMD_TO),
> > + !!(ctrl & RISCV_IOMMU_CQCSR_CMD_ILL),
> > + !!(ctrl & RISCV_IOMMU_CQCSR_FENCE_W_IP));
> > + }
> > +
> > + /* Placeholder for command queue interrupt notifiers */
> > +
> > + /* Clear command interrupt pending. */
> > + riscv_iommu_writel(queue->iommu, RISCV_IOMMU_REG_IPSR, Q_IPSR(queue));
> > +
> > + return IRQ_HANDLED;
> > +}
> > +
> > +/* Send command to the IOMMU command queue */
> > +static void riscv_iommu_cmd_send(struct riscv_iommu_device *iommu,
> > + struct riscv_iommu_command *cmd,
> > + unsigned int timeout_us)
> > +{
> > + riscv_iommu_queue_send(&iommu->cmdq, cmd, sizeof(*cmd), timeout_us);
> > +}
> > +
> > +/*
> > + * IOMMU Fault/Event queue chapter 3.2
> > + */
> > +
> > +static void riscv_iommu_fault(struct riscv_iommu_device *iommu,
> > + struct riscv_iommu_fq_record *event)
> > +{
> > + unsigned int err = FIELD_GET(RISCV_IOMMU_FQ_HDR_CAUSE, event->hdr);
> > + unsigned int devid = FIELD_GET(RISCV_IOMMU_FQ_HDR_DID, event->hdr);
> > +
> > + /* Placeholder for future fault handling implementation, report only. */
> > + if (err)
> > + dev_warn_ratelimited(iommu->dev,
> > + "Fault %d devid: 0x%x iotval: %llx iotval2: %llx\n",
> > + err, devid, event->iotval, event->iotval2);
> > +}
> > +
> > +/* Fault queue interrupt handler thread function */
> > +static irqreturn_t riscv_iommu_fltq_process(int irq, void *data)
> > +{
> > + struct riscv_iommu_queue *queue = (struct riscv_iommu_queue *)data;
> > + struct riscv_iommu_device *iommu = queue->iommu;
> > + struct riscv_iommu_fq_record *events;
> > + unsigned int ctrl, idx;
> > + int cnt, len;
> > +
> > + events = (struct riscv_iommu_fq_record *)queue->base;
> > +
> > + /* Clear fault interrupt pending and process all received fault events. */
> > + riscv_iommu_writel(iommu, RISCV_IOMMU_REG_IPSR, Q_IPSR(queue));
> > +
> > + do {
> > + cnt = riscv_iommu_queue_consume(queue, &idx);
> > + for (len = 0; len < cnt; idx++, len++)
> > + riscv_iommu_fault(iommu, &events[Q_ITEM(queue, idx)]);
> > + riscv_iommu_queue_release(queue, cnt);
> > + } while (cnt > 0);
> > +
> > + /* Clear MF/OF errors, complete error recovery to be implemented. */
> > + ctrl = riscv_iommu_readl(iommu, queue->qcr);
> > + if (ctrl & (RISCV_IOMMU_FQCSR_FQMF | RISCV_IOMMU_FQCSR_FQOF)) {
> > + riscv_iommu_writel(iommu, queue->qcr, ctrl);
> > + dev_warn(iommu->dev,
> > + "Queue #%u error; memory fault:%d overflow:%d\n",
> > + queue->qid,
> > + !!(ctrl & RISCV_IOMMU_FQCSR_FQMF),
> > + !!(ctrl & RISCV_IOMMU_FQCSR_FQOF));
> > + }
> > +
> > + return IRQ_HANDLED;
> > +}
> > +
> > /* Lookup and initialize device context info structure. */
> > static struct riscv_iommu_dc *riscv_iommu_get_dc(struct riscv_iommu_device *iommu,
> > unsigned int devid)
> > @@ -250,6 +683,7 @@ static int riscv_iommu_iodir_set_mode(struct riscv_iommu_device *iommu,
> > struct device *dev = iommu->dev;
> > u64 ddtp, rq_ddtp;
> > unsigned int mode, rq_mode = ddtp_mode;
> > + struct riscv_iommu_command cmd;
> >
> > ddtp = riscv_iommu_read_ddtp(iommu);
> > if (ddtp & RISCV_IOMMU_DDTP_BUSY)
> > @@ -317,6 +751,18 @@ static int riscv_iommu_iodir_set_mode(struct riscv_iommu_device *iommu,
> > if (mode != ddtp_mode)
> > dev_dbg(dev, "DDTP hw mode %u, requested %u\n", mode, ddtp_mode);
> >
> > + /* Invalidate device context cache */
> > + riscv_iommu_cmd_iodir_inval_ddt(&cmd);
> > + riscv_iommu_cmd_send(iommu, &cmd, 0);
> > +
> > + /* Invalidate address translation cache */
> > + riscv_iommu_cmd_inval_vma(&cmd);
> > + riscv_iommu_cmd_send(iommu, &cmd, 0);
> > +
> > + /* IOFENCE.C */
> > + riscv_iommu_cmd_iofence(&cmd);
> > + riscv_iommu_cmd_send(iommu, &cmd, RISCV_IOMMU_IOTINVAL_TIMEOUT);
> > +
> > return 0;
> > }
> >
> > @@ -492,6 +938,26 @@ static int riscv_iommu_init_check(struct riscv_iommu_device *iommu)
> > return -EINVAL;
> > }
> >
> > + /*
> > + * Distribute interrupt vectors, always use first vector for CIV.
> > + * At least one interrupt is required. Read back and verify.
> > + */
> > + if (!iommu->irqs_count)
> > + return -EINVAL;
> > +
> > + iommu->ivec = 0;
> > + iommu->ivec |= FIELD_PREP(RISCV_IOMMU_IVEC_FIV, 1 % iommu->irqs_count);
> > + iommu->ivec |= FIELD_PREP(RISCV_IOMMU_IVEC_PIV, 2 % iommu->irqs_count);
> > + iommu->ivec |= FIELD_PREP(RISCV_IOMMU_IVEC_PMIV, 3 % iommu->irqs_count);
> > + riscv_iommu_writeq(iommu, RISCV_IOMMU_REG_IVEC, iommu->ivec);
> > +
> > + iommu->ivec = riscv_iommu_readq(iommu, RISCV_IOMMU_REG_IVEC);
> > + if (riscv_iommu_queue_vec(iommu, RISCV_IOMMU_IVEC_CIV) >= RISCV_IOMMU_INTR_COUNT ||
> > + riscv_iommu_queue_vec(iommu, RISCV_IOMMU_IVEC_FIV) >= RISCV_IOMMU_INTR_COUNT ||
> > + riscv_iommu_queue_vec(iommu, RISCV_IOMMU_IVEC_PIV) >= RISCV_IOMMU_INTR_COUNT ||
> > + riscv_iommu_queue_vec(iommu, RISCV_IOMMU_IVEC_PMIV) >= RISCV_IOMMU_INTR_COUNT)
> > + return -EINVAL;
> > +
> > return 0;
> > }
> >
> > @@ -500,12 +966,17 @@ void riscv_iommu_remove(struct riscv_iommu_device *iommu)
> > iommu_device_unregister(&iommu->iommu);
> > iommu_device_sysfs_remove(&iommu->iommu);
> > riscv_iommu_iodir_set_mode(iommu, RISCV_IOMMU_DDTP_MODE_OFF);
> > + riscv_iommu_queue_disable(&iommu->cmdq);
> > + riscv_iommu_queue_disable(&iommu->fltq);
> > }
> >
> > int riscv_iommu_init(struct riscv_iommu_device *iommu)
> > {
> > int rc;
> >
> > + RISCV_IOMMU_QUEUE_INIT(&iommu->cmdq, CQ);
> > + RISCV_IOMMU_QUEUE_INIT(&iommu->fltq, FQ);
> > +
> > rc = riscv_iommu_init_check(iommu);
> > if (rc)
> > return dev_err_probe(iommu->dev, rc, "unexpected device state\n");
> > @@ -514,10 +985,28 @@ int riscv_iommu_init(struct riscv_iommu_device *iommu)
> > if (rc)
> > return rc;
> >
> > - rc = riscv_iommu_iodir_set_mode(iommu, RISCV_IOMMU_DDTP_MODE_MAX);
> > + rc = riscv_iommu_queue_alloc(iommu, &iommu->cmdq,
> > + sizeof(struct riscv_iommu_command));
> > + if (rc)
> > + return rc;
> > +
> > + rc = riscv_iommu_queue_alloc(iommu, &iommu->fltq,
> > + sizeof(struct riscv_iommu_fq_record));
> > + if (rc)
> > + return rc;
> > +
> > + rc = riscv_iommu_queue_enable(iommu, &iommu->cmdq, riscv_iommu_cmdq_process);
> > if (rc)
> > return rc;
> >
> > + rc = riscv_iommu_queue_enable(iommu, &iommu->fltq, riscv_iommu_fltq_process);
> > + if (rc)
> > + goto err_queue_disable;
> > +
> > + rc = riscv_iommu_iodir_set_mode(iommu, RISCV_IOMMU_DDTP_MODE_MAX);
> > + if (rc)
> > + goto err_queue_disable;
> > +
> > rc = iommu_device_sysfs_add(&iommu->iommu, NULL, NULL, "riscv-iommu@%s",
> > dev_name(iommu->dev));
> > if (rc) {
> > @@ -537,5 +1026,8 @@ int riscv_iommu_init(struct riscv_iommu_device *iommu)
> > iommu_device_sysfs_remove(&iommu->iommu);
> > err_iodir_off:
> > riscv_iommu_iodir_set_mode(iommu, RISCV_IOMMU_DDTP_MODE_OFF);
> > +err_queue_disable:
> > + riscv_iommu_queue_disable(&iommu->fltq);
> > + riscv_iommu_queue_disable(&iommu->cmdq);
> > return rc;
> > }
> > diff --git a/drivers/iommu/riscv/iommu.h b/drivers/iommu/riscv/iommu.h
> > index f1696926582c..03e0c45bc7e1 100644
> > --- a/drivers/iommu/riscv/iommu.h
> > +++ b/drivers/iommu/riscv/iommu.h
> > @@ -17,6 +17,22 @@
> >
> > #include "iommu-bits.h"
> >
> > +struct riscv_iommu_device;
> > +
> > +struct riscv_iommu_queue {
> > + atomic_t prod; /* unbounded producer allocation index */
> > + atomic_t head; /* unbounded shadow ring buffer consumer index */
> > + atomic_t tail; /* unbounded shadow ring buffer producer index */
> > + unsigned int mask; /* index mask, queue length - 1 */
> > + unsigned int irq; /* allocated interrupt number */
> > + struct riscv_iommu_device *iommu; /* iommu device handling the queue when active */
> > + void *base; /* ring buffer kernel pointer */
> > + dma_addr_t phys; /* ring buffer physical address */
> > + u16 qbr; /* base register offset, head and tail reference */
> > + u16 qcr; /* control and status register offset */
> > + u8 qid; /* queue identifier, same as RISCV_IOMMU_INTR_XX */
> > +};
> > +
> > struct riscv_iommu_device {
> > /* iommu core interface */
> > struct iommu_device iommu;
> > @@ -34,6 +50,11 @@ struct riscv_iommu_device {
> > /* available interrupt numbers, MSI or WSI */
> > unsigned int irqs[RISCV_IOMMU_INTR_COUNT];
> > unsigned int irqs_count;
> > + unsigned int ivec;
> > +
> > + /* hardware queues */
> > + struct riscv_iommu_queue cmdq;
> > + struct riscv_iommu_queue fltq;
> >
> > /* device directory */
> > unsigned int ddt_mode;
> > --
> > 2.34.1
> >

Best,
- Tomasz

> >
> > _______________________________________________
> > linux-riscv mailing list
> > [email protected]
> > http://lists.infradead.org/mailman/listinfo/linux-riscv

2024-05-08 16:14:07

by Tomasz Jeznach

[permalink] [raw]
Subject: Re: [PATCH v4 7/7] iommu/riscv: Paging domain support

On Wed, May 8, 2024 at 8:57 AM Zong Li <[email protected]> wrote:
>
> On Sat, May 4, 2024 at 12:13 AM Tomasz Jeznach <tjeznach@rivosinccom> wrote:
> >
> > Introduce first-stage address translation support.
> >
> > Page table configured by the IOMMU driver will use the highest mode
> > implemented by the hardware, unless not known at the domain allocation
> > time falling back to the CPU’s MMU page mode.
> >
> > This change introduces IOTINVAL.VMA command, required to invalidate
> > any cached IOATC entries after mapping is updated and/or removed from
> > the paging domain. Invalidations for the non-leaf page entries use
> > IOTINVAL for all addresses assigned to the protection domain for
> > hardware not supporting more granular non-leaf page table cache
> > invalidations.
> >
> > Signed-off-by: Tomasz Jeznach <[email protected]>
> > ---
> > drivers/iommu/riscv/iommu.c | 587 +++++++++++++++++++++++++++++++++++-
> > 1 file changed, 585 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
> > index 4349ac8a3990..ec701fde520f 100644
> > --- a/drivers/iommu/riscv/iommu.c
> > +++ b/drivers/iommu/riscv/iommu.c
> > @@ -41,6 +41,10 @@
> > #define dev_to_iommu(dev) \
> > iommu_get_iommu_dev(dev, struct riscv_iommu_device, iommu)
> >
> > +/* IOMMU PSCID allocation namespace. */
> > +static DEFINE_IDA(riscv_iommu_pscids);
> > +#define RISCV_IOMMU_MAX_PSCID (BIT(20) - 1)
> > +
> > /* Device resource-managed allocations */
> > struct riscv_iommu_devres {
> > void *addr;
> > @@ -766,6 +770,143 @@ static int riscv_iommu_iodir_set_mode(struct riscv_iommu_device *iommu,
> > return 0;
> > }
> >
> > +/* This struct contains protection domain specific IOMMU driver data. */
> > +struct riscv_iommu_domain {
> > + struct iommu_domain domain;
> > + struct list_head bonds;
> > + spinlock_t lock; /* protect bonds list updates. */
> > + int pscid;
> > + int numa_node;
> > + int amo_enabled:1;
> > + unsigned int pgd_mode;
> > + unsigned long *pgd_root;
> > +};
> > +
> > +#define iommu_domain_to_riscv(iommu_domain) \
> > + container_of(iommu_domain, struct riscv_iommu_domain, domain)
> > +
> > +/* Private IOMMU data for managed devices, dev_iommu_priv_* */
> > +struct riscv_iommu_info {
> > + struct riscv_iommu_domain *domain;
> > +};
> > +
> > +/* Linkage between an iommu_domain and attached devices. */
> > +struct riscv_iommu_bond {
> > + struct list_head list;
> > + struct rcu_head rcu;
> > + struct device *dev;
> > +};
> > +
> > +static int riscv_iommu_bond_link(struct riscv_iommu_domain *domain,
> > + struct device *dev)
> > +{
> > + struct riscv_iommu_device *iommu = dev_to_iommu(dev);
> > + struct riscv_iommu_bond *bond;
> > + struct list_head *bonds;
> > +
> > + bond = kzalloc(sizeof(*bond), GFP_KERNEL);
> > + if (!bond)
> > + return -ENOMEM;
> > + bond->dev = dev;
> > +
> > + /*
> > + * Linked device pointer and iommu data remain stable in bond struct
> > + * as long the device is attached to the managed IOMMU at _probe_device(),
> > + * up to completion of _release_device() call. Release of the bond will be
> > + * synchronized at the device release, to guarantee no stale pointer is
> > + * used inside rcu protected sections.
> > + *
> > + * List of devices attached to the domain is arranged based on
> > + * managed IOMMU device.
> > + */
> > +
> > + spin_lock(&domain->lock);
> > + list_for_each_rcu(bonds, &domain->bonds)
> > + if (dev_to_iommu(list_entry(bonds, struct riscv_iommu_bond, list)->dev) == iommu)
> > + break;
> > + list_add_rcu(&bond->list, bonds);
> > + spin_unlock(&domain->lock);
> > +
> > + return 0;
> > +}
> > +
> > +static void riscv_iommu_bond_unlink(struct riscv_iommu_domain *domain,
> > + struct device *dev)
> > +{
> > + struct riscv_iommu_bond *bond, *found = NULL;
> > +
> > + if (!domain)
> > + return;
> > +
> > + spin_lock(&domain->lock);
> > + list_for_each_entry_rcu(bond, &domain->bonds, list) {
> > + if (bond->dev == dev) {
> > + list_del_rcu(&bond->list);
> > + found = bond;
> > + }
> > + }
> > + spin_unlock(&domain->lock);
> > + kfree_rcu(found, rcu);
> > +}
> > +
> > +/*
> > + * Send IOTLB.INVAL for whole address space for ranges larger than 2MB.
> > + * This limit will be replaced with range invalidations, if supported by
> > + * the hardware, when RISC-V IOMMU architecture specification update for
> > + * range invalidations update will be available.
> > + */
> > +#define RISCV_IOMMU_IOTLB_INVAL_LIMIT (2 << 20)
> > +
> > +static void riscv_iommu_iotlb_inval(struct riscv_iommu_domain *domain,
> > + unsigned long start, unsigned long end)
> > +{
> > + struct riscv_iommu_bond *bond;
> > + struct riscv_iommu_device *iommu, *prev;
> > + struct riscv_iommu_command cmd;
> > + unsigned long len = end - start + 1;
> > + unsigned long iova;
> > +
> > + rcu_read_lock();
> > +
> > + prev = NULL;
> > + list_for_each_entry_rcu(bond, &domain->bonds, list) {
> > + iommu = dev_to_iommu(bond->dev);
> > +
> > + /*
> > + * IOTLB invalidation request can be safely omitted if already sent
> > + * to the IOMMU for the same PSCID, and with domain->bonds list
> > + * arranged based on the device's IOMMU, it's sufficient to check
> > + * last device the invalidation was sent to.
> > + */
> > + if (iommu == prev)
> > + continue;
> > +
> > + riscv_iommu_cmd_inval_vma(&cmd);
> > + riscv_iommu_cmd_inval_set_pscid(&cmd, domain->pscid);
> > + if (len && len >= RISCV_IOMMU_IOTLB_INVAL_LIMIT) {
> > + for (iova = start; iova < end; iova += PAGE_SIZE) {
> > + riscv_iommu_cmd_inval_set_addr(&cmd, iova);
> > + riscv_iommu_cmd_send(iommu, &cmd, 0);
> > + }
> > + } else {
> > + riscv_iommu_cmd_send(iommu, &cmd, 0);
> > + }
> > + prev = iommu;
> > + }
> > +
> > + prev = NULL;
> > + list_for_each_entry_rcu(bond, &domain->bonds, list) {
> > + iommu = dev_to_iommu(bond->dev);
> > + if (iommu == prev)
> > + continue;
> > +
> > + riscv_iommu_cmd_iofence(&cmd);
>
> I was wondering why we need many 'iofence' commands following
> invalidation commands. Could we just use one iofence to guarantees
> that all previous invalidation commands have been completed?
>

Unless I'm missing something, this code should send only one IOFENCE.C
command send per IOMMU device. If there are multiple IOMMU devices
involved in the invalidation sequence, we have to wait for all of the
invalidation commands to complete, as there is no synchronization
mechanism between those independent IOMMUs.

> > + riscv_iommu_cmd_send(iommu, &cmd, RISCV_IOMMU_QUEUE_TIMEOUT);
> > + prev = iommu;
> > + }
> > + rcu_read_unlock();
> > +}
> > +
> > #define RISCV_IOMMU_FSC_BARE 0
> >
> > /*
> > @@ -785,10 +926,29 @@ static void riscv_iommu_iodir_update(struct riscv_iommu_device *iommu,
> > {
> > struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
> > struct riscv_iommu_dc *dc;
> > + struct riscv_iommu_command cmd;
> > + bool sync_required = false;
> > u64 tc;
> > int i;
> >
> > - /* Device context invalidation ignored for now. */
> > + for (i = 0; i < fwspec->num_ids; i++) {
> > + dc = riscv_iommu_get_dc(iommu, fwspec->ids[i]);
> > + tc = READ_ONCE(dc->tc);
> > + if (tc & RISCV_IOMMU_DC_TC_V) {
> > + WRITE_ONCE(dc->tc, tc & ~RISCV_IOMMU_DC_TC_V);
> > +
> > + /* Invalidate device context cached values */
> > + riscv_iommu_cmd_iodir_inval_ddt(&cmd);
> > + riscv_iommu_cmd_iodir_set_did(&cmd, fwspec->ids[i]);
> > + riscv_iommu_cmd_send(iommu, &cmd, 0);
> > + sync_required = true;
> > + }
> > + }
> > +
> > + if (sync_required) {
> > + riscv_iommu_cmd_iofence(&cmd);
> > + riscv_iommu_cmd_send(iommu, &cmd, RISCV_IOMMU_IOTINVAL_TIMEOUT);
> > + }
> >
> > /*
> > * For device context with DC_TC_PDTV = 0, translation attributes valid bit
> > @@ -806,12 +966,415 @@ static void riscv_iommu_iodir_update(struct riscv_iommu_device *iommu,
> > }
> > }
> >
> > +/*
> > + * IOVA page translation tree management.
> > + */
> > +
> > +#define IOMMU_PAGE_SIZE_4K BIT_ULL(12)
> > +#define IOMMU_PAGE_SIZE_2M BIT_ULL(21)
> > +#define IOMMU_PAGE_SIZE_1G BIT_ULL(30)
> > +#define IOMMU_PAGE_SIZE_512G BIT_ULL(39)
> > +
> > +#define PT_SHIFT (PAGE_SHIFT - ilog2(sizeof(pte_t)))
> > +
> > +static void riscv_iommu_iotlb_flush_all(struct iommu_domain *iommu_domain)
> > +{
> > + struct riscv_iommu_domain *domain = iommu_domain_to_riscv(iommu_domain);
> > +
> > + riscv_iommu_iotlb_inval(domain, 0, ULONG_MAX);
> > +}
> > +
> > +static void riscv_iommu_iotlb_sync(struct iommu_domain *iommu_domain,
> > + struct iommu_iotlb_gather *gather)
> > +{
> > + struct riscv_iommu_domain *domain = iommu_domain_to_riscv(iommu_domain);
> > +
> > + riscv_iommu_iotlb_inval(domain, gather->start, gather->end);
> > +}
> > +
> > +static inline size_t get_page_size(size_t size)
> > +{
> > + if (size >= IOMMU_PAGE_SIZE_512G)
> > + return IOMMU_PAGE_SIZE_512G;
> > + if (size >= IOMMU_PAGE_SIZE_1G)
> > + return IOMMU_PAGE_SIZE_1G;
> > + if (size >= IOMMU_PAGE_SIZE_2M)
> > + return IOMMU_PAGE_SIZE_2M;
> > + return IOMMU_PAGE_SIZE_4K;
> > +}
> > +
> > +#define _io_pte_present(pte) ((pte) & (_PAGE_PRESENT | _PAGE_PROT_NONE))
> > +#define _io_pte_leaf(pte) ((pte) & _PAGE_LEAF)
> > +#define _io_pte_none(pte) ((pte) == 0)
> > +#define _io_pte_entry(pn, prot) ((_PAGE_PFN_MASK & ((pn) << _PAGE_PFN_SHIFT)) | (prot))
> > +
> > +static void riscv_iommu_pte_free(struct riscv_iommu_domain *domain,
> > + unsigned long pte, struct list_head *freelist)
> > +{
> > + unsigned long *ptr;
> > + int i;
> > +
> > + if (!_io_pte_present(pte) || _io_pte_leaf(pte))
> > + return;
> > +
> > + ptr = (unsigned long *)pfn_to_virt(__page_val_to_pfn(pte));
> > +
> > + /* Recursively free all sub page table pages */
> > + for (i = 0; i < PTRS_PER_PTE; i++) {
> > + pte = READ_ONCE(ptr[i]);
> > + if (!_io_pte_none(pte) && cmpxchg_relaxed(ptr + i, pte, 0) == pte)
> > + riscv_iommu_pte_free(domain, pte, freelist);
> > + }
> > +
> > + if (freelist)
> > + list_add_tail(&virt_to_page(ptr)->lru, freelist);
> > + else
> > + iommu_free_page(ptr);
> > +}
> > +
> > +static unsigned long *riscv_iommu_pte_alloc(struct riscv_iommu_domain *domain,
> > + unsigned long iova, size_t pgsize,
> > + gfp_t gfp)
> > +{
> > + unsigned long *ptr = domain->pgd_root;
> > + unsigned long pte, old;
> > + int level = domain->pgd_mode - RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV39 + 2;
> > + void *addr;
> > +
> > + do {
> > + const int shift = PAGE_SHIFT + PT_SHIFT * level;
> > +
> > + ptr += ((iova >> shift) & (PTRS_PER_PTE - 1));
> > + /*
> > + * Note: returned entry might be a non-leaf if there was
> > + * existing mapping with smaller granularity. Up to the caller
> > + * to replace and invalidate.
> > + */
> > + if (((size_t)1 << shift) == pgsize)
> > + return ptr;
> > +pte_retry:
> > + pte = READ_ONCE(*ptr);
> > + /*
> > + * This is very likely incorrect as we should not be adding
> > + * new mapping with smaller granularity on top
> > + * of existing 2M/1G mapping. Fail.
> > + */
> > + if (_io_pte_present(pte) && _io_pte_leaf(pte))
> > + return NULL;
> > + /*
> > + * Non-leaf entry is missing, allocate and try to add to the
> > + * page table. This might race with other mappings, retry.
> > + */
> > + if (_io_pte_none(pte)) {
> > + addr = iommu_alloc_page_node(domain->numa_node, gfp);
> > + if (!addr)
> > + return NULL;
> > + old = pte;
> > + pte = _io_pte_entry(virt_to_pfn(addr), _PAGE_TABLE);
> > + if (cmpxchg_relaxed(ptr, old, pte) != old) {
> > + iommu_free_page(addr);
> > + goto pte_retry;
> > + }
> > + }
> > + ptr = (unsigned long *)pfn_to_virt(__page_val_to_pfn(pte));
> > + } while (level-- > 0);
> > +
> > + return NULL;
> > +}
> > +
> > +static unsigned long *riscv_iommu_pte_fetch(struct riscv_iommu_domain *domain,
> > + unsigned long iova, size_t *pte_pgsize)
> > +{
> > + unsigned long *ptr = domain->pgd_root;
> > + unsigned long pte;
> > + int level = domain->pgd_mode - RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV39 + 2;
> > +
> > + do {
> > + const int shift = PAGE_SHIFT + PT_SHIFT * level;
> > +
> > + ptr += ((iova >> shift) & (PTRS_PER_PTE - 1));
> > + pte = READ_ONCE(*ptr);
> > + if (_io_pte_present(pte) && _io_pte_leaf(pte)) {
> > + *pte_pgsize = (size_t)1 << shift;
> > + return ptr;
> > + }
> > + if (_io_pte_none(pte))
> > + return NULL;
> > + ptr = (unsigned long *)pfn_to_virt(__page_val_to_pfn(pte));
> > + } while (level-- > 0);
> > +
> > + return NULL;
> > +}
> > +
> > +static int riscv_iommu_map_pages(struct iommu_domain *iommu_domain,
> > + unsigned long iova, phys_addr_t phys,
> > + size_t pgsize, size_t pgcount, int prot,
> > + gfp_t gfp, size_t *mapped)
> > +{
> > + struct riscv_iommu_domain *domain = iommu_domain_to_riscv(iommu_domain);
> > + size_t size = 0;
> > + size_t page_size = get_page_size(pgsize);
> > + unsigned long *ptr;
> > + unsigned long pte, old, pte_prot;
> > + int rc = 0;
> > + LIST_HEAD(freelist);
> > +
> > + if (!(prot & IOMMU_WRITE))
> > + pte_prot = _PAGE_BASE | _PAGE_READ;
> > + else if (domain->amo_enabled)
> > + pte_prot = _PAGE_BASE | _PAGE_READ | _PAGE_WRITE;
> > + else
> > + pte_prot = _PAGE_BASE | _PAGE_READ | _PAGE_WRITE | _PAGE_DIRTY;
> > +
> > + while (pgcount) {
> > + ptr = riscv_iommu_pte_alloc(domain, iova, page_size, gfp);
> > + if (!ptr) {
> > + rc = -ENOMEM;
> > + break;
> > + }
> > +
> > + old = READ_ONCE(*ptr);
> > + pte = _io_pte_entry(phys_to_pfn(phys), pte_prot);
> > + if (cmpxchg_relaxed(ptr, old, pte) != old)
> > + continue;
> > +
> > + riscv_iommu_pte_free(domain, old, &freelist);
> > +
> > + size += page_size;
> > + iova += page_size;
> > + phys += page_size;
> > + --pgcount;
> > + }
> > +
> > + *mapped = size;
> > +
> > + if (!list_empty(&freelist)) {
> > + /*
> > + * In 1.0 spec version, the smallest scope we can use to
> > + * invalidate all levels of page table (i.e. leaf and non-leaf)
> > + * is an invalidate-all-PSCID IOTINVAL.VMA with AV=0.
> > + * This will be updated with hardware support for
> > + * capability.NL (non-leaf) IOTINVAL command.
> > + */
> > + riscv_iommu_iotlb_inval(domain, 0, ULONG_MAX);
> > + iommu_put_pages_list(&freelist);
> > + }
> > +
> > + return rc;
> > +}
> > +
> > +static size_t riscv_iommu_unmap_pages(struct iommu_domain *iommu_domain,
> > + unsigned long iova, size_t pgsize,
> > + size_t pgcount,
> > + struct iommu_iotlb_gather *gather)
> > +{
> > + struct riscv_iommu_domain *domain = iommu_domain_to_riscv(iommu_domain);
> > + size_t size = pgcount << __ffs(pgsize);
> > + unsigned long *ptr, old;
> > + size_t unmapped = 0;
> > + size_t pte_size;
> > +
> > + while (unmapped < size) {
> > + ptr = riscv_iommu_pte_fetch(domain, iova, &pte_size);
> > + if (!ptr)
> > + return unmapped;
> > +
> > + /* partial unmap is not allowed, fail. */
> > + if (iova & (pte_size - 1))
> > + return unmapped;
> > +
> > + old = READ_ONCE(*ptr);
> > + if (cmpxchg_relaxed(ptr, old, 0) != old)
> > + continue;
> > +
> > + iommu_iotlb_gather_add_page(&domain->domain, gather, iova,
> > + pte_size);
> > +
> > + iova += pte_size;
> > + unmapped += pte_size;
> > + }
> > +
> > + return unmapped;
> > +}
> > +
> > +static phys_addr_t riscv_iommu_iova_to_phys(struct iommu_domain *iommu_domain,
> > + dma_addr_t iova)
> > +{
> > + struct riscv_iommu_domain *domain = iommu_domain_to_riscv(iommu_domain);
> > + unsigned long pte_size;
> > + unsigned long *ptr;
> > +
> > + ptr = riscv_iommu_pte_fetch(domain, iova, &pte_size);
> > + if (_io_pte_none(*ptr) || !_io_pte_present(*ptr))
> > + return 0;
> > +
> > + return pfn_to_phys(__page_val_to_pfn(*ptr)) | (iova & (pte_size - 1));
> > +}
> > +
> > +static void riscv_iommu_free_paging_domain(struct iommu_domain *iommu_domain)
> > +{
> > + struct riscv_iommu_domain *domain = iommu_domain_to_riscv(iommu_domain);
> > + const unsigned long pfn = virt_to_pfn(domain->pgd_root);
> > +
> > + WARN_ON(!list_empty(&domain->bonds));
> > +
> > + if ((int)domain->pscid > 0)
> > + ida_free(&riscv_iommu_pscids, domain->pscid);
> > +
> > + riscv_iommu_pte_free(domain, _io_pte_entry(pfn, _PAGE_TABLE), NULL);
> > + kfree(domain);
> > +}
> > +
> > +static bool riscv_iommu_pt_supported(struct riscv_iommu_device *iommu, int pgd_mode)
> > +{
> > + switch (pgd_mode) {
> > + case RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV39:
> > + return iommu->caps & RISCV_IOMMU_CAP_S_SV39;
> > +
> > + case RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV48:
> > + return iommu->caps & RISCV_IOMMU_CAP_S_SV48;
> > +
> > + case RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV57:
> > + return iommu->caps & RISCV_IOMMU_CAP_S_SV57;
> > + }
> > + return false;
> > +}
> > +
> > +static int riscv_iommu_attach_paging_domain(struct iommu_domain *iommu_domain,
> > + struct device *dev)
> > +{
> > + struct riscv_iommu_domain *domain = iommu_domain_to_riscv(iommu_domain);
> > + struct riscv_iommu_device *iommu = dev_to_iommu(dev);
> > + struct riscv_iommu_info *info = dev_iommu_priv_get(dev);
> > + u64 fsc, ta;
> > +
> > + if (!riscv_iommu_pt_supported(iommu, domain->pgd_mode))
> > + return -ENODEV;
> > +
> > + if (domain->numa_node == NUMA_NO_NODE)
> > + domain->numa_node = dev_to_node(iommu->dev);
> > +
> > + if (riscv_iommu_bond_link(domain, dev))
> > + return -ENOMEM;
> > +
> > + /*
> > + * Invalidate PSCID.
> > + * This invalidation might be redundant if IOATC has been already cleared,
> > + * however we are not keeping track for domains not linked to a device.
> > + */
> > + riscv_iommu_iotlb_inval(domain, 0, ULONG_MAX);
> > +
> > + fsc = FIELD_PREP(RISCV_IOMMU_PC_FSC_MODE, domain->pgd_mode) |
> > + FIELD_PREP(RISCV_IOMMU_PC_FSC_PPN, virt_to_pfn(domain->pgd_root));
> > + ta = FIELD_PREP(RISCV_IOMMU_PC_TA_PSCID, domain->pscid) |
> > + RISCV_IOMMU_PC_TA_V;
> > +
> > + riscv_iommu_iodir_update(iommu, dev, fsc, ta);
> > + riscv_iommu_bond_unlink(info->domain, dev);
>
> Should we unlink bond before link it?
>

No. This will potentially result in missing mapping invalidation for
previously attached (and still valid during
riscv_iommu_iodir_update()) domain.

> > + info->domain = domain;
> > +
> > + return 0;
> > +}
> > +
> > +static const struct iommu_domain_ops riscv_iommu_paging_domain_ops = {
> > + .attach_dev = riscv_iommu_attach_paging_domain,
> > + .free = riscv_iommu_free_paging_domain,
> > + .map_pages = riscv_iommu_map_pages,
> > + .unmap_pages = riscv_iommu_unmap_pages,
> > + .iova_to_phys = riscv_iommu_iova_to_phys,
> > + .iotlb_sync = riscv_iommu_iotlb_sync,
> > + .flush_iotlb_all = riscv_iommu_iotlb_flush_all,
> > +};
> > +
> > +static struct iommu_domain *riscv_iommu_alloc_paging_domain(struct device *dev)
> > +{
> > + struct riscv_iommu_domain *domain;
> > + struct riscv_iommu_device *iommu;
> > + unsigned int pgd_mode;
> > + int va_bits;
> > +
> > + iommu = dev ? dev_to_iommu(dev) : NULL;
> > +
> > + /*
> > + * In unlikely case when dev or iommu is not known, use system
> > + * SATP mode to configure paging domain radix tree depth.
> > + * Use highest available if actual IOMMU hardware capabilities
> > + * are known here.
> > + */
> > + if (!iommu) {
> > + pgd_mode = satp_mode >> SATP_MODE_SHIFT;
> > + va_bits = VA_BITS;
> > + } else if (iommu->caps & RISCV_IOMMU_CAP_S_SV57) {
> > + pgd_mode = RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV57;
> > + va_bits = 57;
> > + } else if (iommu->caps & RISCV_IOMMU_CAP_S_SV48) {
> > + pgd_mode = RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV48;
> > + va_bits = 48;
> > + } else if (iommu->caps & RISCV_IOMMU_CAP_S_SV39) {
> > + pgd_mode = RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV39;
> > + va_bits = 39;
> > + } else {
> > + dev_err(dev, "cannot find supported page table mode\n");
> > + return ERR_PTR(-ENODEV);
> > + }
> > +
> > + domain = kzalloc(sizeof(*domain), GFP_KERNEL);
> > + if (!domain)
> > + return ERR_PTR(-ENOMEM);
> > +
> > + INIT_LIST_HEAD_RCU(&domain->bonds);
> > + spin_lock_init(&domain->lock);
> > + domain->numa_node = NUMA_NO_NODE;
> > +
> > + if (iommu) {
> > + domain->numa_node = dev_to_node(iommu->dev);
> > + domain->amo_enabled = !!(iommu->caps & RISCV_IOMMU_CAP_AMO_HWAD);
> > + }
> > +
> > + domain->pgd_mode = pgd_mode;
> > + domain->pgd_root = iommu_alloc_page_node(domain->numa_node,
> > + GFP_KERNEL_ACCOUNT);
> > + if (!domain->pgd_root) {
> > + kfree(domain);
> > + return ERR_PTR(-ENOMEM);
> > + }
> > +
> > + domain->pscid = ida_alloc_range(&riscv_iommu_pscids, 1,
> > + RISCV_IOMMU_MAX_PSCID, GFP_KERNEL);
> > + if (domain->pscid < 0) {
> > + iommu_free_page(domain->pgd_root);
> > + kfree(domain);
> > + return ERR_PTR(-ENOMEM);
> > + }
> > +
> > + /*
> > + * Note: RISC-V Privilege spec mandates that virtual addresses
> > + * need to be sign-extended, so if (VA_BITS - 1) is set, all
> > + * bits >= VA_BITS need to also be set or else we'll get a
> > + * page fault. However the code that creates the mappings
> > + * above us (e.g. iommu_dma_alloc_iova()) won't do that for us
> > + * for now, so we'll end up with invalid virtual addresses
> > + * to map. As a workaround until we get this sorted out
> > + * limit the available virtual addresses to VA_BITS - 1.
> > + */
> > + domain->domain.geometry.aperture_start = 0;
> > + domain->domain.geometry.aperture_end = DMA_BIT_MASK(va_bits - 1);
> > + domain->domain.geometry.force_aperture = true;
> > +
> > + domain->domain.ops = &riscv_iommu_paging_domain_ops;
> > +
> > + return &domain->domain;
> > +}
> > +
> > static int riscv_iommu_attach_blocking_domain(struct iommu_domain *iommu_domain,
> > struct device *dev)
> > {
> > struct riscv_iommu_device *iommu = dev_to_iommu(dev);
> > + struct riscv_iommu_info *info = dev_iommu_priv_get(dev);
> >
> > riscv_iommu_iodir_update(iommu, dev, RISCV_IOMMU_FSC_BARE, 0);
> > + riscv_iommu_bond_unlink(info->domain, dev);
> > + info->domain = NULL;
> >
> > return 0;
> > }
> > @@ -827,8 +1390,11 @@ static int riscv_iommu_attach_identity_domain(struct iommu_domain *iommu_domain,
> > struct device *dev)
> > {
> > struct riscv_iommu_device *iommu = dev_to_iommu(dev);
> > + struct riscv_iommu_info *info = dev_iommu_priv_get(dev);
> >
> > riscv_iommu_iodir_update(iommu, dev, RISCV_IOMMU_FSC_BARE, RISCV_IOMMU_PC_TA_V);
> > + riscv_iommu_bond_unlink(info->domain, dev);
> > + info->domain = NULL;
> >
> > return 0;
> > }
> > @@ -842,7 +1408,7 @@ static struct iommu_domain riscv_iommu_identity_domain = {
> >
> > static int riscv_iommu_device_domain_type(struct device *dev)
> > {
> > - return IOMMU_DOMAIN_IDENTITY;
> > + return 0;
> > }
> >
> > static struct iommu_group *riscv_iommu_device_group(struct device *dev)
> > @@ -861,6 +1427,7 @@ static struct iommu_device *riscv_iommu_probe_device(struct device *dev)
> > {
> > struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
> > struct riscv_iommu_device *iommu;
> > + struct riscv_iommu_info *info;
> > struct riscv_iommu_dc *dc;
> > u64 tc;
> > int i;
> > @@ -895,17 +1462,33 @@ static struct iommu_device *riscv_iommu_probe_device(struct device *dev)
> > WRITE_ONCE(dc->tc, tc);
> > }
> >
> > + info = kzalloc(sizeof(*info), GFP_KERNEL);
> > + if (!info)
> > + return ERR_PTR(-ENOMEM);
> > + dev_iommu_priv_set(dev, info);
> > +
> > return &iommu->iommu;
> > }
> >
> > +static void riscv_iommu_release_device(struct device *dev)
> > +{
> > + struct riscv_iommu_info *info = dev_iommu_priv_get(dev);
> > +
> > + synchronize_rcu();
> > + kfree(info);
> > +}
> > +
> > static const struct iommu_ops riscv_iommu_ops = {
> > + .pgsize_bitmap = SZ_4K,
> > .of_xlate = riscv_iommu_of_xlate,
> > .identity_domain = &riscv_iommu_identity_domain,
> > .blocked_domain = &riscv_iommu_blocking_domain,
> > .release_domain = &riscv_iommu_blocking_domain,
> > + .domain_alloc_paging = riscv_iommu_alloc_paging_domain,
> > .def_domain_type = riscv_iommu_device_domain_type,
> > .device_group = riscv_iommu_device_group,
> > .probe_device = riscv_iommu_probe_device,
> > + .release_device = riscv_iommu_release_device,
> > };
> >
> > static int riscv_iommu_init_check(struct riscv_iommu_device *iommu)
> > --
> > 2.34.1
> >

Best regards,
- Tomasz


> >
> > _______________________________________________
> > linux-riscv mailing list
> > [email protected]
> > http://lists.infradead.org/mailman/listinfo/linux-riscv

2024-05-09 01:58:28

by Zong Li

[permalink] [raw]
Subject: Re: [PATCH v4 6/7] iommu/riscv: Command and fault queue support

On Thu, May 9, 2024 at 12:03 AM Tomasz Jeznach <[email protected]> wrote:
>
> On Wed, May 8, 2024 at 8:39 AM Zong Li <[email protected]> wrote:
> >
> > On Sat, May 4, 2024 at 12:13 AM Tomasz Jeznach <[email protected]> wrote:
> > >
> > > Introduce device command submission and fault reporting queues,
> > > as described in Chapter 3.1 and 3.2 of the RISC-V IOMMU Architecture
> > > Specification.
> > >
> > > Command and fault queues are instantiated in contiguous system memory
> > > local to IOMMU device domain, or mapped from fixed I/O space provided
> > > by the hardware implementation. Detection of the location and maximum
> > > allowed size of the queue utilize WARL properties of queue base control
> > > register. Driver implementation will try to allocate up to 128KB of
> > > system memory, while respecting hardware supported maximum queue size.
> > >
> > > Interrupts allocation is based on interrupt vectors availability and
> > > distributed to all queues in simple round-robin fashion. For hardware
> > > Implementation with fixed event type to interrupt vector assignment
> > > IVEC WARL property is used to discover such mappings.
> > >
> > > Address translation, command and queue fault handling in this change
> > > is limited to simple fault reporting without taking any action.
> > >
> > > Reviewed-by: Lu Baolu <[email protected]>
> > > Signed-off-by: Tomasz Jeznach <[email protected]>
> > > ---
> > > drivers/iommu/riscv/iommu-bits.h | 75 +++++
> > > drivers/iommu/riscv/iommu.c | 496 ++++++++++++++++++++++++++++++-
> > > drivers/iommu/riscv/iommu.h | 21 ++
> > > 3 files changed, 590 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/iommu/riscv/iommu-bits.h b/drivers/iommu/riscv/iommu-bits.h
> > > index ba093c29de9f..40c379222821 100644
> > > --- a/drivers/iommu/riscv/iommu-bits.h
> > > +++ b/drivers/iommu/riscv/iommu-bits.h
> > > @@ -704,4 +704,79 @@ struct riscv_iommu_msi_pte {
> > > #define RISCV_IOMMU_MSI_MRIF_NPPN RISCV_IOMMU_PPN_FIELD
> > > #define RISCV_IOMMU_MSI_MRIF_NID_MSB BIT_ULL(60)
> > >
> > > +/* Helper functions: command structure builders. */
> > > +
> > > +static inline void riscv_iommu_cmd_inval_vma(struct riscv_iommu_command *cmd)
> > > +{
> > > + cmd->dword0 = FIELD_PREP(RISCV_IOMMU_CMD_OPCODE, RISCV_IOMMU_CMD_IOTINVAL_OPCODE) |
> > > + FIELD_PREP(RISCV_IOMMU_CMD_FUNC, RISCV_IOMMU_CMD_IOTINVAL_FUNC_VMA);
> > > + cmd->dword1 = 0;
> > > +}
> > > +
> > > +static inline void riscv_iommu_cmd_inval_set_addr(struct riscv_iommu_command *cmd,
> > > + u64 addr)
> > > +{
> > > + cmd->dword1 = FIELD_PREP(RISCV_IOMMU_CMD_IOTINVAL_ADDR, phys_to_pfn(addr));
> > > + cmd->dword0 |= RISCV_IOMMU_CMD_IOTINVAL_AV;
> > > +}
> > > +
> > > +static inline void riscv_iommu_cmd_inval_set_pscid(struct riscv_iommu_command *cmd,
> > > + int pscid)
> > > +{
> > > + cmd->dword0 |= FIELD_PREP(RISCV_IOMMU_CMD_IOTINVAL_PSCID, pscid) |
> > > + RISCV_IOMMU_CMD_IOTINVAL_PSCV;
> > > +}
> > > +
> > > +static inline void riscv_iommu_cmd_inval_set_gscid(struct riscv_iommu_command *cmd,
> > > + int gscid)
> > > +{
> > > + cmd->dword0 |= FIELD_PREP(RISCV_IOMMU_CMD_IOTINVAL_GSCID, gscid) |
> > > + RISCV_IOMMU_CMD_IOTINVAL_GV;
> > > +}
> > > +
> > > +static inline void riscv_iommu_cmd_iofence(struct riscv_iommu_command *cmd)
> > > +{
> > > + cmd->dword0 = FIELD_PREP(RISCV_IOMMU_CMD_OPCODE, RISCV_IOMMU_CMD_IOFENCE_OPCODE) |
> > > + FIELD_PREP(RISCV_IOMMU_CMD_FUNC, RISCV_IOMMU_CMD_IOFENCE_FUNC_C) |
> > > + RISCV_IOMMU_CMD_IOFENCE_PR | RISCV_IOMMU_CMD_IOFENCE_PW;
> > > + cmd->dword1 = 0;
> > > +}
> > > +
> > > +static inline void riscv_iommu_cmd_iofence_set_av(struct riscv_iommu_command *cmd,
> > > + u64 addr, u32 data)
> > > +{
> > > + cmd->dword0 = FIELD_PREP(RISCV_IOMMU_CMD_OPCODE, RISCV_IOMMU_CMD_IOFENCE_OPCODE) |
> > > + FIELD_PREP(RISCV_IOMMU_CMD_FUNC, RISCV_IOMMU_CMD_IOFENCE_FUNC_C) |
> > > + FIELD_PREP(RISCV_IOMMU_CMD_IOFENCE_DATA, data) |
> > > + RISCV_IOMMU_CMD_IOFENCE_AV;
> > > + cmd->dword1 = addr >> 2;
> > > +}
> > > +
> > > +static inline void riscv_iommu_cmd_iodir_inval_ddt(struct riscv_iommu_command *cmd)
> > > +{
> > > + cmd->dword0 = FIELD_PREP(RISCV_IOMMU_CMD_OPCODE, RISCV_IOMMU_CMD_IODIR_OPCODE) |
> > > + FIELD_PREP(RISCV_IOMMU_CMD_FUNC, RISCV_IOMMU_CMD_IODIR_FUNC_INVAL_DDT);
> > > + cmd->dword1 = 0;
> > > +}
> > > +
> > > +static inline void riscv_iommu_cmd_iodir_inval_pdt(struct riscv_iommu_command *cmd)
> > > +{
> > > + cmd->dword0 = FIELD_PREP(RISCV_IOMMU_CMD_OPCODE, RISCV_IOMMU_CMD_IODIR_OPCODE) |
> > > + FIELD_PREP(RISCV_IOMMU_CMD_FUNC, RISCV_IOMMU_CMD_IODIR_FUNC_INVAL_PDT);
> > > + cmd->dword1 = 0;
> > > +}
> > > +
> > > +static inline void riscv_iommu_cmd_iodir_set_did(struct riscv_iommu_command *cmd,
> > > + unsigned int devid)
> > > +{
> > > + cmd->dword0 |= FIELD_PREP(RISCV_IOMMU_CMD_IODIR_DID, devid) |
> > > + RISCV_IOMMU_CMD_IODIR_DV;
> > > +}
> > > +
> > > +static inline void riscv_iommu_cmd_iodir_set_pid(struct riscv_iommu_command *cmd,
> > > + unsigned int pasid)
> > > +{
> > > + cmd->dword0 |= FIELD_PREP(RISCV_IOMMU_CMD_IODIR_PID, pasid);
> > > +}
> > > +
> > > #endif /* _RISCV_IOMMU_BITS_H_ */
> > > diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
> > > index 71b7903d83d4..4349ac8a3990 100644
> > > --- a/drivers/iommu/riscv/iommu.c
> > > +++ b/drivers/iommu/riscv/iommu.c
> > > @@ -25,7 +25,14 @@
> > > #include "iommu.h"
> > >
> > > /* Timeouts in [us] */
> > > -#define RISCV_IOMMU_DDTP_TIMEOUT 50000
> > > +#define RISCV_IOMMU_QCSR_TIMEOUT 150000
> > > +#define RISCV_IOMMU_QUEUE_TIMEOUT 150000
> > > +#define RISCV_IOMMU_DDTP_TIMEOUT 10000000
> > > +#define RISCV_IOMMU_IOTINVAL_TIMEOUT 90000000
> > > +
> > > +/* Number of entries per CMD/FLT queue, should be <= INT_MAX */
> > > +#define RISCV_IOMMU_DEF_CQ_COUNT 8192
> > > +#define RISCV_IOMMU_DEF_FQ_COUNT 4096
> > >
> > > /* RISC-V IOMMU PPN <> PHYS address conversions, PHYS <=> PPN[53:10] */
> > > #define phys_to_ppn(va) (((va) >> 2) & (((1ULL << 44) - 1) << 10))
> > > @@ -89,6 +96,432 @@ static void riscv_iommu_free_pages(struct riscv_iommu_device *iommu, void *addr)
> > > riscv_iommu_devres_pages_match, &devres);
> > > }
> > >
> > > +/*
> > > + * Hardware queue allocation and management.
> > > + */
> > > +
> > > +/* Setup queue base, control registers and default queue length */
> > > +#define RISCV_IOMMU_QUEUE_INIT(q, name) do { \
> > > + struct riscv_iommu_queue *_q = q; \
> > > + _q->qid = RISCV_IOMMU_INTR_ ## name; \
> > > + _q->qbr = RISCV_IOMMU_REG_ ## name ## B; \
> > > + _q->qcr = RISCV_IOMMU_REG_ ## name ## CSR; \
> > > + _q->mask = _q->mask ?: (RISCV_IOMMU_DEF_ ## name ## _COUNT) - 1;\
> > > +} while (0)
> > > +
> > > +/* Note: offsets are the same for all queues */
> > > +#define Q_HEAD(q) ((q)->qbr + (RISCV_IOMMU_REG_CQH - RISCV_IOMMU_REG_CQB))
> > > +#define Q_TAIL(q) ((q)->qbr + (RISCV_IOMMU_REG_CQT - RISCV_IOMMU_REG_CQB))
> > > +#define Q_ITEM(q, index) ((q)->mask & (index))
> > > +#define Q_IPSR(q) BIT((q)->qid)
> > > +
> > > +/*
> > > + * Discover queue ring buffer hardware configuration, allocate in-memory
> > > + * ring buffer or use fixed I/O memory location, configure queue base register.
> > > + * Must be called before hardware queue is enabled.
> > > + *
> > > + * @queue - data structure, configured with RISCV_IOMMU_QUEUE_INIT()
> > > + * @entry_size - queue single element size in bytes.
> > > + */
> > > +static int riscv_iommu_queue_alloc(struct riscv_iommu_device *iommu,
> > > + struct riscv_iommu_queue *queue,
> > > + size_t entry_size)
> > > +{
> > > + unsigned int logsz;
> > > + u64 qb, rb;
> > > +
> > > + /*
> > > + * Use WARL base register property to discover maximum allowed
> > > + * number of entries and optional fixed IO address for queue location.
> > > + */
> > > + riscv_iommu_writeq(iommu, queue->qbr, RISCV_IOMMU_QUEUE_LOGSZ_FIELD);
> > > + qb = riscv_iommu_readq(iommu, queue->qbr);
> > > +
> > > + /*
> > > + * Calculate and verify hardware supported queue length, as reported
> > > + * by the field LOGSZ, where max queue length is equal to 2^(LOGSZ + 1).
> > > + * Update queue size based on hardware supported value.
> > > + */
> > > + logsz = ilog2(queue->mask);
> > > + if (logsz > FIELD_GET(RISCV_IOMMU_QUEUE_LOGSZ_FIELD, qb))
> > > + logsz = FIELD_GET(RISCV_IOMMU_QUEUE_LOGSZ_FIELD, qb);
> > > +
> > > + /*
> > > + * Use WARL base register property to discover an optional fixed IO
> > > + * address for queue ring buffer location. Otherwise allocate contigus
> > > + * system memory.
> > > + */
> > > + if (FIELD_GET(RISCV_IOMMU_PPN_FIELD, qb)) {
> > > + const size_t queue_size = entry_size << (logsz + 1);
> > > +
> > > + queue->phys = ppn_to_phys(FIELD_GET(RISCV_IOMMU_PPN_FIELD, qb));
> > > + queue->base = devm_ioremap(iommu->dev, queue->phys, queue_size);
> > > + } else {
> > > + do {
> > > + const size_t queue_size = entry_size << (logsz + 1);
> > > + const int order = get_order(queue_size);
> > > +
> > > + queue->base = riscv_iommu_get_pages(iommu, order);
> > > + queue->phys = __pa(queue->base);
> > > + } while (!queue->base && logsz-- > 0);
> > > + }
> > > +
> > > + if (!queue->base)
> > > + return -ENOMEM;
> > > +
> > > + qb = phys_to_ppn(queue->phys) |
> > > + FIELD_PREP(RISCV_IOMMU_QUEUE_LOGSZ_FIELD, logsz);
> > > +
> > > + /* Update base register and read back to verify hw accepted our write */
> > > + riscv_iommu_writeq(iommu, queue->qbr, qb);
> > > + rb = riscv_iommu_readq(iommu, queue->qbr);
> > > + if (rb != qb) {
> > > + dev_err(iommu->dev, "queue #%u allocation failed\n", queue->qid);
> > > + return -ENODEV;
> > > + }
> > > +
> > > + /* Update actual queue mask */
> > > + queue->mask = (2U << logsz) - 1;
> > > +
> > > + dev_dbg(iommu->dev, "queue #%u allocated 2^%u entries",
> > > + queue->qid, logsz + 1);
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +/* Check interrupt queue status, IPSR */
> > > +static irqreturn_t riscv_iommu_queue_ipsr(int irq, void *data)
> > > +{
> > > + struct riscv_iommu_queue *queue = (struct riscv_iommu_queue *)data;
> > > +
> > > + if (riscv_iommu_readl(queue->iommu, RISCV_IOMMU_REG_IPSR) & Q_IPSR(queue))
> > > + return IRQ_WAKE_THREAD;
> > > +
> > > + return IRQ_NONE;
> > > +}
> > > +
> > > +static int riscv_iommu_queue_vec(struct riscv_iommu_device *iommu, int n)
> > > +{
> > > + /* Reuse IVEC.CIV mask for all interrupt vectors mapping. */
> > > + return (iommu->ivec >> (n * 4)) & RISCV_IOMMU_IVEC_CIV;
> > > +}
> > > +
> > > +/*
> > > + * Enable queue processing in the hardware, register interrupt handler.
> > > + *
> > > + * @queue - data structure, already allocated with riscv_iommu_queue_alloc()
> > > + * @irq_handler - threaded interrupt handler.
> > > + */
> > > +static int riscv_iommu_queue_enable(struct riscv_iommu_device *iommu,
> > > + struct riscv_iommu_queue *queue,
> > > + irq_handler_t irq_handler)
> > > +{
> > > + const unsigned int irq = iommu->irqs[riscv_iommu_queue_vec(iommu, queue->qid)];
> > > + u32 csr;
> > > + int rc;
> > > +
> > > + if (queue->iommu)
> > > + return -EBUSY;
> > > +
> > > + /* Polling not implemented */
> > > + if (!irq)
> > > + return -ENODEV;
> > > +
> > > + queue->iommu = iommu;
> > > + rc = request_threaded_irq(irq, riscv_iommu_queue_ipsr, irq_handler,
> > > + IRQF_ONESHOT | IRQF_SHARED,
> > > + dev_name(iommu->dev), queue);
> > > + if (rc) {
> > > + queue->iommu = NULL;
> > > + return rc;
> > > + }
> > > +
> > > + /*
> > > + * Enable queue with interrupts, clear any memory fault if any.
> > > + * Wait for the hardware to acknowledge request and activate queue
> > > + * processing.
> > > + * Note: All CSR bitfields are in the same offsets for all queues.
> > > + */
> > > + riscv_iommu_writel(iommu, queue->qcr,
> > > + RISCV_IOMMU_QUEUE_ENABLE |
> > > + RISCV_IOMMU_QUEUE_INTR_ENABLE |
> > > + RISCV_IOMMU_QUEUE_MEM_FAULT);
> > > +
> > > + riscv_iommu_readl_timeout(iommu, queue->qcr,
> > > + csr, !(csr & RISCV_IOMMU_QUEUE_BUSY),
> > > + 10, RISCV_IOMMU_QCSR_TIMEOUT);
> > > +
> > > + if (RISCV_IOMMU_QUEUE_ACTIVE != (csr & (RISCV_IOMMU_QUEUE_ACTIVE |
> > > + RISCV_IOMMU_QUEUE_BUSY |
> > > + RISCV_IOMMU_QUEUE_MEM_FAULT))) {
> > > + /* Best effort to stop and disable failing hardware queue. */
> > > + riscv_iommu_writel(iommu, queue->qcr, 0);
> > > + free_irq(irq, queue);
> > > + queue->iommu = NULL;
> > > + dev_err(iommu->dev, "queue #%u failed to start\n", queue->qid);
> > > + return -EBUSY;
> > > + }
> > > +
> > > + /* Clear any pending interrupt flag. */
> > > + riscv_iommu_writel(iommu, RISCV_IOMMU_REG_IPSR, Q_IPSR(queue));
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +/*
> > > + * Disable queue. Wait for the hardware to acknowledge request and
> > > + * stop processing enqueued requests. Report errors but continue.
> > > + */
> > > +static void riscv_iommu_queue_disable(struct riscv_iommu_queue *queue)
> > > +{
> > > + struct riscv_iommu_device *iommu = queue->iommu;
> > > + u32 csr;
> > > +
> > > + if (!iommu)
> > > + return;
> > > +
> > > + free_irq(iommu->irqs[riscv_iommu_queue_vec(iommu, queue->qid)], queue);
> > > + riscv_iommu_writel(iommu, queue->qcr, 0);
> > > + riscv_iommu_readl_timeout(iommu, queue->qcr,
> > > + csr, !(csr & RISCV_IOMMU_QUEUE_BUSY),
> > > + 10, RISCV_IOMMU_QCSR_TIMEOUT);
> > > +
> > > + if (csr & (RISCV_IOMMU_QUEUE_ACTIVE | RISCV_IOMMU_QUEUE_BUSY))
> > > + dev_err(iommu->dev, "fail to disable hardware queue #%u, csr 0x%x\n",
> > > + queue->qid, csr);
> > > +
> > > + queue->iommu = NULL;
> > > +}
> > > +
> > > +/*
> > > + * Returns number of available valid queue entries and the first item index.
> > > + * Update shadow producer index if necessary.
> > > + */
> > > +static int riscv_iommu_queue_consume(struct riscv_iommu_queue *queue,
> > > + unsigned int *index)
> > > +{
> > > + unsigned int head = atomic_read(&queue->head);
> > > + unsigned int tail = atomic_read(&queue->tail);
> > > + unsigned int last = Q_ITEM(queue, tail);
> > > + int available = (int)(tail - head);
> > > +
> > > + *index = head;
> > > +
> > > + if (available > 0)
> > > + return available;
> > > +
> > > + /* read hardware producer index, check reserved register bits are not set. */
> > > + if (riscv_iommu_readl_timeout(queue->iommu, Q_TAIL(queue),
> > > + tail, (tail & ~queue->mask) == 0,
> > > + 0, RISCV_IOMMU_QUEUE_TIMEOUT)) {
> > > + dev_err_once(queue->iommu->dev,
> > > + "Hardware error: queue access timeout\n");
> > > + return 0;
> > > + }
> > > +
> > > + if (tail == last)
> > > + return 0;
> > > +
> > > + /* update shadow producer index */
> > > + return (int)(atomic_add_return((tail - last) & queue->mask, &queue->tail) - head);
> > > +}
> > > +
> > > +/*
> > > + * Release processed queue entries, should match riscv_iommu_queue_consume() calls.
> > > + */
> > > +static void riscv_iommu_queue_release(struct riscv_iommu_queue *queue, int count)
> > > +{
> > > + const unsigned int head = atomic_add_return(count, &queue->head);
> > > +
> > > + riscv_iommu_writel(queue->iommu, Q_HEAD(queue), Q_ITEM(queue, head));
> > > +}
> > > +
> > > +/* Return actual consumer index based on hardware reported queue head index. */
> > > +static unsigned int riscv_iommu_queue_cons(struct riscv_iommu_queue *queue)
> > > +{
> > > + const unsigned int cons = atomic_read(&queue->head);
> > > + const unsigned int last = Q_ITEM(queue, cons);
> > > + unsigned int head;
> > > +
> > > + if (riscv_iommu_readl_timeout(queue->iommu, Q_HEAD(queue), head,
> > > + !(head & ~queue->mask),
> > > + 0, RISCV_IOMMU_QUEUE_TIMEOUT))
> > > + return cons;
> > > +
> > > + return cons + ((head - last) & queue->mask);
> > > +}
> > > +
> > > +/* Wait for submitted item to be processed. */
> > > +static int riscv_iommu_queue_wait(struct riscv_iommu_queue *queue,
> > > + unsigned int index,
> > > + unsigned int timeout_us)
> > > +{
> > > + unsigned int cons = atomic_read(&queue->head);
> > > +
> > > + /* Already processed by the consumer */
> > > + if ((int)(cons - index) > 0)
> > > + return 0;
> > > +
> > > + /* Monitor consumer index */
> > > + return readx_poll_timeout(riscv_iommu_queue_cons, queue, cons,
> > > + (int)(cons - index) > 0, 0, timeout_us);
> > > +}
> >
> > Apart from iofence command, it seems to me that we might not be able
> > to use the head pointer to determine whether the command is executed,
> > because spec mentioned that the IOMMU advancing cqh is not a guarantee
> > that the commands fetched by the IOMMU have been executed or
> > committed. For iofence completion, perhaps using ADD and AV to write
> > memory would be better as well.
> >
>
> I'll update the comment to the function that this is *only* allowed to
> track IOFENCE.C completion. Call to riscv_iommu_queue_wait() is used
> only for this purpose (unless specification will be extended to track
> other commands with head/tail updates).

I'm not sure whether it is a good idea if we check the opcode around
there, but adding comment is ok to me. Thanks

>
> Previous version of the command tracking used address/AV notifier, but
> after several iterations and testing we've considered tracking
> head/tail as best approach.

Thanks for bringing the information back here again.

>
> > > +
> > > +/* Enqueue an entry and wait to be processed if timeout_us > 0
> > > + *
> > > + * Error handling for IOMMU hardware not responding in reasonable time
> > > + * will be added as separate patch series along with other RAS features.
> > > + * For now, only report hardware failure and continue.
> > > + */
> > > +static void riscv_iommu_queue_send(struct riscv_iommu_queue *queue,
> > > + void *entry, size_t entry_size,
> > > + unsigned int timeout_us)
> > > +{
> > > + unsigned int prod;
> > > + unsigned int head;
> > > + unsigned int tail;
> > > + unsigned long flags;
> > > +
> > > + /* Do not preempt submission flow. */
> > > + local_irq_save(flags);
> > > +
> > > + /* 1. Allocate some space in the queue */
> > > + prod = atomic_inc_return(&queue->prod) - 1;
> > > + head = atomic_read(&queue->head);
> > > +
> > > + /* 2. Wait for space availability. */
> > > + if ((prod - head) > queue->mask) {
> > > + if (readx_poll_timeout(atomic_read, &queue->head,
> > > + head, (prod - head) < queue->mask,
> > > + 0, RISCV_IOMMU_QUEUE_TIMEOUT))
> > > + goto err_busy;
> > > + } else if ((prod - head) == queue->mask) {
> > > + const unsigned int last = Q_ITEM(queue, head);
> > > +
> > > + if (riscv_iommu_readl_timeout(queue->iommu, Q_HEAD(queue), head,
> > > + !(head & ~queue->mask) && head != last,
> > > + 0, RISCV_IOMMU_QUEUE_TIMEOUT))
> > > + goto err_busy;
> > > + atomic_add((head - last) & queue->mask, &queue->head);
> > > + }
> > > +
> > > + /* 3. Store entry in the ring buffer. */
> > > + memcpy(queue->base + Q_ITEM(queue, prod) * entry_size, entry, entry_size);
> > > +
> > > + /* 4. Wait for all previous entries to be ready */
> > > + if (readx_poll_timeout(atomic_read, &queue->tail, tail, prod == tail,
> > > + 0, RISCV_IOMMU_QUEUE_TIMEOUT))
> > > + goto err_busy;
> > > +
> > > + /* 5. Complete submission and restore local interrupts */
> > > + dma_wmb();
> > > + riscv_iommu_writel(queue->iommu, Q_TAIL(queue), Q_ITEM(queue, prod + 1));
> > > + atomic_inc(&queue->tail);
> > > + local_irq_restore(flags);
> > > +
> > > + if (timeout_us && riscv_iommu_queue_wait(queue, prod, timeout_us))
> > > + dev_err_once(queue->iommu->dev,
> > > + "Hardware error: command execution timeout\n");
> > > +
> > > + return;
> > > +
> > > +err_busy:
> > > + local_irq_restore(flags);
> > > + dev_err_once(queue->iommu->dev, "Hardware error: command enqueue failed\n");
> > > +}
> > > +
> > > +/*
> > > + * IOMMU Command queue chapter 3.1
> > > + */
> > > +
> > > +/* Command queue interrupt handler thread function */
> > > +static irqreturn_t riscv_iommu_cmdq_process(int irq, void *data)
> > > +{
> > > + const struct riscv_iommu_queue *queue = (struct riscv_iommu_queue *)data;
> > > + unsigned int ctrl;
> > > +
> > > + /* Clear MF/CQ errors, complete error recovery to be implemented. */
> > > + ctrl = riscv_iommu_readl(queue->iommu, queue->qcr);
> > > + if (ctrl & (RISCV_IOMMU_CQCSR_CQMF | RISCV_IOMMU_CQCSR_CMD_TO |
> > > + RISCV_IOMMU_CQCSR_CMD_ILL | RISCV_IOMMU_CQCSR_FENCE_W_IP)) {
> > > + riscv_iommu_writel(queue->iommu, queue->qcr, ctrl);
> >
> > For RISCV_IOMMU_CQCSR_CMD_ILL and RISCV_IOMMU_CQCSR_CMD_TO, I think we
> > need to adjust the head/tail pointer, otherwise, IOMMU might keep
> > trying to execute the problematic command.
> >
> > > + dev_warn(queue->iommu->dev,
> > > + "Queue #%u error; fault:%d timeout:%d illegal:%d fence_w_ip:%d\n",
> >
> > It might be a bit weird if we view fence_w_ip as queue error and print
> > the message as error?
>
> Driver does not set WSI bit in IOFENCE.C command structure, so it's
> unexpected to see fence_w_ip set. This message is warning only about
> unexpected command queue interrupt, listing state of possible
> triggers.

Okay, perhaps we can adjust it in the future when we actually enable
the fence_w_ip. Thanks

>
> >
> > > + queue->qid,
> > > + !!(ctrl & RISCV_IOMMU_CQCSR_CQMF),
> > > + !!(ctrl & RISCV_IOMMU_CQCSR_CMD_TO),
> > > + !!(ctrl & RISCV_IOMMU_CQCSR_CMD_ILL),
> > > + !!(ctrl & RISCV_IOMMU_CQCSR_FENCE_W_IP));
> > > + }
> > > +
> > > + /* Placeholder for command queue interrupt notifiers */
> > > +
> > > + /* Clear command interrupt pending. */
> > > + riscv_iommu_writel(queue->iommu, RISCV_IOMMU_REG_IPSR, Q_IPSR(queue));
> > > +
> > > + return IRQ_HANDLED;
> > > +}
> > > +
> > > +/* Send command to the IOMMU command queue */
> > > +static void riscv_iommu_cmd_send(struct riscv_iommu_device *iommu,
> > > + struct riscv_iommu_command *cmd,
> > > + unsigned int timeout_us)
> > > +{
> > > + riscv_iommu_queue_send(&iommu->cmdq, cmd, sizeof(*cmd), timeout_us);
> > > +}
> > > +
> > > +/*
> > > + * IOMMU Fault/Event queue chapter 3.2
> > > + */
> > > +
> > > +static void riscv_iommu_fault(struct riscv_iommu_device *iommu,
> > > + struct riscv_iommu_fq_record *event)
> > > +{
> > > + unsigned int err = FIELD_GET(RISCV_IOMMU_FQ_HDR_CAUSE, event->hdr);
> > > + unsigned int devid = FIELD_GET(RISCV_IOMMU_FQ_HDR_DID, event->hdr);
> > > +
> > > + /* Placeholder for future fault handling implementation, report only. */
> > > + if (err)
> > > + dev_warn_ratelimited(iommu->dev,
> > > + "Fault %d devid: 0x%x iotval: %llx iotval2: %llx\n",
> > > + err, devid, event->iotval, event->iotval2);
> > > +}
> > > +
> > > +/* Fault queue interrupt handler thread function */
> > > +static irqreturn_t riscv_iommu_fltq_process(int irq, void *data)
> > > +{
> > > + struct riscv_iommu_queue *queue = (struct riscv_iommu_queue *)data;
> > > + struct riscv_iommu_device *iommu = queue->iommu;
> > > + struct riscv_iommu_fq_record *events;
> > > + unsigned int ctrl, idx;
> > > + int cnt, len;
> > > +
> > > + events = (struct riscv_iommu_fq_record *)queue->base;
> > > +
> > > + /* Clear fault interrupt pending and process all received fault events. */
> > > + riscv_iommu_writel(iommu, RISCV_IOMMU_REG_IPSR, Q_IPSR(queue));
> > > +
> > > + do {
> > > + cnt = riscv_iommu_queue_consume(queue, &idx);
> > > + for (len = 0; len < cnt; idx++, len++)
> > > + riscv_iommu_fault(iommu, &events[Q_ITEM(queue, idx)]);
> > > + riscv_iommu_queue_release(queue, cnt);
> > > + } while (cnt > 0);
> > > +
> > > + /* Clear MF/OF errors, complete error recovery to be implemented. */
> > > + ctrl = riscv_iommu_readl(iommu, queue->qcr);
> > > + if (ctrl & (RISCV_IOMMU_FQCSR_FQMF | RISCV_IOMMU_FQCSR_FQOF)) {
> > > + riscv_iommu_writel(iommu, queue->qcr, ctrl);
> > > + dev_warn(iommu->dev,
> > > + "Queue #%u error; memory fault:%d overflow:%d\n",
> > > + queue->qid,
> > > + !!(ctrl & RISCV_IOMMU_FQCSR_FQMF),
> > > + !!(ctrl & RISCV_IOMMU_FQCSR_FQOF));
> > > + }
> > > +
> > > + return IRQ_HANDLED;
> > > +}
> > > +
> > > /* Lookup and initialize device context info structure. */
> > > static struct riscv_iommu_dc *riscv_iommu_get_dc(struct riscv_iommu_device *iommu,
> > > unsigned int devid)
> > > @@ -250,6 +683,7 @@ static int riscv_iommu_iodir_set_mode(struct riscv_iommu_device *iommu,
> > > struct device *dev = iommu->dev;
> > > u64 ddtp, rq_ddtp;
> > > unsigned int mode, rq_mode = ddtp_mode;
> > > + struct riscv_iommu_command cmd;
> > >
> > > ddtp = riscv_iommu_read_ddtp(iommu);
> > > if (ddtp & RISCV_IOMMU_DDTP_BUSY)
> > > @@ -317,6 +751,18 @@ static int riscv_iommu_iodir_set_mode(struct riscv_iommu_device *iommu,
> > > if (mode != ddtp_mode)
> > > dev_dbg(dev, "DDTP hw mode %u, requested %u\n", mode, ddtp_mode);
> > >
> > > + /* Invalidate device context cache */
> > > + riscv_iommu_cmd_iodir_inval_ddt(&cmd);
> > > + riscv_iommu_cmd_send(iommu, &cmd, 0);
> > > +
> > > + /* Invalidate address translation cache */
> > > + riscv_iommu_cmd_inval_vma(&cmd);
> > > + riscv_iommu_cmd_send(iommu, &cmd, 0);
> > > +
> > > + /* IOFENCE.C */
> > > + riscv_iommu_cmd_iofence(&cmd);
> > > + riscv_iommu_cmd_send(iommu, &cmd, RISCV_IOMMU_IOTINVAL_TIMEOUT);
> > > +
> > > return 0;
> > > }
> > >
> > > @@ -492,6 +938,26 @@ static int riscv_iommu_init_check(struct riscv_iommu_device *iommu)
> > > return -EINVAL;
> > > }
> > >
> > > + /*
> > > + * Distribute interrupt vectors, always use first vector for CIV.
> > > + * At least one interrupt is required. Read back and verify.
> > > + */
> > > + if (!iommu->irqs_count)
> > > + return -EINVAL;
> > > +
> > > + iommu->ivec = 0;
> > > + iommu->ivec |= FIELD_PREP(RISCV_IOMMU_IVEC_FIV, 1 % iommu->irqs_count);
> > > + iommu->ivec |= FIELD_PREP(RISCV_IOMMU_IVEC_PIV, 2 % iommu->irqs_count);
> > > + iommu->ivec |= FIELD_PREP(RISCV_IOMMU_IVEC_PMIV, 3 % iommu->irqs_count);
> > > + riscv_iommu_writeq(iommu, RISCV_IOMMU_REG_IVEC, iommu->ivec);
> > > +
> > > + iommu->ivec = riscv_iommu_readq(iommu, RISCV_IOMMU_REG_IVEC);
> > > + if (riscv_iommu_queue_vec(iommu, RISCV_IOMMU_IVEC_CIV) >= RISCV_IOMMU_INTR_COUNT ||
> > > + riscv_iommu_queue_vec(iommu, RISCV_IOMMU_IVEC_FIV) >= RISCV_IOMMU_INTR_COUNT ||
> > > + riscv_iommu_queue_vec(iommu, RISCV_IOMMU_IVEC_PIV) >= RISCV_IOMMU_INTR_COUNT ||
> > > + riscv_iommu_queue_vec(iommu, RISCV_IOMMU_IVEC_PMIV) >= RISCV_IOMMU_INTR_COUNT)
> > > + return -EINVAL;
> > > +
> > > return 0;
> > > }
> > >
> > > @@ -500,12 +966,17 @@ void riscv_iommu_remove(struct riscv_iommu_device *iommu)
> > > iommu_device_unregister(&iommu->iommu);
> > > iommu_device_sysfs_remove(&iommu->iommu);
> > > riscv_iommu_iodir_set_mode(iommu, RISCV_IOMMU_DDTP_MODE_OFF);
> > > + riscv_iommu_queue_disable(&iommu->cmdq);
> > > + riscv_iommu_queue_disable(&iommu->fltq);
> > > }
> > >
> > > int riscv_iommu_init(struct riscv_iommu_device *iommu)
> > > {
> > > int rc;
> > >
> > > + RISCV_IOMMU_QUEUE_INIT(&iommu->cmdq, CQ);
> > > + RISCV_IOMMU_QUEUE_INIT(&iommu->fltq, FQ);
> > > +
> > > rc = riscv_iommu_init_check(iommu);
> > > if (rc)
> > > return dev_err_probe(iommu->dev, rc, "unexpected device state\n");
> > > @@ -514,10 +985,28 @@ int riscv_iommu_init(struct riscv_iommu_device *iommu)
> > > if (rc)
> > > return rc;
> > >
> > > - rc = riscv_iommu_iodir_set_mode(iommu, RISCV_IOMMU_DDTP_MODE_MAX);
> > > + rc = riscv_iommu_queue_alloc(iommu, &iommu->cmdq,
> > > + sizeof(struct riscv_iommu_command));
> > > + if (rc)
> > > + return rc;
> > > +
> > > + rc = riscv_iommu_queue_alloc(iommu, &iommu->fltq,
> > > + sizeof(struct riscv_iommu_fq_record));
> > > + if (rc)
> > > + return rc;
> > > +
> > > + rc = riscv_iommu_queue_enable(iommu, &iommu->cmdq, riscv_iommu_cmdq_process);
> > > if (rc)
> > > return rc;
> > >
> > > + rc = riscv_iommu_queue_enable(iommu, &iommu->fltq, riscv_iommu_fltq_process);
> > > + if (rc)
> > > + goto err_queue_disable;
> > > +
> > > + rc = riscv_iommu_iodir_set_mode(iommu, RISCV_IOMMU_DDTP_MODE_MAX);
> > > + if (rc)
> > > + goto err_queue_disable;
> > > +
> > > rc = iommu_device_sysfs_add(&iommu->iommu, NULL, NULL, "riscv-iommu@%s",
> > > dev_name(iommu->dev));
> > > if (rc) {
> > > @@ -537,5 +1026,8 @@ int riscv_iommu_init(struct riscv_iommu_device *iommu)
> > > iommu_device_sysfs_remove(&iommu->iommu);
> > > err_iodir_off:
> > > riscv_iommu_iodir_set_mode(iommu, RISCV_IOMMU_DDTP_MODE_OFF);
> > > +err_queue_disable:
> > > + riscv_iommu_queue_disable(&iommu->fltq);
> > > + riscv_iommu_queue_disable(&iommu->cmdq);
> > > return rc;
> > > }
> > > diff --git a/drivers/iommu/riscv/iommu.h b/drivers/iommu/riscv/iommu.h
> > > index f1696926582c..03e0c45bc7e1 100644
> > > --- a/drivers/iommu/riscv/iommu.h
> > > +++ b/drivers/iommu/riscv/iommu.h
> > > @@ -17,6 +17,22 @@
> > >
> > > #include "iommu-bits.h"
> > >
> > > +struct riscv_iommu_device;
> > > +
> > > +struct riscv_iommu_queue {
> > > + atomic_t prod; /* unbounded producer allocation index */
> > > + atomic_t head; /* unbounded shadow ring buffer consumer index */
> > > + atomic_t tail; /* unbounded shadow ring buffer producer index */
> > > + unsigned int mask; /* index mask, queue length - 1 */
> > > + unsigned int irq; /* allocated interrupt number */
> > > + struct riscv_iommu_device *iommu; /* iommu device handling the queue when active */
> > > + void *base; /* ring buffer kernel pointer */
> > > + dma_addr_t phys; /* ring buffer physical address */
> > > + u16 qbr; /* base register offset, head and tail reference */
> > > + u16 qcr; /* control and status register offset */
> > > + u8 qid; /* queue identifier, same as RISCV_IOMMU_INTR_XX */
> > > +};
> > > +
> > > struct riscv_iommu_device {
> > > /* iommu core interface */
> > > struct iommu_device iommu;
> > > @@ -34,6 +50,11 @@ struct riscv_iommu_device {
> > > /* available interrupt numbers, MSI or WSI */
> > > unsigned int irqs[RISCV_IOMMU_INTR_COUNT];
> > > unsigned int irqs_count;
> > > + unsigned int ivec;
> > > +
> > > + /* hardware queues */
> > > + struct riscv_iommu_queue cmdq;
> > > + struct riscv_iommu_queue fltq;
> > >
> > > /* device directory */
> > > unsigned int ddt_mode;
> > > --
> > > 2.34.1
> > >
>
> Best,
> - Tomasz
>
> > >
> > > _______________________________________________
> > > linux-riscv mailing list
> > > [email protected]
> > > http://lists.infradead.org/mailman/listinfo/linux-riscv