LinuxLists.cc - [PATCH RFC 0/8] iommu/arm-smmu-v3: add support for ECMDQ register mode

2021-06-26 11:03:18

[permalink] [raw]

Subject: [PATCH RFC 0/8] iommu/arm-smmu-v3: add support for ECMDQ register mode

SMMU v3.3 added a new feature, which is Enhanced Command queue interface
for reducing contention when submitting Commands to the SMMU, in this
patch set, ECMDQ is the abbreviation of Enhanced Command Queue.

When the hardware supports ECMDQ and each core can exclusively use one ECMDQ,
each core does not need to compete with other cores when using its own ECMDQ.
This means that each core can insert commands in parallel. If each ECMDQ can
execute commands in parallel, the overall performance may be better. However,
our hardware currently does not support multiple ECMDQ execute commands in
parallel.

In order to reuse existing code, I originally still call arm_smmu_cmdq_issue_cmdlist()
to insert commands. Even so, however, there was a performance improvement of nearly 12%
in strict mode.

The test environment is the EMU, which simulates the connection of the 200 Gbit/s NIC.
Number of queues: passthrough lazy strict(ECMDQ) strict(CMDQ)
6 188 180 162 145 --> 11.7% improvement
8 188 188 184 183 --> 0.55% improvement

In recent days, I implemented a new function without competition with other
cores to replace arm_smmu_cmdq_issue_cmdlist() when a core can have an ECMDQ.
I'm guessing it might get better performance results. Because the EMU is too
slow, it will take a while before the relevant data is available.

Zhen Lei (8):
iommu/arm-smmu-v3: Use command queue batching helpers to improve
performance
iommu/arm-smmu-v3: Add and use static helper function
arm_smmu_cmdq_issue_cmd_with_sync()
iommu/arm-smmu-v3: Add and use static helper function
arm_smmu_get_cmdq()
iommu/arm-smmu-v3: Extract reusable function
__arm_smmu_cmdq_skip_err()
iommu/arm-smmu-v3: Add support for ECMDQ register mode
iommu/arm-smmu-v3: Ensure that a set of associated commands are
inserted in the same ECMDQ
iommu/arm-smmu-v3: Add arm_smmu_ecmdq_issue_cmdlist() for non-shared
ECMDQ
iommu/arm-smmu-v3: Add support for less than one ECMDQ per core

drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 483 ++++++++++++++++++--
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 37 ++
2 files changed, 489 insertions(+), 31 deletions(-)

--
2.26.0.106.g9fadedd

2021-06-26 11:03:56

[permalink] [raw]

Subject: [PATCH RFC 2/8] iommu/arm-smmu-v3: Add and use static helper function arm_smmu_cmdq_issue_cmd_with_sync()

The obvious key to the performance optimization of commit 587e6c10a7ce
("iommu/arm-smmu-v3: Reduce contention during command-queue insertion") is
to allow multiple cores to insert commands in parallel after a brief mutex
contention.

Obviously, inserting as many commands at a time as possible can reduce the
number of times the mutex contention participates, thereby improving the
overall performance. At least it reduces the number of calls to function
arm_smmu_cmdq_issue_cmdlist().

Therefore, function arm_smmu_cmdq_issue_cmd_with_sync() is added to insert
the 'cmd+sync' commands at a time.

Signed-off-by: Zhen Lei <[email protected]>
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 33 +++++++++++++--------
1 file changed, 21 insertions(+), 12 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 2433d3c29b49ff2..a5361153ca1d6a4 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -858,11 +858,25 @@ static int arm_smmu_cmdq_issue_cmd(struct arm_smmu_device *smmu,
return arm_smmu_cmdq_issue_cmdlist(smmu, cmd, 1, false);
}

-static int arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu)
+static int __maybe_unused arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu)
{
return arm_smmu_cmdq_issue_cmdlist(smmu, NULL, 0, true);
}

+static int arm_smmu_cmdq_issue_cmd_with_sync(struct arm_smmu_device *smmu,
+ struct arm_smmu_cmdq_ent *ent)
+{
+ u64 cmd[CMDQ_ENT_DWORDS];
+
+ if (arm_smmu_cmdq_build_cmd(cmd, ent)) {
+ dev_warn(smmu->dev, "ignoring unknown CMDQ opcode 0x%x\n",
+ ent->opcode);
+ return -EINVAL;
+ }
+
+ return arm_smmu_cmdq_issue_cmdlist(smmu, cmd, 1, true);
+}
+
static void arm_smmu_cmdq_batch_add(struct arm_smmu_device *smmu,
struct arm_smmu_cmdq_batch *cmds,
struct arm_smmu_cmdq_ent *cmd)
@@ -928,8 +942,7 @@ void arm_smmu_tlb_inv_asid(struct arm_smmu_device *smmu, u16 asid)
.tlbi.asid = asid,
};

- arm_smmu_cmdq_issue_cmd(smmu, &cmd);
- arm_smmu_cmdq_issue_sync(smmu);
+ arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
}

static void arm_smmu_sync_cd(struct arm_smmu_domain *smmu_domain,
@@ -1210,8 +1223,7 @@ static void arm_smmu_sync_ste_for_sid(struct arm_smmu_device *smmu, u32 sid)
},
};

- arm_smmu_cmdq_issue_cmd(smmu, &cmd);
- arm_smmu_cmdq_issue_sync(smmu);
+ arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
}

static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
@@ -1823,8 +1835,7 @@ static void arm_smmu_tlb_inv_context(void *cookie)
} else {
cmd.opcode = CMDQ_OP_TLBI_S12_VMALL;
cmd.tlbi.vmid = smmu_domain->s2_cfg.vmid;
- arm_smmu_cmdq_issue_cmd(smmu, &cmd);
- arm_smmu_cmdq_issue_sync(smmu);
+ arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
}
arm_smmu_atc_inv_domain(smmu_domain, 0, 0, 0);
}
@@ -3338,18 +3349,16 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu, bool bypass)

/* Invalidate any cached configuration */
cmd.opcode = CMDQ_OP_CFGI_ALL;
- arm_smmu_cmdq_issue_cmd(smmu, &cmd);
- arm_smmu_cmdq_issue_sync(smmu);
+ arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);

/* Invalidate any stale TLB entries */
if (smmu->features & ARM_SMMU_FEAT_HYP) {
cmd.opcode = CMDQ_OP_TLBI_EL2_ALL;
- arm_smmu_cmdq_issue_cmd(smmu, &cmd);
+ arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
}

cmd.opcode = CMDQ_OP_TLBI_NSNH_ALL;
- arm_smmu_cmdq_issue_cmd(smmu, &cmd);
- arm_smmu_cmdq_issue_sync(smmu);
+ arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);

/* Event queue */
writeq_relaxed(smmu->evtq.q.q_base, smmu->base + ARM_SMMU_EVTQ_BASE);
--
2.26.0.106.g9fadedd

2021-06-26 11:04:10

[permalink] [raw]

Subject: [PATCH RFC 6/8] iommu/arm-smmu-v3: Ensure that a set of associated commands are inserted in the same ECMDQ

The SYNC command only ensures that the command that precedes it in the
same ECMDQ must be executed, but cannot synchronize the commands in other
ECMDQs. If an unmap involves multiple commands, some commands are executed
on one core, and the other commands are executed on another core. In this
case, after the SYNC execution is complete, the execution of all preceded
commands can not be ensured.

Prevent the process that performs a set of associated commands insertion
from being migrated to other cores ensures that all commands are inserted
into the same ECMDQ.

Signed-off-by: Zhen Lei <[email protected]>
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 40 +++++++++++++++++----
1 file changed, 33 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index d7b590e911a879d..d5205030710bd1a 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -233,6 +233,18 @@ static int queue_remove_raw(struct arm_smmu_queue *q, u64 *ent)
return 0;
}

+static void arm_smmu_preempt_disable(struct arm_smmu_device *smmu)
+{
+ if (smmu->ecmdq_enabled)
+ preempt_disable();
+}
+
+static void arm_smmu_preempt_enable(struct arm_smmu_device *smmu)
+{
+ if (smmu->ecmdq_enabled)
+ preempt_enable();
+}
+
/* High-level queue accessors */
static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
{
@@ -1016,6 +1028,7 @@ static void arm_smmu_sync_cd(struct arm_smmu_domain *smmu_domain,
},
};

+ arm_smmu_preempt_disable(smmu);
spin_lock_irqsave(&smmu_domain->devices_lock, flags);
list_for_each_entry(master, &smmu_domain->devices, domain_head) {
for (i = 0; i < master->num_streams; i++) {
@@ -1026,6 +1039,7 @@ static void arm_smmu_sync_cd(struct arm_smmu_domain *smmu_domain,
spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);

arm_smmu_cmdq_batch_submit(smmu, &cmds);
+ arm_smmu_preempt_enable(smmu);
}

static int arm_smmu_alloc_cd_leaf_table(struct arm_smmu_device *smmu,
@@ -1814,30 +1828,36 @@ arm_smmu_atc_inv_to_cmd(int ssid, unsigned long iova, size_t size,

static int arm_smmu_atc_inv_master(struct arm_smmu_master *master)
{
- int i;
+ int i, ret;
struct arm_smmu_cmdq_ent cmd;
struct arm_smmu_cmdq_batch cmds = {};
+ struct arm_smmu_device *smmu = master->smmu;

arm_smmu_atc_inv_to_cmd(0, 0, 0, &cmd);

+ arm_smmu_preempt_disable(smmu);
for (i = 0; i < master->num_streams; i++) {
cmd.atc.sid = master->streams[i].id;
- arm_smmu_cmdq_batch_add(master->smmu, &cmds, &cmd);
+ arm_smmu_cmdq_batch_add(smmu, &cmds, &cmd);
}

- return arm_smmu_cmdq_batch_submit(master->smmu, &cmds);
+ ret = arm_smmu_cmdq_batch_submit(smmu, &cmds);
+ arm_smmu_preempt_enable(smmu);
+
+ return ret;
}

int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain, int ssid,
unsigned long iova, size_t size)
{
- int i;
+ int i, ret;
unsigned long flags;
struct arm_smmu_cmdq_ent cmd;
struct arm_smmu_master *master;
struct arm_smmu_cmdq_batch cmds = {};
+ struct arm_smmu_device *smmu = smmu_domain->smmu;

- if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_ATS))
+ if (!(smmu->features & ARM_SMMU_FEAT_ATS))
return 0;

/*
@@ -1859,6 +1879,7 @@ int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain, int ssid,

arm_smmu_atc_inv_to_cmd(ssid, iova, size, &cmd);

+ arm_smmu_preempt_disable(smmu);
spin_lock_irqsave(&smmu_domain->devices_lock, flags);
list_for_each_entry(master, &smmu_domain->devices, domain_head) {
if (!master->ats_enabled)
@@ -1866,12 +1887,15 @@ int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain, int ssid,

for (i = 0; i < master->num_streams; i++) {
cmd.atc.sid = master->streams[i].id;
- arm_smmu_cmdq_batch_add(smmu_domain->smmu, &cmds, &cmd);
+ arm_smmu_cmdq_batch_add(smmu, &cmds, &cmd);
}
}
spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);

- return arm_smmu_cmdq_batch_submit(smmu_domain->smmu, &cmds);
+ ret = arm_smmu_cmdq_batch_submit(smmu, &cmds);
+ arm_smmu_preempt_enable(smmu);
+
+ return ret;
}

/* IO_PGTABLE API */
@@ -1924,6 +1948,7 @@ static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd,
num_pages = size >> tg;
}

+ arm_smmu_preempt_disable(smmu);
while (iova < end) {
if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
/*
@@ -1955,6 +1980,7 @@ static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd,
iova += inv_range;
}
arm_smmu_cmdq_batch_submit(smmu, &cmds);
+ arm_smmu_preempt_enable(smmu);
}

static void arm_smmu_tlb_inv_range_domain(unsigned long iova, size_t size,
--
2.26.0.106.g9fadedd

2021-06-26 11:04:16

[permalink] [raw]

Subject: [PATCH RFC 8/8] iommu/arm-smmu-v3: Add support for less than one ECMDQ per core

Due to limited hardware resources, the number of ECMDQs may be less than
the number of cores. If the number of ECMDQs is greater than the number of
numa nodes, ensure that each node has at least one ECMDQ. This is because
ECMDQ queue memory is requested from the NUMA node where it resides, which
may result in better command filling and insertion performance.

The current ECMDQ implementation reuses the command insertion function
arm_smmu_cmdq_issue_cmdlist() of the normal CMDQ. This function already
supports multiple cores concurrent insertion commands.

Signed-off-by: Zhen Lei <[email protected]>
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 101 ++++++++++++++++++--
1 file changed, 92 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index a088f2479fc6223..55f651ce42e7a51 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3636,14 +3636,15 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu, bool bypass)

static int arm_smmu_ecmdq_layout(struct arm_smmu_device *smmu)
{
- int cpu;
- struct arm_smmu_ecmdq *ecmdq;
+ int cpu, node, nr_remain, nr_nodes = 0;
+ int *nr_ecmdqs;
+ struct arm_smmu_ecmdq *ecmdq, **ecmdqs;

- if (num_possible_cpus() <= smmu->nr_ecmdq) {
- ecmdq = devm_alloc_percpu(smmu->dev, *ecmdq);
- if (!ecmdq)
- return -ENOMEM;
+ ecmdq = devm_alloc_percpu(smmu->dev, *ecmdq);
+ if (!ecmdq)
+ return -ENOMEM;

+ if (num_possible_cpus() <= smmu->nr_ecmdq) {
for_each_possible_cpu(cpu)
*per_cpu_ptr(smmu->ecmdq, cpu) = per_cpu_ptr(ecmdq, cpu);

@@ -3653,7 +3654,79 @@ static int arm_smmu_ecmdq_layout(struct arm_smmu_device *smmu)
return 0;
}

- return -ENOSPC;
+ for_each_node(node)
+ if (nr_cpus_node(node))
+ nr_nodes++;
+
+ if (nr_nodes >= smmu->nr_ecmdq) {
+ dev_err(smmu->dev, "%d ECMDQs is less than %d nodes\n", smmu->nr_ecmdq, nr_nodes);
+ return -ENOSPC;
+ }
+
+ nr_ecmdqs = kcalloc(MAX_NUMNODES, sizeof(int), GFP_KERNEL);
+ if (!nr_ecmdqs)
+ return -ENOMEM;
+
+ ecmdqs = kcalloc(smmu->nr_ecmdq, sizeof(*ecmdqs), GFP_KERNEL);
+ if (!ecmdqs) {
+ kfree(nr_ecmdqs);
+ return -ENOMEM;
+ }
+
+ /* [1] Ensure that each node has at least one ECMDQ */
+ nr_remain = smmu->nr_ecmdq - nr_nodes;
+ for_each_node(node) {
+ /*
+ * Calculate the number of ECMDQs to be allocated to this node.
+ * NR_ECMDQS_PER_CPU = nr_remain / num_possible_cpus();
+ * When nr_cpus_node(node) is not zero, less than one ECMDQ
+ * may be left due to truncation rounding.
+ */
+ nr_ecmdqs[node] = nr_cpus_node(node) * nr_remain / num_possible_cpus();
+ nr_remain -= nr_ecmdqs[node];
+ }
+
+ /* Divide the remaining ECMDQs */
+ while (nr_remain) {
+ for_each_node(node) {
+ if (!nr_remain)
+ break;
+
+ if (nr_ecmdqs[node] >= nr_cpus_node(node))
+ continue;
+
+ nr_ecmdqs[node]++;
+ nr_remain--;
+ }
+ }
+
+ for_each_node(node) {
+ int i, round, shared = 0;
+
+ if (!nr_cpus_node(node))
+ continue;
+
+ /* An ECMDQ has been reserved for each node at above [1] */
+ nr_ecmdqs[node]++;
+
+ if (nr_ecmdqs[node] < nr_cpus_node(node))
+ shared = 1;
+
+ i = 0;
+ for_each_cpu(cpu, cpumask_of_node(node)) {
+ round = i % nr_ecmdqs[node];
+ if (i++ < nr_ecmdqs[node]) {
+ ecmdqs[round] = per_cpu_ptr(ecmdq, cpu);
+ ecmdqs[round]->cmdq.shared = shared;
+ }
+ *per_cpu_ptr(smmu->ecmdq, cpu) = ecmdqs[round];
+ }
+ }
+
+ kfree(nr_ecmdqs);
+ kfree(ecmdqs);
+
+ return 0;
}

static int arm_smmu_ecmdq_probe(struct arm_smmu_device *smmu)
@@ -3718,10 +3791,20 @@ static int arm_smmu_ecmdq_probe(struct arm_smmu_device *smmu)
struct arm_smmu_queue *q;

ecmdq = *per_cpu_ptr(smmu->ecmdq, cpu);
- ecmdq->base = cp_base + addr;
-
q = &ecmdq->cmdq.q;

+ /*
+ * The boot option "maxcpus=" can limit the number of online
+ * CPUs. The CPUs that are not selected are not showed in
+ * cpumask_of_node(node), their 'ecmdq' may be NULL.
+ *
+ * (q->ecmdq_prod & ECMDQ_PROD_EN) indicates that the ECMDQ is
+ * shared by multiple cores and has been initialized.
+ */
+ if (!ecmdq || (q->ecmdq_prod & ECMDQ_PROD_EN))
+ continue;
+ ecmdq->base = cp_base + addr;
+
q->llq.max_n_shift = ECMDQ_MAX_SZ_SHIFT + shift_increment;
ret = arm_smmu_init_one_queue(smmu, q, ecmdq->base, ARM_SMMU_ECMDQ_PROD,
ARM_SMMU_ECMDQ_CONS, CMDQ_ENT_DWORDS, "ecmdq");
--
2.26.0.106.g9fadedd

2021-06-26 11:04:20

[permalink] [raw]

Subject: [PATCH RFC 7/8] iommu/arm-smmu-v3: Add arm_smmu_ecmdq_issue_cmdlist() for non-shared ECMDQ

When a core can exclusively own an ECMDQ, competition with other cores
does not need to be considered during command insertion. Therefore, we can
delete the part of arm_smmu_cmdq_issue_cmdlist() that deals with
multi-core contention and generate a more efficient ECMDQ-specific
function arm_smmu_ecmdq_issue_cmdlist().

Signed-off-by: Zhen Lei <[email protected]>
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 85 +++++++++++++++++++++
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 1 +
2 files changed, 86 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index d5205030710bd1a..a088f2479fc6223 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -769,6 +769,87 @@ static void arm_smmu_cmdq_write_entries(struct arm_smmu_cmdq *cmdq, u64 *cmds,
}
}

+/*
+ * The function is used when the current core exclusively occupies an ECMDQ.
+ * This is a reduced version of arm_smmu_cmdq_issue_cmdlist(), which eliminates
+ * a lot of unnecessary inter-core competition considerations.
+ */
+static int arm_smmu_ecmdq_issue_cmdlist(struct arm_smmu_device *smmu,
+ struct arm_smmu_cmdq *cmdq,
+ u64 *cmds, int n, bool sync)
+{
+ u32 prod;
+ unsigned long flags;
+ struct arm_smmu_ll_queue llq = {
+ .max_n_shift = cmdq->q.llq.max_n_shift,
+ }, head;
+ int ret = 0;
+
+ /* 1. Allocate some space in the queue */
+ local_irq_save(flags);
+ llq.val = READ_ONCE(cmdq->q.llq.val);
+ do {
+ u64 old;
+
+ while (!queue_has_space(&llq, n + sync)) {
+ local_irq_restore(flags);
+ if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq))
+ dev_err_ratelimited(smmu->dev, "ECMDQ timeout\n");
+ local_irq_save(flags);
+ }
+
+ head.cons = llq.cons;
+ head.prod = queue_inc_prod_n(&llq, n + sync);
+
+ old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val);
+ if (old == llq.val)
+ break;
+
+ llq.val = old;
+ } while (1);
+
+ /* 2. Write our commands into the queue */
+ arm_smmu_cmdq_write_entries(cmdq, cmds, llq.prod, n);
+ if (sync) {
+ u64 cmd_sync[CMDQ_ENT_DWORDS];
+
+ prod = queue_inc_prod_n(&llq, n);
+ arm_smmu_cmdq_build_sync_cmd(cmd_sync, smmu, &cmdq->q, prod);
+ queue_write(Q_ENT(&cmdq->q, prod), cmd_sync, CMDQ_ENT_DWORDS);
+ }
+
+ /* 3. Ensuring commands are visible first */
+ dma_wmb();
+
+ /* 4. Advance the hardware prod pointer */
+ read_lock(&cmdq->q.ecmdq_lock);
+ writel_relaxed(head.prod | cmdq->q.ecmdq_prod, cmdq->q.prod_reg);
+ read_unlock(&cmdq->q.ecmdq_lock);
+
+ /* 5. If we are inserting a CMD_SYNC, we must wait for it to complete */
+ if (sync) {
+ llq.prod = queue_inc_prod_n(&llq, n);
+ ret = arm_smmu_cmdq_poll_until_sync(smmu, &llq);
+ if (ret) {
+ dev_err_ratelimited(smmu->dev,
+ "CMD_SYNC timeout at 0x%08x [hwprod 0x%08x, hwcons 0x%08x]\n",
+ llq.prod,
+ readl_relaxed(cmdq->q.prod_reg),
+ readl_relaxed(cmdq->q.cons_reg));
+ }
+
+ /*
+ * Update cmdq->q.llq.cons, to improve the success rate of
+ * queue_has_space() when some new commands are inserted next
+ * time.
+ */
+ WRITE_ONCE(cmdq->q.llq.cons, llq.cons);
+ }
+
+ local_irq_restore(flags);
+ return ret;
+}
+
/*
* This is the actual insertion function, and provides the following
* ordering guarantees to callers:
@@ -798,6 +879,9 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
}, head = llq;
int ret = 0;

+ if (!cmdq->shared)
+ return arm_smmu_ecmdq_issue_cmdlist(smmu, cmdq, cmds, n, sync);
+
/* 1. Allocate some space in the queue */
local_irq_save(flags);
llq.val = READ_ONCE(cmdq->q.llq.val);
@@ -3001,6 +3085,7 @@ static int arm_smmu_cmdq_init(struct arm_smmu_device *smmu)
unsigned int nents = 1 << cmdq->q.llq.max_n_shift;
atomic_long_t *bitmap;

+ cmdq->shared = 1;
atomic_set(&cmdq->owner_prod, 0);
atomic_set(&cmdq->lock, 0);

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 3f3a867a4626fcd..c6efbea3c0a1cda 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -569,6 +569,7 @@ struct arm_smmu_cmdq {
atomic_long_t *valid_map;
atomic_t owner_prod;
atomic_t lock;
+ int shared;
};

struct arm_smmu_ecmdq {
--
2.26.0.106.g9fadedd

2021-06-26 11:04:39

[permalink] [raw]

Subject: [PATCH RFC 3/8] iommu/arm-smmu-v3: Add and use static helper function arm_smmu_get_cmdq()

One SMMU has only one normal CMDQ. Therefore, this CMDQ is used regardless
of the core on which the command is inserted. It can be referenced
directly through "smmu->cmdq". However, one SMMU has multiple ECMDQs, and
the ECMDQ used by the core on which the command insertion is executed may
be different. So the helper function arm_smmu_get_cmdq() is added, which
returns the CMDQ/ECMDQ that the current core should use. Currently, the
code that supports ECMDQ is not added. just simply returns "&smmu->cmdq".

Many subfunctions of arm_smmu_cmdq_issue_cmdlist() use "&smmu->cmdq" or
"&smmu->cmdq.q" directly. To support ECMDQ, they need to call the newly
added function arm_smmu_get_cmdq() instead.

Note that normal CMDQ is still required until ECMDQ is available.

Signed-off-by: Zhen Lei <[email protected]>
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 22 ++++++++++++---------
1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index a5361153ca1d6a4..e4af13b1e7fc015 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -335,10 +335,14 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
return 0;
}

+static struct arm_smmu_cmdq *arm_smmu_get_cmdq(struct arm_smmu_device *smmu)
+{
+ return &smmu->cmdq;
+}
+
static void arm_smmu_cmdq_build_sync_cmd(u64 *cmd, struct arm_smmu_device *smmu,
- u32 prod)
+ struct arm_smmu_queue *q, u32 prod)
{
- struct arm_smmu_queue *q = &smmu->cmdq.q;
struct arm_smmu_cmdq_ent ent = {
.opcode = CMDQ_OP_CMD_SYNC,
};
@@ -578,7 +582,7 @@ static int arm_smmu_cmdq_poll_until_not_full(struct arm_smmu_device *smmu,
{
unsigned long flags;
struct arm_smmu_queue_poll qp;
- struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
+ struct arm_smmu_cmdq *cmdq = arm_smmu_get_cmdq(smmu);
int ret = 0;

/*
@@ -594,7 +598,7 @@ static int arm_smmu_cmdq_poll_until_not_full(struct arm_smmu_device *smmu,

queue_poll_init(smmu, &qp);
do {
- llq->val = READ_ONCE(smmu->cmdq.q.llq.val);
+ llq->val = READ_ONCE(cmdq->q.llq.val);
if (!queue_full(llq))
break;

@@ -613,7 +617,7 @@ static int __arm_smmu_cmdq_poll_until_msi(struct arm_smmu_device *smmu,
{
int ret = 0;
struct arm_smmu_queue_poll qp;
- struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
+ struct arm_smmu_cmdq *cmdq = arm_smmu_get_cmdq(smmu);
u32 *cmd = (u32 *)(Q_ENT(&cmdq->q, llq->prod));

queue_poll_init(smmu, &qp);
@@ -636,12 +640,12 @@ static int __arm_smmu_cmdq_poll_until_consumed(struct arm_smmu_device *smmu,
struct arm_smmu_ll_queue *llq)
{
struct arm_smmu_queue_poll qp;
- struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
+ struct arm_smmu_cmdq *cmdq = arm_smmu_get_cmdq(smmu);
u32 prod = llq->prod;
int ret = 0;

queue_poll_init(smmu, &qp);
- llq->val = READ_ONCE(smmu->cmdq.q.llq.val);
+ llq->val = READ_ONCE(cmdq->q.llq.val);
do {
if (queue_consumed(llq, prod))
break;
@@ -731,7 +735,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
u32 prod;
unsigned long flags;
bool owner;
- struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
+ struct arm_smmu_cmdq *cmdq = arm_smmu_get_cmdq(smmu);
struct arm_smmu_ll_queue llq = {
.max_n_shift = cmdq->q.llq.max_n_shift,
}, head = llq;
@@ -771,7 +775,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
arm_smmu_cmdq_write_entries(cmdq, cmds, llq.prod, n);
if (sync) {
prod = queue_inc_prod_n(&llq, n);
- arm_smmu_cmdq_build_sync_cmd(cmd_sync, smmu, prod);
+ arm_smmu_cmdq_build_sync_cmd(cmd_sync, smmu, &cmdq->q, prod);
queue_write(Q_ENT(&cmdq->q, prod), cmd_sync, CMDQ_ENT_DWORDS);

/*
--
2.26.0.106.g9fadedd

2021-06-26 11:05:12

[permalink] [raw]

Subject: [PATCH RFC 5/8] iommu/arm-smmu-v3: Add support for ECMDQ register mode

Ensure that each core exclusively occupies an ECMDQ and all of them are
enabled during initialization. During this initialization process, any
errors will result in a fallback to using normal CMDQ.

When GERROR is triggered by ECMDQ, all ECMDQs need to be traversed: the
ECMDQs with errors will be processed and the ECMDQs without errors will
be skipped directly.

Compared with register SMMU_CMDQ_PROD, register SMMU_ECMDQ_PROD has one
more 'EN' bit and one more 'ERRACK' bit. Therefore, an extra member
'ecmdq_prod' is added to record the values of these two bits. Each time
register SMMU_ECMDQ_PROD is updated, the value of 'ecmdq_prod' is ORed.
After the error indicated by SMMU_GERROR.CMDQP_ERR is fixed, the 'ERRACK'
bit needs to be toggled to resume the corresponding ECMDQ. Therefore, a
rwlock is used to protect the write operation to bit 'ERRACK' during error
handling and the read operation to bit 'ERRACK' during command insertion.

Signed-off-by: Zhen Lei <[email protected]>
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 210 +++++++++++++++++++-
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 36 ++++
2 files changed, 245 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 62b2742daab3257..d7b590e911a879d 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -337,6 +337,14 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent)

static struct arm_smmu_cmdq *arm_smmu_get_cmdq(struct arm_smmu_device *smmu)
{
+ if (smmu->ecmdq_enabled) {
+ struct arm_smmu_ecmdq *ecmdq;
+
+ ecmdq = *this_cpu_ptr(smmu->ecmdq);
+
+ return &ecmdq->cmdq;
+ }
+
return &smmu->cmdq;
}

@@ -421,6 +429,38 @@ static void arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu)
__arm_smmu_cmdq_skip_err(smmu, &smmu->cmdq.q);
}

+static void arm_smmu_ecmdq_skip_err(struct arm_smmu_device *smmu)
+{
+ int i;
+ u32 prod, cons;
+ struct arm_smmu_queue *q;
+ struct arm_smmu_ecmdq *ecmdq;
+
+ for (i = 0; i < smmu->nr_ecmdq; i++) {
+ unsigned long flags;
+
+ ecmdq = *per_cpu_ptr(smmu->ecmdq, i);
+ q = &ecmdq->cmdq.q;
+
+ prod = readl_relaxed(q->prod_reg);
+ cons = readl_relaxed(q->cons_reg);
+ if (((prod ^ cons) & ECMDQ_CONS_ERR) == 0)
+ continue;
+
+ __arm_smmu_cmdq_skip_err(smmu, q);
+
+ write_lock_irqsave(&q->ecmdq_lock, flags);
+ q->ecmdq_prod &= ~ECMDQ_PROD_ERRACK;
+ q->ecmdq_prod |= cons & ECMDQ_CONS_ERR;
+
+ prod = readl_relaxed(q->prod_reg);
+ prod &= ~ECMDQ_PROD_ERRACK;
+ prod |= cons & ECMDQ_CONS_ERR;
+ writel(prod, q->prod_reg);
+ write_unlock_irqrestore(&q->ecmdq_lock, flags);
+ }
+}
+
/*
* Command queue locking.
* This is a form of bastardised rwlock with the following major changes:
@@ -817,7 +857,13 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
* d. Advance the hardware prod pointer
* Control dependency ordering from the entries becoming valid.
*/
- writel_relaxed(prod, cmdq->q.prod_reg);
+ if (smmu->ecmdq_enabled) {
+ read_lock(&cmdq->q.ecmdq_lock);
+ writel_relaxed(prod | cmdq->q.ecmdq_prod, cmdq->q.prod_reg);
+ read_unlock(&cmdq->q.ecmdq_lock);
+ } else {
+ writel_relaxed(prod, cmdq->q.prod_reg);
+ }

/*
* e. Tell the next owner we're done
@@ -1675,6 +1721,9 @@ static irqreturn_t arm_smmu_gerror_handler(int irq, void *dev)
if (active & GERROR_CMDQ_ERR)
arm_smmu_cmdq_skip_err(smmu);

+ if (active & GERROR_CMDQP_ERR)
+ arm_smmu_ecmdq_skip_err(smmu);
+
writel(gerror, smmu->base + ARM_SMMU_GERRORN);
return IRQ_HANDLED;
}
@@ -2941,6 +2990,20 @@ static int arm_smmu_cmdq_init(struct arm_smmu_device *smmu)
return ret;
}

+static int arm_smmu_ecmdq_init(struct arm_smmu_cmdq *cmdq)
+{
+ unsigned int nents = 1 << cmdq->q.llq.max_n_shift;
+
+ atomic_set(&cmdq->owner_prod, 0);
+ atomic_set(&cmdq->lock, 0);
+
+ cmdq->valid_map = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL);
+ if (!cmdq->valid_map)
+ return -ENOMEM;
+
+ return 0;
+}
+
static int arm_smmu_init_queues(struct arm_smmu_device *smmu)
{
int ret;
@@ -3304,6 +3367,7 @@ static int arm_smmu_device_disable(struct arm_smmu_device *smmu)

static int arm_smmu_device_reset(struct arm_smmu_device *smmu, bool bypass)
{
+ int i;
int ret;
u32 reg, enables;
struct arm_smmu_cmdq_ent cmd;
@@ -3348,6 +3412,28 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu, bool bypass)
writel_relaxed(smmu->cmdq.q.llq.prod, smmu->base + ARM_SMMU_CMDQ_PROD);
writel_relaxed(smmu->cmdq.q.llq.cons, smmu->base + ARM_SMMU_CMDQ_CONS);

+ for (i = 0; i < smmu->nr_ecmdq; i++) {
+ struct arm_smmu_ecmdq *ecmdq;
+ struct arm_smmu_queue *q;
+
+ ecmdq = *per_cpu_ptr(smmu->ecmdq, i);
+ q = &ecmdq->cmdq.q;
+
+ writeq_relaxed(q->q_base, ecmdq->base + ARM_SMMU_ECMDQ_BASE);
+ writel_relaxed(q->llq.prod, ecmdq->base + ARM_SMMU_ECMDQ_PROD);
+ writel_relaxed(q->llq.cons, ecmdq->base + ARM_SMMU_ECMDQ_CONS);
+
+ /* enable ecmdq */
+ writel(ECMDQ_PROD_EN, q->prod_reg);
+ ret = readl_relaxed_poll_timeout(q->cons_reg, reg, reg & ECMDQ_CONS_ENACK,
+ 1, ARM_SMMU_POLL_TIMEOUT_US);
+ if (ret) {
+ dev_err(smmu->dev, "ecmdq[%d] enable failed\n", i);
+ smmu->ecmdq_enabled = 0;
+ break;
+ }
+ }
+
enables = CR0_CMDQEN;
ret = arm_smmu_write_reg_sync(smmu, enables, ARM_SMMU_CR0,
ARM_SMMU_CR0ACK);
@@ -3437,6 +3523,115 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu, bool bypass)
return 0;
}

+static int arm_smmu_ecmdq_layout(struct arm_smmu_device *smmu)
+{
+ int cpu;
+ struct arm_smmu_ecmdq *ecmdq;
+
+ if (num_possible_cpus() <= smmu->nr_ecmdq) {
+ ecmdq = devm_alloc_percpu(smmu->dev, *ecmdq);
+ if (!ecmdq)
+ return -ENOMEM;
+
+ for_each_possible_cpu(cpu)
+ *per_cpu_ptr(smmu->ecmdq, cpu) = per_cpu_ptr(ecmdq, cpu);
+
+ /* A core requires at most one ECMDQ */
+ smmu->nr_ecmdq = num_possible_cpus();
+
+ return 0;
+ }
+
+ return -ENOSPC;
+}
+
+static int arm_smmu_ecmdq_probe(struct arm_smmu_device *smmu)
+{
+ int ret, cpu;
+ u32 i, nump, numq, gap;
+ u32 reg, shift_increment;
+ u64 addr, smmu_dma_base;
+ void __iomem *cp_regs, *cp_base;
+
+ /* IDR6 */
+ reg = readl_relaxed(smmu->base + ARM_SMMU_IDR6);
+ nump = 1 << FIELD_GET(IDR6_LOG2NUMP, reg);
+ numq = 1 << FIELD_GET(IDR6_LOG2NUMQ, reg);
+ smmu->nr_ecmdq = nump * numq;
+ gap = ECMDQ_CP_RRESET_SIZE >> FIELD_GET(IDR6_LOG2NUMQ, reg);
+
+ smmu_dma_base = (vmalloc_to_pfn(smmu->base) << PAGE_SHIFT);
+ cp_regs = ioremap(smmu_dma_base + ARM_SMMU_ECMDQ_CP_BASE, PAGE_SIZE);
+ if (!cp_regs)
+ return -ENOMEM;
+
+ for (i = 0; i < nump; i++) {
+ u64 val, pre_addr;
+
+ val = readq_relaxed(cp_regs + 32 * i);
+ if (!(val & ECMDQ_CP_PRESET)) {
+ iounmap(cp_regs);
+ dev_err(smmu->dev, "ecmdq control page %u is memory mode\n", i);
+ return -EFAULT;
+ }
+
+ if (i && ((val & ECMDQ_CP_ADDR) != (pre_addr + ECMDQ_CP_RRESET_SIZE))) {
+ iounmap(cp_regs);
+ dev_err(smmu->dev, "ecmdq_cp memory region is not contiguous\n");
+ return -EFAULT;
+ }
+
+ pre_addr = val & ECMDQ_CP_ADDR;
+ }
+
+ addr = readl_relaxed(cp_regs) & ECMDQ_CP_ADDR;
+ iounmap(cp_regs);
+
+ cp_base = devm_ioremap(smmu->dev, smmu_dma_base + addr, ECMDQ_CP_RRESET_SIZE * nump);
+ if (!cp_base)
+ return -ENOMEM;
+
+ smmu->ecmdq = devm_alloc_percpu(smmu->dev, struct arm_smmu_ecmdq *);
+ if (!smmu->ecmdq)
+ return -ENOMEM;
+
+ ret = arm_smmu_ecmdq_layout(smmu);
+ if (ret)
+ return ret;
+
+ shift_increment = order_base_2(num_possible_cpus() / smmu->nr_ecmdq);
+
+ addr = 0;
+ for_each_possible_cpu(cpu) {
+ struct arm_smmu_ecmdq *ecmdq;
+ struct arm_smmu_queue *q;
+
+ ecmdq = *per_cpu_ptr(smmu->ecmdq, cpu);
+ ecmdq->base = cp_base + addr;
+
+ q = &ecmdq->cmdq.q;
+
+ q->llq.max_n_shift = ECMDQ_MAX_SZ_SHIFT + shift_increment;
+ ret = arm_smmu_init_one_queue(smmu, q, ecmdq->base, ARM_SMMU_ECMDQ_PROD,
+ ARM_SMMU_ECMDQ_CONS, CMDQ_ENT_DWORDS, "ecmdq");
+ if (ret)
+ return ret;
+
+ q->ecmdq_prod = ECMDQ_PROD_EN;
+ rwlock_init(&q->ecmdq_lock);
+
+ ret = arm_smmu_ecmdq_init(&ecmdq->cmdq);
+ if (ret) {
+ dev_err(smmu->dev, "ecmdq[%d] init failed\n", i);
+ return ret;
+ }
+
+ addr += gap;
+ }
+
+ return 0;
+}
+
static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
{
u32 reg;
@@ -3547,6 +3742,9 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
return -ENXIO;
}

+ if (reg & IDR1_ECMDQ)
+ smmu->features |= ARM_SMMU_FEAT_ECMDQ;
+
/* Queue sizes, capped to ensure natural alignment */
smmu->cmdq.q.llq.max_n_shift = min_t(u32, CMDQ_MAX_SZ_SHIFT,
FIELD_GET(IDR1_CMDQS, reg));
@@ -3647,6 +3845,16 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)

dev_info(smmu->dev, "ias %lu-bit, oas %lu-bit (features 0x%08x)\n",
smmu->ias, smmu->oas, smmu->features);
+
+ if (smmu->features & ARM_SMMU_FEAT_ECMDQ) {
+ int err;
+
+ err = arm_smmu_ecmdq_probe(smmu);
+ if (err) {
+ dev_err(smmu->dev, "suppress ecmdq feature, errno=%d\n", err);
+ smmu->ecmdq_enabled = 0;
+ }
+ }
return 0;
}

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 4cb136f07914e83..3f3a867a4626fcd 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -41,6 +41,7 @@
#define IDR0_S2P (1 << 0)

#define ARM_SMMU_IDR1 0x4
+#define IDR1_ECMDQ (1 << 31)
#define IDR1_TABLES_PRESET (1 << 30)
#define IDR1_QUEUES_PRESET (1 << 29)
#define IDR1_REL (1 << 28)
@@ -107,6 +108,7 @@
#define ARM_SMMU_IRQ_CTRLACK 0x54

#define ARM_SMMU_GERROR 0x60
+#define GERROR_CMDQP_ERR (1 << 9)
#define GERROR_SFM_ERR (1 << 8)
#define GERROR_MSI_GERROR_ABT_ERR (1 << 7)
#define GERROR_MSI_PRIQ_ABT_ERR (1 << 6)
@@ -152,6 +154,26 @@
#define ARM_SMMU_PRIQ_IRQ_CFG1 0xd8
#define ARM_SMMU_PRIQ_IRQ_CFG2 0xdc

+#define ARM_SMMU_IDR6 0x190
+#define IDR6_LOG2NUMP GENMASK(27, 24)
+#define IDR6_LOG2NUMQ GENMASK(19, 16)
+#define IDR6_BA_DOORBELLS GENMASK(9, 0)
+
+#define ARM_SMMU_ECMDQ_BASE 0x00
+#define ARM_SMMU_ECMDQ_PROD 0x08
+#define ARM_SMMU_ECMDQ_CONS 0x0c
+#define ECMDQ_MAX_SZ_SHIFT 8
+#define ECMDQ_PROD_EN (1 << 31)
+#define ECMDQ_CONS_ENACK (1 << 31)
+#define ECMDQ_CONS_ERR (1 << 23)
+#define ECMDQ_PROD_ERRACK (1 << 23)
+
+#define ARM_SMMU_ECMDQ_CP_BASE 0x4000
+#define ECMDQ_CP_ADDR GENMASK_ULL(51, 12)
+#define ECMDQ_CP_CMDQGS GENMASK_ULL(2, 1)
+#define ECMDQ_CP_PRESET (1UL << 0)
+#define ECMDQ_CP_RRESET_SIZE 0x10000
+
#define ARM_SMMU_REG_SZ 0xe00

/* Common MSI config fields */
@@ -522,6 +544,8 @@ struct arm_smmu_ll_queue {
struct arm_smmu_queue {
struct arm_smmu_ll_queue llq;
int irq; /* Wired interrupt */
+ u32 ecmdq_prod;
+ rwlock_t ecmdq_lock;

__le64 *base;
dma_addr_t base_dma;
@@ -547,6 +571,11 @@ struct arm_smmu_cmdq {
atomic_t lock;
};

+struct arm_smmu_ecmdq {
+ struct arm_smmu_cmdq cmdq;
+ void __iomem *base;
+};
+
struct arm_smmu_cmdq_batch {
u64 cmds[CMDQ_BATCH_ENTRIES * CMDQ_ENT_DWORDS];
int num;
@@ -640,6 +669,7 @@ struct arm_smmu_device {
#define ARM_SMMU_FEAT_BTM (1 << 16)
#define ARM_SMMU_FEAT_SVA (1 << 17)
#define ARM_SMMU_FEAT_E2H (1 << 18)
+#define ARM_SMMU_FEAT_ECMDQ (1 << 19)
u32 features;

#define ARM_SMMU_OPT_SKIP_PREFETCH (1 << 0)
@@ -647,6 +677,12 @@ struct arm_smmu_device {
#define ARM_SMMU_OPT_MSIPOLL (1 << 2)
u32 options;

+ union {
+ u32 nr_ecmdq;
+ u32 ecmdq_enabled;
+ };
+ struct arm_smmu_ecmdq *__percpu *ecmdq;
+
struct arm_smmu_cmdq cmdq;
struct arm_smmu_evtq evtq;
struct arm_smmu_priq priq;
--
2.26.0.106.g9fadedd

2021-08-10 18:26:37

[permalink] [raw]

Subject: Re: [PATCH RFC 2/8] iommu/arm-smmu-v3: Add and use static helper function arm_smmu_cmdq_issue_cmd_with_sync()

On Sat, Jun 26, 2021 at 07:01:24PM +0800, Zhen Lei wrote:
> The obvious key to the performance optimization of commit 587e6c10a7ce
> ("iommu/arm-smmu-v3: Reduce contention during command-queue insertion") is
> to allow multiple cores to insert commands in parallel after a brief mutex
> contention.
>
> Obviously, inserting as many commands at a time as possible can reduce the
> number of times the mutex contention participates, thereby improving the
> overall performance. At least it reduces the number of calls to function
> arm_smmu_cmdq_issue_cmdlist().
>
> Therefore, function arm_smmu_cmdq_issue_cmd_with_sync() is added to insert
> the 'cmd+sync' commands at a time.
>
> Signed-off-by: Zhen Lei <[email protected]>
> ---
> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 33 +++++++++++++--------
> 1 file changed, 21 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 2433d3c29b49ff2..a5361153ca1d6a4 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -858,11 +858,25 @@ static int arm_smmu_cmdq_issue_cmd(struct arm_smmu_device *smmu,
> return arm_smmu_cmdq_issue_cmdlist(smmu, cmd, 1, false);
> }
>
> -static int arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu)
> +static int __maybe_unused arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu)
> {
> return arm_smmu_cmdq_issue_cmdlist(smmu, NULL, 0, true);
> }
>
> +static int arm_smmu_cmdq_issue_cmd_with_sync(struct arm_smmu_device *smmu,
> + struct arm_smmu_cmdq_ent *ent)
> +{
> + u64 cmd[CMDQ_ENT_DWORDS];
> +
> + if (arm_smmu_cmdq_build_cmd(cmd, ent)) {
> + dev_warn(smmu->dev, "ignoring unknown CMDQ opcode 0x%x\n",
> + ent->opcode);
> + return -EINVAL;
> + }
> +
> + return arm_smmu_cmdq_issue_cmdlist(smmu, cmd, 1, true);
> +}

This function is almost identical to arm_smmu_cmdq_issue_cmd(). How about
moving the guts out into a helper:

static int __arm_smmu_cmdq_issue_cmd(struct arm_smmu_device *smmu,
struct arm_smmu_cmdq_ent *ent,
bool sync);

and then having arm_smmu_cmdq_issue_cmd_with_sync() and
arm_smmu_cmdq_issue_cmd() wrap that?

Will

2021-08-10 18:36:35

[permalink] [raw]

Subject: Re: [PATCH RFC 0/8] iommu/arm-smmu-v3: add support for ECMDQ register mode

On Sat, Jun 26, 2021 at 07:01:22PM +0800, Zhen Lei wrote:
> SMMU v3.3 added a new feature, which is Enhanced Command queue interface
> for reducing contention when submitting Commands to the SMMU, in this
> patch set, ECMDQ is the abbreviation of Enhanced Command Queue.
>
> When the hardware supports ECMDQ and each core can exclusively use one ECMDQ,
> each core does not need to compete with other cores when using its own ECMDQ.
> This means that each core can insert commands in parallel. If each ECMDQ can
> execute commands in parallel, the overall performance may be better. However,
> our hardware currently does not support multiple ECMDQ execute commands in
> parallel.
>
> In order to reuse existing code, I originally still call arm_smmu_cmdq_issue_cmdlist()
> to insert commands. Even so, however, there was a performance improvement of nearly 12%
> in strict mode.
>
> The test environment is the EMU, which simulates the connection of the 200 Gbit/s NIC.
> Number of queues: passthrough lazy strict(ECMDQ) strict(CMDQ)
> 6 188 180 162 145 --> 11.7% improvement
> 8 188 188 184 183 --> 0.55% improvement

Sorry, I don't quite follow the numbers here. Why does the number of queues
affect the classic "CMDQ" mode? We only have one queue there, right?

> In recent days, I implemented a new function without competition with other
> cores to replace arm_smmu_cmdq_issue_cmdlist() when a core can have an ECMDQ.
> I'm guessing it might get better performance results. Because the EMU is too
> slow, it will take a while before the relevant data is available.

I'd certainly prefer to wait until we have something we know is
representative. However, I can take the first four prep patches now if you
respin the second one. At least that's then less for you to carry.

I'd also like review from the Arm side on this (and thank you for adopting
the architecture unlike others seem to have done judging by the patches
floating around).

Will

2021-08-11 02:10:42

[permalink] [raw]

Subject: Re: [PATCH RFC 0/8] iommu/arm-smmu-v3: add support for ECMDQ register mode

On 2021/8/11 2:35, Will Deacon wrote:
> On Sat, Jun 26, 2021 at 07:01:22PM +0800, Zhen Lei wrote:
>> SMMU v3.3 added a new feature, which is Enhanced Command queue interface
>> for reducing contention when submitting Commands to the SMMU, in this
>> patch set, ECMDQ is the abbreviation of Enhanced Command Queue.
>>
>> When the hardware supports ECMDQ and each core can exclusively use one ECMDQ,
>> each core does not need to compete with other cores when using its own ECMDQ.
>> This means that each core can insert commands in parallel. If each ECMDQ can
>> execute commands in parallel, the overall performance may be better. However,
>> our hardware currently does not support multiple ECMDQ execute commands in
>> parallel.
>>
>> In order to reuse existing code, I originally still call arm_smmu_cmdq_issue_cmdlist()
>> to insert commands. Even so, however, there was a performance improvement of nearly 12%
>> in strict mode.
>>
>> The test environment is the EMU, which simulates the connection of the 200 Gbit/s NIC.
>> Number of queues: passthrough lazy strict(ECMDQ) strict(CMDQ)
>> 6 188 180 162 145 --> 11.7% improvement
>> 8 188 188 184 183 --> 0.55% improvement
>
> Sorry, I don't quite follow the numbers here. Why does the number of queues
> affect the classic "CMDQ" mode? We only have one queue there, right?

These queues indicates the network concurrency, maybe I should use channels or threads.
6 means six threads are deployed on different cores using their own channels to send
and receive network packets.

>
>> In recent days, I implemented a new function without competition with other
>> cores to replace arm_smmu_cmdq_issue_cmdlist() when a core can have an ECMDQ.
>> I'm guessing it might get better performance results. Because the EMU is too
>> slow, it will take a while before the relevant data is available.
>
> I'd certainly prefer to wait until we have something we know is
> representative.

Yes, it would be better to have an actual set of performance data. Now the EMU is
used to analyze hardware problems. This test has not been numbered yet.

> However, I can take the first four prep patches now if you
> respin the second one. At least that's then less for you to carry.

Great. Thank you. I will respin the second one.

>
> I'd also like review from the Arm side on this (and thank you for adopting
> the architecture unlike others seem to have done judging by the patches
> floating around).
>
> Will
> .
>

2021-08-11 02:17:36

[permalink] [raw]

Subject: Re: [PATCH RFC 2/8] iommu/arm-smmu-v3: Add and use static helper function arm_smmu_cmdq_issue_cmd_with_sync()

On 2021/8/11 2:24, Will Deacon wrote:
> On Sat, Jun 26, 2021 at 07:01:24PM +0800, Zhen Lei wrote:
>> The obvious key to the performance optimization of commit 587e6c10a7ce
>> ("iommu/arm-smmu-v3: Reduce contention during command-queue insertion") is
>> to allow multiple cores to insert commands in parallel after a brief mutex
>> contention.
>>
>> Obviously, inserting as many commands at a time as possible can reduce the
>> number of times the mutex contention participates, thereby improving the
>> overall performance. At least it reduces the number of calls to function
>> arm_smmu_cmdq_issue_cmdlist().
>>
>> Therefore, function arm_smmu_cmdq_issue_cmd_with_sync() is added to insert
>> the 'cmd+sync' commands at a time.
>>
>> Signed-off-by: Zhen Lei <[email protected]>
>> ---
>> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 33 +++++++++++++--------
>> 1 file changed, 21 insertions(+), 12 deletions(-)
>>
>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> index 2433d3c29b49ff2..a5361153ca1d6a4 100644
>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> @@ -858,11 +858,25 @@ static int arm_smmu_cmdq_issue_cmd(struct arm_smmu_device *smmu,
>> return arm_smmu_cmdq_issue_cmdlist(smmu, cmd, 1, false);
>> }
>>
>> -static int arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu)
>> +static int __maybe_unused arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu)
>> {
>> return arm_smmu_cmdq_issue_cmdlist(smmu, NULL, 0, true);
>> }
>>
>> +static int arm_smmu_cmdq_issue_cmd_with_sync(struct arm_smmu_device *smmu,
>> + struct arm_smmu_cmdq_ent *ent)
>> +{
>> + u64 cmd[CMDQ_ENT_DWORDS];
>> +
>> + if (arm_smmu_cmdq_build_cmd(cmd, ent)) {
>> + dev_warn(smmu->dev, "ignoring unknown CMDQ opcode 0x%x\n",
>> + ent->opcode);
>> + return -EINVAL;
>> + }
>> +
>> + return arm_smmu_cmdq_issue_cmdlist(smmu, cmd, 1, true);
>> +}
>
> This function is almost identical to arm_smmu_cmdq_issue_cmd(). How about
> moving the guts out into a helper:
>
> static int __arm_smmu_cmdq_issue_cmd(struct arm_smmu_device *smmu,
> struct arm_smmu_cmdq_ent *ent,
> bool sync);
>
> and then having arm_smmu_cmdq_issue_cmd_with_sync() and
> arm_smmu_cmdq_issue_cmd() wrap that?

OK, I will do it.

How about remove arm_smmu_cmdq_issue_sync()? It's unused now.

>
> Will
> .
>

2021-08-11 10:11:11

[permalink] [raw]

Subject: Re: [PATCH RFC 2/8] iommu/arm-smmu-v3: Add and use static helper function arm_smmu_cmdq_issue_cmd_with_sync()

On Wed, Aug 11, 2021 at 10:16:39AM +0800, Leizhen (ThunderTown) wrote:
>
>
> On 2021/8/11 2:24, Will Deacon wrote:
> > On Sat, Jun 26, 2021 at 07:01:24PM +0800, Zhen Lei wrote:
> >> The obvious key to the performance optimization of commit 587e6c10a7ce
> >> ("iommu/arm-smmu-v3: Reduce contention during command-queue insertion") is
> >> to allow multiple cores to insert commands in parallel after a brief mutex
> >> contention.
> >>
> >> Obviously, inserting as many commands at a time as possible can reduce the
> >> number of times the mutex contention participates, thereby improving the
> >> overall performance. At least it reduces the number of calls to function
> >> arm_smmu_cmdq_issue_cmdlist().
> >>
> >> Therefore, function arm_smmu_cmdq_issue_cmd_with_sync() is added to insert
> >> the 'cmd+sync' commands at a time.
> >>
> >> Signed-off-by: Zhen Lei <[email protected]>
> >> ---
> >> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 33 +++++++++++++--------
> >> 1 file changed, 21 insertions(+), 12 deletions(-)
> >>
> >> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> >> index 2433d3c29b49ff2..a5361153ca1d6a4 100644
> >> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> >> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> >> @@ -858,11 +858,25 @@ static int arm_smmu_cmdq_issue_cmd(struct arm_smmu_device *smmu,
> >> return arm_smmu_cmdq_issue_cmdlist(smmu, cmd, 1, false);
> >> }
> >>
> >> -static int arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu)
> >> +static int __maybe_unused arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu)
> >> {
> >> return arm_smmu_cmdq_issue_cmdlist(smmu, NULL, 0, true);
> >> }
> >>
> >> +static int arm_smmu_cmdq_issue_cmd_with_sync(struct arm_smmu_device *smmu,
> >> + struct arm_smmu_cmdq_ent *ent)
> >> +{
> >> + u64 cmd[CMDQ_ENT_DWORDS];
> >> +
> >> + if (arm_smmu_cmdq_build_cmd(cmd, ent)) {
> >> + dev_warn(smmu->dev, "ignoring unknown CMDQ opcode 0x%x\n",
> >> + ent->opcode);
> >> + return -EINVAL;
> >> + }
> >> +
> >> + return arm_smmu_cmdq_issue_cmdlist(smmu, cmd, 1, true);
> >> +}
> >
> > This function is almost identical to arm_smmu_cmdq_issue_cmd(). How about
> > moving the guts out into a helper:
> >
> > static int __arm_smmu_cmdq_issue_cmd(struct arm_smmu_device *smmu,
> > struct arm_smmu_cmdq_ent *ent,
> > bool sync);
> >
> > and then having arm_smmu_cmdq_issue_cmd_with_sync() and
> > arm_smmu_cmdq_issue_cmd() wrap that?
>
> OK, I will do it.
>
> How about remove arm_smmu_cmdq_issue_sync()? It's unused now.

Sure.

Will

2021-08-11 10:34:19

[permalink] [raw]

Subject: Re: [PATCH RFC 2/8] iommu/arm-smmu-v3: Add and use static helper function arm_smmu_cmdq_issue_cmd_with_sync()

>>>> Obviously, inserting as many commands at a time as possible can reduce the
>>>> number of times the mutex contention participates, thereby improving the
>>>> overall performance. At least it reduces the number of calls to function
>>>> arm_smmu_cmdq_issue_cmdlist().
>>>>
>>>> Therefore, function arm_smmu_cmdq_issue_cmd_with_sync() is added to insert
>>>> the 'cmd+sync' commands at a time.
>>>>
>>>> Signed-off-by: Zhen Lei <[email protected]>
>>>> ---
>>>> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 33 +++++++++++++--------
>>>> 1 file changed, 21 insertions(+), 12 deletions(-)
>>>>
>>>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>>> index 2433d3c29b49ff2..a5361153ca1d6a4 100644
>>>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>>> @@ -858,11 +858,25 @@ static int arm_smmu_cmdq_issue_cmd(struct arm_smmu_device *smmu,
>>>> return arm_smmu_cmdq_issue_cmdlist(smmu, cmd, 1, false);
>>>> }
>>>>
>>>> -static int arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu)
>>>> +static int __maybe_unused arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu)
>>>> {
>>>> return arm_smmu_cmdq_issue_cmdlist(smmu, NULL, 0, true);
>>>> }
>>>>
>>>> +static int arm_smmu_cmdq_issue_cmd_with_sync(struct arm_smmu_device *smmu,
>>>> + struct arm_smmu_cmdq_ent *ent)
>>>> +{
>>>> + u64 cmd[CMDQ_ENT_DWORDS];
>>>> +
>>>> + if (arm_smmu_cmdq_build_cmd(cmd, ent)) {
>>>> + dev_warn(smmu->dev, "ignoring unknown CMDQ opcode 0x%x\n",
>>>> + ent->opcode);
>>>> + return -EINVAL;

Are any of the errors returned from the "issue command" functions
actually consumed? I couldn't see it on mainline code from a brief browse.

>>>> + }
>>>> +
>>>> + return arm_smmu_cmdq_issue_cmdlist(smmu, cmd, 1, true);

Thanks,
John

2021-08-11 10:34:49

[permalink] [raw]

Subject: Re: [PATCH RFC 2/8] iommu/arm-smmu-v3: Add and use static helper function arm_smmu_cmdq_issue_cmd_with_sync()

On Wed, Aug 11, 2021 at 11:31:08AM +0100, John Garry wrote:
> > > > > Obviously, inserting as many commands at a time as possible can reduce the
> > > > > number of times the mutex contention participates, thereby improving the
> > > > > overall performance. At least it reduces the number of calls to function
> > > > > arm_smmu_cmdq_issue_cmdlist().
> > > > >
> > > > > Therefore, function arm_smmu_cmdq_issue_cmd_with_sync() is added to insert
> > > > > the 'cmd+sync' commands at a time.
> > > > >
> > > > > Signed-off-by: Zhen Lei <[email protected]>
> > > > > ---
> > > > > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 33 +++++++++++++--------
> > > > > 1 file changed, 21 insertions(+), 12 deletions(-)
> > > > >
> > > > > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > > > > index 2433d3c29b49ff2..a5361153ca1d6a4 100644
> > > > > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > > > > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > > > > @@ -858,11 +858,25 @@ static int arm_smmu_cmdq_issue_cmd(struct arm_smmu_device *smmu,
> > > > > return arm_smmu_cmdq_issue_cmdlist(smmu, cmd, 1, false);
> > > > > }
> > > > > -static int arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu)
> > > > > +static int __maybe_unused arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu)
> > > > > {
> > > > > return arm_smmu_cmdq_issue_cmdlist(smmu, NULL, 0, true);
> > > > > }
> > > > > +static int arm_smmu_cmdq_issue_cmd_with_sync(struct arm_smmu_device *smmu,
> > > > > + struct arm_smmu_cmdq_ent *ent)
> > > > > +{
> > > > > + u64 cmd[CMDQ_ENT_DWORDS];
> > > > > +
> > > > > + if (arm_smmu_cmdq_build_cmd(cmd, ent)) {
> > > > > + dev_warn(smmu->dev, "ignoring unknown CMDQ opcode 0x%x\n",
> > > > > + ent->opcode);
> > > > > + return -EINVAL;
>
> Are any of the errors returned from the "issue command" functions actually
> consumed? I couldn't see it on mainline code from a brief browse.

I don't think so. Can we actually propagate them?

Will

2021-08-11 11:17:35

[permalink] [raw]

Subject: Re: [PATCH RFC 2/8] iommu/arm-smmu-v3: Add and use static helper function arm_smmu_cmdq_issue_cmd_with_sync()

>>>>>>
>>>>>> Therefore, function arm_smmu_cmdq_issue_cmd_with_sync() is added to insert
>>>>>> the 'cmd+sync' commands at a time.
>>>>>>
>>>>>> Signed-off-by: Zhen Lei<[email protected]>
>>>>>> ---
>>>>>> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 33 +++++++++++++--------
>>>>>> 1 file changed, 21 insertions(+), 12 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>>>>> index 2433d3c29b49ff2..a5361153ca1d6a4 100644
>>>>>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>>>>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>>>>> @@ -858,11 +858,25 @@ static int arm_smmu_cmdq_issue_cmd(struct arm_smmu_device *smmu,
>>>>>> return arm_smmu_cmdq_issue_cmdlist(smmu, cmd, 1, false);
>>>>>> }
>>>>>> -static int arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu)
>>>>>> +static int __maybe_unused arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu)
>>>>>> {
>>>>>> return arm_smmu_cmdq_issue_cmdlist(smmu, NULL, 0, true);
>>>>>> }
>>>>>> +static int arm_smmu_cmdq_issue_cmd_with_sync(struct arm_smmu_device *smmu,
>>>>>> + struct arm_smmu_cmdq_ent *ent)
>>>>>> +{
>>>>>> + u64 cmd[CMDQ_ENT_DWORDS];
>>>>>> +
>>>>>> + if (arm_smmu_cmdq_build_cmd(cmd, ent)) {
>>>>>> + dev_warn(smmu->dev, "ignoring unknown CMDQ opcode 0x%x\n",
>>>>>> + ent->opcode);
>>>>>> + return -EINVAL;
>> Are any of the errors returned from the "issue command" functions actually
>> consumed? I couldn't see it on mainline code from a brief browse.
> I don't think so.

I don't think so either.

> Can we actually propagate them?

There does appear to be some places, here's one I found:

arm_smmu_page_response() -> arm_smmu_cmdq_issue_cmd(), and
arm_smmu_page_response is set to arm_smmu_ops.page_response, which
returns an int

Thanks,
John