2017-04-20 19:35:07

by Haiying Wang

[permalink] [raw]
Subject: [PATCH 0/3] bus: fsl-mc: dpio: udpate QMan CENA region

This patchset allows the NXP's DPAA2 QMan Software portal
CENA region to be cacheable as designed for the performance
goal. Besides, the write allocate stash feature of the QMan
requires the non-shareable attribute for this cache-enabled
memory.
So this patchset extends the arm64 ioremap with cache-enable
and non-shareable first, then enables the CENA portal memory
access in the QBMan driver, at last changes the ioremap call
in dpio driver where the software portal CENA memory is
mapped to be the correct one.

Haiying Wang (3):
arm64: extend ioremap for cacheable non-shareable memory
bus: fsl-mc: dpio: enable qbman CENA portal memory access
bus: fsl-mc: dpio: change CENA regs to be cacheable

arch/arm64/include/asm/io.h | 1 +
arch/arm64/include/asm/pgtable-prot.h | 1 +
drivers/staging/fsl-mc/bus/dpio/dpio-driver.c | 6 +++---
drivers/staging/fsl-mc/bus/dpio/qbman-portal.c | 23 +++++++++++++++++------
4 files changed, 22 insertions(+), 9 deletions(-)

--
2.7.4


2017-04-20 19:35:17

by Haiying Wang

[permalink] [raw]
Subject: [PATCH 1/3] arm64: extend ioremap for cacheable non-shareable memory

NXP arm64 based SoC needs to allocate cacheable and
non-shareable memory for the software portals of
Queue manager, so we extend the arm64 ioremap support
for this memory attribute.

Signed-off-by: Haiying Wang <[email protected]>
---
arch/arm64/include/asm/io.h | 1 +
arch/arm64/include/asm/pgtable-prot.h | 1 +
2 files changed, 2 insertions(+)

diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h
index 0c00c87..b6f03e7 100644
--- a/arch/arm64/include/asm/io.h
+++ b/arch/arm64/include/asm/io.h
@@ -170,6 +170,7 @@ extern void __iomem *ioremap_cache(phys_addr_t phys_addr, size_t size);
#define ioremap_nocache(addr, size) __ioremap((addr), (size), __pgprot(PROT_DEVICE_nGnRE))
#define ioremap_wc(addr, size) __ioremap((addr), (size), __pgprot(PROT_NORMAL_NC))
#define ioremap_wt(addr, size) __ioremap((addr), (size), __pgprot(PROT_DEVICE_nGnRE))
+#define ioremap_cache_ns(addr, size) __ioremap((addr), (size), __pgprot(PROT_NORMAL_NS))
#define iounmap __iounmap

/*
diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
index 2142c77..7fc7910 100644
--- a/arch/arm64/include/asm/pgtable-prot.h
+++ b/arch/arm64/include/asm/pgtable-prot.h
@@ -42,6 +42,7 @@
#define PROT_NORMAL_NC (PROT_DEFAULT | PTE_PXN | PTE_UXN | PTE_DIRTY | PTE_WRITE | PTE_ATTRINDX(MT_NORMAL_NC))
#define PROT_NORMAL_WT (PROT_DEFAULT | PTE_PXN | PTE_UXN | PTE_DIRTY | PTE_WRITE | PTE_ATTRINDX(MT_NORMAL_WT))
#define PROT_NORMAL (PROT_DEFAULT | PTE_PXN | PTE_UXN | PTE_DIRTY | PTE_WRITE | PTE_ATTRINDX(MT_NORMAL))
+#define PROT_NORMAL_NS (PTE_TYPE_PAGE | PTE_AF | PTE_PXN | PTE_UXN | PTE_DIRTY | PTE_WRITE | PTE_ATTRINDX(MT_NORMAL))

#define PROT_SECT_DEVICE_nGnRE (PROT_SECT_DEFAULT | PMD_SECT_PXN | PMD_SECT_UXN | PMD_ATTRINDX(MT_DEVICE_nGnRE))
#define PROT_SECT_NORMAL (PROT_SECT_DEFAULT | PMD_SECT_PXN | PMD_SECT_UXN | PMD_ATTRINDX(MT_NORMAL))
--
2.7.4

2017-04-20 19:35:25

by Haiying Wang

[permalink] [raw]
Subject: [PATCH 3/3] bus: fsl-mc: dpio: change CENA regs to be cacheable

plus non-shareable to meet the performance requirement.
QMan's CENA region contains registers and structures that
are 64byte in size and are inteneded to be accessed using a
single 64 byte bus transaction, therefore this portal
memory should be configured as cache-enabled. Also because
the write allocate stash transcations of QBMan should be
issued as cachable and non-coherent(non-sharable), we
need to configure this region to be non-shareable.

Signed-off-by: Haiying Wang <[email protected]>
---
drivers/staging/fsl-mc/bus/dpio/dpio-driver.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/staging/fsl-mc/bus/dpio/dpio-driver.c b/drivers/staging/fsl-mc/bus/dpio/dpio-driver.c
index e36da20..97f909c 100644
--- a/drivers/staging/fsl-mc/bus/dpio/dpio-driver.c
+++ b/drivers/staging/fsl-mc/bus/dpio/dpio-driver.c
@@ -168,10 +168,10 @@ static int dpaa2_dpio_probe(struct fsl_mc_device *dpio_dev)
desc.cpu = next_cpu;

/*
- * Set the CENA regs to be the cache inhibited area of the portal to
- * avoid coherency issues if a user migrates to another core.
+ * Set the CENA regs to be the cache enalbed area of the portal to
+ * archieve the best performance.
*/
- desc.regs_cena = ioremap_wc(dpio_dev->regions[1].start,
+ desc.regs_cena = ioremap_cache_ns(dpio_dev->regions[1].start,
resource_size(&dpio_dev->regions[1]));
desc.regs_cinh = ioremap(dpio_dev->regions[1].start,
resource_size(&dpio_dev->regions[1]));
--
2.7.4

2017-04-20 19:35:56

by Haiying Wang

[permalink] [raw]
Subject: [PATCH 2/3] bus: fsl-mc: dpio: enable qbman CENA portal memory access

Once we enable the cacheable portal memory, we need to do
cache flush for enqueue, vdq, buffer release, and management
commands, as well as invalidate and prefetch for the valid bit
of management command response and next index of dqrr.

Signed-off-by: Haiying Wang <[email protected]>
---
drivers/staging/fsl-mc/bus/dpio/qbman-portal.c | 23 +++++++++++++++++------
1 file changed, 17 insertions(+), 6 deletions(-)

diff --git a/drivers/staging/fsl-mc/bus/dpio/qbman-portal.c b/drivers/staging/fsl-mc/bus/dpio/qbman-portal.c
index 2a3ea29..e16121c 100644
--- a/drivers/staging/fsl-mc/bus/dpio/qbman-portal.c
+++ b/drivers/staging/fsl-mc/bus/dpio/qbman-portal.c
@@ -99,6 +99,14 @@ enum qbman_sdqcr_fc {
qbman_sdqcr_fc_up_to_3 = 1
};

+#define dccvac(p) { asm volatile("dc cvac, %0;" : : "r" (p) : "memory"); }
+#define dcivac(p) { asm volatile("dc ivac, %0" : : "r"(p) : "memory"); }
+static inline void qbman_inval_prefetch(struct qbman_swp *p, uint32_t offset)
+{
+ dcivac(p->addr_cena + offset);
+ prefetch(p->addr_cena + offset);
+}
+
/* Portal Access */

static inline u32 qbman_read_register(struct qbman_swp *p, u32 offset)
@@ -189,7 +197,7 @@ struct qbman_swp *qbman_swp_init(const struct qbman_swp_desc *d)
p->addr_cinh = d->cinh_bar;

reg = qbman_set_swp_cfg(p->dqrr.dqrr_size,
- 1, /* Writes Non-cacheable */
+ 0, /* Writes cacheable */
0, /* EQCR_CI stashing threshold */
3, /* RPM: Valid bit mode, RCR in array mode */
2, /* DCM: Discrete consumption ack mode */
@@ -315,6 +323,7 @@ void qbman_swp_mc_submit(struct qbman_swp *p, void *cmd, u8 cmd_verb)

dma_wmb();
*v = cmd_verb | p->mc.valid_bit;
+ dccvac(cmd);
}

/*
@@ -325,6 +334,7 @@ void *qbman_swp_mc_result(struct qbman_swp *p)
{
u32 *ret, verb;

+ qbman_inval_prefetch(p, QBMAN_CENA_SWP_RR(p->mc.valid_bit));
ret = qbman_get_cmd(p, QBMAN_CENA_SWP_RR(p->mc.valid_bit));

/* Remove the valid-bit - command completed if the rest is non-zero */
@@ -435,6 +445,7 @@ int qbman_swp_enqueue(struct qbman_swp *s, const struct qbman_eq_desc *d,
/* Set the verb byte, have to substitute in the valid-bit */
dma_wmb();
p->verb = d->verb | EQAR_VB(eqar);
+ dccvac(p);

return 0;
}
@@ -627,6 +638,7 @@ int qbman_swp_pull(struct qbman_swp *s, struct qbman_pull_desc *d)
/* Set the verb byte, have to substitute in the valid-bit */
p->verb = d->verb | s->vdq.valid_bit;
s->vdq.valid_bit ^= QB_VALID_BIT;
+ dccvac(p);

return 0;
}
@@ -680,8 +692,7 @@ const struct dpaa2_dq *qbman_swp_dqrr_next(struct qbman_swp *s)
s->dqrr.next_idx, pi);
s->dqrr.reset_bug = 0;
}
- prefetch(qbman_get_cmd(s,
- QBMAN_CENA_SWP_DQRR(s->dqrr.next_idx)));
+ qbman_inval_prefetch(s, QBMAN_CENA_SWP_DQRR(s->dqrr.next_idx));
}

p = qbman_get_cmd(s, QBMAN_CENA_SWP_DQRR(s->dqrr.next_idx));
@@ -696,8 +707,7 @@ const struct dpaa2_dq *qbman_swp_dqrr_next(struct qbman_swp *s)
* knew from reading PI.
*/
if ((verb & QB_VALID_BIT) != s->dqrr.valid_bit) {
- prefetch(qbman_get_cmd(s,
- QBMAN_CENA_SWP_DQRR(s->dqrr.next_idx)));
+ qbman_inval_prefetch(s, QBMAN_CENA_SWP_DQRR(s->dqrr.next_idx));
return NULL;
}
/*
@@ -720,7 +730,7 @@ const struct dpaa2_dq *qbman_swp_dqrr_next(struct qbman_swp *s)
(flags & DPAA2_DQ_STAT_EXPIRED))
atomic_inc(&s->vdq.available);

- prefetch(qbman_get_cmd(s, QBMAN_CENA_SWP_DQRR(s->dqrr.next_idx)));
+ qbman_inval_prefetch(s, QBMAN_CENA_SWP_DQRR(s->dqrr.next_idx));

return p;
}
@@ -848,6 +858,7 @@ int qbman_swp_release(struct qbman_swp *s, const struct qbman_release_desc *d,
*/
dma_wmb();
p->verb = d->verb | RAR_VB(rar) | num_buffers;
+ dccvac(p);

return 0;
}
--
2.7.4

2017-04-21 09:12:35

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH 1/3] arm64: extend ioremap for cacheable non-shareable memory

Hi,

I notice you missed Catalin and Will from Cc. In future, please ensure
that you Cc them when altering arm64 arch code.

On Thu, Apr 20, 2017 at 03:34:16PM -0400, Haiying Wang wrote:
> NXP arm64 based SoC needs to allocate cacheable and
> non-shareable memory for the software portals of
> Queue manager, so we extend the arm64 ioremap support
> for this memory attribute.

NAK to this patch.

It is not possible to safely use Non-Shareable attributes in Linux page
tables, given that these page tables are shared by all PEs (i.e. CPUs).

My understanding is that if several PEs map a region as Non-Shareable,
the usual background behaviour of the PEs (e.g. speculation,
prefetching, natural eviction) mean that uniprocessor semantics are not
guaranteed (i.e. a read following a write may see stale data).

For example, in a system like:

+------+ +------+
| PE-a | | PE-b |
+------+ +------+
| L1-a | | L1-b |
+------+ +------+
|| ||
+----------------+
| Shared cache |
+----------------+
||
+----------------+
| Memory |
+----------------+

... you could have a sequence like:

1) PE-a allocates a line into L1-a for address X in preparation for a
store.

2) PE-b allocates a line into L1-b for the same address X as a result of
speculation.

3) PE-a makes a store to the line in L1-a. Since address X is mapped as
Non-shareable, no snoops are performed to keep other copies of the
line in sync.

4) As a result of explicit maintenance or as a natural eviction, L1-a
evicts its line into shared cache. The shared cache subsequently
evicts this to memory.

5) L1-b evicts its line to shared cache as a natural eviction.

6) L1-a fetches the line from shared cache in response to a load by
PE-a, returning stale data (i.e. the store is lost).

No amount of cache maintenance can avoid this. In general, Non-Shareable
mappings are a bad idea.

Thanks,
Mark.

> Signed-off-by: Haiying Wang <[email protected]>
> ---
> arch/arm64/include/asm/io.h | 1 +
> arch/arm64/include/asm/pgtable-prot.h | 1 +
> 2 files changed, 2 insertions(+)
>
> diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h
> index 0c00c87..b6f03e7 100644
> --- a/arch/arm64/include/asm/io.h
> +++ b/arch/arm64/include/asm/io.h
> @@ -170,6 +170,7 @@ extern void __iomem *ioremap_cache(phys_addr_t phys_addr, size_t size);
> #define ioremap_nocache(addr, size) __ioremap((addr), (size), __pgprot(PROT_DEVICE_nGnRE))
> #define ioremap_wc(addr, size) __ioremap((addr), (size), __pgprot(PROT_NORMAL_NC))
> #define ioremap_wt(addr, size) __ioremap((addr), (size), __pgprot(PROT_DEVICE_nGnRE))
> +#define ioremap_cache_ns(addr, size) __ioremap((addr), (size), __pgprot(PROT_NORMAL_NS))
> #define iounmap __iounmap
>
> /*
> diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
> index 2142c77..7fc7910 100644
> --- a/arch/arm64/include/asm/pgtable-prot.h
> +++ b/arch/arm64/include/asm/pgtable-prot.h
> @@ -42,6 +42,7 @@
> #define PROT_NORMAL_NC (PROT_DEFAULT | PTE_PXN | PTE_UXN | PTE_DIRTY | PTE_WRITE | PTE_ATTRINDX(MT_NORMAL_NC))
> #define PROT_NORMAL_WT (PROT_DEFAULT | PTE_PXN | PTE_UXN | PTE_DIRTY | PTE_WRITE | PTE_ATTRINDX(MT_NORMAL_WT))
> #define PROT_NORMAL (PROT_DEFAULT | PTE_PXN | PTE_UXN | PTE_DIRTY | PTE_WRITE | PTE_ATTRINDX(MT_NORMAL))
> +#define PROT_NORMAL_NS (PTE_TYPE_PAGE | PTE_AF | PTE_PXN | PTE_UXN | PTE_DIRTY | PTE_WRITE | PTE_ATTRINDX(MT_NORMAL))
>
> #define PROT_SECT_DEVICE_nGnRE (PROT_SECT_DEFAULT | PMD_SECT_PXN | PMD_SECT_UXN | PMD_ATTRINDX(MT_DEVICE_nGnRE))
> #define PROT_SECT_NORMAL (PROT_SECT_DEFAULT | PMD_SECT_PXN | PMD_SECT_UXN | PMD_ATTRINDX(MT_NORMAL))
> --
> 2.7.4
>
>
> _______________________________________________
> linux-arm-kernel mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

2017-04-21 09:25:56

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH 2/3] bus: fsl-mc: dpio: enable qbman CENA portal memory access

Hi,

On Thu, Apr 20, 2017 at 03:34:17PM -0400, Haiying Wang wrote:
> Once we enable the cacheable portal memory, we need to do
> cache flush for enqueue, vdq, buffer release, and management
> commands, as well as invalidate and prefetch for the valid bit
> of management command response and next index of dqrr.
>
> Signed-off-by: Haiying Wang <[email protected]>
> ---
> drivers/staging/fsl-mc/bus/dpio/qbman-portal.c | 23 +++++++++++++++++------
> 1 file changed, 17 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/staging/fsl-mc/bus/dpio/qbman-portal.c b/drivers/staging/fsl-mc/bus/dpio/qbman-portal.c
> index 2a3ea29..e16121c 100644
> --- a/drivers/staging/fsl-mc/bus/dpio/qbman-portal.c
> +++ b/drivers/staging/fsl-mc/bus/dpio/qbman-portal.c
> @@ -99,6 +99,14 @@ enum qbman_sdqcr_fc {
> qbman_sdqcr_fc_up_to_3 = 1
> };
>
> +#define dccvac(p) { asm volatile("dc cvac, %0;" : : "r" (p) : "memory"); }
> +#define dcivac(p) { asm volatile("dc ivac, %0" : : "r"(p) : "memory"); }

Please don't place arm64-specific assembly in drivers, and use the APIs
the architecture provides. We have a suite of cache maintenance APIs,
and abstractions atop of those that make them easier to deal with.

This patch pushes the code further from being acceptable outside of
staging.


> +static inline void qbman_inval_prefetch(struct qbman_swp *p, uint32_t offset)
> +{
> + dcivac(p->addr_cena + offset);
> + prefetch(p->addr_cena + offset);
> +}

It doesn't make sense to me to clean+invalidate an arbitrary amount of
data, then prefetch an arbitrary amount. What exactly is this code
trying to do?

If this is intended to make data visible to a non-coherent observer, a
clean would be sufficient.

If this is intended to make data from a non-coherent master visible to
the CPUs, then an invalidate would be sufficient.

In either case, we *must* take the cache line size, and the size of the
data into account. If the same cache line is being modified by multiple
observers, this will corrupt the data in some cases.

Architecturally cache maintenance requires particular memory barriers,
which are missing here and/or in callers. At best, this works by chance.

The existing cache maintenance APIs handle this, and all you need to do
here is to ensure suitable alignment and padding. Please don't write
your own cache maintenance like this.

Thanks,
Mark.

> +
> /* Portal Access */
>
> static inline u32 qbman_read_register(struct qbman_swp *p, u32 offset)
> @@ -189,7 +197,7 @@ struct qbman_swp *qbman_swp_init(const struct qbman_swp_desc *d)
> p->addr_cinh = d->cinh_bar;
>
> reg = qbman_set_swp_cfg(p->dqrr.dqrr_size,
> - 1, /* Writes Non-cacheable */
> + 0, /* Writes cacheable */
> 0, /* EQCR_CI stashing threshold */
> 3, /* RPM: Valid bit mode, RCR in array mode */
> 2, /* DCM: Discrete consumption ack mode */
> @@ -315,6 +323,7 @@ void qbman_swp_mc_submit(struct qbman_swp *p, void *cmd, u8 cmd_verb)
>
> dma_wmb();
> *v = cmd_verb | p->mc.valid_bit;
> + dccvac(cmd);
> }
>
> /*
> @@ -325,6 +334,7 @@ void *qbman_swp_mc_result(struct qbman_swp *p)
> {
> u32 *ret, verb;
>
> + qbman_inval_prefetch(p, QBMAN_CENA_SWP_RR(p->mc.valid_bit));
> ret = qbman_get_cmd(p, QBMAN_CENA_SWP_RR(p->mc.valid_bit));
>
> /* Remove the valid-bit - command completed if the rest is non-zero */
> @@ -435,6 +445,7 @@ int qbman_swp_enqueue(struct qbman_swp *s, const struct qbman_eq_desc *d,
> /* Set the verb byte, have to substitute in the valid-bit */
> dma_wmb();
> p->verb = d->verb | EQAR_VB(eqar);
> + dccvac(p);
>
> return 0;
> }
> @@ -627,6 +638,7 @@ int qbman_swp_pull(struct qbman_swp *s, struct qbman_pull_desc *d)
> /* Set the verb byte, have to substitute in the valid-bit */
> p->verb = d->verb | s->vdq.valid_bit;
> s->vdq.valid_bit ^= QB_VALID_BIT;
> + dccvac(p);
>
> return 0;
> }
> @@ -680,8 +692,7 @@ const struct dpaa2_dq *qbman_swp_dqrr_next(struct qbman_swp *s)
> s->dqrr.next_idx, pi);
> s->dqrr.reset_bug = 0;
> }
> - prefetch(qbman_get_cmd(s,
> - QBMAN_CENA_SWP_DQRR(s->dqrr.next_idx)));
> + qbman_inval_prefetch(s, QBMAN_CENA_SWP_DQRR(s->dqrr.next_idx));
> }
>
> p = qbman_get_cmd(s, QBMAN_CENA_SWP_DQRR(s->dqrr.next_idx));
> @@ -696,8 +707,7 @@ const struct dpaa2_dq *qbman_swp_dqrr_next(struct qbman_swp *s)
> * knew from reading PI.
> */
> if ((verb & QB_VALID_BIT) != s->dqrr.valid_bit) {
> - prefetch(qbman_get_cmd(s,
> - QBMAN_CENA_SWP_DQRR(s->dqrr.next_idx)));
> + qbman_inval_prefetch(s, QBMAN_CENA_SWP_DQRR(s->dqrr.next_idx));
> return NULL;
> }
> /*
> @@ -720,7 +730,7 @@ const struct dpaa2_dq *qbman_swp_dqrr_next(struct qbman_swp *s)
> (flags & DPAA2_DQ_STAT_EXPIRED))
> atomic_inc(&s->vdq.available);
>
> - prefetch(qbman_get_cmd(s, QBMAN_CENA_SWP_DQRR(s->dqrr.next_idx)));
> + qbman_inval_prefetch(s, QBMAN_CENA_SWP_DQRR(s->dqrr.next_idx));
>
> return p;
> }
> @@ -848,6 +858,7 @@ int qbman_swp_release(struct qbman_swp *s, const struct qbman_release_desc *d,
> */
> dma_wmb();
> p->verb = d->verb | RAR_VB(rar) | num_buffers;
> + dccvac(p);
>
> return 0;
> }
> --
> 2.7.4
>
>
> _______________________________________________
> linux-arm-kernel mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

2017-04-21 09:27:35

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH 3/3] bus: fsl-mc: dpio: change CENA regs to be cacheable

On Thu, Apr 20, 2017 at 03:34:18PM -0400, Haiying Wang wrote:
> plus non-shareable to meet the performance requirement.
> QMan's CENA region contains registers and structures that
> are 64byte in size and are inteneded to be accessed using a
> single 64 byte bus transaction, therefore this portal
> memory should be configured as cache-enabled. Also because
> the write allocate stash transcations of QBMan should be
> issued as cachable and non-coherent(non-sharable), we
> need to configure this region to be non-shareable.
>
> Signed-off-by: Haiying Wang <[email protected]>
> ---
> drivers/staging/fsl-mc/bus/dpio/dpio-driver.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/staging/fsl-mc/bus/dpio/dpio-driver.c b/drivers/staging/fsl-mc/bus/dpio/dpio-driver.c
> index e36da20..97f909c 100644
> --- a/drivers/staging/fsl-mc/bus/dpio/dpio-driver.c
> +++ b/drivers/staging/fsl-mc/bus/dpio/dpio-driver.c
> @@ -168,10 +168,10 @@ static int dpaa2_dpio_probe(struct fsl_mc_device *dpio_dev)
> desc.cpu = next_cpu;
>
> /*
> - * Set the CENA regs to be the cache inhibited area of the portal to
> - * avoid coherency issues if a user migrates to another core.
> + * Set the CENA regs to be the cache enalbed area of the portal to
> + * archieve the best performance.

Is migrating to another core no longer a concern?

As with my prior comments regarding the Non-Shareable mapping, I do not
thing this makes sense.

Thanks,
Mark.

> */
> - desc.regs_cena = ioremap_wc(dpio_dev->regions[1].start,
> + desc.regs_cena = ioremap_cache_ns(dpio_dev->regions[1].start,
> resource_size(&dpio_dev->regions[1]));
> desc.regs_cinh = ioremap(dpio_dev->regions[1].start,
> resource_size(&dpio_dev->regions[1]));
> --
> 2.7.4
>
>
> _______________________________________________
> linux-arm-kernel mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

2017-04-21 14:30:40

by Roy Pledge

[permalink] [raw]
Subject: Re: [PATCH 1/3] arm64: extend ioremap for cacheable non-shareable memory

These transactions are done in HW via an ACP port which if I remember correctly only supports non coherent transactions. I will need to go back and check through email conversations I had with Catalin last year when debugging an issue using this mechanism (cacheable/nonshareable mapping) but it was deemed to be valid ARM setup architecturally for this type of device.

Just for some background the page the QBMan device presented to a core is only accessed by a single core (i.e. SW portals are core affine). In this model each page is always mapped as non shareable and another core will never access it. The important factor is that it is not DDR memory being mapped non sharable, but a non-coherent master on the bus in our SoC. I agree regular RAM shouldn’t be mapped this way but we cannot map this device as cacheable/shareable (coherent) on CCN-504 devices without getting exceptions from the CCN-504. Treating it as non cacheable is functionally OK but performance suffers in that case.

Your help will be appreciated as we want to get support for these devices with good performance in upstream kernels.

Roy

On 2017-04-21, 5:11 AM, "Mark Rutland" <[email protected]> wrote:

Hi,

I notice you missed Catalin and Will from Cc. In future, please ensure
that you Cc them when altering arm64 arch code.

On Thu, Apr 20, 2017 at 03:34:16PM -0400, Haiying Wang wrote:
> NXP arm64 based SoC needs to allocate cacheable and
> non-shareable memory for the software portals of
> Queue manager, so we extend the arm64 ioremap support
> for this memory attribute.

NAK to this patch.

It is not possible to safely use Non-Shareable attributes in Linux page
tables, given that these page tables are shared by all PEs (i.e. CPUs).

My understanding is that if several PEs map a region as Non-Shareable,
the usual background behaviour of the PEs (e.g. speculation,
prefetching, natural eviction) mean that uniprocessor semantics are not
guaranteed (i.e. a read following a write may see stale data).

For example, in a system like:

+------+ +------+
| PE-a | | PE-b |
+------+ +------+
| L1-a | | L1-b |
+------+ +------+
|| ||
+----------------+
| Shared cache |
+----------------+
||
+----------------+
| Memory |
+----------------+

... you could have a sequence like:

1) PE-a allocates a line into L1-a for address X in preparation for a
store.

2) PE-b allocates a line into L1-b for the same address X as a result of
speculation.

3) PE-a makes a store to the line in L1-a. Since address X is mapped as
Non-shareable, no snoops are performed to keep other copies of the
line in sync.

4) As a result of explicit maintenance or as a natural eviction, L1-a
evicts its line into shared cache. The shared cache subsequently
evicts this to memory.

5) L1-b evicts its line to shared cache as a natural eviction.

6) L1-a fetches the line from shared cache in response to a load by
PE-a, returning stale data (i.e. the store is lost).

No amount of cache maintenance can avoid this. In general, Non-Shareable
mappings are a bad idea.

Thanks,
Mark.

> Signed-off-by: Haiying Wang <[email protected]>
> ---
> arch/arm64/include/asm/io.h | 1 +
> arch/arm64/include/asm/pgtable-prot.h | 1 +
> 2 files changed, 2 insertions(+)
>
> diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h
> index 0c00c87..b6f03e7 100644
> --- a/arch/arm64/include/asm/io.h
> +++ b/arch/arm64/include/asm/io.h
> @@ -170,6 +170,7 @@ extern void __iomem *ioremap_cache(phys_addr_t phys_addr, size_t size);
> #define ioremap_nocache(addr, size) __ioremap((addr), (size), __pgprot(PROT_DEVICE_nGnRE))
> #define ioremap_wc(addr, size) __ioremap((addr), (size), __pgprot(PROT_NORMAL_NC))
> #define ioremap_wt(addr, size) __ioremap((addr), (size), __pgprot(PROT_DEVICE_nGnRE))
> +#define ioremap_cache_ns(addr, size) __ioremap((addr), (size), __pgprot(PROT_NORMAL_NS))
> #define iounmap __iounmap
>
> /*
> diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
> index 2142c77..7fc7910 100644
> --- a/arch/arm64/include/asm/pgtable-prot.h
> +++ b/arch/arm64/include/asm/pgtable-prot.h
> @@ -42,6 +42,7 @@
> #define PROT_NORMAL_NC (PROT_DEFAULT | PTE_PXN | PTE_UXN | PTE_DIRTY | PTE_WRITE | PTE_ATTRINDX(MT_NORMAL_NC))
> #define PROT_NORMAL_WT (PROT_DEFAULT | PTE_PXN | PTE_UXN | PTE_DIRTY | PTE_WRITE | PTE_ATTRINDX(MT_NORMAL_WT))
> #define PROT_NORMAL (PROT_DEFAULT | PTE_PXN | PTE_UXN | PTE_DIRTY | PTE_WRITE | PTE_ATTRINDX(MT_NORMAL))
> +#define PROT_NORMAL_NS (PTE_TYPE_PAGE | PTE_AF | PTE_PXN | PTE_UXN | PTE_DIRTY | PTE_WRITE | PTE_ATTRINDX(MT_NORMAL))
>
> #define PROT_SECT_DEVICE_nGnRE (PROT_SECT_DEFAULT | PMD_SECT_PXN | PMD_SECT_UXN | PMD_ATTRINDX(MT_DEVICE_nGnRE))
> #define PROT_SECT_NORMAL (PROT_SECT_DEFAULT | PMD_SECT_PXN | PMD_SECT_UXN | PMD_ATTRINDX(MT_NORMAL))
> --
> 2.7.4
>
>
> _______________________________________________
> linux-arm-kernel mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel



2017-04-21 14:37:32

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH 2/3] bus: fsl-mc: dpio: enable qbman CENA portal memory access

On Thu, Apr 20, 2017 at 03:34:17PM -0400, Haiying Wang wrote:
> Once we enable the cacheable portal memory, we need to do
> cache flush for enqueue, vdq, buffer release, and management
> commands, as well as invalidate and prefetch for the valid bit
> of management command response and next index of dqrr.
>
> Signed-off-by: Haiying Wang <[email protected]>

For the record, NAK on the whole series. It may appear to improve the
performance a bit but this is not architecturally compliant and relies
heavily on specific CPU and SoC implementation details which may
fail in very subtle way.

> --- a/drivers/staging/fsl-mc/bus/dpio/qbman-portal.c
> +++ b/drivers/staging/fsl-mc/bus/dpio/qbman-portal.c
> @@ -99,6 +99,14 @@ enum qbman_sdqcr_fc {
> qbman_sdqcr_fc_up_to_3 = 1
> };
>
> +#define dccvac(p) { asm volatile("dc cvac, %0;" : : "r" (p) : "memory"); }
> +#define dcivac(p) { asm volatile("dc ivac, %0" : : "r"(p) : "memory"); }
> +static inline void qbman_inval_prefetch(struct qbman_swp *p, uint32_t offset)
> +{
> + dcivac(p->addr_cena + offset);
> + prefetch(p->addr_cena + offset);
> +}
[...]
> @@ -435,6 +445,7 @@ int qbman_swp_enqueue(struct qbman_swp *s, const struct qbman_eq_desc *d,
> /* Set the verb byte, have to substitute in the valid-bit */
> dma_wmb();
> p->verb = d->verb | EQAR_VB(eqar);
> + dccvac(p);

The command buffer is on a device and currently mapped as write-combined
memory (that is Normal Non-cacheable on ARM). To enqueue a command, the
current code populates the struct qbman_eq_desc fields, followed by a
dma_wmb() (that is a DSB SY on ARM) and the subsequent update of the
"verb" byte to mark the descriptor as valid. These are all pretty much
*slave* (vs master) accesses since the device does not read such memory
through the system's point of serialisation. The DSB may happen ensure
the ordering of the verb update on your specific SoC, though
architecturally this is not guaranteed.

With your cacheable mapping patch, things get worse. The above dma_wmb()
does not have any effect at all with respect to the QBMan device since
all the code does is populate qbman_eq_desc *within* the CPU's L1 cache
which the device does *not* see. Once you call the dccvac() macro, you
would find that the "verb" is evicted first (since it's the beginning of
the qbman_eq_desc structure which is not what you want). The
interconnect may also split the cache eviction into multiple
transactions, so what the device would see is not guaranteed to be
atomic.

Similarly for the dequeue ring, the CPU is allowed to generate
transactions to fill its cache lines in any order (usually it can start
on any of the 4 quad words of the cache line), so it could as well
preload stale descriptors but with a valid verb.

--
Catalin

2017-04-25 13:43:07

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH 1/3] arm64: extend ioremap for cacheable non-shareable memory

On Fri, Apr 21, 2017 at 02:30:32PM +0000, Roy Pledge wrote:
> These transactions are done in HW via an ACP port which if I remember
> correctly only supports non coherent transactions. I will need to go
> back and check through email conversations I had with Catalin last
> year when debugging an issue using this mechanism
> (cacheable/nonshareable mapping) but it was deemed to be valid ARM
> setup architecturally for this type of device.
>
> Just for some background the page the QBMan device presented to a core
> is only accessed by a single core (i.e. SW portals are core affine).
> In this model each page is always mapped as non shareable and another
> core will never access it.

You cannot guarantee this given the page tables are used by multiple
CPUs.

The problem is not explicit memory accesses performed by code. As you
suggest, you can enforce that instructions accessing memory are only
architecturally executed on certain CPUs.

The problem is that even without any explicit access, a CPU can
implicitly access any region of Normal memory, at any time, for any
reason the microarchitecture sees fit to do so.

For example, the core may speculate some arbitrary instruction sequence,
which (perhaps by chance) generates an address falling in the
Non-shareable region, and attempts to load from it. The results might be
thrown away (i.e. the sequence wasn't architecturally executed), but the
speculated accesses will affect the memory system, and can result in
problems such as what I described previously.

Further, a cache maintenance op on a VA is only guaranteed to affect
caches scoped to the shareability domain of that VA. So no amount of
cache maintenance can provide the guarantees you require.

Practically speaking, because of such issues, it does not make sense for
Linux to use Non-shareable mappings.

> The important factor is that it is not DDR memory being mapped non
> sharable, but a non-coherent master on the bus in our SoC. I agree
> regular RAM shouldn’t be mapped this way but we cannot map this device
> as cacheable/shareable (coherent) on CCN-504 devices without getting
> exceptions from the CCN-504.

The problem is that multiple CPUs have a Non-shareable mapping for the
same physical address. What in particular is being mapped is immaterial.

> Treating it as non cacheable is functionally OK but performance
> suffers in that case.

Given that mapping these regions as Non-shareable is not functionally
OK, and given that you are unable to use coherent transactions, the only
option is to use a Non-cacheable mapping.

Thanks,
Mark.