From: Arnd Bergmann <[email protected]>
After a long discussion about adding SoC specific semantics for when
to flush caches in drivers/soc/ drivers that we determined to be
fundamentally flawed[1], I volunteered to try to move that logic into
architecture-independent code and make all existing architectures do
the same thing.
As we had determined earlier, the behavior is wildly different across
architectures, but most of the differences come down to either bugs
(when required flushes are missing) or extra flushes that are harmless
but might hurt performance.
I finally found the time to come up with an implementation of this, which
starts by replacing every outlier with one of the three common options:
1. architectures without speculative prefetching (hegagon, m68k,
openrisc, sh, sparc, and certain armv4 and xtensa implementations)
only flush their caches before a DMA, by cleaning write-back caches
(if any) before a DMA to the device, and by invalidating the caches
before a DMA from a device
2. arc, microblaze, mips, nios2, sh and later xtensa now follow the
normal 32-bit arm model and invalidate their writeback caches
again after a DMA from the device, to remove stale cache lines
that got prefetched during the DMA. arc, csky and mips used to
invalidate buffers also before the bidirectional DMA, but this
is now skipped whenever we know it gets invalidated again
after the DMA.
3. parisc, powerpc and riscv already flushed buffers before
a DMA_FROM_DEVICE, and these get moved to the arm64 behavior
that does the writeback before and invalidate after both
DMA_FROM_DEVICE and DMA_BIDIRECTIONAL in order to avoid the
problem of accidentally leaking stale data if the DMA does
not actually happen[2].
The last patch in the series replaces the architecture specific code
with a shared version that implements all three based on architecture
specific parameters that are almost always determined at compile time.
The difference between cases 1. and 2. is hardware specific, while between
2. and 3. we need to decide which semantics we want, but I explicitly
avoid this question in my series and leave it to be decided later.
Another difference that I do not address here is what cache invalidation
does for partical cache lines. On arm32, arm64 and powerpc, a partial
cache line always gets written back before invalidation in order to
ensure that data before or after the buffer is not discarded. On all
other architectures, the assumption is cache lines are never shared
between DMA buffer and data that is accessed by the CPU. If we end up
always writing back dirty cache lines before a DMA (option 3 above),
then this point becomes moot, otherwise we should probably address this
in a follow-up series to document one behavior or the other and implement
it consistently.
Please review!
Arnd
[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/all/20220606152150.GA31568@willie-the-truck/
Arnd Bergmann (21):
openrisc: dma-mapping: flush bidirectional mappings
xtensa: dma-mapping: use normal cache invalidation rules
sparc32: flush caches in dma_sync_*for_device
microblaze: dma-mapping: skip extra DMA flushes
powerpc: dma-mapping: split out cache operation logic
powerpc: dma-mapping: minimize for_cpu flushing
powerpc: dma-mapping: always clean cache in _for_device() op
riscv: dma-mapping: only invalidate after DMA, not flush
riscv: dma-mapping: skip invalidation before bidirectional DMA
csky: dma-mapping: skip invalidating before DMA from device
mips: dma-mapping: skip invalidating before bidirectional DMA
mips: dma-mapping: split out cache operation logic
arc: dma-mapping: skip invalidating before bidirectional DMA
parisc: dma-mapping: use regular flush/invalidate ops
ARM: dma-mapping: always invalidate WT caches before DMA
ARM: dma-mapping: bring back dmac_{clean,inv}_range
ARM: dma-mapping: use arch_sync_dma_for_{device,cpu}() internally
ARM: drop SMP support for ARM11MPCore
ARM: dma-mapping: use generic form of arch_sync_dma_* helpers
ARM: dma-mapping: split out arch_dma_mark_clean() helper
dma-mapping: replace custom code with generic implementation
arch/arc/mm/dma.c | 66 ++------
arch/arm/Kconfig | 4 +
arch/arm/include/asm/cacheflush.h | 21 +++
arch/arm/include/asm/glue-cache.h | 4 +
arch/arm/mach-oxnas/Kconfig | 4 -
arch/arm/mach-oxnas/Makefile | 1 -
arch/arm/mach-oxnas/headsmp.S | 23 ---
arch/arm/mach-oxnas/platsmp.c | 96 -----------
arch/arm/mach-versatile/platsmp-realview.c | 4 -
arch/arm/mm/Kconfig | 19 ---
arch/arm/mm/cache-fa.S | 4 +-
arch/arm/mm/cache-nop.S | 6 +
arch/arm/mm/cache-v4.S | 13 +-
arch/arm/mm/cache-v4wb.S | 4 +-
arch/arm/mm/cache-v4wt.S | 22 ++-
arch/arm/mm/cache-v6.S | 35 +---
arch/arm/mm/cache-v7.S | 6 +-
arch/arm/mm/cache-v7m.S | 4 +-
arch/arm/mm/dma-mapping-nommu.c | 36 ++--
arch/arm/mm/dma-mapping.c | 181 ++++++++++-----------
arch/arm/mm/proc-arm1020.S | 4 +-
arch/arm/mm/proc-arm1020e.S | 4 +-
arch/arm/mm/proc-arm1022.S | 4 +-
arch/arm/mm/proc-arm1026.S | 4 +-
arch/arm/mm/proc-arm920.S | 4 +-
arch/arm/mm/proc-arm922.S | 4 +-
arch/arm/mm/proc-arm925.S | 4 +-
arch/arm/mm/proc-arm926.S | 4 +-
arch/arm/mm/proc-arm940.S | 4 +-
arch/arm/mm/proc-arm946.S | 4 +-
arch/arm/mm/proc-feroceon.S | 8 +-
arch/arm/mm/proc-macros.S | 2 +
arch/arm/mm/proc-mohawk.S | 4 +-
arch/arm/mm/proc-xsc3.S | 4 +-
arch/arm/mm/proc-xscale.S | 6 +-
arch/arm64/mm/dma-mapping.c | 28 ++--
arch/csky/mm/dma-mapping.c | 46 +++---
arch/hexagon/kernel/dma.c | 44 ++---
arch/m68k/kernel/dma.c | 43 +++--
arch/microblaze/kernel/dma.c | 38 ++---
arch/mips/mm/dma-noncoherent.c | 75 +++------
arch/nios2/mm/dma-mapping.c | 57 +++----
arch/openrisc/kernel/dma.c | 62 ++++---
arch/parisc/include/asm/cacheflush.h | 6 +-
arch/parisc/kernel/pci-dma.c | 33 +++-
arch/powerpc/mm/dma-noncoherent.c | 76 +++++----
arch/riscv/mm/dma-noncoherent.c | 51 +++---
arch/sh/kernel/dma-coherent.c | 43 +++--
arch/sparc/Kconfig | 2 +-
arch/sparc/kernel/ioport.c | 38 +++--
arch/xtensa/Kconfig | 1 -
arch/xtensa/include/asm/cacheflush.h | 6 +-
arch/xtensa/kernel/pci-dma.c | 47 +++---
include/linux/dma-sync.h | 107 ++++++++++++
54 files changed, 721 insertions(+), 699 deletions(-)
delete mode 100644 arch/arm/mach-oxnas/headsmp.S
delete mode 100644 arch/arm/mach-oxnas/platsmp.c
create mode 100644 include/linux/dma-sync.h
--
2.39.2
Cc: Vineet Gupta <[email protected]>
Cc: Russell King <[email protected]>
Cc: Neil Armstrong <[email protected]>
Cc: Linus Walleij <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Guo Ren <[email protected]>
Cc: Brian Cain <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Michal Simek <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Dinh Nguyen <[email protected]>
Cc: Stafford Horne <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Robin Murphy <[email protected]>
Cc: Lad Prabhakar <[email protected]>
Cc: Conor Dooley <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
From: Arnd Bergmann <[email protected]>
xtensa is one of the platforms that has both write-back and write-through
caches, and needs to account for both in its DMA mapping operations.
It does this through a set of operations that is different from any
architecture. This is not a problem by itself, but it makes it rather
hard to figure out whether this is correct or not, and to unify this
implementation with the others.
Change the semantics to the usual ones for non-speculating CPUs:
- On DMA_TO_DEVICE, call __flush_dcache_range() to perform the
writeback even on writethrough caches, where this is a nop.
- On DMA_FROM_DEVICE, invalidate the mapping before the DMA rather
than afterwards.
- On DMA_BIDIRECTIONAL, combine the pre-writeback with the
post-invalidate into a call to __flush_invalidate_dcache_range()
that turns into a simple invalidate on writeback caches.
Signed-off-by: Arnd Bergmann <[email protected]>
---
arch/xtensa/Kconfig | 1 -
arch/xtensa/include/asm/cacheflush.h | 6 +++---
arch/xtensa/kernel/pci-dma.c | 29 +++++-----------------------
3 files changed, 8 insertions(+), 28 deletions(-)
diff --git a/arch/xtensa/Kconfig b/arch/xtensa/Kconfig
index bcb0c5d2abc2..b938bacbb9af 100644
--- a/arch/xtensa/Kconfig
+++ b/arch/xtensa/Kconfig
@@ -8,7 +8,6 @@ config XTENSA
select ARCH_HAS_DMA_PREP_COHERENT if MMU
select ARCH_HAS_GCOV_PROFILE_ALL
select ARCH_HAS_KCOV
- select ARCH_HAS_SYNC_DMA_FOR_CPU if MMU
select ARCH_HAS_SYNC_DMA_FOR_DEVICE if MMU
select ARCH_HAS_DMA_SET_UNCACHED if MMU
select ARCH_HAS_STRNCPY_FROM_USER if !KASAN
diff --git a/arch/xtensa/include/asm/cacheflush.h b/arch/xtensa/include/asm/cacheflush.h
index 7b4359312c25..2f645d25565a 100644
--- a/arch/xtensa/include/asm/cacheflush.h
+++ b/arch/xtensa/include/asm/cacheflush.h
@@ -61,9 +61,9 @@ static inline void __flush_dcache_page(unsigned long va)
static inline void __flush_dcache_range(unsigned long va, unsigned long sz)
{
}
-# define __flush_invalidate_dcache_all() __invalidate_dcache_all()
-# define __flush_invalidate_dcache_page(p) __invalidate_dcache_page(p)
-# define __flush_invalidate_dcache_range(p,s) __invalidate_dcache_range(p,s)
+# define __flush_invalidate_dcache_all __invalidate_dcache_all
+# define __flush_invalidate_dcache_page __invalidate_dcache_page
+# define __flush_invalidate_dcache_range __invalidate_dcache_range
#endif
#if defined(CONFIG_MMU) && (DCACHE_WAY_SIZE > PAGE_SIZE)
diff --git a/arch/xtensa/kernel/pci-dma.c b/arch/xtensa/kernel/pci-dma.c
index 94955caa4488..ff3bf015eca4 100644
--- a/arch/xtensa/kernel/pci-dma.c
+++ b/arch/xtensa/kernel/pci-dma.c
@@ -43,38 +43,19 @@ static void do_cache_op(phys_addr_t paddr, size_t size,
}
}
-void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
+void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
enum dma_data_direction dir)
{
switch (dir) {
- case DMA_BIDIRECTIONAL:
+ case DMA_TO_DEVICE:
+ do_cache_op(paddr, size, __flush_dcache_range);
+ break;
case DMA_FROM_DEVICE:
do_cache_op(paddr, size, __invalidate_dcache_range);
break;
-
- case DMA_NONE:
- BUG();
- break;
-
- default:
- break;
- }
-}
-
-void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
- enum dma_data_direction dir)
-{
- switch (dir) {
case DMA_BIDIRECTIONAL:
- case DMA_TO_DEVICE:
- if (XCHAL_DCACHE_IS_WRITEBACK)
- do_cache_op(paddr, size, __flush_dcache_range);
+ do_cache_op(paddr, size, __flush_invalidate_dcache_range);
break;
-
- case DMA_NONE:
- BUG();
- break;
-
default:
break;
}
--
2.39.2
From: Arnd Bergmann <[email protected]>
Leon has a very minimalistic cache that has no range operations
and requires being flushed entirely to deal with noncoherent
DMA. Most in-order architectures do their cache management in
the dma_sync_*for_device() operations rather than dma_sync_*for_cpu.
Since the cache is write-through only, both should have the same
effect, so change it for consistency with the other architectures.
Signed-off-by: Arnd Bergmann <[email protected]>
---
arch/sparc/Kconfig | 2 +-
arch/sparc/kernel/ioport.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index 84437a4c6545..637da50e236c 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -51,7 +51,7 @@ config SPARC
config SPARC32
def_bool !64BIT
select ARCH_32BIT_OFF_T
- select ARCH_HAS_SYNC_DMA_FOR_CPU
+ select ARCH_HAS_SYNC_DMA_FOR_DEVICE
select CLZ_TAB
select DMA_DIRECT_REMAP
select GENERIC_ATOMIC64
diff --git a/arch/sparc/kernel/ioport.c b/arch/sparc/kernel/ioport.c
index 4e4f3d3263e4..4f3d26066ec2 100644
--- a/arch/sparc/kernel/ioport.c
+++ b/arch/sparc/kernel/ioport.c
@@ -306,7 +306,7 @@ arch_initcall(sparc_register_ioport);
* On LEON systems without cache snooping, the entire D-CACHE must be flushed to
* make DMA to cacheable memory coherent.
*/
-void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
+void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
enum dma_data_direction dir)
{
if (dir != DMA_TO_DEVICE &&
--
2.39.2
From: Arnd Bergmann <[email protected]>
The powerpc implementation of arch_sync_dma_for_device() is unique in that
it sometimes performs a full flush for the arch_sync_dma_for_device(paddr,
size, DMA_FROM_DEVICE) operation when the address is unaligned, but
otherwise invalidates the caches.
Since the _for_cpu() counterpart has to invalidate the cache already
in order to avoid stale data from prefetching, this operation only really
needs to ensure that there are no dirty cache lines, which can be done
using either invalidation or cleaning the cache, but not necessarily
both.
Most architectures traditionally go for invalidation here, but as
Will Deacon points out, this can leak old data to user space if
a DMA is started but the device ends up not actually filling the
entire buffer, see the link below.
The same argument applies to DMA_BIDIRECTIONAL transfers. Using
a cache-clean operation is the safe choice here, followed by
invalidating the cache after the DMA to get rid of stale data
that was prefetched before the completion of the DMA.
Link: https://lore.kernel.org/all/20220606152150.GA31568@willie-the-truck/
Signed-off-by: Arnd Bergmann <[email protected]>
---
arch/powerpc/mm/dma-noncoherent.c | 21 +--------------------
1 file changed, 1 insertion(+), 20 deletions(-)
diff --git a/arch/powerpc/mm/dma-noncoherent.c b/arch/powerpc/mm/dma-noncoherent.c
index e108cacf877f..00e59a4faa2b 100644
--- a/arch/powerpc/mm/dma-noncoherent.c
+++ b/arch/powerpc/mm/dma-noncoherent.c
@@ -104,26 +104,7 @@ static void __dma_phys_op(phys_addr_t paddr, size_t size, enum dma_cache_op op)
void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
enum dma_data_direction dir)
{
- switch (direction) {
- case DMA_NONE:
- BUG();
- case DMA_FROM_DEVICE:
- /*
- * invalidate only when cache-line aligned otherwise there is
- * the potential for discarding uncommitted data from the cache
- */
- if ((start | end) & (L1_CACHE_BYTES - 1))
- __dma_phys_op(start, end, DMA_CACHE_FLUSH);
- else
- __dma_phys_op(start, end, DMA_CACHE_INVAL);
- break;
- case DMA_TO_DEVICE: /* writeback only */
- __dma_phys_op(start, end, DMA_CACHE_CLEAN);
- break;
- case DMA_BIDIRECTIONAL: /* writeback and invalidate */
- __dma_phys_op(start, end, DMA_CACHE_FLUSH);
- break;
- }
+ __dma_phys_op(start, end, DMA_CACHE_CLEAN);
}
void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
--
2.39.2
From: Arnd Bergmann <[email protected]>
No other architecture intentionally writes back dirty cache lines into
a buffer that a device has just finished writing into. If the cache is
clean, this has no effect at all, but if a cacheline in the buffer has
actually been written by the CPU, there is a drive bug that is likely
made worse by overwriting that buffer.
Signed-off-by: Arnd Bergmann <[email protected]>
---
arch/riscv/mm/dma-noncoherent.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/riscv/mm/dma-noncoherent.c b/arch/riscv/mm/dma-noncoherent.c
index d919efab6eba..640f4c496d26 100644
--- a/arch/riscv/mm/dma-noncoherent.c
+++ b/arch/riscv/mm/dma-noncoherent.c
@@ -42,7 +42,7 @@ void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
break;
case DMA_FROM_DEVICE:
case DMA_BIDIRECTIONAL:
- ALT_CMO_OP(flush, vaddr, size, riscv_cbom_block_size);
+ ALT_CMO_OP(inval, vaddr, size, riscv_cbom_block_size);
break;
default:
break;
--
2.39.2
From: Arnd Bergmann <[email protected]>
The microblaze dma_sync_* implementation uses the same function
for both _for_cpu() and _for_device(), which is inconsistent
with other architectures and slightly more expensive.
Split it up into separate functions and skip the parts that
are not needed:
- on dma_sync_*_for_cpu(..., DMA_TO_DEVICE), skip the second
writeback, which does nothing.
- on dma_sync_*_for_cpu(..., DMA_BIDIRECTIONAL), only invalidate
the cache to clear out cache lines that got loaded speculatively,
but skip the extraneous writeback.
Signed-off-by: Arnd Bergmann <[email protected]>
---
arch/microblaze/kernel/dma.c | 22 ++++++++++++----------
1 file changed, 12 insertions(+), 10 deletions(-)
diff --git a/arch/microblaze/kernel/dma.c b/arch/microblaze/kernel/dma.c
index 04d091ade417..b4c4e45fd45e 100644
--- a/arch/microblaze/kernel/dma.c
+++ b/arch/microblaze/kernel/dma.c
@@ -14,8 +14,8 @@
#include <linux/bug.h>
#include <asm/cacheflush.h>
-static void __dma_sync(phys_addr_t paddr, size_t size,
- enum dma_data_direction direction)
+void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
+ enum dma_data_direction dir)
{
switch (direction) {
case DMA_TO_DEVICE:
@@ -30,14 +30,16 @@ static void __dma_sync(phys_addr_t paddr, size_t size,
}
}
-void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
- enum dma_data_direction dir)
-{
- __dma_sync(paddr, size, dir);
-}
-
void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
enum dma_data_direction dir)
{
- __dma_sync(paddr, size, dir);
-}
+ switch (direction) {
+ case DMA_TO_DEVICE:
+ break;
+ case DMA_BIDIRECTIONAL:
+ case DMA_FROM_DEVICE:
+ invalidate_dcache_range(paddr, paddr + size);
+ break;
+ default:
+ BUG();
+ }}
--
2.39.2
From: Arnd Bergmann <[email protected]>
For a DMA_BIDIRECTIONAL transfer, the caches have to be cleaned
first to let the device see data written by the CPU, and invalidated
after the transfer to let the CPU see data written by the device.
riscv also invalidates the caches before the transfer, which does
not appear to serve any purpose.
Signed-off-by: Arnd Bergmann <[email protected]>
---
arch/riscv/mm/dma-noncoherent.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/riscv/mm/dma-noncoherent.c b/arch/riscv/mm/dma-noncoherent.c
index 640f4c496d26..69c80b2155a1 100644
--- a/arch/riscv/mm/dma-noncoherent.c
+++ b/arch/riscv/mm/dma-noncoherent.c
@@ -25,7 +25,7 @@ void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
ALT_CMO_OP(clean, vaddr, size, riscv_cbom_block_size);
break;
case DMA_BIDIRECTIONAL:
- ALT_CMO_OP(flush, vaddr, size, riscv_cbom_block_size);
+ ALT_CMO_OP(clean, vaddr, size, riscv_cbom_block_size);
break;
default:
break;
--
2.39.2
From: Arnd Bergmann <[email protected]>
Some architectures that need to invalidate buffers after bidirectional
DMA because of speculative prefetching only do a simpler writeback
before that DMA, while architectures that don't need to do the second
invalidate tend to have a combined writeback+invalidate before the
DMA.
The behavior on mips is slightly inconsistent, as it always
does the invalidation before bidirectional DMA and conditionally
does it a second time.
In order to make the behavior the same as the rest, change it
so that there is exactly one invalidation here, either before
or after the DMA.
Signed-off-by: Arnd Bergmann <[email protected]>
---
arch/mips/mm/dma-noncoherent.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/arch/mips/mm/dma-noncoherent.c b/arch/mips/mm/dma-noncoherent.c
index 3c4fc97b9f39..b4350faf4f1e 100644
--- a/arch/mips/mm/dma-noncoherent.c
+++ b/arch/mips/mm/dma-noncoherent.c
@@ -65,7 +65,11 @@ static inline void dma_sync_virt_for_device(void *addr, size_t size,
dma_cache_inv((unsigned long)addr, size);
break;
case DMA_BIDIRECTIONAL:
- dma_cache_wback_inv((unsigned long)addr, size);
+ if (IS_ENABLED(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU) &&
+ cpu_needs_post_dma_flush())
+ dma_cache_wback((unsigned long)addr, size);
+ else
+ dma_cache_wback_inv((unsigned long)addr, size);
break;
default:
BUG();
--
2.39.2
From: Arnd Bergmann <[email protected]>
The powerpc arch_sync_dma_for_device()/arch_sync_dma_for_cpu() functions
behave differently from all other architectures, at least for some of
the operations.
As a preparation for making the behavior more consistent, reorder the
logic in which they decide whether to flush, invalidate or clean the.
No change in behavior is intended.
Signed-off-by: Arnd Bergmann <[email protected]>
---
arch/powerpc/mm/dma-noncoherent.c | 91 +++++++++++++++++++++----------
1 file changed, 63 insertions(+), 28 deletions(-)
diff --git a/arch/powerpc/mm/dma-noncoherent.c b/arch/powerpc/mm/dma-noncoherent.c
index 30260b5d146d..f10869d27de5 100644
--- a/arch/powerpc/mm/dma-noncoherent.c
+++ b/arch/powerpc/mm/dma-noncoherent.c
@@ -16,31 +16,28 @@
#include <asm/tlbflush.h>
#include <asm/dma.h>
+enum dma_cache_op {
+ DMA_CACHE_CLEAN,
+ DMA_CACHE_INVAL,
+ DMA_CACHE_FLUSH,
+};
+
/*
* make an area consistent.
*/
-static void __dma_sync(void *vaddr, size_t size, int direction)
+static void __dma_op(void *vaddr, size_t size, enum dma_cache_op op)
{
unsigned long start = (unsigned long)vaddr;
unsigned long end = start + size;
- switch (direction) {
- case DMA_NONE:
- BUG();
- case DMA_FROM_DEVICE:
- /*
- * invalidate only when cache-line aligned otherwise there is
- * the potential for discarding uncommitted data from the cache
- */
- if ((start | end) & (L1_CACHE_BYTES - 1))
- flush_dcache_range(start, end);
- else
- invalidate_dcache_range(start, end);
- break;
- case DMA_TO_DEVICE: /* writeback only */
+ switch (op) {
+ case DMA_CACHE_CLEAN:
clean_dcache_range(start, end);
break;
- case DMA_BIDIRECTIONAL: /* writeback and invalidate */
+ case DMA_CACHE_INVAL:
+ invalidate_dcache_range(start, end);
+ break;
+ case DMA_CACHE_FLUSH:
flush_dcache_range(start, end);
break;
}
@@ -48,16 +45,16 @@ static void __dma_sync(void *vaddr, size_t size, int direction)
#ifdef CONFIG_HIGHMEM
/*
- * __dma_sync_page() implementation for systems using highmem.
+ * __dma_highmem_op() implementation for systems using highmem.
* In this case, each page of a buffer must be kmapped/kunmapped
- * in order to have a virtual address for __dma_sync(). This must
+ * in order to have a virtual address for __dma_op(). This must
* not sleep so kmap_atomic()/kunmap_atomic() are used.
*
* Note: yes, it is possible and correct to have a buffer extend
* beyond the first page.
*/
-static inline void __dma_sync_page_highmem(struct page *page,
- unsigned long offset, size_t size, int direction)
+static inline void __dma_highmem_op(struct page *page,
+ unsigned long offset, size_t size, enum dma_cache_op op)
{
size_t seg_size = min((size_t)(PAGE_SIZE - offset), size);
size_t cur_size = seg_size;
@@ -71,7 +68,7 @@ static inline void __dma_sync_page_highmem(struct page *page,
start = (unsigned long)kmap_atomic(page + seg_nr) + seg_offset;
/* Sync this buffer segment */
- __dma_sync((void *)start, seg_size, direction);
+ __dma_op((void *)start, seg_size, op);
kunmap_atomic((void *)start);
seg_nr++;
@@ -88,32 +85,70 @@ static inline void __dma_sync_page_highmem(struct page *page,
#endif /* CONFIG_HIGHMEM */
/*
- * __dma_sync_page makes memory consistent. identical to __dma_sync, but
- * takes a struct page instead of a virtual address
+ * __dma_phys_op makes memory consistent. identical to __dma_op, but
+ * takes a phys_addr_t instead of a virtual address
*/
-static void __dma_sync_page(phys_addr_t paddr, size_t size, int dir)
+static void __dma_phys_op(phys_addr_t paddr, size_t size, enum dma_cache_op op)
{
struct page *page = pfn_to_page(paddr >> PAGE_SHIFT);
unsigned offset = paddr & ~PAGE_MASK;
#ifdef CONFIG_HIGHMEM
- __dma_sync_page_highmem(page, offset, size, dir);
+ __dma_highmem_op(page, offset, size, op);
#else
unsigned long start = (unsigned long)page_address(page) + offset;
- __dma_sync((void *)start, size, dir);
+ __dma_op((void *)start, size, op);
#endif
}
void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
enum dma_data_direction dir)
{
- __dma_sync_page(paddr, size, dir);
+ switch (direction) {
+ case DMA_NONE:
+ BUG();
+ case DMA_FROM_DEVICE:
+ /*
+ * invalidate only when cache-line aligned otherwise there is
+ * the potential for discarding uncommitted data from the cache
+ */
+ if ((start | end) & (L1_CACHE_BYTES - 1))
+ __dma_phys_op(start, end, DMA_CACHE_FLUSH);
+ else
+ __dma_phys_op(start, end, DMA_CACHE_INVAL);
+ break;
+ case DMA_TO_DEVICE: /* writeback only */
+ __dma_phys_op(start, end, DMA_CACHE_CLEAN);
+ break;
+ case DMA_BIDIRECTIONAL: /* writeback and invalidate */
+ __dma_phys_op(start, end, DMA_CACHE_FLUSH);
+ break;
+ }
}
void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
enum dma_data_direction dir)
{
- __dma_sync_page(paddr, size, dir);
+ switch (direction) {
+ case DMA_NONE:
+ BUG();
+ case DMA_FROM_DEVICE:
+ /*
+ * invalidate only when cache-line aligned otherwise there is
+ * the potential for discarding uncommitted data from the cache
+ */
+ if ((start | end) & (L1_CACHE_BYTES - 1))
+ __dma_phys_op(start, end, DMA_CACHE_FLUSH);
+ else
+ __dma_phys_op(start, end, DMA_CACHE_INVAL);
+ break;
+ case DMA_TO_DEVICE: /* writeback only */
+ __dma_phys_op(start, end, DMA_CACHE_CLEAN);
+ break;
+ case DMA_BIDIRECTIONAL: /* writeback and invalidate */
+ __dma_phys_op(start, end, DMA_CACHE_FLUSH);
+ break;
+ }
}
void arch_dma_prep_coherent(struct page *page, size_t size)
--
2.39.2
From: Arnd Bergmann <[email protected]>
csky is the only architecture that does a full flush for the
dma_sync_*_for_device(..., DMA_FROM_DEVICE) operation. The requirement
is only make sure there are no dirty cache lines for the buffer,
which can be either done through an invalidate operation (as on most
architectures including arm32, mips and arc), or a writeback (as on
arm64 and riscv). The cache also has to be invalidated eventually but
csky already does that after the transfer.
Use a 'clean' operation here for consistency with arm64 and riscv.
Signed-off-by: Arnd Bergmann <[email protected]>
---
arch/csky/mm/dma-mapping.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/arch/csky/mm/dma-mapping.c b/arch/csky/mm/dma-mapping.c
index 82447029feb4..c90f912e2822 100644
--- a/arch/csky/mm/dma-mapping.c
+++ b/arch/csky/mm/dma-mapping.c
@@ -60,11 +60,9 @@ void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
{
switch (dir) {
case DMA_TO_DEVICE:
- cache_op(paddr, size, dma_wb_range);
- break;
case DMA_FROM_DEVICE:
case DMA_BIDIRECTIONAL:
- cache_op(paddr, size, dma_wbinv_range);
+ cache_op(paddr, size, dma_wb_range);
break;
default:
BUG();
--
2.39.2
From: Arnd Bergmann <[email protected]>
The powerpc dma_sync_*_for_cpu() variants do more flushes than on other
architectures. Reduce it to what everyone else does:
- No flush is needed after data has been sent to a device
- When data has been received from a device, the cache only needs to
be invalidated to clear out cache lines that were speculatively
prefetched.
In particular, the second flushing of partial cache lines of bidirectional
buffers is actively harmful -- if a single cache line is written by both
the CPU and the device, flushing it again does not maintain coherency
but instead overwrite the data that was just received from the device.
Signed-off-by: Arnd Bergmann <[email protected]>
---
arch/powerpc/mm/dma-noncoherent.c | 18 ++++--------------
1 file changed, 4 insertions(+), 14 deletions(-)
diff --git a/arch/powerpc/mm/dma-noncoherent.c b/arch/powerpc/mm/dma-noncoherent.c
index f10869d27de5..e108cacf877f 100644
--- a/arch/powerpc/mm/dma-noncoherent.c
+++ b/arch/powerpc/mm/dma-noncoherent.c
@@ -132,21 +132,11 @@ void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
switch (direction) {
case DMA_NONE:
BUG();
- case DMA_FROM_DEVICE:
- /*
- * invalidate only when cache-line aligned otherwise there is
- * the potential for discarding uncommitted data from the cache
- */
- if ((start | end) & (L1_CACHE_BYTES - 1))
- __dma_phys_op(start, end, DMA_CACHE_FLUSH);
- else
- __dma_phys_op(start, end, DMA_CACHE_INVAL);
- break;
- case DMA_TO_DEVICE: /* writeback only */
- __dma_phys_op(start, end, DMA_CACHE_CLEAN);
+ case DMA_TO_DEVICE:
break;
- case DMA_BIDIRECTIONAL: /* writeback and invalidate */
- __dma_phys_op(start, end, DMA_CACHE_FLUSH);
+ case DMA_FROM_DEVICE:
+ case DMA_BIDIRECTIONAL:
+ __dma_phys_op(start, end, DMA_CACHE_INVAL);
break;
}
}
--
2.39.2
From: Arnd Bergmann <[email protected]>
Some architectures that need to invalidate buffers after bidirectional
DMA because of speculative prefetching only do a simpler writeback
before that DMA, while architectures that don't need to do the second
invalidate tend to have a combined writeback+invalidate before the
DMA.
arc is one of the architectures that does both, which seems unnecessary.
Change it to behave like arm/arm64/xtensa instead, and use just a
writeback before the DMA when we do the invalidate afterwards.
Signed-off-by: Arnd Bergmann <[email protected]>
---
arch/arc/mm/dma.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/arc/mm/dma.c b/arch/arc/mm/dma.c
index 2a7fbbb83b70..ddb96786f765 100644
--- a/arch/arc/mm/dma.c
+++ b/arch/arc/mm/dma.c
@@ -40,7 +40,7 @@ void arch_dma_prep_coherent(struct page *page, size_t size)
* |----------------------------------------------------------------
* TO_DEV | writeback writeback | none none
* FROM_DEV | invalidate invalidate | invalidate* invalidate*
- * BIDIR | writeback+inv writeback+inv | invalidate invalidate
+ * BIDIR | writeback writeback | invalidate invalidate
*
* [*] needed for CPU speculative prefetches
*
@@ -61,7 +61,7 @@ void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
break;
case DMA_BIDIRECTIONAL:
- dma_cache_wback_inv(paddr, size);
+ dma_cache_wback(paddr, size);
break;
default:
--
2.39.2
From: Arnd Bergmann <[email protected]>
The mips arch_sync_dma_for_device()/arch_sync_dma_for_cpu() functions
behave the same way as on other architectures, but in order to unify
the implementations, the code needs to be rearranged to pick the type
of cache operation in the outermost function.
Signed-off-by: Arnd Bergmann <[email protected]>
---
arch/mips/mm/dma-noncoherent.c | 75 ++++++++++++++--------------------
1 file changed, 30 insertions(+), 45 deletions(-)
diff --git a/arch/mips/mm/dma-noncoherent.c b/arch/mips/mm/dma-noncoherent.c
index b4350faf4f1e..b9d68bcc5d53 100644
--- a/arch/mips/mm/dma-noncoherent.c
+++ b/arch/mips/mm/dma-noncoherent.c
@@ -54,50 +54,13 @@ void *arch_dma_set_uncached(void *addr, size_t size)
return (void *)(__pa(addr) + UNCAC_BASE);
}
-static inline void dma_sync_virt_for_device(void *addr, size_t size,
- enum dma_data_direction dir)
-{
- switch (dir) {
- case DMA_TO_DEVICE:
- dma_cache_wback((unsigned long)addr, size);
- break;
- case DMA_FROM_DEVICE:
- dma_cache_inv((unsigned long)addr, size);
- break;
- case DMA_BIDIRECTIONAL:
- if (IS_ENABLED(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU) &&
- cpu_needs_post_dma_flush())
- dma_cache_wback((unsigned long)addr, size);
- else
- dma_cache_wback_inv((unsigned long)addr, size);
- break;
- default:
- BUG();
- }
-}
-
-static inline void dma_sync_virt_for_cpu(void *addr, size_t size,
- enum dma_data_direction dir)
-{
- switch (dir) {
- case DMA_TO_DEVICE:
- break;
- case DMA_FROM_DEVICE:
- case DMA_BIDIRECTIONAL:
- dma_cache_inv((unsigned long)addr, size);
- break;
- default:
- BUG();
- }
-}
-
/*
* A single sg entry may refer to multiple physically contiguous pages. But
* we still need to process highmem pages individually. If highmem is not
* configured then the bulk of this loop gets optimized out.
*/
static inline void dma_sync_phys(phys_addr_t paddr, size_t size,
- enum dma_data_direction dir, bool for_device)
+ void(*cache_op)(unsigned long start, unsigned long size))
{
struct page *page = pfn_to_page(paddr >> PAGE_SHIFT);
unsigned long offset = paddr & ~PAGE_MASK;
@@ -113,10 +76,7 @@ static inline void dma_sync_phys(phys_addr_t paddr, size_t size,
}
addr = kmap_atomic(page);
- if (for_device)
- dma_sync_virt_for_device(addr + offset, len, dir);
- else
- dma_sync_virt_for_cpu(addr + offset, len, dir);
+ cache_op((unsigned long)addr + offset, len);
kunmap_atomic(addr);
offset = 0;
@@ -128,15 +88,40 @@ static inline void dma_sync_phys(phys_addr_t paddr, size_t size,
void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
enum dma_data_direction dir)
{
- dma_sync_phys(paddr, size, dir, true);
+ switch (dir) {
+ case DMA_TO_DEVICE:
+ dma_sync_phys(paddr, size, _dma_cache_wback);
+ break;
+ case DMA_FROM_DEVICE:
+ dma_sync_phys(paddr, size, _dma_cache_inv);
+ break;
+ case DMA_BIDIRECTIONAL:
+ if (IS_ENABLED(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU) &&
+ cpu_needs_post_dma_flush())
+ dma_sync_phys(paddr, size, _dma_cache_wback);
+ else
+ dma_sync_phys(paddr, size, _dma_cache_wback_inv);
+ break;
+ default:
+ break;
+ }
}
#ifdef CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU
void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
enum dma_data_direction dir)
{
- if (cpu_needs_post_dma_flush())
- dma_sync_phys(paddr, size, dir, false);
+ switch (dir) {
+ case DMA_TO_DEVICE:
+ break;
+ case DMA_FROM_DEVICE:
+ case DMA_BIDIRECTIONAL:
+ if (cpu_needs_post_dma_flush())
+ dma_sync_phys(paddr, size, _dma_cache_inv);
+ break;
+ default:
+ break;
+ }
}
#endif
--
2.39.2
From: Arnd Bergmann <[email protected]>
non-coherent devices on parisc traditionally use a full flush+invalidate
before and after each DMA, which is more expensive that what we do on
other architectures.
Before transfers to a device, the cache only has to be written back,
but apparently there is no operation for this on parisc. There is no
need to flush it again after the transfer though.
After transfers from a device, the second writeback can be skipped because
the CPU was not allowed to write to the buffer anyway, instead a purge
(invalidate without flush) can be used.
The DMA_FROM_DEVICE is handled differently across architectures,
most use only an invalidate (purge) operation, but some have moved
to flush in order to preserve dirty data when the device does not
write to the buffer, see the link below. As parisc already did the
full flush here, keep that behavior.
Link: https://lore.kernel.org/all/20220606152150.GA31568@willie-the-truck/
Signed-off-by: Arnd Bergmann <[email protected]>
---
I'm not really sure I understand the semantics of the 'flush'
and 'purge' operations on parisc correctly, please double-check that
this makes sense in the context of this architecture.
---
arch/parisc/include/asm/cacheflush.h | 6 +++++-
arch/parisc/kernel/pci-dma.c | 25 +++++++++++++++++++++++--
2 files changed, 28 insertions(+), 3 deletions(-)
diff --git a/arch/parisc/include/asm/cacheflush.h b/arch/parisc/include/asm/cacheflush.h
index 0bdee6724132..a4c5042f1821 100644
--- a/arch/parisc/include/asm/cacheflush.h
+++ b/arch/parisc/include/asm/cacheflush.h
@@ -33,8 +33,12 @@ void flush_cache_mm(struct mm_struct *mm);
void flush_kernel_dcache_page_addr(const void *addr);
+#define clean_kernel_dcache_range(start,size) \
+ flush_kernel_dcache_range((start), (size))
#define flush_kernel_dcache_range(start,size) \
- flush_kernel_dcache_range_asm((start), (start)+(size));
+ flush_kernel_dcache_range_asm((start), (start)+(size))
+#define purge_kernel_dcache_range(start,size) \
+ purge_kernel_dcache_range_asm((start), (start)+(size))
#define ARCH_IMPLEMENTS_FLUSH_KERNEL_VMAP_RANGE 1
void flush_kernel_vmap_range(void *vaddr, int size);
diff --git a/arch/parisc/kernel/pci-dma.c b/arch/parisc/kernel/pci-dma.c
index ba87f791323b..6d3d3cffb316 100644
--- a/arch/parisc/kernel/pci-dma.c
+++ b/arch/parisc/kernel/pci-dma.c
@@ -446,11 +446,32 @@ void arch_dma_free(struct device *dev, size_t size, void *vaddr,
void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
enum dma_data_direction dir)
{
- flush_kernel_dcache_range((unsigned long)phys_to_virt(paddr), size);
+ unsigned long virt = (unsigned long)phys_to_virt(paddr);
+
+ switch (dir) {
+ case DMA_TO_DEVICE:
+ clean_kernel_dcache_range(virt, size);
+ break;
+ case DMA_FROM_DEVICE:
+ clean_kernel_dcache_range(virt, size);
+ break;
+ case DMA_BIDIRECTIONAL:
+ flush_kernel_dcache_range(virt, size);
+ break;
+ }
}
void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
enum dma_data_direction dir)
{
- flush_kernel_dcache_range((unsigned long)phys_to_virt(paddr), size);
+ unsigned long virt = (unsigned long)phys_to_virt(paddr);
+
+ switch (dir) {
+ case DMA_TO_DEVICE:
+ break;
+ case DMA_FROM_DEVICE:
+ case DMA_BIDIRECTIONAL:
+ purge_kernel_dcache_range(virt, size);
+ break;
+ }
}
--
2.39.2
From: Arnd Bergmann <[email protected]>
Most ARM CPUs can have write-back caches and that require
cache management to be done in the dma_sync_*_for_device()
operation. This is typically done in both writeback and
writethrough mode.
The cache-v4.S (arm720/740/7tdmi/9tdmi) and cache-v4wt.S
(arm920t, arm940t) implementations are the exception here,
and only do the cache management after the DMA is complete,
in the dma_sync_*_for_cpu() operation.
Change this for consistency with the other platforms. This
should have no user visible effect.
Signed-off-by: Arnd Bergmann <[email protected]>
---
arch/arm/mm/cache-v4.S | 8 ++++----
arch/arm/mm/cache-v4wt.S | 8 ++++----
2 files changed, 8 insertions(+), 8 deletions(-)
diff --git a/arch/arm/mm/cache-v4.S b/arch/arm/mm/cache-v4.S
index 7787057e4990..e2b104876340 100644
--- a/arch/arm/mm/cache-v4.S
+++ b/arch/arm/mm/cache-v4.S
@@ -117,23 +117,23 @@ ENTRY(v4_dma_flush_range)
ret lr
/*
- * dma_unmap_area(start, size, dir)
+ * dma_map_area(start, size, dir)
* - start - kernel virtual start address
* - size - size of region
* - dir - DMA direction
*/
-ENTRY(v4_dma_unmap_area)
+ENTRY(v4_dma_map_area)
teq r2, #DMA_TO_DEVICE
bne v4_dma_flush_range
/* FALLTHROUGH */
/*
- * dma_map_area(start, size, dir)
+ * dma_unmap_area(start, size, dir)
* - start - kernel virtual start address
* - size - size of region
* - dir - DMA direction
*/
-ENTRY(v4_dma_map_area)
+ENTRY(v4_dma_unmap_area)
ret lr
ENDPROC(v4_dma_unmap_area)
ENDPROC(v4_dma_map_area)
diff --git a/arch/arm/mm/cache-v4wt.S b/arch/arm/mm/cache-v4wt.S
index 0b290c25a99d..652218752f88 100644
--- a/arch/arm/mm/cache-v4wt.S
+++ b/arch/arm/mm/cache-v4wt.S
@@ -172,24 +172,24 @@ v4wt_dma_inv_range:
.equ v4wt_dma_flush_range, v4wt_dma_inv_range
/*
- * dma_unmap_area(start, size, dir)
+ * dma_map_area(start, size, dir)
* - start - kernel virtual start address
* - size - size of region
* - dir - DMA direction
*/
-ENTRY(v4wt_dma_unmap_area)
+ENTRY(v4wt_dma_map_area)
add r1, r1, r0
teq r2, #DMA_TO_DEVICE
bne v4wt_dma_inv_range
/* FALLTHROUGH */
/*
- * dma_map_area(start, size, dir)
+ * dma_unmap_area(start, size, dir)
* - start - kernel virtual start address
* - size - size of region
* - dir - DMA direction
*/
-ENTRY(v4wt_dma_map_area)
+ENTRY(v4wt_dma_unmap_area)
ret lr
ENDPROC(v4wt_dma_unmap_area)
ENDPROC(v4wt_dma_map_area)
--
2.39.2
From: Arnd Bergmann <[email protected]>
The arm specific iommu code in dma-mapping.c uses the page+offset based
__dma_page_cpu_to_dev()/__dma_page_dev_to_cpu() helpers in place of the
phys_addr_t based arch_sync_dma_for_device()/arch_sync_dma_for_cpu()
wrappers around the.
In order to be able to move the latter part set of functions into
common code, change the iommu implementation to use them directly
and remove the internal ones as a separate interface.
As page+offset and phys_address are equivalent, but are used in
different parts of the code here, this allows removing some of
the conversion but adds them elsewhere.
Signed-off-by: Arnd Bergmann <[email protected]>
---
arch/arm/mm/dma-mapping.c | 93 ++++++++++++++-------------------------
1 file changed, 33 insertions(+), 60 deletions(-)
diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index 8bc01071474a..ce4b74f34a58 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -622,16 +622,14 @@ static void __arm_dma_free(struct device *dev, size_t size, void *cpu_addr,
kfree(buf);
}
-static void dma_cache_maint_page(struct page *page, unsigned long offset,
+static void dma_cache_maint(phys_addr_t paddr,
size_t size, enum dma_data_direction dir,
void (*op)(const void *, size_t, int))
{
- unsigned long pfn;
+ unsigned long pfn = PFN_DOWN(paddr);
+ unsigned long offset = paddr % PAGE_SIZE;
size_t left = size;
- pfn = page_to_pfn(page) + offset / PAGE_SIZE;
- offset %= PAGE_SIZE;
-
/*
* A single sg entry may refer to multiple physically contiguous
* pages. But we still need to process highmem pages individually.
@@ -641,8 +639,7 @@ static void dma_cache_maint_page(struct page *page, unsigned long offset,
do {
size_t len = left;
void *vaddr;
-
- page = pfn_to_page(pfn);
+ struct page *page = pfn_to_page(pfn);
if (PageHighMem(page)) {
if (len + offset > PAGE_SIZE)
@@ -674,14 +671,11 @@ static void dma_cache_maint_page(struct page *page, unsigned long offset,
* Note: Drivers should NOT use this function directly.
* Use the driver DMA support - see dma-mapping.h (dma_sync_*)
*/
-static void __dma_page_cpu_to_dev(struct page *page, unsigned long off,
- size_t size, enum dma_data_direction dir)
+void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
+ enum dma_data_direction dir)
{
- phys_addr_t paddr;
+ dma_cache_maint(paddr, size, dir, dmac_map_area);
- dma_cache_maint_page(page, off, size, dir, dmac_map_area);
-
- paddr = page_to_phys(page) + off;
if (dir == DMA_FROM_DEVICE) {
outer_inv_range(paddr, paddr + size);
} else {
@@ -690,34 +684,30 @@ static void __dma_page_cpu_to_dev(struct page *page, unsigned long off,
/* FIXME: non-speculating: flush on bidirectional mappings? */
}
-static void __dma_page_dev_to_cpu(struct page *page, unsigned long off,
- size_t size, enum dma_data_direction dir)
+void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
+ enum dma_data_direction dir)
{
- phys_addr_t paddr = page_to_phys(page) + off;
-
/* FIXME: non-speculating: not required */
/* in any case, don't bother invalidating if DMA to device */
if (dir != DMA_TO_DEVICE) {
outer_inv_range(paddr, paddr + size);
- dma_cache_maint_page(page, off, size, dir, dmac_unmap_area);
+ dma_cache_maint(paddr, size, dir, dmac_unmap_area);
}
/*
* Mark the D-cache clean for these pages to avoid extra flushing.
*/
if (dir != DMA_TO_DEVICE && size >= PAGE_SIZE) {
- unsigned long pfn;
+ unsigned long pfn = PFN_UP(paddr);
+ unsigned long off = paddr & (PAGE_SIZE - 1);
size_t left = size;
- pfn = page_to_pfn(page) + off / PAGE_SIZE;
- off %= PAGE_SIZE;
- if (off) {
- pfn++;
+ if (off)
left -= PAGE_SIZE - off;
- }
+
while (left >= PAGE_SIZE) {
- page = pfn_to_page(pfn++);
+ struct page *page = pfn_to_page(pfn++);
set_bit(PG_dcache_clean, &page->flags);
left -= PAGE_SIZE;
}
@@ -1204,7 +1194,7 @@ static int __map_sg_chunk(struct device *dev, struct scatterlist *sg,
unsigned int len = PAGE_ALIGN(s->offset + s->length);
if (!dev->dma_coherent && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
- __dma_page_cpu_to_dev(sg_page(s), s->offset, s->length, dir);
+ arch_sync_dma_for_device(phys + s->offset, s->length, dir);
prot = __dma_info_to_prot(dir, attrs);
@@ -1306,8 +1296,7 @@ static void arm_iommu_unmap_sg(struct device *dev,
__iommu_remove_mapping(dev, sg_dma_address(s),
sg_dma_len(s));
if (!dev->dma_coherent && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
- __dma_page_dev_to_cpu(sg_page(s), s->offset,
- s->length, dir);
+ arch_sync_dma_for_cpu(sg_phys(s), s->length, dir);
}
}
@@ -1329,7 +1318,7 @@ static void arm_iommu_sync_sg_for_cpu(struct device *dev,
return;
for_each_sg(sg, s, nents, i)
- __dma_page_dev_to_cpu(sg_page(s), s->offset, s->length, dir);
+ arch_sync_dma_for_cpu(sg_phys(s), s->length, dir);
}
@@ -1351,7 +1340,8 @@ static void arm_iommu_sync_sg_for_device(struct device *dev,
return;
for_each_sg(sg, s, nents, i)
- __dma_page_cpu_to_dev(sg_page(s), s->offset, s->length, dir);
+ arch_sync_dma_for_device(page_to_phys(sg_page(s)) + s->offset,
+ s->length, dir);
}
/**
@@ -1373,7 +1363,8 @@ static dma_addr_t arm_iommu_map_page(struct device *dev, struct page *page,
int ret, prot, len = PAGE_ALIGN(size + offset);
if (!dev->dma_coherent && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
- __dma_page_cpu_to_dev(page, offset, size, dir);
+ arch_sync_dma_for_device(page_to_phys(page) + offset,
+ size, dir);
dma_addr = __alloc_iova(mapping, len);
if (dma_addr == DMA_MAPPING_ERROR)
@@ -1406,7 +1397,7 @@ static void arm_iommu_unmap_page(struct device *dev, dma_addr_t handle,
{
struct dma_iommu_mapping *mapping = to_dma_iommu_mapping(dev);
dma_addr_t iova = handle & PAGE_MASK;
- struct page *page;
+ phys_addr_t phys;
int offset = handle & ~PAGE_MASK;
int len = PAGE_ALIGN(size + offset);
@@ -1414,8 +1405,8 @@ static void arm_iommu_unmap_page(struct device *dev, dma_addr_t handle,
return;
if (!dev->dma_coherent && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) {
- page = phys_to_page(iommu_iova_to_phys(mapping->domain, iova));
- __dma_page_dev_to_cpu(page, offset, size, dir);
+ phys = iommu_iova_to_phys(mapping->domain, handle);
+ arch_sync_dma_for_cpu(phys, size, dir);
}
iommu_unmap(mapping->domain, iova, len);
@@ -1483,30 +1474,26 @@ static void arm_iommu_sync_single_for_cpu(struct device *dev,
dma_addr_t handle, size_t size, enum dma_data_direction dir)
{
struct dma_iommu_mapping *mapping = to_dma_iommu_mapping(dev);
- dma_addr_t iova = handle & PAGE_MASK;
- struct page *page;
- unsigned int offset = handle & ~PAGE_MASK;
+ phys_addr_t phys;
- if (dev->dma_coherent || !iova)
+ if (dev->dma_coherent || !(handle & PAGE_MASK))
return;
- page = phys_to_page(iommu_iova_to_phys(mapping->domain, iova));
- __dma_page_dev_to_cpu(page, offset, size, dir);
+ phys = iommu_iova_to_phys(mapping->domain, handle);
+ arch_sync_dma_for_cpu(phys, size, dir);
}
static void arm_iommu_sync_single_for_device(struct device *dev,
dma_addr_t handle, size_t size, enum dma_data_direction dir)
{
struct dma_iommu_mapping *mapping = to_dma_iommu_mapping(dev);
- dma_addr_t iova = handle & PAGE_MASK;
- struct page *page;
- unsigned int offset = handle & ~PAGE_MASK;
+ phys_addr_t phys;
- if (dev->dma_coherent || !iova)
+ if (dev->dma_coherent || !(handle & PAGE_MASK))
return;
- page = phys_to_page(iommu_iova_to_phys(mapping->domain, iova));
- __dma_page_cpu_to_dev(page, offset, size, dir);
+ phys = iommu_iova_to_phys(mapping->domain, handle);
+ arch_sync_dma_for_device(phys, size, dir);
}
static const struct dma_map_ops iommu_ops = {
@@ -1789,20 +1776,6 @@ void arch_teardown_dma_ops(struct device *dev)
set_dma_ops(dev, NULL);
}
-void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
- enum dma_data_direction dir)
-{
- __dma_page_cpu_to_dev(phys_to_page(paddr), paddr & (PAGE_SIZE - 1),
- size, dir);
-}
-
-void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
- enum dma_data_direction dir)
-{
- __dma_page_dev_to_cpu(phys_to_page(paddr), paddr & (PAGE_SIZE - 1),
- size, dir);
-}
-
void *arch_dma_alloc(struct device *dev, size_t size, dma_addr_t *dma_handle,
gfp_t gfp, unsigned long attrs)
{
--
2.39.2
From: Arnd Bergmann <[email protected]>
These were remove ages ago in commit 702b94bff3c5 ("ARM: dma-mapping:
remove dmac_clean_range and dmac_inv_range") in an effort to sanitize
the dma-mapping API.
Now this logic is getting moved into the generic dma-mapping
implementation in order to give architectures less control over
it, which requires reverting that earlier work.
Signed-off-by: Arnd Bergmann <[email protected]>
---
arch/arm/include/asm/cacheflush.h | 21 +++++++++++++++++++++
arch/arm/include/asm/glue-cache.h | 4 ++++
arch/arm/mm/cache-fa.S | 4 ++--
arch/arm/mm/cache-nop.S | 6 ++++++
arch/arm/mm/cache-v4.S | 5 +++++
arch/arm/mm/cache-v4wb.S | 4 ++--
arch/arm/mm/cache-v4wt.S | 14 +++++++++++++-
arch/arm/mm/cache-v6.S | 4 ++--
arch/arm/mm/cache-v7.S | 6 ++++--
arch/arm/mm/cache-v7m.S | 4 ++--
arch/arm/mm/proc-arm1020.S | 4 ++--
arch/arm/mm/proc-arm1020e.S | 4 ++--
arch/arm/mm/proc-arm1022.S | 4 ++--
arch/arm/mm/proc-arm1026.S | 4 ++--
arch/arm/mm/proc-arm920.S | 4 ++--
arch/arm/mm/proc-arm922.S | 4 ++--
arch/arm/mm/proc-arm925.S | 4 ++--
arch/arm/mm/proc-arm926.S | 4 ++--
arch/arm/mm/proc-arm940.S | 4 ++--
arch/arm/mm/proc-arm946.S | 4 ++--
arch/arm/mm/proc-feroceon.S | 8 ++++----
arch/arm/mm/proc-macros.S | 2 ++
arch/arm/mm/proc-mohawk.S | 4 ++--
arch/arm/mm/proc-xsc3.S | 4 ++--
arch/arm/mm/proc-xscale.S | 6 ++++--
25 files changed, 95 insertions(+), 41 deletions(-)
diff --git a/arch/arm/include/asm/cacheflush.h b/arch/arm/include/asm/cacheflush.h
index a094f964c869..04462bfe9130 100644
--- a/arch/arm/include/asm/cacheflush.h
+++ b/arch/arm/include/asm/cacheflush.h
@@ -91,6 +91,21 @@
* DMA Cache Coherency
* ===================
*
+ * dma_inv_range(start, end)
+ *
+ * Invalidate (discard) the specified virtual address range.
+ * May not write back any entries. If 'start' or 'end'
+ * are not cache line aligned, those lines must be written
+ * back.
+ * - start - virtual start address
+ * - end - virtual end address
+ *
+ * dma_clean_range(start, end)
+ *
+ * Clean (write back) the specified virtual address range.
+ * - start - virtual start address
+ * - end - virtual end address
+ *
* dma_flush_range(start, end)
*
* Clean and invalidate the specified virtual address range.
@@ -112,6 +127,8 @@ struct cpu_cache_fns {
void (*dma_map_area)(const void *, size_t, int);
void (*dma_unmap_area)(const void *, size_t, int);
+ void (*dma_clean_range)(const void *, const void *);
+ void (*dma_inv_range)(const void *, const void *);
void (*dma_flush_range)(const void *, const void *);
} __no_randomize_layout;
@@ -137,6 +154,8 @@ extern struct cpu_cache_fns cpu_cache;
* is visible to DMA, or data written by DMA to system memory is
* visible to the CPU.
*/
+#define dmac_clean_range cpu_cache.dma_clean_range
+#define dmac_inv_range cpu_cache.dma_inv_range
#define dmac_flush_range cpu_cache.dma_flush_range
#else
@@ -156,6 +175,8 @@ extern void __cpuc_flush_dcache_area(void *, size_t);
* is visible to DMA, or data written by DMA to system memory is
* visible to the CPU.
*/
+extern void dmac_clean_range(const void *, const void *);
+extern void dmac_inv_range(const void *, const void *);
extern void dmac_flush_range(const void *, const void *);
#endif
diff --git a/arch/arm/include/asm/glue-cache.h b/arch/arm/include/asm/glue-cache.h
index 724f8dac1e5b..d8c93b483adf 100644
--- a/arch/arm/include/asm/glue-cache.h
+++ b/arch/arm/include/asm/glue-cache.h
@@ -139,6 +139,8 @@ static inline int nop_coherent_user_range(unsigned long a,
unsigned long b) { return 0; }
static inline void nop_flush_kern_dcache_area(void *a, size_t s) { }
+static inline void nop_dma_clean_range(const void *a, const void *b) { }
+static inline void nop_dma_inv_range(const void *a, const void *b) { }
static inline void nop_dma_flush_range(const void *a, const void *b) { }
static inline void nop_dma_map_area(const void *s, size_t l, int f) { }
@@ -155,6 +157,8 @@ static inline void nop_dma_unmap_area(const void *s, size_t l, int f) { }
#define __cpuc_coherent_user_range __glue(_CACHE,_coherent_user_range)
#define __cpuc_flush_dcache_area __glue(_CACHE,_flush_kern_dcache_area)
+#define dmac_clean_range __glue(_CACHE,_dma_clean_range)
+#define dmac_inv_range __glue(_CACHE,_dma_inv_range)
#define dmac_flush_range __glue(_CACHE,_dma_flush_range)
#endif
diff --git a/arch/arm/mm/cache-fa.S b/arch/arm/mm/cache-fa.S
index 3a464d1649b4..abc3d58948dd 100644
--- a/arch/arm/mm/cache-fa.S
+++ b/arch/arm/mm/cache-fa.S
@@ -166,7 +166,7 @@ ENTRY(fa_flush_kern_dcache_area)
* - start - virtual start address
* - end - virtual end address
*/
-fa_dma_inv_range:
+ENTRY(fa_dma_inv_range)
tst r0, #CACHE_DLINESIZE - 1
bic r0, r0, #CACHE_DLINESIZE - 1
mcrne p15, 0, r0, c7, c14, 1 @ clean & invalidate D entry
@@ -189,7 +189,7 @@ fa_dma_inv_range:
* - start - virtual start address
* - end - virtual end address
*/
-fa_dma_clean_range:
+ENTRY(fa_dma_clean_range)
bic r0, r0, #CACHE_DLINESIZE - 1
1: mcr p15, 0, r0, c7, c10, 1 @ clean D entry
add r0, r0, #CACHE_DLINESIZE
diff --git a/arch/arm/mm/cache-nop.S b/arch/arm/mm/cache-nop.S
index 72d939ef8798..a058544d6c2b 100644
--- a/arch/arm/mm/cache-nop.S
+++ b/arch/arm/mm/cache-nop.S
@@ -32,6 +32,12 @@ ENDPROC(nop_coherent_user_range)
.globl nop_flush_kern_dcache_area
.equ nop_flush_kern_dcache_area, nop_flush_icache_all
+ .globl nop_dma_clean_range
+ .equ nop_dma_clean_range, nop_flush_icache_all
+
+ .globl nop_dma_inv_range
+ .equ nop_dma_inv_range, nop_flush_icache_all
+
.globl nop_dma_flush_range
.equ nop_dma_flush_range, nop_flush_icache_all
diff --git a/arch/arm/mm/cache-v4.S b/arch/arm/mm/cache-v4.S
index e2b104876340..b747e591109c 100644
--- a/arch/arm/mm/cache-v4.S
+++ b/arch/arm/mm/cache-v4.S
@@ -103,17 +103,22 @@ ENTRY(v4_flush_kern_dcache_area)
/*
* dma_flush_range(start, end)
+ * dma_inv_range(start, end)
*
* Clean and invalidate the specified virtual address range.
+ * As only write-through caches are supported here, this is the
+ * same as invalidate, while the clean operation does nothing.
*
* - start - virtual start address
* - end - virtual end address
*/
+ENTRY(v4_dma_inv_range)
ENTRY(v4_dma_flush_range)
#ifdef CONFIG_CPU_CP15
mov r0, #0
mcr p15, 0, r0, c7, c7, 0 @ flush ID cache
#endif
+ENTRY(v4_dma_clean_range)
ret lr
/*
diff --git a/arch/arm/mm/cache-v4wb.S b/arch/arm/mm/cache-v4wb.S
index 905ac2fa2b1e..55f609eae38d 100644
--- a/arch/arm/mm/cache-v4wb.S
+++ b/arch/arm/mm/cache-v4wb.S
@@ -183,7 +183,7 @@ ENTRY(v4wb_coherent_user_range)
* - start - virtual start address
* - end - virtual end address
*/
-v4wb_dma_inv_range:
+ENTRY(v4wb_dma_inv_range)
tst r0, #CACHE_DLINESIZE - 1
bic r0, r0, #CACHE_DLINESIZE - 1
mcrne p15, 0, r0, c7, c10, 1 @ clean D entry
@@ -204,7 +204,7 @@ v4wb_dma_inv_range:
* - start - virtual start address
* - end - virtual end address
*/
-v4wb_dma_clean_range:
+ENTRY(v4wb_dma_clean_range)
bic r0, r0, #CACHE_DLINESIZE - 1
1: mcr p15, 0, r0, c7, c10, 1 @ clean D entry
add r0, r0, #CACHE_DLINESIZE
diff --git a/arch/arm/mm/cache-v4wt.S b/arch/arm/mm/cache-v4wt.S
index 652218752f88..1a88627ec09b 100644
--- a/arch/arm/mm/cache-v4wt.S
+++ b/arch/arm/mm/cache-v4wt.S
@@ -152,7 +152,7 @@ ENTRY(v4wt_flush_kern_dcache_area)
* - start - virtual start address
* - end - virtual end address
*/
-v4wt_dma_inv_range:
+ENTRY(v4wt_dma_inv_range)
bic r0, r0, #CACHE_DLINESIZE - 1
1: mcr p15, 0, r0, c7, c6, 1 @ invalidate D entry
add r0, r0, #CACHE_DLINESIZE
@@ -171,6 +171,18 @@ v4wt_dma_inv_range:
.globl v4wt_dma_flush_range
.equ v4wt_dma_flush_range, v4wt_dma_inv_range
+/*
+ * dma_clean_range(start, end)
+ *
+ * Clean the specified virtual address range.
+ * Empty implementation for writethrough caches.
+ *
+ * - start - virtual start address
+ * - end - virtual end address
+ */
+ .globl v4wt_dma_clean_range
+ .equ v4wt_dma_clean_range, v4wt_dma_unmap_area
+
/*
* dma_map_area(start, size, dir)
* - start - kernel virtual start address
diff --git a/arch/arm/mm/cache-v6.S b/arch/arm/mm/cache-v6.S
index 250c83bf7158..abae7ff5defc 100644
--- a/arch/arm/mm/cache-v6.S
+++ b/arch/arm/mm/cache-v6.S
@@ -200,7 +200,7 @@ ENTRY(v6_flush_kern_dcache_area)
* - start - virtual start address of region
* - end - virtual end address of region
*/
-v6_dma_inv_range:
+ENTRY(v6_dma_inv_range)
#ifdef CONFIG_DMA_CACHE_RWFO
ldrb r2, [r0] @ read for ownership
strb r2, [r0] @ write for ownership
@@ -245,7 +245,7 @@ v6_dma_inv_range:
* - start - virtual start address of region
* - end - virtual end address of region
*/
-v6_dma_clean_range:
+ENTRY(v6_dma_clean_range)
bic r0, r0, #D_CACHE_LINE_SIZE - 1
1:
#ifdef CONFIG_DMA_CACHE_RWFO
diff --git a/arch/arm/mm/cache-v7.S b/arch/arm/mm/cache-v7.S
index 127afe2096ba..b16a0d2a7cce 100644
--- a/arch/arm/mm/cache-v7.S
+++ b/arch/arm/mm/cache-v7.S
@@ -361,7 +361,7 @@ ENDPROC(v7_flush_kern_dcache_area)
* - start - virtual start address of region
* - end - virtual end address of region
*/
-v7_dma_inv_range:
+ENTRY(v7_dma_inv_range)
dcache_line_size r2, r3
sub r3, r2, #1
tst r0, r3
@@ -391,7 +391,7 @@ ENDPROC(v7_dma_inv_range)
* - start - virtual start address of region
* - end - virtual end address of region
*/
-v7_dma_clean_range:
+ENTRY(v7_dma_clean_range)
dcache_line_size r2, r3
sub r3, r2, #1
bic r0, r0, r3
@@ -477,6 +477,8 @@ ENDPROC(v7_dma_unmap_area)
globl_equ b15_dma_map_area, v7_dma_map_area
globl_equ b15_dma_unmap_area, v7_dma_unmap_area
+ globl_equ b15_dma_clean_range, v7_dma_clean_range
+ globl_equ b15_dma_inv_range, v7_dma_inv_range
globl_equ b15_dma_flush_range, v7_dma_flush_range
define_cache_functions b15
diff --git a/arch/arm/mm/cache-v7m.S b/arch/arm/mm/cache-v7m.S
index eb60b5e5e2ad..4fc6e0028e40 100644
--- a/arch/arm/mm/cache-v7m.S
+++ b/arch/arm/mm/cache-v7m.S
@@ -364,7 +364,7 @@ ENDPROC(v7m_flush_kern_dcache_area)
* - start - virtual start address of region
* - end - virtual end address of region
*/
-v7m_dma_inv_range:
+ENTRY(v7m_dma_inv_range)
dcache_line_size r2, r3
sub r3, r2, #1
tst r0, r3
@@ -390,7 +390,7 @@ ENDPROC(v7m_dma_inv_range)
* - start - virtual start address of region
* - end - virtual end address of region
*/
-v7m_dma_clean_range:
+ENTRY(v7m_dma_clean_range)
dcache_line_size r2, r3
sub r3, r2, #1
bic r0, r0, r3
diff --git a/arch/arm/mm/proc-arm1020.S b/arch/arm/mm/proc-arm1020.S
index 6837cf7a4812..0089e366f4e8 100644
--- a/arch/arm/mm/proc-arm1020.S
+++ b/arch/arm/mm/proc-arm1020.S
@@ -263,7 +263,7 @@ ENTRY(arm1020_flush_kern_dcache_area)
*
* (same as v4wb)
*/
-arm1020_dma_inv_range:
+ENTRY(arm1020_dma_inv_range)
mov ip, #0
#ifndef CONFIG_CPU_DCACHE_DISABLE
tst r0, #CACHE_DLINESIZE - 1
@@ -293,7 +293,7 @@ arm1020_dma_inv_range:
*
* (same as v4wb)
*/
-arm1020_dma_clean_range:
+ENTRY(arm1020_dma_clean_range)
mov ip, #0
#ifndef CONFIG_CPU_DCACHE_DISABLE
bic r0, r0, #CACHE_DLINESIZE - 1
diff --git a/arch/arm/mm/proc-arm1020e.S b/arch/arm/mm/proc-arm1020e.S
index df49b10250b8..c662e55a76fa 100644
--- a/arch/arm/mm/proc-arm1020e.S
+++ b/arch/arm/mm/proc-arm1020e.S
@@ -256,7 +256,7 @@ ENTRY(arm1020e_flush_kern_dcache_area)
*
* (same as v4wb)
*/
-arm1020e_dma_inv_range:
+ENTRY(arm1020e_dma_inv_range)
mov ip, #0
#ifndef CONFIG_CPU_DCACHE_DISABLE
tst r0, #CACHE_DLINESIZE - 1
@@ -282,7 +282,7 @@ arm1020e_dma_inv_range:
*
* (same as v4wb)
*/
-arm1020e_dma_clean_range:
+ENTRY(arm1020e_dma_clean_range)
mov ip, #0
#ifndef CONFIG_CPU_DCACHE_DISABLE
bic r0, r0, #CACHE_DLINESIZE - 1
diff --git a/arch/arm/mm/proc-arm1022.S b/arch/arm/mm/proc-arm1022.S
index e89ce467f672..e77328906bc5 100644
--- a/arch/arm/mm/proc-arm1022.S
+++ b/arch/arm/mm/proc-arm1022.S
@@ -256,7 +256,7 @@ ENTRY(arm1022_flush_kern_dcache_area)
*
* (same as v4wb)
*/
-arm1022_dma_inv_range:
+ENTRY(arm1022_dma_inv_range)
mov ip, #0
#ifndef CONFIG_CPU_DCACHE_DISABLE
tst r0, #CACHE_DLINESIZE - 1
@@ -282,7 +282,7 @@ arm1022_dma_inv_range:
*
* (same as v4wb)
*/
-arm1022_dma_clean_range:
+ENTRY(arm1022_dma_clean_range)
mov ip, #0
#ifndef CONFIG_CPU_DCACHE_DISABLE
bic r0, r0, #CACHE_DLINESIZE - 1
diff --git a/arch/arm/mm/proc-arm1026.S b/arch/arm/mm/proc-arm1026.S
index 7fdd1a205e8e..a23f9fa28d07 100644
--- a/arch/arm/mm/proc-arm1026.S
+++ b/arch/arm/mm/proc-arm1026.S
@@ -250,7 +250,7 @@ ENTRY(arm1026_flush_kern_dcache_area)
*
* (same as v4wb)
*/
-arm1026_dma_inv_range:
+ENTRY(arm1026_dma_inv_range)
mov ip, #0
#ifndef CONFIG_CPU_DCACHE_DISABLE
tst r0, #CACHE_DLINESIZE - 1
@@ -276,7 +276,7 @@ arm1026_dma_inv_range:
*
* (same as v4wb)
*/
-arm1026_dma_clean_range:
+ENTRY(arm1026_dma_clean_range)
mov ip, #0
#ifndef CONFIG_CPU_DCACHE_DISABLE
bic r0, r0, #CACHE_DLINESIZE - 1
diff --git a/arch/arm/mm/proc-arm920.S b/arch/arm/mm/proc-arm920.S
index a234cd8ba5e6..4c918ab106f3 100644
--- a/arch/arm/mm/proc-arm920.S
+++ b/arch/arm/mm/proc-arm920.S
@@ -232,7 +232,7 @@ ENTRY(arm920_flush_kern_dcache_area)
*
* (same as v4wb)
*/
-arm920_dma_inv_range:
+ENTRY(arm920_dma_inv_range)
tst r0, #CACHE_DLINESIZE - 1
bic r0, r0, #CACHE_DLINESIZE - 1
mcrne p15, 0, r0, c7, c10, 1 @ clean D entry
@@ -255,7 +255,7 @@ arm920_dma_inv_range:
*
* (same as v4wb)
*/
-arm920_dma_clean_range:
+ENTRY(arm920_dma_clean_range)
bic r0, r0, #CACHE_DLINESIZE - 1
1: mcr p15, 0, r0, c7, c10, 1 @ clean D entry
add r0, r0, #CACHE_DLINESIZE
diff --git a/arch/arm/mm/proc-arm922.S b/arch/arm/mm/proc-arm922.S
index 53c029dcfd83..6ac7bb7d94a4 100644
--- a/arch/arm/mm/proc-arm922.S
+++ b/arch/arm/mm/proc-arm922.S
@@ -234,7 +234,7 @@ ENTRY(arm922_flush_kern_dcache_area)
*
* (same as v4wb)
*/
-arm922_dma_inv_range:
+ENTRY(arm922_dma_inv_range)
tst r0, #CACHE_DLINESIZE - 1
bic r0, r0, #CACHE_DLINESIZE - 1
mcrne p15, 0, r0, c7, c10, 1 @ clean D entry
@@ -257,7 +257,7 @@ arm922_dma_inv_range:
*
* (same as v4wb)
*/
-arm922_dma_clean_range:
+ENTRY(arm922_dma_clean_range)
bic r0, r0, #CACHE_DLINESIZE - 1
1: mcr p15, 0, r0, c7, c10, 1 @ clean D entry
add r0, r0, #CACHE_DLINESIZE
diff --git a/arch/arm/mm/proc-arm925.S b/arch/arm/mm/proc-arm925.S
index 0bfad62ea858..860f0074ff81 100644
--- a/arch/arm/mm/proc-arm925.S
+++ b/arch/arm/mm/proc-arm925.S
@@ -280,7 +280,7 @@ ENTRY(arm925_flush_kern_dcache_area)
*
* (same as v4wb)
*/
-arm925_dma_inv_range:
+ENTRY(arm925_dma_inv_range)
#ifndef CONFIG_CPU_DCACHE_WRITETHROUGH
tst r0, #CACHE_DLINESIZE - 1
mcrne p15, 0, r0, c7, c10, 1 @ clean D entry
@@ -305,7 +305,7 @@ arm925_dma_inv_range:
*
* (same as v4wb)
*/
-arm925_dma_clean_range:
+ENTRY(arm925_dma_clean_range)
#ifndef CONFIG_CPU_DCACHE_WRITETHROUGH
bic r0, r0, #CACHE_DLINESIZE - 1
1: mcr p15, 0, r0, c7, c10, 1 @ clean D entry
diff --git a/arch/arm/mm/proc-arm926.S b/arch/arm/mm/proc-arm926.S
index 0487a2c3439b..519f62e023c5 100644
--- a/arch/arm/mm/proc-arm926.S
+++ b/arch/arm/mm/proc-arm926.S
@@ -243,7 +243,7 @@ ENTRY(arm926_flush_kern_dcache_area)
*
* (same as v4wb)
*/
-arm926_dma_inv_range:
+ENTRY(arm926_dma_inv_range)
#ifndef CONFIG_CPU_DCACHE_WRITETHROUGH
tst r0, #CACHE_DLINESIZE - 1
mcrne p15, 0, r0, c7, c10, 1 @ clean D entry
@@ -268,7 +268,7 @@ arm926_dma_inv_range:
*
* (same as v4wb)
*/
-arm926_dma_clean_range:
+ENTRY(arm926_dma_clean_range)
#ifndef CONFIG_CPU_DCACHE_WRITETHROUGH
bic r0, r0, #CACHE_DLINESIZE - 1
1: mcr p15, 0, r0, c7, c10, 1 @ clean D entry
diff --git a/arch/arm/mm/proc-arm940.S b/arch/arm/mm/proc-arm940.S
index cf9bfcc825ca..14dda5c5ee4a 100644
--- a/arch/arm/mm/proc-arm940.S
+++ b/arch/arm/mm/proc-arm940.S
@@ -177,7 +177,7 @@ ENTRY(arm940_flush_kern_dcache_area)
* - start - virtual start address
* - end - virtual end address
*/
-arm940_dma_inv_range:
+ENTRY(arm940_dma_inv_range)
mov ip, #0
mov r1, #(CACHE_DSEGMENTS - 1) << 4 @ 4 segments
1: orr r3, r1, #(CACHE_DENTRIES - 1) << 26 @ 64 entries
@@ -198,7 +198,7 @@ arm940_dma_inv_range:
* - start - virtual start address
* - end - virtual end address
*/
-arm940_dma_clean_range:
+ENTRY(arm940_dma_clean_range)
ENTRY(cpu_arm940_dcache_clean_area)
mov ip, #0
#ifndef CONFIG_CPU_DCACHE_WRITETHROUGH
diff --git a/arch/arm/mm/proc-arm946.S b/arch/arm/mm/proc-arm946.S
index 6fb3898ad1cd..91f62a7d334b 100644
--- a/arch/arm/mm/proc-arm946.S
+++ b/arch/arm/mm/proc-arm946.S
@@ -222,7 +222,7 @@ ENTRY(arm946_flush_kern_dcache_area)
* - end - virtual end address
* (same as arm926)
*/
-arm946_dma_inv_range:
+ENTRY(arm946_dma_inv_range)
#ifndef CONFIG_CPU_DCACHE_WRITETHROUGH
tst r0, #CACHE_DLINESIZE - 1
mcrne p15, 0, r0, c7, c10, 1 @ clean D entry
@@ -247,7 +247,7 @@ arm946_dma_inv_range:
*
* (same as arm926)
*/
-arm946_dma_clean_range:
+ENTRY(arm946_dma_clean_range)
#ifndef CONFIG_CPU_DCACHE_WRITETHROUGH
bic r0, r0, #CACHE_DLINESIZE - 1
1: mcr p15, 0, r0, c7, c10, 1 @ clean D entry
diff --git a/arch/arm/mm/proc-feroceon.S b/arch/arm/mm/proc-feroceon.S
index 61ce82aca6f0..86122bad6d9b 100644
--- a/arch/arm/mm/proc-feroceon.S
+++ b/arch/arm/mm/proc-feroceon.S
@@ -271,7 +271,7 @@ ENTRY(feroceon_range_flush_kern_dcache_area)
* (same as v4wb)
*/
.align 5
-feroceon_dma_inv_range:
+ENTRY(feroceon_dma_inv_range)
tst r0, #CACHE_DLINESIZE - 1
bic r0, r0, #CACHE_DLINESIZE - 1
mcrne p15, 0, r0, c7, c10, 1 @ clean D entry
@@ -285,7 +285,7 @@ feroceon_dma_inv_range:
ret lr
.align 5
-feroceon_range_dma_inv_range:
+ENTRY(feroceon_range_dma_inv_range)
mrs r2, cpsr
tst r0, #CACHE_DLINESIZE - 1
mcrne p15, 0, r0, c7, c10, 1 @ clean D entry
@@ -311,7 +311,7 @@ feroceon_range_dma_inv_range:
* (same as v4wb)
*/
.align 5
-feroceon_dma_clean_range:
+ENTRY(feroceon_dma_clean_range)
bic r0, r0, #CACHE_DLINESIZE - 1
1: mcr p15, 0, r0, c7, c10, 1 @ clean D entry
add r0, r0, #CACHE_DLINESIZE
@@ -321,7 +321,7 @@ feroceon_dma_clean_range:
ret lr
.align 5
-feroceon_range_dma_clean_range:
+ENTRY(feroceon_range_dma_clean_range)
mrs r2, cpsr
cmp r1, r0
subne r1, r1, #1 @ top address is inclusive
diff --git a/arch/arm/mm/proc-macros.S b/arch/arm/mm/proc-macros.S
index e43f6d716b4b..c1328955fd2a 100644
--- a/arch/arm/mm/proc-macros.S
+++ b/arch/arm/mm/proc-macros.S
@@ -334,6 +334,8 @@ ENTRY(\name\()_cache_fns)
.long \name\()_flush_kern_dcache_area
.long \name\()_dma_map_area
.long \name\()_dma_unmap_area
+ .long \name\()_dma_clean_range
+ .long \name\()_dma_inv_range
.long \name\()_dma_flush_range
.size \name\()_cache_fns, . - \name\()_cache_fns
.endm
diff --git a/arch/arm/mm/proc-mohawk.S b/arch/arm/mm/proc-mohawk.S
index 1645ccaffe96..db3a2f00372a 100644
--- a/arch/arm/mm/proc-mohawk.S
+++ b/arch/arm/mm/proc-mohawk.S
@@ -216,7 +216,7 @@ ENTRY(mohawk_flush_kern_dcache_area)
*
* (same as v4wb)
*/
-mohawk_dma_inv_range:
+ENTRY(mohawk_dma_inv_range)
tst r0, #CACHE_DLINESIZE - 1
mcrne p15, 0, r0, c7, c10, 1 @ clean D entry
tst r1, #CACHE_DLINESIZE - 1
@@ -239,7 +239,7 @@ mohawk_dma_inv_range:
*
* (same as v4wb)
*/
-mohawk_dma_clean_range:
+ENTRY(mohawk_dma_clean_range)
bic r0, r0, #CACHE_DLINESIZE - 1
1: mcr p15, 0, r0, c7, c10, 1 @ clean D entry
add r0, r0, #CACHE_DLINESIZE
diff --git a/arch/arm/mm/proc-xsc3.S b/arch/arm/mm/proc-xsc3.S
index a17afe7e195a..6db611a945f3 100644
--- a/arch/arm/mm/proc-xsc3.S
+++ b/arch/arm/mm/proc-xsc3.S
@@ -263,7 +263,7 @@ ENTRY(xsc3_flush_kern_dcache_area)
* - start - virtual start address
* - end - virtual end address
*/
-xsc3_dma_inv_range:
+ENTRY(xsc3_dma_inv_range)
tst r0, #CACHELINESIZE - 1
bic r0, r0, #CACHELINESIZE - 1
mcrne p15, 0, r0, c7, c10, 1 @ clean L1 D line
@@ -284,7 +284,7 @@ xsc3_dma_inv_range:
* - start - virtual start address
* - end - virtual end address
*/
-xsc3_dma_clean_range:
+ENTRY(xsc3_dma_clean_range)
bic r0, r0, #CACHELINESIZE - 1
1: mcr p15, 0, r0, c7, c10, 1 @ clean L1 D line
add r0, r0, #CACHELINESIZE
diff --git a/arch/arm/mm/proc-xscale.S b/arch/arm/mm/proc-xscale.S
index d82590aa71c0..291dec830714 100644
--- a/arch/arm/mm/proc-xscale.S
+++ b/arch/arm/mm/proc-xscale.S
@@ -323,7 +323,7 @@ ENTRY(xscale_flush_kern_dcache_area)
* - start - virtual start address
* - end - virtual end address
*/
-xscale_dma_inv_range:
+ENTRY(xscale_dma_inv_range)
tst r0, #CACHELINESIZE - 1
bic r0, r0, #CACHELINESIZE - 1
mcrne p15, 0, r0, c7, c10, 1 @ clean D entry
@@ -344,7 +344,7 @@ xscale_dma_inv_range:
* - start - virtual start address
* - end - virtual end address
*/
-xscale_dma_clean_range:
+ENTRY(xscale_dma_clean_range)
bic r0, r0, #CACHELINESIZE - 1
1: mcr p15, 0, r0, c7, c10, 1 @ clean D entry
add r0, r0, #CACHELINESIZE
@@ -445,6 +445,8 @@ ENDPROC(xscale_dma_unmap_area)
a0_alias coherent_kern_range
a0_alias coherent_user_range
a0_alias flush_kern_dcache_area
+ a0_alias dma_clean_range
+ a0_alias dma_inv_range
a0_alias dma_flush_range
a0_alias dma_unmap_area
--
2.39.2
From: Arnd Bergmann <[email protected]>
The cache management operations for noncoherent DMA on ARMv6 work
in two different ways:
* When CONFIG_DMA_CACHE_RWFO is set, speculative prefetches on in-flight
DMA buffers lead to data corruption when the prefetched data is written
back on top of data from the device.
* When CONFIG_DMA_CACHE_RWFO is disabled, a cache flush on one CPU
is not seen by the other core(s), leading to inconsistent contents
accross the system.
As a consequence, neither configuration is actually safe to use in a
general-purpose kernel that is used on both MPCore systems and ARM1176
with prefetching enabled.
We could add further workarounds to make the behavior more dynamic based
on the system, but realistically, there are close to zero remaining
users on any ARM11MPCore anyway, and nobody seems too interested in it,
compared to the more popular ARM1176 used in BMC2835 and AST2500.
The Oxnas platform has some minimal support in OpenWRT, but most of the
drivers and dts files never made it into the mainline kernel, while the
Arm Versatile/Realview platform mainly serves as a reference system but
is not necessary to be kept working once all other ARM11MPCore are gone.
Take the easy way out here and drop support for multiprocessing on
ARMv6, along with the CONFIG_DMA_CACHE_RWFO option and the cache
management implementation for it. This also helps with other ARMv6
issues, but for the moment leaves the ability to build a kernel that
can run on both ARMv7 SMP and single-processor ARMv6, which we probably
want to stop supporting as well, but not as part of this series.
Cc: Neil Armstrong <[email protected]>
Cc: Daniel Golle <[email protected]>
Cc: Linus Walleij <[email protected]>
Cc: [email protected]
Signed-off-by: Arnd Bergmann <[email protected]>
---
I could use some help clarifying the above changelog text to describe
the exact problem, and how the CONFIG_DMA_CACHE_RWFO actually works on
MPCore. The TRMs for both 1176 and 11MPCore only describe prefetching
into the instruction cache, not the data cache, but this can end up in
the outercache as a result. The 1176 has some extra control bits to
control prefetching, but I found no reference that explains why an
MPCore does not run into the problem.
---
arch/arm/mach-oxnas/Kconfig | 4 -
arch/arm/mach-oxnas/Makefile | 1 -
arch/arm/mach-oxnas/headsmp.S | 23 ------
arch/arm/mach-oxnas/platsmp.c | 96 ----------------------
arch/arm/mach-versatile/platsmp-realview.c | 4 -
arch/arm/mm/Kconfig | 19 -----
arch/arm/mm/cache-v6.S | 31 -------
7 files changed, 178 deletions(-)
delete mode 100644 arch/arm/mach-oxnas/headsmp.S
delete mode 100644 arch/arm/mach-oxnas/platsmp.c
diff --git a/arch/arm/mach-oxnas/Kconfig b/arch/arm/mach-oxnas/Kconfig
index a9ded7079268..a054235c3d6c 100644
--- a/arch/arm/mach-oxnas/Kconfig
+++ b/arch/arm/mach-oxnas/Kconfig
@@ -28,10 +28,6 @@ config MACH_OX820
bool "Support OX820 Based Products"
depends on ARCH_MULTI_V6
select ARM_GIC
- select DMA_CACHE_RWFO if SMP
- select HAVE_SMP
- select HAVE_ARM_SCU if SMP
- select HAVE_ARM_TWD if SMP
help
Include Support for the Oxford Semiconductor OX820 SoC Based Products.
diff --git a/arch/arm/mach-oxnas/Makefile b/arch/arm/mach-oxnas/Makefile
index 0e78ecfe6c49..a4e40e534e6a 100644
--- a/arch/arm/mach-oxnas/Makefile
+++ b/arch/arm/mach-oxnas/Makefile
@@ -1,2 +1 @@
# SPDX-License-Identifier: GPL-2.0-only
-obj-$(CONFIG_SMP) += platsmp.o headsmp.o
diff --git a/arch/arm/mach-oxnas/headsmp.S b/arch/arm/mach-oxnas/headsmp.S
deleted file mode 100644
index 9c0f1479f33a..000000000000
--- a/arch/arm/mach-oxnas/headsmp.S
+++ /dev/null
@@ -1,23 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only */
-/*
- * Copyright (C) 2013 Ma Haijun <[email protected]>
- * Copyright (c) 2003 ARM Limited
- * All Rights Reserved
- */
-#include <linux/linkage.h>
-#include <linux/init.h>
-
- __INIT
-
-/*
- * OX820 specific entry point for secondary CPUs.
- */
-ENTRY(ox820_secondary_startup)
- mov r4, #0
- /* invalidate both caches and branch target cache */
- mcr p15, 0, r4, c7, c7, 0
- /*
- * we've been released from the holding pen: secondary_stack
- * should now contain the SVC stack for this core
- */
- b secondary_startup
diff --git a/arch/arm/mach-oxnas/platsmp.c b/arch/arm/mach-oxnas/platsmp.c
deleted file mode 100644
index f0a50b9e61df..000000000000
--- a/arch/arm/mach-oxnas/platsmp.c
+++ /dev/null
@@ -1,96 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0-only
-/*
- * Copyright (C) 2016 Neil Armstrong <[email protected]>
- * Copyright (C) 2013 Ma Haijun <[email protected]>
- * Copyright (C) 2002 ARM Ltd.
- * All Rights Reserved
- */
-#include <linux/io.h>
-#include <linux/delay.h>
-#include <linux/of.h>
-#include <linux/of_address.h>
-
-#include <asm/cacheflush.h>
-#include <asm/cp15.h>
-#include <asm/smp_plat.h>
-#include <asm/smp_scu.h>
-
-extern void ox820_secondary_startup(void);
-
-static void __iomem *cpu_ctrl;
-static void __iomem *gic_cpu_ctrl;
-
-#define HOLDINGPEN_CPU_OFFSET 0xc8
-#define HOLDINGPEN_LOCATION_OFFSET 0xc4
-
-#define GIC_NCPU_OFFSET(cpu) (0x100 + (cpu)*0x100)
-#define GIC_CPU_CTRL 0x00
-#define GIC_CPU_CTRL_ENABLE 1
-
-static int __init ox820_boot_secondary(unsigned int cpu,
- struct task_struct *idle)
-{
- /*
- * Write the address of secondary startup into the
- * system-wide flags register. The BootMonitor waits
- * until it receives a soft interrupt, and then the
- * secondary CPU branches to this address.
- */
- writel(virt_to_phys(ox820_secondary_startup),
- cpu_ctrl + HOLDINGPEN_LOCATION_OFFSET);
-
- writel(cpu, cpu_ctrl + HOLDINGPEN_CPU_OFFSET);
-
- /*
- * Enable GIC cpu interface in CPU Interface Control Register
- */
- writel(GIC_CPU_CTRL_ENABLE,
- gic_cpu_ctrl + GIC_NCPU_OFFSET(cpu) + GIC_CPU_CTRL);
-
- /*
- * Send the secondary CPU a soft interrupt, thereby causing
- * the boot monitor to read the system wide flags register,
- * and branch to the address found there.
- */
- arch_send_wakeup_ipi_mask(cpumask_of(cpu));
-
- return 0;
-}
-
-static void __init ox820_smp_prepare_cpus(unsigned int max_cpus)
-{
- struct device_node *np;
- void __iomem *scu_base;
-
- np = of_find_compatible_node(NULL, NULL, "arm,arm11mp-scu");
- scu_base = of_iomap(np, 0);
- of_node_put(np);
- if (!scu_base)
- return;
-
- /* Remap CPU Interrupt Interface Registers */
- np = of_find_compatible_node(NULL, NULL, "arm,arm11mp-gic");
- gic_cpu_ctrl = of_iomap(np, 1);
- of_node_put(np);
- if (!gic_cpu_ctrl)
- goto unmap_scu;
-
- np = of_find_compatible_node(NULL, NULL, "oxsemi,ox820-sys-ctrl");
- cpu_ctrl = of_iomap(np, 0);
- of_node_put(np);
- if (!cpu_ctrl)
- goto unmap_scu;
-
- scu_enable(scu_base);
- flush_cache_all();
-
-unmap_scu:
- iounmap(scu_base);
-}
-
-static const struct smp_operations ox820_smp_ops __initconst = {
- .smp_prepare_cpus = ox820_smp_prepare_cpus,
- .smp_boot_secondary = ox820_boot_secondary,
-};
-
-CPU_METHOD_OF_DECLARE(ox820_smp, "oxsemi,ox820-smp", &ox820_smp_ops);
diff --git a/arch/arm/mach-versatile/platsmp-realview.c b/arch/arm/mach-versatile/platsmp-realview.c
index 5d363385c801..fa31fd2d211d 100644
--- a/arch/arm/mach-versatile/platsmp-realview.c
+++ b/arch/arm/mach-versatile/platsmp-realview.c
@@ -18,16 +18,12 @@
#define REALVIEW_SYS_FLAGSSET_OFFSET 0x30
static const struct of_device_id realview_scu_match[] = {
- { .compatible = "arm,arm11mp-scu", },
{ .compatible = "arm,cortex-a9-scu", },
{ .compatible = "arm,cortex-a5-scu", },
{ }
};
static const struct of_device_id realview_syscon_match[] = {
- { .compatible = "arm,core-module-integrator", },
- { .compatible = "arm,realview-eb-syscon", },
- { .compatible = "arm,realview-pb11mp-syscon", },
{ .compatible = "arm,realview-pbx-syscon", },
{ },
};
diff --git a/arch/arm/mm/Kconfig b/arch/arm/mm/Kconfig
index c5bbae86f725..16b62bc0a970 100644
--- a/arch/arm/mm/Kconfig
+++ b/arch/arm/mm/Kconfig
@@ -937,25 +937,6 @@ config VDSO
You must have glibc 2.22 or later for programs to seamlessly
take advantage of this.
-config DMA_CACHE_RWFO
- bool "Enable read/write for ownership DMA cache maintenance"
- depends on CPU_V6K && SMP
- default y
- help
- The Snoop Control Unit on ARM11MPCore does not detect the
- cache maintenance operations and the dma_{map,unmap}_area()
- functions may leave stale cache entries on other CPUs. By
- enabling this option, Read or Write For Ownership in the ARMv6
- DMA cache maintenance functions is performed. These LDR/STR
- instructions change the cache line state to shared or modified
- so that the cache operation has the desired effect.
-
- Note that the workaround is only valid on processors that do
- not perform speculative loads into the D-cache. For such
- processors, if cache maintenance operations are not broadcast
- in hardware, other workarounds are needed (e.g. cache
- maintenance broadcasting in software via FIQ).
-
config OUTER_CACHE
bool
diff --git a/arch/arm/mm/cache-v6.S b/arch/arm/mm/cache-v6.S
index abae7ff5defc..f6ee53c1de20 100644
--- a/arch/arm/mm/cache-v6.S
+++ b/arch/arm/mm/cache-v6.S
@@ -201,10 +201,6 @@ ENTRY(v6_flush_kern_dcache_area)
* - end - virtual end address of region
*/
ENTRY(v6_dma_inv_range)
-#ifdef CONFIG_DMA_CACHE_RWFO
- ldrb r2, [r0] @ read for ownership
- strb r2, [r0] @ write for ownership
-#endif
tst r0, #D_CACHE_LINE_SIZE - 1
bic r0, r0, #D_CACHE_LINE_SIZE - 1
#ifdef HARVARD_CACHE
@@ -213,10 +209,6 @@ ENTRY(v6_dma_inv_range)
mcrne p15, 0, r0, c7, c11, 1 @ clean unified line
#endif
tst r1, #D_CACHE_LINE_SIZE - 1
-#ifdef CONFIG_DMA_CACHE_RWFO
- ldrbne r2, [r1, #-1] @ read for ownership
- strbne r2, [r1, #-1] @ write for ownership
-#endif
bic r1, r1, #D_CACHE_LINE_SIZE - 1
#ifdef HARVARD_CACHE
mcrne p15, 0, r1, c7, c14, 1 @ clean & invalidate D line
@@ -231,10 +223,6 @@ ENTRY(v6_dma_inv_range)
#endif
add r0, r0, #D_CACHE_LINE_SIZE
cmp r0, r1
-#ifdef CONFIG_DMA_CACHE_RWFO
- ldrlo r2, [r0] @ read for ownership
- strlo r2, [r0] @ write for ownership
-#endif
blo 1b
mov r0, #0
mcr p15, 0, r0, c7, c10, 4 @ drain write buffer
@@ -248,9 +236,6 @@ ENTRY(v6_dma_inv_range)
ENTRY(v6_dma_clean_range)
bic r0, r0, #D_CACHE_LINE_SIZE - 1
1:
-#ifdef CONFIG_DMA_CACHE_RWFO
- ldr r2, [r0] @ read for ownership
-#endif
#ifdef HARVARD_CACHE
mcr p15, 0, r0, c7, c10, 1 @ clean D line
#else
@@ -269,10 +254,6 @@ ENTRY(v6_dma_clean_range)
* - end - virtual end address of region
*/
ENTRY(v6_dma_flush_range)
-#ifdef CONFIG_DMA_CACHE_RWFO
- ldrb r2, [r0] @ read for ownership
- strb r2, [r0] @ write for ownership
-#endif
bic r0, r0, #D_CACHE_LINE_SIZE - 1
1:
#ifdef HARVARD_CACHE
@@ -282,10 +263,6 @@ ENTRY(v6_dma_flush_range)
#endif
add r0, r0, #D_CACHE_LINE_SIZE
cmp r0, r1
-#ifdef CONFIG_DMA_CACHE_RWFO
- ldrblo r2, [r0] @ read for ownership
- strblo r2, [r0] @ write for ownership
-#endif
blo 1b
mov r0, #0
mcr p15, 0, r0, c7, c10, 4 @ drain write buffer
@@ -301,13 +278,7 @@ ENTRY(v6_dma_map_area)
add r1, r1, r0
teq r2, #DMA_FROM_DEVICE
beq v6_dma_inv_range
-#ifndef CONFIG_DMA_CACHE_RWFO
b v6_dma_clean_range
-#else
- teq r2, #DMA_TO_DEVICE
- beq v6_dma_clean_range
- b v6_dma_flush_range
-#endif
ENDPROC(v6_dma_map_area)
/*
@@ -317,11 +288,9 @@ ENDPROC(v6_dma_map_area)
* - dir - DMA direction
*/
ENTRY(v6_dma_unmap_area)
-#ifndef CONFIG_DMA_CACHE_RWFO
add r1, r1, r0
teq r2, #DMA_TO_DEVICE
bne v6_dma_inv_range
-#endif
ret lr
ENDPROC(v6_dma_unmap_area)
--
2.39.2
From: Arnd Bergmann <[email protected]>
As the final step of the conversion to generic arch_sync_dma_*
helpers, change the Arm implementation to look the same as the
new generic version, by calling the dmac_{clean,inv,flush}_area
low-level functions instead of the abstracted dmac_{map,unmap}_area
version.
On ARMv6/v7, this invalidates the caches after a DMA transfer from
a device because of speculative prefetching, while on earlier versions
it only needs to do this before the transfer.
This should not change any of the current behavior.
FIXME: address CONFIG_DMA_CACHE_RWFO properly.
Signed-off-by: Arnd Bergmann <[email protected]>
---
arch/arm/mm/dma-mapping-nommu.c | 11 +++----
arch/arm/mm/dma-mapping.c | 53 +++++++++++++++++++++++----------
2 files changed, 43 insertions(+), 21 deletions(-)
diff --git a/arch/arm/mm/dma-mapping-nommu.c b/arch/arm/mm/dma-mapping-nommu.c
index cfd9c933d2f0..12b5c6ae93fc 100644
--- a/arch/arm/mm/dma-mapping-nommu.c
+++ b/arch/arm/mm/dma-mapping-nommu.c
@@ -16,12 +16,13 @@
void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
enum dma_data_direction dir)
{
- dmac_map_area(__va(paddr), size, dir);
-
- if (dir == DMA_FROM_DEVICE)
+ if (dir == DMA_FROM_DEVICE) {
+ dmac_inv_range(__va(paddr), __va(paddr + size));
outer_inv_range(paddr, paddr + size);
- else
+ } else {
+ dmac_clean_range(__va(paddr), __va(paddr + size));
outer_clean_range(paddr, paddr + size);
+ }
}
void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
@@ -29,7 +30,7 @@ void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
{
if (dir != DMA_TO_DEVICE) {
outer_inv_range(paddr, paddr + size);
- dmac_unmap_area(__va(paddr), size, dir);
+ dmac_inv_range(__va(paddr), __va(paddr));
}
}
diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index ce4b74f34a58..cc702cb27ae7 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -623,8 +623,7 @@ static void __arm_dma_free(struct device *dev, size_t size, void *cpu_addr,
}
static void dma_cache_maint(phys_addr_t paddr,
- size_t size, enum dma_data_direction dir,
- void (*op)(const void *, size_t, int))
+ size_t size, void (*op)(const void *, const void *))
{
unsigned long pfn = PFN_DOWN(paddr);
unsigned long offset = paddr % PAGE_SIZE;
@@ -647,18 +646,18 @@ static void dma_cache_maint(phys_addr_t paddr,
if (cache_is_vipt_nonaliasing()) {
vaddr = kmap_atomic(page);
- op(vaddr + offset, len, dir);
+ op(vaddr + offset, vaddr + offset + len);
kunmap_atomic(vaddr);
} else {
vaddr = kmap_high_get(page);
if (vaddr) {
- op(vaddr + offset, len, dir);
+ op(vaddr + offset, vaddr + offset + len);
kunmap_high(page);
}
}
} else {
vaddr = page_address(page) + offset;
- op(vaddr, len, dir);
+ op(vaddr, vaddr + len);
}
offset = 0;
pfn++;
@@ -666,6 +665,18 @@ static void dma_cache_maint(phys_addr_t paddr,
} while (left);
}
+static bool arch_sync_dma_cpu_needs_post_dma_flush(void)
+{
+ if (IS_ENABLED(CONFIG_CPU_V6) ||
+ IS_ENABLED(CONFIG_CPU_V6K) ||
+ IS_ENABLED(CONFIG_CPU_V7) ||
+ IS_ENABLED(CONFIG_CPU_V7M))
+ return true;
+
+ /* FIXME: runtime detection */
+ return false;
+}
+
/*
* Make an area consistent for devices.
* Note: Drivers should NOT use this function directly.
@@ -674,25 +685,35 @@ static void dma_cache_maint(phys_addr_t paddr,
void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
enum dma_data_direction dir)
{
- dma_cache_maint(paddr, size, dir, dmac_map_area);
-
- if (dir == DMA_FROM_DEVICE) {
- outer_inv_range(paddr, paddr + size);
- } else {
+ switch (dir) {
+ case DMA_TO_DEVICE:
+ dma_cache_maint(paddr, size, dmac_clean_range);
outer_clean_range(paddr, paddr + size);
+ break;
+ case DMA_FROM_DEVICE:
+ dma_cache_maint(paddr, size, dmac_inv_range);
+ outer_inv_range(paddr, paddr + size);
+ break;
+ case DMA_BIDIRECTIONAL:
+ if (arch_sync_dma_cpu_needs_post_dma_flush()) {
+ dma_cache_maint(paddr, size, dmac_clean_range);
+ outer_clean_range(paddr, paddr + size);
+ } else {
+ dma_cache_maint(paddr, size, dmac_flush_range);
+ outer_flush_range(paddr, paddr + size);
+ }
+ break;
+ default:
+ break;
}
- /* FIXME: non-speculating: flush on bidirectional mappings? */
}
void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
enum dma_data_direction dir)
{
- /* FIXME: non-speculating: not required */
- /* in any case, don't bother invalidating if DMA to device */
- if (dir != DMA_TO_DEVICE) {
+ if (dir != DMA_TO_DEVICE && arch_sync_dma_cpu_needs_post_dma_flush()) {
outer_inv_range(paddr, paddr + size);
-
- dma_cache_maint(paddr, size, dir, dmac_unmap_area);
+ dma_cache_maint(paddr, size, dmac_inv_range);
}
/*
--
2.39.2
From: Arnd Bergmann <[email protected]>
The arm version of the arch_sync_dma_for_cpu() function annotates pages as
PG_dcache_clean after a DMA, but no other architecture does this here. On
ia64, the same thing is done in arch_sync_dma_for_cpu(), so it makes sense
to use the same hook in order to have identical arch_sync_dma_for_cpu()
semantics as all other architectures.
Splitting this out has multiple effects:
- for dma-direct, this now gets called after arch_sync_dma_for_cpu()
for DMA_FROM_DEVICE mappings, but not for DMA_BIDIRECTIONAL. While
it would not be harmful to keep doing it for bidirectional mappings,
those are apparently not used in any callers that care about the flag.
- Since arm has its own dma-iommu abstraction, this now also needs to
call the same function, so the calls are added there to mirror the
dma-direct version.
- Like dma-direct, the dma-iommu version now marks the dcache clean
for both coherent and noncoherent devices after a DMA, but it only
does this for DMA_FROM_DEVICE, not DMA_BIDIRECTIONAL.
[ HELP NEEDED: can anyone confirm that it is a correct assumption
on arm that a cache-coherent device writing to a page always results
in it being in a PG_dcache_clean state like on ia64, or can a device
write directly into the dcache?]
Signed-off-by: Arnd Bergmann <[email protected]>
---
arch/arm/Kconfig | 1 +
arch/arm/mm/dma-mapping.c | 71 +++++++++++++++++++++++----------------
2 files changed, 43 insertions(+), 29 deletions(-)
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index e24a9820e12f..125d58c54ab1 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -7,6 +7,7 @@ config ARM
select ARCH_HAS_BINFMT_FLAT
select ARCH_HAS_CURRENT_STACK_POINTER
select ARCH_HAS_DEBUG_VIRTUAL if MMU
+ select ARCH_HAS_DMA_MARK_CLEAN if MMU
select ARCH_HAS_DMA_WRITE_COMBINE if !ARM_DMA_MEM_BUFFERABLE
select ARCH_HAS_ELF_RANDOMIZE
select ARCH_HAS_FORTIFY_SOURCE
diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index cc702cb27ae7..b703cb83d27e 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -665,6 +665,28 @@ static void dma_cache_maint(phys_addr_t paddr,
} while (left);
}
+/*
+ * Mark the D-cache clean for these pages to avoid extra flushing.
+ */
+void arch_dma_mark_clean(phys_addr_t paddr, size_t size)
+{
+ unsigned long pfn = PFN_UP(paddr);
+ unsigned long off = paddr & (PAGE_SIZE - 1);
+ size_t left = size;
+
+ if (size < PAGE_SIZE)
+ return;
+
+ if (off)
+ left -= PAGE_SIZE - off;
+
+ while (left >= PAGE_SIZE) {
+ struct page *page = pfn_to_page(pfn++);
+ set_bit(PG_dcache_clean, &page->flags);
+ left -= PAGE_SIZE;
+ }
+}
+
static bool arch_sync_dma_cpu_needs_post_dma_flush(void)
{
if (IS_ENABLED(CONFIG_CPU_V6) ||
@@ -715,24 +737,6 @@ void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
outer_inv_range(paddr, paddr + size);
dma_cache_maint(paddr, size, dmac_inv_range);
}
-
- /*
- * Mark the D-cache clean for these pages to avoid extra flushing.
- */
- if (dir != DMA_TO_DEVICE && size >= PAGE_SIZE) {
- unsigned long pfn = PFN_UP(paddr);
- unsigned long off = paddr & (PAGE_SIZE - 1);
- size_t left = size;
-
- if (off)
- left -= PAGE_SIZE - off;
-
- while (left >= PAGE_SIZE) {
- struct page *page = pfn_to_page(pfn++);
- set_bit(PG_dcache_clean, &page->flags);
- left -= PAGE_SIZE;
- }
- }
}
#ifdef CONFIG_ARM_DMA_USE_IOMMU
@@ -1294,6 +1298,17 @@ static int arm_iommu_map_sg(struct device *dev, struct scatterlist *sg,
return -EINVAL;
}
+static void arm_iommu_sync_dma_for_cpu(phys_addr_t phys, size_t len,
+ enum dma_data_direction dir,
+ bool dma_coherent)
+{
+ if (!dma_coherent)
+ arch_sync_dma_for_cpu(phys, s->length, dir);
+
+ if (dir == DMA_FROM_DEVICE)
+ arch_dma_mark_clean(phys, s->length);
+}
+
/**
* arm_iommu_unmap_sg - unmap a set of SG buffers mapped by dma_map_sg
* @dev: valid struct device pointer
@@ -1316,8 +1331,9 @@ static void arm_iommu_unmap_sg(struct device *dev,
if (sg_dma_len(s))
__iommu_remove_mapping(dev, sg_dma_address(s),
sg_dma_len(s));
- if (!dev->dma_coherent && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
- arch_sync_dma_for_cpu(sg_phys(s), s->length, dir);
+ if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+ arm_iommu_sync_dma_for_cpu(sg_phys(s), s->length, dir,
+ dev->dma_coherent);
}
}
@@ -1335,12 +1351,9 @@ static void arm_iommu_sync_sg_for_cpu(struct device *dev,
struct scatterlist *s;
int i;
- if (dev->dma_coherent)
- return;
-
for_each_sg(sg, s, nents, i)
- arch_sync_dma_for_cpu(sg_phys(s), s->length, dir);
-
+ arm_iommu_sync_dma_for_cpu(sg_phys(s), s->length, dir,
+ dev->dma_coherent);
}
/**
@@ -1425,9 +1438,9 @@ static void arm_iommu_unmap_page(struct device *dev, dma_addr_t handle,
if (!iova)
return;
- if (!dev->dma_coherent && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) {
+ if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
phys = iommu_iova_to_phys(mapping->domain, handle);
- arch_sync_dma_for_cpu(phys, size, dir);
+ arm_iommu_sync_dma_for_cpu(phys, size, dir, dev->dma_coherent);
}
iommu_unmap(mapping->domain, iova, len);
@@ -1497,11 +1510,11 @@ static void arm_iommu_sync_single_for_cpu(struct device *dev,
struct dma_iommu_mapping *mapping = to_dma_iommu_mapping(dev);
phys_addr_t phys;
- if (dev->dma_coherent || !(handle & PAGE_MASK))
+ if (!(handle & PAGE_MASK))
return;
phys = iommu_iova_to_phys(mapping->domain, handle);
- arch_sync_dma_for_cpu(phys, size, dir);
+ arm_iommu_sync_dma_for_cpu(phys, size, dir, dev->dma_coherent);
}
static void arm_iommu_sync_single_for_device(struct device *dev,
--
2.39.2
From: Arnd Bergmann <[email protected]>
Now that all of these have consistent behavior, replace them with
a single shared implementation of arch_sync_dma_for_device() and
arch_sync_dma_for_cpu() and three parameters to pick how they should
operate:
- If the CPU has speculative prefetching, then the cache
has to be invalidated after a transfer from the device.
On the rarer CPUs without prefetching, this can be skipped,
with all cache management happening before the transfer.
This flag can be runtime detected, but is usually fixed
per architecture.
- Some architectures currently clean the caches before DMA
from a device, while others invalidate it. There has not
been a conclusion regarding whether we should change all
architectures to use clean instead, so this adds an
architecture specific flag that we can change later on.
- On 32-bit Arm, the arch_sync_dma_for_cpu() function keeps
track pages that are marked clean in the page cache, to
avoid flushing them again. The implementation for this is
generic enough to work on all architectures that use the
PG_dcache_clean page flag, but a Kconfig symbol is used
to only enable it on Arm to preserve the existing behavior.
For the function naming, I picked 'wback' over 'clean', and 'wback_inv'
over 'flush', to avoid any ambiguity of what the helper functions are
supposed to do.
Moving the global functions into a header file is usually a bad idea
as it prevents the header from being included more than once, but it
helps keep the behavior as close as possible to the previous state,
including the possibility of inlining most of it into these functions
where that was done before. This also helps keep the global namespace
clean, by hiding the new arch_dma_cache{_wback,_inv,_wback_inv} from
device drivers that might use them incorrectly.
It would be possible to do this one architecture at a time, but
as the change is the same everywhere, the combined patch helps
explain it better once.
Signed-off-by: Arnd Bergmann <[email protected]>
---
arch/arc/mm/dma.c | 66 +++++-------------
arch/arm/Kconfig | 3 +
arch/arm/mm/dma-mapping-nommu.c | 39 ++++++-----
arch/arm/mm/dma-mapping.c | 64 +++++++-----------
arch/arm64/mm/dma-mapping.c | 28 +++++---
arch/csky/mm/dma-mapping.c | 44 ++++++------
arch/hexagon/kernel/dma.c | 44 ++++++------
arch/m68k/kernel/dma.c | 43 +++++++-----
arch/microblaze/kernel/dma.c | 48 +++++++-------
arch/mips/mm/dma-noncoherent.c | 60 +++++++----------
arch/nios2/mm/dma-mapping.c | 57 +++++++---------
arch/openrisc/kernel/dma.c | 63 +++++++++++-------
arch/parisc/kernel/pci-dma.c | 46 ++++++-------
arch/powerpc/mm/dma-noncoherent.c | 34 ++++++----
arch/riscv/mm/dma-noncoherent.c | 51 +++++++-------
arch/sh/kernel/dma-coherent.c | 43 +++++++-----
arch/sparc/kernel/ioport.c | 38 ++++++++---
arch/xtensa/kernel/pci-dma.c | 40 ++++++-----
include/linux/dma-sync.h | 107 ++++++++++++++++++++++++++++++
19 files changed, 527 insertions(+), 391 deletions(-)
create mode 100644 include/linux/dma-sync.h
diff --git a/arch/arc/mm/dma.c b/arch/arc/mm/dma.c
index ddb96786f765..61cd01646222 100644
--- a/arch/arc/mm/dma.c
+++ b/arch/arc/mm/dma.c
@@ -30,63 +30,33 @@ void arch_dma_prep_coherent(struct page *page, size_t size)
dma_cache_wback_inv(page_to_phys(page), size);
}
-/*
- * Cache operations depending on function and direction argument, inspired by
- * https://lore.kernel.org/lkml/[email protected]
- * "dma_sync_*_for_cpu and direction=TO_DEVICE (was Re: [PATCH 02/20]
- * dma-mapping: provide a generic dma-noncoherent implementation)"
- *
- * | map == for_device | unmap == for_cpu
- * |----------------------------------------------------------------
- * TO_DEV | writeback writeback | none none
- * FROM_DEV | invalidate invalidate | invalidate* invalidate*
- * BIDIR | writeback writeback | invalidate invalidate
- *
- * [*] needed for CPU speculative prefetches
- *
- * NOTE: we don't check the validity of direction argument as it is done in
- * upper layer functions (in include/linux/dma-mapping.h)
- */
-
-void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
- enum dma_data_direction dir)
+static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
{
- switch (dir) {
- case DMA_TO_DEVICE:
- dma_cache_wback(paddr, size);
- break;
-
- case DMA_FROM_DEVICE:
- dma_cache_inv(paddr, size);
- break;
-
- case DMA_BIDIRECTIONAL:
- dma_cache_wback(paddr, size);
- break;
+ dma_cache_wback(paddr, size);
+}
- default:
- break;
- }
+static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size)
+{
+ dma_cache_inv(paddr, size);
}
-void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
- enum dma_data_direction dir)
+static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t size)
{
- switch (dir) {
- case DMA_TO_DEVICE:
- break;
+ dma_cache_wback_inv(paddr, size);
+}
- /* FROM_DEVICE invalidate needed if speculative CPU prefetch only */
- case DMA_FROM_DEVICE:
- case DMA_BIDIRECTIONAL:
- dma_cache_inv(paddr, size);
- break;
+static inline bool arch_sync_dma_clean_before_fromdevice(void)
+{
+ return false;
+}
- default:
- break;
- }
+static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
+{
+ return true;
}
+#include <linux/dma-sync.h>
+
/*
* Plug in direct dma map ops.
*/
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 125d58c54ab1..0de84e861027 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -212,6 +212,9 @@ config LOCKDEP_SUPPORT
bool
default y
+config ARCH_DMA_MARK_DCACHE_CLEAN
+ def_bool y
+
config ARCH_HAS_ILOG2_U32
bool
diff --git a/arch/arm/mm/dma-mapping-nommu.c b/arch/arm/mm/dma-mapping-nommu.c
index 12b5c6ae93fc..0817274aed15 100644
--- a/arch/arm/mm/dma-mapping-nommu.c
+++ b/arch/arm/mm/dma-mapping-nommu.c
@@ -13,27 +13,36 @@
#include "dma.h"
-void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
- enum dma_data_direction dir)
+static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
{
- if (dir == DMA_FROM_DEVICE) {
- dmac_inv_range(__va(paddr), __va(paddr + size));
- outer_inv_range(paddr, paddr + size);
- } else {
- dmac_clean_range(__va(paddr), __va(paddr + size));
- outer_clean_range(paddr, paddr + size);
- }
+ dmac_clean_range(__va(paddr), __va(paddr + size));
+ outer_clean_range(paddr, paddr + size);
}
-void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
- enum dma_data_direction dir)
+static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size)
{
- if (dir != DMA_TO_DEVICE) {
- outer_inv_range(paddr, paddr + size);
- dmac_inv_range(__va(paddr), __va(paddr));
- }
+ dmac_inv_range(__va(paddr), __va(paddr + size));
+ outer_inv_range(paddr, paddr + size);
}
+static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t size)
+{
+ dmac_flush_range(__va(paddr), __va(paddr + size));
+ outer_flush_range(paddr, paddr + size);
+}
+
+static inline bool arch_sync_dma_clean_before_fromdevice(void)
+{
+ return false;
+}
+
+static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
+{
+ return true;
+}
+
+#include <linux/dma-sync.h>
+
void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
const struct iommu_ops *iommu, bool coherent)
{
diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index b703cb83d27e..aa6ee820a0ab 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -687,6 +687,30 @@ void arch_dma_mark_clean(phys_addr_t paddr, size_t size)
}
}
+static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
+{
+ dma_cache_maint(paddr, size, dmac_clean_range);
+ outer_clean_range(paddr, paddr + size);
+}
+
+
+static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size)
+{
+ dma_cache_maint(paddr, size, dmac_inv_range);
+ outer_inv_range(paddr, paddr + size);
+}
+
+static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t size)
+{
+ dma_cache_maint(paddr, size, dmac_flush_range);
+ outer_flush_range(paddr, paddr + size);
+}
+
+static inline bool arch_sync_dma_clean_before_fromdevice(void)
+{
+ return false;
+}
+
static bool arch_sync_dma_cpu_needs_post_dma_flush(void)
{
if (IS_ENABLED(CONFIG_CPU_V6) ||
@@ -699,45 +723,7 @@ static bool arch_sync_dma_cpu_needs_post_dma_flush(void)
return false;
}
-/*
- * Make an area consistent for devices.
- * Note: Drivers should NOT use this function directly.
- * Use the driver DMA support - see dma-mapping.h (dma_sync_*)
- */
-void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
- enum dma_data_direction dir)
-{
- switch (dir) {
- case DMA_TO_DEVICE:
- dma_cache_maint(paddr, size, dmac_clean_range);
- outer_clean_range(paddr, paddr + size);
- break;
- case DMA_FROM_DEVICE:
- dma_cache_maint(paddr, size, dmac_inv_range);
- outer_inv_range(paddr, paddr + size);
- break;
- case DMA_BIDIRECTIONAL:
- if (arch_sync_dma_cpu_needs_post_dma_flush()) {
- dma_cache_maint(paddr, size, dmac_clean_range);
- outer_clean_range(paddr, paddr + size);
- } else {
- dma_cache_maint(paddr, size, dmac_flush_range);
- outer_flush_range(paddr, paddr + size);
- }
- break;
- default:
- break;
- }
-}
-
-void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
- enum dma_data_direction dir)
-{
- if (dir != DMA_TO_DEVICE && arch_sync_dma_cpu_needs_post_dma_flush()) {
- outer_inv_range(paddr, paddr + size);
- dma_cache_maint(paddr, size, dmac_inv_range);
- }
-}
+#include <linux/dma-sync.h>
#ifdef CONFIG_ARM_DMA_USE_IOMMU
diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
index 5240f6acad64..bae741aa65e9 100644
--- a/arch/arm64/mm/dma-mapping.c
+++ b/arch/arm64/mm/dma-mapping.c
@@ -13,25 +13,33 @@
#include <asm/cacheflush.h>
#include <asm/xen/xen-ops.h>
-void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
- enum dma_data_direction dir)
+static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
{
- unsigned long start = (unsigned long)phys_to_virt(paddr);
+ dcache_clean_poc(paddr, paddr + size);
+}
- dcache_clean_poc(start, start + size);
+static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size)
+{
+ dcache_inval_poc(paddr, paddr + size);
}
-void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
- enum dma_data_direction dir)
+static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t size)
{
- unsigned long start = (unsigned long)phys_to_virt(paddr);
+ dcache_clean_inval_poc(paddr, paddr + size);
+}
- if (dir == DMA_TO_DEVICE)
- return;
+static inline bool arch_sync_dma_clean_before_fromdevice(void)
+{
+ return true;
+}
- dcache_inval_poc(start, start + size);
+static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
+{
+ return true;
}
+#include <linux/dma-sync.h>
+
void arch_dma_prep_coherent(struct page *page, size_t size)
{
unsigned long start = (unsigned long)page_address(page);
diff --git a/arch/csky/mm/dma-mapping.c b/arch/csky/mm/dma-mapping.c
index c90f912e2822..9402e101b363 100644
--- a/arch/csky/mm/dma-mapping.c
+++ b/arch/csky/mm/dma-mapping.c
@@ -55,31 +55,29 @@ void arch_dma_prep_coherent(struct page *page, size_t size)
cache_op(page_to_phys(page), size, dma_wbinv_set_zero_range);
}
-void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
- enum dma_data_direction dir)
+static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
{
- switch (dir) {
- case DMA_TO_DEVICE:
- case DMA_FROM_DEVICE:
- case DMA_BIDIRECTIONAL:
- cache_op(paddr, size, dma_wb_range);
- break;
- default:
- BUG();
- }
+ cache_op(paddr, size, dma_wb_range);
}
-void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
- enum dma_data_direction dir)
+static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size)
{
- switch (dir) {
- case DMA_TO_DEVICE:
- return;
- case DMA_FROM_DEVICE:
- case DMA_BIDIRECTIONAL:
- cache_op(paddr, size, dma_inv_range);
- break;
- default:
- BUG();
- }
+ cache_op(paddr, size, dma_inv_range);
}
+
+static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t size)
+{
+ cache_op(paddr, size, dma_wbinv_range);
+}
+
+static inline bool arch_sync_dma_clean_before_fromdevice(void)
+{
+ return true;
+}
+
+static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
+{
+ return true;
+}
+
+#include <linux/dma-sync.h>
diff --git a/arch/hexagon/kernel/dma.c b/arch/hexagon/kernel/dma.c
index 882680e81a30..e6538128a75b 100644
--- a/arch/hexagon/kernel/dma.c
+++ b/arch/hexagon/kernel/dma.c
@@ -9,29 +9,33 @@
#include <linux/memblock.h>
#include <asm/page.h>
-void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
- enum dma_data_direction dir)
+static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
{
- void *addr = phys_to_virt(paddr);
-
- switch (dir) {
- case DMA_TO_DEVICE:
- hexagon_clean_dcache_range((unsigned long) addr,
- (unsigned long) addr + size);
- break;
- case DMA_FROM_DEVICE:
- hexagon_inv_dcache_range((unsigned long) addr,
- (unsigned long) addr + size);
- break;
- case DMA_BIDIRECTIONAL:
- flush_dcache_range((unsigned long) addr,
- (unsigned long) addr + size);
- break;
- default:
- BUG();
- }
+ hexagon_clean_dcache_range(paddr, paddr + size);
}
+static inline void arch_dma_cache_inv(phys_addr_t start, size_t size)
+{
+ hexagon_inv_dcache_range(paddr, paddr + size);
+}
+
+static inline void arch_dma_cache_wback_inv(phys_addr_t start, size_t size)
+{
+ hexagon_flush_dcache_range(paddr, paddr + size);
+}
+
+static inline bool arch_sync_dma_clean_before_fromdevice(void)
+{
+ return false;
+}
+
+static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
+{
+ return false;
+}
+
+#include <linux/dma-sync.h>
+
/*
* Our max_low_pfn should have been backed off by 16MB in mm/init.c to create
* DMA coherent space. Use that for the pool.
diff --git a/arch/m68k/kernel/dma.c b/arch/m68k/kernel/dma.c
index 2e192a5df949..aa9b434e6df8 100644
--- a/arch/m68k/kernel/dma.c
+++ b/arch/m68k/kernel/dma.c
@@ -58,20 +58,33 @@ void arch_dma_free(struct device *dev, size_t size, void *vaddr,
#endif /* CONFIG_MMU && !CONFIG_COLDFIRE */
-void arch_sync_dma_for_device(phys_addr_t handle, size_t size,
- enum dma_data_direction dir)
+static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
{
- switch (dir) {
- case DMA_BIDIRECTIONAL:
- case DMA_TO_DEVICE:
- cache_push(handle, size);
- break;
- case DMA_FROM_DEVICE:
- cache_clear(handle, size);
- break;
- default:
- pr_err_ratelimited("dma_sync_single_for_device: unsupported dir %u\n",
- dir);
- break;
- }
+ /*
+ * cache_push() always invalidates in addition to cleaning
+ * write-back caches.
+ */
+ cache_push(paddr, size);
+}
+
+static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size)
+{
+ cache_clear(paddr, size);
+}
+
+static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t size)
+{
+ cache_push(paddr, size);
}
+
+static inline bool arch_sync_dma_clean_before_fromdevice(void)
+{
+ return false;
+}
+
+static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
+{
+ return false;
+}
+
+#include <linux/dma-sync.h>
diff --git a/arch/microblaze/kernel/dma.c b/arch/microblaze/kernel/dma.c
index b4c4e45fd45e..01110d4aa5b0 100644
--- a/arch/microblaze/kernel/dma.c
+++ b/arch/microblaze/kernel/dma.c
@@ -14,32 +14,30 @@
#include <linux/bug.h>
#include <asm/cacheflush.h>
-void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
- enum dma_data_direction dir)
+static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
{
- switch (direction) {
- case DMA_TO_DEVICE:
- case DMA_BIDIRECTIONAL:
- flush_dcache_range(paddr, paddr + size);
- break;
- case DMA_FROM_DEVICE:
- invalidate_dcache_range(paddr, paddr + size);
- break;
- default:
- BUG();
- }
+ /* writeback plus invalidate, could be a nop on WT caches */
+ flush_dcache_range(paddr, paddr + size);
}
-void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
- enum dma_data_direction dir)
+static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size)
{
- switch (direction) {
- case DMA_TO_DEVICE:
- break;
- case DMA_BIDIRECTIONAL:
- case DMA_FROM_DEVICE:
- invalidate_dcache_range(paddr, paddr + size);
- break;
- default:
- BUG();
- }}
+ invalidate_dcache_range(paddr, paddr + size);
+}
+
+static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t size)
+{
+ flush_dcache_range(paddr, paddr + size);
+}
+
+static inline bool arch_sync_dma_clean_before_fromdevice(void)
+{
+ return false;
+}
+
+static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
+{
+ return true;
+}
+
+#include <linux/dma-sync.h>
diff --git a/arch/mips/mm/dma-noncoherent.c b/arch/mips/mm/dma-noncoherent.c
index b9d68bcc5d53..902d4b7c1f85 100644
--- a/arch/mips/mm/dma-noncoherent.c
+++ b/arch/mips/mm/dma-noncoherent.c
@@ -85,50 +85,38 @@ static inline void dma_sync_phys(phys_addr_t paddr, size_t size,
} while (left);
}
-void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
- enum dma_data_direction dir)
+static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
{
- switch (dir) {
- case DMA_TO_DEVICE:
- dma_sync_phys(paddr, size, _dma_cache_wback);
- break;
- case DMA_FROM_DEVICE:
- dma_sync_phys(paddr, size, _dma_cache_inv);
- break;
- case DMA_BIDIRECTIONAL:
- if (IS_ENABLED(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU) &&
- cpu_needs_post_dma_flush())
- dma_sync_phys(paddr, size, _dma_cache_wback);
- else
- dma_sync_phys(paddr, size, _dma_cache_wback_inv);
- break;
- default:
- break;
- }
+ dma_sync_phys(paddr, size, _dma_cache_wback);
}
-#ifdef CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU
-void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
- enum dma_data_direction dir)
+static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size)
{
- switch (dir) {
- case DMA_TO_DEVICE:
- break;
- case DMA_FROM_DEVICE:
- case DMA_BIDIRECTIONAL:
- if (cpu_needs_post_dma_flush())
- dma_sync_phys(paddr, size, _dma_cache_inv);
- break;
- default:
- break;
- }
+ dma_sync_phys(paddr, size, _dma_cache_inv);
}
-#endif
+
+static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t size)
+{
+ dma_sync_phys(paddr, size, _dma_cache_wback_inv);
+}
+
+static inline bool arch_sync_dma_clean_before_fromdevice(void)
+{
+ return false;
+}
+
+static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
+{
+ return IS_ENABLED(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU) &&
+ cpu_needs_post_dma_flush();
+}
+
+#include <linux/dma-sync.h>
#ifdef CONFIG_ARCH_HAS_SETUP_DMA_OPS
void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
- const struct iommu_ops *iommu, bool coherent)
+ const struct iommu_ops *iommu, bool coherent)
{
- dev->dma_coherent = coherent;
+ dev->dma_coherent = coherent;
}
#endif
diff --git a/arch/nios2/mm/dma-mapping.c b/arch/nios2/mm/dma-mapping.c
index fd887d5f3f9a..29978970955e 100644
--- a/arch/nios2/mm/dma-mapping.c
+++ b/arch/nios2/mm/dma-mapping.c
@@ -13,53 +13,46 @@
#include <linux/types.h>
#include <linux/mm.h>
#include <linux/string.h>
+#include <linux/dma-map-ops.h>
#include <linux/dma-mapping.h>
#include <linux/io.h>
#include <linux/cache.h>
#include <asm/cacheflush.h>
-void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
- enum dma_data_direction dir)
+static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
{
+ /*
+ * We just need to write back the caches here, but Nios2 flush
+ * instruction will do both writeback and invalidate.
+ */
void *vaddr = phys_to_virt(paddr);
+ flush_dcache_range((unsigned long)vaddr, (unsigned long)(vaddr + size));
+}
- switch (dir) {
- case DMA_FROM_DEVICE:
- invalidate_dcache_range((unsigned long)vaddr,
- (unsigned long)(vaddr + size));
- break;
- case DMA_TO_DEVICE:
- /*
- * We just need to flush the caches here , but Nios2 flush
- * instruction will do both writeback and invalidate.
- */
- case DMA_BIDIRECTIONAL: /* flush and invalidate */
- flush_dcache_range((unsigned long)vaddr,
- (unsigned long)(vaddr + size));
- break;
- default:
- BUG();
- }
+static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size)
+{
+ unsigned long vaddr = (unsigned long)phys_to_virt(paddr);
+ invalidate_dcache_range(vaddr, (unsigned long)(vaddr + size));
}
-void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
- enum dma_data_direction dir)
+static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t size)
{
void *vaddr = phys_to_virt(paddr);
+ flush_dcache_range((unsigned long)vaddr, (unsigned long)(vaddr + size));
+}
+
+static inline bool arch_sync_dma_clean_before_fromdevice(void)
+{
+ return false;
+}
- switch (dir) {
- case DMA_BIDIRECTIONAL:
- case DMA_FROM_DEVICE:
- invalidate_dcache_range((unsigned long)vaddr,
- (unsigned long)(vaddr + size));
- break;
- case DMA_TO_DEVICE:
- break;
- default:
- BUG();
- }
+static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
+{
+ return true;
}
+#include <linux/dma-sync.h>
+
void arch_dma_prep_coherent(struct page *page, size_t size)
{
unsigned long start = (unsigned long)page_address(page);
diff --git a/arch/openrisc/kernel/dma.c b/arch/openrisc/kernel/dma.c
index 91a00d09ffad..aba2258e62eb 100644
--- a/arch/openrisc/kernel/dma.c
+++ b/arch/openrisc/kernel/dma.c
@@ -95,32 +95,47 @@ void arch_dma_clear_uncached(void *cpu_addr, size_t size)
mmap_write_unlock(&init_mm);
}
-void arch_sync_dma_for_device(phys_addr_t addr, size_t size,
- enum dma_data_direction dir)
+static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
{
unsigned long cl;
struct cpuinfo_or1k *cpuinfo = &cpuinfo_or1k[smp_processor_id()];
- switch (dir) {
- case DMA_TO_DEVICE:
- /* Write back the dcache for the requested range */
- for (cl = addr; cl < addr + size;
- cl += cpuinfo->dcache_block_size)
- mtspr(SPR_DCBWR, cl);
- break;
- case DMA_FROM_DEVICE:
- /* Invalidate the dcache for the requested range */
- for (cl = addr; cl < addr + size;
- cl += cpuinfo->dcache_block_size)
- mtspr(SPR_DCBIR, cl);
- break;
- case DMA_BIDIRECTIONAL:
- /* Flush the dcache for the requested range */
- for (cl = addr; cl < addr + size;
- cl += cpuinfo->dcache_block_size)
- mtspr(SPR_DCBFR, cl);
- break;
- default:
- break;
- }
+ /* Write back the dcache for the requested range */
+ for (cl = paddr; cl < paddr + size;
+ cl += cpuinfo->dcache_block_size)
+ mtspr(SPR_DCBWR, cl);
}
+
+static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size)
+{
+ unsigned long cl;
+ struct cpuinfo_or1k *cpuinfo = &cpuinfo_or1k[smp_processor_id()];
+
+ /* Invalidate the dcache for the requested range */
+ for (cl = paddr; cl < paddr + size;
+ cl += cpuinfo->dcache_block_size)
+ mtspr(SPR_DCBIR, cl);
+}
+
+static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t size)
+{
+ unsigned long cl;
+ struct cpuinfo_or1k *cpuinfo = &cpuinfo_or1k[smp_processor_id()];
+
+ /* Flush the dcache for the requested range */
+ for (cl = paddr; cl < paddr + size;
+ cl += cpuinfo->dcache_block_size)
+ mtspr(SPR_DCBFR, cl);
+}
+
+static inline bool arch_sync_dma_clean_before_fromdevice(void)
+{
+ return false;
+}
+
+static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
+{
+ return false;
+}
+
+#include <linux/dma-sync.h>
diff --git a/arch/parisc/kernel/pci-dma.c b/arch/parisc/kernel/pci-dma.c
index 6d3d3cffb316..a7955aab8ce2 100644
--- a/arch/parisc/kernel/pci-dma.c
+++ b/arch/parisc/kernel/pci-dma.c
@@ -443,35 +443,35 @@ void arch_dma_free(struct device *dev, size_t size, void *vaddr,
free_pages((unsigned long)__va(dma_handle), order);
}
-void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
- enum dma_data_direction dir)
+static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
{
unsigned long virt = (unsigned long)phys_to_virt(paddr);
- switch (dir) {
- case DMA_TO_DEVICE:
- clean_kernel_dcache_range(virt, size);
- break;
- case DMA_FROM_DEVICE:
- clean_kernel_dcache_range(virt, size);
- break;
- case DMA_BIDIRECTIONAL:
- flush_kernel_dcache_range(virt, size);
- break;
- }
+ clean_kernel_dcache_range(virt, size);
}
-void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
- enum dma_data_direction dir)
+static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size)
{
unsigned long virt = (unsigned long)phys_to_virt(paddr);
- switch (dir) {
- case DMA_TO_DEVICE:
- break;
- case DMA_FROM_DEVICE:
- case DMA_BIDIRECTIONAL:
- purge_kernel_dcache_range(virt, size);
- break;
- }
+ purge_kernel_dcache_range(virt, size);
+}
+
+static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t size)
+{
+ unsigned long virt = (unsigned long)phys_to_virt(paddr);
+
+ flush_kernel_dcache_range(virt, size);
}
+
+static inline bool arch_sync_dma_clean_before_fromdevice(void)
+{
+ return true;
+}
+
+static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
+{
+ return true;
+}
+
+#include <linux/dma-sync.h>
diff --git a/arch/powerpc/mm/dma-noncoherent.c b/arch/powerpc/mm/dma-noncoherent.c
index 00e59a4faa2b..268510c71156 100644
--- a/arch/powerpc/mm/dma-noncoherent.c
+++ b/arch/powerpc/mm/dma-noncoherent.c
@@ -101,27 +101,33 @@ static void __dma_phys_op(phys_addr_t paddr, size_t size, enum dma_cache_op op)
#endif
}
-void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
- enum dma_data_direction dir)
+static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
{
__dma_phys_op(start, end, DMA_CACHE_CLEAN);
}
-void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
- enum dma_data_direction dir)
+static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size)
{
- switch (direction) {
- case DMA_NONE:
- BUG();
- case DMA_TO_DEVICE:
- break;
- case DMA_FROM_DEVICE:
- case DMA_BIDIRECTIONAL:
- __dma_phys_op(start, end, DMA_CACHE_INVAL);
- break;
- }
+ __dma_phys_op(start, end, DMA_CACHE_INVAL);
}
+static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t size)
+{
+ __dma_phys_op(start, end, DMA_CACHE_FLUSH);
+}
+
+static inline bool arch_sync_dma_clean_before_fromdevice(void)
+{
+ return true;
+}
+
+static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
+{
+ return true;
+}
+
+#include <linux/dma-sync.h>
+
void arch_dma_prep_coherent(struct page *page, size_t size)
{
unsigned long kaddr = (unsigned long)page_address(page);
diff --git a/arch/riscv/mm/dma-noncoherent.c b/arch/riscv/mm/dma-noncoherent.c
index 69c80b2155a1..b9a9f57e02be 100644
--- a/arch/riscv/mm/dma-noncoherent.c
+++ b/arch/riscv/mm/dma-noncoherent.c
@@ -12,43 +12,40 @@
static bool noncoherent_supported;
-void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
- enum dma_data_direction dir)
+static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
{
void *vaddr = phys_to_virt(paddr);
- switch (dir) {
- case DMA_TO_DEVICE:
- ALT_CMO_OP(clean, vaddr, size, riscv_cbom_block_size);
- break;
- case DMA_FROM_DEVICE:
- ALT_CMO_OP(clean, vaddr, size, riscv_cbom_block_size);
- break;
- case DMA_BIDIRECTIONAL:
- ALT_CMO_OP(clean, vaddr, size, riscv_cbom_block_size);
- break;
- default:
- break;
- }
+ ALT_CMO_OP(clean, vaddr, size, riscv_cbom_block_size);
}
-void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
- enum dma_data_direction dir)
+static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size)
{
void *vaddr = phys_to_virt(paddr);
- switch (dir) {
- case DMA_TO_DEVICE:
- break;
- case DMA_FROM_DEVICE:
- case DMA_BIDIRECTIONAL:
- ALT_CMO_OP(inval, vaddr, size, riscv_cbom_block_size);
- break;
- default:
- break;
- }
+ ALT_CMO_OP(inval, vaddr, size, riscv_cbom_block_size);
}
+static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t size)
+{
+ void *vaddr = phys_to_virt(paddr);
+
+ ALT_CMO_OP(flush, vaddr, size, riscv_cbom_block_size);
+}
+
+static inline bool arch_sync_dma_clean_before_fromdevice(void)
+{
+ return true;
+}
+
+static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
+{
+ return true;
+}
+
+#include <linux/dma-sync.h>
+
+
void arch_dma_prep_coherent(struct page *page, size_t size)
{
void *flush_addr = page_address(page);
diff --git a/arch/sh/kernel/dma-coherent.c b/arch/sh/kernel/dma-coherent.c
index 6a44c0e7ba40..41f031ae7609 100644
--- a/arch/sh/kernel/dma-coherent.c
+++ b/arch/sh/kernel/dma-coherent.c
@@ -12,22 +12,35 @@ void arch_dma_prep_coherent(struct page *page, size_t size)
__flush_purge_region(page_address(page), size);
}
-void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
- enum dma_data_direction dir)
+static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
{
void *addr = sh_cacheop_vaddr(phys_to_virt(paddr));
- switch (dir) {
- case DMA_FROM_DEVICE: /* invalidate only */
- __flush_invalidate_region(addr, size);
- break;
- case DMA_TO_DEVICE: /* writeback only */
- __flush_wback_region(addr, size);
- break;
- case DMA_BIDIRECTIONAL: /* writeback and invalidate */
- __flush_purge_region(addr, size);
- break;
- default:
- BUG();
- }
+ __flush_wback_region(addr, size);
}
+
+static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size)
+{
+ void *addr = sh_cacheop_vaddr(phys_to_virt(paddr));
+
+ __flush_invalidate_region(addr, size);
+}
+
+static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t size)
+{
+ void *addr = sh_cacheop_vaddr(phys_to_virt(paddr));
+
+ __flush_purge_region(addr, size);
+}
+
+static inline bool arch_sync_dma_clean_before_fromdevice(void)
+{
+ return false;
+}
+
+static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
+{
+ return false;
+}
+
+#include <linux/dma-sync.h>
diff --git a/arch/sparc/kernel/ioport.c b/arch/sparc/kernel/ioport.c
index 4f3d26066ec2..6926ead2f208 100644
--- a/arch/sparc/kernel/ioport.c
+++ b/arch/sparc/kernel/ioport.c
@@ -300,21 +300,39 @@ arch_initcall(sparc_register_ioport);
#endif /* CONFIG_SBUS */
-/*
- * IIep is write-through, not flushing on cpu to device transfer.
- *
- * On LEON systems without cache snooping, the entire D-CACHE must be flushed to
- * make DMA to cacheable memory coherent.
- */
-void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
- enum dma_data_direction dir)
+static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
{
- if (dir != DMA_TO_DEVICE &&
- sparc_cpu_model == sparc_leon &&
+ /* IIep is write-through, not flushing on cpu to device transfer. */
+}
+
+static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size)
+{
+ /*
+ * On LEON systems without cache snooping, the entire D-CACHE must be
+ * flushed to make DMA to cacheable memory coherent.
+ */
+ if (sparc_cpu_model == sparc_leon &&
!sparc_leon3_snooping_enabled())
leon_flush_dcache_all();
}
+static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t size)
+{
+ arch_dma_cache_inv(paddr, size);
+}
+
+static inline bool arch_sync_dma_clean_before_fromdevice(void)
+{
+ return true;
+}
+
+static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
+{
+ return false;
+}
+
+#include <linux/dma-sync.h>
+
#ifdef CONFIG_PROC_FS
static int sparc_io_proc_show(struct seq_file *m, void *v)
diff --git a/arch/xtensa/kernel/pci-dma.c b/arch/xtensa/kernel/pci-dma.c
index ff3bf015eca4..d4ff96585545 100644
--- a/arch/xtensa/kernel/pci-dma.c
+++ b/arch/xtensa/kernel/pci-dma.c
@@ -43,24 +43,34 @@ static void do_cache_op(phys_addr_t paddr, size_t size,
}
}
-void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
- enum dma_data_direction dir)
+static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
{
- switch (dir) {
- case DMA_TO_DEVICE:
- do_cache_op(paddr, size, __flush_dcache_range);
- break;
- case DMA_FROM_DEVICE:
- do_cache_op(paddr, size, __invalidate_dcache_range);
- break;
- case DMA_BIDIRECTIONAL:
- do_cache_op(paddr, size, __flush_invalidate_dcache_range);
- break;
- default:
- break;
- }
+ do_cache_op(paddr, size, __flush_dcache_range);
}
+static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size)
+{
+ do_cache_op(paddr, size, __invalidate_dcache_range);
+}
+
+static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t size)
+{
+ do_cache_op(paddr, size, __flush_invalidate_dcache_range);
+}
+
+static inline bool arch_sync_dma_clean_before_fromdevice(void)
+{
+ return false;
+}
+
+static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
+{
+ return false;
+}
+
+#include <linux/dma-sync.h>
+
+
void arch_dma_prep_coherent(struct page *page, size_t size)
{
__invalidate_dcache_range((unsigned long)page_address(page), size);
diff --git a/include/linux/dma-sync.h b/include/linux/dma-sync.h
new file mode 100644
index 000000000000..18e33d5e8eaf
--- /dev/null
+++ b/include/linux/dma-sync.h
@@ -0,0 +1,107 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Cache operations depending on function and direction argument, inspired by
+ * https://lore.kernel.org/lkml/[email protected]
+ * "dma_sync_*_for_cpu and direction=TO_DEVICE (was Re: [PATCH 02/20]
+ * dma-mapping: provide a generic dma-noncoherent implementation)"
+ *
+ * | map == for_device | unmap == for_cpu
+ * |----------------------------------------------------------------
+ * TO_DEV | writeback writeback | none none
+ * FROM_DEV | invalidate invalidate | invalidate* invalidate*
+ * BIDIR | writeback writeback | invalidate invalidate
+ *
+ * [*] needed for CPU speculative prefetches
+ *
+ * NOTE: we don't check the validity of direction argument as it is done in
+ * upper layer functions (in include/linux/dma-mapping.h)
+ *
+ * This file can be included by arch/.../kernel/dma-noncoherent.c to provide
+ * the respective high-level operations without having to expose the
+ * cache management ops to drivers.
+ */
+
+void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
+ enum dma_data_direction dir)
+{
+ switch (dir) {
+ case DMA_TO_DEVICE:
+ /*
+ * This may be an empty function on write-through caches,
+ * and it might invalidate the cache if an architecture has
+ * a write-back cache but no way to write it back without
+ * invalidating
+ */
+ arch_dma_cache_wback(paddr, size);
+ break;
+
+ case DMA_FROM_DEVICE:
+ /*
+ * FIXME: this should be handled the same across all
+ * architectures, see
+ * https://lore.kernel.org/all/20220606152150.GA31568@willie-the-truck/
+ */
+ if (!arch_sync_dma_clean_before_fromdevice()) {
+ arch_dma_cache_inv(paddr, size);
+ break;
+ }
+ fallthrough;
+
+ case DMA_BIDIRECTIONAL:
+ /* Skip the invalidate here if it's done later */
+ if (IS_ENABLED(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU) &&
+ arch_sync_dma_cpu_needs_post_dma_flush())
+ arch_dma_cache_wback(paddr, size);
+ else
+ arch_dma_cache_wback_inv(paddr, size);
+ break;
+
+ default:
+ break;
+ }
+}
+
+#ifdef CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU
+/*
+ * Mark the D-cache clean for these pages to avoid extra flushing.
+ */
+static void arch_dma_mark_dcache_clean(phys_addr_t paddr, size_t size)
+{
+#ifdef CONFIG_ARCH_DMA_MARK_DCACHE_CLEAN
+ unsigned long pfn = PFN_UP(paddr);
+ unsigned long off = paddr & (PAGE_SIZE - 1);
+ size_t left = size;
+
+ if (off)
+ left -= PAGE_SIZE - off;
+
+ while (left >= PAGE_SIZE) {
+ struct page *page = pfn_to_page(pfn++);
+ set_bit(PG_dcache_clean, &page->flags);
+ left -= PAGE_SIZE;
+ }
+#endif
+}
+
+void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
+ enum dma_data_direction dir)
+{
+ switch (dir) {
+ case DMA_TO_DEVICE:
+ break;
+
+ case DMA_FROM_DEVICE:
+ case DMA_BIDIRECTIONAL:
+ /* FROM_DEVICE invalidate needed if speculative CPU prefetch only */
+ if (arch_sync_dma_cpu_needs_post_dma_flush())
+ arch_dma_cache_inv(paddr, size);
+
+ if (size > PAGE_SIZE)
+ arch_dma_mark_dcache_clean(paddr, size);
+ break;
+
+ default:
+ break;
+ }
+}
+#endif
--
2.39.2
On 2023-03-27 13:13, Arnd Bergmann wrote:
> From: Arnd Bergmann <[email protected]>
>
> The arm version of the arch_sync_dma_for_cpu() function annotates pages as
> PG_dcache_clean after a DMA, but no other architecture does this here. On
> ia64, the same thing is done in arch_sync_dma_for_cpu(), so it makes sense
> to use the same hook in order to have identical arch_sync_dma_for_cpu()
> semantics as all other architectures.
>
> Splitting this out has multiple effects:
>
> - for dma-direct, this now gets called after arch_sync_dma_for_cpu()
> for DMA_FROM_DEVICE mappings, but not for DMA_BIDIRECTIONAL. While
> it would not be harmful to keep doing it for bidirectional mappings,
> those are apparently not used in any callers that care about the flag.
>
> - Since arm has its own dma-iommu abstraction, this now also needs to
> call the same function, so the calls are added there to mirror the
> dma-direct version.
>
> - Like dma-direct, the dma-iommu version now marks the dcache clean
> for both coherent and noncoherent devices after a DMA, but it only
> does this for DMA_FROM_DEVICE, not DMA_BIDIRECTIONAL.
>
> [ HELP NEEDED: can anyone confirm that it is a correct assumption
> on arm that a cache-coherent device writing to a page always results
> in it being in a PG_dcache_clean state like on ia64, or can a device
> write directly into the dcache?]
In AMBA at least, if a snooping write hits in a cache then the data is
most likely going to get routed directly into that cache. If it has
write-back write-allocate attributes it could also land in any cache
along its normal path to RAM; it wouldn't have to go all the way.
Hence all the fun we have where treating a coherent device as
non-coherent can still be almost as broken as the other way round :)
Cheers,
Robin.
> Signed-off-by: Arnd Bergmann <[email protected]>
> ---
> arch/arm/Kconfig | 1 +
> arch/arm/mm/dma-mapping.c | 71 +++++++++++++++++++++++----------------
> 2 files changed, 43 insertions(+), 29 deletions(-)
>
> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> index e24a9820e12f..125d58c54ab1 100644
> --- a/arch/arm/Kconfig
> +++ b/arch/arm/Kconfig
> @@ -7,6 +7,7 @@ config ARM
> select ARCH_HAS_BINFMT_FLAT
> select ARCH_HAS_CURRENT_STACK_POINTER
> select ARCH_HAS_DEBUG_VIRTUAL if MMU
> + select ARCH_HAS_DMA_MARK_CLEAN if MMU
> select ARCH_HAS_DMA_WRITE_COMBINE if !ARM_DMA_MEM_BUFFERABLE
> select ARCH_HAS_ELF_RANDOMIZE
> select ARCH_HAS_FORTIFY_SOURCE
> diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
> index cc702cb27ae7..b703cb83d27e 100644
> --- a/arch/arm/mm/dma-mapping.c
> +++ b/arch/arm/mm/dma-mapping.c
> @@ -665,6 +665,28 @@ static void dma_cache_maint(phys_addr_t paddr,
> } while (left);
> }
>
> +/*
> + * Mark the D-cache clean for these pages to avoid extra flushing.
> + */
> +void arch_dma_mark_clean(phys_addr_t paddr, size_t size)
> +{
> + unsigned long pfn = PFN_UP(paddr);
> + unsigned long off = paddr & (PAGE_SIZE - 1);
> + size_t left = size;
> +
> + if (size < PAGE_SIZE)
> + return;
> +
> + if (off)
> + left -= PAGE_SIZE - off;
> +
> + while (left >= PAGE_SIZE) {
> + struct page *page = pfn_to_page(pfn++);
> + set_bit(PG_dcache_clean, &page->flags);
> + left -= PAGE_SIZE;
> + }
> +}
> +
> static bool arch_sync_dma_cpu_needs_post_dma_flush(void)
> {
> if (IS_ENABLED(CONFIG_CPU_V6) ||
> @@ -715,24 +737,6 @@ void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> outer_inv_range(paddr, paddr + size);
> dma_cache_maint(paddr, size, dmac_inv_range);
> }
> -
> - /*
> - * Mark the D-cache clean for these pages to avoid extra flushing.
> - */
> - if (dir != DMA_TO_DEVICE && size >= PAGE_SIZE) {
> - unsigned long pfn = PFN_UP(paddr);
> - unsigned long off = paddr & (PAGE_SIZE - 1);
> - size_t left = size;
> -
> - if (off)
> - left -= PAGE_SIZE - off;
> -
> - while (left >= PAGE_SIZE) {
> - struct page *page = pfn_to_page(pfn++);
> - set_bit(PG_dcache_clean, &page->flags);
> - left -= PAGE_SIZE;
> - }
> - }
> }
>
> #ifdef CONFIG_ARM_DMA_USE_IOMMU
> @@ -1294,6 +1298,17 @@ static int arm_iommu_map_sg(struct device *dev, struct scatterlist *sg,
> return -EINVAL;
> }
>
> +static void arm_iommu_sync_dma_for_cpu(phys_addr_t phys, size_t len,
> + enum dma_data_direction dir,
> + bool dma_coherent)
> +{
> + if (!dma_coherent)
> + arch_sync_dma_for_cpu(phys, s->length, dir);
> +
> + if (dir == DMA_FROM_DEVICE)
> + arch_dma_mark_clean(phys, s->length);
> +}
> +
> /**
> * arm_iommu_unmap_sg - unmap a set of SG buffers mapped by dma_map_sg
> * @dev: valid struct device pointer
> @@ -1316,8 +1331,9 @@ static void arm_iommu_unmap_sg(struct device *dev,
> if (sg_dma_len(s))
> __iommu_remove_mapping(dev, sg_dma_address(s),
> sg_dma_len(s));
> - if (!dev->dma_coherent && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
> - arch_sync_dma_for_cpu(sg_phys(s), s->length, dir);
> + if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
> + arm_iommu_sync_dma_for_cpu(sg_phys(s), s->length, dir,
> + dev->dma_coherent);
> }
> }
>
> @@ -1335,12 +1351,9 @@ static void arm_iommu_sync_sg_for_cpu(struct device *dev,
> struct scatterlist *s;
> int i;
>
> - if (dev->dma_coherent)
> - return;
> -
> for_each_sg(sg, s, nents, i)
> - arch_sync_dma_for_cpu(sg_phys(s), s->length, dir);
> -
> + arm_iommu_sync_dma_for_cpu(sg_phys(s), s->length, dir,
> + dev->dma_coherent);
> }
>
> /**
> @@ -1425,9 +1438,9 @@ static void arm_iommu_unmap_page(struct device *dev, dma_addr_t handle,
> if (!iova)
> return;
>
> - if (!dev->dma_coherent && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) {
> + if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
> phys = iommu_iova_to_phys(mapping->domain, handle);
> - arch_sync_dma_for_cpu(phys, size, dir);
> + arm_iommu_sync_dma_for_cpu(phys, size, dir, dev->dma_coherent);
> }
>
> iommu_unmap(mapping->domain, iova, len);
> @@ -1497,11 +1510,11 @@ static void arm_iommu_sync_single_for_cpu(struct device *dev,
> struct dma_iommu_mapping *mapping = to_dma_iommu_mapping(dev);
> phys_addr_t phys;
>
> - if (dev->dma_coherent || !(handle & PAGE_MASK))
> + if (!(handle & PAGE_MASK))
> return;
>
> phys = iommu_iova_to_phys(mapping->domain, handle);
> - arch_sync_dma_for_cpu(phys, size, dir);
> + arm_iommu_sync_dma_for_cpu(phys, size, dir, dev->dma_coherent);
> }
>
> static void arm_iommu_sync_single_for_device(struct device *dev,
Le 27/03/2023 à 14:13, Arnd Bergmann a écrit :
> From: Arnd Bergmann <[email protected]>
>
> The powerpc dma_sync_*_for_cpu() variants do more flushes than on other
> architectures. Reduce it to what everyone else does:
>
> - No flush is needed after data has been sent to a device
>
> - When data has been received from a device, the cache only needs to
> be invalidated to clear out cache lines that were speculatively
> prefetched.
>
> In particular, the second flushing of partial cache lines of bidirectional
> buffers is actively harmful -- if a single cache line is written by both
> the CPU and the device, flushing it again does not maintain coherency
> but instead overwrite the data that was just received from the device.
Hum ..... Who is right ?
That behaviour was introduced by commit 03d70617b8a7 ("powerpc: Prevent
memory corruption due to cache invalidation of unaligned DMA buffer")
I think your commit log should explain why that commit was wrong, and
maybe say that your patch is a revert of that commit ?
Christophe
>
> Signed-off-by: Arnd Bergmann <[email protected]>
> ---
> arch/powerpc/mm/dma-noncoherent.c | 18 ++++--------------
> 1 file changed, 4 insertions(+), 14 deletions(-)
>
> diff --git a/arch/powerpc/mm/dma-noncoherent.c b/arch/powerpc/mm/dma-noncoherent.c
> index f10869d27de5..e108cacf877f 100644
> --- a/arch/powerpc/mm/dma-noncoherent.c
> +++ b/arch/powerpc/mm/dma-noncoherent.c
> @@ -132,21 +132,11 @@ void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> switch (direction) {
> case DMA_NONE:
> BUG();
> - case DMA_FROM_DEVICE:
> - /*
> - * invalidate only when cache-line aligned otherwise there is
> - * the potential for discarding uncommitted data from the cache
> - */
> - if ((start | end) & (L1_CACHE_BYTES - 1))
> - __dma_phys_op(start, end, DMA_CACHE_FLUSH);
> - else
> - __dma_phys_op(start, end, DMA_CACHE_INVAL);
> - break;
> - case DMA_TO_DEVICE: /* writeback only */
> - __dma_phys_op(start, end, DMA_CACHE_CLEAN);
> + case DMA_TO_DEVICE:
> break;
> - case DMA_BIDIRECTIONAL: /* writeback and invalidate */
> - __dma_phys_op(start, end, DMA_CACHE_FLUSH);
> + case DMA_FROM_DEVICE:
> + case DMA_BIDIRECTIONAL:
> + __dma_phys_op(start, end, DMA_CACHE_INVAL);
> break;
> }
> }
On Mon, Mar 27, 2023, at 14:56, Christophe Leroy wrote:
> Le 27/03/2023 à 14:13, Arnd Bergmann a écrit :
>> From: Arnd Bergmann <[email protected]>
>>
>> The powerpc dma_sync_*_for_cpu() variants do more flushes than on other
>> architectures. Reduce it to what everyone else does:
>>
>> - No flush is needed after data has been sent to a device
>>
>> - When data has been received from a device, the cache only needs to
>> be invalidated to clear out cache lines that were speculatively
>> prefetched.
>>
>> In particular, the second flushing of partial cache lines of bidirectional
>> buffers is actively harmful -- if a single cache line is written by both
>> the CPU and the device, flushing it again does not maintain coherency
>> but instead overwrite the data that was just received from the device.
>
> Hum ..... Who is right ?
>
> That behaviour was introduced by commit 03d70617b8a7 ("powerpc: Prevent
> memory corruption due to cache invalidation of unaligned DMA buffer")
>
> I think your commit log should explain why that commit was wrong, and
> maybe say that your patch is a revert of that commit ?
Ok, I'll try to explain this better. To clarify here: the __dma_sync()
function in commit 03d70617b8a7 is used both before and after a DMA,
but my patch 05/21 splits this in two, and patch 06/21 only changes
the part that gets called after the DMA-from-device but leaves the
part before DMA-from-device unchanged, which Andrew's patch
addressed.
As I mentioned in the cover letter, it is still unclear whether
we want to consider this the expected behavior as the documentation
seems unclear, but my series does not attempt to answer that
question.
Arnd
On Mon, Mar 27, 2023 at 02:13:12PM +0200, Arnd Bergmann wrote:
> From: Arnd Bergmann <[email protected]>
>
> These were remove ages ago in commit 702b94bff3c5 ("ARM: dma-mapping:
> remove dmac_clean_range and dmac_inv_range") in an effort to sanitize
> the dma-mapping API.
Really no, please no. Let's not go back to this, let's keep the
buffer ownership model that came at around that time.
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
On Mon, Mar 27, 2023 at 8:15 PM Arnd Bergmann <[email protected]> wrote:
>
> From: Arnd Bergmann <[email protected]>
>
> csky is the only architecture that does a full flush for the
> dma_sync_*_for_device(..., DMA_FROM_DEVICE) operation. The requirement
> is only make sure there are no dirty cache lines for the buffer,
> which can be either done through an invalidate operation (as on most
> architectures including arm32, mips and arc), or a writeback (as on
> arm64 and riscv). The cache also has to be invalidated eventually but
> csky already does that after the transfer.
>
> Use a 'clean' operation here for consistency with arm64 and riscv.
>
> Signed-off-by: Arnd Bergmann <[email protected]>
> ---
> arch/csky/mm/dma-mapping.c | 4 +---
> 1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/arch/csky/mm/dma-mapping.c b/arch/csky/mm/dma-mapping.c
> index 82447029feb4..c90f912e2822 100644
> --- a/arch/csky/mm/dma-mapping.c
> +++ b/arch/csky/mm/dma-mapping.c
> @@ -60,11 +60,9 @@ void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> {
> switch (dir) {
> case DMA_TO_DEVICE:
> - cache_op(paddr, size, dma_wb_range);
> - break;
> case DMA_FROM_DEVICE:
> case DMA_BIDIRECTIONAL:
> - cache_op(paddr, size, dma_wbinv_range);
> + cache_op(paddr, size, dma_wb_range);
Reviewed-by: Guo Ren <[email protected]>
> break;
> default:
> BUG();
> --
> 2.39.2
>
--
Best Regards
Guo Ren
On Mon, Mar 27, 2023 at 02:13:16PM +0200, Arnd Bergmann wrote:
> From: Arnd Bergmann <[email protected]>
>
> The arm version of the arch_sync_dma_for_cpu() function annotates pages as
> PG_dcache_clean after a DMA, but no other architecture does this here.
... because this is an arm32 specific feature. Generically, it's
PG_arch_1, which is a page flag free for architecture use. On arm32
we decided to use this to mark whether we can skip dcache writebacks
when establishing a PTE - and thus it was decided to call it
PG_dcache_clean to reflect how arm32 decided to use that bit.
This isn't just a DMA thing, there are other places that we update
the bit, such as flush_dcache_page() and copy_user_highpage().
So thinking that the arm32 PG_dcache_clean is something for DMA is
actually wrong.
Other architectures are free to do their own other optimisations
using that bit, and their implementations may be DMA-centric.
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
On Mon, Mar 27, 2023 at 5:14 AM Arnd Bergmann <[email protected]> wrote:
>
> From: Arnd Bergmann <[email protected]>
>
> xtensa is one of the platforms that has both write-back and write-through
> caches, and needs to account for both in its DMA mapping operations.
>
> It does this through a set of operations that is different from any
> architecture. This is not a problem by itself, but it makes it rather
> hard to figure out whether this is correct or not, and to unify this
> implementation with the others.
>
> Change the semantics to the usual ones for non-speculating CPUs:
>
> - On DMA_TO_DEVICE, call __flush_dcache_range() to perform the
> writeback even on writethrough caches, where this is a nop.
>
> - On DMA_FROM_DEVICE, invalidate the mapping before the DMA rather
> than afterwards.
>
> - On DMA_BIDIRECTIONAL, combine the pre-writeback with the
> post-invalidate into a call to __flush_invalidate_dcache_range()
> that turns into a simple invalidate on writeback caches.
>
> Signed-off-by: Arnd Bergmann <[email protected]>
> ---
> arch/xtensa/Kconfig | 1 -
> arch/xtensa/include/asm/cacheflush.h | 6 +++---
> arch/xtensa/kernel/pci-dma.c | 29 +++++-----------------------
> 3 files changed, 8 insertions(+), 28 deletions(-)
Reviewed-by: Max Filippov <[email protected]>
--
Thanks.
-- Max
> +static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
> {
> + dma_cache_wback(paddr, size);
> +}
>
> +static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size)
> +{
> + dma_cache_inv(paddr, size);
> }
> +static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t size)
> {
> + dma_cache_wback_inv(paddr, size);
> +}
There are the only calls for the three functions for each of the
involved functions. So I'd rather rename the low-level symbols
(and drop the pointless exports for two of them) rather than adding
these wrapppers.
The same is probably true for many other architectures.
> +static inline bool arch_sync_dma_clean_before_fromdevice(void)
> +{
> + return false;
> +}
>
> +static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
> +{
> + return true;
> }
Is there a way to cut down on this boilerplate code by just having
sane default, and Kconfig options to override them if they are not
runtime decisions?
> +#include <linux/dma-sync.h>
I can't really say I like the #include version here despite your
rationale in the commit log. I can probably live with it if you
think it is absolutely worth it, but I'm really not in favor of it.
> +config ARCH_DMA_MARK_DCACHE_CLEAN
> + def_bool y
What do we need this symbol for? Unless I'm missing something it is
always enable for arm32, and only used in arm32 code.
On Mon, Mar 27, 2023 at 02:13:05PM +0200, Arnd Bergmann wrote:
> From: Arnd Bergmann <[email protected]>
>
> For a DMA_BIDIRECTIONAL transfer, the caches have to be cleaned
> first to let the device see data written by the CPU, and invalidated
> after the transfer to let the CPU see data written by the device.
>
> riscv also invalidates the caches before the transfer, which does
> not appear to serve any purpose.
Rationale makes sense to me..
Reviewed-by: Conor Dooley <[email protected]>
Thanks for working on all of this Arnd!
On Mon, Mar 27, 2023 at 02:13:04PM +0200, Arnd Bergmann wrote:
> From: Arnd Bergmann <[email protected]>
>
> No other architecture intentionally writes back dirty cache lines into
> a buffer that a device has just finished writing into. If the cache is
> clean, this has no effect at all, but
> if a cacheline in the buffer has
> actually been written by the CPU, there is a drive bug that is likely
> made worse by overwriting that buffer.
So does this need a
Fixes: 1631ba1259d6 ("riscv: Add support for non-coherent devices using zicbom extension")
then, even if the cacheline really should not have been touched by the
CPU?
Also, minor typo, s/drive/driver/.
In the thread we had that sparked this, I went digging for the source of
the flushes, and it came from a review comment:
https://lore.kernel.org/linux-riscv/[email protected]/
But *surely* if no other arch needs to do that, then we are safe to also
not do it... Your logic seems right by me at least, especially given the
lack of flushes elsewhere.
Reviewed-by: Conor Dooley <[email protected]>
Cheers,
Conor.
> Signed-off-by: Arnd Bergmann <[email protected]>
> ---
> arch/riscv/mm/dma-noncoherent.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/riscv/mm/dma-noncoherent.c b/arch/riscv/mm/dma-noncoherent.c
> index d919efab6eba..640f4c496d26 100644
> --- a/arch/riscv/mm/dma-noncoherent.c
> +++ b/arch/riscv/mm/dma-noncoherent.c
> @@ -42,7 +42,7 @@ void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> break;
> case DMA_FROM_DEVICE:
> case DMA_BIDIRECTIONAL:
> - ALT_CMO_OP(flush, vaddr, size, riscv_cbom_block_size);
> + ALT_CMO_OP(inval, vaddr, size, riscv_cbom_block_size);
> break;
> default:
> break;
> --
> 2.39.2
>
On 27 Mar 2023, at 13:13, Arnd Bergmann <[email protected]> wrote:
>
> From: Arnd Bergmann <[email protected]>
>
> No other architecture intentionally writes back dirty cache lines into
> a buffer that a device has just finished writing into. If the cache is
> clean, this has no effect at all, but if a cacheline in the buffer has
> actually been written by the CPU, there is a drive bug that is likely
> made worse by overwriting that buffer.
FYI [1] proposed this same change a while ago but its justification was
flawed (which was my objection at the time, not the diff itself).
Jess
[1] https://lore.kernel.org/all/[email protected]
> Signed-off-by: Arnd Bergmann <[email protected]>
> ---
> arch/riscv/mm/dma-noncoherent.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/riscv/mm/dma-noncoherent.c b/arch/riscv/mm/dma-noncoherent.c
> index d919efab6eba..640f4c496d26 100644
> --- a/arch/riscv/mm/dma-noncoherent.c
> +++ b/arch/riscv/mm/dma-noncoherent.c
> @@ -42,7 +42,7 @@ void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> break;
> case DMA_FROM_DEVICE:
> case DMA_BIDIRECTIONAL:
> - ALT_CMO_OP(flush, vaddr, size, riscv_cbom_block_size);
> + ALT_CMO_OP(inval, vaddr, size, riscv_cbom_block_size);
> break;
> default:
> break;
> --
> 2.39.2
>
>
> _______________________________________________
> linux-riscv mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-riscv
On Wed, Mar 29, 2023, at 22:48, Conor Dooley wrote:
> On Mon, Mar 27, 2023 at 02:13:04PM +0200, Arnd Bergmann wrote:
>> From: Arnd Bergmann <[email protected]>
>>
>> No other architecture intentionally writes back dirty cache lines into
>> a buffer that a device has just finished writing into. If the cache is
>> clean, this has no effect at all, but
>
>> if a cacheline in the buffer has
>> actually been written by the CPU, there is a drive bug that is likely
>> made worse by overwriting that buffer.
>
> So does this need a
> Fixes: 1631ba1259d6 ("riscv: Add support for non-coherent devices using
> zicbom extension")
> then, even if the cacheline really should not have been touched by the
> CPU?
> Also, minor typo, s/drive/driver/.
done
> In the thread we had that sparked this, I went digging for the source of
> the flushes, and it came from a review comment:
> https://lore.kernel.org/linux-riscv/[email protected]/
Ah, so the comment that led to it was
"For arch_sync_dma_for_cpu(DMA_BIDIRECTIONAL), we expect the CPU to have
written to the buffer, so this should flush, not invalidate."
which sounds like Samuel just misunderstood what "bidirectional"
means: the comment implies that both the cpu and the device access
the buffer before arch_sync_dma_for_cpu(DMA_BIDIRECTIONAL), but
this is not allowed. Instead, the point is that the device may both
read and write the buffer, requiring that we must do a writeback
at arch_sync_dma_for_device(DMA_BIDIRECTIONAL) and an invalidate
at arch_sync_dma_for_cpu(DMA_BIDIRECTIONAL).
The comment about arch_sync_dma_for_device(DMA_FROM_DEVICE) (in the
same email) seems equally confused. It's of course easy to
misunderstand these, and many others have gotten confused in
similar ways before.
> But *surely* if no other arch needs to do that, then we are safe to also
> not do it... Your logic seems right by me at least, especially given the
> lack of flushes elsewhere.
Right, I remove the extra writeback from powerpc, parisc and microblaze
for the same reason. Those appear to only be there because they used the
same function for _for_device() as for _for_cpu(), not because someone
thought they were required.
> Reviewed-by: Conor Dooley <[email protected]>
Thanks!
Arnd
On 27/03/2023 14:13, Arnd Bergmann wrote:
> From: Arnd Bergmann <[email protected]>
>
> The cache management operations for noncoherent DMA on ARMv6 work
> in two different ways:
>
> * When CONFIG_DMA_CACHE_RWFO is set, speculative prefetches on in-flight
> DMA buffers lead to data corruption when the prefetched data is written
> back on top of data from the device.
>
> * When CONFIG_DMA_CACHE_RWFO is disabled, a cache flush on one CPU
> is not seen by the other core(s), leading to inconsistent contents
> accross the system.
>
> As a consequence, neither configuration is actually safe to use in a
> general-purpose kernel that is used on both MPCore systems and ARM1176
> with prefetching enabled.
>
> We could add further workarounds to make the behavior more dynamic based
> on the system, but realistically, there are close to zero remaining
> users on any ARM11MPCore anyway, and nobody seems too interested in it,
> compared to the more popular ARM1176 used in BMC2835 and AST2500.
>
> The Oxnas platform has some minimal support in OpenWRT, but most of the
> drivers and dts files never made it into the mainline kernel, while the
> Arm Versatile/Realview platform mainly serves as a reference system but
> is not necessary to be kept working once all other ARM11MPCore are gone.
Acked-by: Neil Armstrong <[email protected]>
It's sad but it's the reality, there's no chance full OXNAS support will
ever come upstream and no real work has been done for years.
I think OXNAS support can be programmed for removal for next release,
it would need significant work to rework current support to make it acceptable
before trying to upstream missing bits anyway.
Thanks,
Neil
>
> Take the easy way out here and drop support for multiprocessing on
> ARMv6, along with the CONFIG_DMA_CACHE_RWFO option and the cache
> management implementation for it. This also helps with other ARMv6
> issues, but for the moment leaves the ability to build a kernel that
> can run on both ARMv7 SMP and single-processor ARMv6, which we probably
> want to stop supporting as well, but not as part of this series.
>
> Cc: Neil Armstrong <[email protected]>
> Cc: Daniel Golle <[email protected]>
> Cc: Linus Walleij <[email protected]>
> Cc: [email protected]
> Signed-off-by: Arnd Bergmann <[email protected]>
> ---
> I could use some help clarifying the above changelog text to describe
> the exact problem, and how the CONFIG_DMA_CACHE_RWFO actually works on
> MPCore. The TRMs for both 1176 and 11MPCore only describe prefetching
> into the instruction cache, not the data cache, but this can end up in
> the outercache as a result. The 1176 has some extra control bits to
> control prefetching, but I found no reference that explains why an
> MPCore does not run into the problem.
> ---
> arch/arm/mach-oxnas/Kconfig | 4 -
> arch/arm/mach-oxnas/Makefile | 1 -
> arch/arm/mach-oxnas/headsmp.S | 23 ------
> arch/arm/mach-oxnas/platsmp.c | 96 ----------------------
> arch/arm/mach-versatile/platsmp-realview.c | 4 -
> arch/arm/mm/Kconfig | 19 -----
> arch/arm/mm/cache-v6.S | 31 -------
> 7 files changed, 178 deletions(-)
> delete mode 100644 arch/arm/mach-oxnas/headsmp.S
> delete mode 100644 arch/arm/mach-oxnas/platsmp.c
>
> diff --git a/arch/arm/mach-oxnas/Kconfig b/arch/arm/mach-oxnas/Kconfig
> index a9ded7079268..a054235c3d6c 100644
> --- a/arch/arm/mach-oxnas/Kconfig
> +++ b/arch/arm/mach-oxnas/Kconfig
> @@ -28,10 +28,6 @@ config MACH_OX820
> bool "Support OX820 Based Products"
> depends on ARCH_MULTI_V6
> select ARM_GIC
> - select DMA_CACHE_RWFO if SMP
> - select HAVE_SMP
> - select HAVE_ARM_SCU if SMP
> - select HAVE_ARM_TWD if SMP
> help
> Include Support for the Oxford Semiconductor OX820 SoC Based Products.
>
> diff --git a/arch/arm/mach-oxnas/Makefile b/arch/arm/mach-oxnas/Makefile
> index 0e78ecfe6c49..a4e40e534e6a 100644
> --- a/arch/arm/mach-oxnas/Makefile
> +++ b/arch/arm/mach-oxnas/Makefile
> @@ -1,2 +1 @@
> # SPDX-License-Identifier: GPL-2.0-only
> -obj-$(CONFIG_SMP) += platsmp.o headsmp.o
> diff --git a/arch/arm/mach-oxnas/headsmp.S b/arch/arm/mach-oxnas/headsmp.S
> deleted file mode 100644
> index 9c0f1479f33a..000000000000
> --- a/arch/arm/mach-oxnas/headsmp.S
> +++ /dev/null
> @@ -1,23 +0,0 @@
> -/* SPDX-License-Identifier: GPL-2.0-only */
> -/*
> - * Copyright (C) 2013 Ma Haijun <[email protected]>
> - * Copyright (c) 2003 ARM Limited
> - * All Rights Reserved
> - */
> -#include <linux/linkage.h>
> -#include <linux/init.h>
> -
> - __INIT
> -
> -/*
> - * OX820 specific entry point for secondary CPUs.
> - */
> -ENTRY(ox820_secondary_startup)
> - mov r4, #0
> - /* invalidate both caches and branch target cache */
> - mcr p15, 0, r4, c7, c7, 0
> - /*
> - * we've been released from the holding pen: secondary_stack
> - * should now contain the SVC stack for this core
> - */
> - b secondary_startup
> diff --git a/arch/arm/mach-oxnas/platsmp.c b/arch/arm/mach-oxnas/platsmp.c
> deleted file mode 100644
> index f0a50b9e61df..000000000000
> --- a/arch/arm/mach-oxnas/platsmp.c
> +++ /dev/null
> @@ -1,96 +0,0 @@
> -// SPDX-License-Identifier: GPL-2.0-only
> -/*
> - * Copyright (C) 2016 Neil Armstrong <[email protected]>
> - * Copyright (C) 2013 Ma Haijun <[email protected]>
> - * Copyright (C) 2002 ARM Ltd.
> - * All Rights Reserved
> - */
> -#include <linux/io.h>
> -#include <linux/delay.h>
> -#include <linux/of.h>
> -#include <linux/of_address.h>
> -
> -#include <asm/cacheflush.h>
> -#include <asm/cp15.h>
> -#include <asm/smp_plat.h>
> -#include <asm/smp_scu.h>
> -
> -extern void ox820_secondary_startup(void);
> -
> -static void __iomem *cpu_ctrl;
> -static void __iomem *gic_cpu_ctrl;
> -
> -#define HOLDINGPEN_CPU_OFFSET 0xc8
> -#define HOLDINGPEN_LOCATION_OFFSET 0xc4
> -
> -#define GIC_NCPU_OFFSET(cpu) (0x100 + (cpu)*0x100)
> -#define GIC_CPU_CTRL 0x00
> -#define GIC_CPU_CTRL_ENABLE 1
> -
> -static int __init ox820_boot_secondary(unsigned int cpu,
> - struct task_struct *idle)
> -{
> - /*
> - * Write the address of secondary startup into the
> - * system-wide flags register. The BootMonitor waits
> - * until it receives a soft interrupt, and then the
> - * secondary CPU branches to this address.
> - */
> - writel(virt_to_phys(ox820_secondary_startup),
> - cpu_ctrl + HOLDINGPEN_LOCATION_OFFSET);
> -
> - writel(cpu, cpu_ctrl + HOLDINGPEN_CPU_OFFSET);
> -
> - /*
> - * Enable GIC cpu interface in CPU Interface Control Register
> - */
> - writel(GIC_CPU_CTRL_ENABLE,
> - gic_cpu_ctrl + GIC_NCPU_OFFSET(cpu) + GIC_CPU_CTRL);
> -
> - /*
> - * Send the secondary CPU a soft interrupt, thereby causing
> - * the boot monitor to read the system wide flags register,
> - * and branch to the address found there.
> - */
> - arch_send_wakeup_ipi_mask(cpumask_of(cpu));
> -
> - return 0;
> -}
> -
> -static void __init ox820_smp_prepare_cpus(unsigned int max_cpus)
> -{
> - struct device_node *np;
> - void __iomem *scu_base;
> -
> - np = of_find_compatible_node(NULL, NULL, "arm,arm11mp-scu");
> - scu_base = of_iomap(np, 0);
> - of_node_put(np);
> - if (!scu_base)
> - return;
> -
> - /* Remap CPU Interrupt Interface Registers */
> - np = of_find_compatible_node(NULL, NULL, "arm,arm11mp-gic");
> - gic_cpu_ctrl = of_iomap(np, 1);
> - of_node_put(np);
> - if (!gic_cpu_ctrl)
> - goto unmap_scu;
> -
> - np = of_find_compatible_node(NULL, NULL, "oxsemi,ox820-sys-ctrl");
> - cpu_ctrl = of_iomap(np, 0);
> - of_node_put(np);
> - if (!cpu_ctrl)
> - goto unmap_scu;
> -
> - scu_enable(scu_base);
> - flush_cache_all();
> -
> -unmap_scu:
> - iounmap(scu_base);
> -}
> -
> -static const struct smp_operations ox820_smp_ops __initconst = {
> - .smp_prepare_cpus = ox820_smp_prepare_cpus,
> - .smp_boot_secondary = ox820_boot_secondary,
> -};
> -
> -CPU_METHOD_OF_DECLARE(ox820_smp, "oxsemi,ox820-smp", &ox820_smp_ops);
> diff --git a/arch/arm/mach-versatile/platsmp-realview.c b/arch/arm/mach-versatile/platsmp-realview.c
> index 5d363385c801..fa31fd2d211d 100644
> --- a/arch/arm/mach-versatile/platsmp-realview.c
> +++ b/arch/arm/mach-versatile/platsmp-realview.c
> @@ -18,16 +18,12 @@
> #define REALVIEW_SYS_FLAGSSET_OFFSET 0x30
>
> static const struct of_device_id realview_scu_match[] = {
> - { .compatible = "arm,arm11mp-scu", },
> { .compatible = "arm,cortex-a9-scu", },
> { .compatible = "arm,cortex-a5-scu", },
> { }
> };
>
> static const struct of_device_id realview_syscon_match[] = {
> - { .compatible = "arm,core-module-integrator", },
> - { .compatible = "arm,realview-eb-syscon", },
> - { .compatible = "arm,realview-pb11mp-syscon", },
> { .compatible = "arm,realview-pbx-syscon", },
> { },
> };
> diff --git a/arch/arm/mm/Kconfig b/arch/arm/mm/Kconfig
> index c5bbae86f725..16b62bc0a970 100644
> --- a/arch/arm/mm/Kconfig
> +++ b/arch/arm/mm/Kconfig
> @@ -937,25 +937,6 @@ config VDSO
> You must have glibc 2.22 or later for programs to seamlessly
> take advantage of this.
>
> -config DMA_CACHE_RWFO
> - bool "Enable read/write for ownership DMA cache maintenance"
> - depends on CPU_V6K && SMP
> - default y
> - help
> - The Snoop Control Unit on ARM11MPCore does not detect the
> - cache maintenance operations and the dma_{map,unmap}_area()
> - functions may leave stale cache entries on other CPUs. By
> - enabling this option, Read or Write For Ownership in the ARMv6
> - DMA cache maintenance functions is performed. These LDR/STR
> - instructions change the cache line state to shared or modified
> - so that the cache operation has the desired effect.
> -
> - Note that the workaround is only valid on processors that do
> - not perform speculative loads into the D-cache. For such
> - processors, if cache maintenance operations are not broadcast
> - in hardware, other workarounds are needed (e.g. cache
> - maintenance broadcasting in software via FIQ).
> -
> config OUTER_CACHE
> bool
>
> diff --git a/arch/arm/mm/cache-v6.S b/arch/arm/mm/cache-v6.S
> index abae7ff5defc..f6ee53c1de20 100644
> --- a/arch/arm/mm/cache-v6.S
> +++ b/arch/arm/mm/cache-v6.S
> @@ -201,10 +201,6 @@ ENTRY(v6_flush_kern_dcache_area)
> * - end - virtual end address of region
> */
> ENTRY(v6_dma_inv_range)
> -#ifdef CONFIG_DMA_CACHE_RWFO
> - ldrb r2, [r0] @ read for ownership
> - strb r2, [r0] @ write for ownership
> -#endif
> tst r0, #D_CACHE_LINE_SIZE - 1
> bic r0, r0, #D_CACHE_LINE_SIZE - 1
> #ifdef HARVARD_CACHE
> @@ -213,10 +209,6 @@ ENTRY(v6_dma_inv_range)
> mcrne p15, 0, r0, c7, c11, 1 @ clean unified line
> #endif
> tst r1, #D_CACHE_LINE_SIZE - 1
> -#ifdef CONFIG_DMA_CACHE_RWFO
> - ldrbne r2, [r1, #-1] @ read for ownership
> - strbne r2, [r1, #-1] @ write for ownership
> -#endif
> bic r1, r1, #D_CACHE_LINE_SIZE - 1
> #ifdef HARVARD_CACHE
> mcrne p15, 0, r1, c7, c14, 1 @ clean & invalidate D line
> @@ -231,10 +223,6 @@ ENTRY(v6_dma_inv_range)
> #endif
> add r0, r0, #D_CACHE_LINE_SIZE
> cmp r0, r1
> -#ifdef CONFIG_DMA_CACHE_RWFO
> - ldrlo r2, [r0] @ read for ownership
> - strlo r2, [r0] @ write for ownership
> -#endif
> blo 1b
> mov r0, #0
> mcr p15, 0, r0, c7, c10, 4 @ drain write buffer
> @@ -248,9 +236,6 @@ ENTRY(v6_dma_inv_range)
> ENTRY(v6_dma_clean_range)
> bic r0, r0, #D_CACHE_LINE_SIZE - 1
> 1:
> -#ifdef CONFIG_DMA_CACHE_RWFO
> - ldr r2, [r0] @ read for ownership
> -#endif
> #ifdef HARVARD_CACHE
> mcr p15, 0, r0, c7, c10, 1 @ clean D line
> #else
> @@ -269,10 +254,6 @@ ENTRY(v6_dma_clean_range)
> * - end - virtual end address of region
> */
> ENTRY(v6_dma_flush_range)
> -#ifdef CONFIG_DMA_CACHE_RWFO
> - ldrb r2, [r0] @ read for ownership
> - strb r2, [r0] @ write for ownership
> -#endif
> bic r0, r0, #D_CACHE_LINE_SIZE - 1
> 1:
> #ifdef HARVARD_CACHE
> @@ -282,10 +263,6 @@ ENTRY(v6_dma_flush_range)
> #endif
> add r0, r0, #D_CACHE_LINE_SIZE
> cmp r0, r1
> -#ifdef CONFIG_DMA_CACHE_RWFO
> - ldrblo r2, [r0] @ read for ownership
> - strblo r2, [r0] @ write for ownership
> -#endif
> blo 1b
> mov r0, #0
> mcr p15, 0, r0, c7, c10, 4 @ drain write buffer
> @@ -301,13 +278,7 @@ ENTRY(v6_dma_map_area)
> add r1, r1, r0
> teq r2, #DMA_FROM_DEVICE
> beq v6_dma_inv_range
> -#ifndef CONFIG_DMA_CACHE_RWFO
> b v6_dma_clean_range
> -#else
> - teq r2, #DMA_TO_DEVICE
> - beq v6_dma_clean_range
> - b v6_dma_flush_range
> -#endif
> ENDPROC(v6_dma_map_area)
>
> /*
> @@ -317,11 +288,9 @@ ENDPROC(v6_dma_map_area)
> * - dir - DMA direction
> */
> ENTRY(v6_dma_unmap_area)
> -#ifndef CONFIG_DMA_CACHE_RWFO
> add r1, r1, r0
> teq r2, #DMA_TO_DEVICE
> bne v6_dma_inv_range
> -#endif
> ret lr
> ENDPROC(v6_dma_unmap_area)
>
On Mon, Mar 27, 2023 at 2:16 PM Arnd Bergmann <[email protected]> wrote:
> From: Arnd Bergmann <[email protected]>
>
> The cache management operations for noncoherent DMA on ARMv6 work
> in two different ways:
>
> * When CONFIG_DMA_CACHE_RWFO is set, speculative prefetches on in-flight
> DMA buffers lead to data corruption when the prefetched data is written
> back on top of data from the device.
>
> * When CONFIG_DMA_CACHE_RWFO is disabled, a cache flush on one CPU
> is not seen by the other core(s), leading to inconsistent contents
> accross the system.
>
> As a consequence, neither configuration is actually safe to use in a
> general-purpose kernel that is used on both MPCore systems and ARM1176
> with prefetching enabled.
>
> We could add further workarounds to make the behavior more dynamic based
> on the system, but realistically, there are close to zero remaining
> users on any ARM11MPCore anyway, and nobody seems too interested in it,
> compared to the more popular ARM1176 used in BMC2835 and AST2500.
>
> The Oxnas platform has some minimal support in OpenWRT, but most of the
> drivers and dts files never made it into the mainline kernel, while the
> Arm Versatile/Realview platform mainly serves as a reference system but
> is not necessary to be kept working once all other ARM11MPCore are gone.
>
> Take the easy way out here and drop support for multiprocessing on
> ARMv6, along with the CONFIG_DMA_CACHE_RWFO option and the cache
> management implementation for it. This also helps with other ARMv6
> issues, but for the moment leaves the ability to build a kernel that
> can run on both ARMv7 SMP and single-processor ARMv6, which we probably
> want to stop supporting as well, but not as part of this series.
>
> Cc: Neil Armstrong <[email protected]>
> Cc: Daniel Golle <[email protected]>
> Cc: Linus Walleij <[email protected]>
> Cc: [email protected]
> Signed-off-by: Arnd Bergmann <[email protected]>
Yeah, we discussed this earlier, let's just drop it. Not worth the effort.
Acked-by: Linus Walleij <[email protected]>
Yours,
Linus Walleij
On Thu, Mar 30, 2023, at 09:48, Neil Armstrong wrote:
> On 27/03/2023 14:13, Arnd Bergmann wrote:
>> From: Arnd Bergmann <[email protected]>
>>
>> The cache management operations for noncoherent DMA on ARMv6 work
>> in two different ways:
>>
>> * When CONFIG_DMA_CACHE_RWFO is set, speculative prefetches on in-flight
>> DMA buffers lead to data corruption when the prefetched data is written
>> back on top of data from the device.
>>
>> * When CONFIG_DMA_CACHE_RWFO is disabled, a cache flush on one CPU
>> is not seen by the other core(s), leading to inconsistent contents
>> accross the system.
>>
>> As a consequence, neither configuration is actually safe to use in a
>> general-purpose kernel that is used on both MPCore systems and ARM1176
>> with prefetching enabled.
>>
>> We could add further workarounds to make the behavior more dynamic based
>> on the system, but realistically, there are close to zero remaining
>> users on any ARM11MPCore anyway, and nobody seems too interested in it,
>> compared to the more popular ARM1176 used in BMC2835 and AST2500.
>>
>> The Oxnas platform has some minimal support in OpenWRT, but most of the
>> drivers and dts files never made it into the mainline kernel, while the
>> Arm Versatile/Realview platform mainly serves as a reference system but
>> is not necessary to be kept working once all other ARM11MPCore are gone.
>
> Acked-by: Neil Armstrong <[email protected]>
>
> It's sad but it's the reality, there's no chance full OXNAS support will
> ever come upstream and no real work has been done for years.
>
> I think OXNAS support can be programmed for removal for next release,
> it would need significant work to rework current support to make it acceptable
> before trying to upstream missing bits anyway.
Ok, thanks for your reply!
To clarify, do you think we should plan for removal after the next
stable release (6.3, removed in 6.4), or after the next LTS
release (probably 6.6, removed in 6.7)? As far as I understand,
the next OpenWRT release (23.x) will be based on linux-5.15,
and the one after that (24.x) would likely still use 6.1, unless
they skip an LTS kernel.
Arnd
On Mon, 27 Mar 2023 at 14:18, Arnd Bergmann <[email protected]> wrote:
>
> From: Arnd Bergmann <[email protected]>
>
> The cache management operations for noncoherent DMA on ARMv6 work
> in two different ways:
>
> * When CONFIG_DMA_CACHE_RWFO is set, speculative prefetches on in-flight
> DMA buffers lead to data corruption when the prefetched data is written
> back on top of data from the device.
>
> * When CONFIG_DMA_CACHE_RWFO is disabled, a cache flush on one CPU
> is not seen by the other core(s), leading to inconsistent contents
> accross the system.
>
> As a consequence, neither configuration is actually safe to use in a
> general-purpose kernel that is used on both MPCore systems and ARM1176
> with prefetching enabled.
>
> We could add further workarounds to make the behavior more dynamic based
> on the system, but realistically, there are close to zero remaining
> users on any ARM11MPCore anyway, and nobody seems too interested in it,
> compared to the more popular ARM1176 used in BMC2835 and AST2500.
>
> The Oxnas platform has some minimal support in OpenWRT, but most of the
> drivers and dts files never made it into the mainline kernel, while the
> Arm Versatile/Realview platform mainly serves as a reference system but
> is not necessary to be kept working once all other ARM11MPCore are gone.
>
> Take the easy way out here and drop support for multiprocessing on
> ARMv6, along with the CONFIG_DMA_CACHE_RWFO option and the cache
> management implementation for it. This also helps with other ARMv6
> issues, but for the moment leaves the ability to build a kernel that
> can run on both ARMv7 SMP and single-processor ARMv6, which we probably
> want to stop supporting as well, but not as part of this series.
>
> Cc: Neil Armstrong <[email protected]>
> Cc: Daniel Golle <[email protected]>
> Cc: Linus Walleij <[email protected]>
> Cc: [email protected]
> Signed-off-by: Arnd Bergmann <[email protected]>
Acked-by: Ard Biesheuvel <[email protected]>
On Mon, Mar 27, 2023 at 1:16 PM Arnd Bergmann <[email protected]> wrote:
>
> From: Arnd Bergmann <[email protected]>
>
> No other architecture intentionally writes back dirty cache lines into
> a buffer that a device has just finished writing into. If the cache is
> clean, this has no effect at all, but if a cacheline in the buffer has
> actually been written by the CPU, there is a drive bug that is likely
> made worse by overwriting that buffer.
>
> Signed-off-by: Arnd Bergmann <[email protected]>
> ---
> arch/riscv/mm/dma-noncoherent.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
Reviewed-by: Lad Prabhakar <[email protected]>
Cheers,
Prabhakar
> diff --git a/arch/riscv/mm/dma-noncoherent.c b/arch/riscv/mm/dma-noncoherent.c
> index d919efab6eba..640f4c496d26 100644
> --- a/arch/riscv/mm/dma-noncoherent.c
> +++ b/arch/riscv/mm/dma-noncoherent.c
> @@ -42,7 +42,7 @@ void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> break;
> case DMA_FROM_DEVICE:
> case DMA_BIDIRECTIONAL:
> - ALT_CMO_OP(flush, vaddr, size, riscv_cbom_block_size);
> + ALT_CMO_OP(inval, vaddr, size, riscv_cbom_block_size);
> break;
> default:
> break;
> --
> 2.39.2
>
>
> _______________________________________________
> linux-riscv mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-riscv
On Mon, Mar 27, 2023 at 1:16 PM Arnd Bergmann <[email protected]> wrote:
>
> From: Arnd Bergmann <[email protected]>
>
> For a DMA_BIDIRECTIONAL transfer, the caches have to be cleaned
> first to let the device see data written by the CPU, and invalidated
> after the transfer to let the CPU see data written by the device.
>
> riscv also invalidates the caches before the transfer, which does
> not appear to serve any purpose.
>
> Signed-off-by: Arnd Bergmann <[email protected]>
> ---
> arch/riscv/mm/dma-noncoherent.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
Reviewed-by: Lad Prabhakar <[email protected]>
Cheers,
Prabhakar
> diff --git a/arch/riscv/mm/dma-noncoherent.c b/arch/riscv/mm/dma-noncoherent.c
> index 640f4c496d26..69c80b2155a1 100644
> --- a/arch/riscv/mm/dma-noncoherent.c
> +++ b/arch/riscv/mm/dma-noncoherent.c
> @@ -25,7 +25,7 @@ void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> ALT_CMO_OP(clean, vaddr, size, riscv_cbom_block_size);
> break;
> case DMA_BIDIRECTIONAL:
> - ALT_CMO_OP(flush, vaddr, size, riscv_cbom_block_size);
> + ALT_CMO_OP(clean, vaddr, size, riscv_cbom_block_size);
> break;
> default:
> break;
> --
> 2.39.2
>
>
> _______________________________________________
> linux-riscv mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-riscv
On Mon, Mar 27, 2023 at 1:20 PM Arnd Bergmann <[email protected]> wrote:
>
> From: Arnd Bergmann <[email protected]>
>
> Now that all of these have consistent behavior, replace them with
> a single shared implementation of arch_sync_dma_for_device() and
> arch_sync_dma_for_cpu() and three parameters to pick how they should
> operate:
>
> - If the CPU has speculative prefetching, then the cache
> has to be invalidated after a transfer from the device.
> On the rarer CPUs without prefetching, this can be skipped,
> with all cache management happening before the transfer.
> This flag can be runtime detected, but is usually fixed
> per architecture.
>
> - Some architectures currently clean the caches before DMA
> from a device, while others invalidate it. There has not
> been a conclusion regarding whether we should change all
> architectures to use clean instead, so this adds an
> architecture specific flag that we can change later on.
>
> - On 32-bit Arm, the arch_sync_dma_for_cpu() function keeps
> track pages that are marked clean in the page cache, to
> avoid flushing them again. The implementation for this is
> generic enough to work on all architectures that use the
> PG_dcache_clean page flag, but a Kconfig symbol is used
> to only enable it on Arm to preserve the existing behavior.
>
> For the function naming, I picked 'wback' over 'clean', and 'wback_inv'
> over 'flush', to avoid any ambiguity of what the helper functions are
> supposed to do.
>
> Moving the global functions into a header file is usually a bad idea
> as it prevents the header from being included more than once, but it
> helps keep the behavior as close as possible to the previous state,
> including the possibility of inlining most of it into these functions
> where that was done before. This also helps keep the global namespace
> clean, by hiding the new arch_dma_cache{_wback,_inv,_wback_inv} from
> device drivers that might use them incorrectly.
>
> It would be possible to do this one architecture at a time, but
> as the change is the same everywhere, the combined patch helps
> explain it better once.
>
> Signed-off-by: Arnd Bergmann <[email protected]>
> ---
> arch/arc/mm/dma.c | 66 +++++-------------
> arch/arm/Kconfig | 3 +
> arch/arm/mm/dma-mapping-nommu.c | 39 ++++++-----
> arch/arm/mm/dma-mapping.c | 64 +++++++-----------
> arch/arm64/mm/dma-mapping.c | 28 +++++---
> arch/csky/mm/dma-mapping.c | 44 ++++++------
> arch/hexagon/kernel/dma.c | 44 ++++++------
> arch/m68k/kernel/dma.c | 43 +++++++-----
> arch/microblaze/kernel/dma.c | 48 +++++++-------
> arch/mips/mm/dma-noncoherent.c | 60 +++++++----------
> arch/nios2/mm/dma-mapping.c | 57 +++++++---------
> arch/openrisc/kernel/dma.c | 63 +++++++++++-------
> arch/parisc/kernel/pci-dma.c | 46 ++++++-------
> arch/powerpc/mm/dma-noncoherent.c | 34 ++++++----
> arch/riscv/mm/dma-noncoherent.c | 51 +++++++-------
> arch/sh/kernel/dma-coherent.c | 43 +++++++-----
> arch/sparc/kernel/ioport.c | 38 ++++++++---
> arch/xtensa/kernel/pci-dma.c | 40 ++++++-----
> include/linux/dma-sync.h | 107 ++++++++++++++++++++++++++++++
> 19 files changed, 527 insertions(+), 391 deletions(-)
> create mode 100644 include/linux/dma-sync.h
>
I tested this on RZ/Five (with my v6 [0] + additional changes) so for RISC-V,
Reviewed-by: Lad Prabhakar <[email protected]>
Tested-by: Lad Prabhakar <[email protected]>
[0] https://patchwork.kernel.org/project/linux-renesas-soc/cover/[email protected]/
Cheers,
Prabhakar
> diff --git a/arch/arc/mm/dma.c b/arch/arc/mm/dma.c
> index ddb96786f765..61cd01646222 100644
> --- a/arch/arc/mm/dma.c
> +++ b/arch/arc/mm/dma.c
> @@ -30,63 +30,33 @@ void arch_dma_prep_coherent(struct page *page, size_t size)
> dma_cache_wback_inv(page_to_phys(page), size);
> }
>
> -/*
> - * Cache operations depending on function and direction argument, inspired by
> - * https://lore.kernel.org/lkml/[email protected]
> - * "dma_sync_*_for_cpu and direction=TO_DEVICE (was Re: [PATCH 02/20]
> - * dma-mapping: provide a generic dma-noncoherent implementation)"
> - *
> - * | map == for_device | unmap == for_cpu
> - * |----------------------------------------------------------------
> - * TO_DEV | writeback writeback | none none
> - * FROM_DEV | invalidate invalidate | invalidate* invalidate*
> - * BIDIR | writeback writeback | invalidate invalidate
> - *
> - * [*] needed for CPU speculative prefetches
> - *
> - * NOTE: we don't check the validity of direction argument as it is done in
> - * upper layer functions (in include/linux/dma-mapping.h)
> - */
> -
> -void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
> {
> - switch (dir) {
> - case DMA_TO_DEVICE:
> - dma_cache_wback(paddr, size);
> - break;
> -
> - case DMA_FROM_DEVICE:
> - dma_cache_inv(paddr, size);
> - break;
> -
> - case DMA_BIDIRECTIONAL:
> - dma_cache_wback(paddr, size);
> - break;
> + dma_cache_wback(paddr, size);
> +}
>
> - default:
> - break;
> - }
> +static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size)
> +{
> + dma_cache_inv(paddr, size);
> }
>
> -void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t size)
> {
> - switch (dir) {
> - case DMA_TO_DEVICE:
> - break;
> + dma_cache_wback_inv(paddr, size);
> +}
>
> - /* FROM_DEVICE invalidate needed if speculative CPU prefetch only */
> - case DMA_FROM_DEVICE:
> - case DMA_BIDIRECTIONAL:
> - dma_cache_inv(paddr, size);
> - break;
> +static inline bool arch_sync_dma_clean_before_fromdevice(void)
> +{
> + return false;
> +}
>
> - default:
> - break;
> - }
> +static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
> +{
> + return true;
> }
>
> +#include <linux/dma-sync.h>
> +
> /*
> * Plug in direct dma map ops.
> */
> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> index 125d58c54ab1..0de84e861027 100644
> --- a/arch/arm/Kconfig
> +++ b/arch/arm/Kconfig
> @@ -212,6 +212,9 @@ config LOCKDEP_SUPPORT
> bool
> default y
>
> +config ARCH_DMA_MARK_DCACHE_CLEAN
> + def_bool y
> +
> config ARCH_HAS_ILOG2_U32
> bool
>
> diff --git a/arch/arm/mm/dma-mapping-nommu.c b/arch/arm/mm/dma-mapping-nommu.c
> index 12b5c6ae93fc..0817274aed15 100644
> --- a/arch/arm/mm/dma-mapping-nommu.c
> +++ b/arch/arm/mm/dma-mapping-nommu.c
> @@ -13,27 +13,36 @@
>
> #include "dma.h"
>
> -void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
> {
> - if (dir == DMA_FROM_DEVICE) {
> - dmac_inv_range(__va(paddr), __va(paddr + size));
> - outer_inv_range(paddr, paddr + size);
> - } else {
> - dmac_clean_range(__va(paddr), __va(paddr + size));
> - outer_clean_range(paddr, paddr + size);
> - }
> + dmac_clean_range(__va(paddr), __va(paddr + size));
> + outer_clean_range(paddr, paddr + size);
> }
>
> -void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size)
> {
> - if (dir != DMA_TO_DEVICE) {
> - outer_inv_range(paddr, paddr + size);
> - dmac_inv_range(__va(paddr), __va(paddr));
> - }
> + dmac_inv_range(__va(paddr), __va(paddr + size));
> + outer_inv_range(paddr, paddr + size);
> }
>
> +static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t size)
> +{
> + dmac_flush_range(__va(paddr), __va(paddr + size));
> + outer_flush_range(paddr, paddr + size);
> +}
> +
> +static inline bool arch_sync_dma_clean_before_fromdevice(void)
> +{
> + return false;
> +}
> +
> +static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
> +{
> + return true;
> +}
> +
> +#include <linux/dma-sync.h>
> +
> void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
> const struct iommu_ops *iommu, bool coherent)
> {
> diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
> index b703cb83d27e..aa6ee820a0ab 100644
> --- a/arch/arm/mm/dma-mapping.c
> +++ b/arch/arm/mm/dma-mapping.c
> @@ -687,6 +687,30 @@ void arch_dma_mark_clean(phys_addr_t paddr, size_t size)
> }
> }
>
> +static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
> +{
> + dma_cache_maint(paddr, size, dmac_clean_range);
> + outer_clean_range(paddr, paddr + size);
> +}
> +
> +
> +static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size)
> +{
> + dma_cache_maint(paddr, size, dmac_inv_range);
> + outer_inv_range(paddr, paddr + size);
> +}
> +
> +static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t size)
> +{
> + dma_cache_maint(paddr, size, dmac_flush_range);
> + outer_flush_range(paddr, paddr + size);
> +}
> +
> +static inline bool arch_sync_dma_clean_before_fromdevice(void)
> +{
> + return false;
> +}
> +
> static bool arch_sync_dma_cpu_needs_post_dma_flush(void)
> {
> if (IS_ENABLED(CONFIG_CPU_V6) ||
> @@ -699,45 +723,7 @@ static bool arch_sync_dma_cpu_needs_post_dma_flush(void)
> return false;
> }
>
> -/*
> - * Make an area consistent for devices.
> - * Note: Drivers should NOT use this function directly.
> - * Use the driver DMA support - see dma-mapping.h (dma_sync_*)
> - */
> -void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> -{
> - switch (dir) {
> - case DMA_TO_DEVICE:
> - dma_cache_maint(paddr, size, dmac_clean_range);
> - outer_clean_range(paddr, paddr + size);
> - break;
> - case DMA_FROM_DEVICE:
> - dma_cache_maint(paddr, size, dmac_inv_range);
> - outer_inv_range(paddr, paddr + size);
> - break;
> - case DMA_BIDIRECTIONAL:
> - if (arch_sync_dma_cpu_needs_post_dma_flush()) {
> - dma_cache_maint(paddr, size, dmac_clean_range);
> - outer_clean_range(paddr, paddr + size);
> - } else {
> - dma_cache_maint(paddr, size, dmac_flush_range);
> - outer_flush_range(paddr, paddr + size);
> - }
> - break;
> - default:
> - break;
> - }
> -}
> -
> -void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> -{
> - if (dir != DMA_TO_DEVICE && arch_sync_dma_cpu_needs_post_dma_flush()) {
> - outer_inv_range(paddr, paddr + size);
> - dma_cache_maint(paddr, size, dmac_inv_range);
> - }
> -}
> +#include <linux/dma-sync.h>
>
> #ifdef CONFIG_ARM_DMA_USE_IOMMU
>
> diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
> index 5240f6acad64..bae741aa65e9 100644
> --- a/arch/arm64/mm/dma-mapping.c
> +++ b/arch/arm64/mm/dma-mapping.c
> @@ -13,25 +13,33 @@
> #include <asm/cacheflush.h>
> #include <asm/xen/xen-ops.h>
>
> -void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
> {
> - unsigned long start = (unsigned long)phys_to_virt(paddr);
> + dcache_clean_poc(paddr, paddr + size);
> +}
>
> - dcache_clean_poc(start, start + size);
> +static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size)
> +{
> + dcache_inval_poc(paddr, paddr + size);
> }
>
> -void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t size)
> {
> - unsigned long start = (unsigned long)phys_to_virt(paddr);
> + dcache_clean_inval_poc(paddr, paddr + size);
> +}
>
> - if (dir == DMA_TO_DEVICE)
> - return;
> +static inline bool arch_sync_dma_clean_before_fromdevice(void)
> +{
> + return true;
> +}
>
> - dcache_inval_poc(start, start + size);
> +static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
> +{
> + return true;
> }
>
> +#include <linux/dma-sync.h>
> +
> void arch_dma_prep_coherent(struct page *page, size_t size)
> {
> unsigned long start = (unsigned long)page_address(page);
> diff --git a/arch/csky/mm/dma-mapping.c b/arch/csky/mm/dma-mapping.c
> index c90f912e2822..9402e101b363 100644
> --- a/arch/csky/mm/dma-mapping.c
> +++ b/arch/csky/mm/dma-mapping.c
> @@ -55,31 +55,29 @@ void arch_dma_prep_coherent(struct page *page, size_t size)
> cache_op(page_to_phys(page), size, dma_wbinv_set_zero_range);
> }
>
> -void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
> {
> - switch (dir) {
> - case DMA_TO_DEVICE:
> - case DMA_FROM_DEVICE:
> - case DMA_BIDIRECTIONAL:
> - cache_op(paddr, size, dma_wb_range);
> - break;
> - default:
> - BUG();
> - }
> + cache_op(paddr, size, dma_wb_range);
> }
>
> -void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size)
> {
> - switch (dir) {
> - case DMA_TO_DEVICE:
> - return;
> - case DMA_FROM_DEVICE:
> - case DMA_BIDIRECTIONAL:
> - cache_op(paddr, size, dma_inv_range);
> - break;
> - default:
> - BUG();
> - }
> + cache_op(paddr, size, dma_inv_range);
> }
> +
> +static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t size)
> +{
> + cache_op(paddr, size, dma_wbinv_range);
> +}
> +
> +static inline bool arch_sync_dma_clean_before_fromdevice(void)
> +{
> + return true;
> +}
> +
> +static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
> +{
> + return true;
> +}
> +
> +#include <linux/dma-sync.h>
> diff --git a/arch/hexagon/kernel/dma.c b/arch/hexagon/kernel/dma.c
> index 882680e81a30..e6538128a75b 100644
> --- a/arch/hexagon/kernel/dma.c
> +++ b/arch/hexagon/kernel/dma.c
> @@ -9,29 +9,33 @@
> #include <linux/memblock.h>
> #include <asm/page.h>
>
> -void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
> {
> - void *addr = phys_to_virt(paddr);
> -
> - switch (dir) {
> - case DMA_TO_DEVICE:
> - hexagon_clean_dcache_range((unsigned long) addr,
> - (unsigned long) addr + size);
> - break;
> - case DMA_FROM_DEVICE:
> - hexagon_inv_dcache_range((unsigned long) addr,
> - (unsigned long) addr + size);
> - break;
> - case DMA_BIDIRECTIONAL:
> - flush_dcache_range((unsigned long) addr,
> - (unsigned long) addr + size);
> - break;
> - default:
> - BUG();
> - }
> + hexagon_clean_dcache_range(paddr, paddr + size);
> }
>
> +static inline void arch_dma_cache_inv(phys_addr_t start, size_t size)
> +{
> + hexagon_inv_dcache_range(paddr, paddr + size);
> +}
> +
> +static inline void arch_dma_cache_wback_inv(phys_addr_t start, size_t size)
> +{
> + hexagon_flush_dcache_range(paddr, paddr + size);
> +}
> +
> +static inline bool arch_sync_dma_clean_before_fromdevice(void)
> +{
> + return false;
> +}
> +
> +static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
> +{
> + return false;
> +}
> +
> +#include <linux/dma-sync.h>
> +
> /*
> * Our max_low_pfn should have been backed off by 16MB in mm/init.c to create
> * DMA coherent space. Use that for the pool.
> diff --git a/arch/m68k/kernel/dma.c b/arch/m68k/kernel/dma.c
> index 2e192a5df949..aa9b434e6df8 100644
> --- a/arch/m68k/kernel/dma.c
> +++ b/arch/m68k/kernel/dma.c
> @@ -58,20 +58,33 @@ void arch_dma_free(struct device *dev, size_t size, void *vaddr,
>
> #endif /* CONFIG_MMU && !CONFIG_COLDFIRE */
>
> -void arch_sync_dma_for_device(phys_addr_t handle, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
> {
> - switch (dir) {
> - case DMA_BIDIRECTIONAL:
> - case DMA_TO_DEVICE:
> - cache_push(handle, size);
> - break;
> - case DMA_FROM_DEVICE:
> - cache_clear(handle, size);
> - break;
> - default:
> - pr_err_ratelimited("dma_sync_single_for_device: unsupported dir %u\n",
> - dir);
> - break;
> - }
> + /*
> + * cache_push() always invalidates in addition to cleaning
> + * write-back caches.
> + */
> + cache_push(paddr, size);
> +}
> +
> +static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size)
> +{
> + cache_clear(paddr, size);
> +}
> +
> +static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t size)
> +{
> + cache_push(paddr, size);
> }
> +
> +static inline bool arch_sync_dma_clean_before_fromdevice(void)
> +{
> + return false;
> +}
> +
> +static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
> +{
> + return false;
> +}
> +
> +#include <linux/dma-sync.h>
> diff --git a/arch/microblaze/kernel/dma.c b/arch/microblaze/kernel/dma.c
> index b4c4e45fd45e..01110d4aa5b0 100644
> --- a/arch/microblaze/kernel/dma.c
> +++ b/arch/microblaze/kernel/dma.c
> @@ -14,32 +14,30 @@
> #include <linux/bug.h>
> #include <asm/cacheflush.h>
>
> -void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
> {
> - switch (direction) {
> - case DMA_TO_DEVICE:
> - case DMA_BIDIRECTIONAL:
> - flush_dcache_range(paddr, paddr + size);
> - break;
> - case DMA_FROM_DEVICE:
> - invalidate_dcache_range(paddr, paddr + size);
> - break;
> - default:
> - BUG();
> - }
> + /* writeback plus invalidate, could be a nop on WT caches */
> + flush_dcache_range(paddr, paddr + size);
> }
>
> -void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size)
> {
> - switch (direction) {
> - case DMA_TO_DEVICE:
> - break;
> - case DMA_BIDIRECTIONAL:
> - case DMA_FROM_DEVICE:
> - invalidate_dcache_range(paddr, paddr + size);
> - break;
> - default:
> - BUG();
> - }}
> + invalidate_dcache_range(paddr, paddr + size);
> +}
> +
> +static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t size)
> +{
> + flush_dcache_range(paddr, paddr + size);
> +}
> +
> +static inline bool arch_sync_dma_clean_before_fromdevice(void)
> +{
> + return false;
> +}
> +
> +static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
> +{
> + return true;
> +}
> +
> +#include <linux/dma-sync.h>
> diff --git a/arch/mips/mm/dma-noncoherent.c b/arch/mips/mm/dma-noncoherent.c
> index b9d68bcc5d53..902d4b7c1f85 100644
> --- a/arch/mips/mm/dma-noncoherent.c
> +++ b/arch/mips/mm/dma-noncoherent.c
> @@ -85,50 +85,38 @@ static inline void dma_sync_phys(phys_addr_t paddr, size_t size,
> } while (left);
> }
>
> -void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
> {
> - switch (dir) {
> - case DMA_TO_DEVICE:
> - dma_sync_phys(paddr, size, _dma_cache_wback);
> - break;
> - case DMA_FROM_DEVICE:
> - dma_sync_phys(paddr, size, _dma_cache_inv);
> - break;
> - case DMA_BIDIRECTIONAL:
> - if (IS_ENABLED(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU) &&
> - cpu_needs_post_dma_flush())
> - dma_sync_phys(paddr, size, _dma_cache_wback);
> - else
> - dma_sync_phys(paddr, size, _dma_cache_wback_inv);
> - break;
> - default:
> - break;
> - }
> + dma_sync_phys(paddr, size, _dma_cache_wback);
> }
>
> -#ifdef CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU
> -void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size)
> {
> - switch (dir) {
> - case DMA_TO_DEVICE:
> - break;
> - case DMA_FROM_DEVICE:
> - case DMA_BIDIRECTIONAL:
> - if (cpu_needs_post_dma_flush())
> - dma_sync_phys(paddr, size, _dma_cache_inv);
> - break;
> - default:
> - break;
> - }
> + dma_sync_phys(paddr, size, _dma_cache_inv);
> }
> -#endif
> +
> +static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t size)
> +{
> + dma_sync_phys(paddr, size, _dma_cache_wback_inv);
> +}
> +
> +static inline bool arch_sync_dma_clean_before_fromdevice(void)
> +{
> + return false;
> +}
> +
> +static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
> +{
> + return IS_ENABLED(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU) &&
> + cpu_needs_post_dma_flush();
> +}
> +
> +#include <linux/dma-sync.h>
>
> #ifdef CONFIG_ARCH_HAS_SETUP_DMA_OPS
> void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
> - const struct iommu_ops *iommu, bool coherent)
> + const struct iommu_ops *iommu, bool coherent)
> {
> - dev->dma_coherent = coherent;
> + dev->dma_coherent = coherent;
> }
> #endif
> diff --git a/arch/nios2/mm/dma-mapping.c b/arch/nios2/mm/dma-mapping.c
> index fd887d5f3f9a..29978970955e 100644
> --- a/arch/nios2/mm/dma-mapping.c
> +++ b/arch/nios2/mm/dma-mapping.c
> @@ -13,53 +13,46 @@
> #include <linux/types.h>
> #include <linux/mm.h>
> #include <linux/string.h>
> +#include <linux/dma-map-ops.h>
> #include <linux/dma-mapping.h>
> #include <linux/io.h>
> #include <linux/cache.h>
> #include <asm/cacheflush.h>
>
> -void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
> {
> + /*
> + * We just need to write back the caches here, but Nios2 flush
> + * instruction will do both writeback and invalidate.
> + */
> void *vaddr = phys_to_virt(paddr);
> + flush_dcache_range((unsigned long)vaddr, (unsigned long)(vaddr + size));
> +}
>
> - switch (dir) {
> - case DMA_FROM_DEVICE:
> - invalidate_dcache_range((unsigned long)vaddr,
> - (unsigned long)(vaddr + size));
> - break;
> - case DMA_TO_DEVICE:
> - /*
> - * We just need to flush the caches here , but Nios2 flush
> - * instruction will do both writeback and invalidate.
> - */
> - case DMA_BIDIRECTIONAL: /* flush and invalidate */
> - flush_dcache_range((unsigned long)vaddr,
> - (unsigned long)(vaddr + size));
> - break;
> - default:
> - BUG();
> - }
> +static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size)
> +{
> + unsigned long vaddr = (unsigned long)phys_to_virt(paddr);
> + invalidate_dcache_range(vaddr, (unsigned long)(vaddr + size));
> }
>
> -void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t size)
> {
> void *vaddr = phys_to_virt(paddr);
> + flush_dcache_range((unsigned long)vaddr, (unsigned long)(vaddr + size));
> +}
> +
> +static inline bool arch_sync_dma_clean_before_fromdevice(void)
> +{
> + return false;
> +}
>
> - switch (dir) {
> - case DMA_BIDIRECTIONAL:
> - case DMA_FROM_DEVICE:
> - invalidate_dcache_range((unsigned long)vaddr,
> - (unsigned long)(vaddr + size));
> - break;
> - case DMA_TO_DEVICE:
> - break;
> - default:
> - BUG();
> - }
> +static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
> +{
> + return true;
> }
>
> +#include <linux/dma-sync.h>
> +
> void arch_dma_prep_coherent(struct page *page, size_t size)
> {
> unsigned long start = (unsigned long)page_address(page);
> diff --git a/arch/openrisc/kernel/dma.c b/arch/openrisc/kernel/dma.c
> index 91a00d09ffad..aba2258e62eb 100644
> --- a/arch/openrisc/kernel/dma.c
> +++ b/arch/openrisc/kernel/dma.c
> @@ -95,32 +95,47 @@ void arch_dma_clear_uncached(void *cpu_addr, size_t size)
> mmap_write_unlock(&init_mm);
> }
>
> -void arch_sync_dma_for_device(phys_addr_t addr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
> {
> unsigned long cl;
> struct cpuinfo_or1k *cpuinfo = &cpuinfo_or1k[smp_processor_id()];
>
> - switch (dir) {
> - case DMA_TO_DEVICE:
> - /* Write back the dcache for the requested range */
> - for (cl = addr; cl < addr + size;
> - cl += cpuinfo->dcache_block_size)
> - mtspr(SPR_DCBWR, cl);
> - break;
> - case DMA_FROM_DEVICE:
> - /* Invalidate the dcache for the requested range */
> - for (cl = addr; cl < addr + size;
> - cl += cpuinfo->dcache_block_size)
> - mtspr(SPR_DCBIR, cl);
> - break;
> - case DMA_BIDIRECTIONAL:
> - /* Flush the dcache for the requested range */
> - for (cl = addr; cl < addr + size;
> - cl += cpuinfo->dcache_block_size)
> - mtspr(SPR_DCBFR, cl);
> - break;
> - default:
> - break;
> - }
> + /* Write back the dcache for the requested range */
> + for (cl = paddr; cl < paddr + size;
> + cl += cpuinfo->dcache_block_size)
> + mtspr(SPR_DCBWR, cl);
> }
> +
> +static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size)
> +{
> + unsigned long cl;
> + struct cpuinfo_or1k *cpuinfo = &cpuinfo_or1k[smp_processor_id()];
> +
> + /* Invalidate the dcache for the requested range */
> + for (cl = paddr; cl < paddr + size;
> + cl += cpuinfo->dcache_block_size)
> + mtspr(SPR_DCBIR, cl);
> +}
> +
> +static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t size)
> +{
> + unsigned long cl;
> + struct cpuinfo_or1k *cpuinfo = &cpuinfo_or1k[smp_processor_id()];
> +
> + /* Flush the dcache for the requested range */
> + for (cl = paddr; cl < paddr + size;
> + cl += cpuinfo->dcache_block_size)
> + mtspr(SPR_DCBFR, cl);
> +}
> +
> +static inline bool arch_sync_dma_clean_before_fromdevice(void)
> +{
> + return false;
> +}
> +
> +static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
> +{
> + return false;
> +}
> +
> +#include <linux/dma-sync.h>
> diff --git a/arch/parisc/kernel/pci-dma.c b/arch/parisc/kernel/pci-dma.c
> index 6d3d3cffb316..a7955aab8ce2 100644
> --- a/arch/parisc/kernel/pci-dma.c
> +++ b/arch/parisc/kernel/pci-dma.c
> @@ -443,35 +443,35 @@ void arch_dma_free(struct device *dev, size_t size, void *vaddr,
> free_pages((unsigned long)__va(dma_handle), order);
> }
>
> -void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
> {
> unsigned long virt = (unsigned long)phys_to_virt(paddr);
>
> - switch (dir) {
> - case DMA_TO_DEVICE:
> - clean_kernel_dcache_range(virt, size);
> - break;
> - case DMA_FROM_DEVICE:
> - clean_kernel_dcache_range(virt, size);
> - break;
> - case DMA_BIDIRECTIONAL:
> - flush_kernel_dcache_range(virt, size);
> - break;
> - }
> + clean_kernel_dcache_range(virt, size);
> }
>
> -void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size)
> {
> unsigned long virt = (unsigned long)phys_to_virt(paddr);
>
> - switch (dir) {
> - case DMA_TO_DEVICE:
> - break;
> - case DMA_FROM_DEVICE:
> - case DMA_BIDIRECTIONAL:
> - purge_kernel_dcache_range(virt, size);
> - break;
> - }
> + purge_kernel_dcache_range(virt, size);
> +}
> +
> +static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t size)
> +{
> + unsigned long virt = (unsigned long)phys_to_virt(paddr);
> +
> + flush_kernel_dcache_range(virt, size);
> }
> +
> +static inline bool arch_sync_dma_clean_before_fromdevice(void)
> +{
> + return true;
> +}
> +
> +static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
> +{
> + return true;
> +}
> +
> +#include <linux/dma-sync.h>
> diff --git a/arch/powerpc/mm/dma-noncoherent.c b/arch/powerpc/mm/dma-noncoherent.c
> index 00e59a4faa2b..268510c71156 100644
> --- a/arch/powerpc/mm/dma-noncoherent.c
> +++ b/arch/powerpc/mm/dma-noncoherent.c
> @@ -101,27 +101,33 @@ static void __dma_phys_op(phys_addr_t paddr, size_t size, enum dma_cache_op op)
> #endif
> }
>
> -void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
> {
> __dma_phys_op(start, end, DMA_CACHE_CLEAN);
> }
>
> -void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size)
> {
> - switch (direction) {
> - case DMA_NONE:
> - BUG();
> - case DMA_TO_DEVICE:
> - break;
> - case DMA_FROM_DEVICE:
> - case DMA_BIDIRECTIONAL:
> - __dma_phys_op(start, end, DMA_CACHE_INVAL);
> - break;
> - }
> + __dma_phys_op(start, end, DMA_CACHE_INVAL);
> }
>
> +static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t size)
> +{
> + __dma_phys_op(start, end, DMA_CACHE_FLUSH);
> +}
> +
> +static inline bool arch_sync_dma_clean_before_fromdevice(void)
> +{
> + return true;
> +}
> +
> +static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
> +{
> + return true;
> +}
> +
> +#include <linux/dma-sync.h>
> +
> void arch_dma_prep_coherent(struct page *page, size_t size)
> {
> unsigned long kaddr = (unsigned long)page_address(page);
> diff --git a/arch/riscv/mm/dma-noncoherent.c b/arch/riscv/mm/dma-noncoherent.c
> index 69c80b2155a1..b9a9f57e02be 100644
> --- a/arch/riscv/mm/dma-noncoherent.c
> +++ b/arch/riscv/mm/dma-noncoherent.c
> @@ -12,43 +12,40 @@
>
> static bool noncoherent_supported;
>
> -void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
> {
> void *vaddr = phys_to_virt(paddr);
>
> - switch (dir) {
> - case DMA_TO_DEVICE:
> - ALT_CMO_OP(clean, vaddr, size, riscv_cbom_block_size);
> - break;
> - case DMA_FROM_DEVICE:
> - ALT_CMO_OP(clean, vaddr, size, riscv_cbom_block_size);
> - break;
> - case DMA_BIDIRECTIONAL:
> - ALT_CMO_OP(clean, vaddr, size, riscv_cbom_block_size);
> - break;
> - default:
> - break;
> - }
> + ALT_CMO_OP(clean, vaddr, size, riscv_cbom_block_size);
> }
>
> -void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size)
> {
> void *vaddr = phys_to_virt(paddr);
>
> - switch (dir) {
> - case DMA_TO_DEVICE:
> - break;
> - case DMA_FROM_DEVICE:
> - case DMA_BIDIRECTIONAL:
> - ALT_CMO_OP(inval, vaddr, size, riscv_cbom_block_size);
> - break;
> - default:
> - break;
> - }
> + ALT_CMO_OP(inval, vaddr, size, riscv_cbom_block_size);
> }
>
> +static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t size)
> +{
> + void *vaddr = phys_to_virt(paddr);
> +
> + ALT_CMO_OP(flush, vaddr, size, riscv_cbom_block_size);
> +}
> +
> +static inline bool arch_sync_dma_clean_before_fromdevice(void)
> +{
> + return true;
> +}
> +
> +static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
> +{
> + return true;
> +}
> +
> +#include <linux/dma-sync.h>
> +
> +
> void arch_dma_prep_coherent(struct page *page, size_t size)
> {
> void *flush_addr = page_address(page);
> diff --git a/arch/sh/kernel/dma-coherent.c b/arch/sh/kernel/dma-coherent.c
> index 6a44c0e7ba40..41f031ae7609 100644
> --- a/arch/sh/kernel/dma-coherent.c
> +++ b/arch/sh/kernel/dma-coherent.c
> @@ -12,22 +12,35 @@ void arch_dma_prep_coherent(struct page *page, size_t size)
> __flush_purge_region(page_address(page), size);
> }
>
> -void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
> {
> void *addr = sh_cacheop_vaddr(phys_to_virt(paddr));
>
> - switch (dir) {
> - case DMA_FROM_DEVICE: /* invalidate only */
> - __flush_invalidate_region(addr, size);
> - break;
> - case DMA_TO_DEVICE: /* writeback only */
> - __flush_wback_region(addr, size);
> - break;
> - case DMA_BIDIRECTIONAL: /* writeback and invalidate */
> - __flush_purge_region(addr, size);
> - break;
> - default:
> - BUG();
> - }
> + __flush_wback_region(addr, size);
> }
> +
> +static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size)
> +{
> + void *addr = sh_cacheop_vaddr(phys_to_virt(paddr));
> +
> + __flush_invalidate_region(addr, size);
> +}
> +
> +static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t size)
> +{
> + void *addr = sh_cacheop_vaddr(phys_to_virt(paddr));
> +
> + __flush_purge_region(addr, size);
> +}
> +
> +static inline bool arch_sync_dma_clean_before_fromdevice(void)
> +{
> + return false;
> +}
> +
> +static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
> +{
> + return false;
> +}
> +
> +#include <linux/dma-sync.h>
> diff --git a/arch/sparc/kernel/ioport.c b/arch/sparc/kernel/ioport.c
> index 4f3d26066ec2..6926ead2f208 100644
> --- a/arch/sparc/kernel/ioport.c
> +++ b/arch/sparc/kernel/ioport.c
> @@ -300,21 +300,39 @@ arch_initcall(sparc_register_ioport);
>
> #endif /* CONFIG_SBUS */
>
> -/*
> - * IIep is write-through, not flushing on cpu to device transfer.
> - *
> - * On LEON systems without cache snooping, the entire D-CACHE must be flushed to
> - * make DMA to cacheable memory coherent.
> - */
> -void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
> {
> - if (dir != DMA_TO_DEVICE &&
> - sparc_cpu_model == sparc_leon &&
> + /* IIep is write-through, not flushing on cpu to device transfer. */
> +}
> +
> +static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size)
> +{
> + /*
> + * On LEON systems without cache snooping, the entire D-CACHE must be
> + * flushed to make DMA to cacheable memory coherent.
> + */
> + if (sparc_cpu_model == sparc_leon &&
> !sparc_leon3_snooping_enabled())
> leon_flush_dcache_all();
> }
>
> +static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t size)
> +{
> + arch_dma_cache_inv(paddr, size);
> +}
> +
> +static inline bool arch_sync_dma_clean_before_fromdevice(void)
> +{
> + return true;
> +}
> +
> +static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
> +{
> + return false;
> +}
> +
> +#include <linux/dma-sync.h>
> +
> #ifdef CONFIG_PROC_FS
>
> static int sparc_io_proc_show(struct seq_file *m, void *v)
> diff --git a/arch/xtensa/kernel/pci-dma.c b/arch/xtensa/kernel/pci-dma.c
> index ff3bf015eca4..d4ff96585545 100644
> --- a/arch/xtensa/kernel/pci-dma.c
> +++ b/arch/xtensa/kernel/pci-dma.c
> @@ -43,24 +43,34 @@ static void do_cache_op(phys_addr_t paddr, size_t size,
> }
> }
>
> -void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
> {
> - switch (dir) {
> - case DMA_TO_DEVICE:
> - do_cache_op(paddr, size, __flush_dcache_range);
> - break;
> - case DMA_FROM_DEVICE:
> - do_cache_op(paddr, size, __invalidate_dcache_range);
> - break;
> - case DMA_BIDIRECTIONAL:
> - do_cache_op(paddr, size, __flush_invalidate_dcache_range);
> - break;
> - default:
> - break;
> - }
> + do_cache_op(paddr, size, __flush_dcache_range);
> }
>
> +static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size)
> +{
> + do_cache_op(paddr, size, __invalidate_dcache_range);
> +}
> +
> +static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t size)
> +{
> + do_cache_op(paddr, size, __flush_invalidate_dcache_range);
> +}
> +
> +static inline bool arch_sync_dma_clean_before_fromdevice(void)
> +{
> + return false;
> +}
> +
> +static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
> +{
> + return false;
> +}
> +
> +#include <linux/dma-sync.h>
> +
> +
> void arch_dma_prep_coherent(struct page *page, size_t size)
> {
> __invalidate_dcache_range((unsigned long)page_address(page), size);
> diff --git a/include/linux/dma-sync.h b/include/linux/dma-sync.h
> new file mode 100644
> index 000000000000..18e33d5e8eaf
> --- /dev/null
> +++ b/include/linux/dma-sync.h
> @@ -0,0 +1,107 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Cache operations depending on function and direction argument, inspired by
> + * https://lore.kernel.org/lkml/[email protected]
> + * "dma_sync_*_for_cpu and direction=TO_DEVICE (was Re: [PATCH 02/20]
> + * dma-mapping: provide a generic dma-noncoherent implementation)"
> + *
> + * | map == for_device | unmap == for_cpu
> + * |----------------------------------------------------------------
> + * TO_DEV | writeback writeback | none none
> + * FROM_DEV | invalidate invalidate | invalidate* invalidate*
> + * BIDIR | writeback writeback | invalidate invalidate
> + *
> + * [*] needed for CPU speculative prefetches
> + *
> + * NOTE: we don't check the validity of direction argument as it is done in
> + * upper layer functions (in include/linux/dma-mapping.h)
> + *
> + * This file can be included by arch/.../kernel/dma-noncoherent.c to provide
> + * the respective high-level operations without having to expose the
> + * cache management ops to drivers.
> + */
> +
> +void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> + enum dma_data_direction dir)
> +{
> + switch (dir) {
> + case DMA_TO_DEVICE:
> + /*
> + * This may be an empty function on write-through caches,
> + * and it might invalidate the cache if an architecture has
> + * a write-back cache but no way to write it back without
> + * invalidating
> + */
> + arch_dma_cache_wback(paddr, size);
> + break;
> +
> + case DMA_FROM_DEVICE:
> + /*
> + * FIXME: this should be handled the same across all
> + * architectures, see
> + * https://lore.kernel.org/all/20220606152150.GA31568@willie-the-truck/
> + */
> + if (!arch_sync_dma_clean_before_fromdevice()) {
> + arch_dma_cache_inv(paddr, size);
> + break;
> + }
> + fallthrough;
> +
> + case DMA_BIDIRECTIONAL:
> + /* Skip the invalidate here if it's done later */
> + if (IS_ENABLED(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU) &&
> + arch_sync_dma_cpu_needs_post_dma_flush())
> + arch_dma_cache_wback(paddr, size);
> + else
> + arch_dma_cache_wback_inv(paddr, size);
> + break;
> +
> + default:
> + break;
> + }
> +}
> +
> +#ifdef CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU
> +/*
> + * Mark the D-cache clean for these pages to avoid extra flushing.
> + */
> +static void arch_dma_mark_dcache_clean(phys_addr_t paddr, size_t size)
> +{
> +#ifdef CONFIG_ARCH_DMA_MARK_DCACHE_CLEAN
> + unsigned long pfn = PFN_UP(paddr);
> + unsigned long off = paddr & (PAGE_SIZE - 1);
> + size_t left = size;
> +
> + if (off)
> + left -= PAGE_SIZE - off;
> +
> + while (left >= PAGE_SIZE) {
> + struct page *page = pfn_to_page(pfn++);
> + set_bit(PG_dcache_clean, &page->flags);
> + left -= PAGE_SIZE;
> + }
> +#endif
> +}
> +
> +void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> + enum dma_data_direction dir)
> +{
> + switch (dir) {
> + case DMA_TO_DEVICE:
> + break;
> +
> + case DMA_FROM_DEVICE:
> + case DMA_BIDIRECTIONAL:
> + /* FROM_DEVICE invalidate needed if speculative CPU prefetch only */
> + if (arch_sync_dma_cpu_needs_post_dma_flush())
> + arch_dma_cache_inv(paddr, size);
> +
> + if (size > PAGE_SIZE)
> + arch_dma_mark_dcache_clean(paddr, size);
> + break;
> +
> + default:
> + break;
> + }
> +}
> +#endif
> --
> 2.39.2
>
>
> _______________________________________________
> linux-riscv mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-riscv
Le 30/03/2023 à 12:03, Arnd Bergmann a écrit :
> On Thu, Mar 30, 2023, at 09:48, Neil Armstrong wrote:
>> On 27/03/2023 14:13, Arnd Bergmann wrote:
>>> From: Arnd Bergmann <[email protected]>
>>>
>>> The cache management operations for noncoherent DMA on ARMv6 work
>>> in two different ways:
>>>
>>> * When CONFIG_DMA_CACHE_RWFO is set, speculative prefetches on in-flight
>>> DMA buffers lead to data corruption when the prefetched data is written
>>> back on top of data from the device.
>>>
>>> * When CONFIG_DMA_CACHE_RWFO is disabled, a cache flush on one CPU
>>> is not seen by the other core(s), leading to inconsistent contents
>>> accross the system.
>>>
>>> As a consequence, neither configuration is actually safe to use in a
>>> general-purpose kernel that is used on both MPCore systems and ARM1176
>>> with prefetching enabled.
>>>
>>> We could add further workarounds to make the behavior more dynamic based
>>> on the system, but realistically, there are close to zero remaining
>>> users on any ARM11MPCore anyway, and nobody seems too interested in it,
>>> compared to the more popular ARM1176 used in BMC2835 and AST2500.
>>>
>>> The Oxnas platform has some minimal support in OpenWRT, but most of the
>>> drivers and dts files never made it into the mainline kernel, while the
>>> Arm Versatile/Realview platform mainly serves as a reference system but
>>> is not necessary to be kept working once all other ARM11MPCore are gone.
>>
>> Acked-by: Neil Armstrong <[email protected]>
>>
>> It's sad but it's the reality, there's no chance full OXNAS support will
>> ever come upstream and no real work has been done for years.
>>
>> I think OXNAS support can be programmed for removal for next release,
>> it would need significant work to rework current support to make it acceptable
>> before trying to upstream missing bits anyway.
>
> Ok, thanks for your reply!
>
> To clarify, do you think we should plan for removal after the next
> stable release (6.3, removed in 6.4), or after the next LTS
> release (probably 6.6, removed in 6.7)? As far as I understand,
> the next OpenWRT release (23.x) will be based on linux-5.15,
> and the one after that (24.x) would likely still use 6.1, unless
> they skip an LTS kernel.
I think it's ok to remove it ASAP, or at least before the next LTS,
not having SMP makes the platform barely usable so the earliest is the best.
Neil
>
> Arnd
On Mon, Mar 27, 2023 at 2:16 PM Arnd Bergmann <[email protected]> wrote:
> From: Arnd Bergmann <[email protected]>
>
> Most ARM CPUs can have write-back caches and that require
> cache management to be done in the dma_sync_*_for_device()
> operation. This is typically done in both writeback and
> writethrough mode.
>
> The cache-v4.S (arm720/740/7tdmi/9tdmi) and cache-v4wt.S
> (arm920t, arm940t) implementations are the exception here,
> and only do the cache management after the DMA is complete,
> in the dma_sync_*_for_cpu() operation.
>
> Change this for consistency with the other platforms. This
> should have no user visible effect.
>
> Signed-off-by: Arnd Bergmann <[email protected]>
Looks good to me.
Reviewed-by: Linus Walleij <[email protected]>
Yours,
Linus Walleij
On Mon, Mar 27, 2023 at 02:13:11PM +0200, Arnd Bergmann wrote:
> From: Arnd Bergmann <[email protected]>
>
> Most ARM CPUs can have write-back caches and that require
> cache management to be done in the dma_sync_*_for_device()
> operation. This is typically done in both writeback and
> writethrough mode.
>
> The cache-v4.S (arm720/740/7tdmi/9tdmi) and cache-v4wt.S
> (arm920t, arm940t) implementations are the exception here,
> and only do the cache management after the DMA is complete,
> in the dma_sync_*_for_cpu() operation.
>
> Change this for consistency with the other platforms. This
> should have no user visible effect.
NAK...
The reason we do cache management _after_ is to ensure that there
is no stale data. The kernel _has_ (at the very least in the past)
performed DMA to data structures that are embedded within other
data structures, resulting in cache lines being shared. If one of
those cache lines is touched while DMA is progressing, then we
must to cache management _after_ the DMA operation has completed.
Doing it before is no good.
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
On Mon, Mar 27, 2023 at 2:16 PM Arnd Bergmann <[email protected]> wrote:
> From: Arnd Bergmann <[email protected]>
>
> The arm specific iommu code in dma-mapping.c uses the page+offset based
> __dma_page_cpu_to_dev()/__dma_page_dev_to_cpu() helpers in place of the
> phys_addr_t based arch_sync_dma_for_device()/arch_sync_dma_for_cpu()
> wrappers around the.
Broken sentence?
> In order to be able to move the latter part set of functions into
> common code, change the iommu implementation to use them directly
> and remove the internal ones as a separate interface.
>
> As page+offset and phys_address are equivalent, but are used in
> different parts of the code here, this allows removing some of
> the conversion but adds them elsewhere.
>
> Signed-off-by: Arnd Bergmann <[email protected]>
Looks good to me, took me some time to verify and understand
the open-coded version of PFN_UP() and this refactoring alone
makes the patch highly valuable.
Reviewed-by: Linus Walleij <[email protected]>
Yours,
Linus Walleij
On Fri, Mar 31, 2023 at 10:07:28AM +0100, Russell King (Oracle) wrote:
> On Mon, Mar 27, 2023 at 02:13:11PM +0200, Arnd Bergmann wrote:
> > From: Arnd Bergmann <[email protected]>
> >
> > Most ARM CPUs can have write-back caches and that require
> > cache management to be done in the dma_sync_*_for_device()
> > operation. This is typically done in both writeback and
> > writethrough mode.
> >
> > The cache-v4.S (arm720/740/7tdmi/9tdmi) and cache-v4wt.S
> > (arm920t, arm940t) implementations are the exception here,
> > and only do the cache management after the DMA is complete,
> > in the dma_sync_*_for_cpu() operation.
> >
> > Change this for consistency with the other platforms. This
> > should have no user visible effect.
>
> NAK...
>
> The reason we do cache management _after_ is to ensure that there
> is no stale data. The kernel _has_ (at the very least in the past)
> performed DMA to data structures that are embedded within other
> data structures, resulting in cache lines being shared. If one of
> those cache lines is touched while DMA is progressing, then we
> must to cache management _after_ the DMA operation has completed.
> Doing it before is no good.
It looks like the main offender of "touching cache lines shared
with DMA" has now been resolved - that was the SCSI sense buffer,
and was fixed some time ago:
commit de25deb18016f66dcdede165d07654559bb332bc
Author: FUJITA Tomonori <[email protected]>
Date: Wed Jan 16 13:32:17 2008 +0900
/if/ that is the one and only case, then we're probably fine, but
having been through an era where this kind of thing was the norm
and requests to fix it did not get great responses from subsystem
maintainers, I just don't trust the kernel not to want to DMA to
overlapping cache lines.
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
On Fri, Mar 31, 2023, at 11:35, Russell King (Oracle) wrote:
> On Fri, Mar 31, 2023 at 10:07:28AM +0100, Russell King (Oracle) wrote:
>> On Mon, Mar 27, 2023 at 02:13:11PM +0200, Arnd Bergmann wrote:
>> > From: Arnd Bergmann <[email protected]>
>> >
>> > Most ARM CPUs can have write-back caches and that require
>> > cache management to be done in the dma_sync_*_for_device()
>> > operation. This is typically done in both writeback and
>> > writethrough mode.
>> >
>> > The cache-v4.S (arm720/740/7tdmi/9tdmi) and cache-v4wt.S
>> > (arm920t, arm940t) implementations are the exception here,
>> > and only do the cache management after the DMA is complete,
>> > in the dma_sync_*_for_cpu() operation.
>> >
>> > Change this for consistency with the other platforms. This
>> > should have no user visible effect.
>>
>> NAK...
>>
>> The reason we do cache management _after_ is to ensure that there
>> is no stale data. The kernel _has_ (at the very least in the past)
>> performed DMA to data structures that are embedded within other
>> data structures, resulting in cache lines being shared. If one of
>> those cache lines is touched while DMA is progressing, then we
>> must to cache management _after_ the DMA operation has completed.
>> Doing it before is no good.
What I'm trying to address here is the inconsistency between
implementations. If we decide that we always want to invalidate
after FROM_DEVICE, I can do that as part of the series, but then
I have to change most of the other arm implementations.
Right now, the only WT cache implementations that do the the
invalidation after the DMA are cache-v4.S (arm720 integrator and
clps711x), cache-v4wt.S (arm920/arm922 at91rm9200, clps711x,
ep93xx, omap15xx, imx1 and integrator), some sparc32 leon3 and
early xtensa.
Most architectures that have write-through caches (m68k,
microblaze) or write-back caches but no speculation (all other
armv4/armv5, hexagon, openrisc, sh, most mips, later xtensa)
only invalidate before DMA but not after.
OTOH, most machines that are actually in use today (armv6+,
powerpc, later mips, microblaze, riscv, nios2) also have to
deal with speculative accesses, so they end up having to
invalidate or flush both before and after a DMA_FROM_DEVICE
and DMA_BIDIRECTIONAL.
> It looks like the main offender of "touching cache lines shared
> with DMA" has now been resolved - that was the SCSI sense buffer,
> and was fixed some time ago:
>
> commit de25deb18016f66dcdede165d07654559bb332bc
> Author: FUJITA Tomonori <[email protected]>
> Date: Wed Jan 16 13:32:17 2008 +0900
>
> /if/ that is the one and only case, then we're probably fine, but
> having been through an era where this kind of thing was the norm
> and requests to fix it did not get great responses from subsystem
> maintainers, I just don't trust the kernel not to want to DMA to
> overlapping cache lines.
Thanks for digging that out, that is very useful. It looks like this
was around the same time as 03d70617b8a7 ("powerpc: Prevent memory
corruption due to cache invalidation of unaligned DMA buffer"), so
it may well have been related. I know we also had more recent
problems with USB drivers trying to DMA to stack, which would
also cause problems on non-coherent machines, but some of these were
only found after we introduced VMAP_STACK.
It would be nice to use KASAN prevent reads on cache lines that
have in-flight DMA.
Arnd
On Fri, Mar 31, 2023 at 12:38:45PM +0200, Arnd Bergmann wrote:
> On Fri, Mar 31, 2023, at 11:35, Russell King (Oracle) wrote:
> > On Fri, Mar 31, 2023 at 10:07:28AM +0100, Russell King (Oracle) wrote:
> >> On Mon, Mar 27, 2023 at 02:13:11PM +0200, Arnd Bergmann wrote:
> >> > From: Arnd Bergmann <[email protected]>
> >> >
> >> > Most ARM CPUs can have write-back caches and that require
> >> > cache management to be done in the dma_sync_*_for_device()
> >> > operation. This is typically done in both writeback and
> >> > writethrough mode.
> >> >
> >> > The cache-v4.S (arm720/740/7tdmi/9tdmi) and cache-v4wt.S
> >> > (arm920t, arm940t) implementations are the exception here,
> >> > and only do the cache management after the DMA is complete,
> >> > in the dma_sync_*_for_cpu() operation.
> >> >
> >> > Change this for consistency with the other platforms. This
> >> > should have no user visible effect.
> >>
> >> NAK...
> >>
> >> The reason we do cache management _after_ is to ensure that there
> >> is no stale data. The kernel _has_ (at the very least in the past)
> >> performed DMA to data structures that are embedded within other
> >> data structures, resulting in cache lines being shared. If one of
> >> those cache lines is touched while DMA is progressing, then we
> >> must to cache management _after_ the DMA operation has completed.
> >> Doing it before is no good.
>
> What I'm trying to address here is the inconsistency between
> implementations. If we decide that we always want to invalidate
> after FROM_DEVICE, I can do that as part of the series, but then
> I have to change most of the other arm implementations.
Why?
First thing to say is that DMA to buffers where the cache lines are
shared with data the CPU may be accessing need to be outlawed - they
are a recipe for data corruption - always have been. Sadly, some folk
don't see it that way because of a passed "x86 just works and we demand
that all architectures behave like x86!" attitude. The SCSI sense
buffer has historically been a big culpret for that.
For WT, FROM_DEVICE, invalidating after DMA is the right thing to do,
because we want to ensure that the DMA'd data is properly readable upon
completion of the DMA. If overlapping cache lines have been touched
while DMA is progressing, and we invalidate before DMA, then the cache
will contain stale data that will remain in the cache after DMA has
completed. Invalidating a WT cache does not destroy any data, so is
safe to do. So the safest approach is to invalidate after DMA has
completed in this instance.
For WB, FROM_DEVICE, we have the problem of dirty cache lines which
we have to get rid of. For the overlapping cache lines, we have to
clean those before DMA begins to ensure that data written to the
non-DMA-buffer part is preserved. All other cache lines need to be
invalidated before DMA begins to ensure that writebacks do not
corrupt data from the device. Hence why it's different.
And hence why the ARM implementation is based around buffer ownership.
And hence why they're called dma_map_area()/dma_unmap_area() rather
than the cache operations themselves. This is an intentional change,
one that was done when ARMv6 came along.
> OTOH, most machines that are actually in use today (armv6+,
> powerpc, later mips, microblaze, riscv, nios2) also have to
> deal with speculative accesses, so they end up having to
> invalidate or flush both before and after a DMA_FROM_DEVICE
> and DMA_BIDIRECTIONAL.
Again, these are implementation details of the cache, and this is
precisely why having the map/unmap interface is so much better than
having generic code explicitly call "clean" and "invalidate"
interfaces into arch code.
If we treat everything as a speculative cache, then we're doing
needless extra work for those caches that aren't speculative. So,
ARM would have to step through every cache line for every DMA
buffer at 32-byte intervals performing cache maintenance whether
the cache is speculative or not. That is expensive, and hurts
performance.
I put a lot of thought into this when I updated the ARM DMA
implementation when we started seeing these different cache types
particularly when ARMv6 came along. I really don't want that work
wrecked.
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
On Fri, Mar 31, 2023, at 13:08, Russell King (Oracle) wrote:
> On Fri, Mar 31, 2023 at 12:38:45PM +0200, Arnd Bergmann wrote:
>> On Fri, Mar 31, 2023, at 11:35, Russell King (Oracle) wrote:
>> > On Fri, Mar 31, 2023 at 10:07:28AM +0100, Russell King (Oracle) wrote:
>> >> On Mon, Mar 27, 2023 at 02:13:11PM +0200, Arnd Bergmann wrote:
>> >> > From: Arnd Bergmann <[email protected]>
>> >> >
>> >> > Most ARM CPUs can have write-back caches and that require
>> >> > cache management to be done in the dma_sync_*_for_device()
>> >> > operation. This is typically done in both writeback and
>> >> > writethrough mode.
>> >> >
>> >> > The cache-v4.S (arm720/740/7tdmi/9tdmi) and cache-v4wt.S
>> >> > (arm920t, arm940t) implementations are the exception here,
>> >> > and only do the cache management after the DMA is complete,
>> >> > in the dma_sync_*_for_cpu() operation.
>> >> >
>> >> > Change this for consistency with the other platforms. This
>> >> > should have no user visible effect.
>> >>
>> >> NAK...So t
>> >>
>> >> The reason we do cache management _after_ is to ensure that there
>> >> is no stale data. The kernel _has_ (at the very least in the past)
>> >> performed DMA to data structures that are embedded within other
>> >> data structures, resulting in cache lines being shared. If one of
>> >> those cache lines is touched while DMA is progressing, then we
>> >> must to cache management _after_ the DMA operation has completed.
>> >> Doing it before is no good.
>>
>> What I'm trying to address here is the inconsistency between
>> implementations. If we decide that we always want to invalidate
>> after FROM_DEVICE, I can do that as part of the series, but then
>> I have to change most of the other arm implementations.
>
> Why?
>
> First thing to say is that DMA to buffers where the cache lines are
> shared with data the CPU may be accessing need to be outlawed - they
> are a recipe for data corruption - always have been. Sadly, some folk
> don't see it that way because of a passed "x86 just works and we demand
> that all architectures behave like x86!" attitude. The SCSI sense
> buffer has historically been a big culpret for that.
I think that part is pretty much agree by everyone, the difference
between architectures is to what extend they try to work around
drivers that get it wrong.
> For WT, FROM_DEVICE, invalidating after DMA is the right thing to do,
> because we want to ensure that the DMA'd data is properly readable upon
> completion of the DMA. If overlapping cache lines haveDoes that mean you take back you NAK on this patch tehn? been touched
> while DMA is proSo tgressing, and we invalidate before DMA, then the cache
> will contain stale data that will remain in the cache after DMA has
> completed. Invalidating a WT cache does not destroy any data, so is
> safe to do. So the safest approach is to invalidate after DMA has
> completed in this instance.
> For WB, FROM_DEVICE, we have the problem of dirty cache lines which
> we have to get rid of. For the overlapping cache lines, we have to
> clean those before DMA begins to ensure that data written to the
> non-DMA-buffer part is preserved. All other cache lines need to be
> invalidated before DMA begins to ensure that writebacks do not
> corrupt data from the device. Hence why it's different.
I don't see how WB and Wt caches being different implies that we
should give extra guarantees to (broken) drivers when WT caches on
other architectures. Always doing it first in the absence of
prefetching avoids a special case in the generic implementation
and makes the driver interface on Arm/sparc32/xtensa WT caches
no different from what everything provides.
The writeback before DMA_FROM_DEVICE is another issue that we
have to address at some point, as there are clearly incompatible
expectations here. It makes no sense that a device driver can
rely on the entire to be written back on a 64-bit arm kernel
but not on a 32-bit kernel.
> And hence why the ARM implementation is based around buffer ownership.
> And hence why they're called dma_map_area()/dma_unmap_area() rather
> than the cache operations themselves. This is an intentional change,
> one that was done when ARMv6 came along.
The bit that has changed in the meantime though is that the buffer
ownership interfaces has moved up in the stack and is now handled
mostly in the common kernel/dma/*.c that multiplexes between the
direct/iommu/swiotlb dma_map_ops, except for the bit about
noncoherent devices. Right now, we have 37 implementations that
are mostly identical, and all the differences are either bugs
or disagreements about the API guarantees but not related to
architecture specific requirements.
>> OTOH, most machines that are actually in use today (armv6+,
>> powerpc, later mips, microblaze, riscv, nios2) also have to
>> deal with speculative accesses, so they end up having to
>> invalidate or flush both before and after a DMA_FROM_DEVICE
>> and DMA_BIDIRECTIONAL.
>
> Again, these are implementation details of the cache, and this is
> precisely why having the map/unmap interface is so much better than
> having generic code explicitly call "clean" and "invalidate"
> interfaces into arch code.
>
> If we treat everything as a speculative cache, then we're doing
> needless extra work for those caches that aren't speculative. So,
> ARM would have to step through every cache line for every DMA
> buffer at 32-byte intervals performing cache maintenance whether
> the cache is speculative or not. That is expensive, and hurts
> performance.
Dop that mean that you agree with this patch 15 then after all?
If you think we don't need an invalidation after DMA_FROM_DEVICE
on non-speculating CPUs, it should be fine to make the WT case
consistent with the rest.
Arnd
On Fri, Mar 31, 2023, at 11:10, Linus Walleij wrote:
> On Mon, Mar 27, 2023 at 2:16 PM Arnd Bergmann <[email protected]> wrote:
>
>> From: Arnd Bergmann <[email protected]>
>>
>> The arm specific iommu code in dma-mapping.c uses the page+offset based
>> __dma_page_cpu_to_dev()/__dma_page_dev_to_cpu() helpers in place of the
>> phys_addr_t based arch_sync_dma_for_device()/arch_sync_dma_for_cpu()
>> wrappers around the.
>
> Broken sentence?
I've changed s/the/them/ now, at least I think that's what I meant to
write in the first place.
>> In order to be able to move the latter part set of functions into
>> common code, change the iommu implementation to use them directly
>> and remove the internal ones as a separate interface.
>>
>> As page+offset and phys_address are equivalent, but are used in
>> different parts of the code here, this allows removing some of
>> the conversion but adds them elsewhere.
>>
>> Signed-off-by: Arnd Bergmann <[email protected]>
>
> Looks good to me, took me some time to verify and understand
> the open-coded version of PFN_UP() and this refactoring alone
> makes the patch highly valuable.
> Reviewed-by: Linus Walleij <[email protected]>
Thanks!
ARnd
On Tue, Mar 28, 2023, at 00:25, Christoph Hellwig wrote:
>> +static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
>> {
>> + dma_cache_wback(paddr, size);
>> +}
>>
>> +static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size)
>> +{
>> + dma_cache_inv(paddr, size);
>> }
>
>> +static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t size)
>> {
>> + dma_cache_wback_inv(paddr, size);
>> +}
>
> There are the only calls for the three functions for each of the
> involved functions. So I'd rather rename the low-level symbols
> (and drop the pointless exports for two of them) rather than adding
> these wrapppers.
>
> The same is probably true for many other architectures.
Ok, done that now.
>> +static inline bool arch_sync_dma_clean_before_fromdevice(void)
>> +{
>> + return false;
>> +}
>>
>> +static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
>> +{
>> + return true;
>> }
>
> Is there a way to cut down on this boilerplate code by just having
> sane default, and Kconfig options to override them if they are not
> runtime decisions?
I've changed arch_sync_dma_clean_before_fromdevice() to a
Kconfig symbol now, as this is never a runtime decision.
For arch_sync_dma_cpu_needs_post_dma_flush(), I have this
version now in common code, which lets mips and arm have
their own logic and has the same effect elsewhere:
+#ifndef arch_sync_dma_cpu_needs_post_dma_flush
+static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
+{
+ return IS_ENABLED(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU);
+}
+#endif
>> +#include <linux/dma-sync.h>
>
> I can't really say I like the #include version here despite your
> rationale in the commit log. I can probably live with it if you
> think it is absolutely worth it, but I'm really not in favor of it.
>
>> +config ARCH_DMA_MARK_DCACHE_CLEAN
>> + def_bool y
>
> What do we need this symbol for? Unless I'm missing something it is
> always enable for arm32, and only used in arm32 code.
This was left over from an earlier draft and accidentally duplicates
the thing that I have in the Arm version for the existing
ARCH_HAS_DMA_MARK_CLEAN. I dropped this one and the
generic copy of the arch_dma_mark_dcache_clean() function
now, but still need to revisit the arm version, as it sounds
like it has slightly different semantics from the ia64 version.
Arnd
On Mon, Mar 27, 2023, at 17:01, Russell King (Oracle) wrote:
> On Mon, Mar 27, 2023 at 02:13:16PM +0200, Arnd Bergmann wrote:
>> From: Arnd Bergmann <[email protected]>
>>
>> The arm version of the arch_sync_dma_for_cpu() function annotates pages as
>> PG_dcache_clean after a DMA, but no other architecture does this here.
>
> ... because this is an arm32 specific feature. Generically, it's
> PG_arch_1, which is a page flag free for architecture use. On arm32
> we decided to use this to mark whether we can skip dcache writebacks
> when establishing a PTE - and thus it was decided to call it
> PG_dcache_clean to reflect how arm32 decided to use that bit.
>
> This isn't just a DMA thing, there are other places that we update
> the bit, such as flush_dcache_page() and copy_user_highpage().
>
> So thinking that the arm32 PG_dcache_clean is something for DMA is
> actually wrong.
>
> Other architectures are free to do their own other optimisations
> using that bit, and their implementations may be DMA-centric.
The flag is used the same way on most architectures, though some
use the opposite polarity and call it PG_dcache_dirty. The only
other architecture that uses it for DMA is ia64, with the difference
being that this also marks the page as clean even for coherent
DMA, not just when doing a flush as part of noncoherent DMA.
Based on Robin's reply it sounds that this is not a valid assumption
on Arm, if a coherent DMA can target a dirty dcache line without
cleaning it.
Arnd
On Mon, Mar 27, 2023, at 14:48, Robin Murphy wrote:
> On 2023-03-27 13:13, Arnd Bergmann wrote:
>>
>> [ HELP NEEDED: can anyone confirm that it is a correct assumption
>> on arm that a cache-coherent device writing to a page always results
>> in it being in a PG_dcache_clean state like on ia64, or can a device
>> write directly into the dcache?]
>
> In AMBA at least, if a snooping write hits in a cache then the data is
> most likely going to get routed directly into that cache. If it has
> write-back write-allocate attributes it could also land in any cache
> along its normal path to RAM; it wouldn't have to go all the way.
>
> Hence all the fun we have where treating a coherent device as
> non-coherent can still be almost as broken as the other way round :)
Ok, thanks for the information. I'm still not sure whether this can
result in the situation where PG_dcache_clean is wrong though.
Specifically, the question is whether a DMA to a coherent buffer
can end up in a dirty L1 dcache of one core and require to write
back the dcache before invalidating the icache for that page.
On ia64, this is not the case, the optimization here is to
only flush the icache after a coherent DMA into an executable
user page, while Arm only does this for noncoherent DMA but not
coherent DMA.
From your explanation it sounds like this might happen,
even though that would mean that "coherent" DMA is slightly
less coherent than it is elsewhere.
To be on the safe side, I'd have to pass a flag into
arch_dma_mark_clean() about coherency, to let the arm
implementation still require the extra dcache flush
for coherent DMA, while ia64 can ignore that flag.
Arnd
On 31/03/2023 3:00 pm, Arnd Bergmann wrote:
> On Mon, Mar 27, 2023, at 14:48, Robin Murphy wrote:
>> On 2023-03-27 13:13, Arnd Bergmann wrote:
>>>
>>> [ HELP NEEDED: can anyone confirm that it is a correct assumption
>>> on arm that a cache-coherent device writing to a page always results
>>> in it being in a PG_dcache_clean state like on ia64, or can a device
>>> write directly into the dcache?]
>>
>> In AMBA at least, if a snooping write hits in a cache then the data is
>> most likely going to get routed directly into that cache. If it has
>> write-back write-allocate attributes it could also land in any cache
>> along its normal path to RAM; it wouldn't have to go all the way.
>>
>> Hence all the fun we have where treating a coherent device as
>> non-coherent can still be almost as broken as the other way round :)
>
> Ok, thanks for the information. I'm still not sure whether this can
> result in the situation where PG_dcache_clean is wrong though.
>
> Specifically, the question is whether a DMA to a coherent buffer
> can end up in a dirty L1 dcache of one core and require to write
> back the dcache before invalidating the icache for that page.
>
> On ia64, this is not the case, the optimization here is to
> only flush the icache after a coherent DMA into an executable
> user page, while Arm only does this for noncoherent DMA but not
> coherent DMA.
>
> From your explanation it sounds like this might happen,
> even though that would mean that "coherent" DMA is slightly
> less coherent than it is elsewhere.
>
> To be on the safe side, I'd have to pass a flag into
> arch_dma_mark_clean() about coherency, to let the arm
> implementation still require the extra dcache flush
> for coherent DMA, while ia64 can ignore that flag.
Coherent DMA on Arm is assumed to be inner-shareable, so a coherent DMA
write should be pretty much equivalent to a coherent write by another
CPU (or indeed the local CPU itself) - nothing says that it *couldn't*
dirty a line in a data cache above the level of unification, so in
general the assumption must be that, yes, if coherent DMA is writing
data intended to be executable, then it's going to want a Dcache clean
to PoU and an Icache invalidate to PoU before trying to execute it. By
comparison, a non-coherent DMA transfer will inherently have to
invalidate the Dcache all the way to PoC in its dma_unmap, thus cannot
leave dirty data above the PoU, so only the Icache maintenance is
required in the executable case.
(FWIW I believe the Armv8 IDC/DIC features can safely be considered
irrelevant to 32-bit kernels)
I don't know a great deal about IA-64, but it appears to be using its
PG_arch_1 flag in a subtly different manner to Arm, namely to optimise
out the *Icache* maintenance. So if anything, it seems IA-64 is the
weirdo here (who'd have guessed?) where DMA manages to be *more*
coherent than the CPUs themselves :)
This is all now making me think we need some careful consideration of
whether the benefits of consolidating code outweigh the confusion of
conflating multiple different meanings of "clean" together...
Thanks,
Robin.
On Fri, Mar 31, 2023 at 04:06:37PM +0200, Arnd Bergmann wrote:
> On Mon, Mar 27, 2023, at 17:01, Russell King (Oracle) wrote:
> > On Mon, Mar 27, 2023 at 02:13:16PM +0200, Arnd Bergmann wrote:
> >> From: Arnd Bergmann <[email protected]>
> >>
> >> The arm version of the arch_sync_dma_for_cpu() function annotates pages as
> >> PG_dcache_clean after a DMA, but no other architecture does this here.
> >
> > ... because this is an arm32 specific feature. Generically, it's
> > PG_arch_1, which is a page flag free for architecture use. On arm32
> > we decided to use this to mark whether we can skip dcache writebacks
> > when establishing a PTE - and thus it was decided to call it
> > PG_dcache_clean to reflect how arm32 decided to use that bit.
> >
> > This isn't just a DMA thing, there are other places that we update
> > the bit, such as flush_dcache_page() and copy_user_highpage().
> >
> > So thinking that the arm32 PG_dcache_clean is something for DMA is
> > actually wrong.
> >
> > Other architectures are free to do their own other optimisations
> > using that bit, and their implementations may be DMA-centric.
>
> The flag is used the same way on most architectures, though some
> use the opposite polarity and call it PG_dcache_dirty. The only
> other architecture that uses it for DMA is ia64, with the difference
> being that this also marks the page as clean even for coherent
> DMA, not just when doing a flush as part of noncoherent DMA.
>
> Based on Robin's reply it sounds that this is not a valid assumption
> on Arm, if a coherent DMA can target a dirty dcache line without
> cleaning it.
The other thing to note here is that PG_dcache_clean doesn't have
much meaning on modern CPUs with PIPT caches. For these,
cache_is_vipt_nonaliasing() will be true, and
cache_ops_need_broadcast() will be false.
Firstly, if we're using coherent DMA, then PG_dcache_clean is
intentionally not touched, because the data cache isn't cleaned
in any way by DMA operations.
flush_dcache_page() turns into a no-op apart from clearing
PG_dcache_clean if it was set.
__sync_icache_dcache() will do nothing for non-executable pages,
but will write-back a page that isn't marked PG_dcache_clean to
ensure that it is visible to the instruction stream. This is only
used to ensure that a the instructions are visible to a newly
established executable mapping when e.g. the page has been DMA'd
in. The default state of PG_dcache_clean is zero on any new
allocation, so this has the effect of causing any executable page
to be flushed such that the instruction stream can see the
instructions, but only for the first establishment of the mapping.
That means that e.g. libc text pages don't keep getting flushed on
the start of every program.
update_mmu_cache() isn't compiled, so it's use of PG_dcache_clean
is irrelevant.
v6_copy_user_highpage_aliasing() won't be called because we're not
using an aliasing cache.
So, for modern ARM systems with DMA-coherent PG_dcache_clean only
serves for the __sync_icache_dcache() optimisation.
ARMs use of this remains valid in this circumstance.
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
On Mon, Mar 27, 2023 at 02:12:56PM +0200, Arnd Bergmann wrote:
> Another difference that I do not address here is what cache invalidation
> does for partical cache lines. On arm32, arm64 and powerpc, a partial
> cache line always gets written back before invalidation in order to
> ensure that data before or after the buffer is not discarded. On all
> other architectures, the assumption is cache lines are never shared
> between DMA buffer and data that is accessed by the CPU.
I don't think sharing the DMA buffer with other data is safe even with
this clean+invalidate on the unaligned cache. Mapping the DMA buffer as
FROM_DEVICE or BIDIRECTIONAL can cause the shared cache line to be
evicted and override the device written data. This sharing only works if
the CPU guarantees not to dirty the corresponding cache line.
I'm fine with removing this partial cache line hack from arm64 as it's
not safe anyway. We'll see if any driver stops working. If there's some
benign sharing (I wouldn't trust it), the cache cleaning prior to
mapping and invalidate on unmap would not lose any data.
--
Catalin
On Mon, Mar 27, 2023 at 02:13:14PM +0200, Arnd Bergmann wrote:
> From: Arnd Bergmann <[email protected]>
>
> The cache management operations for noncoherent DMA on ARMv6 work
> in two different ways:
>
> * When CONFIG_DMA_CACHE_RWFO is set, speculative prefetches on in-flight
> DMA buffers lead to data corruption when the prefetched data is written
> back on top of data from the device.
>
> * When CONFIG_DMA_CACHE_RWFO is disabled, a cache flush on one CPU
> is not seen by the other core(s), leading to inconsistent contents
> accross the system.
>
> As a consequence, neither configuration is actually safe to use in a
> general-purpose kernel that is used on both MPCore systems and ARM1176
> with prefetching enabled.
As the author of this terrible hack (created under duress ;))
Acked-by: Catalin Marinas <[email protected]>
IIRC, RWFO is working in combination with the cache operations. Because
the cache maintenance broadcast did not happen, we forced the cache
lines to migrate to a CPU via a write (for ownership) and doing the
cache maintenance on that CPU (that was the FROM_DEVICE case). For the
TO_DEVICE case, reading on a CPU would cause dirty lines on another CPU
to be evicted (or migrated as dirty to the current CPU IIRC) then the
cache maintenance to clean them to PoC on the local CPU.
But there's always a small window between read/write for ownership and
the actual cache maintenance which can cause a cache line to migrate to
other CPUs if they do speculative prefetches. At the time ARM11MPCore
was deemed safe-ish but I haven't followed what later implementations
actually did (luckily we fixed the architecture in ARMv7).
--
Catalin
On Fri, Mar 31, 2023, at 17:12, Robin Murphy wrote:
> On 31/03/2023 3:00 pm, Arnd Bergmann wrote:
>> On Mon, Mar 27, 2023, at 14:48, Robin Murphy wrote:
>>
>> To be on the safe side, I'd have to pass a flag into
>> arch_dma_mark_clean() about coherency, to let the arm
>> implementation still require the extra dcache flush
>> for coherent DMA, while ia64 can ignore that flag.
>
> Coherent DMA on Arm is assumed to be inner-shareable, so a coherent DMA
> write should be pretty much equivalent to a coherent write by another
> CPU (or indeed the local CPU itself) - nothing says that it *couldn't*
> dirty a line in a data cache above the level of unification, so in
> general the assumption must be that, yes, if coherent DMA is writing
> data intended to be executable, then it's going to want a Dcache clean
> to PoU and an Icache invalidate to PoU before trying to execute it. By
> comparison, a non-coherent DMA transfer will inherently have to
> invalidate the Dcache all the way to PoC in its dma_unmap, thus cannot
> leave dirty data above the PoU, so only the Icache maintenance is
> required in the executable case.
Ok, makes sense. I've already started reworking my patch for it.
> (FWIW I believe the Armv8 IDC/DIC features can safely be considered
> irrelevant to 32-bit kernels)
>
> I don't know a great deal about IA-64, but it appears to be using its
> PG_arch_1 flag in a subtly different manner to Arm, namely to optimise
> out the *Icache* maintenance. So if anything, it seems IA-64 is the
> weirdo here (who'd have guessed?) where DMA manages to be *more*
> coherent than the CPUs themselves :)
I checked this in the ia64 manual, and as far as I can tell, it originally
only had one cacheflush instruction that flushes the dcache and invalidates
the icache at the same time. So flush_icache_range() actually does
both and flush_dcache_page() instead just marks the page as dirty to
ensure flush_icache_range() does not get skipped after a writing a
page from the kernel.
On later Itaniums, there is apparently a separate icache flush
instruction that gets used in flush_icache_range(), but that
still works for the DMA case that is allowed to skip the flush.
> This is all now making me think we need some careful consideration of
> whether the benefits of consolidating code outweigh the confusion of
> conflating multiple different meanings of "clean" together...
The difference in usage of PG_dcache_clean/PG_dcache_dirty/PG_arch_1
across architectures is certainly big enough that we can't just
define a a common arch_dma_mark_clean() across architectures, but
I think the idea of having a common entry point for
arch_dma_mark_clean() to be called from the dma-mapping code
to do something architecture specific after a DMA is clean still
makes sense,
Arnd
On Fri, Mar 31, 2023, at 18:53, Catalin Marinas wrote:
> On Mon, Mar 27, 2023 at 02:12:56PM +0200, Arnd Bergmann wrote:
>> Another difference that I do not address here is what cache invalidation
>> does for partical cache lines. On arm32, arm64 and powerpc, a partial
>> cache line always gets written back before invalidation in order to
>> ensure that data before or after the buffer is not discarded. On all
>> other architectures, the assumption is cache lines are never shared
>> between DMA buffer and data that is accessed by the CPU.
>
> I don't think sharing the DMA buffer with other data is safe even with
> this clean+invalidate on the unaligned cache. Mapping the DMA buffer as
> FROM_DEVICE or BIDIRECTIONAL can cause the shared cache line to be
> evicted and override the device written data. This sharing only works if
> the CPU guarantees not to dirty the corresponding cache line.
>
> I'm fine with removing this partial cache line hack from arm64 as it's
> not safe anyway. We'll see if any driver stops working. If there's some
> benign sharing (I wouldn't trust it), the cache cleaning prior to
> mapping and invalidate on unmap would not lose any data.
Ok, I'll add a patch to remove that bit from dcache_inval_poc
then. Do you know if any of the the other callers of this function
rely on on the writeback behavior, or is it safe to remove it for
all of them?
Note that before c50f11c6196f ("arm64: mm: Don't invalidate
FROM_DEVICE buffers at start of DMA transfer"), it made some
sense to write back partial cache lines before a DMA_FROM_DEVICE,
in order to allow sharing read-only data in them the same way as
on arm32 and powerpc. Doing the writeback in the sync_for_cpu
bit is of course always pointless.
Arnd
CC Shahab
On 3/27/23 17:43, Arnd Bergmann wrote:
> From: Arnd Bergmann<[email protected]>
>
> Some architectures that need to invalidate buffers after bidirectional
> DMA because of speculative prefetching only do a simpler writeback
> before that DMA, while architectures that don't need to do the second
> invalidate tend to have a combined writeback+invalidate before the
> DMA.
>
> arc is one of the architectures that does both, which seems unnecessary.
>
> Change it to behave like arm/arm64/xtensa instead, and use just a
> writeback before the DMA when we do the invalidate afterwards.
>
> Signed-off-by: Arnd Bergmann<[email protected]>
Reviewed-by: Vineet Gupta <[email protected]>
Shahab can you give this a spin on hsdk - run glibc testsuite over ssh
and make sure nothing strange happens.
Thx,
-Vineet
On 4/2/23 08:52, Vineet Gupta wrote:
> CC Shahab
>
> On 3/27/23 17:43, Arnd Bergmann wrote:
>> From: Arnd Bergmann<[email protected]>
>>
>> Some architectures that need to invalidate buffers after bidirectional
>> DMA because of speculative prefetching only do a simpler writeback
>> before that DMA, while architectures that don't need to do the second
>> invalidate tend to have a combined writeback+invalidate before the
>> DMA.
>>
>> arc is one of the architectures that does both, which seems unnecessary.
>>
>> Change it to behave like arm/arm64/xtensa instead, and use just a
>> writeback before the DMA when we do the invalidate afterwards.
>>
>> Signed-off-by: Arnd Bergmann<[email protected]>
>
> Reviewed-by: Vineet Gupta <[email protected]>
>
> Shahab can you give this a spin on hsdk - run glibc testsuite over ssh and make sure nothing strange happens.
>
> Thx,
> -Vineet
On it.
--
Shahab
On 4/2/23 08:52, Vineet Gupta wrote:
> CC Shahab
>
> On 3/27/23 17:43, Arnd Bergmann wrote:
>> From: Arnd Bergmann<[email protected]>
>>
>> Some architectures that need to invalidate buffers after bidirectional
>> DMA because of speculative prefetching only do a simpler writeback
>> before that DMA, while architectures that don't need to do the second
>> invalidate tend to have a combined writeback+invalidate before the
>> DMA.
>>
>> arc is one of the architectures that does both, which seems unnecessary.
>>
>> Change it to behave like arm/arm64/xtensa instead, and use just a
>> writeback before the DMA when we do the invalidate afterwards.
>>
>> Signed-off-by: Arnd Bergmann<[email protected]>
>
> Reviewed-by: Vineet Gupta <[email protected]>
>
> Shahab can you give this a spin on hsdk - run glibc testsuite over ssh
> and make sure nothing strange happens.
>
> Thx,
> -Vineet
Tested-by: Shahab Vahedi <[email protected]>
No regression was observed for the ARC target before and after applying
these 21 patches. The test environment and its summary follow.
board: ARC HSDK
base: repo: linux-next
tag: next-20230403
commit: 31bd35b66249 Add linux-next specific files for 20230403
hotfix: net: stmmac: check fwnode for phy device before scanning for phy [1]
glibc: 2.37
Summary of test results:
20 FAIL
4227 PASS
38 UNSUPPORTED
16 XFAIL
2 XPASS
[1]
https://lore.kernel.org/lkml/[email protected]/#r
--
Shahab
Hi all,
FYI, this patch breaks on RZ/G2L SMARC EVK board and Arnd will send V2 for fixing this issue.
[10:53] <biju> [ 3.384408] Unable to handle kernel paging request at virtual address 000000004afb0080
[10:53] <biju> [ 3.392755] Mem abort info:
[10:53] <biju> [ 3.395883] ESR = 0x0000000096000144
[10:53] <biju> [ 3.399957] EC = 0x25: DABT (current EL), IL = 32 bits
[10:53] <biju> [ 3.405674] SET = 0, FnV = 0
[10:53] <biju> [ 3.408978] EA = 0, S1PTW = 0
[10:53] <biju> [ 3.412442] FSC = 0x04: level 0 translation fault
[10:53] <biju> [ 3.417825] Data abort info:
[10:53] <biju> [ 3.420959] ISV = 0, ISS = 0x00000144
[10:53] <biju> [ 3.425115] CM = 1, WnR = 1
[10:53] <biju> [ 3.428521] [000000004afb0080] user address but active_mm is swapper
[10:53] <biju> [ 3.435135] Internal error: Oops: 0000000096000144 [#1] PREEMPT SMP
[10:53] <biju> [ 3.441501] Modules linked in:
[10:53] <biju> [ 3.444644] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.3.0-rc6-next-20230412-g2936e9299572 #712
[10:53] <biju> [ 3.453537] Hardware name: Renesas SMARC EVK based on r9a07g054l2 (DT)
[10:53] <biju> [ 3.460130] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[10:53] <biju> [ 3.467184] pc : dcache_clean_poc+0x20/0x38
[10:53] <biju> [ 3.471488] lr : arch_sync_dma_for_device+0x1c/0x2c
[10:53] <biju> [ 3.476463] sp : ffff80000a70b970
[10:53] <biju> [ 3.479834] x29: ffff80000a70b970 x28: 0000000000000000 x27: ffff00000aef7c10
[10:53] <biju> [ 3.487118] x26: ffff00000afb0080 x25: ffff00000b710000 x24: ffff00000b710a40
[10:53] <biju> [ 3.494397] x23: 0000000000002000 x22: 0000000000000000 x21: 0000000000000002
[10:53] <biju> [ 3.501670] x20: ffff00000aef7c10 x19: 000000004afb0080 x18: 0000000000000000
[10:53] <biju> [ 3.508943] x17: 0000000000000100 x16: fffffc0001efc008 x15: 0000000000000000
[10:53] <biju> [ 3.516216] x14: 0000000000000100 x13: 0000000000000068 x12: ffff00007fc0aa50
[10:54] <biju> [ 3.523488] x11: ffff00007fc0a9c0 x10: 0000000000000000 x9 : ffff00000aef7f08
[10:54] <biju> [ 3.530761] x8 : 0000000000000000 x7 : fffffc00002bec00 x6 : 0000000000000000
[10:54] <biju> [ 3.538028] x5 : 0000000000000000 x4 : 0000000000000002 x3 : 000000000000003f
[10:54] <biju> [ 3.545297] x2 : 0000000000000040 x1 : 000000004afb2080 x0 : 000000004afb0080
[10:54] <biju> [ 3.552569] Call trace:
[10:54] <biju> [ 3.555074] dcache_clean_poc+0x20/0x38
[10:54] <biju> [ 3.559014] dma_map_page_attrs+0x1b4/0x248
[10:54] <biju> [ 3.563289] ravb_rx_ring_format_gbeth+0xd8/0x198
[10:54] <biju> [ 3.568095] ravb_ring_format+0x5c/0x108
[10:54] <biju> [ 3.572108] ravb_dmac_init_gbeth+0x30/0xe4
[10:54] <biju> [ 3.576382] ravb_dmac_init+0x80/0x104
[10:54] <biju> [ 3.580222] ravb_open+0x84/0x78c
[10:54] <biju> [ 3.583626] __dev_open+0xec/0x1d8
[10:54] <biju> [ 3.587138] __dev_change_flags+0x190/0x208
[10:54] <biju> [ 3.591406] dev_change_flags+0x24/0x6c
[10:54] <biju> [ 3.595324] ip_auto_config+0x248/0x10ac
[10:54] <biju> [ 3.599345] do_one_initcall+0x6c/0x1b0
[10:54] <biju> [ 3.603268] kernel_init_freeable+0x1c0/0x294
Cheers,
Biju
> -----Original Message-----
> From: linux-arm-kernel <[email protected]> On
> Behalf Of Arnd Bergmann
> Sent: Monday, March 27, 2023 1:13 PM
> To: [email protected]
> Cc: Arnd Bergmann <[email protected]>; Vineet Gupta <[email protected]>; Russell
> King <[email protected]>; Neil Armstrong <[email protected]>;
> Linus Walleij <[email protected]>; Catalin Marinas
> <[email protected]>; Will Deacon <[email protected]>; Guo Ren
> <[email protected]>; Brian Cain <[email protected]>; Geert Uytterhoeven
> <[email protected]>; Michal Simek <[email protected]>; Thomas Bogendoerfer
> <[email protected]>; Dinh Nguyen <[email protected]>; Stafford
> Horne <[email protected]>; Helge Deller <[email protected]>; Michael Ellerman
> <[email protected]>; Christophe Leroy <[email protected]>; Paul
> Walmsley <[email protected]>; Palmer Dabbelt <[email protected]>;
> Rich Felker <[email protected]>; John Paul Adrian Glaubitz
> <[email protected]>; David S. Miller <[email protected]>; Max
> Filippov <[email protected]>; Christoph Hellwig <[email protected]>; Robin Murphy
> <[email protected]>; Prabhakar Mahadev Lad <prabhakar.mahadev-
> [email protected]>; Conor Dooley <[email protected]>; linux-
> [email protected]; [email protected]; linux-
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; linux-
> [email protected]; [email protected]; linuxppc-
> [email protected]; [email protected]; linux-
> [email protected]; [email protected]; linux-xtensa@linux-
> xtensa.org
> Subject: [PATCH 21/21] dma-mapping: replace custom code with generic
> implementation
>
> From: Arnd Bergmann <[email protected]>
>
> Now that all of these have consistent behavior, replace them with a single
> shared implementation of arch_sync_dma_for_device() and
> arch_sync_dma_for_cpu() and three parameters to pick how they should
> operate:
>
> - If the CPU has speculative prefetching, then the cache
> has to be invalidated after a transfer from the device.
> On the rarer CPUs without prefetching, this can be skipped,
> with all cache management happening before the transfer.
> This flag can be runtime detected, but is usually fixed
> per architecture.
>
> - Some architectures currently clean the caches before DMA
> from a device, while others invalidate it. There has not
> been a conclusion regarding whether we should change all
> architectures to use clean instead, so this adds an
> architecture specific flag that we can change later on.
>
> - On 32-bit Arm, the arch_sync_dma_for_cpu() function keeps
> track pages that are marked clean in the page cache, to
> avoid flushing them again. The implementation for this is
> generic enough to work on all architectures that use the
> PG_dcache_clean page flag, but a Kconfig symbol is used
> to only enable it on Arm to preserve the existing behavior.
>
> For the function naming, I picked 'wback' over 'clean', and 'wback_inv'
> over 'flush', to avoid any ambiguity of what the helper functions are
> supposed to do.
>
> Moving the global functions into a header file is usually a bad idea as it
> prevents the header from being included more than once, but it helps keep
> the behavior as close as possible to the previous state, including the
> possibility of inlining most of it into these functions where that was done
> before. This also helps keep the global namespace clean, by hiding the new
> arch_dma_cache{_wback,_inv,_wback_inv} from device drivers that might use
> them incorrectly.
>
> It would be possible to do this one architecture at a time, but as the
> change is the same everywhere, the combined patch helps explain it better
> once.
>
> Signed-off-by: Arnd Bergmann <[email protected]>
> ---
> arch/arc/mm/dma.c | 66 +++++-------------
> arch/arm/Kconfig | 3 +
> arch/arm/mm/dma-mapping-nommu.c | 39 ++++++-----
> arch/arm/mm/dma-mapping.c | 64 +++++++-----------
> arch/arm64/mm/dma-mapping.c | 28 +++++---
> arch/csky/mm/dma-mapping.c | 44 ++++++------
> arch/hexagon/kernel/dma.c | 44 ++++++------
> arch/m68k/kernel/dma.c | 43 +++++++-----
> arch/microblaze/kernel/dma.c | 48 +++++++-------
> arch/mips/mm/dma-noncoherent.c | 60 +++++++----------
> arch/nios2/mm/dma-mapping.c | 57 +++++++---------
> arch/openrisc/kernel/dma.c | 63 +++++++++++-------
> arch/parisc/kernel/pci-dma.c | 46 ++++++-------
> arch/powerpc/mm/dma-noncoherent.c | 34 ++++++----
> arch/riscv/mm/dma-noncoherent.c | 51 +++++++-------
> arch/sh/kernel/dma-coherent.c | 43 +++++++-----
> arch/sparc/kernel/ioport.c | 38 ++++++++---
> arch/xtensa/kernel/pci-dma.c | 40 ++++++-----
> include/linux/dma-sync.h | 107 ++++++++++++++++++++++++++++++
> 19 files changed, 527 insertions(+), 391 deletions(-) create mode 100644
> include/linux/dma-sync.h
>
> diff --git a/arch/arc/mm/dma.c b/arch/arc/mm/dma.c index
> ddb96786f765..61cd01646222 100644
> --- a/arch/arc/mm/dma.c
> +++ b/arch/arc/mm/dma.c
> @@ -30,63 +30,33 @@ void arch_dma_prep_coherent(struct page *page, size_t
> size)
> dma_cache_wback_inv(page_to_phys(page), size); }
>
> -/*
> - * Cache operations depending on function and direction argument, inspired
> by
> - *
> https://lore.kerne/
> l.org%2Flkml%2F20180518175004.GF17671%40n2100.armlinux.org.uk&data=05%7C01%7
> Cbiju.das.jz%40bp.renesas.com%7C3db9a66f29fa416d938108db2ebe1b0c%7C53d82571d
> a1947e49cb4625a166a4a2a%7C0%7C0%7C638155166250292766%7CUnknown%7CTWFpbGZsb3d
> 8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7
> C%7C%7C&sdata=vVMW38elUoLyGW9%2BPQhsBDW8N61ubjgJBsbL6ct6uOU%3D&reserved=0
> - * "dma_sync_*_for_cpu and direction=TO_DEVICE (was Re: [PATCH 02/20]
> - * dma-mapping: provide a generic dma-noncoherent implementation)"
> - *
> - * | map == for_device | unmap == for_cpu
> - * |--------------------------------------------------------------
> --
> - * TO_DEV | writeback writeback | none none
> - * FROM_DEV | invalidate invalidate | invalidate*
> invalidate*
> - * BIDIR | writeback writeback | invalidate
> invalidate
> - *
> - * [*] needed for CPU speculative prefetches
> - *
> - * NOTE: we don't check the validity of direction argument as it is done in
> - * upper layer functions (in include/linux/dma-mapping.h)
> - */
> -
> -void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
> {
> - switch (dir) {
> - case DMA_TO_DEVICE:
> - dma_cache_wback(paddr, size);
> - break;
> -
> - case DMA_FROM_DEVICE:
> - dma_cache_inv(paddr, size);
> - break;
> -
> - case DMA_BIDIRECTIONAL:
> - dma_cache_wback(paddr, size);
> - break;
> + dma_cache_wback(paddr, size);
> +}
>
> - default:
> - break;
> - }
> +static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size) {
> + dma_cache_inv(paddr, size);
> }
>
> -void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t
> +size)
> {
> - switch (dir) {
> - case DMA_TO_DEVICE:
> - break;
> + dma_cache_wback_inv(paddr, size);
> +}
>
> - /* FROM_DEVICE invalidate needed if speculative CPU prefetch only */
> - case DMA_FROM_DEVICE:
> - case DMA_BIDIRECTIONAL:
> - dma_cache_inv(paddr, size);
> - break;
> +static inline bool arch_sync_dma_clean_before_fromdevice(void)
> +{
> + return false;
> +}
>
> - default:
> - break;
> - }
> +static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
> +{
> + return true;
> }
>
> +#include <linux/dma-sync.h>
> +
> /*
> * Plug in direct dma map ops.
> */
> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index
> 125d58c54ab1..0de84e861027 100644
> --- a/arch/arm/Kconfig
> +++ b/arch/arm/Kconfig
> @@ -212,6 +212,9 @@ config LOCKDEP_SUPPORT
> bool
> default y
>
> +config ARCH_DMA_MARK_DCACHE_CLEAN
> + def_bool y
> +
> config ARCH_HAS_ILOG2_U32
> bool
>
> diff --git a/arch/arm/mm/dma-mapping-nommu.c b/arch/arm/mm/dma-mapping-
> nommu.c index 12b5c6ae93fc..0817274aed15 100644
> --- a/arch/arm/mm/dma-mapping-nommu.c
> +++ b/arch/arm/mm/dma-mapping-nommu.c
> @@ -13,27 +13,36 @@
>
> #include "dma.h"
>
> -void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
> {
> - if (dir == DMA_FROM_DEVICE) {
> - dmac_inv_range(__va(paddr), __va(paddr + size));
> - outer_inv_range(paddr, paddr + size);
> - } else {
> - dmac_clean_range(__va(paddr), __va(paddr + size));
> - outer_clean_range(paddr, paddr + size);
> - }
> + dmac_clean_range(__va(paddr), __va(paddr + size));
> + outer_clean_range(paddr, paddr + size);
> }
>
> -void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size)
> {
> - if (dir != DMA_TO_DEVICE) {
> - outer_inv_range(paddr, paddr + size);
> - dmac_inv_range(__va(paddr), __va(paddr));
> - }
> + dmac_inv_range(__va(paddr), __va(paddr + size));
> + outer_inv_range(paddr, paddr + size);
> }
>
> +static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t
> +size) {
> + dmac_flush_range(__va(paddr), __va(paddr + size));
> + outer_flush_range(paddr, paddr + size); }
> +
> +static inline bool arch_sync_dma_clean_before_fromdevice(void)
> +{
> + return false;
> +}
> +
> +static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
> +{
> + return true;
> +}
> +
> +#include <linux/dma-sync.h>
> +
> void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
> const struct iommu_ops *iommu, bool coherent) { diff --
> git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c index
> b703cb83d27e..aa6ee820a0ab 100644
> --- a/arch/arm/mm/dma-mapping.c
> +++ b/arch/arm/mm/dma-mapping.c
> @@ -687,6 +687,30 @@ void arch_dma_mark_clean(phys_addr_t paddr, size_t
> size)
> }
> }
>
> +static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
> +{
> + dma_cache_maint(paddr, size, dmac_clean_range);
> + outer_clean_range(paddr, paddr + size); }
> +
> +
> +static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size) {
> + dma_cache_maint(paddr, size, dmac_inv_range);
> + outer_inv_range(paddr, paddr + size);
> +}
> +
> +static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t
> +size) {
> + dma_cache_maint(paddr, size, dmac_flush_range);
> + outer_flush_range(paddr, paddr + size); }
> +
> +static inline bool arch_sync_dma_clean_before_fromdevice(void)
> +{
> + return false;
> +}
> +
> static bool arch_sync_dma_cpu_needs_post_dma_flush(void)
> {
> if (IS_ENABLED(CONFIG_CPU_V6) ||
> @@ -699,45 +723,7 @@ static bool
> arch_sync_dma_cpu_needs_post_dma_flush(void)
> return false;
> }
>
> -/*
> - * Make an area consistent for devices.
> - * Note: Drivers should NOT use this function directly.
> - * Use the driver DMA support - see dma-mapping.h (dma_sync_*)
> - */
> -void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> -{
> - switch (dir) {
> - case DMA_TO_DEVICE:
> - dma_cache_maint(paddr, size, dmac_clean_range);
> - outer_clean_range(paddr, paddr + size);
> - break;
> - case DMA_FROM_DEVICE:
> - dma_cache_maint(paddr, size, dmac_inv_range);
> - outer_inv_range(paddr, paddr + size);
> - break;
> - case DMA_BIDIRECTIONAL:
> - if (arch_sync_dma_cpu_needs_post_dma_flush()) {
> - dma_cache_maint(paddr, size, dmac_clean_range);
> - outer_clean_range(paddr, paddr + size);
> - } else {
> - dma_cache_maint(paddr, size, dmac_flush_range);
> - outer_flush_range(paddr, paddr + size);
> - }
> - break;
> - default:
> - break;
> - }
> -}
> -
> -void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> -{
> - if (dir != DMA_TO_DEVICE && arch_sync_dma_cpu_needs_post_dma_flush())
> {
> - outer_inv_range(paddr, paddr + size);
> - dma_cache_maint(paddr, size, dmac_inv_range);
> - }
> -}
> +#include <linux/dma-sync.h>
>
> #ifdef CONFIG_ARM_DMA_USE_IOMMU
>
> diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c index
> 5240f6acad64..bae741aa65e9 100644
> --- a/arch/arm64/mm/dma-mapping.c
> +++ b/arch/arm64/mm/dma-mapping.c
> @@ -13,25 +13,33 @@
> #include <asm/cacheflush.h>
> #include <asm/xen/xen-ops.h>
>
> -void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
> {
> - unsigned long start = (unsigned long)phys_to_virt(paddr);
> + dcache_clean_poc(paddr, paddr + size); }
>
> - dcache_clean_poc(start, start + size);
> +static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size) {
> + dcache_inval_poc(paddr, paddr + size);
> }
>
> -void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t
> +size)
> {
> - unsigned long start = (unsigned long)phys_to_virt(paddr);
> + dcache_clean_inval_poc(paddr, paddr + size); }
>
> - if (dir == DMA_TO_DEVICE)
> - return;
> +static inline bool arch_sync_dma_clean_before_fromdevice(void)
> +{
> + return true;
> +}
>
> - dcache_inval_poc(start, start + size);
> +static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
> +{
> + return true;
> }
>
> +#include <linux/dma-sync.h>
> +
> void arch_dma_prep_coherent(struct page *page, size_t size) {
> unsigned long start = (unsigned long)page_address(page); diff --git
> a/arch/csky/mm/dma-mapping.c b/arch/csky/mm/dma-mapping.c index
> c90f912e2822..9402e101b363 100644
> --- a/arch/csky/mm/dma-mapping.c
> +++ b/arch/csky/mm/dma-mapping.c
> @@ -55,31 +55,29 @@ void arch_dma_prep_coherent(struct page *page, size_t
> size)
> cache_op(page_to_phys(page), size, dma_wbinv_set_zero_range); }
>
> -void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
> {
> - switch (dir) {
> - case DMA_TO_DEVICE:
> - case DMA_FROM_DEVICE:
> - case DMA_BIDIRECTIONAL:
> - cache_op(paddr, size, dma_wb_range);
> - break;
> - default:
> - BUG();
> - }
> + cache_op(paddr, size, dma_wb_range);
> }
>
> -void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size)
> {
> - switch (dir) {
> - case DMA_TO_DEVICE:
> - return;
> - case DMA_FROM_DEVICE:
> - case DMA_BIDIRECTIONAL:
> - cache_op(paddr, size, dma_inv_range);
> - break;
> - default:
> - BUG();
> - }
> + cache_op(paddr, size, dma_inv_range);
> }
> +
> +static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t
> +size) {
> + cache_op(paddr, size, dma_wbinv_range); }
> +
> +static inline bool arch_sync_dma_clean_before_fromdevice(void)
> +{
> + return true;
> +}
> +
> +static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
> +{
> + return true;
> +}
> +
> +#include <linux/dma-sync.h>
> diff --git a/arch/hexagon/kernel/dma.c b/arch/hexagon/kernel/dma.c index
> 882680e81a30..e6538128a75b 100644
> --- a/arch/hexagon/kernel/dma.c
> +++ b/arch/hexagon/kernel/dma.c
> @@ -9,29 +9,33 @@
> #include <linux/memblock.h>
> #include <asm/page.h>
>
> -void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
> {
> - void *addr = phys_to_virt(paddr);
> -
> - switch (dir) {
> - case DMA_TO_DEVICE:
> - hexagon_clean_dcache_range((unsigned long) addr,
> - (unsigned long) addr + size);
> - break;
> - case DMA_FROM_DEVICE:
> - hexagon_inv_dcache_range((unsigned long) addr,
> - (unsigned long) addr + size);
> - break;
> - case DMA_BIDIRECTIONAL:
> - flush_dcache_range((unsigned long) addr,
> - (unsigned long) addr + size);
> - break;
> - default:
> - BUG();
> - }
> + hexagon_clean_dcache_range(paddr, paddr + size);
> }
>
> +static inline void arch_dma_cache_inv(phys_addr_t start, size_t size) {
> + hexagon_inv_dcache_range(paddr, paddr + size); }
> +
> +static inline void arch_dma_cache_wback_inv(phys_addr_t start, size_t
> +size) {
> + hexagon_flush_dcache_range(paddr, paddr + size); }
> +
> +static inline bool arch_sync_dma_clean_before_fromdevice(void)
> +{
> + return false;
> +}
> +
> +static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
> +{
> + return false;
> +}
> +
> +#include <linux/dma-sync.h>
> +
> /*
> * Our max_low_pfn should have been backed off by 16MB in mm/init.c to
> create
> * DMA coherent space. Use that for the pool.
> diff --git a/arch/m68k/kernel/dma.c b/arch/m68k/kernel/dma.c index
> 2e192a5df949..aa9b434e6df8 100644
> --- a/arch/m68k/kernel/dma.c
> +++ b/arch/m68k/kernel/dma.c
> @@ -58,20 +58,33 @@ void arch_dma_free(struct device *dev, size_t size, void
> *vaddr,
>
> #endif /* CONFIG_MMU && !CONFIG_COLDFIRE */
>
> -void arch_sync_dma_for_device(phys_addr_t handle, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
> {
> - switch (dir) {
> - case DMA_BIDIRECTIONAL:
> - case DMA_TO_DEVICE:
> - cache_push(handle, size);
> - break;
> - case DMA_FROM_DEVICE:
> - cache_clear(handle, size);
> - break;
> - default:
> - pr_err_ratelimited("dma_sync_single_for_device: unsupported dir
> %u\n",
> - dir);
> - break;
> - }
> + /*
> + * cache_push() always invalidates in addition to cleaning
> + * write-back caches.
> + */
> + cache_push(paddr, size);
> +}
> +
> +static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size) {
> + cache_clear(paddr, size);
> +}
> +
> +static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t
> +size) {
> + cache_push(paddr, size);
> }
> +
> +static inline bool arch_sync_dma_clean_before_fromdevice(void)
> +{
> + return false;
> +}
> +
> +static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
> +{
> + return false;
> +}
> +
> +#include <linux/dma-sync.h>
> diff --git a/arch/microblaze/kernel/dma.c b/arch/microblaze/kernel/dma.c
> index b4c4e45fd45e..01110d4aa5b0 100644
> --- a/arch/microblaze/kernel/dma.c
> +++ b/arch/microblaze/kernel/dma.c
> @@ -14,32 +14,30 @@
> #include <linux/bug.h>
> #include <asm/cacheflush.h>
>
> -void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
> {
> - switch (direction) {
> - case DMA_TO_DEVICE:
> - case DMA_BIDIRECTIONAL:
> - flush_dcache_range(paddr, paddr + size);
> - break;
> - case DMA_FROM_DEVICE:
> - invalidate_dcache_range(paddr, paddr + size);
> - break;
> - default:
> - BUG();
> - }
> + /* writeback plus invalidate, could be a nop on WT caches */
> + flush_dcache_range(paddr, paddr + size);
> }
>
> -void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size)
> {
> - switch (direction) {
> - case DMA_TO_DEVICE:
> - break;
> - case DMA_BIDIRECTIONAL:
> - case DMA_FROM_DEVICE:
> - invalidate_dcache_range(paddr, paddr + size);
> - break;
> - default:
> - BUG();
> - }}
> + invalidate_dcache_range(paddr, paddr + size); }
> +
> +static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t
> +size) {
> + flush_dcache_range(paddr, paddr + size); }
> +
> +static inline bool arch_sync_dma_clean_before_fromdevice(void)
> +{
> + return false;
> +}
> +
> +static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
> +{
> + return true;
> +}
> +
> +#include <linux/dma-sync.h>
> diff --git a/arch/mips/mm/dma-noncoherent.c b/arch/mips/mm/dma-noncoherent.c
> index b9d68bcc5d53..902d4b7c1f85 100644
> --- a/arch/mips/mm/dma-noncoherent.c
> +++ b/arch/mips/mm/dma-noncoherent.c
> @@ -85,50 +85,38 @@ static inline void dma_sync_phys(phys_addr_t paddr,
> size_t size,
> } while (left);
> }
>
> -void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
> {
> - switch (dir) {
> - case DMA_TO_DEVICE:
> - dma_sync_phys(paddr, size, _dma_cache_wback);
> - break;
> - case DMA_FROM_DEVICE:
> - dma_sync_phys(paddr, size, _dma_cache_inv);
> - break;
> - case DMA_BIDIRECTIONAL:
> - if (IS_ENABLED(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU) &&
> - cpu_needs_post_dma_flush())
> - dma_sync_phys(paddr, size, _dma_cache_wback);
> - else
> - dma_sync_phys(paddr, size, _dma_cache_wback_inv);
> - break;
> - default:
> - break;
> - }
> + dma_sync_phys(paddr, size, _dma_cache_wback);
> }
>
> -#ifdef CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU -void
> arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size)
> {
> - switch (dir) {
> - case DMA_TO_DEVICE:
> - break;
> - case DMA_FROM_DEVICE:
> - case DMA_BIDIRECTIONAL:
> - if (cpu_needs_post_dma_flush())
> - dma_sync_phys(paddr, size, _dma_cache_inv);
> - break;
> - default:
> - break;
> - }
> + dma_sync_phys(paddr, size, _dma_cache_inv);
> }
> -#endif
> +
> +static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t
> +size) {
> + dma_sync_phys(paddr, size, _dma_cache_wback_inv); }
> +
> +static inline bool arch_sync_dma_clean_before_fromdevice(void)
> +{
> + return false;
> +}
> +
> +static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
> +{
> + return IS_ENABLED(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU) &&
> + cpu_needs_post_dma_flush(); }
> +
> +#include <linux/dma-sync.h>
>
> #ifdef CONFIG_ARCH_HAS_SETUP_DMA_OPS
> void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
> - const struct iommu_ops *iommu, bool coherent)
> + const struct iommu_ops *iommu, bool coherent)
> {
> - dev->dma_coherent = coherent;
> + dev->dma_coherent = coherent;
> }
> #endif
> diff --git a/arch/nios2/mm/dma-mapping.c b/arch/nios2/mm/dma-mapping.c index
> fd887d5f3f9a..29978970955e 100644
> --- a/arch/nios2/mm/dma-mapping.c
> +++ b/arch/nios2/mm/dma-mapping.c
> @@ -13,53 +13,46 @@
> #include <linux/types.h>
> #include <linux/mm.h>
> #include <linux/string.h>
> +#include <linux/dma-map-ops.h>
> #include <linux/dma-mapping.h>
> #include <linux/io.h>
> #include <linux/cache.h>
> #include <asm/cacheflush.h>
>
> -void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
> {
> + /*
> + * We just need to write back the caches here, but Nios2 flush
> + * instruction will do both writeback and invalidate.
> + */
> void *vaddr = phys_to_virt(paddr);
> + flush_dcache_range((unsigned long)vaddr, (unsigned long)(vaddr +
> +size)); }
>
> - switch (dir) {
> - case DMA_FROM_DEVICE:
> - invalidate_dcache_range((unsigned long)vaddr,
> - (unsigned long)(vaddr + size));
> - break;
> - case DMA_TO_DEVICE:
> - /*
> - * We just need to flush the caches here , but Nios2 flush
> - * instruction will do both writeback and invalidate.
> - */
> - case DMA_BIDIRECTIONAL: /* flush and invalidate */
> - flush_dcache_range((unsigned long)vaddr,
> - (unsigned long)(vaddr + size));
> - break;
> - default:
> - BUG();
> - }
> +static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size) {
> + unsigned long vaddr = (unsigned long)phys_to_virt(paddr);
> + invalidate_dcache_range(vaddr, (unsigned long)(vaddr + size));
> }
>
> -void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t
> +size)
> {
> void *vaddr = phys_to_virt(paddr);
> + flush_dcache_range((unsigned long)vaddr, (unsigned long)(vaddr +
> +size)); }
> +
> +static inline bool arch_sync_dma_clean_before_fromdevice(void)
> +{
> + return false;
> +}
>
> - switch (dir) {
> - case DMA_BIDIRECTIONAL:
> - case DMA_FROM_DEVICE:
> - invalidate_dcache_range((unsigned long)vaddr,
> - (unsigned long)(vaddr + size));
> - break;
> - case DMA_TO_DEVICE:
> - break;
> - default:
> - BUG();
> - }
> +static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
> +{
> + return true;
> }
>
> +#include <linux/dma-sync.h>
> +
> void arch_dma_prep_coherent(struct page *page, size_t size) {
> unsigned long start = (unsigned long)page_address(page); diff --git
> a/arch/openrisc/kernel/dma.c b/arch/openrisc/kernel/dma.c index
> 91a00d09ffad..aba2258e62eb 100644
> --- a/arch/openrisc/kernel/dma.c
> +++ b/arch/openrisc/kernel/dma.c
> @@ -95,32 +95,47 @@ void arch_dma_clear_uncached(void *cpu_addr, size_t
> size)
> mmap_write_unlock(&init_mm);
> }
>
> -void arch_sync_dma_for_device(phys_addr_t addr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
> {
> unsigned long cl;
> struct cpuinfo_or1k *cpuinfo = &cpuinfo_or1k[smp_processor_id()];
>
> - switch (dir) {
> - case DMA_TO_DEVICE:
> - /* Write back the dcache for the requested range */
> - for (cl = addr; cl < addr + size;
> - cl += cpuinfo->dcache_block_size)
> - mtspr(SPR_DCBWR, cl);
> - break;
> - case DMA_FROM_DEVICE:
> - /* Invalidate the dcache for the requested range */
> - for (cl = addr; cl < addr + size;
> - cl += cpuinfo->dcache_block_size)
> - mtspr(SPR_DCBIR, cl);
> - break;
> - case DMA_BIDIRECTIONAL:
> - /* Flush the dcache for the requested range */
> - for (cl = addr; cl < addr + size;
> - cl += cpuinfo->dcache_block_size)
> - mtspr(SPR_DCBFR, cl);
> - break;
> - default:
> - break;
> - }
> + /* Write back the dcache for the requested range */
> + for (cl = paddr; cl < paddr + size;
> + cl += cpuinfo->dcache_block_size)
> + mtspr(SPR_DCBWR, cl);
> }
> +
> +static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size) {
> + unsigned long cl;
> + struct cpuinfo_or1k *cpuinfo = &cpuinfo_or1k[smp_processor_id()];
> +
> + /* Invalidate the dcache for the requested range */
> + for (cl = paddr; cl < paddr + size;
> + cl += cpuinfo->dcache_block_size)
> + mtspr(SPR_DCBIR, cl);
> +}
> +
> +static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t
> +size) {
> + unsigned long cl;
> + struct cpuinfo_or1k *cpuinfo = &cpuinfo_or1k[smp_processor_id()];
> +
> + /* Flush the dcache for the requested range */
> + for (cl = paddr; cl < paddr + size;
> + cl += cpuinfo->dcache_block_size)
> + mtspr(SPR_DCBFR, cl);
> +}
> +
> +static inline bool arch_sync_dma_clean_before_fromdevice(void)
> +{
> + return false;
> +}
> +
> +static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
> +{
> + return false;
> +}
> +
> +#include <linux/dma-sync.h>
> diff --git a/arch/parisc/kernel/pci-dma.c b/arch/parisc/kernel/pci-dma.c
> index 6d3d3cffb316..a7955aab8ce2 100644
> --- a/arch/parisc/kernel/pci-dma.c
> +++ b/arch/parisc/kernel/pci-dma.c
> @@ -443,35 +443,35 @@ void arch_dma_free(struct device *dev, size_t size,
> void *vaddr,
> free_pages((unsigned long)__va(dma_handle), order); }
>
> -void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
> {
> unsigned long virt = (unsigned long)phys_to_virt(paddr);
>
> - switch (dir) {
> - case DMA_TO_DEVICE:
> - clean_kernel_dcache_range(virt, size);
> - break;
> - case DMA_FROM_DEVICE:
> - clean_kernel_dcache_range(virt, size);
> - break;
> - case DMA_BIDIRECTIONAL:
> - flush_kernel_dcache_range(virt, size);
> - break;
> - }
> + clean_kernel_dcache_range(virt, size);
> }
>
> -void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size)
> {
> unsigned long virt = (unsigned long)phys_to_virt(paddr);
>
> - switch (dir) {
> - case DMA_TO_DEVICE:
> - break;
> - case DMA_FROM_DEVICE:
> - case DMA_BIDIRECTIONAL:
> - purge_kernel_dcache_range(virt, size);
> - break;
> - }
> + purge_kernel_dcache_range(virt, size); }
> +
> +static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t
> +size) {
> + unsigned long virt = (unsigned long)phys_to_virt(paddr);
> +
> + flush_kernel_dcache_range(virt, size);
> }
> +
> +static inline bool arch_sync_dma_clean_before_fromdevice(void)
> +{
> + return true;
> +}
> +
> +static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
> +{
> + return true;
> +}
> +
> +#include <linux/dma-sync.h>
> diff --git a/arch/powerpc/mm/dma-noncoherent.c b/arch/powerpc/mm/dma-
> noncoherent.c
> index 00e59a4faa2b..268510c71156 100644
> --- a/arch/powerpc/mm/dma-noncoherent.c
> +++ b/arch/powerpc/mm/dma-noncoherent.c
> @@ -101,27 +101,33 @@ static void __dma_phys_op(phys_addr_t paddr, size_t
> size, enum dma_cache_op op) #endif }
>
> -void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
> {
> __dma_phys_op(start, end, DMA_CACHE_CLEAN); }
>
> -void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size)
> {
> - switch (direction) {
> - case DMA_NONE:
> - BUG();
> - case DMA_TO_DEVICE:
> - break;
> - case DMA_FROM_DEVICE:
> - case DMA_BIDIRECTIONAL:
> - __dma_phys_op(start, end, DMA_CACHE_INVAL);
> - break;
> - }
> + __dma_phys_op(start, end, DMA_CACHE_INVAL);
> }
>
> +static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t
> +size) {
> + __dma_phys_op(start, end, DMA_CACHE_FLUSH); }
> +
> +static inline bool arch_sync_dma_clean_before_fromdevice(void)
> +{
> + return true;
> +}
> +
> +static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
> +{
> + return true;
> +}
> +
> +#include <linux/dma-sync.h>
> +
> void arch_dma_prep_coherent(struct page *page, size_t size) {
> unsigned long kaddr = (unsigned long)page_address(page); diff --git
> a/arch/riscv/mm/dma-noncoherent.c b/arch/riscv/mm/dma-noncoherent.c index
> 69c80b2155a1..b9a9f57e02be 100644
> --- a/arch/riscv/mm/dma-noncoherent.c
> +++ b/arch/riscv/mm/dma-noncoherent.c
> @@ -12,43 +12,40 @@
>
> static bool noncoherent_supported;
>
> -void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
> {
> void *vaddr = phys_to_virt(paddr);
>
> - switch (dir) {
> - case DMA_TO_DEVICE:
> - ALT_CMO_OP(clean, vaddr, size, riscv_cbom_block_size);
> - break;
> - case DMA_FROM_DEVICE:
> - ALT_CMO_OP(clean, vaddr, size, riscv_cbom_block_size);
> - break;
> - case DMA_BIDIRECTIONAL:
> - ALT_CMO_OP(clean, vaddr, size, riscv_cbom_block_size);
> - break;
> - default:
> - break;
> - }
> + ALT_CMO_OP(clean, vaddr, size, riscv_cbom_block_size);
> }
>
> -void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size)
> {
> void *vaddr = phys_to_virt(paddr);
>
> - switch (dir) {
> - case DMA_TO_DEVICE:
> - break;
> - case DMA_FROM_DEVICE:
> - case DMA_BIDIRECTIONAL:
> - ALT_CMO_OP(inval, vaddr, size, riscv_cbom_block_size);
> - break;
> - default:
> - break;
> - }
> + ALT_CMO_OP(inval, vaddr, size, riscv_cbom_block_size);
> }
>
> +static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t
> +size) {
> + void *vaddr = phys_to_virt(paddr);
> +
> + ALT_CMO_OP(flush, vaddr, size, riscv_cbom_block_size); }
> +
> +static inline bool arch_sync_dma_clean_before_fromdevice(void)
> +{
> + return true;
> +}
> +
> +static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
> +{
> + return true;
> +}
> +
> +#include <linux/dma-sync.h>
> +
> +
> void arch_dma_prep_coherent(struct page *page, size_t size) {
> void *flush_addr = page_address(page); diff --git
> a/arch/sh/kernel/dma-coherent.c b/arch/sh/kernel/dma-coherent.c index
> 6a44c0e7ba40..41f031ae7609 100644
> --- a/arch/sh/kernel/dma-coherent.c
> +++ b/arch/sh/kernel/dma-coherent.c
> @@ -12,22 +12,35 @@ void arch_dma_prep_coherent(struct page *page, size_t
> size)
> __flush_purge_region(page_address(page), size); }
>
> -void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
> {
> void *addr = sh_cacheop_vaddr(phys_to_virt(paddr));
>
> - switch (dir) {
> - case DMA_FROM_DEVICE: /* invalidate only */
> - __flush_invalidate_region(addr, size);
> - break;
> - case DMA_TO_DEVICE: /* writeback only */
> - __flush_wback_region(addr, size);
> - break;
> - case DMA_BIDIRECTIONAL: /* writeback and invalidate */
> - __flush_purge_region(addr, size);
> - break;
> - default:
> - BUG();
> - }
> + __flush_wback_region(addr, size);
> }
> +
> +static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size) {
> + void *addr = sh_cacheop_vaddr(phys_to_virt(paddr));
> +
> + __flush_invalidate_region(addr, size); }
> +
> +static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t
> +size) {
> + void *addr = sh_cacheop_vaddr(phys_to_virt(paddr));
> +
> + __flush_purge_region(addr, size);
> +}
> +
> +static inline bool arch_sync_dma_clean_before_fromdevice(void)
> +{
> + return false;
> +}
> +
> +static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
> +{
> + return false;
> +}
> +
> +#include <linux/dma-sync.h>
> diff --git a/arch/sparc/kernel/ioport.c b/arch/sparc/kernel/ioport.c index
> 4f3d26066ec2..6926ead2f208 100644
> --- a/arch/sparc/kernel/ioport.c
> +++ b/arch/sparc/kernel/ioport.c
> @@ -300,21 +300,39 @@ arch_initcall(sparc_register_ioport);
>
> #endif /* CONFIG_SBUS */
>
> -/*
> - * IIep is write-through, not flushing on cpu to device transfer.
> - *
> - * On LEON systems without cache snooping, the entire D-CACHE must be
> flushed to
> - * make DMA to cacheable memory coherent.
> - */
> -void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
> {
> - if (dir != DMA_TO_DEVICE &&
> - sparc_cpu_model == sparc_leon &&
> + /* IIep is write-through, not flushing on cpu to device transfer. */ }
> +
> +static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size) {
> + /*
> + * On LEON systems without cache snooping, the entire D-CACHE must be
> + * flushed to make DMA to cacheable memory coherent.
> + */
> + if (sparc_cpu_model == sparc_leon &&
> !sparc_leon3_snooping_enabled())
> leon_flush_dcache_all();
> }
>
> +static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t
> +size) {
> + arch_dma_cache_inv(paddr, size);
> +}
> +
> +static inline bool arch_sync_dma_clean_before_fromdevice(void)
> +{
> + return true;
> +}
> +
> +static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
> +{
> + return false;
> +}
> +
> +#include <linux/dma-sync.h>
> +
> #ifdef CONFIG_PROC_FS
>
> static int sparc_io_proc_show(struct seq_file *m, void *v) diff --git
> a/arch/xtensa/kernel/pci-dma.c b/arch/xtensa/kernel/pci-dma.c index
> ff3bf015eca4..d4ff96585545 100644
> --- a/arch/xtensa/kernel/pci-dma.c
> +++ b/arch/xtensa/kernel/pci-dma.c
> @@ -43,24 +43,34 @@ static void do_cache_op(phys_addr_t paddr, size_t size,
> }
> }
>
> -void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> - enum dma_data_direction dir)
> +static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
> {
> - switch (dir) {
> - case DMA_TO_DEVICE:
> - do_cache_op(paddr, size, __flush_dcache_range);
> - break;
> - case DMA_FROM_DEVICE:
> - do_cache_op(paddr, size, __invalidate_dcache_range);
> - break;
> - case DMA_BIDIRECTIONAL:
> - do_cache_op(paddr, size, __flush_invalidate_dcache_range);
> - break;
> - default:
> - break;
> - }
> + do_cache_op(paddr, size, __flush_dcache_range);
> }
>
> +static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size) {
> + do_cache_op(paddr, size, __invalidate_dcache_range); }
> +
> +static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t
> +size) {
> + do_cache_op(paddr, size, __flush_invalidate_dcache_range); }
> +
> +static inline bool arch_sync_dma_clean_before_fromdevice(void)
> +{
> + return false;
> +}
> +
> +static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
> +{
> + return false;
> +}
> +
> +#include <linux/dma-sync.h>
> +
> +
> void arch_dma_prep_coherent(struct page *page, size_t size) {
> __invalidate_dcache_range((unsigned long)page_address(page), size);
> diff --git a/include/linux/dma-sync.h b/include/linux/dma-sync.h new file
> mode 100644 index 000000000000..18e33d5e8eaf
> --- /dev/null
> +++ b/include/linux/dma-sync.h
> @@ -0,0 +1,107 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Cache operations depending on function and direction argument,
> +inspired by
> + *
> +https://lore/.
> +kernel.org%2Flkml%2F20180518175004.GF17671%40n2100.armlinux.org.uk&data
> +=05%7C01%7Cbiju.das.jz%40bp.renesas.com%7C3db9a66f29fa416d938108db2ebe1
> +b0c%7C53d82571da1947e49cb4625a166a4a2a%7C0%7C0%7C638155166250449286%7CU
> +nknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haW
> +wiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=04qDpyhP%2FT1wdPjg%2Bi0EzLz815rk
> +8AJmZFv8tq7tolM%3D&reserved=0
> + * "dma_sync_*_for_cpu and direction=TO_DEVICE (was Re: [PATCH 02/20]
> + * dma-mapping: provide a generic dma-noncoherent implementation)"
> + *
> + * | map == for_device | unmap == for_cpu
> + * |--------------------------------------------------------------
> --
> + * TO_DEV | writeback writeback | none none
> + * FROM_DEV | invalidate invalidate | invalidate*
> invalidate*
> + * BIDIR | writeback writeback | invalidate
> invalidate
> + *
> + * [*] needed for CPU speculative prefetches
> + *
> + * NOTE: we don't check the validity of direction argument as it is
> +done in
> + * upper layer functions (in include/linux/dma-mapping.h)
> + *
> + * This file can be included by arch/.../kernel/dma-noncoherent.c to
> +provide
> + * the respective high-level operations without having to expose the
> + * cache management ops to drivers.
> + */
> +
> +void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> + enum dma_data_direction dir)
> +{
> + switch (dir) {
> + case DMA_TO_DEVICE:
> + /*
> + * This may be an empty function on write-through caches,
> + * and it might invalidate the cache if an architecture has
> + * a write-back cache but no way to write it back without
> + * invalidating
> + */
> + arch_dma_cache_wback(paddr, size);
> + break;
> +
> + case DMA_FROM_DEVICE:
> + /*
> + * FIXME: this should be handled the same across all
> + * architectures, see
> + *
> https://lore.kerne/
> l.org%2Fall%2F20220606152150.GA31568%40willie-the-
> truck%2F&data=05%7C01%7Cbiju.das.jz%40bp.renesas.com%7C3db9a66f29fa416d93810
> 8db2ebe1b0c%7C53d82571da1947e49cb4625a166a4a2a%7C0%7C0%7C638155166250449286%
> 7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwi
> LCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=rMRR1qB7VTNcvosS73f04WZ5BI46kEoZXj4sTXl
> Sbf8%3D&reserved=0
> + */
> + if (!arch_sync_dma_clean_before_fromdevice()) {
> + arch_dma_cache_inv(paddr, size);
> + break;
> + }
> + fallthrough;
> +
> + case DMA_BIDIRECTIONAL:
> + /* Skip the invalidate here if it's done later */
> + if (IS_ENABLED(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU) &&
> + arch_sync_dma_cpu_needs_post_dma_flush())
> + arch_dma_cache_wback(paddr, size);
> + else
> + arch_dma_cache_wback_inv(paddr, size);
> + break;
> +
> + default:
> + break;
> + }
> +}
> +
> +#ifdef CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU
> +/*
> + * Mark the D-cache clean for these pages to avoid extra flushing.
> + */
> +static void arch_dma_mark_dcache_clean(phys_addr_t paddr, size_t size)
> +{ #ifdef CONFIG_ARCH_DMA_MARK_DCACHE_CLEAN
> + unsigned long pfn = PFN_UP(paddr);
> + unsigned long off = paddr & (PAGE_SIZE - 1);
> + size_t left = size;
> +
> + if (off)
> + left -= PAGE_SIZE - off;
> +
> + while (left >= PAGE_SIZE) {
> + struct page *page = pfn_to_page(pfn++);
> + set_bit(PG_dcache_clean, &page->flags);
> + left -= PAGE_SIZE;
> + }
> +#endif
> +}
> +
> +void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> + enum dma_data_direction dir)
> +{
> + switch (dir) {
> + case DMA_TO_DEVICE:
> + break;
> +
> + case DMA_FROM_DEVICE:
> + case DMA_BIDIRECTIONAL:
> + /* FROM_DEVICE invalidate needed if speculative CPU prefetch
> only */
> + if (arch_sync_dma_cpu_needs_post_dma_flush())
> + arch_dma_cache_inv(paddr, size);
> +
> + if (size > PAGE_SIZE)
> + arch_dma_mark_dcache_clean(paddr, size);
> + break;
> +
> + default:
> + break;
> + }
> +}
> +#endif
> --
> 2.39.2
>
>
> _______________________________________________
> linux-arm-kernel mailing list
> [email protected]
> http://lists.infra/
> dead.org%2Fmailman%2Flistinfo%2Flinux-arm-
> kernel&data=05%7C01%7Cbiju.das.jz%40bp.renesas.com%7C3db9a66f29fa416d938108d
> b2ebe1b0c%7C53d82571da1947e49cb4625a166a4a2a%7C0%7C0%7C638155166250449286%7C
> Unknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLC
> JXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=jVWHs4FyF3gf99YGax4jl1vHNQ7JFMbsX3NoIAHdw
> Zw%3D&reserved=0
On Thu, Apr 13, 2023, at 14:13, Biju Das wrote:
> Hi all,
>
> FYI, this patch breaks on RZ/G2L SMARC EVK board and Arnd will send V2
> for fixing this issue.
>
> [10:53] <biju> [ 3.384408] Unable to handle kernel paging request at
> virtual address 000000004afb0080
Right, sorry about this, I accidentally removed the 'phys_to_virt()'
conversion on arm64.
Arnd
On Mon, 27 Mar 2023 05:13:04 PDT (-0700), [email protected] wrote:
> From: Arnd Bergmann <[email protected]>
>
> No other architecture intentionally writes back dirty cache lines into
> a buffer that a device has just finished writing into. If the cache is
> clean, this has no effect at all, but if a cacheline in the buffer has
> actually been written by the CPU, there is a drive bug that is likely
> made worse by overwriting that buffer.
>
> Signed-off-by: Arnd Bergmann <[email protected]>
> ---
> arch/riscv/mm/dma-noncoherent.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/riscv/mm/dma-noncoherent.c b/arch/riscv/mm/dma-noncoherent.c
> index d919efab6eba..640f4c496d26 100644
> --- a/arch/riscv/mm/dma-noncoherent.c
> +++ b/arch/riscv/mm/dma-noncoherent.c
> @@ -42,7 +42,7 @@ void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> break;
> case DMA_FROM_DEVICE:
> case DMA_BIDIRECTIONAL:
> - ALT_CMO_OP(flush, vaddr, size, riscv_cbom_block_size);
> + ALT_CMO_OP(inval, vaddr, size, riscv_cbom_block_size);
> break;
> default:
> break;
Acked-by: Palmer Dabbelt <[email protected]>
On Mon, 27 Mar 2023 05:13:05 PDT (-0700), [email protected] wrote:
> From: Arnd Bergmann <[email protected]>
>
> For a DMA_BIDIRECTIONAL transfer, the caches have to be cleaned
> first to let the device see data written by the CPU, and invalidated
> after the transfer to let the CPU see data written by the device.
>
> riscv also invalidates the caches before the transfer, which does
> not appear to serve any purpose.
>
> Signed-off-by: Arnd Bergmann <[email protected]>
> ---
> arch/riscv/mm/dma-noncoherent.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/riscv/mm/dma-noncoherent.c b/arch/riscv/mm/dma-noncoherent.c
> index 640f4c496d26..69c80b2155a1 100644
> --- a/arch/riscv/mm/dma-noncoherent.c
> +++ b/arch/riscv/mm/dma-noncoherent.c
> @@ -25,7 +25,7 @@ void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> ALT_CMO_OP(clean, vaddr, size, riscv_cbom_block_size);
> break;
> case DMA_BIDIRECTIONAL:
> - ALT_CMO_OP(flush, vaddr, size, riscv_cbom_block_size);
> + ALT_CMO_OP(clean, vaddr, size, riscv_cbom_block_size);
> break;
> default:
> break;
Acked-by: Palmer Dabbelt <[email protected]>
On Mon, Mar 27, 2023 at 8:15 PM Arnd Bergmann <[email protected]> wrote:
>
> From: Arnd Bergmann <[email protected]>
>
> For a DMA_BIDIRECTIONAL transfer, the caches have to be cleaned
> first to let the device see data written by the CPU, and invalidated
> after the transfer to let the CPU see data written by the device.
>
> riscv also invalidates the caches before the transfer, which does
> not appear to serve any purpose.
Yes, we can't guarantee the CPU pre-load cache lines randomly during
dma working.
But I've two purposes to keep invalidates before dma transfer:
- We clearly tell the CPU these cache lines are invalid. The caching
algorithm would use these invalid slots first instead of replacing
valid ones.
- Invalidating is very cheap. Actually, flush and clean have the same
performance in our machine.
So, how about:
diff --git a/arch/riscv/mm/dma-noncoherent.c b/arch/riscv/mm/dma-noncoherent.c
index d919efab6eba..2c52fbc15064 100644
--- a/arch/riscv/mm/dma-noncoherent.c
+++ b/arch/riscv/mm/dma-noncoherent.c
@@ -22,8 +22,6 @@ void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
ALT_CMO_OP(clean, vaddr, size, riscv_cbom_block_size);
break;
case DMA_FROM_DEVICE:
- ALT_CMO_OP(clean, vaddr, size, riscv_cbom_block_size);
- break;
case DMA_BIDIRECTIONAL:
ALT_CMO_OP(flush, vaddr, size, riscv_cbom_block_size);
break;
@@ -42,7 +40,7 @@ void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
break;
case DMA_FROM_DEVICE:
case DMA_BIDIRECTIONAL:
/* I'm not sure all drivers have guaranteed cacheline
alignment. If not, this inval would cause problems */
- ALT_CMO_OP(flush, vaddr, size, riscv_cbom_block_size);
+ ALT_CMO_OP(inval, vaddr, size, riscv_cbom_block_size);
break;
default:
break;
>
> Signed-off-by: Arnd Bergmann <[email protected]>
> ---
> arch/riscv/mm/dma-noncoherent.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/riscv/mm/dma-noncoherent.c b/arch/riscv/mm/dma-noncoherent.c
> index 640f4c496d26..69c80b2155a1 100644
> --- a/arch/riscv/mm/dma-noncoherent.c
> +++ b/arch/riscv/mm/dma-noncoherent.c
> @@ -25,7 +25,7 @@ void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> ALT_CMO_OP(clean, vaddr, size, riscv_cbom_block_size);
> break;
> case DMA_BIDIRECTIONAL:
> - ALT_CMO_OP(flush, vaddr, size, riscv_cbom_block_size);
> + ALT_CMO_OP(clean, vaddr, size, riscv_cbom_block_size);
> break;
> default:
> break;
> --
> 2.39.2
>
--
Best Regards
Guo Ren
On Fri, May 5, 2023, at 07:47, Guo Ren wrote:
> On Mon, Mar 27, 2023 at 8:15 PM Arnd Bergmann <[email protected]> wrote:
>>
>> riscv also invalidates the caches before the transfer, which does
>> not appear to serve any purpose.
> Yes, we can't guarantee the CPU pre-load cache lines randomly during
> dma working.
>
> But I've two purposes to keep invalidates before dma transfer:
> - We clearly tell the CPU these cache lines are invalid. The caching
> algorithm would use these invalid slots first instead of replacing
> valid ones.
> - Invalidating is very cheap. Actually, flush and clean have the same
> performance in our machine.
The main purpose of the series was to get consistent behavior on
all machines, so I really don't want a custom optimization on
one architecture. You make a good point about cacheline reuse
after invalidation, but if we do that, I'd suggest doing this
across all architectures.
> So, how about:
>
> diff --git a/arch/riscv/mm/dma-noncoherent.c b/arch/riscv/mm/dma-noncoherent.c
> index d919efab6eba..2c52fbc15064 100644
> --- a/arch/riscv/mm/dma-noncoherent.c
> +++ b/arch/riscv/mm/dma-noncoherent.c
> @@ -22,8 +22,6 @@ void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> ALT_CMO_OP(clean, vaddr, size, riscv_cbom_block_size);
> break;
> case DMA_FROM_DEVICE:
> - ALT_CMO_OP(clean, vaddr, size, riscv_cbom_block_size);
> - break;
> case DMA_BIDIRECTIONAL:
> ALT_CMO_OP(flush, vaddr, size, riscv_cbom_block_size);
> break;
This is something we can consider. Unfortunately, this is something
that no architecture (except pa-risc, which has other problems)
does at the moment, so we'd probably need to have a proper debate
about this.
We already have two conflicting ways to handle DMA_FROM_DEVICE,
either invalidate/invalidate, or clean/invalidate. I can see
that flush/invalidate may be a sensible option as well, but I'd
want to have that discussion after the series is complete, so
we can come to a generic solution that has the same documented
behavior across all architectures.
In particular, if we end up moving arm64 and riscv back to the
traditional invalidate/invalidate for DMA_FROM_DEVICE and
document that driver must not rely on buffers getting cleaned
before a partial DMA_FROM_DEVICE, the question between clean
or flush becomes moot as well.
> @@ -42,7 +40,7 @@ void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> break;
> case DMA_FROM_DEVICE:
> case DMA_BIDIRECTIONAL:
> /* I'm not sure all drivers have guaranteed cacheline
> alignment. If not, this inval would cause problems */
> - ALT_CMO_OP(flush, vaddr, size, riscv_cbom_block_size);
> + ALT_CMO_OP(inval, vaddr, size, riscv_cbom_block_size);
> break;
This is my original patch, and I would not mix it with the other
change. The problem with non-aligned DMA_BIDIRECTIONAL buffers in
is that both flush and inval would be wrong if you get simultaneous
writes from device and cpu to the same cache line, so there is
no way to win this. Using inval instead of flush would at least
work if the CPU data in the cacheline is read-only from the CPU,
so that seems better than something that is always wrong.
The documented API is that sharing the cache line is not allowed
at all, so anything that would observe a difference between the
two is also a bug. One idea that we have considered already is
that we could overwrite the unused bits of the cacheline with
poison values and/or mark them as invalid using KASAN for debugging
purposes, to find drivers that already violate this.
Arnd
On Fri, May 5, 2023 at 9:19 PM Arnd Bergmann <[email protected]> wrote:
>
> On Fri, May 5, 2023, at 07:47, Guo Ren wrote:
> > On Mon, Mar 27, 2023 at 8:15 PM Arnd Bergmann <[email protected]> wrote:
>
> >>
> >> riscv also invalidates the caches before the transfer, which does
> >> not appear to serve any purpose.
> > Yes, we can't guarantee the CPU pre-load cache lines randomly during
> > dma working.
> >
> > But I've two purposes to keep invalidates before dma transfer:
> > - We clearly tell the CPU these cache lines are invalid. The caching
> > algorithm would use these invalid slots first instead of replacing
> > valid ones.
> > - Invalidating is very cheap. Actually, flush and clean have the same
> > performance in our machine.
>
> The main purpose of the series was to get consistent behavior on
> all machines, so I really don't want a custom optimization on
> one architecture. You make a good point about cacheline reuse
> after invalidation, but if we do that, I'd suggest doing this
> across all architectures.
Yes, invalidation of DMA_FROM_DEVICE-for_device is a proposal for all
architectures.
>
> > So, how about:
> >
> > diff --git a/arch/riscv/mm/dma-noncoherent.c b/arch/riscv/mm/dma-noncoherent.c
> > index d919efab6eba..2c52fbc15064 100644
> > --- a/arch/riscv/mm/dma-noncoherent.c
> > +++ b/arch/riscv/mm/dma-noncoherent.c
> > @@ -22,8 +22,6 @@ void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> > ALT_CMO_OP(clean, vaddr, size, riscv_cbom_block_size);
> > break;
> > case DMA_FROM_DEVICE:
> > - ALT_CMO_OP(clean, vaddr, size, riscv_cbom_block_size);
> > - break;
> > case DMA_BIDIRECTIONAL:
> > ALT_CMO_OP(flush, vaddr, size, riscv_cbom_block_size);
> > break;
>
> This is something we can consider. Unfortunately, this is something
> that no architecture (except pa-risc, which has other problems)
> does at the moment, so we'd probably need to have a proper debate
> about this.
>
> We already have two conflicting ways to handle DMA_FROM_DEVICE,
> either invalidate/invalidate, or clean/invalidate. I can see
I vote to invalidate/invalidate.
My key point is to let DMA_FROM_DEVICE-for_device invalidate, and
DMA_BIDIRECTIONAL contains DMA_FROM_DEVICE.
So I also agree:
@@ -22,8 +22,6 @@ void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
ALT_CMO_OP(clean, vaddr, size, riscv_cbom_block_size);
break;
case DMA_FROM_DEVICE:
- ALT_CMO_OP(clean, vaddr, size, riscv_cbom_block_size);
+ ALT_CMO_OP(invalidate, vaddr, size, riscv_cbom_block_size);
break;
case DMA_BIDIRECTIONAL:
ALT_CMO_OP(flush, vaddr, size, riscv_cbom_block_size);
break;
> that flush/invalidate may be a sensible option as well, but I'd
> want to have that discussion after the series is complete, so
> we can come to a generic solution that has the same documented
> behavior across all architectures.
Yes, I agree to unify them into a generic solution first. My proposal
could be another topic in the future.
For that purpose, I give
Acked-by: Guo Ren <[email protected]>
>
> In particular, if we end up moving arm64 and riscv back to the
> traditional invalidate/invalidate for DMA_FROM_DEVICE and
> document that driver must not rely on buffers getting cleaned
After invalidation, the cache lines are also cleaned, right? So why do
we need to document it additionally?
> before a partial DMA_FROM_DEVICE, the question between clean
> or flush becomes moot as well.
>
> > @@ -42,7 +40,7 @@ void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> > break;
> > case DMA_FROM_DEVICE:
> > case DMA_BIDIRECTIONAL:
> > /* I'm not sure all drivers have guaranteed cacheline
> > alignment. If not, this inval would cause problems */
> > - ALT_CMO_OP(flush, vaddr, size, riscv_cbom_block_size);
> > + ALT_CMO_OP(inval, vaddr, size, riscv_cbom_block_size);
> > break;
>
> This is my original patch, and I would not mix it with the other
> change. The problem with non-aligned DMA_BIDIRECTIONAL buffers in
> is that both flush and inval would be wrong if you get simultaneous
> writes from device and cpu to the same cache line, so there is
> no way to win this. Using inval instead of flush would at least
> work if the CPU data in the cacheline is read-only from the CPU,
> so that seems better than something that is always wrong.
If CPU data in the cacheline is read-only, the cacheline would never
be dirty. Yes, It's always safe.
Okay, I agree we must keep cache-line-aligned. I comment it here, just
worry some dirty drivers couldn't work with the "invalid mechanism"
because of the CPU data corruption, and device data in the cacheline
is useless.
>
> The documented API is that sharing the cache line is not allowed
> at all, so anything that would observe a difference between the
> two is also a bug. One idea that we have considered already is
> that we could overwrite the unused bits of the cacheline with
> poison values and/or mark them as invalid using KASAN for debugging
> purposes, to find drivers that already violate this.
>
> Arnd
--
Best Regards
Guo Ren
On Sat, May 6, 2023, at 09:25, Guo Ren wrote:
> On Fri, May 5, 2023 at 9:19 PM Arnd Bergmann <[email protected]> wrote:
>>
>> This is something we can consider. Unfortunately, this is something
>> that no architecture (except pa-risc, which has other problems)
>> does at the moment, so we'd probably need to have a proper debate
>> about this.
>>
>> We already have two conflicting ways to handle DMA_FROM_DEVICE,
>> either invalidate/invalidate, or clean/invalidate. I can see
> I vote to invalidate/invalidate.
>
...
>
>> that flush/invalidate may be a sensible option as well, but I'd
>> want to have that discussion after the series is complete, so
>> we can come to a generic solution that has the same documented
>> behavior across all architectures.
> Yes, I agree to unify them into a generic solution first. My proposal
> could be another topic in the future.
Right, I was explicitly trying to exclude that question from my
series, and left it as an architecture specific Kconfig option
based on the current behavior.
>> In particular, if we end up moving arm64 and riscv back to the
>> traditional invalidate/invalidate for DMA_FROM_DEVICE and
>> document that driver must not rely on buffers getting cleaned
> After invalidation, the cache lines are also cleaned, right? So why do
> we need to document it additionally?
I mentioned the debate in the cover letter, the full explanation
is archived at
https://lore.kernel.org/all/20220606152150.GA31568@willie-the-truck/
In short, the problem that is addressed here is leaking sensitive
kernel data to user space or a device as in this sequence:
1. A DMA buffer is allocated in the kernel and contains stale data
that is no longer needed but must not be exposed to untrusted
userspace, i.e. encryption keys or user file pages
2. allocator uses memset() to clear out the buffer
3. buffer gets mapped into a device for DMA_FROM_DEVICE
4. writeback cache gets invalidated, uncovering the sensitive
data by discarding the zeros
5. device returns less data than expected
6. buffer is unmapped
7. whole buffer is mapped or copied to user space
Will added his patch for arm64 to prevent this scenario by using
'clean' instead of 'invalidate' in step 4, and the same behavior
got copied to riscv but not most of the other architectures.
The dma-mapping documentation does not say anything about this
case, and an alternative approach would be to document that
device drivers must watch out for short reads in step 5, or that
kzalloc() should clean the cache in step 2. Both of these come
at a cost as well.
Arnd
Hi Arnd,
On Mon, Mar 27, 2023 at 1:14 PM Arnd Bergmann <[email protected]> wrote:
>
> From: Arnd Bergmann <[email protected]>
>
> After a long discussion about adding SoC specific semantics for when
> to flush caches in drivers/soc/ drivers that we determined to be
> fundamentally flawed[1], I volunteered to try to move that logic into
> architecture-independent code and make all existing architectures do
> the same thing.
>
> As we had determined earlier, the behavior is wildly different across
> architectures, but most of the differences come down to either bugs
> (when required flushes are missing) or extra flushes that are harmless
> but might hurt performance.
>
> I finally found the time to come up with an implementation of this, which
> starts by replacing every outlier with one of the three common options:
>
> 1. architectures without speculative prefetching (hegagon, m68k,
> openrisc, sh, sparc, and certain armv4 and xtensa implementations)
> only flush their caches before a DMA, by cleaning write-back caches
> (if any) before a DMA to the device, and by invalidating the caches
> before a DMA from a device
>
> 2. arc, microblaze, mips, nios2, sh and later xtensa now follow the
> normal 32-bit arm model and invalidate their writeback caches
> again after a DMA from the device, to remove stale cache lines
> that got prefetched during the DMA. arc, csky and mips used to
> invalidate buffers also before the bidirectional DMA, but this
> is now skipped whenever we know it gets invalidated again
> after the DMA.
>
> 3. parisc, powerpc and riscv already flushed buffers before
> a DMA_FROM_DEVICE, and these get moved to the arm64 behavior
> that does the writeback before and invalidate after both
> DMA_FROM_DEVICE and DMA_BIDIRECTIONAL in order to avoid the
> problem of accidentally leaking stale data if the DMA does
> not actually happen[2].
>
> The last patch in the series replaces the architecture specific code
> with a shared version that implements all three based on architecture
> specific parameters that are almost always determined at compile time.
>
> The difference between cases 1. and 2. is hardware specific, while between
> 2. and 3. we need to decide which semantics we want, but I explicitly
> avoid this question in my series and leave it to be decided later.
>
> Another difference that I do not address here is what cache invalidation
> does for partical cache lines. On arm32, arm64 and powerpc, a partial
> cache line always gets written back before invalidation in order to
> ensure that data before or after the buffer is not discarded. On all
> other architectures, the assumption is cache lines are never shared
> between DMA buffer and data that is accessed by the CPU. If we end up
> always writing back dirty cache lines before a DMA (option 3 above),
> then this point becomes moot, otherwise we should probably address this
> in a follow-up series to document one behavior or the other and implement
> it consistently.
>
> Please review!
>
> Arnd
>
> [1] https://lore.kernel.org/all/[email protected]/
> [2] https://lore.kernel.org/all/20220606152150.GA31568@willie-the-truck/
>
> Arnd Bergmann (21):
> openrisc: dma-mapping: flush bidirectional mappings
> xtensa: dma-mapping: use normal cache invalidation rules
> sparc32: flush caches in dma_sync_*for_device
> microblaze: dma-mapping: skip extra DMA flushes
> powerpc: dma-mapping: split out cache operation logic
> powerpc: dma-mapping: minimize for_cpu flushing
> powerpc: dma-mapping: always clean cache in _for_device() op
> riscv: dma-mapping: only invalidate after DMA, not flush
> riscv: dma-mapping: skip invalidation before bidirectional DMA
> csky: dma-mapping: skip invalidating before DMA from device
> mips: dma-mapping: skip invalidating before bidirectional DMA
> mips: dma-mapping: split out cache operation logic
> arc: dma-mapping: skip invalidating before bidirectional DMA
> parisc: dma-mapping: use regular flush/invalidate ops
> ARM: dma-mapping: always invalidate WT caches before DMA
> ARM: dma-mapping: bring back dmac_{clean,inv}_range
> ARM: dma-mapping: use arch_sync_dma_for_{device,cpu}() internally
> ARM: drop SMP support for ARM11MPCore
> ARM: dma-mapping: use generic form of arch_sync_dma_* helpers
> ARM: dma-mapping: split out arch_dma_mark_clean() helper
> dma-mapping: replace custom code with generic implementation
>
Do you plan to send v2 for this series?
Cheers,
Prabhakar
> arch/arc/mm/dma.c | 66 ++------
> arch/arm/Kconfig | 4 +
> arch/arm/include/asm/cacheflush.h | 21 +++
> arch/arm/include/asm/glue-cache.h | 4 +
> arch/arm/mach-oxnas/Kconfig | 4 -
> arch/arm/mach-oxnas/Makefile | 1 -
> arch/arm/mach-oxnas/headsmp.S | 23 ---
> arch/arm/mach-oxnas/platsmp.c | 96 -----------
> arch/arm/mach-versatile/platsmp-realview.c | 4 -
> arch/arm/mm/Kconfig | 19 ---
> arch/arm/mm/cache-fa.S | 4 +-
> arch/arm/mm/cache-nop.S | 6 +
> arch/arm/mm/cache-v4.S | 13 +-
> arch/arm/mm/cache-v4wb.S | 4 +-
> arch/arm/mm/cache-v4wt.S | 22 ++-
> arch/arm/mm/cache-v6.S | 35 +---
> arch/arm/mm/cache-v7.S | 6 +-
> arch/arm/mm/cache-v7m.S | 4 +-
> arch/arm/mm/dma-mapping-nommu.c | 36 ++--
> arch/arm/mm/dma-mapping.c | 181 ++++++++++-----------
> arch/arm/mm/proc-arm1020.S | 4 +-
> arch/arm/mm/proc-arm1020e.S | 4 +-
> arch/arm/mm/proc-arm1022.S | 4 +-
> arch/arm/mm/proc-arm1026.S | 4 +-
> arch/arm/mm/proc-arm920.S | 4 +-
> arch/arm/mm/proc-arm922.S | 4 +-
> arch/arm/mm/proc-arm925.S | 4 +-
> arch/arm/mm/proc-arm926.S | 4 +-
> arch/arm/mm/proc-arm940.S | 4 +-
> arch/arm/mm/proc-arm946.S | 4 +-
> arch/arm/mm/proc-feroceon.S | 8 +-
> arch/arm/mm/proc-macros.S | 2 +
> arch/arm/mm/proc-mohawk.S | 4 +-
> arch/arm/mm/proc-xsc3.S | 4 +-
> arch/arm/mm/proc-xscale.S | 6 +-
> arch/arm64/mm/dma-mapping.c | 28 ++--
> arch/csky/mm/dma-mapping.c | 46 +++---
> arch/hexagon/kernel/dma.c | 44 ++---
> arch/m68k/kernel/dma.c | 43 +++--
> arch/microblaze/kernel/dma.c | 38 ++---
> arch/mips/mm/dma-noncoherent.c | 75 +++------
> arch/nios2/mm/dma-mapping.c | 57 +++----
> arch/openrisc/kernel/dma.c | 62 ++++---
> arch/parisc/include/asm/cacheflush.h | 6 +-
> arch/parisc/kernel/pci-dma.c | 33 +++-
> arch/powerpc/mm/dma-noncoherent.c | 76 +++++----
> arch/riscv/mm/dma-noncoherent.c | 51 +++---
> arch/sh/kernel/dma-coherent.c | 43 +++--
> arch/sparc/Kconfig | 2 +-
> arch/sparc/kernel/ioport.c | 38 +++--
> arch/xtensa/Kconfig | 1 -
> arch/xtensa/include/asm/cacheflush.h | 6 +-
> arch/xtensa/kernel/pci-dma.c | 47 +++---
> include/linux/dma-sync.h | 107 ++++++++++++
> 54 files changed, 721 insertions(+), 699 deletions(-)
> delete mode 100644 arch/arm/mach-oxnas/headsmp.S
> delete mode 100644 arch/arm/mach-oxnas/platsmp.c
> create mode 100644 include/linux/dma-sync.h
>
> --
> 2.39.2
>
> Cc: Vineet Gupta <[email protected]>
> Cc: Russell King <[email protected]>
> Cc: Neil Armstrong <[email protected]>
> Cc: Linus Walleij <[email protected]>
> Cc: Catalin Marinas <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Guo Ren <[email protected]>
> Cc: Brian Cain <[email protected]>
> Cc: Geert Uytterhoeven <[email protected]>
> Cc: Michal Simek <[email protected]>
> Cc: Thomas Bogendoerfer <[email protected]>
> Cc: Dinh Nguyen <[email protected]>
> Cc: Stafford Horne <[email protected]>
> Cc: Helge Deller <[email protected]>
> Cc: Michael Ellerman <[email protected]>
> Cc: Christophe Leroy <[email protected]>
> Cc: Paul Walmsley <[email protected]>
> Cc: Palmer Dabbelt <[email protected]>
> Cc: Rich Felker <[email protected]>
> Cc: John Paul Adrian Glaubitz <[email protected]>
> Cc: "David S. Miller" <[email protected]>
> Cc: Max Filippov <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Cc: Robin Murphy <[email protected]>
> Cc: Lad Prabhakar <[email protected]>
> Cc: Conor Dooley <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
>
> _______________________________________________
> linux-riscv mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-riscv
On Thu, Apr 13, 2023 at 2:52 PM Arnd Bergmann <[email protected]> wrote:
> On Thu, Apr 13, 2023, at 14:13, Biju Das wrote:
> > FYI, this patch breaks on RZ/G2L SMARC EVK board and Arnd will send V2
> > for fixing this issue.
> >
> > [10:53] <biju> [ 3.384408] Unable to handle kernel paging request at
> > virtual address 000000004afb0080
>
> Right, sorry about this, I accidentally removed the 'phys_to_virt()'
> conversion on arm64.
Meh, I missed that, so I ended up bisecting this same failure...
This patch is now commit 801f1883c4bb70cc ("dma-mapping: replace
custom code with generic implementation") in esmil/jh7100-dmapool,
and broke booting on R-Car Gen3.
The following gmail-whitespace-damaged patch fixes that:
diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
index 97b7cea5eb23aedd..77e0b68b43e5849a 100644
--- a/arch/arm64/mm/dma-mapping.c
+++ b/arch/arm64/mm/dma-mapping.c
@@ -15,17 +15,23 @@
static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
{
- dcache_clean_poc(paddr, paddr + size);
+ unsigned long start = (unsigned long)phys_to_virt(paddr);
+
+ dcache_clean_poc(start, start + size);
}
static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size)
{
- dcache_inval_poc(paddr, paddr + size);
+ unsigned long start = (unsigned long)phys_to_virt(paddr);
+
+ dcache_inval_poc(start, start + size);
}
static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t size)
{
- dcache_clean_inval_poc(paddr, paddr + size);
+ unsigned long start = (unsigned long)phys_to_virt(paddr);
+
+ dcache_clean_inval_poc(start, start + size);
}
static inline bool arch_sync_dma_clean_before_fromdevice(void)
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
Hi Arnd,
On Mon, Mar 27, 2023 at 2:16 PM Arnd Bergmann <[email protected]> wrote:
> From: Arnd Bergmann <[email protected]>
>
> The arm version of the arch_sync_dma_for_cpu() function annotates pages as
> PG_dcache_clean after a DMA, but no other architecture does this here. On
> ia64, the same thing is done in arch_sync_dma_for_cpu(), so it makes sense
> to use the same hook in order to have identical arch_sync_dma_for_cpu()
> semantics as all other architectures.
>
> Splitting this out has multiple effects:
>
> - for dma-direct, this now gets called after arch_sync_dma_for_cpu()
> for DMA_FROM_DEVICE mappings, but not for DMA_BIDIRECTIONAL. While
> it would not be harmful to keep doing it for bidirectional mappings,
> those are apparently not used in any callers that care about the flag.
>
> - Since arm has its own dma-iommu abstraction, this now also needs to
> call the same function, so the calls are added there to mirror the
> dma-direct version.
>
> - Like dma-direct, the dma-iommu version now marks the dcache clean
> for both coherent and noncoherent devices after a DMA, but it only
> does this for DMA_FROM_DEVICE, not DMA_BIDIRECTIONAL.
>
> [ HELP NEEDED: can anyone confirm that it is a correct assumption
> on arm that a cache-coherent device writing to a page always results
> in it being in a PG_dcache_clean state like on ia64, or can a device
> write directly into the dcache?]
>
> Signed-off-by: Arnd Bergmann <[email protected]>
Thanks for your patch, which is now commit 322dbe898f82fd8a
("ARM: dma-mapping: split out arch_dma_mark_clean() helper") in
esmil/jh7100-dmapool.
If CONFIG_ARM_DMA_USE_IOMMU=y, the build fails.
> --- a/arch/arm/mm/dma-mapping.c
> +++ b/arch/arm/mm/dma-mapping.c
> @@ -1294,6 +1298,17 @@ static int arm_iommu_map_sg(struct device *dev, struct scatterlist *sg,
> return -EINVAL;
> }
>
> +static void arm_iommu_sync_dma_for_cpu(phys_addr_t phys, size_t len,
> + enum dma_data_direction dir,
> + bool dma_coherent)
> +{
> + if (!dma_coherent)
> + arch_sync_dma_for_cpu(phys, s->length, dir);
s/s->length/len/
> +
> + if (dir == DMA_FROM_DEVICE)
> + arch_dma_mark_clean(phys, s->length);
Likewise.
> +}
> +
> /**
> * arm_iommu_unmap_sg - unmap a set of SG buffers mapped by dma_map_sg
> * @dev: valid struct device pointer
> @@ -1425,9 +1438,9 @@ static void arm_iommu_unmap_page(struct device *dev, dma_addr_t handle,
> if (!iova)
> return;
>
> - if (!dev->dma_coherent && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) {
> + if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
Missing opening curly brace.
> phys = iommu_iova_to_phys(mapping->domain, handle);
> - arch_sync_dma_for_cpu(phys, size, dir);
> + arm_iommu_sync_dma_for_cpu(phys, size, dir, dev->dma_coherent);
> }
>
> iommu_unmap(mapping->domain, iova, len);
With the above fixed, it builds and boots fine (on R-Car M2-W).
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
> Thanks for your patch, which is now commit 322dbe898f82fd8a
> ("ARM: dma-mapping: split out arch_dma_mark_clean() helper") in
> esmil/jh7100-dmapool.
Well, something is wrong with that branch then, and this series still
needs more work, and should eventually be merged through the dma-mapping
tree.