2024-01-31 16:26:11

by Mathieu Desnoyers

[permalink] [raw]
Subject: [RFC PATCH v3 0/4] Introduce cpu_dcache_is_aliasing() to fix DAX regression

This commit introduced in v5.13 prevents building FS_DAX on 32-bit ARM,
even on ARMv7 which does not have virtually aliased data caches:

commit d92576f1167c ("dax: does not work correctly with virtual aliasing caches")

It used to work fine before: I have customers using DAX over pmem on
ARMv7, but this regression will likely prevent them from upgrading their
kernel.

The root of the issue here is the fact that DAX was never designed to
handle virtually aliasing data caches (VIVT and VIPT with aliasing data
cache). It touches the pages through their linear mapping, which is not
consistent with the userspace mappings with virtually aliasing data
caches.

This patch series introduces cpu_dcache_is_aliasing() with the new
Kconfig option ARCH_HAS_CPU_CACHE_ALIASING and implements it for all
architectures. The implementation of cpu_dcache_is_aliasing() is either
evaluated to a constant at compile-time or a runtime check, which is
what is needed on ARM.

With this we can basically narrow down the list of architectures which
are unsupported by DAX to those which are really affected.

Feedback is welcome,

Thanks,

Mathieu

Changes since v2:
- Move DAX supported runtime check to alloc_dax(),
- Modify DM to handle alloc_dax() error as non-fatal,
- Remove all filesystem modifications, since the check is now done by
alloc_dax(),
- rename "dcache" and "cache" to "cpu dcache" and "cpu cache" to
eliminate confusion with VFS terminology.

Changes since v1:
- The order of the series was completely changed based on the
feedback received on v1,
- cache_is_aliasing() is renamed to dcache_is_aliasing(),
- ARCH_HAS_CACHE_ALIASING_DYNAMIC is gone,
- dcache_is_aliasing() vs ARCH_HAS_CACHE_ALIASING relationship is
simplified,
- the dax_is_supported() check was moved to its rightful place in all
filesystems.

Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: Dan Williams <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Russell King <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]

Mathieu Desnoyers (4):
dm: Treat alloc_dax failure as non-fatal
dax: Check for data cache aliasing at runtime
Introduce cpu_dcache_is_aliasing() across all architectures
dax: Fix incorrect list of data cache aliasing architectures

arch/arc/Kconfig | 1 +
arch/arc/include/asm/cachetype.h | 9 +++++++++
arch/arm/Kconfig | 1 +
arch/arm/include/asm/cachetype.h | 2 ++
arch/csky/Kconfig | 1 +
arch/csky/include/asm/cachetype.h | 9 +++++++++
arch/m68k/Kconfig | 1 +
arch/m68k/include/asm/cachetype.h | 9 +++++++++
arch/mips/Kconfig | 1 +
arch/mips/include/asm/cachetype.h | 9 +++++++++
arch/nios2/Kconfig | 1 +
arch/nios2/include/asm/cachetype.h | 10 ++++++++++
arch/parisc/Kconfig | 1 +
arch/parisc/include/asm/cachetype.h | 9 +++++++++
arch/sh/Kconfig | 1 +
arch/sh/include/asm/cachetype.h | 9 +++++++++
arch/sparc/Kconfig | 1 +
arch/sparc/include/asm/cachetype.h | 14 ++++++++++++++
arch/xtensa/Kconfig | 1 +
arch/xtensa/include/asm/cachetype.h | 10 ++++++++++
drivers/dax/super.c | 5 +++++
drivers/md/dm.c | 10 +++++-----
fs/Kconfig | 1 -
include/linux/cacheinfo.h | 6 ++++++
mm/Kconfig | 6 ++++++
25 files changed, 122 insertions(+), 6 deletions(-)
create mode 100644 arch/arc/include/asm/cachetype.h
create mode 100644 arch/csky/include/asm/cachetype.h
create mode 100644 arch/m68k/include/asm/cachetype.h
create mode 100644 arch/mips/include/asm/cachetype.h
create mode 100644 arch/nios2/include/asm/cachetype.h
create mode 100644 arch/parisc/include/asm/cachetype.h
create mode 100644 arch/sh/include/asm/cachetype.h
create mode 100644 arch/sparc/include/asm/cachetype.h
create mode 100644 arch/xtensa/include/asm/cachetype.h

--
2.39.2



2024-01-31 16:27:11

by Mathieu Desnoyers

[permalink] [raw]
Subject: [RFC PATCH v3 2/4] dax: Check for data cache aliasing at runtime

Replace the following fs/Kconfig:FS_DAX dependency:

depends on !(ARM || MIPS || SPARC)

By a runtime check within alloc_dax().

This is done in preparation for its use by each filesystem supporting
the "dax" mount option to validate whether DAX is indeed supported.

This is done in preparation for using cpu_dcache_is_aliasing() in a
following change which will properly support architectures which detect
data cache aliasing at runtime.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches")
Signed-off-by: Mathieu Desnoyers <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: Dan Williams <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Russell King <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
drivers/dax/super.c | 6 ++++++
fs/Kconfig | 1 -
2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 0da9232ea175..e9f397b8a5a3 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -445,6 +445,12 @@ struct dax_device *alloc_dax(void *private, const struct dax_operations *ops)
dev_t devt;
int minor;

+ /* Unavailable on architectures with virtually aliased data caches. */
+ if (IS_ENABLED(CONFIG_ARM) ||
+ IS_ENABLED(CONFIG_MIPS) ||
+ IS_ENABLED(CONFIG_SPARC))
+ return NULL;
+
if (WARN_ON_ONCE(ops && !ops->zero_page_range))
return ERR_PTR(-EINVAL);

diff --git a/fs/Kconfig b/fs/Kconfig
index 42837617a55b..e5efdb3b276b 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -56,7 +56,6 @@ endif # BLOCK
config FS_DAX
bool "File system based Direct Access (DAX) support"
depends on MMU
- depends on !(ARM || MIPS || SPARC)
depends on ZONE_DEVICE || FS_DAX_LIMITED
select FS_IOMAP
select DAX
--
2.39.2


2024-01-31 16:27:14

by Mathieu Desnoyers

[permalink] [raw]
Subject: [RFC PATCH v3 3/4] Introduce cpu_dcache_is_aliasing() across all architectures

Introduce a generic way to query whether the data cache is virtually
aliased on all architectures. Its purpose is to ensure that subsystems
which are incompatible with virtually aliased data caches (e.g. FS_DAX)
can reliably query this.

For data cache aliasing, there are three scenarios dependending on the
architecture. Here is a breakdown based on my understanding:

A) The data cache is always aliasing:

* arc
* csky
* m68k (note: shared memory mappings are incoherent ? SHMLBA is missing there.)
* sh
* parisc

B) The data cache aliasing is statically known or depends on querying CPU
state at runtime:

* arm (cache_is_vivt() || cache_is_vipt_aliasing())
* mips (cpu_has_dc_aliases)
* nios2 (NIOS2_DCACHE_SIZE > PAGE_SIZE)
* sparc32 (vac_cache_size > PAGE_SIZE)
* sparc64 (L1DCACHE_SIZE > PAGE_SIZE)
* xtensa (DCACHE_WAY_SIZE > PAGE_SIZE)

C) The data cache is never aliasing:

* alpha
* arm64 (aarch64)
* hexagon
* loongarch (but with incoherent write buffers, which are disabled since
commit d23b7795 ("LoongArch: Change SHMLBA from SZ_64K to PAGE_SIZE"))
* microblaze
* openrisc
* powerpc
* riscv
* s390
* um
* x86

Require architectures in A) and B) to select ARCH_HAS_CPU_CACHE_ALIASING and
implement "cpu_dcache_is_aliasing()".

Architectures in C) don't select ARCH_HAS_CPU_CACHE_ALIASING, and thus
cpu_dcache_is_aliasing() simply evaluates to "false".

Note that this leaves "cpu_icache_is_aliasing()" to be implemented as future
work. This would be useful to gate features like XIP on architectures
which have aliasing CPU dcache-icache but not CPU dcache-dcache.

Use "cpu_dcache" and "cpu_cache" rather than just "dcache" and "cache"
to clarify that we really mean "CPU data cache" and "CPU cache" to
eliminate any possible confusion with VFS "dentry cache" and "page
cache".

Link: https://lore.kernel.org/lkml/[email protected]/
Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches")
Signed-off-by: Mathieu Desnoyers <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: Dan Williams <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Russell King <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
arch/arc/Kconfig | 1 +
arch/arc/include/asm/cachetype.h | 9 +++++++++
arch/arm/Kconfig | 1 +
arch/arm/include/asm/cachetype.h | 2 ++
arch/csky/Kconfig | 1 +
arch/csky/include/asm/cachetype.h | 9 +++++++++
arch/m68k/Kconfig | 1 +
arch/m68k/include/asm/cachetype.h | 9 +++++++++
arch/mips/Kconfig | 1 +
arch/mips/include/asm/cachetype.h | 9 +++++++++
arch/nios2/Kconfig | 1 +
arch/nios2/include/asm/cachetype.h | 10 ++++++++++
arch/parisc/Kconfig | 1 +
arch/parisc/include/asm/cachetype.h | 9 +++++++++
arch/sh/Kconfig | 1 +
arch/sh/include/asm/cachetype.h | 9 +++++++++
arch/sparc/Kconfig | 1 +
arch/sparc/include/asm/cachetype.h | 14 ++++++++++++++
arch/xtensa/Kconfig | 1 +
arch/xtensa/include/asm/cachetype.h | 10 ++++++++++
include/linux/cacheinfo.h | 6 ++++++
mm/Kconfig | 6 ++++++
22 files changed, 112 insertions(+)
create mode 100644 arch/arc/include/asm/cachetype.h
create mode 100644 arch/csky/include/asm/cachetype.h
create mode 100644 arch/m68k/include/asm/cachetype.h
create mode 100644 arch/mips/include/asm/cachetype.h
create mode 100644 arch/nios2/include/asm/cachetype.h
create mode 100644 arch/parisc/include/asm/cachetype.h
create mode 100644 arch/sh/include/asm/cachetype.h
create mode 100644 arch/sparc/include/asm/cachetype.h
create mode 100644 arch/xtensa/include/asm/cachetype.h

diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig
index 1b0483c51cc1..7d294a3242a4 100644
--- a/arch/arc/Kconfig
+++ b/arch/arc/Kconfig
@@ -6,6 +6,7 @@
config ARC
def_bool y
select ARC_TIMERS
+ select ARCH_HAS_CPU_CACHE_ALIASING
select ARCH_HAS_CACHE_LINE_SIZE
select ARCH_HAS_DEBUG_VM_PGTABLE
select ARCH_HAS_DMA_PREP_COHERENT
diff --git a/arch/arc/include/asm/cachetype.h b/arch/arc/include/asm/cachetype.h
new file mode 100644
index 000000000000..05fc7ed59712
--- /dev/null
+++ b/arch/arc/include/asm/cachetype.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_ARC_CACHETYPE_H
+#define __ASM_ARC_CACHETYPE_H
+
+#include <linux/types.h>
+
+#define cpu_dcache_is_aliasing() true
+
+#endif
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index f8567e95f98b..cd13b1788973 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -5,6 +5,7 @@ config ARM
select ARCH_32BIT_OFF_T
select ARCH_CORRECT_STACKTRACE_ON_KRETPROBE if HAVE_KRETPROBES && FRAME_POINTER && !ARM_UNWIND
select ARCH_HAS_BINFMT_FLAT
+ select ARCH_HAS_CPU_CACHE_ALIASING
select ARCH_HAS_CPU_FINALIZE_INIT if MMU
select ARCH_HAS_CURRENT_STACK_POINTER
select ARCH_HAS_DEBUG_VIRTUAL if MMU
diff --git a/arch/arm/include/asm/cachetype.h b/arch/arm/include/asm/cachetype.h
index e8c30430be33..b9dbe1d4c8fe 100644
--- a/arch/arm/include/asm/cachetype.h
+++ b/arch/arm/include/asm/cachetype.h
@@ -20,6 +20,8 @@ extern unsigned int cacheid;
#define icache_is_vipt_aliasing() cacheid_is(CACHEID_VIPT_I_ALIASING)
#define icache_is_pipt() cacheid_is(CACHEID_PIPT)

+#define cpu_dcache_is_aliasing() (cache_is_vivt() || cache_is_vipt_aliasing())
+
/*
* __LINUX_ARM_ARCH__ is the minimum supported CPU architecture
* Mask out support which will never be present on newer CPUs.
diff --git a/arch/csky/Kconfig b/arch/csky/Kconfig
index cf2a6fd7dff8..8a91eccf76dc 100644
--- a/arch/csky/Kconfig
+++ b/arch/csky/Kconfig
@@ -2,6 +2,7 @@
config CSKY
def_bool y
select ARCH_32BIT_OFF_T
+ select ARCH_HAS_CPU_CACHE_ALIASING
select ARCH_HAS_DMA_PREP_COHERENT
select ARCH_HAS_GCOV_PROFILE_ALL
select ARCH_HAS_SYNC_DMA_FOR_CPU
diff --git a/arch/csky/include/asm/cachetype.h b/arch/csky/include/asm/cachetype.h
new file mode 100644
index 000000000000..98cbe3af662f
--- /dev/null
+++ b/arch/csky/include/asm/cachetype.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_CSKY_CACHETYPE_H
+#define __ASM_CSKY_CACHETYPE_H
+
+#include <linux/types.h>
+
+#define cpu_dcache_is_aliasing() true
+
+#endif
diff --git a/arch/m68k/Kconfig b/arch/m68k/Kconfig
index 4b3e93cac723..a9c3e3de0c6d 100644
--- a/arch/m68k/Kconfig
+++ b/arch/m68k/Kconfig
@@ -3,6 +3,7 @@ config M68K
bool
default y
select ARCH_32BIT_OFF_T
+ select ARCH_HAS_CPU_CACHE_ALIASING
select ARCH_HAS_BINFMT_FLAT
select ARCH_HAS_CPU_FINALIZE_INIT if MMU
select ARCH_HAS_CURRENT_STACK_POINTER
diff --git a/arch/m68k/include/asm/cachetype.h b/arch/m68k/include/asm/cachetype.h
new file mode 100644
index 000000000000..7fad5d9ab8fe
--- /dev/null
+++ b/arch/m68k/include/asm/cachetype.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_M68K_CACHETYPE_H
+#define __ASM_M68K_CACHETYPE_H
+
+#include <linux/types.h>
+
+#define cpu_dcache_is_aliasing() true
+
+#endif
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index 797ae590ebdb..ab1c8bd96666 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -4,6 +4,7 @@ config MIPS
default y
select ARCH_32BIT_OFF_T if !64BIT
select ARCH_BINFMT_ELF_STATE if MIPS_FP_SUPPORT
+ select ARCH_HAS_CPU_CACHE_ALIASING
select ARCH_HAS_CPU_FINALIZE_INIT
select ARCH_HAS_CURRENT_STACK_POINTER if !CC_IS_CLANG || CLANG_VERSION >= 140000
select ARCH_HAS_DEBUG_VIRTUAL if !64BIT
diff --git a/arch/mips/include/asm/cachetype.h b/arch/mips/include/asm/cachetype.h
new file mode 100644
index 000000000000..9f4ba2fe1155
--- /dev/null
+++ b/arch/mips/include/asm/cachetype.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_MIPS_CACHETYPE_H
+#define __ASM_MIPS_CACHETYPE_H
+
+#include <asm/cpu-features.h>
+
+#define cpu_dcache_is_aliasing() cpu_has_dc_aliases
+
+#endif
diff --git a/arch/nios2/Kconfig b/arch/nios2/Kconfig
index d54464021a61..760fb541ecd2 100644
--- a/arch/nios2/Kconfig
+++ b/arch/nios2/Kconfig
@@ -2,6 +2,7 @@
config NIOS2
def_bool y
select ARCH_32BIT_OFF_T
+ select ARCH_HAS_CPU_CACHE_ALIASING
select ARCH_HAS_DMA_PREP_COHERENT
select ARCH_HAS_SYNC_DMA_FOR_CPU
select ARCH_HAS_SYNC_DMA_FOR_DEVICE
diff --git a/arch/nios2/include/asm/cachetype.h b/arch/nios2/include/asm/cachetype.h
new file mode 100644
index 000000000000..eb9c416b8a1c
--- /dev/null
+++ b/arch/nios2/include/asm/cachetype.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_NIOS2_CACHETYPE_H
+#define __ASM_NIOS2_CACHETYPE_H
+
+#include <asm/page.h>
+#include <asm/cache.h>
+
+#define cpu_dcache_is_aliasing() (NIOS2_DCACHE_SIZE > PAGE_SIZE)
+
+#endif
diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig
index d14ccc948a29..0f25c227f74b 100644
--- a/arch/parisc/Kconfig
+++ b/arch/parisc/Kconfig
@@ -8,6 +8,7 @@ config PARISC
select HAVE_FUNCTION_GRAPH_TRACER
select HAVE_SYSCALL_TRACEPOINTS
select ARCH_WANT_FRAME_POINTERS
+ select ARCH_HAS_CPU_CACHE_ALIASING
select ARCH_HAS_DMA_ALLOC if PA11
select ARCH_HAS_ELF_RANDOMIZE
select ARCH_HAS_STRICT_KERNEL_RWX
diff --git a/arch/parisc/include/asm/cachetype.h b/arch/parisc/include/asm/cachetype.h
new file mode 100644
index 000000000000..e0868a1d3c47
--- /dev/null
+++ b/arch/parisc/include/asm/cachetype.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_PARISC_CACHETYPE_H
+#define __ASM_PARISC_CACHETYPE_H
+
+#include <linux/types.h>
+
+#define cpu_dcache_is_aliasing() true
+
+#endif
diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig
index 7500521b2b98..2ad3e29f0ebe 100644
--- a/arch/sh/Kconfig
+++ b/arch/sh/Kconfig
@@ -2,6 +2,7 @@
config SUPERH
def_bool y
select ARCH_32BIT_OFF_T
+ select ARCH_HAS_CPU_CACHE_ALIASING
select ARCH_ENABLE_MEMORY_HOTPLUG if SPARSEMEM && MMU
select ARCH_ENABLE_MEMORY_HOTREMOVE if SPARSEMEM && MMU
select ARCH_HAVE_NMI_SAFE_CMPXCHG if (GUSA_RB || CPU_SH4A)
diff --git a/arch/sh/include/asm/cachetype.h b/arch/sh/include/asm/cachetype.h
new file mode 100644
index 000000000000..a5fffe536068
--- /dev/null
+++ b/arch/sh/include/asm/cachetype.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_SH_CACHETYPE_H
+#define __ASM_SH_CACHETYPE_H
+
+#include <linux/types.h>
+
+#define cpu_dcache_is_aliasing() true
+
+#endif
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index 49849790e66d..5ba627da15d7 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -13,6 +13,7 @@ config 64BIT
config SPARC
bool
default y
+ select ARCH_HAS_CPU_CACHE_ALIASING
select ARCH_MIGHT_HAVE_PC_PARPORT if SPARC64 && PCI
select ARCH_MIGHT_HAVE_PC_SERIO
select DMA_OPS
diff --git a/arch/sparc/include/asm/cachetype.h b/arch/sparc/include/asm/cachetype.h
new file mode 100644
index 000000000000..caf1c0045892
--- /dev/null
+++ b/arch/sparc/include/asm/cachetype.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_SPARC_CACHETYPE_H
+#define __ASM_SPARC_CACHETYPE_H
+
+#include <asm/page.h>
+
+#ifdef CONFIG_SPARC32
+extern int vac_cache_size;
+#define cpu_dcache_is_aliasing() (vac_cache_size > PAGE_SIZE)
+#else
+#define cpu_dcache_is_aliasing() (L1DCACHE_SIZE > PAGE_SIZE)
+#endif
+
+#endif
diff --git a/arch/xtensa/Kconfig b/arch/xtensa/Kconfig
index 7d792077e5fd..2dfde54d1a84 100644
--- a/arch/xtensa/Kconfig
+++ b/arch/xtensa/Kconfig
@@ -2,6 +2,7 @@
config XTENSA
def_bool y
select ARCH_32BIT_OFF_T
+ select ARCH_HAS_CPU_CACHE_ALIASING
select ARCH_HAS_BINFMT_FLAT if !MMU
select ARCH_HAS_CURRENT_STACK_POINTER
select ARCH_HAS_DEBUG_VM_PGTABLE
diff --git a/arch/xtensa/include/asm/cachetype.h b/arch/xtensa/include/asm/cachetype.h
new file mode 100644
index 000000000000..51bd49e2a1c5
--- /dev/null
+++ b/arch/xtensa/include/asm/cachetype.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_XTENSA_CACHETYPE_H
+#define __ASM_XTENSA_CACHETYPE_H
+
+#include <asm/cache.h>
+#include <asm/page.h>
+
+#define cpu_dcache_is_aliasing() (DCACHE_WAY_SIZE > PAGE_SIZE)
+
+#endif
diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h
index d504eb4b49ab..2cb15fe4fe12 100644
--- a/include/linux/cacheinfo.h
+++ b/include/linux/cacheinfo.h
@@ -138,4 +138,10 @@ static inline int get_cpu_cacheinfo_id(int cpu, int level)
#define use_arch_cache_info() (false)
#endif

+#ifndef CONFIG_ARCH_HAS_CPU_CACHE_ALIASING
+#define cpu_dcache_is_aliasing() false
+#else
+#include <asm/cachetype.h>
+#endif
+
#endif /* _LINUX_CACHEINFO_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 57cd378c73d6..db09c9ad15c9 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1016,6 +1016,12 @@ config IDLE_PAGE_TRACKING
See Documentation/admin-guide/mm/idle_page_tracking.rst for
more details.

+# Architectures which implement cpu_dcache_is_aliasing() to query
+# whether the data caches are aliased (VIVT or VIPT with dcache
+# aliasing) need to select this.
+config ARCH_HAS_CPU_CACHE_ALIASING
+ bool
+
config ARCH_HAS_CACHE_LINE_SIZE
bool

--
2.39.2


2024-01-31 16:28:43

by Mathieu Desnoyers

[permalink] [raw]
Subject: [RFC PATCH v3 1/4] dm: Treat alloc_dax failure as non-fatal

In preparation for checking whether the architecture has data cache
aliasing within alloc_dax(), modify the error handling of dm alloc_dev()
to treat alloc_dax() failure as non-fatal.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches")
Suggested-by: Dan Williams <[email protected]>
Signed-off-by: Mathieu Desnoyers <[email protected]>
Cc: Alasdair Kergon <[email protected]>
Cc: Mike Snitzer <[email protected]>
Cc: Mikulas Patocka <[email protected]>
Cc: [email protected]
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: Dan Williams <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Russell King <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
drivers/md/dm.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 23c32cd1f1d8..f90743a94da9 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2126,12 +2126,12 @@ static struct mapped_device *alloc_dev(int minor)
md->dax_dev = alloc_dax(md, &dm_dax_ops);
if (IS_ERR(md->dax_dev)) {
md->dax_dev = NULL;
- goto bad;
+ } else {
+ set_dax_nocache(md->dax_dev);
+ set_dax_nomc(md->dax_dev);
+ if (dax_add_host(md->dax_dev, md->disk))
+ goto bad;
}
- set_dax_nocache(md->dax_dev);
- set_dax_nomc(md->dax_dev);
- if (dax_add_host(md->dax_dev, md->disk))
- goto bad;
}

format_dev_t(md->name, MKDEV(_major, minor));
--
2.39.2


2024-01-31 16:28:51

by Mathieu Desnoyers

[permalink] [raw]
Subject: [RFC PATCH v3 4/4] dax: Fix incorrect list of data cache aliasing architectures

commit d92576f1167c ("dax: does not work correctly with virtual aliasing caches")
prevents DAX from building on architectures with virtually aliased
dcache with:

depends on !(ARM || MIPS || SPARC)

This check is too broad (e.g. recent ARMv7 don't have virtually aliased
dcaches), and also misses many other architectures with virtually
aliased data cache.

This is a regression introduced in the v5.13 Linux kernel where the
dax mount option is removed for 32-bit ARMv7 boards which have no data
cache aliasing, and therefore should work fine with FS_DAX.

This was turned into the following check in alloc_dax() by a preparatory
change:

if (IS_ENABLED(CONFIG_ARM) ||
IS_ENABLED(CONFIG_MIPS) ||
IS_ENABLED(CONFIG_SPARC))
return NULL;

Use cpu_dcache_is_aliasing() instead to figure out whether the environment
has aliasing data caches.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches")
Signed-off-by: Mathieu Desnoyers <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: Dan Williams <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Russell King <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
drivers/dax/super.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index e9f397b8a5a3..8c3a6e8e6334 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -13,6 +13,7 @@
#include <linux/uio.h>
#include <linux/dax.h>
#include <linux/fs.h>
+#include <linux/cacheinfo.h>
#include "dax-private.h"

/**
@@ -446,9 +447,7 @@ struct dax_device *alloc_dax(void *private, const struct dax_operations *ops)
int minor;

/* Unavailable on architectures with virtually aliased data caches. */
- if (IS_ENABLED(CONFIG_ARM) ||
- IS_ENABLED(CONFIG_MIPS) ||
- IS_ENABLED(CONFIG_SPARC))
+ if (cpu_dcache_is_aliasing())
return NULL;

if (WARN_ON_ONCE(ops && !ops->zero_page_range))
--
2.39.2


2024-01-31 17:59:59

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH v3 3/4] Introduce cpu_dcache_is_aliasing() across all architectures

On 2024-01-31 12:17, Christoph Hellwig wrote:
> So this is the third iteration and you still keep only sending patch
> 3 to the list. How is anyone supposed to review it if you don't send
> them all the pieces?

My bad. I was aiming for not spamming mailing lists on unrelated patches. I did
not CC [email protected] on the other patches. But the missing
context is just confusing. And I forgot to CC [email protected]
on patch 3 as well.

You can find the entire series on lore:

https://lore.kernel.org/lkml/[email protected]/T/#t

I'll make sure to copy all lists for all patches in the next
round, namely:

Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


2024-01-31 20:43:31

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC PATCH v3 3/4] Introduce cpu_dcache_is_aliasing() across all architectures

Mathieu Desnoyers wrote:
> On 2024-01-31 12:17, Christoph Hellwig wrote:
> > So this is the third iteration and you still keep only sending patch
> > 3 to the list. How is anyone supposed to review it if you don't send
> > them all the pieces?
>
> My bad. I was aiming for not spamming mailing lists on unrelated patches. I did
> not CC [email protected] on the other patches. But the missing
> context is just confusing. And I forgot to CC [email protected]
> on patch 3 as well.
>
> You can find the entire series on lore:
>
> https://lore.kernel.org/lkml/[email protected]/T/#t
>
> I'll make sure to copy all lists for all patches in the next
> round, namely:
>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
>
> Thanks,
>
> Mathieu

FWIW there are some developers and lists that want all patches and there
are some developers only want to see the one patch they are directly
copied on. I have a locally maintained list as I have discovered these
different preferences, but maybe MAINTAINERS could carry a flag to
indicate this to save the next person some time?

2024-01-31 21:03:00

by Dan Williams

[permalink] [raw]
Subject: RE: [RFC PATCH v3 2/4] dax: Check for data cache aliasing at runtime

Mathieu Desnoyers wrote:
> Replace the following fs/Kconfig:FS_DAX dependency:
>
> depends on !(ARM || MIPS || SPARC)
>
> By a runtime check within alloc_dax().
>
> This is done in preparation for its use by each filesystem supporting
> the "dax" mount option to validate whether DAX is indeed supported.
>
> This is done in preparation for using cpu_dcache_is_aliasing() in a
> following change which will properly support architectures which detect
> data cache aliasing at runtime.
>
> Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches")
> Signed-off-by: Mathieu Desnoyers <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: Dan Williams <[email protected]>
> Cc: Vishal Verma <[email protected]>
> Cc: Dave Jiang <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Arnd Bergmann <[email protected]>
> Cc: Russell King <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> ---
> drivers/dax/super.c | 6 ++++++
> fs/Kconfig | 1 -
> 2 files changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 0da9232ea175..e9f397b8a5a3 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -445,6 +445,12 @@ struct dax_device *alloc_dax(void *private, const struct dax_operations *ops)
> dev_t devt;
> int minor;
>
> + /* Unavailable on architectures with virtually aliased data caches. */
> + if (IS_ENABLED(CONFIG_ARM) ||
> + IS_ENABLED(CONFIG_MIPS) ||
> + IS_ENABLED(CONFIG_SPARC))
> + return NULL;

This function returns ERR_PTR(), not NULL on failure.

..and I notice this mistake is also made in include/linux/dax.h in the
CONFIG_DAX=n case. That function also mentions:

static inline struct dax_device *alloc_dax(void *private,
const struct dax_operations *ops)
{
/*
* Callers should check IS_ENABLED(CONFIG_DAX) to know if this
* NULL is an error or expected.
*/
return NULL;
}

..and none of the callers validate the result, but now runtime
validation is necessary. I.e. it is not enough to check
IS_ENABLED(CONFIG_DAX) it also needs to check cpu_dcache_is_aliasing().

With that, there are a few more fixup places needed, pmem_attach_disk(),
dcssblk_add_store(), and virtio_fs_setup_dax().

2024-01-31 21:33:52

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH v3 3/4] Introduce cpu_dcache_is_aliasing() across all architectures

So this is the third iteration and you still keep only sending patch
3 to the list. How is anyone supposed to review it if you don't send
them all the pieces?


2024-01-31 21:46:05

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH v3 2/4] dax: Check for data cache aliasing at runtime

On 2024-01-31 16:02, Dan Williams wrote:
> Mathieu Desnoyers wrote:
>> Replace the following fs/Kconfig:FS_DAX dependency:
>>
>> depends on !(ARM || MIPS || SPARC)
>>
>> By a runtime check within alloc_dax().
>>
>> This is done in preparation for its use by each filesystem supporting
>> the "dax" mount option to validate whether DAX is indeed supported.
>>
>> This is done in preparation for using cpu_dcache_is_aliasing() in a
>> following change which will properly support architectures which detect
>> data cache aliasing at runtime.
>>
>> Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches")
>> Signed-off-by: Mathieu Desnoyers <[email protected]>
>> Cc: Andrew Morton <[email protected]>
>> Cc: Linus Torvalds <[email protected]>
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: Dan Williams <[email protected]>
>> Cc: Vishal Verma <[email protected]>
>> Cc: Dave Jiang <[email protected]>
>> Cc: Matthew Wilcox <[email protected]>
>> Cc: Arnd Bergmann <[email protected]>
>> Cc: Russell King <[email protected]>
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: [email protected]
>> ---
>> drivers/dax/super.c | 6 ++++++
>> fs/Kconfig | 1 -
>> 2 files changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
>> index 0da9232ea175..e9f397b8a5a3 100644
>> --- a/drivers/dax/super.c
>> +++ b/drivers/dax/super.c
>> @@ -445,6 +445,12 @@ struct dax_device *alloc_dax(void *private, const struct dax_operations *ops)
>> dev_t devt;
>> int minor;
>>
>> + /* Unavailable on architectures with virtually aliased data caches. */
>> + if (IS_ENABLED(CONFIG_ARM) ||
>> + IS_ENABLED(CONFIG_MIPS) ||
>> + IS_ENABLED(CONFIG_SPARC))
>> + return NULL;
>
> This function returns ERR_PTR(), not NULL on failure.

Except that it returns NULL in the CONFIG_DAX=n case as you
noticed below.

>
> ...and I notice this mistake is also made in include/linux/dax.h in the
> CONFIG_DAX=n case. That function also mentions:
>
> static inline struct dax_device *alloc_dax(void *private,
> const struct dax_operations *ops)
> {
> /*
> * Callers should check IS_ENABLED(CONFIG_DAX) to know if this
> * NULL is an error or expected.
> */
> return NULL;
> }
>
> ...and none of the callers validate the result, but now runtime
> validation is necessary. I.e. it is not enough to check
> IS_ENABLED(CONFIG_DAX) it also needs to check cpu_dcache_is_aliasing().

If the callers select DAX in their Kconfig, then they don't have to
explicitly check for IS_ENABLED(CONFIG_DAX). Things change for the
introduced runtime check though.

>
> With that, there are a few more fixup places needed, pmem_attach_disk(),
> dcssblk_add_store(), and virtio_fs_setup_dax().

Which approach should we take then ? Should we:

A) Keep returning NULL from alloc_dax() for both
cpu_dcache_is_aliasing() and CONFIG_DAX=n, and use IS_ERR_OR_NULL()
in the caller. If we do this, then the callers need to somehow
translate this NULL into a negative error value, or

B) Replace this NULL return value in both cases by a ERR_PTR() (which
error value should we return ?).

I would favor approach B) which appears more robust and introduces
fewer changes. If we go for that approach do we still need to change
the callers ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


2024-01-31 22:35:45

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC PATCH v3 2/4] dax: Check for data cache aliasing at runtime

Mathieu Desnoyers wrote:
> On 2024-01-31 16:02, Dan Williams wrote:
> > Mathieu Desnoyers wrote:
> >> Replace the following fs/Kconfig:FS_DAX dependency:
> >>
> >> depends on !(ARM || MIPS || SPARC)
> >>
> >> By a runtime check within alloc_dax().
> >>
> >> This is done in preparation for its use by each filesystem supporting
> >> the "dax" mount option to validate whether DAX is indeed supported.
> >>
> >> This is done in preparation for using cpu_dcache_is_aliasing() in a
> >> following change which will properly support architectures which detect
> >> data cache aliasing at runtime.
> >>
> >> Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches")
> >> Signed-off-by: Mathieu Desnoyers <[email protected]>
> >> Cc: Andrew Morton <[email protected]>
> >> Cc: Linus Torvalds <[email protected]>
> >> Cc: [email protected]
> >> Cc: [email protected]
> >> Cc: Dan Williams <[email protected]>
> >> Cc: Vishal Verma <[email protected]>
> >> Cc: Dave Jiang <[email protected]>
> >> Cc: Matthew Wilcox <[email protected]>
> >> Cc: Arnd Bergmann <[email protected]>
> >> Cc: Russell King <[email protected]>
> >> Cc: [email protected]
> >> Cc: [email protected]
> >> Cc: [email protected]
> >> Cc: [email protected]
> >> ---
> >> drivers/dax/super.c | 6 ++++++
> >> fs/Kconfig | 1 -
> >> 2 files changed, 6 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> >> index 0da9232ea175..e9f397b8a5a3 100644
> >> --- a/drivers/dax/super.c
> >> +++ b/drivers/dax/super.c
> >> @@ -445,6 +445,12 @@ struct dax_device *alloc_dax(void *private, const struct dax_operations *ops)
> >> dev_t devt;
> >> int minor;
> >>
> >> + /* Unavailable on architectures with virtually aliased data caches. */
> >> + if (IS_ENABLED(CONFIG_ARM) ||
> >> + IS_ENABLED(CONFIG_MIPS) ||
> >> + IS_ENABLED(CONFIG_SPARC))
> >> + return NULL;
> >
> > This function returns ERR_PTR(), not NULL on failure.
>
> Except that it returns NULL in the CONFIG_DAX=n case as you
> noticed below.
>
> >
> > ...and I notice this mistake is also made in include/linux/dax.h in the
> > CONFIG_DAX=n case. That function also mentions:
> >
> > static inline struct dax_device *alloc_dax(void *private,
> > const struct dax_operations *ops)
> > {
> > /*
> > * Callers should check IS_ENABLED(CONFIG_DAX) to know if this
> > * NULL is an error or expected.
> > */
> > return NULL;
> > }
> >
> > ...and none of the callers validate the result, but now runtime
> > validation is necessary. I.e. it is not enough to check
> > IS_ENABLED(CONFIG_DAX) it also needs to check cpu_dcache_is_aliasing().
>
> If the callers select DAX in their Kconfig, then they don't have to
> explicitly check for IS_ENABLED(CONFIG_DAX). Things change for the
> introduced runtime check though.
>
> >
> > With that, there are a few more fixup places needed, pmem_attach_disk(),
> > dcssblk_add_store(), and virtio_fs_setup_dax().
>
> Which approach should we take then ? Should we:
>
> A) Keep returning NULL from alloc_dax() for both
> cpu_dcache_is_aliasing() and CONFIG_DAX=n, and use IS_ERR_OR_NULL()
> in the caller. If we do this, then the callers need to somehow
> translate this NULL into a negative error value, or
>
> B) Replace this NULL return value in both cases by a ERR_PTR() (which
> error value should we return ?).
>
> I would favor approach B) which appears more robust and introduces
> fewer changes. If we go for that approach do we still need to change
> the callers ?

I agree approach B is the way to go, but that still requires these
fixups, feel free to steal these hunks and split them into patches:

Co-developed-by: Dan Williams <[email protected]>
Signed-off-by: Dan Williams <[email protected]>

..but note they are compile-tested only. They assume that alloc_dax()
returns ERR_PTR(-EOPNOTSUPP) when the arch support is missing, and I
wrote them quickly so I might have missed something.

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index f4b635526345..254d3b1e420e 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -322,7 +322,7 @@ EXPORT_SYMBOL_GPL(dax_alive);
*/
void kill_dax(struct dax_device *dax_dev)
{
- if (!dax_dev)
+ if (IS_ERR_OR_NULL(dax_dev))
return;

if (dax_dev->holder_data != NULL)
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 4e8fdcb3f1c8..b69c9e442cf4 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -560,17 +560,19 @@ static int pmem_attach_disk(struct device *dev,
dax_dev = alloc_dax(pmem, &pmem_dax_ops);
if (IS_ERR(dax_dev)) {
rc = PTR_ERR(dax_dev);
- goto out;
+ if (rc != -EOPNOTSUPP)
+ goto out;
+ } else {
+ set_dax_nocache(dax_dev);
+ set_dax_nomc(dax_dev);
+ if (is_nvdimm_sync(nd_region))
+ set_dax_synchronous(dax_dev);
+ rc = dax_add_host(dax_dev, disk);
+ if (rc)
+ goto out_cleanup_dax;
+ dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
+ pmem->dax_dev = dax_dev;
}
- set_dax_nocache(dax_dev);
- set_dax_nomc(dax_dev);
- if (is_nvdimm_sync(nd_region))
- set_dax_synchronous(dax_dev);
- rc = dax_add_host(dax_dev, disk);
- if (rc)
- goto out_cleanup_dax;
- dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
- pmem->dax_dev = dax_dev;

rc = device_add_disk(dev, disk, pmem_attribute_groups);
if (rc)
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 4b7ecd4fd431..f911e58a24dd 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -681,12 +681,14 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char
if (IS_ERR(dev_info->dax_dev)) {
rc = PTR_ERR(dev_info->dax_dev);
dev_info->dax_dev = NULL;
- goto put_dev;
+ if (rc != -EOPNOTSUPP)
+ goto put_dev;
+ } else {
+ set_dax_synchronous(dev_info->dax_dev);
+ rc = dax_add_host(dev_info->dax_dev, dev_info->gd);
+ if (rc)
+ goto out_dax;
}
- set_dax_synchronous(dev_info->dax_dev);
- rc = dax_add_host(dev_info->dax_dev, dev_info->gd);
- if (rc)
- goto out_dax;

get_device(&dev_info->dev);
rc = device_add_disk(&dev_info->dev, dev_info->gd, NULL);
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 5f1be1da92ce..11053a70f5ab 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -16,6 +16,7 @@
#include <linux/fs_context.h>
#include <linux/fs_parser.h>
#include <linux/highmem.h>
+#include <linux/cleanup.h>
#include <linux/uio.h>
#include "fuse_i.h"

@@ -795,8 +796,11 @@ static void virtio_fs_cleanup_dax(void *data)
put_dax(dax_dev);
}

+DEFINE_FREE(cleanup_dax, struct dax_dev *, if (!IS_ERR_OR_NULL(_T)) virtio_fs_cleanup_dax(_T))
+
static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
{
+ struct dax_device *dax_dev __free(cleanup_dax) = NULL;
struct virtio_shm_region cache_reg;
struct dev_pagemap *pgmap;
bool have_cache;
@@ -804,6 +808,15 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
if (!IS_ENABLED(CONFIG_FUSE_DAX))
return 0;

+ dax_dev = alloc_dax(fs, &virtio_fs_dax_ops);
+ if (IS_ERR(dax_dev)) {
+ int rc = PTR_ERR(dax_dev);
+
+ if (rc == -EOPNOTSUPP)
+ return 0;
+ return rc;
+ }
+
/* Get cache region */
have_cache = virtio_get_shm_region(vdev, &cache_reg,
(u8)VIRTIO_FS_SHMCAP_ID_CACHE);
@@ -849,10 +862,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
dev_dbg(&vdev->dev, "%s: window kaddr 0x%px phys_addr 0x%llx len 0x%llx\n",
__func__, fs->window_kaddr, cache_reg.addr, cache_reg.len);

- fs->dax_dev = alloc_dax(fs, &virtio_fs_dax_ops);
- if (IS_ERR(fs->dax_dev))
- return PTR_ERR(fs->dax_dev);
-
+ fs->dax_dev = no_free_ptr(dax_dev);
return devm_add_action_or_reset(&vdev->dev, virtio_fs_cleanup_dax,
fs->dax_dev);
}

2024-02-01 16:15:01

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH v3 2/4] dax: Check for data cache aliasing at runtime

On 2024-01-31 17:18, Dan Williams wrote:
> Mathieu Desnoyers wrote:
>> On 2024-01-31 16:02, Dan Williams wrote:
>>> Mathieu Desnoyers wrote:
>>>> Replace the following fs/Kconfig:FS_DAX dependency:
>>>>
>>>> depends on !(ARM || MIPS || SPARC)
>>>>
>>>> By a runtime check within alloc_dax().
>>>>
>>>> This is done in preparation for its use by each filesystem supporting
>>>> the "dax" mount option to validate whether DAX is indeed supported.
>>>>
>>>> This is done in preparation for using cpu_dcache_is_aliasing() in a
>>>> following change which will properly support architectures which detect
>>>> data cache aliasing at runtime.
>>>>
>>>> Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches")
>>>> Signed-off-by: Mathieu Desnoyers <[email protected]>
>>>> Cc: Andrew Morton <[email protected]>
>>>> Cc: Linus Torvalds <[email protected]>
>>>> Cc: [email protected]
>>>> Cc: [email protected]
>>>> Cc: Dan Williams <[email protected]>
>>>> Cc: Vishal Verma <[email protected]>
>>>> Cc: Dave Jiang <[email protected]>
>>>> Cc: Matthew Wilcox <[email protected]>
>>>> Cc: Arnd Bergmann <[email protected]>
>>>> Cc: Russell King <[email protected]>
>>>> Cc: [email protected]
>>>> Cc: [email protected]
>>>> Cc: [email protected]
>>>> Cc: [email protected]
>>>> ---
>>>> drivers/dax/super.c | 6 ++++++
>>>> fs/Kconfig | 1 -
>>>> 2 files changed, 6 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
>>>> index 0da9232ea175..e9f397b8a5a3 100644
>>>> --- a/drivers/dax/super.c
>>>> +++ b/drivers/dax/super.c
>>>> @@ -445,6 +445,12 @@ struct dax_device *alloc_dax(void *private, const struct dax_operations *ops)
>>>> dev_t devt;
>>>> int minor;
>>>>
>>>> + /* Unavailable on architectures with virtually aliased data caches. */
>>>> + if (IS_ENABLED(CONFIG_ARM) ||
>>>> + IS_ENABLED(CONFIG_MIPS) ||
>>>> + IS_ENABLED(CONFIG_SPARC))
>>>> + return NULL;
>>>
>>> This function returns ERR_PTR(), not NULL on failure.
>>
>> Except that it returns NULL in the CONFIG_DAX=n case as you
>> noticed below.
>>
>>>
>>> ...and I notice this mistake is also made in include/linux/dax.h in the
>>> CONFIG_DAX=n case. That function also mentions:
>>>
>>> static inline struct dax_device *alloc_dax(void *private,
>>> const struct dax_operations *ops)
>>> {
>>> /*
>>> * Callers should check IS_ENABLED(CONFIG_DAX) to know if this
>>> * NULL is an error or expected.
>>> */
>>> return NULL;
>>> }
>>>
>>> ...and none of the callers validate the result, but now runtime
>>> validation is necessary. I.e. it is not enough to check
>>> IS_ENABLED(CONFIG_DAX) it also needs to check cpu_dcache_is_aliasing().
>>
>> If the callers select DAX in their Kconfig, then they don't have to
>> explicitly check for IS_ENABLED(CONFIG_DAX). Things change for the
>> introduced runtime check though.
>>
>>>
>>> With that, there are a few more fixup places needed, pmem_attach_disk(),
>>> dcssblk_add_store(), and virtio_fs_setup_dax().
>>
>> Which approach should we take then ? Should we:
>>
>> A) Keep returning NULL from alloc_dax() for both
>> cpu_dcache_is_aliasing() and CONFIG_DAX=n, and use IS_ERR_OR_NULL()
>> in the caller. If we do this, then the callers need to somehow
>> translate this NULL into a negative error value, or
>>
>> B) Replace this NULL return value in both cases by a ERR_PTR() (which
>> error value should we return ?).
>>
>> I would favor approach B) which appears more robust and introduces
>> fewer changes. If we go for that approach do we still need to change
>> the callers ?
>
> I agree approach B is the way to go, but that still requires these
> fixups, feel free to steal these hunks and split them into patches:
>
> Co-developed-by: Dan Williams <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>
>
> ...but note they are compile-tested only. They assume that alloc_dax()
> returns ERR_PTR(-EOPNOTSUPP) when the arch support is missing, and I
> wrote them quickly so I might have missed something.

The change for -EOPNOTSUPP in header and implementation would look like
this (more questions below):

diff --git a/include/linux/dax.h b/include/linux/dax.h
index b463502b16e1..df2d52b8a245 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -86,11 +86,7 @@ static inline void *dax_holder(struct dax_device *dax_dev)
static inline struct dax_device *alloc_dax(void *private,
const struct dax_operations *ops)
{
- /*
- * Callers should check IS_ENABLED(CONFIG_DAX) to know if this
- * NULL is an error or expected.
- */
- return NULL;
+ return ERR_PTR(-EOPNOTSUPP);
}
static inline void put_dax(struct dax_device *dax_dev)
{
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 8c3a6e8e6334..c1cf6f0bbe12 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -448,7 +448,7 @@ struct dax_device *alloc_dax(void *private, const struct dax_operations *ops)

/* Unavailable on architectures with virtually aliased data caches. */
if (cpu_dcache_is_aliasing())
- return NULL;
+ return ERR_PTR(-EOPNOTSUPP);

if (WARN_ON_ONCE(ops && !ops->zero_page_range))
return ERR_PTR(-EINVAL);

>
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index f4b635526345..254d3b1e420e 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -322,7 +322,7 @@ EXPORT_SYMBOL_GPL(dax_alive);
> */
> void kill_dax(struct dax_device *dax_dev)
> {
> - if (!dax_dev)
> + if (IS_ERR_OR_NULL(dax_dev))

I am tempted to just leave the "if (!dax_dev)" check here, because
many other functions of this API are only protected by a NULL
pointer check. I would hate to forget to convert one check in this
change, and I don't think it simplifies anything.

The alternative route I intend to take is to audit all callers
of alloc_dax() and make sure they all save the alloc_dax() return
value in a struct dax_device * local variable first for the sake
of checking for IS_ERR(). This will leave the xyz->dax_dev pointer
initialized to NULL in the error case and simplify the rest of
error checking.


> return;
>
> if (dax_dev->holder_data != NULL)
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 4e8fdcb3f1c8..b69c9e442cf4 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -560,17 +560,19 @@ static int pmem_attach_disk(struct device *dev,
> dax_dev = alloc_dax(pmem, &pmem_dax_ops);
> if (IS_ERR(dax_dev)) {
> rc = PTR_ERR(dax_dev);
> - goto out;
> + if (rc != -EOPNOTSUPP)
> + goto out;

If I compare the before / after this change, if previously
pmem_attach_disk() was called in a configuration with FS_DAX=n, it would
result in a NULL pointer dereference.

This would be an error handling fix all by itself. Do we really want
to return successfully if dax is unsupported, or should we return
an error here ?


> + } else {
> + set_dax_nocache(dax_dev);
> + set_dax_nomc(dax_dev);
> + if (is_nvdimm_sync(nd_region))
> + set_dax_synchronous(dax_dev);
> + rc = dax_add_host(dax_dev, disk);
> + if (rc)
> + goto out_cleanup_dax;
> + dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
> + pmem->dax_dev = dax_dev;
> }
> - set_dax_nocache(dax_dev);
> - set_dax_nomc(dax_dev);
> - if (is_nvdimm_sync(nd_region))
> - set_dax_synchronous(dax_dev);
> - rc = dax_add_host(dax_dev, disk);
> - if (rc)
> - goto out_cleanup_dax;
> - dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
> - pmem->dax_dev = dax_dev;
>
> rc = device_add_disk(dev, disk, pmem_attribute_groups);
> if (rc)
> diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
> index 4b7ecd4fd431..f911e58a24dd 100644
> --- a/drivers/s390/block/dcssblk.c
> +++ b/drivers/s390/block/dcssblk.c
> @@ -681,12 +681,14 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char
> if (IS_ERR(dev_info->dax_dev)) {
> rc = PTR_ERR(dev_info->dax_dev);
> dev_info->dax_dev = NULL;
> - goto put_dev;
> + if (rc != -EOPNOTSUPP)
> + goto put_dev;

config DCSSBLK selects FS_DAX_LIMITED and DAX.

I'm not sure what selecting DAX is trying to achieve here, because the
Kconfig option is "FS_DAX".

So depending on the real motivation behind this select, we may want to
consider failure rather than success in the -EOPNOTSUPP case.

> + } else {
> + set_dax_synchronous(dev_info->dax_dev);
> + rc = dax_add_host(dev_info->dax_dev, dev_info->gd);
> + if (rc)
> + goto out_dax;
> }
> - set_dax_synchronous(dev_info->dax_dev);
> - rc = dax_add_host(dev_info->dax_dev, dev_info->gd);
> - if (rc)
> - goto out_dax;
>
> get_device(&dev_info->dev);
> rc = device_add_disk(&dev_info->dev, dev_info->gd, NULL);

My own changes, if we want failure on -EOPNOTSUPP:

diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 4b7ecd4fd431..f363c1d51d9a 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -549,6 +549,7 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char
int rc, i, j, num_of_segments;
struct dcssblk_dev_info *dev_info;
struct segment_info *seg_info, *temp;
+ struct dax_device *dax_dev;
char *local_buf;
unsigned long seg_byte_size;

@@ -677,13 +678,13 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char
if (rc)
goto put_dev;

- dev_info->dax_dev = alloc_dax(dev_info, &dcssblk_dax_ops);
- if (IS_ERR(dev_info->dax_dev)) {
- rc = PTR_ERR(dev_info->dax_dev);
- dev_info->dax_dev = NULL;
+ dax_dev = alloc_dax(dev_info, &dcssblk_dax_ops);
+ if (IS_ERR(dax_dev)) {
+ rc = PTR_ERR(dax_dev);
goto put_dev;
}
- set_dax_synchronous(dev_info->dax_dev);
+ set_dax_synchronous(dax_dev);
+ dev_info->dax_dev = dax_dev;
rc = dax_add_host(dev_info->dax_dev, dev_info->gd);
if (rc)
goto out_dax;


> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> index 5f1be1da92ce..11053a70f5ab 100644
> --- a/fs/fuse/virtio_fs.c
> +++ b/fs/fuse/virtio_fs.c
> @@ -16,6 +16,7 @@
> #include <linux/fs_context.h>
> #include <linux/fs_parser.h>
> #include <linux/highmem.h>
> +#include <linux/cleanup.h>
> #include <linux/uio.h>
> #include "fuse_i.h"
>
> @@ -795,8 +796,11 @@ static void virtio_fs_cleanup_dax(void *data)
> put_dax(dax_dev);
> }
>
> +DEFINE_FREE(cleanup_dax, struct dax_dev *, if (!IS_ERR_OR_NULL(_T)) virtio_fs_cleanup_dax(_T))
> +
> static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)

So either I'm completely missing how ownership works in this function, or
we should be really concerned about the fact that it does no actual
cleanup of anything on any error.

I would be tempted to first refactor this function without using cleanup.h
so those fixes can be easily backported to older kernels (?)

Here what I'm seeing so far:

- devm_release_mem_region() is never called after devm_request_mem_region(). Not
on error, neither on teardown,
- pgmap is never freed on error after devm_kzalloc.

> {
> + struct dax_device *dax_dev __free(cleanup_dax) = NULL;
> struct virtio_shm_region cache_reg;
> struct dev_pagemap *pgmap;
> bool have_cache;
> @@ -804,6 +808,15 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> if (!IS_ENABLED(CONFIG_FUSE_DAX))
> return 0;
>
> + dax_dev = alloc_dax(fs, &virtio_fs_dax_ops);
> + if (IS_ERR(dax_dev)) {
> + int rc = PTR_ERR(dax_dev);
> +
> + if (rc == -EOPNOTSUPP)
> + return 0;
> + return rc;
> + }

What is gained by moving this allocation here ?

> +
> /* Get cache region */
> have_cache = virtio_get_shm_region(vdev, &cache_reg,
> (u8)VIRTIO_FS_SHMCAP_ID_CACHE);
> @@ -849,10 +862,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> dev_dbg(&vdev->dev, "%s: window kaddr 0x%px phys_addr 0x%llx len 0x%llx\n",
> __func__, fs->window_kaddr, cache_reg.addr, cache_reg.len);
>
> - fs->dax_dev = alloc_dax(fs, &virtio_fs_dax_ops);
> - if (IS_ERR(fs->dax_dev))
> - return PTR_ERR(fs->dax_dev);
> -
> + fs->dax_dev = no_free_ptr(dax_dev);
> return devm_add_action_or_reset(&vdev->dev, virtio_fs_cleanup_dax,
> fs->dax_dev);
> }

In addition I have:

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index f90743a94da9..86de91b35f4d 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2054,6 +2054,7 @@ static void cleanup_mapped_device(struct mapped_device *md)
static struct mapped_device *alloc_dev(int minor)
{
int r, numa_node_id = dm_get_numa_node();
+ struct dax_device *dax_dev;
struct mapped_device *md;
void *old_md;

@@ -2122,16 +2123,13 @@ static struct mapped_device *alloc_dev(int minor)
md->disk->private_data = md;
sprintf(md->disk->disk_name, "dm-%d", minor);

- if (IS_ENABLED(CONFIG_FS_DAX)) {
- md->dax_dev = alloc_dax(md, &dm_dax_ops);
- if (IS_ERR(md->dax_dev)) {
- md->dax_dev = NULL;
- } else {
- set_dax_nocache(md->dax_dev);
- set_dax_nomc(md->dax_dev);
- if (dax_add_host(md->dax_dev, md->disk))
- goto bad;
- }
+ dax_dev = alloc_dax(md, &dm_dax_ops);
+ if (!IS_ERR(dax_dev)) {
+ set_dax_nocache(dax_dev);
+ set_dax_nomc(dax_dev);
+ md->dax_dev = dax_dev;
+ if (dax_add_host(dax_dev, md->disk))
+ goto bad;
}

format_dev_t(md->name, MKDEV(_major, minor));

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


2024-02-02 14:57:16

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH v3 2/4] dax: Check for data cache aliasing at runtime

On 2024-02-01 10:44, Mathieu Desnoyers wrote:
[...]
>> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
>> index 4e8fdcb3f1c8..b69c9e442cf4 100644
>> --- a/drivers/nvdimm/pmem.c
>> +++ b/drivers/nvdimm/pmem.c
>> @@ -560,17 +560,19 @@ static int pmem_attach_disk(struct device *dev,
>>       dax_dev = alloc_dax(pmem, &pmem_dax_ops);
>>       if (IS_ERR(dax_dev)) {
>>           rc = PTR_ERR(dax_dev);
>> -        goto out;
>> +        if (rc != -EOPNOTSUPP)
>> +            goto out;
>
> If I compare the before / after this change, if previously
> pmem_attach_disk() was called in a configuration with FS_DAX=n, it would
> result in a NULL pointer dereference.

I was wrong. drivers/nvdimm/Kconfig has:

config BLK_DEV_PMEM
select DAX

and

drivers/nvdimm/Makefile has:

obj-$(CONFIG_BLK_DEV_PMEM) += nd_pmem.o
nd_pmem-y := pmem.o

which means that anything in pmem.c can assume that alloc_dax() is
implemented.

[...]
>> diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
>> index 4b7ecd4fd431..f911e58a24dd 100644
>> --- a/drivers/s390/block/dcssblk.c
>> +++ b/drivers/s390/block/dcssblk.c
>> @@ -681,12 +681,14 @@ dcssblk_add_store(struct device *dev, struct
>> device_attribute *attr, const char
>>       if (IS_ERR(dev_info->dax_dev)) {
>>           rc = PTR_ERR(dev_info->dax_dev);
>>           dev_info->dax_dev = NULL;
>> -        goto put_dev;
>> +        if (rc != -EOPNOTSUPP)
>> +            goto put_dev;
>
> config DCSSBLK selects FS_DAX_LIMITED and DAX.
>
> I'm not sure what selecting DAX is trying to achieve here, because the
> Kconfig option is "FS_DAX".
>
> So depending on the real motivation behind this select, we may want to
> consider failure rather than success in the -EOPNOTSUPP case.
>

I missed that alloc_dax() is implemented as not supported based on
CONFIG_DAX (not CONFIG_FS_DAX).

Therefore DCSSBLK Kconfig does the right thing and always selects DAX,
and thus an implemented version of alloc_dax().

This takes care of two of my open questions at least. :)

Thanks,

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


2024-02-02 16:32:20

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH v3 2/4] dax: Check for data cache aliasing at runtime

On 2024-02-01 10:44, Mathieu Desnoyers wrote:
> On 2024-01-31 17:18, Dan Williams wrote:

[...]


>> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
>> index 5f1be1da92ce..11053a70f5ab 100644
>> --- a/fs/fuse/virtio_fs.c
>> +++ b/fs/fuse/virtio_fs.c
>> @@ -16,6 +16,7 @@
>>   #include <linux/fs_context.h>
>>   #include <linux/fs_parser.h>
>>   #include <linux/highmem.h>
>> +#include <linux/cleanup.h>
>>   #include <linux/uio.h>
>>   #include "fuse_i.h"
>> @@ -795,8 +796,11 @@ static void virtio_fs_cleanup_dax(void *data)
>>       put_dax(dax_dev);
>>   }
>> +DEFINE_FREE(cleanup_dax, struct dax_dev *, if (!IS_ERR_OR_NULL(_T))
>> virtio_fs_cleanup_dax(_T))
>> +
>>   static int virtio_fs_setup_dax(struct virtio_device *vdev, struct
>> virtio_fs *fs)
>
> So either I'm completely missing how ownership works in this function, or
> we should be really concerned about the fact that it does no actual
> cleanup of anything on any error.
[...]
>
> Here what I'm seeing so far:
>
> - devm_release_mem_region() is never called after
> devm_request_mem_region(). Not
>   on error, neither on teardown,
> - pgmap is never freed on error after devm_kzalloc.

I was indeed missing something: the devm_ family of functions
keeps ownership at the device level, so we would not need explicit
teardown.

>
>>   {
>> +    struct dax_device *dax_dev __free(cleanup_dax) = NULL;
>>       struct virtio_shm_region cache_reg;
>>       struct dev_pagemap *pgmap;
>>       bool have_cache;
>> @@ -804,6 +808,15 @@ static int virtio_fs_setup_dax(struct
>> virtio_device *vdev, struct virtio_fs *fs)
>>       if (!IS_ENABLED(CONFIG_FUSE_DAX))
>>           return 0;
>> +    dax_dev = alloc_dax(fs, &virtio_fs_dax_ops);
>> +    if (IS_ERR(dax_dev)) {
>> +        int rc = PTR_ERR(dax_dev);
>> +
>> +        if (rc == -EOPNOTSUPP)
>> +            return 0;
>> +        return rc;
>> +    }
>
> What is gained by moving this allocation here ?

I'm still concerned about moving the call to alloc_dax() before
the setup of the memory region it will use. Are those completely
independent ?

>
>> +
>>       /* Get cache region */
>>       have_cache = virtio_get_shm_region(vdev, &cache_reg,
>>                          (u8)VIRTIO_FS_SHMCAP_ID_CACHE);
>> @@ -849,10 +862,7 @@ static int virtio_fs_setup_dax(struct
>> virtio_device *vdev, struct virtio_fs *fs)
>>       dev_dbg(&vdev->dev, "%s: window kaddr 0x%px phys_addr 0x%llx len
>> 0x%llx\n",
>>           __func__, fs->window_kaddr, cache_reg.addr, cache_reg.len);
>> -    fs->dax_dev = alloc_dax(fs, &virtio_fs_dax_ops);
>> -    if (IS_ERR(fs->dax_dev))
>> -        return PTR_ERR(fs->dax_dev);
>> -
>> +    fs->dax_dev = no_free_ptr(dax_dev);
>>       return devm_add_action_or_reset(&vdev->dev, virtio_fs_cleanup_dax,
>>                       fs->dax_dev);
>>   }
>

[...]

Thanks,

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


2024-02-02 17:38:16

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC PATCH v3 2/4] dax: Check for data cache aliasing at runtime

Mathieu Desnoyers wrote:
[..]
> The change for -EOPNOTSUPP in header and implementation would look like
> this (more questions below):
>
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index b463502b16e1..df2d52b8a245 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -86,11 +86,7 @@ static inline void *dax_holder(struct dax_device *dax_dev)
> static inline struct dax_device *alloc_dax(void *private,
> const struct dax_operations *ops)
> {
> - /*
> - * Callers should check IS_ENABLED(CONFIG_DAX) to know if this
> - * NULL is an error or expected.
> - */
> - return NULL;
> + return ERR_PTR(-EOPNOTSUPP);
> }
> static inline void put_dax(struct dax_device *dax_dev)
> {
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 8c3a6e8e6334..c1cf6f0bbe12 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -448,7 +448,7 @@ struct dax_device *alloc_dax(void *private, const struct dax_operations *ops)
>
> /* Unavailable on architectures with virtually aliased data caches. */
> if (cpu_dcache_is_aliasing())
> - return NULL;
> + return ERR_PTR(-EOPNOTSUPP);
>
> if (WARN_ON_ONCE(ops && !ops->zero_page_range))
> return ERR_PTR(-EINVAL);
>
> >
> > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > index f4b635526345..254d3b1e420e 100644
> > --- a/drivers/dax/super.c
> > +++ b/drivers/dax/super.c
> > @@ -322,7 +322,7 @@ EXPORT_SYMBOL_GPL(dax_alive);
> > */
> > void kill_dax(struct dax_device *dax_dev)
> > {
> > - if (!dax_dev)
> > + if (IS_ERR_OR_NULL(dax_dev))
>
> I am tempted to just leave the "if (!dax_dev)" check here, because
> many other functions of this API are only protected by a NULL
> pointer check. I would hate to forget to convert one check in this
> change, and I don't think it simplifies anything.

It's not that it simplifies anything, but mistakes fail safely as a
memory leak rather than a crash.

> The alternative route I intend to take is to audit all callers
> of alloc_dax() and make sure they all save the alloc_dax() return
> value in a struct dax_device * local variable first for the sake
> of checking for IS_ERR(). This will leave the xyz->dax_dev pointer
> initialized to NULL in the error case and simplify the rest of
> error checking.

I could maybe get on board with that, but it needs a comment somewhere
about the asymmetric subtlety.

>
>
> > return;
> >
> > if (dax_dev->holder_data != NULL)
> > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> > index 4e8fdcb3f1c8..b69c9e442cf4 100644
> > --- a/drivers/nvdimm/pmem.c
> > +++ b/drivers/nvdimm/pmem.c
> > @@ -560,17 +560,19 @@ static int pmem_attach_disk(struct device *dev,
> > dax_dev = alloc_dax(pmem, &pmem_dax_ops);
> > if (IS_ERR(dax_dev)) {
> > rc = PTR_ERR(dax_dev);
> > - goto out;
> > + if (rc != -EOPNOTSUPP)
> > + goto out;
>
> If I compare the before / after this change, if previously
> pmem_attach_disk() was called in a configuration with FS_DAX=n, it would
> result in a NULL pointer dereference.

No, alloc_dax() only returns NULL CONFIG_DAX=n case, not the
CONFIG_FS_DAX=n case. So that means that pmem devices on ARM have been
possible without FS_DAX. So, in order for alloc_dax() returning
ERR_PTR(-EOPNOTSUPP) to not regress pmem device availability this error
path needs to be changed.

> This would be an error handling fix all by itself. Do we really want
> to return successfully if dax is unsupported, or should we return
> an error here ?

Per above, there is no error handling fix, and pmem block device
available should not depend on alloc_dax() succeeding.

The real question is what to do about device-dax. I *think* it is not
affected by cpu_dcache aliasing because it never accesses user mappings
through a kernel alias. I doubt device-dax is in use on these platforms,
but we might need another fixup for that if someone screams about the
alloc_dax() behavior change making them lose device-dax access.

> > + } else {
> > + set_dax_nocache(dax_dev);
> > + set_dax_nomc(dax_dev);
> > + if (is_nvdimm_sync(nd_region))
> > + set_dax_synchronous(dax_dev);
> > + rc = dax_add_host(dax_dev, disk);
> > + if (rc)
> > + goto out_cleanup_dax;
> > + dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
> > + pmem->dax_dev = dax_dev;
> > }
> > - set_dax_nocache(dax_dev);
> > - set_dax_nomc(dax_dev);
> > - if (is_nvdimm_sync(nd_region))
> > - set_dax_synchronous(dax_dev);
> > - rc = dax_add_host(dax_dev, disk);
> > - if (rc)
> > - goto out_cleanup_dax;
> > - dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
> > - pmem->dax_dev = dax_dev;
> >
> > rc = device_add_disk(dev, disk, pmem_attribute_groups);
> > if (rc)
> > diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
> > index 4b7ecd4fd431..f911e58a24dd 100644
> > --- a/drivers/s390/block/dcssblk.c
> > +++ b/drivers/s390/block/dcssblk.c
> > @@ -681,12 +681,14 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char
> > if (IS_ERR(dev_info->dax_dev)) {
> > rc = PTR_ERR(dev_info->dax_dev);
> > dev_info->dax_dev = NULL;
> > - goto put_dev;
> > + if (rc != -EOPNOTSUPP)
> > + goto put_dev;
>
> config DCSSBLK selects FS_DAX_LIMITED and DAX.
>
> I'm not sure what selecting DAX is trying to achieve here, because the
> Kconfig option is "FS_DAX".
>
> So depending on the real motivation behind this select, we may want to
> consider failure rather than success in the -EOPNOTSUPP case.

Ah, true, s390 is a !cpu_dcache_is_aliasing() arch, so it is ok to fail
driver load on alloc_dax() failure as it knows that ERR_PTR(-EOPNOTSUPP)
will never be returned.

>
> > + } else {
> > + set_dax_synchronous(dev_info->dax_dev);
> > + rc = dax_add_host(dev_info->dax_dev, dev_info->gd);
> > + if (rc)
> > + goto out_dax;
> > }
> > - set_dax_synchronous(dev_info->dax_dev);
> > - rc = dax_add_host(dev_info->dax_dev, dev_info->gd);
> > - if (rc)
> > - goto out_dax;
> >
> > get_device(&dev_info->dev);
> > rc = device_add_disk(&dev_info->dev, dev_info->gd, NULL);
>
> My own changes, if we want failure on -EOPNOTSUPP:
>
> diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
> index 4b7ecd4fd431..f363c1d51d9a 100644
> --- a/drivers/s390/block/dcssblk.c
> +++ b/drivers/s390/block/dcssblk.c
> @@ -549,6 +549,7 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char
> int rc, i, j, num_of_segments;
> struct dcssblk_dev_info *dev_info;
> struct segment_info *seg_info, *temp;
> + struct dax_device *dax_dev;
> char *local_buf;
> unsigned long seg_byte_size;
>
> @@ -677,13 +678,13 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char
> if (rc)
> goto put_dev;
>
> - dev_info->dax_dev = alloc_dax(dev_info, &dcssblk_dax_ops);
> - if (IS_ERR(dev_info->dax_dev)) {
> - rc = PTR_ERR(dev_info->dax_dev);
> - dev_info->dax_dev = NULL;
> + dax_dev = alloc_dax(dev_info, &dcssblk_dax_ops);
> + if (IS_ERR(dax_dev)) {
> + rc = PTR_ERR(dax_dev);
> goto put_dev;
> }
> - set_dax_synchronous(dev_info->dax_dev);
> + set_dax_synchronous(dax_dev);
> + dev_info->dax_dev = dax_dev;

Looks good to me.

> rc = dax_add_host(dev_info->dax_dev, dev_info->gd);
> if (rc)
> goto out_dax;
>
>
> > diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> > index 5f1be1da92ce..11053a70f5ab 100644
> > --- a/fs/fuse/virtio_fs.c
> > +++ b/fs/fuse/virtio_fs.c
> > @@ -16,6 +16,7 @@
> > #include <linux/fs_context.h>
> > #include <linux/fs_parser.h>
> > #include <linux/highmem.h>
> > +#include <linux/cleanup.h>
> > #include <linux/uio.h>
> > #include "fuse_i.h"
> >
> > @@ -795,8 +796,11 @@ static void virtio_fs_cleanup_dax(void *data)
> > put_dax(dax_dev);
> > }
> >
> > +DEFINE_FREE(cleanup_dax, struct dax_dev *, if (!IS_ERR_OR_NULL(_T)) virtio_fs_cleanup_dax(_T))
> > +
> > static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
>
> So either I'm completely missing how ownership works in this function, or
> we should be really concerned about the fact that it does no actual
> cleanup of anything on any error.
>
> I would be tempted to first refactor this function without using cleanup.h
> so those fixes can be easily backported to older kernels (?)
>
> Here what I'm seeing so far:
>
> - devm_release_mem_region() is never called after devm_request_mem_region(). Not
> on error, neither on teardown,

devm_release_mem_region() is called from virtio_fs_probe() context. That
means that when virtio_fs_probe() returns an error the driver core will
automatically call devm_request_mem_region().

> - pgmap is never freed on error after devm_kzalloc.

That is what the "devm_" in devm_kzalloc() does, free the memory on
driver-probe failure, or after the driver remove callback is invoked.

>
> > {
> > + struct dax_device *dax_dev __free(cleanup_dax) = NULL;
> > struct virtio_shm_region cache_reg;
> > struct dev_pagemap *pgmap;
> > bool have_cache;
> > @@ -804,6 +808,15 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> > if (!IS_ENABLED(CONFIG_FUSE_DAX))
> > return 0;
> >
> > + dax_dev = alloc_dax(fs, &virtio_fs_dax_ops);
> > + if (IS_ERR(dax_dev)) {
> > + int rc = PTR_ERR(dax_dev);
> > +
> > + if (rc == -EOPNOTSUPP)
> > + return 0;
> > + return rc;
> > + }
>
> What is gained by moving this allocation here ?

The gain is to fail early in virtio_fs_setup_dax() since the fundamental
dependency of alloc_dax() success is not met. For example why let the
setup progress to devm_memremap_pages() when alloc_dax() is going to
return ERR_PTR(-EOPNOTSUPP).

>
> > +
> > /* Get cache region */
> > have_cache = virtio_get_shm_region(vdev, &cache_reg,
> > (u8)VIRTIO_FS_SHMCAP_ID_CACHE);
> > @@ -849,10 +862,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> > dev_dbg(&vdev->dev, "%s: window kaddr 0x%px phys_addr 0x%llx len 0x%llx\n",
> > __func__, fs->window_kaddr, cache_reg.addr, cache_reg.len);
> >
> > - fs->dax_dev = alloc_dax(fs, &virtio_fs_dax_ops);
> > - if (IS_ERR(fs->dax_dev))
> > - return PTR_ERR(fs->dax_dev);
> > -
> > + fs->dax_dev = no_free_ptr(dax_dev);
> > return devm_add_action_or_reset(&vdev->dev, virtio_fs_cleanup_dax,
> > fs->dax_dev);
> > }
>
> In addition I have:
>
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index f90743a94da9..86de91b35f4d 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -2054,6 +2054,7 @@ static void cleanup_mapped_device(struct mapped_device *md)
> static struct mapped_device *alloc_dev(int minor)
> {
> int r, numa_node_id = dm_get_numa_node();
> + struct dax_device *dax_dev;
> struct mapped_device *md;
> void *old_md;
>
> @@ -2122,16 +2123,13 @@ static struct mapped_device *alloc_dev(int minor)
> md->disk->private_data = md;
> sprintf(md->disk->disk_name, "dm-%d", minor);
>
> - if (IS_ENABLED(CONFIG_FS_DAX)) {
> - md->dax_dev = alloc_dax(md, &dm_dax_ops);
> - if (IS_ERR(md->dax_dev)) {
> - md->dax_dev = NULL;
> - } else {
> - set_dax_nocache(md->dax_dev);
> - set_dax_nomc(md->dax_dev);
> - if (dax_add_host(md->dax_dev, md->disk))
> - goto bad;
> - }
> + dax_dev = alloc_dax(md, &dm_dax_ops);
> + if (!IS_ERR(dax_dev)) {
> + set_dax_nocache(dax_dev);
> + set_dax_nomc(dax_dev);
> + md->dax_dev = dax_dev;
> + if (dax_add_host(dax_dev, md->disk))
> + goto bad;

Looks good.

2024-02-02 19:29:24

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH v3 2/4] dax: Check for data cache aliasing at runtime

On 2024-02-02 12:37, Dan Williams wrote:
> Mathieu Desnoyers wrote:
[...]
>>
>
>> The alternative route I intend to take is to audit all callers
>> of alloc_dax() and make sure they all save the alloc_dax() return
>> value in a struct dax_device * local variable first for the sake
>> of checking for IS_ERR(). This will leave the xyz->dax_dev pointer
>> initialized to NULL in the error case and simplify the rest of
>> error checking.
>
> I could maybe get on board with that, but it needs a comment somewhere
> about the asymmetric subtlety.

Is this "somewhere" at every alloc_dax() call site, or do you have
something else in mind ?

>
>>
>>
>>> return;
>>>
>>> if (dax_dev->holder_data != NULL)
>>> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
>>> index 4e8fdcb3f1c8..b69c9e442cf4 100644
>>> --- a/drivers/nvdimm/pmem.c
>>> +++ b/drivers/nvdimm/pmem.c
>>> @@ -560,17 +560,19 @@ static int pmem_attach_disk(struct device *dev,
>>> dax_dev = alloc_dax(pmem, &pmem_dax_ops);
>>> if (IS_ERR(dax_dev)) {
>>> rc = PTR_ERR(dax_dev);
>>> - goto out;
>>> + if (rc != -EOPNOTSUPP)
>>> + goto out;
>>
>> If I compare the before / after this change, if previously
>> pmem_attach_disk() was called in a configuration with FS_DAX=n, it would
>> result in a NULL pointer dereference.
>
> No, alloc_dax() only returns NULL CONFIG_DAX=n case, not the
> CONFIG_FS_DAX=n case.

Indeed, I was wrong there.

> So that means that pmem devices on ARM have been
> possible without FS_DAX. So, in order for alloc_dax() returning
> ERR_PTR(-EOPNOTSUPP) to not regress pmem device availability this error
> path needs to be changed.
Good point. We're moving the depends on !(ARM || MIPS |PARC) from FS_DAX
Kconfig to a runtime check in alloc_dax(), which is used whenever DAX=y,
which includes configurations that had FS_DAX=n previously.

I'll change the error path in pmem_attack_disk to treat -EOPNOTSUPP
alloc_dax() return value as non-fatal.

>
>> This would be an error handling fix all by itself. Do we really want
>> to return successfully if dax is unsupported, or should we return
>> an error here ?
>
> Per above, there is no error handling fix, and pmem block device
> available should not depend on alloc_dax() succeeding.

I agree on treating alloc_dax() failure as non-fatal. There is
however one error handling fix to nvdimm/pmem which I plan to
introduce as an initial patch before this change:

nvdimm/pmem: Fix leak on dax_add_host() failure

Fix a leak on dax_add_host() error, where "goto out_cleanup_dax" is done
before setting pmem->dax_dev, which therefore issues the two following
calls on NULL pointers:

out_cleanup_dax:
kill_dax(pmem->dax_dev);
put_dax(pmem->dax_dev);

Signed-off-by: Mathieu Desnoyers <[email protected]>

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 4e8fdcb3f1c8..9fe358090720 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -566,12 +566,11 @@ static int pmem_attach_disk(struct device *dev,
set_dax_nomc(dax_dev);
if (is_nvdimm_sync(nd_region))
set_dax_synchronous(dax_dev);
+ pmem->dax_dev = dax_dev;
rc = dax_add_host(dax_dev, disk);
if (rc)
goto out_cleanup_dax;
dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
- pmem->dax_dev = dax_dev;
-
rc = device_add_disk(dev, disk, pmem_attribute_groups);
if (rc)
goto out_remove_host;

>
> The real question is what to do about device-dax. I *think* it is not
> affected by cpu_dcache aliasing because it never accesses user mappings
> through a kernel alias. I doubt device-dax is in use on these platforms,
> but we might need another fixup for that if someone screams about the
> alloc_dax() behavior change making them lose device-dax access.

By "device-dax", I understand you mean drivers/dax/Kconfig:DEV_DAX.

Based on your analysis, is alloc_dax() still the right spot where
to place this runtime check ? Which call sites are responsible
for invoking alloc_dax() for device-dax ?

If we know which call sites do not intend to use the kernel linear
mapping, we could introduce a flag (or a new variant of the alloc_dax()
API) that would either enforce or skip the check.

[...]

>>
>> Here what I'm seeing so far:
>>
>> - devm_release_mem_region() is never called after devm_request_mem_region(). Not
>> on error, neither on teardown,
>
> devm_release_mem_region() is called from virtio_fs_probe() context. That

I guess you mean "devm_request_mem_region()" here.

> means that when virtio_fs_probe() returns an error the driver core will
> automatically call devm_request_mem_region().

And "devm_release_mem_region()" here.

>
>> - pgmap is never freed on error after devm_kzalloc.
>
> That is what the "devm_" in devm_kzalloc() does, free the memory on
> driver-probe failure, or after the driver remove callback is invoked.

Got it.

>
>>
>>> {
>>> + struct dax_device *dax_dev __free(cleanup_dax) = NULL;
>>> struct virtio_shm_region cache_reg;
>>> struct dev_pagemap *pgmap;
>>> bool have_cache;
>>> @@ -804,6 +808,15 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
>>> if (!IS_ENABLED(CONFIG_FUSE_DAX))
>>> return 0;
>>>
>>> + dax_dev = alloc_dax(fs, &virtio_fs_dax_ops);
>>> + if (IS_ERR(dax_dev)) {
>>> + int rc = PTR_ERR(dax_dev);
>>> +
>>> + if (rc == -EOPNOTSUPP)
>>> + return 0;
>>> + return rc;
>>> + }
>>
>> What is gained by moving this allocation here ?
>
> The gain is to fail early in virtio_fs_setup_dax() since the fundamental
> dependency of alloc_dax() success is not met. For example why let the
> setup progress to devm_memremap_pages() when alloc_dax() is going to
> return ERR_PTR(-EOPNOTSUPP).

What I don't know is whether there is a dependency requiring to do
devm_request_mem_region(), devm_kzalloc(), devm_memremap_pages()
before calling alloc_dax() ?

Those 3 calls are used to populate:

fs->window_phys_addr = (phys_addr_t) cache_reg.addr;
fs->window_len = (phys_addr_t) cache_reg.len;

and then alloc_dax() takes "fs" as private data parameter. So it's
unclear to me whether we can swap the invocation order. I suspect
that it is not an issue because it is only used to populate
dax_dev->private, but I prefer to confirm this with you just to be
on the safe side.

[...]

Thanks,

Mathieu



--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


2024-02-02 19:42:07

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC PATCH v3 2/4] dax: Check for data cache aliasing at runtime

Mathieu Desnoyers wrote:
> On 2024-02-02 12:37, Dan Williams wrote:
> > Mathieu Desnoyers wrote:
> [...]
> >>
> >
> >> The alternative route I intend to take is to audit all callers
> >> of alloc_dax() and make sure they all save the alloc_dax() return
> >> value in a struct dax_device * local variable first for the sake
> >> of checking for IS_ERR(). This will leave the xyz->dax_dev pointer
> >> initialized to NULL in the error case and simplify the rest of
> >> error checking.
> >
> > I could maybe get on board with that, but it needs a comment somewhere
> > about the asymmetric subtlety.
>
> Is this "somewhere" at every alloc_dax() call site, or do you have
> something else in mind ?

At least kill_dax() should mention the asymmetry I think.

>
> >
> >>
> >>
> >>> return;
> >>>
> >>> if (dax_dev->holder_data != NULL)
> >>> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> >>> index 4e8fdcb3f1c8..b69c9e442cf4 100644
> >>> --- a/drivers/nvdimm/pmem.c
> >>> +++ b/drivers/nvdimm/pmem.c
> >>> @@ -560,17 +560,19 @@ static int pmem_attach_disk(struct device *dev,
> >>> dax_dev = alloc_dax(pmem, &pmem_dax_ops);
> >>> if (IS_ERR(dax_dev)) {
> >>> rc = PTR_ERR(dax_dev);
> >>> - goto out;
> >>> + if (rc != -EOPNOTSUPP)
> >>> + goto out;
> >>
> >> If I compare the before / after this change, if previously
> >> pmem_attach_disk() was called in a configuration with FS_DAX=n, it would
> >> result in a NULL pointer dereference.
> >
> > No, alloc_dax() only returns NULL CONFIG_DAX=n case, not the
> > CONFIG_FS_DAX=n case.
>
> Indeed, I was wrong there.
>
> > So that means that pmem devices on ARM have been
> > possible without FS_DAX. So, in order for alloc_dax() returning
> > ERR_PTR(-EOPNOTSUPP) to not regress pmem device availability this error
> > path needs to be changed.
> Good point. We're moving the depends on !(ARM || MIPS |PARC) from FS_DAX
> Kconfig to a runtime check in alloc_dax(), which is used whenever DAX=y,
> which includes configurations that had FS_DAX=n previously.
>
> I'll change the error path in pmem_attack_disk to treat -EOPNOTSUPP
> alloc_dax() return value as non-fatal.
>
> >
> >> This would be an error handling fix all by itself. Do we really want
> >> to return successfully if dax is unsupported, or should we return
> >> an error here ?
> >
> > Per above, there is no error handling fix, and pmem block device
> > available should not depend on alloc_dax() succeeding.
>
> I agree on treating alloc_dax() failure as non-fatal. There is
> however one error handling fix to nvdimm/pmem which I plan to
> introduce as an initial patch before this change:
>
> nvdimm/pmem: Fix leak on dax_add_host() failure
>
> Fix a leak on dax_add_host() error, where "goto out_cleanup_dax" is done
> before setting pmem->dax_dev, which therefore issues the two following
> calls on NULL pointers:
>
> out_cleanup_dax:
> kill_dax(pmem->dax_dev);
> put_dax(pmem->dax_dev);
>
> Signed-off-by: Mathieu Desnoyers <[email protected]>
>
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 4e8fdcb3f1c8..9fe358090720 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -566,12 +566,11 @@ static int pmem_attach_disk(struct device *dev,
> set_dax_nomc(dax_dev);
> if (is_nvdimm_sync(nd_region))
> set_dax_synchronous(dax_dev);
> + pmem->dax_dev = dax_dev;
> rc = dax_add_host(dax_dev, disk);
> if (rc)
> goto out_cleanup_dax;
> dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
> - pmem->dax_dev = dax_dev;
> -
> rc = device_add_disk(dev, disk, pmem_attribute_groups);
> if (rc)
> goto out_remove_host;

Yup, looks good.

> > The real question is what to do about device-dax. I *think* it is not
> > affected by cpu_dcache aliasing because it never accesses user mappings
> > through a kernel alias. I doubt device-dax is in use on these platforms,
> > but we might need another fixup for that if someone screams about the
> > alloc_dax() behavior change making them lose device-dax access.
>
> By "device-dax", I understand you mean drivers/dax/Kconfig:DEV_DAX.
>
> Based on your analysis, is alloc_dax() still the right spot where
> to place this runtime check ? Which call sites are responsible
> for invoking alloc_dax() for device-dax ?

That is in devm_create_dev_dax().

> If we know which call sites do not intend to use the kernel linear
> mapping, we could introduce a flag (or a new variant of the alloc_dax()
> API) that would either enforce or skip the check.

Hmmm, it looks like there is already a natural flag for that. If
alloc_dax() is passed a NULL operations pointer it means there are no
kernel usages of the aliased mapping. That actually fits rather nicely.

[..]
> >>> @@ -804,6 +808,15 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> >>> if (!IS_ENABLED(CONFIG_FUSE_DAX))
> >>> return 0;
> >>>
> >>> + dax_dev = alloc_dax(fs, &virtio_fs_dax_ops);
> >>> + if (IS_ERR(dax_dev)) {
> >>> + int rc = PTR_ERR(dax_dev);
> >>> +
> >>> + if (rc == -EOPNOTSUPP)
> >>> + return 0;
> >>> + return rc;
> >>> + }
> >>
> >> What is gained by moving this allocation here ?
> >
> > The gain is to fail early in virtio_fs_setup_dax() since the fundamental
> > dependency of alloc_dax() success is not met. For example why let the
> > setup progress to devm_memremap_pages() when alloc_dax() is going to
> > return ERR_PTR(-EOPNOTSUPP).
>
> What I don't know is whether there is a dependency requiring to do
> devm_request_mem_region(), devm_kzalloc(), devm_memremap_pages()
> before calling alloc_dax() ?
>
> Those 3 calls are used to populate:
>
> fs->window_phys_addr = (phys_addr_t) cache_reg.addr;
> fs->window_len = (phys_addr_t) cache_reg.len;
>
> and then alloc_dax() takes "fs" as private data parameter. So it's
> unclear to me whether we can swap the invocation order. I suspect
> that it is not an issue because it is only used to populate
> dax_dev->private, but I prefer to confirm this with you just to be
> on the safe side.

Thanks for that. All of those need to be done before the fs goes live
later in virtio_device_ready(), but before that point nothing should be
calling into virtio_fs_dax_ops, so as far as I can see it is safe to
change the order.

2024-02-02 20:04:07

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH v3 2/4] dax: Check for data cache aliasing at runtime

On 2024-02-02 14:41, Dan Williams wrote:
> Mathieu Desnoyers wrote:
>> On 2024-02-02 12:37, Dan Williams wrote:
>>> Mathieu Desnoyers wrote:
>> [...]
>>>>
>>>
>>>> The alternative route I intend to take is to audit all callers
>>>> of alloc_dax() and make sure they all save the alloc_dax() return
>>>> value in a struct dax_device * local variable first for the sake
>>>> of checking for IS_ERR(). This will leave the xyz->dax_dev pointer
>>>> initialized to NULL in the error case and simplify the rest of
>>>> error checking.
>>>
>>> I could maybe get on board with that, but it needs a comment somewhere
>>> about the asymmetric subtlety.
>>
>> Is this "somewhere" at every alloc_dax() call site, or do you have
>> something else in mind ?
>
> At least kill_dax() should mention the asymmetry I think.

Here is what I intend to add:

* Note, because alloc_dax() returns an ERR_PTR() on error, callers
* typically store its result into a local variable in order to check
* the result. Therefore, care must be taken to populate the struct
* device dax_dev field make sure the dax_dev is not leaked.

>
>>> The real question is what to do about device-dax. I *think* it is not
>>> affected by cpu_dcache aliasing because it never accesses user mappings
>>> through a kernel alias. I doubt device-dax is in use on these platforms,
>>> but we might need another fixup for that if someone screams about the
>>> alloc_dax() behavior change making them lose device-dax access.
>>
>> By "device-dax", I understand you mean drivers/dax/Kconfig:DEV_DAX.
>>
>> Based on your analysis, is alloc_dax() still the right spot where
>> to place this runtime check ? Which call sites are responsible
>> for invoking alloc_dax() for device-dax ?
>
> That is in devm_create_dev_dax().
>
>> If we know which call sites do not intend to use the kernel linear
>> mapping, we could introduce a flag (or a new variant of the alloc_dax()
>> API) that would either enforce or skip the check.
>
> Hmmm, it looks like there is already a natural flag for that. If
> alloc_dax() is passed a NULL operations pointer it means there are no
> kernel usages of the aliased mapping. That actually fits rather nicely.

Good, I was reaching the same conclusion when I received your reply.
I'll do that. It ends up being:

/*
* Unavailable on architectures with virtually aliased data caches,
* except for device-dax (NULL operations pointer), which does
* not use aliased mappings from the kernel.
*/
if (ops && cpu_dcache_is_aliasing())
return ERR_PTR(-EOPNOTSUPP);

>
> [..]
>>>>> @@ -804,6 +808,15 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
>>>>> if (!IS_ENABLED(CONFIG_FUSE_DAX))
>>>>> return 0;
>>>>>
>>>>> + dax_dev = alloc_dax(fs, &virtio_fs_dax_ops);
>>>>> + if (IS_ERR(dax_dev)) {
>>>>> + int rc = PTR_ERR(dax_dev);
>>>>> +
>>>>> + if (rc == -EOPNOTSUPP)
>>>>> + return 0;
>>>>> + return rc;
>>>>> + }
>>>>
>>>> What is gained by moving this allocation here ?
>>>
>>> The gain is to fail early in virtio_fs_setup_dax() since the fundamental
>>> dependency of alloc_dax() success is not met. For example why let the
>>> setup progress to devm_memremap_pages() when alloc_dax() is going to
>>> return ERR_PTR(-EOPNOTSUPP).
>>
>> What I don't know is whether there is a dependency requiring to do
>> devm_request_mem_region(), devm_kzalloc(), devm_memremap_pages()
>> before calling alloc_dax() ?
>>
>> Those 3 calls are used to populate:
>>
>> fs->window_phys_addr = (phys_addr_t) cache_reg.addr;
>> fs->window_len = (phys_addr_t) cache_reg.len;
>>
>> and then alloc_dax() takes "fs" as private data parameter. So it's
>> unclear to me whether we can swap the invocation order. I suspect
>> that it is not an issue because it is only used to populate
>> dax_dev->private, but I prefer to confirm this with you just to be
>> on the safe side.
>
> Thanks for that. All of those need to be done before the fs goes live
> later in virtio_device_ready(), but before that point nothing should be
> calling into virtio_fs_dax_ops, so as far as I can see it is safe to
> change the order.

Sounds good, I'll do that.

I will soon be ready to send out a RFC v4, which is still only
compiled-tested. Do you happen to have some kind of test suite
you can use to automate some of the runtime testing ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


2024-02-02 20:14:46

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC PATCH v3 2/4] dax: Check for data cache aliasing at runtime

Mathieu Desnoyers wrote:
[..]
> > Thanks for that. All of those need to be done before the fs goes live
> > later in virtio_device_ready(), but before that point nothing should be
> > calling into virtio_fs_dax_ops, so as far as I can see it is safe to
> > change the order.
>
> Sounds good, I'll do that.
>
> I will soon be ready to send out a RFC v4, which is still only
> compiled-tested. Do you happen to have some kind of test suite
> you can use to automate some of the runtime testing ?

There is a test suite for the pmem, dm, and dax changes
(https://github.com/pmem/ndctl?tab=readme-ov-file#unit-tests), but not
automated unfortunately. The NVDIMM maintainer team will run that before
pushing patches out to the fixes branch if you just want to lean on
that. For the rest I think we will need to depend on tested-by's from
s390 + virtio_fs folks, and / or sufficient soak time in linux-next.

2024-02-02 20:18:52

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH v3 2/4] dax: Check for data cache aliasing at runtime

On 2024-02-02 15:14, Dan Williams wrote:
> Mathieu Desnoyers wrote:
> [..]
>>> Thanks for that. All of those need to be done before the fs goes live
>>> later in virtio_device_ready(), but before that point nothing should be
>>> calling into virtio_fs_dax_ops, so as far as I can see it is safe to
>>> change the order.
>>
>> Sounds good, I'll do that.
>>
>> I will soon be ready to send out a RFC v4, which is still only
>> compiled-tested. Do you happen to have some kind of test suite
>> you can use to automate some of the runtime testing ?
>
> There is a test suite for the pmem, dm, and dax changes
> (https://github.com/pmem/ndctl?tab=readme-ov-file#unit-tests), but not
> automated unfortunately. The NVDIMM maintainer team will run that before
> pushing patches out to the fixes branch if you just want to lean on
> that. For the rest I think we will need to depend on tested-by's from
> s390 + virtio_fs folks, and / or sufficient soak time in linux-next.

I suspect this will be necessary. There are just so many combinations
of architectures, drivers and filesystems involved here that I don't
think it is realistic to try to do all this testing manually on my own.

I prefer to voice this up front, so there are no misplaced expectations
about testing.

Thanks,

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com