2024-06-10 20:49:11

by Linus Torvalds

[permalink] [raw]
Subject: [PATCH 0/7] arm64 / x86-64: low-level code generation issues

So this is the result of me doing some profiling on my 128-core Altra
box. I've sent out versions of this before, but they've all been fairly
ugly partial series.

This is the full cleaned-up series with patches split up to be logical,
and with fixes from some of the commentary from previous patches.

The first four patches are for the 'runtime constant' code, where I did
the initial implementation on x86-64 just because I was more comfy with
that, and the arm64 version of it came once I had the x86-64 side
working.

The horror that is __d_lookup_rcu() shows up a lot more on my Altra box
because of the relatively pitiful caches, but it's something that I've
wanted on x86-64 before. The arm64 numbers just made me bite the
bullet on the whole runtime constant thing.

The last three patches are purely arm64-specific, and just fix up some
nasty code generation in the user access functions. I just noticed that
I will need to implement 'user_access_save()' for KCSAN now that I do
the unsafe user access functions.

Anyway, that 'user_access_save/restore()' issue only shows up with
KCSAN. And it would be a no-op thanks to arm64 doing SMAP the right way
(pet peeve: arm64 did what I told the x86 designers to do originally,
but they claimed was too hard, so we ended up with that CLAC/STAC
instead)...

Sadly that "no-op for KCSAN" would is except for the horrid
CONFIG_ARM64_SW_TTBR0_PAN case, which is why I'm not touching it. I'm
hoping some hapless^Whelpful arm64 person is willing to tackle this (or
maybe make KCSAN and ARM64_SW_TTBR0_PAN incompatible in the Kconfig).

Note: the final access_ok() change in 7/7 is a API relaxation and
cleanup, and as such much more worrisome than the other patches. It's
_simpler_ than the other patches, but the others aren't intended to
really change behavior. That one does.

Linus Torvalds (7):
vfs: dcache: move hashlen_hash() from callers into d_hash()
add default dummy 'runtime constant' infrastructure
x86: add 'runtime constant' support
arm64: add 'runtime constant' support
arm64: start using 'asm goto' for get_user() when available
arm64: start using 'asm goto' for put_user() when available
arm64: access_ok() optimization

arch/arm64/include/asm/runtime-const.h | 75 ++++++++++
arch/arm64/include/asm/uaccess.h | 191 +++++++++++++++++--------
arch/arm64/kernel/mte.c | 12 +-
arch/arm64/kernel/vmlinux.lds.S | 3 +
arch/x86/include/asm/runtime-const.h | 61 ++++++++
arch/x86/kernel/vmlinux.lds.S | 3 +
fs/dcache.c | 17 ++-
include/asm-generic/Kbuild | 1 +
include/asm-generic/runtime-const.h | 15 ++
include/asm-generic/vmlinux.lds.h | 8 ++
10 files changed, 319 insertions(+), 67 deletions(-)
create mode 100644 arch/arm64/include/asm/runtime-const.h
create mode 100644 arch/x86/include/asm/runtime-const.h
create mode 100644 include/asm-generic/runtime-const.h

--
2.45.1.209.gc6f12300df



2024-06-10 20:49:29

by Linus Torvalds

[permalink] [raw]
Subject: [PATCH 1/7] vfs: dcache: move hashlen_hash() from callers into d_hash()

Both __d_lookup_rcu() and __d_lookup_rcu_op_compare() have the full
'name_hash' value of the qstr that they want to look up, and mask it off
to just the low 32-bit hash before calling down to d_hash().

Other callers just load the 32-bit hash and pass it as the argument.

If we move the masking into d_hash() itself, it simplifies the two
callers that currently do the masking, and is a no-op for the other
cases. It doesn't actually change the generated code since the compiler
will inline d_hash() and see that the end result is the same.

[ Technically, since the parse tree changes, the code generation may not
be 100% the same, and for me on x86-64, this does result in gcc
switching the operands around for one 'cmpl' instruction. So not
necessarily the exact same code generation, but equivalent ]

However, this does encapsulate the 'd_hash()' operation more, and makes
the shift operation in particular be a "shift 32 bits right, return full
word". Which matches the instruction semantics on both x86-64 and arm64
better, since a 32-bit shift will clear the upper bits.

That makes the next step of introducing a "shift by runtime constant"
more obvious and generates the shift with no extraneous type masking.

Signed-off-by: Linus Torvalds <[email protected]>
---
fs/dcache.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 407095188f83..8b4382f5c99d 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -100,9 +100,9 @@ static unsigned int d_hash_shift __ro_after_init;

static struct hlist_bl_head *dentry_hashtable __ro_after_init;

-static inline struct hlist_bl_head *d_hash(unsigned int hash)
+static inline struct hlist_bl_head *d_hash(unsigned long hashlen)
{
- return dentry_hashtable + (hash >> d_hash_shift);
+ return dentry_hashtable + ((u32)hashlen >> d_hash_shift);
}

#define IN_LOOKUP_SHIFT 10
@@ -2104,7 +2104,7 @@ static noinline struct dentry *__d_lookup_rcu_op_compare(
unsigned *seqp)
{
u64 hashlen = name->hash_len;
- struct hlist_bl_head *b = d_hash(hashlen_hash(hashlen));
+ struct hlist_bl_head *b = d_hash(hashlen);
struct hlist_bl_node *node;
struct dentry *dentry;

@@ -2171,7 +2171,7 @@ struct dentry *__d_lookup_rcu(const struct dentry *parent,
{
u64 hashlen = name->hash_len;
const unsigned char *str = name->name;
- struct hlist_bl_head *b = d_hash(hashlen_hash(hashlen));
+ struct hlist_bl_head *b = d_hash(hashlen);
struct hlist_bl_node *node;
struct dentry *dentry;

--
2.45.1.209.gc6f12300df


2024-06-10 20:49:48

by Linus Torvalds

[permalink] [raw]
Subject: [PATCH 3/7] x86: add 'runtime constant' support

This implements the runtime constant infrastructure for x86, allowing
the dcache d_hash() function to be generated using as a constant for
hash table address followed by shift by a constant of the hash index.

Signed-off-by: Linus Torvalds <[email protected]>
---
arch/x86/include/asm/runtime-const.h | 61 ++++++++++++++++++++++++++++
arch/x86/kernel/vmlinux.lds.S | 3 ++
2 files changed, 64 insertions(+)
create mode 100644 arch/x86/include/asm/runtime-const.h

diff --git a/arch/x86/include/asm/runtime-const.h b/arch/x86/include/asm/runtime-const.h
new file mode 100644
index 000000000000..24e3a53ca255
--- /dev/null
+++ b/arch/x86/include/asm/runtime-const.h
@@ -0,0 +1,61 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_RUNTIME_CONST_H
+#define _ASM_RUNTIME_CONST_H
+
+#define runtime_const_ptr(sym) ({ \
+ typeof(sym) __ret; \
+ asm_inline("mov %1,%0\n1:\n" \
+ ".pushsection runtime_ptr_" #sym ",\"a\"\n\t" \
+ ".long 1b - %c2 - .\n\t" \
+ ".popsection" \
+ :"=r" (__ret) \
+ :"i" ((unsigned long)0x0123456789abcdefull), \
+ "i" (sizeof(long))); \
+ __ret; })
+
+// The 'typeof' will create at _least_ a 32-bit type, but
+// will happily also take a bigger type and the 'shrl' will
+// clear the upper bits
+#define runtime_const_shift_right_32(val, sym) ({ \
+ typeof(0u+(val)) __ret = (val); \
+ asm_inline("shrl $12,%k0\n1:\n" \
+ ".pushsection runtime_shift_" #sym ",\"a\"\n\t" \
+ ".long 1b - 1 - .\n\t" \
+ ".popsection" \
+ :"+r" (__ret)); \
+ __ret; })
+
+#define runtime_const_init(type, sym) do { \
+ extern s32 __start_runtime_##type##_##sym[]; \
+ extern s32 __stop_runtime_##type##_##sym[]; \
+ runtime_const_fixup(__runtime_fixup_##type, \
+ (unsigned long)(sym), \
+ __start_runtime_##type##_##sym, \
+ __stop_runtime_##type##_##sym); \
+} while (0)
+
+/*
+ * The text patching is trivial - you can only do this at init time,
+ * when the text section hasn't been marked RO, and before the text
+ * has ever been executed.
+ */
+static inline void __runtime_fixup_ptr(void *where, unsigned long val)
+{
+ *(unsigned long *)where = val;
+}
+
+static inline void __runtime_fixup_shift(void *where, unsigned long val)
+{
+ *(unsigned char *)where = val;
+}
+
+static inline void runtime_const_fixup(void (*fn)(void *, unsigned long),
+ unsigned long val, s32 *start, s32 *end)
+{
+ while (start < end) {
+ fn(*start + (void *)start, val);
+ start++;
+ }
+}
+
+#endif
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 3509afc6a672..6e73403e874f 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -357,6 +357,9 @@ SECTIONS
PERCPU_SECTION(INTERNODE_CACHE_BYTES)
#endif

+ RUNTIME_CONST(shift, d_hash_shift)
+ RUNTIME_CONST(ptr, dentry_hashtable)
+
. = ALIGN(PAGE_SIZE);

/* freed after init ends here */
--
2.45.1.209.gc6f12300df


2024-06-10 20:50:05

by Linus Torvalds

[permalink] [raw]
Subject: [PATCH 2/7] add default dummy 'runtime constant' infrastructure

This adds the initial dummy support for 'runtime constants' for when
an architecture doesn't actually support an implementation of fixing up
said runtime constants.

This ends up being the fallback to just using the variables as regular
__ro_after_init variables, and changes the dcache d_hash() function to
use this model.

Signed-off-by: Linus Torvalds <[email protected]>
---
fs/dcache.c | 11 ++++++++++-
include/asm-generic/Kbuild | 1 +
include/asm-generic/runtime-const.h | 15 +++++++++++++++
include/asm-generic/vmlinux.lds.h | 8 ++++++++
4 files changed, 34 insertions(+), 1 deletion(-)
create mode 100644 include/asm-generic/runtime-const.h

diff --git a/fs/dcache.c b/fs/dcache.c
index 8b4382f5c99d..5d3a5b315692 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -35,6 +35,8 @@
#include "internal.h"
#include "mount.h"

+#include <asm/runtime-const.h>
+
/*
* Usage:
* dcache->d_inode->i_lock protects:
@@ -102,7 +104,8 @@ static struct hlist_bl_head *dentry_hashtable __ro_after_init;

static inline struct hlist_bl_head *d_hash(unsigned long hashlen)
{
- return dentry_hashtable + ((u32)hashlen >> d_hash_shift);
+ return runtime_const_ptr(dentry_hashtable) +
+ runtime_const_shift_right_32(hashlen, d_hash_shift);
}

#define IN_LOOKUP_SHIFT 10
@@ -3129,6 +3132,9 @@ static void __init dcache_init_early(void)
0,
0);
d_hash_shift = 32 - d_hash_shift;
+
+ runtime_const_init(shift, d_hash_shift);
+ runtime_const_init(ptr, dentry_hashtable);
}

static void __init dcache_init(void)
@@ -3157,6 +3163,9 @@ static void __init dcache_init(void)
0,
0);
d_hash_shift = 32 - d_hash_shift;
+
+ runtime_const_init(shift, d_hash_shift);
+ runtime_const_init(ptr, dentry_hashtable);
}

/* SLAB cache for __getname() consumers */
diff --git a/include/asm-generic/Kbuild b/include/asm-generic/Kbuild
index b20fa25a7e8d..052e5c98c105 100644
--- a/include/asm-generic/Kbuild
+++ b/include/asm-generic/Kbuild
@@ -46,6 +46,7 @@ mandatory-y += pci.h
mandatory-y += percpu.h
mandatory-y += pgalloc.h
mandatory-y += preempt.h
+mandatory-y += runtime-const.h
mandatory-y += rwonce.h
mandatory-y += sections.h
mandatory-y += serial.h
diff --git a/include/asm-generic/runtime-const.h b/include/asm-generic/runtime-const.h
new file mode 100644
index 000000000000..670499459514
--- /dev/null
+++ b/include/asm-generic/runtime-const.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_RUNTIME_CONST_H
+#define _ASM_RUNTIME_CONST_H
+
+/*
+ * This is the fallback for when the architecture doesn't
+ * support the runtime const operations.
+ *
+ * We just use the actual symbols as-is.
+ */
+#define runtime_const_ptr(sym) (sym)
+#define runtime_const_shift_right_32(val, sym) ((u32)(val)>>(sym))
+#define runtime_const_init(type,sym) do { } while (0)
+
+#endif
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index 5703526d6ebf..389a78415b9b 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -944,6 +944,14 @@
#define CON_INITCALL \
BOUNDED_SECTION_POST_LABEL(.con_initcall.init, __con_initcall, _start, _end)

+#define RUNTIME_NAME(t,x) runtime_##t##_##x
+
+#define RUNTIME_CONST(t,x) \
+ . = ALIGN(8); \
+ RUNTIME_NAME(t,x) : AT(ADDR(RUNTIME_NAME(t,x)) - LOAD_OFFSET) { \
+ *(RUNTIME_NAME(t,x)); \
+ }
+
/* Alignment must be consistent with (kunit_suite *) in include/kunit/test.h */
#define KUNIT_TABLE() \
. = ALIGN(8); \
--
2.45.1.209.gc6f12300df


2024-06-10 20:50:18

by Linus Torvalds

[permalink] [raw]
Subject: [PATCH 6/7] arm64: start using 'asm goto' for put_user() when available

This generates noticeably better code with compilers that support it,
since we don't need to test the error register etc, the exception just
jumps to the error handling directly.

Signed-off-by: Linus Torvalds <[email protected]>
---
arch/arm64/include/asm/uaccess.h | 77 +++++++++++++++++++-------------
1 file changed, 46 insertions(+), 31 deletions(-)

diff --git a/arch/arm64/include/asm/uaccess.h b/arch/arm64/include/asm/uaccess.h
index 23c2edf517ed..4ab3938290ab 100644
--- a/arch/arm64/include/asm/uaccess.h
+++ b/arch/arm64/include/asm/uaccess.h
@@ -294,29 +294,41 @@ do { \
} while (0); \
} while (0)

-#define __put_mem_asm(store, reg, x, addr, err, type) \
+#ifdef CONFIG_CC_HAS_ASM_GOTO
+#define __put_mem_asm(store, reg, x, addr, label, type) \
+ asm goto( \
+ "1: " store " " reg "0, [%1]\n" \
+ "2:\n" \
+ _ASM_EXTABLE_##type##ACCESS_ZERO(1b, %l2) \
+ : : "rZ" (x), "r" (addr) : : label)
+#else
+#define __put_mem_asm(store, reg, x, addr, label, type) do { \
+ int __pma_err = 0; \
asm volatile( \
"1: " store " " reg "1, [%2]\n" \
"2:\n" \
_ASM_EXTABLE_##type##ACCESS_ERR(1b, 2b, %w0) \
- : "+r" (err) \
- : "rZ" (x), "r" (addr))
+ : "+r" (__pma_err) \
+ : "rZ" (x), "r" (addr)); \
+ if (__pma_err) goto label; \
+} while (0)
+#endif

-#define __raw_put_mem(str, x, ptr, err, type) \
+#define __raw_put_mem(str, x, ptr, label, type) \
do { \
__typeof__(*(ptr)) __pu_val = (x); \
switch (sizeof(*(ptr))) { \
case 1: \
- __put_mem_asm(str "b", "%w", __pu_val, (ptr), (err), type); \
+ __put_mem_asm(str "b", "%w", __pu_val, (ptr), label, type); \
break; \
case 2: \
- __put_mem_asm(str "h", "%w", __pu_val, (ptr), (err), type); \
+ __put_mem_asm(str "h", "%w", __pu_val, (ptr), label, type); \
break; \
case 4: \
- __put_mem_asm(str, "%w", __pu_val, (ptr), (err), type); \
+ __put_mem_asm(str, "%w", __pu_val, (ptr), label, type); \
break; \
case 8: \
- __put_mem_asm(str, "%x", __pu_val, (ptr), (err), type); \
+ __put_mem_asm(str, "%x", __pu_val, (ptr), label, type); \
break; \
default: \
BUILD_BUG(); \
@@ -328,25 +340,34 @@ do { \
* uaccess_ttbr0_disable(). As `x` and `ptr` could contain blocking functions,
* we must evaluate these outside of the critical section.
*/
-#define __raw_put_user(x, ptr, err) \
+#define __raw_put_user(x, ptr, label) \
do { \
+ __label__ __rpu_failed; \
__typeof__(*(ptr)) __user *__rpu_ptr = (ptr); \
__typeof__(*(ptr)) __rpu_val = (x); \
__chk_user_ptr(__rpu_ptr); \
\
- uaccess_ttbr0_enable(); \
- __raw_put_mem("sttr", __rpu_val, __rpu_ptr, err, U); \
- uaccess_ttbr0_disable(); \
+ do { \
+ uaccess_ttbr0_enable(); \
+ __raw_put_mem("sttr", __rpu_val, __rpu_ptr, __rpu_failed, U); \
+ uaccess_ttbr0_disable(); \
+ break; \
+ __rpu_failed: \
+ uaccess_ttbr0_disable(); \
+ goto label; \
+ } while (0); \
} while (0)

#define __put_user_error(x, ptr, err) \
do { \
+ __label__ __pu_failed; \
__typeof__(*(ptr)) __user *__p = (ptr); \
might_fault(); \
if (access_ok(__p, sizeof(*__p))) { \
__p = uaccess_mask_ptr(__p); \
- __raw_put_user((x), __p, (err)); \
+ __raw_put_user((x), __p, __pu_failed); \
} else { \
+ __pu_failed: \
(err) = -EFAULT; \
} \
} while (0)
@@ -369,15 +390,18 @@ do { \
do { \
__typeof__(dst) __pkn_dst = (dst); \
__typeof__(src) __pkn_src = (src); \
- int __pkn_err = 0; \
\
- __mte_enable_tco_async(); \
- __raw_put_mem("str", *((type *)(__pkn_src)), \
- (__force type *)(__pkn_dst), __pkn_err, K); \
- __mte_disable_tco_async(); \
- \
- if (unlikely(__pkn_err)) \
+ do { \
+ __label__ __pkn_err; \
+ __mte_enable_tco_async(); \
+ __raw_put_mem("str", *((type *)(__pkn_src)), \
+ (__force type *)(__pkn_dst), __pkn_err, K); \
+ __mte_disable_tco_async(); \
+ break; \
+ __pkn_err: \
+ __mte_disable_tco_async(); \
goto err_label; \
+ } while (0); \
} while(0)

extern unsigned long __must_check __arch_copy_from_user(void *to, const void __user *from, unsigned long n);
@@ -411,17 +435,8 @@ static __must_check __always_inline bool user_access_begin(const void __user *pt
}
#define user_access_begin(a,b) user_access_begin(a,b)
#define user_access_end() uaccess_ttbr0_disable()
-
-/*
- * The arm64 inline asms should learn abut asm goto, and we should
- * teach user_access_begin() about address masking.
- */
-#define unsafe_put_user(x, ptr, label) do { \
- int __upu_err = 0; \
- __raw_put_mem("sttr", x, uaccess_mask_ptr(ptr), __upu_err, U); \
- if (__upu_err) goto label; \
-} while (0)
-
+#define unsafe_put_user(x, ptr, label) \
+ __raw_put_mem("sttr", x, uaccess_mask_ptr(ptr), label, U)
#define unsafe_get_user(x, ptr, label) \
__raw_get_mem("ldtr", x, uaccess_mask_ptr(ptr), label, U)

--
2.45.1.209.gc6f12300df


2024-06-10 20:50:29

by Linus Torvalds

[permalink] [raw]
Subject: [PATCH 7/7] arm64: access_ok() optimization

The TBI setup on arm64 is very strange: HW is set up to always do TBI,
but the kernel enforcement for system calls is purely a software
contract, and user space is supposed to mask off the top bits before the
system call.

Except all the actual brk/mmap/etc() system calls then mask it in kernel
space anyway, and accept any TBI address.

This basically unifies things and makes access_ok() also ignore it.

This is an ABI change, but the current situation is very odd, and this
change avoids the current mess and makes the kernel more permissive, and
as such is unlikely to break anything.

The way forward - for some possible future situation when people want to
use more bits - is probably to introduce a new "I actually want the full
64-bit address space" prctl. But we should make sure that the software
and hardware rules actually match at that point.

Signed-off-by: Linus Torvalds <[email protected]>
---
arch/arm64/include/asm/uaccess.h | 23 ++++++++++-------------
1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/include/asm/uaccess.h b/arch/arm64/include/asm/uaccess.h
index 4ab3938290ab..a435eff4ee93 100644
--- a/arch/arm64/include/asm/uaccess.h
+++ b/arch/arm64/include/asm/uaccess.h
@@ -30,23 +30,20 @@ static inline int __access_ok(const void __user *ptr, unsigned long size);

/*
* Test whether a block of memory is a valid user space address.
- * Returns 1 if the range is valid, 0 otherwise.
*
- * This is equivalent to the following test:
- * (u65)addr + (u65)size <= (u65)TASK_SIZE_MAX
+ * We only care that the address cannot reach the kernel mapping, and
+ * that an invalid address will fault.
*/
-static inline int access_ok(const void __user *addr, unsigned long size)
+static inline int access_ok(const void __user *p, unsigned long size)
{
- /*
- * Asynchronous I/O running in a kernel thread does not have the
- * TIF_TAGGED_ADDR flag of the process owning the mm, so always untag
- * the user address before checking.
- */
- if (IS_ENABLED(CONFIG_ARM64_TAGGED_ADDR_ABI) &&
- (current->flags & PF_KTHREAD || test_thread_flag(TIF_TAGGED_ADDR)))
- addr = untagged_addr(addr);
+ unsigned long addr = (unsigned long)p;

- return likely(__access_ok(addr, size));
+ /* Only bit 55 of the address matters */
+ addr |= addr+size;
+ addr = (addr >> 55) & 1;
+ size >>= 55;
+
+ return !(addr | size);
}
#define access_ok access_ok

--
2.45.1.209.gc6f12300df


2024-06-10 20:50:28

by Linus Torvalds

[permalink] [raw]
Subject: [PATCH 4/7] arm64: add 'runtime constant' support

This implements the runtime constant infrastructure for arm64, allowing
the dcache d_hash() function to be generated using as a constant for
hash table address followed by shift by a constant of the hash index.

Signed-off-by: Linus Torvalds <[email protected]>
---
arch/arm64/include/asm/runtime-const.h | 75 ++++++++++++++++++++++++++
arch/arm64/kernel/vmlinux.lds.S | 3 ++
2 files changed, 78 insertions(+)
create mode 100644 arch/arm64/include/asm/runtime-const.h

diff --git a/arch/arm64/include/asm/runtime-const.h b/arch/arm64/include/asm/runtime-const.h
new file mode 100644
index 000000000000..02462b2cb6f9
--- /dev/null
+++ b/arch/arm64/include/asm/runtime-const.h
@@ -0,0 +1,75 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_RUNTIME_CONST_H
+#define _ASM_RUNTIME_CONST_H
+
+#define runtime_const_ptr(sym) ({ \
+ typeof(sym) __ret; \
+ asm_inline("1:\t" \
+ "movz %0, #0xcdef\n\t" \
+ "movk %0, #0x89ab, lsl #16\n\t" \
+ "movk %0, #0x4567, lsl #32\n\t" \
+ "movk %0, #0x0123, lsl #48\n\t" \
+ ".pushsection runtime_ptr_" #sym ",\"a\"\n\t" \
+ ".long 1b - .\n\t" \
+ ".popsection" \
+ :"=r" (__ret)); \
+ __ret; })
+
+#define runtime_const_shift_right_32(val, sym) ({ \
+ unsigned long __ret; \
+ asm_inline("1:\t" \
+ "lsr %w0,%w1,#12\n\t" \
+ ".pushsection runtime_shift_" #sym ",\"a\"\n\t" \
+ ".long 1b - .\n\t" \
+ ".popsection" \
+ :"=r" (__ret) \
+ :"r" (0u+(val))); \
+ __ret; })
+
+#define runtime_const_init(type, sym) do { \
+ extern s32 __start_runtime_##type##_##sym[]; \
+ extern s32 __stop_runtime_##type##_##sym[]; \
+ runtime_const_fixup(__runtime_fixup_##type, \
+ (unsigned long)(sym), \
+ __start_runtime_##type##_##sym, \
+ __stop_runtime_##type##_##sym); \
+} while (0)
+
+// 16-bit immediate for wide move (movz and movk) in bits 5..20
+static inline void __runtime_fixup_16(unsigned int *p, unsigned int val)
+{
+ unsigned int insn = *p;
+ insn &= 0xffe0001f;
+ insn |= (val & 0xffff) << 5;
+ *p = insn;
+}
+
+static inline void __runtime_fixup_ptr(void *where, unsigned long val)
+{
+ unsigned int *p = lm_alias(where);
+ __runtime_fixup_16(p, val);
+ __runtime_fixup_16(p+1, val >> 16);
+ __runtime_fixup_16(p+2, val >> 32);
+ __runtime_fixup_16(p+3, val >> 48);
+}
+
+// Immediate value is 5 bits starting at bit #16
+static inline void __runtime_fixup_shift(void *where, unsigned long val)
+{
+ unsigned int *p = lm_alias(where);
+ unsigned int insn = *p;
+ insn &= 0xffc0ffff;
+ insn |= (val & 63) << 16;
+ *p = insn;
+}
+
+static inline void runtime_const_fixup(void (*fn)(void *, unsigned long),
+ unsigned long val, s32 *start, s32 *end)
+{
+ while (start < end) {
+ fn(*start + (void *)start, val);
+ start++;
+ }
+}
+
+#endif
diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
index 755a22d4f840..55a8e310ea12 100644
--- a/arch/arm64/kernel/vmlinux.lds.S
+++ b/arch/arm64/kernel/vmlinux.lds.S
@@ -264,6 +264,9 @@ SECTIONS
EXIT_DATA
}

+ RUNTIME_CONST(shift, d_hash_shift)
+ RUNTIME_CONST(ptr, dentry_hashtable)
+
PERCPU_SECTION(L1_CACHE_BYTES)
HYPERVISOR_PERCPU_SECTION

--
2.45.1.209.gc6f12300df


2024-06-10 20:50:41

by Linus Torvalds

[permalink] [raw]
Subject: [PATCH 5/7] arm64: start using 'asm goto' for get_user() when available

This generates noticeably better code with compilers that support it,
since we don't need to test the error register etc, the exception just
jumps to the error handling directly.

Signed-off-by: Linus Torvalds <[email protected]>
---
arch/arm64/include/asm/uaccess.h | 113 ++++++++++++++++++++++++-------
arch/arm64/kernel/mte.c | 12 ++--
2 files changed, 95 insertions(+), 30 deletions(-)

diff --git a/arch/arm64/include/asm/uaccess.h b/arch/arm64/include/asm/uaccess.h
index 14be5000c5a0..23c2edf517ed 100644
--- a/arch/arm64/include/asm/uaccess.h
+++ b/arch/arm64/include/asm/uaccess.h
@@ -184,29 +184,40 @@ static inline void __user *__uaccess_mask_ptr(const void __user *ptr)
* The "__xxx_error" versions set the third argument to -EFAULT if an error
* occurs, and leave it unchanged on success.
*/
-#define __get_mem_asm(load, reg, x, addr, err, type) \
+#ifdef CONFIG_CC_HAS_ASM_GOTO_OUTPUT
+#define __get_mem_asm(load, reg, x, addr, label, type) \
+ asm_goto_output( \
+ "1: " load " " reg "0, [%1]\n" \
+ _ASM_EXTABLE_##type##ACCESS_ERR(1b, %l2, %w0) \
+ : "=r" (x) \
+ : "r" (addr) : : label)
+#else
+#define __get_mem_asm(load, reg, x, addr, label, type) do { \
+ int __gma_err = 0; \
asm volatile( \
"1: " load " " reg "1, [%2]\n" \
"2:\n" \
_ASM_EXTABLE_##type##ACCESS_ERR_ZERO(1b, 2b, %w0, %w1) \
- : "+r" (err), "=r" (x) \
- : "r" (addr))
+ : "+r" (__gma_err), "=r" (x) \
+ : "r" (addr)); \
+ if (__gma_err) goto label; } while (0)
+#endif

-#define __raw_get_mem(ldr, x, ptr, err, type) \
+#define __raw_get_mem(ldr, x, ptr, label, type) \
do { \
unsigned long __gu_val; \
switch (sizeof(*(ptr))) { \
case 1: \
- __get_mem_asm(ldr "b", "%w", __gu_val, (ptr), (err), type); \
+ __get_mem_asm(ldr "b", "%w", __gu_val, (ptr), label, type); \
break; \
case 2: \
- __get_mem_asm(ldr "h", "%w", __gu_val, (ptr), (err), type); \
+ __get_mem_asm(ldr "h", "%w", __gu_val, (ptr), label, type); \
break; \
case 4: \
- __get_mem_asm(ldr, "%w", __gu_val, (ptr), (err), type); \
+ __get_mem_asm(ldr, "%w", __gu_val, (ptr), label, type); \
break; \
case 8: \
- __get_mem_asm(ldr, "%x", __gu_val, (ptr), (err), type); \
+ __get_mem_asm(ldr, "%x", __gu_val, (ptr), label, type); \
break; \
default: \
BUILD_BUG(); \
@@ -219,27 +230,34 @@ do { \
* uaccess_ttbr0_disable(). As `x` and `ptr` could contain blocking functions,
* we must evaluate these outside of the critical section.
*/
-#define __raw_get_user(x, ptr, err) \
+#define __raw_get_user(x, ptr, label) \
do { \
__typeof__(*(ptr)) __user *__rgu_ptr = (ptr); \
__typeof__(x) __rgu_val; \
__chk_user_ptr(ptr); \
- \
- uaccess_ttbr0_enable(); \
- __raw_get_mem("ldtr", __rgu_val, __rgu_ptr, err, U); \
- uaccess_ttbr0_disable(); \
- \
- (x) = __rgu_val; \
+ do { \
+ __label__ __rgu_failed; \
+ uaccess_ttbr0_enable(); \
+ __raw_get_mem("ldtr", __rgu_val, __rgu_ptr, __rgu_failed, U); \
+ uaccess_ttbr0_disable(); \
+ (x) = __rgu_val; \
+ break; \
+ __rgu_failed: \
+ uaccess_ttbr0_disable(); \
+ goto label; \
+ } while (0); \
} while (0)

#define __get_user_error(x, ptr, err) \
do { \
+ __label__ __gu_failed; \
__typeof__(*(ptr)) __user *__p = (ptr); \
might_fault(); \
if (access_ok(__p, sizeof(*__p))) { \
__p = uaccess_mask_ptr(__p); \
- __raw_get_user((x), __p, (err)); \
+ __raw_get_user((x), __p, __gu_failed); \
} else { \
+ __gu_failed: \
(x) = (__force __typeof__(x))0; (err) = -EFAULT; \
} \
} while (0)
@@ -262,15 +280,18 @@ do { \
do { \
__typeof__(dst) __gkn_dst = (dst); \
__typeof__(src) __gkn_src = (src); \
- int __gkn_err = 0; \
+ do { \
+ __label__ __gkn_label; \
\
- __mte_enable_tco_async(); \
- __raw_get_mem("ldr", *((type *)(__gkn_dst)), \
- (__force type *)(__gkn_src), __gkn_err, K); \
- __mte_disable_tco_async(); \
- \
- if (unlikely(__gkn_err)) \
+ __mte_enable_tco_async(); \
+ __raw_get_mem("ldr", *((type *)(__gkn_dst)), \
+ (__force type *)(__gkn_src), __gkn_label, K); \
+ __mte_disable_tco_async(); \
+ break; \
+ __gkn_label: \
+ __mte_disable_tco_async(); \
goto err_label; \
+ } while (0); \
} while (0)

#define __put_mem_asm(store, reg, x, addr, err, type) \
@@ -381,6 +402,52 @@ extern unsigned long __must_check __arch_copy_to_user(void __user *to, const voi
__actu_ret; \
})

+static __must_check __always_inline bool user_access_begin(const void __user *ptr, size_t len)
+{
+ if (unlikely(!access_ok(ptr,len)))
+ return 0;
+ uaccess_ttbr0_enable();
+ return 1;
+}
+#define user_access_begin(a,b) user_access_begin(a,b)
+#define user_access_end() uaccess_ttbr0_disable()
+
+/*
+ * The arm64 inline asms should learn abut asm goto, and we should
+ * teach user_access_begin() about address masking.
+ */
+#define unsafe_put_user(x, ptr, label) do { \
+ int __upu_err = 0; \
+ __raw_put_mem("sttr", x, uaccess_mask_ptr(ptr), __upu_err, U); \
+ if (__upu_err) goto label; \
+} while (0)
+
+#define unsafe_get_user(x, ptr, label) \
+ __raw_get_mem("ldtr", x, uaccess_mask_ptr(ptr), label, U)
+
+/*
+ * We want the unsafe accessors to always be inlined and use
+ * the error labels - thus the macro games.
+ */
+#define unsafe_copy_loop(dst, src, len, type, label) \
+ while (len >= sizeof(type)) { \
+ unsafe_put_user(*(type *)(src),(type __user *)(dst),label); \
+ dst += sizeof(type); \
+ src += sizeof(type); \
+ len -= sizeof(type); \
+ }
+
+#define unsafe_copy_to_user(_dst,_src,_len,label) \
+do { \
+ char __user *__ucu_dst = (_dst); \
+ const char *__ucu_src = (_src); \
+ size_t __ucu_len = (_len); \
+ unsafe_copy_loop(__ucu_dst, __ucu_src, __ucu_len, u64, label); \
+ unsafe_copy_loop(__ucu_dst, __ucu_src, __ucu_len, u32, label); \
+ unsafe_copy_loop(__ucu_dst, __ucu_src, __ucu_len, u16, label); \
+ unsafe_copy_loop(__ucu_dst, __ucu_src, __ucu_len, u8, label); \
+} while (0)
+
#define INLINE_COPY_TO_USER
#define INLINE_COPY_FROM_USER

diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index dcdcccd40891..6174671be7c1 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -582,12 +582,9 @@ subsys_initcall(register_mte_tcf_preferred_sysctl);
size_t mte_probe_user_range(const char __user *uaddr, size_t size)
{
const char __user *end = uaddr + size;
- int err = 0;
char val;

- __raw_get_user(val, uaddr, err);
- if (err)
- return size;
+ __raw_get_user(val, uaddr, efault);

uaddr = PTR_ALIGN(uaddr, MTE_GRANULE_SIZE);
while (uaddr < end) {
@@ -595,12 +592,13 @@ size_t mte_probe_user_range(const char __user *uaddr, size_t size)
* A read is sufficient for mte, the caller should have probed
* for the pte write permission if required.
*/
- __raw_get_user(val, uaddr, err);
- if (err)
- return end - uaddr;
+ __raw_get_user(val, uaddr, efault);
uaddr += MTE_GRANULE_SIZE;
}
(void)val;

return 0;
+
+efault:
+ return end - uaddr;
}
--
2.45.1.209.gc6f12300df


2024-06-11 14:29:29

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH 4/7] arm64: add 'runtime constant' support

Hi Linus,

On Mon, Jun 10, 2024 at 01:48:18PM -0700, Linus Torvalds wrote:
> This implements the runtime constant infrastructure for arm64, allowing
> the dcache d_hash() function to be generated using as a constant for
> hash table address followed by shift by a constant of the hash index.
>
> Signed-off-by: Linus Torvalds <[email protected]>
> ---
> arch/arm64/include/asm/runtime-const.h | 75 ++++++++++++++++++++++++++
> arch/arm64/kernel/vmlinux.lds.S | 3 ++
> 2 files changed, 78 insertions(+)
> create mode 100644 arch/arm64/include/asm/runtime-const.h

From a quick scan I spotted a couple of issues (explained with fixes
below). I haven't had the chance to test/review the whole series yet.

Do we expect to use this more widely? If this only really matters for
d_hash() it might be better to handle this via the alternatives
framework with callbacks and avoid the need for new infrastructure.

> diff --git a/arch/arm64/include/asm/runtime-const.h b/arch/arm64/include/asm/runtime-const.h
> new file mode 100644
> index 000000000000..02462b2cb6f9
> --- /dev/null
> +++ b/arch/arm64/include/asm/runtime-const.h
> @@ -0,0 +1,75 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_RUNTIME_CONST_H
> +#define _ASM_RUNTIME_CONST_H
> +
> +#define runtime_const_ptr(sym) ({ \
> + typeof(sym) __ret; \
> + asm_inline("1:\t" \
> + "movz %0, #0xcdef\n\t" \
> + "movk %0, #0x89ab, lsl #16\n\t" \
> + "movk %0, #0x4567, lsl #32\n\t" \
> + "movk %0, #0x0123, lsl #48\n\t" \
> + ".pushsection runtime_ptr_" #sym ",\"a\"\n\t" \
> + ".long 1b - .\n\t" \
> + ".popsection" \
> + :"=r" (__ret)); \
> + __ret; })
> +
> +#define runtime_const_shift_right_32(val, sym) ({ \
> + unsigned long __ret; \
> + asm_inline("1:\t" \
> + "lsr %w0,%w1,#12\n\t" \
> + ".pushsection runtime_shift_" #sym ",\"a\"\n\t" \
> + ".long 1b - .\n\t" \
> + ".popsection" \
> + :"=r" (__ret) \
> + :"r" (0u+(val))); \
> + __ret; })
> +
> +#define runtime_const_init(type, sym) do { \
> + extern s32 __start_runtime_##type##_##sym[]; \
> + extern s32 __stop_runtime_##type##_##sym[]; \
> + runtime_const_fixup(__runtime_fixup_##type, \
> + (unsigned long)(sym), \
> + __start_runtime_##type##_##sym, \
> + __stop_runtime_##type##_##sym); \
> +} while (0)
> +
> +// 16-bit immediate for wide move (movz and movk) in bits 5..20
> +static inline void __runtime_fixup_16(unsigned int *p, unsigned int val)
> +{
> + unsigned int insn = *p;
> + insn &= 0xffe0001f;
> + insn |= (val & 0xffff) << 5;
> + *p = insn;
> +}

As-is this will break BE kernels as instructions are always encoded
little-endian regardless of data endianness. We usually handle that by
using __le32 instruction pointers and using le32_to_cpu()/cpu_to_le32()
when reading/writing, e.g.

#include <asm/byteorder.h>

static inline void __runtime_fixup_16(__le32 *p, unsigned int val)
{
u32 insn = le32_to_cpu(*p);
insn &= 0xffe0001f;
insn |= (val & 0xffff) << 5;
*p = cpu_to_le32(insn);
}

We have some helpers for instruction manipulation, and we can use
aarch64_insn_encode_immediate() here, e.g.

#include <asm/insn.h>

static inline void __runtime_fixup_16(__le32 *p, unsigned int val)
{
u32 insn = le32_to_cpu(*p);
insn = aarch64_insn_encode_immediate(AARCH64_INSN_IMM_16, insn, val);
*p = cpu_to_le32(insn);
}

> +static inline void __runtime_fixup_ptr(void *where, unsigned long val)
> +{
> + unsigned int *p = lm_alias(where);
> + __runtime_fixup_16(p, val);
> + __runtime_fixup_16(p+1, val >> 16);
> + __runtime_fixup_16(p+2, val >> 32);
> + __runtime_fixup_16(p+3, val >> 48);
> +}

This is missing the necessary cache maintenance and context
synchronization event.

After the new values are written, we need cache maintenance (a D$ clean
to PoU, then an I$ invalidate to PoU) followed by a context
synchronization event (e.g. an ISB) before CPUs are guaranteed to use
the new instructions rather than the old ones.

Depending on how your system has been integrated, you might get away
without that. If you see:

Data cache clean to the PoU not required for I/D coherence

... in your dmesg, that means you only need the I$ invalidate and
context synchronization event, and you might happen to get those by
virtue of alternative patching running between dcache_init_early() and
the use of the patched instructions.

However, in general, we do need all of that. As long as this runs before
secondaries are brought up, we can handle that with
caches_clean_inval_pou(). Assuming the __le32 changes above, I'd expect
this to be:

static inline void __runtime_fixup_ptr(void *where, unsigned long val)
{
__le32 *p = lm_alias(where);
__runtime_fixup_16(p, val);
__runtime_fixup_16(p + 1, val >> 16);
__runtime_fixup_16(p + 2, val >> 32);
__runtime_fixup_16(p + 3, val >> 48);

caches_clean_inval_pou((unsigned long)p, (unsigned long)(p + 4));
}

> +// Immediate value is 5 bits starting at bit #16
> +static inline void __runtime_fixup_shift(void *where, unsigned long val)
> +{
> + unsigned int *p = lm_alias(where);
> + unsigned int insn = *p;
> + insn &= 0xffc0ffff;
> + insn |= (val & 63) << 16;
> + *p = insn;
> +}

As with the other bits above, I'd expect this to be:

static inline void __runtime_fixup_shift(void *where, unsigned long val)
{
__le32 *p = lm_alias(where);
u32 insn = le32_to_cpu(*p);
insn = aarch64_insn_encode_immediate(AARCH64_INSN_IMM_R, insn, val);
*p = cpu_to_le32(insn);

caches_clean_inval_pou((unsigned long)p, (unsigned long)(p + 1));
}

Mark.

> +
> +static inline void runtime_const_fixup(void (*fn)(void *, unsigned long),
> + unsigned long val, s32 *start, s32 *end)
> +{
> + while (start < end) {
> + fn(*start + (void *)start, val);
> + start++;
> + }
> +}
> +
> +#endif
> diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
> index 755a22d4f840..55a8e310ea12 100644
> --- a/arch/arm64/kernel/vmlinux.lds.S
> +++ b/arch/arm64/kernel/vmlinux.lds.S
> @@ -264,6 +264,9 @@ SECTIONS
> EXIT_DATA
> }
>
> + RUNTIME_CONST(shift, d_hash_shift)
> + RUNTIME_CONST(ptr, dentry_hashtable)
> +
> PERCPU_SECTION(L1_CACHE_BYTES)
> HYPERVISOR_PERCPU_SECTION
>
> --
> 2.45.1.209.gc6f12300df
>
>

2024-06-11 16:59:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 4/7] arm64: add 'runtime constant' support

On Tue, 11 Jun 2024 at 07:29, Mark Rutland <[email protected]> wrote:
>
> Do we expect to use this more widely? If this only really matters for
> d_hash() it might be better to handle this via the alternatives
> framework with callbacks and avoid the need for new infrastructure.

Hmm. The notion of a callback for alternatives is intriguing and would
be very generic, but we don't have anything like that right now.

Is anybody willing to implement something like that? Because while I
like the idea, it sounds like a much bigger change.

> As-is this will break BE kernels [...]

I had forgotten about that horror. BE in this day and age is insane,
but it's easy enough to fix as per your comments. Will do.

> We have some helpers for instruction manipulation, and we can use
> aarch64_insn_encode_immediate() here, e.g.
>
> #include <asm/insn.h>
>
> static inline void __runtime_fixup_16(__le32 *p, unsigned int val)
> {
> u32 insn = le32_to_cpu(*p);
> insn = aarch64_insn_encode_immediate(AARCH64_INSN_IMM_16, insn, val);
> *p = cpu_to_le32(insn);
> }

Ugh. I did that, and then noticed that it makes the generated code
about ten times bigger.

That interface looks positively broken.

There is absolutely nobody who actually wants a dynamic argument, so
it would have made both the callers and the implementation *much*
simpler had the "AARCH64_INSN_IMM_16" been encoded in the function
name the way I did it for my instruction rewriting.

It would have made the use of it simpler, it would have avoided all
the "switch (type)" garbage, and it would have made it all generate
much better code.

So I did that change you suggested, and then undid it again.

Because that whole aarch64_insn_encode_immediate() thing is an
abomination, and should be burned at the stake. It's misdesigned in
the *worst* possible way.

And no, this code isn't performance-critical, but I have some taste,
and the code I write will not be using that garbage.

> This is missing the necessary cache maintenance and context
> synchronization event.

Will do.

Linus

2024-06-11 17:20:42

by Linus Torvalds

[permalink] [raw]
Subject: [PATCH 4/7 v2] arm64: add 'runtime constant' support

This implements the runtime constant infrastructure for arm64, allowing
the dcache d_hash() function to be generated using as a constant for
hash table address followed by shift by a constant of the hash index.

Signed-off-by: Linus Torvalds <[email protected]>
---
v2: updates as per Mark Rutland

arch/arm64/include/asm/runtime-const.h | 83 ++++++++++++++++++++++++++++++++++
arch/arm64/kernel/vmlinux.lds.S | 3 ++
2 files changed, 86 insertions(+)
create mode 100644 arch/arm64/include/asm/runtime-const.h

diff --git a/arch/arm64/include/asm/runtime-const.h b/arch/arm64/include/asm/runtime-const.h
new file mode 100644
index 000000000000..8dc83d48a202
--- /dev/null
+++ b/arch/arm64/include/asm/runtime-const.h
@@ -0,0 +1,83 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_RUNTIME_CONST_H
+#define _ASM_RUNTIME_CONST_H
+
+#include <asm/cacheflush.h>
+
+/* Sigh. You can still run arm64 in BE mode */
+#include <asm/byteorder.h>
+
+#define runtime_const_ptr(sym) ({ \
+ typeof(sym) __ret; \
+ asm_inline("1:\t" \
+ "movz %0, #0xcdef\n\t" \
+ "movk %0, #0x89ab, lsl #16\n\t" \
+ "movk %0, #0x4567, lsl #32\n\t" \
+ "movk %0, #0x0123, lsl #48\n\t" \
+ ".pushsection runtime_ptr_" #sym ",\"a\"\n\t" \
+ ".long 1b - .\n\t" \
+ ".popsection" \
+ :"=r" (__ret)); \
+ __ret; })
+
+#define runtime_const_shift_right_32(val, sym) ({ \
+ unsigned long __ret; \
+ asm_inline("1:\t" \
+ "lsr %w0,%w1,#12\n\t" \
+ ".pushsection runtime_shift_" #sym ",\"a\"\n\t" \
+ ".long 1b - .\n\t" \
+ ".popsection" \
+ :"=r" (__ret) \
+ :"r" (0u+(val))); \
+ __ret; })
+
+#define runtime_const_init(type, sym) do { \
+ extern s32 __start_runtime_##type##_##sym[]; \
+ extern s32 __stop_runtime_##type##_##sym[]; \
+ runtime_const_fixup(__runtime_fixup_##type, \
+ (unsigned long)(sym), \
+ __start_runtime_##type##_##sym, \
+ __stop_runtime_##type##_##sym); \
+} while (0)
+
+/* 16-bit immediate for wide move (movz and movk) in bits 5..20 */
+static inline void __runtime_fixup_16(__le32 *p, unsigned int val)
+{
+ u32 insn = le32_to_cpu(*p);
+ insn &= 0xffe0001f;
+ insn |= (val & 0xffff) << 5;
+ *p = insn;
+ *p = cpu_to_le32(insn);
+}
+
+static inline void __runtime_fixup_ptr(void *where, unsigned long val)
+{
+ __le32 *p = lm_alias(where);
+ __runtime_fixup_16(p, val);
+ __runtime_fixup_16(p+1, val >> 16);
+ __runtime_fixup_16(p+2, val >> 32);
+ __runtime_fixup_16(p+3, val >> 48);
+ caches_clean_inval_pou((unsigned long)p, (unsigned long)(p + 4));
+}
+
+/* Immediate value is 6 bits starting at bit #16 */
+static inline void __runtime_fixup_shift(void *where, unsigned long val)
+{
+ __le32 *p = lm_alias(where);
+ u32 insn = le32_to_cpu(*p);
+ insn &= 0xffc0ffff;
+ insn |= (val & 63) << 16;
+ *p = cpu_to_le32(insn);
+ caches_clean_inval_pou((unsigned long)p, (unsigned long)(p + 1));
+}
+
+static inline void runtime_const_fixup(void (*fn)(void *, unsigned long),
+ unsigned long val, s32 *start, s32 *end)
+{
+ while (start < end) {
+ fn(*start + (void *)start, val);
+ start++;
+ }
+}
+
+#endif
diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
index 755a22d4f840..55a8e310ea12 100644
--- a/arch/arm64/kernel/vmlinux.lds.S
+++ b/arch/arm64/kernel/vmlinux.lds.S
@@ -264,6 +264,9 @@ SECTIONS
EXIT_DATA
}

+ RUNTIME_CONST(shift, d_hash_shift)
+ RUNTIME_CONST(ptr, dentry_hashtable)
+
PERCPU_SECTION(L1_CACHE_BYTES)
HYPERVISOR_PERCPU_SECTION


2024-06-11 17:49:16

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH 4/7] arm64: add 'runtime constant' support

On Tue, Jun 11, 2024 at 09:56:17AM -0700, Linus Torvalds wrote:
> On Tue, 11 Jun 2024 at 07:29, Mark Rutland <[email protected]> wrote:
> >
> > Do we expect to use this more widely? If this only really matters for
> > d_hash() it might be better to handle this via the alternatives
> > framework with callbacks and avoid the need for new infrastructure.
>
> Hmm. The notion of a callback for alternatives is intriguing and would
> be very generic, but we don't have anything like that right now.
>
> Is anybody willing to implement something like that? Because while I
> like the idea, it sounds like a much bigger change.

Fair enough if that's a pain on x86, but we already have them on arm64, and
hence using them is a smaller change there. We already have a couple of cases
which uses MOVZ;MOVK;MOVK;MOVK sequence, e.g.

// in __invalidate_icache_max_range()
asm volatile(ALTERNATIVE_CB("movz %0, #0\n"
"movk %0, #0, lsl #16\n"
"movk %0, #0, lsl #32\n"
"movk %0, #0, lsl #48\n",
ARM64_ALWAYS_SYSTEM,
kvm_compute_final_ctr_el0)
: "=r" (ctr));

... which is patched via the callback:

void kvm_compute_final_ctr_el0(struct alt_instr *alt,
__le32 *origptr, __le32 *updptr, int nr_inst)
{
generate_mov_q(read_sanitised_ftr_reg(SYS_CTR_EL0),
origptr, updptr, nr_inst);
}

... where the generate_mov_q() helper does the actual instruction generation.

So if we only care about a few specific constants, we could give them their own
callbacks, like kvm_compute_final_ctr_el0() above.

[...]

> > We have some helpers for instruction manipulation, and we can use
> > aarch64_insn_encode_immediate() here, e.g.
> >
> > #include <asm/insn.h>
> >
> > static inline void __runtime_fixup_16(__le32 *p, unsigned int val)
> > {
> > u32 insn = le32_to_cpu(*p);
> > insn = aarch64_insn_encode_immediate(AARCH64_INSN_IMM_16, insn, val);
> > *p = cpu_to_le32(insn);
> > }
>
> Ugh. I did that, and then noticed that it makes the generated code
> about ten times bigger.
>
> That interface looks positively broken.
>
> There is absolutely nobody who actually wants a dynamic argument, so
> it would have made both the callers and the implementation *much*
> simpler had the "AARCH64_INSN_IMM_16" been encoded in the function
> name the way I did it for my instruction rewriting.
>
> It would have made the use of it simpler, it would have avoided all
> the "switch (type)" garbage, and it would have made it all generate
> much better code.

Oh, completely agreed. FWIW, I have better versions sat in my
arm64/insn/rework branch, but I haven't had the time to get all the rest
of the insn framework cleanup sorted:

https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/commit/?h=arm64/insn/rework&id=9cf0ec088c9d5324c60933bf3924176fea0a4d0b

I can go prioritise getting that bit out if it'd help, or I can clean
this up later.

Those allow the compiler to do much better, including compile-time (or
runtime) checks that immediates fit. For example:

void encode_imm16(__le32 *p, u16 imm)
{
u32 insn = le32_to_cpu(*p);

// Would warn if 'imm' were u32.
// As u16 always fits, no warning
BUILD_BUG_ON(!aarch64_insn_try_encode_unsigned_imm16(&insn, imm));

*p = cpu_to_le32(insn);
}

... compiles to:

<encode_imm16>:
ldr w2, [x0]
bfi w2, w1, #5, #16
str w2, [x0]
ret

... which I think is what you want?

> So I did that change you suggested, and then undid it again.
>
> Because that whole aarch64_insn_encode_immediate() thing is an
> abomination, and should be burned at the stake. It's misdesigned in
> the *worst* possible way.
>
> And no, this code isn't performance-critical, but I have some taste,
> and the code I write will not be using that garbage.

Fair enough.

Mark.

2024-06-11 18:04:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 4/7] arm64: add 'runtime constant' support

On Tue, 11 Jun 2024 at 10:48, Mark Rutland <[email protected]> wrote:
>
> Fair enough if that's a pain on x86, but we already have them on arm64, and
> hence using them is a smaller change there. We already have a couple of cases
> which uses MOVZ;MOVK;MOVK;MOVK sequence, e.g.
>
> // in __invalidate_icache_max_range()
> asm volatile(ALTERNATIVE_CB("movz %0, #0\n"
> "movk %0, #0, lsl #16\n"
> "movk %0, #0, lsl #32\n"
> "movk %0, #0, lsl #48\n",
> ARM64_ALWAYS_SYSTEM,
> kvm_compute_final_ctr_el0)
> : "=r" (ctr));
>
> ... which is patched via the callback:
>
> void kvm_compute_final_ctr_el0(struct alt_instr *alt,
> __le32 *origptr, __le32 *updptr, int nr_inst)
> {
> generate_mov_q(read_sanitised_ftr_reg(SYS_CTR_EL0),
> origptr, updptr, nr_inst);
> }
>
> ... where the generate_mov_q() helper does the actual instruction generation.
>
> So if we only care about a few specific constants, we could give them their own
> callbacks, like kvm_compute_final_ctr_el0() above.

I'll probably only have another day until my mailbox starts getting
more pull requests (Mon-Tue outside the merge window is typically my
quiet time when I have time to go through old emails and have time for
private projects).

So I'll look at doing this for x86 and see how it works.

I do suspect that even then it's possibly more code with a
site-specific callback for each case, but maybe it would be worth it
just for the flexibility.

Linus

2024-06-11 18:59:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 4/7] arm64: add 'runtime constant' support

On Tue, 11 Jun 2024 at 10:59, Linus Torvalds
<[email protected]> wrote:
>
> So I'll look at doing this for x86 and see how it works.

Oh - and when I started looking at it, I immediately remembered why I
didn't want to use alternatives originally.

The alternatives are finalized much too early for this. By the time
the dcache code works, the alternatives have already been applied.

I guess all the arm64 alternative callbacks are basically finalized
very early, basically when the CPU models etc have been setup.

We could do a "late alternatives", I guess, but now it's even more
infrastructure just for the constants.

Linus

2024-06-11 20:22:45

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH 4/7] arm64: add 'runtime constant' support

On Tue, Jun 11, 2024 at 11:59:21AM -0700, Linus Torvalds wrote:
> On Tue, 11 Jun 2024 at 10:59, Linus Torvalds
> <[email protected]> wrote:
> >
> > So I'll look at doing this for x86 and see how it works.
>
> Oh - and when I started looking at it, I immediately remembered why I
> didn't want to use alternatives originally.
>
> The alternatives are finalized much too early for this. By the time
> the dcache code works, the alternatives have already been applied.
>
> I guess all the arm64 alternative callbacks are basically finalized
> very early, basically when the CPU models etc have been setup.

On arm64 we have early ("boot") and late ("system-wide") alternatives.
We apply the system-wide alternatives in apply_alternatives_all(), a few
callees deep under smp_cpus_done(), after secondary CPUs are brought up,
since that has to handle mismatched features in big.LITTLE systems.

I had assumed that we could use late/system-wide alternatives here, since
those get applied after vfs_caches_init_early(), but maybe that's too
late?

> We could do a "late alternatives", I guess, but now it's even more
> infrastructure just for the constants.

Fair enough; thanks for taking a look.

Mark.

2024-06-11 21:09:02

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 4/7] arm64: add 'runtime constant' support

On Tue, 11 Jun 2024 at 13:22, Mark Rutland <[email protected]> wrote:
>
> On arm64 we have early ("boot") and late ("system-wide") alternatives.
> We apply the system-wide alternatives in apply_alternatives_all(), a few
> callees deep under smp_cpus_done(), after secondary CPUs are brought up,
> since that has to handle mismatched features in big.LITTLE systems.

Annoyingly, we don't have any generic model for this. Maybe that would
be a good thing regardless, but your point that you have big.LITTLE
issues does kind of reinforce the thing that different architectures
have different requirements for the alternatives patching.

On arm64, the late alternatives seem to be in

kernel_init() ->
kernel_init_freeable() ->
smp_init() ->
smp_cpus_done() ->
setup_system_features() ->
setup_system_capabilities() ->
apply_alternatives_all()

which is nice and late - that's when the system is fully initialized,
and kernel_init() is already running as the first real thread.

On x86, the alternatives are finalized much earlier in

start_kernel() ->
arch_cpu_finalize_init ->
alternative_instructions()

which is quite early, much closer to the early arm64 case.

Now, even that early x86 timing is good enough for vfs_caches_early(),
which is also done from start_kernel() fairly early on - and before
the arch_cpu_finalize_init() code is run.

But ...

> I had assumed that we could use late/system-wide alternatives here, since
> those get applied after vfs_caches_init_early(), but maybe that's too
> late?

So vfs_caches_init_early() is *one* case for the dcache init, but for
the NUMA case, we delay the dcache init until after the MM setup has
been completed, and do it relatively later in the init sequence at
vfs_caches_init().

See that horribly named 'hashdist' variable ('dist' is not 'distance',
it's 'distribute'). It's not dcache-specific, btw. There's a couple of
other hashes that do that whole "NUMA distribution or not" thing..

Annoying, yes. I'm not sure that the dual init makes any actual sense
- I think it's entirely a historical oddity.

But that "done conditionally in two different places" may be ugly, but
even if we fixed it, we'd fix it by doing it in just once, and it
would be that later "NUMA has been initialized" vfs_caches_init()
case.

Which is too late for the x86 alternatives.

The arm64 late case would seem to work fine. It's late enough to be
after all "core kernel init", but still early enough to be before the
"generic" initcalls that will start initializing filesystems etc (that
then need the vfs code to have been initialized).

So that "smp_init()" placement that arm64 has is actually a very good
place for at least the dcache case. It's just not what x86 does.

Note that my "just replace the constants" model avoids all the
ordering issues because it just does the constant initialization
synchronously when the constant is initialized.

So it doesn't depend on any other ordering at all, and there is no
worry about subtle differences in when alternatives are applied, or
when the uses happen.

(It obviously does have the same ordering requirement that the
variable initialization itself has: the dcache init itself has to
happen before any dcache use, but that's neither surprising nor a new
ordering imposed by the runtime constant case).

There's an advantage to just being self-sufficient and not tying into
random other subsystems that have random other constraints.

Linus

2024-06-11 21:56:12

by Nathan Chancellor

[permalink] [raw]
Subject: Re: [PATCH 6/7] arm64: start using 'asm goto' for put_user() when available

Hi Linus,

On Mon, Jun 10, 2024 at 01:48:20PM -0700, Linus Torvalds wrote:
> This generates noticeably better code with compilers that support it,
> since we don't need to test the error register etc, the exception just
> jumps to the error handling directly.
>
> Signed-off-by: Linus Torvalds <[email protected]>
> ---
> arch/arm64/include/asm/uaccess.h | 77 +++++++++++++++++++-------------
> 1 file changed, 46 insertions(+), 31 deletions(-)
>
> diff --git a/arch/arm64/include/asm/uaccess.h b/arch/arm64/include/asm/uaccess.h
> index 23c2edf517ed..4ab3938290ab 100644
> --- a/arch/arm64/include/asm/uaccess.h
> +++ b/arch/arm64/include/asm/uaccess.h
> @@ -294,29 +294,41 @@ do { \
> } while (0); \
> } while (0)
>
> -#define __put_mem_asm(store, reg, x, addr, err, type) \
> +#ifdef CONFIG_CC_HAS_ASM_GOTO

This symbol was eliminated two years ago with commit a0a12c3ed057 ("asm
goto: eradicate CC_HAS_ASM_GOTO") since all supported compilers have
support for it.

> +#define __put_mem_asm(store, reg, x, addr, label, type) \
> + asm goto( \
> + "1: " store " " reg "0, [%1]\n" \
> + "2:\n" \
> + _ASM_EXTABLE_##type##ACCESS_ZERO(1b, %l2) \
> + : : "rZ" (x), "r" (addr) : : label)
> +#else
> +#define __put_mem_asm(store, reg, x, addr, label, type) do { \
> + int __pma_err = 0; \
> asm volatile( \
> "1: " store " " reg "1, [%2]\n" \
> "2:\n" \
> _ASM_EXTABLE_##type##ACCESS_ERR(1b, 2b, %w0) \
> - : "+r" (err) \
> - : "rZ" (x), "r" (addr))
> + : "+r" (__pma_err) \
> + : "rZ" (x), "r" (addr)); \
> + if (__pma_err) goto label; \
> +} while (0)
> +#endif

Cheers,
Nathan

2024-06-11 23:34:53

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 6/7] arm64: start using 'asm goto' for put_user() when available

On Tue, 11 Jun 2024 at 14:56, Nathan Chancellor <[email protected]> wrote:
>
> > +#ifdef CONFIG_CC_HAS_ASM_GOTO
>
> This symbol was eliminated two years ago with commit a0a12c3ed057 ("asm
> goto: eradicate CC_HAS_ASM_GOTO") since all supported compilers have
> support for it.

Hah. And I was trying to be a good boy and keep the old setup working.

Instead - because the HAS_ASM_GOTO config variable no longer exists -
I didn't actually test the new case at all, and it only worked because
the old case did in fact work.

Because fixing the broken #ifdef also showed that the

+ _ASM_EXTABLE_##type##ACCESS_ZERO(1b, %l2)

line was wrong and was a copy-and-paste error from the get_user case
(that zeroes the result register on error).

It should be just

+ _ASM_EXTABLE_##type##ACCESS(1b, %l2)

and to make the nasty copy_to_kernel_nofault_loop() build I also need
to do the proper _ASM_EXTABLE_KACCESS macro without the zeroing that
didn't exist.

Oops.

It would be nice to get rid of the CC_HAS_ASM_GOTO_OUTPUT thing too,
but that's probably a decade away ;(

But at least this made me now go and actually test the _actual_ old
compiler case (no asm goto output). Perhaps ironically, I did get
*that* one right. That's the case where I had actually checked the new
code for get_user().

Linus

2024-06-11 23:40:47

by Linus Torvalds

[permalink] [raw]
Subject: [PATCH 6/7 v2] arm64: start using 'asm goto' for put_user() when available

This generates noticeably better code with compilers that support it,
since we don't need to test the error register etc, the exception just
jumps to the error handling directly.

Unlike get_user(), there's no need to worry about old compilers. All
supported compilers support the regular non-output 'asm goto', as
pointed out by Nathan Chancellor.

Signed-off-by: Linus Torvalds <[email protected]>
---

This is the fixed version that actually uses "asm goto" for put_user()
because it doesn't accidentally disable it by using the old CONFIG
option that no longer exists.

arch/arm64/include/asm/asm-extable.h | 3 ++
arch/arm64/include/asm/uaccess.h | 70 ++++++++++++++--------------
2 files changed, 39 insertions(+), 34 deletions(-)

diff --git a/arch/arm64/include/asm/asm-extable.h b/arch/arm64/include/asm/asm-extable.h
index 980d1dd8e1a3..b8a5861dc7b7 100644
--- a/arch/arm64/include/asm/asm-extable.h
+++ b/arch/arm64/include/asm/asm-extable.h
@@ -112,6 +112,9 @@
#define _ASM_EXTABLE_KACCESS_ERR(insn, fixup, err) \
_ASM_EXTABLE_KACCESS_ERR_ZERO(insn, fixup, err, wzr)

+#define _ASM_EXTABLE_KACCESS(insn, fixup) \
+ _ASM_EXTABLE_KACCESS_ERR_ZERO(insn, fixup, wzr, wzr)
+
#define _ASM_EXTABLE_LOAD_UNALIGNED_ZEROPAD(insn, fixup, data, addr) \
__DEFINE_ASM_GPR_NUMS \
__ASM_EXTABLE_RAW(#insn, #fixup, \
diff --git a/arch/arm64/include/asm/uaccess.h b/arch/arm64/include/asm/uaccess.h
index 23c2edf517ed..6d4b16acc880 100644
--- a/arch/arm64/include/asm/uaccess.h
+++ b/arch/arm64/include/asm/uaccess.h
@@ -294,29 +294,28 @@ do { \
} while (0); \
} while (0)

-#define __put_mem_asm(store, reg, x, addr, err, type) \
- asm volatile( \
- "1: " store " " reg "1, [%2]\n" \
+#define __put_mem_asm(store, reg, x, addr, label, type) \
+ asm goto( \
+ "1: " store " " reg "0, [%1]\n" \
"2:\n" \
- _ASM_EXTABLE_##type##ACCESS_ERR(1b, 2b, %w0) \
- : "+r" (err) \
- : "rZ" (x), "r" (addr))
+ _ASM_EXTABLE_##type##ACCESS(1b, %l2) \
+ : : "rZ" (x), "r" (addr) : : label)

-#define __raw_put_mem(str, x, ptr, err, type) \
+#define __raw_put_mem(str, x, ptr, label, type) \
do { \
__typeof__(*(ptr)) __pu_val = (x); \
switch (sizeof(*(ptr))) { \
case 1: \
- __put_mem_asm(str "b", "%w", __pu_val, (ptr), (err), type); \
+ __put_mem_asm(str "b", "%w", __pu_val, (ptr), label, type); \
break; \
case 2: \
- __put_mem_asm(str "h", "%w", __pu_val, (ptr), (err), type); \
+ __put_mem_asm(str "h", "%w", __pu_val, (ptr), label, type); \
break; \
case 4: \
- __put_mem_asm(str, "%w", __pu_val, (ptr), (err), type); \
+ __put_mem_asm(str, "%w", __pu_val, (ptr), label, type); \
break; \
case 8: \
- __put_mem_asm(str, "%x", __pu_val, (ptr), (err), type); \
+ __put_mem_asm(str, "%x", __pu_val, (ptr), label, type); \
break; \
default: \
BUILD_BUG(); \
@@ -328,25 +327,34 @@ do { \
* uaccess_ttbr0_disable(). As `x` and `ptr` could contain blocking functions,
* we must evaluate these outside of the critical section.
*/
-#define __raw_put_user(x, ptr, err) \
+#define __raw_put_user(x, ptr, label) \
do { \
+ __label__ __rpu_failed; \
__typeof__(*(ptr)) __user *__rpu_ptr = (ptr); \
__typeof__(*(ptr)) __rpu_val = (x); \
__chk_user_ptr(__rpu_ptr); \
\
- uaccess_ttbr0_enable(); \
- __raw_put_mem("sttr", __rpu_val, __rpu_ptr, err, U); \
- uaccess_ttbr0_disable(); \
+ do { \
+ uaccess_ttbr0_enable(); \
+ __raw_put_mem("sttr", __rpu_val, __rpu_ptr, __rpu_failed, U); \
+ uaccess_ttbr0_disable(); \
+ break; \
+ __rpu_failed: \
+ uaccess_ttbr0_disable(); \
+ goto label; \
+ } while (0); \
} while (0)

#define __put_user_error(x, ptr, err) \
do { \
+ __label__ __pu_failed; \
__typeof__(*(ptr)) __user *__p = (ptr); \
might_fault(); \
if (access_ok(__p, sizeof(*__p))) { \
__p = uaccess_mask_ptr(__p); \
- __raw_put_user((x), __p, (err)); \
+ __raw_put_user((x), __p, __pu_failed); \
} else { \
+ __pu_failed: \
(err) = -EFAULT; \
} \
} while (0)
@@ -369,15 +377,18 @@ do { \
do { \
__typeof__(dst) __pkn_dst = (dst); \
__typeof__(src) __pkn_src = (src); \
- int __pkn_err = 0; \
\
- __mte_enable_tco_async(); \
- __raw_put_mem("str", *((type *)(__pkn_src)), \
- (__force type *)(__pkn_dst), __pkn_err, K); \
- __mte_disable_tco_async(); \
- \
- if (unlikely(__pkn_err)) \
+ do { \
+ __label__ __pkn_err; \
+ __mte_enable_tco_async(); \
+ __raw_put_mem("str", *((type *)(__pkn_src)), \
+ (__force type *)(__pkn_dst), __pkn_err, K); \
+ __mte_disable_tco_async(); \
+ break; \
+ __pkn_err: \
+ __mte_disable_tco_async(); \
goto err_label; \
+ } while (0); \
} while(0)

extern unsigned long __must_check __arch_copy_from_user(void *to, const void __user *from, unsigned long n);
@@ -411,17 +422,8 @@ static __must_check __always_inline bool user_access_begin(const void __user *pt
}
#define user_access_begin(a,b) user_access_begin(a,b)
#define user_access_end() uaccess_ttbr0_disable()
-
-/*
- * The arm64 inline asms should learn abut asm goto, and we should
- * teach user_access_begin() about address masking.
- */
-#define unsafe_put_user(x, ptr, label) do { \
- int __upu_err = 0; \
- __raw_put_mem("sttr", x, uaccess_mask_ptr(ptr), __upu_err, U); \
- if (__upu_err) goto label; \
-} while (0)
-
+#define unsafe_put_user(x, ptr, label) \
+ __raw_put_mem("sttr", x, uaccess_mask_ptr(ptr), label, U)
#define unsafe_get_user(x, ptr, label) \
__raw_get_mem("ldtr", x, uaccess_mask_ptr(ptr), label, U)

--
2.45.1.209.gc6f12300df


2024-06-12 10:05:06

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 2/7] add default dummy 'runtime constant' infrastructure

On Mon, Jun 10, 2024 at 01:48:16PM -0700, Linus Torvalds wrote:
> Subject: Re: [PATCH 2/7] add default dummy 'runtime constant' infrastructure

Probably need a subject prefix:

"runtime-const: Add default..."

> This adds the initial dummy support for 'runtime constants' for when
> an architecture doesn't actually support an implementation of fixing up
> said runtime constants.

...

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-06-12 19:03:02

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH 4/7 v2] arm64: add 'runtime constant' support

On Tue, Jun 11, 2024 at 10:20:10AM -0700, Linus Torvalds wrote:
> This implements the runtime constant infrastructure for arm64, allowing
> the dcache d_hash() function to be generated using as a constant for
> hash table address followed by shift by a constant of the hash index.
>
> Signed-off-by: Linus Torvalds <[email protected]>
> ---
> v2: updates as per Mark Rutland

Sorry, I just realised I got the cache maintenance slightly wrong below.

> +static inline void __runtime_fixup_ptr(void *where, unsigned long val)
> +{
> + __le32 *p = lm_alias(where);
> + __runtime_fixup_16(p, val);
> + __runtime_fixup_16(p+1, val >> 16);
> + __runtime_fixup_16(p+2, val >> 32);
> + __runtime_fixup_16(p+3, val >> 48);
> + caches_clean_inval_pou((unsigned long)p, (unsigned long)(p + 4));
> +}

We need to do the I$ maintenance on the VA that'll be executed (to
handle systems with a VIPT I$), so we'll need to use 'where' rather than
'p', e.g.

caches_clean_inval_pou((unsigned long)where,
(unsigned long)where + 4 * AARCH64_INSN_SIZE);

Note: the D$ and I$ maintenance instruction (DC CVAU and IC IVAU) only
require read permissions, so those can be used on the kernel's
executable alias even though that's mapped without write permissions.

> +/* Immediate value is 6 bits starting at bit #16 */
> +static inline void __runtime_fixup_shift(void *where, unsigned long val)
> +{
> + __le32 *p = lm_alias(where);
> + u32 insn = le32_to_cpu(*p);
> + insn &= 0xffc0ffff;
> + insn |= (val & 63) << 16;
> + *p = cpu_to_le32(insn);
> + caches_clean_inval_pou((unsigned long)p, (unsigned long)(p + 1));
> +}

Likewise:

caches_clean_inval_pou((unsigned long)where,
(unsigned long)where + AARCH64_INSN_SIZE);

Mark.

2024-06-12 19:04:05

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH 0/7] arm64 / x86-64: low-level code generation issues

On Mon, Jun 10, 2024 at 01:48:14PM -0700, Linus Torvalds wrote:
> The last three patches are purely arm64-specific, and just fix up some
> nasty code generation in the user access functions. I just noticed that
> I will need to implement 'user_access_save()' for KCSAN now that I do
> the unsafe user access functions.
>
> Anyway, that 'user_access_save/restore()' issue only shows up with
> KCSAN. And it would be a no-op thanks to arm64 doing SMAP the right way
> (pet peeve: arm64 did what I told the x86 designers to do originally,
> but they claimed was too hard, so we ended up with that CLAC/STAC
> instead)...
>
> Sadly that "no-op for KCSAN" would is except for the horrid
> CONFIG_ARM64_SW_TTBR0_PAN case, which is why I'm not touching it. I'm
> hoping some hapless^Whelpful arm64 person is willing to tackle this (or
> maybe make KCSAN and ARM64_SW_TTBR0_PAN incompatible in the Kconfig).

Given how badly things go when we get this wrong (e.g. TLB corruption), I'd
like to say "just mark it incompatible", applying to all instrumentation, not
just KCSAN.

Any new microarchitecture since ~2014 has real HW PAN, and IIUC it's really
only Cortex-{A53,A57,A72} that don't have it, so I think it'd be fair to say
don't use sanitizers with SW PAN on those CPUs.

Otherwise, I came up with the below (untested). It's a bit horrid because we
could have instrumentation in the middle of __uaccess_ttbr0_{enable,disable}(),
and so we aways need the ISB just in case, and TBH I'm not sure that we use
user_access_{save,restore}() in all the places we should.

Mark.

diff --git a/arch/arm64/include/asm/uaccess.h b/arch/arm64/include/asm/uaccess.h
index 30e2f8fa87a4..83301400ec4c 100644
--- a/arch/arm64/include/asm/uaccess.h
+++ b/arch/arm64/include/asm/uaccess.h
@@ -92,6 +92,38 @@ static inline void __uaccess_ttbr0_enable(void)
local_irq_restore(flags);
}

+static inline unsigned long user_access_ttbr0_save(void)
+{
+ if (!system_uses_ttbr0_pan())
+ return 0;
+
+ /*
+ * If TTBR1 has an ASID other than the reserved ASID, then we have an
+ * active user TTBR0 or are part-way through enabling/disabling TTBR0
+ * access.
+ */
+ if (read_sysreg(ttbr1_el1) & TTBR_ASID_MASK) {
+ __uaccess_ttbr0_disable();
+ return 1;
+ }
+
+ /*
+ * Instrumentation could fire during __uaccess_ttbr0_disable() between
+ * the final write to TTBR1 and before the trailing ISB. We need an ISB
+ * to ensure that we don't continue to use the old ASID.
+ */
+ isb();
+ return 0;
+}
+
+static inline void user_access_ttbr0_restore(unsigned long enabled)
+{
+ if (!system_uses_ttbr0_pan() || !enabled)
+ return;
+
+ __uaccess_ttbr0_enable();
+}
+
static inline bool uaccess_ttbr0_disable(void)
{
if (!system_uses_ttbr0_pan())
@@ -117,8 +149,20 @@ static inline bool uaccess_ttbr0_enable(void)
{
return false;
}
+
+static inline unsigned long user_access_ttbr0_save(void)
+{
+ return 0;
+}
+
+static inline void user_access_ttbr0_restore(unsigned long)
+{
+}
#endif

+#define user_access_save user_access_ttbr0_save
+#define user_access_restore user_access_ttbr0_restore
+
static inline void __uaccess_disable_hw_pan(void)
{
asm(ALTERNATIVE("nop", SET_PSTATE_PAN(0), ARM64_HAS_PAN,

2024-06-12 20:04:42

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/7] arm64 / x86-64: low-level code generation issues

On Wed, 12 Jun 2024 at 11:41, Mark Rutland <[email protected]> wrote:
>
> Given how badly things go when we get this wrong (e.g. TLB corruption), I'd
> like to say "just mark it incompatible", applying to all instrumentation, not
> just KCSAN.

Ack. I'll start out with just KCSAN (since that's the actual technical
issue right now). But since the SW PAN support is hopefully not
something that we should worry about going forward, I wouldn't mind it
being de-emphasized.

It's not like PAN is something that should necessarily be everywhere.
The real advantage of SMAP on x86 (and then PAN on arm) is that it
catches wild kernel pointers. As long as the HW support is common
enough, people will find bugs on those platforms.

So I think the advantage of SW PAN was "it will find issues early
before HW PAN is widely available". It might be time to lay SW PAN
entirely to rest now.

I'll send out a new version of the arm64 patches with the KCSAN build
failure fixed (with the simple no-op save/restore functions by making
KCSAN and SW PAN mutually incompatible), and with the virtual address
fix you pointed out in the other email.

Linus

2024-06-12 22:26:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/7] arm64 / x86-64: low-level code generation issues

On Wed, 12 Jun 2024 at 13:02, Linus Torvalds
<[email protected]> wrote:
>
> I'll send out a new version of the arm64 patches with the KCSAN build
> failure fixed (with the simple no-op save/restore functions by making
> KCSAN and SW PAN mutually incompatible), and with the virtual address
> fix you pointed out in the other email.

Actually, unless somebody really wants to see the whole series again,
here's just the diff between the end result of the series.

The actual changes are done in the relevant commits (ie the "asm goto
for get_user()" one for the KCSAN issue, and the "arm64 runtime
constant" commit for the I$ fixup).

Holler if you want to see the full series again.

It might be worth noting that I initially made the arm64 KCSAN support
have a "depend on !ARM64_SW_TTBR0_PAN" condition, but decided that
it's actually better to do it the other way around, and make
ARM64_SW_TTBR0_PAN depend on KCSAN not being enabled.

That way we're basically more eagerly disabling the thing that should
go away, and we're also having the KCSAN code be checked for
allmodconfig builds.

But hey, it's a judgement call.

Linus


Attachments:
patch.diff (2.51 kB)