Needs more comments and testing, but it works, and has a generic
fallback for architectures that don't support it.
Signed-off-by: Linus Torvalds <[email protected]>
---
Notes from the first hack: I renamed the infrastructure from "static
const" to "runtime const". We end up having a number of uses of "static
const" that are related to the C language notion of "static const"
variables or functions, and "runtime constant" is a bit more descriptive
anyway.
And this now is properly abstracted out, so that any architecture can
choose to implement their own version, but it all falls back on "just
use the variable".
Josh - sorry for wasting your time on the objtool patch, I ended up
using the linker functionality that Rasmus pointed out as existing
instead.
Rasmus - I've cleaned up my patch a lot, and it now compiles fine on
other architectures too, although obviously with the fallback of "no
constant fixup". As a result, my patch is actually smaller and much
cleaner, and I ended up liking my approach more than your RAI thing
after all.
Ingo / Peter / Borislav - I enabled this for 32-bit x86 too, because it
was literally trivial (had to remove a "q" from "movq"). I did a
test-build and it looks find, but I didn't actually try to boot it.
The x86-64 code is actually tested. It's not like it has a _lot_ of
testing, but the patch ends up being pretty small in the end. Yes, the
"shift u32 value right by a constant" is a pretty special case, but the
__d_lookup_rcu() function really is pretty hot.
Or rather it *was* pretty hot. It's actually looking very good with
this, imho.
Build tested with allmodconfig and on arm64, but I'm not claiming that I
have necessarily found all special case corners. That said, it's small
and pretty straightforward.
Comments?
arch/x86/include/asm/runtime-const.h | 61 ++++++++++++++++++++++++++++
arch/x86/kernel/vmlinux.lds.S | 3 ++
fs/dcache.c | 24 +++++++----
include/asm-generic/Kbuild | 1 +
include/asm-generic/runtime-const.h | 15 +++++++
include/asm-generic/vmlinux.lds.h | 8 ++++
6 files changed, 104 insertions(+), 8 deletions(-)
create mode 100644 arch/x86/include/asm/runtime-const.h
create mode 100644 include/asm-generic/runtime-const.h
diff --git a/arch/x86/include/asm/runtime-const.h b/arch/x86/include/asm/runtime-const.h
new file mode 100644
index 000000000000..b4f7efc0a554
--- /dev/null
+++ b/arch/x86/include/asm/runtime-const.h
@@ -0,0 +1,61 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_RUNTIME_CONST_H
+#define _ASM_RUNTIME_CONST_H
+
+#define runtime_const_ptr(sym) ({ \
+ typeof(sym) __ret; \
+ asm("mov %1,%0\n1:\n" \
+ ".pushsection runtime_ptr_" #sym ",\"a\"\n\t" \
+ ".long 1b - %c2 - .\n\t" \
+ ".popsection" \
+ :"=r" (__ret) \
+ :"i" ((unsigned long)0x0123456789abcdefull), \
+ "i" (sizeof(long))); \
+ __ret; })
+
+// The 'typeof' will create at _least_ a 32-bit type, but
+// will happily also take a bigger type and the 'shrl' will
+// clear the upper bits
+#define runtime_const_shift_right_32(val, sym) ({ \
+ typeof(0u+(val)) __ret = (val); \
+ asm("shrl $12,%k0\n1:\n" \
+ ".pushsection runtime_shift_" #sym ",\"a\"\n\t" \
+ ".long 1b - 1 - .\n\t" \
+ ".popsection" \
+ :"+r" (__ret)); \
+ __ret; })
+
+#define runtime_const_init(type, sym, value) do { \
+ extern s32 __start_runtime_##type##_##sym[]; \
+ extern s32 __stop_runtime_##type##_##sym[]; \
+ runtime_const_fixup(__runtime_fixup_##type, \
+ (unsigned long)(value), \
+ __start_runtime_##type##_##sym, \
+ __stop_runtime_##type##_##sym); \
+} while (0)
+
+/*
+ * The text patching is trivial - you can only do this at init time,
+ * when the text section hasn't been marked RO, and before the text
+ * has ever been executed.
+ */
+static inline void __runtime_fixup_ptr(void *where, unsigned long val)
+{
+ *(unsigned long *)where = val;
+}
+
+static inline void __runtime_fixup_shift(void *where, unsigned long val)
+{
+ *(unsigned char *)where = val;
+}
+
+static inline void runtime_const_fixup(void (*fn)(void *, unsigned long),
+ unsigned long val, s32 *start, s32 *end)
+{
+ while (start < end) {
+ fn(*start + (void *)start, val);
+ start++;
+ }
+}
+
+#endif
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 3509afc6a672..6e73403e874f 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -357,6 +357,9 @@ SECTIONS
PERCPU_SECTION(INTERNODE_CACHE_BYTES)
#endif
+ RUNTIME_CONST(shift, d_hash_shift)
+ RUNTIME_CONST(ptr, dentry_hashtable)
+
. = ALIGN(PAGE_SIZE);
/* freed after init ends here */
diff --git a/fs/dcache.c b/fs/dcache.c
index 407095188f83..4511e557bf84 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -97,12 +97,14 @@ EXPORT_SYMBOL(dotdot_name);
*/
static unsigned int d_hash_shift __ro_after_init;
-
static struct hlist_bl_head *dentry_hashtable __ro_after_init;
-static inline struct hlist_bl_head *d_hash(unsigned int hash)
+#include <asm/runtime-const.h>
+
+static inline struct hlist_bl_head *d_hash(unsigned long hashlen)
{
- return dentry_hashtable + (hash >> d_hash_shift);
+ return runtime_const_ptr(dentry_hashtable) +
+ runtime_const_shift_right_32(hashlen, d_hash_shift);
}
#define IN_LOOKUP_SHIFT 10
@@ -495,7 +497,7 @@ static void ___d_drop(struct dentry *dentry)
if (unlikely(IS_ROOT(dentry)))
b = &dentry->d_sb->s_roots;
else
- b = d_hash(dentry->d_name.hash);
+ b = d_hash(dentry->d_name.hash_len);
hlist_bl_lock(b);
__hlist_bl_del(&dentry->d_hash);
@@ -2104,7 +2106,7 @@ static noinline struct dentry *__d_lookup_rcu_op_compare(
unsigned *seqp)
{
u64 hashlen = name->hash_len;
- struct hlist_bl_head *b = d_hash(hashlen_hash(hashlen));
+ struct hlist_bl_head *b = d_hash(hashlen);
struct hlist_bl_node *node;
struct dentry *dentry;
@@ -2171,7 +2173,7 @@ struct dentry *__d_lookup_rcu(const struct dentry *parent,
{
u64 hashlen = name->hash_len;
const unsigned char *str = name->name;
- struct hlist_bl_head *b = d_hash(hashlen_hash(hashlen));
+ struct hlist_bl_head *b = d_hash(hashlen);
struct hlist_bl_node *node;
struct dentry *dentry;
@@ -2277,7 +2279,7 @@ EXPORT_SYMBOL(d_lookup);
struct dentry *__d_lookup(const struct dentry *parent, const struct qstr *name)
{
unsigned int hash = name->hash;
- struct hlist_bl_head *b = d_hash(hash);
+ struct hlist_bl_head *b = d_hash(name->hash_len);
struct hlist_bl_node *node;
struct dentry *found = NULL;
struct dentry *dentry;
@@ -2397,7 +2399,7 @@ EXPORT_SYMBOL(d_delete);
static void __d_rehash(struct dentry *entry)
{
- struct hlist_bl_head *b = d_hash(entry->d_name.hash);
+ struct hlist_bl_head *b = d_hash(entry->d_name.hash_len);
hlist_bl_lock(b);
hlist_bl_add_head_rcu(&entry->d_hash, b);
@@ -3129,6 +3131,9 @@ static void __init dcache_init_early(void)
0,
0);
d_hash_shift = 32 - d_hash_shift;
+
+ runtime_const_init(shift, d_hash_shift, d_hash_shift);
+ runtime_const_init(ptr, dentry_hashtable, dentry_hashtable);
}
static void __init dcache_init(void)
@@ -3157,6 +3162,9 @@ static void __init dcache_init(void)
0,
0);
d_hash_shift = 32 - d_hash_shift;
+
+ runtime_const_init(shift, d_hash_shift, d_hash_shift);
+ runtime_const_init(ptr, dentry_hashtable, dentry_hashtable);
}
/* SLAB cache for __getname() consumers */
diff --git a/include/asm-generic/Kbuild b/include/asm-generic/Kbuild
index b20fa25a7e8d..052e5c98c105 100644
--- a/include/asm-generic/Kbuild
+++ b/include/asm-generic/Kbuild
@@ -46,6 +46,7 @@ mandatory-y += pci.h
mandatory-y += percpu.h
mandatory-y += pgalloc.h
mandatory-y += preempt.h
+mandatory-y += runtime-const.h
mandatory-y += rwonce.h
mandatory-y += sections.h
mandatory-y += serial.h
diff --git a/include/asm-generic/runtime-const.h b/include/asm-generic/runtime-const.h
new file mode 100644
index 000000000000..b54824bd616e
--- /dev/null
+++ b/include/asm-generic/runtime-const.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_RUNTIME_CONST_H
+#define _ASM_RUNTIME_CONST_H
+
+/*
+ * This is the fallback for when the architecture doesn't
+ * support the runtime const operations.
+ *
+ * We just use the actual symbols as-is.
+ */
+#define runtime_const_ptr(sym) (sym)
+#define runtime_const_shift_right_32(val, sym) ((u32)(val)>>(sym))
+#define runtime_const_init(type,sym,value) do { } while (0)
+
+#endif
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index 5703526d6ebf..389a78415b9b 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -944,6 +944,14 @@
#define CON_INITCALL \
BOUNDED_SECTION_POST_LABEL(.con_initcall.init, __con_initcall, _start, _end)
+#define RUNTIME_NAME(t,x) runtime_##t##_##x
+
+#define RUNTIME_CONST(t,x) \
+ . = ALIGN(8); \
+ RUNTIME_NAME(t,x) : AT(ADDR(RUNTIME_NAME(t,x)) - LOAD_OFFSET) { \
+ *(RUNTIME_NAME(t,x)); \
+ }
+
/* Alignment must be consistent with (kunit_suite *) in include/kunit/test.h */
#define KUNIT_TABLE() \
. = ALIGN(8); \
--
2.45.1.209.gc6f12300df
On Sat, 8 Jun 2024 at 12:43, Linus Torvalds
<[email protected]> wrote:
>
> Needs more comments and testing, but it works, and has a generic
> fallback for architectures that don't support it.
.. and here is the TOTALLY UNTESTED patch to implement the arm64
version of runtime constants.
It almost certainly does not work. I'm too scared to test it. I may
start to recognize arm64 instructions, but rewriting them on the fly
is another thing entirely, and I'm fairly sure this needs an I$ sync
and probably modifying the instructions using another address even
during early boot.
So this is a "throw it over the fence to the actually competent arm64
people" patch.
Catalin, Will? This depends on the infrastructure that I added in
https://lore.kernel.org/all/[email protected]/
which is actually tested on the x86-64 side.
I did test that the code generation looks superficially sane, and this generates
mov x1, #0xcdef
movk x1, #0x89ab, lsl #16
movk x1, #0x4567, lsl #32
movk x1, #0x123, lsl #48
...
lsr w0, w25, #12
ldr x0, [x1, x0, lsl #3]
for the dcache hash lookup (those constants are obviously the ones
that get rewritten after the hash table has been allocated and the
size becomes fixed).
And honestly, I may have gotten even the simple part of instruction
rewriting wrong (ie maybe I'm filling in the wrong bit locations - I'm
reading the architecture manual, not actually *testing* anything).
Think of this patch mostly as a "look, adding another architecture
isn't *that* hard - even if the constant value is spread out in the
instructions".
Linus
On Sat, 8 Jun 2024 at 13:55, Linus Torvalds
<[email protected]> wrote:
>
> Think of this patch mostly as a "look, adding another architecture
> isn't *that* hard - even if the constant value is spread out in the
> instructions".
.. and here's a version that actually works. It wasn't that bad.
Or rather, it wouldn't have been that bad had I not spent *ages*
debugging a stupid cut-and-paste error where I instead of writing
words 0..3 of the 64-bit large constant generation, wrote words 0..2
and then overwrote word 2 (again) with the data that should have gone
into word 3. Causing the top 32 bits to be all wonky. Oops. Literally.
That stupid typo caused like two hours of wasted time.
But anyway, this patch actually works for me. It still doesn't do any
I$/D$ flushing, because it's not needed in practice, but it *should*
probably do that.
Linus
On Sat, Jun 08, 2024 at 12:35:05PM -0700, Linus Torvalds wrote:
> Ingo / Peter / Borislav - I enabled this for 32-bit x86 too, because it
> was literally trivial (had to remove a "q" from "movq"). I did a
> test-build and it looks find, but I didn't actually try to boot it.
Will do once you have your final version. I still have an Atom, 32-bit
only laptop lying around here.
> +#define runtime_const_ptr(sym) ({ \
> + typeof(sym) __ret; \
> + asm("mov %1,%0\n1:\n" \
> + ".pushsection runtime_ptr_" #sym ",\"a\"\n\t" \
> + ".long 1b - %c2 - .\n\t" \
> + ".popsection" \
> + :"=r" (__ret) \
> + :"i" ((unsigned long)0x0123456789abcdefull), \
> + "i" (sizeof(long))); \
> + __ret; })
You might wanna use asm symbolic names for the operands so that it is
more readable:
#define runtime_const_ptr(sym) ({ \
typeof(sym) __ret; \
asm("mov %[constant] ,%[__ret]\n1:\n" \
".pushsection runtime_ptr_" #sym ",\"a\"\n\t" \
".long 1b - %c[sizeoflong] - .\n\t" \
".popsection" \
: [__ret] "=r" (__ret) \
: [constant] "i" ((unsigned long)0x0123456789abcdefull), \
[sizeoflong] "i" (sizeof(long))); \
__ret; })
For example.
> +// The 'typeof' will create at _least_ a 32-bit type, but
> +// will happily also take a bigger type and the 'shrl' will
> +// clear the upper bits
Can we pls use the multiline comments, like you do below in the same
file.
Otherwise, it looks ok to me and it boots in a guest.
I'll take the final version for a spin on real hw in a couple of days,
once the review dust settles.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On June 9, 2024 4:22:40 AM PDT, Borislav Petkov <[email protected]> wrote:
>On Sat, Jun 08, 2024 at 12:35:05PM -0700, Linus Torvalds wrote:
>> Ingo / Peter / Borislav - I enabled this for 32-bit x86 too, because it
>> was literally trivial (had to remove a "q" from "movq"). I did a
>> test-build and it looks find, but I didn't actually try to boot it.
>
>Will do once you have your final version. I still have an Atom, 32-bit
>only laptop lying around here.
>
>> +#define runtime_const_ptr(sym) ({ \
>> + typeof(sym) __ret; \
>> + asm("mov %1,%0\n1:\n" \
>> + ".pushsection runtime_ptr_" #sym ",\"a\"\n\t" \
>> + ".long 1b - %c2 - .\n\t" \
>> + ".popsection" \
>> + :"=r" (__ret) \
>> + :"i" ((unsigned long)0x0123456789abcdefull), \
>> + "i" (sizeof(long))); \
>> + __ret; })
>
>You might wanna use asm symbolic names for the operands so that it is
>more readable:
>
>#define runtime_const_ptr(sym) ({ \
> typeof(sym) __ret; \
> asm("mov %[constant] ,%[__ret]\n1:\n" \
> ".pushsection runtime_ptr_" #sym ",\"a\"\n\t" \
> ".long 1b - %c[sizeoflong] - .\n\t" \
> ".popsection" \
> : [__ret] "=r" (__ret) \
> : [constant] "i" ((unsigned long)0x0123456789abcdefull), \
> [sizeoflong] "i" (sizeof(long))); \
> __ret; })
>
>For example.
>
>> +// The 'typeof' will create at _least_ a 32-bit type, but
>> +// will happily also take a bigger type and the 'shrl' will
>> +// clear the upper bits
>
>Can we pls use the multiline comments, like you do below in the same
>file.
>
>Otherwise, it looks ok to me and it boots in a guest.
>
>I'll take the final version for a spin on real hw in a couple of days,
>once the review dust settles.
>
>Thx.
>
So the biggest difference versus what I had in progress was that I had the idea of basically doing "ro_after_init" behavior by doing memory references until alternatives are run.
I don't know if that was overthinking the problem...
On Sun, 9 Jun 2024 at 04:48, H. Peter Anvin <[email protected]> wrote:
>
> So the biggest difference versus what I had in progress was that I had
> the idea of basically doing "ro_after_init" behavior by doing memory
> references until alternatives are run.
Yeah, that would make it a lot more complicated, because now you need
to replace the whole asm, and you need to write it in a way that it
can be replaced with alternate code.
It's a bit akin to what Rasmus' RAI code did, but while I liked
Rasmus' patch, I honestly don't think there is any advantage to
having that "load from memory" alternate version.
At least for something like the dcache lookup, the "before things have
been initialized" state is that the runtime constant pointer is NULL,
so the pre-init code would cause an oops anyway, even if it had that
"load from memory" fallback.
Now, there may be other uses for these kinds of runtime constants, but
I do think the main one - and the one I wrote this for - is simply a
boot-time allocation. So I think that kind of "before it's
initialized, it won't work whether there's a load from memory or not"
is fairly fundamental.
There are other constants this could be used for - particularly things
like "size of virtual memory". That
#define task_size_max() \
((_AC(1,UL) << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
thing really used to be a big pain back when we used it in every
get_user/put_user call. We replaced it with just the sign bit test,
though, and now it's gone.
(I still see it in profiles in strncpy_from_user(), but it's not
really noticeable, and it's fixable using a similar address masking
thing).
I'd expect that there are other cases than the dcache lookup that
wants this, but none that really show up on any of the profiles _I_
tend to run (but I don't benchmark networking, for example).
There is the security layer blob_sizes "constants" that could possibly
use this, but a part of the cost there is that since they aren't
compile-time constants, the compiler can't just use a common base
pointer.
So even if those were converted to runtime constants, it wouldn't make
the code generation all _that_ much better.
Linus
On 08/06/2024 21.35, Linus Torvalds wrote:
> Needs more comments and testing, but it works, and has a generic
> fallback for architectures that don't support it.
>
> Signed-off-by: Linus Torvalds <[email protected]>
> ---
>
> Notes from the first hack: I renamed the infrastructure from "static
> const" to "runtime const". We end up having a number of uses of "static
> const" that are related to the C language notion of "static const"
> variables or functions, and "runtime constant" is a bit more descriptive
> anyway.
>
> And this now is properly abstracted out, so that any architecture can
> choose to implement their own version, but it all falls back on "just
> use the variable".
Well, I think it lacks a little. I made it so that an arch didn't need
to implement support for all runtime_const_*() helpers (I'll use that
name because yes, much better than rai_), but could implement just
runtime_const_load() and not the _shift_right thing, and then the base
pointer would be loaded as an immediate, and the shift count would also
be loaded as an immediate, but into a register, and the actual shift
would be with the shift amount from that register, so no D$ access.
So I think it should be something like
#ifndef runtime_const_load(x)
#define runtime_const_load(x) (x)
#endif
#ifndef runtime_const_shift_right_32
#define runtime_const_shift_right_32(val, sym) ((u32)val >>
runtime_const_load(sym))
#endif
etc. (This assumes we add some type-generic runtime_const_load that can
be used both for pointers and ints; supporting sizeof() < 4 is probably
not relevant, but could also be done).
Otherwise, if we want to add some runtime_const_mask32() thing to
support hash tables where we use that model, one would have to hook up
support on all architectures at once. Similarly, architectures that are
late to the party won't need to do the full monty at once, just
implementing runtime_const_load() would be enough to get some of the win.
> Rasmus - I've cleaned up my patch a lot, and it now compiles fine on
> other architectures too, although obviously with the fallback of "no
> constant fixup". As a result, my patch is actually smaller and much
> cleaner, and I ended up liking my approach more than your RAI thing
> after all.
Yeah, it's nice and small. I was probably over-ambitious, as I also
wanted to, eventually, use runtime_const_load() for lots of the *_cachep
variables and similar. With my approach I didn't have to modify the
linker script every time a new use is introduced, and also didn't have
to modify the place that does the initialization - but it did come at
the cost of only patching everything at once at the end of init, thus
requiring the trampoline to do the loads from memory in the meantime.
But that was partly also for another too ambitious thing: eventually
allowing this to be used for __read_mostly and not just __ro_after_init
stuff (various run-time limits etc.).
>
> arch/x86/include/asm/runtime-const.h | 61 ++++++++++++++++++++++++++++
> arch/x86/kernel/vmlinux.lds.S | 3 ++
> fs/dcache.c | 24 +++++++----
> include/asm-generic/Kbuild | 1 +
> include/asm-generic/runtime-const.h | 15 +++++++
> include/asm-generic/vmlinux.lds.h | 8 ++++
> 6 files changed, 104 insertions(+), 8 deletions(-)
> create mode 100644 arch/x86/include/asm/runtime-const.h
> create mode 100644 include/asm-generic/runtime-const.h
>
> diff --git a/arch/x86/include/asm/runtime-const.h b/arch/x86/include/asm/runtime-const.h
> new file mode 100644
> index 000000000000..b4f7efc0a554
> --- /dev/null
> +++ b/arch/x86/include/asm/runtime-const.h
> @@ -0,0 +1,61 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_RUNTIME_CONST_H
> +#define _ASM_RUNTIME_CONST_H
> +
> +#define runtime_const_ptr(sym) ({ \
> + typeof(sym) __ret; \
> + asm("mov %1,%0\n1:\n" \
> + ".pushsection runtime_ptr_" #sym ",\"a\"\n\t" \
> + ".long 1b - %c2 - .\n\t" \
> + ".popsection" \
> + :"=r" (__ret) \
> + :"i" ((unsigned long)0x0123456789abcdefull), \
> + "i" (sizeof(long))); \
> + __ret; })
> +
> +// The 'typeof' will create at _least_ a 32-bit type, but
> +// will happily also take a bigger type and the 'shrl' will
> +// clear the upper bits
> +#define runtime_const_shift_right_32(val, sym) ({ \
> + typeof(0u+(val)) __ret = (val); \
> + asm("shrl $12,%k0\n1:\n" \
> + ".pushsection runtime_shift_" #sym ",\"a\"\n\t" \
> + ".long 1b - 1 - .\n\t" \
> + ".popsection" \
> + :"+r" (__ret)); \
> + __ret; })
Don't you need a cc clobber here? And for both, I think asm_inline is
appropriate, as you really want to tell gcc that this just consists of a
single instruction, the .pushsection games notwithstanding.
I know it's a bit more typing, but the section names should be
"runtime_const_shift" etc., not merely "runtime_shift".
> +
> +#define runtime_const_init(type, sym, value) do { \
> + extern s32 __start_runtime_##type##_##sym[]; \
> + extern s32 __stop_runtime_##type##_##sym[]; \
> + runtime_const_fixup(__runtime_fixup_##type, \
> + (unsigned long)(value), \
> + __start_runtime_##type##_##sym, \
> + __stop_runtime_##type##_##sym); \
> +} while (0)
> +
> +/*
> + * The text patching is trivial - you can only do this at init time,
> + * when the text section hasn't been marked RO, and before the text
> + * has ever been executed.
> + */
> +static inline void __runtime_fixup_ptr(void *where, unsigned long val)
> +{
> + *(unsigned long *)where = val;
> +}
> +
> +static inline void __runtime_fixup_shift(void *where, unsigned long val)
> +{
> + *(unsigned char *)where = val;
> +}
> +
> +static inline void runtime_const_fixup(void (*fn)(void *, unsigned long),
> + unsigned long val, s32 *start, s32 *end)
> +{
> + while (start < end) {
> + fn(*start + (void *)start, val);
> + start++;
> + }
> +}
> +
> +#endif
> diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
> index 3509afc6a672..6e73403e874f 100644
> --- a/arch/x86/kernel/vmlinux.lds.S
> +++ b/arch/x86/kernel/vmlinux.lds.S
> @@ -357,6 +357,9 @@ SECTIONS
> PERCPU_SECTION(INTERNODE_CACHE_BYTES)
> #endif
>
> + RUNTIME_CONST(shift, d_hash_shift)
> + RUNTIME_CONST(ptr, dentry_hashtable)
> +
Hm, doesn't this belong in the common linker script? I mean, if arm64
were to implement support for this, they'd also have to add this
boilerplate to their vmlinux.lds.S? And then if another
RUNTIME_CONST(ptr, ...) use case is added that goes in all
at-that-point-in-time supporting architectures' vmlinux.lds.S. Doesn't
seem to scale.
> . = ALIGN(PAGE_SIZE);
>
> /* freed after init ends here */
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 407095188f83..4511e557bf84 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -97,12 +97,14 @@ EXPORT_SYMBOL(dotdot_name);
> */
>
> static unsigned int d_hash_shift __ro_after_init;
> -
> static struct hlist_bl_head *dentry_hashtable __ro_after_init;
>
> -static inline struct hlist_bl_head *d_hash(unsigned int hash)
> +#include <asm/runtime-const.h>
> +
Please keep the #includes together at the top of the file.
> +static inline struct hlist_bl_head *d_hash(unsigned long hashlen)
> {
> - return dentry_hashtable + (hash >> d_hash_shift);
> + return runtime_const_ptr(dentry_hashtable) +
> + runtime_const_shift_right_32(hashlen, d_hash_shift);
> }
Could you spend some words on this signature change? Why should this now
(on 64BIT) take the full hash_len as argument, only to let the compiler
(with the (u32) cast) or cpu (in the x86_64 case, via the %k modifier if
I'm reading things right) only use the lower 32 bits?
The patch would be even smaller without this, and perhaps it could be
done as a separate patch if it does lead to (even) better code generation.
>
> static void __d_rehash(struct dentry *entry)
> {
> - struct hlist_bl_head *b = d_hash(entry->d_name.hash);
> + struct hlist_bl_head *b = d_hash(entry->d_name.hash_len);
>
> hlist_bl_lock(b);
> hlist_bl_add_head_rcu(&entry->d_hash, b);
> @@ -3129,6 +3131,9 @@ static void __init dcache_init_early(void)
> 0,
> 0);
> d_hash_shift = 32 - d_hash_shift;
> +
> + runtime_const_init(shift, d_hash_shift, d_hash_shift);
> + runtime_const_init(ptr, dentry_hashtable, dentry_hashtable);
> }
>
> static void __init dcache_init(void)
> @@ -3157,6 +3162,9 @@ static void __init dcache_init(void)
> 0,
> 0);
> d_hash_shift = 32 - d_hash_shift;
> +
> + runtime_const_init(shift, d_hash_shift, d_hash_shift);
> + runtime_const_init(ptr, dentry_hashtable, dentry_hashtable);
> }
>
Aside: That duplication makes me want to make dcache_init() grow an 'int
early' arg, change the body accordingly and change the call sites to
dcache_init(1) / dcache_init(0). But since the functions propably only
change once per decade or something like that, probably not worth it.
> +/*
> + * This is the fallback for when the architecture doesn't
> + * support the runtime const operations.
> + *
> + * We just use the actual symbols as-is.
> + */
> +#define runtime_const_ptr(sym) (sym)
> +#define runtime_const_shift_right_32(val, sym) ((u32)(val)>>(sym))
> +#define runtime_const_init(type,sym,value) do { } while (0)
Hm, I wonder if there's ever a case where you want to patch using any
other value than what you've just assigned to sym. IOW, why doesn't
runtime_const_init just take type,sym args?
It can't be to support some 'struct global { void *this; } g' where you
then want to use the identifier g_this for the tables and g.this for the
value, 'cause that wouldn't work with the fallbacks vs the x86
implementation - one would need runtime_const_ptr(g.this) and the other
runtime_const_ptr(g_this) at the use sites.
Rasmus
On Sat, Jun 08, 2024 at 12:35:05PM -0700, Linus Torvalds wrote:
> Needs more comments and testing, but it works, and has a generic
> fallback for architectures that don't support it.
>
> Signed-off-by: Linus Torvalds <[email protected]>
> ---
>
> Notes from the first hack: I renamed the infrastructure from "static
> const" to "runtime const". We end up having a number of uses of "static
> const" that are related to the C language notion of "static const"
> variables or functions, and "runtime constant" is a bit more descriptive
> anyway.
>
> And this now is properly abstracted out, so that any architecture can
> choose to implement their own version, but it all falls back on "just
> use the variable".
>
> Josh - sorry for wasting your time on the objtool patch, I ended up
> using the linker functionality that Rasmus pointed out as existing
> instead.
>
> Rasmus - I've cleaned up my patch a lot, and it now compiles fine on
> other architectures too, although obviously with the fallback of "no
> constant fixup". As a result, my patch is actually smaller and much
> cleaner, and I ended up liking my approach more than your RAI thing
> after all.
>
> Ingo / Peter / Borislav - I enabled this for 32-bit x86 too, because it
> was literally trivial (had to remove a "q" from "movq"). I did a
> test-build and it looks find, but I didn't actually try to boot it.
>
> The x86-64 code is actually tested. It's not like it has a _lot_ of
> testing, but the patch ends up being pretty small in the end. Yes, the
> "shift u32 value right by a constant" is a pretty special case, but the
> __d_lookup_rcu() function really is pretty hot.
>
> Or rather it *was* pretty hot. It's actually looking very good with
> this, imho.
>
> Build tested with allmodconfig and on arm64, but I'm not claiming that I
> have necessarily found all special case corners. That said, it's small
> and pretty straightforward.
>
> Comments?
It obviously has the constraint of never running the code before the
corresponding runtime_const_init() has been done, otherwise things will
go sideways in a hurry, but this also makes the whole thing a *lot*
simpler.
The only thing I'm not sure about is it having a section per symbol,
given how jump_label and static_call took off, this might not be
scalable.
Yes, the approach is super simple and straight forward, but imagine
there being like a 100 symbols soon :/
The below hackery -- it very much needs cleanup and is only compiled on
x86_64 and does not support modules, boots for me.
---
Index: linux-2.6/arch/Kconfig
===================================================================
--- linux-2.6.orig/arch/Kconfig
+++ linux-2.6/arch/Kconfig
@@ -1492,6 +1492,9 @@ config HAVE_SPARSE_SYSCALL_NR
config ARCH_HAS_VDSO_DATA
bool
+config HAVE_RUNTIME_CONST
+ bool
+
config HAVE_STATIC_CALL
bool
Index: linux-2.6/arch/x86/Kconfig
===================================================================
--- linux-2.6.orig/arch/x86/Kconfig
+++ linux-2.6/arch/x86/Kconfig
@@ -279,6 +279,7 @@ config X86
select HAVE_STACK_VALIDATION if HAVE_OBJTOOL
select HAVE_STATIC_CALL
select HAVE_STATIC_CALL_INLINE if HAVE_OBJTOOL
+ select HAVE_RUNTIME_CONST
select HAVE_PREEMPT_DYNAMIC_CALL
select HAVE_RSEQ
select HAVE_RUST if X86_64
Index: linux-2.6/arch/x86/include/asm/runtime_const.h
===================================================================
--- /dev/null
+++ linux-2.6/arch/x86/include/asm/runtime_const.h
@@ -0,0 +1,62 @@
+#ifndef _ASM_RUNTIME_CONST_H
+#define _ASM_RUNTIME_CONST_H
+
+#include <asm/asm.h>
+
+#define __runtime_const(sym, op, type) \
+({ \
+ typeof(sym) __ret; \
+ asm(op " %1, %0\n1:\n" \
+ ".pushsection __runtime_const, \"aw\"\n\t" \
+ ".long %c3 - . # sym \n\t" \
+ ".long %c2 # size \n\t" \
+ ".long 1b - %c2 - . # addr \n\t" \
+ ".popsection\n\t" \
+ : "=r" (__ret) \
+ : "i" ((type)0x0123456789abcdefull), \
+ "i" (sizeof(type)), \
+ "i" ((void *)&sym)); \
+ __ret; \
+})
+
+#define runtime_const(sym) \
+({ \
+ typeof(sym) __ret; \
+ switch(sizeof(sym)) { \
+ case 1: __ret = __runtime_const(sym, "movb", u8); break; \
+ case 2: __ret = __runtime_const(sym, "movs", u16); break; \
+ case 4: __ret = __runtime_const(sym, "movl", u32); break; \
+ case 8: __ret = __runtime_const(sym, "movq", u64); break; \
+ default: BUG(); \
+ } \
+ __ret; \
+})
+
+#define runtime_const_shr32(val, sym) \
+({ \
+ typeof(0u+val) __ret = (val); \
+ asm("shrl $12, %k0\n1:\n" \
+ ".pushsection __runtime_const, \"aw\"\n\t" \
+ ".long %c3 - . # sym \n\t" \
+ ".long %c2 # size \n\t" \
+ ".long 1b - %c2 - . # addr \n\t" \
+ ".popsection\n\t" \
+ : "+r" (__ret) \
+ : "i" ((u8)0x12), \
+ "i" (sizeof(u8)), \
+ "i" ((void *)&sym)); \
+ __ret; \
+})
+
+struct runtime_const_entry {
+ s32 sym;
+ u32 size;
+ s32 addr;
+};
+
+extern struct runtime_const_entry __start___runtime_const[];
+extern struct runtime_const_entry __stop___runtime_const[];
+
+extern void arch_runtime_const_set(struct runtime_const_entry *rc, u64 val);
+
+#endif /* _ASM_RUNTIME_CONST_H */
Index: linux-2.6/include/asm-generic/Kbuild
===================================================================
--- linux-2.6.orig/include/asm-generic/Kbuild
+++ linux-2.6/include/asm-generic/Kbuild
@@ -53,6 +53,7 @@ mandatory-y += shmparam.h
mandatory-y += simd.h
mandatory-y += softirq_stack.h
mandatory-y += switch_to.h
+mandatory-y += runtime_const.h
mandatory-y += timex.h
mandatory-y += tlbflush.h
mandatory-y += topology.h
Index: linux-2.6/include/asm-generic/runtime_const.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-generic/runtime_const.h
@@ -0,0 +1,7 @@
+#ifndef _ASM_RUNTIME_CONST_H
+#define _ASM_RUNTIME_CONST_H
+
+#define runtime_const(sym) (sym)
+#define runtime_const_shr32(val, sym) ((u32)(val) >> (sym))
+
+#endif /* _ASM_RUNTIME_CONST_H */
Index: linux-2.6/include/asm-generic/vmlinux.lds.h
===================================================================
--- linux-2.6.orig/include/asm-generic/vmlinux.lds.h
+++ linux-2.6/include/asm-generic/vmlinux.lds.h
@@ -423,6 +423,14 @@
#define STATIC_CALL_DATA
#endif
+#ifdef CONFIG_HAVE_RUNTIME_CONST
+#define RUNTIME_CONST_DATA \
+ . = ALIGN(8); \
+ BOUNDED_SECTION_BY(__runtime_const, ___runtime_const)
+#else
+#define RUNTIME_CONST_DATA
+#endif
+
/*
* Allow architectures to handle ro_after_init data on their
* own by defining an empty RO_AFTER_INIT_DATA.
@@ -434,6 +442,7 @@
*(.data..ro_after_init) \
JUMP_TABLE_DATA \
STATIC_CALL_DATA \
+ RUNTIME_CONST_DATA \
__end_ro_after_init = .;
#endif
Index: linux-2.6/include/linux/runtime_const.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/runtime_const.h
@@ -0,0 +1,23 @@
+#ifndef _LINUX_RUNTIME_CONST_H
+#define _LINUX_RUNTIME_CONST_H
+
+#include "asm/runtime_const.h"
+
+#ifdef CONFIG_HAVE_RUNTIME_CONST
+
+extern void __runtime_const_set(void *sym, int size, u64 val);
+
+#define runtime_const_set(sym, val) \
+ __runtime_const_set(&(sym), sizeof(val), val)
+
+extern void runtime_const_init(void);
+
+#else
+
+#define runtime_const_set(sym, val) ((sym) = (val))
+
+static inline void runtime_const_init(void) { }
+
+#endif
+
+#endif /* _LINUX_RUNTIME_CONST_H */
Index: linux-2.6/kernel/Makefile
===================================================================
--- linux-2.6.orig/kernel/Makefile
+++ linux-2.6/kernel/Makefile
@@ -115,6 +115,7 @@ obj-$(CONFIG_KCSAN) += kcsan/
obj-$(CONFIG_SHADOW_CALL_STACK) += scs.o
obj-$(CONFIG_HAVE_STATIC_CALL) += static_call.o
obj-$(CONFIG_HAVE_STATIC_CALL_INLINE) += static_call_inline.o
+obj-$(CONFIG_HAVE_RUNTIME_CONST) += runtime_const.o
obj-$(CONFIG_CFI_CLANG) += cfi.o
obj-$(CONFIG_NUMA) += numa.o
Index: linux-2.6/kernel/runtime_const.c
===================================================================
--- /dev/null
+++ linux-2.6/kernel/runtime_const.c
@@ -0,0 +1,119 @@
+
+#include <linux/sort.h>
+#include <linux/mutex.h>
+#include <linux/bsearch.h>
+#include <linux/bug.h>
+#include <linux/runtime_const.h>
+#include <asm/sections.h>
+#include <asm/text-patching.h>
+
+static bool runtime_const_initialized;
+
+static unsigned long runtime_const_sym(const struct runtime_const_entry *rc)
+{
+ return (unsigned long)&rc->sym + rc->sym;
+}
+
+static unsigned long runtime_const_addr(const struct runtime_const_entry *rc)
+{
+ return (unsigned long)&rc->addr + rc->addr;
+}
+
+static int runtime_const_cmp(const void *a, const void *b)
+{
+ const struct runtime_const_entry *rca = a;
+ const struct runtime_const_entry *rcb = b;
+
+ if (runtime_const_sym(rca) < runtime_const_sym(rcb))
+ return -1;
+
+ if (runtime_const_sym(rca) > runtime_const_sym(rcb))
+ return 1;
+
+ return 0;
+}
+
+static void runtime_const_swap(void *a, void *b, int size)
+{
+ long delta = (unsigned long)a - (unsigned long)b;
+ struct runtime_const_entry *rca = a;
+ struct runtime_const_entry *rcb = b;
+ struct runtime_const_entry tmp = *rca;
+
+ rca->sym = rcb->sym - delta;
+ rca->size = rcb->size;
+ rca->addr = rcb->addr - delta;
+
+ rcb->sym = tmp.sym + delta;
+ rcb->size = tmp.size;
+ rcb->addr = tmp.addr + delta;
+}
+
+static DEFINE_MUTEX(runtime_const_lock);
+
+void __init runtime_const_init(void)
+{
+ struct runtime_const_entry *start = __start___runtime_const;
+ struct runtime_const_entry *stop = __stop___runtime_const;
+ unsigned long num = ((unsigned long)stop - (unsigned long)start) / sizeof(*start);
+
+ if (runtime_const_initialized)
+ return;
+
+ mutex_lock(&runtime_const_lock);
+ sort(start, num, sizeof(*start), runtime_const_cmp, runtime_const_swap);
+ mutex_unlock(&runtime_const_lock);
+
+ runtime_const_initialized = true;
+}
+
+static struct runtime_const_entry *rc_find_sym(void *sym)
+{
+ struct runtime_const_entry *start = __start___runtime_const;
+ struct runtime_const_entry *stop = __stop___runtime_const;
+ unsigned long num = ((unsigned long)stop - (unsigned long)start) / sizeof(*start);
+ struct runtime_const_entry k = {
+ .sym = (unsigned long)sym - (unsigned long)&k.sym,
+ };
+ struct runtime_const_entry *i =
+ bsearch(&k, start, num, sizeof(*start), runtime_const_cmp);
+ if (!i)
+ return i;
+ early_printk("XXX i: %ld s: %d\n", (i - start), i->size);
+ while (i > start) {
+ if (runtime_const_sym(i-1) == (unsigned long)sym)
+ i--;
+ else
+ break;
+ }
+ early_printk("XXY i: %ld\n", (i - start));
+ return i;
+}
+
+__weak void arch_runtime_const_set(struct runtime_const_entry *rc, u64 val)
+{
+ void *addr = (void *)runtime_const_addr(rc);
+ switch (rc->size) {
+ case 1: { *(u8 *)addr = val; break; }
+ case 2: { *(u16 *)addr = val; break; }
+ case 4: { *(u32 *)addr = val; break; }
+ case 8: { *(u64 *)addr = val; break; }
+ default: BUG();
+ }
+}
+
+void __runtime_const_set(void *sym, int size, u64 val)
+{
+ struct runtime_const_entry *i, *stop = __stop___runtime_const;
+
+ mutex_lock(&runtime_const_lock);
+
+ for (i = rc_find_sym(sym);
+ i && i < stop && runtime_const_sym(i) == (unsigned long)sym;
+ i++) {
+ WARN_ONCE(i->size != size, "%d != %d\n", i->size, size);
+ arch_runtime_const_set(i, val);
+ }
+
+ mutex_unlock(&runtime_const_lock);
+}
Index: linux-2.6/arch/x86/kernel/setup.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/setup.c
+++ linux-2.6/arch/x86/kernel/setup.c
@@ -25,6 +25,7 @@
#include <linux/static_call.h>
#include <linux/swiotlb.h>
#include <linux/random.h>
+#include <linux/runtime_const.h>
#include <uapi/linux/mount.h>
@@ -782,6 +783,7 @@ void __init setup_arch(char **cmdline_p)
early_cpu_init();
jump_label_init();
static_call_init();
+// runtime_const_init();
early_ioremap_init();
setup_olpc_ofw_pgd();
Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -97,12 +97,14 @@ EXPORT_SYMBOL(dotdot_name);
*/
static unsigned int d_hash_shift __ro_after_init;
-
static struct hlist_bl_head *dentry_hashtable __ro_after_init;
+#include <linux/runtime_const.h>
+
static inline struct hlist_bl_head *d_hash(unsigned int hash)
{
- return dentry_hashtable + (hash >> d_hash_shift);
+ return runtime_const(dentry_hashtable) +
+ runtime_const_shr32(hash, d_hash_shift);
}
#define IN_LOOKUP_SHIFT 10
@@ -3129,6 +3131,11 @@ static void __init dcache_init_early(voi
0,
0);
d_hash_shift = 32 - d_hash_shift;
+
+ runtime_const_init();
+
+ runtime_const_set(dentry_hashtable, (unsigned long)dentry_hashtable);
+ runtime_const_set(d_hash_shift, (u8)d_hash_shift);
}
static void __init dcache_init(void)
@@ -3157,6 +3164,9 @@ static void __init dcache_init(void)
0,
0);
d_hash_shift = 32 - d_hash_shift;
+
+ runtime_const_set(dentry_hashtable, (unsigned long)dentry_hashtable);
+ runtime_const_set(d_hash_shift, (u8)d_hash_shift);
}
/* SLAB cache for __getname() consumers */
On Mon, Jun 10, 2024 at 12:43:52PM +0200, Peter Zijlstra wrote:
> Yes, the approach is super simple and straight forward, but imagine
> there being like a 100 symbols soon :/
I think we should accept patches using this only when there really is
a good, perf reason for doing so. Not "I wanna use this fance shite in
my new driver just because...".
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Sun, 9 Jun 2024 at 05:11, Linus Torvalds
<[email protected]> wrote:
>
> On Sat, 8 Jun 2024 at 13:55, Linus Torvalds
> <[email protected]> wrote:
> >
> > Think of this patch mostly as a "look, adding another architecture
> > isn't *that* hard - even if the constant value is spread out in the
> > instructions".
>
> .. and here's a version that actually works. It wasn't that bad.
>
> Or rather, it wouldn't have been that bad had I not spent *ages*
> debugging a stupid cut-and-paste error where I instead of writing
> words 0..3 of the 64-bit large constant generation, wrote words 0..2
> and then overwrote word 2 (again) with the data that should have gone
> into word 3. Causing the top 32 bits to be all wonky. Oops. Literally.
>
> That stupid typo caused like two hours of wasted time.
>
> But anyway, this patch actually works for me. It still doesn't do any
> I$/D$ flushing, because it's not needed in practice, but it *should*
> probably do that.
>
arm64 already has so-called 'callback' alternatives that allow the
patching logic for a particular alternative sequence to be implemented
by the user of the API.
A callback implementation to patch a movz/movk sequence already
exists, in arch/arm64/kvm/va_layout.c, used by
kvm_get_kimage_voffset() and kvm_compute_final_ctr_el0().
From inline asm, it gets called like this
static inline size_t __invalidate_icache_max_range(void)
{
u8 iminline;
u64 ctr;
asm(ALTERNATIVE_CB("movz %0, #0\n"
"movk %0, #0, lsl #16\n"
"movk %0, #0, lsl #32\n"
"movk %0, #0, lsl #48\n",
ARM64_ALWAYS_SYSTEM,
kvm_compute_final_ctr_el0)
: "=r" (ctr));
iminline = SYS_FIELD_GET(CTR_EL0, IminLine, ctr) + 2;
return MAX_DVM_OPS << iminline;
}
This one gets patched after SMP bringup, but they can be updated
earlier if needed.
Doing the same for the immediate field in a single ALU instruction
should be straight-forward, although this particular example doesn't
even bother, and just masks and shifts the result of the movz/movk
sequence.
On 10/06/2024 12.43, Peter Zijlstra wrote:
> On Sat, Jun 08, 2024 at 12:35:05PM -0700, Linus Torvalds wrote:
>> Comments?
>
> It obviously has the constraint of never running the code before the
> corresponding runtime_const_init() has been done, otherwise things will
> go sideways in a hurry, but this also makes the whole thing a *lot*
> simpler.
>
> The only thing I'm not sure about is it having a section per symbol,
> given how jump_label and static_call took off, this might not be
> scalable.
>
> Yes, the approach is super simple and straight forward, but imagine
> there being like a 100 symbols soon :/
>
> The below hackery -- it very much needs cleanup and is only compiled on
> x86_64 and does not support modules, boots for me.
As can be seen in my other reply, yes, I'm also worried about the
scalability and would like to see this applied to more stuff.
But if we do this, surely that's what scripts/sorttable is for, right?
Alternatively, if we just keep emitting to per-symbol
__runtime_const_##sym sections but collect them in one __runtime_const,
just using __runtime_const { *(SORT_BY_NAME(__runtime_const_*)) } in the
linker script should already be enough to allow that binary search to
work (with whatever : AT(ADDR() ... ) magic is also required), with no
post-processing at build or runtime required.
Rasmus
On Mon, 10 Jun 2024 at 02:50, Rasmus Villemoes <[email protected]> wrote:
>
> Well, I think it lacks a little. I made it so that an arch didn't need
> to implement support for all runtime_const_*() helpers
Honestly, normally the number of helpers would max out at probably
three or four.
The only helpers you need is for the different constant sizes and the
placement in the instructions. For example, on x86-64, you almost
certainly end up with just three fixup functions (byte, 32-bit and
64-bit) and then *maybe* a couple of "operation" helpers.
Honestly, the only op helpers that are likely to exist are
add/mask/shift. But I didn't add macros for creating these ops,
because I don't actually believe very many exist in the first place.
And yes, x86-64 ends up being fairly simple, because it doesn't have
multiple different ways of encoding immediates in different places of
the instruction. But even that isn't going to make for very many
operations.
In fact, one of the whole design points here was KISS - Keep It Simple
Stupid. Very much *not* trying to handle all operations. This is not
something that will ever be like the atomic ops, where we have a
plethora of them.
To use a runtime constant, the code has to be really *really*
critical. Honestly, I have only ever seen one such function.
__d_lookup_rcu() really has shown up for over a decace as one of the
hottest kernel functions on real and relevant loads.
For example, in your RAI patches six years ago, you did the dentry
cache, and you did the inode_cachep thing.
And honestly, while I've never seen the inode_cachep in a profile. If
you ever look up the value of inode_cachep, it's because you're doing
to allocate or free an inode, and the memory load is the *least* of
your problems.
> name because yes, much better than rai_), but could implement just
> runtime_const_load() and not the _shift_right thing,
Not a single relevant architecture would find that AT ALL useful.
If you have a constant shift, that constant will be encoded in the
shift instruction. Only completely broken and pointless architectures
would ever have to use a special "load constant into register, then
shift by register".
I'm sure such horrors exist - hardware designers sometimes have some
really odd hang-ups - but none of the relevant architectures do that.
> Otherwise, if we want to add some runtime_const_mask32() thing to
> support hash tables where we use that model, one would have to hook up
> support on all architectures at once.
See above: when we're talking about a couple of operations, I think
trying to abstract it is actually *wrong*. It makes the code less
readable, not more.
> Don't you need a cc clobber here?
"cc" clobbers aren't actually needed on x86. All inline asms are
assumed to clobber cc
It's not wrong to add them, but we generally don't.
> And for both, I think asm_inline is appropriate
Yup. Will do.
> I know it's a bit more typing, but the section names should be
> "runtime_const_shift" etc., not merely "runtime_shift".
I actually had that initially. It wasn't the "more typing". It was
"harder to read" because everything became so long and cumbersome.
The noise in the names basically ended up just overwhelming all the
RealCode(tm).
> > + RUNTIME_CONST(shift, d_hash_shift)
> > + RUNTIME_CONST(ptr, dentry_hashtable)
> > +
>
> Hm, doesn't this belong in the common linker script?
I actually went back and forth on this and had it both ways. I ended
up worrying that different architectures would want to have these
things in particular locations, closer to the text section etc.
IOW, I ended up putting it in the architecture-specific part because
that's where the altinstructions sections were, and it fit that
pattern.
It could possibly go into the
INIT_DATA_SECTION
thing in the generic macros, but it has to go *outside* the section
that is called ".init.data" because the section name has to match.
See why I didn't do that?
> I mean, if arm64 were to implement support for this, they'd also have to add this
> boilerplate to their vmlinux.lds.S?
See
https://lore.kernel.org/all/CAHk-=wiPSnaCczHp3Jy=kFjfqJa7MTQg6jht_FwZCxOnpsi4Vw@mail.gmail.com/
doing arm64 was always the plan - ti was doing perf profiles on arm64
that showed this nasty pattern to me once again (and honestly, showed
it much more: I have 14% of all time in __d_lookup_rcu() because while
this machine has lots of cores, it doesn't have lots of cache, so it's
all very nasty).
Notice how small that patch is. Yes, it adds two lines to the arm64
vmlinux.lds.S file. But everything else is literally just the required
"this is how you modify the instructions".
There is no extra fat there.
> Please keep the #includes together at the top of the file.
Yes, my original plan was to do
#ifdef CONFIG_RUNTIME_CONST
#include <asm/runtime-const.h>
.. do the const version of d_hash() ..
#else
.. do the non-const one ..
#endif
but the asm-generic approach meant that I didn't even need to make it
a CONFIG variable.
> > +static inline struct hlist_bl_head *d_hash(unsigned long hashlen)
>
> Could you spend some words on this signature change? Why should this now
> (on 64BIT) take the full hash_len as argument, only to let the compiler
> (with the (u32) cast) or cpu (in the x86_64 case, via the %k modifier if
> I'm reading things right) only use the lower 32 bits?
I'll separate it out, but it boils down to both x86 and arm64 doing
32-bit shifts that clear the upper 32 bits.
So you do *not* want a separate instruction just to turn a "unsigned
long" into a "u32".
And while arm64 can take a 32-bit input register and output a 64-bit
output, on x86-64 the input and output registers are the same, and you
cannot tell the compiler that the upper 32 bits don't matter on input
to the asm.
End result: you actually want to keep the input to the constant shift
as the native register size.
But I'll separate out the d_hash() interface change from the actual
runtime constant part.
Linus
On Mon, 10 Jun 2024 at 05:02, Borislav Petkov <[email protected]> wrote:
>
> I think we should accept patches using this only when there really is
> a good, perf reason for doing so. Not "I wanna use this fance shite in
> my new driver just because...".
Absolutely.
So for example, if the code could possibly be a module, it's never
going to be able to use runtime constants.
If the code doesn't show up as "noticeable percentage of kernel time
on real loads", it will not be a valid use for runtime constants.
The reason I did __d_lookup_rcu() is that I have optimized that
function to hell and back before, and it *still* showed up at 14% of
kernel time on my "empty kernel build" benchmark. And the constant
load was a noticeable - but not dominant - part of that.
And yes, it shows up that high because it's all D$ misses, and the
machine I tested on has more CPU cores than cache, so it's all kinds
of broken. But the point ends up being that __d_lookup_rcu() is just
very very hot on loads that just do a lot of 'stat()' calls (and such
loads exist and aren't just microbenchmarks).
I have other functions I see in the 5%+ range of kernel overhead on
real machines, but they tend to be things like clear_page(), which is
another kind of issue entirely.
And yes, the benchmarks I run are odd ("why would anybody care about
an empty kernel build?") but somewhat real to me (since I do builds
between every pull even when they just change a couple of files).
And yes, to actually even see anything else than the CPU security
issues on x86, you need to build without debug support, and without
retpolines etc. So my profiles are "fake" in that sense, because they
are the best case profiles without a lot of the horror that people
enable.
Others will have other real benchmarks, which is why I do think we'd
end up with more uses of this. But I would expect a handful, not
"hundreds".
I could imagine some runtime constant in the core networking socket
code, for example. Or in some scheduler thing. Or kernel entry code.
But not ever in a driver or a filesystem, for example. Once you've
gotten that far off the core code path, the "load a variable" overhead
just isn't relevant any more.
Linus
On Mon, 10 Jun 2024 at 03:43, Peter Zijlstra <[email protected]> wrote:
>
> --- linux-2.6.orig/arch/Kconfig
> +++ linux-2.6/arch/Kconfig
> @@ -1492,6 +1492,9 @@ config HAVE_SPARSE_SYSCALL_NR
> config ARCH_HAS_VDSO_DATA
> bool
>
> +config HAVE_RUNTIME_CONST
> + bool
No. We're not adding a stupid config variable, when nothing actually wants it.
> +#define __runtime_const(sym, op, type) \
> +({ \
> + typeof(sym) __ret; \
> + asm(op " %1, %0\n1:\n" \
> + ".pushsection __runtime_const, \"aw\"\n\t" \
> + ".long %c3 - . # sym \n\t" \
> + ".long %c2 # size \n\t" \
> + ".long 1b - %c2 - . # addr \n\t" \
> + ".popsection\n\t" \
> + : "=r" (__ret) \
> + : "i" ((type)0x0123456789abcdefull), \
> + "i" (sizeof(type)), \
> + "i" ((void *)&sym)); \
> + __ret; \
> +})
> +
> +#define runtime_const(sym) \
> +({ \
> + typeof(sym) __ret; \
> + switch(sizeof(sym)) { \
> + case 1: __ret = __runtime_const(sym, "movb", u8); break; \
> + case 2: __ret = __runtime_const(sym, "movs", u16); break; \
> + case 4: __ret = __runtime_const(sym, "movl", u32); break; \
> + case 8: __ret = __runtime_const(sym, "movq", u64); break; \
> + default: BUG(); \
> + } \
> + __ret; \
> +})
And no. We're not adding magic "generic" helpers that have zero use
and just make the code harder to read, and don't even work on 32-bit
x86 anyway.
Because I didn't test, but I am pretty sure that clang will not
compile the above on x86-32, because clang verifies the inline asm,
and "movq" isn't a valid instruction.
We had exactly that for the uaccess macros, and needed to special-case
the 64-bit case for that reason.
And we don't *need* to. All of the above is garbage waiting for a use
that shouldn't exist.
> +++ linux-2.6/kernel/runtime_const.c
> @@ -0,0 +1,119 @@
And here you basically tripled the size of the patch in order to have
just one section, when I had per-symbol sections.
So no.
I envision a *couple* of runtime constants. The absolutely only reason
to use a runtime constant is that it is *so* hot that the difference
between "load a variable" and "write code with a constant" is
noticeable.
I can point to exactly one such case in the kernel right now.
I'm sure there are others, but I'd expect that "others" to be mainly a handful.
This needs to be simple, obvious, and literally designed for very targeted use.
This is a *very* special case for a *very* special code. Not for generic use.
I do not ever expect to see this used by modules, for example. There
is no way in hell I will expose the instruction rewriting to a module.
The *use* of a constant is a _maybe_, but it's questionable too. By
definition, module code cannot be all that core.
Linus
On June 10, 2024 11:20:21 AM PDT, Linus Torvalds <[email protected]> wrote:
>On Mon, 10 Jun 2024 at 05:02, Borislav Petkov <[email protected]> wrote:
>>
>> I think we should accept patches using this only when there really is
>> a good, perf reason for doing so. Not "I wanna use this fance shite in
>> my new driver just because...".
>
>Absolutely.
>
>So for example, if the code could possibly be a module, it's never
>going to be able to use runtime constants.
>
>If the code doesn't show up as "noticeable percentage of kernel time
>on real loads", it will not be a valid use for runtime constants.
>
>The reason I did __d_lookup_rcu() is that I have optimized that
>function to hell and back before, and it *still* showed up at 14% of
>kernel time on my "empty kernel build" benchmark. And the constant
>load was a noticeable - but not dominant - part of that.
>
>And yes, it shows up that high because it's all D$ misses, and the
>machine I tested on has more CPU cores than cache, so it's all kinds
>of broken. But the point ends up being that __d_lookup_rcu() is just
>very very hot on loads that just do a lot of 'stat()' calls (and such
>loads exist and aren't just microbenchmarks).
>
>I have other functions I see in the 5%+ range of kernel overhead on
>real machines, but they tend to be things like clear_page(), which is
>another kind of issue entirely.
>
>And yes, the benchmarks I run are odd ("why would anybody care about
>an empty kernel build?") but somewhat real to me (since I do builds
>between every pull even when they just change a couple of files).
>
>And yes, to actually even see anything else than the CPU security
>issues on x86, you need to build without debug support, and without
>retpolines etc. So my profiles are "fake" in that sense, because they
>are the best case profiles without a lot of the horror that people
>enable.
>
>Others will have other real benchmarks, which is why I do think we'd
>end up with more uses of this. But I would expect a handful, not
>"hundreds".
>
>I could imagine some runtime constant in the core networking socket
>code, for example. Or in some scheduler thing. Or kernel entry code.
>
>But not ever in a driver or a filesystem, for example. Once you've
>gotten that far off the core code path, the "load a variable" overhead
>just isn't relevant any more.
>
> Linus
So I would also strongly suggest that we make the code fault if it is executed unpatched if there is no fallback.
My original motivation for making the code as complex as I did was to handle both the init and noninit cases, and to handle things other than mov (for the mov case, it is quite trivial to allow for either a memory reference or an immediate, of course, but I was trying to optimize all kinds of operations, of which shifts were by far the worst. Almost certainly a mistake on my part.)
For that case it also works just fine in modules, since alternatives does the patching there already.
The way I did it was to introduce a new alternative patching type (a concept that didn't exist yet, and again I made the mistake of not simply doing that part separately first) using the address of the variable in question (in the ro_after_init segment) as the alternatives source buffer.
Much has changed since I hacked on this, a lot of which makes this whole concept *much* easier to implement. We could even make it completely transparent at the programmer level by using objtool or a linker plugin to detect loads from a certain segment and tag them for alternatives patching (which would, however, require adding padding to some kinds of instructions, notably arbitrary 64-bit immediates.)
The one case that gets really nasty for automatic conversation is the case of an arbitrary 64-bit value used in a computation, like:
__patchable long b;
long a;
/* ... */
a += b;
The compiler can generate:
addq b(%rip),%rax
... which would require a temporary register to convert to an immediate. On the other hand:
__patchable int b;
long a;
/* ... */
a += b;
... would generate ...
movslq b(%rip),%rbx
addq %rbx,%rax
... which can be compacted down to a single instruction:
addq $bimm,%rax
... so *really* optimizing this could be rather tricky without some kind of compiler support (native or via plug-in) even. The most likely option would be to generate object code as if immediate instructions are used, but with relocations that indicate "value of" rather than "address of" the variable being referenced. Postprocessing either at the assembly level or object code level (i.e. objtool) can then substitute the prepatch sequence and alternatives metadata.
(If it is a custom plug-in then postprocessing is presumably not necessary, of course.)
On Mon, 10 Jun 2024 at 16:35, H. Peter Anvin <[email protected]> wrote:
>
> ... which can be compacted down to a single instruction:
>
> addq $bimm,%rax
We'll burn that bridge when we get to it. I'm not actually seeing any
obvious for 32-bit immediates, except as part of some actual operation
sequence (ie you might have a similar hash lookup to the d_hash() one,
except using a mask rather than a shift).
When would you ever add a constant, except when that constant is an
address? And those kinds of runtime constant addresses would always
be the full 64-bit - I don't think 32-bit is interesting enough to
spend any effort on.
(Yes, you can get 32-bit rip-address constants if you start doing
things like conditional link addresses, but that's what things like
static_call() is for - not this runtime constant code)
Linus
On Mon, 10 Jun 2024 at 16:35, H. Peter Anvin <[email protected]> wrote:
>
> So I would also strongly suggest that we make the code fault if it is executed unpatched if there is no fallback.
It effectively does that already, just because the address won't be a
valid address before patching.
Doing it in general is actually very very painful. Feel free to try -
but I can almost guarantee that you will throw out the "Keep It Simple
Stupid" approach and your patch will be twice the size if you do some
"rewrite the whole instruction" stuff.
I really think there's a fundamental advantage to keeping things simple.
Linus
On Mon, 10 Jun 2024 at 18:09, Linus Torvalds
<[email protected]> wrote:
>
> Doing it in general is actually very very painful. Feel free to try -
> but I can almost guarantee that you will throw out the "Keep It Simple
> Stupid" approach and your patch will be twice the size if you do some
> "rewrite the whole instruction" stuff.
>
> I really think there's a fundamental advantage to keeping things simple.
I guess the KISS approach would be to have a debug mode that just adds
an 'int3' instruction *after* the constant. And then the constant
rewriting rewrites the constant and just changes the 'int3' into the
standard single-byte 'nop' instruction.
That wouldn't be complicated, and the cost would be minimal. But I
don't see it being worth it, at least not for the current use where
the unrewritten constant will just cause an oops on use.
Linus
/me saves a pointer to that mail to show to eager submitters who will
want to jump on this new thing.
On Mon, Jun 10, 2024 at 11:20:21AM -0700, Linus Torvalds wrote:
> So for example, if the code could possibly be a module, it's never
> going to be able to use runtime constants.
Perfectly fine with that.
> If the code doesn't show up as "noticeable percentage of kernel time
> on real loads", it will not be a valid use for runtime constants.
>
> The reason I did __d_lookup_rcu() is that I have optimized that
> function to hell and back before, and it *still* showed up at 14% of
> kernel time on my "empty kernel build" benchmark. And the constant
> load was a noticeable - but not dominant - part of that.
I have seen mails from you, off and on, about __d_lookup_rcu() :-P
> And yes, it shows up that high because it's all D$ misses, and the
> machine I tested on has more CPU cores than cache, so it's all kinds
> of broken. But the point ends up being that __d_lookup_rcu() is just
> very very hot on loads that just do a lot of 'stat()' calls (and such
> loads exist and aren't just microbenchmarks).
Yap.
> And yes, the benchmarks I run are odd ("why would anybody care about
> an empty kernel build?") but somewhat real to me (since I do builds
> between every pull even when they just change a couple of files).
You're not the only one - I'm building every patch too so yeah, got
a nice Threadripper for exactly that purpose too.
:-)
> And yes, to actually even see anything else than the CPU security
> issues on x86, you need to build without debug support, and without
> retpolines etc. So my profiles are "fake" in that sense, because they
> are the best case profiles without a lot of the horror that people
> enable.
Right, I run with the default settings because, well, we must eat our
own dogfood but yeah, all that zoo of config options to disable the
mitigations wasn't added just for fun so others will run similar
situtaions too, if they don't do that already.
> Others will have other real benchmarks, which is why I do think we'd
> end up with more uses of this. But I would expect a handful, not
> "hundreds".
That's a good rule. In talking to Peter I also didn't like the thing of
having a single ELF section per variable but if we're doing handful,
meh, ok, I guess.
> I could imagine some runtime constant in the core networking socket
> code, for example. Or in some scheduler thing. Or kernel entry code.
>
> But not ever in a driver or a filesystem, for example. Once you've
> gotten that far off the core code path, the "load a variable" overhead
> just isn't relevant any more.
Yeah, that makes sense.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 6/10/24 18:24, Linus Torvalds wrote:
> On Mon, 10 Jun 2024 at 18:09, Linus Torvalds
> <[email protected]> wrote:
>>
>> Doing it in general is actually very very painful. Feel free to try -
>> but I can almost guarantee that you will throw out the "Keep It Simple
>> Stupid" approach and your patch will be twice the size if you do some
>> "rewrite the whole instruction" stuff.
>>
>> I really think there's a fundamental advantage to keeping things simple.
>
> I guess the KISS approach would be to have a debug mode that just adds
> an 'int3' instruction *after* the constant. And then the constant
> rewriting rewrites the constant and just changes the 'int3' into the
> standard single-byte 'nop' instruction.
>
> That wouldn't be complicated, and the cost would be minimal. But I
> don't see it being worth it, at least not for the current use where
> the unrewritten constant will just cause an oops on use.
>
A general enough way to do it would be to put an int $3 and replace it
with a ds: dummy prefix.
The issue there is "current use". I'm really, really worried about
someone in the future putting this where it won't get properly patched
and then all hell will be breaking loose.
Perhaps a better idea would be to provide the initial value as part of
the declaration, so that the value is never "uninitialized" (much like a
static variable can be initialized at compile time)?
In other words:
runtime_const_ptr(sym,init)
Unfortunately gas doesn't seem to properly implement the {nooptimize}
prefix that is documented. This does require some gentle assembly hacking:
- Loading a pointer/long requires using the "movabs" mnemonic on x86-64.
Combining that with
(but not on x86-32 as there are no compacted forms of mov immediate;
on x86-32 it is also legit to allow =rm rather than =r, but for an 8-bit
immediate "=qm" has to be used.)
A size/type-generic version (one nice thing here is that init also ends
up specifying the type):
#define _RUNTIME_CONST(where, sym, size) \
"\n\t" \
".pushsection \"runtime_const_" #sym "\",\"a\"\n\t" \
".long (" #where ") - (" #size ") - .\n\t" \
".popsection"
extern void __noreturn __runtime_const_bad_size(void);
#define runtime_const(_sym, _init) ({ \
typeof(_init) __ret; \
const size_t __sz = sizeof(__ret); \
switch (__sz) { \
case 1: \
asm_inline("mov %[init],%[ret]\n1:" \
_RUNTIME_CONST(1b, _sym, 1) \
: [ret] "=qm" (__ret) \
: [init] "i" (_init)); \
break; \
case 8: \
asm_inline("movabs %[init],%[ret]\n1:" \
_RUNTIME_CONST(1b, _sym, 8) \
: [ret] "=r" (__ret) \
: [init] "i" (_init)); \
break; \
default: \
asm_inline("mov %[init],%[ret]\n1:" \
_RUNTIME_CONST(1b, _sym, %c[sz]) \
: [ret] "=rm" (__ret) \
: [init] "i" (_init), [sz] "n" (__sz))); \
break; \
} \
__ret; })
- For a shift count, it is unfortunately necessary to explicitly stitch
together the instruction using .insn to avoid truncating the case where
the operand is 1.
Size- and operand-generic version:
#define _runtime_const_shift(_val, _sym, _init, _op2) ({ \
typeof(_val) __ret = (_val); \
switch (sizeof(__ret)) { \
case 1: \
asm_inline(".insn 0xc0/%c[op2],%[ret],%[init]{:u8}\n1:" \
_RUNTIME_CONST(1b, _sym, 1) \
: [ret] "+qm" (__ret) \
: [init] "i" ((u8)(_init)), \
[op2] "n" (_op2)); \
break; \
default: \
asm_inline(".insn 0xc1/%c[op2],%[ret],%[init]{:u8}\n1:" \
_RUNTIME_CONST(1b, _sym, 1) \
: [ret] "+rm" (__ret) \
: [init] "i" ((u8)(_init)), \
[op2] "n" (_op2)); \
break; \
} \
__ret; }) \
#define runtime_const_rol(v,s,i) _runtime_const_shift(v,s,i,0)
#define runtime_const_ror(v,s,i) _runtime_const_shift(v,s,i,1)
#define runtime_const_shl(v,s,i) _runtime_const_shift(v,s,i,4)
#define runtime_const_shr(v,s,i) _runtime_const_shift(v,s,i,5)
#define runtime_const_sar(v,s,i) _runtime_const_shift(v,s,i,7)
I am not sure if I'm missing something, but there really isn't a reason
to use different section names for the shift counts specifically, is there?
NOTE: if we are *not* making these type-generic there is no reason
whatsoever to not make these inlines...
***
Also: one thing I would *really* like to see this being used for is
cr4_pinned_bits, in which case one can, indeed, safely use a zero value
at init time as long as cr4_pinned_mask is patched at the same time,
which very much goes along with the above.
Again, this is a slightly less minimal versus what I had which was a
maximal solution.
My approach would pretty much have targeted doing this for nearly all
instances, which I eventually felt called for compiler support (or C++!)
as adding a bunch of static_imm_*() macros all over the kernel really
felt unpleasant.
-hpa
On 6/10/24 06:38, Rasmus Villemoes wrote:
> On 10/06/2024 12.43, Peter Zijlstra wrote:
>> On Sat, Jun 08, 2024 at 12:35:05PM -0700, Linus Torvalds wrote:
>
>>> Comments?
>>
>> It obviously has the constraint of never running the code before the
>> corresponding runtime_const_init() has been done, otherwise things will
>> go sideways in a hurry, but this also makes the whole thing a *lot*
>> simpler.
>>
>> The only thing I'm not sure about is it having a section per symbol,
>> given how jump_label and static_call took off, this might not be
>> scalable.
>>
>> Yes, the approach is super simple and straight forward, but imagine
>> there being like a 100 symbols soon :/
>>
>> The below hackery -- it very much needs cleanup and is only compiled on
>> x86_64 and does not support modules, boots for me.
>
> As can be seen in my other reply, yes, I'm also worried about the
> scalability and would like to see this applied to more stuff.
>
> But if we do this, surely that's what scripts/sorttable is for, right?
>
> Alternatively, if we just keep emitting to per-symbol
> __runtime_const_##sym sections but collect them in one __runtime_const,
> just using __runtime_const { *(SORT_BY_NAME(__runtime_const_*)) } in the
> linker script should already be enough to allow that binary search to
> work (with whatever : AT(ADDR() ... ) magic is also required), with no
> post-processing at build or runtime required.
>
As far as one section per symbol, this is *exactly* what the linker
table infrastructure was intended to make clean and scalable.
I think rejecting it was a big mistake. It is really a very useful
general piece of infrastructure, and IMNSHO the whole notion of "oh, we
won't ever need that many such tables" is just plain wrong (as evidenced
here.)
Either way, the problem isn't that hard; you end up doing something like:
struct runtime_const {
unsigned int size;
reladdr_t entries[0];
};
#define DECLARE_RUNTIME_CONST(sym,type) \
extern struct runtime_const sym;\
asm(".pushsection \"runtime_const_" #sym ".Start\",\"a\"\n\t"
".globl " #sym "\n"
#sym ": .int 2f - 1f\n\t"
"1:\n"
".popsection\n\t"
".pushsection \"runtime_const_" #sym "._end\",\"a\"\n\t"
"2:\n"
".popsection\n\t");
... and add a common suffix, say, ".entry", for the entry section names.
Then SORT_BY_NAME() will handle the rest.
-hpa
On June 11, 2024 12:43:02 PM PDT, "H. Peter Anvin" <[email protected]> wrote:
>On 6/10/24 06:38, Rasmus Villemoes wrote:
>> On 10/06/2024 12.43, Peter Zijlstra wrote:
>>> On Sat, Jun 08, 2024 at 12:35:05PM -0700, Linus Torvalds wrote:
>>
>>>> Comments?
>>>
>>> It obviously has the constraint of never running the code before the
>>> corresponding runtime_const_init() has been done, otherwise things will
>>> go sideways in a hurry, but this also makes the whole thing a *lot*
>>> simpler.
>>>
>>> The only thing I'm not sure about is it having a section per symbol,
>>> given how jump_label and static_call took off, this might not be
>>> scalable.
>>>
>>> Yes, the approach is super simple and straight forward, but imagine
>>> there being like a 100 symbols soon :/
>>>
>>> The below hackery -- it very much needs cleanup and is only compiled on
>>> x86_64 and does not support modules, boots for me.
>>
>> As can be seen in my other reply, yes, I'm also worried about the
>> scalability and would like to see this applied to more stuff.
>>
>> But if we do this, surely that's what scripts/sorttable is for, right?
>>
>> Alternatively, if we just keep emitting to per-symbol
>> __runtime_const_##sym sections but collect them in one __runtime_const,
>> just using __runtime_const { *(SORT_BY_NAME(__runtime_const_*)) } in the
>> linker script should already be enough to allow that binary search to
>> work (with whatever : AT(ADDR() ... ) magic is also required), with no
>> post-processing at build or runtime required.
>>
>
>As far as one section per symbol, this is *exactly* what the linker table infrastructure was intended to make clean and scalable.
>
>I think rejecting it was a big mistake. It is really a very useful general piece of infrastructure, and IMNSHO the whole notion of "oh, we won't ever need that many such tables" is just plain wrong (as evidenced here.)
>
>Either way, the problem isn't that hard; you end up doing something like:
>
>struct runtime_const {
> unsigned int size;
> reladdr_t entries[0];
>};
>
>#define DECLARE_RUNTIME_CONST(sym,type) \
>extern struct runtime_const sym;\
>asm(".pushsection \"runtime_const_" #sym ".Start\",\"a\"\n\t"
> ".globl " #sym "\n"
> #sym ": .int 2f - 1f\n\t"
> "1:\n"
> ".popsection\n\t"
> ".pushsection \"runtime_const_" #sym "._end\",\"a\"\n\t"
> "2:\n"
> ".popsection\n\t");
>
>... and add a common suffix, say, ".entry", for the entry section names. Then SORT_BY_NAME() will handle the rest.
>
> -hpa
>
Ok, the section naming is obviously bogus, but...
I just had an idea how to clearly make this type-safe as a benefit. I'm at a school event right now but I'll hack up a demo as soon as I get home.
On Tue, 11 Jun 2024 at 13:16, H. Peter Anvin <[email protected]> wrote:
>
> I just had an idea how to clearly make this type-safe as a benefit.
You mean exactly like my patch already is, because I use the section name?
Christ people. I am throwing down the gauntlet: if you can't make a
patch that actually *improves* on what I already posted, don't even
bother.
The whole "it doesn't scale" is crazy talk. We don't want hundreds -
much less thousands - of these things. Using one named section for
each individual constant is a *good* thing.
So really: take a good hard look at that final
[PATCH 3/7] x86: add 'runtime constant' support
at
https://lore.kernel.org/lkml/[email protected]/
and only if you can *improve* on it by making it smaller or somehow
actually better.
Seriously. There's one line in that patch that really says it all:
2 files changed, 64 insertions(+)
and that's with basically optimal code generation. Improve on *THAT*.
No more of this pointless bikeshedding with arguments that make no
actual technical sense.
Linus