From: "Mike Rapoport (IBM)" <[email protected]>
Hi,
Since v3 I looked into making execmem more of an utility toolbox, as we
discussed at LPC with Mark Rutland, but it was getting more hairier than
having a struct describing architecture constraints and a type identifying
the consumer of execmem.
And I do think that having the description of architecture constraints for
allocations of executable memory in a single place is better that having it
spread all over the place.
The patches available via git:
https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=execmem/v4
v4 changes:
* rebase on v6.9-rc2
* rename execmem_params to execmem_info and execmem_arch_params() to
execmem_arch_setup()
* use single execmem_alloc() API instead of execmem_{text,data}_alloc() (Song)
* avoid extra copy of execmem parameters (Rick)
* run execmem_init() as core_initcall() except for the architectures that
may allocated text really early (currently only x86) (Will)
* add acks for some of arm64 and riscv changes, thanks Will and Alexandre
* new commits:
- drop call to kasan_alloc_module_shadow() on arm64 because it's not
needed anymore
- rename MODULE_START to MODULES_VADDR on MIPS
- use CONFIG_EXECMEM instead of CONFIG_MODULES on powerpc as per Christophe:
https://lore.kernel.org/all/[email protected]/
v3: https://lore.kernel.org/all/[email protected]
* add type parameter to execmem allocation APIs
* remove BPF dependency on modules
v2: https://lore.kernel.org/all/[email protected]
* Separate "module" and "others" allocations with execmem_text_alloc()
and jit_text_alloc()
* Drop ROX entailment on x86
* Add ack for nios2 changes, thanks Dinh Nguyen
v1: https://lore.kernel.org/all/[email protected]
= Cover letter from v1 (sligtly updated) =
module_alloc() is used everywhere as a mean to allocate memory for code.
Beside being semantically wrong, this unnecessarily ties all subsystmes
that need to allocate code, such as ftrace, kprobes and BPF to modules and
puts the burden of code allocation to the modules code.
Several architectures override module_alloc() because of various
constraints where the executable memory can be located and this causes
additional obstacles for improvements of code allocation.
A centralized infrastructure for code allocation allows allocations of
executable memory as ROX, and future optimizations such as caching large
pages for better iTLB performance and providing sub-page allocations for
users that only need small jit code snippets.
Rick Edgecombe proposed perm_alloc extension to vmalloc [1] and Song Liu
proposed execmem_alloc [2], but both these approaches were targeting BPF
allocations and lacked the ground work to abstract executable allocations
and split them from the modules core.
Thomas Gleixner suggested to express module allocation restrictions and
requirements as struct mod_alloc_type_params [3] that would define ranges,
protections and other parameters for different types of allocations used by
modules and following that suggestion Song separated allocations of
different types in modules (commit ac3b43283923 ("module: replace
module_layout with module_memory")) and posted "Type aware module
allocator" set [4].
I liked the idea of parametrising code allocation requirements as a
structure, but I believe the original proposal and Song's module allocator
was too module centric, so I came up with these patches.
This set splits code allocation from modules by introducing execmem_alloc()
and and execmem_free(), APIs, replaces call sites of module_alloc() and
module_memfree() with the new APIs and implements core text and related
allocations in a central place.
Instead of architecture specific overrides for module_alloc(), the
architectures that require non-default behaviour for text allocation must
fill execmem_info structure and implement execmem_arch_setup() that returns
a pointer to that structure. If an architecture does not implement
execmem_arch_setup(), the defaults compatible with the current
modules::module_alloc() are used.
Since architectures define different restrictions on placement,
permissions, alignment and other parameters for memory that can be used by
different subsystems that allocate executable memory, execmem APIs
take a type argument, that will be used to identify the calling subsystem
and to allow architectures to define parameters for ranges suitable for that
subsystem.
The new infrastructure allows decoupling of BPF, kprobes and ftrace from
modules, and most importantly it paves the way for ROX allocations for
executable memory.
[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/all/[email protected]/
[3] https://lore.kernel.org/all/87v8mndy3y.ffs@tglx/
[4] https://lore.kernel.org/all/[email protected]
Mike Rapoport (IBM) (15):
arm64: module: remove uneeded call to kasan_alloc_module_shadow()
mips: module: rename MODULE_START to MODULES_VADDR
nios2: define virtual address space for modules
module: make module_memory_{alloc,free} more self-contained
mm: introduce execmem_alloc() and execmem_free()
mm/execmem, arch: convert simple overrides of module_alloc to execmem
mm/execmem, arch: convert remaining overrides of module_alloc to
execmem
arm64: extend execmem_info for generated code allocations
riscv: extend execmem_params for generated code allocations
powerpc: extend execmem_params for kprobes allocations
arch: make execmem setup available regardless of CONFIG_MODULES
x86/ftrace: enable dynamic ftrace without CONFIG_MODULES
powerpc: use CONFIG_EXECMEM instead of CONFIG_MODULES where appropiate
kprobes: remove dependency on CONFIG_MODULES
bpf: remove CONFIG_BPF_JIT dependency on CONFIG_MODULES of
arch/Kconfig | 8 +-
arch/arm/kernel/module.c | 34 -------
arch/arm/mm/init.c | 40 ++++++++
arch/arm64/kernel/module.c | 126 ------------------------
arch/arm64/kernel/probes/kprobes.c | 7 --
arch/arm64/mm/init.c | 136 ++++++++++++++++++++++++++
arch/arm64/net/bpf_jit_comp.c | 11 ---
arch/loongarch/kernel/module.c | 6 --
arch/loongarch/mm/init.c | 20 ++++
arch/mips/include/asm/pgtable-64.h | 4 +-
arch/mips/kernel/module.c | 10 --
arch/mips/mm/fault.c | 4 +-
arch/mips/mm/init.c | 22 +++++
arch/nios2/include/asm/pgtable.h | 5 +-
arch/nios2/kernel/module.c | 20 ----
arch/nios2/mm/init.c | 19 ++++
arch/parisc/kernel/module.c | 12 ---
arch/parisc/mm/init.c | 22 ++++-
arch/powerpc/Kconfig | 2 +-
arch/powerpc/include/asm/kasan.h | 2 +-
arch/powerpc/kernel/head_8xx.S | 4 +-
arch/powerpc/kernel/head_book3s_32.S | 6 +-
arch/powerpc/kernel/kprobes.c | 22 +----
arch/powerpc/kernel/module.c | 38 --------
arch/powerpc/lib/code-patching.c | 2 +-
arch/powerpc/mm/book3s32/mmu.c | 2 +-
arch/powerpc/mm/mem.c | 64 ++++++++++++
arch/riscv/kernel/module.c | 12 ---
arch/riscv/kernel/probes/kprobes.c | 10 --
arch/riscv/mm/init.c | 41 ++++++++
arch/riscv/net/bpf_jit_core.c | 13 ---
arch/s390/kernel/ftrace.c | 4 +-
arch/s390/kernel/kprobes.c | 4 +-
arch/s390/kernel/module.c | 42 +-------
arch/s390/mm/init.c | 28 ++++++
arch/sparc/kernel/module.c | 30 ------
arch/sparc/mm/Makefile | 2 +
arch/sparc/mm/execmem.c | 25 +++++
arch/sparc/net/bpf_jit_comp_32.c | 8 +-
arch/x86/Kconfig | 2 +
arch/x86/kernel/ftrace.c | 16 +--
arch/x86/kernel/kprobes/core.c | 4 +-
arch/x86/kernel/module.c | 51 ----------
arch/x86/mm/init.c | 27 ++++++
include/linux/execmem.h | 132 +++++++++++++++++++++++++
include/linux/moduleloader.h | 15 ---
kernel/bpf/Kconfig | 2 +-
kernel/bpf/core.c | 6 +-
kernel/kprobes.c | 51 +++++-----
kernel/module/Kconfig | 1 +
kernel/module/main.c | 105 +++++++++-----------
kernel/trace/trace_kprobe.c | 11 +++
mm/Kconfig | 3 +
mm/Makefile | 1 +
mm/execmem.c | 139 +++++++++++++++++++++++++++
mm/mm_init.c | 2 +
56 files changed, 858 insertions(+), 577 deletions(-)
create mode 100644 arch/sparc/mm/execmem.c
create mode 100644 include/linux/execmem.h
create mode 100644 mm/execmem.c
base-commit: 39cd87c4eb2b893354f3b850f916353f2658ae6f
--
2.43.0
From: "Mike Rapoport (IBM)" <[email protected]>
Since commit f6f37d9320a1 ("arm64: select KASAN_VMALLOC for SW/HW_TAGS
modes") KASAN_VMALLOC is always enabled when KASAN is on. This means
that allocations in module_alloc() will be tracked by KASAN protection
for vmalloc() and that kasan_alloc_module_shadow() will be always an
empty inline and there is no point in calling it.
Drop meaningless call to kasan_alloc_module_shadow() from
module_alloc().
Signed-off-by: Mike Rapoport (IBM) <[email protected]>
---
arch/arm64/kernel/module.c | 5 -----
1 file changed, 5 deletions(-)
diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
index 47e0be610bb6..e92da4da1b2a 100644
--- a/arch/arm64/kernel/module.c
+++ b/arch/arm64/kernel/module.c
@@ -141,11 +141,6 @@ void *module_alloc(unsigned long size)
__func__);
}
- if (p && (kasan_alloc_module_shadow(p, size, GFP_KERNEL) < 0)) {
- vfree(p);
- return NULL;
- }
-
/* Memory is intended to be executable, reset the pointer tag. */
return kasan_reset_tag(p);
}
--
2.43.0
From: "Mike Rapoport (IBM)" <[email protected]>
module_alloc() is used everywhere as a mean to allocate memory for code.
Beside being semantically wrong, this unnecessarily ties all subsystems
that need to allocate code, such as ftrace, kprobes and BPF to modules and
puts the burden of code allocation to the modules code.
Several architectures override module_alloc() because of various
constraints where the executable memory can be located and this causes
additional obstacles for improvements of code allocation.
Start splitting code allocation from modules by introducing execmem_alloc()
and execmem_free() APIs.
Initially, execmem_alloc() is a wrapper for module_alloc() and
execmem_free() is a replacement of module_memfree() to allow updating all
call sites to use the new APIs.
Since architectures define different restrictions on placement,
permissions, alignment and other parameters for memory that can be used by
different subsystems that allocate executable memory, execmem_alloc() takes
a type argument, that will be used to identify the calling subsystem and to
allow architectures define parameters for ranges suitable for that
subsystem.
Signed-off-by: Mike Rapoport (IBM) <[email protected]>
---
arch/powerpc/kernel/kprobes.c | 6 ++--
arch/s390/kernel/ftrace.c | 4 +--
arch/s390/kernel/kprobes.c | 4 +--
arch/s390/kernel/module.c | 5 +--
arch/sparc/net/bpf_jit_comp_32.c | 8 ++---
arch/x86/kernel/ftrace.c | 6 ++--
arch/x86/kernel/kprobes/core.c | 4 +--
include/linux/execmem.h | 57 ++++++++++++++++++++++++++++++++
include/linux/moduleloader.h | 3 --
kernel/bpf/core.c | 6 ++--
kernel/kprobes.c | 8 ++---
kernel/module/Kconfig | 1 +
kernel/module/main.c | 25 +++++---------
mm/Kconfig | 3 ++
mm/Makefile | 1 +
mm/execmem.c | 26 +++++++++++++++
16 files changed, 122 insertions(+), 45 deletions(-)
create mode 100644 include/linux/execmem.h
create mode 100644 mm/execmem.c
diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
index bbca90a5e2ec..9fcd01bb2ce6 100644
--- a/arch/powerpc/kernel/kprobes.c
+++ b/arch/powerpc/kernel/kprobes.c
@@ -19,8 +19,8 @@
#include <linux/extable.h>
#include <linux/kdebug.h>
#include <linux/slab.h>
-#include <linux/moduleloader.h>
#include <linux/set_memory.h>
+#include <linux/execmem.h>
#include <asm/code-patching.h>
#include <asm/cacheflush.h>
#include <asm/sstep.h>
@@ -130,7 +130,7 @@ void *alloc_insn_page(void)
{
void *page;
- page = module_alloc(PAGE_SIZE);
+ page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE);
if (!page)
return NULL;
@@ -142,7 +142,7 @@ void *alloc_insn_page(void)
}
return page;
error:
- module_memfree(page);
+ execmem_free(page);
return NULL;
}
diff --git a/arch/s390/kernel/ftrace.c b/arch/s390/kernel/ftrace.c
index c46381ea04ec..798249ef5646 100644
--- a/arch/s390/kernel/ftrace.c
+++ b/arch/s390/kernel/ftrace.c
@@ -7,13 +7,13 @@
* Author(s): Martin Schwidefsky <[email protected]>
*/
-#include <linux/moduleloader.h>
#include <linux/hardirq.h>
#include <linux/uaccess.h>
#include <linux/ftrace.h>
#include <linux/kernel.h>
#include <linux/types.h>
#include <linux/kprobes.h>
+#include <linux/execmem.h>
#include <trace/syscall.h>
#include <asm/asm-offsets.h>
#include <asm/text-patching.h>
@@ -220,7 +220,7 @@ static int __init ftrace_plt_init(void)
{
const char *start, *end;
- ftrace_plt = module_alloc(PAGE_SIZE);
+ ftrace_plt = execmem_alloc(EXECMEM_FTRACE, PAGE_SIZE);
if (!ftrace_plt)
panic("cannot allocate ftrace plt\n");
diff --git a/arch/s390/kernel/kprobes.c b/arch/s390/kernel/kprobes.c
index f0cf20d4b3c5..3c1b1be744de 100644
--- a/arch/s390/kernel/kprobes.c
+++ b/arch/s390/kernel/kprobes.c
@@ -9,7 +9,6 @@
#define pr_fmt(fmt) "kprobes: " fmt
-#include <linux/moduleloader.h>
#include <linux/kprobes.h>
#include <linux/ptrace.h>
#include <linux/preempt.h>
@@ -21,6 +20,7 @@
#include <linux/slab.h>
#include <linux/hardirq.h>
#include <linux/ftrace.h>
+#include <linux/execmem.h>
#include <asm/set_memory.h>
#include <asm/sections.h>
#include <asm/dis.h>
@@ -38,7 +38,7 @@ void *alloc_insn_page(void)
{
void *page;
- page = module_alloc(PAGE_SIZE);
+ page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE);
if (!page)
return NULL;
set_memory_rox((unsigned long)page, 1);
diff --git a/arch/s390/kernel/module.c b/arch/s390/kernel/module.c
index 42215f9404af..ac97a905e8cd 100644
--- a/arch/s390/kernel/module.c
+++ b/arch/s390/kernel/module.c
@@ -21,6 +21,7 @@
#include <linux/moduleloader.h>
#include <linux/bug.h>
#include <linux/memory.h>
+#include <linux/execmem.h>
#include <asm/alternative.h>
#include <asm/nospec-branch.h>
#include <asm/facility.h>
@@ -76,7 +77,7 @@ void *module_alloc(unsigned long size)
#ifdef CONFIG_FUNCTION_TRACER
void module_arch_cleanup(struct module *mod)
{
- module_memfree(mod->arch.trampolines_start);
+ execmem_free(mod->arch.trampolines_start);
}
#endif
@@ -510,7 +511,7 @@ static int module_alloc_ftrace_hotpatch_trampolines(struct module *me,
size = FTRACE_HOTPATCH_TRAMPOLINES_SIZE(s->sh_size);
numpages = DIV_ROUND_UP(size, PAGE_SIZE);
- start = module_alloc(numpages * PAGE_SIZE);
+ start = execmem_alloc(EXECMEM_FTRACE, numpages * PAGE_SIZE);
if (!start)
return -ENOMEM;
set_memory_rox((unsigned long)start, numpages);
diff --git a/arch/sparc/net/bpf_jit_comp_32.c b/arch/sparc/net/bpf_jit_comp_32.c
index da2df1e84ed4..bda2dbd3f4c5 100644
--- a/arch/sparc/net/bpf_jit_comp_32.c
+++ b/arch/sparc/net/bpf_jit_comp_32.c
@@ -1,10 +1,10 @@
// SPDX-License-Identifier: GPL-2.0
-#include <linux/moduleloader.h>
#include <linux/workqueue.h>
#include <linux/netdevice.h>
#include <linux/filter.h>
#include <linux/cache.h>
#include <linux/if_vlan.h>
+#include <linux/execmem.h>
#include <asm/cacheflush.h>
#include <asm/ptrace.h>
@@ -713,7 +713,7 @@ cond_branch: f_offset = addrs[i + filter[i].jf];
if (unlikely(proglen + ilen > oldproglen)) {
pr_err("bpb_jit_compile fatal error\n");
kfree(addrs);
- module_memfree(image);
+ execmem_free(image);
return;
}
memcpy(image + proglen, temp, ilen);
@@ -736,7 +736,7 @@ cond_branch: f_offset = addrs[i + filter[i].jf];
break;
}
if (proglen == oldproglen) {
- image = module_alloc(proglen);
+ image = execmem_alloc(EXECMEM_BPF, proglen);
if (!image)
goto out;
}
@@ -758,7 +758,7 @@ cond_branch: f_offset = addrs[i + filter[i].jf];
void bpf_jit_free(struct bpf_prog *fp)
{
if (fp->jited)
- module_memfree(fp->bpf_func);
+ execmem_free(fp->bpf_func);
bpf_prog_unlock_free(fp);
}
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index 70139d9d2e01..c8ddb7abda7c 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -25,6 +25,7 @@
#include <linux/memory.h>
#include <linux/vmalloc.h>
#include <linux/set_memory.h>
+#include <linux/execmem.h>
#include <trace/syscall.h>
@@ -261,15 +262,14 @@ void arch_ftrace_update_code(int command)
#ifdef CONFIG_X86_64
#ifdef CONFIG_MODULES
-#include <linux/moduleloader.h>
/* Module allocation simplifies allocating memory for code */
static inline void *alloc_tramp(unsigned long size)
{
- return module_alloc(size);
+ return execmem_alloc(EXECMEM_FTRACE, size);
}
static inline void tramp_free(void *tramp)
{
- module_memfree(tramp);
+ execmem_free(tramp);
}
#else
/* Trampolines can only be created if modules are supported */
diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index d0e49bd7c6f3..72e6a45e7ec2 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -40,12 +40,12 @@
#include <linux/kgdb.h>
#include <linux/ftrace.h>
#include <linux/kasan.h>
-#include <linux/moduleloader.h>
#include <linux/objtool.h>
#include <linux/vmalloc.h>
#include <linux/pgtable.h>
#include <linux/set_memory.h>
#include <linux/cfi.h>
+#include <linux/execmem.h>
#include <asm/text-patching.h>
#include <asm/cacheflush.h>
@@ -495,7 +495,7 @@ void *alloc_insn_page(void)
{
void *page;
- page = module_alloc(PAGE_SIZE);
+ page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE);
if (!page)
return NULL;
diff --git a/include/linux/execmem.h b/include/linux/execmem.h
new file mode 100644
index 000000000000..43e7995593a1
--- /dev/null
+++ b/include/linux/execmem.h
@@ -0,0 +1,57 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_EXECMEM_ALLOC_H
+#define _LINUX_EXECMEM_ALLOC_H
+
+#include <linux/types.h>
+#include <linux/moduleloader.h>
+
+/**
+ * enum execmem_type - types of executable memory ranges
+ *
+ * There are several subsystems that allocate executable memory.
+ * Architectures define different restrictions on placement,
+ * permissions, alignment and other parameters for memory that can be used
+ * by these subsystems.
+ * Types in this enum identify subsystems that allocate executable memory
+ * and let architectures define parameters for ranges suitable for
+ * allocations by each subsystem.
+ *
+ * @EXECMEM_DEFAULT: default parameters that would be used for types that
+ * are not explcitly defined.
+ * @EXECMEM_MODULE_TEXT: parameters for module text sections
+ * @EXECMEM_KPROBES: parameters for kprobes
+ * @EXECMEM_FTRACE: parameters for ftrace
+ * @EXECMEM_BPF: parameters for BPF
+ * @EXECMEM_TYPE_MAX:
+ */
+enum execmem_type {
+ EXECMEM_DEFAULT,
+ EXECMEM_MODULE_TEXT = EXECMEM_DEFAULT,
+ EXECMEM_KPROBES,
+ EXECMEM_FTRACE,
+ EXECMEM_BPF,
+ EXECMEM_TYPE_MAX,
+};
+
+/**
+ * execmem_alloc - allocate executable memory
+ * @type: type of the allocation
+ * @size: how many bytes of memory are required
+ *
+ * Allocates memory that will contain executable code, either generated or
+ * loaded from kernel modules.
+ *
+ * The memory will have protections defined by architecture for executable
+ * region of the @type.
+ *
+ * Return: a pointer to the allocated memory or %NULL
+ */
+void *execmem_alloc(enum execmem_type type, size_t size);
+
+/**
+ * execmem_free - free executable memory
+ * @ptr: pointer to the memory that should be freed
+ */
+void execmem_free(void *ptr);
+
+#endif /* _LINUX_EXECMEM_ALLOC_H */
diff --git a/include/linux/moduleloader.h b/include/linux/moduleloader.h
index 89b1e0ed9811..a3b8caee9405 100644
--- a/include/linux/moduleloader.h
+++ b/include/linux/moduleloader.h
@@ -29,9 +29,6 @@ unsigned int arch_mod_section_prepend(struct module *mod, unsigned int section);
sections. Returns NULL on failure. */
void *module_alloc(unsigned long size);
-/* Free memory returned from module_alloc. */
-void module_memfree(void *module_region);
-
/* Determines if the section name is an init section (that is only used during
* module loading).
*/
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 696bc55de8e8..75a54024e2f4 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -22,7 +22,6 @@
#include <linux/skbuff.h>
#include <linux/vmalloc.h>
#include <linux/random.h>
-#include <linux/moduleloader.h>
#include <linux/bpf.h>
#include <linux/btf.h>
#include <linux/objtool.h>
@@ -37,6 +36,7 @@
#include <linux/nospec.h>
#include <linux/bpf_mem_alloc.h>
#include <linux/memcontrol.h>
+#include <linux/execmem.h>
#include <asm/barrier.h>
#include <asm/unaligned.h>
@@ -1050,12 +1050,12 @@ void bpf_jit_uncharge_modmem(u32 size)
void *__weak bpf_jit_alloc_exec(unsigned long size)
{
- return module_alloc(size);
+ return execmem_alloc(EXECMEM_BPF, size);
}
void __weak bpf_jit_free_exec(void *addr)
{
- module_memfree(addr);
+ execmem_free(addr);
}
struct bpf_binary_header *
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 9d9095e81792..047ca629ce49 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -26,7 +26,6 @@
#include <linux/slab.h>
#include <linux/stddef.h>
#include <linux/export.h>
-#include <linux/moduleloader.h>
#include <linux/kallsyms.h>
#include <linux/freezer.h>
#include <linux/seq_file.h>
@@ -39,6 +38,7 @@
#include <linux/jump_label.h>
#include <linux/static_call.h>
#include <linux/perf_event.h>
+#include <linux/execmem.h>
#include <asm/sections.h>
#include <asm/cacheflush.h>
@@ -113,17 +113,17 @@ enum kprobe_slot_state {
void __weak *alloc_insn_page(void)
{
/*
- * Use module_alloc() so this page is within +/- 2GB of where the
+ * Use execmem_alloc() so this page is within +/- 2GB of where the
* kernel image and loaded module images reside. This is required
* for most of the architectures.
* (e.g. x86-64 needs this to handle the %rip-relative fixups.)
*/
- return module_alloc(PAGE_SIZE);
+ return execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE);
}
static void free_insn_page(void *page)
{
- module_memfree(page);
+ execmem_free(page);
}
struct kprobe_insn_cache kprobe_insn_slots = {
diff --git a/kernel/module/Kconfig b/kernel/module/Kconfig
index f3e0329337f6..744383c1eed1 100644
--- a/kernel/module/Kconfig
+++ b/kernel/module/Kconfig
@@ -2,6 +2,7 @@
menuconfig MODULES
bool "Enable loadable module support"
modules
+ select EXECMEM
help
Kernel modules are small pieces of compiled code which can
be inserted in the running kernel, rather than being
diff --git a/kernel/module/main.c b/kernel/module/main.c
index 5b82b069e0d3..d56b7df0cbb6 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -57,6 +57,7 @@
#include <linux/audit.h>
#include <linux/cfi.h>
#include <linux/debugfs.h>
+#include <linux/execmem.h>
#include <uapi/linux/module.h>
#include "internal.h"
@@ -1179,16 +1180,6 @@ resolve_symbol_wait(struct module *mod,
return ksym;
}
-void __weak module_memfree(void *module_region)
-{
- /*
- * This memory may be RO, and freeing RO memory in an interrupt is not
- * supported by vmalloc.
- */
- WARN_ON(in_interrupt());
- vfree(module_region);
-}
-
void __weak module_arch_cleanup(struct module *mod)
{
}
@@ -1213,7 +1204,7 @@ static int module_memory_alloc(struct module *mod, enum mod_mem_type type)
if (mod_mem_use_vmalloc(type))
ptr = vmalloc(size);
else
- ptr = module_alloc(size);
+ ptr = execmem_alloc(EXECMEM_MODULE_TEXT, size);
if (!ptr)
return -ENOMEM;
@@ -1244,7 +1235,7 @@ static void module_memory_free(struct module *mod, enum mod_mem_type type)
if (mod_mem_use_vmalloc(type))
vfree(ptr);
else
- module_memfree(ptr);
+ execmem_free(ptr);
}
static void free_mod_mem(struct module *mod)
@@ -2496,9 +2487,9 @@ static void do_free_init(struct work_struct *w)
llist_for_each_safe(pos, n, list) {
initfree = container_of(pos, struct mod_initfree, node);
- module_memfree(initfree->init_text);
- module_memfree(initfree->init_data);
- module_memfree(initfree->init_rodata);
+ execmem_free(initfree->init_text);
+ execmem_free(initfree->init_data);
+ execmem_free(initfree->init_rodata);
kfree(initfree);
}
}
@@ -2608,10 +2599,10 @@ static noinline int do_init_module(struct module *mod)
* We want to free module_init, but be aware that kallsyms may be
* walking this with preempt disabled. In all the failure paths, we
* call synchronize_rcu(), but we don't want to slow down the success
- * path. module_memfree() cannot be called in an interrupt, so do the
+ * path. execmem_free() cannot be called in an interrupt, so do the
* work and call synchronize_rcu() in a work queue.
*
- * Note that module_alloc() on most architectures creates W+X page
+ * Note that execmem_alloc() on most architectures creates W+X page
* mappings which won't be cleaned up until do_free_init() runs. Any
* code such as mark_rodata_ro() which depends on those mappings to
* be cleaned up needs to sync with the queued work by invoking
diff --git a/mm/Kconfig b/mm/Kconfig
index b1448aa81e15..f08a216d4793 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1241,6 +1241,9 @@ config LOCK_MM_AND_FIND_VMA
config IOMMU_MM_DATA
bool
+config EXECMEM
+ bool
+
source "mm/damon/Kconfig"
endmenu
diff --git a/mm/Makefile b/mm/Makefile
index 4abb40b911ec..001336c91864 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -133,3 +133,4 @@ obj-$(CONFIG_IO_MAPPING) += io-mapping.o
obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o
obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
+obj-$(CONFIG_EXECMEM) += execmem.o
diff --git a/mm/execmem.c b/mm/execmem.c
new file mode 100644
index 000000000000..ed2ea41a2543
--- /dev/null
+++ b/mm/execmem.c
@@ -0,0 +1,26 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/mm.h>
+#include <linux/vmalloc.h>
+#include <linux/execmem.h>
+#include <linux/moduleloader.h>
+
+static void *__execmem_alloc(size_t size)
+{
+ return module_alloc(size);
+}
+
+void *execmem_alloc(enum execmem_type type, size_t size)
+{
+ return __execmem_alloc(size);
+}
+
+void execmem_free(void *ptr)
+{
+ /*
+ * This memory may be RO, and freeing RO memory in an interrupt is not
+ * supported by vmalloc.
+ */
+ WARN_ON(in_interrupt());
+ vfree(ptr);
+}
--
2.43.0
From: "Mike Rapoport (IBM)" <[email protected]>
Several architectures override module_alloc() only to define address
range for code allocations different than VMALLOC address space.
Provide a generic implementation in execmem that uses the parameters for
address space ranges, required alignment and page protections provided
by architectures.
The architectures must fill execmem_info structure and implement
execmem_arch_setup() that returns a pointer to that structure. This way the
execmem initialization won't be called from every architecture, but rather
from a central place, namely a core_initcall() in execmem.
The execmem provides execmem_alloc() API that wraps __vmalloc_node_range()
with the parameters defined by the architectures. If an architecture does
not implement execmem_arch_setup(), execmem_alloc() will fall back to
module_alloc().
Signed-off-by: Mike Rapoport (IBM) <[email protected]>
---
arch/loongarch/kernel/module.c | 18 +++++++--
arch/mips/kernel/module.c | 19 +++++++--
arch/nios2/kernel/module.c | 19 ++++++---
arch/parisc/kernel/module.c | 23 +++++++----
arch/riscv/kernel/module.c | 21 +++++++---
arch/sparc/kernel/module.c | 41 ++++++++-----------
include/linux/execmem.h | 41 +++++++++++++++++++
mm/execmem.c | 73 ++++++++++++++++++++++++++++++++--
8 files changed, 202 insertions(+), 53 deletions(-)
diff --git a/arch/loongarch/kernel/module.c b/arch/loongarch/kernel/module.c
index c7d0338d12c1..78c6a68f6c3c 100644
--- a/arch/loongarch/kernel/module.c
+++ b/arch/loongarch/kernel/module.c
@@ -18,6 +18,7 @@
#include <linux/ftrace.h>
#include <linux/string.h>
#include <linux/kernel.h>
+#include <linux/execmem.h>
#include <asm/alternative.h>
#include <asm/inst.h>
#include <asm/unwind.h>
@@ -490,10 +491,21 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char *strtab,
return 0;
}
-void *module_alloc(unsigned long size)
+static struct execmem_info execmem_info __ro_after_init = {
+ .ranges = {
+ [EXECMEM_DEFAULT] = {
+ .pgprot = PAGE_KERNEL,
+ .alignment = 1,
+ },
+ },
+};
+
+struct execmem_info __init *execmem_arch_setup(void)
{
- return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
- GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE, __builtin_return_address(0));
+ execmem_info.ranges[EXECMEM_DEFAULT].start = MODULES_VADDR;
+ execmem_info.ranges[EXECMEM_DEFAULT].end = MODULES_END;
+
+ return &execmem_info;
}
static void module_init_ftrace_plt(const Elf_Ehdr *hdr,
diff --git a/arch/mips/kernel/module.c b/arch/mips/kernel/module.c
index 9a6c96014904..50505e910763 100644
--- a/arch/mips/kernel/module.c
+++ b/arch/mips/kernel/module.c
@@ -20,6 +20,7 @@
#include <linux/kernel.h>
#include <linux/spinlock.h>
#include <linux/jump_label.h>
+#include <linux/execmem.h>
#include <asm/jump_label.h>
struct mips_hi16 {
@@ -32,11 +33,21 @@ static LIST_HEAD(dbe_list);
static DEFINE_SPINLOCK(dbe_lock);
#ifdef MODULES_VADDR
-void *module_alloc(unsigned long size)
+static struct execmem_info execmem_info __ro_after_init = {
+ .ranges = {
+ [EXECMEM_DEFAULT] = {
+ .start = MODULES_VADDR,
+ .end = MODULES_END,
+ .alignment = 1,
+ },
+ },
+};
+
+struct execmem_info __init *execmem_arch_setup(void)
{
- return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
- GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE,
- __builtin_return_address(0));
+ execmem_info.ranges[EXECMEM_DEFAULT].pgprot = PAGE_KERNEL;
+
+ return &execmem_info;
}
#endif
diff --git a/arch/nios2/kernel/module.c b/arch/nios2/kernel/module.c
index 9c97b7513853..2b68ef8aad42 100644
--- a/arch/nios2/kernel/module.c
+++ b/arch/nios2/kernel/module.c
@@ -18,15 +18,24 @@
#include <linux/fs.h>
#include <linux/string.h>
#include <linux/kernel.h>
+#include <linux/execmem.h>
#include <asm/cacheflush.h>
-void *module_alloc(unsigned long size)
+static struct execmem_info execmem_info __ro_after_init = {
+ .ranges = {
+ [EXECMEM_DEFAULT] = {
+ .start = MODULES_VADDR,
+ .end = MODULES_END,
+ .pgprot = PAGE_KERNEL_EXEC,
+ .alignment = 1,
+ },
+ },
+};
+
+struct execmem_info __init *execmem_arch_setup(void)
{
- return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
- GFP_KERNEL, PAGE_KERNEL_EXEC,
- VM_FLUSH_RESET_PERMS, NUMA_NO_NODE,
- __builtin_return_address(0));
+ return &execmem_info;
}
int apply_relocate_add(Elf32_Shdr *sechdrs, const char *strtab,
diff --git a/arch/parisc/kernel/module.c b/arch/parisc/kernel/module.c
index d214bbe3c2af..721324c42b7d 100644
--- a/arch/parisc/kernel/module.c
+++ b/arch/parisc/kernel/module.c
@@ -49,6 +49,7 @@
#include <linux/bug.h>
#include <linux/mm.h>
#include <linux/slab.h>
+#include <linux/execmem.h>
#include <asm/unwind.h>
#include <asm/sections.h>
@@ -173,15 +174,21 @@ static inline int reassemble_22(int as22)
((as22 & 0x0003ff) << 3));
}
-void *module_alloc(unsigned long size)
+static struct execmem_info execmem_info __ro_after_init = {
+ .ranges = {
+ [EXECMEM_DEFAULT] = {
+ .pgprot = PAGE_KERNEL_RWX,
+ .alignment = 1,
+ },
+ },
+};
+
+struct execmem_info __init *execmem_arch_setup(void)
{
- /* using RWX means less protection for modules, but it's
- * easier than trying to map the text, data, init_text and
- * init_data correctly */
- return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
- GFP_KERNEL,
- PAGE_KERNEL_RWX, 0, NUMA_NO_NODE,
- __builtin_return_address(0));
+ execmem_info.ranges[EXECMEM_DEFAULT].start = VMALLOC_START;
+ execmem_info.ranges[EXECMEM_DEFAULT].end = VMALLOC_END;
+
+ return &execmem_info;
}
#ifndef CONFIG_64BIT
diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
index 5e5a82644451..ad32e2a8621a 100644
--- a/arch/riscv/kernel/module.c
+++ b/arch/riscv/kernel/module.c
@@ -14,6 +14,7 @@
#include <linux/vmalloc.h>
#include <linux/sizes.h>
#include <linux/pgtable.h>
+#include <linux/execmem.h>
#include <asm/alternative.h>
#include <asm/sections.h>
@@ -906,13 +907,21 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char *strtab,
}
#if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
-void *module_alloc(unsigned long size)
+static struct execmem_info execmem_info __ro_after_init = {
+ .ranges = {
+ [EXECMEM_DEFAULT] = {
+ .pgprot = PAGE_KERNEL,
+ .alignment = 1,
+ },
+ },
+};
+
+struct execmem_info __init *execmem_arch_setup(void)
{
- return __vmalloc_node_range(size, 1, MODULES_VADDR,
- MODULES_END, GFP_KERNEL,
- PAGE_KERNEL, VM_FLUSH_RESET_PERMS,
- NUMA_NO_NODE,
- __builtin_return_address(0));
+ execmem_info.ranges[EXECMEM_DEFAULT].start = MODULES_VADDR;
+ execmem_info.ranges[EXECMEM_DEFAULT].end = MODULES_END;
+
+ return &execmem_info;
}
#endif
diff --git a/arch/sparc/kernel/module.c b/arch/sparc/kernel/module.c
index 66c45a2764bc..b70047f944cc 100644
--- a/arch/sparc/kernel/module.c
+++ b/arch/sparc/kernel/module.c
@@ -14,6 +14,7 @@
#include <linux/string.h>
#include <linux/ctype.h>
#include <linux/mm.h>
+#include <linux/execmem.h>
#include <asm/processor.h>
#include <asm/spitfire.h>
@@ -21,34 +22,26 @@
#include "entry.h"
+static struct execmem_info execmem_info __ro_after_init = {
+ .ranges = {
+ [EXECMEM_DEFAULT] = {
#ifdef CONFIG_SPARC64
-
-#include <linux/jump_label.h>
-
-static void *module_map(unsigned long size)
-{
- if (PAGE_ALIGN(size) > MODULES_LEN)
- return NULL;
- return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
- GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE,
- __builtin_return_address(0));
-}
+ .start = MODULES_VADDR,
+ .end = MODULES_END,
#else
-static void *module_map(unsigned long size)
+ .start = VMALLOC_START,
+ .end = VMALLOC_END,
+#endif
+ .alignment = 1,
+ },
+ },
+};
+
+struct execmem_info __init *execmem_arch_setup(void)
{
- return vmalloc(size);
-}
-#endif /* CONFIG_SPARC64 */
-
-void *module_alloc(unsigned long size)
-{
- void *ret;
-
- ret = module_map(size);
- if (ret)
- memset(ret, 0, size);
+ execmem_info.ranges[EXECMEM_DEFAULT].pgprot = PAGE_KERNEL;
- return ret;
+ return &execmem_info;
}
/* Make generic code ignore STT_REGISTER dummy undefined symbols. */
diff --git a/include/linux/execmem.h b/include/linux/execmem.h
index 43e7995593a1..89173be320cf 100644
--- a/include/linux/execmem.h
+++ b/include/linux/execmem.h
@@ -33,6 +33,47 @@ enum execmem_type {
EXECMEM_TYPE_MAX,
};
+/**
+ * struct execmem_range - definition of an address space suitable for code and
+ * related data allocations
+ * @start: address space start
+ * @end: address space end (inclusive)
+ * @pgprot: permissions for memory in this address space
+ * @alignment: alignment required for text allocations
+ */
+struct execmem_range {
+ unsigned long start;
+ unsigned long end;
+ pgprot_t pgprot;
+ unsigned int alignment;
+};
+
+/**
+ * struct execmem_info - architecture parameters for code allocations
+ * @ranges: array of parameter sets defining architecture specific
+ * parameters for executable memory allocations. The ranges that are not
+ * explicitly initialized by an architecture use parameters defined for
+ * @EXECMEM_DEFAULT.
+ */
+struct execmem_info {
+ struct execmem_range ranges[EXECMEM_TYPE_MAX];
+};
+
+/**
+ * execmem_arch_setup - define parameters for allocations of executable memory
+ *
+ * A hook for architectures to define parameters for allocations of
+ * executable memory. These parameters should be filled into the
+ * @execmem_info structure.
+ *
+ * For architectures that do not implement this method a default set of
+ * parameters will be used
+ *
+ * Return: a structure defining architecture parameters and restrictions
+ * for allocations of executable memory
+ */
+struct execmem_info *execmem_arch_setup(void);
+
/**
* execmem_alloc - allocate executable memory
* @type: type of the allocation
diff --git a/mm/execmem.c b/mm/execmem.c
index ed2ea41a2543..d9fb20bc7354 100644
--- a/mm/execmem.c
+++ b/mm/execmem.c
@@ -5,14 +5,30 @@
#include <linux/execmem.h>
#include <linux/moduleloader.h>
-static void *__execmem_alloc(size_t size)
+static struct execmem_info *execmem_info __ro_after_init;
+
+static void *__execmem_alloc(struct execmem_range *range, size_t size)
{
- return module_alloc(size);
+ unsigned long start = range->start;
+ unsigned long end = range->end;
+ unsigned int align = range->alignment;
+ pgprot_t pgprot = range->pgprot;
+
+ return __vmalloc_node_range(size, align, start, end,
+ GFP_KERNEL, pgprot, VM_FLUSH_RESET_PERMS,
+ NUMA_NO_NODE, __builtin_return_address(0));
}
void *execmem_alloc(enum execmem_type type, size_t size)
{
- return __execmem_alloc(size);
+ struct execmem_range *range;
+
+ if (!execmem_info)
+ return module_alloc(size);
+
+ range = &execmem_info->ranges[type];
+
+ return __execmem_alloc(range, size);
}
void execmem_free(void *ptr)
@@ -24,3 +40,54 @@ void execmem_free(void *ptr)
WARN_ON(in_interrupt());
vfree(ptr);
}
+
+static bool execmem_validate(struct execmem_info *info)
+{
+ struct execmem_range *r = &info->ranges[EXECMEM_DEFAULT];
+
+ if (!r->alignment || !r->start || !r->end || !pgprot_val(r->pgprot)) {
+ pr_crit("Invalid parameters for execmem allocator, module loading will fail");
+ return false;
+ }
+
+ return true;
+}
+
+static void execmem_init_missing(struct execmem_info *info)
+{
+ struct execmem_range *default_range = &info->ranges[EXECMEM_DEFAULT];
+
+ for (int i = EXECMEM_DEFAULT + 1; i < EXECMEM_TYPE_MAX; i++) {
+ struct execmem_range *r = &info->ranges[i];
+
+ if (!r->start) {
+ r->pgprot = default_range->pgprot;
+ r->alignment = default_range->alignment;
+ r->start = default_range->start;
+ r->end = default_range->end;
+ }
+ }
+}
+
+struct execmem_info * __weak execmem_arch_setup(void)
+{
+ return NULL;
+}
+
+static int __init execmem_init(void)
+{
+ struct execmem_info *info = execmem_arch_setup();
+
+ if (!info)
+ return 0;
+
+ if (!execmem_validate(info))
+ return -EINVAL;
+
+ execmem_init_missing(info);
+
+ execmem_info = info;
+
+ return 0;
+}
+core_initcall(execmem_init);
--
2.43.0
From: "Mike Rapoport (IBM)" <[email protected]>
The memory allocations for kprobes and BPF on RISC-V are not placed in
the modules area and these custom allocations are implemented with
overrides of alloc_insn_page() and bpf_jit_alloc_exec().
Slightly reorder execmem_params initialization to support both 32 and 64
bit variants, define EXECMEM_KPROBES and EXECMEM_BPF ranges in
riscv::execmem_params and drop overrides of alloc_insn_page() and
bpf_jit_alloc_exec().
Signed-off-by: Mike Rapoport (IBM) <[email protected]>
Reviewed-by: Alexandre Ghiti <[email protected]>
---
arch/riscv/kernel/module.c | 21 ++++++++++++++++++++-
arch/riscv/kernel/probes/kprobes.c | 10 ----------
arch/riscv/net/bpf_jit_core.c | 13 -------------
3 files changed, 20 insertions(+), 24 deletions(-)
diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
index ad32e2a8621a..aad158bb2022 100644
--- a/arch/riscv/kernel/module.c
+++ b/arch/riscv/kernel/module.c
@@ -906,20 +906,39 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char *strtab,
return 0;
}
-#if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
+#ifdef CONFIG_MMU
static struct execmem_info execmem_info __ro_after_init = {
.ranges = {
[EXECMEM_DEFAULT] = {
.pgprot = PAGE_KERNEL,
.alignment = 1,
},
+ [EXECMEM_KPROBES] = {
+ .pgprot = PAGE_KERNEL_READ_EXEC,
+ .alignment = 1,
+ },
+ [EXECMEM_BPF] = {
+ .pgprot = PAGE_KERNEL,
+ .alignment = PAGE_SIZE,
+ },
},
};
struct execmem_info __init *execmem_arch_setup(void)
{
+#ifdef CONFIG_64BIT
execmem_info.ranges[EXECMEM_DEFAULT].start = MODULES_VADDR;
execmem_info.ranges[EXECMEM_DEFAULT].end = MODULES_END;
+#else
+ execmem_info.ranges[EXECMEM_DEFAULT].start = VMALLOC_START;
+ execmem_info.ranges[EXECMEM_DEFAULT].end = VMALLOC_END;
+#endif
+
+ execmem_info.ranges[EXECMEM_KPROBES].start = VMALLOC_START;
+ execmem_info.ranges[EXECMEM_KPROBES].end = VMALLOC_END;
+
+ execmem_info.ranges[EXECMEM_BPF].start = BPF_JIT_REGION_START;
+ execmem_info.ranges[EXECMEM_BPF].end = BPF_JIT_REGION_END;
return &execmem_info;
}
diff --git a/arch/riscv/kernel/probes/kprobes.c b/arch/riscv/kernel/probes/kprobes.c
index 2f08c14a933d..e64f2f3064eb 100644
--- a/arch/riscv/kernel/probes/kprobes.c
+++ b/arch/riscv/kernel/probes/kprobes.c
@@ -104,16 +104,6 @@ int __kprobes arch_prepare_kprobe(struct kprobe *p)
return 0;
}
-#ifdef CONFIG_MMU
-void *alloc_insn_page(void)
-{
- return __vmalloc_node_range(PAGE_SIZE, 1, VMALLOC_START, VMALLOC_END,
- GFP_KERNEL, PAGE_KERNEL_READ_EXEC,
- VM_FLUSH_RESET_PERMS, NUMA_NO_NODE,
- __builtin_return_address(0));
-}
-#endif
-
/* install breakpoint in text */
void __kprobes arch_arm_kprobe(struct kprobe *p)
{
diff --git a/arch/riscv/net/bpf_jit_core.c b/arch/riscv/net/bpf_jit_core.c
index 6b3acac30c06..e238fdbd5dbc 100644
--- a/arch/riscv/net/bpf_jit_core.c
+++ b/arch/riscv/net/bpf_jit_core.c
@@ -219,19 +219,6 @@ u64 bpf_jit_alloc_exec_limit(void)
return BPF_JIT_REGION_SIZE;
}
-void *bpf_jit_alloc_exec(unsigned long size)
-{
- return __vmalloc_node_range(size, PAGE_SIZE, BPF_JIT_REGION_START,
- BPF_JIT_REGION_END, GFP_KERNEL,
- PAGE_KERNEL, 0, NUMA_NO_NODE,
- __builtin_return_address(0));
-}
-
-void bpf_jit_free_exec(void *addr)
-{
- return vfree(addr);
-}
-
void *bpf_arch_text_copy(void *dst, void *src, size_t len)
{
int ret;
--
2.43.0
From: "Mike Rapoport (IBM)" <[email protected]>
The memory allocations for kprobes and BPF on arm64 can be placed
anywhere in vmalloc address space and currently this is implemented with
overrides of alloc_insn_page() and bpf_jit_alloc_exec() in arm64.
Define EXECMEM_KPROBES and EXECMEM_BPF ranges in arm64::execmem_info and
drop overrides of alloc_insn_page() and bpf_jit_alloc_exec().
Signed-off-by: Mike Rapoport (IBM) <[email protected]>
Acked-by: Will Deacon <[email protected]>
---
arch/arm64/kernel/module.c | 14 ++++++++++++++
arch/arm64/kernel/probes/kprobes.c | 7 -------
arch/arm64/net/bpf_jit_comp.c | 11 -----------
3 files changed, 14 insertions(+), 18 deletions(-)
diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
index a377a3217cf2..aa9e2b3d7459 100644
--- a/arch/arm64/kernel/module.c
+++ b/arch/arm64/kernel/module.c
@@ -115,6 +115,12 @@ static struct execmem_info execmem_info __ro_after_init = {
[EXECMEM_DEFAULT] = {
.alignment = MODULE_ALIGN,
},
+ [EXECMEM_KPROBES] = {
+ .alignment = 1,
+ },
+ [EXECMEM_BPF] = {
+ .alignment = 1,
+ },
},
};
@@ -143,6 +149,14 @@ struct execmem_info __init *execmem_arch_setup(void)
r->end = module_plt_base + SZ_2G;
}
+ execmem_info.ranges[EXECMEM_KPROBES].pgprot = PAGE_KERNEL_ROX;
+ execmem_info.ranges[EXECMEM_KPROBES].start = VMALLOC_START;
+ execmem_info.ranges[EXECMEM_KPROBES].end = VMALLOC_END;
+
+ execmem_info.ranges[EXECMEM_BPF].pgprot = PAGE_KERNEL;
+ execmem_info.ranges[EXECMEM_BPF].start = VMALLOC_START;
+ execmem_info.ranges[EXECMEM_BPF].end = VMALLOC_END;
+
return &execmem_info;
}
diff --git a/arch/arm64/kernel/probes/kprobes.c b/arch/arm64/kernel/probes/kprobes.c
index 327855a11df2..4268678d0e86 100644
--- a/arch/arm64/kernel/probes/kprobes.c
+++ b/arch/arm64/kernel/probes/kprobes.c
@@ -129,13 +129,6 @@ int __kprobes arch_prepare_kprobe(struct kprobe *p)
return 0;
}
-void *alloc_insn_page(void)
-{
- return __vmalloc_node_range(PAGE_SIZE, 1, VMALLOC_START, VMALLOC_END,
- GFP_KERNEL, PAGE_KERNEL_ROX, VM_FLUSH_RESET_PERMS,
- NUMA_NO_NODE, __builtin_return_address(0));
-}
-
/* arm kprobe: install breakpoint in text */
void __kprobes arch_arm_kprobe(struct kprobe *p)
{
diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
index 122021f9bdfc..456f5af239fc 100644
--- a/arch/arm64/net/bpf_jit_comp.c
+++ b/arch/arm64/net/bpf_jit_comp.c
@@ -1793,17 +1793,6 @@ u64 bpf_jit_alloc_exec_limit(void)
return VMALLOC_END - VMALLOC_START;
}
-void *bpf_jit_alloc_exec(unsigned long size)
-{
- /* Memory is intended to be executable, reset the pointer tag. */
- return kasan_reset_tag(vmalloc(size));
-}
-
-void bpf_jit_free_exec(void *addr)
-{
- return vfree(addr);
-}
-
/* Indicate the JIT backend supports mixing bpf2bpf and tailcalls. */
bool bpf_jit_supports_subprog_tailcalls(void)
{
--
2.43.0
From: "Mike Rapoport (IBM)" <[email protected]>
Dynamic ftrace must allocate memory for code and this was impossible
without CONFIG_MODULES.
With execmem separated from the modules code, execmem_text_alloc() is
available regardless of CONFIG_MODULES.
Remove dependency of dynamic ftrace on CONFIG_MODULES and make
CONFIG_DYNAMIC_FTRACE select CONFIG_EXECMEM in Kconfig.
Signed-off-by: Mike Rapoport (IBM) <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/kernel/ftrace.c | 10 ----------
2 files changed, 1 insertion(+), 10 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e87ddbdaaeb2..5100a769ffda 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -34,6 +34,7 @@ config X86_64
select SWIOTLB
select ARCH_HAS_ELFCORE_COMPAT
select ZONE_DMA32
+ select EXECMEM if DYNAMIC_FTRACE
config FORCE_DYNAMIC_FTRACE
def_bool y
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index c8ddb7abda7c..8da0e66ca22d 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -261,8 +261,6 @@ void arch_ftrace_update_code(int command)
/* Currently only x86_64 supports dynamic trampolines */
#ifdef CONFIG_X86_64
-#ifdef CONFIG_MODULES
-/* Module allocation simplifies allocating memory for code */
static inline void *alloc_tramp(unsigned long size)
{
return execmem_alloc(EXECMEM_FTRACE, size);
@@ -271,14 +269,6 @@ static inline void tramp_free(void *tramp)
{
execmem_free(tramp);
}
-#else
-/* Trampolines can only be created if modules are supported */
-static inline void *alloc_tramp(unsigned long size)
-{
- return NULL;
-}
-static inline void tramp_free(void *tramp) { }
-#endif
/* Defined as markers to the end of the ftrace default trampolines */
extern void ftrace_regs_caller_end(void);
--
2.43.0
From: "Mike Rapoport (IBM)" <[email protected]>
There are places where CONFIG_MODULES guards the code that depends on
memory allocation being done with module_alloc().
Replace CONFIG_MODULES with CONFIG_EXECMEM in such places.
Signed-off-by: Mike Rapoport (IBM) <[email protected]>
---
arch/powerpc/Kconfig | 2 +-
arch/powerpc/include/asm/kasan.h | 2 +-
arch/powerpc/kernel/head_8xx.S | 4 ++--
arch/powerpc/kernel/head_book3s_32.S | 6 +++---
arch/powerpc/lib/code-patching.c | 2 +-
arch/powerpc/mm/book3s32/mmu.c | 2 +-
6 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 1c4be3373686..2e586733a464 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -285,7 +285,7 @@ config PPC
select IOMMU_HELPER if PPC64
select IRQ_DOMAIN
select IRQ_FORCED_THREADING
- select KASAN_VMALLOC if KASAN && MODULES
+ select KASAN_VMALLOC if KASAN && EXECMEM
select LOCK_MM_AND_FIND_VMA
select MMU_GATHER_PAGE_SIZE
select MMU_GATHER_RCU_TABLE_FREE
diff --git a/arch/powerpc/include/asm/kasan.h b/arch/powerpc/include/asm/kasan.h
index 365d2720097c..b5bbb94c51f6 100644
--- a/arch/powerpc/include/asm/kasan.h
+++ b/arch/powerpc/include/asm/kasan.h
@@ -19,7 +19,7 @@
#define KASAN_SHADOW_SCALE_SHIFT 3
-#if defined(CONFIG_MODULES) && defined(CONFIG_PPC32)
+#if defined(CONFIG_EXECMEM) && defined(CONFIG_PPC32)
#define KASAN_KERN_START ALIGN_DOWN(PAGE_OFFSET - SZ_256M, SZ_256M)
#else
#define KASAN_KERN_START PAGE_OFFSET
diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S
index 647b0b445e89..edc479a7c2bc 100644
--- a/arch/powerpc/kernel/head_8xx.S
+++ b/arch/powerpc/kernel/head_8xx.S
@@ -199,12 +199,12 @@ instruction_counter:
mfspr r10, SPRN_SRR0 /* Get effective address of fault */
INVALIDATE_ADJACENT_PAGES_CPU15(r10, r11)
mtspr SPRN_MD_EPN, r10
-#ifdef CONFIG_MODULES
+#ifdef CONFIG_EXECMEM
mfcr r11
compare_to_kernel_boundary r10, r10
#endif
mfspr r10, SPRN_M_TWB /* Get level 1 table */
-#ifdef CONFIG_MODULES
+#ifdef CONFIG_EXECMEM
blt+ 3f
rlwinm r10, r10, 0, 20, 31
oris r10, r10, (swapper_pg_dir - PAGE_OFFSET)@ha
diff --git a/arch/powerpc/kernel/head_book3s_32.S b/arch/powerpc/kernel/head_book3s_32.S
index c1d89764dd22..57196883a00e 100644
--- a/arch/powerpc/kernel/head_book3s_32.S
+++ b/arch/powerpc/kernel/head_book3s_32.S
@@ -419,14 +419,14 @@ InstructionTLBMiss:
*/
/* Get PTE (linux-style) and check access */
mfspr r3,SPRN_IMISS
-#ifdef CONFIG_MODULES
+#ifdef CONFIG_EXECMEM
lis r1, TASK_SIZE@h /* check if kernel address */
cmplw 0,r1,r3
#endif
mfspr r2, SPRN_SDR1
li r1,_PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_EXEC
rlwinm r2, r2, 28, 0xfffff000
-#ifdef CONFIG_MODULES
+#ifdef CONFIG_EXECMEM
li r0, 3
bgt- 112f
lis r2, (swapper_pg_dir - PAGE_OFFSET)@ha /* if kernel address, use */
@@ -442,7 +442,7 @@ InstructionTLBMiss:
andc. r1,r1,r2 /* check access & ~permission */
bne- InstructionAddressInvalid /* return if access not permitted */
/* Convert linux-style PTE to low word of PPC-style PTE */
-#ifdef CONFIG_MODULES
+#ifdef CONFIG_EXECMEM
rlwimi r2, r0, 0, 31, 31 /* userspace ? -> PP lsb */
#endif
ori r1, r1, 0xe06 /* clear out reserved bits */
diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index c6ab46156cda..7af791446ddf 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -225,7 +225,7 @@ void __init poking_init(void)
static unsigned long get_patch_pfn(void *addr)
{
- if (IS_ENABLED(CONFIG_MODULES) && is_vmalloc_or_module_addr(addr))
+ if (IS_ENABLED(CONFIG_EXECMEM) && is_vmalloc_or_module_addr(addr))
return vmalloc_to_pfn(addr);
else
return __pa_symbol(addr) >> PAGE_SHIFT;
diff --git a/arch/powerpc/mm/book3s32/mmu.c b/arch/powerpc/mm/book3s32/mmu.c
index 100f999871bc..625fe7d08e06 100644
--- a/arch/powerpc/mm/book3s32/mmu.c
+++ b/arch/powerpc/mm/book3s32/mmu.c
@@ -184,7 +184,7 @@ unsigned long __init mmu_mapin_ram(unsigned long base, unsigned long top)
static bool is_module_segment(unsigned long addr)
{
- if (!IS_ENABLED(CONFIG_MODULES))
+ if (!IS_ENABLED(CONFIG_EXECMEM))
return false;
if (addr < ALIGN_DOWN(MODULES_VADDR, SZ_256M))
return false;
--
2.43.0
From: "Mike Rapoport (IBM)" <[email protected]>
kprobes depended on CONFIG_MODULES because it has to allocate memory for
code.
Since code allocations are now implemented with execmem, kprobes can be
enabled in non-modular kernels.
Add #ifdef CONFIG_MODULE guards for the code dealing with kprobes inside
modules, make CONFIG_KPROBES select CONFIG_EXECMEM and drop the
dependency of CONFIG_KPROBES on CONFIG_MODULES.
Signed-off-by: Mike Rapoport (IBM) <[email protected]>
---
arch/Kconfig | 2 +-
kernel/kprobes.c | 43 +++++++++++++++++++++----------------
kernel/trace/trace_kprobe.c | 11 ++++++++++
3 files changed, 37 insertions(+), 19 deletions(-)
diff --git a/arch/Kconfig b/arch/Kconfig
index bc9e8e5dccd5..68177adf61a0 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -52,9 +52,9 @@ config GENERIC_ENTRY
config KPROBES
bool "Kprobes"
- depends on MODULES
depends on HAVE_KPROBES
select KALLSYMS
+ select EXECMEM
select TASKS_RCU if PREEMPTION
help
Kprobes allows you to trap at almost any kernel address and
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 047ca629ce49..90c056853e6f 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -1580,6 +1580,7 @@ static int check_kprobe_address_safe(struct kprobe *p,
goto out;
}
+#ifdef CONFIG_MODULES
/* Check if 'p' is probing a module. */
*probed_mod = __module_text_address((unsigned long) p->addr);
if (*probed_mod) {
@@ -1603,6 +1604,8 @@ static int check_kprobe_address_safe(struct kprobe *p,
ret = -ENOENT;
}
}
+#endif
+
out:
preempt_enable();
jump_label_unlock();
@@ -2482,24 +2485,6 @@ int kprobe_add_area_blacklist(unsigned long start, unsigned long end)
return 0;
}
-/* Remove all symbols in given area from kprobe blacklist */
-static void kprobe_remove_area_blacklist(unsigned long start, unsigned long end)
-{
- struct kprobe_blacklist_entry *ent, *n;
-
- list_for_each_entry_safe(ent, n, &kprobe_blacklist, list) {
- if (ent->start_addr < start || ent->start_addr >= end)
- continue;
- list_del(&ent->list);
- kfree(ent);
- }
-}
-
-static void kprobe_remove_ksym_blacklist(unsigned long entry)
-{
- kprobe_remove_area_blacklist(entry, entry + 1);
-}
-
int __weak arch_kprobe_get_kallsym(unsigned int *symnum, unsigned long *value,
char *type, char *sym)
{
@@ -2564,6 +2549,25 @@ static int __init populate_kprobe_blacklist(unsigned long *start,
return ret ? : arch_populate_kprobe_blacklist();
}
+#ifdef CONFIG_MODULES
+/* Remove all symbols in given area from kprobe blacklist */
+static void kprobe_remove_area_blacklist(unsigned long start, unsigned long end)
+{
+ struct kprobe_blacklist_entry *ent, *n;
+
+ list_for_each_entry_safe(ent, n, &kprobe_blacklist, list) {
+ if (ent->start_addr < start || ent->start_addr >= end)
+ continue;
+ list_del(&ent->list);
+ kfree(ent);
+ }
+}
+
+static void kprobe_remove_ksym_blacklist(unsigned long entry)
+{
+ kprobe_remove_area_blacklist(entry, entry + 1);
+}
+
static void add_module_kprobe_blacklist(struct module *mod)
{
unsigned long start, end;
@@ -2665,6 +2669,7 @@ static struct notifier_block kprobe_module_nb = {
.notifier_call = kprobes_module_callback,
.priority = 0
};
+#endif
void kprobe_free_init_mem(void)
{
@@ -2724,8 +2729,10 @@ static int __init init_kprobes(void)
err = arch_init_kprobes();
if (!err)
err = register_die_notifier(&kprobe_exceptions_nb);
+#ifdef CONFIG_MODULES
if (!err)
err = register_module_notifier(&kprobe_module_nb);
+#endif
kprobes_initialized = (err == 0);
kprobe_sysctls_init();
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 14099cc17fc9..f0610137d6a3 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -111,6 +111,7 @@ static nokprobe_inline bool trace_kprobe_within_module(struct trace_kprobe *tk,
return strncmp(module_name(mod), name, len) == 0 && name[len] == ':';
}
+#ifdef CONFIG_MODULES
static nokprobe_inline bool trace_kprobe_module_exist(struct trace_kprobe *tk)
{
char *p;
@@ -129,6 +130,12 @@ static nokprobe_inline bool trace_kprobe_module_exist(struct trace_kprobe *tk)
return ret;
}
+#else
+static inline bool trace_kprobe_module_exist(struct trace_kprobe *tk)
+{
+ return false;
+}
+#endif
static bool trace_kprobe_is_busy(struct dyn_event *ev)
{
@@ -670,6 +677,7 @@ static int register_trace_kprobe(struct trace_kprobe *tk)
return ret;
}
+#ifdef CONFIG_MODULES
/* Module notifier call back, checking event on the module */
static int trace_kprobe_module_callback(struct notifier_block *nb,
unsigned long val, void *data)
@@ -704,6 +712,7 @@ static struct notifier_block trace_kprobe_module_nb = {
.notifier_call = trace_kprobe_module_callback,
.priority = 1 /* Invoked after kprobe module callback */
};
+#endif
static int count_symbols(void *data, unsigned long unused)
{
@@ -1933,8 +1942,10 @@ static __init int init_kprobe_trace_early(void)
if (ret)
return ret;
+#ifdef CONFIG_MODULES
if (register_module_notifier(&trace_kprobe_module_nb))
return -EINVAL;
+#endif
return 0;
}
--
2.43.0
From: "Mike Rapoport (IBM)" <[email protected]>
BPF just-in-time compiler depended on CONFIG_MODULES because it used
module_alloc() to allocate memory for the generated code.
Since code allocations are now implemented with execmem, drop dependency of
CONFIG_BPF_JIT on CONFIG_MODULES and make it select CONFIG_EXECMEM.
Suggested-by: Björn Töpel <[email protected]>
Signed-off-by: Mike Rapoport (IBM) <[email protected]>
---
kernel/bpf/Kconfig | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/bpf/Kconfig b/kernel/bpf/Kconfig
index bc25f5098a25..f999e4e0b344 100644
--- a/kernel/bpf/Kconfig
+++ b/kernel/bpf/Kconfig
@@ -43,7 +43,7 @@ config BPF_JIT
bool "Enable BPF Just In Time compiler"
depends on BPF
depends on HAVE_CBPF_JIT || HAVE_EBPF_JIT
- depends on MODULES
+ select EXECMEM
help
BPF programs are normally handled by a BPF interpreter. This option
allows the kernel to generate native code when a program is loaded
--
2.43.0
From: "Mike Rapoport (IBM)" <[email protected]>
Move the logic related to the memory allocation and freeing into
module_memory_alloc() and module_memory_free().
Signed-off-by: Mike Rapoport (IBM) <[email protected]>
---
kernel/module/main.c | 64 +++++++++++++++++++++++++++-----------------
1 file changed, 39 insertions(+), 25 deletions(-)
diff --git a/kernel/module/main.c b/kernel/module/main.c
index e1e8a7a9d6c1..5b82b069e0d3 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -1203,15 +1203,44 @@ static bool mod_mem_use_vmalloc(enum mod_mem_type type)
mod_mem_type_is_core_data(type);
}
-static void *module_memory_alloc(unsigned int size, enum mod_mem_type type)
+static int module_memory_alloc(struct module *mod, enum mod_mem_type type)
{
+ unsigned int size = PAGE_ALIGN(mod->mem[type].size);
+ void *ptr;
+
+ mod->mem[type].size = size;
+
if (mod_mem_use_vmalloc(type))
- return vzalloc(size);
- return module_alloc(size);
+ ptr = vmalloc(size);
+ else
+ ptr = module_alloc(size);
+
+ if (!ptr)
+ return -ENOMEM;
+
+ /*
+ * The pointer to these blocks of memory are stored on the module
+ * structure and we keep that around so long as the module is
+ * around. We only free that memory when we unload the module.
+ * Just mark them as not being a leak then. The .init* ELF
+ * sections *do* get freed after boot so we *could* treat them
+ * slightly differently with kmemleak_ignore() and only grey
+ * them out as they work as typical memory allocations which
+ * *do* eventually get freed, but let's just keep things simple
+ * and avoid *any* false positives.
+ */
+ kmemleak_not_leak(ptr);
+
+ memset(ptr, 0, size);
+ mod->mem[type].base = ptr;
+
+ return 0;
}
-static void module_memory_free(void *ptr, enum mod_mem_type type)
+static void module_memory_free(struct module *mod, enum mod_mem_type type)
{
+ void *ptr = mod->mem[type].base;
+
if (mod_mem_use_vmalloc(type))
vfree(ptr);
else
@@ -1229,12 +1258,12 @@ static void free_mod_mem(struct module *mod)
/* Free lock-classes; relies on the preceding sync_rcu(). */
lockdep_free_key_range(mod_mem->base, mod_mem->size);
if (mod_mem->size)
- module_memory_free(mod_mem->base, type);
+ module_memory_free(mod, type);
}
/* MOD_DATA hosts mod, so free it at last */
lockdep_free_key_range(mod->mem[MOD_DATA].base, mod->mem[MOD_DATA].size);
- module_memory_free(mod->mem[MOD_DATA].base, MOD_DATA);
+ module_memory_free(mod, MOD_DATA);
}
/* Free a module, remove from lists, etc. */
@@ -2225,7 +2254,6 @@ static int find_module_sections(struct module *mod, struct load_info *info)
static int move_module(struct module *mod, struct load_info *info)
{
int i;
- void *ptr;
enum mod_mem_type t = 0;
int ret = -ENOMEM;
@@ -2234,26 +2262,12 @@ static int move_module(struct module *mod, struct load_info *info)
mod->mem[type].base = NULL;
continue;
}
- mod->mem[type].size = PAGE_ALIGN(mod->mem[type].size);
- ptr = module_memory_alloc(mod->mem[type].size, type);
- /*
- * The pointer to these blocks of memory are stored on the module
- * structure and we keep that around so long as the module is
- * around. We only free that memory when we unload the module.
- * Just mark them as not being a leak then. The .init* ELF
- * sections *do* get freed after boot so we *could* treat them
- * slightly differently with kmemleak_ignore() and only grey
- * them out as they work as typical memory allocations which
- * *do* eventually get freed, but let's just keep things simple
- * and avoid *any* false positives.
- */
- kmemleak_not_leak(ptr);
- if (!ptr) {
+
+ ret = module_memory_alloc(mod, type);
+ if (ret) {
t = type;
goto out_enomem;
}
- memset(ptr, 0, mod->mem[type].size);
- mod->mem[type].base = ptr;
}
/* Transfer each section which specifies SHF_ALLOC */
@@ -2296,7 +2310,7 @@ static int move_module(struct module *mod, struct load_info *info)
return 0;
out_enomem:
for (t--; t >= 0; t--)
- module_memory_free(mod->mem[t].base, t);
+ module_memory_free(mod, t);
return ret;
}
--
2.43.0
From: "Mike Rapoport (IBM)" <[email protected]>
Extend execmem parameters to accommodate more complex overrides of
module_alloc() by architectures.
This includes specification of a fallback range required by arm, arm64
and powerpc, EXECMEM_MODULE_DATA type required by powerpc, support for
allocation of KASAN shadow required by s390 and x86 and support for
early initialization of execmem required by x86.
The core implementation of execmem_alloc() takes care of suppressing
warnings when the initial allocation fails but there is a fallback range
defined.
Signed-off-by: Mike Rapoport (IBM) <[email protected]>
---
arch/Kconfig | 6 ++++
arch/arm/kernel/module.c | 38 +++++++++++---------
arch/arm64/kernel/module.c | 49 ++++++++++++-------------
arch/powerpc/kernel/module.c | 58 ++++++++++++++++++------------
arch/s390/kernel/module.c | 52 +++++++++++----------------
arch/x86/Kconfig | 1 +
arch/x86/kernel/module.c | 62 ++++++++++----------------------
include/linux/execmem.h | 34 ++++++++++++++++++
include/linux/moduleloader.h | 12 -------
kernel/module/main.c | 26 ++++----------
mm/execmem.c | 70 +++++++++++++++++++++++++++++-------
mm/mm_init.c | 2 ++
12 files changed, 228 insertions(+), 182 deletions(-)
diff --git a/arch/Kconfig b/arch/Kconfig
index 9f066785bb71..bc9e8e5dccd5 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -960,6 +960,12 @@ config ARCH_WANTS_MODULES_DATA_IN_VMALLOC
For architectures like powerpc/32 which have constraints on module
allocation and need to allocate module data outside of module area.
+config ARCH_WANTS_EXECMEM_EARLY
+ bool
+ help
+ For architectures that might allocate executable memory early on
+ boot, for instance ftrace on x86.
+
config HAVE_IRQ_EXIT_ON_IRQ_STACK
bool
help
diff --git a/arch/arm/kernel/module.c b/arch/arm/kernel/module.c
index e74d84f58b77..32974758c73b 100644
--- a/arch/arm/kernel/module.c
+++ b/arch/arm/kernel/module.c
@@ -16,6 +16,7 @@
#include <linux/fs.h>
#include <linux/string.h>
#include <linux/gfp.h>
+#include <linux/execmem.h>
#include <asm/sections.h>
#include <asm/smp_plat.h>
@@ -34,23 +35,28 @@
#endif
#ifdef CONFIG_MMU
-void *module_alloc(unsigned long size)
+static struct execmem_info execmem_info __ro_after_init = {
+ .ranges = {
+ [EXECMEM_DEFAULT] = {
+ .start = MODULES_VADDR,
+ .end = MODULES_END,
+ .alignment = 1,
+ },
+ },
+};
+
+struct execmem_info __init *execmem_arch_setup(void)
{
- gfp_t gfp_mask = GFP_KERNEL;
- void *p;
-
- /* Silence the initial allocation */
- if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS))
- gfp_mask |= __GFP_NOWARN;
-
- p = __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
- gfp_mask, PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
- __builtin_return_address(0));
- if (!IS_ENABLED(CONFIG_ARM_MODULE_PLTS) || p)
- return p;
- return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
- GFP_KERNEL, PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
- __builtin_return_address(0));
+ struct execmem_range *r = &execmem_info.ranges[EXECMEM_DEFAULT];
+
+ r->pgprot = PAGE_KERNEL_EXEC;
+
+ if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS)) {
+ r->fallback_start = VMALLOC_START;
+ r->fallback_end = VMALLOC_END;
+ }
+
+ return &execmem_info;
}
#endif
diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
index e92da4da1b2a..a377a3217cf2 100644
--- a/arch/arm64/kernel/module.c
+++ b/arch/arm64/kernel/module.c
@@ -20,6 +20,7 @@
#include <linux/random.h>
#include <linux/scs.h>
#include <linux/vmalloc.h>
+#include <linux/execmem.h>
#include <asm/alternative.h>
#include <asm/insn.h>
@@ -108,41 +109,41 @@ static int __init module_init_limits(void)
return 0;
}
-subsys_initcall(module_init_limits);
-void *module_alloc(unsigned long size)
+static struct execmem_info execmem_info __ro_after_init = {
+ .ranges = {
+ [EXECMEM_DEFAULT] = {
+ .alignment = MODULE_ALIGN,
+ },
+ },
+};
+
+struct execmem_info __init *execmem_arch_setup(void)
{
- void *p = NULL;
+ struct execmem_range *r = &execmem_info.ranges[EXECMEM_DEFAULT];
+
+ module_init_limits();
+
+ r->pgprot = PAGE_KERNEL;
/*
* Where possible, prefer to allocate within direct branch range of the
* kernel such that no PLTs are necessary.
*/
if (module_direct_base) {
- p = __vmalloc_node_range(size, MODULE_ALIGN,
- module_direct_base,
- module_direct_base + SZ_128M,
- GFP_KERNEL | __GFP_NOWARN,
- PAGE_KERNEL, 0, NUMA_NO_NODE,
- __builtin_return_address(0));
- }
-
- if (!p && module_plt_base) {
- p = __vmalloc_node_range(size, MODULE_ALIGN,
- module_plt_base,
- module_plt_base + SZ_2G,
- GFP_KERNEL | __GFP_NOWARN,
- PAGE_KERNEL, 0, NUMA_NO_NODE,
- __builtin_return_address(0));
- }
+ r->start = module_direct_base;
+ r->end = module_direct_base + SZ_128M;
- if (!p) {
- pr_warn_ratelimited("%s: unable to allocate memory\n",
- __func__);
+ if (module_plt_base) {
+ r->fallback_start = module_plt_base;
+ r->fallback_end = module_plt_base + SZ_2G;
+ }
+ } else if (module_plt_base) {
+ r->start = module_plt_base;
+ r->end = module_plt_base + SZ_2G;
}
- /* Memory is intended to be executable, reset the pointer tag. */
- return kasan_reset_tag(p);
+ return &execmem_info;
}
enum aarch64_reloc_op {
diff --git a/arch/powerpc/kernel/module.c b/arch/powerpc/kernel/module.c
index f6d6ae0a1692..5a1d0490c831 100644
--- a/arch/powerpc/kernel/module.c
+++ b/arch/powerpc/kernel/module.c
@@ -10,6 +10,7 @@
#include <linux/vmalloc.h>
#include <linux/mm.h>
#include <linux/bug.h>
+#include <linux/execmem.h>
#include <asm/module.h>
#include <linux/uaccess.h>
#include <asm/firmware.h>
@@ -89,39 +90,52 @@ int module_finalize(const Elf_Ehdr *hdr,
return 0;
}
-static __always_inline void *
-__module_alloc(unsigned long size, unsigned long start, unsigned long end, bool nowarn)
+static struct execmem_info execmem_info __ro_after_init = {
+ .ranges = {
+ [EXECMEM_DEFAULT] = {
+ .alignment = 1,
+ },
+ [EXECMEM_MODULE_DATA] = {
+ .alignment = 1,
+ },
+ },
+};
+
+struct execmem_info __init *execmem_arch_setup(void)
{
pgprot_t prot = strict_module_rwx_enabled() ? PAGE_KERNEL : PAGE_KERNEL_EXEC;
- gfp_t gfp = GFP_KERNEL | (nowarn ? __GFP_NOWARN : 0);
+ struct execmem_range *text = &execmem_info.ranges[EXECMEM_DEFAULT];
/*
- * Don't do huge page allocations for modules yet until more testing
- * is done. STRICT_MODULE_RWX may require extra work to support this
- * too.
+ * BOOK3S_32 and 8xx define MODULES_VADDR for text allocations and
+ * allow allocating data in the entire vmalloc space
*/
- return __vmalloc_node_range(size, 1, start, end, gfp, prot,
- VM_FLUSH_RESET_PERMS,
- NUMA_NO_NODE, __builtin_return_address(0));
-}
-
-void *module_alloc(unsigned long size)
-{
#ifdef MODULES_VADDR
+ struct execmem_range *data = &execmem_info.ranges[EXECMEM_MODULE_DATA];
unsigned long limit = (unsigned long)_etext - SZ_32M;
- void *ptr = NULL;
BUILD_BUG_ON(TASK_SIZE > MODULES_VADDR);
/* First try within 32M limit from _etext to avoid branch trampolines */
- if (MODULES_VADDR < PAGE_OFFSET && MODULES_END > limit)
- ptr = __module_alloc(size, limit, MODULES_END, true);
-
- if (!ptr)
- ptr = __module_alloc(size, MODULES_VADDR, MODULES_END, false);
-
- return ptr;
+ if (MODULES_VADDR < PAGE_OFFSET && MODULES_END > limit) {
+ text->start = limit;
+ text->end = MODULES_END;
+ text->fallback_start = MODULES_VADDR;
+ text->fallback_end = MODULES_END;
+ } else {
+ text->start = MODULES_VADDR;
+ text->end = MODULES_END;
+ }
+ data->start = VMALLOC_START;
+ data->end = VMALLOC_END;
+ data->pgprot = PAGE_KERNEL;
+ data->alignment = 1;
#else
- return __module_alloc(size, VMALLOC_START, VMALLOC_END, false);
+ text->start = VMALLOC_START;
+ text->end = VMALLOC_END;
#endif
+
+ text->pgprot = prot;
+
+ return &execmem_info;
}
diff --git a/arch/s390/kernel/module.c b/arch/s390/kernel/module.c
index ac97a905e8cd..7d38218bfd27 100644
--- a/arch/s390/kernel/module.c
+++ b/arch/s390/kernel/module.c
@@ -37,41 +37,29 @@
#define PLT_ENTRY_SIZE 22
-static unsigned long get_module_load_offset(void)
+static struct execmem_info execmem_info __ro_after_init = {
+ .ranges = {
+ [EXECMEM_DEFAULT] = {
+ .flags = EXECMEM_KASAN_SHADOW,
+ .alignment = MODULE_ALIGN,
+ .pgprot = PAGE_KERNEL,
+ },
+ },
+};
+
+struct execmem_info __init *execmem_arch_setup(void)
{
- static DEFINE_MUTEX(module_kaslr_mutex);
- static unsigned long module_load_offset;
-
- if (!kaslr_enabled())
- return 0;
- /*
- * Calculate the module_load_offset the first time this code
- * is called. Once calculated it stays the same until reboot.
- */
- mutex_lock(&module_kaslr_mutex);
- if (!module_load_offset)
+ unsigned long module_load_offset = 0;
+ unsigned long start;
+
+ if (kaslr_enabled())
module_load_offset = get_random_u32_inclusive(1, 1024) * PAGE_SIZE;
- mutex_unlock(&module_kaslr_mutex);
- return module_load_offset;
-}
-void *module_alloc(unsigned long size)
-{
- gfp_t gfp_mask = GFP_KERNEL;
- void *p;
-
- if (PAGE_ALIGN(size) > MODULES_LEN)
- return NULL;
- p = __vmalloc_node_range(size, MODULE_ALIGN,
- MODULES_VADDR + get_module_load_offset(),
- MODULES_END, gfp_mask, PAGE_KERNEL,
- VM_FLUSH_RESET_PERMS | VM_DEFER_KMEMLEAK,
- NUMA_NO_NODE, __builtin_return_address(0));
- if (p && (kasan_alloc_module_shadow(p, size, gfp_mask) < 0)) {
- vfree(p);
- return NULL;
- }
- return p;
+ start = MODULES_VADDR + module_load_offset;
+ execmem_info.ranges[EXECMEM_DEFAULT].start = start;
+ execmem_info.ranges[EXECMEM_DEFAULT].end = MODULES_END;
+
+ return &execmem_info;
}
#ifdef CONFIG_FUNCTION_TRACER
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 4fff6ed46e90..e87ddbdaaeb2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -135,6 +135,7 @@ config X86
select ARCH_WANT_OPTIMIZE_DAX_VMEMMAP if X86_64
select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP if X86_64
select ARCH_WANTS_THP_SWAP if X86_64
+ select ARCH_WANTS_EXECMEM_EARLY if EXECMEM
select ARCH_HAS_PARANOID_L1D_FLUSH
select BUILDTIME_TABLE_SORT
select CLKEVT_I8253
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index e18914c0e38a..8f526f056847 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -19,6 +19,7 @@
#include <linux/jump_label.h>
#include <linux/random.h>
#include <linux/memory.h>
+#include <linux/execmem.h>
#include <asm/text-patching.h>
#include <asm/page.h>
@@ -36,55 +37,28 @@ do { \
} while (0)
#endif
-#ifdef CONFIG_RANDOMIZE_BASE
-static unsigned long module_load_offset;
+static struct execmem_info execmem_info __ro_after_init = {
+ .ranges = {
+ [EXECMEM_DEFAULT] = {
+ .flags = EXECMEM_KASAN_SHADOW,
+ .alignment = MODULE_ALIGN,
+ },
+ },
+};
-/* Mutex protects the module_load_offset. */
-static DEFINE_MUTEX(module_kaslr_mutex);
-
-static unsigned long int get_module_load_offset(void)
-{
- if (kaslr_enabled()) {
- mutex_lock(&module_kaslr_mutex);
- /*
- * Calculate the module_load_offset the first time this
- * code is called. Once calculated it stays the same until
- * reboot.
- */
- if (module_load_offset == 0)
- module_load_offset =
- get_random_u32_inclusive(1, 1024) * PAGE_SIZE;
- mutex_unlock(&module_kaslr_mutex);
- }
- return module_load_offset;
-}
-#else
-static unsigned long int get_module_load_offset(void)
-{
- return 0;
-}
-#endif
-
-void *module_alloc(unsigned long size)
+struct execmem_info __init *execmem_arch_setup(void)
{
- gfp_t gfp_mask = GFP_KERNEL;
- void *p;
-
- if (PAGE_ALIGN(size) > MODULES_LEN)
- return NULL;
+ unsigned long start, offset = 0;
- p = __vmalloc_node_range(size, MODULE_ALIGN,
- MODULES_VADDR + get_module_load_offset(),
- MODULES_END, gfp_mask, PAGE_KERNEL,
- VM_FLUSH_RESET_PERMS | VM_DEFER_KMEMLEAK,
- NUMA_NO_NODE, __builtin_return_address(0));
+ if (kaslr_enabled())
+ offset = get_random_u32_inclusive(1, 1024) * PAGE_SIZE;
- if (p && (kasan_alloc_module_shadow(p, size, gfp_mask) < 0)) {
- vfree(p);
- return NULL;
- }
+ start = MODULES_VADDR + offset;
+ execmem_info.ranges[EXECMEM_DEFAULT].start = start;
+ execmem_info.ranges[EXECMEM_DEFAULT].end = MODULES_END;
+ execmem_info.ranges[EXECMEM_DEFAULT].pgprot = PAGE_KERNEL;
- return p;
+ return &execmem_info;
}
#ifdef CONFIG_X86_32
diff --git a/include/linux/execmem.h b/include/linux/execmem.h
index 89173be320cf..ffd0d12feef5 100644
--- a/include/linux/execmem.h
+++ b/include/linux/execmem.h
@@ -5,6 +5,14 @@
#include <linux/types.h>
#include <linux/moduleloader.h>
+#if (defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)) && \
+ !defined(CONFIG_KASAN_VMALLOC)
+#include <linux/kasan.h>
+#define MODULE_ALIGN (PAGE_SIZE << KASAN_SHADOW_SCALE_SHIFT)
+#else
+#define MODULE_ALIGN PAGE_SIZE
+#endif
+
/**
* enum execmem_type - types of executable memory ranges
*
@@ -22,6 +30,7 @@
* @EXECMEM_KPROBES: parameters for kprobes
* @EXECMEM_FTRACE: parameters for ftrace
* @EXECMEM_BPF: parameters for BPF
+ * @EXECMEM_MODULE_DATA: parameters for module data sections
* @EXECMEM_TYPE_MAX:
*/
enum execmem_type {
@@ -30,22 +39,38 @@ enum execmem_type {
EXECMEM_KPROBES,
EXECMEM_FTRACE,
EXECMEM_BPF,
+ EXECMEM_MODULE_DATA,
EXECMEM_TYPE_MAX,
};
+/**
+ * enum execmem_range_flags - options for executable memory allocations
+ * @EXECMEM_KASAN_SHADOW: allocate kasan shadow
+ */
+enum execmem_range_flags {
+ EXECMEM_KASAN_SHADOW = (1 << 0),
+};
+
/**
* struct execmem_range - definition of an address space suitable for code and
* related data allocations
* @start: address space start
* @end: address space end (inclusive)
+ * @fallback_start: start of the secondary address space range for fallback
+ * allocations on architectures that require it
+ * @fallback_end: start of the secondary address space (inclusive)
* @pgprot: permissions for memory in this address space
* @alignment: alignment required for text allocations
+ * @flags: options for memory allocations for this range
*/
struct execmem_range {
unsigned long start;
unsigned long end;
+ unsigned long fallback_start;
+ unsigned long fallback_end;
pgprot_t pgprot;
unsigned int alignment;
+ enum execmem_range_flags flags;
};
/**
@@ -82,6 +107,9 @@ struct execmem_info *execmem_arch_setup(void);
* Allocates memory that will contain executable code, either generated or
* loaded from kernel modules.
*
+ * Allocates memory that will contain data coupled with executable code,
+ * like data sections in kernel modules.
+ *
* The memory will have protections defined by architecture for executable
* region of the @type.
*
@@ -95,4 +123,10 @@ void *execmem_alloc(enum execmem_type type, size_t size);
*/
void execmem_free(void *ptr);
+#ifdef CONFIG_ARCH_WANTS_EXECMEM_EARLY
+void execmem_early_init(void);
+#else
+static inline void execmem_early_init(void) {}
+#endif
+
#endif /* _LINUX_EXECMEM_ALLOC_H */
diff --git a/include/linux/moduleloader.h b/include/linux/moduleloader.h
index a3b8caee9405..e395461d59e5 100644
--- a/include/linux/moduleloader.h
+++ b/include/linux/moduleloader.h
@@ -25,10 +25,6 @@ int module_frob_arch_sections(Elf_Ehdr *hdr,
/* Additional bytes needed by arch in front of individual sections */
unsigned int arch_mod_section_prepend(struct module *mod, unsigned int section);
-/* Allocator used for allocating struct module, core sections and init
- sections. Returns NULL on failure. */
-void *module_alloc(unsigned long size);
-
/* Determines if the section name is an init section (that is only used during
* module loading).
*/
@@ -126,12 +122,4 @@ void module_arch_cleanup(struct module *mod);
/* Any cleanup before freeing mod->module_init */
void module_arch_freeing_init(struct module *mod);
-#if (defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)) && \
- !defined(CONFIG_KASAN_VMALLOC)
-#include <linux/kasan.h>
-#define MODULE_ALIGN (PAGE_SIZE << KASAN_SHADOW_SCALE_SHIFT)
-#else
-#define MODULE_ALIGN PAGE_SIZE
-#endif
-
#endif
diff --git a/kernel/module/main.c b/kernel/module/main.c
index d56b7df0cbb6..91e185607d4b 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -1188,24 +1188,20 @@ void __weak module_arch_freeing_init(struct module *mod)
{
}
-static bool mod_mem_use_vmalloc(enum mod_mem_type type)
-{
- return IS_ENABLED(CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC) &&
- mod_mem_type_is_core_data(type);
-}
-
static int module_memory_alloc(struct module *mod, enum mod_mem_type type)
{
unsigned int size = PAGE_ALIGN(mod->mem[type].size);
+ enum execmem_type execmem_type;
void *ptr;
mod->mem[type].size = size;
- if (mod_mem_use_vmalloc(type))
- ptr = vmalloc(size);
+ if (mod_mem_type_is_data(type))
+ execmem_type = EXECMEM_MODULE_DATA;
else
- ptr = execmem_alloc(EXECMEM_MODULE_TEXT, size);
+ execmem_type = EXECMEM_MODULE_TEXT;
+ ptr = execmem_alloc(execmem_type, size);
if (!ptr)
return -ENOMEM;
@@ -1232,10 +1228,7 @@ static void module_memory_free(struct module *mod, enum mod_mem_type type)
{
void *ptr = mod->mem[type].base;
- if (mod_mem_use_vmalloc(type))
- vfree(ptr);
- else
- execmem_free(ptr);
+ execmem_free(ptr);
}
static void free_mod_mem(struct module *mod)
@@ -1630,13 +1623,6 @@ static void free_modinfo(struct module *mod)
}
}
-void * __weak module_alloc(unsigned long size)
-{
- return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
- GFP_KERNEL, PAGE_KERNEL_EXEC, VM_FLUSH_RESET_PERMS,
- NUMA_NO_NODE, __builtin_return_address(0));
-}
-
bool __weak module_init_section(const char *name)
{
return strstarts(name, ".init");
diff --git a/mm/execmem.c b/mm/execmem.c
index d9fb20bc7354..aabc0afabdbc 100644
--- a/mm/execmem.c
+++ b/mm/execmem.c
@@ -6,27 +6,49 @@
#include <linux/moduleloader.h>
static struct execmem_info *execmem_info __ro_after_init;
+static struct execmem_info default_execmem_info __ro_after_init;
static void *__execmem_alloc(struct execmem_range *range, size_t size)
{
+ bool kasan = range->flags & EXECMEM_KASAN_SHADOW;
+ unsigned long vm_flags = VM_FLUSH_RESET_PERMS;
+ gfp_t gfp_flags = GFP_KERNEL | __GFP_NOWARN;
unsigned long start = range->start;
unsigned long end = range->end;
unsigned int align = range->alignment;
pgprot_t pgprot = range->pgprot;
+ void *p;
+
+ if (kasan)
+ vm_flags |= VM_DEFER_KMEMLEAK;
+
+ p = __vmalloc_node_range(size, align, start, end, gfp_flags,
+ pgprot, vm_flags, NUMA_NO_NODE,
+ __builtin_return_address(0));
+ if (!p && range->fallback_start) {
+ start = range->fallback_start;
+ end = range->fallback_end;
+ p = __vmalloc_node_range(size, align, start, end, gfp_flags,
+ pgprot, vm_flags, NUMA_NO_NODE,
+ __builtin_return_address(0));
+ }
+
+ if (!p) {
+ pr_warn_ratelimited("execmem: unable to allocate memory\n");
+ return NULL;
+ }
+
+ if (kasan && (kasan_alloc_module_shadow(p, size, GFP_KERNEL) < 0)) {
+ vfree(p);
+ return NULL;
+ }
- return __vmalloc_node_range(size, align, start, end,
- GFP_KERNEL, pgprot, VM_FLUSH_RESET_PERMS,
- NUMA_NO_NODE, __builtin_return_address(0));
+ return kasan_reset_tag(p);
}
void *execmem_alloc(enum execmem_type type, size_t size)
{
- struct execmem_range *range;
-
- if (!execmem_info)
- return module_alloc(size);
-
- range = &execmem_info->ranges[type];
+ struct execmem_range *range = &execmem_info->ranges[type];
return __execmem_alloc(range, size);
}
@@ -61,10 +83,16 @@ static void execmem_init_missing(struct execmem_info *info)
struct execmem_range *r = &info->ranges[i];
if (!r->start) {
- r->pgprot = default_range->pgprot;
+ if (i == EXECMEM_MODULE_DATA)
+ r->pgprot = PAGE_KERNEL;
+ else
+ r->pgprot = default_range->pgprot;
r->alignment = default_range->alignment;
r->start = default_range->start;
r->end = default_range->end;
+ r->flags = default_range->flags;
+ r->fallback_start = default_range->fallback_start;
+ r->fallback_end = default_range->fallback_end;
}
}
}
@@ -74,12 +102,18 @@ struct execmem_info * __weak execmem_arch_setup(void)
return NULL;
}
-static int __init execmem_init(void)
+static int __init __execmem_init(void)
{
struct execmem_info *info = execmem_arch_setup();
- if (!info)
+ if (!info) {
+ info = execmem_info = &default_execmem_info;
+ info->ranges[EXECMEM_DEFAULT].start = VMALLOC_START;
+ info->ranges[EXECMEM_DEFAULT].end = VMALLOC_END;
+ info->ranges[EXECMEM_DEFAULT].pgprot = PAGE_KERNEL_EXEC;
+ info->ranges[EXECMEM_DEFAULT].alignment = 1;
return 0;
+ }
if (!execmem_validate(info))
return -EINVAL;
@@ -90,4 +124,16 @@ static int __init execmem_init(void)
return 0;
}
+
+#ifndef CONFIG_ARCH_WANTS_EXECMEM_EARLY
+static int __init execmem_init(void)
+{
+ return __execmem_init();
+}
core_initcall(execmem_init);
+#else
+void __init execmem_early_init(void)
+{
+ (void)__execmem_init();
+}
+#endif
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 549e76af8f82..dae777234a31 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -27,6 +27,7 @@
#include <linux/swap.h>
#include <linux/cma.h>
#include <linux/crash_dump.h>
+#include <linux/execmem.h>
#include "internal.h"
#include "slab.h"
#include "shuffle.h"
@@ -2793,4 +2794,5 @@ void __init mm_core_init(void)
pti_init();
kmsan_init_runtime();
mm_cache_init();
+ execmem_early_init();
}
--
2.43.0
From: "Mike Rapoport (IBM)" <[email protected]>
powerpc overrides kprobes::alloc_insn_page() to remove writable
permissions when STRICT_MODULE_RWX is on.
Add definition of EXECMEM_KRPOBES to execmem_params to allow using the
generic kprobes::alloc_insn_page() with the desired permissions.
As powerpc uses breakpoint instructions to inject kprobes, it does not
need to constrain kprobe allocations to the modules area and can use the
entire vmalloc address space.
Signed-off-by: Mike Rapoport (IBM) <[email protected]>
---
arch/powerpc/kernel/kprobes.c | 20 --------------------
arch/powerpc/kernel/module.c | 11 +++++++++++
2 files changed, 11 insertions(+), 20 deletions(-)
diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
index 9fcd01bb2ce6..14c5ddec3056 100644
--- a/arch/powerpc/kernel/kprobes.c
+++ b/arch/powerpc/kernel/kprobes.c
@@ -126,26 +126,6 @@ kprobe_opcode_t *arch_adjust_kprobe_addr(unsigned long addr, unsigned long offse
return (kprobe_opcode_t *)(addr + offset);
}
-void *alloc_insn_page(void)
-{
- void *page;
-
- page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE);
- if (!page)
- return NULL;
-
- if (strict_module_rwx_enabled()) {
- int err = set_memory_rox((unsigned long)page, 1);
-
- if (err)
- goto error;
- }
- return page;
-error:
- execmem_free(page);
- return NULL;
-}
-
int arch_prepare_kprobe(struct kprobe *p)
{
int ret = 0;
diff --git a/arch/powerpc/kernel/module.c b/arch/powerpc/kernel/module.c
index 5a1d0490c831..a1eaa74f2d41 100644
--- a/arch/powerpc/kernel/module.c
+++ b/arch/powerpc/kernel/module.c
@@ -95,6 +95,9 @@ static struct execmem_info execmem_info __ro_after_init = {
[EXECMEM_DEFAULT] = {
.alignment = 1,
},
+ [EXECMEM_KPROBES] = {
+ .alignment = 1,
+ },
[EXECMEM_MODULE_DATA] = {
.alignment = 1,
},
@@ -137,5 +140,13 @@ struct execmem_info __init *execmem_arch_setup(void)
text->pgprot = prot;
+ execmem_info.ranges[EXECMEM_KPROBES].start = VMALLOC_START;
+ execmem_info.ranges[EXECMEM_KPROBES].start = VMALLOC_END;
+
+ if (strict_module_rwx_enabled())
+ execmem_info.ranges[EXECMEM_KPROBES].pgprot = PAGE_KERNEL_ROX;
+ else
+ execmem_info.ranges[EXECMEM_KPROBES].pgprot = PAGE_KERNEL_EXEC;
+
return &execmem_info;
}
--
2.43.0
From: "Mike Rapoport (IBM)" <[email protected]>
nios2 uses kmalloc() to implement module_alloc() because CALL26/PCREL26
cannot reach all of vmalloc address space.
Define module space as 32MiB below the kernel base and switch nios2 to
use vmalloc for module allocations.
Suggested-by: Thomas Gleixner <[email protected]>
Acked-by: Dinh Nguyen <[email protected]>
Acked-by: Song Liu <[email protected]>
Signed-off-by: Mike Rapoport (IBM) <[email protected]>
---
arch/nios2/include/asm/pgtable.h | 5 ++++-
arch/nios2/kernel/module.c | 19 ++++---------------
2 files changed, 8 insertions(+), 16 deletions(-)
diff --git a/arch/nios2/include/asm/pgtable.h b/arch/nios2/include/asm/pgtable.h
index d052dfcbe8d3..eab87c6beacb 100644
--- a/arch/nios2/include/asm/pgtable.h
+++ b/arch/nios2/include/asm/pgtable.h
@@ -25,7 +25,10 @@
#include <asm-generic/pgtable-nopmd.h>
#define VMALLOC_START CONFIG_NIOS2_KERNEL_MMU_REGION_BASE
-#define VMALLOC_END (CONFIG_NIOS2_KERNEL_REGION_BASE - 1)
+#define VMALLOC_END (CONFIG_NIOS2_KERNEL_REGION_BASE - SZ_32M - 1)
+
+#define MODULES_VADDR (CONFIG_NIOS2_KERNEL_REGION_BASE - SZ_32M)
+#define MODULES_END (CONFIG_NIOS2_KERNEL_REGION_BASE - 1)
struct mm_struct;
diff --git a/arch/nios2/kernel/module.c b/arch/nios2/kernel/module.c
index 76e0a42d6e36..9c97b7513853 100644
--- a/arch/nios2/kernel/module.c
+++ b/arch/nios2/kernel/module.c
@@ -21,23 +21,12 @@
#include <asm/cacheflush.h>
-/*
- * Modules should NOT be allocated with kmalloc for (obvious) reasons.
- * But we do it for now to avoid relocation issues. CALL26/PCREL26 cannot reach
- * from 0x80000000 (vmalloc area) to 0xc00000000 (kernel) (kmalloc returns
- * addresses in 0xc0000000)
- */
void *module_alloc(unsigned long size)
{
- if (size == 0)
- return NULL;
- return kmalloc(size, GFP_KERNEL);
-}
-
-/* Free memory returned from module_alloc */
-void module_memfree(void *module_region)
-{
- kfree(module_region);
+ return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
+ GFP_KERNEL, PAGE_KERNEL_EXEC,
+ VM_FLUSH_RESET_PERMS, NUMA_NO_NODE,
+ __builtin_return_address(0));
}
int apply_relocate_add(Elf32_Shdr *sechdrs, const char *strtab,
--
2.43.0
From: "Mike Rapoport (IBM)" <[email protected]>
and MODULE_END to MODULES_END to match other architectures that define
custom address space for modules.
Signed-off-by: Mike Rapoport (IBM) <[email protected]>
---
arch/mips/include/asm/pgtable-64.h | 4 ++--
arch/mips/kernel/module.c | 4 ++--
arch/mips/mm/fault.c | 4 ++--
3 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/arch/mips/include/asm/pgtable-64.h b/arch/mips/include/asm/pgtable-64.h
index 20ca48c1b606..c0109aff223b 100644
--- a/arch/mips/include/asm/pgtable-64.h
+++ b/arch/mips/include/asm/pgtable-64.h
@@ -147,8 +147,8 @@
#if defined(CONFIG_MODULES) && defined(KBUILD_64BIT_SYM32) && \
VMALLOC_START != CKSSEG
/* Load modules into 32bit-compatible segment. */
-#define MODULE_START CKSSEG
-#define MODULE_END (FIXADDR_START-2*PAGE_SIZE)
+#define MODULES_VADDR CKSSEG
+#define MODULES_END (FIXADDR_START-2*PAGE_SIZE)
#endif
#define pte_ERROR(e) \
diff --git a/arch/mips/kernel/module.c b/arch/mips/kernel/module.c
index 7b2fbaa9cac5..9a6c96014904 100644
--- a/arch/mips/kernel/module.c
+++ b/arch/mips/kernel/module.c
@@ -31,10 +31,10 @@ struct mips_hi16 {
static LIST_HEAD(dbe_list);
static DEFINE_SPINLOCK(dbe_lock);
-#ifdef MODULE_START
+#ifdef MODULES_VADDR
void *module_alloc(unsigned long size)
{
- return __vmalloc_node_range(size, 1, MODULE_START, MODULE_END,
+ return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE,
__builtin_return_address(0));
}
diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c
index aaa9a242ebba..37fedeaca2e9 100644
--- a/arch/mips/mm/fault.c
+++ b/arch/mips/mm/fault.c
@@ -83,8 +83,8 @@ static void __do_page_fault(struct pt_regs *regs, unsigned long write,
if (unlikely(address >= VMALLOC_START && address <= VMALLOC_END))
goto VMALLOC_FAULT_TARGET;
-#ifdef MODULE_START
- if (unlikely(address >= MODULE_START && address < MODULE_END))
+#ifdef MODULES_VADDR
+ if (unlikely(address >= MODULES_VADDR && address < MODULES_END))
goto VMALLOC_FAULT_TARGET;
#endif
--
2.43.0
From: "Mike Rapoport (IBM)" <[email protected]>
execmem does not depend on modules, on the contrary modules use
execmem.
To make execmem available when CONFIG_MODULES=n, for instance for
kprobes, split execmem_params initialization out from
arch/*/kernel/module.c and compile it when CONFIG_EXECMEM=y
Signed-off-by: Mike Rapoport (IBM) <[email protected]>
---
arch/arm/kernel/module.c | 40 ----------
arch/arm/mm/init.c | 40 ++++++++++
arch/arm64/kernel/module.c | 136 ---------------------------------
arch/arm64/mm/init.c | 136 +++++++++++++++++++++++++++++++++
arch/loongarch/kernel/module.c | 18 -----
arch/loongarch/mm/init.c | 20 +++++
arch/mips/kernel/module.c | 21 -----
arch/mips/mm/init.c | 22 ++++++
arch/nios2/kernel/module.c | 18 -----
arch/nios2/mm/init.c | 19 +++++
arch/parisc/kernel/module.c | 19 -----
arch/parisc/mm/init.c | 22 +++++-
arch/powerpc/kernel/module.c | 63 ---------------
arch/powerpc/mm/mem.c | 64 ++++++++++++++++
arch/riscv/kernel/module.c | 40 ----------
arch/riscv/mm/init.c | 41 ++++++++++
arch/s390/kernel/module.c | 25 ------
arch/s390/mm/init.c | 28 +++++++
arch/sparc/kernel/module.c | 23 ------
arch/sparc/mm/Makefile | 2 +
arch/sparc/mm/execmem.c | 25 ++++++
arch/x86/kernel/module.c | 25 ------
arch/x86/mm/init.c | 27 +++++++
23 files changed, 445 insertions(+), 429 deletions(-)
create mode 100644 arch/sparc/mm/execmem.c
diff --git a/arch/arm/kernel/module.c b/arch/arm/kernel/module.c
index 32974758c73b..677f218f7e84 100644
--- a/arch/arm/kernel/module.c
+++ b/arch/arm/kernel/module.c
@@ -12,54 +12,14 @@
#include <linux/kernel.h>
#include <linux/mm.h>
#include <linux/elf.h>
-#include <linux/vmalloc.h>
#include <linux/fs.h>
#include <linux/string.h>
-#include <linux/gfp.h>
-#include <linux/execmem.h>
#include <asm/sections.h>
#include <asm/smp_plat.h>
#include <asm/unwind.h>
#include <asm/opcodes.h>
-#ifdef CONFIG_XIP_KERNEL
-/*
- * The XIP kernel text is mapped in the module area for modules and
- * some other stuff to work without any indirect relocations.
- * MODULES_VADDR is redefined here and not in asm/memory.h to avoid
- * recompiling the whole kernel when CONFIG_XIP_KERNEL is turned on/off.
- */
-#undef MODULES_VADDR
-#define MODULES_VADDR (((unsigned long)_exiprom + ~PMD_MASK) & PMD_MASK)
-#endif
-
-#ifdef CONFIG_MMU
-static struct execmem_info execmem_info __ro_after_init = {
- .ranges = {
- [EXECMEM_DEFAULT] = {
- .start = MODULES_VADDR,
- .end = MODULES_END,
- .alignment = 1,
- },
- },
-};
-
-struct execmem_info __init *execmem_arch_setup(void)
-{
- struct execmem_range *r = &execmem_info.ranges[EXECMEM_DEFAULT];
-
- r->pgprot = PAGE_KERNEL_EXEC;
-
- if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS)) {
- r->fallback_start = VMALLOC_START;
- r->fallback_end = VMALLOC_END;
- }
-
- return &execmem_info;
-}
-#endif
-
bool module_init_section(const char *name)
{
return strstarts(name, ".init") ||
diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c
index e8c6f4be0ce1..e54338825156 100644
--- a/arch/arm/mm/init.c
+++ b/arch/arm/mm/init.c
@@ -22,6 +22,7 @@
#include <linux/sizes.h>
#include <linux/stop_machine.h>
#include <linux/swiotlb.h>
+#include <linux/execmem.h>
#include <asm/cp15.h>
#include <asm/mach-types.h>
@@ -486,3 +487,42 @@ void free_initrd_mem(unsigned long start, unsigned long end)
free_reserved_area((void *)start, (void *)end, -1, "initrd");
}
#endif
+
+#ifdef CONFIG_EXECMEM
+#ifdef CONFIG_XIP_KERNEL
+/*
+ * The XIP kernel text is mapped in the module area for modules and
+ * some other stuff to work without any indirect relocations.
+ * MODULES_VADDR is redefined here and not in asm/memory.h to avoid
+ * recompiling the whole kernel when CONFIG_XIP_KERNEL is turned on/off.
+ */
+#undef MODULES_VADDR
+#define MODULES_VADDR (((unsigned long)_exiprom + ~PMD_MASK) & PMD_MASK)
+#endif
+
+#ifdef CONFIG_MMU
+static struct execmem_info execmem_info __ro_after_init = {
+ .ranges = {
+ [EXECMEM_DEFAULT] = {
+ .start = MODULES_VADDR,
+ .end = MODULES_END,
+ .alignment = 1,
+ },
+ },
+};
+
+struct execmem_info __init *execmem_arch_setup(void)
+{
+ struct execmem_range *r = &execmem_info.ranges[EXECMEM_DEFAULT];
+
+ r->pgprot = PAGE_KERNEL_EXEC;
+
+ if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS)) {
+ r->fallback_start = VMALLOC_START;
+ r->fallback_end = VMALLOC_END;
+ }
+
+ return &execmem_info;
+}
+#endif
+#endif /* CONFIG_EXECMEM */
diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
index aa9e2b3d7459..36b25af56324 100644
--- a/arch/arm64/kernel/module.c
+++ b/arch/arm64/kernel/module.c
@@ -12,154 +12,18 @@
#include <linux/bitops.h>
#include <linux/elf.h>
#include <linux/ftrace.h>
-#include <linux/gfp.h>
#include <linux/kasan.h>
#include <linux/kernel.h>
#include <linux/mm.h>
#include <linux/moduleloader.h>
#include <linux/random.h>
#include <linux/scs.h>
-#include <linux/vmalloc.h>
-#include <linux/execmem.h>
#include <asm/alternative.h>
#include <asm/insn.h>
#include <asm/scs.h>
#include <asm/sections.h>
-static u64 module_direct_base __ro_after_init = 0;
-static u64 module_plt_base __ro_after_init = 0;
-
-/*
- * Choose a random page-aligned base address for a window of 'size' bytes which
- * entirely contains the interval [start, end - 1].
- */
-static u64 __init random_bounding_box(u64 size, u64 start, u64 end)
-{
- u64 max_pgoff, pgoff;
-
- if ((end - start) >= size)
- return 0;
-
- max_pgoff = (size - (end - start)) / PAGE_SIZE;
- pgoff = get_random_u32_inclusive(0, max_pgoff);
-
- return start - pgoff * PAGE_SIZE;
-}
-
-/*
- * Modules may directly reference data and text anywhere within the kernel
- * image and other modules. References using PREL32 relocations have a +/-2G
- * range, and so we need to ensure that the entire kernel image and all modules
- * fall within a 2G window such that these are always within range.
- *
- * Modules may directly branch to functions and code within the kernel text,
- * and to functions and code within other modules. These branches will use
- * CALL26/JUMP26 relocations with a +/-128M range. Without PLTs, we must ensure
- * that the entire kernel text and all module text falls within a 128M window
- * such that these are always within range. With PLTs, we can expand this to a
- * 2G window.
- *
- * We chose the 128M region to surround the entire kernel image (rather than
- * just the text) as using the same bounds for the 128M and 2G regions ensures
- * by construction that we never select a 128M region that is not a subset of
- * the 2G region. For very large and unusual kernel configurations this means
- * we may fall back to PLTs where they could have been avoided, but this keeps
- * the logic significantly simpler.
- */
-static int __init module_init_limits(void)
-{
- u64 kernel_end = (u64)_end;
- u64 kernel_start = (u64)_text;
- u64 kernel_size = kernel_end - kernel_start;
-
- /*
- * The default modules region is placed immediately below the kernel
- * image, and is large enough to use the full 2G relocation range.
- */
- BUILD_BUG_ON(KIMAGE_VADDR != MODULES_END);
- BUILD_BUG_ON(MODULES_VSIZE < SZ_2G);
-
- if (!kaslr_enabled()) {
- if (kernel_size < SZ_128M)
- module_direct_base = kernel_end - SZ_128M;
- if (kernel_size < SZ_2G)
- module_plt_base = kernel_end - SZ_2G;
- } else {
- u64 min = kernel_start;
- u64 max = kernel_end;
-
- if (IS_ENABLED(CONFIG_RANDOMIZE_MODULE_REGION_FULL)) {
- pr_info("2G module region forced by RANDOMIZE_MODULE_REGION_FULL\n");
- } else {
- module_direct_base = random_bounding_box(SZ_128M, min, max);
- if (module_direct_base) {
- min = module_direct_base;
- max = module_direct_base + SZ_128M;
- }
- }
-
- module_plt_base = random_bounding_box(SZ_2G, min, max);
- }
-
- pr_info("%llu pages in range for non-PLT usage",
- module_direct_base ? (SZ_128M - kernel_size) / PAGE_SIZE : 0);
- pr_info("%llu pages in range for PLT usage",
- module_plt_base ? (SZ_2G - kernel_size) / PAGE_SIZE : 0);
-
- return 0;
-}
-
-static struct execmem_info execmem_info __ro_after_init = {
- .ranges = {
- [EXECMEM_DEFAULT] = {
- .alignment = MODULE_ALIGN,
- },
- [EXECMEM_KPROBES] = {
- .alignment = 1,
- },
- [EXECMEM_BPF] = {
- .alignment = 1,
- },
- },
-};
-
-struct execmem_info __init *execmem_arch_setup(void)
-{
- struct execmem_range *r = &execmem_info.ranges[EXECMEM_DEFAULT];
-
- module_init_limits();
-
- r->pgprot = PAGE_KERNEL;
-
- /*
- * Where possible, prefer to allocate within direct branch range of the
- * kernel such that no PLTs are necessary.
- */
- if (module_direct_base) {
- r->start = module_direct_base;
- r->end = module_direct_base + SZ_128M;
-
- if (module_plt_base) {
- r->fallback_start = module_plt_base;
- r->fallback_end = module_plt_base + SZ_2G;
- }
- } else if (module_plt_base) {
- r->start = module_plt_base;
- r->end = module_plt_base + SZ_2G;
- }
-
- execmem_info.ranges[EXECMEM_KPROBES].pgprot = PAGE_KERNEL_ROX;
- execmem_info.ranges[EXECMEM_KPROBES].start = VMALLOC_START;
- execmem_info.ranges[EXECMEM_KPROBES].end = VMALLOC_END;
-
- execmem_info.ranges[EXECMEM_BPF].pgprot = PAGE_KERNEL;
- execmem_info.ranges[EXECMEM_BPF].start = VMALLOC_START;
- execmem_info.ranges[EXECMEM_BPF].end = VMALLOC_END;
-
- return &execmem_info;
-}
-
enum aarch64_reloc_op {
RELOC_OP_NONE,
RELOC_OP_ABS,
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 03efd86dce0a..4058447507ae 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -32,6 +32,7 @@
#include <linux/hugetlb.h>
#include <linux/acpi_iort.h>
#include <linux/kmemleak.h>
+#include <linux/execmem.h>
#include <asm/boot.h>
#include <asm/fixmap.h>
@@ -432,3 +433,138 @@ void dump_mem_limit(void)
pr_emerg("Memory Limit: none\n");
}
}
+
+#ifdef CONFIG_EXECMEM
+static u64 module_direct_base __ro_after_init = 0;
+static u64 module_plt_base __ro_after_init = 0;
+
+/*
+ * Choose a random page-aligned base address for a window of 'size' bytes which
+ * entirely contains the interval [start, end - 1].
+ */
+static u64 __init random_bounding_box(u64 size, u64 start, u64 end)
+{
+ u64 max_pgoff, pgoff;
+
+ if ((end - start) >= size)
+ return 0;
+
+ max_pgoff = (size - (end - start)) / PAGE_SIZE;
+ pgoff = get_random_u32_inclusive(0, max_pgoff);
+
+ return start - pgoff * PAGE_SIZE;
+}
+
+/*
+ * Modules may directly reference data and text anywhere within the kernel
+ * image and other modules. References using PREL32 relocations have a +/-2G
+ * range, and so we need to ensure that the entire kernel image and all modules
+ * fall within a 2G window such that these are always within range.
+ *
+ * Modules may directly branch to functions and code within the kernel text,
+ * and to functions and code within other modules. These branches will use
+ * CALL26/JUMP26 relocations with a +/-128M range. Without PLTs, we must ensure
+ * that the entire kernel text and all module text falls within a 128M window
+ * such that these are always within range. With PLTs, we can expand this to a
+ * 2G window.
+ *
+ * We chose the 128M region to surround the entire kernel image (rather than
+ * just the text) as using the same bounds for the 128M and 2G regions ensures
+ * by construction that we never select a 128M region that is not a subset of
+ * the 2G region. For very large and unusual kernel configurations this means
+ * we may fall back to PLTs where they could have been avoided, but this keeps
+ * the logic significantly simpler.
+ */
+static int __init module_init_limits(void)
+{
+ u64 kernel_end = (u64)_end;
+ u64 kernel_start = (u64)_text;
+ u64 kernel_size = kernel_end - kernel_start;
+
+ /*
+ * The default modules region is placed immediately below the kernel
+ * image, and is large enough to use the full 2G relocation range.
+ */
+ BUILD_BUG_ON(KIMAGE_VADDR != MODULES_END);
+ BUILD_BUG_ON(MODULES_VSIZE < SZ_2G);
+
+ if (!kaslr_enabled()) {
+ if (kernel_size < SZ_128M)
+ module_direct_base = kernel_end - SZ_128M;
+ if (kernel_size < SZ_2G)
+ module_plt_base = kernel_end - SZ_2G;
+ } else {
+ u64 min = kernel_start;
+ u64 max = kernel_end;
+
+ if (IS_ENABLED(CONFIG_RANDOMIZE_MODULE_REGION_FULL)) {
+ pr_info("2G module region forced by RANDOMIZE_MODULE_REGION_FULL\n");
+ } else {
+ module_direct_base = random_bounding_box(SZ_128M, min, max);
+ if (module_direct_base) {
+ min = module_direct_base;
+ max = module_direct_base + SZ_128M;
+ }
+ }
+
+ module_plt_base = random_bounding_box(SZ_2G, min, max);
+ }
+
+ pr_info("%llu pages in range for non-PLT usage",
+ module_direct_base ? (SZ_128M - kernel_size) / PAGE_SIZE : 0);
+ pr_info("%llu pages in range for PLT usage",
+ module_plt_base ? (SZ_2G - kernel_size) / PAGE_SIZE : 0);
+
+ return 0;
+}
+
+static struct execmem_info execmem_info __ro_after_init = {
+ .ranges = {
+ [EXECMEM_DEFAULT] = {
+ .alignment = MODULE_ALIGN,
+ },
+ [EXECMEM_KPROBES] = {
+ .alignment = 1,
+ },
+ [EXECMEM_BPF] = {
+ .alignment = 1,
+ },
+ },
+};
+
+struct execmem_info __init *execmem_arch_setup(void)
+{
+ struct execmem_range *r = &execmem_info.ranges[EXECMEM_DEFAULT];
+
+ module_init_limits();
+
+ r->pgprot = PAGE_KERNEL;
+
+ /*
+ * Where possible, prefer to allocate within direct branch range of the
+ * kernel such that no PLTs are necessary.
+ */
+ if (module_direct_base) {
+ r->start = module_direct_base;
+ r->end = module_direct_base + SZ_128M;
+
+ if (module_plt_base) {
+ r->fallback_start = module_plt_base;
+ r->fallback_end = module_plt_base + SZ_2G;
+ }
+ } else if (module_plt_base) {
+ r->start = module_plt_base;
+ r->end = module_plt_base + SZ_2G;
+ }
+
+ execmem_info.ranges[EXECMEM_KPROBES].pgprot = PAGE_KERNEL_ROX;
+ execmem_info.ranges[EXECMEM_KPROBES].start = VMALLOC_START;
+ execmem_info.ranges[EXECMEM_KPROBES].end = VMALLOC_END;
+
+ execmem_info.ranges[EXECMEM_BPF].pgprot = PAGE_KERNEL;
+ execmem_info.ranges[EXECMEM_BPF].start = VMALLOC_START;
+ execmem_info.ranges[EXECMEM_BPF].end = VMALLOC_END;
+
+ return &execmem_info;
+}
+#endif /* CONFIG_EXECMEM */
diff --git a/arch/loongarch/kernel/module.c b/arch/loongarch/kernel/module.c
index 78c6a68f6c3c..36d6d9eeb7c7 100644
--- a/arch/loongarch/kernel/module.c
+++ b/arch/loongarch/kernel/module.c
@@ -18,7 +18,6 @@
#include <linux/ftrace.h>
#include <linux/string.h>
#include <linux/kernel.h>
-#include <linux/execmem.h>
#include <asm/alternative.h>
#include <asm/inst.h>
#include <asm/unwind.h>
@@ -491,23 +490,6 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char *strtab,
return 0;
}
-static struct execmem_info execmem_info __ro_after_init = {
- .ranges = {
- [EXECMEM_DEFAULT] = {
- .pgprot = PAGE_KERNEL,
- .alignment = 1,
- },
- },
-};
-
-struct execmem_info __init *execmem_arch_setup(void)
-{
- execmem_info.ranges[EXECMEM_DEFAULT].start = MODULES_VADDR;
- execmem_info.ranges[EXECMEM_DEFAULT].end = MODULES_END;
-
- return &execmem_info;
-}
-
static void module_init_ftrace_plt(const Elf_Ehdr *hdr,
const Elf_Shdr *sechdrs, struct module *mod)
{
diff --git a/arch/loongarch/mm/init.c b/arch/loongarch/mm/init.c
index 4dd53427f657..5a65497c617e 100644
--- a/arch/loongarch/mm/init.c
+++ b/arch/loongarch/mm/init.c
@@ -24,6 +24,7 @@
#include <linux/gfp.h>
#include <linux/hugetlb.h>
#include <linux/mmzone.h>
+#include <linux/execmem.h>
#include <asm/asm-offsets.h>
#include <asm/bootinfo.h>
@@ -248,3 +249,22 @@ EXPORT_SYMBOL(invalid_pmd_table);
#endif
pte_t invalid_pte_table[PTRS_PER_PTE] __page_aligned_bss;
EXPORT_SYMBOL(invalid_pte_table);
+
+#ifdef CONFIG_EXECMEM
+static struct execmem_info execmem_info __ro_after_init = {
+ .ranges = {
+ [EXECMEM_DEFAULT] = {
+ .pgprot = PAGE_KERNEL,
+ .alignment = 1,
+ },
+ },
+};
+
+struct execmem_info __init *execmem_arch_setup(void)
+{
+ execmem_info.ranges[EXECMEM_DEFAULT].start = MODULES_VADDR;
+ execmem_info.ranges[EXECMEM_DEFAULT].end = MODULES_END;
+
+ return &execmem_info;
+}
+#endif /* CONFIG_EXECMEM */
diff --git a/arch/mips/kernel/module.c b/arch/mips/kernel/module.c
index 50505e910763..ba0f62d8eff5 100644
--- a/arch/mips/kernel/module.c
+++ b/arch/mips/kernel/module.c
@@ -13,14 +13,12 @@
#include <linux/elf.h>
#include <linux/mm.h>
#include <linux/numa.h>
-#include <linux/vmalloc.h>
#include <linux/slab.h>
#include <linux/fs.h>
#include <linux/string.h>
#include <linux/kernel.h>
#include <linux/spinlock.h>
#include <linux/jump_label.h>
-#include <linux/execmem.h>
#include <asm/jump_label.h>
struct mips_hi16 {
@@ -32,25 +30,6 @@ struct mips_hi16 {
static LIST_HEAD(dbe_list);
static DEFINE_SPINLOCK(dbe_lock);
-#ifdef MODULES_VADDR
-static struct execmem_info execmem_info __ro_after_init = {
- .ranges = {
- [EXECMEM_DEFAULT] = {
- .start = MODULES_VADDR,
- .end = MODULES_END,
- .alignment = 1,
- },
- },
-};
-
-struct execmem_info __init *execmem_arch_setup(void)
-{
- execmem_info.ranges[EXECMEM_DEFAULT].pgprot = PAGE_KERNEL;
-
- return &execmem_info;
-}
-#endif
-
static void apply_r_mips_32(u32 *location, u32 base, Elf_Addr v)
{
*location = base + v;
diff --git a/arch/mips/mm/init.c b/arch/mips/mm/init.c
index 39f129205b0c..6e01697c3eba 100644
--- a/arch/mips/mm/init.c
+++ b/arch/mips/mm/init.c
@@ -31,6 +31,7 @@
#include <linux/gfp.h>
#include <linux/kcore.h>
#include <linux/initrd.h>
+#include <linux/execmem.h>
#include <asm/bootinfo.h>
#include <asm/cachectl.h>
@@ -576,3 +577,24 @@ EXPORT_SYMBOL_GPL(invalid_pmd_table);
#endif
pte_t invalid_pte_table[PTRS_PER_PTE] __page_aligned_bss;
EXPORT_SYMBOL(invalid_pte_table);
+
+#ifdef CONFIG_EXECMEM
+#ifdef MODULES_VADDR
+static struct execmem_info execmem_info __ro_after_init = {
+ .ranges = {
+ [EXECMEM_DEFAULT] = {
+ .start = MODULES_VADDR,
+ .end = MODULES_END,
+ .alignment = 1,
+ },
+ },
+};
+
+struct execmem_info __init *execmem_arch_setup(void)
+{
+ execmem_info.ranges[EXECMEM_DEFAULT].pgprot = PAGE_KERNEL;
+
+ return &execmem_info;
+}
+#endif
+#endif /* CONFIG_EXECMEM */
diff --git a/arch/nios2/kernel/module.c b/arch/nios2/kernel/module.c
index 2b68ef8aad42..f4483243578d 100644
--- a/arch/nios2/kernel/module.c
+++ b/arch/nios2/kernel/module.c
@@ -13,31 +13,13 @@
#include <linux/moduleloader.h>
#include <linux/elf.h>
#include <linux/mm.h>
-#include <linux/vmalloc.h>
#include <linux/slab.h>
#include <linux/fs.h>
#include <linux/string.h>
#include <linux/kernel.h>
-#include <linux/execmem.h>
#include <asm/cacheflush.h>
-static struct execmem_info execmem_info __ro_after_init = {
- .ranges = {
- [EXECMEM_DEFAULT] = {
- .start = MODULES_VADDR,
- .end = MODULES_END,
- .pgprot = PAGE_KERNEL_EXEC,
- .alignment = 1,
- },
- },
-};
-
-struct execmem_info __init *execmem_arch_setup(void)
-{
- return &execmem_info;
-}
-
int apply_relocate_add(Elf32_Shdr *sechdrs, const char *strtab,
unsigned int symindex, unsigned int relsec,
struct module *mod)
diff --git a/arch/nios2/mm/init.c b/arch/nios2/mm/init.c
index 7bc82ee889c9..82abb117f851 100644
--- a/arch/nios2/mm/init.c
+++ b/arch/nios2/mm/init.c
@@ -26,6 +26,7 @@
#include <linux/memblock.h>
#include <linux/slab.h>
#include <linux/binfmts.h>
+#include <linux/execmem.h>
#include <asm/setup.h>
#include <asm/page.h>
@@ -143,3 +144,21 @@ static const pgprot_t protection_map[16] = {
[VM_SHARED | VM_EXEC | VM_WRITE | VM_READ] = MKP(1, 1, 1)
};
DECLARE_VM_GET_PAGE_PROT
+
+#ifdef CONFIG_EXECMEM
+static struct execmem_info execmem_info __ro_after_init = {
+ .ranges = {
+ [EXECMEM_DEFAULT] = {
+ .start = MODULES_VADDR,
+ .end = MODULES_END,
+ .pgprot = PAGE_KERNEL_EXEC,
+ .alignment = 1,
+ },
+ },
+};
+
+struct execmem_info __init *execmem_arch_setup(void)
+{
+ return &execmem_info;
+}
+#endif /* CONFIG_EXECMEM */
diff --git a/arch/parisc/kernel/module.c b/arch/parisc/kernel/module.c
index 721324c42b7d..4e5d991b2b65 100644
--- a/arch/parisc/kernel/module.c
+++ b/arch/parisc/kernel/module.c
@@ -41,7 +41,6 @@
#include <linux/moduleloader.h>
#include <linux/elf.h>
-#include <linux/vmalloc.h>
#include <linux/fs.h>
#include <linux/ftrace.h>
#include <linux/string.h>
@@ -49,7 +48,6 @@
#include <linux/bug.h>
#include <linux/mm.h>
#include <linux/slab.h>
-#include <linux/execmem.h>
#include <asm/unwind.h>
#include <asm/sections.h>
@@ -174,23 +172,6 @@ static inline int reassemble_22(int as22)
((as22 & 0x0003ff) << 3));
}
-static struct execmem_info execmem_info __ro_after_init = {
- .ranges = {
- [EXECMEM_DEFAULT] = {
- .pgprot = PAGE_KERNEL_RWX,
- .alignment = 1,
- },
- },
-};
-
-struct execmem_info __init *execmem_arch_setup(void)
-{
- execmem_info.ranges[EXECMEM_DEFAULT].start = VMALLOC_START;
- execmem_info.ranges[EXECMEM_DEFAULT].end = VMALLOC_END;
-
- return &execmem_info;
-}
-
#ifndef CONFIG_64BIT
static inline unsigned long count_gots(const Elf_Rela *rela, unsigned long n)
{
diff --git a/arch/parisc/mm/init.c b/arch/parisc/mm/init.c
index f876af56e13f..22b4a71dc0e9 100644
--- a/arch/parisc/mm/init.c
+++ b/arch/parisc/mm/init.c
@@ -24,6 +24,7 @@
#include <linux/nodemask.h> /* for node_online_map */
#include <linux/pagemap.h> /* for release_pages */
#include <linux/compat.h>
+#include <linux/execmem.h>
#include <asm/pgalloc.h>
#include <asm/tlb.h>
@@ -481,7 +482,7 @@ void free_initmem(void)
/* finally dump all the instructions which were cached, since the
* pages are no-longer executable */
flush_icache_range(init_begin, init_end);
-
+
free_initmem_default(POISON_FREE_INITMEM);
/* set up a new led state on systems shipped LED State panel */
@@ -992,3 +993,22 @@ static const pgprot_t protection_map[16] = {
[VM_SHARED | VM_EXEC | VM_WRITE | VM_READ] = PAGE_RWX
};
DECLARE_VM_GET_PAGE_PROT
+
+#ifdef CONFIG_EXECMEM
+static struct execmem_info execmem_info __ro_after_init = {
+ .ranges = {
+ [EXECMEM_DEFAULT] = {
+ .pgprot = PAGE_KERNEL_RWX,
+ .alignment = 1,
+ },
+ },
+};
+
+struct execmem_info __init *execmem_arch_setup(void)
+{
+ execmem_info.ranges[EXECMEM_DEFAULT].start = VMALLOC_START;
+ execmem_info.ranges[EXECMEM_DEFAULT].end = VMALLOC_END;
+
+ return &execmem_info;
+}
+#endif /* CONFIG_EXECMEM */
diff --git a/arch/powerpc/kernel/module.c b/arch/powerpc/kernel/module.c
index a1eaa74f2d41..77ea82e9dc5f 100644
--- a/arch/powerpc/kernel/module.c
+++ b/arch/powerpc/kernel/module.c
@@ -7,10 +7,8 @@
#include <linux/elf.h>
#include <linux/moduleloader.h>
#include <linux/err.h>
-#include <linux/vmalloc.h>
#include <linux/mm.h>
#include <linux/bug.h>
-#include <linux/execmem.h>
#include <asm/module.h>
#include <linux/uaccess.h>
#include <asm/firmware.h>
@@ -89,64 +87,3 @@ int module_finalize(const Elf_Ehdr *hdr,
return 0;
}
-
-static struct execmem_info execmem_info __ro_after_init = {
- .ranges = {
- [EXECMEM_DEFAULT] = {
- .alignment = 1,
- },
- [EXECMEM_KPROBES] = {
- .alignment = 1,
- },
- [EXECMEM_MODULE_DATA] = {
- .alignment = 1,
- },
- },
-};
-
-struct execmem_info __init *execmem_arch_setup(void)
-{
- pgprot_t prot = strict_module_rwx_enabled() ? PAGE_KERNEL : PAGE_KERNEL_EXEC;
- struct execmem_range *text = &execmem_info.ranges[EXECMEM_DEFAULT];
-
- /*
- * BOOK3S_32 and 8xx define MODULES_VADDR for text allocations and
- * allow allocating data in the entire vmalloc space
- */
-#ifdef MODULES_VADDR
- struct execmem_range *data = &execmem_info.ranges[EXECMEM_MODULE_DATA];
- unsigned long limit = (unsigned long)_etext - SZ_32M;
-
- BUILD_BUG_ON(TASK_SIZE > MODULES_VADDR);
-
- /* First try within 32M limit from _etext to avoid branch trampolines */
- if (MODULES_VADDR < PAGE_OFFSET && MODULES_END > limit) {
- text->start = limit;
- text->end = MODULES_END;
- text->fallback_start = MODULES_VADDR;
- text->fallback_end = MODULES_END;
- } else {
- text->start = MODULES_VADDR;
- text->end = MODULES_END;
- }
- data->start = VMALLOC_START;
- data->end = VMALLOC_END;
- data->pgprot = PAGE_KERNEL;
- data->alignment = 1;
-#else
- text->start = VMALLOC_START;
- text->end = VMALLOC_END;
-#endif
-
- text->pgprot = prot;
-
- execmem_info.ranges[EXECMEM_KPROBES].start = VMALLOC_START;
- execmem_info.ranges[EXECMEM_KPROBES].start = VMALLOC_END;
-
- if (strict_module_rwx_enabled())
- execmem_info.ranges[EXECMEM_KPROBES].pgprot = PAGE_KERNEL_ROX;
- else
- execmem_info.ranges[EXECMEM_KPROBES].pgprot = PAGE_KERNEL_EXEC;
-
- return &execmem_info;
-}
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 3a440004b97d..82723dc966e4 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -16,6 +16,7 @@
#include <linux/highmem.h>
#include <linux/suspend.h>
#include <linux/dma-direct.h>
+#include <linux/execmem.h>
#include <asm/swiotlb.h>
#include <asm/machdep.h>
@@ -406,3 +407,66 @@ int devmem_is_allowed(unsigned long pfn)
* the EHEA driver. Drop this when drivers/net/ethernet/ibm/ehea is removed.
*/
EXPORT_SYMBOL_GPL(walk_system_ram_range);
+
+#ifdef CONFIG_EXECMEM
+static struct execmem_info execmem_info __ro_after_init = {
+ .ranges = {
+ [EXECMEM_DEFAULT] = {
+ .alignment = 1,
+ },
+ [EXECMEM_KPROBES] = {
+ .alignment = 1,
+ },
+ [EXECMEM_MODULE_DATA] = {
+ .alignment = 1,
+ },
+ },
+};
+
+struct execmem_info __init *execmem_arch_setup(void)
+{
+ pgprot_t prot = strict_module_rwx_enabled() ? PAGE_KERNEL : PAGE_KERNEL_EXEC;
+ struct execmem_range *text = &execmem_info.ranges[EXECMEM_DEFAULT];
+
+ /*
+ * BOOK3S_32 and 8xx define MODULES_VADDR for text allocations and
+ * allow allocating data in the entire vmalloc space
+ */
+#ifdef MODULES_VADDR
+ struct execmem_range *data = &execmem_info.ranges[EXECMEM_MODULE_DATA];
+ unsigned long limit = (unsigned long)_etext - SZ_32M;
+
+ BUILD_BUG_ON(TASK_SIZE > MODULES_VADDR);
+
+ /* First try within 32M limit from _etext to avoid branch trampolines */
+ if (MODULES_VADDR < PAGE_OFFSET && MODULES_END > limit) {
+ text->start = limit;
+ text->end = MODULES_END;
+ text->fallback_start = MODULES_VADDR;
+ text->fallback_end = MODULES_END;
+ } else {
+ text->start = MODULES_VADDR;
+ text->end = MODULES_END;
+ }
+ data->start = VMALLOC_START;
+ data->end = VMALLOC_END;
+ data->pgprot = PAGE_KERNEL;
+ data->alignment = 1;
+#else
+ text->start = VMALLOC_START;
+ text->end = VMALLOC_END;
+#endif
+
+ text->pgprot = prot;
+
+ execmem_info.ranges[EXECMEM_KPROBES].start = VMALLOC_START;
+ execmem_info.ranges[EXECMEM_KPROBES].start = VMALLOC_END;
+
+ if (strict_module_rwx_enabled())
+ execmem_info.ranges[EXECMEM_KPROBES].pgprot = PAGE_KERNEL_ROX;
+ else
+ execmem_info.ranges[EXECMEM_KPROBES].pgprot = PAGE_KERNEL_EXEC;
+
+ return &execmem_info;
+}
+#endif /* CONFIG_EXECMEM */
diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
index aad158bb2022..906f9a3a5d65 100644
--- a/arch/riscv/kernel/module.c
+++ b/arch/riscv/kernel/module.c
@@ -11,10 +11,8 @@
#include <linux/kernel.h>
#include <linux/log2.h>
#include <linux/moduleloader.h>
-#include <linux/vmalloc.h>
#include <linux/sizes.h>
#include <linux/pgtable.h>
-#include <linux/execmem.h>
#include <asm/alternative.h>
#include <asm/sections.h>
@@ -906,44 +904,6 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char *strtab,
return 0;
}
-#ifdef CONFIG_MMU
-static struct execmem_info execmem_info __ro_after_init = {
- .ranges = {
- [EXECMEM_DEFAULT] = {
- .pgprot = PAGE_KERNEL,
- .alignment = 1,
- },
- [EXECMEM_KPROBES] = {
- .pgprot = PAGE_KERNEL_READ_EXEC,
- .alignment = 1,
- },
- [EXECMEM_BPF] = {
- .pgprot = PAGE_KERNEL,
- .alignment = PAGE_SIZE,
- },
- },
-};
-
-struct execmem_info __init *execmem_arch_setup(void)
-{
-#ifdef CONFIG_64BIT
- execmem_info.ranges[EXECMEM_DEFAULT].start = MODULES_VADDR;
- execmem_info.ranges[EXECMEM_DEFAULT].end = MODULES_END;
-#else
- execmem_info.ranges[EXECMEM_DEFAULT].start = VMALLOC_START;
- execmem_info.ranges[EXECMEM_DEFAULT].end = VMALLOC_END;
-#endif
-
- execmem_info.ranges[EXECMEM_KPROBES].start = VMALLOC_START;
- execmem_info.ranges[EXECMEM_KPROBES].end = VMALLOC_END;
-
- execmem_info.ranges[EXECMEM_BPF].start = BPF_JIT_REGION_START;
- execmem_info.ranges[EXECMEM_BPF].end = BPF_JIT_REGION_END;
-
- return &execmem_info;
-}
-#endif
-
int module_finalize(const Elf_Ehdr *hdr,
const Elf_Shdr *sechdrs,
struct module *me)
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index fe8e159394d8..25d35564be47 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -24,6 +24,7 @@
#include <linux/elf.h>
#endif
#include <linux/kfence.h>
+#include <linux/execmem.h>
#include <asm/fixmap.h>
#include <asm/io.h>
@@ -1481,3 +1482,43 @@ void __init pgtable_cache_init(void)
preallocate_pgd_pages_range(MODULES_VADDR, MODULES_END, "bpf/modules");
}
#endif
+
+#ifdef CONFIG_EXECMEM
+#ifdef CONFIG_MMU
+static struct execmem_info execmem_info __ro_after_init = {
+ .ranges = {
+ [EXECMEM_DEFAULT] = {
+ .pgprot = PAGE_KERNEL,
+ .alignment = 1,
+ },
+ [EXECMEM_KPROBES] = {
+ .pgprot = PAGE_KERNEL_READ_EXEC,
+ .alignment = 1,
+ },
+ [EXECMEM_BPF] = {
+ .pgprot = PAGE_KERNEL,
+ .alignment = PAGE_SIZE,
+ },
+ },
+};
+
+struct execmem_info __init *execmem_arch_setup(void)
+{
+#ifdef CONFIG_64BIT
+ execmem_info.ranges[EXECMEM_DEFAULT].start = MODULES_VADDR;
+ execmem_info.ranges[EXECMEM_DEFAULT].end = MODULES_END;
+#else
+ execmem_info.ranges[EXECMEM_DEFAULT].start = VMALLOC_START;
+ execmem_info.ranges[EXECMEM_DEFAULT].end = VMALLOC_END;
+#endif
+
+ execmem_info.ranges[EXECMEM_KPROBES].start = VMALLOC_START;
+ execmem_info.ranges[EXECMEM_KPROBES].end = VMALLOC_END;
+
+ execmem_info.ranges[EXECMEM_BPF].start = BPF_JIT_REGION_START;
+ execmem_info.ranges[EXECMEM_BPF].end = BPF_JIT_REGION_END;
+
+ return &execmem_info;
+}
+#endif /* CONFIG_MMU */
+#endif /* CONFIG_EXECMEM */
diff --git a/arch/s390/kernel/module.c b/arch/s390/kernel/module.c
index 7d38218bfd27..91e207b50394 100644
--- a/arch/s390/kernel/module.c
+++ b/arch/s390/kernel/module.c
@@ -37,31 +37,6 @@
#define PLT_ENTRY_SIZE 22
-static struct execmem_info execmem_info __ro_after_init = {
- .ranges = {
- [EXECMEM_DEFAULT] = {
- .flags = EXECMEM_KASAN_SHADOW,
- .alignment = MODULE_ALIGN,
- .pgprot = PAGE_KERNEL,
- },
- },
-};
-
-struct execmem_info __init *execmem_arch_setup(void)
-{
- unsigned long module_load_offset = 0;
- unsigned long start;
-
- if (kaslr_enabled())
- module_load_offset = get_random_u32_inclusive(1, 1024) * PAGE_SIZE;
-
- start = MODULES_VADDR + module_load_offset;
- execmem_info.ranges[EXECMEM_DEFAULT].start = start;
- execmem_info.ranges[EXECMEM_DEFAULT].end = MODULES_END;
-
- return &execmem_info;
-}
-
#ifdef CONFIG_FUNCTION_TRACER
void module_arch_cleanup(struct module *mod)
{
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index f6391442c0c2..9e25f9bf1445 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -49,6 +49,7 @@
#include <asm/uv.h>
#include <linux/virtio_anchor.h>
#include <linux/virtio_config.h>
+#include <linux/execmem.h>
pgd_t swapper_pg_dir[PTRS_PER_PGD] __section(".bss..swapper_pg_dir");
pgd_t invalid_pg_dir[PTRS_PER_PGD] __section(".bss..invalid_pg_dir");
@@ -302,3 +303,30 @@ void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
vmem_remove_mapping(start, size);
}
#endif /* CONFIG_MEMORY_HOTPLUG */
+
+#ifdef CONFIG_EXECMEM
+static struct execmem_info execmem_info __ro_after_init = {
+ .ranges = {
+ [EXECMEM_DEFAULT] = {
+ .flags = EXECMEM_KASAN_SHADOW,
+ .alignment = MODULE_ALIGN,
+ .pgprot = PAGE_KERNEL,
+ },
+ },
+};
+
+struct execmem_info __init *execmem_arch_setup(void)
+{
+ unsigned long module_load_offset = 0;
+ unsigned long start;
+
+ if (kaslr_enabled())
+ module_load_offset = get_random_u32_inclusive(1, 1024) * PAGE_SIZE;
+
+ start = MODULES_VADDR + module_load_offset;
+ execmem_info.ranges[EXECMEM_DEFAULT].start = start;
+ execmem_info.ranges[EXECMEM_DEFAULT].end = MODULES_END;
+
+ return &execmem_info;
+}
+#endif /* CONFIG_EXECMEM */
diff --git a/arch/sparc/kernel/module.c b/arch/sparc/kernel/module.c
index b70047f944cc..b8c51cc23d96 100644
--- a/arch/sparc/kernel/module.c
+++ b/arch/sparc/kernel/module.c
@@ -14,7 +14,6 @@
#include <linux/string.h>
#include <linux/ctype.h>
#include <linux/mm.h>
-#include <linux/execmem.h>
#include <asm/processor.h>
#include <asm/spitfire.h>
@@ -22,28 +21,6 @@
#include "entry.h"
-static struct execmem_info execmem_info __ro_after_init = {
- .ranges = {
- [EXECMEM_DEFAULT] = {
-#ifdef CONFIG_SPARC64
- .start = MODULES_VADDR,
- .end = MODULES_END,
-#else
- .start = VMALLOC_START,
- .end = VMALLOC_END,
-#endif
- .alignment = 1,
- },
- },
-};
-
-struct execmem_info __init *execmem_arch_setup(void)
-{
- execmem_info.ranges[EXECMEM_DEFAULT].pgprot = PAGE_KERNEL;
-
- return &execmem_info;
-}
-
/* Make generic code ignore STT_REGISTER dummy undefined symbols. */
int module_frob_arch_sections(Elf_Ehdr *hdr,
Elf_Shdr *sechdrs,
diff --git a/arch/sparc/mm/Makefile b/arch/sparc/mm/Makefile
index 809d993f6d88..2d1752108d77 100644
--- a/arch/sparc/mm/Makefile
+++ b/arch/sparc/mm/Makefile
@@ -14,3 +14,5 @@ obj-$(CONFIG_SPARC32) += leon_mm.o
# Only used by sparc64
obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
+
+obj-$(CONFIG_EXECMEM) += execmem.o
diff --git a/arch/sparc/mm/execmem.c b/arch/sparc/mm/execmem.c
new file mode 100644
index 000000000000..8904545f7814
--- /dev/null
+++ b/arch/sparc/mm/execmem.c
@@ -0,0 +1,25 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/mm.h>
+#include <linux/execmem.h>
+
+static struct execmem_info execmem_info __ro_after_init = {
+ .ranges = {
+ [EXECMEM_DEFAULT] = {
+#ifdef CONFIG_SPARC64
+ .start = MODULES_VADDR,
+ .end = MODULES_END,
+#else
+ .start = VMALLOC_START,
+ .end = VMALLOC_END,
+#endif
+ .alignment = 1,
+ },
+ },
+};
+
+struct execmem_info __init *execmem_arch_setup(void)
+{
+ execmem_info.ranges[EXECMEM_DEFAULT].pgprot = PAGE_KERNEL;
+
+ return &execmem_info;
+}
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index 8f526f056847..837450b6e882 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -19,7 +19,6 @@
#include <linux/jump_label.h>
#include <linux/random.h>
#include <linux/memory.h>
-#include <linux/execmem.h>
#include <asm/text-patching.h>
#include <asm/page.h>
@@ -37,30 +36,6 @@ do { \
} while (0)
#endif
-static struct execmem_info execmem_info __ro_after_init = {
- .ranges = {
- [EXECMEM_DEFAULT] = {
- .flags = EXECMEM_KASAN_SHADOW,
- .alignment = MODULE_ALIGN,
- },
- },
-};
-
-struct execmem_info __init *execmem_arch_setup(void)
-{
- unsigned long start, offset = 0;
-
- if (kaslr_enabled())
- offset = get_random_u32_inclusive(1, 1024) * PAGE_SIZE;
-
- start = MODULES_VADDR + offset;
- execmem_info.ranges[EXECMEM_DEFAULT].start = start;
- execmem_info.ranges[EXECMEM_DEFAULT].end = MODULES_END;
- execmem_info.ranges[EXECMEM_DEFAULT].pgprot = PAGE_KERNEL;
-
- return &execmem_info;
-}
-
#ifdef CONFIG_X86_32
int apply_relocate(Elf32_Shdr *sechdrs,
const char *strtab,
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 679893ea5e68..8e8cd0de3af6 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -7,6 +7,7 @@
#include <linux/swapops.h>
#include <linux/kmemleak.h>
#include <linux/sched/task.h>
+#include <linux/execmem.h>
#include <asm/set_memory.h>
#include <asm/cpu_device_id.h>
@@ -1099,3 +1100,29 @@ unsigned long arch_max_swapfile_size(void)
return pages;
}
#endif
+
+#ifdef CONFIG_EXECMEM
+static struct execmem_info execmem_info __ro_after_init = {
+ .ranges = {
+ [EXECMEM_DEFAULT] = {
+ .flags = EXECMEM_KASAN_SHADOW,
+ .alignment = MODULE_ALIGN,
+ },
+ },
+};
+
+struct execmem_info __init *execmem_arch_setup(void)
+{
+ unsigned long start, offset = 0;
+
+ if (kaslr_enabled())
+ offset = get_random_u32_inclusive(1, 1024) * PAGE_SIZE;
+
+ start = MODULES_VADDR + offset;
+ execmem_info.ranges[EXECMEM_DEFAULT].start = start;
+ execmem_info.ranges[EXECMEM_DEFAULT].end = MODULES_END;
+ execmem_info.ranges[EXECMEM_DEFAULT].pgprot = PAGE_KERNEL;
+
+ return &execmem_info;
+}
+#endif /* CONFIG_EXECMEM */
--
2.43.0
On Thu, Apr 11, 2024 at 07:00:36PM +0300, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" <[email protected]>
>
> Hi,
>
> Since v3 I looked into making execmem more of an utility toolbox, as we
> discussed at LPC with Mark Rutland, but it was getting more hairier than
> having a struct describing architecture constraints and a type identifying
> the consumer of execmem.
>
> And I do think that having the description of architecture constraints for
> allocations of executable memory in a single place is better that having it
> spread all over the place.
>
> The patches available via git:
> https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=execmem/v4
>
> v4 changes:
> * rebase on v6.9-rc2
> * rename execmem_params to execmem_info and execmem_arch_params() to
> execmem_arch_setup()
> * use single execmem_alloc() API instead of execmem_{text,data}_alloc() (Song)
> * avoid extra copy of execmem parameters (Rick)
> * run execmem_init() as core_initcall() except for the architectures that
> may allocated text really early (currently only x86) (Will)
> * add acks for some of arm64 and riscv changes, thanks Will and Alexandre
> * new commits:
> - drop call to kasan_alloc_module_shadow() on arm64 because it's not
> needed anymore
> - rename MODULE_START to MODULES_VADDR on MIPS
> - use CONFIG_EXECMEM instead of CONFIG_MODULES on powerpc as per Christophe:
> https://lore.kernel.org/all/[email protected]/
>
> v3: https://lore.kernel.org/all/[email protected]
> * add type parameter to execmem allocation APIs
> * remove BPF dependency on modules
>
> v2: https://lore.kernel.org/all/[email protected]
> * Separate "module" and "others" allocations with execmem_text_alloc()
> and jit_text_alloc()
> * Drop ROX entailment on x86
> * Add ack for nios2 changes, thanks Dinh Nguyen
>
> v1: https://lore.kernel.org/all/[email protected]
>
> = Cover letter from v1 (sligtly updated) =
>
> module_alloc() is used everywhere as a mean to allocate memory for code.
>
> Beside being semantically wrong, this unnecessarily ties all subsystmes
> that need to allocate code, such as ftrace, kprobes and BPF to modules and
> puts the burden of code allocation to the modules code.
>
> Several architectures override module_alloc() because of various
> constraints where the executable memory can be located and this causes
> additional obstacles for improvements of code allocation.
>
> A centralized infrastructure for code allocation allows allocations of
> executable memory as ROX, and future optimizations such as caching large
> pages for better iTLB performance and providing sub-page allocations for
> users that only need small jit code snippets.
>
> Rick Edgecombe proposed perm_alloc extension to vmalloc [1] and Song Liu
> proposed execmem_alloc [2], but both these approaches were targeting BPF
> allocations and lacked the ground work to abstract executable allocations
> and split them from the modules core.
>
> Thomas Gleixner suggested to express module allocation restrictions and
> requirements as struct mod_alloc_type_params [3] that would define ranges,
> protections and other parameters for different types of allocations used by
> modules and following that suggestion Song separated allocations of
> different types in modules (commit ac3b43283923 ("module: replace
> module_layout with module_memory")) and posted "Type aware module
> allocator" set [4].
>
> I liked the idea of parametrising code allocation requirements as a
> structure, but I believe the original proposal and Song's module allocator
> was too module centric, so I came up with these patches.
>
> This set splits code allocation from modules by introducing execmem_alloc()
> and and execmem_free(), APIs, replaces call sites of module_alloc() and
> module_memfree() with the new APIs and implements core text and related
> allocations in a central place.
>
> Instead of architecture specific overrides for module_alloc(), the
> architectures that require non-default behaviour for text allocation must
> fill execmem_info structure and implement execmem_arch_setup() that returns
> a pointer to that structure. If an architecture does not implement
> execmem_arch_setup(), the defaults compatible with the current
> modules::module_alloc() are used.
>
> Since architectures define different restrictions on placement,
> permissions, alignment and other parameters for memory that can be used by
> different subsystems that allocate executable memory, execmem APIs
> take a type argument, that will be used to identify the calling subsystem
> and to allow architectures to define parameters for ranges suitable for that
> subsystem.
>
> The new infrastructure allows decoupling of BPF, kprobes and ftrace from
> modules, and most importantly it paves the way for ROX allocations for
> executable memory.
It looks like you're just doing API cleanup first, then improving the
implementation later?
Patch set looks nice and clean; previous versions did seem to leak too
much arch/module details (or perhaps we were just bikeshedding too much
;) - but the API first approach is nice.
Looking forward to seeing this merged.
Hi Mike.
On Thu, Apr 11, 2024 at 07:00:42PM +0300, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" <[email protected]>
>
> Several architectures override module_alloc() only to define address
> range for code allocations different than VMALLOC address space.
>
> Provide a generic implementation in execmem that uses the parameters for
> address space ranges, required alignment and page protections provided
> by architectures.
>
> The architectures must fill execmem_info structure and implement
> execmem_arch_setup() that returns a pointer to that structure. This way the
> execmem initialization won't be called from every architecture, but rather
> from a central place, namely a core_initcall() in execmem.
>
> The execmem provides execmem_alloc() API that wraps __vmalloc_node_range()
> with the parameters defined by the architectures. If an architecture does
> not implement execmem_arch_setup(), execmem_alloc() will fall back to
> module_alloc().
>
> Signed-off-by: Mike Rapoport (IBM) <[email protected]>
> ---
This code snippet could be more readable ...
> diff --git a/arch/sparc/kernel/module.c b/arch/sparc/kernel/module.c
> index 66c45a2764bc..b70047f944cc 100644
> --- a/arch/sparc/kernel/module.c
> +++ b/arch/sparc/kernel/module.c
> @@ -14,6 +14,7 @@
> #include <linux/string.h>
> #include <linux/ctype.h>
> #include <linux/mm.h>
> +#include <linux/execmem.h>
>
> #include <asm/processor.h>
> #include <asm/spitfire.h>
> @@ -21,34 +22,26 @@
>
> #include "entry.h"
>
> +static struct execmem_info execmem_info __ro_after_init = {
> + .ranges = {
> + [EXECMEM_DEFAULT] = {
> #ifdef CONFIG_SPARC64
> -
> -#include <linux/jump_label.h>
> -
> -static void *module_map(unsigned long size)
> -{
> - if (PAGE_ALIGN(size) > MODULES_LEN)
> - return NULL;
> - return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
> - GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE,
> - __builtin_return_address(0));
> -}
> + .start = MODULES_VADDR,
> + .end = MODULES_END,
> #else
> -static void *module_map(unsigned long size)
> + .start = VMALLOC_START,
> + .end = VMALLOC_END,
> +#endif
> + .alignment = 1,
> + },
> + },
> +};
> +
> +struct execmem_info __init *execmem_arch_setup(void)
> {
> - return vmalloc(size);
> -}
> -#endif /* CONFIG_SPARC64 */
> -
> -void *module_alloc(unsigned long size)
> -{
> - void *ret;
> -
> - ret = module_map(size);
> - if (ret)
> - memset(ret, 0, size);
> + execmem_info.ranges[EXECMEM_DEFAULT].pgprot = PAGE_KERNEL;
>
> - return ret;
> + return &execmem_info;
> }
>
> /* Make generic code ignore STT_REGISTER dummy undefined symbols. */
.. if the following was added:
diff --git a/arch/sparc/include/asm/pgtable_32.h b/arch/sparc/include/asm/pgtable_32.h
index 9e85d57ac3f2..62bcafe38b1f 100644
--- a/arch/sparc/include/asm/pgtable_32.h
+++ b/arch/sparc/include/asm/pgtable_32.h
@@ -432,6 +432,8 @@ static inline int io_remap_pfn_range(struct vm_area_struct *vma,
#define VMALLOC_START _AC(0xfe600000,UL)
#define VMALLOC_END _AC(0xffc00000,UL)
+#define MODULES_VADDR VMALLOC_START
+#define MODULES_END VMALLOC_END
Then the #ifdef CONFIG_SPARC64 could be dropped and the code would be
the same for 32 and 64 bits.
Just a drive-by comment.
Sam
On Thu, Apr 11, 2024 at 07:00:36PM +0300, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" <[email protected]>
>
> Hi,
>
> Since v3 I looked into making execmem more of an utility toolbox, as we
> discussed at LPC with Mark Rutland, but it was getting more hairier than
> having a struct describing architecture constraints and a type identifying
> the consumer of execmem.
>
> And I do think that having the description of architecture constraints for
> allocations of executable memory in a single place is better that having it
> spread all over the place.
>
> The patches available via git:
> https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=execmem/v4
I've taken the first 5 patches through modules-next for now to get early
exposure to testing. Of those I just had minor nit feedback on the 5th,
but the rest look good.
Let's wait for review for the rest of the patches 6-15.
Luis
On Thu, Apr 11, 2024 at 07:00:41PM +0300, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" <[email protected]>
>
> module_alloc() is used everywhere as a mean to allocate memory for code.
>
> Beside being semantically wrong, this unnecessarily ties all subsystems
> that need to allocate code, such as ftrace, kprobes and BPF to modules and
> puts the burden of code allocation to the modules code.
>
> Several architectures override module_alloc() because of various
> constraints where the executable memory can be located and this causes
> additional obstacles for improvements of code allocation.
>
> Start splitting code allocation from modules by introducing execmem_alloc()
> and execmem_free() APIs.
>
> Initially, execmem_alloc() is a wrapper for module_alloc() and
> execmem_free() is a replacement of module_memfree() to allow updating all
> call sites to use the new APIs.
>
> Since architectures define different restrictions on placement,
> permissions, alignment and other parameters for memory that can be used by
> different subsystems that allocate executable memory, execmem_alloc() takes
> a type argument, that will be used to identify the calling subsystem and to
> allow architectures define parameters for ranges suitable for that
> subsystem.
It would be good to describe this is a non-fuctional change.
> Signed-off-by: Mike Rapoport (IBM) <[email protected]>
> ---
> diff --git a/mm/execmem.c b/mm/execmem.c
> new file mode 100644
> index 000000000000..ed2ea41a2543
> --- /dev/null
> +++ b/mm/execmem.c
> @@ -0,0 +1,26 @@
> +// SPDX-License-Identifier: GPL-2.0
And this just needs to copy over the copyright notices from the main.c file.
Luis
* Mike Rapoport <[email protected]> wrote:
> +/**
> + * enum execmem_type - types of executable memory ranges
> + *
> + * There are several subsystems that allocate executable memory.
> + * Architectures define different restrictions on placement,
> + * permissions, alignment and other parameters for memory that can be used
> + * by these subsystems.
> + * Types in this enum identify subsystems that allocate executable memory
> + * and let architectures define parameters for ranges suitable for
> + * allocations by each subsystem.
> + *
> + * @EXECMEM_DEFAULT: default parameters that would be used for types that
> + * are not explcitly defined.
> + * @EXECMEM_MODULE_TEXT: parameters for module text sections
> + * @EXECMEM_KPROBES: parameters for kprobes
> + * @EXECMEM_FTRACE: parameters for ftrace
> + * @EXECMEM_BPF: parameters for BPF
> + * @EXECMEM_TYPE_MAX:
> + */
> +enum execmem_type {
> + EXECMEM_DEFAULT,
> + EXECMEM_MODULE_TEXT = EXECMEM_DEFAULT,
> + EXECMEM_KPROBES,
> + EXECMEM_FTRACE,
> + EXECMEM_BPF,
> + EXECMEM_TYPE_MAX,
> +};
s/explcitly
/explicitly
Thanks,
Ingo
On Thu, Apr 11, 2024 at 12:42:05PM -0700, Luis Chamberlain wrote:
> On Thu, Apr 11, 2024 at 07:00:41PM +0300, Mike Rapoport wrote:
> > From: "Mike Rapoport (IBM)" <[email protected]>
> >
> > module_alloc() is used everywhere as a mean to allocate memory for code.
> >
> > Beside being semantically wrong, this unnecessarily ties all subsystems
> > that need to allocate code, such as ftrace, kprobes and BPF to modules and
> > puts the burden of code allocation to the modules code.
> >
> > Several architectures override module_alloc() because of various
> > constraints where the executable memory can be located and this causes
> > additional obstacles for improvements of code allocation.
> >
> > Start splitting code allocation from modules by introducing execmem_alloc()
> > and execmem_free() APIs.
> >
> > Initially, execmem_alloc() is a wrapper for module_alloc() and
> > execmem_free() is a replacement of module_memfree() to allow updating all
> > call sites to use the new APIs.
> >
> > Since architectures define different restrictions on placement,
> > permissions, alignment and other parameters for memory that can be used by
> > different subsystems that allocate executable memory, execmem_alloc() takes
> > a type argument, that will be used to identify the calling subsystem and to
> > allow architectures define parameters for ranges suitable for that
> > subsystem.
>
> It would be good to describe this is a non-fuctional change.
Ok.
> > Signed-off-by: Mike Rapoport (IBM) <[email protected]>
> > ---
>
> > diff --git a/mm/execmem.c b/mm/execmem.c
> > new file mode 100644
> > index 000000000000..ed2ea41a2543
> > --- /dev/null
> > +++ b/mm/execmem.c
> > @@ -0,0 +1,26 @@
> > +// SPDX-License-Identifier: GPL-2.0
>
> And this just needs to copy over the copyright notices from the main.c file.
Will do.
> Luis
--
Sincerely yours,
Mike.
On Fri, Apr 12, 2024 at 11:16:10AM +0200, Ingo Molnar wrote:
>
> * Mike Rapoport <[email protected]> wrote:
>
> > +/**
> > + * enum execmem_type - types of executable memory ranges
> > + *
> > + * There are several subsystems that allocate executable memory.
> > + * Architectures define different restrictions on placement,
> > + * permissions, alignment and other parameters for memory that can be used
> > + * by these subsystems.
> > + * Types in this enum identify subsystems that allocate executable memory
> > + * and let architectures define parameters for ranges suitable for
> > + * allocations by each subsystem.
> > + *
> > + * @EXECMEM_DEFAULT: default parameters that would be used for types that
> > + * are not explcitly defined.
> > + * @EXECMEM_MODULE_TEXT: parameters for module text sections
> > + * @EXECMEM_KPROBES: parameters for kprobes
> > + * @EXECMEM_FTRACE: parameters for ftrace
> > + * @EXECMEM_BPF: parameters for BPF
> > + * @EXECMEM_TYPE_MAX:
> > + */
> > +enum execmem_type {
> > + EXECMEM_DEFAULT,
> > + EXECMEM_MODULE_TEXT = EXECMEM_DEFAULT,
> > + EXECMEM_KPROBES,
> > + EXECMEM_FTRACE,
> > + EXECMEM_BPF,
> > + EXECMEM_TYPE_MAX,
> > +};
>
> s/explcitly
> /explicitly
Sure, thanks
> Thanks,
>
> Ingo
--
Sincerely yours,
Mike.
On Thu, Apr 11, 2024 at 10:53:46PM +0200, Sam Ravnborg wrote:
> Hi Mike.
>
> On Thu, Apr 11, 2024 at 07:00:42PM +0300, Mike Rapoport wrote:
> > From: "Mike Rapoport (IBM)" <[email protected]>
> >
> > Several architectures override module_alloc() only to define address
> > range for code allocations different than VMALLOC address space.
> >
> > Provide a generic implementation in execmem that uses the parameters for
> > address space ranges, required alignment and page protections provided
> > by architectures.
> >
> > The architectures must fill execmem_info structure and implement
> > execmem_arch_setup() that returns a pointer to that structure. This way the
> > execmem initialization won't be called from every architecture, but rather
> > from a central place, namely a core_initcall() in execmem.
> >
> > The execmem provides execmem_alloc() API that wraps __vmalloc_node_range()
> > with the parameters defined by the architectures. If an architecture does
> > not implement execmem_arch_setup(), execmem_alloc() will fall back to
> > module_alloc().
> >
> > Signed-off-by: Mike Rapoport (IBM) <[email protected]>
> > ---
>
> This code snippet could be more readable ...
> > diff --git a/arch/sparc/kernel/module.c b/arch/sparc/kernel/module.c
> > index 66c45a2764bc..b70047f944cc 100644
> > --- a/arch/sparc/kernel/module.c
> > +++ b/arch/sparc/kernel/module.c
> > @@ -14,6 +14,7 @@
> > #include <linux/string.h>
> > #include <linux/ctype.h>
> > #include <linux/mm.h>
> > +#include <linux/execmem.h>
> >
> > #include <asm/processor.h>
> > #include <asm/spitfire.h>
> > @@ -21,34 +22,26 @@
> >
> > #include "entry.h"
> >
> > +static struct execmem_info execmem_info __ro_after_init = {
> > + .ranges = {
> > + [EXECMEM_DEFAULT] = {
> > #ifdef CONFIG_SPARC64
> > -
> > -#include <linux/jump_label.h>
> > -
> > -static void *module_map(unsigned long size)
> > -{
> > - if (PAGE_ALIGN(size) > MODULES_LEN)
> > - return NULL;
> > - return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
> > - GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE,
> > - __builtin_return_address(0));
> > -}
> > + .start = MODULES_VADDR,
> > + .end = MODULES_END,
> > #else
> > -static void *module_map(unsigned long size)
> > + .start = VMALLOC_START,
> > + .end = VMALLOC_END,
> > +#endif
> > + .alignment = 1,
> > + },
> > + },
> > +};
> > +
> > +struct execmem_info __init *execmem_arch_setup(void)
> > {
> > - return vmalloc(size);
> > -}
> > -#endif /* CONFIG_SPARC64 */
> > -
> > -void *module_alloc(unsigned long size)
> > -{
> > - void *ret;
> > -
> > - ret = module_map(size);
> > - if (ret)
> > - memset(ret, 0, size);
> > + execmem_info.ranges[EXECMEM_DEFAULT].pgprot = PAGE_KERNEL;
> >
> > - return ret;
> > + return &execmem_info;
> > }
> >
> > /* Make generic code ignore STT_REGISTER dummy undefined symbols. */
>
> ... if the following was added:
>
> diff --git a/arch/sparc/include/asm/pgtable_32.h b/arch/sparc/include/asm/pgtable_32.h
> index 9e85d57ac3f2..62bcafe38b1f 100644
> --- a/arch/sparc/include/asm/pgtable_32.h
> +++ b/arch/sparc/include/asm/pgtable_32.h
> @@ -432,6 +432,8 @@ static inline int io_remap_pfn_range(struct vm_area_struct *vma,
>
> #define VMALLOC_START _AC(0xfe600000,UL)
> #define VMALLOC_END _AC(0xffc00000,UL)
> +#define MODULES_VADDR VMALLOC_START
> +#define MODULES_END VMALLOC_END
>
>
> Then the #ifdef CONFIG_SPARC64 could be dropped and the code would be
> the same for 32 and 64 bits.
Yeah, the #ifdef there can be dropped even regardless of execmem.
I'll add a patch for that.
> Just a drive-by comment.
>
> Sam
>
--
Sincerely yours,
Mike.
On Thu, Apr 11, 2024 at 07:00:43PM +0300, Mike Rapoport wrote:
> +static struct execmem_info execmem_info __ro_after_init = {
> + .ranges = {
> + [EXECMEM_DEFAULT] = {
> + .flags = EXECMEM_KASAN_SHADOW,
> + .alignment = MODULE_ALIGN,
> + },
> + },
> +};
>
> +struct execmem_info __init *execmem_arch_setup(void)
> {
> + unsigned long start, offset = 0;
>
> + if (kaslr_enabled())
> + offset = get_random_u32_inclusive(1, 1024) * PAGE_SIZE;
>
> + start = MODULES_VADDR + offset;
> + execmem_info.ranges[EXECMEM_DEFAULT].start = start;
> + execmem_info.ranges[EXECMEM_DEFAULT].end = MODULES_END;
> + execmem_info.ranges[EXECMEM_DEFAULT].pgprot = PAGE_KERNEL;
>
> + return &execmem_info;
> }
struct execmem_info __init *execmem_arch_setup(void)
{
unsigned long offset = 0;
if (kaslr_enabled())
offset = get_random_u32_inclusive(1, 1024) * PAGE_SIZE;
execmem_info = (struct execmem_info){
.ranges = {
[EXECMEM_DEFAULT] = {
.start = MODULES_VADDR + offset,
.end = MODULES_END,
.pgprot = PAGE_KERNEL,
.flags = EXECMEM_KASAN_SHADOW,
.alignment = 1,
},
},
};
return &execmem_info;
}
On Thu, Apr 11, 2024 at 07:00:42PM +0300, Mike Rapoport wrote:
> +static struct execmem_info execmem_info __ro_after_init = {
> + .ranges = {
> + [EXECMEM_DEFAULT] = {
> + .start = MODULES_VADDR,
> + .end = MODULES_END,
> + .alignment = 1,
> + },
> + },
> +};
> +
> +struct execmem_info __init *execmem_arch_setup(void)
> {
> + execmem_info.ranges[EXECMEM_DEFAULT].pgprot = PAGE_KERNEL;
> +
> + return &execmem_info;
> }
> +static struct execmem_info execmem_info __ro_after_init = {
> + .ranges = {
> + [EXECMEM_DEFAULT] = {
> + .start = MODULES_VADDR,
> + .end = MODULES_END,
> + .pgprot = PAGE_KERNEL_EXEC,
> + .alignment = 1,
> + },
> + },
> +};
> +
> +struct execmem_info __init *execmem_arch_setup(void)
> {
> + return &execmem_info;
> }
> +static struct execmem_info execmem_info __ro_after_init = {
> + .ranges = {
> + [EXECMEM_DEFAULT] = {
> + .pgprot = PAGE_KERNEL_RWX,
> + .alignment = 1,
> + },
> + },
> +};
> +
> +struct execmem_info __init *execmem_arch_setup(void)
> {
> + execmem_info.ranges[EXECMEM_DEFAULT].start = VMALLOC_START;
> + execmem_info.ranges[EXECMEM_DEFAULT].end = VMALLOC_END;
> +
> + return &execmem_info;
> }
> +static struct execmem_info execmem_info __ro_after_init = {
> + .ranges = {
> + [EXECMEM_DEFAULT] = {
> + .pgprot = PAGE_KERNEL,
> + .alignment = 1,
> + },
> + },
> +};
> +
> +struct execmem_info __init *execmem_arch_setup(void)
> {
> + execmem_info.ranges[EXECMEM_DEFAULT].start = MODULES_VADDR;
> + execmem_info.ranges[EXECMEM_DEFAULT].end = MODULES_END;
> +
> + return &execmem_info;
> }
> +static struct execmem_info execmem_info __ro_after_init = {
> + .ranges = {
> + [EXECMEM_DEFAULT] = {
> #ifdef CONFIG_SPARC64
> + .start = MODULES_VADDR,
> + .end = MODULES_END,
> #else
> + .start = VMALLOC_START,
> + .end = VMALLOC_END,
> +#endif
> + .alignment = 1,
> + },
> + },
> +};
> +
> +struct execmem_info __init *execmem_arch_setup(void)
> {
> + execmem_info.ranges[EXECMEM_DEFAULT].pgprot = PAGE_KERNEL;
>
> + return &execmem_info;
> }
I'm amazed by the weird and inconsistent breakup of initializations.
What exactly is wrong with something like:
static struct execmem_info execmem_info __ro_after_init;
struct execmem_info __init *execmem_arch_setup(void)
{
execmem_info = (struct execmem_info){
.ranges = {
[EXECMEM_DEFAULT] = {
.start = MODULES_VADDR,
.end = MODULES_END,
.pgprot = PAGE_KERNEL,
.alignment = 1,
},
},
};
return &execmem_info;
}
On Thu, Apr 11, 2024 at 07:00:41PM +0300, Mike Rapoport wrote:
> +/**
> + * enum execmem_type - types of executable memory ranges
> + *
> + * There are several subsystems that allocate executable memory.
> + * Architectures define different restrictions on placement,
> + * permissions, alignment and other parameters for memory that can be used
> + * by these subsystems.
> + * Types in this enum identify subsystems that allocate executable memory
> + * and let architectures define parameters for ranges suitable for
> + * allocations by each subsystem.
> + *
> + * @EXECMEM_DEFAULT: default parameters that would be used for types that
> + * are not explcitly defined.
> + * @EXECMEM_MODULE_TEXT: parameters for module text sections
> + * @EXECMEM_KPROBES: parameters for kprobes
> + * @EXECMEM_FTRACE: parameters for ftrace
> + * @EXECMEM_BPF: parameters for BPF
> + * @EXECMEM_TYPE_MAX:
> + */
> +enum execmem_type {
> + EXECMEM_DEFAULT,
> + EXECMEM_MODULE_TEXT = EXECMEM_DEFAULT,
> + EXECMEM_KPROBES,
> + EXECMEM_FTRACE,
> + EXECMEM_BPF,
> + EXECMEM_TYPE_MAX,
> +};
Can we please get a break-down of how all these types are actually
different from one another?
I'm thinking some platforms have a tiny immediate space (arm64 comes to
mind) and has less strict placement constraints for some of them?
On Mon, Apr 15, 2024 at 09:52:41AM +0200, Peter Zijlstra wrote:
> On Thu, Apr 11, 2024 at 07:00:41PM +0300, Mike Rapoport wrote:
> > +/**
> > + * enum execmem_type - types of executable memory ranges
> > + *
> > + * There are several subsystems that allocate executable memory.
> > + * Architectures define different restrictions on placement,
> > + * permissions, alignment and other parameters for memory that can be used
> > + * by these subsystems.
> > + * Types in this enum identify subsystems that allocate executable memory
> > + * and let architectures define parameters for ranges suitable for
> > + * allocations by each subsystem.
> > + *
> > + * @EXECMEM_DEFAULT: default parameters that would be used for types that
> > + * are not explcitly defined.
> > + * @EXECMEM_MODULE_TEXT: parameters for module text sections
> > + * @EXECMEM_KPROBES: parameters for kprobes
> > + * @EXECMEM_FTRACE: parameters for ftrace
> > + * @EXECMEM_BPF: parameters for BPF
> > + * @EXECMEM_TYPE_MAX:
> > + */
> > +enum execmem_type {
> > + EXECMEM_DEFAULT,
> > + EXECMEM_MODULE_TEXT = EXECMEM_DEFAULT,
> > + EXECMEM_KPROBES,
> > + EXECMEM_FTRACE,
> > + EXECMEM_BPF,
> > + EXECMEM_TYPE_MAX,
> > +};
>
> Can we please get a break-down of how all these types are actually
> different from one another?
>
> I'm thinking some platforms have a tiny immediate space (arm64 comes to
> mind) and has less strict placement constraints for some of them?
loongarch, mips, nios2 and sparc define modules address space different
from vmalloc and use that for modules, kprobes and bpf (where supported).
parisc uses vmalloc range for everything, but it sets permissions to
PAGE_KERNEL_RWX because it's PAGE_KERNEL_EXEC is read only and it lacks
set_memory_* APIs.
arm has an address space for modules, but it fall back to the entire
vmalloc with CONFIG_ARM_MODULE_PLTS=y.
arm64 uses different ranges for modules and bpf/kprobes. For kprobes it
does vmalloc(PAGE_KERNEL_ROX) and for bpf just plain vmalloc().
For modules arm64 first tries to allocated from 128M below kernel_end and
if that fails it uses 2G below kernel_end as a fallback.
powerpc uses vmalloc space for everything for some configurations. For
book3s-32 and 8xx it defines two ranges that are used for module text,
kprobes and bpf and the module data can be allocated anywhere in vmalloc.
riscv has an address space for modules, a different address space for bpf
and uses vmalloc space for kprobes.
s390 and x86 have modules address space and use that space for all
executable allocations.
The EXECMEM_FTRACE type is only used on s390 and x86 and for now it's there
more for completeness rather to denote special constraints or properties.
--
Sincerely yours,
Mike.
On Mon, Apr 15, 2024 at 09:52:41AM +0200, Peter Zijlstra wrote:
> On Thu, Apr 11, 2024 at 07:00:41PM +0300, Mike Rapoport wrote:
> > +/**
> > + * enum execmem_type - types of executable memory ranges
> > + *
> > + * There are several subsystems that allocate executable memory.
> > + * Architectures define different restrictions on placement,
> > + * permissions, alignment and other parameters for memory that can be used
> > + * by these subsystems.
> > + * Types in this enum identify subsystems that allocate executable memory
> > + * and let architectures define parameters for ranges suitable for
> > + * allocations by each subsystem.
> > + *
> > + * @EXECMEM_DEFAULT: default parameters that would be used for types that
> > + * are not explcitly defined.
> > + * @EXECMEM_MODULE_TEXT: parameters for module text sections
> > + * @EXECMEM_KPROBES: parameters for kprobes
> > + * @EXECMEM_FTRACE: parameters for ftrace
> > + * @EXECMEM_BPF: parameters for BPF
> > + * @EXECMEM_TYPE_MAX:
> > + */
> > +enum execmem_type {
> > + EXECMEM_DEFAULT,
> > + EXECMEM_MODULE_TEXT = EXECMEM_DEFAULT,
> > + EXECMEM_KPROBES,
> > + EXECMEM_FTRACE,
> > + EXECMEM_BPF,
> > + EXECMEM_TYPE_MAX,
> > +};
>
> Can we please get a break-down of how all these types are actually
> different from one another?
>
> I'm thinking some platforms have a tiny immediate space (arm64 comes to
> mind) and has less strict placement constraints for some of them?
Yeah, and really I'd *much* rather deal with that in arch code, as I have said
several times.
For arm64 we have two bsaic restrictions:
1) Direct branches can go +/-128M
We can expand this range by having direct branches go to PLTs, at a
performance cost.
2) PREL32 relocations can go +/-2G
We cannot expand this further.
* We don't need to allocate memory for ftrace. We do not use trampolines.
* Kprobes XOL areas don't care about either of those; we don't place any
PC-relative instructions in those. Maybe we want to in future.
* Modules care about both; we'd *prefer* to place them within +/-128M of all
other kernel/module code, but if there's no space we can use PLTs and expand
that to +/-2G. Since modules can refreence other modules, that ends up
actually being halved, and modules have to fit within some 2G window that
also covers the kernel.
* I'm not sure about BPF's requirements; it seems happy doing the same as
modules.
So if we *must* use a common execmem allocator, what we'd reall want is our own
types, e.g.
EXECMEM_ANYWHERE
EXECMEM_NOPLT
EXECMEM_PREL32
.. and then we use those in arch code to implement module_alloc() and friends.
Mark.
On Mon, Apr 15, 2024 at 06:36:39PM +0100, Mark Rutland wrote:
> On Mon, Apr 15, 2024 at 09:52:41AM +0200, Peter Zijlstra wrote:
> > On Thu, Apr 11, 2024 at 07:00:41PM +0300, Mike Rapoport wrote:
> > > +/**
> > > + * enum execmem_type - types of executable memory ranges
> > > + *
> > > + * There are several subsystems that allocate executable memory.
> > > + * Architectures define different restrictions on placement,
> > > + * permissions, alignment and other parameters for memory that can be used
> > > + * by these subsystems.
> > > + * Types in this enum identify subsystems that allocate executable memory
> > > + * and let architectures define parameters for ranges suitable for
> > > + * allocations by each subsystem.
> > > + *
> > > + * @EXECMEM_DEFAULT: default parameters that would be used for types that
> > > + * are not explcitly defined.
> > > + * @EXECMEM_MODULE_TEXT: parameters for module text sections
> > > + * @EXECMEM_KPROBES: parameters for kprobes
> > > + * @EXECMEM_FTRACE: parameters for ftrace
> > > + * @EXECMEM_BPF: parameters for BPF
> > > + * @EXECMEM_TYPE_MAX:
> > > + */
> > > +enum execmem_type {
> > > + EXECMEM_DEFAULT,
> > > + EXECMEM_MODULE_TEXT = EXECMEM_DEFAULT,
> > > + EXECMEM_KPROBES,
> > > + EXECMEM_FTRACE,
> > > + EXECMEM_BPF,
> > > + EXECMEM_TYPE_MAX,
> > > +};
> >
> > Can we please get a break-down of how all these types are actually
> > different from one another?
> >
> > I'm thinking some platforms have a tiny immediate space (arm64 comes to
> > mind) and has less strict placement constraints for some of them?
>
> Yeah, and really I'd *much* rather deal with that in arch code, as I have said
> several times.
>
> For arm64 we have two bsaic restrictions:
>
> 1) Direct branches can go +/-128M
> We can expand this range by having direct branches go to PLTs, at a
> performance cost.
>
> 2) PREL32 relocations can go +/-2G
> We cannot expand this further.
>
> * We don't need to allocate memory for ftrace. We do not use trampolines.
>
> * Kprobes XOL areas don't care about either of those; we don't place any
> PC-relative instructions in those. Maybe we want to in future.
>
> * Modules care about both; we'd *prefer* to place them within +/-128M of all
> other kernel/module code, but if there's no space we can use PLTs and expand
> that to +/-2G. Since modules can refreence other modules, that ends up
> actually being halved, and modules have to fit within some 2G window that
> also covers the kernel.
>
> * I'm not sure about BPF's requirements; it seems happy doing the same as
> modules.
BPF are happy with vmalloc().
> So if we *must* use a common execmem allocator, what we'd reall want is our own
> types, e.g.
>
> EXECMEM_ANYWHERE
> EXECMEM_NOPLT
> EXECMEM_PREL32
>
> ... and then we use those in arch code to implement module_alloc() and friends.
I'm looking at execmem_types more as definition of the consumers, maybe I
should have named the enum execmem_consumer at the first place.
And the arch constrains defined in struct execmem_range describe how memory
should be allocated for each consumer.
These constraints are defined early at boot and remain static, so
initializing them once and letting a common allocator use them makes
perfect sense to me.
I agree that fallback_{start,end} are not ideal, but we have 3
architectures that have preferred and secondary range for modules. And arm
and powerpc use the same logic for kprobes as well, and I don't see why this
code should be duplicated.
And, for instance, if you decide to place PC-relative instructions if
kprobes XOL areas, you'd only need to update execmem_range for kprobes to
be more like the range for modules.
With central allocator it's easier to deal with the things like
VM_FLUSH_RESET_PERMS and caching of ROX memory and I think it will be more
maintainable that module_alloc(), alloc_insn_page() and
bpf_jit_alloc_exec() spread all over the place.
> Mark.
--
Sincerely yours,
Mike.
On Thu, 11 Apr 2024 19:00:41 +0300
Mike Rapoport <[email protected]> wrote:
> From: "Mike Rapoport (IBM)" <[email protected]>
>
> module_alloc() is used everywhere as a mean to allocate memory for code.
>
> Beside being semantically wrong, this unnecessarily ties all subsystems
> that need to allocate code, such as ftrace, kprobes and BPF to modules and
> puts the burden of code allocation to the modules code.
>
> Several architectures override module_alloc() because of various
> constraints where the executable memory can be located and this causes
> additional obstacles for improvements of code allocation.
>
> Start splitting code allocation from modules by introducing execmem_alloc()
> and execmem_free() APIs.
>
> Initially, execmem_alloc() is a wrapper for module_alloc() and
> execmem_free() is a replacement of module_memfree() to allow updating all
> call sites to use the new APIs.
>
> Since architectures define different restrictions on placement,
> permissions, alignment and other parameters for memory that can be used by
> different subsystems that allocate executable memory, execmem_alloc() takes
> a type argument, that will be used to identify the calling subsystem and to
> allow architectures define parameters for ranges suitable for that
> subsystem.
>
This looks good to me for the kprobe part.
Acked-by: Masami Hiramatsu (Google) <[email protected]>
Thank you,
> Signed-off-by: Mike Rapoport (IBM) <[email protected]>
> ---
> arch/powerpc/kernel/kprobes.c | 6 ++--
> arch/s390/kernel/ftrace.c | 4 +--
> arch/s390/kernel/kprobes.c | 4 +--
> arch/s390/kernel/module.c | 5 +--
> arch/sparc/net/bpf_jit_comp_32.c | 8 ++---
> arch/x86/kernel/ftrace.c | 6 ++--
> arch/x86/kernel/kprobes/core.c | 4 +--
> include/linux/execmem.h | 57 ++++++++++++++++++++++++++++++++
> include/linux/moduleloader.h | 3 --
> kernel/bpf/core.c | 6 ++--
> kernel/kprobes.c | 8 ++---
> kernel/module/Kconfig | 1 +
> kernel/module/main.c | 25 +++++---------
> mm/Kconfig | 3 ++
> mm/Makefile | 1 +
> mm/execmem.c | 26 +++++++++++++++
> 16 files changed, 122 insertions(+), 45 deletions(-)
> create mode 100644 include/linux/execmem.h
> create mode 100644 mm/execmem.c
>
> diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
> index bbca90a5e2ec..9fcd01bb2ce6 100644
> --- a/arch/powerpc/kernel/kprobes.c
> +++ b/arch/powerpc/kernel/kprobes.c
> @@ -19,8 +19,8 @@
> #include <linux/extable.h>
> #include <linux/kdebug.h>
> #include <linux/slab.h>
> -#include <linux/moduleloader.h>
> #include <linux/set_memory.h>
> +#include <linux/execmem.h>
> #include <asm/code-patching.h>
> #include <asm/cacheflush.h>
> #include <asm/sstep.h>
> @@ -130,7 +130,7 @@ void *alloc_insn_page(void)
> {
> void *page;
>
> - page = module_alloc(PAGE_SIZE);
> + page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE);
> if (!page)
> return NULL;
>
> @@ -142,7 +142,7 @@ void *alloc_insn_page(void)
> }
> return page;
> error:
> - module_memfree(page);
> + execmem_free(page);
> return NULL;
> }
>
> diff --git a/arch/s390/kernel/ftrace.c b/arch/s390/kernel/ftrace.c
> index c46381ea04ec..798249ef5646 100644
> --- a/arch/s390/kernel/ftrace.c
> +++ b/arch/s390/kernel/ftrace.c
> @@ -7,13 +7,13 @@
> * Author(s): Martin Schwidefsky <[email protected]>
> */
>
> -#include <linux/moduleloader.h>
> #include <linux/hardirq.h>
> #include <linux/uaccess.h>
> #include <linux/ftrace.h>
> #include <linux/kernel.h>
> #include <linux/types.h>
> #include <linux/kprobes.h>
> +#include <linux/execmem.h>
> #include <trace/syscall.h>
> #include <asm/asm-offsets.h>
> #include <asm/text-patching.h>
> @@ -220,7 +220,7 @@ static int __init ftrace_plt_init(void)
> {
> const char *start, *end;
>
> - ftrace_plt = module_alloc(PAGE_SIZE);
> + ftrace_plt = execmem_alloc(EXECMEM_FTRACE, PAGE_SIZE);
> if (!ftrace_plt)
> panic("cannot allocate ftrace plt\n");
>
> diff --git a/arch/s390/kernel/kprobes.c b/arch/s390/kernel/kprobes.c
> index f0cf20d4b3c5..3c1b1be744de 100644
> --- a/arch/s390/kernel/kprobes.c
> +++ b/arch/s390/kernel/kprobes.c
> @@ -9,7 +9,6 @@
>
> #define pr_fmt(fmt) "kprobes: " fmt
>
> -#include <linux/moduleloader.h>
> #include <linux/kprobes.h>
> #include <linux/ptrace.h>
> #include <linux/preempt.h>
> @@ -21,6 +20,7 @@
> #include <linux/slab.h>
> #include <linux/hardirq.h>
> #include <linux/ftrace.h>
> +#include <linux/execmem.h>
> #include <asm/set_memory.h>
> #include <asm/sections.h>
> #include <asm/dis.h>
> @@ -38,7 +38,7 @@ void *alloc_insn_page(void)
> {
> void *page;
>
> - page = module_alloc(PAGE_SIZE);
> + page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE);
> if (!page)
> return NULL;
> set_memory_rox((unsigned long)page, 1);
> diff --git a/arch/s390/kernel/module.c b/arch/s390/kernel/module.c
> index 42215f9404af..ac97a905e8cd 100644
> --- a/arch/s390/kernel/module.c
> +++ b/arch/s390/kernel/module.c
> @@ -21,6 +21,7 @@
> #include <linux/moduleloader.h>
> #include <linux/bug.h>
> #include <linux/memory.h>
> +#include <linux/execmem.h>
> #include <asm/alternative.h>
> #include <asm/nospec-branch.h>
> #include <asm/facility.h>
> @@ -76,7 +77,7 @@ void *module_alloc(unsigned long size)
> #ifdef CONFIG_FUNCTION_TRACER
> void module_arch_cleanup(struct module *mod)
> {
> - module_memfree(mod->arch.trampolines_start);
> + execmem_free(mod->arch.trampolines_start);
> }
> #endif
>
> @@ -510,7 +511,7 @@ static int module_alloc_ftrace_hotpatch_trampolines(struct module *me,
>
> size = FTRACE_HOTPATCH_TRAMPOLINES_SIZE(s->sh_size);
> numpages = DIV_ROUND_UP(size, PAGE_SIZE);
> - start = module_alloc(numpages * PAGE_SIZE);
> + start = execmem_alloc(EXECMEM_FTRACE, numpages * PAGE_SIZE);
> if (!start)
> return -ENOMEM;
> set_memory_rox((unsigned long)start, numpages);
> diff --git a/arch/sparc/net/bpf_jit_comp_32.c b/arch/sparc/net/bpf_jit_comp_32.c
> index da2df1e84ed4..bda2dbd3f4c5 100644
> --- a/arch/sparc/net/bpf_jit_comp_32.c
> +++ b/arch/sparc/net/bpf_jit_comp_32.c
> @@ -1,10 +1,10 @@
> // SPDX-License-Identifier: GPL-2.0
> -#include <linux/moduleloader.h>
> #include <linux/workqueue.h>
> #include <linux/netdevice.h>
> #include <linux/filter.h>
> #include <linux/cache.h>
> #include <linux/if_vlan.h>
> +#include <linux/execmem.h>
>
> #include <asm/cacheflush.h>
> #include <asm/ptrace.h>
> @@ -713,7 +713,7 @@ cond_branch: f_offset = addrs[i + filter[i].jf];
> if (unlikely(proglen + ilen > oldproglen)) {
> pr_err("bpb_jit_compile fatal error\n");
> kfree(addrs);
> - module_memfree(image);
> + execmem_free(image);
> return;
> }
> memcpy(image + proglen, temp, ilen);
> @@ -736,7 +736,7 @@ cond_branch: f_offset = addrs[i + filter[i].jf];
> break;
> }
> if (proglen == oldproglen) {
> - image = module_alloc(proglen);
> + image = execmem_alloc(EXECMEM_BPF, proglen);
> if (!image)
> goto out;
> }
> @@ -758,7 +758,7 @@ cond_branch: f_offset = addrs[i + filter[i].jf];
> void bpf_jit_free(struct bpf_prog *fp)
> {
> if (fp->jited)
> - module_memfree(fp->bpf_func);
> + execmem_free(fp->bpf_func);
>
> bpf_prog_unlock_free(fp);
> }
> diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
> index 70139d9d2e01..c8ddb7abda7c 100644
> --- a/arch/x86/kernel/ftrace.c
> +++ b/arch/x86/kernel/ftrace.c
> @@ -25,6 +25,7 @@
> #include <linux/memory.h>
> #include <linux/vmalloc.h>
> #include <linux/set_memory.h>
> +#include <linux/execmem.h>
>
> #include <trace/syscall.h>
>
> @@ -261,15 +262,14 @@ void arch_ftrace_update_code(int command)
> #ifdef CONFIG_X86_64
>
> #ifdef CONFIG_MODULES
> -#include <linux/moduleloader.h>
> /* Module allocation simplifies allocating memory for code */
> static inline void *alloc_tramp(unsigned long size)
> {
> - return module_alloc(size);
> + return execmem_alloc(EXECMEM_FTRACE, size);
> }
> static inline void tramp_free(void *tramp)
> {
> - module_memfree(tramp);
> + execmem_free(tramp);
> }
> #else
> /* Trampolines can only be created if modules are supported */
> diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
> index d0e49bd7c6f3..72e6a45e7ec2 100644
> --- a/arch/x86/kernel/kprobes/core.c
> +++ b/arch/x86/kernel/kprobes/core.c
> @@ -40,12 +40,12 @@
> #include <linux/kgdb.h>
> #include <linux/ftrace.h>
> #include <linux/kasan.h>
> -#include <linux/moduleloader.h>
> #include <linux/objtool.h>
> #include <linux/vmalloc.h>
> #include <linux/pgtable.h>
> #include <linux/set_memory.h>
> #include <linux/cfi.h>
> +#include <linux/execmem.h>
>
> #include <asm/text-patching.h>
> #include <asm/cacheflush.h>
> @@ -495,7 +495,7 @@ void *alloc_insn_page(void)
> {
> void *page;
>
> - page = module_alloc(PAGE_SIZE);
> + page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE);
> if (!page)
> return NULL;
>
> diff --git a/include/linux/execmem.h b/include/linux/execmem.h
> new file mode 100644
> index 000000000000..43e7995593a1
> --- /dev/null
> +++ b/include/linux/execmem.h
> @@ -0,0 +1,57 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_EXECMEM_ALLOC_H
> +#define _LINUX_EXECMEM_ALLOC_H
> +
> +#include <linux/types.h>
> +#include <linux/moduleloader.h>
> +
> +/**
> + * enum execmem_type - types of executable memory ranges
> + *
> + * There are several subsystems that allocate executable memory.
> + * Architectures define different restrictions on placement,
> + * permissions, alignment and other parameters for memory that can be used
> + * by these subsystems.
> + * Types in this enum identify subsystems that allocate executable memory
> + * and let architectures define parameters for ranges suitable for
> + * allocations by each subsystem.
> + *
> + * @EXECMEM_DEFAULT: default parameters that would be used for types that
> + * are not explcitly defined.
> + * @EXECMEM_MODULE_TEXT: parameters for module text sections
> + * @EXECMEM_KPROBES: parameters for kprobes
> + * @EXECMEM_FTRACE: parameters for ftrace
> + * @EXECMEM_BPF: parameters for BPF
> + * @EXECMEM_TYPE_MAX:
> + */
> +enum execmem_type {
> + EXECMEM_DEFAULT,
> + EXECMEM_MODULE_TEXT = EXECMEM_DEFAULT,
> + EXECMEM_KPROBES,
> + EXECMEM_FTRACE,
> + EXECMEM_BPF,
> + EXECMEM_TYPE_MAX,
> +};
> +
> +/**
> + * execmem_alloc - allocate executable memory
> + * @type: type of the allocation
> + * @size: how many bytes of memory are required
> + *
> + * Allocates memory that will contain executable code, either generated or
> + * loaded from kernel modules.
> + *
> + * The memory will have protections defined by architecture for executable
> + * region of the @type.
> + *
> + * Return: a pointer to the allocated memory or %NULL
> + */
> +void *execmem_alloc(enum execmem_type type, size_t size);
> +
> +/**
> + * execmem_free - free executable memory
> + * @ptr: pointer to the memory that should be freed
> + */
> +void execmem_free(void *ptr);
> +
> +#endif /* _LINUX_EXECMEM_ALLOC_H */
> diff --git a/include/linux/moduleloader.h b/include/linux/moduleloader.h
> index 89b1e0ed9811..a3b8caee9405 100644
> --- a/include/linux/moduleloader.h
> +++ b/include/linux/moduleloader.h
> @@ -29,9 +29,6 @@ unsigned int arch_mod_section_prepend(struct module *mod, unsigned int section);
> sections. Returns NULL on failure. */
> void *module_alloc(unsigned long size);
>
> -/* Free memory returned from module_alloc. */
> -void module_memfree(void *module_region);
> -
> /* Determines if the section name is an init section (that is only used during
> * module loading).
> */
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 696bc55de8e8..75a54024e2f4 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -22,7 +22,6 @@
> #include <linux/skbuff.h>
> #include <linux/vmalloc.h>
> #include <linux/random.h>
> -#include <linux/moduleloader.h>
> #include <linux/bpf.h>
> #include <linux/btf.h>
> #include <linux/objtool.h>
> @@ -37,6 +36,7 @@
> #include <linux/nospec.h>
> #include <linux/bpf_mem_alloc.h>
> #include <linux/memcontrol.h>
> +#include <linux/execmem.h>
>
> #include <asm/barrier.h>
> #include <asm/unaligned.h>
> @@ -1050,12 +1050,12 @@ void bpf_jit_uncharge_modmem(u32 size)
>
> void *__weak bpf_jit_alloc_exec(unsigned long size)
> {
> - return module_alloc(size);
> + return execmem_alloc(EXECMEM_BPF, size);
> }
>
> void __weak bpf_jit_free_exec(void *addr)
> {
> - module_memfree(addr);
> + execmem_free(addr);
> }
>
> struct bpf_binary_header *
> diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> index 9d9095e81792..047ca629ce49 100644
> --- a/kernel/kprobes.c
> +++ b/kernel/kprobes.c
> @@ -26,7 +26,6 @@
> #include <linux/slab.h>
> #include <linux/stddef.h>
> #include <linux/export.h>
> -#include <linux/moduleloader.h>
> #include <linux/kallsyms.h>
> #include <linux/freezer.h>
> #include <linux/seq_file.h>
> @@ -39,6 +38,7 @@
> #include <linux/jump_label.h>
> #include <linux/static_call.h>
> #include <linux/perf_event.h>
> +#include <linux/execmem.h>
>
> #include <asm/sections.h>
> #include <asm/cacheflush.h>
> @@ -113,17 +113,17 @@ enum kprobe_slot_state {
> void __weak *alloc_insn_page(void)
> {
> /*
> - * Use module_alloc() so this page is within +/- 2GB of where the
> + * Use execmem_alloc() so this page is within +/- 2GB of where the
> * kernel image and loaded module images reside. This is required
> * for most of the architectures.
> * (e.g. x86-64 needs this to handle the %rip-relative fixups.)
> */
> - return module_alloc(PAGE_SIZE);
> + return execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE);
> }
>
> static void free_insn_page(void *page)
> {
> - module_memfree(page);
> + execmem_free(page);
> }
>
> struct kprobe_insn_cache kprobe_insn_slots = {
> diff --git a/kernel/module/Kconfig b/kernel/module/Kconfig
> index f3e0329337f6..744383c1eed1 100644
> --- a/kernel/module/Kconfig
> +++ b/kernel/module/Kconfig
> @@ -2,6 +2,7 @@
> menuconfig MODULES
> bool "Enable loadable module support"
> modules
> + select EXECMEM
> help
> Kernel modules are small pieces of compiled code which can
> be inserted in the running kernel, rather than being
> diff --git a/kernel/module/main.c b/kernel/module/main.c
> index 5b82b069e0d3..d56b7df0cbb6 100644
> --- a/kernel/module/main.c
> +++ b/kernel/module/main.c
> @@ -57,6 +57,7 @@
> #include <linux/audit.h>
> #include <linux/cfi.h>
> #include <linux/debugfs.h>
> +#include <linux/execmem.h>
> #include <uapi/linux/module.h>
> #include "internal.h"
>
> @@ -1179,16 +1180,6 @@ resolve_symbol_wait(struct module *mod,
> return ksym;
> }
>
> -void __weak module_memfree(void *module_region)
> -{
> - /*
> - * This memory may be RO, and freeing RO memory in an interrupt is not
> - * supported by vmalloc.
> - */
> - WARN_ON(in_interrupt());
> - vfree(module_region);
> -}
> -
> void __weak module_arch_cleanup(struct module *mod)
> {
> }
> @@ -1213,7 +1204,7 @@ static int module_memory_alloc(struct module *mod, enum mod_mem_type type)
> if (mod_mem_use_vmalloc(type))
> ptr = vmalloc(size);
> else
> - ptr = module_alloc(size);
> + ptr = execmem_alloc(EXECMEM_MODULE_TEXT, size);
>
> if (!ptr)
> return -ENOMEM;
> @@ -1244,7 +1235,7 @@ static void module_memory_free(struct module *mod, enum mod_mem_type type)
> if (mod_mem_use_vmalloc(type))
> vfree(ptr);
> else
> - module_memfree(ptr);
> + execmem_free(ptr);
> }
>
> static void free_mod_mem(struct module *mod)
> @@ -2496,9 +2487,9 @@ static void do_free_init(struct work_struct *w)
>
> llist_for_each_safe(pos, n, list) {
> initfree = container_of(pos, struct mod_initfree, node);
> - module_memfree(initfree->init_text);
> - module_memfree(initfree->init_data);
> - module_memfree(initfree->init_rodata);
> + execmem_free(initfree->init_text);
> + execmem_free(initfree->init_data);
> + execmem_free(initfree->init_rodata);
> kfree(initfree);
> }
> }
> @@ -2608,10 +2599,10 @@ static noinline int do_init_module(struct module *mod)
> * We want to free module_init, but be aware that kallsyms may be
> * walking this with preempt disabled. In all the failure paths, we
> * call synchronize_rcu(), but we don't want to slow down the success
> - * path. module_memfree() cannot be called in an interrupt, so do the
> + * path. execmem_free() cannot be called in an interrupt, so do the
> * work and call synchronize_rcu() in a work queue.
> *
> - * Note that module_alloc() on most architectures creates W+X page
> + * Note that execmem_alloc() on most architectures creates W+X page
> * mappings which won't be cleaned up until do_free_init() runs. Any
> * code such as mark_rodata_ro() which depends on those mappings to
> * be cleaned up needs to sync with the queued work by invoking
> diff --git a/mm/Kconfig b/mm/Kconfig
> index b1448aa81e15..f08a216d4793 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1241,6 +1241,9 @@ config LOCK_MM_AND_FIND_VMA
> config IOMMU_MM_DATA
> bool
>
> +config EXECMEM
> + bool
> +
> source "mm/damon/Kconfig"
>
> endmenu
> diff --git a/mm/Makefile b/mm/Makefile
> index 4abb40b911ec..001336c91864 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -133,3 +133,4 @@ obj-$(CONFIG_IO_MAPPING) += io-mapping.o
> obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o
> obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
> obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
> +obj-$(CONFIG_EXECMEM) += execmem.o
> diff --git a/mm/execmem.c b/mm/execmem.c
> new file mode 100644
> index 000000000000..ed2ea41a2543
> --- /dev/null
> +++ b/mm/execmem.c
> @@ -0,0 +1,26 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <linux/mm.h>
> +#include <linux/vmalloc.h>
> +#include <linux/execmem.h>
> +#include <linux/moduleloader.h>
> +
> +static void *__execmem_alloc(size_t size)
> +{
> + return module_alloc(size);
> +}
> +
> +void *execmem_alloc(enum execmem_type type, size_t size)
> +{
> + return __execmem_alloc(size);
> +}
> +
> +void execmem_free(void *ptr)
> +{
> + /*
> + * This memory may be RO, and freeing RO memory in an interrupt is not
> + * supported by vmalloc.
> + */
> + WARN_ON(in_interrupt());
> + vfree(ptr);
> +}
> --
> 2.43.0
>
>
--
Masami Hiramatsu (Google) <[email protected]>
Hi Mike,
On Thu, 11 Apr 2024 19:00:50 +0300
Mike Rapoport <[email protected]> wrote:
> From: "Mike Rapoport (IBM)" <[email protected]>
>
> kprobes depended on CONFIG_MODULES because it has to allocate memory for
> code.
>
> Since code allocations are now implemented with execmem, kprobes can be
> enabled in non-modular kernels.
>
> Add #ifdef CONFIG_MODULE guards for the code dealing with kprobes inside
> modules, make CONFIG_KPROBES select CONFIG_EXECMEM and drop the
> dependency of CONFIG_KPROBES on CONFIG_MODULES.
Thanks for this work, but this conflicts with the latest fix in v6.9-rc4.
Also, can you use IS_ENABLED(CONFIG_MODULES) instead of #ifdefs in
function body? We have enough dummy functions for that, so it should
not make a problem.
Thank you,
>
> Signed-off-by: Mike Rapoport (IBM) <[email protected]>
> ---
> arch/Kconfig | 2 +-
> kernel/kprobes.c | 43 +++++++++++++++++++++----------------
> kernel/trace/trace_kprobe.c | 11 ++++++++++
> 3 files changed, 37 insertions(+), 19 deletions(-)
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index bc9e8e5dccd5..68177adf61a0 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -52,9 +52,9 @@ config GENERIC_ENTRY
>
> config KPROBES
> bool "Kprobes"
> - depends on MODULES
> depends on HAVE_KPROBES
> select KALLSYMS
> + select EXECMEM
> select TASKS_RCU if PREEMPTION
> help
> Kprobes allows you to trap at almost any kernel address and
> diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> index 047ca629ce49..90c056853e6f 100644
> --- a/kernel/kprobes.c
> +++ b/kernel/kprobes.c
> @@ -1580,6 +1580,7 @@ static int check_kprobe_address_safe(struct kprobe *p,
> goto out;
> }
>
> +#ifdef CONFIG_MODULES
> /* Check if 'p' is probing a module. */
> *probed_mod = __module_text_address((unsigned long) p->addr);
> if (*probed_mod) {
> @@ -1603,6 +1604,8 @@ static int check_kprobe_address_safe(struct kprobe *p,
> ret = -ENOENT;
> }
> }
> +#endif
> +
> out:
> preempt_enable();
> jump_label_unlock();
> @@ -2482,24 +2485,6 @@ int kprobe_add_area_blacklist(unsigned long start, unsigned long end)
> return 0;
> }
>
> -/* Remove all symbols in given area from kprobe blacklist */
> -static void kprobe_remove_area_blacklist(unsigned long start, unsigned long end)
> -{
> - struct kprobe_blacklist_entry *ent, *n;
> -
> - list_for_each_entry_safe(ent, n, &kprobe_blacklist, list) {
> - if (ent->start_addr < start || ent->start_addr >= end)
> - continue;
> - list_del(&ent->list);
> - kfree(ent);
> - }
> -}
> -
> -static void kprobe_remove_ksym_blacklist(unsigned long entry)
> -{
> - kprobe_remove_area_blacklist(entry, entry + 1);
> -}
> -
> int __weak arch_kprobe_get_kallsym(unsigned int *symnum, unsigned long *value,
> char *type, char *sym)
> {
> @@ -2564,6 +2549,25 @@ static int __init populate_kprobe_blacklist(unsigned long *start,
> return ret ? : arch_populate_kprobe_blacklist();
> }
>
> +#ifdef CONFIG_MODULES
> +/* Remove all symbols in given area from kprobe blacklist */
> +static void kprobe_remove_area_blacklist(unsigned long start, unsigned long end)
> +{
> + struct kprobe_blacklist_entry *ent, *n;
> +
> + list_for_each_entry_safe(ent, n, &kprobe_blacklist, list) {
> + if (ent->start_addr < start || ent->start_addr >= end)
> + continue;
> + list_del(&ent->list);
> + kfree(ent);
> + }
> +}
> +
> +static void kprobe_remove_ksym_blacklist(unsigned long entry)
> +{
> + kprobe_remove_area_blacklist(entry, entry + 1);
> +}
> +
> static void add_module_kprobe_blacklist(struct module *mod)
> {
> unsigned long start, end;
> @@ -2665,6 +2669,7 @@ static struct notifier_block kprobe_module_nb = {
> .notifier_call = kprobes_module_callback,
> .priority = 0
> };
> +#endif
>
> void kprobe_free_init_mem(void)
> {
> @@ -2724,8 +2729,10 @@ static int __init init_kprobes(void)
> err = arch_init_kprobes();
> if (!err)
> err = register_die_notifier(&kprobe_exceptions_nb);
> +#ifdef CONFIG_MODULES
> if (!err)
> err = register_module_notifier(&kprobe_module_nb);
> +#endif
>
> kprobes_initialized = (err == 0);
> kprobe_sysctls_init();
> diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
> index 14099cc17fc9..f0610137d6a3 100644
> --- a/kernel/trace/trace_kprobe.c
> +++ b/kernel/trace/trace_kprobe.c
> @@ -111,6 +111,7 @@ static nokprobe_inline bool trace_kprobe_within_module(struct trace_kprobe *tk,
> return strncmp(module_name(mod), name, len) == 0 && name[len] == ':';
> }
>
> +#ifdef CONFIG_MODULES
> static nokprobe_inline bool trace_kprobe_module_exist(struct trace_kprobe *tk)
> {
> char *p;
> @@ -129,6 +130,12 @@ static nokprobe_inline bool trace_kprobe_module_exist(struct trace_kprobe *tk)
>
> return ret;
> }
> +#else
> +static inline bool trace_kprobe_module_exist(struct trace_kprobe *tk)
> +{
> + return false;
> +}
> +#endif
>
> static bool trace_kprobe_is_busy(struct dyn_event *ev)
> {
> @@ -670,6 +677,7 @@ static int register_trace_kprobe(struct trace_kprobe *tk)
> return ret;
> }
>
> +#ifdef CONFIG_MODULES
> /* Module notifier call back, checking event on the module */
> static int trace_kprobe_module_callback(struct notifier_block *nb,
> unsigned long val, void *data)
> @@ -704,6 +712,7 @@ static struct notifier_block trace_kprobe_module_nb = {
> .notifier_call = trace_kprobe_module_callback,
> .priority = 1 /* Invoked after kprobe module callback */
> };
> +#endif
>
> static int count_symbols(void *data, unsigned long unused)
> {
> @@ -1933,8 +1942,10 @@ static __init int init_kprobe_trace_early(void)
> if (ret)
> return ret;
>
> +#ifdef CONFIG_MODULES
> if (register_module_notifier(&trace_kprobe_module_nb))
> return -EINVAL;
> +#endif
>
> return 0;
> }
> --
> 2.43.0
>
>
--
Masami Hiramatsu
On Tue, Apr 16, 2024 at 12:23 AM Mike Rapoport <[email protected]> wrote:
>
> On Mon, Apr 15, 2024 at 06:36:39PM +0100, Mark Rutland wrote:
> > On Mon, Apr 15, 2024 at 09:52:41AM +0200, Peter Zijlstra wrote:
> > > On Thu, Apr 11, 2024 at 07:00:41PM +0300, Mike Rapoport wrote:
> > > > +/**
> > > > + * enum execmem_type - types of executable memory ranges
> > > > + *
> > > > + * There are several subsystems that allocate executable memory.
> > > > + * Architectures define different restrictions on placement,
> > > > + * permissions, alignment and other parameters for memory that can be used
> > > > + * by these subsystems.
> > > > + * Types in this enum identify subsystems that allocate executable memory
> > > > + * and let architectures define parameters for ranges suitable for
> > > > + * allocations by each subsystem.
> > > > + *
> > > > + * @EXECMEM_DEFAULT: default parameters that would be used for types that
> > > > + * are not explcitly defined.
> > > > + * @EXECMEM_MODULE_TEXT: parameters for module text sections
> > > > + * @EXECMEM_KPROBES: parameters for kprobes
> > > > + * @EXECMEM_FTRACE: parameters for ftrace
> > > > + * @EXECMEM_BPF: parameters for BPF
> > > > + * @EXECMEM_TYPE_MAX:
> > > > + */
> > > > +enum execmem_type {
> > > > + EXECMEM_DEFAULT,
> > > > + EXECMEM_MODULE_TEXT = EXECMEM_DEFAULT,
> > > > + EXECMEM_KPROBES,
> > > > + EXECMEM_FTRACE,
> > > > + EXECMEM_BPF,
> > > > + EXECMEM_TYPE_MAX,
> > > > +};
> > >
> > > Can we please get a break-down of how all these types are actually
> > > different from one another?
> > >
> > > I'm thinking some platforms have a tiny immediate space (arm64 comes to
> > > mind) and has less strict placement constraints for some of them?
> >
> > Yeah, and really I'd *much* rather deal with that in arch code, as I have said
> > several times.
> >
> > For arm64 we have two bsaic restrictions:
> >
> > 1) Direct branches can go +/-128M
> > We can expand this range by having direct branches go to PLTs, at a
> > performance cost.
> >
> > 2) PREL32 relocations can go +/-2G
> > We cannot expand this further.
> >
> > * We don't need to allocate memory for ftrace. We do not use trampolines.
> >
> > * Kprobes XOL areas don't care about either of those; we don't place any
> > PC-relative instructions in those. Maybe we want to in future.
> >
> > * Modules care about both; we'd *prefer* to place them within +/-128M of all
> > other kernel/module code, but if there's no space we can use PLTs and expand
> > that to +/-2G. Since modules can refreence other modules, that ends up
> > actually being halved, and modules have to fit within some 2G window that
> > also covers the kernel.
Is +/- 2G enough for all realistic use cases? If so, I guess we don't
really need
EXECMEM_ANYWHERE below?
> >
> > * I'm not sure about BPF's requirements; it seems happy doing the same as
> > modules.
>
> BPF are happy with vmalloc().
>
> > So if we *must* use a common execmem allocator, what we'd reall want is our own
> > types, e.g.
> >
> > EXECMEM_ANYWHERE
> > EXECMEM_NOPLT
> > EXECMEM_PREL32
> >
> > ... and then we use those in arch code to implement module_alloc() and friends.
>
> I'm looking at execmem_types more as definition of the consumers, maybe I
> should have named the enum execmem_consumer at the first place.
I think looking at execmem_type from consumers' point of view adds
unnecessary complexity. IIUC, for most (if not all) archs, ftrace, kprobe,
and bpf (and maybe also module text) all have the same requirements.
Did I miss something?
IOW, we have
enum execmem_type {
EXECMEM_DEFAULT,
EXECMEM_TEXT,
EXECMEM_KPROBES = EXECMEM_TEXT,
EXECMEM_FTRACE = EXECMEM_TEXT,
EXECMEM_BPF = EXECMEM_TEXT, /* we may end up without
_KPROBE, _FTRACE, _BPF */
EXECMEM_DATA, /* rw */
EXECMEM_RO_DATA,
EXECMEM_RO_AFTER_INIT,
EXECMEM_TYPE_MAX,
};
Does this make sense?
Thanks,
Song
On Wed, Apr 17, 2024 at 04:32:49PM -0700, Song Liu wrote:
> On Tue, Apr 16, 2024 at 12:23 AM Mike Rapoport <[email protected]> wrote:
> >
> > On Mon, Apr 15, 2024 at 06:36:39PM +0100, Mark Rutland wrote:
> > > On Mon, Apr 15, 2024 at 09:52:41AM +0200, Peter Zijlstra wrote:
> > > > On Thu, Apr 11, 2024 at 07:00:41PM +0300, Mike Rapoport wrote:
> > > > > +/**
> > > > > + * enum execmem_type - types of executable memory ranges
> > > > > + *
> > > > > + * There are several subsystems that allocate executable memory.
> > > > > + * Architectures define different restrictions on placement,
> > > > > + * permissions, alignment and other parameters for memory that can be used
> > > > > + * by these subsystems.
> > > > > + * Types in this enum identify subsystems that allocate executable memory
> > > > > + * and let architectures define parameters for ranges suitable for
> > > > > + * allocations by each subsystem.
> > > > > + *
> > > > > + * @EXECMEM_DEFAULT: default parameters that would be used for types that
> > > > > + * are not explcitly defined.
> > > > > + * @EXECMEM_MODULE_TEXT: parameters for module text sections
> > > > > + * @EXECMEM_KPROBES: parameters for kprobes
> > > > > + * @EXECMEM_FTRACE: parameters for ftrace
> > > > > + * @EXECMEM_BPF: parameters for BPF
> > > > > + * @EXECMEM_TYPE_MAX:
> > > > > + */
> > > > > +enum execmem_type {
> > > > > + EXECMEM_DEFAULT,
> > > > > + EXECMEM_MODULE_TEXT = EXECMEM_DEFAULT,
> > > > > + EXECMEM_KPROBES,
> > > > > + EXECMEM_FTRACE,
> > > > > + EXECMEM_BPF,
> > > > > + EXECMEM_TYPE_MAX,
> > > > > +};
> > > >
> > > > Can we please get a break-down of how all these types are actually
> > > > different from one another?
> > > >
> > > > I'm thinking some platforms have a tiny immediate space (arm64 comes to
> > > > mind) and has less strict placement constraints for some of them?
> > >
> > > Yeah, and really I'd *much* rather deal with that in arch code, as I have said
> > > several times.
> > >
> > > For arm64 we have two bsaic restrictions:
> > >
> > > 1) Direct branches can go +/-128M
> > > We can expand this range by having direct branches go to PLTs, at a
> > > performance cost.
> > >
> > > 2) PREL32 relocations can go +/-2G
> > > We cannot expand this further.
> > >
> > > * We don't need to allocate memory for ftrace. We do not use trampolines.
> > >
> > > * Kprobes XOL areas don't care about either of those; we don't place any
> > > PC-relative instructions in those. Maybe we want to in future.
> > >
> > > * Modules care about both; we'd *prefer* to place them within +/-128M of all
> > > other kernel/module code, but if there's no space we can use PLTs and expand
> > > that to +/-2G. Since modules can refreence other modules, that ends up
> > > actually being halved, and modules have to fit within some 2G window that
> > > also covers the kernel.
>
> Is +/- 2G enough for all realistic use cases? If so, I guess we don't
> really need
> EXECMEM_ANYWHERE below?
>
> > >
> > > * I'm not sure about BPF's requirements; it seems happy doing the same as
> > > modules.
> >
> > BPF are happy with vmalloc().
> >
> > > So if we *must* use a common execmem allocator, what we'd reall want is our own
> > > types, e.g.
> > >
> > > EXECMEM_ANYWHERE
> > > EXECMEM_NOPLT
> > > EXECMEM_PREL32
> > >
> > > ... and then we use those in arch code to implement module_alloc() and friends.
> >
> > I'm looking at execmem_types more as definition of the consumers, maybe I
> > should have named the enum execmem_consumer at the first place.
>
> I think looking at execmem_type from consumers' point of view adds
> unnecessary complexity. IIUC, for most (if not all) archs, ftrace, kprobe,
> and bpf (and maybe also module text) all have the same requirements.
> Did I miss something?
It's enough to have one architecture with different constrains for kprobes
and bpf to warrant a type for each.
Where do you see unnecessary complexity?
> IOW, we have
>
> enum execmem_type {
> EXECMEM_DEFAULT,
> EXECMEM_TEXT,
> EXECMEM_KPROBES = EXECMEM_TEXT,
> EXECMEM_FTRACE = EXECMEM_TEXT,
> EXECMEM_BPF = EXECMEM_TEXT, /* we may end up without
> _KPROBE, _FTRACE, _BPF */
> EXECMEM_DATA, /* rw */
> EXECMEM_RO_DATA,
> EXECMEM_RO_AFTER_INIT,
> EXECMEM_TYPE_MAX,
> };
>
> Does this make sense?
How do you suggest to deal with e.g. riscv that has separate address spaces
for modules, kprobes and bpf?
> Thanks,
> Song
--
Sincerely yours,
Mike.
Hi Masami,
On Thu, Apr 18, 2024 at 06:16:15AM +0900, Masami Hiramatsu wrote:
> Hi Mike,
>
> On Thu, 11 Apr 2024 19:00:50 +0300
> Mike Rapoport <[email protected]> wrote:
>
> > From: "Mike Rapoport (IBM)" <[email protected]>
> >
> > kprobes depended on CONFIG_MODULES because it has to allocate memory for
> > code.
> >
> > Since code allocations are now implemented with execmem, kprobes can be
> > enabled in non-modular kernels.
> >
> > Add #ifdef CONFIG_MODULE guards for the code dealing with kprobes inside
> > modules, make CONFIG_KPROBES select CONFIG_EXECMEM and drop the
> > dependency of CONFIG_KPROBES on CONFIG_MODULES.
>
> Thanks for this work, but this conflicts with the latest fix in v6.9-rc4.
> Also, can you use IS_ENABLED(CONFIG_MODULES) instead of #ifdefs in
> function body? We have enough dummy functions for that, so it should
> not make a problem.
I'll rebase and will try to reduce ifdefery where possible.
> Thank you,
>
> >
> > Signed-off-by: Mike Rapoport (IBM) <[email protected]>
> > ---
> > arch/Kconfig | 2 +-
> > kernel/kprobes.c | 43 +++++++++++++++++++++----------------
> > kernel/trace/trace_kprobe.c | 11 ++++++++++
> > 3 files changed, 37 insertions(+), 19 deletions(-)
> >
>
> --
> Masami Hiramatsu
--
Sincerely yours,
Mike.
On Thu, Apr 18, 2024 at 8:37 AM Mike Rapoport <[email protected]> wrote:
>
[...]
> >
> > Is +/- 2G enough for all realistic use cases? If so, I guess we don't
> > really need
> > EXECMEM_ANYWHERE below?
> >
> > > >
> > > > * I'm not sure about BPF's requirements; it seems happy doing the same as
> > > > modules.
> > >
> > > BPF are happy with vmalloc().
> > >
> > > > So if we *must* use a common execmem allocator, what we'd reall want is our own
> > > > types, e.g.
> > > >
> > > > EXECMEM_ANYWHERE
> > > > EXECMEM_NOPLT
> > > > EXECMEM_PREL32
> > > >
> > > > ... and then we use those in arch code to implement module_alloc() and friends.
> > >
> > > I'm looking at execmem_types more as definition of the consumers, maybe I
> > > should have named the enum execmem_consumer at the first place.
> >
> > I think looking at execmem_type from consumers' point of view adds
> > unnecessary complexity. IIUC, for most (if not all) archs, ftrace, kprobe,
> > and bpf (and maybe also module text) all have the same requirements.
> > Did I miss something?
>
> It's enough to have one architecture with different constrains for kprobes
> and bpf to warrant a type for each.
>
AFAICT, some of these constraints can be changed without too much work.
> Where do you see unnecessary complexity?
>
> > IOW, we have
> >
> > enum execmem_type {
> > EXECMEM_DEFAULT,
> > EXECMEM_TEXT,
> > EXECMEM_KPROBES = EXECMEM_TEXT,
> > EXECMEM_FTRACE = EXECMEM_TEXT,
> > EXECMEM_BPF = EXECMEM_TEXT, /* we may end up without
> > _KPROBE, _FTRACE, _BPF */
> > EXECMEM_DATA, /* rw */
> > EXECMEM_RO_DATA,
> > EXECMEM_RO_AFTER_INIT,
> > EXECMEM_TYPE_MAX,
> > };
> >
> > Does this make sense?
>
> How do you suggest to deal with e.g. riscv that has separate address spaces
> for modules, kprobes and bpf?
IIUC, modules and bpf use the same address space on riscv, while kprobes use
vmalloc address. I haven't tried this yet, but I think we can let
kprobes use the
same space as modules and bpf, which is:
ffffffff00000000 | -4 GB | ffffffff7fffffff | 2 GB | modules, BPF
Did I get this right?
Thanks,
Song
On Thu, Apr 18, 2024 at 09:13:27AM -0700, Song Liu wrote:
> On Thu, Apr 18, 2024 at 8:37 AM Mike Rapoport <[email protected]> wrote:
> > > >
> > > > I'm looking at execmem_types more as definition of the consumers, maybe I
> > > > should have named the enum execmem_consumer at the first place.
> > >
> > > I think looking at execmem_type from consumers' point of view adds
> > > unnecessary complexity. IIUC, for most (if not all) archs, ftrace, kprobe,
> > > and bpf (and maybe also module text) all have the same requirements.
> > > Did I miss something?
> >
> > It's enough to have one architecture with different constrains for kprobes
> > and bpf to warrant a type for each.
>
> AFAICT, some of these constraints can be changed without too much work.
But why?
I honestly don't understand what are you trying to optimize here. A few
lines of initialization in execmem_info?
What is the advantage in forcing architectures to have imposed limits on
kprobes or bpf allocations?
> > Where do you see unnecessary complexity?
> >
> > > IOW, we have
> > >
> > > enum execmem_type {
> > > EXECMEM_DEFAULT,
> > > EXECMEM_TEXT,
> > > EXECMEM_KPROBES = EXECMEM_TEXT,
> > > EXECMEM_FTRACE = EXECMEM_TEXT,
> > > EXECMEM_BPF = EXECMEM_TEXT, /* we may end up without
> > > _KPROBE, _FTRACE, _BPF */
> > > EXECMEM_DATA, /* rw */
> > > EXECMEM_RO_DATA,
> > > EXECMEM_RO_AFTER_INIT,
> > > EXECMEM_TYPE_MAX,
> > > };
> > >
> > > Does this make sense?
> >
> > How do you suggest to deal with e.g. riscv that has separate address spaces
> > for modules, kprobes and bpf?
>
> IIUC, modules and bpf use the same address space on riscv
Not exactly, bpf is a subset of modules on riscv.
> while kprobes use vmalloc address.
The whole point of using the entire vmalloc for kprobes is to avoid
pollution of limited modules space.
> Thanks,
> Song
--
Sincerely yours,
Mike.
On Thu, Apr 18, 2024 at 10:54 AM Mike Rapoport <[email protected]> wrote:
>
> On Thu, Apr 18, 2024 at 09:13:27AM -0700, Song Liu wrote:
> > On Thu, Apr 18, 2024 at 8:37 AM Mike Rapoport <[email protected]> wrote:
> > > > >
> > > > > I'm looking at execmem_types more as definition of the consumers, maybe I
> > > > > should have named the enum execmem_consumer at the first place.
> > > >
> > > > I think looking at execmem_type from consumers' point of view adds
> > > > unnecessary complexity. IIUC, for most (if not all) archs, ftrace, kprobe,
> > > > and bpf (and maybe also module text) all have the same requirements.
> > > > Did I miss something?
> > >
> > > It's enough to have one architecture with different constrains for kprobes
> > > and bpf to warrant a type for each.
> >
> > AFAICT, some of these constraints can be changed without too much work.
>
> But why?
> I honestly don't understand what are you trying to optimize here. A few
> lines of initialization in execmem_info?
IIUC, having separate EXECMEM_BPF and EXECMEM_KPROBE makes it
harder for bpf and kprobe to share the same ROX page. In many use cases,
a 2MiB page (assuming x86_64) is enough for all BPF, kprobe, ftrace, and
module text. It is not efficient if we have to allocate separate pages for each
of these use cases. If this is not a problem, the current approach works.
Thanks,
Song
On Thu, Apr 18, 2024 at 02:01:22PM -0700, Song Liu wrote:
> On Thu, Apr 18, 2024 at 10:54 AM Mike Rapoport <[email protected]> wrote:
> >
> > On Thu, Apr 18, 2024 at 09:13:27AM -0700, Song Liu wrote:
> > > On Thu, Apr 18, 2024 at 8:37 AM Mike Rapoport <[email protected]> wrote:
> > > > > >
> > > > > > I'm looking at execmem_types more as definition of the consumers, maybe I
> > > > > > should have named the enum execmem_consumer at the first place.
> > > > >
> > > > > I think looking at execmem_type from consumers' point of view adds
> > > > > unnecessary complexity. IIUC, for most (if not all) archs, ftrace, kprobe,
> > > > > and bpf (and maybe also module text) all have the same requirements.
> > > > > Did I miss something?
> > > >
> > > > It's enough to have one architecture with different constrains for kprobes
> > > > and bpf to warrant a type for each.
> > >
> > > AFAICT, some of these constraints can be changed without too much work.
> >
> > But why?
> > I honestly don't understand what are you trying to optimize here. A few
> > lines of initialization in execmem_info?
>
> IIUC, having separate EXECMEM_BPF and EXECMEM_KPROBE makes it
> harder for bpf and kprobe to share the same ROX page. In many use cases,
> a 2MiB page (assuming x86_64) is enough for all BPF, kprobe, ftrace, and
> module text. It is not efficient if we have to allocate separate pages for each
> of these use cases. If this is not a problem, the current approach works.
The caching of large ROX pages does not need to be per type.
In the POC I've posted for caching of large ROX pages on x86 [1], the cache is
global and to make kprobes and bpf use it it's enough to set a flag in
execmem_info.
[1] https://lore.kernel.org/all/[email protected]
> Thanks,
> Song
--
Sincerely yours,
Mike.
Hi Masami,
On Thu, Apr 18, 2024 at 06:16:15AM +0900, Masami Hiramatsu wrote:
> Hi Mike,
>
> On Thu, 11 Apr 2024 19:00:50 +0300
> Mike Rapoport <[email protected]> wrote:
>
> > From: "Mike Rapoport (IBM)" <[email protected]>
> >
> > kprobes depended on CONFIG_MODULES because it has to allocate memory for
> > code.
> >
> > Since code allocations are now implemented with execmem, kprobes can be
> > enabled in non-modular kernels.
> >
> > Add #ifdef CONFIG_MODULE guards for the code dealing with kprobes inside
> > modules, make CONFIG_KPROBES select CONFIG_EXECMEM and drop the
> > dependency of CONFIG_KPROBES on CONFIG_MODULES.
>
> Thanks for this work, but this conflicts with the latest fix in v6.9-rc4.
> Also, can you use IS_ENABLED(CONFIG_MODULES) instead of #ifdefs in
> function body? We have enough dummy functions for that, so it should
> not make a problem.
The code in check_kprobe_address_safe() that gets the module and checks for
__init functions does not compile with IS_ENABLED(CONFIG_MODULES).
I can pull it out to a helper or leave #ifdef in the function body,
whichever you prefer.
> --
> Masami Hiramatsu
--
Sincerely yours,
Mike.
On Thu, Apr 18, 2024 at 11:56 PM Mike Rapoport <[email protected]> wrote:
>
> On Thu, Apr 18, 2024 at 02:01:22PM -0700, Song Liu wrote:
> > On Thu, Apr 18, 2024 at 10:54 AM Mike Rapoport <[email protected]> wrote:
> > >
> > > On Thu, Apr 18, 2024 at 09:13:27AM -0700, Song Liu wrote:
> > > > On Thu, Apr 18, 2024 at 8:37 AM Mike Rapoport <[email protected]> wrote:
> > > > > > >
> > > > > > > I'm looking at execmem_types more as definition of the consumers, maybe I
> > > > > > > should have named the enum execmem_consumer at the first place.
> > > > > >
> > > > > > I think looking at execmem_type from consumers' point of view adds
> > > > > > unnecessary complexity. IIUC, for most (if not all) archs, ftrace, kprobe,
> > > > > > and bpf (and maybe also module text) all have the same requirements.
> > > > > > Did I miss something?
> > > > >
> > > > > It's enough to have one architecture with different constrains for kprobes
> > > > > and bpf to warrant a type for each.
> > > >
> > > > AFAICT, some of these constraints can be changed without too much work.
> > >
> > > But why?
> > > I honestly don't understand what are you trying to optimize here. A few
> > > lines of initialization in execmem_info?
> >
> > IIUC, having separate EXECMEM_BPF and EXECMEM_KPROBE makes it
> > harder for bpf and kprobe to share the same ROX page. In many use cases,
> > a 2MiB page (assuming x86_64) is enough for all BPF, kprobe, ftrace, and
> > module text. It is not efficient if we have to allocate separate pages for each
> > of these use cases. If this is not a problem, the current approach works.
>
> The caching of large ROX pages does not need to be per type.
>
> In the POC I've posted for caching of large ROX pages on x86 [1], the cache is
> global and to make kprobes and bpf use it it's enough to set a flag in
> execmem_info.
>
> [1] https://lore.kernel.org/all/[email protected]
For the ROX to work, we need different users (module text, kprobe, etc.) to have
the same execmem_range. From [1]:
static void *execmem_cache_alloc(struct execmem_range *range, size_t size)
{
..
p = __execmem_cache_alloc(size);
if (p)
return p;
err = execmem_cache_populate(range, size);
..
}
We are calling __execmem_cache_alloc() without range. For this to work,
we can only call execmem_cache_alloc() with one execmem_range.
Did I miss something?
Thanks,
Song
Le 19/04/2024 à 17:49, Mike Rapoport a écrit :
> Hi Masami,
>
> On Thu, Apr 18, 2024 at 06:16:15AM +0900, Masami Hiramatsu wrote:
>> Hi Mike,
>>
>> On Thu, 11 Apr 2024 19:00:50 +0300
>> Mike Rapoport <[email protected]> wrote:
>>
>>> From: "Mike Rapoport (IBM)" <[email protected]>
>>>
>>> kprobes depended on CONFIG_MODULES because it has to allocate memory for
>>> code.
>>>
>>> Since code allocations are now implemented with execmem, kprobes can be
>>> enabled in non-modular kernels.
>>>
>>> Add #ifdef CONFIG_MODULE guards for the code dealing with kprobes inside
>>> modules, make CONFIG_KPROBES select CONFIG_EXECMEM and drop the
>>> dependency of CONFIG_KPROBES on CONFIG_MODULES.
>>
>> Thanks for this work, but this conflicts with the latest fix in v6.9-rc4.
>> Also, can you use IS_ENABLED(CONFIG_MODULES) instead of #ifdefs in
>> function body? We have enough dummy functions for that, so it should
>> not make a problem.
>
> The code in check_kprobe_address_safe() that gets the module and checks for
> __init functions does not compile with IS_ENABLED(CONFIG_MODULES).
> I can pull it out to a helper or leave #ifdef in the function body,
> whichever you prefer.
As far as I can see, the only problem is MODULE_STATE_COMING.
Can we move 'enum module_state' out of #ifdef CONFIG_MODULES in module.h ?
>
>> --
>> Masami Hiramatsu
>
On Fri, Apr 19, 2024 at 08:54:40AM -0700, Song Liu wrote:
> On Thu, Apr 18, 2024 at 11:56 PM Mike Rapoport <[email protected]> wrote:
> >
> > On Thu, Apr 18, 2024 at 02:01:22PM -0700, Song Liu wrote:
> > > On Thu, Apr 18, 2024 at 10:54 AM Mike Rapoport <[email protected]> wrote:
> > > >
> > > > On Thu, Apr 18, 2024 at 09:13:27AM -0700, Song Liu wrote:
> > > > > On Thu, Apr 18, 2024 at 8:37 AM Mike Rapoport <[email protected]> wrote:
> > > > > > > >
> > > > > > > > I'm looking at execmem_types more as definition of the consumers, maybe I
> > > > > > > > should have named the enum execmem_consumer at the first place.
> > > > > > >
> > > > > > > I think looking at execmem_type from consumers' point of view adds
> > > > > > > unnecessary complexity. IIUC, for most (if not all) archs, ftrace, kprobe,
> > > > > > > and bpf (and maybe also module text) all have the same requirements.
> > > > > > > Did I miss something?
> > > > > >
> > > > > > It's enough to have one architecture with different constrains for kprobes
> > > > > > and bpf to warrant a type for each.
> > > > >
> > > > > AFAICT, some of these constraints can be changed without too much work.
> > > >
> > > > But why?
> > > > I honestly don't understand what are you trying to optimize here. A few
> > > > lines of initialization in execmem_info?
> > >
> > > IIUC, having separate EXECMEM_BPF and EXECMEM_KPROBE makes it
> > > harder for bpf and kprobe to share the same ROX page. In many use cases,
> > > a 2MiB page (assuming x86_64) is enough for all BPF, kprobe, ftrace, and
> > > module text. It is not efficient if we have to allocate separate pages for each
> > > of these use cases. If this is not a problem, the current approach works.
> >
> > The caching of large ROX pages does not need to be per type.
> >
> > In the POC I've posted for caching of large ROX pages on x86 [1], the cache is
> > global and to make kprobes and bpf use it it's enough to set a flag in
> > execmem_info.
> >
> > [1] https://lore.kernel.org/all/[email protected]
>
> For the ROX to work, we need different users (module text, kprobe, etc.) to have
> the same execmem_range. From [1]:
>
> static void *execmem_cache_alloc(struct execmem_range *range, size_t size)
> {
> ...
> p = __execmem_cache_alloc(size);
> if (p)
> return p;
> err = execmem_cache_populate(range, size);
> ...
> }
>
> We are calling __execmem_cache_alloc() without range. For this to work,
> we can only call execmem_cache_alloc() with one execmem_range.
Actually, on x86 this will "just work" because everything shares the same
address space :)
The 2M pages in the cache will be in the modules space, so
__execmem_cache_alloc() will always return memory from that address space.
For other architectures this indeed needs to be fixed with passing the
range to __execmem_cache_alloc() and limiting search in the cache for that
range.
> Did I miss something?
>
> Thanks,
> Song
--
Sincerely yours,
Mike.
On Fri, Apr 19, 2024 at 10:03 AM Mike Rapoport <[email protected]> wrote:
[...]
> > >
> > > [1] https://lore.kernel.org/all/[email protected]
> >
> > For the ROX to work, we need different users (module text, kprobe, etc.) to have
> > the same execmem_range. From [1]:
> >
> > static void *execmem_cache_alloc(struct execmem_range *range, size_t size)
> > {
> > ...
> > p = __execmem_cache_alloc(size);
> > if (p)
> > return p;
> > err = execmem_cache_populate(range, size);
> > ...
> > }
> >
> > We are calling __execmem_cache_alloc() without range. For this to work,
> > we can only call execmem_cache_alloc() with one execmem_range.
>
> Actually, on x86 this will "just work" because everything shares the same
> address space :)
>
> The 2M pages in the cache will be in the modules space, so
> __execmem_cache_alloc() will always return memory from that address space.
>
> For other architectures this indeed needs to be fixed with passing the
> range to __execmem_cache_alloc() and limiting search in the cache for that
> range.
I think we at least need the "map to" concept (initially proposed by Thomas)
to get this work. For example, EXECMEM_BPF and EXECMEM_KPROBE
maps to EXECMEM_MODULE_TEXT, so that all these actually share
the same range.
Does this make sense?
Song
On Fri, Apr 19, 2024 at 10:32:39AM -0700, Song Liu wrote:
> On Fri, Apr 19, 2024 at 10:03 AM Mike Rapoport <[email protected]> wrote:
> [...]
> > > >
> > > > [1] https://lore.kernel.org/all/[email protected]
> > >
> > > For the ROX to work, we need different users (module text, kprobe, etc.) to have
> > > the same execmem_range. From [1]:
> > >
> > > static void *execmem_cache_alloc(struct execmem_range *range, size_t size)
> > > {
> > > ...
> > > p = __execmem_cache_alloc(size);
> > > if (p)
> > > return p;
> > > err = execmem_cache_populate(range, size);
> > > ...
> > > }
> > >
> > > We are calling __execmem_cache_alloc() without range. For this to work,
> > > we can only call execmem_cache_alloc() with one execmem_range.
> >
> > Actually, on x86 this will "just work" because everything shares the same
> > address space :)
> >
> > The 2M pages in the cache will be in the modules space, so
> > __execmem_cache_alloc() will always return memory from that address space.
> >
> > For other architectures this indeed needs to be fixed with passing the
> > range to __execmem_cache_alloc() and limiting search in the cache for that
> > range.
>
> I think we at least need the "map to" concept (initially proposed by Thomas)
> to get this work. For example, EXECMEM_BPF and EXECMEM_KPROBE
> maps to EXECMEM_MODULE_TEXT, so that all these actually share
> the same range.
Why?
> Does this make sense?
>
> Song
--
Sincerely yours,
Mike.
On Fri, Apr 19, 2024 at 1:00 PM Mike Rapoport <[email protected]> wrote:
>
> On Fri, Apr 19, 2024 at 10:32:39AM -0700, Song Liu wrote:
> > On Fri, Apr 19, 2024 at 10:03 AM Mike Rapoport <[email protected]> wrote:
> > [...]
> > > > >
> > > > > [1] https://lore.kernel.org/all/[email protected]
> > > >
> > > > For the ROX to work, we need different users (module text, kprobe, etc.) to have
> > > > the same execmem_range. From [1]:
> > > >
> > > > static void *execmem_cache_alloc(struct execmem_range *range, size_t size)
> > > > {
> > > > ...
> > > > p = __execmem_cache_alloc(size);
> > > > if (p)
> > > > return p;
> > > > err = execmem_cache_populate(range, size);
> > > > ...
> > > > }
> > > >
> > > > We are calling __execmem_cache_alloc() without range. For this to work,
> > > > we can only call execmem_cache_alloc() with one execmem_range.
> > >
> > > Actually, on x86 this will "just work" because everything shares the same
> > > address space :)
> > >
> > > The 2M pages in the cache will be in the modules space, so
> > > __execmem_cache_alloc() will always return memory from that address space.
> > >
> > > For other architectures this indeed needs to be fixed with passing the
> > > range to __execmem_cache_alloc() and limiting search in the cache for that
> > > range.
> >
> > I think we at least need the "map to" concept (initially proposed by Thomas)
> > to get this work. For example, EXECMEM_BPF and EXECMEM_KPROBE
> > maps to EXECMEM_MODULE_TEXT, so that all these actually share
> > the same range.
>
> Why?
IIUC, we need to update __execmem_cache_alloc() to take a range pointer as
input. module text will use "range" for EXECMEM_MODULE_TEXT, while kprobe
will use "range" for EXECMEM_KPROBE. Without "map to" concept or sharing
the "range" object, we will have to compare different range parameters to check
we can share cached pages between module text and kprobe, which is not
efficient. Did I miss something?
Thanks,
Song
On Fri, Apr 19, 2024 at 02:42:16PM -0700, Song Liu wrote:
> On Fri, Apr 19, 2024 at 1:00 PM Mike Rapoport <[email protected]> wrote:
> >
> > On Fri, Apr 19, 2024 at 10:32:39AM -0700, Song Liu wrote:
> > > On Fri, Apr 19, 2024 at 10:03 AM Mike Rapoport <[email protected]> wrote:
> > > [...]
> > > > > >
> > > > > > [1] https://lore.kernel.org/all/[email protected]
> > > > >
> > > > > For the ROX to work, we need different users (module text, kprobe, etc.) to have
> > > > > the same execmem_range. From [1]:
> > > > >
> > > > > static void *execmem_cache_alloc(struct execmem_range *range, size_t size)
> > > > > {
> > > > > ...
> > > > > p = __execmem_cache_alloc(size);
> > > > > if (p)
> > > > > return p;
> > > > > err = execmem_cache_populate(range, size);
> > > > > ...
> > > > > }
> > > > >
> > > > > We are calling __execmem_cache_alloc() without range. For this to work,
> > > > > we can only call execmem_cache_alloc() with one execmem_range.
> > > >
> > > > Actually, on x86 this will "just work" because everything shares the same
> > > > address space :)
> > > >
> > > > The 2M pages in the cache will be in the modules space, so
> > > > __execmem_cache_alloc() will always return memory from that address space.
> > > >
> > > > For other architectures this indeed needs to be fixed with passing the
> > > > range to __execmem_cache_alloc() and limiting search in the cache for that
> > > > range.
> > >
> > > I think we at least need the "map to" concept (initially proposed by Thomas)
> > > to get this work. For example, EXECMEM_BPF and EXECMEM_KPROBE
> > > maps to EXECMEM_MODULE_TEXT, so that all these actually share
> > > the same range.
> >
> > Why?
>
> IIUC, we need to update __execmem_cache_alloc() to take a range pointer as
> input. module text will use "range" for EXECMEM_MODULE_TEXT, while kprobe
> will use "range" for EXECMEM_KPROBE. Without "map to" concept or sharing
> the "range" object, we will have to compare different range parameters to check
> we can share cached pages between module text and kprobe, which is not
> efficient. Did I miss something?
We can always share large ROX pages as long as they are within the correct
address space. The permissions for them are ROX and the alignment
differences are due to KASAN and this is handled during allocation of the
large page to refill the cache. __execmem_cache_alloc() only needs to limit
the search for the address space of the range.
And regardless, they way we deal with sharing of the cache can be sorted
out later.
> Thanks,
> Song
--
Sincerely yours,
Mike.
On Fri, Apr 19, 2024 at 03:59:40PM +0000, Christophe Leroy wrote:
>
>
> Le 19/04/2024 ? 17:49, Mike Rapoport a ?crit?:
> > Hi Masami,
> >
> > On Thu, Apr 18, 2024 at 06:16:15AM +0900, Masami Hiramatsu wrote:
> >> Hi Mike,
> >>
> >> On Thu, 11 Apr 2024 19:00:50 +0300
> >> Mike Rapoport <[email protected]> wrote:
> >>
> >>> From: "Mike Rapoport (IBM)" <[email protected]>
> >>>
> >>> kprobes depended on CONFIG_MODULES because it has to allocate memory for
> >>> code.
> >>>
> >>> Since code allocations are now implemented with execmem, kprobes can be
> >>> enabled in non-modular kernels.
> >>>
> >>> Add #ifdef CONFIG_MODULE guards for the code dealing with kprobes inside
> >>> modules, make CONFIG_KPROBES select CONFIG_EXECMEM and drop the
> >>> dependency of CONFIG_KPROBES on CONFIG_MODULES.
> >>
> >> Thanks for this work, but this conflicts with the latest fix in v6.9-rc4.
> >> Also, can you use IS_ENABLED(CONFIG_MODULES) instead of #ifdefs in
> >> function body? We have enough dummy functions for that, so it should
> >> not make a problem.
> >
> > The code in check_kprobe_address_safe() that gets the module and checks for
> > __init functions does not compile with IS_ENABLED(CONFIG_MODULES).
> > I can pull it out to a helper or leave #ifdef in the function body,
> > whichever you prefer.
>
> As far as I can see, the only problem is MODULE_STATE_COMING.
> Can we move 'enum module_state' out of #ifdef CONFIG_MODULES in module.h ?
There's dereference of 'struct module' there:
(*probed_mod)->state != MODULE_STATE_COMING) {
...
}
so moving out 'enum module_state' won't be enough.
> >
> >> --
> >> Masami Hiramatsu
> >
--
Sincerely yours,
Mike.
On Sat, 20 Apr 2024 07:22:50 +0300
Mike Rapoport <[email protected]> wrote:
> On Fri, Apr 19, 2024 at 02:42:16PM -0700, Song Liu wrote:
> > On Fri, Apr 19, 2024 at 1:00 PM Mike Rapoport <[email protected]> wrote:
> > >
> > > On Fri, Apr 19, 2024 at 10:32:39AM -0700, Song Liu wrote:
> > > > On Fri, Apr 19, 2024 at 10:03 AM Mike Rapoport <[email protected]> wrote:
> > > > [...]
> > > > > > >
> > > > > > > [1] https://lore.kernel.org/all/[email protected]
> > > > > >
> > > > > > For the ROX to work, we need different users (module text, kprobe, etc.) to have
> > > > > > the same execmem_range. From [1]:
> > > > > >
> > > > > > static void *execmem_cache_alloc(struct execmem_range *range, size_t size)
> > > > > > {
> > > > > > ...
> > > > > > p = __execmem_cache_alloc(size);
> > > > > > if (p)
> > > > > > return p;
> > > > > > err = execmem_cache_populate(range, size);
> > > > > > ...
> > > > > > }
> > > > > >
> > > > > > We are calling __execmem_cache_alloc() without range. For this to work,
> > > > > > we can only call execmem_cache_alloc() with one execmem_range.
> > > > >
> > > > > Actually, on x86 this will "just work" because everything shares the same
> > > > > address space :)
> > > > >
> > > > > The 2M pages in the cache will be in the modules space, so
> > > > > __execmem_cache_alloc() will always return memory from that address space.
> > > > >
> > > > > For other architectures this indeed needs to be fixed with passing the
> > > > > range to __execmem_cache_alloc() and limiting search in the cache for that
> > > > > range.
> > > >
> > > > I think we at least need the "map to" concept (initially proposed by Thomas)
> > > > to get this work. For example, EXECMEM_BPF and EXECMEM_KPROBE
> > > > maps to EXECMEM_MODULE_TEXT, so that all these actually share
> > > > the same range.
> > >
> > > Why?
> >
> > IIUC, we need to update __execmem_cache_alloc() to take a range pointer as
> > input. module text will use "range" for EXECMEM_MODULE_TEXT, while kprobe
> > will use "range" for EXECMEM_KPROBE. Without "map to" concept or sharing
> > the "range" object, we will have to compare different range parameters to check
> > we can share cached pages between module text and kprobe, which is not
> > efficient. Did I miss something?
Song, thanks for trying to eplain. I think I need to explain why I used
module_alloc() originally.
This depends on how kprobe features are implemented on the architecture, and
how much features are supported on kprobes.
Because kprobe jump optimization and kprobe jump-back optimization need to
use a jump instruction to jump into the trampoline and jump back from the
trampoline directly, if the architecuture jmp instruction supports +-2GB range
like x86, it needs to allocate the trampoline buffer inside such address space.
This requirement is similar to the modules (because module function needs to
call other functions in the kernel etc.), at least kprobes on x86 used
module_alloc().
However, if an architecture only supports breakpoint/trap based kprobe,
it does not need to consider whether the execmem is allocated.
>
> We can always share large ROX pages as long as they are within the correct
> address space. The permissions for them are ROX and the alignment
> differences are due to KASAN and this is handled during allocation of the
> large page to refill the cache. __execmem_cache_alloc() only needs to limit
> the search for the address space of the range.
So I don't think EXECMEM_KPROBE always same as EXECMEM_MODULE_TEXT, it
should be configured for each arch. Especially, if it is only used for
searching parameter, it looks OK to me.
Thank you,
>
> And regardless, they way we deal with sharing of the cache can be sorted
> out later.
>
> > Thanks,
> > Song
>
> --
> Sincerely yours,
> Mike.
>
--
Masami Hiramatsu (Google) <[email protected]>
On Sat, 20 Apr 2024 10:33:38 +0300
Mike Rapoport <[email protected]> wrote:
> On Fri, Apr 19, 2024 at 03:59:40PM +0000, Christophe Leroy wrote:
> >
> >
> > Le 19/04/2024 à 17:49, Mike Rapoport a écrit :
> > > Hi Masami,
> > >
> > > On Thu, Apr 18, 2024 at 06:16:15AM +0900, Masami Hiramatsu wrote:
> > >> Hi Mike,
> > >>
> > >> On Thu, 11 Apr 2024 19:00:50 +0300
> > >> Mike Rapoport <[email protected]> wrote:
> > >>
> > >>> From: "Mike Rapoport (IBM)" <[email protected]>
> > >>>
> > >>> kprobes depended on CONFIG_MODULES because it has to allocate memory for
> > >>> code.
> > >>>
> > >>> Since code allocations are now implemented with execmem, kprobes can be
> > >>> enabled in non-modular kernels.
> > >>>
> > >>> Add #ifdef CONFIG_MODULE guards for the code dealing with kprobes inside
> > >>> modules, make CONFIG_KPROBES select CONFIG_EXECMEM and drop the
> > >>> dependency of CONFIG_KPROBES on CONFIG_MODULES.
> > >>
> > >> Thanks for this work, but this conflicts with the latest fix in v6.9-rc4.
> > >> Also, can you use IS_ENABLED(CONFIG_MODULES) instead of #ifdefs in
> > >> function body? We have enough dummy functions for that, so it should
> > >> not make a problem.
> > >
> > > The code in check_kprobe_address_safe() that gets the module and checks for
> > > __init functions does not compile with IS_ENABLED(CONFIG_MODULES).
> > > I can pull it out to a helper or leave #ifdef in the function body,
> > > whichever you prefer.
> >
> > As far as I can see, the only problem is MODULE_STATE_COMING.
> > Can we move 'enum module_state' out of #ifdef CONFIG_MODULES in module.h ?
>
> There's dereference of 'struct module' there:
>
> (*probed_mod)->state != MODULE_STATE_COMING) {
> ...
> }
>
> so moving out 'enum module_state' won't be enough.
Hmm, this part should be inline functions like;
#ifdef CONFIG_MODULES
static inline bool module_is_coming(struct module *mod)
{
return mod->state == MODULE_STATE_COMING;
}
#else
#define module_is_coming(mod) (false)
#endif
Then we don't need the enum.
Thank you,
>
> > >
> > >> --
> > >> Masami Hiramatsu
> > >
>
> --
> Sincerely yours,
> Mike.
>
--
Masami Hiramatsu (Google) <[email protected]>
On Sat, Apr 20, 2024 at 06:15:00PM +0900, Masami Hiramatsu wrote:
> On Sat, 20 Apr 2024 10:33:38 +0300
> Mike Rapoport <[email protected]> wrote:
>
> > On Fri, Apr 19, 2024 at 03:59:40PM +0000, Christophe Leroy wrote:
> > >
> > >
> > > Le 19/04/2024 ? 17:49, Mike Rapoport a ?crit?:
> > > > Hi Masami,
> > > >
> > > > On Thu, Apr 18, 2024 at 06:16:15AM +0900, Masami Hiramatsu wrote:
> > > >> Hi Mike,
> > > >>
> > > >> On Thu, 11 Apr 2024 19:00:50 +0300
> > > >> Mike Rapoport <[email protected]> wrote:
> > > >>
> > > >>> From: "Mike Rapoport (IBM)" <[email protected]>
> > > >>>
> > > >>> kprobes depended on CONFIG_MODULES because it has to allocate memory for
> > > >>> code.
> > > >>>
> > > >>> Since code allocations are now implemented with execmem, kprobes can be
> > > >>> enabled in non-modular kernels.
> > > >>>
> > > >>> Add #ifdef CONFIG_MODULE guards for the code dealing with kprobes inside
> > > >>> modules, make CONFIG_KPROBES select CONFIG_EXECMEM and drop the
> > > >>> dependency of CONFIG_KPROBES on CONFIG_MODULES.
> > > >>
> > > >> Thanks for this work, but this conflicts with the latest fix in v6.9-rc4.
> > > >> Also, can you use IS_ENABLED(CONFIG_MODULES) instead of #ifdefs in
> > > >> function body? We have enough dummy functions for that, so it should
> > > >> not make a problem.
> > > >
> > > > The code in check_kprobe_address_safe() that gets the module and checks for
> > > > __init functions does not compile with IS_ENABLED(CONFIG_MODULES).
> > > > I can pull it out to a helper or leave #ifdef in the function body,
> > > > whichever you prefer.
> > >
> > > As far as I can see, the only problem is MODULE_STATE_COMING.
> > > Can we move 'enum module_state' out of #ifdef CONFIG_MODULES in module.h ?
> >
> > There's dereference of 'struct module' there:
> >
> > (*probed_mod)->state != MODULE_STATE_COMING) {
> > ...
> > }
> >
> > so moving out 'enum module_state' won't be enough.
>
> Hmm, this part should be inline functions like;
>
> #ifdef CONFIG_MODULES
> static inline bool module_is_coming(struct module *mod)
> {
> return mod->state == MODULE_STATE_COMING;
> }
> #else
> #define module_is_coming(mod) (false)
I'd prefer
static inline module_is_coming(struct module *mod)
{
return false;
}
> #endif
>
> Then we don't need the enum.
> Thank you,
>
> --
> Masami Hiramatsu (Google) <[email protected]>
--
Sincerely yours,
Mike.
Hi Masami and Mike,
On Sat, Apr 20, 2024 at 2:11 AM Masami Hiramatsu <[email protected]> wrote:
[...]
> > >
> > > IIUC, we need to update __execmem_cache_alloc() to take a range pointer as
> > > input. module text will use "range" for EXECMEM_MODULE_TEXT, while kprobe
> > > will use "range" for EXECMEM_KPROBE. Without "map to" concept or sharing
> > > the "range" object, we will have to compare different range parameters to check
> > > we can share cached pages between module text and kprobe, which is not
> > > efficient. Did I miss something?
>
> Song, thanks for trying to eplain. I think I need to explain why I used
> module_alloc() originally.
>
> This depends on how kprobe features are implemented on the architecture, and
> how much features are supported on kprobes.
>
> Because kprobe jump optimization and kprobe jump-back optimization need to
> use a jump instruction to jump into the trampoline and jump back from the
> trampoline directly, if the architecuture jmp instruction supports +-2GB range
> like x86, it needs to allocate the trampoline buffer inside such address space.
> This requirement is similar to the modules (because module function needs to
> call other functions in the kernel etc.), at least kprobes on x86 used
> module_alloc().
>
> However, if an architecture only supports breakpoint/trap based kprobe,
> it does not need to consider whether the execmem is allocated.
>
> >
> > We can always share large ROX pages as long as they are within the correct
> > address space. The permissions for them are ROX and the alignment
> > differences are due to KASAN and this is handled during allocation of the
> > large page to refill the cache. __execmem_cache_alloc() only needs to limit
> > the search for the address space of the range.
>
> So I don't think EXECMEM_KPROBE always same as EXECMEM_MODULE_TEXT, it
> should be configured for each arch. Especially, if it is only used for
> searching parameter, it looks OK to me.
Thanks for the explanation!
I was thinking "we can have EXECMEM_KPROBE share the same parameters as
EXECMEM_MODULE_TEXT for all architectures". But this thought is built on top
of assumptions on future changes/improvements within multiple sub systems.
At this moment, I have no objections moving forward with current execmem APIs.
Thanks,
Song