2018-11-20 23:36:33

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v9 RESEND 0/4] KASLR feature to randomize each loadable module

Resending this because I missed Jessica in the "to" list. Also removing the part
of this coverletter that talked about KPTI helping with some local kernel text
de-randomizing methods, because I'm not sure I fully understand this.

------------------------------------------------------------

This is V9 of the "KASLR feature to randomize each loadable module" patchset.
The purpose is to increase the randomization for the module space from 10 to 17
bits, and also to make the modules randomized in relation to each other instead
of just the address where the allocations begin, so that if one module leaks the
location of the others can't be inferred.

Why its useful
==============
Randomizing the location of executable code is a defense against control flow
attacks, where the kernel is tricked into jumping, or speculatively executing
code other than what is intended. By randomizing the location of the code, the
attacker doesn't know where to redirect the control flow.

Today the RANDOMIZE_BASE feature randomizes the base address where the module
allocations begin with 10 bits of entropy for this purpose. From here, a highly
deterministic algorithm allocates space for the modules as they are loaded and
unloaded. If an attacker can predict the order and identities for modules that
will be loaded (either by the system, or controlled by the user with
request_module or BPF), then a single text address leak can give the attacker
access to the locations of other modules. So in this case this new algorithm can
take the entropy of the other modules from ~0 to 17, making it much more robust.

Another problem today is that the low 10 bits of entropy makes brute force
attacks feasible, especially in the case of speculative execution where a wrong
guess won't necessarily cause a crash. In this case, increasing the
randomization will force attacks to take longer, and so increase the time an
attacker may be detected on a system.

There are multiple efforts to apply more randomization to the core kernel text
as well, and so this module space piece can be a first step to increasing
randomization for all kernel space executable code.

Userspace ASLR can get 28 bits of entropy or more, so at least increasing this
to 17 for now improves what is currently a pretty low amount of randomization
for the higher privileged kernel space.

How it works
============
The algorithm is pretty simple. It just breaks the module space in two, a random
area (2/3 of module space) and a backup area (1/3 of module space). It first
tries to allocate up to 10000 randomly located starting pages inside the random
section. If this fails, it will allocate in the backup area. The backup area
base will be offset in the same way as current algorithm does for the base area,
which has 10 bits of entropy.

The vmalloc allocator can be used to try an allocation at a specific address,
however it is usually used to try an allocation over a large address range, and
so some behaviors which are non-issues in normal usage can be be sub-optimal
when trying the an allocation at 10000 small ranges. So this patch also includes
a new vmalloc function __vmalloc_node_try_addr and some other vmalloc tweaks
that allow for more efficient trying of addresses.

This algorithm targets maintaining high entropy for many 1000's of module
allocations. This is because there are other users of the module space besides
kernel modules, like eBPF JIT, classic BPF socket filter JIT and kprobes.

Performance
===========
Simulations were run using module sizes derived from the x86_64 modules to
measure the allocation performance at various levels of fragmentation and
whether the backup area was used.

Capacity
--------
There is a slight reduction in the capacity of modules as simulated by the
x86_64 module sizes of <1000. Note this is a worst case, since in practice
module allocations in the 1000's will consist of smaller BPF JIT allocations or
kprobes which would fit better in the random area.

Allocation time
---------------
Below are three sets of measurements in ns of the allocation time as measured by
the included kselftests. The first two columns are this new algorithm with and
with out the vmalloc optimizations for trying random addresses quickly. They are
included for consideration of whether the changes are worth it. The last column
is the performance of the original algorithm.

Modules Vmalloc optimization No Vmalloc Optimization Existing Module KASLR
1000 1433 1993 3821
2000 2295 3681 7830
3000 4424 7450 13012
4000 7746 13824 18106
5000 12721 21852 22572
6000 19724 33926 26443
7000 27638 47427 30473
8000 37745 64443 34200

These allocations are not taking very long, but it may show up on systems with
very high usage of the module space (BPF JITs). If the trade-off of touching
vmalloc doesn't seem worth it to people, I can remove the optimizations.

Randomness
----------
Unlike the existing algorithm, the amount of randomness provided has a
dependency on the number of modules allocated and the sizes of the modules text
sections. The entropy provided for the Nth allocation will come from three
sources of randomness, the range of addresses for the random area, the
probability the section will be allocated in the backup area and randomness from
the number of modules already allocated in the backup area. For computing a
lower bound entropy in the following calculations, the randomness of the modules
already in the backup area, or overlapping from the random area, is ignored
since it is usually small and will only increase the entropy. Below is an
attempt to compute a worst case value for entropy to compare to the existing
algorithm.

For probability of the Nth allocation being in the backup area, p, a lower bound
entropy estimate is calculated here as:

Random Area Slots = ((2/3)*1073741824)/4096 = 174762

Entropy = -( (1-p)*log2((1-p)/174762) + p*log2(p/1024) )

For >8000 modules the entropy remains above 17.3. For non-speculative control
flow attacks, an attack might crash the system. So the probability of the
first guess being right can be more important than the Nth guess. KASLR schemes
usually have equal probability for each possible position, but in this scheme
that is not the case. So a more conservative comparison to existing schemes is
the amount of information that would have to be guessed correctly for the
position that has the highest probability for having the Nth module allocated
(as that would be the attackers best guess):

Min Info = MIN(-log2(p/1024), -log2((1-p)/174762))

Allocations Entropy
1000 17.4
2000 17.4
3000 17.4
4000 16.8
5000 15.8
6000 14.9
7000 14.8
8000 14.2

If anyone is keeping track, these numbers are different than as reported in V2,
because they are generated using the more compact allocation size heuristic that
is included in the kselftest rather than the real much larger dataset. The
heuristic generates randomization benchmarks that are slightly slower than the
real dataset. The real dataset also isn't representative of the case of mostly
smaller BPF filters, so it represents a worst case lower bound for entropy and
in practice 17+ bits should be maintained to much higher number of modules.

PTE usage
---------
Since the allocations are spread out over a wider address space, there is
increased PTE usage which should not exceed 1.3MB more than the old algorithm.


Changes for V9:
- Better explanations in commit messages, instructions in kselftests (Andrew
Morton)

Changes for V8:
- Simplify code by removing logic for optimum handling of lazy free areas

Changes for V7:
- More 0-day build fixes, readability improvements (Kees Cook)

Changes for V6:
- 0-day build fixes by removing un-needed functional testing, more error
handling

Changes for V5:
- Add module_alloc test module

Changes for V4:
- Fix issue caused by KASAN, kmemleak being provided different allocation
lengths (padding).
- Avoid kmalloc until sure its needed in __vmalloc_node_try_addr.
- Fixed issues reported by 0-day.

Changes for V3:
- Code cleanup based on internal feedback. (thanks to Dave Hansen and Andriy
Shevchenko)
- Slight refactor of existing algorithm to more cleanly live along side new
one.
- BPF synthetic benchmark

Changes for V2:
- New implementation of __vmalloc_node_try_addr based on the
__vmalloc_node_range implementation, that only flushes TLB when needed.
- Modified module loading algorithm to try to reduce the TLB flushes further.
- Increase "random area" tries in order to increase the number of modules that
can get high randomness.
- Increase "random area" size to 2/3 of module area in order to increase the
number of modules that can get high randomness.
- Fix for 0day failures on other architectures.
- Fix for wrong debugfs permissions. (thanks to Jann Horn)
- Spelling fix. (thanks to Jann Horn)
- Data on module_alloc performance and TLB flushes. (brought up by Kees Cook
and Jann Horn)
- Data on memory usage. (suggested by Jann)


Rick Edgecombe (4):
vmalloc: Add __vmalloc_node_try_addr function
x86/modules: Increase randomization for modules
vmalloc: Add debugfs modfraginfo
Kselftest for module text allocation benchmarking

arch/x86/Kconfig | 3 +
arch/x86/include/asm/kaslr_modules.h | 38 ++
arch/x86/include/asm/pgtable_64_types.h | 7 +
arch/x86/kernel/module.c | 111 ++++--
include/linux/vmalloc.h | 3 +
lib/Kconfig.debug | 9 +
lib/Makefile | 1 +
lib/test_mod_alloc.c | 375 ++++++++++++++++++
mm/vmalloc.c | 228 +++++++++--
tools/testing/selftests/bpf/test_mod_alloc.sh | 29 ++
10 files changed, 743 insertions(+), 61 deletions(-)
create mode 100644 arch/x86/include/asm/kaslr_modules.h
create mode 100644 lib/test_mod_alloc.c
create mode 100755 tools/testing/selftests/bpf/test_mod_alloc.sh

--
2.17.1



2018-11-20 23:23:03

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v9 RESEND 2/4] x86/modules: Increase randomization for modules

This changes the behavior of the KASLR logic for allocating memory for the text
sections of loadable modules. It randomizes the location of each module text
section with about 17 bits of entropy in typical use. This is enabled on X86_64
only. For 32 bit, the behavior is unchanged.

It refactors existing code around module randomization somewhat. There are now
three different behaviors for x86 module_alloc depending on config.
RANDOMIZE_BASE=n, and RANDOMIZE_BASE=y ARCH=x86_64, and RANDOMIZE_BASE=y
ARCH=i386. The refactor of the existing code is to try to clearly show what
those behaviors are without having three separate versions or threading the
behaviors in a bunch of little spots. The reason it is not enabled on 32 bit
yet is because the module space is much smaller and simulations haven't been
run to see how it performs.

The new algorithm breaks the module space in two, a random area and a backup
area. It first tries to allocate at a number of randomly located starting pages
inside the random section. If this fails, then it will allocate in the backup
area. The backup area base will be offset in the same way as the current
algorithm does for the base area, 1024 possible locations.

Due to boot_params being defined with different types in different places,
placing the config helpers modules.h or kaslr.h caused conflicts elsewhere, and
so they are placed in a new file, kaslr_modules.h, instead.

Signed-off-by: Rick Edgecombe <[email protected]>
---
arch/x86/Kconfig | 3 +
arch/x86/include/asm/kaslr_modules.h | 38 ++++++++
arch/x86/include/asm/pgtable_64_types.h | 7 ++
arch/x86/kernel/module.c | 111 +++++++++++++++++++-----
4 files changed, 136 insertions(+), 23 deletions(-)
create mode 100644 arch/x86/include/asm/kaslr_modules.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ba7e3464ee92..db93cde0528a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2144,6 +2144,9 @@ config RANDOMIZE_BASE

If unsure, say Y.

+config RANDOMIZE_FINE_MODULE
+ def_bool y if RANDOMIZE_BASE && X86_64 && !CONFIG_UML
+
# Relocation on x86 needs some additional build support
config X86_NEED_RELOCS
def_bool y
diff --git a/arch/x86/include/asm/kaslr_modules.h b/arch/x86/include/asm/kaslr_modules.h
new file mode 100644
index 000000000000..1da6eced4b47
--- /dev/null
+++ b/arch/x86/include/asm/kaslr_modules.h
@@ -0,0 +1,38 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_KASLR_MODULES_H_
+#define _ASM_KASLR_MODULES_H_
+
+#ifdef CONFIG_RANDOMIZE_BASE
+/* kaslr_enabled is not always defined */
+static inline int kaslr_mod_randomize_base(void)
+{
+ return kaslr_enabled();
+}
+#else
+static inline int kaslr_mod_randomize_base(void)
+{
+ return 0;
+}
+#endif /* CONFIG_RANDOMIZE_BASE */
+
+#ifdef CONFIG_RANDOMIZE_FINE_MODULE
+/* kaslr_enabled is not always defined */
+static inline int kaslr_mod_randomize_each_module(void)
+{
+ return kaslr_enabled();
+}
+
+static inline unsigned long get_modules_rand_len(void)
+{
+ return MODULES_RAND_LEN;
+}
+#else
+static inline int kaslr_mod_randomize_each_module(void)
+{
+ return 0;
+}
+
+unsigned long get_modules_rand_len(void);
+#endif /* CONFIG_RANDOMIZE_FINE_MODULE */
+
+#endif
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 04edd2d58211..5e26369ab86c 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -143,6 +143,13 @@ extern unsigned int ptrs_per_p4d;
#define MODULES_END _AC(0xffffffffff000000, UL)
#define MODULES_LEN (MODULES_END - MODULES_VADDR)

+/*
+ * Dedicate the first part of the module space to a randomized area when KASLR
+ * is in use. Leave the remaining part for a fallback if we are unable to
+ * allocate in the random area.
+ */
+#define MODULES_RAND_LEN PAGE_ALIGN((MODULES_LEN/3)*2)
+
#define ESPFIX_PGD_ENTRY _AC(-2, UL)
#define ESPFIX_BASE_ADDR (ESPFIX_PGD_ENTRY << P4D_SHIFT)

diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index b052e883dd8c..35cb912ed1f8 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -36,6 +36,7 @@
#include <asm/pgtable.h>
#include <asm/setup.h>
#include <asm/unwind.h>
+#include <asm/kaslr_modules.h>

#if 0
#define DEBUGP(fmt, ...) \
@@ -48,34 +49,96 @@ do { \
} while (0)
#endif

-#ifdef CONFIG_RANDOMIZE_BASE
static unsigned long module_load_offset;
+static const unsigned long NO_TRY_RAND = 10000;

/* Mutex protects the module_load_offset. */
static DEFINE_MUTEX(module_kaslr_mutex);

static unsigned long int get_module_load_offset(void)
{
- if (kaslr_enabled()) {
- mutex_lock(&module_kaslr_mutex);
- /*
- * Calculate the module_load_offset the first time this
- * code is called. Once calculated it stays the same until
- * reboot.
- */
- if (module_load_offset == 0)
- module_load_offset =
- (get_random_int() % 1024 + 1) * PAGE_SIZE;
- mutex_unlock(&module_kaslr_mutex);
- }
+ mutex_lock(&module_kaslr_mutex);
+ /*
+ * Calculate the module_load_offset the first time this
+ * code is called. Once calculated it stays the same until
+ * reboot.
+ */
+ if (module_load_offset == 0)
+ module_load_offset = (get_random_int() % 1024 + 1) * PAGE_SIZE;
+ mutex_unlock(&module_kaslr_mutex);
+
return module_load_offset;
}
-#else
-static unsigned long int get_module_load_offset(void)
+
+static unsigned long get_module_vmalloc_start(void)
{
- return 0;
+ unsigned long addr = MODULES_VADDR;
+
+ if (kaslr_mod_randomize_base())
+ addr += get_module_load_offset();
+
+ if (kaslr_mod_randomize_each_module())
+ addr += get_modules_rand_len();
+
+ return addr;
+}
+
+static void *try_module_alloc(unsigned long addr, unsigned long size)
+{
+ const unsigned long vm_flags = 0;
+
+ return __vmalloc_node_try_addr(addr, size, GFP_KERNEL, PAGE_KERNEL_EXEC,
+ vm_flags, NUMA_NO_NODE,
+ __builtin_return_address(0));
+}
+
+/*
+ * Find a random address to try that won't obviously not fit. Random areas are
+ * allowed to overflow into the backup area
+ */
+static unsigned long get_rand_module_addr(unsigned long size)
+{
+ unsigned long nr_max_pos = (MODULES_LEN - size) / MODULE_ALIGN + 1;
+ unsigned long nr_rnd_pos = get_modules_rand_len() / MODULE_ALIGN;
+ unsigned long nr_pos = min(nr_max_pos, nr_rnd_pos);
+
+ unsigned long module_position_nr = get_random_long() % nr_pos;
+ unsigned long offset = module_position_nr * MODULE_ALIGN;
+
+ return MODULES_VADDR + offset;
+}
+
+/*
+ * Try to allocate in the random area at 10000 random addresses. If these
+ * fail, return NULL.
+ */
+static void *try_module_randomize_each(unsigned long size)
+{
+ void *p = NULL;
+ unsigned int i;
+
+ /* This will have a guard page */
+ unsigned long va_size = PAGE_ALIGN(size) + PAGE_SIZE;
+
+ if (!kaslr_mod_randomize_each_module())
+ return NULL;
+
+ /* Make sure there is at least one address that might fit. */
+ if (va_size < PAGE_ALIGN(size) || va_size > MODULES_LEN)
+ return NULL;
+
+ /* Try to find a spot that doesn't need a lazy purge */
+ for (i = 0; i < NO_TRY_RAND; i++) {
+ unsigned long addr = get_rand_module_addr(va_size);
+
+ p = try_module_alloc(addr, size);
+
+ if (p)
+ return p;
+ }
+
+ return NULL;
}
-#endif

void *module_alloc(unsigned long size)
{
@@ -84,16 +147,18 @@ void *module_alloc(unsigned long size)
if (PAGE_ALIGN(size) > MODULES_LEN)
return NULL;

- p = __vmalloc_node_range(size, MODULE_ALIGN,
- MODULES_VADDR + get_module_load_offset(),
- MODULES_END, GFP_KERNEL,
- PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
- __builtin_return_address(0));
+ p = try_module_randomize_each(size);
+
+ if (!p)
+ p = __vmalloc_node_range(size, MODULE_ALIGN,
+ get_module_vmalloc_start(), MODULES_END,
+ GFP_KERNEL, PAGE_KERNEL_EXEC, 0,
+ NUMA_NO_NODE, __builtin_return_address(0));
+
if (p && (kasan_module_alloc(p, size) < 0)) {
vfree(p);
return NULL;
}
-
return p;
}

--
2.17.1


2018-11-20 23:23:09

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v9 RESEND 4/4] Kselftest for module text allocation benchmarking

This adds a test module in lib/, and a script in kselftest that does
benchmarking on the allocation of memory in the module space. Performance here
would have some small impact on kernel module insertions, BPF JIT insertions
and kprobes. In the case of KASLR features for the module space, this module
can be used to measure the allocation performance of different configurations.
This module needs to be compiled into the kernel because module_alloc is not
exported.

With some modification to the code, as explained in the comments, it can be
enabled to measure TLB flushes as well.

There are two tests in the module. One allocates until failure in order to
test module capacity and the other times allocating space in the module area.
They both use module sizes that roughly approximate the distribution of in-tree
X86_64 modules.

You can control the number of modules used in the tests like this:
echo m1000>/dev/mod_alloc_test

Run the test for module capacity like:
echo t1>/dev/mod_alloc_test

The other test will measure the allocation time, and for CONFG_X86_64 and
CONFIG_RANDOMIZE_BASE, also give data on how often the “backup area" is used.

Run the test for allocation time and backup area usage like:
echo t2>/dev/mod_alloc_test
The output will be something like this:
num all(ns) last(ns)
1000 1083 1099
Last module in backup count = 0
Total modules in backup = 0
>1 module in backup count = 0

To run a suite of allocation time tests for a collection of module numbers you can run:
tools/testing/selftests/bpf/test_mod_alloc.sh

Signed-off-by: Rick Edgecombe <[email protected]>
---
lib/Kconfig.debug | 9 +
lib/Makefile | 1 +
lib/test_mod_alloc.c | 375 ++++++++++++++++++
tools/testing/selftests/bpf/test_mod_alloc.sh | 29 ++
4 files changed, 414 insertions(+)
create mode 100644 lib/test_mod_alloc.c
create mode 100755 tools/testing/selftests/bpf/test_mod_alloc.sh

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 1af29b8224fd..b590b2bb312f 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1886,6 +1886,15 @@ config TEST_BPF

If unsure, say N.

+config TEST_MOD_ALLOC
+ bool "Tests for module allocator/vmalloc"
+ help
+ This builds the "test_mod_alloc" module that performs performance
+ tests on the module text section allocator. The module uses X86_64
+ module text sizes for simulations.
+
+ If unsure, say N.
+
config FIND_BIT_BENCHMARK
tristate "Test find_bit functions"
help
diff --git a/lib/Makefile b/lib/Makefile
index db06d1237898..c447e07931b0 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -60,6 +60,7 @@ UBSAN_SANITIZE_test_ubsan.o := y
obj-$(CONFIG_TEST_KSTRTOX) += test-kstrtox.o
obj-$(CONFIG_TEST_LIST_SORT) += test_list_sort.o
obj-$(CONFIG_TEST_LKM) += test_module.o
+obj-$(CONFIG_TEST_MOD_ALLOC) += test_mod_alloc.o
obj-$(CONFIG_TEST_OVERFLOW) += test_overflow.o
obj-$(CONFIG_TEST_RHASHTABLE) += test_rhashtable.o
obj-$(CONFIG_TEST_SORT) += test_sort.o
diff --git a/lib/test_mod_alloc.c b/lib/test_mod_alloc.c
new file mode 100644
index 000000000000..3a6fb7999df4
--- /dev/null
+++ b/lib/test_mod_alloc.c
@@ -0,0 +1,375 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * This module can be used to test allocation allocation time and randomness.
+ *
+ * To interact with this module, mount debugfs, for example:
+ * mount -t debugfs none /sys/kernel/debug/
+ *
+ * Then write to the file:
+ * /sys/kernel/debug/mod_alloc_test
+ *
+ * There are two tests:
+ * Test 1: Allocate until failure
+ * Test 2: Run 1000 iterations of a test the simulates loading modules with
+ * x86_64 module sizes.
+ *
+ * Configure the number (ex:1000) of modules to use per test in the tests:
+ * echo m1000 > /sys/kernel/debug/mod_alloc_test
+ *
+ * To run test (ex: Test 2):
+ * echo t2 > /sys/kernel/debug/mod_alloc_test
+ *
+ * For test 1 it will print the results of each test. For test 2 it will print
+ * out statistics for example:
+ * New module count: 1000
+ * Starting 10000 iterations of 1000 modules
+ * num all(ns) last(ns)
+ * 1000 1984 2112
+ * Last module in backup count = 0
+ * Total modules in backup = 188
+ * >1 module in backup count = 7
+ */
+
+#include <linux/debugfs.h>
+#include <linux/device.h>
+#include <linux/fs.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/moduleloader.h>
+#include <linux/random.h>
+#include <linux/uaccess.h>
+#include <linux/vmalloc.h>
+
+struct mod { int filesize; int coresize; int initsize; };
+
+/* ==== Begin optional logging ==== */
+/*
+ * Note: In order to get an accurate count for the tlb flushes triggered in
+ * vmalloc, create a counter in vmalloc.c with this method signature and export
+ * it. Then replace the below with:
+ *
+ * extern unsigned long get_tlb_flushes_vmalloc(void);
+ */
+static unsigned long get_tlb_flushes_vmalloc(void)
+{
+ return 0;
+}
+
+/* ==== End optional logging ==== */
+
+
+#define MAX_ALLOC_CNT 20000
+#define ITERS 1000
+
+struct vm_alloc {
+ void *core;
+ unsigned long core_size;
+ void *init;
+};
+
+static struct vm_alloc *allocs_vm;
+static long mod_cnt;
+static DEFINE_MUTEX(test_mod_alloc_mutex);
+
+const static int core_hist[10] = {1, 5, 21, 46, 141, 245, 597, 2224, 1875, 0};
+const static int init_hist[10] = {0, 0, 0, 0, 10, 19, 70, 914, 3906, 236};
+const static int file_hist[10] = {6, 20, 55, 86, 286, 551, 918, 2024, 1028,
+ 181};
+
+const static int bins[10] = {5000000, 2000000, 1000000, 500000, 200000, 100000,
+ 50000, 20000, 10000, 5000};
+/*
+ * Rough approximation of the X86_64 module size distribution.
+ */
+static int get_mod_rand_size(const int *hist)
+{
+ int area_under = get_random_int() % 5155;
+ int i;
+ int last_bin = bins[0] + 1;
+ int sum = 0;
+
+ for (i = 0; i <= 9; i++) {
+ sum += hist[i];
+ if (area_under <= sum)
+ return bins[i]
+ + (get_random_int() % (last_bin - bins[i]));
+ last_bin = bins[i];
+ }
+ return 4096;
+}
+
+static struct mod get_rand_module(void)
+{
+ struct mod ret;
+
+ ret.coresize = get_mod_rand_size(core_hist);
+ ret.initsize = get_mod_rand_size(init_hist);
+ ret.filesize = get_mod_rand_size(file_hist);
+ return ret;
+}
+
+static void do_test_alloc_fail(void)
+{
+ struct vm_alloc *cur_alloc;
+ struct mod cur_mod;
+ void *file;
+ int mod_n, free_mod_n;
+ unsigned long fail = 0;
+ int iter;
+
+ for (iter = 0; iter < ITERS; iter++) {
+ pr_info("Running iteration: %d\n", iter);
+ memset(allocs_vm, 0, mod_cnt * sizeof(struct vm_alloc));
+ vm_unmap_aliases();
+ for (mod_n = 0; mod_n < mod_cnt; mod_n++) {
+ cur_mod = get_rand_module();
+ cur_alloc = &allocs_vm[mod_n];
+
+ /* Allocate */
+ file = vmalloc(cur_mod.filesize);
+ cur_alloc->core = module_alloc(cur_mod.coresize);
+ cur_alloc->init = module_alloc(cur_mod.initsize);
+
+ /* Clean up everything except core */
+ if (!cur_alloc->core || !cur_alloc->init) {
+ fail++;
+ vfree(file);
+ if (cur_alloc->init) {
+ module_memfree(cur_alloc->init);
+ vm_unmap_aliases();
+ }
+ break;
+ }
+ module_memfree(cur_alloc->init);
+ vm_unmap_aliases();
+ vfree(file);
+ }
+
+ /* Clean up core sizes */
+ for (free_mod_n = 0; free_mod_n < mod_n; free_mod_n++) {
+ cur_alloc = &allocs_vm[free_mod_n];
+ if (cur_alloc->core)
+ module_memfree(cur_alloc->core);
+ }
+ }
+ pr_info("Failures(%ld modules): %lu\n", mod_cnt, fail);
+}
+
+#ifdef CONFIG_RANDOMIZE_FINE_MODULE
+static int is_in_backup(void *addr)
+{
+ return (unsigned long)addr >= MODULES_VADDR + MODULES_RAND_LEN;
+}
+#else
+static int is_in_backup(void *addr)
+{
+ return 0;
+}
+#endif
+
+static void do_test_last_perf(void)
+{
+ struct vm_alloc *cur_alloc;
+ struct mod cur_mod;
+ void *file;
+ int mod_n, mon_n_free;
+ unsigned long fail = 0;
+ int iter;
+ ktime_t start, diff;
+ ktime_t total_last = 0;
+ ktime_t total_all = 0;
+
+ /*
+ * The number of last core allocations for each iteration that were
+ * allocated in the backup area.
+ */
+ int last_in_bk = 0;
+
+ /*
+ * The total number of core allocations that were in the backup area for
+ * all iterations.
+ */
+ int total_in_bk = 0;
+
+ /* The number of iterations where the count was more than 1 */
+ int cnt_more_than_1 = 0;
+
+ /*
+ * The number of core allocations that were in the backup area for the
+ * current iteration.
+ */
+ int cur_in_bk = 0;
+
+ unsigned long before_tlbs;
+ unsigned long tlb_cnt_total;
+ unsigned long tlb_cur;
+ unsigned long total_tlbs = 0;
+
+ pr_info("Starting %d iterations of %ld modules\n", ITERS, mod_cnt);
+
+ for (iter = 0; iter < ITERS; iter++) {
+ vm_unmap_aliases();
+ before_tlbs = get_tlb_flushes_vmalloc();
+ memset(allocs_vm, 0, mod_cnt * sizeof(struct vm_alloc));
+ tlb_cnt_total = 0;
+ cur_in_bk = 0;
+ for (mod_n = 0; mod_n < mod_cnt; mod_n++) {
+ /* allocate how the module allocator allocates */
+
+ cur_mod = get_rand_module();
+ cur_alloc = &allocs_vm[mod_n];
+ file = vmalloc(cur_mod.filesize);
+
+ tlb_cur = get_tlb_flushes_vmalloc();
+
+ start = ktime_get();
+ cur_alloc->core = module_alloc(cur_mod.coresize);
+ diff = ktime_get() - start;
+
+ cur_alloc->init = module_alloc(cur_mod.initsize);
+
+ /* Collect metrics */
+ if (is_in_backup(cur_alloc->core)) {
+ cur_in_bk++;
+ if (mod_n == mod_cnt - 1)
+ last_in_bk++;
+ }
+ total_all += diff;
+
+ if (mod_n == mod_cnt - 1)
+ total_last += diff;
+
+ tlb_cnt_total += get_tlb_flushes_vmalloc() - tlb_cur;
+
+ /* If there is a failure, quit. init/core freed later */
+ if (!cur_alloc->core || !cur_alloc->init) {
+ fail++;
+ vfree(file);
+ break;
+ }
+ /* Init sections do not last long so free here */
+ module_memfree(cur_alloc->init);
+ vm_unmap_aliases();
+ cur_alloc->init = NULL;
+ vfree(file);
+ }
+
+ /* Collect per iteration metrics */
+ total_in_bk += cur_in_bk;
+ if (cur_in_bk > 1)
+ cnt_more_than_1++;
+ total_tlbs += get_tlb_flushes_vmalloc() - before_tlbs;
+
+ /* Collect per iteration metrics */
+ for (mon_n_free = 0; mon_n_free < mod_cnt; mon_n_free++) {
+ cur_alloc = &allocs_vm[mon_n_free];
+ module_memfree(cur_alloc->init);
+ module_memfree(cur_alloc->core);
+ }
+ }
+
+ if (fail)
+ pr_info("There was an alloc failure, results invalid!\n");
+
+ pr_info("num\t\tall(ns)\t\tlast(ns)");
+ pr_info("%ld\t\t%llu\t\t%llu\n", mod_cnt,
+ div64_s64(total_all, ITERS * mod_cnt),
+ div64_s64(total_last, ITERS));
+
+ if (IS_ENABLED(CONFIG_RANDOMIZE_FINE_MODULE)) {
+ pr_info("Last module in backup count = %d\n", last_in_bk);
+ pr_info("Total modules in backup = %d\n", total_in_bk);
+ pr_info(">1 module in backup count = %d\n", cnt_more_than_1);
+ }
+ /*
+ * This will usually hide info when the instrumentation is not in place.
+ */
+ if (tlb_cnt_total)
+ pr_info("TLB Flushes: %lu\n", tlb_cnt_total);
+}
+
+static void do_test(int test)
+{
+ switch (test) {
+ case 1:
+ do_test_alloc_fail();
+ break;
+ case 2:
+ do_test_last_perf();
+ break;
+ default:
+ pr_info("Unknown test\n");
+ }
+}
+
+static ssize_t device_file_write(struct file *filp, const char __user *user_buf,
+ size_t count, loff_t *offp)
+{
+ char buf[100];
+ long input_num;
+
+ if (count >= sizeof(buf) - 1) {
+ pr_info("Command too long\n");
+ return count;
+ }
+
+ if (!mutex_trylock(&test_mod_alloc_mutex)) {
+ pr_info("test_mod_alloc busy\n");
+ return count;
+ }
+
+ if (copy_from_user(buf, user_buf, count))
+ goto error;
+
+ buf[count] = 0;
+
+ if (kstrtol(buf+1, 10, &input_num))
+ goto error;
+
+ switch (buf[0]) {
+ case 'm':
+ if (input_num > 0 && input_num <= MAX_ALLOC_CNT) {
+ pr_info("New module count: %ld\n", input_num);
+ mod_cnt = input_num;
+ if (allocs_vm)
+ vfree(allocs_vm);
+ allocs_vm = vmalloc(sizeof(struct vm_alloc) * mod_cnt);
+ } else
+ pr_info("more than %d not supported\n", MAX_ALLOC_CNT);
+ break;
+ case 't':
+ if (!mod_cnt) {
+ pr_info("Set module count first\n");
+ break;
+ }
+
+ do_test(input_num);
+ break;
+ default:
+ pr_info("Unknown command\n");
+ }
+ goto done;
+error:
+ pr_info("Could not process input\n");
+done:
+ mutex_unlock(&test_mod_alloc_mutex);
+ return count;
+}
+
+static const char *dv_name = "mod_alloc_test";
+const static struct file_operations test_mod_alloc_fops = {
+ .owner = THIS_MODULE,
+ .write = device_file_write,
+};
+
+static int __init mod_alloc_test_init(void)
+{
+ debugfs_create_file(dv_name, 0400, NULL, NULL, &test_mod_alloc_fops);
+
+ return 0;
+}
+
+MODULE_LICENSE("GPL");
+
+module_init(mod_alloc_test_init);
diff --git a/tools/testing/selftests/bpf/test_mod_alloc.sh b/tools/testing/selftests/bpf/test_mod_alloc.sh
new file mode 100755
index 000000000000..e9aea570de78
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_mod_alloc.sh
@@ -0,0 +1,29 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+UNMOUNT_DEBUG_FS=0
+if ! mount | grep -q debugfs; then
+ if mount -t debugfs none /sys/kernel/debug/; then
+ UNMOUNT_DEBUG_FS=1
+ else
+ echo "Could not mount debug fs."
+ exit 1
+ fi
+fi
+
+if [ ! -e /sys/kernel/debug/mod_alloc_test ]; then
+ echo "Test module not found, did you build kernel with TEST_MOD_ALLOC?"
+ exit 1
+fi
+
+echo "Beginning module_alloc performance tests."
+
+for i in `seq 1000 1000 8000`; do
+ echo m$i>/sys/kernel/debug/mod_alloc_test
+ echo t2>/sys/kernel/debug/mod_alloc_test
+done
+
+echo "Module_alloc performance tests ended."
+
+if [ $UNMOUNT_DEBUG_FS -eq 1 ]; then
+ umount /sys/kernel/debug/
+fi
--
2.17.1


2018-11-20 23:23:15

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v9 RESEND 1/4] vmalloc: Add __vmalloc_node_try_addr function

Create __vmalloc_node_try_addr function that tries to allocate at a specific
address without triggering any lazy purging and retry. For the randomized
allocator that uses this function, failing to allocate at a specific address is
a lot more common. This function will not try to do any lazy purge and retry,
to try to fail faster when an allocation won't fit at a specific address. This
function is used for a case where lazy free areas are unlikely and so the purge
and retry is just extra work done every time. For the randomized module
loader, the performance for an average allocation in ns for different numbers
of modules was:

Modules Vmalloc optimization No Vmalloc Optimization
1000 1433 1993
2000 2295 3681
3000 4424 7450
4000 7746 13824
5000 12721 21852
6000 19724 33926
7000 27638 47427
8000 37745 64443

In order to support this behavior a try_addr argument was plugged into several
of the static helpers.

This also changes logic in __get_vm_area_node to be faster in cases where
allocations fail due to no space, which is a lot more common when trying
specific addresses.

Signed-off-by: Rick Edgecombe <[email protected]>
---
include/linux/vmalloc.h | 3 +
mm/vmalloc.c | 128 +++++++++++++++++++++++++++++-----------
2 files changed, 95 insertions(+), 36 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 398e9c95cd61..6eaa89612372 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -82,6 +82,9 @@ extern void *__vmalloc_node_range(unsigned long size, unsigned long align,
unsigned long start, unsigned long end, gfp_t gfp_mask,
pgprot_t prot, unsigned long vm_flags, int node,
const void *caller);
+extern void *__vmalloc_node_try_addr(unsigned long addr, unsigned long size,
+ gfp_t gfp_mask, pgprot_t prot, unsigned long vm_flags,
+ int node, const void *caller);
#ifndef CONFIG_MMU
extern void *__vmalloc_node_flags(unsigned long size, int node, gfp_t flags);
static inline void *__vmalloc_node_flags_caller(unsigned long size, int node,
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 97d4b25d0373..b8b34d319c85 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -326,6 +326,9 @@ EXPORT_SYMBOL(vmalloc_to_pfn);
#define VM_LAZY_FREE 0x02
#define VM_VM_AREA 0x04

+#define VMAP_MAY_PURGE 0x2
+#define VMAP_NO_PURGE 0x1
+
static DEFINE_SPINLOCK(vmap_area_lock);
/* Export for kexec only */
LIST_HEAD(vmap_area_list);
@@ -402,12 +405,12 @@ static BLOCKING_NOTIFIER_HEAD(vmap_notify_list);
static struct vmap_area *alloc_vmap_area(unsigned long size,
unsigned long align,
unsigned long vstart, unsigned long vend,
- int node, gfp_t gfp_mask)
+ int node, gfp_t gfp_mask, int try_purge)
{
struct vmap_area *va;
struct rb_node *n;
unsigned long addr;
- int purged = 0;
+ int purged = try_purge & VMAP_NO_PURGE;
struct vmap_area *first;

BUG_ON(!size);
@@ -860,7 +863,7 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)

va = alloc_vmap_area(VMAP_BLOCK_SIZE, VMAP_BLOCK_SIZE,
VMALLOC_START, VMALLOC_END,
- node, gfp_mask);
+ node, gfp_mask, VMAP_MAY_PURGE);
if (IS_ERR(va)) {
kfree(vb);
return ERR_CAST(va);
@@ -1170,8 +1173,9 @@ void *vm_map_ram(struct page **pages, unsigned int count, int node, pgprot_t pro
addr = (unsigned long)mem;
} else {
struct vmap_area *va;
- va = alloc_vmap_area(size, PAGE_SIZE,
- VMALLOC_START, VMALLOC_END, node, GFP_KERNEL);
+ va = alloc_vmap_area(size, PAGE_SIZE, VMALLOC_START,
+ VMALLOC_END, node, GFP_KERNEL,
+ VMAP_MAY_PURGE);
if (IS_ERR(va))
return NULL;

@@ -1372,7 +1376,8 @@ static void clear_vm_uninitialized_flag(struct vm_struct *vm)

static struct vm_struct *__get_vm_area_node(unsigned long size,
unsigned long align, unsigned long flags, unsigned long start,
- unsigned long end, int node, gfp_t gfp_mask, const void *caller)
+ unsigned long end, int node, gfp_t gfp_mask, int try_purge,
+ const void *caller)
{
struct vmap_area *va;
struct vm_struct *area;
@@ -1386,16 +1391,17 @@ static struct vm_struct *__get_vm_area_node(unsigned long size,
align = 1ul << clamp_t(int, get_count_order_long(size),
PAGE_SHIFT, IOREMAP_MAX_ORDER);

- area = kzalloc_node(sizeof(*area), gfp_mask & GFP_RECLAIM_MASK, node);
- if (unlikely(!area))
- return NULL;
-
if (!(flags & VM_NO_GUARD))
size += PAGE_SIZE;

- va = alloc_vmap_area(size, align, start, end, node, gfp_mask);
- if (IS_ERR(va)) {
- kfree(area);
+ va = alloc_vmap_area(size, align, start, end, node, gfp_mask,
+ try_purge);
+ if (IS_ERR(va))
+ return NULL;
+
+ area = kzalloc_node(sizeof(*area), gfp_mask & GFP_RECLAIM_MASK, node);
+ if (unlikely(!area)) {
+ free_vmap_area(va);
return NULL;
}

@@ -1408,7 +1414,8 @@ struct vm_struct *__get_vm_area(unsigned long size, unsigned long flags,
unsigned long start, unsigned long end)
{
return __get_vm_area_node(size, 1, flags, start, end, NUMA_NO_NODE,
- GFP_KERNEL, __builtin_return_address(0));
+ GFP_KERNEL, VMAP_MAY_PURGE,
+ __builtin_return_address(0));
}
EXPORT_SYMBOL_GPL(__get_vm_area);

@@ -1417,7 +1424,7 @@ struct vm_struct *__get_vm_area_caller(unsigned long size, unsigned long flags,
const void *caller)
{
return __get_vm_area_node(size, 1, flags, start, end, NUMA_NO_NODE,
- GFP_KERNEL, caller);
+ GFP_KERNEL, VMAP_MAY_PURGE, caller);
}

/**
@@ -1432,7 +1439,7 @@ struct vm_struct *__get_vm_area_caller(unsigned long size, unsigned long flags,
struct vm_struct *get_vm_area(unsigned long size, unsigned long flags)
{
return __get_vm_area_node(size, 1, flags, VMALLOC_START, VMALLOC_END,
- NUMA_NO_NODE, GFP_KERNEL,
+ NUMA_NO_NODE, GFP_KERNEL, VMAP_MAY_PURGE,
__builtin_return_address(0));
}

@@ -1440,7 +1447,8 @@ struct vm_struct *get_vm_area_caller(unsigned long size, unsigned long flags,
const void *caller)
{
return __get_vm_area_node(size, 1, flags, VMALLOC_START, VMALLOC_END,
- NUMA_NO_NODE, GFP_KERNEL, caller);
+ NUMA_NO_NODE, GFP_KERNEL, VMAP_MAY_PURGE,
+ caller);
}

/**
@@ -1713,26 +1721,10 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
return NULL;
}

-/**
- * __vmalloc_node_range - allocate virtually contiguous memory
- * @size: allocation size
- * @align: desired alignment
- * @start: vm area range start
- * @end: vm area range end
- * @gfp_mask: flags for the page level allocator
- * @prot: protection mask for the allocated pages
- * @vm_flags: additional vm area flags (e.g. %VM_NO_GUARD)
- * @node: node to use for allocation or NUMA_NO_NODE
- * @caller: caller's return address
- *
- * Allocate enough pages to cover @size from the page level
- * allocator with @gfp_mask flags. Map them into contiguous
- * kernel virtual space, using a pagetable protection of @prot.
- */
-void *__vmalloc_node_range(unsigned long size, unsigned long align,
+static void *__vmalloc_node_range_opts(unsigned long size, unsigned long align,
unsigned long start, unsigned long end, gfp_t gfp_mask,
pgprot_t prot, unsigned long vm_flags, int node,
- const void *caller)
+ int try_purge, const void *caller)
{
struct vm_struct *area;
void *addr;
@@ -1743,7 +1735,8 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
goto fail;

area = __get_vm_area_node(size, align, VM_ALLOC | VM_UNINITIALIZED |
- vm_flags, start, end, node, gfp_mask, caller);
+ vm_flags, start, end, node, gfp_mask,
+ try_purge, caller);
if (!area)
goto fail;

@@ -1768,6 +1761,69 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
return NULL;
}

+/**
+ * __vmalloc_node_range - allocate virtually contiguous memory
+ * @size: allocation size
+ * @align: desired alignment
+ * @start: vm area range start
+ * @end: vm area range end
+ * @gfp_mask: flags for the page level allocator
+ * @prot: protection mask for the allocated pages
+ * @vm_flags: additional vm area flags (e.g. %VM_NO_GUARD)
+ * @node: node to use for allocation or NUMA_NO_NODE
+ * @caller: caller's return address
+ *
+ * Allocate enough pages to cover @size from the page level
+ * allocator with @gfp_mask flags. Map them into contiguous
+ * kernel virtual space, using a pagetable protection of @prot.
+ */
+void *__vmalloc_node_range(unsigned long size, unsigned long align,
+ unsigned long start, unsigned long end, gfp_t gfp_mask,
+ pgprot_t prot, unsigned long vm_flags, int node,
+ const void *caller)
+{
+ return __vmalloc_node_range_opts(size, align, start, end, gfp_mask,
+ prot, vm_flags, node, VMAP_MAY_PURGE,
+ caller);
+}
+
+/**
+ * __vmalloc_try_addr - try to alloc at a specific address
+ * @addr: address to try
+ * @size: size to try
+ * @gfp_mask: flags for the page level allocator
+ * @prot: protection mask for the allocated pages
+ * @vm_flags: additional vm area flags (e.g. %VM_NO_GUARD)
+ * @node: node to use for allocation or NUMA_NO_NODE
+ * @caller: caller's return address
+ *
+ * Try to allocate at the specific address. If it succeeds the address is
+ * returned. If it fails NULL is returned. It will not try to purge lazy
+ * free vmap areas in order to fit.
+ */
+void *__vmalloc_node_try_addr(unsigned long addr, unsigned long size,
+ gfp_t gfp_mask, pgprot_t prot, unsigned long vm_flags,
+ int node, const void *caller)
+{
+ unsigned long addr_end;
+ unsigned long vsize = PAGE_ALIGN(size);
+
+ if (!vsize || (vsize >> PAGE_SHIFT) > totalram_pages)
+ return NULL;
+
+ if (!(vm_flags & VM_NO_GUARD))
+ vsize += PAGE_SIZE;
+
+ addr_end = addr + vsize;
+
+ if (addr > addr_end)
+ return NULL;
+
+ return __vmalloc_node_range_opts(size, 1, addr, addr_end,
+ gfp_mask | __GFP_NOWARN, prot, vm_flags, node,
+ VMAP_NO_PURGE, caller);
+}
+
/**
* __vmalloc_node - allocate virtually contiguous memory
* @size: allocation size
--
2.17.1


2018-11-20 23:23:55

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v9 RESEND 3/4] vmalloc: Add debugfs modfraginfo

Add debugfs file "modfraginfo" for providing info on module space fragmentation.
This can be used for determining if loadable module randomization is causing any
problems for extreme module loading situations, like huge numbers of modules or
extremely large modules.

Sample output when KASLR is enabled and X86_64 is configured:
Largest free space: 897912 kB
Total free space: 1025424 kB
Allocations in backup area: 0

Sample output when just X86_64:
Largest free space: 897912 kB
Total free space: 1025424 kB

Signed-off-by: Rick Edgecombe <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
---
mm/vmalloc.c | 100 +++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 98 insertions(+), 2 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index b8b34d319c85..63894cb50873 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -18,6 +18,7 @@
#include <linux/interrupt.h>
#include <linux/proc_fs.h>
#include <linux/seq_file.h>
+#include <linux/debugfs.h>
#include <linux/debugobjects.h>
#include <linux/kallsyms.h>
#include <linux/list.h>
@@ -36,6 +37,12 @@
#include <asm/tlbflush.h>
#include <asm/shmparam.h>

+#ifdef CONFIG_X86
+#include <asm/page_types.h>
+#include <asm/setup.h>
+#include <asm/kaslr_modules.h>
+#endif
+
#include "internal.h"

struct vfree_deferred {
@@ -2415,7 +2422,6 @@ void free_vm_area(struct vm_struct *area)
}
EXPORT_SYMBOL_GPL(free_vm_area);

-#ifdef CONFIG_SMP
static struct vmap_area *node_to_va(struct rb_node *n)
{
return rb_entry_safe(n, struct vmap_area, rb_node);
@@ -2463,6 +2469,7 @@ static bool pvm_find_next_prev(unsigned long end,
return true;
}

+#ifdef CONFIG_SMP
/**
* pvm_determine_end - find the highest aligned address between two vmap_areas
* @pnext: in/out arg for the next vmap_area
@@ -2804,7 +2811,96 @@ static int __init proc_vmalloc_init(void)
proc_create_seq("vmallocinfo", 0400, NULL, &vmalloc_op);
return 0;
}
-module_init(proc_vmalloc_init);
+#elif defined(CONFIG_DEBUG_FS)
+static int __init proc_vmalloc_init(void)
+{
+ return 0;
+}
+#endif
+
+#if defined(CONFIG_DEBUG_FS) && defined(CONFIG_RANDOMIZE_FINE_MODULE)
+static inline unsigned long is_in_backup(unsigned long addr)
+{
+ return addr >= MODULES_VADDR + get_modules_rand_len();
+}
+
+static int modulefraginfo_debug_show(struct seq_file *m, void *v)
+{
+ unsigned long last_end = MODULES_VADDR;
+ unsigned long total_free = 0;
+ unsigned long largest_free = 0;
+ unsigned long backup_cnt = 0;
+ unsigned long gap;
+ struct vmap_area *prev, *cur = NULL;
+
+ spin_lock(&vmap_area_lock);
+
+ if (!pvm_find_next_prev(MODULES_VADDR, &cur, &prev) || !cur)
+ goto done;
+
+ for (; cur->va_end <= MODULES_END; cur = list_next_entry(cur, list)) {
+ /* Don't count areas that are marked to be lazily freed */
+ if (!(cur->flags & VM_LAZY_FREE)) {
+ if (kaslr_mod_randomize_each_module())
+ backup_cnt += is_in_backup(cur->va_start);
+ gap = cur->va_start - last_end;
+ if (gap > largest_free)
+ largest_free = gap;
+ total_free += gap;
+ last_end = cur->va_end;
+ }
+
+ if (list_is_last(&cur->list, &vmap_area_list))
+ break;
+ }
+
+done:
+ gap = (MODULES_END - last_end);
+ if (gap > largest_free)
+ largest_free = gap;
+ total_free += gap;

+ spin_unlock(&vmap_area_lock);
+
+ seq_printf(m, "\tLargest free space:\t%lu kB\n", largest_free / 1024);
+ seq_printf(m, "\t Total free space:\t%lu kB\n", total_free / 1024);
+
+ if (kaslr_mod_randomize_each_module())
+ seq_printf(m, "Allocations in backup area:\t%lu\n", backup_cnt);
+
+ return 0;
+}
+
+static int proc_module_frag_debug_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, modulefraginfo_debug_show, NULL);
+}
+
+static const struct file_operations debug_module_frag_operations = {
+ .open = proc_module_frag_debug_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static void __init debug_modfrag_init(void)
+{
+ debugfs_create_file("modfraginfo", 0400, NULL, NULL,
+ &debug_module_frag_operations);
+}
+#elif defined(CONFIG_DEBUG_FS) || defined(CONFIG_PROC_FS)
+static void __init debug_modfrag_init(void)
+{
+}
#endif

+#if defined(CONFIG_DEBUG_FS) || defined(CONFIG_PROC_FS)
+static int __init info_vmalloc_init(void)
+{
+ proc_vmalloc_init();
+ debug_modfrag_init();
+ return 0;
+}
+
+module_init(info_vmalloc_init);
+#endif
--
2.17.1


2018-11-26 15:38:06

by Jessica Yu

[permalink] [raw]
Subject: Re: [PATCH v9 RESEND 0/4] KASLR feature to randomize each loadable module

+++ Rick Edgecombe [20/11/18 15:23 -0800]:
>Resending this because I missed Jessica in the "to" list. Also removing the part
>of this coverletter that talked about KPTI helping with some local kernel text
>de-randomizing methods, because I'm not sure I fully understand this.
>
>------------------------------------------------------------
>
>This is V9 of the "KASLR feature to randomize each loadable module" patchset.
>The purpose is to increase the randomization for the module space from 10 to 17
>bits, and also to make the modules randomized in relation to each other instead
>of just the address where the allocations begin, so that if one module leaks the
>location of the others can't be inferred.
>
>Why its useful
>==============
>Randomizing the location of executable code is a defense against control flow
>attacks, where the kernel is tricked into jumping, or speculatively executing
>code other than what is intended. By randomizing the location of the code, the
>attacker doesn't know where to redirect the control flow.
>
>Today the RANDOMIZE_BASE feature randomizes the base address where the module
>allocations begin with 10 bits of entropy for this purpose. From here, a highly
>deterministic algorithm allocates space for the modules as they are loaded and
>unloaded. If an attacker can predict the order and identities for modules that
>will be loaded (either by the system, or controlled by the user with
>request_module or BPF), then a single text address leak can give the attacker
>access to the locations of other modules. So in this case this new algorithm can
>take the entropy of the other modules from ~0 to 17, making it much more robust.
>
>Another problem today is that the low 10 bits of entropy makes brute force
>attacks feasible, especially in the case of speculative execution where a wrong
>guess won't necessarily cause a crash. In this case, increasing the
>randomization will force attacks to take longer, and so increase the time an
>attacker may be detected on a system.
>
>There are multiple efforts to apply more randomization to the core kernel text
>as well, and so this module space piece can be a first step to increasing
>randomization for all kernel space executable code.
>
>Userspace ASLR can get 28 bits of entropy or more, so at least increasing this
>to 17 for now improves what is currently a pretty low amount of randomization
>for the higher privileged kernel space.
>
>How it works
>============
>The algorithm is pretty simple. It just breaks the module space in two, a random
>area (2/3 of module space) and a backup area (1/3 of module space). It first
>tries to allocate up to 10000 randomly located starting pages inside the random
>section. If this fails, it will allocate in the backup area. The backup area
>base will be offset in the same way as current algorithm does for the base area,
>which has 10 bits of entropy.
>
>The vmalloc allocator can be used to try an allocation at a specific address,
>however it is usually used to try an allocation over a large address range, and
>so some behaviors which are non-issues in normal usage can be be sub-optimal
>when trying the an allocation at 10000 small ranges. So this patch also includes
>a new vmalloc function __vmalloc_node_try_addr and some other vmalloc tweaks
>that allow for more efficient trying of addresses.
>
>This algorithm targets maintaining high entropy for many 1000's of module
>allocations. This is because there are other users of the module space besides
>kernel modules, like eBPF JIT, classic BPF socket filter JIT and kprobes.

Hi Rick!

Sorry for the delay. I'd like to take a step back and ask some broader questions -

- Is the end goal of this patchset to randomize loading kernel modules, or most/all
executable kernel memory allocations, including bpf, kprobes, etc?

- It seems that a lot of complexity and heuristics are introduced just to
accommodate the potential fragmentation that can happen when the module vmalloc
space starts to get fragmented with bpf filters. I'm partial to the idea of
splitting or having bpf own its own vmalloc space, similar to what Ard is already
implementing for arm64.

So a question for the bpf and x86 folks, is having a dedicated vmalloc region
(as well as a seperate bpf_alloc api) for bpf feasible or desirable on x86_64?

If bpf filters need to be within 2 GB of the core kernel, would it make sense
to carve out a portion of the current module region for bpf filters? According
to Documentation/x86/x86_64/mm.txt, the module region is ~1.5 GB. I am doubtful
that any real system will actually have 1.5 GB worth of kernel modules loaded.
Is there a specific reason why that much space is dedicated to kernel modules,
and would it be feasible to split that region cleanly with bpf?

- If bpf gets its own dedicated vmalloc space, and we stick to the single task
of randomizing *just* kernel modules, could the vmalloc optimizations and the
"backup" area be dropped? The benefits of the vmalloc optimizations seem to
only be noticeable when we get to thousands of module_alloc allocations -
again, a concern caused by bpf filters sharing the same space with kernel
modules.

So tldr, it seems to me that the concern of fragmentation, the vmalloc
optimizations, and the main purpose of the backup area - basically, the more
complex parts of this patchset - stems squarely from the fact that bpf filters
share the same space as modules on x86. If we were to focus on randomizing
*just* kernel modules, and if bpf and modules had their own dedicated regions,
then I *think* the concrete use cases for the backup area and the vmalloc
optimizations (if we're strictly considering just kernel modules) would
mostly disappear (please correct me if I'm in the wrong here). Then tackling the
randomization of bpf allocations could potentially be a separate task on its own.

Thanks!

Jessica

>Performance
>===========
>Simulations were run using module sizes derived from the x86_64 modules to
>measure the allocation performance at various levels of fragmentation and
>whether the backup area was used.
>
>Capacity
>--------
>There is a slight reduction in the capacity of modules as simulated by the
>x86_64 module sizes of <1000. Note this is a worst case, since in practice
>module allocations in the 1000's will consist of smaller BPF JIT allocations or
>kprobes which would fit better in the random area.
>
>Allocation time
>---------------
>Below are three sets of measurements in ns of the allocation time as measured by
>the included kselftests. The first two columns are this new algorithm with and
>with out the vmalloc optimizations for trying random addresses quickly. They are
>included for consideration of whether the changes are worth it. The last column
>is the performance of the original algorithm.
>
>Modules Vmalloc optimization No Vmalloc Optimization Existing Module KASLR
>1000 1433 1993 3821
>2000 2295 3681 7830
>3000 4424 7450 13012
>4000 7746 13824 18106
>5000 12721 21852 22572
>6000 19724 33926 26443
>7000 27638 47427 30473
>8000 37745 64443 34200
>
>These allocations are not taking very long, but it may show up on systems with
>very high usage of the module space (BPF JITs). If the trade-off of touching
>vmalloc doesn't seem worth it to people, I can remove the optimizations.
>
>Randomness
>----------
>Unlike the existing algorithm, the amount of randomness provided has a
>dependency on the number of modules allocated and the sizes of the modules text
>sections. The entropy provided for the Nth allocation will come from three
>sources of randomness, the range of addresses for the random area, the
>probability the section will be allocated in the backup area and randomness from
>the number of modules already allocated in the backup area. For computing a
>lower bound entropy in the following calculations, the randomness of the modules
>already in the backup area, or overlapping from the random area, is ignored
>since it is usually small and will only increase the entropy. Below is an
>attempt to compute a worst case value for entropy to compare to the existing
>algorithm.
>
>For probability of the Nth allocation being in the backup area, p, a lower bound
>entropy estimate is calculated here as:
>
>Random Area Slots = ((2/3)*1073741824)/4096 = 174762
>
>Entropy = -( (1-p)*log2((1-p)/174762) + p*log2(p/1024) )
>
>For >8000 modules the entropy remains above 17.3. For non-speculative control
>flow attacks, an attack might crash the system. So the probability of the
>first guess being right can be more important than the Nth guess. KASLR schemes
>usually have equal probability for each possible position, but in this scheme
>that is not the case. So a more conservative comparison to existing schemes is
>the amount of information that would have to be guessed correctly for the
>position that has the highest probability for having the Nth module allocated
>(as that would be the attackers best guess):
>
>Min Info = MIN(-log2(p/1024), -log2((1-p)/174762))
>
>Allocations Entropy
>1000 17.4
>2000 17.4
>3000 17.4
>4000 16.8
>5000 15.8
>6000 14.9
>7000 14.8
>8000 14.2
>
>If anyone is keeping track, these numbers are different than as reported in V2,
>because they are generated using the more compact allocation size heuristic that
>is included in the kselftest rather than the real much larger dataset. The
>heuristic generates randomization benchmarks that are slightly slower than the
>real dataset. The real dataset also isn't representative of the case of mostly
>smaller BPF filters, so it represents a worst case lower bound for entropy and
>in practice 17+ bits should be maintained to much higher number of modules.
>
>PTE usage
>---------
>Since the allocations are spread out over a wider address space, there is
>increased PTE usage which should not exceed 1.3MB more than the old algorithm.
>
>
>Changes for V9:
> - Better explanations in commit messages, instructions in kselftests (Andrew
> Morton)
>
>Changes for V8:
> - Simplify code by removing logic for optimum handling of lazy free areas
>
>Changes for V7:
> - More 0-day build fixes, readability improvements (Kees Cook)
>
>Changes for V6:
> - 0-day build fixes by removing un-needed functional testing, more error
> handling
>
>Changes for V5:
> - Add module_alloc test module
>
>Changes for V4:
> - Fix issue caused by KASAN, kmemleak being provided different allocation
> lengths (padding).
> - Avoid kmalloc until sure its needed in __vmalloc_node_try_addr.
> - Fixed issues reported by 0-day.
>
>Changes for V3:
> - Code cleanup based on internal feedback. (thanks to Dave Hansen and Andriy
> Shevchenko)
> - Slight refactor of existing algorithm to more cleanly live along side new
> one.
> - BPF synthetic benchmark
>
>Changes for V2:
> - New implementation of __vmalloc_node_try_addr based on the
> __vmalloc_node_range implementation, that only flushes TLB when needed.
> - Modified module loading algorithm to try to reduce the TLB flushes further.
> - Increase "random area" tries in order to increase the number of modules that
> can get high randomness.
> - Increase "random area" size to 2/3 of module area in order to increase the
> number of modules that can get high randomness.
> - Fix for 0day failures on other architectures.
> - Fix for wrong debugfs permissions. (thanks to Jann Horn)
> - Spelling fix. (thanks to Jann Horn)
> - Data on module_alloc performance and TLB flushes. (brought up by Kees Cook
> and Jann Horn)
> - Data on memory usage. (suggested by Jann)
>
>
>Rick Edgecombe (4):
> vmalloc: Add __vmalloc_node_try_addr function
> x86/modules: Increase randomization for modules
> vmalloc: Add debugfs modfraginfo
> Kselftest for module text allocation benchmarking
>
> arch/x86/Kconfig | 3 +
> arch/x86/include/asm/kaslr_modules.h | 38 ++
> arch/x86/include/asm/pgtable_64_types.h | 7 +
> arch/x86/kernel/module.c | 111 ++++--
> include/linux/vmalloc.h | 3 +
> lib/Kconfig.debug | 9 +
> lib/Makefile | 1 +
> lib/test_mod_alloc.c | 375 ++++++++++++++++++
> mm/vmalloc.c | 228 +++++++++--
> tools/testing/selftests/bpf/test_mod_alloc.sh | 29 ++
> 10 files changed, 743 insertions(+), 61 deletions(-)
> create mode 100644 arch/x86/include/asm/kaslr_modules.h
> create mode 100644 lib/test_mod_alloc.c
> create mode 100755 tools/testing/selftests/bpf/test_mod_alloc.sh
>
>--
>2.17.1
>

2018-11-27 00:20:39

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v9 RESEND 0/4] KASLR feature to randomize each loadable module

On Mon, 2018-11-26 at 16:36 +0100, Jessica Yu wrote:
> +++ Rick Edgecombe [20/11/18 15:23 -0800]:
[snip]
> Hi Rick!
>
> Sorry for the delay. I'd like to take a step back and ask some broader
> questions -
>
> - Is the end goal of this patchset to randomize loading kernel modules, or
> most/all
> executable kernel memory allocations, including bpf, kprobes, etc?
Thanks for taking a look!

It started with the goal of just randomizing modules (hence the name), but I
think there is maybe value in randomizing the placement of all runtime added
executable code. Beyond just trying to make executable code placement less
deterministic in general, today all of the usages have the property of starting
with RW permissions and then becoming RO executable, so there is the benefit of
narrowing the chances a bug could successfully write to it during the RW window.

> - It seems that a lot of complexity and heuristics are introduced just to
> accommodate the potential fragmentation that can happen when the module
> vmalloc
> space starts to get fragmented with bpf filters. I'm partial to the idea of
> splitting or having bpf own its own vmalloc space, similar to what Ard is
> already
> implementing for arm64.
>
> So a question for the bpf and x86 folks, is having a dedicated vmalloc
> region
> (as well as a seperate bpf_alloc api) for bpf feasible or desirable on
> x86_64?
I actually did some prototyping and testing on this. It seems there would be
some slowdown from the required changes to the JITed code to support calling
back from the vmalloc region into the kernel, and so module space would still be
the preferred region.

> If bpf filters need to be within 2 GB of the core kernel, would it make
> sense
> to carve out a portion of the current module region for bpf
> filters? According
> to Documentation/x86/x86_64/mm.txt, the module region is ~1.5 GB. I am
> doubtful
> that any real system will actually have 1.5 GB worth of kernel modules
> loaded.
> Is there a specific reason why that much space is dedicated to kernel
> modules,
> and would it be feasible to split that region cleanly with bpf?
Hopefully someone from BPF side of things will chime in, but my understanding
was that they would like even more space than today if possible and so they may
not like the reduced space.

Also with KASLR on x86 its actually only 1GB, so it would only be 500MB per
section (assuming kprobes, etc would share the non-module region, so just two
sections).

> - If bpf gets its own dedicated vmalloc space, and we stick to the single task
> of randomizing *just* kernel modules, could the vmalloc optimizations and
> the
> "backup" area be dropped? The benefits of the vmalloc optimizations seem to
> only be noticeable when we get to thousands of module_alloc allocations -
> again, a concern caused by bpf filters sharing the same space with kernel
> modules.
I think the backup area may still be needed, for example if you have 200 modules
evenly spaced inside 500MB there is only average ~2.5MB gap between them. So a
late added large module could still get blocked.

> So tldr, it seems to me that the concern of fragmentation, the vmalloc
> optimizations, and the main purpose of the backup area - basically, the
> more
> complex parts of this patchset - stems squarely from the fact that bpf
> filters
> share the same space as modules on x86. If we were to focus on randomizing
> *just* kernel modules, and if bpf and modules had their own dedicated
> regions,
> then I *think* the concrete use cases for the backup area and the vmalloc
> optimizations (if we're strictly considering just kernel modules) would
> mostly disappear (please correct me if I'm in the wrong here). Then
> tackling the
> randomization of bpf allocations could potentially be a separate task on
> its own.
Yes it seems then the vmalloc optimizations could be dropped then, but I don't
think the backup area could be. Also the entropy would go down since there would
be less possible positions and we would reduce the space available to BPF. So
there are some downsides just to remove the vmalloc piece.

Is your concern that vmalloc optimizations might regress something else? There
is a middle ground vmalloc optimization where only the try_purge flag is plumbed
through. The flag was most of the performance gained and with just that piece it
should not change any behavior for the non-modules flows. Would that be more
acceptable?

> Thanks!
>
> Jessica
>
[snip]

2018-11-27 10:22:42

by Daniel Borkmann

[permalink] [raw]
Subject: Re: [PATCH v9 RESEND 0/4] KASLR feature to randomize each loadable module

On 11/27/2018 01:19 AM, Edgecombe, Rick P wrote:
> On Mon, 2018-11-26 at 16:36 +0100, Jessica Yu wrote:
>> +++ Rick Edgecombe [20/11/18 15:23 -0800]:
> [snip]
>> Hi Rick!
>>
>> Sorry for the delay. I'd like to take a step back and ask some broader
>> questions -
>>
>> - Is the end goal of this patchset to randomize loading kernel modules, or
>> most/all
>> executable kernel memory allocations, including bpf, kprobes, etc?
> Thanks for taking a look!
>
> It started with the goal of just randomizing modules (hence the name), but I
> think there is maybe value in randomizing the placement of all runtime added
> executable code. Beyond just trying to make executable code placement less
> deterministic in general, today all of the usages have the property of starting
> with RW permissions and then becoming RO executable, so there is the benefit of
> narrowing the chances a bug could successfully write to it during the RW window.
>
>> - It seems that a lot of complexity and heuristics are introduced just to
>> accommodate the potential fragmentation that can happen when the module
>> vmalloc
>> space starts to get fragmented with bpf filters. I'm partial to the idea of
>> splitting or having bpf own its own vmalloc space, similar to what Ard is
>> already
>> implementing for arm64.
>>
>> So a question for the bpf and x86 folks, is having a dedicated vmalloc
>> region
>> (as well as a seperate bpf_alloc api) for bpf feasible or desirable on
>> x86_64?
> I actually did some prototyping and testing on this. It seems there would be
> some slowdown from the required changes to the JITed code to support calling
> back from the vmalloc region into the kernel, and so module space would still be
> the preferred region.

Yes, any runtime slow-down would be no-go as BPF sits in the middle of critical
networking fast-path and e.g. on XDP or tc layer and is used in load-balancing,
firewalling, DDoS protection scenarios, some recent examples in [0-3].

[0] http://vger.kernel.org/lpc-networking2018.html#session-10
[1] http://vger.kernel.org/lpc-networking2018.html#session-15
[2] https://blog.cloudflare.com/how-to-drop-10-million-packets/
[3] http://vger.kernel.org/lpc-bpf2018.html#session-1

>> If bpf filters need to be within 2 GB of the core kernel, would it make
>> sense
>> to carve out a portion of the current module region for bpf
>> filters? According
>> to Documentation/x86/x86_64/mm.txt, the module region is ~1.5 GB. I am
>> doubtful
>> that any real system will actually have 1.5 GB worth of kernel modules
>> loaded.
>> Is there a specific reason why that much space is dedicated to kernel
>> modules,
>> and would it be feasible to split that region cleanly with bpf?
> Hopefully someone from BPF side of things will chime in, but my understanding
> was that they would like even more space than today if possible and so they may
> not like the reduced space.

I wouldn't mind of the region is split as Jessica suggests but in a way where
there would be _no_ runtime regressions for BPF. This might also allow to have
more flexibility in sizing the area dedicated for BPF in future, and could
potentially be done in similar way as Ard was proposing recently [4].

[4] https://patchwork.ozlabs.org/project/netdev/list/?series=77779

> Also with KASLR on x86 its actually only 1GB, so it would only be 500MB per
> section (assuming kprobes, etc would share the non-module region, so just two
> sections).
>
>> - If bpf gets its own dedicated vmalloc space, and we stick to the single task
>> of randomizing *just* kernel modules, could the vmalloc optimizations and
>> the
>> "backup" area be dropped? The benefits of the vmalloc optimizations seem to
>> only be noticeable when we get to thousands of module_alloc allocations -
>> again, a concern caused by bpf filters sharing the same space with kernel
>> modules.
> I think the backup area may still be needed, for example if you have 200 modules
> evenly spaced inside 500MB there is only average ~2.5MB gap between them. So a
> late added large module could still get blocked.
>
>> So tldr, it seems to me that the concern of fragmentation, the vmalloc
>> optimizations, and the main purpose of the backup area - basically, the
>> more
>> complex parts of this patchset - stems squarely from the fact that bpf
>> filters
>> share the same space as modules on x86. If we were to focus on randomizing
>> *just* kernel modules, and if bpf and modules had their own dedicated
>> regions,
>> then I *think* the concrete use cases for the backup area and the vmalloc
>> optimizations (if we're strictly considering just kernel modules) would
>> mostly disappear (please correct me if I'm in the wrong here). Then
>> tackling the
>> randomization of bpf allocations could potentially be a separate task on
>> its own.
> Yes it seems then the vmalloc optimizations could be dropped then, but I don't
> think the backup area could be. Also the entropy would go down since there would
> be less possible positions and we would reduce the space available to BPF. So
> there are some downsides just to remove the vmalloc piece.
>
> Is your concern that vmalloc optimizations might regress something else? There
> is a middle ground vmalloc optimization where only the try_purge flag is plumbed
> through. The flag was most of the performance gained and with just that piece it
> should not change any behavior for the non-modules flows. Would that be more
> acceptable?
>
>> Thanks!
>>
>> Jessica
>>
> [snip]
>


2018-11-28 01:41:46

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v9 RESEND 0/4] KASLR feature to randomize each loadable module

On Tue, 2018-11-27 at 11:21 +0100, Daniel Borkmann wrote:
> On 11/27/2018 01:19 AM, Edgecombe, Rick P wrote:
> > On Mon, 2018-11-26 at 16:36 +0100, Jessica Yu wrote:
> > > +++ Rick Edgecombe [20/11/18 15:23 -0800]:
> >
> > [snip]
> > > Hi Rick!
> > >
> > > Sorry for the delay. I'd like to take a step back and ask some broader
> > > questions -
> > >
> > > - Is the end goal of this patchset to randomize loading kernel modules, or
> > > most/all
> > > executable kernel memory allocations, including bpf, kprobes, etc?
> >
> > Thanks for taking a look!
> >
> > It started with the goal of just randomizing modules (hence the name), but I
> > think there is maybe value in randomizing the placement of all runtime added
> > executable code. Beyond just trying to make executable code placement less
> > deterministic in general, today all of the usages have the property of
> > starting
> > with RW permissions and then becoming RO executable, so there is the benefit
> > of
> > narrowing the chances a bug could successfully write to it during the RW
> > window.
> >
> > > - It seems that a lot of complexity and heuristics are introduced just to
> > > accommodate the potential fragmentation that can happen when the module
> > > vmalloc
> > > space starts to get fragmented with bpf filters. I'm partial to the
> > > idea of
> > > splitting or having bpf own its own vmalloc space, similar to what Ard
> > > is
> > > already
> > > implementing for arm64.
> > >
> > > So a question for the bpf and x86 folks, is having a dedicated vmalloc
> > > region
> > > (as well as a seperate bpf_alloc api) for bpf feasible or desirable on
> > > x86_64?
> >
> > I actually did some prototyping and testing on this. It seems there would be
> > some slowdown from the required changes to the JITed code to support calling
> > back from the vmalloc region into the kernel, and so module space would
> > still be
> > the preferred region.
>
> Yes, any runtime slow-down would be no-go as BPF sits in the middle of
> critical
> networking fast-path and e.g. on XDP or tc layer and is used in load-
> balancing,
> firewalling, DDoS protection scenarios, some recent examples in [0-3].
>
> [0] http://vger.kernel.org/lpc-networking2018.html#session-10
> [1] http://vger.kernel.org/lpc-networking2018.html#session-15
> [2] https://blog.cloudflare.com/how-to-drop-10-million-packets/
> [3] http://vger.kernel.org/lpc-bpf2018.html#session-1
>
> > > If bpf filters need to be within 2 GB of the core kernel, would it make
> > > sense
> > > to carve out a portion of the current module region for bpf
> > > filters? According
> > > to Documentation/x86/x86_64/mm.txt, the module region is ~1.5 GB. I am
> > > doubtful
> > > that any real system will actually have 1.5 GB worth of kernel modules
> > > loaded.
> > > Is there a specific reason why that much space is dedicated to kernel
> > > modules,
> > > and would it be feasible to split that region cleanly with bpf?
> >
> > Hopefully someone from BPF side of things will chime in, but my
> > understanding
> > was that they would like even more space than today if possible and so they
> > may
> > not like the reduced space.
>
> I wouldn't mind of the region is split as Jessica suggests but in a way where
> there would be _no_ runtime regressions for BPF. This might also allow to have
> more flexibility in sizing the area dedicated for BPF in future, and could
> potentially be done in similar way as Ard was proposing recently [4].
>
> [4] https://patchwork.ozlabs.org/project/netdev/list/?series=77779

CCing Ard.

The benefit of sharing the space, for randomization at least, is that you can
spread the allocations over a larger area.

I think there are also other benefits to unifying how this memory is managed
though, rather than spreading it further. Today there are various patterns and
techniques used like calling different combinations of set_memory_* before
freeing, zeroing in modules or setting invalid instructions like BPF does, etc.
There is also special care to be taken on vfree-ing executable memory. So this
way things only have to be done right once and there is less duplication.

Not saying there shouldn't be __weak alloc and free method in BPF for arch
specific behavior, just that there is quite a few other concerns that could be
good to centralize even more than today.

What if there was a unified executable alloc API with support for things like:
- Concepts of two regions for Ard's usage, near(modules) and far(vmalloc) from
kernel text. Won't apply for every arch, but maybe enough that some logic
could be unified
- Limits for each of the usages (modules, bpf, kprobes, ftrace)
- Centralized logic for moving between RW and RO+X
- Options for exclusive regions or all shared
- Randomizing base, randomizing independently or none
- Some cgroups hooks?

Would there be any interest in that for the future?

As a next step, if BPF doesn't want to use this by default, could BPF just call
vmalloc_node_range directly from Ard's new __weak functions on x86? Then modules
can randomize across the whole space and BPF can fill the gaps linearly from the
beginning. Is that acceptable? Then the vmalloc optimizations could be dropped
for the time being since the BPFs would not be fragmented, but the separate
regions could come as part of future work.

Thanks,

Rick

> > Also with KASLR on x86 its actually only 1GB, so it would only be 500MB per
> > section (assuming kprobes, etc would share the non-module region, so just
> > two
> > sections).
> >
> > > - If bpf gets its own dedicated vmalloc space, and we stick to the single
> > > task
> > > of randomizing *just* kernel modules, could the vmalloc optimizations
> > > and
> > > the
> > > "backup" area be dropped? The benefits of the vmalloc optimizations
> > > seem to
> > > only be noticeable when we get to thousands of module_alloc allocations
> > > -
> > > again, a concern caused by bpf filters sharing the same space with
> > > kernel
> > > modules.
> >
> > I think the backup area may still be needed, for example if you have 200
> > modules
> > evenly spaced inside 500MB there is only average ~2.5MB gap between them. So
> > a
> > late added large module could still get blocked.
> >
> > > So tldr, it seems to me that the concern of fragmentation, the vmalloc
> > > optimizations, and the main purpose of the backup area - basically, the
> > > more
> > > complex parts of this patchset - stems squarely from the fact that bpf
> > > filters
> > > share the same space as modules on x86. If we were to focus on
> > > randomizing
> > > *just* kernel modules, and if bpf and modules had their own dedicated
> > > regions,
> > > then I *think* the concrete use cases for the backup area and the
> > > vmalloc
> > > optimizations (if we're strictly considering just kernel modules) would
> > > mostly disappear (please correct me if I'm in the wrong here). Then
> > > tackling the
> > > randomization of bpf allocations could potentially be a separate task
> > > on
> > > its own.
> >
> > Yes it seems then the vmalloc optimizations could be dropped then, but I
> > don't
> > think the backup area could be. Also the entropy would go down since there
> > would
> > be less possible positions and we would reduce the space available to BPF.
> > So
> > there are some downsides just to remove the vmalloc piece.
> >
> > Is your concern that vmalloc optimizations might regress something else?
> > There
> > is a middle ground vmalloc optimization where only the try_purge flag is
> > plumbed
> > through. The flag was most of the performance gained and with just that
> > piece it
> > should not change any behavior for the non-modules flows. Would that be more
> > acceptable?
> >
> > > Thanks!
> > >
> > > Jessica
> > >
> >
> > [snip]
> >
>
>

2018-12-12 23:06:45

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v9 RESEND 0/4] KASLR feature to randomize each loadable module

On Wed, 2018-11-28 at 01:40 +0000, Edgecombe, Rick P wrote:
> On Tue, 2018-11-27 at 11:21 +0100, Daniel Borkmann wrote:
> > On 11/27/2018 01:19 AM, Edgecombe, Rick P wrote:
> > > On Mon, 2018-11-26 at 16:36 +0100, Jessica Yu wrote:
> > > > +++ Rick Edgecombe [20/11/18 15:23 -0800]:
> > >
> > > [snip]
> > > > Hi Rick!
> > > >
> > > > Sorry for the delay. I'd like to take a step back and ask some broader
> > > > questions -
> > > >
> > > > - Is the end goal of this patchset to randomize loading kernel modules,
> > > > or
> > > > most/all
> > > > executable kernel memory allocations, including bpf, kprobes, etc?
> > >
> > > Thanks for taking a look!
> > >
> > > It started with the goal of just randomizing modules (hence the name), but
> > > I
> > > think there is maybe value in randomizing the placement of all runtime
> > > added
> > > executable code. Beyond just trying to make executable code placement less
> > > deterministic in general, today all of the usages have the property of
> > > starting
> > > with RW permissions and then becoming RO executable, so there is the
> > > benefit
> > > of
> > > narrowing the chances a bug could successfully write to it during the RW
> > > window.
> > >
> > > > - It seems that a lot of complexity and heuristics are introduced just
> > > > to
> > > > accommodate the potential fragmentation that can happen when the
> > > > module
> > > > vmalloc
> > > > space starts to get fragmented with bpf filters. I'm partial to the
> > > > idea of
> > > > splitting or having bpf own its own vmalloc space, similar to what
> > > > Ard
> > > > is
> > > > already
> > > > implementing for arm64.
> > > >
> > > > So a question for the bpf and x86 folks, is having a dedicated
> > > > vmalloc
> > > > region
> > > > (as well as a seperate bpf_alloc api) for bpf feasible or desirable
> > > > on
> > > > x86_64?
> > >
> > > I actually did some prototyping and testing on this. It seems there would
> > > be
> > > some slowdown from the required changes to the JITed code to support
> > > calling
> > > back from the vmalloc region into the kernel, and so module space would
> > > still be
> > > the preferred region.
> >
> > Yes, any runtime slow-down would be no-go as BPF sits in the middle of
> > critical
> > networking fast-path and e.g. on XDP or tc layer and is used in load-
> > balancing,
> > firewalling, DDoS protection scenarios, some recent examples in [0-3].
> >
> > [0] http://vger.kernel.org/lpc-networking2018.html#session-10
> > [1] http://vger.kernel.org/lpc-networking2018.html#session-15
> > [2] https://blog.cloudflare.com/how-to-drop-10-million-packets/
> > [3] http://vger.kernel.org/lpc-bpf2018.html#session-1
> >
> > > > If bpf filters need to be within 2 GB of the core kernel, would it
> > > > make
> > > > sense
> > > > to carve out a portion of the current module region for bpf
> > > > filters? According
> > > > to Documentation/x86/x86_64/mm.txt, the module region is ~1.5 GB. I
> > > > am
> > > > doubtful
> > > > that any real system will actually have 1.5 GB worth of kernel
> > > > modules
> > > > loaded.
> > > > Is there a specific reason why that much space is dedicated to kernel
> > > > modules,
> > > > and would it be feasible to split that region cleanly with bpf?
> > >
> > > Hopefully someone from BPF side of things will chime in, but my
> > > understanding
> > > was that they would like even more space than today if possible and so
> > > they
> > > may
> > > not like the reduced space.
> >
> > I wouldn't mind of the region is split as Jessica suggests but in a way
> > where
> > there would be _no_ runtime regressions for BPF. This might also allow to
> > have
> > more flexibility in sizing the area dedicated for BPF in future, and could
> > potentially be done in similar way as Ard was proposing recently [4].
> >
> > [4] https://patchwork.ozlabs.org/project/netdev/list/?series=77779
>
> CCing Ard.
>
> The benefit of sharing the space, for randomization at least, is that you can
> spread the allocations over a larger area.
>
> I think there are also other benefits to unifying how this memory is managed
> though, rather than spreading it further. Today there are various patterns and
> techniques used like calling different combinations of set_memory_* before
> freeing, zeroing in modules or setting invalid instructions like BPF does,
> etc.
> There is also special care to be taken on vfree-ing executable memory. So this
> way things only have to be done right once and there is less duplication.
>
> Not saying there shouldn't be __weak alloc and free method in BPF for arch
> specific behavior, just that there is quite a few other concerns that could be
> good to centralize even more than today.
>
> What if there was a unified executable alloc API with support for things like:
> - Concepts of two regions for Ard's usage, near(modules) and far(vmalloc)
> from
> kernel text. Won't apply for every arch, but maybe enough that some logic
> could be unified
> - Limits for each of the usages (modules, bpf, kprobes, ftrace)
> - Centralized logic for moving between RW and RO+X
> - Options for exclusive regions or all shared
> - Randomizing base, randomizing independently or none
> - Some cgroups hooks?
>
> Would there be any interest in that for the future?
>
> As a next step, if BPF doesn't want to use this by default, could BPF just
> call
> vmalloc_node_range directly from Ard's new __weak functions on x86? Then
> modules
> can randomize across the whole space and BPF can fill the gaps linearly from
> the
> beginning. Is that acceptable? Then the vmalloc optimizations could be dropped
> for the time being since the BPFs would not be fragmented, but the separate
> regions could come as part of future work.
Jessica, Daniel,

Any advice for me on how we could move this forward?

Thanks,
Rick



> Thanks,
>
> Rick
>
> > > Also with KASLR on x86 its actually only 1GB, so it would only be 500MB
> > > per
> > > section (assuming kprobes, etc would share the non-module region, so just
> > > two
> > > sections).
> > >
> > > > - If bpf gets its own dedicated vmalloc space, and we stick to the
> > > > single
> > > > task
> > > > of randomizing *just* kernel modules, could the vmalloc optimizations
> > > > and
> > > > the
> > > > "backup" area be dropped? The benefits of the vmalloc optimizations
> > > > seem to
> > > > only be noticeable when we get to thousands of module_alloc
> > > > allocations
> > > > -
> > > > again, a concern caused by bpf filters sharing the same space with
> > > > kernel
> > > > modules.
> > >
> > > I think the backup area may still be needed, for example if you have 200
> > > modules
> > > evenly spaced inside 500MB there is only average ~2.5MB gap between them.
> > > So
> > > a
> > > late added large module could still get blocked.
> > >
> > > > So tldr, it seems to me that the concern of fragmentation, the
> > > > vmalloc
> > > > optimizations, and the main purpose of the backup area - basically,
> > > > the
> > > > more
> > > > complex parts of this patchset - stems squarely from the fact that
> > > > bpf
> > > > filters
> > > > share the same space as modules on x86. If we were to focus on
> > > > randomizing
> > > > *just* kernel modules, and if bpf and modules had their own dedicated
> > > > regions,
> > > > then I *think* the concrete use cases for the backup area and the
> > > > vmalloc
> > > > optimizations (if we're strictly considering just kernel modules)
> > > > would
> > > > mostly disappear (please correct me if I'm in the wrong here). Then
> > > > tackling the
> > > > randomization of bpf allocations could potentially be a separate task
> > > > on
> > > > its own.
> > >
> > > Yes it seems then the vmalloc optimizations could be dropped then, but I
> > > don't
> > > think the backup area could be. Also the entropy would go down since there
> > > would
> > > be less possible positions and we would reduce the space available to BPF.
> > > So
> > > there are some downsides just to remove the vmalloc piece.
> > >
> > > Is your concern that vmalloc optimizations might regress something else?
> > > There
> > > is a middle ground vmalloc optimization where only the try_purge flag is
> > > plumbed
> > > through. The flag was most of the performance gained and with just that
> > > piece it
> > > should not change any behavior for the non-modules flows. Would that be
> > > more
> > > acceptable?
> > >
> > > > Thanks!
> > > >
> > > > Jessica
> > > >
> > >
> > > [snip]
> > >
> >
> >

2018-12-17 05:03:48

by Jessica Yu

[permalink] [raw]
Subject: Re: [PATCH v9 RESEND 0/4] KASLR feature to randomize each loadable module

+++ Edgecombe, Rick P [12/12/18 23:05 +0000]:
>On Wed, 2018-11-28 at 01:40 +0000, Edgecombe, Rick P wrote:
>> On Tue, 2018-11-27 at 11:21 +0100, Daniel Borkmann wrote:
>> > On 11/27/2018 01:19 AM, Edgecombe, Rick P wrote:
>> > > On Mon, 2018-11-26 at 16:36 +0100, Jessica Yu wrote:
>> > > > +++ Rick Edgecombe [20/11/18 15:23 -0800]:
>> > >
>> > > [snip]
>> > > > Hi Rick!
>> > > >
>> > > > Sorry for the delay. I'd like to take a step back and ask some broader
>> > > > questions -
>> > > >
>> > > > - Is the end goal of this patchset to randomize loading kernel modules,
>> > > > or
>> > > > most/all
>> > > > executable kernel memory allocations, including bpf, kprobes, etc?
>> > >
>> > > Thanks for taking a look!
>> > >
>> > > It started with the goal of just randomizing modules (hence the name), but
>> > > I
>> > > think there is maybe value in randomizing the placement of all runtime
>> > > added
>> > > executable code. Beyond just trying to make executable code placement less
>> > > deterministic in general, today all of the usages have the property of
>> > > starting
>> > > with RW permissions and then becoming RO executable, so there is the
>> > > benefit
>> > > of
>> > > narrowing the chances a bug could successfully write to it during the RW
>> > > window.
>> > >
>> > > > - It seems that a lot of complexity and heuristics are introduced just
>> > > > to
>> > > > accommodate the potential fragmentation that can happen when the
>> > > > module
>> > > > vmalloc
>> > > > space starts to get fragmented with bpf filters. I'm partial to the
>> > > > idea of
>> > > > splitting or having bpf own its own vmalloc space, similar to what
>> > > > Ard
>> > > > is
>> > > > already
>> > > > implementing for arm64.
>> > > >
>> > > > So a question for the bpf and x86 folks, is having a dedicated
>> > > > vmalloc
>> > > > region
>> > > > (as well as a seperate bpf_alloc api) for bpf feasible or desirable
>> > > > on
>> > > > x86_64?
>> > >
>> > > I actually did some prototyping and testing on this. It seems there would
>> > > be
>> > > some slowdown from the required changes to the JITed code to support
>> > > calling
>> > > back from the vmalloc region into the kernel, and so module space would
>> > > still be
>> > > the preferred region.
>> >
>> > Yes, any runtime slow-down would be no-go as BPF sits in the middle of
>> > critical
>> > networking fast-path and e.g. on XDP or tc layer and is used in load-
>> > balancing,
>> > firewalling, DDoS protection scenarios, some recent examples in [0-3].
>> >
>> > [0] http://vger.kernel.org/lpc-networking2018.html#session-10
>> > [1] http://vger.kernel.org/lpc-networking2018.html#session-15
>> > [2] https://blog.cloudflare.com/how-to-drop-10-million-packets/
>> > [3] http://vger.kernel.org/lpc-bpf2018.html#session-1
>> >
>> > > > If bpf filters need to be within 2 GB of the core kernel, would it
>> > > > make
>> > > > sense
>> > > > to carve out a portion of the current module region for bpf
>> > > > filters? According
>> > > > to Documentation/x86/x86_64/mm.txt, the module region is ~1.5 GB. I
>> > > > am
>> > > > doubtful
>> > > > that any real system will actually have 1.5 GB worth of kernel
>> > > > modules
>> > > > loaded.
>> > > > Is there a specific reason why that much space is dedicated to kernel
>> > > > modules,
>> > > > and would it be feasible to split that region cleanly with bpf?
>> > >
>> > > Hopefully someone from BPF side of things will chime in, but my
>> > > understanding
>> > > was that they would like even more space than today if possible and so
>> > > they
>> > > may
>> > > not like the reduced space.
>> >
>> > I wouldn't mind of the region is split as Jessica suggests but in a way
>> > where
>> > there would be _no_ runtime regressions for BPF. This might also allow to
>> > have
>> > more flexibility in sizing the area dedicated for BPF in future, and could
>> > potentially be done in similar way as Ard was proposing recently [4].
>> >
>> > [4] https://patchwork.ozlabs.org/project/netdev/list/?series=77779
>>
>> CCing Ard.
>>
>> The benefit of sharing the space, for randomization at least, is that you can
>> spread the allocations over a larger area.
>>
>> I think there are also other benefits to unifying how this memory is managed
>> though, rather than spreading it further. Today there are various patterns and
>> techniques used like calling different combinations of set_memory_* before
>> freeing, zeroing in modules or setting invalid instructions like BPF does,
>> etc.
>> There is also special care to be taken on vfree-ing executable memory. So this
>> way things only have to be done right once and there is less duplication.
>>
>> Not saying there shouldn't be __weak alloc and free method in BPF for arch
>> specific behavior, just that there is quite a few other concerns that could be
>> good to centralize even more than today.
>>
>> What if there was a unified executable alloc API with support for things like:
>> - Concepts of two regions for Ard's usage, near(modules) and far(vmalloc)
>> from
>> kernel text. Won't apply for every arch, but maybe enough that some logic
>> could be unified
>> - Limits for each of the usages (modules, bpf, kprobes, ftrace)
>> - Centralized logic for moving between RW and RO+X
>> - Options for exclusive regions or all shared
>> - Randomizing base, randomizing independently or none
>> - Some cgroups hooks?
>>
>> Would there be any interest in that for the future?
>>
>> As a next step, if BPF doesn't want to use this by default, could BPF just
>> call
>> vmalloc_node_range directly from Ard's new __weak functions on x86? Then
>> modules
>> can randomize across the whole space and BPF can fill the gaps linearly from
>> the
>> beginning. Is that acceptable? Then the vmalloc optimizations could be dropped
>> for the time being since the BPFs would not be fragmented, but the separate
>> regions could come as part of future work.
>Jessica, Daniel,
>
>Any advice for me on how we could move this forward?

Hi Rick,

It would be good for the x86 folks to chime in if they find the
x86-related module changes agreeable (in particular, the partitioning
and sizing of the module space in separate randomization and backup
areas). Has that happened already or did I just miss that in the
previous versions?

I'm impartial towards the vmalloc optimizations, as I wouldn't
consider module loading performance-critical (For instance, you'd most
likely just load a driver once and be done with it, and it's not like
you'd very frequently be loading/unloading modules. And note I mean
loading a kernel module, not module_alloc() allocations. These two
concepts are starting to get conflated :-/ ). So, I'd leave the
optimizations up to the BPF folks if they consider that beneficial for
their module_alloc() allocations.

And it looks like there isn't really a strong push or interest on
having a separate vmalloc area for bpf, so I suppose we can drop that
idea for now (that would be a separate patchset on its own anyway).
I just suggested the idea because I was curious if that would have
helped with the potential fragmentation issues. In any case it sounded
like the potentially reduced space (should the module space be split
between bpf and modules) isn't desirable.

Thanks,

Jessica

>
>> Thanks,
>>
>> Rick
>>
>> > > Also with KASLR on x86 its actually only 1GB, so it would only be 500MB
>> > > per
>> > > section (assuming kprobes, etc would share the non-module region, so just
>> > > two
>> > > sections).
>> > >
>> > > > - If bpf gets its own dedicated vmalloc space, and we stick to the
>> > > > single
>> > > > task
>> > > > of randomizing *just* kernel modules, could the vmalloc optimizations
>> > > > and
>> > > > the
>> > > > "backup" area be dropped? The benefits of the vmalloc optimizations
>> > > > seem to
>> > > > only be noticeable when we get to thousands of module_alloc
>> > > > allocations
>> > > > -
>> > > > again, a concern caused by bpf filters sharing the same space with
>> > > > kernel
>> > > > modules.
>> > >
>> > > I think the backup area may still be needed, for example if you have 200
>> > > modules
>> > > evenly spaced inside 500MB there is only average ~2.5MB gap between them.
>> > > So
>> > > a
>> > > late added large module could still get blocked.
>> > >
>> > > > So tldr, it seems to me that the concern of fragmentation, the
>> > > > vmalloc
>> > > > optimizations, and the main purpose of the backup area - basically,
>> > > > the
>> > > > more
>> > > > complex parts of this patchset - stems squarely from the fact that
>> > > > bpf
>> > > > filters
>> > > > share the same space as modules on x86. If we were to focus on
>> > > > randomizing
>> > > > *just* kernel modules, and if bpf and modules had their own dedicated
>> > > > regions,
>> > > > then I *think* the concrete use cases for the backup area and the
>> > > > vmalloc
>> > > > optimizations (if we're strictly considering just kernel modules)
>> > > > would
>> > > > mostly disappear (please correct me if I'm in the wrong here). Then
>> > > > tackling the
>> > > > randomization of bpf allocations could potentially be a separate task
>> > > > on
>> > > > its own.
>> > >
>> > > Yes it seems then the vmalloc optimizations could be dropped then, but I
>> > > don't
>> > > think the backup area could be. Also the entropy would go down since there
>> > > would
>> > > be less possible positions and we would reduce the space available to BPF.
>> > > So
>> > > there are some downsides just to remove the vmalloc piece.
>> > >
>> > > Is your concern that vmalloc optimizations might regress something else?
>> > > There
>> > > is a middle ground vmalloc optimization where only the try_purge flag is
>> > > plumbed
>> > > through. The flag was most of the performance gained and with just that
>> > > piece it
>> > > should not change any behavior for the non-modules flows. Would that be
>> > > more
>> > > acceptable?
>> > >
>> > > > Thanks!
>> > > >
>> > > > Jessica
>> > > >
>> > >
>> > > [snip]
>> > >
>> >
>> >

2018-12-18 01:28:29

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v9 RESEND 0/4] KASLR feature to randomize each loadable module

On Mon, 2018-12-17 at 05:41 +0100, Jessica Yu wrote:
> +++ Edgecombe, Rick P [12/12/18 23:05 +0000]:
> > On Wed, 2018-11-28 at 01:40 +0000, Edgecombe, Rick P wrote:
> > > On Tue, 2018-11-27 at 11:21 +0100, Daniel Borkmann wrote:
> > > > On 11/27/2018 01:19 AM, Edgecombe, Rick P wrote:
> > > > > On Mon, 2018-11-26 at 16:36 +0100, Jessica Yu wrote:
> > > > > > +++ Rick Edgecombe [20/11/18 15:23 -0800]:
> > > > >
> > > > > [snip]
> > > > > > Hi Rick!
> > > > > >
> > > > > > Sorry for the delay. I'd like to take a step back and ask some
> > > > > > broader
> > > > > > questions -
> > > > > >
> > > > > > - Is the end goal of this patchset to randomize loading kernel
> > > > > > modules,
> > > > > > or
> > > > > > most/all
> > > > > > executable kernel memory allocations, including bpf, kprobes,
> > > > > > etc?
> > > > >
> > > > > Thanks for taking a look!
> > > > >
> > > > > It started with the goal of just randomizing modules (hence the name),
> > > > > but
> > > > > I
> > > > > think there is maybe value in randomizing the placement of all runtime
> > > > > added
> > > > > executable code. Beyond just trying to make executable code placement
> > > > > less
> > > > > deterministic in general, today all of the usages have the property of
> > > > > starting
> > > > > with RW permissions and then becoming RO executable, so there is the
> > > > > benefit
> > > > > of
> > > > > narrowing the chances a bug could successfully write to it during the
> > > > > RW
> > > > > window.
> > > > >
> > > > > > - It seems that a lot of complexity and heuristics are introduced
> > > > > > just
> > > > > > to
> > > > > > accommodate the potential fragmentation that can happen when the
> > > > > > module
> > > > > > vmalloc
> > > > > > space starts to get fragmented with bpf filters. I'm partial to
> > > > > > the
> > > > > > idea of
> > > > > > splitting or having bpf own its own vmalloc space, similar to
> > > > > > what
> > > > > > Ard
> > > > > > is
> > > > > > already
> > > > > > implementing for arm64.
> > > > > >
> > > > > > So a question for the bpf and x86 folks, is having a dedicated
> > > > > > vmalloc
> > > > > > region
> > > > > > (as well as a seperate bpf_alloc api) for bpf feasible or
> > > > > > desirable
> > > > > > on
> > > > > > x86_64?
> > > > >
> > > > > I actually did some prototyping and testing on this. It seems there
> > > > > would
> > > > > be
> > > > > some slowdown from the required changes to the JITed code to support
> > > > > calling
> > > > > back from the vmalloc region into the kernel, and so module space
> > > > > would
> > > > > still be
> > > > > the preferred region.
> > > >
> > > > Yes, any runtime slow-down would be no-go as BPF sits in the middle of
> > > > critical
> > > > networking fast-path and e.g. on XDP or tc layer and is used in load-
> > > > balancing,
> > > > firewalling, DDoS protection scenarios, some recent examples in [0-3].
> > > >
> > > > [0] http://vger.kernel.org/lpc-networking2018.html#session-10
> > > > [1] http://vger.kernel.org/lpc-networking2018.html#session-15
> > > > [2] https://blog.cloudflare.com/how-to-drop-10-million-packets/
> > > > [3] http://vger.kernel.org/lpc-bpf2018.html#session-1
> > > >
> > > > > > If bpf filters need to be within 2 GB of the core kernel, would
> > > > > > it
> > > > > > make
> > > > > > sense
> > > > > > to carve out a portion of the current module region for bpf
> > > > > > filters? According
> > > > > > to Documentation/x86/x86_64/mm.txt, the module region is ~1.5 GB.
> > > > > > I
> > > > > > am
> > > > > > doubtful
> > > > > > that any real system will actually have 1.5 GB worth of kernel
> > > > > > modules
> > > > > > loaded.
> > > > > > Is there a specific reason why that much space is dedicated to
> > > > > > kernel
> > > > > > modules,
> > > > > > and would it be feasible to split that region cleanly with bpf?
> > > > >
> > > > > Hopefully someone from BPF side of things will chime in, but my
> > > > > understanding
> > > > > was that they would like even more space than today if possible and so
> > > > > they
> > > > > may
> > > > > not like the reduced space.
> > > >
> > > > I wouldn't mind of the region is split as Jessica suggests but in a way
> > > > where
> > > > there would be _no_ runtime regressions for BPF. This might also allow
> > > > to
> > > > have
> > > > more flexibility in sizing the area dedicated for BPF in future, and
> > > > could
> > > > potentially be done in similar way as Ard was proposing recently [4].
> > > >
> > > > [4] https://patchwork.ozlabs.org/project/netdev/list/?series=77779
> > >
> > > CCing Ard.
> > >
> > > The benefit of sharing the space, for randomization at least, is that you
> > > can
> > > spread the allocations over a larger area.
> > >
> > > I think there are also other benefits to unifying how this memory is
> > > managed
> > > though, rather than spreading it further. Today there are various patterns
> > > and
> > > techniques used like calling different combinations of set_memory_* before
> > > freeing, zeroing in modules or setting invalid instructions like BPF does,
> > > etc.
> > > There is also special care to be taken on vfree-ing executable memory. So
> > > this
> > > way things only have to be done right once and there is less duplication.
> > >
> > > Not saying there shouldn't be __weak alloc and free method in BPF for arch
> > > specific behavior, just that there is quite a few other concerns that
> > > could be
> > > good to centralize even more than today.
> > >
> > > What if there was a unified executable alloc API with support for things
> > > like:
> > > - Concepts of two regions for Ard's usage, near(modules) and far(vmalloc)
> > > from
> > > kernel text. Won't apply for every arch, but maybe enough that some
> > > logic
> > > could be unified
> > > - Limits for each of the usages (modules, bpf, kprobes, ftrace)
> > > - Centralized logic for moving between RW and RO+X
> > > - Options for exclusive regions or all shared
> > > - Randomizing base, randomizing independently or none
> > > - Some cgroups hooks?
> > >
> > > Would there be any interest in that for the future?
> > >
> > > As a next step, if BPF doesn't want to use this by default, could BPF just
> > > call
> > > vmalloc_node_range directly from Ard's new __weak functions on x86? Then
> > > modules
> > > can randomize across the whole space and BPF can fill the gaps linearly
> > > from
> > > the
> > > beginning. Is that acceptable? Then the vmalloc optimizations could be
> > > dropped
> > > for the time being since the BPFs would not be fragmented, but the
> > > separate
> > > regions could come as part of future work.
> >
> > Jessica, Daniel,
> >
> > Any advice for me on how we could move this forward?
>
> Hi Rick,
>
> It would be good for the x86 folks to chime in if they find the
> x86-related module changes agreeable (in particular, the partitioning
> and sizing of the module space in separate randomization and backup
> areas). Has that happened already or did I just miss that in the
> previous versions?
Andrew Morton(on v8) and Kees Cook(way back on v1 IIRC) had asked if we need the
backup area at all. The answer is yes in the case of heavy usage from the other
module_alloc users, or late added large modules have a real world chance of
being blocked.

The sizes of the areas were chosen experimentally with the simulations, but I
didn't save the data.

Anyone in particular you would want to see comment on this?

> I'm impartial towards the vmalloc optimizations, as I wouldn't
> consider module loading performance-critical (For instance, you'd most
> likely just load a driver once and be done with it, and it's not like
> you'd very frequently be loading/unloading modules. And note I mean
> loading a kernel module, not module_alloc() allocations. These two
> concepts are starting to get conflated :-/ ). So, I'd leave the
> optimizations up to the BPF folks if they consider that beneficial for
> their module_alloc() allocations.
Daniel, Alexei,

Any thoughts how you would prefer this works with BPF JIT?

> And it looks like there isn't really a strong push or interest on
> having a separate vmalloc area for bpf, so I suppose we can drop that
> idea for now (that would be a separate patchset on its own anyway).
> I just suggested the idea because I was curious if that would have
> helped with the potential fragmentation issues. In any case it sounded
> like the potentially reduced space (should the module space be split
> between bpf and modules) isn't desirable.
[snip]