2018-06-20 22:12:38

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH 0/3] KASLR feature to randomize each loadable module

Hi,
This is to add a KASLR feature for stronger randomization for the location of
the text sections of dynamically loaded kernel modules.

Today the RANDOMIZE_BASE feature randomizes the base address where the module
allocations begin with 10 bits of entropy. From here, a highly deterministic
algorithm allocates space for the modules as they are loaded and un-loaded. If
an attacker can predict the order and identities for modules that will be
loaded, then a single text address leak can give the attacker access to the
locations of all the modules.

This patch changes the module loading KASLR algorithm to randomize the position
of each module text section allocation with at least 18 bits of entropy in the
typical case. It used on x86_64 only for now.

Allocation Algorithm
====================
The algorithm evenly breaks the module space in two, a random area and a backup
area. For module text allocations, it first tries to allocate up to 10 randomly
located starting pages inside the random section. If this fails, it will
allocate in the backup area. The backup area base will be offset in the same
way as current algorithm does for the base area, which has 10 bits of entropy.

Randomness and Fragmentation
============================
The advantages of this algorithm over the existing one are higher entropy and
that each module text section is randomized in relation to the other sections,
so that if one location is leaked the location of other sections cannot be
inferred.

However, unlike the existing algorithm, the amount of randomness provided has a
dependency on the number of modules allocated and the sizes of the modules text
sections.

The following estimates are based on simulations done with core section
allocation sizes recorded from all in-tree x86_64 modules, and with a module
space size of 1GB (the size when KASLR is enabled). The entropy provided for the
Nth allocation will come from three sources of randomness, the address picked
for the random area, the probability the section will be allocated in the backup
area and randomness from the number of modules already allocated in the backup
area. For computing a lower bound entropy in the following calculations, the
randomness of the modules already in the backup area, or overlapping from the
random area, is ignored since it is usually small for small numbers of modules
and will only increase the entropy.

For probability of the Nth module being in the backup area, p, a lower bound
entropy estimate is calculated here as:
Entropy = -((1-p)*log2((1-p)/(1073741824/4096)) + p*log2(p/1024))

Nth Modules Probability Nth in Backup (p<0.01) Entropy (bits)
200 0.00015658918 18.0009525805
300 0.00061754750 18.0025340517
400 0.00092257674 18.0032512276
500 0.00143354729 18.0041398771
600 0.00199926260 18.0048133611
700 0.00303342527 18.0054763676
800 0.00375362443 18.0056209924
900 0.00449013182 18.0055609282
1000 0.00506372420 18.0053909502
2000 0.01655518527 17.9891937614

For the subclass of control flow attacks, a wrong guess can often crash the
process or even the system if is wrong, so the probability of the first guess
being right can be more important than the Nth guess. KASLR schemes usually have
equal probability for each possible position, but in this scheme that is not the
case. So a more conservative comparison to existing schemes is the amount of
information that would have to be guessed correctly for the position that has
the highest probability for having the Nth module allocated (as that would be
the attackers best guess).

This next table shows the bits that would have to be guessed for a most likely
position for the Nth module, assuming no other address has leaked:

Min Info = MIN(-log2(p/1024), -log2((1-p)/(1073741824/4096)))

Nth Modules Min Info Random Area Backup Area
200 18.00022592813 18.00022592813 22.64072780584
300 18.00089120792 18.00089120792 20.66116227856
400 18.00133161125 18.00133161125 20.08204345143
500 18.00206965540 18.00206965540 19.44619478537
600 18.00288721335 18.00288721335 18.96631630463
700 18.00438295865 18.00438295865 18.36483651470
800 18.00542552443 18.00542552443 18.05749997547
900 17.79902648177 18.00649247790 17.79902648177
1000 17.62558545623 18.00732396876 17.62558545623
2000 15.91657303366 18.02408399587 15.91657303366

So the defensive strength of this algorithm in typical usage (<800 modules) for
x86_64 should be at least 18 bits, even if an address from the random area
leaks.

If an address from a section in the backup area leaks however, the remaining
information that would have to be guessed is reduced. To get at a lower bound,
the following assumes the address of the leak is the first module in the backup
area and ignores the probability of guessing the identity.

Nth Modules P of At Least 2 in Backup (p<0.01) Info (bits)
200 0.00005298177 14.20414443057
300 0.00005298177 14.20414443057
400 0.00034665456 11.49421363374
500 0.00310895422 8.32935491164
600 0.01299838019 6.26552433915
700 0.04042051772 4.62876838940
800 0.09812051823 3.34930133623
900 0.19325547277 2.37141882470
1000 0.32712329132 1.61209361130

So the in typical usage, the entropy will still be decent if an address in the
backup leaks as well.

As for fragmentation, this algorithm reduces the average number of modules that
can be loaded without an allocation failure by about 6% (~17000 to ~16000)
(p<0.05). It can also reduce the largest module executable section that can be
loaded by half to ~500MB in the worst case.

Implementation
==============
This patch adds a new function in vmalloc (__vmalloc_node_try_addr) that tries
to allocate at a specific address. In the x86 module loader, this new vmalloc
function is used to implement the algorithm described above.

The new __vmalloc_node_try_addr function uses the existing function
__vmalloc_node_range, in order to introduce this algorithm with the least
invasive change. The side effect is that each time there is a collision when
trying to allocate in the random area a TLB flush will be triggered. There is
a more complex, more efficient implementation that can be used instead if
there is interest in improving performance.


Rick Edgecombe (3):
vmalloc: Add __vmalloc_node_try_addr function
x86/modules: Increase randomization for modules
vmalloc: Add debugfs modfraginfo

arch/x86/include/asm/pgtable_64_types.h | 1 +
arch/x86/kernel/module.c | 80 +++++++++++++++--
include/linux/vmalloc.h | 3 +
mm/vmalloc.c | 151 +++++++++++++++++++++++++++++++-
4 files changed, 227 insertions(+), 8 deletions(-)

--
2.7.4



2018-06-20 22:10:26

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH 1/3] vmalloc: Add __vmalloc_node_try_addr function

Create __vmalloc_node_try_addr function that tries to allocate at a specific
address. The implementation relies on __vmalloc_node_range for the bulk of the
work. To keep this function from spamming the logs when an allocation failure
is fails, __vmalloc_node_range is changed to only warn when __GFP_NOWARN is not
set. This behavior is consistent with this flags interpretation in
alloc_vmap_area.

Signed-off-by: Rick Edgecombe <[email protected]>
---
include/linux/vmalloc.h | 3 +++
mm/vmalloc.c | 41 +++++++++++++++++++++++++++++++++++++++--
2 files changed, 42 insertions(+), 2 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 398e9c9..6eaa896 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -82,6 +82,9 @@ extern void *__vmalloc_node_range(unsigned long size, unsigned long align,
unsigned long start, unsigned long end, gfp_t gfp_mask,
pgprot_t prot, unsigned long vm_flags, int node,
const void *caller);
+extern void *__vmalloc_node_try_addr(unsigned long addr, unsigned long size,
+ gfp_t gfp_mask, pgprot_t prot, unsigned long vm_flags,
+ int node, const void *caller);
#ifndef CONFIG_MMU
extern void *__vmalloc_node_flags(unsigned long size, int node, gfp_t flags);
static inline void *__vmalloc_node_flags_caller(unsigned long size, int node,
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index cfea25b..9e0820c9 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1710,6 +1710,42 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
}

/**
+ * __vmalloc_try_addr - try to alloc at a specific address
+ * @addr: address to try
+ * @size: size to try
+ * @gfp_mask: flags for the page level allocator
+ * @prot: protection mask for the allocated pages
+ * @vm_flags: additional vm area flags (e.g. %VM_NO_GUARD)
+ * @node: node to use for allocation or NUMA_NO_NODE
+ * @caller: caller's return address
+ *
+ * Try to allocate at the specific address. If it succeeds the address is
+ * returned. If it fails NULL is returned. It may trigger TLB flushes.
+ */
+void *__vmalloc_node_try_addr(unsigned long addr, unsigned long size,
+ gfp_t gfp_mask, pgprot_t prot, unsigned long vm_flags,
+ int node, const void *caller)
+{
+ unsigned long addr_end;
+ unsigned long vsize = PAGE_ALIGN(size);
+
+ if (!vsize || (vsize >> PAGE_SHIFT) > totalram_pages)
+ return NULL;
+
+ if (!(vm_flags & VM_NO_GUARD))
+ vsize += PAGE_SIZE;
+
+ addr_end = addr + vsize;
+
+ if (addr > addr_end)
+ return NULL;
+
+ return __vmalloc_node_range(size, 1, addr, addr_end,
+ gfp_mask | __GFP_NOWARN, prot, vm_flags, node,
+ caller);
+}
+
+/**
* __vmalloc_node_range - allocate virtually contiguous memory
* @size: allocation size
* @align: desired alignment
@@ -1759,8 +1795,9 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
return addr;

fail:
- warn_alloc(gfp_mask, NULL,
- "vmalloc: allocation failure: %lu bytes", real_size);
+ if (!(gfp_mask & __GFP_NOWARN))
+ warn_alloc(gfp_mask, NULL,
+ "vmalloc: allocation failure: %lu bytes", real_size);
return NULL;
}

--
2.7.4


2018-06-20 22:10:38

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH 3/3] vmalloc: Add debugfs modfraginfo

Add debugfs file "modfraginfo" for providing info on module space
fragmentation. This can be used for determining if loadable module
randomization is causing any problems for extreme module loading situations,
like huge numbers of modules or extremely large modules.

Sample output when RANDOMIZE_BASE and X86_64 is configured:
Largest free space: 847253504
External Memory Fragementation: 20%
Allocations in backup area: 0

Sample output otherwise:
Largest free space: 847253504
External Memory Fragementation: 20%

Signed-off-by: Rick Edgecombe <[email protected]>
---
mm/vmalloc.c | 110 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 109 insertions(+), 1 deletion(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 9e0820c9..afb8fe9 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -18,6 +18,7 @@
#include <linux/interrupt.h>
#include <linux/proc_fs.h>
#include <linux/seq_file.h>
+#include <linux/debugfs.h>
#include <linux/debugobjects.h>
#include <linux/kallsyms.h>
#include <linux/list.h>
@@ -33,6 +34,7 @@
#include <linux/bitops.h>

#include <linux/uaccess.h>
+#include <asm/setup.h>
#include <asm/tlbflush.h>
#include <asm/shmparam.h>

@@ -2785,7 +2787,113 @@ static int __init proc_vmalloc_init(void)
proc_create_seq("vmallocinfo", 0400, NULL, &vmalloc_op);
return 0;
}
-module_init(proc_vmalloc_init);
+#else
+static int proc_vmalloc_init(void)
+{
+ return 0;
+}
+#endif
+
+#ifdef CONFIG_DEBUG_FS
+#if defined(CONFIG_RANDOMIZE_BASE) && defined(CONFIG_X86_64)
+static void print_backup_area(struct seq_file *m, unsigned long backup_cnt)
+{
+ if (kaslr_enabled())
+ seq_printf(m, "Allocations in backup area:\t%lu\n", backup_cnt);
+}
+static unsigned long get_backup_start(void)
+{
+ return MODULES_VADDR + MODULES_RAND_LEN;
+}
+#else
+static void print_backup_area(struct seq_file *m, unsigned long backup_cnt)
+{
+}
+static unsigned long get_backup_start(void)
+{
+ return 0;
+}
+#endif
+
+static int modulefraginfo_debug_show(struct seq_file *m, void *v)
+{
+ struct list_head *i;
+ unsigned long last_end = MODULES_VADDR;
+ unsigned long total_free = 0;
+ unsigned long largest_free = 0;
+ unsigned long backup_cnt = 0;
+ unsigned long gap;
+
+ spin_lock(&vmap_area_lock);
+
+ list_for_each(i, &vmap_area_list) {
+ struct vmap_area *obj = list_entry(i, struct vmap_area, list);
+
+ if (!(obj->flags & VM_LAZY_FREE)
+ && obj->va_start >= MODULES_VADDR
+ && obj->va_end <= MODULES_END) {
+
+ if (obj->va_start >= get_backup_start())
+ backup_cnt++;
+
+ gap = (obj->va_start - last_end);
+ if (gap > largest_free)
+ largest_free = gap;
+ total_free += gap;
+
+ last_end = obj->va_end;
+ }
+ }
+
+ gap = (MODULES_END - last_end);
+ if (gap > largest_free)
+ largest_free = gap;
+ total_free += gap;
+
+ spin_unlock(&vmap_area_lock);
+
+ seq_printf(m, "Largest free space:\t\t%lu\n", largest_free);
+ if (total_free)
+ seq_printf(m, "External Memory Fragementation:\t%lu%%\n",
+ 100-(100*largest_free/total_free));
+ else
+ seq_puts(m, "External Memory Fragementation:\t0%%\n");
+
+ print_backup_area(m, backup_cnt);
+
+ return 0;
+}
+
+static int proc_module_frag_debug_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, modulefraginfo_debug_show, NULL);
+}
+
+static const struct file_operations debug_module_frag_operations = {
+ .open = proc_module_frag_debug_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};

+static void debug_modfrag_init(void)
+{
+ debugfs_create_file("modfraginfo", 0x0400, NULL, NULL,
+ &debug_module_frag_operations);
+}
+#else
+static void debug_modfrag_init(void)
+{
+}
#endif

+#if defined(CONFIG_DEBUG_FS) || defined(CONFIG_PROC_FS)
+static int __init info_vmalloc_init(void)
+{
+ proc_vmalloc_init();
+ debug_modfrag_init();
+ return 0;
+}
+
+module_init(info_vmalloc_init);
+#endif
--
2.7.4


2018-06-20 22:11:22

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH 2/3] x86/modules: Increase randomization for modules

This changes the behavior of the KASLR logic for allocating memory for the text
sections of loadable modules. It randomizes the location of each module text
section with about 18 bits of entropy in typical use. This is enabled on X86_64
only. For 32 bit, the behavior is unchanged.

The algorithm evenly breaks the module space in two, a random area and a backup
area. For module text allocations, it first tries to allocate up to 10 randomly
located starting pages inside the random section. If this fails, it will
allocate in the backup area. The backup area base will be offset in the same
way as the current algorithm does for the base area, 1024 possible locations.

Signed-off-by: Rick Edgecombe <[email protected]>
---
arch/x86/include/asm/pgtable_64_types.h | 1 +
arch/x86/kernel/module.c | 80 ++++++++++++++++++++++++++++++---
2 files changed, 76 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 054765a..a98708a 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -141,6 +141,7 @@ extern unsigned int ptrs_per_p4d;
/* The module sections ends with the start of the fixmap */
#define MODULES_END _AC(0xffffffffff000000, UL)
#define MODULES_LEN (MODULES_END - MODULES_VADDR)
+#define MODULES_RAND_LEN (MODULES_LEN/2)

#define ESPFIX_PGD_ENTRY _AC(-2, UL)
#define ESPFIX_BASE_ADDR (ESPFIX_PGD_ENTRY << P4D_SHIFT)
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index f58336a..833ea81 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -77,6 +77,71 @@ static unsigned long int get_module_load_offset(void)
}
#endif

+static unsigned long get_module_area_base(void)
+{
+ return MODULES_VADDR + get_module_load_offset();
+}
+
+#if defined(CONFIG_X86_64) && defined(CONFIG_RANDOMIZE_BASE)
+static unsigned long get_module_vmalloc_start(void)
+{
+ if (kaslr_enabled())
+ return MODULES_VADDR + MODULES_RAND_LEN
+ + get_module_load_offset();
+ else
+ return get_module_area_base();
+}
+
+static void *try_module_alloc(unsigned long addr, unsigned long size)
+{
+ return __vmalloc_node_try_addr(addr, size, GFP_KERNEL,
+ PAGE_KERNEL_EXEC, 0,
+ NUMA_NO_NODE,
+ __builtin_return_address(0));
+}
+
+/*
+ * Try to allocate in 10 random positions starting in the random part of the
+ * module space. If these fail, return NULL.
+ */
+static void *try_module_randomize_each(unsigned long size)
+{
+ void *p = NULL;
+ unsigned int i;
+ unsigned long offset;
+ unsigned long addr;
+ unsigned long end;
+ const unsigned long nr_mod_positions = MODULES_RAND_LEN / MODULE_ALIGN;
+
+ if (!kaslr_enabled())
+ return NULL;
+
+ for (i = 0; i < 10; i++) {
+ offset = (get_random_long() % nr_mod_positions) * MODULE_ALIGN;
+ addr = (unsigned long)MODULES_VADDR + offset;
+ end = addr + size;
+
+ if (end > addr && end < MODULES_END) {
+ p = try_module_alloc(addr, size);
+
+ if (p)
+ return p;
+ }
+ }
+ return NULL;
+}
+#else
+static unsigned long get_module_vmalloc_start(void)
+{
+ return get_module_area_base();
+}
+
+static void *try_module_randomize_each(unsigned long size)
+{
+ return NULL;
+}
+#endif
+
void *module_alloc(unsigned long size)
{
void *p;
@@ -84,11 +149,16 @@ void *module_alloc(unsigned long size)
if (PAGE_ALIGN(size) > MODULES_LEN)
return NULL;

- p = __vmalloc_node_range(size, MODULE_ALIGN,
- MODULES_VADDR + get_module_load_offset(),
- MODULES_END, GFP_KERNEL,
- PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
- __builtin_return_address(0));
+ p = try_module_randomize_each(size);
+
+ if (!p)
+ p = __vmalloc_node_range(size, MODULE_ALIGN,
+ get_module_vmalloc_start(),
+ MODULES_END, GFP_KERNEL,
+ PAGE_KERNEL_EXEC, 0,
+ NUMA_NO_NODE,
+ __builtin_return_address(0));
+
if (p && (kasan_module_alloc(p, size) < 0)) {
vfree(p);
return NULL;
--
2.7.4


2018-06-20 22:18:19

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH 1/3] vmalloc: Add __vmalloc_node_try_addr function

On 06/20/2018 03:09 PM, Rick Edgecombe wrote:
> Create __vmalloc_node_try_addr function that tries to allocate at a specific
> address. The implementation relies on __vmalloc_node_range for the bulk of the
> work. To keep this function from spamming the logs when an allocation failure
> is fails, __vmalloc_node_range is changed to only warn when __GFP_NOWARN is not
> set. This behavior is consistent with this flags interpretation in
> alloc_vmap_area.
>
> Signed-off-by: Rick Edgecombe <[email protected]>
> ---
> include/linux/vmalloc.h | 3 +++
> mm/vmalloc.c | 41 +++++++++++++++++++++++++++++++++++++++--
> 2 files changed, 42 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> index 398e9c9..6eaa896 100644
> --- a/include/linux/vmalloc.h
> +++ b/include/linux/vmalloc.h
> @@ -82,6 +82,9 @@ extern void *__vmalloc_node_range(unsigned long size, unsigned long align,
> unsigned long start, unsigned long end, gfp_t gfp_mask,
> pgprot_t prot, unsigned long vm_flags, int node,
> const void *caller);
> +extern void *__vmalloc_node_try_addr(unsigned long addr, unsigned long size,
> + gfp_t gfp_mask, pgprot_t prot, unsigned long vm_flags,
> + int node, const void *caller);
> #ifndef CONFIG_MMU
> extern void *__vmalloc_node_flags(unsigned long size, int node, gfp_t flags);
> static inline void *__vmalloc_node_flags_caller(unsigned long size, int node,

> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index cfea25b..9e0820c9 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1710,6 +1710,42 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> }
>
> /**
> + * __vmalloc_try_addr - try to alloc at a specific address

* __vmalloc_node_try_addr - try to allocate at a specific address

> + * @addr: address to try
> + * @size: size to try
> + * @gfp_mask: flags for the page level allocator
> + * @prot: protection mask for the allocated pages
> + * @vm_flags: additional vm area flags (e.g. %VM_NO_GUARD)
> + * @node: node to use for allocation or NUMA_NO_NODE
> + * @caller: caller's return address
> + *
> + * Try to allocate at the specific address. If it succeeds the address is
> + * returned. If it fails NULL is returned. It may trigger TLB flushes.
> + */
> +void *__vmalloc_node_try_addr(unsigned long addr, unsigned long size,
> + gfp_t gfp_mask, pgprot_t prot, unsigned long vm_flags,
> + int node, const void *caller)
> +{

so this isn't optional, eh? You are going to force it on people because?

thanks,
--
~Randy

2018-06-20 22:27:55

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 1/3] vmalloc: Add __vmalloc_node_try_addr function

On Wed, Jun 20, 2018 at 03:09:28PM -0700, Rick Edgecombe wrote:
>
> /**
> + * __vmalloc_try_addr - try to alloc at a specific address
> + * @addr: address to try
> + * @size: size to try
> + * @gfp_mask: flags for the page level allocator
> + * @prot: protection mask for the allocated pages
> + * @vm_flags: additional vm area flags (e.g. %VM_NO_GUARD)
> + * @node: node to use for allocation or NUMA_NO_NODE
> + * @caller: caller's return address
> + *
> + * Try to allocate at the specific address. If it succeeds the address is
> + * returned. If it fails NULL is returned. It may trigger TLB flushes.

* Try to allocate memory at a specific address. May trigger TLB flushes.
*
* Context: Process context.
* Return: The allocated address if it succeeds. NULL if it fails.

> @@ -1759,8 +1795,9 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
> return addr;
>
> fail:
> - warn_alloc(gfp_mask, NULL,
> - "vmalloc: allocation failure: %lu bytes", real_size);
> + if (!(gfp_mask & __GFP_NOWARN))
> + warn_alloc(gfp_mask, NULL,
> + "vmalloc: allocation failure: %lu bytes", real_size);
> return NULL;

Not needed:

void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
{
...
if ((gfp_mask & __GFP_NOWARN) || !__ratelimit(&nopage_rs))
return;


2018-06-20 22:34:42

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH 0/3] KASLR feature to randomize each loadable module

On Wed, Jun 20, 2018 at 3:09 PM, Rick Edgecombe
<[email protected]> wrote:
> This patch changes the module loading KASLR algorithm to randomize the position
> of each module text section allocation with at least 18 bits of entropy in the
> typical case. It used on x86_64 only for now.

Very cool! Thanks for sending the series. :)

> Today the RANDOMIZE_BASE feature randomizes the base address where the module
> allocations begin with 10 bits of entropy. From here, a highly deterministic
> algorithm allocates space for the modules as they are loaded and un-loaded. If
> an attacker can predict the order and identities for modules that will be
> loaded, then a single text address leak can give the attacker access to the

nit: "text address" -> "module text address"

> So the defensive strength of this algorithm in typical usage (<800 modules) for
> x86_64 should be at least 18 bits, even if an address from the random area
> leaks.

And most systems have <200 modules, really. I have 113 on a desktop
right now, 63 on a server. So this looks like a trivial win.

> As for fragmentation, this algorithm reduces the average number of modules that
> can be loaded without an allocation failure by about 6% (~17000 to ~16000)
> (p<0.05). It can also reduce the largest module executable section that can be
> loaded by half to ~500MB in the worst case.

Given that we only have 8312 tristate Kconfig items, I think 16000
will remain just fine. And even large modules (i915) are under 2MB...

> The new __vmalloc_node_try_addr function uses the existing function
> __vmalloc_node_range, in order to introduce this algorithm with the least
> invasive change. The side effect is that each time there is a collision when
> trying to allocate in the random area a TLB flush will be triggered. There is
> a more complex, more efficient implementation that can be used instead if
> there is interest in improving performance.

The only time when module loading speed is noticeable, I would think,
would be boot time. Have you done any boot time delta analysis? I
wouldn't expect it to change hardly at all, but it's probably a good
idea to actually test it. :)

Also: can this be generalized for use on other KASLRed architectures?
For example, I know the arm64 module randomization is pretty similar
to x86.

-Kees

--
Kees Cook
Pixel Security

2018-06-20 22:37:47

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH 1/3] vmalloc: Add __vmalloc_node_try_addr function

On Wed, Jun 20, 2018 at 3:16 PM, Randy Dunlap <[email protected]> wrote:
> On 06/20/2018 03:09 PM, Rick Edgecombe wrote:
>> +void *__vmalloc_node_try_addr(unsigned long addr, unsigned long size,
>> + gfp_t gfp_mask, pgprot_t prot, unsigned long vm_flags,
>> + int node, const void *caller)
>> +{
>
> so this isn't optional, eh? You are going to force it on people because?

RANDOMIZE_BASE isn't optional either. :) This improves the module
address entropy with (what seems to be) no down-side, so yeah, I think
it should be non-optional. :)

-Kees

--
Kees Cook
Pixel Security

2018-06-20 22:47:06

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH 1/3] vmalloc: Add __vmalloc_node_try_addr function

On 06/20/2018 03:35 PM, Kees Cook wrote:
> On Wed, Jun 20, 2018 at 3:16 PM, Randy Dunlap <[email protected]> wrote:
>> On 06/20/2018 03:09 PM, Rick Edgecombe wrote:
>>> +void *__vmalloc_node_try_addr(unsigned long addr, unsigned long size,
>>> + gfp_t gfp_mask, pgprot_t prot, unsigned long vm_flags,
>>> + int node, const void *caller)
>>> +{
>>
>> so this isn't optional, eh? You are going to force it on people because?
>
> RANDOMIZE_BASE isn't optional either. :) This improves the module
> address entropy with (what seems to be) no down-side, so yeah, I think
> it should be non-optional. :)

In what kernel tree is RANDOMIZE_BASE not optional?

x86:
config RANDOMIZE_BASE
bool "Randomize the address of the kernel image (KASLR)"
depends on RELOCATABLE
default y

mips:
config RANDOMIZE_BASE
bool "Randomize the address of the kernel image"
depends on RELOCATABLE

arm64:
config RANDOMIZE_BASE
bool "Randomize the address of the kernel image"
select ARM64_MODULE_PLTS if MODULES
select RELOCATABLE


thanks,
--
~Randy

2018-06-20 23:06:53

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH 1/3] vmalloc: Add __vmalloc_node_try_addr function

On Wed, Jun 20, 2018 at 3:44 PM, Randy Dunlap <[email protected]> wrote:
> On 06/20/2018 03:35 PM, Kees Cook wrote:
>> On Wed, Jun 20, 2018 at 3:16 PM, Randy Dunlap <[email protected]> wrote:
>>> On 06/20/2018 03:09 PM, Rick Edgecombe wrote:
>>>> +void *__vmalloc_node_try_addr(unsigned long addr, unsigned long size,
>>>> + gfp_t gfp_mask, pgprot_t prot, unsigned long vm_flags,
>>>> + int node, const void *caller)
>>>> +{
>>>
>>> so this isn't optional, eh? You are going to force it on people because?
>>
>> RANDOMIZE_BASE isn't optional either. :) This improves the module
>> address entropy with (what seems to be) no down-side, so yeah, I think
>> it should be non-optional. :)
>
> In what kernel tree is RANDOMIZE_BASE not optional?

Oh, sorry, I misspoke: on by default. It _is_ possible to turn it off.

But patch #2 does check for RANDOMIZE_BASE, so it should work as expected, yes?

Or did you want even this helper function to be compiled out without it?

-Kees

--
Kees Cook
Pixel Security

2018-06-20 23:18:41

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH 1/3] vmalloc: Add __vmalloc_node_try_addr function

On 06/20/2018 04:05 PM, Kees Cook wrote:
> On Wed, Jun 20, 2018 at 3:44 PM, Randy Dunlap <[email protected]> wrote:
>> On 06/20/2018 03:35 PM, Kees Cook wrote:
>>> On Wed, Jun 20, 2018 at 3:16 PM, Randy Dunlap <[email protected]> wrote:
>>>> On 06/20/2018 03:09 PM, Rick Edgecombe wrote:
>>>>> +void *__vmalloc_node_try_addr(unsigned long addr, unsigned long size,
>>>>> + gfp_t gfp_mask, pgprot_t prot, unsigned long vm_flags,
>>>>> + int node, const void *caller)
>>>>> +{
>>>>
>>>> so this isn't optional, eh? You are going to force it on people because?
>>>
>>> RANDOMIZE_BASE isn't optional either. :) This improves the module
>>> address entropy with (what seems to be) no down-side, so yeah, I think
>>> it should be non-optional. :)
>>
>> In what kernel tree is RANDOMIZE_BASE not optional?
>
> Oh, sorry, I misspoke: on by default. It _is_ possible to turn it off.
>
> But patch #2 does check for RANDOMIZE_BASE, so it should work as expected, yes?
>
> Or did you want even this helper function to be compiled out without it?

Thanks, I missed it. :(

Looks fine.

--
~Randy

2018-06-21 00:54:55

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 3/3] vmalloc: Add debugfs modfraginfo

Hi Rick,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on mmotm/master]
[also build test ERROR on v4.18-rc1 next-20180620]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Rick-Edgecombe/KASLR-feature-to-randomize-each-loadable-module/20180621-061051
base: git://git.cmpxchg.org/linux-mmotm.git master
config: mips-fuloong2e_defconfig (attached as .config)
compiler: mips64el-linux-gnuabi64-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=7.2.0 make.cross ARCH=mips

All errors (new ones prefixed by >>):

mm/vmalloc.c: In function 'modulefraginfo_debug_show':
>> mm/vmalloc.c:2821:27: error: 'MODULES_VADDR' undeclared (first use in this function); did you mean 'MODULE_END'?
unsigned long last_end = MODULES_VADDR;
^~~~~~~~~~~~~
MODULE_END
mm/vmalloc.c:2821:27: note: each undeclared identifier is reported only once for each function it appears in
>> mm/vmalloc.c:2834:22: error: 'MODULES_END' undeclared (first use in this function); did you mean 'MODULE_END'?
&& obj->va_end <= MODULES_END) {
^~~~~~~~~~~
MODULE_END

vim +2821 mm/vmalloc.c

2817
2818 static int modulefraginfo_debug_show(struct seq_file *m, void *v)
2819 {
2820 struct list_head *i;
> 2821 unsigned long last_end = MODULES_VADDR;
2822 unsigned long total_free = 0;
2823 unsigned long largest_free = 0;
2824 unsigned long backup_cnt = 0;
2825 unsigned long gap;
2826
2827 spin_lock(&vmap_area_lock);
2828
2829 list_for_each(i, &vmap_area_list) {
2830 struct vmap_area *obj = list_entry(i, struct vmap_area, list);
2831
2832 if (!(obj->flags & VM_LAZY_FREE)
2833 && obj->va_start >= MODULES_VADDR
> 2834 && obj->va_end <= MODULES_END) {
2835
2836 if (obj->va_start >= get_backup_start())
2837 backup_cnt++;
2838
2839 gap = (obj->va_start - last_end);
2840 if (gap > largest_free)
2841 largest_free = gap;
2842 total_free += gap;
2843
2844 last_end = obj->va_end;
2845 }
2846 }
2847
2848 gap = (MODULES_END - last_end);
2849 if (gap > largest_free)
2850 largest_free = gap;
2851 total_free += gap;
2852
2853 spin_unlock(&vmap_area_lock);
2854
2855 seq_printf(m, "Largest free space:\t\t%lu\n", largest_free);
2856 if (total_free)
2857 seq_printf(m, "External Memory Fragementation:\t%lu%%\n",
2858 100-(100*largest_free/total_free));
2859 else
2860 seq_puts(m, "External Memory Fragementation:\t0%%\n");
2861
2862 print_backup_area(m, backup_cnt);
2863
2864 return 0;
2865 }
2866

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (3.16 kB)
.config.gz (17.23 kB)
Download all attachments

2018-06-21 01:20:13

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 3/3] vmalloc: Add debugfs modfraginfo

Hi Rick,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on mmotm/master]
[also build test ERROR on v4.18-rc1 next-20180620]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Rick-Edgecombe/KASLR-feature-to-randomize-each-loadable-module/20180621-061051
base: git://git.cmpxchg.org/linux-mmotm.git master
config: parisc-c3000_defconfig (attached as .config)
compiler: hppa-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=7.2.0 make.cross ARCH=parisc

All errors (new ones prefixed by >>):

mm/vmalloc.c: In function 'modulefraginfo_debug_show':
>> mm/vmalloc.c:2821:27: error: 'MODULES_VADDR' undeclared (first use in this function); did you mean 'MODULE_AUTHOR'?
unsigned long last_end = MODULES_VADDR;
^~~~~~~~~~~~~
MODULE_AUTHOR
mm/vmalloc.c:2821:27: note: each undeclared identifier is reported only once for each function it appears in
mm/vmalloc.c:2834:22: error: 'MODULES_END' undeclared (first use in this function); did you mean 'MODULES_VADDR'?
&& obj->va_end <= MODULES_END) {
^~~~~~~~~~~
MODULES_VADDR

vim +2821 mm/vmalloc.c

2817
2818 static int modulefraginfo_debug_show(struct seq_file *m, void *v)
2819 {
2820 struct list_head *i;
> 2821 unsigned long last_end = MODULES_VADDR;
2822 unsigned long total_free = 0;
2823 unsigned long largest_free = 0;
2824 unsigned long backup_cnt = 0;
2825 unsigned long gap;
2826
2827 spin_lock(&vmap_area_lock);
2828
2829 list_for_each(i, &vmap_area_list) {
2830 struct vmap_area *obj = list_entry(i, struct vmap_area, list);
2831
2832 if (!(obj->flags & VM_LAZY_FREE)
2833 && obj->va_start >= MODULES_VADDR
2834 && obj->va_end <= MODULES_END) {
2835
2836 if (obj->va_start >= get_backup_start())
2837 backup_cnt++;
2838
2839 gap = (obj->va_start - last_end);
2840 if (gap > largest_free)
2841 largest_free = gap;
2842 total_free += gap;
2843
2844 last_end = obj->va_end;
2845 }
2846 }
2847
2848 gap = (MODULES_END - last_end);
2849 if (gap > largest_free)
2850 largest_free = gap;
2851 total_free += gap;
2852
2853 spin_unlock(&vmap_area_lock);
2854
2855 seq_printf(m, "Largest free space:\t\t%lu\n", largest_free);
2856 if (total_free)
2857 seq_printf(m, "External Memory Fragementation:\t%lu%%\n",
2858 100-(100*largest_free/total_free));
2859 else
2860 seq_puts(m, "External Memory Fragementation:\t0%%\n");
2861
2862 print_backup_area(m, backup_cnt);
2863
2864 return 0;
2865 }
2866

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (3.16 kB)
.config.gz (14.20 kB)
Download all attachments

2018-06-21 12:34:48

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH 3/3] vmalloc: Add debugfs modfraginfo

On Thu, Jun 21, 2018 at 12:12 AM Rick Edgecombe
<[email protected]> wrote:
> Add debugfs file "modfraginfo" for providing info on module space
> fragmentation. This can be used for determining if loadable module
> randomization is causing any problems for extreme module loading situations,
> like huge numbers of modules or extremely large modules.
>
> Sample output when RANDOMIZE_BASE and X86_64 is configured:
> Largest free space: 847253504
> External Memory Fragementation: 20%
> Allocations in backup area: 0
>
> Sample output otherwise:
> Largest free space: 847253504
> External Memory Fragementation: 20%
[...]
> + seq_printf(m, "Largest free space:\t\t%lu\n", largest_free);
> + if (total_free)
> + seq_printf(m, "External Memory Fragementation:\t%lu%%\n",

"Fragmentation"

> + 100-(100*largest_free/total_free));
> + else
> + seq_puts(m, "External Memory Fragementation:\t0%%\n");

"Fragmentation"

[...]
> +static const struct file_operations debug_module_frag_operations = {
> + .open = proc_module_frag_debug_open,
> + .read = seq_read,
> + .llseek = seq_lseek,
> + .release = single_release,
> +};
>
> +static void debug_modfrag_init(void)
> +{
> + debugfs_create_file("modfraginfo", 0x0400, NULL, NULL,
> + &debug_module_frag_operations);

0x0400 is 02000, which is the setgid bit. I think you meant to type 0400?

2018-06-21 13:39:14

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH 0/3] KASLR feature to randomize each loadable module

On Thu, Jun 21, 2018 at 12:34 AM Kees Cook <[email protected]> wrote:
>
> On Wed, Jun 20, 2018 at 3:09 PM, Rick Edgecombe
> <[email protected]> wrote:
> > This patch changes the module loading KASLR algorithm to randomize the position
> > of each module text section allocation with at least 18 bits of entropy in the
> > typical case. It used on x86_64 only for now.
>
> Very cool! Thanks for sending the series. :)
>
> > Today the RANDOMIZE_BASE feature randomizes the base address where the module
> > allocations begin with 10 bits of entropy. From here, a highly deterministic
> > algorithm allocates space for the modules as they are loaded and un-loaded. If
> > an attacker can predict the order and identities for modules that will be
> > loaded, then a single text address leak can give the attacker access to the
>
> nit: "text address" -> "module text address"
>
> > So the defensive strength of this algorithm in typical usage (<800 modules) for
> > x86_64 should be at least 18 bits, even if an address from the random area
> > leaks.
>
> And most systems have <200 modules, really. I have 113 on a desktop
> right now, 63 on a server. So this looks like a trivial win.

But note that the eBPF JIT also uses module_alloc(). Every time a BPF
program (this includes seccomp filters!) is JIT-compiled by the
kernel, another module_alloc() allocation is made. For example, on my
desktop machine, I have a bunch of seccomp-sandboxed processes thanks
to Chrome. If I enable the net.core.bpf_jit_enable sysctl and open a
few Chrome tabs, BPF JIT allocations start showing up between modules:

# grep -C1 bpf_jit_binary_alloc /proc/vmallocinfo | cut -d' ' -f 2-
20480 load_module+0x1326/0x2ab0 pages=4 vmalloc N0=4
12288 bpf_jit_binary_alloc+0x32/0x90 pages=2 vmalloc N0=2
20480 load_module+0x1326/0x2ab0 pages=4 vmalloc N0=4
--
20480 load_module+0x1326/0x2ab0 pages=4 vmalloc N0=4
12288 bpf_jit_binary_alloc+0x32/0x90 pages=2 vmalloc N0=2
36864 load_module+0x1326/0x2ab0 pages=8 vmalloc N0=8
--
20480 load_module+0x1326/0x2ab0 pages=4 vmalloc N0=4
12288 bpf_jit_binary_alloc+0x32/0x90 pages=2 vmalloc N0=2
40960 load_module+0x1326/0x2ab0 pages=9 vmalloc N0=9
--
20480 load_module+0x1326/0x2ab0 pages=4 vmalloc N0=4
12288 bpf_jit_binary_alloc+0x32/0x90 pages=2 vmalloc N0=2
253952 load_module+0x1326/0x2ab0 pages=61 vmalloc N0=61

If you use Chrome with Site Isolation, you have a few dozen open tabs,
and the BPF JIT is enabled, reaching a few hundred allocations might
not be that hard.

Also: What's the impact on memory usage? Is this going to increase the
number of pagetables that need to be allocated by the kernel per
module_alloc() by 4K or 8K or so?

> > As for fragmentation, this algorithm reduces the average number of modules that
> > can be loaded without an allocation failure by about 6% (~17000 to ~16000)
> > (p<0.05). It can also reduce the largest module executable section that can be
> > loaded by half to ~500MB in the worst case.
>
> Given that we only have 8312 tristate Kconfig items, I think 16000
> will remain just fine. And even large modules (i915) are under 2MB...
>
> > The new __vmalloc_node_try_addr function uses the existing function
> > __vmalloc_node_range, in order to introduce this algorithm with the least
> > invasive change. The side effect is that each time there is a collision when
> > trying to allocate in the random area a TLB flush will be triggered. There is
> > a more complex, more efficient implementation that can be used instead if
> > there is interest in improving performance.
>
> The only time when module loading speed is noticeable, I would think,
> would be boot time. Have you done any boot time delta analysis? I
> wouldn't expect it to change hardly at all, but it's probably a good
> idea to actually test it. :)

If you have a forking server that applies seccomp filters on each
fork, or something like that, you might care about those TLB flushes.

> Also: can this be generalized for use on other KASLRed architectures?
> For example, I know the arm64 module randomization is pretty similar
> to x86.
>
> -Kees
>
> --
> Kees Cook
> Pixel Security

2018-06-21 13:40:18

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH 0/3] KASLR feature to randomize each loadable module

On Thu, Jun 21, 2018 at 3:37 PM Jann Horn <[email protected]> wrote:
>
> On Thu, Jun 21, 2018 at 12:34 AM Kees Cook <[email protected]> wrote:
> >
> > On Wed, Jun 20, 2018 at 3:09 PM, Rick Edgecombe
> > <[email protected]> wrote:
> > > This patch changes the module loading KASLR algorithm to randomize the position
> > > of each module text section allocation with at least 18 bits of entropy in the
> > > typical case. It used on x86_64 only for now.
> >
> > Very cool! Thanks for sending the series. :)
> >
> > > Today the RANDOMIZE_BASE feature randomizes the base address where the module
> > > allocations begin with 10 bits of entropy. From here, a highly deterministic
> > > algorithm allocates space for the modules as they are loaded and un-loaded. If
> > > an attacker can predict the order and identities for modules that will be
> > > loaded, then a single text address leak can give the attacker access to the
> >
> > nit: "text address" -> "module text address"
> >
> > > So the defensive strength of this algorithm in typical usage (<800 modules) for
> > > x86_64 should be at least 18 bits, even if an address from the random area
> > > leaks.
> >
> > And most systems have <200 modules, really. I have 113 on a desktop
> > right now, 63 on a server. So this looks like a trivial win.
[...]
> Also: What's the impact on memory usage? Is this going to increase the
> number of pagetables that need to be allocated by the kernel per
> module_alloc() by 4K or 8K or so?

Sorry, I meant increase the amount of memory used by pagetables by 4K
or 8K, not the number of pagetables.

2018-06-21 18:57:13

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH 0/3] KASLR feature to randomize each loadable module

On Wed, 2018-06-20 at 15:33 -0700, Kees Cook wrote:
> > The new __vmalloc_node_try_addr function uses the existing function
> > __vmalloc_node_range, in order to introduce this algorithm with the
> > least
> > invasive change. The side effect is that each time there is a
> > collision when
> > trying to allocate in the random area a TLB flush will be
> > triggered. There is
> > a more complex, more efficient implementation that can be used
> > instead if
> > there is interest in improving performance.
> The only time when module loading speed is noticeable, I would think,
> would be boot time. Have you done any boot time delta analysis? I
> wouldn't expect it to change hardly at all, but it's probably a good
> idea to actually test it. :)

Thanks, I'll do some tests.

> Also: can this be generalized for use on other KASLRed architectures?
> For example, I know the arm64 module randomization is pretty similar
> to x86.

I started in the x86/kernel/module.c because that was where the
existing implementation was, but I don't know of any reason why
it could not apply to other architectures in general.

The randomness estimates would be different if module size probability
distribution, module space size or module alignment are different.

2018-06-21 18:57:23

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH 3/3] vmalloc: Add debugfs modfraginfo

On Thu, 2018-06-21 at 14:32 +0200, Jann Horn wrote:
> On Thu, Jun 21, 2018 at 12:12 AM Rick Edgecombe
> <[email protected]> wrote:
> >
> > Add debugfs file "modfraginfo" for providing info on module space
> > fragmentation.  This can be used for determining if loadable module
> > randomization is causing any problems for extreme module loading
> > situations,
> > like huge numbers of modules or extremely large modules.
> >
> > Sample output when RANDOMIZE_BASE and X86_64 is configured:
> > Largest free space:             847253504
> > External Memory Fragementation: 20%
> > Allocations in backup area:     0
> >
> > Sample output otherwise:
> > Largest free space:             847253504
> > External Memory Fragementation: 20%
> [...]
> >
> > +       seq_printf(m, "Largest free space:\t\t%lu\n",
> > largest_free);
> > +       if (total_free)
> > +               seq_printf(m, "External Memory
> > Fragementation:\t%lu%%\n",
> "Fragmentation"
>
> >
> > +                       100-(100*largest_free/total_free));
> > +       else
> > +               seq_puts(m, "External Memory
> > Fragementation:\t0%%\n");
> "Fragmentation"

Oops! Thanks.

> [...]
> >
> > +static const struct file_operations debug_module_frag_operations =
> > {
> > +       .open       = proc_module_frag_debug_open,
> > +       .read       = seq_read,
> > +       .llseek     = seq_lseek,
> > +       .release    = single_release,
> > +};
> >
> > +static void debug_modfrag_init(void)
> > +{
> > +       debugfs_create_file("modfraginfo", 0x0400, NULL, NULL,
> > +                       &debug_module_frag_operations);
> 0x0400 is 02000, which is the setgid bit. I think you meant to type
> 0400?

Yes, thanks.

2018-06-21 19:00:57

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH 0/3] KASLR feature to randomize each loadable module

On Thu, 2018-06-21 at 15:37 +0200, Jann Horn wrote:
> On Thu, Jun 21, 2018 at 12:34 AM Kees Cook <[email protected]>
> wrote:
> > And most systems have <200 modules, really. I have 113 on a desktop
> > right now, 63 on a server. So this looks like a trivial win.
> But note that the eBPF JIT also uses module_alloc(). Every time a BPF
> program (this includes seccomp filters!) is JIT-compiled by the
> kernel, another module_alloc() allocation is made. For example, on my
> desktop machine, I have a bunch of seccomp-sandboxed processes thanks
> to Chrome. If I enable the net.core.bpf_jit_enable sysctl and open a
> few Chrome tabs, BPF JIT allocations start showing up between
> modules:
>
> # grep -C1 bpf_jit_binary_alloc /proc/vmallocinfo | cut -d' ' -f 2-
>   20480 load_module+0x1326/0x2ab0 pages=4 vmalloc N0=4
>   12288 bpf_jit_binary_alloc+0x32/0x90 pages=2 vmalloc N0=2
>   20480 load_module+0x1326/0x2ab0 pages=4 vmalloc N0=4
> --
>   20480 load_module+0x1326/0x2ab0 pages=4 vmalloc N0=4
>   12288 bpf_jit_binary_alloc+0x32/0x90 pages=2 vmalloc N0=2
>   36864 load_module+0x1326/0x2ab0 pages=8 vmalloc N0=8
> --
>   20480 load_module+0x1326/0x2ab0 pages=4 vmalloc N0=4
>   12288 bpf_jit_binary_alloc+0x32/0x90 pages=2 vmalloc N0=2
>   40960 load_module+0x1326/0x2ab0 pages=9 vmalloc N0=9
> --
>   20480 load_module+0x1326/0x2ab0 pages=4 vmalloc N0=4
>   12288 bpf_jit_binary_alloc+0x32/0x90 pages=2 vmalloc N0=2
>  253952 load_module+0x1326/0x2ab0 pages=61 vmalloc N0=61
>
> If you use Chrome with Site Isolation, you have a few dozen open
> tabs,
> and the BPF JIT is enabled, reaching a few hundred allocations might
> not be that hard.
>
> Also: What's the impact on memory usage? Is this going to increase
> the
> number of pagetables that need to be allocated by the kernel per
> module_alloc() by 4K or 8K or so?
Thanks, it seems it might require some extra memory.  I'll look into it
to find out exactly how much.

I didn't include eBFP modules in the randomization estimates, but it
looks like they are usually smaller than a page.  So with the slight
leap that the larger normal modules based estimate is the worst case,
you should still get ~800 modules at 18 bits. After that it will start
to go down to 10 bits and so in either case it at least won't regress
the randomness of the existing algorithm.

> >
> > >
> > > As for fragmentation, this algorithm reduces the average number
> > > of modules that
> > > can be loaded without an allocation failure by about 6% (~17000
> > > to ~16000)
> > > (p<0.05). It can also reduce the largest module executable
> > > section that can be
> > > loaded by half to ~500MB in the worst case.
> > Given that we only have 8312 tristate Kconfig items, I think 16000
> > will remain just fine. And even large modules (i915) are under
> > 2MB...
> >
> > >
> > > The new __vmalloc_node_try_addr function uses the existing
> > > function
> > > __vmalloc_node_range, in order to introduce this algorithm with
> > > the least
> > > invasive change. The side effect is that each time there is a
> > > collision when
> > > trying to allocate in the random area a TLB flush will be
> > > triggered. There is
> > > a more complex, more efficient implementation that can be used
> > > instead if
> > > there is interest in improving performance.
> > The only time when module loading speed is noticeable, I would
> > think,
> > would be boot time. Have you done any boot time delta analysis? I
> > wouldn't expect it to change hardly at all, but it's probably a
> > good
> > idea to actually test it. :)
> If you have a forking server that applies seccomp filters on each
> fork, or something like that, you might care about those TLB flushes.
>

I can test this as well.

2018-06-21 21:24:14

by Daniel Borkmann

[permalink] [raw]
Subject: Re: [PATCH 0/3] KASLR feature to randomize each loadable module

On 06/21/2018 08:59 PM, Edgecombe, Rick P wrote:
> On Thu, 2018-06-21 at 15:37 +0200, Jann Horn wrote:
>> On Thu, Jun 21, 2018 at 12:34 AM Kees Cook <[email protected]>
>> wrote:
>>> And most systems have <200 modules, really. I have 113 on a desktop
>>> right now, 63 on a server. So this looks like a trivial win.
>> But note that the eBPF JIT also uses module_alloc(). Every time a BPF
>> program (this includes seccomp filters!) is JIT-compiled by the
>> kernel, another module_alloc() allocation is made. For example, on my
>> desktop machine, I have a bunch of seccomp-sandboxed processes thanks
>> to Chrome. If I enable the net.core.bpf_jit_enable sysctl and open a
>> few Chrome tabs, BPF JIT allocations start showing up between
>> modules:
>>
>> # grep -C1 bpf_jit_binary_alloc /proc/vmallocinfo | cut -d' ' -f 2-
>>   20480 load_module+0x1326/0x2ab0 pages=4 vmalloc N0=4
>>   12288 bpf_jit_binary_alloc+0x32/0x90 pages=2 vmalloc N0=2
>>   20480 load_module+0x1326/0x2ab0 pages=4 vmalloc N0=4
>> --
>>   20480 load_module+0x1326/0x2ab0 pages=4 vmalloc N0=4
>>   12288 bpf_jit_binary_alloc+0x32/0x90 pages=2 vmalloc N0=2
>>   36864 load_module+0x1326/0x2ab0 pages=8 vmalloc N0=8
>> --
>>   20480 load_module+0x1326/0x2ab0 pages=4 vmalloc N0=4
>>   12288 bpf_jit_binary_alloc+0x32/0x90 pages=2 vmalloc N0=2
>>   40960 load_module+0x1326/0x2ab0 pages=9 vmalloc N0=9
>> --
>>   20480 load_module+0x1326/0x2ab0 pages=4 vmalloc N0=4
>>   12288 bpf_jit_binary_alloc+0x32/0x90 pages=2 vmalloc N0=2
>>  253952 load_module+0x1326/0x2ab0 pages=61 vmalloc N0=61
>>
>> If you use Chrome with Site Isolation, you have a few dozen open
>> tabs,
>> and the BPF JIT is enabled, reaching a few hundred allocations might
>> not be that hard.
>>
>> Also: What's the impact on memory usage? Is this going to increase
>> the
>> number of pagetables that need to be allocated by the kernel per
>> module_alloc() by 4K or 8K or so?
> Thanks, it seems it might require some extra memory.  I'll look into it
> to find out exactly how much.
>
> I didn't include eBFP modules in the randomization estimates, but it
> looks like they are usually smaller than a page.  So with the slight
> leap that the larger normal modules based estimate is the worst case,
> you should still get ~800 modules at 18 bits. After that it will start
> to go down to 10 bits and so in either case it at least won't regress
> the randomness of the existing algorithm.

Assume typically complex (real) programs at around 2.5k BPF insns today.
In our case it's max a handful per net device, thus approx per netns (veth)
which can be few hundreds. Worst case is 4k that BPF allows and then JITs.
There's a BPF kselftest suite you could also run to check on worst case
upper bounds.

2018-06-21 22:03:46

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH 1/3] vmalloc: Add __vmalloc_node_try_addr function

On Wed, 2018-06-20 at 15:26 -0700, Matthew Wilcox wrote:
> Not needed:
>
> void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char
> *fmt, ...)
> {
> ...
>         if ((gfp_mask & __GFP_NOWARN) || !__ratelimit(&nopage_rs))
>                 return;
>
Yes, thanks!