2020-12-01 21:48:52

by Topi Miettinen

[permalink] [raw]
Subject: [PATCH] mm/vmalloc: randomize vmalloc() allocations

Memory mappings inside kernel allocated with vmalloc() are in
predictable order and packed tightly toward the low addresses. With
new kernel boot parameter 'randomize_vmalloc=1', the entire area is
used randomly to make the allocations less predictable and harder to
guess for attackers.

Without randomize_vmalloc=1:
$ cat /proc/vmallocinfo
0xffffc90000000000-0xffffc90000002000 8192 acpi_os_map_iomem+0x29e/0x2c0 phys=0x000000003ffe1000 ioremap
0xffffc90000002000-0xffffc90000005000 12288 acpi_os_map_iomem+0x29e/0x2c0 phys=0x000000003ffe0000 ioremap
0xffffc90000005000-0xffffc90000007000 8192 hpet_enable+0x36/0x4a9 phys=0x00000000fed00000 ioremap
0xffffc90000007000-0xffffc90000009000 8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
0xffffc90000009000-0xffffc9000000b000 8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
0xffffc9000000b000-0xffffc9000000d000 8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
0xffffc9000000d000-0xffffc9000000f000 8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
0xffffc90000011000-0xffffc90000015000 16384 n_tty_open+0x16/0xe0 pages=3 vmalloc
0xffffc900003de000-0xffffc900003e0000 8192 acpi_os_map_iomem+0x29e/0x2c0 phys=0x00000000fed00000 ioremap
0xffffc900003e0000-0xffffc900003e2000 8192 memremap+0x1a1/0x280 phys=0x00000000000f5000 ioremap
0xffffc900003e2000-0xffffc900003f3000 69632 pcpu_create_chunk+0x80/0x2c0 pages=16 vmalloc
0xffffc900003f3000-0xffffc90000405000 73728 pcpu_create_chunk+0xb7/0x2c0 pages=17 vmalloc
0xffffc90000405000-0xffffc9000040a000 20480 pcpu_create_chunk+0xed/0x2c0 pages=4 vmalloc
0xffffe8ffffc00000-0xffffe8ffffe00000 2097152 pcpu_get_vm_areas+0x0/0x1a40 vmalloc

With randomize_vmalloc=1, the allocations are randomized:
$ cat /proc/vmallocinfo
0xffffca3a36442000-0xffffca3a36447000 20480 pcpu_create_chunk+0xed/0x2c0 pages=4 vmalloc
0xffffca63034d6000-0xffffca63034d9000 12288 acpi_os_map_iomem+0x29e/0x2c0 phys=0x000000003ffe0000 ioremap
0xffffcce23d32e000-0xffffcce23d330000 8192 memremap+0x1a1/0x280 phys=0x00000000000f5000 ioremap
0xffffcfb9f0e22000-0xffffcfb9f0e24000 8192 hpet_enable+0x36/0x4a9 phys=0x00000000fed00000 ioremap
0xffffd1df23e9e000-0xffffd1df23eb0000 73728 pcpu_create_chunk+0xb7/0x2c0 pages=17 vmalloc
0xffffd690c2990000-0xffffd690c2992000 8192 acpi_os_map_iomem+0x29e/0x2c0 phys=0x000000003ffe1000 ioremap
0xffffd8460c718000-0xffffd8460c71c000 16384 n_tty_open+0x16/0xe0 pages=3 vmalloc
0xffffd89aba709000-0xffffd89aba70b000 8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
0xffffe0ca3f2ed000-0xffffe0ca3f2ef000 8192 acpi_os_map_iomem+0x29e/0x2c0 phys=0x00000000fed00000 ioremap
0xffffe3ba44802000-0xffffe3ba44804000 8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
0xffffe4524b2a2000-0xffffe4524b2a4000 8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
0xffffe61372b2e000-0xffffe61372b30000 8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
0xffffe704d2f7c000-0xffffe704d2f8d000 69632 pcpu_create_chunk+0x80/0x2c0 pages=16 vmalloc
0xffffe8ffffc00000-0xffffe8ffffe00000 2097152 pcpu_get_vm_areas+0x0/0x1a40 vmalloc

CC: Andrew Morton <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Jann Horn <[email protected]>
CC: Kees Cook <[email protected]>
CC: Linux API <[email protected]>
CC: Matthew Wilcox <[email protected]>
CC: Mike Rapoport <[email protected]>
Signed-off-by: Topi Miettinen <[email protected]>
---
.../admin-guide/kernel-parameters.txt | 2 ++
mm/vmalloc.c | 25 +++++++++++++++++--
2 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 44fde25bb221..a0242e31d2d8 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4017,6 +4017,8 @@

ramdisk_start= [RAM] RAM disk image start address

+ randomize_vmalloc= [KNL] Randomize vmalloc() allocations.
+
random.trust_cpu={on,off}
[KNL] Enable or disable trusting the use of the
CPU's random number generator (if available) to
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 6ae491a8b210..a5f7bb46ddf2 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -34,6 +34,7 @@
#include <linux/bitops.h>
#include <linux/rbtree_augmented.h>
#include <linux/overflow.h>
+#include <linux/random.h>

#include <linux/uaccess.h>
#include <asm/tlbflush.h>
@@ -1079,6 +1080,17 @@ adjust_va_to_fit_type(struct vmap_area *va,
return 0;
}

+static int randomize_vmalloc = 0;
+
+static int __init set_randomize_vmalloc(char *str)
+{
+ if (!str)
+ return 0;
+ randomize_vmalloc = simple_strtoul(str, &str, 0);
+ return 1;
+}
+__setup("randomize_vmalloc=", set_randomize_vmalloc);
+
/*
* Returns a start address of the newly allocated area, if success.
* Otherwise a vend is returned that indicates failure.
@@ -1152,7 +1164,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
int node, gfp_t gfp_mask)
{
struct vmap_area *va, *pva;
- unsigned long addr;
+ unsigned long addr, voffset;
int purged = 0;
int ret;

@@ -1207,11 +1219,20 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
if (pva && __this_cpu_cmpxchg(ne_fit_preload_node, NULL, pva))
kmem_cache_free(vmap_area_cachep, pva);

+ /* Randomize allocation */
+ if (randomize_vmalloc) {
+ voffset = get_random_long() & (roundup_pow_of_two(vend - vstart) - 1);
+ voffset = PAGE_ALIGN(voffset);
+ if (voffset + size > vend - vstart)
+ voffset = vend - vstart - size;
+ } else
+ voffset = 0;
+
/*
* If an allocation fails, the "vend" address is
* returned. Therefore trigger the overflow path.
*/
- addr = __alloc_vmap_area(size, align, vstart, vend);
+ addr = __alloc_vmap_area(size, align, vstart + voffset, vend);
spin_unlock(&free_vmap_area_lock);

if (unlikely(addr == vend))
--
2.29.2


2020-12-02 18:51:40

by Topi Miettinen

[permalink] [raw]
Subject: Re: [PATCH] mm/vmalloc: randomize vmalloc() allocations

On 1.12.2020 23.45, Topi Miettinen wrote:
> Memory mappings inside kernel allocated with vmalloc() are in
> predictable order and packed tightly toward the low addresses. With
> new kernel boot parameter 'randomize_vmalloc=1', the entire area is
> used randomly to make the allocations less predictable and harder to
> guess for attackers.
>
> Without randomize_vmalloc=1:
> $ cat /proc/vmallocinfo
> 0xffffc90000000000-0xffffc90000002000 8192 acpi_os_map_iomem+0x29e/0x2c0 phys=0x000000003ffe1000 ioremap
> 0xffffc90000002000-0xffffc90000005000 12288 acpi_os_map_iomem+0x29e/0x2c0 phys=0x000000003ffe0000 ioremap
> 0xffffc90000005000-0xffffc90000007000 8192 hpet_enable+0x36/0x4a9 phys=0x00000000fed00000 ioremap
> 0xffffc90000007000-0xffffc90000009000 8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
> 0xffffc90000009000-0xffffc9000000b000 8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
> 0xffffc9000000b000-0xffffc9000000d000 8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
> 0xffffc9000000d000-0xffffc9000000f000 8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
> 0xffffc90000011000-0xffffc90000015000 16384 n_tty_open+0x16/0xe0 pages=3 vmalloc
> 0xffffc900003de000-0xffffc900003e0000 8192 acpi_os_map_iomem+0x29e/0x2c0 phys=0x00000000fed00000 ioremap
> 0xffffc900003e0000-0xffffc900003e2000 8192 memremap+0x1a1/0x280 phys=0x00000000000f5000 ioremap
> 0xffffc900003e2000-0xffffc900003f3000 69632 pcpu_create_chunk+0x80/0x2c0 pages=16 vmalloc
> 0xffffc900003f3000-0xffffc90000405000 73728 pcpu_create_chunk+0xb7/0x2c0 pages=17 vmalloc
> 0xffffc90000405000-0xffffc9000040a000 20480 pcpu_create_chunk+0xed/0x2c0 pages=4 vmalloc
> 0xffffe8ffffc00000-0xffffe8ffffe00000 2097152 pcpu_get_vm_areas+0x0/0x1a40 vmalloc
>
> With randomize_vmalloc=1, the allocations are randomized:
> $ cat /proc/vmallocinfo
> 0xffffca3a36442000-0xffffca3a36447000 20480 pcpu_create_chunk+0xed/0x2c0 pages=4 vmalloc
> 0xffffca63034d6000-0xffffca63034d9000 12288 acpi_os_map_iomem+0x29e/0x2c0 phys=0x000000003ffe0000 ioremap
> 0xffffcce23d32e000-0xffffcce23d330000 8192 memremap+0x1a1/0x280 phys=0x00000000000f5000 ioremap
> 0xffffcfb9f0e22000-0xffffcfb9f0e24000 8192 hpet_enable+0x36/0x4a9 phys=0x00000000fed00000 ioremap
> 0xffffd1df23e9e000-0xffffd1df23eb0000 73728 pcpu_create_chunk+0xb7/0x2c0 pages=17 vmalloc
> 0xffffd690c2990000-0xffffd690c2992000 8192 acpi_os_map_iomem+0x29e/0x2c0 phys=0x000000003ffe1000 ioremap
> 0xffffd8460c718000-0xffffd8460c71c000 16384 n_tty_open+0x16/0xe0 pages=3 vmalloc
> 0xffffd89aba709000-0xffffd89aba70b000 8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
> 0xffffe0ca3f2ed000-0xffffe0ca3f2ef000 8192 acpi_os_map_iomem+0x29e/0x2c0 phys=0x00000000fed00000 ioremap
> 0xffffe3ba44802000-0xffffe3ba44804000 8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
> 0xffffe4524b2a2000-0xffffe4524b2a4000 8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
> 0xffffe61372b2e000-0xffffe61372b30000 8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
> 0xffffe704d2f7c000-0xffffe704d2f8d000 69632 pcpu_create_chunk+0x80/0x2c0 pages=16 vmalloc
> 0xffffe8ffffc00000-0xffffe8ffffe00000 2097152 pcpu_get_vm_areas+0x0/0x1a40 vmalloc
>
> CC: Andrew Morton <[email protected]>
> CC: Andy Lutomirski <[email protected]>
> CC: Jann Horn <[email protected]>
> CC: Kees Cook <[email protected]>
> CC: Linux API <[email protected]>
> CC: Matthew Wilcox <[email protected]>
> CC: Mike Rapoport <[email protected]>
> Signed-off-by: Topi Miettinen <[email protected]>
> ---
> .../admin-guide/kernel-parameters.txt | 2 ++
> mm/vmalloc.c | 25 +++++++++++++++++--
> 2 files changed, 25 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 44fde25bb221..a0242e31d2d8 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -4017,6 +4017,8 @@
>
> ramdisk_start= [RAM] RAM disk image start address
>
> + randomize_vmalloc= [KNL] Randomize vmalloc() allocations.
> +
> random.trust_cpu={on,off}
> [KNL] Enable or disable trusting the use of the
> CPU's random number generator (if available) to
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 6ae491a8b210..a5f7bb46ddf2 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -34,6 +34,7 @@
> #include <linux/bitops.h>
> #include <linux/rbtree_augmented.h>
> #include <linux/overflow.h>
> +#include <linux/random.h>
>
> #include <linux/uaccess.h>
> #include <asm/tlbflush.h>
> @@ -1079,6 +1080,17 @@ adjust_va_to_fit_type(struct vmap_area *va,
> return 0;
> }
>
> +static int randomize_vmalloc = 0;
> +
> +static int __init set_randomize_vmalloc(char *str)
> +{
> + if (!str)
> + return 0;
> + randomize_vmalloc = simple_strtoul(str, &str, 0);
> + return 1;
> +}
> +__setup("randomize_vmalloc=", set_randomize_vmalloc);
> +
> /*
> * Returns a start address of the newly allocated area, if success.
> * Otherwise a vend is returned that indicates failure.
> @@ -1152,7 +1164,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
> int node, gfp_t gfp_mask)
> {
> struct vmap_area *va, *pva;
> - unsigned long addr;
> + unsigned long addr, voffset;
> int purged = 0;
> int ret;
>
> @@ -1207,11 +1219,20 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
> if (pva && __this_cpu_cmpxchg(ne_fit_preload_node, NULL, pva))
> kmem_cache_free(vmap_area_cachep, pva);
>
> + /* Randomize allocation */
> + if (randomize_vmalloc) {
> + voffset = get_random_long() & (roundup_pow_of_two(vend - vstart) - 1);
> + voffset = PAGE_ALIGN(voffset);
> + if (voffset + size > vend - vstart)
> + voffset = vend - vstart - size;
> + } else
> + voffset = 0;
> +
> /*
> * If an allocation fails, the "vend" address is
> * returned. Therefore trigger the overflow path.
> */
> - addr = __alloc_vmap_area(size, align, vstart, vend);
> + addr = __alloc_vmap_area(size, align, vstart + voffset, vend);

Does not work so well after all:

Dec 02 18:25:01 kernel: systemd-udevd: vmalloc: allocation failure:
10526720 bytes, mode:0xcc0(GFP_KERNEL),
nodemask=(null),cpuset=/,mems_allowed=0
Dec 02 18:25:01 kernel: CPU: 12 PID: 716 Comm: systemd-udevd Tainted: G
E 5.10.0-rc5+ #25
Dec 02 18:25:01 kernel: Hardware name: <redacted>
Dec 02 18:25:01 kernel: Call Trace:
Dec 02 18:25:01 kernel: dump_stack+0x7d/0xa3
Dec 02 18:25:01 kernel: warn_alloc.cold+0x83/0x126
Dec 02 18:25:01 kernel: ? zone_watermark_ok_safe+0x140/0x140
Dec 02 18:25:01 kernel: ? __kasan_slab_free+0x122/0x150
Dec 02 18:25:01 kernel: ? slab_free_freelist_hook+0x66/0x110
Dec 02 18:25:01 kernel: ? kfree+0xba/0x3e0
Dec 02 18:25:01 kernel: __vmalloc_node_range+0xd7/0xf0
Dec 02 18:25:01 kernel: ? load_module+0x29e0/0x3f40
Dec 02 18:25:01 kernel: module_alloc+0x9f/0x110
Dec 02 18:25:01 kernel: ? load_module+0x29e0/0x3f40
Dec 02 18:25:01 kernel: load_module+0x29e0/0x3f40
Dec 02 18:25:01 kernel: ? ima_post_read_file+0x140/0x150
Dec 02 18:25:01 kernel: ? module_frob_arch_sections+0x20/0x20
Dec 02 18:25:01 kernel: ? kernel_read_file+0x1d2/0x3e0
Dec 02 18:25:01 kernel: ? __x64_sys_fsopen+0x1f0/0x1f0
Dec 02 18:25:01 kernel: ? up_write+0x92/0x140
Dec 02 18:25:01 kernel: ? downgrade_write+0x160/0x160
Dec 02 18:25:01 kernel: ? kernel_read_file_from_fd+0x4b/0x90
Dec 02 18:25:01 kernel: __do_sys_finit_module+0x110/0x1a0
Dec 02 18:25:01 kernel: ? __x64_sys_init_module+0x50/0x50
Dec 02 18:25:01 kernel: ? get_nth_filter.part.0+0x160/0x160
Dec 02 18:25:01 kernel: ? randomize_stack_top+0x70/0x70
Dec 02 18:25:01 kernel: ? __x64_sys_fstat+0x30/0x30
Dec 02 18:25:01 kernel: ? __audit_syscall_entry+0x16a/0x1d0
Dec 02 18:25:01 kernel: ? ktime_get_coarse_real_ts64+0x4a/0x70
Dec 02 18:25:01 kernel: do_syscall_64+0x33/0x40
Dec 02 18:25:01 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
Dec 02 18:25:01 kernel: RIP: 0033:0xdd0fd2fb989
Dec 02 18:25:01 kernel: Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f
44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c
24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d d7 54 0c 00 f7 d8 64
89 01 48
Dec 02 18:25:01 kernel: RSP: 002b:00000ceb4f03f028 EFLAGS: 00000246
ORIG_RAX: 0000000000000139
Dec 02 18:25:01 kernel: RAX: ffffffffffffffda RBX: 00000ef04a12fa90 RCX:
00000dd0fd2fb989
Dec 02 18:25:01 kernel: RDX: 0000000000000000 RSI: 000003b119220e4d RDI:
0000000000000017
Dec 02 18:25:01 kernel: RBP: 0000000000020000 R08: 0000000000000000 R09:
00000ef04a11b018
Dec 02 18:25:01 kernel: R10: 0000000000000017 R11: 0000000000000246 R12:
000003b119220e4d
Dec 02 18:25:01 kernel: R13: 0000000000000000 R14: 00000ef04a124a10 R15:
00000ef04a12fa90
Dec 02 18:25:01 kernel: Mem-Info:
Dec 02 18:25:01 kernel: active_anon:96 inactive_anon:17667 isolated_anon:0
active_file:15598 inactive_file:35563
isolated_file:0
unevictable:0 dirty:0 writeback:0
slab_reclaimable:8064 slab_unreclaimable:159447
mapped:10434 shmem:229 pagetables:5844 bounce:0
free:3176890 free_pcp:2892 free_cma:0
Dec 02 18:25:01 kernel: Node 0 active_anon:384kB inactive_anon:70668kB
active_file:62392kB inactive_file:142252kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB mapped:41736kB dirty:0kB
writeback:0kB shmem:916kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp:
0kB wri>
Dec 02 18:25:01 kernel: DMA free:13860kB min:76kB low:92kB high:108kB
reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB
present:15996kB managed:15908kB mlocked:0kB pagetables:0kB bounce:0kB
free_pc>
Dec 02 18:25:01 kernel: lowmem_reserve[]: 0 2650 13377 13377 13377
Dec 02 18:25:01 kernel: DMA32 free:2790432kB min:13372kB low:16712kB
high:20052kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB
present:2796348kB managed:2796008kB mlocked:0kB pagetables:0kB bo>
Dec 02 18:25:01 kernel: lowmem_reserve[]: 0 0 10726 10726 10726
Dec 02 18:25:01 kernel: Normal free:9903268kB min:54128kB low:67660kB
high:81192kB reserved_highatomic:0KB active_anon:384kB
inactive_anon:70668kB active_file:62392kB inactive_file:142252kB
unevictable:0kB writepending:0kB present:13356288kB managed:10991672kB
mlocked:0kB>
Dec 02 18:25:01 kernel: lowmem_reserve[]: 0 0 0 0 0
Dec 02 18:25:01 kernel: DMA: 3*4kB (U) 1*8kB (U) 1*16kB (U) 0*32kB
2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 2*2048kB (UM)
2*4096kB (M) = 13860kB
Dec 02 18:25:01 kernel: DMA32: 12*4kB (UM) 10*8kB (UM) 8*16kB (M) 9*32kB
(M) 8*64kB (UM) 6*128kB (UM) 7*256kB (UM) 9*512kB (UM) 5*1024kB (UM)
6*2048kB (M) 675*4096kB (M) = 2790432kB
Dec 02 18:25:01 kernel: Normal: 82*4kB (UE) 1*8kB (E) 3*16kB (UME)
16*32kB (UM) 1*64kB (U) 1*128kB (U) 1*256kB (M) 6*512kB (UM) 8*1024kB
(UME) 1*2048kB (E) 2414*4096kB (M) = 9902400kB

I suppose the random address happened to be too near 'vend' and no
suitable block was found. Perhaps the search in __alloc_vmap_area()
should then continue at 'vstart' instead (so __alloc_vmap_area() would
be passed all three of vstart, voffset, vend instead of just
vstart+voffset, vend).

This also seems to randomize module addresses. I was going to check that
next, so nice surprise!

-Topi

> spin_unlock(&free_vmap_area_lock);
>
> if (unlikely(addr == vend))
>

2020-12-02 18:58:00

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm/vmalloc: randomize vmalloc() allocations

On Tue, Dec 01, 2020 at 11:45:47PM +0200, Topi Miettinen wrote:
> + /* Randomize allocation */
> + if (randomize_vmalloc) {
> + voffset = get_random_long() & (roundup_pow_of_two(vend - vstart) - 1);
> + voffset = PAGE_ALIGN(voffset);
> + if (voffset + size > vend - vstart)
> + voffset = vend - vstart - size;
> + } else
> + voffset = 0;
> +
> /*
> * If an allocation fails, the "vend" address is
> * returned. Therefore trigger the overflow path.
> */
> - addr = __alloc_vmap_area(size, align, vstart, vend);
> + addr = __alloc_vmap_area(size, align, vstart + voffset, vend);
> spin_unlock(&free_vmap_area_lock);

What if there isn't any free address space between vstart+voffset and
vend, but there is free address space between vstart and voffset?
Seems like we should add:

addr = __alloc_vmap_area(size, align, vstart + voffset, vend);
+ if (!addr)
+ addr = __alloc_vmap_area(size, align, vstart, vend);
spin_unlock(&free_vmap_area_lock);

2020-12-02 21:31:31

by Topi Miettinen

[permalink] [raw]
Subject: Re: [PATCH] mm/vmalloc: randomize vmalloc() allocations

On 2.12.2020 20.53, Matthew Wilcox wrote:
> On Tue, Dec 01, 2020 at 11:45:47PM +0200, Topi Miettinen wrote:
>> + /* Randomize allocation */
>> + if (randomize_vmalloc) {
>> + voffset = get_random_long() & (roundup_pow_of_two(vend - vstart) - 1);
>> + voffset = PAGE_ALIGN(voffset);
>> + if (voffset + size > vend - vstart)
>> + voffset = vend - vstart - size;
>> + } else
>> + voffset = 0;
>> +
>> /*
>> * If an allocation fails, the "vend" address is
>> * returned. Therefore trigger the overflow path.
>> */
>> - addr = __alloc_vmap_area(size, align, vstart, vend);
>> + addr = __alloc_vmap_area(size, align, vstart + voffset, vend);
>> spin_unlock(&free_vmap_area_lock);
>
> What if there isn't any free address space between vstart+voffset and
> vend, but there is free address space between vstart and voffset?
> Seems like we should add:
>
> addr = __alloc_vmap_area(size, align, vstart + voffset, vend);
> + if (!addr)
> + addr = __alloc_vmap_area(size, align, vstart, vend);
> spin_unlock(&free_vmap_area_lock);
>

How about:

addr = __alloc_vmap_area(size, align, vstart + voffset, vend);
+ if (!addr)
+ addr = __alloc_vmap_area(size, align, vstart, vstart + voffset + size);
spin_unlock(&free_vmap_area_lock);

That way the search would not be redone for the area that was already
checked and rejected.

Perhaps my previous patch for mmap() etc. randomization could also
search towards higher addresses instead of trying random addresses five
times in case of clashes.

-Topi

2020-12-03 07:01:21

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH] mm/vmalloc: randomize vmalloc() allocations

On Wed, Dec 02, 2020 at 08:49:06PM +0200, Topi Miettinen wrote:
> On 1.12.2020 23.45, Topi Miettinen wrote:
> > Memory mappings inside kernel allocated with vmalloc() are in
> > predictable order and packed tightly toward the low addresses. With
> > new kernel boot parameter 'randomize_vmalloc=1', the entire area is
> > used randomly to make the allocations less predictable and harder to
> > guess for attackers.
> >
>
> This also seems to randomize module addresses. I was going to check that
> next, so nice surprise!

Heh, that's because module_alloc() uses vmalloc() in that way or another :)

> -Topi
>
> > spin_unlock(&free_vmap_area_lock);
> > if (unlikely(addr == vend))
> >
>

--
Sincerely yours,
Mike.

2020-12-03 23:20:20

by David Laight

[permalink] [raw]
Subject: RE: [PATCH] mm/vmalloc: randomize vmalloc() allocations

From: Mike Rapoport
> Sent: 03 December 2020 06:58
>
> On Wed, Dec 02, 2020 at 08:49:06PM +0200, Topi Miettinen wrote:
> > On 1.12.2020 23.45, Topi Miettinen wrote:
> > > Memory mappings inside kernel allocated with vmalloc() are in
> > > predictable order and packed tightly toward the low addresses. With
> > > new kernel boot parameter 'randomize_vmalloc=1', the entire area is
> > > used randomly to make the allocations less predictable and harder to
> > > guess for attackers.

Isn't that going to horribly fragment the available address space
and make even moderate sized allocation requests fail (or sleep).

I'm not even sure that you need to use 'best fit' rather than
'first fit'.
'best fit' is certainly a lot better for a simple linked list
user space malloc.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2020-12-04 11:01:03

by Topi Miettinen

[permalink] [raw]
Subject: Re: [PATCH] mm/vmalloc: randomize vmalloc() allocations

On 4.12.2020 1.15, David Laight wrote:
> From: Mike Rapoport
>> Sent: 03 December 2020 06:58
>>
>> On Wed, Dec 02, 2020 at 08:49:06PM +0200, Topi Miettinen wrote:
>>> On 1.12.2020 23.45, Topi Miettinen wrote:
>>>> Memory mappings inside kernel allocated with vmalloc() are in
>>>> predictable order and packed tightly toward the low addresses. With
>>>> new kernel boot parameter 'randomize_vmalloc=1', the entire area is
>>>> used randomly to make the allocations less predictable and harder to
>>>> guess for attackers.
>
> Isn't that going to horribly fragment the available address space
> and make even moderate sized allocation requests fail (or sleep).

For 32 bit architecture this is a real issue, but I don't think for 64
bits it will be a problem. You can't fragment the virtual memory space
for small allocations because the resulting page tables will not fit in
RAM for existing or near future systems.

For large allocations (directly mapping entire contents of TB sized NVME
drives or a special application which needs 1GB huge pages) this could
be a risk. Maybe this could be solved by reserving some space for them,
or perhaps in those cases you shouldn't use randomize_vmalloc=1.

The method for reserving the large areas could something like below.

First, consider a simple arrangement of reserving high addresses for
large allocations and low addresses for smaller allocations. The
allocator would start searching downwards from high addresses for a free
large block and upwards from low addresses for small blocks. Also the
address space would be semi-rigidly divided to priority areas: area 0
with priority for small allocations, area 1 with equal priority for both
small and large, and area 2 where small allocations would be placed only
as a last resort (which probably would never be the case).

The linear way of dividing the allocations would of course be very much
non-random, so this could be improved with a pseudo-random scrambling
function to distribute the addresses in memory. A simple example would
be to randomly choose a value for one bit in the address for large
allocations (not necessarily the most significant available but also
large enough to align 1GB/TB sized allocations if needed), or a bit
pattern across several address bits for non-even distribution.

The addresses would be also fully randomized inside each priority area.

The division would mean some loss of randomization. A simple rigid
division of 50%/50% for small vs. large allocations would mean a loss of
one bit but the above methods could help this. Dividing the address
space less evenly would improve one side at the expense of the other.
Cracking the scrambling function would reveal the bit(s) used for the
division.

It would be nice to remove the current rigid division of the kernel
address space (Documentation/x86/x86_64/mm.rst) and let the allocations
be placed more randomly in the entire 47 bit address space. Would the
above priority scheme (perhaps with a rigid priority for certain users)
be good enough to allow this?

Even better would be to remove the use of highest bit for selecting
kernel/user addresses but I suppose it would be a lot of work for
gaining just one extra bit of randomness. There could be other effects
though (good or bad).

-Topi

> I'm not even sure that you need to use 'best fit' rather than
> 'first fit'.
> 'best fit' is certainly a lot better for a simple linked list
> user space malloc.
>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>

2020-12-04 13:38:38

by David Laight

[permalink] [raw]
Subject: RE: [PATCH] mm/vmalloc: randomize vmalloc() allocations

From: Topi Miettinen
> Sent: 04 December 2020 10:58
>
> On 4.12.2020 1.15, David Laight wrote:
> > From: Mike Rapoport
> >> Sent: 03 December 2020 06:58
> >>
> >> On Wed, Dec 02, 2020 at 08:49:06PM +0200, Topi Miettinen wrote:
> >>> On 1.12.2020 23.45, Topi Miettinen wrote:
> >>>> Memory mappings inside kernel allocated with vmalloc() are in
> >>>> predictable order and packed tightly toward the low addresses. With
> >>>> new kernel boot parameter 'randomize_vmalloc=1', the entire area is
> >>>> used randomly to make the allocations less predictable and harder to
> >>>> guess for attackers.
> >
> > Isn't that going to horribly fragment the available address space
> > and make even moderate sized allocation requests fail (or sleep).
>
> For 32 bit architecture this is a real issue, but I don't think for 64
> bits it will be a problem. You can't fragment the virtual memory space
> for small allocations because the resulting page tables will not fit in
> RAM for existing or near future systems.

Hmmm truly random allocations are going to need 3 or 4 extra page tables
on 64bit systems. A bit overhead for 4k allocates.
While you won't run out of address space, you will run out of memory.

Randomising the allocated address with the area that already
has page tables allocated might make a bit of sense.
Then allocate similar(ish) sized items from the same 'large' pages.

I was wondering if a flag indicating whether an allocate was 'long term'
or 'short term' might help the placement.
Short term small items could be used to fill the space in 'large pages' left
by non-aligned length large items.

Trouble is you need a CBU (Crystal Ball Unit) to get it right.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2020-12-04 16:57:19

by Topi Miettinen

[permalink] [raw]
Subject: Re: [PATCH] mm/vmalloc: randomize vmalloc() allocations

On 4.12.2020 15.33, David Laight wrote:
> From: Topi Miettinen
>> Sent: 04 December 2020 10:58
>>
>> On 4.12.2020 1.15, David Laight wrote:
>>> From: Mike Rapoport
>>>> Sent: 03 December 2020 06:58
>>>>
>>>> On Wed, Dec 02, 2020 at 08:49:06PM +0200, Topi Miettinen wrote:
>>>>> On 1.12.2020 23.45, Topi Miettinen wrote:
>>>>>> Memory mappings inside kernel allocated with vmalloc() are in
>>>>>> predictable order and packed tightly toward the low addresses. With
>>>>>> new kernel boot parameter 'randomize_vmalloc=1', the entire area is
>>>>>> used randomly to make the allocations less predictable and harder to
>>>>>> guess for attackers.
>>>
>>> Isn't that going to horribly fragment the available address space
>>> and make even moderate sized allocation requests fail (or sleep).
>>
>> For 32 bit architecture this is a real issue, but I don't think for 64
>> bits it will be a problem. You can't fragment the virtual memory space
>> for small allocations because the resulting page tables will not fit in
>> RAM for existing or near future systems.
>
> Hmmm truly random allocations are going to need 3 or 4 extra page tables
> on 64bit systems. A bit overhead for 4k allocates.
> While you won't run out of address space, you will run out of memory.

There are 3500 entries in /proc/vmallocinfo on my system with lots of
BPF filters (which allocate 8kB blocks). The total memory used is 740MB.
Assuming that every entry needed additional 4 pages, it would mean 55MB,
or 7.4% extra. I don't think that's a problem and even if it would be in
some case, there's still the option of not using randomize_vmalloc.

-Topi

2020-12-09 19:12:10

by Topi Miettinen

[permalink] [raw]
Subject: Re: [PATCH] mm/vmalloc: randomize vmalloc() allocations

On 3.12.2020 8.58, Mike Rapoport wrote:
> On Wed, Dec 02, 2020 at 08:49:06PM +0200, Topi Miettinen wrote:
>> On 1.12.2020 23.45, Topi Miettinen wrote:
>>> Memory mappings inside kernel allocated with vmalloc() are in
>>> predictable order and packed tightly toward the low addresses. With
>>> new kernel boot parameter 'randomize_vmalloc=1', the entire area is
>>> used randomly to make the allocations less predictable and harder to
>>> guess for attackers.
>>>
>>
>> This also seems to randomize module addresses. I was going to check that
>> next, so nice surprise!
>
> Heh, that's because module_alloc() uses vmalloc() in that way or another :)

The modules are still allocated from their small (1.5GB) separate area
instead of the much larger (32TB/12.5PB) vmalloc area, which would
greatly improve ASLR for the modules. To fix that, I tried to to #define
MODULES_VADDR to VMALLOC_START etc. like x86_32 does, but then kernel
dies very early without even any output.

-Topi

2020-12-10 20:02:18

by Topi Miettinen

[permalink] [raw]
Subject: Re: [PATCH] mm/vmalloc: randomize vmalloc() allocations

On 3.12.2020 8.58, Mike Rapoport wrote:
> On Wed, Dec 02, 2020 at 08:49:06PM +0200, Topi Miettinen wrote:
>> On 1.12.2020 23.45, Topi Miettinen wrote:
>>> Memory mappings inside kernel allocated with vmalloc() are in
>>> predictable order and packed tightly toward the low addresses. With
>>> new kernel boot parameter 'randomize_vmalloc=1', the entire area is
>>> used randomly to make the allocations less predictable and harder to
>>> guess for attackers.
>>>
>>
>> This also seems to randomize module addresses. I was going to check that
>> next, so nice surprise!
>
> Heh, that's because module_alloc() uses vmalloc() in that way or another :)

I got a bit further with really using vmalloc with
[VMALLOC_START..VMALLOC_END] for modules, but then inserting a module
fails because of the relocations:
[ 9.202856] module: overflow in relocation type 11 val ffffe1950e27f080

Type 11 is R_X86_64_32S expecting a 32 bits signed offset, so the loader
obviously can't fit the relocation from the highest 2GB to somewhere 32
TB lower.

The problem seems to be that the modules aren't really built as
position-independent shared objects with -fPIE/-fPIC, but instead
there's explicit -fno-PIE. I guess the modules also shouldn't use
-mcmodel=kernel. Though tweaking the flags shows that some combinations
aren't well supported (like ’-mindirect-branch=thunk-extern’ and
‘-mcmodel=large’ are not compatible) and the handwritten assembly code
also assumes 32 bit offsets.

A different approach could be to make the entire kernel relocatable to
lower addresses and then the modules could stay close nearby. I guess
the asm files aren't written with position independence in mind either.

But it seems that I'm finding and breaking lots of assumptions built in
to the system. What's the experts' opinion, is full module/kernel
randomization ever going to fly?

-Topi