2024-04-24 21:41:37

by Kees Cook

[permalink] [raw]
Subject: [PATCH v3 0/6] slab: Introduce dedicated bucket allocator

Hi,

Series change history:

v3:
- clarify rationale and purpose in commit log
- rebase to -next (CONFIG_CODE_TAGGING)
- simplify calling styles and split out bucket plumbing more cleanly
- consolidate kmem_buckets_*() family introduction patches
v2: https://lore.kernel.org/lkml/[email protected]/
v1: https://lore.kernel.org/lkml/[email protected]/

For the cover letter, I'm repeating commit log for patch 4 here, which has
additional clarifications and rationale since v2:

Dedicated caches are available for fixed size allocations via
kmem_cache_alloc(), but for dynamically sized allocations there is only
the global kmalloc API's set of buckets available. This means it isn't
possible to separate specific sets of dynamically sized allocations into
a separate collection of caches.

This leads to a use-after-free exploitation weakness in the Linux
kernel since many heap memory spraying/grooming attacks depend on using
userspace-controllable dynamically sized allocations to collide with
fixed size allocations that end up in same cache.

While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
against these kinds of "type confusion" attacks, including for fixed
same-size heap objects, we can create a complementary deterministic
defense for dynamically sized allocations that are directly user
controlled. Addressing these cases is limited in scope, so isolation these
kinds of interfaces will not become an unbounded game of whack-a-mole. For
example, pass through memdup_user(), making isolation there very
effective.

In order to isolate user-controllable sized allocations from system
allocations, introduce kmem_buckets_create(), which behaves like
kmem_cache_create(). Introduce kmem_buckets_alloc(), which behaves like
kmem_cache_alloc(). Introduce kmem_buckets_alloc_track_caller() for
where caller tracking is needed. Introduce kmem_buckets_valloc() for
cases where vmalloc callback is needed.

Allows for confining allocations to a dedicated set of sized caches
(which have the same layout as the kmalloc caches).

This can also be used in the future to extend codetag allocation
annotations to implement per-caller allocation cache isolation[1] even
for dynamic allocations.

Memory allocation pinning[2] is still needed to plug the Use-After-Free
cross-allocator weakness, but that is an existing and separate issue
which is complementary to this improvement. Development continues for
that feature via the SLAB_VIRTUAL[3] series (which could also provide
guard pages -- another complementary improvement).

Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [1]
Link: https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html [2]
Link: https://lore.kernel.org/lkml/[email protected]/ [3]

After the core implementation are 2 patches that cover the most heavily
abused "repeat offenders" used in exploits. Repeating those details here:

The msg subsystem is a common target for exploiting[1][2][3][4][5][6]
use-after-free type confusion flaws in the kernel for both read and
write primitives. Avoid having a user-controlled size cache share the
global kmalloc allocator by using a separate set of kmalloc buckets.

Link: https://blog.hacktivesecurity.com/index.php/2022/06/13/linux-kernel-exploit-development-1day-case-study/ [1]
Link: https://hardenedvault.net/blog/2022-11-13-msg_msg-recon-mitigation-ved/ [2]
Link: https://www.willsroot.io/2021/08/corctf-2021-fire-of-salvation-writeup.html [3]
Link: https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html [4]
Link: https://google.github.io/security-research/pocs/linux/cve-2021-22555/writeup.html [5]
Link: https://zplin.me/papers/ELOISE.pdf [6]
Link: https://syst3mfailure.io/wall-of-perdition/ [7]

Both memdup_user() and vmemdup_user() handle allocations that are
regularly used for exploiting use-after-free type confusion flaws in
the kernel (e.g. prctl() PR_SET_VMA_ANON_NAME[1] and setxattr[2][3][4]
respectively).

Since both are designed for contents coming from userspace, it allows
for userspace-controlled allocation sizes. Use a dedicated set of kmalloc
buckets so these allocations do not share caches with the global kmalloc
buckets.

Link: https://starlabs.sg/blog/2023/07-prctl-anon_vma_name-an-amusing-heap-spray/ [1]
Link: https://duasynt.com/blog/linux-kernel-heap-spray [2]
Link: https://etenal.me/archives/1336 [3]
Link: https://github.com/a13xp0p0v/kernel-hack-drill/blob/master/drill_exploit_uaf.c [4]

Thanks!

-Kees


Kees Cook (6):
mm/slab: Introduce kmem_buckets typedef
mm/slab: Plumb kmem_buckets into __do_kmalloc_node()
mm/slab: Introduce __kvmalloc_node() that can take kmem_buckets
argument
mm/slab: Introduce kmem_buckets_create() and family
ipc, msg: Use dedicated slab buckets for alloc_msg()
mm/util: Use dedicated slab buckets for memdup_user()

include/linux/slab.h | 44 ++++++++++++++++--------
ipc/msgutil.c | 13 +++++++-
lib/fortify_kunit.c | 2 +-
lib/rhashtable.c | 2 +-
mm/slab.h | 6 ++--
mm/slab_common.c | 79 +++++++++++++++++++++++++++++++++++++++++---
mm/slub.c | 14 ++++----
mm/util.c | 21 +++++++++---
8 files changed, 146 insertions(+), 35 deletions(-)

--
2.34.1



2024-04-24 21:41:52

by Kees Cook

[permalink] [raw]
Subject: [PATCH v3 2/6] mm/slab: Plumb kmem_buckets into __do_kmalloc_node()

To be able to choose which buckets to allocate from, make the buckets
available to the lower level kmalloc interfaces by adding them as the
first argument. Where the bucket is not available, pass NULL, which means
"use the default system kmalloc bucket set" (the prior existing behavior),
as implemented in kmalloc_slab().

Signed-off-by: Kees Cook <[email protected]>
---
Cc: Vlastimil Babka <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Hyeonggon Yoo <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
include/linux/slab.h | 16 ++++++++--------
lib/fortify_kunit.c | 2 +-
mm/slab.h | 6 ++++--
mm/slab_common.c | 4 ++--
mm/slub.c | 14 +++++++-------
mm/util.c | 2 +-
6 files changed, 23 insertions(+), 21 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index c8164d5db420..07373b680894 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -569,8 +569,8 @@ static __always_inline void kfree_bulk(size_t size, void **p)
kmem_cache_free_bulk(NULL, size, p);
}

-void *__kmalloc_node_noprof(size_t size, gfp_t flags, int node) __assume_kmalloc_alignment
- __alloc_size(1);
+void *__kmalloc_node_noprof(kmem_buckets *b, size_t size, gfp_t flags, int node)
+ __assume_kmalloc_alignment __alloc_size(2);
#define __kmalloc_node(...) alloc_hooks(__kmalloc_node_noprof(__VA_ARGS__))

void *kmem_cache_alloc_node_noprof(struct kmem_cache *s, gfp_t flags,
@@ -679,7 +679,7 @@ static __always_inline __alloc_size(1) void *kmalloc_node_noprof(size_t size, gf
kmalloc_caches[kmalloc_type(flags, _RET_IP_)][index],
flags, node, size);
}
- return __kmalloc_node_noprof(size, flags, node);
+ return __kmalloc_node_noprof(NULL, size, flags, node);
}
#define kmalloc_node(...) alloc_hooks(kmalloc_node_noprof(__VA_ARGS__))

@@ -730,10 +730,10 @@ static inline __realloc_size(2, 3) void * __must_check krealloc_array_noprof(voi
*/
#define kcalloc(n, size, flags) kmalloc_array(n, size, (flags) | __GFP_ZERO)

-void *kmalloc_node_track_caller_noprof(size_t size, gfp_t flags, int node,
- unsigned long caller) __alloc_size(1);
+void *kmalloc_node_track_caller_noprof(kmem_buckets *b, size_t size, gfp_t flags, int node,
+ unsigned long caller) __alloc_size(2);
#define kmalloc_node_track_caller(...) \
- alloc_hooks(kmalloc_node_track_caller_noprof(__VA_ARGS__, _RET_IP_))
+ alloc_hooks(kmalloc_node_track_caller_noprof(NULL, __VA_ARGS__, _RET_IP_))

/*
* kmalloc_track_caller is a special version of kmalloc that records the
@@ -746,7 +746,7 @@ void *kmalloc_node_track_caller_noprof(size_t size, gfp_t flags, int node,
#define kmalloc_track_caller(...) kmalloc_node_track_caller(__VA_ARGS__, NUMA_NO_NODE)

#define kmalloc_track_caller_noprof(...) \
- kmalloc_node_track_caller_noprof(__VA_ARGS__, NUMA_NO_NODE, _RET_IP_)
+ kmalloc_node_track_caller_noprof(NULL, __VA_ARGS__, NUMA_NO_NODE, _RET_IP_)

static inline __alloc_size(1, 2) void *kmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags,
int node)
@@ -757,7 +757,7 @@ static inline __alloc_size(1, 2) void *kmalloc_array_node_noprof(size_t n, size_
return NULL;
if (__builtin_constant_p(n) && __builtin_constant_p(size))
return kmalloc_node_noprof(bytes, flags, node);
- return __kmalloc_node_noprof(bytes, flags, node);
+ return __kmalloc_node_noprof(NULL, bytes, flags, node);
}
#define kmalloc_array_node(...) alloc_hooks(kmalloc_array_node_noprof(__VA_ARGS__))

diff --git a/lib/fortify_kunit.c b/lib/fortify_kunit.c
index 493ec02dd5b3..ff059d88d455 100644
--- a/lib/fortify_kunit.c
+++ b/lib/fortify_kunit.c
@@ -220,7 +220,7 @@ static void alloc_size_##allocator##_dynamic_test(struct kunit *test) \
checker(expected_size, __kmalloc(alloc_size, gfp), \
kfree(p)); \
checker(expected_size, \
- __kmalloc_node(alloc_size, gfp, NUMA_NO_NODE), \
+ __kmalloc_node(NULL, alloc_size, gfp, NUMA_NO_NODE), \
kfree(p)); \
\
orig = kmalloc(alloc_size, gfp); \
diff --git a/mm/slab.h b/mm/slab.h
index 5f8f47c5bee0..f459cd338852 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -403,16 +403,18 @@ static inline unsigned int size_index_elem(unsigned int bytes)
* KMALLOC_MAX_CACHE_SIZE and the caller must check that.
*/
static inline struct kmem_cache *
-kmalloc_slab(size_t size, gfp_t flags, unsigned long caller)
+kmalloc_slab(kmem_buckets *b, size_t size, gfp_t flags, unsigned long caller)
{
unsigned int index;

+ if (!b)
+ b = &kmalloc_caches[kmalloc_type(flags, caller)];
if (size <= 192)
index = kmalloc_size_index[size_index_elem(size)];
else
index = fls(size - 1);

- return kmalloc_caches[kmalloc_type(flags, caller)][index];
+ return (*b)[index];
}

gfp_t kmalloc_fix_flags(gfp_t flags);
diff --git a/mm/slab_common.c b/mm/slab_common.c
index db9e1b15efd5..7cb4e8fd1275 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -702,7 +702,7 @@ size_t kmalloc_size_roundup(size_t size)
* The flags don't matter since size_index is common to all.
* Neither does the caller for just getting ->object_size.
*/
- return kmalloc_slab(size, GFP_KERNEL, 0)->object_size;
+ return kmalloc_slab(NULL, size, GFP_KERNEL, 0)->object_size;
}

/* Above the smaller buckets, size is a multiple of page size. */
@@ -1186,7 +1186,7 @@ __do_krealloc(const void *p, size_t new_size, gfp_t flags)
return (void *)p;
}

- ret = kmalloc_node_track_caller_noprof(new_size, flags, NUMA_NO_NODE, _RET_IP_);
+ ret = kmalloc_node_track_caller_noprof(NULL, new_size, flags, NUMA_NO_NODE, _RET_IP_);
if (ret && p) {
/* Disable KASAN checks as the object's redzone is accessed. */
kasan_disable_current();
diff --git a/mm/slub.c b/mm/slub.c
index 23bc0d236c26..a94a0507e19c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4093,7 +4093,7 @@ void *kmalloc_large_node_noprof(size_t size, gfp_t flags, int node)
EXPORT_SYMBOL(kmalloc_large_node_noprof);

static __always_inline
-void *__do_kmalloc_node(size_t size, gfp_t flags, int node,
+void *__do_kmalloc_node(kmem_buckets *b, size_t size, gfp_t flags, int node,
unsigned long caller)
{
struct kmem_cache *s;
@@ -4109,7 +4109,7 @@ void *__do_kmalloc_node(size_t size, gfp_t flags, int node,
if (unlikely(!size))
return ZERO_SIZE_PTR;

- s = kmalloc_slab(size, flags, caller);
+ s = kmalloc_slab(b, size, flags, caller);

ret = slab_alloc_node(s, NULL, flags, node, caller, size);
ret = kasan_kmalloc(s, ret, size, flags);
@@ -4117,22 +4117,22 @@ void *__do_kmalloc_node(size_t size, gfp_t flags, int node,
return ret;
}

-void *__kmalloc_node_noprof(size_t size, gfp_t flags, int node)
+void *__kmalloc_node_noprof(kmem_buckets *b, size_t size, gfp_t flags, int node)
{
- return __do_kmalloc_node(size, flags, node, _RET_IP_);
+ return __do_kmalloc_node(b, size, flags, node, _RET_IP_);
}
EXPORT_SYMBOL(__kmalloc_node_noprof);

void *__kmalloc_noprof(size_t size, gfp_t flags)
{
- return __do_kmalloc_node(size, flags, NUMA_NO_NODE, _RET_IP_);
+ return __do_kmalloc_node(NULL, size, flags, NUMA_NO_NODE, _RET_IP_);
}
EXPORT_SYMBOL(__kmalloc_noprof);

-void *kmalloc_node_track_caller_noprof(size_t size, gfp_t flags,
+void *kmalloc_node_track_caller_noprof(kmem_buckets *b, size_t size, gfp_t flags,
int node, unsigned long caller)
{
- return __do_kmalloc_node(size, flags, node, caller);
+ return __do_kmalloc_node(b, size, flags, node, caller);
}
EXPORT_SYMBOL(kmalloc_node_track_caller_noprof);

diff --git a/mm/util.c b/mm/util.c
index c9e519e6811f..80430e5ba981 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -128,7 +128,7 @@ void *kmemdup_noprof(const void *src, size_t len, gfp_t gfp)
{
void *p;

- p = kmalloc_node_track_caller_noprof(len, gfp, NUMA_NO_NODE, _RET_IP_);
+ p = kmalloc_node_track_caller_noprof(NULL, len, gfp, NUMA_NO_NODE, _RET_IP_);
if (p)
memcpy(p, src, len);
return p;
--
2.34.1


2024-04-24 21:42:00

by Kees Cook

[permalink] [raw]
Subject: [PATCH v3 4/6] mm/slab: Introduce kmem_buckets_create() and family

Dedicated caches are available for fixed size allocations via
kmem_cache_alloc(), but for dynamically sized allocations there is only
the global kmalloc API's set of buckets available. This means it isn't
possible to separate specific sets of dynamically sized allocations into
a separate collection of caches.

This leads to a use-after-free exploitation weakness in the Linux
kernel since many heap memory spraying/grooming attacks depend on using
userspace-controllable dynamically sized allocations to collide with
fixed size allocations that end up in same cache.

While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
against these kinds of "type confusion" attacks, including for fixed
same-size heap objects, we can create a complementary deterministic
defense for dynamically sized allocations that are directly user
controlled. Addressing these cases is limited in scope, so isolation these
kinds of interfaces will not become an unbounded game of whack-a-mole. For
example, pass through memdup_user(), making isolation there very
effective.

In order to isolate user-controllable sized allocations from system
allocations, introduce kmem_buckets_create(), which behaves like
kmem_cache_create(). Introduce kmem_buckets_alloc(), which behaves like
kmem_cache_alloc(). Introduce kmem_buckets_alloc_track_caller() for
where caller tracking is needed. Introduce kmem_buckets_valloc() for
cases where vmalloc callback is needed.

Allows for confining allocations to a dedicated set of sized caches
(which have the same layout as the kmalloc caches).

This can also be used in the future to extend codetag allocation
annotations to implement per-caller allocation cache isolation[1] even
for dynamic allocations.

Memory allocation pinning[2] is still needed to plug the Use-After-Free
cross-allocator weakness, but that is an existing and separate issue
which is complementary to this improvement. Development continues for
that feature via the SLAB_VIRTUAL[3] series (which could also provide
guard pages -- another complementary improvement).

Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [1]
Link: https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html [2]
Link: https://lore.kernel.org/lkml/[email protected]/ [3]
Signed-off-by: Kees Cook <[email protected]>
---
Cc: Vlastimil Babka <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Hyeonggon Yoo <[email protected]>
Cc: [email protected]
---
include/linux/slab.h | 13 ++++++++
mm/slab_common.c | 72 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 85 insertions(+)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 23b13be0ac95..1f14a65741a6 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -552,6 +552,11 @@ void *kmem_cache_alloc_lru_noprof(struct kmem_cache *s, struct list_lru *lru,

void kmem_cache_free(struct kmem_cache *s, void *objp);

+kmem_buckets *kmem_buckets_create(const char *name, unsigned int align,
+ slab_flags_t flags,
+ unsigned int useroffset, unsigned int usersize,
+ void (*ctor)(void *));
+
/*
* Bulk allocation and freeing operations. These are accelerated in an
* allocator specific way to avoid taking locks repeatedly or building
@@ -666,6 +671,12 @@ static __always_inline __alloc_size(1) void *kmalloc_noprof(size_t size, gfp_t f
}
#define kmalloc(...) alloc_hooks(kmalloc_noprof(__VA_ARGS__))

+#define kmem_buckets_alloc(_b, _size, _flags) \
+ alloc_hooks(__kmalloc_node_noprof(_b, _size, _flags, NUMA_NO_NODE))
+
+#define kmem_buckets_alloc_track_caller(_b, _size, _flags) \
+ alloc_hooks(kmalloc_node_track_caller_noprof(_b, _size, _flags, NUMA_NO_NODE, _RET_IP_))
+
static __always_inline __alloc_size(1) void *kmalloc_node_noprof(size_t size, gfp_t flags, int node)
{
if (__builtin_constant_p(size) && size) {
@@ -792,6 +803,8 @@ extern void *kvmalloc_node_noprof(kmem_buckets *b, size_t size, gfp_t flags, int

#define kvzalloc_node(_size, _flags, _node) kvmalloc_node(_size, _flags|__GFP_ZERO, _node)

+#define kmem_buckets_valloc(_b, _size, _flags) __kvmalloc_node(_b, _size, _flags, NUMA_NO_NODE)
+
static inline __alloc_size(1, 2) void *kvmalloc_array_noprof(size_t n, size_t size, gfp_t flags)
{
size_t bytes;
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 7cb4e8fd1275..e36360e63ebd 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -392,6 +392,74 @@ kmem_cache_create(const char *name, unsigned int size, unsigned int align,
}
EXPORT_SYMBOL(kmem_cache_create);

+static struct kmem_cache *kmem_buckets_cache __ro_after_init;
+
+kmem_buckets *kmem_buckets_create(const char *name, unsigned int align,
+ slab_flags_t flags,
+ unsigned int useroffset,
+ unsigned int usersize,
+ void (*ctor)(void *))
+{
+ kmem_buckets *b;
+ int idx;
+
+ if (WARN_ON(!kmem_buckets_cache))
+ return NULL;
+
+ b = kmem_cache_alloc(kmem_buckets_cache, GFP_KERNEL|__GFP_ZERO);
+ if (WARN_ON(!b))
+ return NULL;
+
+ flags |= SLAB_NO_MERGE;
+
+ for (idx = 0; idx < ARRAY_SIZE(kmalloc_caches[KMALLOC_NORMAL]); idx++) {
+ char *short_size, *cache_name;
+ unsigned int cache_useroffset, cache_usersize;
+ unsigned int size;
+
+ if (!kmalloc_caches[KMALLOC_NORMAL][idx])
+ continue;
+
+ size = kmalloc_caches[KMALLOC_NORMAL][idx]->object_size;
+ if (!size)
+ continue;
+
+ short_size = strchr(kmalloc_caches[KMALLOC_NORMAL][idx]->name, '-');
+ if (WARN_ON(!short_size))
+ goto fail;
+
+ cache_name = kasprintf(GFP_KERNEL, "%s-%s", name, short_size + 1);
+ if (WARN_ON(!cache_name))
+ goto fail;
+
+ if (useroffset >= size) {
+ cache_useroffset = 0;
+ cache_usersize = 0;
+ } else {
+ cache_useroffset = useroffset;
+ cache_usersize = min(size - cache_useroffset, usersize);
+ }
+ (*b)[idx] = kmem_cache_create_usercopy(cache_name, size,
+ align, flags, cache_useroffset,
+ cache_usersize, ctor);
+ kfree(cache_name);
+ if (WARN_ON(!(*b)[idx]))
+ goto fail;
+ }
+
+ return b;
+
+fail:
+ for (idx = 0; idx < ARRAY_SIZE(kmalloc_caches[KMALLOC_NORMAL]); idx++) {
+ if ((*b)[idx])
+ kmem_cache_destroy((*b)[idx]);
+ }
+ kfree(b);
+
+ return NULL;
+}
+EXPORT_SYMBOL(kmem_buckets_create);
+
#ifdef SLAB_SUPPORTS_SYSFS
/*
* For a given kmem_cache, kmem_cache_destroy() should only be called
@@ -938,6 +1006,10 @@ void __init create_kmalloc_caches(void)

/* Kmalloc array is now usable */
slab_state = UP;
+
+ kmem_buckets_cache = kmem_cache_create("kmalloc_buckets",
+ sizeof(kmem_buckets),
+ 0, 0, NULL);
}

/**
--
2.34.1


2024-04-24 21:42:25

by Kees Cook

[permalink] [raw]
Subject: [PATCH v3 5/6] ipc, msg: Use dedicated slab buckets for alloc_msg()

The msg subsystem is a common target for exploiting[1][2][3][4][5][6][7]
use-after-free type confusion flaws in the kernel for both read and
write primitives. Avoid having a user-controlled size cache share the
global kmalloc allocator by using a separate set of kmalloc buckets.

Link: https://blog.hacktivesecurity.com/index.php/2022/06/13/linux-kernel-exploit-development-1day-case-study/ [1]
Link: https://hardenedvault.net/blog/2022-11-13-msg_msg-recon-mitigation-ved/ [2]
Link: https://www.willsroot.io/2021/08/corctf-2021-fire-of-salvation-writeup.html [3]
Link: https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html [4]
Link: https://google.github.io/security-research/pocs/linux/cve-2021-22555/writeup.html [5]
Link: https://zplin.me/papers/ELOISE.pdf [6]
Link: https://syst3mfailure.io/wall-of-perdition/ [7]
Signed-off-by: Kees Cook <[email protected]>
---
Cc: "GONG, Ruiqi" <[email protected]>
Cc: Xiu Jianfeng <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Kent Overstreet <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Matteo Rizzo <[email protected]>
---
ipc/msgutil.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/ipc/msgutil.c b/ipc/msgutil.c
index d0a0e877cadd..f392f30a057a 100644
--- a/ipc/msgutil.c
+++ b/ipc/msgutil.c
@@ -42,6 +42,17 @@ struct msg_msgseg {
#define DATALEN_MSG ((size_t)PAGE_SIZE-sizeof(struct msg_msg))
#define DATALEN_SEG ((size_t)PAGE_SIZE-sizeof(struct msg_msgseg))

+static kmem_buckets *msg_buckets __ro_after_init;
+
+static int __init init_msg_buckets(void)
+{
+ msg_buckets = kmem_buckets_create("msg_msg", 0, SLAB_ACCOUNT,
+ sizeof(struct msg_msg),
+ DATALEN_MSG, NULL);
+
+ return 0;
+}
+subsys_initcall(init_msg_buckets);

static struct msg_msg *alloc_msg(size_t len)
{
@@ -50,7 +61,7 @@ static struct msg_msg *alloc_msg(size_t len)
size_t alen;

alen = min(len, DATALEN_MSG);
- msg = kmalloc(sizeof(*msg) + alen, GFP_KERNEL_ACCOUNT);
+ msg = kmem_buckets_alloc(msg_buckets, sizeof(*msg) + alen, GFP_KERNEL);
if (msg == NULL)
return NULL;

--
2.34.1


2024-04-24 21:42:37

by Kees Cook

[permalink] [raw]
Subject: [PATCH v3 6/6] mm/util: Use dedicated slab buckets for memdup_user()

Both memdup_user() and vmemdup_user() handle allocations that are
regularly used for exploiting use-after-free type confusion flaws in
the kernel (e.g. prctl() PR_SET_VMA_ANON_NAME[1] and setxattr[2][3][4]
respectively).

Since both are designed for contents coming from userspace, it allows
for userspace-controlled allocation sizes. Use a dedicated set of kmalloc
buckets so these allocations do not share caches with the global kmalloc
buckets.

After a fresh boot under Ubuntu 23.10, we can see the caches are already
in active use:

# grep ^memdup /proc/slabinfo
memdup_user-8k 4 4 8192 4 8 : ...
memdup_user-4k 8 8 4096 8 8 : ...
memdup_user-2k 16 16 2048 16 8 : ...
memdup_user-1k 0 0 1024 16 4 : ...
memdup_user-512 0 0 512 16 2 : ...
memdup_user-256 0 0 256 16 1 : ...
memdup_user-128 0 0 128 32 1 : ...
memdup_user-64 256 256 64 64 1 : ...
memdup_user-32 512 512 32 128 1 : ...
memdup_user-16 1024 1024 16 256 1 : ...
memdup_user-8 2048 2048 8 512 1 : ...
memdup_user-192 0 0 192 21 1 : ...
memdup_user-96 168 168 96 42 1 : ...

Link: https://starlabs.sg/blog/2023/07-prctl-anon_vma_name-an-amusing-heap-spray/ [1]
Link: https://duasynt.com/blog/linux-kernel-heap-spray [2]
Link: https://etenal.me/archives/1336 [3]
Link: https://github.com/a13xp0p0v/kernel-hack-drill/blob/master/drill_exploit_uaf.c [4]
Signed-off-by: Kees Cook <[email protected]>
---
Cc: Andrew Morton <[email protected]>
Cc: "GONG, Ruiqi" <[email protected]>
Cc: Xiu Jianfeng <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Kent Overstreet <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Matteo Rizzo <[email protected]>
Cc: [email protected]
---
mm/util.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/mm/util.c b/mm/util.c
index bdec4954680a..c448f80ed441 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -198,6 +198,16 @@ char *kmemdup_nul(const char *s, size_t len, gfp_t gfp)
}
EXPORT_SYMBOL(kmemdup_nul);

+static kmem_buckets *user_buckets __ro_after_init;
+
+static int __init init_user_buckets(void)
+{
+ user_buckets = kmem_buckets_create("memdup_user", 0, 0, 0, INT_MAX, NULL);
+
+ return 0;
+}
+subsys_initcall(init_user_buckets);
+
/**
* memdup_user - duplicate memory region from user space
*
@@ -211,7 +221,7 @@ void *memdup_user(const void __user *src, size_t len)
{
void *p;

- p = kmalloc_track_caller(len, GFP_USER | __GFP_NOWARN);
+ p = kmem_buckets_alloc_track_caller(user_buckets, len, GFP_USER | __GFP_NOWARN);
if (!p)
return ERR_PTR(-ENOMEM);

@@ -237,7 +247,7 @@ void *vmemdup_user(const void __user *src, size_t len)
{
void *p;

- p = kvmalloc(len, GFP_USER);
+ p = kmem_buckets_valloc(user_buckets, len, GFP_USER);
if (!p)
return ERR_PTR(-ENOMEM);

--
2.34.1


2024-04-28 11:03:05

by jvoisin

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] slab: Introduce dedicated bucket allocator

On 4/24/24 23:40, Kees Cook wrote:
> Hi,
>
> Series change history:
>
> v3:
> - clarify rationale and purpose in commit log
> - rebase to -next (CONFIG_CODE_TAGGING)
> - simplify calling styles and split out bucket plumbing more cleanly
> - consolidate kmem_buckets_*() family introduction patches
> v2: https://lore.kernel.org/lkml/[email protected]/
> v1: https://lore.kernel.org/lkml/[email protected]/
>
> For the cover letter, I'm repeating commit log for patch 4 here, which has
> additional clarifications and rationale since v2:
>
> Dedicated caches are available for fixed size allocations via
> kmem_cache_alloc(), but for dynamically sized allocations there is only
> the global kmalloc API's set of buckets available. This means it isn't
> possible to separate specific sets of dynamically sized allocations into
> a separate collection of caches.
>
> This leads to a use-after-free exploitation weakness in the Linux
> kernel since many heap memory spraying/grooming attacks depend on using
> userspace-controllable dynamically sized allocations to collide with
> fixed size allocations that end up in same cache.
>
> While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
> against these kinds of "type confusion" attacks, including for fixed
> same-size heap objects, we can create a complementary deterministic
> defense for dynamically sized allocations that are directly user
> controlled. Addressing these cases is limited in scope, so isolation these
> kinds of interfaces will not become an unbounded game of whack-a-mole. For
> example, pass through memdup_user(), making isolation there very
> effective.

What does "Addressing these cases is limited in scope, so isolation
these kinds of interfaces will not become an unbounded game of
whack-a-mole." mean exactly?

>
> In order to isolate user-controllable sized allocations from system
> allocations, introduce kmem_buckets_create(), which behaves like
> kmem_cache_create(). Introduce kmem_buckets_alloc(), which behaves like
> kmem_cache_alloc(). Introduce kmem_buckets_alloc_track_caller() for
> where caller tracking is needed. Introduce kmem_buckets_valloc() for
> cases where vmalloc callback is needed.
>
> Allows for confining allocations to a dedicated set of sized caches
> (which have the same layout as the kmalloc caches).
>
> This can also be used in the future to extend codetag allocation
> annotations to implement per-caller allocation cache isolation[1] even
> for dynamic allocations.
Having per-caller allocation cache isolation looks like something that
has already been done in
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3c6152940584290668b35fa0800026f6a1ae05fe
albeit in a randomized way. Why not piggy-back on the infra added by
this patch, instead of adding a new API?

> Memory allocation pinning[2] is still needed to plug the Use-After-Free
> cross-allocator weakness, but that is an existing and separate issue
> which is complementary to this improvement. Development continues for
> that feature via the SLAB_VIRTUAL[3] series (which could also provide
> guard pages -- another complementary improvement).
>
> Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [1]
> Link: https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html [2]
> Link: https://lore.kernel.org/lkml/[email protected]/ [3]

To be honest, I think this series is close to useless without allocation
pinning. And even with pinning, it's still routinely bypassed in the
KernelCTF
(https://github.com/google/security-research/tree/master/pocs/linux/kernelctf).

Do you have some particular exploits in mind that would be completely
mitigated by your series?

Moreover, I'm not aware of any ongoing development of the SLAB_VIRTUAL
series: the last sign of life on its thread is from 7 months ago.

>
> After the core implementation are 2 patches that cover the most heavily
> abused "repeat offenders" used in exploits. Repeating those details here:
>
> The msg subsystem is a common target for exploiting[1][2][3][4][5][6]
> use-after-free type confusion flaws in the kernel for both read and
> write primitives. Avoid having a user-controlled size cache share the
> global kmalloc allocator by using a separate set of kmalloc buckets.
>
> Link: https://blog.hacktivesecurity.com/index.php/2022/06/13/linux-kernel-exploit-development-1day-case-study/ [1]
> Link: https://hardenedvault.net/blog/2022-11-13-msg_msg-recon-mitigation-ved/ [2]
> Link: https://www.willsroot.io/2021/08/corctf-2021-fire-of-salvation-writeup.html [3]
> Link: https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html [4]
> Link: https://google.github.io/security-research/pocs/linux/cve-2021-22555/writeup.html [5]
> Link: https://zplin.me/papers/ELOISE.pdf [6]
> Link: https://syst3mfailure.io/wall-of-perdition/ [7]
>
> Both memdup_user() and vmemdup_user() handle allocations that are
> regularly used for exploiting use-after-free type confusion flaws in
> the kernel (e.g. prctl() PR_SET_VMA_ANON_NAME[1] and setxattr[2][3][4]
> respectively).
>
> Since both are designed for contents coming from userspace, it allows
> for userspace-controlled allocation sizes. Use a dedicated set of kmalloc
> buckets so these allocations do not share caches with the global kmalloc
> buckets.
>
> Link: https://starlabs.sg/blog/2023/07-prctl-anon_vma_name-an-amusing-heap-spray/ [1]
> Link: https://duasynt.com/blog/linux-kernel-heap-spray [2]
> Link: https://etenal.me/archives/1336 [3]
> Link: https://github.com/a13xp0p0v/kernel-hack-drill/blob/master/drill_exploit_uaf.c [4]

What's the performance impact of this series? Did you run some benchmarks?

>
> Thanks!
>
> -Kees
>
>
> Kees Cook (6):
> mm/slab: Introduce kmem_buckets typedef
> mm/slab: Plumb kmem_buckets into __do_kmalloc_node()
> mm/slab: Introduce __kvmalloc_node() that can take kmem_buckets
> argument
> mm/slab: Introduce kmem_buckets_create() and family
> ipc, msg: Use dedicated slab buckets for alloc_msg()
> mm/util: Use dedicated slab buckets for memdup_user()
>
> include/linux/slab.h | 44 ++++++++++++++++--------
> ipc/msgutil.c | 13 +++++++-
> lib/fortify_kunit.c | 2 +-
> lib/rhashtable.c | 2 +-
> mm/slab.h | 6 ++--
> mm/slab_common.c | 79 +++++++++++++++++++++++++++++++++++++++++---
> mm/slub.c | 14 ++++----
> mm/util.c | 21 +++++++++---
> 8 files changed, 146 insertions(+), 35 deletions(-)
>


2024-04-28 17:03:24

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] slab: Introduce dedicated bucket allocator

On Sun, Apr 28, 2024 at 01:02:36PM +0200, jvoisin wrote:
> On 4/24/24 23:40, Kees Cook wrote:
> > Hi,
> >
> > Series change history:
> >
> > v3:
> > - clarify rationale and purpose in commit log
> > - rebase to -next (CONFIG_CODE_TAGGING)
> > - simplify calling styles and split out bucket plumbing more cleanly
> > - consolidate kmem_buckets_*() family introduction patches
> > v2: https://lore.kernel.org/lkml/[email protected]/
> > v1: https://lore.kernel.org/lkml/[email protected]/
> >
> > For the cover letter, I'm repeating commit log for patch 4 here, which has
> > additional clarifications and rationale since v2:
> >
> > Dedicated caches are available for fixed size allocations via
> > kmem_cache_alloc(), but for dynamically sized allocations there is only
> > the global kmalloc API's set of buckets available. This means it isn't
> > possible to separate specific sets of dynamically sized allocations into
> > a separate collection of caches.
> >
> > This leads to a use-after-free exploitation weakness in the Linux
> > kernel since many heap memory spraying/grooming attacks depend on using
> > userspace-controllable dynamically sized allocations to collide with
> > fixed size allocations that end up in same cache.
> >
> > While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
> > against these kinds of "type confusion" attacks, including for fixed
> > same-size heap objects, we can create a complementary deterministic
> > defense for dynamically sized allocations that are directly user
> > controlled. Addressing these cases is limited in scope, so isolation these
> > kinds of interfaces will not become an unbounded game of whack-a-mole. For
> > example, pass through memdup_user(), making isolation there very
> > effective.
>
> What does "Addressing these cases is limited in scope, so isolation
> these kinds of interfaces will not become an unbounded game of
> whack-a-mole." mean exactly?

The number of cases where there is a user/kernel API for size-controlled
allocations is limited. They don't get added very often, and most are
(correctly) using kmemdup_user() as the basis of their allocations. This
means we have a relatively well defined set of criteria for finding
places where this is needed, and most newly added interfaces will use
the existing (kmemdup_user()) infrastructure that will already be covered.

> > In order to isolate user-controllable sized allocations from system
> > allocations, introduce kmem_buckets_create(), which behaves like
> > kmem_cache_create(). Introduce kmem_buckets_alloc(), which behaves like
> > kmem_cache_alloc(). Introduce kmem_buckets_alloc_track_caller() for
> > where caller tracking is needed. Introduce kmem_buckets_valloc() for
> > cases where vmalloc callback is needed.
> >
> > Allows for confining allocations to a dedicated set of sized caches
> > (which have the same layout as the kmalloc caches).
> >
> > This can also be used in the future to extend codetag allocation
> > annotations to implement per-caller allocation cache isolation[1] even
> > for dynamic allocations.
> Having per-caller allocation cache isolation looks like something that
> has already been done in
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3c6152940584290668b35fa0800026f6a1ae05fe
> albeit in a randomized way. Why not piggy-back on the infra added by
> this patch, instead of adding a new API?

It's not sufficient because it is a static set of buckets. It cannot be
adjusted dynamically (which is not a problem kmem_buckets_create() has).
I had asked[1], in an earlier version of CONFIG_RANDOM_KMALLOC_CACHES, for
exactly the API that is provided in this series, because that would be
much more flexible.

And for systems that use allocation profiling, the next step
would be to provide per-call-site isolation (which would supersede
CONFIG_RANDOM_KMALLOC_CACHES, which we'd keep for the non-alloc-prof
cases).

> > Memory allocation pinning[2] is still needed to plug the Use-After-Free
> > cross-allocator weakness, but that is an existing and separate issue
> > which is complementary to this improvement. Development continues for
> > that feature via the SLAB_VIRTUAL[3] series (which could also provide
> > guard pages -- another complementary improvement).
> >
> > Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [1]
> > Link: https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html [2]
> > Link: https://lore.kernel.org/lkml/[email protected]/ [3]
>
> To be honest, I think this series is close to useless without allocation
> pinning. And even with pinning, it's still routinely bypassed in the
> KernelCTF
> (https://github.com/google/security-research/tree/master/pocs/linux/kernelctf).

Sure, I can understand why you might think that, but I disagree. This
adds the building blocks we need for better allocation isolation
control, and stops existing (and similar) attacks today.

But yes, given attackers with sufficient control over the entire system,
all mitigations get weaker. We can't fall into the trap of "perfect
security"; real-world experience shows that incremental improvements
like this can strongly impact the difficulty of mounting attacks. Not
all flaws are created equal; not everything is exploitable to the same
degree.

> Do you have some particular exploits in mind that would be completely
> mitigated by your series?

I link to like a dozen in the last two patches. :P

This series immediately closes 3 well used exploit methodologies.
Attackers exploiting new flaws that could have used the killed methods
must now choose methods that have greater complexity, and this drives
them towards cross-allocator attacks. Robust exploits there are more
costly to develop as we narrow the scope of methods.

Bad analogy: we're locking the doors of a house. Yes, some windows may
still be unlocked, but now they'll need a ladder. And it doesn't make
sense to lock the windows if we didn't lock the doors first. This is
what I mean by complementary defenses, and comes back to what I mentioned
earlier: "perfect security" is a myth, but incremental security works.

> Moreover, I'm not aware of any ongoing development of the SLAB_VIRTUAL
> series: the last sign of life on its thread is from 7 months ago.

Yeah, I know, but sometimes other things get in the way. Matteo assures
me it's still coming.

Since you're interested in seeing SLAB_VIRTUAL land, please join the
development efforts. Reach out to Matteo (you, he, and I all work for
the same company) and see where you can assist. Surely this can be
something you can contribute to while "on the clock"?

> > After the core implementation are 2 patches that cover the most heavily
> > abused "repeat offenders" used in exploits. Repeating those details here:
> >
> > The msg subsystem is a common target for exploiting[1][2][3][4][5][6]
> > use-after-free type confusion flaws in the kernel for both read and
> > write primitives. Avoid having a user-controlled size cache share the
> > global kmalloc allocator by using a separate set of kmalloc buckets.
> >
> > Link: https://blog.hacktivesecurity.com/index.php/2022/06/13/linux-kernel-exploit-development-1day-case-study/ [1]
> > Link: https://hardenedvault.net/blog/2022-11-13-msg_msg-recon-mitigation-ved/ [2]
> > Link: https://www.willsroot.io/2021/08/corctf-2021-fire-of-salvation-writeup.html [3]
> > Link: https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html [4]
> > Link: https://google.github.io/security-research/pocs/linux/cve-2021-22555/writeup.html [5]
> > Link: https://zplin.me/papers/ELOISE.pdf [6]
> > Link: https://syst3mfailure.io/wall-of-perdition/ [7]
> >
> > Both memdup_user() and vmemdup_user() handle allocations that are
> > regularly used for exploiting use-after-free type confusion flaws in
> > the kernel (e.g. prctl() PR_SET_VMA_ANON_NAME[1] and setxattr[2][3][4]
> > respectively).
> >
> > Since both are designed for contents coming from userspace, it allows
> > for userspace-controlled allocation sizes. Use a dedicated set of kmalloc
> > buckets so these allocations do not share caches with the global kmalloc
> > buckets.
> >
> > Link: https://starlabs.sg/blog/2023/07-prctl-anon_vma_name-an-amusing-heap-spray/ [1]
> > Link: https://duasynt.com/blog/linux-kernel-heap-spray [2]
> > Link: https://etenal.me/archives/1336 [3]
> > Link: https://github.com/a13xp0p0v/kernel-hack-drill/blob/master/drill_exploit_uaf.c [4]
>
> What's the performance impact of this series? Did you run some benchmarks?

I wasn't able to measure any performance impact at all. It does add a
small bit of memory overhead, but it's on the order of a dozen pages
used for the 2 extra sets of buckets. (E.g. it's well below the overhead
introduced by CONFIG_RANDOM_KMALLOC_CACHES, which adds 16 extra sets
of buckets.)

-Kees

[1] https://lore.kernel.org/lkml/202305161204.CB4A87C13@keescook/

--
Kees Cook

2024-05-03 13:40:02

by jvoisin

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] slab: Introduce dedicated bucket allocator

On 4/28/24 19:02, Kees Cook wrote:
> On Sun, Apr 28, 2024 at 01:02:36PM +0200, jvoisin wrote:
>> On 4/24/24 23:40, Kees Cook wrote:
>>> Hi,
>>>
>>> Series change history:
>>>
>>> v3:
>>> - clarify rationale and purpose in commit log
>>> - rebase to -next (CONFIG_CODE_TAGGING)
>>> - simplify calling styles and split out bucket plumbing more cleanly
>>> - consolidate kmem_buckets_*() family introduction patches
>>> v2: https://lore.kernel.org/lkml/[email protected]/
>>> v1: https://lore.kernel.org/lkml/[email protected]/
>>>
>>> For the cover letter, I'm repeating commit log for patch 4 here, which has
>>> additional clarifications and rationale since v2:
>>>
>>> Dedicated caches are available for fixed size allocations via
>>> kmem_cache_alloc(), but for dynamically sized allocations there is only
>>> the global kmalloc API's set of buckets available. This means it isn't
>>> possible to separate specific sets of dynamically sized allocations into
>>> a separate collection of caches.
>>>
>>> This leads to a use-after-free exploitation weakness in the Linux
>>> kernel since many heap memory spraying/grooming attacks depend on using
>>> userspace-controllable dynamically sized allocations to collide with
>>> fixed size allocations that end up in same cache.
>>>
>>> While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
>>> against these kinds of "type confusion" attacks, including for fixed
>>> same-size heap objects, we can create a complementary deterministic
>>> defense for dynamically sized allocations that are directly user
>>> controlled. Addressing these cases is limited in scope, so isolation these
>>> kinds of interfaces will not become an unbounded game of whack-a-mole. For
>>> example, pass through memdup_user(), making isolation there very
>>> effective.
>>
>> What does "Addressing these cases is limited in scope, so isolation
>> these kinds of interfaces will not become an unbounded game of
>> whack-a-mole." mean exactly?
>
> The number of cases where there is a user/kernel API for size-controlled
> allocations is limited. They don't get added very often, and most are
> (correctly) using kmemdup_user() as the basis of their allocations. This
> means we have a relatively well defined set of criteria for finding
> places where this is needed, and most newly added interfaces will use
> the existing (kmemdup_user()) infrastructure that will already be covered.

A simple CodeQL query returns 266 of them:
https://lookerstudio.google.com/reporting/68b02863-4f5c-4d85-b3c1-992af89c855c/page/n92nD?params=%7B%22df3%22:%22include%25EE%2580%25803%25EE%2580%2580T%22%7D

Is this number realistic and coherent with your results/own analysis?

>
>>> In order to isolate user-controllable sized allocations from system
>>> allocations, introduce kmem_buckets_create(), which behaves like
>>> kmem_cache_create(). Introduce kmem_buckets_alloc(), which behaves like
>>> kmem_cache_alloc(). Introduce kmem_buckets_alloc_track_caller() for
>>> where caller tracking is needed. Introduce kmem_buckets_valloc() for
>>> cases where vmalloc callback is needed.
>>>
>>> Allows for confining allocations to a dedicated set of sized caches
>>> (which have the same layout as the kmalloc caches).
>>>
>>> This can also be used in the future to extend codetag allocation
>>> annotations to implement per-caller allocation cache isolation[1] even
>>> for dynamic allocations.
>> Having per-caller allocation cache isolation looks like something that
>> has already been done in
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3c6152940584290668b35fa0800026f6a1ae05fe
>> albeit in a randomized way. Why not piggy-back on the infra added by
>> this patch, instead of adding a new API?
>
> It's not sufficient because it is a static set of buckets. It cannot be
> adjusted dynamically (which is not a problem kmem_buckets_create() has).
> I had asked[1], in an earlier version of CONFIG_RANDOM_KMALLOC_CACHES, for
> exactly the API that is provided in this series, because that would be
> much more flexible.
>
> And for systems that use allocation profiling, the next step
> would be to provide per-call-site isolation (which would supersede
> CONFIG_RANDOM_KMALLOC_CACHES, which we'd keep for the non-alloc-prof
> cases).
>
>>> Memory allocation pinning[2] is still needed to plug the Use-After-Free
>>> cross-allocator weakness, but that is an existing and separate issue
>>> which is complementary to this improvement. Development continues for
>>> that feature via the SLAB_VIRTUAL[3] series (which could also provide
>>> guard pages -- another complementary improvement).
>>>
>>> Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [1]
>>> Link: https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html [2]
>>> Link: https://lore.kernel.org/lkml/[email protected]/ [3]
>>
>> To be honest, I think this series is close to useless without allocation
>> pinning. And even with pinning, it's still routinely bypassed in the
>> KernelCTF
>> (https://github.com/google/security-research/tree/master/pocs/linux/kernelctf).
>
> Sure, I can understand why you might think that, but I disagree. This
> adds the building blocks we need for better allocation isolation
> control, and stops existing (and similar) attacks toda>
> But yes, given attackers with sufficient control over the entire system,
> all mitigations get weaker. We can't fall into the trap of "perfect
> security"; real-world experience shows that incremental improvements
> like this can strongly impact the difficulty of mounting attacks. Not
> all flaws are created equal; not everything is exploitable to the same
> degree.

It's not about "perfect security", but about wisely spending the
complexity/review/performance/churn/… budgets in my opinion.

>> Do you have some particular exploits in mind that would be completely
>> mitigated by your series?
>
> I link to like a dozen in the last two patches. :P
>
> This series immediately closes 3 well used exploit methodologies.
> Attackers exploiting new flaws that could have used the killed methods
> must now choose methods that have greater complexity, and this drives
> them towards cross-allocator attacks. Robust exploits there are more
> costly to develop as we narrow the scope of methods.

You linked exploits that were making use of the two structures that you
isolated; making them use different structures would likely mean a
couple of hours.

I was more interested in exploits that are effectively killed; as I'm
still not convinced that elastic structures are rare, and that manually
isolating them one by one is attainable/sustainable/…

But if you have some proper analysis in this direction, then yes, I
completely agrees that isolating all of them is a great idea.

>
> Bad analogy: we're locking the doors of a house. Yes, some windows may
> still be unlocked, but now they'll need a ladder. And it doesn't make
> sense to lock the windows if we didn't lock the doors first. This is
> what I mean by complementary defenses, and comes back to what I mentioned
> earlier: "perfect security" is a myth, but incremental security works.
>
>> Moreover, I'm not aware of any ongoing development of the SLAB_VIRTUAL
>> series: the last sign of life on its thread is from 7 months ago.
>
> Yeah, I know, but sometimes other things get in the way. Matteo assures
> me it's still coming.
>
> Since you're interested in seeing SLAB_VIRTUAL land, please join the
> development efforts. Reach out to Matteo (you, he, and I all work for
> the same company) and see where you can assist. Surely this can be
> something you can contribute to while "on the clock"?

I left Google a couple of weeks ago unfortunately, and I won't touch
anything with email-based development for less than a Google salary :D

>
>>> After the core implementation are 2 patches that cover the most heavily
>>> abused "repeat offenders" used in exploits. Repeating those details here:
>>>
>>> The msg subsystem is a common target for exploiting[1][2][3][4][5][6]
>>> use-after-free type confusion flaws in the kernel for both read and
>>> write primitives. Avoid having a user-controlled size cache share the
>>> global kmalloc allocator by using a separate set of kmalloc buckets.
>>>
>>> Link: https://blog.hacktivesecurity.com/index.php/2022/06/13/linux-kernel-exploit-development-1day-case-study/ [1]
>>> Link: https://hardenedvault.net/blog/2022-11-13-msg_msg-recon-mitigation-ved/ [2]
>>> Link: https://www.willsroot.io/2021/08/corctf-2021-fire-of-salvation-writeup.html [3]
>>> Link: https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html [4]
>>> Link: https://google.github.io/security-research/pocs/linux/cve-2021-22555/writeup.html [5]
>>> Link: https://zplin.me/papers/ELOISE.pdf [6]
>>> Link: https://syst3mfailure.io/wall-of-perdition/ [7]
>>>
>>> Both memdup_user() and vmemdup_user() handle allocations that are
>>> regularly used for exploiting use-after-free type confusion flaws in
>>> the kernel (e.g. prctl() PR_SET_VMA_ANON_NAME[1] and setxattr[2][3][4]
>>> respectively).
>>>
>>> Since both are designed for contents coming from userspace, it allows
>>> for userspace-controlled allocation sizes. Use a dedicated set of kmalloc
>>> buckets so these allocations do not share caches with the global kmalloc
>>> buckets.
>>>
>>> Link: https://starlabs.sg/blog/2023/07-prctl-anon_vma_name-an-amusing-heap-spray/ [1]
>>> Link: https://duasynt.com/blog/linux-kernel-heap-spray [2]
>>> Link: https://etenal.me/archives/1336 [3]
>>> Link: https://github.com/a13xp0p0v/kernel-hack-drill/blob/master/drill_exploit_uaf.c [4]
>>
>> What's the performance impact of this series? Did you run some benchmarks?
>
> I wasn't able to measure any performance impact at all. It does add a
> small bit of memory overhead, but it's on the order of a dozen pages
> used for the 2 extra sets of buckets. (E.g. it's well below the overhead
> introduced by CONFIG_RANDOM_KMALLOC_CACHES, which adds 16 extra sets
> of buckets.)

Nice!

2024-05-03 19:06:50

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] slab: Introduce dedicated bucket allocator

On Fri, May 03, 2024 at 03:39:28PM +0200, jvoisin wrote:
> On 4/28/24 19:02, Kees Cook wrote:
> > On Sun, Apr 28, 2024 at 01:02:36PM +0200, jvoisin wrote:
> >> On 4/24/24 23:40, Kees Cook wrote:
> >>> [...]
> >>> While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
> >>> against these kinds of "type confusion" attacks, including for fixed
> >>> same-size heap objects, we can create a complementary deterministic
> >>> defense for dynamically sized allocations that are directly user
> >>> controlled. Addressing these cases is limited in scope, so isolation these
> >>> kinds of interfaces will not become an unbounded game of whack-a-mole. For
> >>> example, pass through memdup_user(), making isolation there very
> >>> effective.
> >>
> >> What does "Addressing these cases is limited in scope, so isolation
> >> these kinds of interfaces will not become an unbounded game of
> >> whack-a-mole." mean exactly?
> >
> > The number of cases where there is a user/kernel API for size-controlled
> > allocations is limited. They don't get added very often, and most are
> > (correctly) using kmemdup_user() as the basis of their allocations. This
> > means we have a relatively well defined set of criteria for finding
> > places where this is needed, and most newly added interfaces will use
> > the existing (kmemdup_user()) infrastructure that will already be covered.
>
> A simple CodeQL query returns 266 of them:
> https://lookerstudio.google.com/reporting/68b02863-4f5c-4d85-b3c1-992af89c855c/page/n92nD?params=%7B%22df3%22:%22include%25EE%2580%25803%25EE%2580%2580T%22%7D

These aren't filtered for being long-lived, nor filtered for
userspace reachability, nor filtered for userspace size and content
controllability. Take for example, cros_ec_get_panicinfo(): the size is
controlled by a device, the allocation doesn't last beyond the function,
and the function itself is part of device probing.

> Is this number realistic and coherent with your results/own analysis?

No, I think it's 1 possibly 2 orders of magnitude too high. Thank you for
the link, though: we can see what the absolute upper bounds is with it,
but that's not an accurate count of cases that would need to explicitly
use this bucket API. But even if it did, 300 instances is still small:
we converted more VLAs than that, more switch statement fallthroughs
than that, and fixed more array bounds cases than that.

And, again, while this series does close a bunch of methods today,
it's a _prerequisite_ for doing per-call-site allocation isolation,
which obviates the need for doing per-site analysis. (We can and will
still do such analysis, though, since there's a benefit to it for folks
that can't tolerate the expected per-site memory overhead.)

> [...]
> >>> Memory allocation pinning[2] is still needed to plug the Use-After-Free
> >>> cross-allocator weakness, but that is an existing and separate issue
> >>> which is complementary to this improvement. Development continues for
> >>> that feature via the SLAB_VIRTUAL[3] series (which could also provide
> >>> guard pages -- another complementary improvement).
> >>>
> >>> Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [1]
> >>> Link: https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html [2]
> >>> Link: https://lore.kernel.org/lkml/[email protected]/ [3]
> >>
> >> To be honest, I think this series is close to useless without allocation
> >> pinning. And even with pinning, it's still routinely bypassed in the
> >> KernelCTF
> >> (https://github.com/google/security-research/tree/master/pocs/linux/kernelctf).
> >
> > Sure, I can understand why you might think that, but I disagree. This
> > adds the building blocks we need for better allocation isolation
> > control, and stops existing (and similar) attacks toda>
> > But yes, given attackers with sufficient control over the entire system,
> > all mitigations get weaker. We can't fall into the trap of "perfect
> > security"; real-world experience shows that incremental improvements
> > like this can strongly impact the difficulty of mounting attacks. Not
> > all flaws are created equal; not everything is exploitable to the same
> > degree.
>
> It's not about "perfect security", but about wisely spending the
> complexity/review/performance/churn/… budgets in my opinion.

Sure, that's an appropriate analysis to make, and it's one I've done. I
think this series is well within those budgets: it abstracts the "bucket"
system into a distinct object that we've needed to have extracted for
other things, it's a pretty trivial review (since the abstraction makes
the other patches very straight forward), using the new API is a nearly
trivial drop-in replacement, and we immediately closes several glaring
exploit techniques, which has real-world impact. This is, IMO, a total
slam dunk of a series.

> >> Do you have some particular exploits in mind that would be completely
> >> mitigated by your series?
> >
> > I link to like a dozen in the last two patches. :P
> >
> > This series immediately closes 3 well used exploit methodologies.
> > Attackers exploiting new flaws that could have used the killed methods
> > must now choose methods that have greater complexity, and this drives
> > them towards cross-allocator attacks. Robust exploits there are more
> > costly to develop as we narrow the scope of methods.
>
> You linked exploits that were making use of the two structures that you
> isolated; making them use different structures would likely mean a
> couple of hours.

I think you underestimate what it would take to provide such a flexible
replacement. As I noted earlier, the techniques have several requirements:
- reachable from userspace
- long-lived allocation
- userspace controllable size
- userspace controllable contents
I'm not saying there aren't other interfaces that provide this, but it's
not common, and each (like these) will have their own quirks and
limitations. (e.g. the msg_msg exploit can't use the start of the
allocation since the contents aren't controllable, and has a minimum
size for the same reason.)

This series kills the 3 techniques with _2_ changes. 2 of the techniques
depend on the same internal (memdup_user()) that gets protected, which
implies that it will cover other things both now and in the future.

> I was more interested in exploits that are effectively killed; as I'm
> still not convinced that elastic structures are rare, and that manually
> isolating them one by one is attainable/sustainable/…

I don't agree with your rarity analysis, but it doesn't matter, because
we will be taking the next step and providing per-call-site isolation
using this abstraction.

> But if you have some proper analysis in this direction, then yes, I
> completely agrees that isolating all of them is a great idea.

I don't need to perform a complete reachability analysis for all UAPI
because I can point to just memdup_user(): it is the recommended way
to get long-lived data from userspace. It has been and will be used by
interfaces that meet all 4 criteria for the exploit technique.

Converting other APIs to it or using the bucket allocation API can
happen over time as those are identified. This is standard operating
procedure for incremental improvements in Linux.

> > Bad analogy: we're locking the doors of a house. Yes, some windows may
> > still be unlocked, but now they'll need a ladder. And it doesn't make
> > sense to lock the windows if we didn't lock the doors first. This is
> > what I mean by complementary defenses, and comes back to what I mentioned
> > earlier: "perfect security" is a myth, but incremental security works.
> >
> >> Moreover, I'm not aware of any ongoing development of the SLAB_VIRTUAL
> >> series: the last sign of life on its thread is from 7 months ago.
> >
> > Yeah, I know, but sometimes other things get in the way. Matteo assures
> > me it's still coming.
> >
> > Since you're interested in seeing SLAB_VIRTUAL land, please join the
> > development efforts. Reach out to Matteo (you, he, and I all work for
> > the same company) and see where you can assist. Surely this can be
> > something you can contribute to while "on the clock"?
>
> I left Google a couple of weeks ago unfortunately,

Ah! Bummer; I didn't see that happen. :(

> and I won't touch
> anything with email-based development for less than a Google salary :D

LOL. Yes, I can understand that. :) I do want to say, though, that
objections carry a lot more weight when counter-proposal patches are
provided. "This is the way." :P

-Kees

--
Kees Cook

2024-05-24 13:39:19

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v3 2/6] mm/slab: Plumb kmem_buckets into __do_kmalloc_node()

On 4/24/24 11:40 PM, Kees Cook wrote:
> To be able to choose which buckets to allocate from, make the buckets
> available to the lower level kmalloc interfaces by adding them as the
> first argument. Where the bucket is not available, pass NULL, which means
> "use the default system kmalloc bucket set" (the prior existing behavior),
> as implemented in kmalloc_slab().
>
> Signed-off-by: Kees Cook <[email protected]>
> ---
> Cc: Vlastimil Babka <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Pekka Enberg <[email protected]>
> Cc: David Rientjes <[email protected]>
> Cc: Joonsoo Kim <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Roman Gushchin <[email protected]>
> Cc: Hyeonggon Yoo <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> ---
> include/linux/slab.h | 16 ++++++++--------
> lib/fortify_kunit.c | 2 +-
> mm/slab.h | 6 ++++--
> mm/slab_common.c | 4 ++--
> mm/slub.c | 14 +++++++-------
> mm/util.c | 2 +-
> 6 files changed, 23 insertions(+), 21 deletions(-)
>
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index c8164d5db420..07373b680894 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -569,8 +569,8 @@ static __always_inline void kfree_bulk(size_t size, void **p)
> kmem_cache_free_bulk(NULL, size, p);
> }
>
> -void *__kmalloc_node_noprof(size_t size, gfp_t flags, int node) __assume_kmalloc_alignment
> - __alloc_size(1);
> +void *__kmalloc_node_noprof(kmem_buckets *b, size_t size, gfp_t flags, int node)
> + __assume_kmalloc_alignment __alloc_size(2);
> #define __kmalloc_node(...) alloc_hooks(__kmalloc_node_noprof(__VA_ARGS__))
>
> void *kmem_cache_alloc_node_noprof(struct kmem_cache *s, gfp_t flags,
> @@ -679,7 +679,7 @@ static __always_inline __alloc_size(1) void *kmalloc_node_noprof(size_t size, gf
> kmalloc_caches[kmalloc_type(flags, _RET_IP_)][index],
> flags, node, size);
> }
> - return __kmalloc_node_noprof(size, flags, node);
> + return __kmalloc_node_noprof(NULL, size, flags, node);

This is not ideal as now every kmalloc_node() callsite will now have to add
the NULL parameter even if this is not enabled. Could the new parameter be
only added depending on the respective config?

> }
> #define kmalloc_node(...) alloc_hooks(kmalloc_node_noprof(__VA_ARGS__))


2024-05-24 13:44:00

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v3 4/6] mm/slab: Introduce kmem_buckets_create() and family

On 4/24/24 11:41 PM, Kees Cook wrote:
> Dedicated caches are available for fixed size allocations via
> kmem_cache_alloc(), but for dynamically sized allocations there is only
> the global kmalloc API's set of buckets available. This means it isn't
> possible to separate specific sets of dynamically sized allocations into
> a separate collection of caches.
>
> This leads to a use-after-free exploitation weakness in the Linux
> kernel since many heap memory spraying/grooming attacks depend on using
> userspace-controllable dynamically sized allocations to collide with
> fixed size allocations that end up in same cache.
>
> While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
> against these kinds of "type confusion" attacks, including for fixed
> same-size heap objects, we can create a complementary deterministic
> defense for dynamically sized allocations that are directly user
> controlled. Addressing these cases is limited in scope, so isolation these
> kinds of interfaces will not become an unbounded game of whack-a-mole. For
> example, pass through memdup_user(), making isolation there very
> effective.
>
> In order to isolate user-controllable sized allocations from system
> allocations, introduce kmem_buckets_create(), which behaves like
> kmem_cache_create(). Introduce kmem_buckets_alloc(), which behaves like
> kmem_cache_alloc(). Introduce kmem_buckets_alloc_track_caller() for
> where caller tracking is needed. Introduce kmem_buckets_valloc() for
> cases where vmalloc callback is needed.
>
> Allows for confining allocations to a dedicated set of sized caches
> (which have the same layout as the kmalloc caches).
>
> This can also be used in the future to extend codetag allocation
> annotations to implement per-caller allocation cache isolation[1] even
> for dynamic allocations.
>
> Memory allocation pinning[2] is still needed to plug the Use-After-Free
> cross-allocator weakness, but that is an existing and separate issue
> which is complementary to this improvement. Development continues for
> that feature via the SLAB_VIRTUAL[3] series (which could also provide
> guard pages -- another complementary improvement).
>
> Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [1]
> Link: https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html [2]
> Link: https://lore.kernel.org/lkml/[email protected]/ [3]
> Signed-off-by: Kees Cook <[email protected]>

So this seems like it's all unconditional and not depending on a config
option? I'd rather see one, as has been done for all similar functionality
before, as not everyone will want this trade-off.

Also AFAIU every new user (patches 5, 6) will add new bunch of lines to
/proc/slabinfo? And when you integrate alloc tagging, do you expect every
callsite will do that as well? Any idea how many there would be and what
kind of memory overhead it will have in the end?



2024-05-24 14:55:15

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] slab: Introduce dedicated bucket allocator

On Wed, Apr 24, 2024 at 02:40:57PM -0700, Kees Cook wrote:
> Hi,
>
> Series change history:
>
> v3:
> - clarify rationale and purpose in commit log
> - rebase to -next (CONFIG_CODE_TAGGING)
> - simplify calling styles and split out bucket plumbing more cleanly
> - consolidate kmem_buckets_*() family introduction patches
> v2: https://lore.kernel.org/lkml/[email protected]/
> v1: https://lore.kernel.org/lkml/[email protected]/
>
> For the cover letter, I'm repeating commit log for patch 4 here, which has
> additional clarifications and rationale since v2:
>
> Dedicated caches are available for fixed size allocations via
> kmem_cache_alloc(), but for dynamically sized allocations there is only
> the global kmalloc API's set of buckets available. This means it isn't
> possible to separate specific sets of dynamically sized allocations into
> a separate collection of caches.
>
> This leads to a use-after-free exploitation weakness in the Linux
> kernel since many heap memory spraying/grooming attacks depend on using
> userspace-controllable dynamically sized allocations to collide with
> fixed size allocations that end up in same cache.

This is going to increase internal fragmentation in the slab allocator,
so we're going to need better, more visible numbers on the amount of
memory stranded thusly, so users can easily see the effect this has.

Please also document this effect and point users in the documentation
where to check, so that we devs can get feedback on this.

2024-05-24 15:01:57

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH v3 2/6] mm/slab: Plumb kmem_buckets into __do_kmalloc_node()

On Wed, Apr 24, 2024 at 02:40:59PM -0700, Kees Cook wrote:
> To be able to choose which buckets to allocate from, make the buckets
> available to the lower level kmalloc interfaces by adding them as the
> first argument. Where the bucket is not available, pass NULL, which means
> "use the default system kmalloc bucket set" (the prior existing behavior),
> as implemented in kmalloc_slab().

I thought the plan was to use codetags for this? That would obviate the
need for all this plumbing.

Add fields to the alloc tag for:
- allocation size (or 0 if it's not a compile time constant)
- union of kmem_cache, kmem_buckets, depending on whether the
allocation size is constant or not

Then this can all be internal to slub.c, and the kmem_cache or
kmem_buckets gets lazy initialized.

If memory allocation profiling isn't enabled, #ifdef the other fields of
the alloc tag out (including the codetag itself) so your fields will be
the only fields in alloc_tag.

>
> Signed-off-by: Kees Cook <[email protected]>
> ---
> Cc: Vlastimil Babka <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Pekka Enberg <[email protected]>
> Cc: David Rientjes <[email protected]>
> Cc: Joonsoo Kim <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Roman Gushchin <[email protected]>
> Cc: Hyeonggon Yoo <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> ---
> include/linux/slab.h | 16 ++++++++--------
> lib/fortify_kunit.c | 2 +-
> mm/slab.h | 6 ++++--
> mm/slab_common.c | 4 ++--
> mm/slub.c | 14 +++++++-------
> mm/util.c | 2 +-
> 6 files changed, 23 insertions(+), 21 deletions(-)
>
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index c8164d5db420..07373b680894 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -569,8 +569,8 @@ static __always_inline void kfree_bulk(size_t size, void **p)
> kmem_cache_free_bulk(NULL, size, p);
> }
>
> -void *__kmalloc_node_noprof(size_t size, gfp_t flags, int node) __assume_kmalloc_alignment
> - __alloc_size(1);
> +void *__kmalloc_node_noprof(kmem_buckets *b, size_t size, gfp_t flags, int node)
> + __assume_kmalloc_alignment __alloc_size(2);
> #define __kmalloc_node(...) alloc_hooks(__kmalloc_node_noprof(__VA_ARGS__))
>
> void *kmem_cache_alloc_node_noprof(struct kmem_cache *s, gfp_t flags,
> @@ -679,7 +679,7 @@ static __always_inline __alloc_size(1) void *kmalloc_node_noprof(size_t size, gf
> kmalloc_caches[kmalloc_type(flags, _RET_IP_)][index],
> flags, node, size);
> }
> - return __kmalloc_node_noprof(size, flags, node);
> + return __kmalloc_node_noprof(NULL, size, flags, node);
> }
> #define kmalloc_node(...) alloc_hooks(kmalloc_node_noprof(__VA_ARGS__))
>
> @@ -730,10 +730,10 @@ static inline __realloc_size(2, 3) void * __must_check krealloc_array_noprof(voi
> */
> #define kcalloc(n, size, flags) kmalloc_array(n, size, (flags) | __GFP_ZERO)
>
> -void *kmalloc_node_track_caller_noprof(size_t size, gfp_t flags, int node,
> - unsigned long caller) __alloc_size(1);
> +void *kmalloc_node_track_caller_noprof(kmem_buckets *b, size_t size, gfp_t flags, int node,
> + unsigned long caller) __alloc_size(2);
> #define kmalloc_node_track_caller(...) \
> - alloc_hooks(kmalloc_node_track_caller_noprof(__VA_ARGS__, _RET_IP_))
> + alloc_hooks(kmalloc_node_track_caller_noprof(NULL, __VA_ARGS__, _RET_IP_))
>
> /*
> * kmalloc_track_caller is a special version of kmalloc that records the
> @@ -746,7 +746,7 @@ void *kmalloc_node_track_caller_noprof(size_t size, gfp_t flags, int node,
> #define kmalloc_track_caller(...) kmalloc_node_track_caller(__VA_ARGS__, NUMA_NO_NODE)
>
> #define kmalloc_track_caller_noprof(...) \
> - kmalloc_node_track_caller_noprof(__VA_ARGS__, NUMA_NO_NODE, _RET_IP_)
> + kmalloc_node_track_caller_noprof(NULL, __VA_ARGS__, NUMA_NO_NODE, _RET_IP_)
>
> static inline __alloc_size(1, 2) void *kmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags,
> int node)
> @@ -757,7 +757,7 @@ static inline __alloc_size(1, 2) void *kmalloc_array_node_noprof(size_t n, size_
> return NULL;
> if (__builtin_constant_p(n) && __builtin_constant_p(size))
> return kmalloc_node_noprof(bytes, flags, node);
> - return __kmalloc_node_noprof(bytes, flags, node);
> + return __kmalloc_node_noprof(NULL, bytes, flags, node);
> }
> #define kmalloc_array_node(...) alloc_hooks(kmalloc_array_node_noprof(__VA_ARGS__))
>
> diff --git a/lib/fortify_kunit.c b/lib/fortify_kunit.c
> index 493ec02dd5b3..ff059d88d455 100644
> --- a/lib/fortify_kunit.c
> +++ b/lib/fortify_kunit.c
> @@ -220,7 +220,7 @@ static void alloc_size_##allocator##_dynamic_test(struct kunit *test) \
> checker(expected_size, __kmalloc(alloc_size, gfp), \
> kfree(p)); \
> checker(expected_size, \
> - __kmalloc_node(alloc_size, gfp, NUMA_NO_NODE), \
> + __kmalloc_node(NULL, alloc_size, gfp, NUMA_NO_NODE), \
> kfree(p)); \
> \
> orig = kmalloc(alloc_size, gfp); \
> diff --git a/mm/slab.h b/mm/slab.h
> index 5f8f47c5bee0..f459cd338852 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -403,16 +403,18 @@ static inline unsigned int size_index_elem(unsigned int bytes)
> * KMALLOC_MAX_CACHE_SIZE and the caller must check that.
> */
> static inline struct kmem_cache *
> -kmalloc_slab(size_t size, gfp_t flags, unsigned long caller)
> +kmalloc_slab(kmem_buckets *b, size_t size, gfp_t flags, unsigned long caller)
> {
> unsigned int index;
>
> + if (!b)
> + b = &kmalloc_caches[kmalloc_type(flags, caller)];
> if (size <= 192)
> index = kmalloc_size_index[size_index_elem(size)];
> else
> index = fls(size - 1);
>
> - return kmalloc_caches[kmalloc_type(flags, caller)][index];
> + return (*b)[index];
> }
>
> gfp_t kmalloc_fix_flags(gfp_t flags);
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index db9e1b15efd5..7cb4e8fd1275 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -702,7 +702,7 @@ size_t kmalloc_size_roundup(size_t size)
> * The flags don't matter since size_index is common to all.
> * Neither does the caller for just getting ->object_size.
> */
> - return kmalloc_slab(size, GFP_KERNEL, 0)->object_size;
> + return kmalloc_slab(NULL, size, GFP_KERNEL, 0)->object_size;
> }
>
> /* Above the smaller buckets, size is a multiple of page size. */
> @@ -1186,7 +1186,7 @@ __do_krealloc(const void *p, size_t new_size, gfp_t flags)
> return (void *)p;
> }
>
> - ret = kmalloc_node_track_caller_noprof(new_size, flags, NUMA_NO_NODE, _RET_IP_);
> + ret = kmalloc_node_track_caller_noprof(NULL, new_size, flags, NUMA_NO_NODE, _RET_IP_);
> if (ret && p) {
> /* Disable KASAN checks as the object's redzone is accessed. */
> kasan_disable_current();
> diff --git a/mm/slub.c b/mm/slub.c
> index 23bc0d236c26..a94a0507e19c 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -4093,7 +4093,7 @@ void *kmalloc_large_node_noprof(size_t size, gfp_t flags, int node)
> EXPORT_SYMBOL(kmalloc_large_node_noprof);
>
> static __always_inline
> -void *__do_kmalloc_node(size_t size, gfp_t flags, int node,
> +void *__do_kmalloc_node(kmem_buckets *b, size_t size, gfp_t flags, int node,
> unsigned long caller)
> {
> struct kmem_cache *s;
> @@ -4109,7 +4109,7 @@ void *__do_kmalloc_node(size_t size, gfp_t flags, int node,
> if (unlikely(!size))
> return ZERO_SIZE_PTR;
>
> - s = kmalloc_slab(size, flags, caller);
> + s = kmalloc_slab(b, size, flags, caller);
>
> ret = slab_alloc_node(s, NULL, flags, node, caller, size);
> ret = kasan_kmalloc(s, ret, size, flags);
> @@ -4117,22 +4117,22 @@ void *__do_kmalloc_node(size_t size, gfp_t flags, int node,
> return ret;
> }
>
> -void *__kmalloc_node_noprof(size_t size, gfp_t flags, int node)
> +void *__kmalloc_node_noprof(kmem_buckets *b, size_t size, gfp_t flags, int node)
> {
> - return __do_kmalloc_node(size, flags, node, _RET_IP_);
> + return __do_kmalloc_node(b, size, flags, node, _RET_IP_);
> }
> EXPORT_SYMBOL(__kmalloc_node_noprof);
>
> void *__kmalloc_noprof(size_t size, gfp_t flags)
> {
> - return __do_kmalloc_node(size, flags, NUMA_NO_NODE, _RET_IP_);
> + return __do_kmalloc_node(NULL, size, flags, NUMA_NO_NODE, _RET_IP_);
> }
> EXPORT_SYMBOL(__kmalloc_noprof);
>
> -void *kmalloc_node_track_caller_noprof(size_t size, gfp_t flags,
> +void *kmalloc_node_track_caller_noprof(kmem_buckets *b, size_t size, gfp_t flags,
> int node, unsigned long caller)
> {
> - return __do_kmalloc_node(size, flags, node, caller);
> + return __do_kmalloc_node(b, size, flags, node, caller);
> }
> EXPORT_SYMBOL(kmalloc_node_track_caller_noprof);
>
> diff --git a/mm/util.c b/mm/util.c
> index c9e519e6811f..80430e5ba981 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -128,7 +128,7 @@ void *kmemdup_noprof(const void *src, size_t len, gfp_t gfp)
> {
> void *p;
>
> - p = kmalloc_node_track_caller_noprof(len, gfp, NUMA_NO_NODE, _RET_IP_);
> + p = kmalloc_node_track_caller_noprof(NULL, len, gfp, NUMA_NO_NODE, _RET_IP_);
> if (p)
> memcpy(p, src, len);
> return p;
> --
> 2.34.1
>

2024-05-31 16:38:14

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v3 4/6] mm/slab: Introduce kmem_buckets_create() and family

On Fri, May 24, 2024 at 03:43:33PM +0200, Vlastimil Babka wrote:
> On 4/24/24 11:41 PM, Kees Cook wrote:
> > Dedicated caches are available for fixed size allocations via
> > kmem_cache_alloc(), but for dynamically sized allocations there is only
> > the global kmalloc API's set of buckets available. This means it isn't
> > possible to separate specific sets of dynamically sized allocations into
> > a separate collection of caches.
> >
> > This leads to a use-after-free exploitation weakness in the Linux
> > kernel since many heap memory spraying/grooming attacks depend on using
> > userspace-controllable dynamically sized allocations to collide with
> > fixed size allocations that end up in same cache.
> >
> > While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
> > against these kinds of "type confusion" attacks, including for fixed
> > same-size heap objects, we can create a complementary deterministic
> > defense for dynamically sized allocations that are directly user
> > controlled. Addressing these cases is limited in scope, so isolation these
> > kinds of interfaces will not become an unbounded game of whack-a-mole. For
> > example, pass through memdup_user(), making isolation there very
> > effective.
> >
> > In order to isolate user-controllable sized allocations from system
> > allocations, introduce kmem_buckets_create(), which behaves like
> > kmem_cache_create(). Introduce kmem_buckets_alloc(), which behaves like
> > kmem_cache_alloc(). Introduce kmem_buckets_alloc_track_caller() for
> > where caller tracking is needed. Introduce kmem_buckets_valloc() for
> > cases where vmalloc callback is needed.
> >
> > Allows for confining allocations to a dedicated set of sized caches
> > (which have the same layout as the kmalloc caches).
> >
> > This can also be used in the future to extend codetag allocation
> > annotations to implement per-caller allocation cache isolation[1] even
> > for dynamic allocations.
> >
> > Memory allocation pinning[2] is still needed to plug the Use-After-Free
> > cross-allocator weakness, but that is an existing and separate issue
> > which is complementary to this improvement. Development continues for
> > that feature via the SLAB_VIRTUAL[3] series (which could also provide
> > guard pages -- another complementary improvement).
> >
> > Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [1]
> > Link: https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html [2]
> > Link: https://lore.kernel.org/lkml/[email protected]/ [3]
> > Signed-off-by: Kees Cook <[email protected]>
>
> So this seems like it's all unconditional and not depending on a config
> option? I'd rather see one, as has been done for all similar functionality
> before, as not everyone will want this trade-off.

Okay, sure, I can do that. Since it was changing some of the core APIs
to pass in the target buckets (instead of using the global buckets), it
seemed less complex to me to just make the simple changes unconditional.

> Also AFAIU every new user (patches 5, 6) will add new bunch of lines to
> /proc/slabinfo? And when you integrate alloc tagging, do you expect every

Yes, that's true, but that's the goal: they want to live in a separate
bucket collection. At present, it's only two, and arguable almost
everything that the manual isolation provides should be using the
mem_from_user() interface anyway.

> callsite will do that as well? Any idea how many there would be and what
> kind of memory overhead it will have in the end?

I haven't measured the per-codetag-site overhead, but I don't expect it
to have giant overhead. If the buckets are created on demand, it should
be small.

But one step at a time. This provides the infrastructure needed to
immediately solve the low-hanging exploitable fruit and to pave the way
for the per-codetag-site automation.

-Kees

--
Kees Cook

2024-05-31 16:43:41

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v3 2/6] mm/slab: Plumb kmem_buckets into __do_kmalloc_node()

On Fri, May 24, 2024 at 03:38:58PM +0200, Vlastimil Babka wrote:
> On 4/24/24 11:40 PM, Kees Cook wrote:
> > To be able to choose which buckets to allocate from, make the buckets
> > available to the lower level kmalloc interfaces by adding them as the
> > first argument. Where the bucket is not available, pass NULL, which means
> > "use the default system kmalloc bucket set" (the prior existing behavior),
> > as implemented in kmalloc_slab().
> >
> > Signed-off-by: Kees Cook <[email protected]>
> > ---
> > Cc: Vlastimil Babka <[email protected]>
> > Cc: Christoph Lameter <[email protected]>
> > Cc: Pekka Enberg <[email protected]>
> > Cc: David Rientjes <[email protected]>
> > Cc: Joonsoo Kim <[email protected]>
> > Cc: Andrew Morton <[email protected]>
> > Cc: Roman Gushchin <[email protected]>
> > Cc: Hyeonggon Yoo <[email protected]>
> > Cc: [email protected]
> > Cc: [email protected]
> > ---
> > include/linux/slab.h | 16 ++++++++--------
> > lib/fortify_kunit.c | 2 +-
> > mm/slab.h | 6 ++++--
> > mm/slab_common.c | 4 ++--
> > mm/slub.c | 14 +++++++-------
> > mm/util.c | 2 +-
> > 6 files changed, 23 insertions(+), 21 deletions(-)
> >
> > diff --git a/include/linux/slab.h b/include/linux/slab.h
> > index c8164d5db420..07373b680894 100644
> > --- a/include/linux/slab.h
> > +++ b/include/linux/slab.h
> > @@ -569,8 +569,8 @@ static __always_inline void kfree_bulk(size_t size, void **p)
> > kmem_cache_free_bulk(NULL, size, p);
> > }
> >
> > -void *__kmalloc_node_noprof(size_t size, gfp_t flags, int node) __assume_kmalloc_alignment
> > - __alloc_size(1);
> > +void *__kmalloc_node_noprof(kmem_buckets *b, size_t size, gfp_t flags, int node)
> > + __assume_kmalloc_alignment __alloc_size(2);
> > #define __kmalloc_node(...) alloc_hooks(__kmalloc_node_noprof(__VA_ARGS__))
> >
> > void *kmem_cache_alloc_node_noprof(struct kmem_cache *s, gfp_t flags,
> > @@ -679,7 +679,7 @@ static __always_inline __alloc_size(1) void *kmalloc_node_noprof(size_t size, gf
> > kmalloc_caches[kmalloc_type(flags, _RET_IP_)][index],
> > flags, node, size);
> > }
> > - return __kmalloc_node_noprof(size, flags, node);
> > + return __kmalloc_node_noprof(NULL, size, flags, node);
>
> This is not ideal as now every kmalloc_node() callsite will now have to add
> the NULL parameter even if this is not enabled. Could the new parameter be
> only added depending on the respective config?

I felt like it was much simpler to add an argument to the existing call
path than to create a duplicate API that had 1 extra argument. However,
if you want this behind a Kconfig option, I can redefine the argument
list based on that?

--
Kees Cook

2024-05-31 16:49:51

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v3 2/6] mm/slab: Plumb kmem_buckets into __do_kmalloc_node()

On Fri, May 24, 2024 at 11:01:40AM -0400, Kent Overstreet wrote:
> On Wed, Apr 24, 2024 at 02:40:59PM -0700, Kees Cook wrote:
> > To be able to choose which buckets to allocate from, make the buckets
> > available to the lower level kmalloc interfaces by adding them as the
> > first argument. Where the bucket is not available, pass NULL, which means
> > "use the default system kmalloc bucket set" (the prior existing behavior),
> > as implemented in kmalloc_slab().
>
> I thought the plan was to use codetags for this? That would obviate the
> need for all this plumbing.
>
> Add fields to the alloc tag for:
> - allocation size (or 0 if it's not a compile time constant)
> - union of kmem_cache, kmem_buckets, depending on whether the
> allocation size is constant or not

I want to provide "simple" (low-hanging fruit) coverage that can live
separately from the codetags-based coverage. The memory overhead for
this patch series is negligible, but I suspect the codetags expansion,
while not giant, will be more than some deployments will want. I want
to avoid an all-or-nothing solution -- which is why I had intended this
to be available "by default".

But I will respin this with kmem_buckets under a Kconfig.

--
Kees Cook

2024-05-31 16:50:23

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH v3 2/6] mm/slab: Plumb kmem_buckets into __do_kmalloc_node()

On Fri, May 31, 2024 at 09:48:49AM -0700, Kees Cook wrote:
> On Fri, May 24, 2024 at 11:01:40AM -0400, Kent Overstreet wrote:
> > On Wed, Apr 24, 2024 at 02:40:59PM -0700, Kees Cook wrote:
> > > To be able to choose which buckets to allocate from, make the buckets
> > > available to the lower level kmalloc interfaces by adding them as the
> > > first argument. Where the bucket is not available, pass NULL, which means
> > > "use the default system kmalloc bucket set" (the prior existing behavior),
> > > as implemented in kmalloc_slab().
> >
> > I thought the plan was to use codetags for this? That would obviate the
> > need for all this plumbing.
> >
> > Add fields to the alloc tag for:
> > - allocation size (or 0 if it's not a compile time constant)
> > - union of kmem_cache, kmem_buckets, depending on whether the
> > allocation size is constant or not
>
> I want to provide "simple" (low-hanging fruit) coverage that can live
> separately from the codetags-based coverage. The memory overhead for
> this patch series is negligible, but I suspect the codetags expansion,
> while not giant, will be more than some deployments will want. I want
> to avoid an all-or-nothing solution -- which is why I had intended this
> to be available "by default".

You don't need to include the full codetag + alloc tag counters - with
the appropriate #ifdefs the overhead will be negligable.

2024-05-31 16:54:43

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH v3 2/6] mm/slab: Plumb kmem_buckets into __do_kmalloc_node()

On Fri, May 31, 2024 at 09:48:49AM -0700, Kees Cook wrote:
> On Fri, May 24, 2024 at 11:01:40AM -0400, Kent Overstreet wrote:
> > On Wed, Apr 24, 2024 at 02:40:59PM -0700, Kees Cook wrote:
> > > To be able to choose which buckets to allocate from, make the buckets
> > > available to the lower level kmalloc interfaces by adding them as the
> > > first argument. Where the bucket is not available, pass NULL, which means
> > > "use the default system kmalloc bucket set" (the prior existing behavior),
> > > as implemented in kmalloc_slab().
> >
> > I thought the plan was to use codetags for this? That would obviate the
> > need for all this plumbing.
> >
> > Add fields to the alloc tag for:
> > - allocation size (or 0 if it's not a compile time constant)
> > - union of kmem_cache, kmem_buckets, depending on whether the
> > allocation size is constant or not
>
> I want to provide "simple" (low-hanging fruit) coverage that can live
> separately from the codetags-based coverage. The memory overhead for
> this patch series is negligible, but I suspect the codetags expansion,
> while not giant, will be more than some deployments will want. I want
> to avoid an all-or-nothing solution -- which is why I had intended this
> to be available "by default".

technically there's no reason for your thing to depend on
CONFIG_CODETAGGING at all, that's the infrastructure for finding
codetags for e.g. /proc/allocinfo. you'd just be using the alloc_hoos()
macro and struct alloc_tag as a place to stash the kmem_buckets pointer.

2024-05-31 17:11:55

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] slab: Introduce dedicated bucket allocator

On Fri, May 24, 2024 at 10:54:58AM -0400, Kent Overstreet wrote:
> On Wed, Apr 24, 2024 at 02:40:57PM -0700, Kees Cook wrote:
> > Hi,
> >
> > Series change history:
> >
> > v3:
> > - clarify rationale and purpose in commit log
> > - rebase to -next (CONFIG_CODE_TAGGING)
> > - simplify calling styles and split out bucket plumbing more cleanly
> > - consolidate kmem_buckets_*() family introduction patches
> > v2: https://lore.kernel.org/lkml/[email protected]/
> > v1: https://lore.kernel.org/lkml/[email protected]/
> >
> > For the cover letter, I'm repeating commit log for patch 4 here, which has
> > additional clarifications and rationale since v2:
> >
> > Dedicated caches are available for fixed size allocations via
> > kmem_cache_alloc(), but for dynamically sized allocations there is only
> > the global kmalloc API's set of buckets available. This means it isn't
> > possible to separate specific sets of dynamically sized allocations into
> > a separate collection of caches.
> >
> > This leads to a use-after-free exploitation weakness in the Linux
> > kernel since many heap memory spraying/grooming attacks depend on using
> > userspace-controllable dynamically sized allocations to collide with
> > fixed size allocations that end up in same cache.
>
> This is going to increase internal fragmentation in the slab allocator,
> so we're going to need better, more visible numbers on the amount of
> memory stranded thusly, so users can easily see the effect this has.

Yes, but not significantly. It's less than the 16-buckets randomized
kmalloc implementation. The numbers will be visible in /proc/slabinfo
just like any other.

> Please also document this effect and point users in the documentation
> where to check, so that we devs can get feedback on this.

Okay, sure. In the commit log, or did you have somewhere else in mind?

--
Kees Cook

2024-05-31 20:59:22

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v3 2/6] mm/slab: Plumb kmem_buckets into __do_kmalloc_node()

On Fri, May 31, 2024 at 12:51:29PM -0400, Kent Overstreet wrote:
> On Fri, May 31, 2024 at 09:48:49AM -0700, Kees Cook wrote:
> > On Fri, May 24, 2024 at 11:01:40AM -0400, Kent Overstreet wrote:
> > > On Wed, Apr 24, 2024 at 02:40:59PM -0700, Kees Cook wrote:
> > > > To be able to choose which buckets to allocate from, make the buckets
> > > > available to the lower level kmalloc interfaces by adding them as the
> > > > first argument. Where the bucket is not available, pass NULL, which means
> > > > "use the default system kmalloc bucket set" (the prior existing behavior),
> > > > as implemented in kmalloc_slab().
> > >
> > > I thought the plan was to use codetags for this? That would obviate the
> > > need for all this plumbing.
> > >
> > > Add fields to the alloc tag for:
> > > - allocation size (or 0 if it's not a compile time constant)
> > > - union of kmem_cache, kmem_buckets, depending on whether the
> > > allocation size is constant or not
> >
> > I want to provide "simple" (low-hanging fruit) coverage that can live
> > separately from the codetags-based coverage. The memory overhead for
> > this patch series is negligible, but I suspect the codetags expansion,
> > while not giant, will be more than some deployments will want. I want
> > to avoid an all-or-nothing solution -- which is why I had intended this
> > to be available "by default".
>
> technically there's no reason for your thing to depend on
> CONFIG_CODETAGGING at all, that's the infrastructure for finding
> codetags for e.g. /proc/allocinfo. you'd just be using the alloc_hoos()
> macro and struct alloc_tag as a place to stash the kmem_buckets pointer.

It's the overhead of separate kmem_cache and kmem_buckets for every
allocation location that I meant. So I'd like the "simple" version for
gaining coverage over the currently-being-regularly-exploited cases, and
then allow for the "big hammer" solution too.

However, I do think I'll still need the codetag infra because of the
sections, etc. I think we'll need to pre-build the caches, but maybe
that could be avoided by adding some kind of per-site READ_ONCE/lock
thingy to create them on demand. We'll see! :)

--
Kees Cook