2023-09-15 14:23:26

by Matteo Rizzo

[permalink] [raw]
Subject: [RFC PATCH 00/14] Prevent cross-cache attacks in the SLUB allocator

The goal of this patch series is to deterministically prevent cross-cache
attacks in the SLUB allocator.

Use-after-free bugs are normally exploited by making the memory allocator
reuse the victim object's memory for an object with a different type. This
creates a type confusion which is a very powerful attack primitive.

There are generally two ways to create such type confusions in the kernel:
one way is to make SLUB reuse the freed object's address for another object
of a different type which lives in the same slab cache. This only works in
slab caches that can contain objects of different types (i.e. the kmalloc
caches) and the attacker is limited to objects that belong to the same size
class as the victim object.

The other way is to use a "cross-cache attack": make SLUB return the page
containing the victim object to the page allocator and then make it use the
same page for a different slab cache or other objects that contain
attacker-controlled data. This gives attackers access to all objects rather
than just the ones in the same size class as the target and lets attackers
target objects allocated from dedicated caches such as struct file.

This patch prevents cross-cache attacks by making sure that once a virtual
address is used for a slab cache it's never reused for anything except for
other slabs in that cache.


Jann Horn (13):
mm/slub: add is_slab_addr/is_slab_page helpers
mm/slub: move kmem_cache_order_objects to slab.h
mm: use virt_to_slab instead of folio_slab
mm/slub: create folio_set/clear_slab helpers
mm/slub: pass additional args to alloc_slab_page
mm/slub: pass slab pointer to the freeptr decode helper
security: introduce CONFIG_SLAB_VIRTUAL
mm/slub: add the slab freelists to kmem_cache
x86: Create virtual memory region for SLUB
mm/slub: allocate slabs from virtual memory
mm/slub: introduce the deallocated_pages sysfs attribute
mm/slub: sanity-check freepointers
security: add documentation for SLAB_VIRTUAL

Matteo Rizzo (1):
mm/slub: don't try to dereference invalid freepointers

Documentation/arch/x86/x86_64/mm.rst | 4 +-
Documentation/security/self-protection.rst | 102 ++++
arch/x86/include/asm/page_64.h | 10 +
arch/x86/include/asm/pgtable_64_types.h | 21 +
arch/x86/mm/init_64.c | 19 +-
arch/x86/mm/kaslr.c | 9 +
arch/x86/mm/mm_internal.h | 4 +
arch/x86/mm/physaddr.c | 10 +
include/linux/slab.h | 8 +
include/linux/slub_def.h | 25 +-
init/main.c | 1 +
kernel/resource.c | 2 +-
lib/slub_kunit.c | 4 +
mm/memcontrol.c | 2 +-
mm/slab.h | 145 +++++
mm/slab_common.c | 21 +-
mm/slub.c | 641 +++++++++++++++++++--
mm/usercopy.c | 12 +-
security/Kconfig.hardening | 16 +
19 files changed, 977 insertions(+), 79 deletions(-)


base-commit: 46a9ea6681907a3be6b6b0d43776dccc62cad6cf
--
2.42.0.459.ge4e396fd5e-goog


2023-09-15 14:41:13

by Matteo Rizzo

[permalink] [raw]
Subject: [RFC PATCH 11/14] mm/slub: allocate slabs from virtual memory

From: Jann Horn <[email protected]>

This is the main implementation of SLAB_VIRTUAL. With SLAB_VIRTUAL
enabled, slab memory is not allocated from the linear map but from a
dedicated region of virtual memory. The code ensures that once a range
of virtual addresses is assigned to a slab cache, that virtual memory is
never reused again except for other slabs in that same cache. This lets
us mitigate some exploits for use-after-free vulnerabilities where the
attacker makes SLUB release a slab page to the page allocator and then
makes it reuse that same page for a different slab cache ("cross-cache
attacks").

With SLAB_VIRTUAL enabled struct slab no longer overlaps struct page but
instead it is allocated from a dedicated region of virtual memory. This
makes it possible to have references to slabs whose physical memory has
been freed.

SLAB_VIRTUAL has a small performance overhead, about 1-2% on kernel
compilation time. We are using 4 KiB pages to map slab pages and slab
metadata area, instead of the 2 MiB pages that the kernel uses to map
the physmap. We experimented with a version of the patch that uses 2 MiB
pages and we did see some performance improvement but the code also
became much more complicated and ugly because we would need to allocate
and free multiple slabs at once.

In addition to the TLB contention, SLAB_VIRTUAL also adds new locks to
the slow path of the allocator. Lock contention also contributes to the
performance penalty to some extent, and this is more visible on machines
with many CPUs.

Signed-off-by: Jann Horn <[email protected]>
Co-developed-by: Matteo Rizzo <[email protected]>
Signed-off-by: Matteo Rizzo <[email protected]>
---
arch/x86/include/asm/page_64.h | 10 +
arch/x86/include/asm/pgtable_64_types.h | 5 +
arch/x86/mm/physaddr.c | 10 +
include/linux/slab.h | 7 +
init/main.c | 1 +
mm/slab.h | 106 ++++++
mm/slab_common.c | 4 +
mm/slub.c | 439 +++++++++++++++++++++++-
mm/usercopy.c | 12 +-
9 files changed, 587 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index cc6b8e087192..25fb734a2fe6 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -3,6 +3,7 @@
#define _ASM_X86_PAGE_64_H

#include <asm/page_64_types.h>
+#include <asm/pgtable_types.h>

#ifndef __ASSEMBLY__
#include <asm/cpufeatures.h>
@@ -18,10 +19,19 @@ extern unsigned long page_offset_base;
extern unsigned long vmalloc_base;
extern unsigned long vmemmap_base;

+#ifdef CONFIG_SLAB_VIRTUAL
+unsigned long slab_virt_to_phys(unsigned long x);
+#endif
+
static __always_inline unsigned long __phys_addr_nodebug(unsigned long x)
{
unsigned long y = x - __START_KERNEL_map;

+#ifdef CONFIG_SLAB_VIRTUAL
+ if (is_slab_addr(x))
+ return slab_virt_to_phys(x);
+#endif
+
/* use the carry flag to determine if x was < __START_KERNEL_map */
x = y + ((x > y) ? phys_base : (__START_KERNEL_map - PAGE_OFFSET));

diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index e1a91eb084c4..4aae822a6a96 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -213,6 +213,11 @@ extern unsigned int ptrs_per_p4d;
#define SLAB_VPAGES ((SLAB_END_ADDR - SLAB_BASE_ADDR) / PAGE_SIZE)
#define SLAB_META_SIZE ALIGN(SLAB_VPAGES * STRUCT_SLAB_SIZE, PAGE_SIZE)
#define SLAB_DATA_BASE_ADDR (SLAB_BASE_ADDR + SLAB_META_SIZE)
+
+#define is_slab_addr(ptr) ((unsigned long)(ptr) >= SLAB_DATA_BASE_ADDR && \
+ (unsigned long)(ptr) < SLAB_END_ADDR)
+#define is_slab_meta(ptr) ((unsigned long)(ptr) >= SLAB_BASE_ADDR && \
+ (unsigned long)(ptr) < SLAB_DATA_BASE_ADDR)
#endif /* CONFIG_SLAB_VIRTUAL */

#define CPU_ENTRY_AREA_PGD _AC(-4, UL)
diff --git a/arch/x86/mm/physaddr.c b/arch/x86/mm/physaddr.c
index fc3f3d3e2ef2..7f1b81c75e4d 100644
--- a/arch/x86/mm/physaddr.c
+++ b/arch/x86/mm/physaddr.c
@@ -16,6 +16,11 @@ unsigned long __phys_addr(unsigned long x)
{
unsigned long y = x - __START_KERNEL_map;

+#ifdef CONFIG_SLAB_VIRTUAL
+ if (is_slab_addr(x))
+ return slab_virt_to_phys(x);
+#endif
+
/* use the carry flag to determine if x was < __START_KERNEL_map */
if (unlikely(x > y)) {
x = y + phys_base;
@@ -48,6 +53,11 @@ bool __virt_addr_valid(unsigned long x)
{
unsigned long y = x - __START_KERNEL_map;

+#ifdef CONFIG_SLAB_VIRTUAL
+ if (is_slab_addr(x))
+ return true;
+#endif
+
/* use the carry flag to determine if x was < __START_KERNEL_map */
if (unlikely(x > y)) {
x = y + phys_base;
diff --git a/include/linux/slab.h b/include/linux/slab.h
index a2d82010d269..2180d5170995 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -793,5 +793,12 @@ int slab_dead_cpu(unsigned int cpu);
#define slab_dead_cpu NULL
#endif

+#ifdef CONFIG_SLAB_VIRTUAL
+void __init init_slub_page_reclaim(void);
+#else
#define is_slab_addr(addr) folio_test_slab(virt_to_folio(addr))
+static inline void init_slub_page_reclaim(void)
+{
+}
+#endif /* CONFIG_SLAB_VIRTUAL */
#endif /* _LINUX_SLAB_H */
diff --git a/init/main.c b/init/main.c
index ad920fac325c..72456964417e 100644
--- a/init/main.c
+++ b/init/main.c
@@ -1532,6 +1532,7 @@ static noinline void __init kernel_init_freeable(void)
workqueue_init();

init_mm_internals();
+ init_slub_page_reclaim();

rcu_init_tasks_generic();
do_pre_smp_initcalls();
diff --git a/mm/slab.h b/mm/slab.h
index 3fe0d1e26e26..460c802924bd 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -1,6 +1,11 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef MM_SLAB_H
#define MM_SLAB_H
+
+#include <linux/build_bug.h>
+#include <linux/slab.h>
+#include <linux/mm.h>
+
/*
* Internal slab definitions
*/
@@ -49,7 +54,35 @@ struct kmem_cache_order_objects {

/* Reuses the bits in struct page */
struct slab {
+ /*
+ * With CONFIG_SLAB_VIRTUAL enabled instances of struct slab are not
+ * overlapped with struct page but instead they are allocated from
+ * a dedicated virtual memory area.
+ */
+#ifndef CONFIG_SLAB_VIRTUAL
unsigned long __page_flags;
+#else
+ /*
+ * Used by virt_to_slab to find the actual struct slab for a slab that
+ * spans multiple pages.
+ */
+ struct slab *compound_slab_head;
+
+ /*
+ * Pointer to the folio that the objects are allocated from, or NULL if
+ * the slab is currently unused and no physical memory is allocated to
+ * it. Protected by slub_kworker_lock.
+ */
+ struct folio *backing_folio;
+
+ struct kmem_cache_order_objects oo;
+
+ struct list_head flush_list_elem;
+
+ /* Replaces the page lock */
+ spinlock_t slab_lock;
+
+#endif

#if defined(CONFIG_SLAB)

@@ -104,12 +137,17 @@ struct slab {
#error "Unexpected slab allocator configured"
#endif

+ /* See comment for __page_flags above. */
+#ifndef CONFIG_SLAB_VIRTUAL
atomic_t __page_refcount;
+#endif
#ifdef CONFIG_MEMCG
unsigned long memcg_data;
#endif
};

+/* See comment for __page_flags above. */
+#ifndef CONFIG_SLAB_VIRTUAL
#define SLAB_MATCH(pg, sl) \
static_assert(offsetof(struct page, pg) == offsetof(struct slab, sl))
SLAB_MATCH(flags, __page_flags);
@@ -120,10 +158,15 @@ SLAB_MATCH(memcg_data, memcg_data);
#endif
#undef SLAB_MATCH
static_assert(sizeof(struct slab) <= sizeof(struct page));
+#else
+static_assert(sizeof(struct slab) <= STRUCT_SLAB_SIZE);
+#endif
+
#if defined(system_has_freelist_aba) && defined(CONFIG_SLUB)
static_assert(IS_ALIGNED(offsetof(struct slab, freelist), sizeof(freelist_aba_t)));
#endif

+#ifndef CONFIG_SLAB_VIRTUAL
/**
* folio_slab - Converts from folio to slab.
* @folio: The folio.
@@ -187,6 +230,14 @@ static_assert(IS_ALIGNED(offsetof(struct slab, freelist), sizeof(freelist_aba_t)
* Return: true if s points to a slab and false otherwise.
*/
#define is_slab_page(s) folio_test_slab(slab_folio(s))
+#else
+#define slab_folio(s) (s->backing_folio)
+#define is_slab_page(s) is_slab_meta(s)
+/* Needed for check_heap_object but never actually used */
+#define folio_slab(folio) NULL
+static void *slab_to_virt(const struct slab *s);
+#endif /* CONFIG_SLAB_VIRTUAL */
+
/*
* If network-based swap is enabled, sl*b must keep track of whether pages
* were allocated from pfmemalloc reserves.
@@ -213,7 +264,11 @@ static inline void __slab_clear_pfmemalloc(struct slab *slab)

static inline void *slab_address(const struct slab *slab)
{
+#ifdef CONFIG_SLAB_VIRTUAL
+ return slab_to_virt(slab);
+#else
return folio_address(slab_folio(slab));
+#endif
}

static inline int slab_nid(const struct slab *slab)
@@ -226,6 +281,52 @@ static inline pg_data_t *slab_pgdat(const struct slab *slab)
return folio_pgdat(slab_folio(slab));
}

+#ifdef CONFIG_SLAB_VIRTUAL
+/*
+ * Internal helper. Returns the address of the struct slab corresponding to
+ * the virtual memory page containing kaddr. This does a simple arithmetic
+ * mapping and does *not* return the struct slab of the head page!
+ */
+static unsigned long virt_to_slab_raw(unsigned long addr)
+{
+ VM_WARN_ON(!is_slab_addr(addr));
+ return SLAB_BASE_ADDR +
+ ((addr - SLAB_BASE_ADDR) / PAGE_SIZE * sizeof(struct slab));
+}
+
+static struct slab *virt_to_slab(const void *addr)
+{
+ struct slab *slab, *slab_head;
+
+ if (!is_slab_addr(addr))
+ return NULL;
+
+ slab = (struct slab *)virt_to_slab_raw((unsigned long)addr);
+ slab_head = slab->compound_slab_head;
+
+ if (CHECK_DATA_CORRUPTION(!is_slab_meta(slab_head),
+ "compound slab head out of meta range: %p", slab_head))
+ return NULL;
+
+ return slab_head;
+}
+
+static void *slab_to_virt(const struct slab *s)
+{
+ unsigned long slab_idx;
+ bool unaligned_slab =
+ ((unsigned long)s - SLAB_BASE_ADDR) % sizeof(*s) != 0;
+
+ if (CHECK_DATA_CORRUPTION(!is_slab_meta(s), "slab not in meta range") ||
+ CHECK_DATA_CORRUPTION(unaligned_slab, "unaligned slab pointer") ||
+ CHECK_DATA_CORRUPTION(s->compound_slab_head != s,
+ "%s called on non-head slab", __func__))
+ return NULL;
+
+ slab_idx = ((unsigned long)s - SLAB_BASE_ADDR) / sizeof(*s);
+ return (void *)(SLAB_BASE_ADDR + PAGE_SIZE * slab_idx);
+}
+#else
static inline struct slab *virt_to_slab(const void *addr)
{
struct folio *folio = virt_to_folio(addr);
@@ -235,6 +336,7 @@ static inline struct slab *virt_to_slab(const void *addr)

return folio_slab(folio);
}
+#endif /* CONFIG_SLAB_VIRTUAL */

#define OO_SHIFT 16
#define OO_MASK ((1 << OO_SHIFT) - 1)
@@ -251,7 +353,11 @@ static inline unsigned int oo_objects(struct kmem_cache_order_objects x)

static inline int slab_order(const struct slab *slab)
{
+#ifndef CONFIG_SLAB_VIRTUAL
return folio_order((struct folio *)slab_folio(slab));
+#else
+ return oo_order(slab->oo);
+#endif
}

static inline size_t slab_size(const struct slab *slab)
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 42ceaf7e9f47..7754fdba07a0 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1064,6 +1064,10 @@ void kfree(const void *object)

if (unlikely(!is_slab_addr(object))) {
folio = virt_to_folio(object);
+ if (IS_ENABLED(CONFIG_SLAB_VIRTUAL) &&
+ CHECK_DATA_CORRUPTION(folio_test_slab(folio),
+ "unexpected slab page mapped outside slab range"))
+ return;
free_large_kmalloc(folio, (void *)object);
return;
}
diff --git a/mm/slub.c b/mm/slub.c
index a731fdc79bff..66ae60cdadaf 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -38,6 +38,10 @@
#include <linux/prefetch.h>
#include <linux/memcontrol.h>
#include <linux/random.h>
+#include <linux/kthread.h>
+#include <linux/io.h>
+#include <asm/pgalloc.h>
+#include <asm/tlbflush.h>
#include <kunit/test.h>
#include <kunit/test-bug.h>
#include <linux/sort.h>
@@ -168,6 +172,8 @@

#ifdef CONFIG_SLAB_VIRTUAL
unsigned long slub_addr_base = SLAB_DATA_BASE_ADDR;
+/* Protects slub_addr_base */
+static DEFINE_SPINLOCK(slub_valloc_lock);
#endif /* CONFIG_SLAB_VIRTUAL */

/*
@@ -430,19 +436,18 @@ static void prefetch_freepointer(const struct kmem_cache *s, void *object)
* get_freepointer_safe() returns initialized memory.
*/
__no_kmsan_checks
-static inline void *get_freepointer_safe(struct kmem_cache *s, void *object,
+static inline freeptr_t get_freepointer_safe(struct kmem_cache *s, void *object,
struct slab *slab)
{
- unsigned long freepointer_addr;
+ unsigned long freepointer_addr = (unsigned long)object + s->offset;
freeptr_t p;

if (!debug_pagealloc_enabled_static())
- return get_freepointer(s, object, slab);
+ return *(freeptr_t *)freepointer_addr;

object = kasan_reset_tag(object);
- freepointer_addr = (unsigned long)object + s->offset;
copy_from_kernel_nofault(&p, (freeptr_t *)freepointer_addr, sizeof(p));
- return freelist_ptr_decode(s, p, freepointer_addr, slab);
+ return p;
}

static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
@@ -478,6 +483,17 @@ static inline struct kmem_cache_order_objects oo_make(unsigned int order,
return x;
}

+#ifdef CONFIG_SLAB_VIRTUAL
+unsigned long slab_virt_to_phys(unsigned long x)
+{
+ struct slab *slab = virt_to_slab((void *)x);
+ struct folio *folio = slab_folio(slab);
+
+ return page_to_phys(folio_page(folio, 0)) + offset_in_folio(folio, x);
+}
+EXPORT_SYMBOL(slab_virt_to_phys);
+#endif
+
#ifdef CONFIG_SLUB_CPU_PARTIAL
static void slub_set_cpu_partial(struct kmem_cache *s, unsigned int nr_objects)
{
@@ -506,18 +522,26 @@ slub_set_cpu_partial(struct kmem_cache *s, unsigned int nr_objects)
*/
static __always_inline void slab_lock(struct slab *slab)
{
+#ifdef CONFIG_SLAB_VIRTUAL
+ spin_lock(&slab->slab_lock);
+#else
struct page *page = slab_page(slab);

VM_BUG_ON_PAGE(PageTail(page), page);
bit_spin_lock(PG_locked, &page->flags);
+#endif
}

static __always_inline void slab_unlock(struct slab *slab)
{
+#ifdef CONFIG_SLAB_VIRTUAL
+ spin_unlock(&slab->slab_lock);
+#else
struct page *page = slab_page(slab);

VM_BUG_ON_PAGE(PageTail(page), page);
__bit_spin_unlock(PG_locked, &page->flags);
+#endif
}

static inline bool
@@ -1863,6 +1887,10 @@ static void folio_set_slab(struct folio *folio, struct slab *slab)
/* Make the flag visible before any changes to folio->mapping */
smp_wmb();

+#ifdef CONFIG_SLAB_VIRTUAL
+ slab->backing_folio = folio;
+#endif
+
if (folio_is_pfmemalloc(folio))
slab_set_pfmemalloc(slab);
}
@@ -1874,8 +1902,285 @@ static void folio_clear_slab(struct folio *folio, struct slab *slab)
/* Make the mapping reset visible before clearing the flag */
smp_wmb();
__folio_clear_slab(folio);
+#ifdef CONFIG_SLAB_VIRTUAL
+ slab->backing_folio = NULL;
+#endif
+}
+
+#ifdef CONFIG_SLAB_VIRTUAL
+/*
+ * Make sure we have the necessary page tables for the given address.
+ * Returns a pointer to the PTE, or NULL on allocation failure.
+ *
+ * We're using ugly low-level code here instead of the standard
+ * helpers because the normal code insists on using GFP_KERNEL.
+ *
+ * If may_alloc is false, throw an error if the PTE is not already mapped.
+ */
+static pte_t *slub_get_ptep(unsigned long address, gfp_t gfp_flags,
+ bool may_alloc)
+{
+ pgd_t *pgd = pgd_offset_k(address);
+ p4d_t *p4d;
+ pud_t *pud;
+ pmd_t *pmd;
+ unsigned long flags;
+ struct page *spare_page = NULL;
+
+retry:
+ spin_lock_irqsave(&slub_valloc_lock, flags);
+ /*
+ * The top-level entry should already be present - see
+ * preallocate_top_level_entries().
+ */
+ BUG_ON(pgd_none(READ_ONCE(*pgd)));
+ p4d = p4d_offset(pgd, address);
+ if (p4d_none(READ_ONCE(*p4d))) {
+ if (!spare_page)
+ goto need_page;
+ p4d_populate(&init_mm, p4d, (pud_t *)page_to_virt(spare_page));
+ goto need_page;
+
+ }
+ pud = pud_offset(p4d, address);
+ if (pud_none(READ_ONCE(*pud))) {
+ if (!spare_page)
+ goto need_page;
+ pud_populate(&init_mm, pud, (pmd_t *)page_to_virt(spare_page));
+ goto need_page;
+ }
+ pmd = pmd_offset(pud, address);
+ if (pmd_none(READ_ONCE(*pmd))) {
+ if (!spare_page)
+ goto need_page;
+ pmd_populate_kernel(&init_mm, pmd,
+ (pte_t *)page_to_virt(spare_page));
+ spare_page = NULL;
+ }
+ spin_unlock_irqrestore(&slub_valloc_lock, flags);
+ if (spare_page)
+ __free_page(spare_page);
+ return pte_offset_kernel(pmd, address);
+
+need_page:
+ spin_unlock_irqrestore(&slub_valloc_lock, flags);
+ VM_WARN_ON(!may_alloc);
+ spare_page = alloc_page(gfp_flags);
+ if (unlikely(!spare_page))
+ return NULL;
+ /* ensure ordering between page zeroing and PTE write */
+ smp_wmb();
+ goto retry;
+}
+
+/*
+ * Reserve a range of virtual address space, ensure that we have page tables for
+ * it, and allocate a corresponding struct slab.
+ * This is cold code, we don't really have to worry about performance here.
+ */
+static struct slab *alloc_slab_meta(unsigned int order, gfp_t gfp_flags)
+{
+ unsigned long alloc_size = PAGE_SIZE << order;
+ unsigned long flags;
+ unsigned long old_base;
+ unsigned long data_range_start, data_range_end;
+ unsigned long meta_range_start, meta_range_end;
+ unsigned long addr;
+ struct slab *slab, *sp;
+ bool valid_start, valid_end;
+
+ gfp_flags &= (__GFP_HIGH | __GFP_RECLAIM | __GFP_IO |
+ __GFP_FS | __GFP_NOWARN | __GFP_RETRY_MAYFAIL |
+ __GFP_NOFAIL | __GFP_NORETRY | __GFP_MEMALLOC |
+ __GFP_NOMEMALLOC);
+ /* New page tables and metadata pages should be zeroed */
+ gfp_flags |= __GFP_ZERO;
+
+ spin_lock_irqsave(&slub_valloc_lock, flags);
+retry_locked:
+ old_base = slub_addr_base;
+
+ /*
+ * We drop the lock. The following code might sleep during
+ * page table allocation. Any mutations we make before rechecking
+ * slub_addr_base are idempotent, so that's fine.
+ */
+ spin_unlock_irqrestore(&slub_valloc_lock, flags);
+
+ /*
+ * [data_range_start, data_range_end) is the virtual address range where
+ * this slab's objects will be mapped.
+ * We want alignment appropriate for the order. Note that this could be
+ * relaxed based on the alignment requirements of the objects being
+ * allocated, but for now, we behave like the page allocator would.
+ */
+ data_range_start = ALIGN(old_base, alloc_size);
+ data_range_end = data_range_start + alloc_size;
+
+ valid_start = data_range_start >= SLAB_BASE_ADDR &&
+ IS_ALIGNED(data_range_start, PAGE_SIZE);
+ valid_end = data_range_end >= SLAB_BASE_ADDR &&
+ IS_ALIGNED(data_range_end, PAGE_SIZE);
+ if (CHECK_DATA_CORRUPTION(!valid_start,
+ "invalid slab data range start") ||
+ CHECK_DATA_CORRUPTION(!valid_end,
+ "invalid slab data range end"))
+ return NULL;
+
+ /* We ran out of virtual memory for slabs */
+ if (WARN_ON_ONCE(data_range_start >= SLAB_END_ADDR ||
+ data_range_end >= SLAB_END_ADDR))
+ return NULL;
+
+ /*
+ * [meta_range_start, meta_range_end) is the range where the struct
+ * slabs for the current data range are mapped. The first struct slab,
+ * located at meta_range_start is the head slab that contains the actual
+ * data, all other struct slabs in the range point to the head slab.
+ */
+ meta_range_start = virt_to_slab_raw(data_range_start);
+ meta_range_end = virt_to_slab_raw(data_range_end);
+
+ /* Ensure the meta range is mapped. */
+ for (addr = ALIGN_DOWN(meta_range_start, PAGE_SIZE);
+ addr < meta_range_end; addr += PAGE_SIZE) {
+ pte_t *ptep = slub_get_ptep(addr, gfp_flags, true);
+
+ if (ptep == NULL)
+ return NULL;
+
+ spin_lock_irqsave(&slub_valloc_lock, flags);
+ if (pte_none(READ_ONCE(*ptep))) {
+ struct page *meta_page;
+
+ spin_unlock_irqrestore(&slub_valloc_lock, flags);
+ meta_page = alloc_page(gfp_flags);
+ if (meta_page == NULL)
+ return NULL;
+ spin_lock_irqsave(&slub_valloc_lock, flags);
+
+ /* Make sure that no one else has already mapped that page */
+ if (pte_none(READ_ONCE(*ptep)))
+ set_pte_safe(ptep,
+ mk_pte(meta_page, PAGE_KERNEL));
+ else
+ __free_page(meta_page);
+ }
+ spin_unlock_irqrestore(&slub_valloc_lock, flags);
+ }
+
+ /* Ensure we have page tables for the data range. */
+ for (addr = data_range_start; addr < data_range_end;
+ addr += PAGE_SIZE) {
+ pte_t *ptep = slub_get_ptep(addr, gfp_flags, true);
+
+ if (ptep == NULL)
+ return NULL;
+ }
+
+ /* Did we race with someone else who made forward progress? */
+ spin_lock_irqsave(&slub_valloc_lock, flags);
+ if (old_base != slub_addr_base)
+ goto retry_locked;
+
+ /* Success! Grab the range for ourselves. */
+ slub_addr_base = data_range_end;
+ spin_unlock_irqrestore(&slub_valloc_lock, flags);
+
+ slab = (struct slab *)meta_range_start;
+ spin_lock_init(&slab->slab_lock);
+
+ /* Initialize basic slub metadata for virt_to_slab() */
+ for (sp = slab; (unsigned long)sp < meta_range_end; sp++)
+ sp->compound_slab_head = slab;
+
+ return slab;
+}
+
+/* Get an unused slab, or allocate a new one */
+static struct slab *get_free_slab(struct kmem_cache *s,
+ struct kmem_cache_order_objects oo, gfp_t meta_gfp_flags,
+ struct list_head *freed_slabs)
+{
+ unsigned long flags;
+ struct slab *slab;
+
+ spin_lock_irqsave(&s->virtual.freed_slabs_lock, flags);
+ slab = list_first_entry_or_null(freed_slabs, struct slab, slab_list);
+
+ if (likely(slab)) {
+ list_del(&slab->slab_list);
+
+ spin_unlock_irqrestore(&s->virtual.freed_slabs_lock, flags);
+ return slab;
+ }
+
+ spin_unlock_irqrestore(&s->virtual.freed_slabs_lock, flags);
+ slab = alloc_slab_meta(oo_order(oo), meta_gfp_flags);
+ if (slab == NULL)
+ return NULL;
+
+ return slab;
}

+static struct slab *alloc_slab_page(struct kmem_cache *s,
+ gfp_t meta_gfp_flags, gfp_t gfp_flags, int node,
+ struct kmem_cache_order_objects oo)
+{
+ struct folio *folio;
+ struct slab *slab;
+ unsigned int order = oo_order(oo);
+ unsigned long flags;
+ void *virt_mapping;
+ pte_t *ptep;
+ struct list_head *freed_slabs;
+
+ if (order == oo_order(s->min))
+ freed_slabs = &s->virtual.freed_slabs_min;
+ else
+ freed_slabs = &s->virtual.freed_slabs;
+
+ slab = get_free_slab(s, oo, meta_gfp_flags, freed_slabs);
+
+ /*
+ * Avoid making UAF reads easily exploitable by repopulating
+ * with pages containing attacker-controller data - always zero
+ * pages.
+ */
+ gfp_flags |= __GFP_ZERO;
+ if (node == NUMA_NO_NODE)
+ folio = (struct folio *)alloc_pages(gfp_flags, order);
+ else
+ folio = (struct folio *)__alloc_pages_node(node, gfp_flags,
+ order);
+
+ if (!folio) {
+ /* Rollback: put the struct slab back. */
+ spin_lock_irqsave(&s->virtual.freed_slabs_lock, flags);
+ list_add(&slab->slab_list, freed_slabs);
+ spin_unlock_irqrestore(&s->virtual.freed_slabs_lock, flags);
+
+ return NULL;
+ }
+ folio_set_slab(folio, slab);
+
+ slab->oo = oo;
+
+ virt_mapping = slab_to_virt(slab);
+
+ /* Wire up physical folio */
+ for (unsigned long i = 0; i < (1UL << oo_order(oo)); i++) {
+ ptep = slub_get_ptep(
+ (unsigned long)virt_mapping + i * PAGE_SIZE, 0, false);
+ if (CHECK_DATA_CORRUPTION(pte_present(*ptep),
+ "slab PTE already present"))
+ return NULL;
+ set_pte_safe(ptep, mk_pte(folio_page(folio, i), PAGE_KERNEL));
+ }
+
+ return slab;
+}
+#else
static inline struct slab *alloc_slab_page(struct kmem_cache *s,
gfp_t meta_flags, gfp_t flags, int node,
struct kmem_cache_order_objects oo)
@@ -1897,6 +2202,7 @@ static inline struct slab *alloc_slab_page(struct kmem_cache *s,

return slab;
}
+#endif /* CONFIG_SLAB_VIRTUAL */

#ifdef CONFIG_SLAB_FREELIST_RANDOM
/* Pre-initialize the random sequence cache */
@@ -2085,6 +2391,94 @@ static struct slab *new_slab(struct kmem_cache *s, gfp_t flags, int node)
flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
}

+#ifdef CONFIG_SLAB_VIRTUAL
+static DEFINE_SPINLOCK(slub_kworker_lock);
+static struct kthread_worker *slub_kworker;
+static LIST_HEAD(slub_tlbflush_queue);
+
+static void slub_tlbflush_worker(struct kthread_work *work)
+{
+ unsigned long irq_flags;
+ LIST_HEAD(local_queue);
+ struct slab *slab, *tmp;
+ unsigned long addr_start = ULONG_MAX;
+ unsigned long addr_end = 0;
+
+ spin_lock_irqsave(&slub_kworker_lock, irq_flags);
+ list_splice_init(&slub_tlbflush_queue, &local_queue);
+ list_for_each_entry(slab, &local_queue, flush_list_elem) {
+ unsigned long start = (unsigned long)slab_to_virt(slab);
+ unsigned long end = start + PAGE_SIZE *
+ (1UL << oo_order(slab->oo));
+
+ if (start < addr_start)
+ addr_start = start;
+ if (end > addr_end)
+ addr_end = end;
+ }
+ spin_unlock_irqrestore(&slub_kworker_lock, irq_flags);
+
+ if (addr_start < addr_end)
+ flush_tlb_kernel_range(addr_start, addr_end);
+
+ spin_lock_irqsave(&slub_kworker_lock, irq_flags);
+ list_for_each_entry_safe(slab, tmp, &local_queue, flush_list_elem) {
+ struct folio *folio = slab_folio(slab);
+ struct kmem_cache *s = slab->slab_cache;
+
+ list_del(&slab->flush_list_elem);
+ folio_clear_slab(folio, slab);
+ __free_pages(folio_page(folio, 0), oo_order(slab->oo));
+
+ /* IRQs are already off */
+ spin_lock(&s->virtual.freed_slabs_lock);
+ if (oo_order(slab->oo) == oo_order(s->oo)) {
+ list_add(&slab->slab_list, &s->virtual.freed_slabs);
+ } else {
+ WARN_ON(oo_order(slab->oo) != oo_order(s->min));
+ list_add(&slab->slab_list, &s->virtual.freed_slabs_min);
+ }
+ spin_unlock(&s->virtual.freed_slabs_lock);
+ }
+ spin_unlock_irqrestore(&slub_kworker_lock, irq_flags);
+}
+static DEFINE_KTHREAD_WORK(slub_tlbflush_work, slub_tlbflush_worker);
+
+static void __free_slab(struct kmem_cache *s, struct slab *slab)
+{
+ int order = oo_order(slab->oo);
+ unsigned long pages = 1UL << order;
+ unsigned long slab_base = (unsigned long)slab_address(slab);
+ unsigned long irq_flags;
+
+ /* Clear the PTEs for the slab we're freeing */
+ for (unsigned long i = 0; i < pages; i++) {
+ unsigned long addr = slab_base + i * PAGE_SIZE;
+ pte_t *ptep = slub_get_ptep(addr, 0, false);
+
+ if (CHECK_DATA_CORRUPTION(!pte_present(*ptep),
+ "slab PTE already clear"))
+ return;
+
+ ptep_clear(&init_mm, addr, ptep);
+ }
+
+ mm_account_reclaimed_pages(pages);
+ unaccount_slab(slab, order, s);
+
+ /*
+ * We might not be able to a TLB flush here (e.g. hardware interrupt
+ * handlers) so instead we give the slab to the TLB flusher thread
+ * which will flush the TLB for us and only then free the physical
+ * memory.
+ */
+ spin_lock_irqsave(&slub_kworker_lock, irq_flags);
+ list_add(&slab->flush_list_elem, &slub_tlbflush_queue);
+ spin_unlock_irqrestore(&slub_kworker_lock, irq_flags);
+ if (READ_ONCE(slub_kworker) != NULL)
+ kthread_queue_work(slub_kworker, &slub_tlbflush_work);
+}
+#else
static void __free_slab(struct kmem_cache *s, struct slab *slab)
{
struct folio *folio = slab_folio(slab);
@@ -2096,6 +2490,7 @@ static void __free_slab(struct kmem_cache *s, struct slab *slab)
unaccount_slab(slab, order, s);
__free_pages(&folio->page, order);
}
+#endif /* CONFIG_SLAB_VIRTUAL */

static void rcu_free_slab(struct rcu_head *h)
{
@@ -3384,7 +3779,15 @@ static __always_inline void *__slab_alloc_node(struct kmem_cache *s,
unlikely(!object || !slab || !node_match(slab, node))) {
object = __slab_alloc(s, gfpflags, node, addr, c, orig_size);
} else {
- void *next_object = get_freepointer_safe(s, object, slab);
+ void *next_object;
+ freeptr_t next_encoded = get_freepointer_safe(s, object, slab);
+
+ if (unlikely(READ_ONCE(c->tid) != tid))
+ goto redo;
+
+ next_object = freelist_ptr_decode(s, next_encoded,
+ (unsigned long)kasan_reset_tag(object) + s->offset,
+ slab);

/*
* The cmpxchg will only match if there was no additional
@@ -5050,6 +5453,30 @@ static struct kmem_cache * __init bootstrap(struct kmem_cache *static_cache)
return s;
}

+#ifdef CONFIG_SLAB_VIRTUAL
+/*
+ * Late initialization of reclaim kthread.
+ * This has to happen way later than kmem_cache_init() because it depends on
+ * having all the kthread infrastructure ready.
+ */
+void __init init_slub_page_reclaim(void)
+{
+ struct kthread_worker *w;
+
+ w = kthread_create_worker(0, "slub-physmem-reclaim");
+ if (IS_ERR(w))
+ panic("unable to create slub-physmem-reclaim worker");
+
+ /*
+ * Make sure that the kworker is properly initialized before making
+ * the store visible to other CPUs. The free path will check that
+ * slub_kworker is not NULL before attempting to give the TLB flusher
+ * pages to free.
+ */
+ smp_store_release(&slub_kworker, w);
+}
+#endif /* CONFIG_SLAB_VIRTUAL */
+
void __init kmem_cache_init(void)
{
static __initdata struct kmem_cache boot_kmem_cache,
diff --git a/mm/usercopy.c b/mm/usercopy.c
index 83c164aba6e0..8b30906ca7f9 100644
--- a/mm/usercopy.c
+++ b/mm/usercopy.c
@@ -189,9 +189,19 @@ static inline void check_heap_object(const void *ptr, unsigned long n,
if (!virt_addr_valid(ptr))
return;

+ /*
+ * We need to check this first because when CONFIG_SLAB_VIRTUAL is
+ * enabled a slab address might not be backed by a folio.
+ */
+ if (IS_ENABLED(CONFIG_SLAB_VIRTUAL) && is_slab_addr(ptr)) {
+ /* Check slab allocator for flags and size. */
+ __check_heap_object(ptr, n, virt_to_slab(ptr), to_user);
+ return;
+ }
+
folio = virt_to_folio(ptr);

- if (folio_test_slab(folio)) {
+ if (!IS_ENABLED(CONFIG_SLAB_VIRTUAL) && folio_test_slab(folio)) {
/* Check slab allocator for flags and size. */
__check_heap_object(ptr, n, folio_slab(folio), to_user);
} else if (folio_test_large(folio)) {
--
2.42.0.459.ge4e396fd5e-goog

2023-09-15 15:25:38

by Matteo Rizzo

[permalink] [raw]
Subject: [RFC PATCH 09/14] mm/slub: add the slab freelists to kmem_cache

From: Jann Horn <[email protected]>

With SLAB_VIRTUAL enabled, unused slabs which still have virtual memory
allocated to them but no physical memory are kept in a per-cache list so
that they can be reused later if the cache needs to grow again.

Signed-off-by: Jann Horn <[email protected]>
Co-developed-by: Matteo Rizzo <[email protected]>
Signed-off-by: Matteo Rizzo <[email protected]>
---
include/linux/slub_def.h | 16 ++++++++++++++++
mm/slub.c | 23 +++++++++++++++++++++++
2 files changed, 39 insertions(+)

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 0adf5ba8241b..693e9bb34edc 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -86,6 +86,20 @@ struct kmem_cache_cpu {
/*
* Slab cache management.
*/
+struct kmem_cache_virtual {
+#ifdef CONFIG_SLAB_VIRTUAL
+ /* Protects freed_slabs and freed_slabs_min */
+ spinlock_t freed_slabs_lock;
+ /*
+ * Slabs on this list have virtual memory of size oo allocated to them
+ * but no physical memory
+ */
+ struct list_head freed_slabs;
+ /* Same as freed_slabs but with memory of size min */
+ struct list_head freed_slabs_min;
+#endif
+};
+
struct kmem_cache {
#ifndef CONFIG_SLUB_TINY
struct kmem_cache_cpu __percpu *cpu_slab;
@@ -107,6 +121,8 @@ struct kmem_cache {

/* Allocation and freeing of slabs */
struct kmem_cache_order_objects min;
+ struct kmem_cache_virtual virtual;
+
gfp_t allocflags; /* gfp flags to use on each alloc */
int refcount; /* Refcount for slab cache destroy */
void (*ctor)(void *);
diff --git a/mm/slub.c b/mm/slub.c
index 42e7cc0b4452..4f77e5d4fe6c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4510,8 +4510,20 @@ static int calculate_sizes(struct kmem_cache *s)
return !!oo_objects(s->oo);
}

+static inline void slab_virtual_open(struct kmem_cache *s)
+{
+#ifdef CONFIG_SLAB_VIRTUAL
+ /* WARNING: this stuff will be relocated in bootstrap()! */
+ spin_lock_init(&s->virtual.freed_slabs_lock);
+ INIT_LIST_HEAD(&s->virtual.freed_slabs);
+ INIT_LIST_HEAD(&s->virtual.freed_slabs_min);
+#endif
+}
+
static int kmem_cache_open(struct kmem_cache *s, slab_flags_t flags)
{
+ slab_virtual_open(s);
+
s->flags = kmem_cache_flags(s->size, flags, s->name);
#ifdef CONFIG_SLAB_FREELIST_HARDENED
s->random = get_random_long();
@@ -4994,6 +5006,16 @@ static int slab_memory_callback(struct notifier_block *self,
* that may be pointing to the wrong kmem_cache structure.
*/

+static inline void slab_virtual_bootstrap(struct kmem_cache *s, struct kmem_cache *static_cache)
+{
+ slab_virtual_open(s);
+
+#ifdef CONFIG_SLAB_VIRTUAL
+ list_splice(&static_cache->virtual.freed_slabs, &s->virtual.freed_slabs);
+ list_splice(&static_cache->virtual.freed_slabs_min, &s->virtual.freed_slabs_min);
+#endif
+}
+
static struct kmem_cache * __init bootstrap(struct kmem_cache *static_cache)
{
int node;
@@ -5001,6 +5023,7 @@ static struct kmem_cache * __init bootstrap(struct kmem_cache *static_cache)
struct kmem_cache_node *n;

memcpy(s, static_cache, kmem_cache->object_size);
+ slab_virtual_bootstrap(s, static_cache);

/*
* This runs very early, and only the boot processor is supposed to be
--
2.42.0.459.ge4e396fd5e-goog

2023-09-15 16:13:20

by Matteo Rizzo

[permalink] [raw]
Subject: [RFC PATCH 03/14] mm/slub: move kmem_cache_order_objects to slab.h

From: Jann Horn <[email protected]>

This is refactoring for SLAB_VIRTUAL. The implementation needs to know
the order of the virtual memory region allocated to each slab to know
how much physical memory to allocate when the slab is reused. We reuse
kmem_cache_order_objects for this, so we have to move it before struct
slab.

Signed-off-by: Jann Horn <[email protected]>
Co-developed-by: Matteo Rizzo <[email protected]>
Signed-off-by: Matteo Rizzo <[email protected]>
---
include/linux/slub_def.h | 9 ---------
mm/slab.h | 22 ++++++++++++++++++++++
mm/slub.c | 12 ------------
3 files changed, 22 insertions(+), 21 deletions(-)

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index deb90cf4bffb..0adf5ba8241b 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -83,15 +83,6 @@ struct kmem_cache_cpu {
#define slub_percpu_partial_read_once(c) NULL
#endif // CONFIG_SLUB_CPU_PARTIAL

-/*
- * Word size structure that can be atomically updated or read and that
- * contains both the order and the number of objects that a slab of the
- * given order would contain.
- */
-struct kmem_cache_order_objects {
- unsigned int x;
-};
-
/*
* Slab cache management.
*/
diff --git a/mm/slab.h b/mm/slab.h
index 25e41dd6087e..3fe0d1e26e26 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -38,6 +38,15 @@ typedef union {
freelist_full_t full;
} freelist_aba_t;

+/*
+ * Word size structure that can be atomically updated or read and that
+ * contains both the order and the number of objects that a slab of the
+ * given order would contain.
+ */
+struct kmem_cache_order_objects {
+ unsigned int x;
+};
+
/* Reuses the bits in struct page */
struct slab {
unsigned long __page_flags;
@@ -227,6 +236,19 @@ static inline struct slab *virt_to_slab(const void *addr)
return folio_slab(folio);
}

+#define OO_SHIFT 16
+#define OO_MASK ((1 << OO_SHIFT) - 1)
+
+static inline unsigned int oo_order(struct kmem_cache_order_objects x)
+{
+ return x.x >> OO_SHIFT;
+}
+
+static inline unsigned int oo_objects(struct kmem_cache_order_objects x)
+{
+ return x.x & OO_MASK;
+}
+
static inline int slab_order(const struct slab *slab)
{
return folio_order((struct folio *)slab_folio(slab));
diff --git a/mm/slub.c b/mm/slub.c
index b69916ab7aa8..df2529c03bd3 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -284,8 +284,6 @@ static inline bool kmem_cache_has_cpu_partial(struct kmem_cache *s)
*/
#define DEBUG_METADATA_FLAGS (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER)

-#define OO_SHIFT 16
-#define OO_MASK ((1 << OO_SHIFT) - 1)
#define MAX_OBJS_PER_PAGE 32767 /* since slab.objects is u15 */

/* Internal SLUB flags */
@@ -473,16 +471,6 @@ static inline struct kmem_cache_order_objects oo_make(unsigned int order,
return x;
}

-static inline unsigned int oo_order(struct kmem_cache_order_objects x)
-{
- return x.x >> OO_SHIFT;
-}
-
-static inline unsigned int oo_objects(struct kmem_cache_order_objects x)
-{
- return x.x & OO_MASK;
-}
-
#ifdef CONFIG_SLUB_CPU_PARTIAL
static void slub_set_cpu_partial(struct kmem_cache *s, unsigned int nr_objects)
{
--
2.42.0.459.ge4e396fd5e-goog

2023-09-15 16:17:29

by Matteo Rizzo

[permalink] [raw]
Subject: [RFC PATCH 04/14] mm: use virt_to_slab instead of folio_slab

From: Jann Horn <[email protected]>

This is refactoring in preparation for the introduction of SLAB_VIRTUAL
which does not implement folio_slab.

With SLAB_VIRTUAL there is no longer a 1:1 correspondence between slabs
and pages of physical memory used by the slab allocator. There is no way
to look up the slab which corresponds to a specific page of physical
memory without iterating over all slabs or over the page tables. Instead
of doing that, we can look up the slab starting from its virtual address
which can still be performed cheaply with both SLAB_VIRTUAL enabled and
disabled.

Signed-off-by: Jann Horn <[email protected]>
Co-developed-by: Matteo Rizzo <[email protected]>
Signed-off-by: Matteo Rizzo <[email protected]>
---
mm/memcontrol.c | 2 +-
mm/slab_common.c | 12 +++++++-----
mm/slub.c | 14 ++++++--------
3 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e8ca4bdcb03c..0ab9f5323db7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2936,7 +2936,7 @@ struct mem_cgroup *mem_cgroup_from_obj_folio(struct folio *folio, void *p)
struct slab *slab;
unsigned int off;

- slab = folio_slab(folio);
+ slab = virt_to_slab(p);
objcgs = slab_objcgs(slab);
if (!objcgs)
return NULL;
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 79102d24f099..42ceaf7e9f47 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1062,13 +1062,13 @@ void kfree(const void *object)
if (unlikely(ZERO_OR_NULL_PTR(object)))
return;

- folio = virt_to_folio(object);
if (unlikely(!is_slab_addr(object))) {
+ folio = virt_to_folio(object);
free_large_kmalloc(folio, (void *)object);
return;
}

- slab = folio_slab(folio);
+ slab = virt_to_slab(object);
s = slab->slab_cache;
__kmem_cache_free(s, (void *)object, _RET_IP_);
}
@@ -1089,12 +1089,13 @@ EXPORT_SYMBOL(kfree);
size_t __ksize(const void *object)
{
struct folio *folio;
+ struct kmem_cache *s;

if (unlikely(object == ZERO_SIZE_PTR))
return 0;

- folio = virt_to_folio(object);
if (unlikely(!is_slab_addr(object))) {
+ folio = virt_to_folio(object);
if (WARN_ON(folio_size(folio) <= KMALLOC_MAX_CACHE_SIZE))
return 0;
if (WARN_ON(object != folio_address(folio)))
@@ -1102,11 +1103,12 @@ size_t __ksize(const void *object)
return folio_size(folio);
}

+ s = virt_to_slab(object)->slab_cache;
#ifdef CONFIG_SLUB_DEBUG
- skip_orig_size_check(folio_slab(folio)->slab_cache, object);
+ skip_orig_size_check(s, object);
#endif

- return slab_ksize(folio_slab(folio)->slab_cache);
+ return slab_ksize(s);
}

void *kmalloc_trace(struct kmem_cache *s, gfp_t gfpflags, size_t size)
diff --git a/mm/slub.c b/mm/slub.c
index df2529c03bd3..ad33d9e1601d 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3848,25 +3848,23 @@ int build_detached_freelist(struct kmem_cache *s, size_t size,
{
int lookahead = 3;
void *object;
- struct folio *folio;
+ struct slab *slab;
size_t same;

object = p[--size];
- folio = virt_to_folio(object);
+ slab = virt_to_slab(object);
if (!s) {
/* Handle kalloc'ed objects */
- if (unlikely(!folio_test_slab(folio))) {
- free_large_kmalloc(folio, object);
+ if (unlikely(slab == NULL)) {
+ free_large_kmalloc(virt_to_folio(object), object);
df->slab = NULL;
return size;
}
- /* Derive kmem_cache from object */
- df->slab = folio_slab(folio);
- df->s = df->slab->slab_cache;
+ df->s = slab->slab_cache;
} else {
- df->slab = folio_slab(folio);
df->s = cache_from_obj(s, object); /* Support for memcg */
}
+ df->slab = slab;

/* Start new detached freelist */
df->tail = object;
--
2.42.0.459.ge4e396fd5e-goog

2023-09-15 16:32:08

by Matteo Rizzo

[permalink] [raw]
Subject: [RFC PATCH 01/14] mm/slub: don't try to dereference invalid freepointers

slab_free_freelist_hook tries to read a freelist pointer from the
current object even when freeing a single object. This is invalid
because single objects don't actually contain a freelist pointer when
they're freed and the memory contains other data. This causes problems
for checking the integrity of freelist in get_freepointer.

Signed-off-by: Matteo Rizzo <[email protected]>
---
mm/slub.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/slub.c b/mm/slub.c
index f7940048138c..a7dae207c2d2 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1820,7 +1820,9 @@ static inline bool slab_free_freelist_hook(struct kmem_cache *s,

do {
object = next;
- next = get_freepointer(s, object);
+ /* Single objects don't actually contain a freepointer */
+ if (object != old_tail)
+ next = get_freepointer(s, object);

/* If object's reuse doesn't have to be delayed */
if (!slab_free_hook(s, object, slab_want_init_on_free(s))) {
--
2.42.0.459.ge4e396fd5e-goog

2023-09-15 16:41:58

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH 00/14] Prevent cross-cache attacks in the SLUB allocator

On 9/15/23 03:59, Matteo Rizzo wrote:
> The goal of this patch series is to deterministically prevent cross-cache
> attacks in the SLUB allocator.

What's the cost?

2023-09-15 16:50:28

by Matteo Rizzo

[permalink] [raw]
Subject: [RFC PATCH 13/14] mm/slub: sanity-check freepointers

From: Jann Horn <[email protected]>

Sanity-check that:
- non-NULL freepointers point into the slab
- freepointers look plausibly aligned

Signed-off-by: Jann Horn <[email protected]>
Co-developed-by: Matteo Rizzo <[email protected]>
Signed-off-by: Matteo Rizzo <[email protected]>
---
lib/slub_kunit.c | 4 ++++
mm/slab.h | 8 +++++++
mm/slub.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 69 insertions(+)

diff --git a/lib/slub_kunit.c b/lib/slub_kunit.c
index d4a3730b08fa..acf8600bd1fd 100644
--- a/lib/slub_kunit.c
+++ b/lib/slub_kunit.c
@@ -45,6 +45,10 @@ static void test_clobber_zone(struct kunit *test)
#ifndef CONFIG_KASAN
static void test_next_pointer(struct kunit *test)
{
+ if (IS_ENABLED(CONFIG_SLAB_VIRTUAL))
+ kunit_skip(test,
+ "incompatible with freepointer corruption detection in CONFIG_SLAB_VIRTUAL");
+
struct kmem_cache *s = test_kmem_cache_create("TestSlub_next_ptr_free",
64, SLAB_POISON);
u8 *p = kmem_cache_alloc(s, GFP_KERNEL);
diff --git a/mm/slab.h b/mm/slab.h
index 460c802924bd..8d10a011bdf0 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -79,6 +79,14 @@ struct slab {

struct list_head flush_list_elem;

+ /*
+ * Not in kmem_cache because it depends on whether the allocation is
+ * normal order or fallback order.
+ * an alternative might be to over-allocate virtual memory for
+ * fallback-order pages.
+ */
+ unsigned long align_mask;
+
/* Replaces the page lock */
spinlock_t slab_lock;

diff --git a/mm/slub.c b/mm/slub.c
index 0f7f5bf0b174..57474c8a6569 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -392,6 +392,44 @@ static inline freeptr_t freelist_ptr_encode(const struct kmem_cache *s,
return (freeptr_t){.v = encoded};
}

+/*
+ * Does some validation of freelist pointers. Without SLAB_VIRTUAL this is
+ * currently a no-op.
+ */
+static inline bool freelist_pointer_corrupted(struct slab *slab, freeptr_t ptr,
+ void *decoded)
+{
+#ifdef CONFIG_SLAB_VIRTUAL
+ /*
+ * If the freepointer decodes to 0, use 0 as the slab_base so that
+ * the check below always passes (0 & slab->align_mask == 0).
+ */
+ unsigned long slab_base = decoded ? (unsigned long)slab_to_virt(slab)
+ : 0;
+
+ /*
+ * This verifies that the SLUB freepointer does not point outside the
+ * slab. Since at that point we can basically do it for free, it also
+ * checks that the pointer alignment looks vaguely sane.
+ * However, we probably don't want the cost of a proper division here,
+ * so instead we just do a cheap check whether the bottom bits that are
+ * clear in the size are also clear in the pointer.
+ * So for kmalloc-32, it does a perfect alignment check, but for
+ * kmalloc-192, it just checks that the pointer is a multiple of 32.
+ * This should probably be reconsidered - is this a good tradeoff, or
+ * should that part be thrown out, or do we want a proper accurate
+ * alignment check (and can we make it work with acceptable performance
+ * cost compared to the security improvement - probably not)?
+ */
+ return CHECK_DATA_CORRUPTION(
+ ((unsigned long)decoded & slab->align_mask) != slab_base,
+ "bad freeptr (encoded %lx, ptr %p, base %lx, mask %lx",
+ ptr.v, decoded, slab_base, slab->align_mask);
+#else
+ return false;
+#endif
+}
+
static inline void *freelist_ptr_decode(const struct kmem_cache *s,
freeptr_t ptr, unsigned long ptr_addr,
struct slab *slab)
@@ -403,6 +441,10 @@ static inline void *freelist_ptr_decode(const struct kmem_cache *s,
#else
decoded = (void *)ptr.v;
#endif
+
+ if (unlikely(freelist_pointer_corrupted(slab, ptr, decoded)))
+ return NULL;
+
return decoded;
}

@@ -2122,6 +2164,21 @@ static struct slab *get_free_slab(struct kmem_cache *s,
if (slab == NULL)
return NULL;

+ /*
+ * Bits that must be equal to start-of-slab address for all
+ * objects inside the slab.
+ * For compatibility with pointer tagging (like in HWASAN), this would
+ * need to clear the pointer tag bits from the mask.
+ */
+ slab->align_mask = ~((PAGE_SIZE << oo_order(oo)) - 1);
+
+ /*
+ * Object alignment bits (must be zero, which is equal to the bits in
+ * the start-of-slab address)
+ */
+ if (s->red_left_pad == 0)
+ slab->align_mask |= (1 << (ffs(s->size) - 1)) - 1;
+
return slab;
}

--
2.42.0.459.ge4e396fd5e-goog

2023-09-15 17:17:51

by Matteo Rizzo

[permalink] [raw]
Subject: [RFC PATCH 14/14] security: add documentation for SLAB_VIRTUAL

From: Jann Horn <[email protected]>

Document what SLAB_VIRTUAL is trying to do, how it's implemented, and
why.

Signed-off-by: Jann Horn <[email protected]>
Co-developed-by: Matteo Rizzo <[email protected]>
Signed-off-by: Matteo Rizzo <[email protected]>
---
Documentation/security/self-protection.rst | 102 +++++++++++++++++++++
1 file changed, 102 insertions(+)

diff --git a/Documentation/security/self-protection.rst b/Documentation/security/self-protection.rst
index 910668e665cb..5a5e99e3f244 100644
--- a/Documentation/security/self-protection.rst
+++ b/Documentation/security/self-protection.rst
@@ -314,3 +314,105 @@ To help kill classes of bugs that result in kernel addresses being
written to userspace, the destination of writes needs to be tracked. If
the buffer is destined for userspace (e.g. seq_file backed ``/proc`` files),
it should automatically censor sensitive values.
+
+
+Memory Allocator Mitigations
+============================
+
+Protection against cross-cache attacks (SLAB_VIRTUAL)
+-----------------------------------------------------
+
+SLAB_VIRTUAL is a mitigation that deterministically prevents cross-cache
+attacks.
+
+Linux Kernel use-after-free vulnerabilities are commonly exploited by turning
+them into an object type confusion (having two active pointers of different
+types to the same memory location) using one of the following techniques:
+
+1. Direct object reuse: make the kernel give the victim object back to the slab
+ allocator, then allocate the object again from the same slab cache as a
+ different type. This is only possible if the victim object resides in a slab
+ cache which can contain objects of different types - for example one of the
+ kmalloc caches.
+2. "Cross-cache attack": make the kernel give the victim object back to the slab
+ allocator, then make the slab allocator give the page containing the object
+ back to the page allocator, then either allocate the page directly as some
+ other type of page or make the slab allocator allocate it again for a
+ different slab cache and allocate an object from there.
+
+In either case, the important part is that the same virtual address is reused
+for two objects of different types.
+
+The first case can be addressed by separating objects of different types
+into different slab caches. If a slab cache only contains objects of the
+same type then directly turning an use-after-free into a type confusion is
+impossible as long as the slab page that contains the victim object remains
+assigned to that slab cache. This type of mitigation is easily bypassable
+by cross-cache attacks: if the attacker can make the slab allocator return
+the page containing the victim object to the page allocator and then make
+it use the same page for a different slab cache, type confusion becomes
+possible again. Addressing the first case is therefore only worthwhile if
+cross-cache attacks are also addressed. AUTOSLAB uses a combination of
+probabilistic mitigations for this. SLAB_VIRTUAL addresses the second case
+deterministically by changing the way the slab allocator allocates memory.
+
+Preventing slab virtual address reuse
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In theory there is an easy fix against cross-cache attacks: modify the slab
+allocator so that it never gives memory back to the page allocator. In practice
+this would be problematic because physical memory remains permanently assigned
+to a slab cache even if it doesn't contain any active objects. A viable
+cross-cache mitigation must allow the system to reclaim unused physical memory.
+In the current design of the slab allocator there is no way
+to keep a region of virtual memory permanently assigned to a slab cache without
+also permanently reserving physical memory. That is because the virtual
+addresses that the slab allocator uses come from the linear map region, where
+there is a 1:1 correspondence between virtual and physical addresses.
+
+SLAB_VIRTUAL's solution is to create a dedicated virtual memory region that is
+only used for slab memory, and to enforce that once a range of virtual addresses
+is used for a slab cache, it is never reused for any other caches. Using a
+dedicated region of virtual memory lets us reserve ranges of virtual addresses
+to prevent cross-cache attacks and at the same time release physical memory back
+to the system when it's no longer needed. This is what Chromium's PartitionAlloc
+does in userspace
+(https://chromium.googlesource.com/chromium/src/+/354da2514b31df2aa14291199a567e10a7671621/base/allocator/partition_allocator/PartitionAlloc.md).
+
+Implementation
+~~~~~~~~~~~~~~
+
+SLAB_VIRTUAL reserves a region of virtual memory for the slab allocator. All
+pointers returned by the slab allocator point to this region. The region is
+statically partitioned in two sub-regions: the metadata region and the data
+region. The data region is where the actual objects are allocated from. The
+metadata region is an array of struct slab objects, one for each PAGE_SIZE bytes
+in the data region.
+Without SLAB_VIRTUAL, struct slab is overlaid on top of the struct page/struct
+folio that corresponds to the physical memory page backing the slab instead of
+using a dedicated memory region. This doesn't work for SLAB_VIRTUAL, which needs
+to store metadata for slabs even when no physical memory is allocated to them.
+Having an array of struct slab lets us implement virt_to_slab efficiently purely
+with arithmetic. In order to support high-order slabs, the struct slabs
+corresponding to tail pages contain a pointer to the head slab, which
+corresponds to the slab's head page.
+
+TLB flushing
+~~~~~~~~~~~~
+
+Before it can release a page of physical memory back to the page allocator, the
+slab allocator must flush the TLB entries for that page on all CPUs. This is not
+only necessary for the mitigation to work reliably but it's also required for
+correctness. Without a TLB flush some CPUs might continue using the old mapping
+if the virtual address range is reused for a new slab and cause memory
+corruption even in the absence of other bugs. The slab allocator can release
+pages in contexts where TLB flushes can't be performed (e.g. in hardware
+interrupt handlers). Pages to free are not freed directly, and instead they are
+put on a queue and freed from a workqueue context which also flushes the TLB.
+
+Performance
+~~~~~~~~~~~
+
+SLAB_VIRTUAL's performance impact depends on the workload. On kernel compilation
+(kernbench) the slowdown is about 1-2% depending on the machine type and is
+slightly worse on machines with more cores.
--
2.42.0.459.ge4e396fd5e-goog

2023-09-15 18:08:51

by Matteo Rizzo

[permalink] [raw]
Subject: [RFC PATCH 10/14] x86: Create virtual memory region for SLUB

From: Jann Horn <[email protected]>

SLAB_VIRTUAL reserves 512 GiB of virtual memory and uses them for both
struct slab and the actual slab memory. The pointers returned by
kmem_cache_alloc will point to this range of memory.

Signed-off-by: Jann Horn <[email protected]>
Co-developed-by: Matteo Rizzo <[email protected]>
Signed-off-by: Matteo Rizzo <[email protected]>
---
Documentation/arch/x86/x86_64/mm.rst | 4 ++--
arch/x86/include/asm/pgtable_64_types.h | 16 ++++++++++++++++
arch/x86/mm/init_64.c | 19 +++++++++++++++----
arch/x86/mm/kaslr.c | 9 +++++++++
arch/x86/mm/mm_internal.h | 4 ++++
mm/slub.c | 4 ++++
security/Kconfig.hardening | 2 ++
7 files changed, 52 insertions(+), 6 deletions(-)

diff --git a/Documentation/arch/x86/x86_64/mm.rst b/Documentation/arch/x86/x86_64/mm.rst
index 35e5e18c83d0..121179537175 100644
--- a/Documentation/arch/x86/x86_64/mm.rst
+++ b/Documentation/arch/x86/x86_64/mm.rst
@@ -57,7 +57,7 @@ Complete virtual memory map with 4-level page tables
fffffc0000000000 | -4 TB | fffffdffffffffff | 2 TB | ... unused hole
| | | | vaddr_end for KASLR
fffffe0000000000 | -2 TB | fffffe7fffffffff | 0.5 TB | cpu_entry_area mapping
- fffffe8000000000 | -1.5 TB | fffffeffffffffff | 0.5 TB | ... unused hole
+ fffffe8000000000 | -1.5 TB | fffffeffffffffff | 0.5 TB | SLUB virtual memory
ffffff0000000000 | -1 TB | ffffff7fffffffff | 0.5 TB | %esp fixup stacks
ffffff8000000000 | -512 GB | ffffffeeffffffff | 444 GB | ... unused hole
ffffffef00000000 | -68 GB | fffffffeffffffff | 64 GB | EFI region mapping space
@@ -116,7 +116,7 @@ Complete virtual memory map with 5-level page tables
fffffc0000000000 | -4 TB | fffffdffffffffff | 2 TB | ... unused hole
| | | | vaddr_end for KASLR
fffffe0000000000 | -2 TB | fffffe7fffffffff | 0.5 TB | cpu_entry_area mapping
- fffffe8000000000 | -1.5 TB | fffffeffffffffff | 0.5 TB | ... unused hole
+ fffffe8000000000 | -1.5 TB | fffffeffffffffff | 0.5 TB | SLUB virtual memory
ffffff0000000000 | -1 TB | ffffff7fffffffff | 0.5 TB | %esp fixup stacks
ffffff8000000000 | -512 GB | ffffffeeffffffff | 444 GB | ... unused hole
ffffffef00000000 | -68 GB | fffffffeffffffff | 64 GB | EFI region mapping space
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 38b54b992f32..e1a91eb084c4 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -6,6 +6,7 @@

#ifndef __ASSEMBLY__
#include <linux/types.h>
+#include <linux/align.h>
#include <asm/kaslr.h>

/*
@@ -199,6 +200,21 @@ extern unsigned int ptrs_per_p4d;
#define ESPFIX_PGD_ENTRY _AC(-2, UL)
#define ESPFIX_BASE_ADDR (ESPFIX_PGD_ENTRY << P4D_SHIFT)

+#ifdef CONFIG_SLAB_VIRTUAL
+#define SLAB_PGD_ENTRY _AC(-3, UL)
+#define SLAB_BASE_ADDR (SLAB_PGD_ENTRY << P4D_SHIFT)
+#define SLAB_END_ADDR (SLAB_BASE_ADDR + P4D_SIZE)
+
+/*
+ * We need to define this here because we need it to compute SLAB_META_SIZE
+ * and including slab.h causes a dependency cycle.
+ */
+#define STRUCT_SLAB_SIZE (32 * sizeof(void *))
+#define SLAB_VPAGES ((SLAB_END_ADDR - SLAB_BASE_ADDR) / PAGE_SIZE)
+#define SLAB_META_SIZE ALIGN(SLAB_VPAGES * STRUCT_SLAB_SIZE, PAGE_SIZE)
+#define SLAB_DATA_BASE_ADDR (SLAB_BASE_ADDR + SLAB_META_SIZE)
+#endif /* CONFIG_SLAB_VIRTUAL */
+
#define CPU_ENTRY_AREA_PGD _AC(-4, UL)
#define CPU_ENTRY_AREA_BASE (CPU_ENTRY_AREA_PGD << P4D_SHIFT)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index a190aae8ceaf..d716ddfd9880 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1279,16 +1279,19 @@ static void __init register_page_bootmem_info(void)
}

/*
- * Pre-allocates page-table pages for the vmalloc area in the kernel page-table.
+ * Pre-allocates page-table pages for the vmalloc and SLUB areas in the kernel
+ * page-table.
* Only the level which needs to be synchronized between all page-tables is
* allocated because the synchronization can be expensive.
*/
-static void __init preallocate_vmalloc_pages(void)
+static void __init preallocate_top_level_entries_range(unsigned long start,
+ unsigned long end)
{
unsigned long addr;
const char *lvl;

- for (addr = VMALLOC_START; addr <= VMEMORY_END; addr = ALIGN(addr + 1, PGDIR_SIZE)) {
+
+ for (addr = start; addr <= end; addr = ALIGN(addr + 1, PGDIR_SIZE)) {
pgd_t *pgd = pgd_offset_k(addr);
p4d_t *p4d;
pud_t *pud;
@@ -1328,6 +1331,14 @@ static void __init preallocate_vmalloc_pages(void)
panic("Failed to pre-allocate %s pages for vmalloc area\n", lvl);
}

+static void __init preallocate_top_level_entries(void)
+{
+ preallocate_top_level_entries_range(VMALLOC_START, VMEMORY_END);
+#ifdef CONFIG_SLAB_VIRTUAL
+ preallocate_top_level_entries_range(SLAB_BASE_ADDR, SLAB_END_ADDR - 1);
+#endif
+}
+
void __init mem_init(void)
{
pci_iommu_alloc();
@@ -1351,7 +1362,7 @@ void __init mem_init(void)
if (get_gate_vma(&init_mm))
kclist_add(&kcore_vsyscall, (void *)VSYSCALL_ADDR, PAGE_SIZE, KCORE_USER);

- preallocate_vmalloc_pages();
+ preallocate_top_level_entries();
}

#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c
index 37db264866b6..7b297d372a8c 100644
--- a/arch/x86/mm/kaslr.c
+++ b/arch/x86/mm/kaslr.c
@@ -136,6 +136,15 @@ void __init kernel_randomize_memory(void)
vaddr = round_up(vaddr + 1, PUD_SIZE);
remain_entropy -= entropy;
}
+
+#ifdef CONFIG_SLAB_VIRTUAL
+ /*
+ * slub_addr_base is initialized separately from the
+ * kaslr_memory_regions because it comes after CPU_ENTRY_AREA_BASE.
+ */
+ prandom_bytes_state(&rand_state, &rand, sizeof(rand));
+ slub_addr_base += (rand & ((1UL << 36) - PAGE_SIZE));
+#endif
}

void __meminit init_trampoline_kaslr(void)
diff --git a/arch/x86/mm/mm_internal.h b/arch/x86/mm/mm_internal.h
index 3f37b5c80bb3..fafb79b7e019 100644
--- a/arch/x86/mm/mm_internal.h
+++ b/arch/x86/mm/mm_internal.h
@@ -25,4 +25,8 @@ void update_cache_mode_entry(unsigned entry, enum page_cache_mode cache);

extern unsigned long tlb_single_page_flush_ceiling;

+#ifdef CONFIG_SLAB_VIRTUAL
+extern unsigned long slub_addr_base;
+#endif
+
#endif /* __X86_MM_INTERNAL_H */
diff --git a/mm/slub.c b/mm/slub.c
index 4f77e5d4fe6c..a731fdc79bff 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -166,6 +166,10 @@
* the fast path and disables lockless freelists.
*/

+#ifdef CONFIG_SLAB_VIRTUAL
+unsigned long slub_addr_base = SLAB_DATA_BASE_ADDR;
+#endif /* CONFIG_SLAB_VIRTUAL */
+
/*
* We could simply use migrate_disable()/enable() but as long as it's a
* function call even on !PREEMPT_RT, use inline preempt_disable() there.
diff --git a/security/Kconfig.hardening b/security/Kconfig.hardening
index 9f4e6e38aa76..f4a0af424149 100644
--- a/security/Kconfig.hardening
+++ b/security/Kconfig.hardening
@@ -357,6 +357,8 @@ config GCC_PLUGIN_RANDSTRUCT

config SLAB_VIRTUAL
bool "Allocate slab objects from virtual memory"
+ # For virtual memory region allocation
+ depends on X86_64
depends on SLUB && !SLUB_TINY
# If KFENCE support is desired, it could be implemented on top of our
# virtual memory allocation facilities
--
2.42.0.459.ge4e396fd5e-goog

2023-09-15 19:56:31

by Matteo Rizzo

[permalink] [raw]
Subject: [RFC PATCH 05/14] mm/slub: create folio_set/clear_slab helpers

From: Jann Horn <[email protected]>

This is refactoring in preparation for SLAB_VIRTUAL. Extract this code
to separate functions so that it's not duplicated in the code that
allocates and frees page with SLAB_VIRTUAL enabled.

Signed-off-by: Jann Horn <[email protected]>
Co-developed-by: Matteo Rizzo <[email protected]>
Signed-off-by: Matteo Rizzo <[email protected]>
---
mm/slub.c | 32 ++++++++++++++++++++++----------
1 file changed, 22 insertions(+), 10 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index ad33d9e1601d..9b87afade125 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1849,6 +1849,26 @@ static void *setup_object(struct kmem_cache *s, void *object)
/*
* Slab allocation and freeing
*/
+
+static void folio_set_slab(struct folio *folio, struct slab *slab)
+{
+ __folio_set_slab(folio);
+ /* Make the flag visible before any changes to folio->mapping */
+ smp_wmb();
+
+ if (folio_is_pfmemalloc(folio))
+ slab_set_pfmemalloc(slab);
+}
+
+static void folio_clear_slab(struct folio *folio, struct slab *slab)
+{
+ __slab_clear_pfmemalloc(slab);
+ folio->mapping = NULL;
+ /* Make the mapping reset visible before clearing the flag */
+ smp_wmb();
+ __folio_clear_slab(folio);
+}
+
static inline struct slab *alloc_slab_page(gfp_t flags, int node,
struct kmem_cache_order_objects oo)
{
@@ -1865,11 +1885,7 @@ static inline struct slab *alloc_slab_page(gfp_t flags, int node,
return NULL;

slab = folio_slab(folio);
- __folio_set_slab(folio);
- /* Make the flag visible before any changes to folio->mapping */
- smp_wmb();
- if (folio_is_pfmemalloc(folio))
- slab_set_pfmemalloc(slab);
+ folio_set_slab(folio, slab);

return slab;
}
@@ -2067,11 +2083,7 @@ static void __free_slab(struct kmem_cache *s, struct slab *slab)
int order = folio_order(folio);
int pages = 1 << order;

- __slab_clear_pfmemalloc(slab);
- folio->mapping = NULL;
- /* Make the mapping reset visible before clearing the flag */
- smp_wmb();
- __folio_clear_slab(folio);
+ folio_clear_slab(folio, slab);
mm_account_reclaimed_pages(pages);
unaccount_slab(slab, order, s);
__free_pages(&folio->page, order);
--
2.42.0.459.ge4e396fd5e-goog

2023-09-15 22:30:42

by Matteo Rizzo

[permalink] [raw]
Subject: [RFC PATCH 02/14] mm/slub: add is_slab_addr/is_slab_page helpers

From: Jann Horn <[email protected]>

This is refactoring in preparation for adding two different
implementations (for SLAB_VIRTUAL enabled and disabled).

virt_to_folio(x) expands to _compound_head(virt_to_page(x)) and
virt_to_head_page(x) also expands to _compound_head(virt_to_page(x))

so PageSlab(virt_to_head_page(res)) should be equivalent to
is_slab_addr(res).

Signed-off-by: Jann Horn <[email protected]>
Co-developed-by: Matteo Rizzo <[email protected]>
Signed-off-by: Matteo Rizzo <[email protected]>
---
include/linux/slab.h | 1 +
kernel/resource.c | 2 +-
mm/slab.h | 9 +++++++++
mm/slab_common.c | 5 ++---
mm/slub.c | 6 +++---
5 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 8228d1276a2f..a2d82010d269 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -793,4 +793,5 @@ int slab_dead_cpu(unsigned int cpu);
#define slab_dead_cpu NULL
#endif

+#define is_slab_addr(addr) folio_test_slab(virt_to_folio(addr))
#endif /* _LINUX_SLAB_H */
diff --git a/kernel/resource.c b/kernel/resource.c
index b1763b2fd7ef..c829e5f97292 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -158,7 +158,7 @@ static void free_resource(struct resource *res)
* buddy and trying to be smart and reusing them eventually in
* alloc_resource() overcomplicates resource handling.
*/
- if (res && PageSlab(virt_to_head_page(res)))
+ if (res && is_slab_addr(res))
kfree(res);
}

diff --git a/mm/slab.h b/mm/slab.h
index 799a315695c6..25e41dd6087e 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -169,6 +169,15 @@ static_assert(IS_ALIGNED(offsetof(struct slab, freelist), sizeof(freelist_aba_t)
*/
#define slab_page(s) folio_page(slab_folio(s), 0)

+/**
+ * is_slab_page - Checks if a page is really a slab page
+ * @s: The slab
+ *
+ * Checks if s points to a slab page.
+ *
+ * Return: true if s points to a slab and false otherwise.
+ */
+#define is_slab_page(s) folio_test_slab(slab_folio(s))
/*
* If network-based swap is enabled, sl*b must keep track of whether pages
* were allocated from pfmemalloc reserves.
diff --git a/mm/slab_common.c b/mm/slab_common.c
index e99e821065c3..79102d24f099 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1063,7 +1063,7 @@ void kfree(const void *object)
return;

folio = virt_to_folio(object);
- if (unlikely(!folio_test_slab(folio))) {
+ if (unlikely(!is_slab_addr(object))) {
free_large_kmalloc(folio, (void *)object);
return;
}
@@ -1094,8 +1094,7 @@ size_t __ksize(const void *object)
return 0;

folio = virt_to_folio(object);
-
- if (unlikely(!folio_test_slab(folio))) {
+ if (unlikely(!is_slab_addr(object))) {
if (WARN_ON(folio_size(folio) <= KMALLOC_MAX_CACHE_SIZE))
return 0;
if (WARN_ON(object != folio_address(folio)))
diff --git a/mm/slub.c b/mm/slub.c
index a7dae207c2d2..b69916ab7aa8 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1259,7 +1259,7 @@ static int check_slab(struct kmem_cache *s, struct slab *slab)
{
int maxobj;

- if (!folio_test_slab(slab_folio(slab))) {
+ if (!is_slab_page(slab)) {
slab_err(s, slab, "Not a valid slab page");
return 0;
}
@@ -1454,7 +1454,7 @@ static noinline bool alloc_debug_processing(struct kmem_cache *s,
return true;

bad:
- if (folio_test_slab(slab_folio(slab))) {
+ if (is_slab_page(slab)) {
/*
* If this is a slab page then lets do the best we can
* to avoid issues in the future. Marking all objects
@@ -1484,7 +1484,7 @@ static inline int free_consistency_checks(struct kmem_cache *s,
return 0;

if (unlikely(s != slab->slab_cache)) {
- if (!folio_test_slab(slab_folio(slab))) {
+ if (!is_slab_page(slab)) {
slab_err(s, slab, "Attempt to free object(0x%p) outside of slab",
object);
} else if (!slab->slab_cache) {
--
2.42.0.459.ge4e396fd5e-goog

2023-09-15 22:43:03

by Lameter, Christopher

[permalink] [raw]
Subject: Re: [RFC PATCH 00/14] Prevent cross-cache attacks in the SLUB allocator

On Fri, 15 Sep 2023, Dave Hansen wrote:

> On 9/15/23 03:59, Matteo Rizzo wrote:
>> The goal of this patch series is to deterministically prevent cross-cache
>> attacks in the SLUB allocator.
>
> What's the cost?

The only thing that I see is 1-2% on kernel compilations (and "more on
machines with lots of cores")?

Having a virtualized slab subsystem could enable other things:

- The page order calculation could be simplified since vmalloc can stitch
arbitrary base pages together to form larger contiguous virtual segments.
So just use f.e. order 5 or so for all slabs to reduce contention?

- Maybe we could make slab pages movable (if we can ensure that slab
objects are not touched somehow. At least stop_machine run could be used
to move batches of slab memory)

- Maybe we can avoid allocating page structs somehow for slab memory?
Looks like this is taking a step into that direction. The metadata storage
of the slab allocator could be reworked and optimized better.

Problems:

- Overhead due to more TLB lookups

- Larger amounts of TLBs are used for the OS. Currently we are trying to
use the maximum mappable TLBs to reduce their numbers. This presumably
means using 4K TLBs for all slab access.

- Memory may not be physically contiguous which may be required by
some drivers doing DMA.



2023-09-15 22:56:21

by Matteo Rizzo

[permalink] [raw]
Subject: [RFC PATCH 12/14] mm/slub: introduce the deallocated_pages sysfs attribute

From: Jann Horn <[email protected]>

When SLAB_VIRTUAL is enabled this new sysfs attribute tracks the number
of slab pages whose physical memory has been reclaimed but whose virtual
memory is still allocated to a kmem_cache.

Signed-off-by: Jann Horn <[email protected]>
Co-developed-by: Matteo Rizzo <[email protected]>
Signed-off-by: Matteo Rizzo <[email protected]>
---
include/linux/slub_def.h | 4 +++-
mm/slub.c | 18 ++++++++++++++++++
2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 693e9bb34edc..eea402d849da 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -88,7 +88,7 @@ struct kmem_cache_cpu {
*/
struct kmem_cache_virtual {
#ifdef CONFIG_SLAB_VIRTUAL
- /* Protects freed_slabs and freed_slabs_min */
+ /* Protects freed_slabs, freed_slabs_min, and nr_free_pages */
spinlock_t freed_slabs_lock;
/*
* Slabs on this list have virtual memory of size oo allocated to them
@@ -97,6 +97,8 @@ struct kmem_cache_virtual {
struct list_head freed_slabs;
/* Same as freed_slabs but with memory of size min */
struct list_head freed_slabs_min;
+ /* Number of slab pages which got freed */
+ unsigned long nr_freed_pages;
#endif
};

diff --git a/mm/slub.c b/mm/slub.c
index 66ae60cdadaf..0f7f5bf0b174 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2110,6 +2110,8 @@ static struct slab *get_free_slab(struct kmem_cache *s,

if (likely(slab)) {
list_del(&slab->slab_list);
+ WRITE_ONCE(s->virtual.nr_freed_pages,
+ s->virtual.nr_freed_pages - (1UL << slab_order(slab)));

spin_unlock_irqrestore(&s->virtual.freed_slabs_lock, flags);
return slab;
@@ -2158,6 +2160,8 @@ static struct slab *alloc_slab_page(struct kmem_cache *s,
/* Rollback: put the struct slab back. */
spin_lock_irqsave(&s->virtual.freed_slabs_lock, flags);
list_add(&slab->slab_list, freed_slabs);
+ WRITE_ONCE(s->virtual.nr_freed_pages,
+ s->virtual.nr_freed_pages + (1UL << slab_order(slab)));
spin_unlock_irqrestore(&s->virtual.freed_slabs_lock, flags);

return NULL;
@@ -2438,6 +2442,8 @@ static void slub_tlbflush_worker(struct kthread_work *work)
WARN_ON(oo_order(slab->oo) != oo_order(s->min));
list_add(&slab->slab_list, &s->virtual.freed_slabs_min);
}
+ WRITE_ONCE(s->virtual.nr_freed_pages, s->virtual.nr_freed_pages +
+ (1UL << slab_order(slab)));
spin_unlock(&s->virtual.freed_slabs_lock);
}
spin_unlock_irqrestore(&slub_kworker_lock, irq_flags);
@@ -4924,6 +4930,7 @@ static inline void slab_virtual_open(struct kmem_cache *s)
spin_lock_init(&s->virtual.freed_slabs_lock);
INIT_LIST_HEAD(&s->virtual.freed_slabs);
INIT_LIST_HEAD(&s->virtual.freed_slabs_min);
+ s->virtual.nr_freed_pages = 0;
#endif
}

@@ -6098,6 +6105,14 @@ static ssize_t objects_partial_show(struct kmem_cache *s, char *buf)
}
SLAB_ATTR_RO(objects_partial);

+#ifdef CONFIG_SLAB_VIRTUAL
+static ssize_t deallocated_pages_show(struct kmem_cache *s, char *buf)
+{
+ return sysfs_emit(buf, "%lu\n", READ_ONCE(s->virtual.nr_freed_pages));
+}
+SLAB_ATTR_RO(deallocated_pages);
+#endif /* CONFIG_SLAB_VIRTUAL */
+
static ssize_t slabs_cpu_partial_show(struct kmem_cache *s, char *buf)
{
int objects = 0;
@@ -6424,6 +6439,9 @@ static struct attribute *slab_attrs[] = {
&min_partial_attr.attr,
&cpu_partial_attr.attr,
&objects_partial_attr.attr,
+#ifdef CONFIG_SLAB_VIRTUAL
+ &deallocated_pages_attr.attr,
+#endif
&partial_attr.attr,
&cpu_slabs_attr.attr,
&ctor_attr.attr,
--
2.42.0.459.ge4e396fd5e-goog

2023-09-15 23:39:04

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH 10/14] x86: Create virtual memory region for SLUB

On 9/15/23 14:13, Kees Cook wrote:
> On Fri, Sep 15, 2023 at 10:59:29AM +0000, Matteo Rizzo wrote:
>> From: Jann Horn <[email protected]>
>>
>> SLAB_VIRTUAL reserves 512 GiB of virtual memory and uses them for both
>> struct slab and the actual slab memory. The pointers returned by
>> kmem_cache_alloc will point to this range of memory.
>
> I think the 512 GiB limit may be worth mentioning in the Kconfig help
> text.

Yes, please.

> And in the "640K is enough for everything" devil's advocacy, why is 512
> GiB enough here? Is there any greater risk of a pathological allocation
> pattern breaking a system any more (or less) than is currently possible?

I have the feeling folks just grabbed the first big-ish chunk they saw
free in the memory map and stole that one. Not a horrible approach,
mind you, but I have the feeling it didn't go through the most rigorous
sizing procedure. :)

My laptop memory is ~6% consumed by slab, 90% of which is reclaimable.
If a 64TB system had the same ratio, it would bump into this 512GB
limit. But it _should_ just reclaim thing earlier rather than falling over.

That said, we still have gobs of actual vmalloc() space. It's ~30TiB in
size and I'm not aware of anyone consuming anywhere near that much. If
the 512GB fills up somehow, there are other places to steal the space.

One minor concern is that the virtual area is the same size on 4 and
5-level paging systems. It might be a good idea to pick one of the
holes that actually gets bigger on 5-level systems.

2023-09-16 00:58:53

by Kees Cook

[permalink] [raw]
Subject: Re: [RFC PATCH 09/14] mm/slub: add the slab freelists to kmem_cache

On Fri, Sep 15, 2023 at 10:59:28AM +0000, Matteo Rizzo wrote:
> From: Jann Horn <[email protected]>
>
> With SLAB_VIRTUAL enabled, unused slabs which still have virtual memory
> allocated to them but no physical memory are kept in a per-cache list so
> that they can be reused later if the cache needs to grow again.
>
> Signed-off-by: Jann Horn <[email protected]>

Looks appropriately #ifdef'ed...

Reviewed-by: Kees Cook <[email protected]>

--
Kees Cook

2023-09-16 00:59:47

by Kees Cook

[permalink] [raw]
Subject: Re: [RFC PATCH 01/14] mm/slub: don't try to dereference invalid freepointers

On Fri, Sep 15, 2023 at 10:59:20AM +0000, Matteo Rizzo wrote:
> slab_free_freelist_hook tries to read a freelist pointer from the
> current object even when freeing a single object. This is invalid
> because single objects don't actually contain a freelist pointer when
> they're freed and the memory contains other data. This causes problems
> for checking the integrity of freelist in get_freepointer.
>
> Signed-off-by: Matteo Rizzo <[email protected]>
> ---
> mm/slub.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index f7940048138c..a7dae207c2d2 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1820,7 +1820,9 @@ static inline bool slab_free_freelist_hook(struct kmem_cache *s,
>
> do {
> object = next;
> - next = get_freepointer(s, object);
> + /* Single objects don't actually contain a freepointer */
> + if (object != old_tail)
> + next = get_freepointer(s, object);
>
> /* If object's reuse doesn't have to be delayed */
> if (!slab_free_hook(s, object, slab_want_init_on_free(s))) {

I find this loop's logic a bit weird, but, yes, this ends up being
correct and avoids needless work.

Reviewed-by: Kees Cook <[email protected]>

--
Kees Cook

2023-09-16 01:57:59

by Kees Cook

[permalink] [raw]
Subject: Re: [RFC PATCH 04/14] mm: use virt_to_slab instead of folio_slab

On Fri, Sep 15, 2023 at 10:59:23AM +0000, Matteo Rizzo wrote:
> From: Jann Horn <[email protected]>
>
> This is refactoring in preparation for the introduction of SLAB_VIRTUAL
> which does not implement folio_slab.
>
> With SLAB_VIRTUAL there is no longer a 1:1 correspondence between slabs
> and pages of physical memory used by the slab allocator. There is no way
> to look up the slab which corresponds to a specific page of physical
> memory without iterating over all slabs or over the page tables. Instead
> of doing that, we can look up the slab starting from its virtual address
> which can still be performed cheaply with both SLAB_VIRTUAL enabled and
> disabled.
>
> Signed-off-by: Jann Horn <[email protected]>

Refactoring continues to track.

Reviewed-by: Kees Cook <[email protected]>

--
Kees Cook

2023-09-16 02:52:03

by Kees Cook

[permalink] [raw]
Subject: Re: [RFC PATCH 13/14] mm/slub: sanity-check freepointers

On Fri, Sep 15, 2023 at 10:59:32AM +0000, Matteo Rizzo wrote:
> From: Jann Horn <[email protected]>
>
> Sanity-check that:
> - non-NULL freepointers point into the slab
> - freepointers look plausibly aligned
>
> Signed-off-by: Jann Horn <[email protected]>
> Co-developed-by: Matteo Rizzo <[email protected]>
> Signed-off-by: Matteo Rizzo <[email protected]>
> ---
> lib/slub_kunit.c | 4 ++++
> mm/slab.h | 8 +++++++
> mm/slub.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 69 insertions(+)
>
> diff --git a/lib/slub_kunit.c b/lib/slub_kunit.c
> index d4a3730b08fa..acf8600bd1fd 100644
> --- a/lib/slub_kunit.c
> +++ b/lib/slub_kunit.c
> @@ -45,6 +45,10 @@ static void test_clobber_zone(struct kunit *test)
> #ifndef CONFIG_KASAN
> static void test_next_pointer(struct kunit *test)
> {
> + if (IS_ENABLED(CONFIG_SLAB_VIRTUAL))
> + kunit_skip(test,
> + "incompatible with freepointer corruption detection in CONFIG_SLAB_VIRTUAL");
> +
> struct kmem_cache *s = test_kmem_cache_create("TestSlub_next_ptr_free",
> 64, SLAB_POISON);
> u8 *p = kmem_cache_alloc(s, GFP_KERNEL);
> diff --git a/mm/slab.h b/mm/slab.h
> index 460c802924bd..8d10a011bdf0 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -79,6 +79,14 @@ struct slab {
>
> struct list_head flush_list_elem;
>
> + /*
> + * Not in kmem_cache because it depends on whether the allocation is
> + * normal order or fallback order.
> + * an alternative might be to over-allocate virtual memory for
> + * fallback-order pages.
> + */
> + unsigned long align_mask;
> +
> /* Replaces the page lock */
> spinlock_t slab_lock;
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 0f7f5bf0b174..57474c8a6569 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -392,6 +392,44 @@ static inline freeptr_t freelist_ptr_encode(const struct kmem_cache *s,
> return (freeptr_t){.v = encoded};
> }
>
> +/*
> + * Does some validation of freelist pointers. Without SLAB_VIRTUAL this is
> + * currently a no-op.
> + */
> +static inline bool freelist_pointer_corrupted(struct slab *slab, freeptr_t ptr,
> + void *decoded)
> +{
> +#ifdef CONFIG_SLAB_VIRTUAL
> + /*
> + * If the freepointer decodes to 0, use 0 as the slab_base so that
> + * the check below always passes (0 & slab->align_mask == 0).
> + */
> + unsigned long slab_base = decoded ? (unsigned long)slab_to_virt(slab)
> + : 0;
> +
> + /*
> + * This verifies that the SLUB freepointer does not point outside the
> + * slab. Since at that point we can basically do it for free, it also
> + * checks that the pointer alignment looks vaguely sane.
> + * However, we probably don't want the cost of a proper division here,
> + * so instead we just do a cheap check whether the bottom bits that are
> + * clear in the size are also clear in the pointer.
> + * So for kmalloc-32, it does a perfect alignment check, but for
> + * kmalloc-192, it just checks that the pointer is a multiple of 32.
> + * This should probably be reconsidered - is this a good tradeoff, or
> + * should that part be thrown out, or do we want a proper accurate
> + * alignment check (and can we make it work with acceptable performance
> + * cost compared to the security improvement - probably not)?

Is it really that much more expensive to check the alignment exactly?

> + */
> + return CHECK_DATA_CORRUPTION(
> + ((unsigned long)decoded & slab->align_mask) != slab_base,
> + "bad freeptr (encoded %lx, ptr %p, base %lx, mask %lx",
> + ptr.v, decoded, slab_base, slab->align_mask);
> +#else
> + return false;
> +#endif
> +}
> +
> static inline void *freelist_ptr_decode(const struct kmem_cache *s,
> freeptr_t ptr, unsigned long ptr_addr,
> struct slab *slab)
> @@ -403,6 +441,10 @@ static inline void *freelist_ptr_decode(const struct kmem_cache *s,
> #else
> decoded = (void *)ptr.v;
> #endif
> +
> + if (unlikely(freelist_pointer_corrupted(slab, ptr, decoded)))
> + return NULL;
> +
> return decoded;
> }
>
> @@ -2122,6 +2164,21 @@ static struct slab *get_free_slab(struct kmem_cache *s,
> if (slab == NULL)
> return NULL;
>
> + /*
> + * Bits that must be equal to start-of-slab address for all
> + * objects inside the slab.
> + * For compatibility with pointer tagging (like in HWASAN), this would
> + * need to clear the pointer tag bits from the mask.
> + */
> + slab->align_mask = ~((PAGE_SIZE << oo_order(oo)) - 1);
> +
> + /*
> + * Object alignment bits (must be zero, which is equal to the bits in
> + * the start-of-slab address)
> + */
> + if (s->red_left_pad == 0)
> + slab->align_mask |= (1 << (ffs(s->size) - 1)) - 1;
> +
> return slab;
> }
>
> --
> 2.42.0.459.ge4e396fd5e-goog
>

We can improve the sanity checking in the future, so as-is, sure:

Reviewed-by: Kees Cook <[email protected]>

--
Kees Cook

2023-09-16 02:55:40

by Kees Cook

[permalink] [raw]
Subject: Re: [RFC PATCH 14/14] security: add documentation for SLAB_VIRTUAL

On Fri, Sep 15, 2023 at 10:59:33AM +0000, Matteo Rizzo wrote:
> From: Jann Horn <[email protected]>
>
> Document what SLAB_VIRTUAL is trying to do, how it's implemented, and
> why.
>
> Signed-off-by: Jann Horn <[email protected]>
> Co-developed-by: Matteo Rizzo <[email protected]>
> Signed-off-by: Matteo Rizzo <[email protected]>
> ---
> Documentation/security/self-protection.rst | 102 +++++++++++++++++++++
> 1 file changed, 102 insertions(+)
>
> diff --git a/Documentation/security/self-protection.rst b/Documentation/security/self-protection.rst
> index 910668e665cb..5a5e99e3f244 100644
> --- a/Documentation/security/self-protection.rst
> +++ b/Documentation/security/self-protection.rst
> @@ -314,3 +314,105 @@ To help kill classes of bugs that result in kernel addresses being
> written to userspace, the destination of writes needs to be tracked. If
> the buffer is destined for userspace (e.g. seq_file backed ``/proc`` files),
> it should automatically censor sensitive values.
> +
> +
> +Memory Allocator Mitigations
> +============================
> +
> +Protection against cross-cache attacks (SLAB_VIRTUAL)
> +-----------------------------------------------------
> +
> +SLAB_VIRTUAL is a mitigation that deterministically prevents cross-cache
> +attacks.
> +
> +Linux Kernel use-after-free vulnerabilities are commonly exploited by turning
> +them into an object type confusion (having two active pointers of different
> +types to the same memory location) using one of the following techniques:
> +
> +1. Direct object reuse: make the kernel give the victim object back to the slab
> + allocator, then allocate the object again from the same slab cache as a
> + different type. This is only possible if the victim object resides in a slab
> + cache which can contain objects of different types - for example one of the
> + kmalloc caches.
> +2. "Cross-cache attack": make the kernel give the victim object back to the slab
> + allocator, then make the slab allocator give the page containing the object
> + back to the page allocator, then either allocate the page directly as some
> + other type of page or make the slab allocator allocate it again for a
> + different slab cache and allocate an object from there.

I feel like adding a link to
https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html
would be nice here, as some folks reading this may not understand how
plausible the second attack can be. :)

> +
> +In either case, the important part is that the same virtual address is reused
> +for two objects of different types.
> +
> +The first case can be addressed by separating objects of different types
> +into different slab caches. If a slab cache only contains objects of the
> +same type then directly turning an use-after-free into a type confusion is
> +impossible as long as the slab page that contains the victim object remains
> +assigned to that slab cache. This type of mitigation is easily bypassable
> +by cross-cache attacks: if the attacker can make the slab allocator return
> +the page containing the victim object to the page allocator and then make
> +it use the same page for a different slab cache, type confusion becomes
> +possible again. Addressing the first case is therefore only worthwhile if
> +cross-cache attacks are also addressed. AUTOSLAB uses a combination of

I think you mean CONFIG_RANDOM_KMALLOC_CACHES, not AUTOSLAB which isn't
upstream.

> +probabilistic mitigations for this. SLAB_VIRTUAL addresses the second case
> +deterministically by changing the way the slab allocator allocates memory.
> +
> +Preventing slab virtual address reuse
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +In theory there is an easy fix against cross-cache attacks: modify the slab
> +allocator so that it never gives memory back to the page allocator. In practice
> +this would be problematic because physical memory remains permanently assigned
> +to a slab cache even if it doesn't contain any active objects. A viable
> +cross-cache mitigation must allow the system to reclaim unused physical memory.
> +In the current design of the slab allocator there is no way
> +to keep a region of virtual memory permanently assigned to a slab cache without
> +also permanently reserving physical memory. That is because the virtual
> +addresses that the slab allocator uses come from the linear map region, where
> +there is a 1:1 correspondence between virtual and physical addresses.
> +
> +SLAB_VIRTUAL's solution is to create a dedicated virtual memory region that is
> +only used for slab memory, and to enforce that once a range of virtual addresses
> +is used for a slab cache, it is never reused for any other caches. Using a
> +dedicated region of virtual memory lets us reserve ranges of virtual addresses
> +to prevent cross-cache attacks and at the same time release physical memory back
> +to the system when it's no longer needed. This is what Chromium's PartitionAlloc
> +does in userspace
> +(https://chromium.googlesource.com/chromium/src/+/354da2514b31df2aa14291199a567e10a7671621/base/allocator/partition_allocator/PartitionAlloc.md).
> +
> +Implementation
> +~~~~~~~~~~~~~~
> +
> +SLAB_VIRTUAL reserves a region of virtual memory for the slab allocator. All
> +pointers returned by the slab allocator point to this region. The region is
> +statically partitioned in two sub-regions: the metadata region and the data
> +region. The data region is where the actual objects are allocated from. The
> +metadata region is an array of struct slab objects, one for each PAGE_SIZE bytes
> +in the data region.
> +Without SLAB_VIRTUAL, struct slab is overlaid on top of the struct page/struct
> +folio that corresponds to the physical memory page backing the slab instead of
> +using a dedicated memory region. This doesn't work for SLAB_VIRTUAL, which needs
> +to store metadata for slabs even when no physical memory is allocated to them.
> +Having an array of struct slab lets us implement virt_to_slab efficiently purely
> +with arithmetic. In order to support high-order slabs, the struct slabs
> +corresponding to tail pages contain a pointer to the head slab, which
> +corresponds to the slab's head page.
> +
> +TLB flushing
> +~~~~~~~~~~~~
> +
> +Before it can release a page of physical memory back to the page allocator, the
> +slab allocator must flush the TLB entries for that page on all CPUs. This is not
> +only necessary for the mitigation to work reliably but it's also required for
> +correctness. Without a TLB flush some CPUs might continue using the old mapping
> +if the virtual address range is reused for a new slab and cause memory
> +corruption even in the absence of other bugs. The slab allocator can release
> +pages in contexts where TLB flushes can't be performed (e.g. in hardware
> +interrupt handlers). Pages to free are not freed directly, and instead they are
> +put on a queue and freed from a workqueue context which also flushes the TLB.
> +
> +Performance
> +~~~~~~~~~~~
> +
> +SLAB_VIRTUAL's performance impact depends on the workload. On kernel compilation
> +(kernbench) the slowdown is about 1-2% depending on the machine type and is
> +slightly worse on machines with more cores.

Is there anything that can be added to the docs about future work, areas
of improvement, etc?

-Kees

--
Kees Cook

2023-09-16 03:16:25

by Kees Cook

[permalink] [raw]
Subject: Re: [RFC PATCH 10/14] x86: Create virtual memory region for SLUB

On Fri, Sep 15, 2023 at 10:59:29AM +0000, Matteo Rizzo wrote:
> From: Jann Horn <[email protected]>
>
> SLAB_VIRTUAL reserves 512 GiB of virtual memory and uses them for both
> struct slab and the actual slab memory. The pointers returned by
> kmem_cache_alloc will point to this range of memory.

I think the 512 GiB limit may be worth mentioning in the Kconfig help
text.

And in the "640K is enough for everything" devil's advocacy, why is 512
GiB enough here? Is there any greater risk of a pathological allocation
pattern breaking a system any more (or less) than is currently possible?
>
> Signed-off-by: Jann Horn <[email protected]>

But, yes, I'm still a fan, and I think it interacts well here with the
rest of the KASLR initialization:

Reviewed-by: Kees Cook <[email protected]>

Have you tried to make this work on arm64? I imagine it should be
roughly as easy?

--
Kees Cook

2023-09-16 04:24:42

by Kees Cook

[permalink] [raw]
Subject: Re: [RFC PATCH 02/14] mm/slub: add is_slab_addr/is_slab_page helpers

On Fri, Sep 15, 2023 at 10:59:21AM +0000, Matteo Rizzo wrote:
> From: Jann Horn <[email protected]>
>
> This is refactoring in preparation for adding two different
> implementations (for SLAB_VIRTUAL enabled and disabled).
>
> virt_to_folio(x) expands to _compound_head(virt_to_page(x)) and
> virt_to_head_page(x) also expands to _compound_head(virt_to_page(x))
>
> so PageSlab(virt_to_head_page(res)) should be equivalent to
> is_slab_addr(res).

Perhaps add a note that redundant calls to virt_to_folio() will be
removed in following patches?

>
> Signed-off-by: Jann Horn <[email protected]>
> Co-developed-by: Matteo Rizzo <[email protected]>
> Signed-off-by: Matteo Rizzo <[email protected]>
> ---
> include/linux/slab.h | 1 +
> kernel/resource.c | 2 +-
> mm/slab.h | 9 +++++++++
> mm/slab_common.c | 5 ++---
> mm/slub.c | 6 +++---
> 5 files changed, 16 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 8228d1276a2f..a2d82010d269 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -793,4 +793,5 @@ int slab_dead_cpu(unsigned int cpu);
> #define slab_dead_cpu NULL
> #endif
>
> +#define is_slab_addr(addr) folio_test_slab(virt_to_folio(addr))
> #endif /* _LINUX_SLAB_H */
> diff --git a/kernel/resource.c b/kernel/resource.c
> index b1763b2fd7ef..c829e5f97292 100644
> --- a/kernel/resource.c
> +++ b/kernel/resource.c
> @@ -158,7 +158,7 @@ static void free_resource(struct resource *res)
> * buddy and trying to be smart and reusing them eventually in
> * alloc_resource() overcomplicates resource handling.
> */
> - if (res && PageSlab(virt_to_head_page(res)))
> + if (res && is_slab_addr(res))
> kfree(res);
> }
>
> diff --git a/mm/slab.h b/mm/slab.h
> index 799a315695c6..25e41dd6087e 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -169,6 +169,15 @@ static_assert(IS_ALIGNED(offsetof(struct slab, freelist), sizeof(freelist_aba_t)
> */
> #define slab_page(s) folio_page(slab_folio(s), 0)
>
> +/**
> + * is_slab_page - Checks if a page is really a slab page
> + * @s: The slab
> + *
> + * Checks if s points to a slab page.
> + *
> + * Return: true if s points to a slab and false otherwise.
> + */
> +#define is_slab_page(s) folio_test_slab(slab_folio(s))
> /*
> * If network-based swap is enabled, sl*b must keep track of whether pages
> * were allocated from pfmemalloc reserves.
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index e99e821065c3..79102d24f099 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -1063,7 +1063,7 @@ void kfree(const void *object)
> return;
>
> folio = virt_to_folio(object);
> - if (unlikely(!folio_test_slab(folio))) {
> + if (unlikely(!is_slab_addr(object))) {
> free_large_kmalloc(folio, (void *)object);
> return;
> }
> @@ -1094,8 +1094,7 @@ size_t __ksize(const void *object)
> return 0;
>
> folio = virt_to_folio(object);
> -
> - if (unlikely(!folio_test_slab(folio))) {
> + if (unlikely(!is_slab_addr(object))) {
> if (WARN_ON(folio_size(folio) <= KMALLOC_MAX_CACHE_SIZE))
> return 0;
> if (WARN_ON(object != folio_address(folio)))

In the above 2 hunks we're doing virt_to_folio() twice, but I see in
patch 4 that these go away.

> diff --git a/mm/slub.c b/mm/slub.c
> index a7dae207c2d2..b69916ab7aa8 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1259,7 +1259,7 @@ static int check_slab(struct kmem_cache *s, struct slab *slab)
> {
> int maxobj;
>
> - if (!folio_test_slab(slab_folio(slab))) {
> + if (!is_slab_page(slab)) {
> slab_err(s, slab, "Not a valid slab page");
> return 0;
> }
> @@ -1454,7 +1454,7 @@ static noinline bool alloc_debug_processing(struct kmem_cache *s,
> return true;
>
> bad:
> - if (folio_test_slab(slab_folio(slab))) {
> + if (is_slab_page(slab)) {
> /*
> * If this is a slab page then lets do the best we can
> * to avoid issues in the future. Marking all objects
> @@ -1484,7 +1484,7 @@ static inline int free_consistency_checks(struct kmem_cache *s,
> return 0;
>
> if (unlikely(s != slab->slab_cache)) {
> - if (!folio_test_slab(slab_folio(slab))) {
> + if (!is_slab_page(slab)) {
> slab_err(s, slab, "Attempt to free object(0x%p) outside of slab",
> object);
> } else if (!slab->slab_cache) {
> --
> 2.42.0.459.ge4e396fd5e-goog

This all looks nice and mechanical. :P

Reviewed-by: Kees Cook <[email protected]>

--
Kees Cook

2023-09-16 04:48:19

by Kees Cook

[permalink] [raw]
Subject: Re: [RFC PATCH 11/14] mm/slub: allocate slabs from virtual memory

On Fri, Sep 15, 2023 at 10:59:30AM +0000, Matteo Rizzo wrote:
> From: Jann Horn <[email protected]>
>
> This is the main implementation of SLAB_VIRTUAL. With SLAB_VIRTUAL
> enabled, slab memory is not allocated from the linear map but from a
> dedicated region of virtual memory. The code ensures that once a range
> of virtual addresses is assigned to a slab cache, that virtual memory is
> never reused again except for other slabs in that same cache. This lets
> us mitigate some exploits for use-after-free vulnerabilities where the
> attacker makes SLUB release a slab page to the page allocator and then
> makes it reuse that same page for a different slab cache ("cross-cache
> attacks").
>
> With SLAB_VIRTUAL enabled struct slab no longer overlaps struct page but
> instead it is allocated from a dedicated region of virtual memory. This
> makes it possible to have references to slabs whose physical memory has
> been freed.
>
> SLAB_VIRTUAL has a small performance overhead, about 1-2% on kernel
> compilation time. We are using 4 KiB pages to map slab pages and slab
> metadata area, instead of the 2 MiB pages that the kernel uses to map
> the physmap. We experimented with a version of the patch that uses 2 MiB
> pages and we did see some performance improvement but the code also
> became much more complicated and ugly because we would need to allocate
> and free multiple slabs at once.

I think these hints about performance should be also noted in the
Kconfig help.

> In addition to the TLB contention, SLAB_VIRTUAL also adds new locks to
> the slow path of the allocator. Lock contention also contributes to the
> performance penalty to some extent, and this is more visible on machines
> with many CPUs.
>
> Signed-off-by: Jann Horn <[email protected]>
> Co-developed-by: Matteo Rizzo <[email protected]>
> Signed-off-by: Matteo Rizzo <[email protected]>
> ---
> arch/x86/include/asm/page_64.h | 10 +
> arch/x86/include/asm/pgtable_64_types.h | 5 +
> arch/x86/mm/physaddr.c | 10 +
> include/linux/slab.h | 7 +
> init/main.c | 1 +
> mm/slab.h | 106 ++++++
> mm/slab_common.c | 4 +
> mm/slub.c | 439 +++++++++++++++++++++++-
> mm/usercopy.c | 12 +-
> 9 files changed, 587 insertions(+), 7 deletions(-)

Much of this needs review by MM people -- I can't speak well to the
specifics of the implementation. On coding style, I wonder if we can get
away with reducing the amount of #ifdef code by either using "if
(IS_ENABLED(...)) { ... }" style code, or, in the case of the allocation
function, splitting it out into two separate files, one for standard
page allocator, and one for the new virt allocator. But, again, MM
preferences reign. :)

--
Kees Cook

2023-09-16 05:52:07

by Kees Cook

[permalink] [raw]
Subject: Re: [RFC PATCH 03/14] mm/slub: move kmem_cache_order_objects to slab.h

On Fri, Sep 15, 2023 at 10:59:22AM +0000, Matteo Rizzo wrote:
> From: Jann Horn <[email protected]>
>
> This is refactoring for SLAB_VIRTUAL. The implementation needs to know
> the order of the virtual memory region allocated to each slab to know
> how much physical memory to allocate when the slab is reused. We reuse
> kmem_cache_order_objects for this, so we have to move it before struct
> slab.
>
> Signed-off-by: Jann Horn <[email protected]>

Yay mechanical changes.

Reviewed-by: Kees Cook <[email protected]>

--
Kees Cook

2023-09-16 06:00:22

by Kees Cook

[permalink] [raw]
Subject: Re: [RFC PATCH 05/14] mm/slub: create folio_set/clear_slab helpers

On Fri, Sep 15, 2023 at 10:59:24AM +0000, Matteo Rizzo wrote:
> From: Jann Horn <[email protected]>
>
> This is refactoring in preparation for SLAB_VIRTUAL. Extract this code
> to separate functions so that it's not duplicated in the code that
> allocates and frees page with SLAB_VIRTUAL enabled.
>
> Signed-off-by: Jann Horn <[email protected]>
> Co-developed-by: Matteo Rizzo <[email protected]>
> Signed-off-by: Matteo Rizzo <[email protected]>
> ---
> mm/slub.c | 32 ++++++++++++++++++++++----------
> 1 file changed, 22 insertions(+), 10 deletions(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index ad33d9e1601d..9b87afade125 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1849,6 +1849,26 @@ static void *setup_object(struct kmem_cache *s, void *object)
> /*
> * Slab allocation and freeing
> */
> +
> +static void folio_set_slab(struct folio *folio, struct slab *slab)
> +{
> + __folio_set_slab(folio);
> + /* Make the flag visible before any changes to folio->mapping */
> + smp_wmb();
> +
> + if (folio_is_pfmemalloc(folio))
> + slab_set_pfmemalloc(slab);
> +}
> +
> +static void folio_clear_slab(struct folio *folio, struct slab *slab)
> +{
> + __slab_clear_pfmemalloc(slab);
> + folio->mapping = NULL;
> + /* Make the mapping reset visible before clearing the flag */
> + smp_wmb();
> + __folio_clear_slab(folio);
> +}

Perhaps these should be explicitly marked as inlines?

> +
> static inline struct slab *alloc_slab_page(gfp_t flags, int node,
> struct kmem_cache_order_objects oo)
> {
> @@ -1865,11 +1885,7 @@ static inline struct slab *alloc_slab_page(gfp_t flags, int node,
> return NULL;
>
> slab = folio_slab(folio);
> - __folio_set_slab(folio);
> - /* Make the flag visible before any changes to folio->mapping */
> - smp_wmb();
> - if (folio_is_pfmemalloc(folio))
> - slab_set_pfmemalloc(slab);
> + folio_set_slab(folio, slab);
>
> return slab;
> }
> @@ -2067,11 +2083,7 @@ static void __free_slab(struct kmem_cache *s, struct slab *slab)
> int order = folio_order(folio);
> int pages = 1 << order;
>
> - __slab_clear_pfmemalloc(slab);
> - folio->mapping = NULL;
> - /* Make the mapping reset visible before clearing the flag */
> - smp_wmb();
> - __folio_clear_slab(folio);
> + folio_clear_slab(folio, slab);
> mm_account_reclaimed_pages(pages);
> unaccount_slab(slab, order, s);
> __free_pages(&folio->page, order);
> --
> 2.42.0.459.ge4e396fd5e-goog

Otherwise this is a straight function extraction.

Reviewed-by: Kees Cook <[email protected]>

--
Kees Cook

2023-09-16 13:14:23

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH 11/14] mm/slub: allocate slabs from virtual memory

On 9/15/23 03:59, Matteo Rizzo wrote:
> + spin_lock_irqsave(&slub_kworker_lock, irq_flags);
> + list_splice_init(&slub_tlbflush_queue, &local_queue);
> + list_for_each_entry(slab, &local_queue, flush_list_elem) {
> + unsigned long start = (unsigned long)slab_to_virt(slab);
> + unsigned long end = start + PAGE_SIZE *
> + (1UL << oo_order(slab->oo));
> +
> + if (start < addr_start)
> + addr_start = start;
> + if (end > addr_end)
> + addr_end = end;
> + }
> + spin_unlock_irqrestore(&slub_kworker_lock, irq_flags);
> +
> + if (addr_start < addr_end)
> + flush_tlb_kernel_range(addr_start, addr_end);

I assume that the TLB flushes in the queue are going to be pretty sparse
on average.

At least on x86, flush_tlb_kernel_range() falls back pretty quickly from
individual address invalidation to just doing a full flush. It might
not even be worth tracking the address ranges, and just do a full flush
every time.

I'd be really curious to see how often actual ranged flushes are
triggered from this code. I expect it would be very near zero.

2023-09-18 13:41:56

by Matteo Rizzo

[permalink] [raw]
Subject: Re: [RFC PATCH 00/14] Prevent cross-cache attacks in the SLUB allocator

On Fri, 15 Sept 2023 at 18:30, Lameter, Christopher
<[email protected]> wrote:
>
> On Fri, 15 Sep 2023, Dave Hansen wrote:
>
> > What's the cost?
>
> The only thing that I see is 1-2% on kernel compilations (and "more on
> machines with lots of cores")?

I used kernel compilation time (wall clock time) as a benchmark while
preparing the series. Lower is better.

Intel Skylake, 112 cores:

LABEL | COUNT | MIN | MAX | MEAN | MEDIAN | STDDEV
---------------+-------+---------+---------+---------+---------+--------
SLAB_VIRTUAL=n | 150 | 49.700s | 51.320s | 50.449s | 50.430s | 0.29959
SLAB_VIRTUAL=y | 150 | 50.020s | 51.660s | 50.880s | 50.880s | 0.30495
| | +0.64% | +0.66% | +0.85% | +0.89% | +1.79%

AMD Milan, 256 cores:

LABEL | COUNT | MIN | MAX | MEAN | MEDIAN | STDDEV
---------------+-------+---------+---------+---------+---------+--------
SLAB_VIRTUAL=n | 150 | 25.480s | 26.550s | 26.065s | 26.055s | 0.23495
SLAB_VIRTUAL=y | 150 | 25.820s | 27.080s | 26.531s | 26.540s | 0.25974
| | +1.33% | +2.00% | +1.79% | +1.86% | +10.55%

Are there any specific benchmarks that you would be interested in seeing or
that are usually used for SLUB?

> Problems:
>
> - Overhead due to more TLB lookups
>
> - Larger amounts of TLBs are used for the OS. Currently we are trying to
> use the maximum mappable TLBs to reduce their numbers. This presumably
> means using 4K TLBs for all slab access.

Yes, we are using 4K pages for the slab mappings which is going to increase
TLB pressure. I also tried writing a version of the patch that uses 2M
pages which had slightly better performance, but that had its own problems.
For example most slabs are much smaller than 2M, so we would need to create
and map multiple slabs at once and we wouldn't be able to release the
physical memory until all slabs in the 2M page are unused which increases
fragmentation.

> - Memory may not be physically contiguous which may be required by some
> drivers doing DMA.

In the current implementation each slab is backed by physically contiguous
memory, but different slabs that are adjacent in virtual memory might not
be physically contiguous. Treating objects allocated from two different
slabs as one contiguous chunk of memory is probably wrong anyway, right?

--
Matteo

2023-09-18 15:12:52

by Matteo Rizzo

[permalink] [raw]
Subject: Re: [RFC PATCH 10/14] x86: Create virtual memory region for SLUB

On Fri, 15 Sept 2023 at 23:50, Dave Hansen <[email protected]> wrote:
>
> I have the feeling folks just grabbed the first big-ish chunk they saw
> free in the memory map and stole that one. Not a horrible approach,
> mind you, but I have the feeling it didn't go through the most rigorous
> sizing procedure. :)
>
> My laptop memory is ~6% consumed by slab, 90% of which is reclaimable.
> If a 64TB system had the same ratio, it would bump into this 512GB
> limit. But it _should_ just reclaim thing earlier rather than falling over.
>
> That said, we still have gobs of actual vmalloc() space. It's ~30TiB in
> size and I'm not aware of anyone consuming anywhere near that much. If
> the 512GB fills up somehow, there are other places to steal the space.
>
> One minor concern is that the virtual area is the same size on 4 and
> 5-level paging systems. It might be a good idea to pick one of the
> holes that actually gets bigger on 5-level systems.

One of the other ideas that we had was to use the KASAN shadow memory instead of
a dedicated area. As far as I know the KASAN region is not used by anything else
when KASAN is disabled, and I don't think it makes sense to have both KASAN and
SLAB_VIRTUAL enabled at the same time (see the patch which introduces the
Kconfig option for why). The KASAN region is 16 TiB on 4-level systems and 8 PiB
on 5-level, in both cases 1/16th the size of the address space.

Could that work?

--
Matteo

2023-09-18 18:43:49

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH 00/14] Prevent cross-cache attacks in the SLUB allocator


* Matteo Rizzo <[email protected]> wrote:

> On Fri, 15 Sept 2023 at 18:30, Lameter, Christopher
> <[email protected]> wrote:
> >
> > On Fri, 15 Sep 2023, Dave Hansen wrote:
> >
> > > What's the cost?
> >
> > The only thing that I see is 1-2% on kernel compilations (and "more on
> > machines with lots of cores")?
>
> I used kernel compilation time (wall clock time) as a benchmark while
> preparing the series. Lower is better.
>
> Intel Skylake, 112 cores:
>
> LABEL | COUNT | MIN | MAX | MEAN | MEDIAN | STDDEV
> ---------------+-------+---------+---------+---------+---------+--------
> SLAB_VIRTUAL=n | 150 | 49.700s | 51.320s | 50.449s | 50.430s | 0.29959
> SLAB_VIRTUAL=y | 150 | 50.020s | 51.660s | 50.880s | 50.880s | 0.30495
> | | +0.64% | +0.66% | +0.85% | +0.89% | +1.79%
>
> AMD Milan, 256 cores:
>
> LABEL | COUNT | MIN | MAX | MEAN | MEDIAN | STDDEV
> ---------------+-------+---------+---------+---------+---------+--------
> SLAB_VIRTUAL=n | 150 | 25.480s | 26.550s | 26.065s | 26.055s | 0.23495
> SLAB_VIRTUAL=y | 150 | 25.820s | 27.080s | 26.531s | 26.540s | 0.25974
> | | +1.33% | +2.00% | +1.79% | +1.86% | +10.55%

That's sadly a rather substantial overhead for a compiler/linker workload
that is dominantly user-space: a kernel build is about 90% user-time and
10% system-time:

$ perf stat --null make -j64 vmlinux
...

Performance counter stats for 'make -j64 vmlinux':

59.840704481 seconds time elapsed

2000.774537000 seconds user
219.138280000 seconds sys

What's the split of the increase in overhead due to SLAB_VIRTUAL=y, between
user-space execution and kernel-space execution?

Thanks,

Ingo

2023-09-18 23:19:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC PATCH 00/14] Prevent cross-cache attacks in the SLUB allocator

On Mon, 18 Sept 2023 at 10:39, Ingo Molnar <[email protected]> wrote:
>
> What's the split of the increase in overhead due to SLAB_VIRTUAL=y, between
> user-space execution and kernel-space execution?

... and equally importantly, what about DMA?

Or what about the fixed-size slabs (aka kmalloc?) What's the point of
"never re-use the same address for a different slab", when the *same*
slab will contain different kinds of allocations anyway?

I think the whole "make it one single compile-time option" model is
completely and fundamentally broken.

Linus

2023-09-19 14:44:25

by Matteo Rizzo

[permalink] [raw]
Subject: Re: [RFC PATCH 00/14] Prevent cross-cache attacks in the SLUB allocator

On Mon, 18 Sept 2023 at 19:39, Ingo Molnar <[email protected]> wrote:
>
> What's the split of the increase in overhead due to SLAB_VIRTUAL=y, between
> user-space execution and kernel-space execution?
>

Same benchmark as before (compiling a kernel on a system running the patched
kernel):

Intel Skylake:

LABEL | COUNT | MIN | MAX | MEAN | MEDIAN | STDDEV
---------------+-------+----------+----------+----------+----------+--------
wall clock | | | | | |
SLAB_VIRTUAL=n | 150 | 49.700 | 51.320 | 50.449 | 50.430 | 0.29959
SLAB_VIRTUAL=y | 150 | 50.020 | 51.660 | 50.880 | 50.880 | 0.30495
| | +0.64% | +0.66% | +0.85% | +0.89% | +1.79%
system time | | | | | |
SLAB_VIRTUAL=n | 150 | 358.560 | 362.900 | 360.922 | 360.985 | 0.91761
SLAB_VIRTUAL=y | 150 | 362.970 | 367.970 | 366.062 | 366.115 | 1.015
| | +1.23% | +1.40% | +1.42% | +1.42% | +10.60%
user time | | | | | |
SLAB_VIRTUAL=n | 150 | 3110.000 | 3124.520 | 3118.143 | 3118.120 | 2.466
SLAB_VIRTUAL=y | 150 | 3115.070 | 3127.070 | 3120.762 | 3120.925 | 2.654
| | +0.16% | +0.08% | +0.08% | +0.09% | +7.63%

AMD Milan:

LABEL | COUNT | MIN | MAX | MEAN | MEDIAN | STDDEV
---------------+-------+----------+----------+----------+----------+--------
wall clock | | | | | |
SLAB_VIRTUAL=n | 150 | 25.480 | 26.550 | 26.065 | 26.055 | 0.23495
SLAB_VIRTUAL=y | 150 | 25.820 | 27.080 | 26.531 | 26.540 | 0.25974
| | +1.33% | +2.00% | +1.79% | +1.86% | +10.55%
system time | | | | | |
SLAB_VIRTUAL=n | 150 | 478.530 | 540.420 | 520.803 | 521.485 | 9.166
SLAB_VIRTUAL=y | 150 | 530.520 | 572.460 | 552.825 | 552.985 | 7.161
| | +10.86% | +5.93% | +6.15% | +6.04% | -21.88%
user time | | | | | |
SLAB_VIRTUAL=n | 150 | 2373.540 | 2403.800 | 2386.343 | 2385.840 | 5.325
SLAB_VIRTUAL=y | 150 | 2388.690 | 2426.290 | 2408.325 | 2408.895 | 6.667
| | +0.64% | +0.94% | +0.92% | +0.97% | +25.20%


I'm not exactly sure why user time increases by almost 1% on Milan, it could be
TLB contention.

--
Matteo

2023-09-19 18:00:41

by Kees Cook

[permalink] [raw]
Subject: Re: [RFC PATCH 00/14] Prevent cross-cache attacks in the SLUB allocator

On September 19, 2023 9:02:07 AM PDT, Dave Hansen <[email protected]> wrote:
>On 9/19/23 08:48, Matteo Rizzo wrote:
>>> I think the whole "make it one single compile-time option" model is
>>> completely and fundamentally broken.
>> Wouldn't making this toggleable at boot time or runtime make performance
>> even worse?
>
>Maybe.
>
>But you can tolerate even more of a performance impact from a feature if
>the people that don't care can actually disable it.
>
>There are also plenty of ways to minimize the overhead of switching it
>on and off at runtime. Static branches are your best friend here.

Let's start with a boot time on/off toggle (no per-slab, no switch on out-of-space, etc). That should be sufficient for initial ease of use for testing, etc. But yes, using static_branch will nicely DTRT here.



--
Kees Cook

2023-09-19 23:00:13

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH 00/14] Prevent cross-cache attacks in the SLUB allocator

On 9/19/23 08:48, Matteo Rizzo wrote:
>> I think the whole "make it one single compile-time option" model is
>> completely and fundamentally broken.
> Wouldn't making this toggleable at boot time or runtime make performance
> even worse?

Maybe.

But you can tolerate even more of a performance impact from a feature if
the people that don't care can actually disable it.

There are also plenty of ways to minimize the overhead of switching it
on and off at runtime. Static branches are your best friend here.

2023-09-20 00:07:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC PATCH 00/14] Prevent cross-cache attacks in the SLUB allocator

On Tue, 19 Sept 2023 at 08:48, Matteo Rizzo <[email protected]> wrote:
>
> On Mon, 18 Sept 2023 at 20:05, Linus Torvalds
> <[email protected]> wrote:
> >
> > ... and equally importantly, what about DMA?
>
> I'm not exactly sure what you mean by this, I don't think this should
> affect the performance of DMA.

I was more worried about just basic correctness.

We've traditionally had a lot of issues with using virtual addresses
for dma, simply because we've got random drivers, and I'm not entirely
convinced that your "virt_to_phys()" update will catch it all.

IOW, even on x86-64 - which is hopefully better than most
architectures because it already has that double mapping issue - we
have things like

unsigned long paddr = (unsigned long)vaddr - __PAGE_OFFSET;

in other places than just the __phys_addr() code.

The one place I grepped for looks to be just boot-time AMD memory
encryption, so wouldn't be any slab allocation, but ...

Linus

2023-09-20 02:31:45

by Matteo Rizzo

[permalink] [raw]
Subject: Re: [RFC PATCH 00/14] Prevent cross-cache attacks in the SLUB allocator

On Mon, 18 Sept 2023 at 20:05, Linus Torvalds
<[email protected]> wrote:
>
> ... and equally importantly, what about DMA?

I'm not exactly sure what you mean by this, I don't think this should
affect the performance of DMA.

> Or what about the fixed-size slabs (aka kmalloc?) What's the point of
> "never re-use the same address for a different slab", when the *same*
> slab will contain different kinds of allocations anyway?

There are a number of patches out there (for example the random_kmalloc
series which recently got merged into v6.6) which attempt to segregate
kmalloc'd objects into different caches to make exploitation harder.
Another thing that we would like to have in the future is to segregate
objects by type (like XNU's kalloc_type
https://security.apple.com/blog/towards-the-next-generation-of-xnu-memory-safety/)
which makes exploiting use-after-free by type confusion much harder or
impossible.

All of these mitigations can be bypassed very easily if the attacker can
mount a cross-cache attack, which is what this series attempts to prevent.
This is not only theoretical, we've seen attackers use this all the time in
kCTF/kernelCTF submissions (for example
https://ruia-ruia.github.io/2022/08/05/CVE-2022-29582-io-uring/).

> I think the whole "make it one single compile-time option" model is
> completely and fundamentally broken.

Wouldn't making this toggleable at boot time or runtime make performance
even worse?

--
Matteo

2023-09-20 08:52:47

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [RFC PATCH 00/14] Prevent cross-cache attacks in the SLUB allocator

On 9/18/23 14:08, Matteo Rizzo wrote:
> On Fri, 15 Sept 2023 at 18:30, Lameter, Christopher
>> Problems:
>>
>> - Overhead due to more TLB lookups
>>
>> - Larger amounts of TLBs are used for the OS. Currently we are trying to
>> use the maximum mappable TLBs to reduce their numbers. This presumably
>> means using 4K TLBs for all slab access.
>
> Yes, we are using 4K pages for the slab mappings which is going to increase
> TLB pressure. I also tried writing a version of the patch that uses 2M
> pages which had slightly better performance, but that had its own problems.
> For example most slabs are much smaller than 2M, so we would need to create
> and map multiple slabs at once and we wouldn't be able to release the
> physical memory until all slabs in the 2M page are unused which increases
> fragmentation.
At last LSF/MM [1] we basically discarded direct map fragmentation
avoidance as solving something that turns out to be insignificant, with the
exception of kernel code. As kernel code is unlikely to be allocated from
kmem caches due to W^X, we can hopefully assume it's also insignificant for
the virtual slab area.

[1] https://lwn.net/Articles/931406/

2023-09-20 09:08:38

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH 00/14] Prevent cross-cache attacks in the SLUB allocator


* Matteo Rizzo <[email protected]> wrote:

> On Mon, 18 Sept 2023 at 19:39, Ingo Molnar <[email protected]> wrote:
> >
> > What's the split of the increase in overhead due to SLAB_VIRTUAL=y, between
> > user-space execution and kernel-space execution?
> >
>
> Same benchmark as before (compiling a kernel on a system running the patched
> kernel):
>
> Intel Skylake:
>
> LABEL | COUNT | MIN | MAX | MEAN | MEDIAN | STDDEV
> ---------------+-------+----------+----------+----------+----------+--------
> wall clock | | | | | |
> SLAB_VIRTUAL=n | 150 | 49.700 | 51.320 | 50.449 | 50.430 | 0.29959
> SLAB_VIRTUAL=y | 150 | 50.020 | 51.660 | 50.880 | 50.880 | 0.30495
> | | +0.64% | +0.66% | +0.85% | +0.89% | +1.79%
> system time | | | | | |
> SLAB_VIRTUAL=n | 150 | 358.560 | 362.900 | 360.922 | 360.985 | 0.91761
> SLAB_VIRTUAL=y | 150 | 362.970 | 367.970 | 366.062 | 366.115 | 1.015
> | | +1.23% | +1.40% | +1.42% | +1.42% | +10.60%
> user time | | | | | |
> SLAB_VIRTUAL=n | 150 | 3110.000 | 3124.520 | 3118.143 | 3118.120 | 2.466
> SLAB_VIRTUAL=y | 150 | 3115.070 | 3127.070 | 3120.762 | 3120.925 | 2.654
> | | +0.16% | +0.08% | +0.08% | +0.09% | +7.63%

These Skylake figures are a bit counter-intuitive: how does an increase of
only +0.08% user-time - which dominates 89.5% of execution, combined with a
+1.42% increase in system time that consumes only 10.5% of CPU capacity,
result in a +0.85% increase in wall-clock time?

There might be hidden factors at work in the DMA space, as Linus suggested?

Or perhaps wall-clock time is dominated by the single-threaded final link
time of the kernel, which phase might be disproportionately hurt by these
changes?

(Stddev seems low enough for this not to be a measurement artifact.)

The AMD Milan figures are more intuitive:

> AMD Milan:
>
> LABEL | COUNT | MIN | MAX | MEAN | MEDIAN | STDDEV
> ---------------+-------+----------+----------+----------+----------+--------
> wall clock | | | | | |
> SLAB_VIRTUAL=n | 150 | 25.480 | 26.550 | 26.065 | 26.055 | 0.23495
> SLAB_VIRTUAL=y | 150 | 25.820 | 27.080 | 26.531 | 26.540 | 0.25974
> | | +1.33% | +2.00% | +1.79% | +1.86% | +10.55%
> system time | | | | | |
> SLAB_VIRTUAL=n | 150 | 478.530 | 540.420 | 520.803 | 521.485 | 9.166
> SLAB_VIRTUAL=y | 150 | 530.520 | 572.460 | 552.825 | 552.985 | 7.161
> | | +10.86% | +5.93% | +6.15% | +6.04% | -21.88%
> user time | | | | | |
> SLAB_VIRTUAL=n | 150 | 2373.540 | 2403.800 | 2386.343 | 2385.840 | 5.325
> SLAB_VIRTUAL=y | 150 | 2388.690 | 2426.290 | 2408.325 | 2408.895 | 6.667
> | | +0.64% | +0.94% | +0.92% | +0.97% | +25.20%
>
>
> I'm not exactly sure why user time increases by almost 1% on Milan, it
> could be TLB contention.

The other worrying aspect is the increase of +6.15% of system time ...
which is roughly in line with what we'd expect from a +1.79% increase in
wall-clock time.

Thanks,

Ingo

2023-09-20 10:08:39

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [RFC PATCH 14/14] security: add documentation for SLAB_VIRTUAL

On 9/15/23 12:59, Matteo Rizzo wrote:
> From: Jann Horn <[email protected]>
>
> Document what SLAB_VIRTUAL is trying to do, how it's implemented, and
> why.
>
> Signed-off-by: Jann Horn <[email protected]>
> Co-developed-by: Matteo Rizzo <[email protected]>
> Signed-off-by: Matteo Rizzo <[email protected]>
> ---
> Documentation/security/self-protection.rst | 102 +++++++++++++++++++++
> 1 file changed, 102 insertions(+)
>
> diff --git a/Documentation/security/self-protection.rst b/Documentation/security/self-protection.rst
> index 910668e665cb..5a5e99e3f244 100644
> --- a/Documentation/security/self-protection.rst
> +++ b/Documentation/security/self-protection.rst
> @@ -314,3 +314,105 @@ To help kill classes of bugs that result in kernel addresses being
> written to userspace, the destination of writes needs to be tracked. If
> the buffer is destined for userspace (e.g. seq_file backed ``/proc`` files),
> it should automatically censor sensitive values.
> +
> +
> +Memory Allocator Mitigations
> +============================
> +
> +Protection against cross-cache attacks (SLAB_VIRTUAL)
> +-----------------------------------------------------
> +
> +SLAB_VIRTUAL is a mitigation that deterministically prevents cross-cache
> +attacks.
> +
> +Linux Kernel use-after-free vulnerabilities are commonly exploited by turning
> +them into an object type confusion (having two active pointers of different
> +types to the same memory location) using one of the following techniques:
> +
> +1. Direct object reuse: make the kernel give the victim object back to the slab
> + allocator, then allocate the object again from the same slab cache as a
> + different type. This is only possible if the victim object resides in a slab
> + cache which can contain objects of different types - for example one of the
> + kmalloc caches.
> +2. "Cross-cache attack": make the kernel give the victim object back to the slab
> + allocator, then make the slab allocator give the page containing the object
> + back to the page allocator, then either allocate the page directly as some
> + other type of page or make the slab allocator allocate it again for a
> + different slab cache and allocate an object from there.
> +
> +In either case, the important part is that the same virtual address is reused
> +for two objects of different types.

Hmm but in the 2. technique and "either allocate the page directly" case,
it's not just the virtual address, right? So we should be saying that
SLAB_VIRTUAL won't help with that case, but hopefully it's also less
common/harder to exploit? I'm not sure though - kmalloc() in SLUB will pass
anything larger than order-1 (8kB) directly to the page allocator via
kmalloc_large() so it will bypass both CONFIG_RANDOM_KMALLOC_CACHES and
SLAB_VIRTUAL, AFAICS?

> +The first case can be addressed by separating objects of different types
> +into different slab caches. If a slab cache only contains objects of the
> +same type then directly turning an use-after-free into a type confusion is
> +impossible as long as the slab page that contains the victim object remains
> +assigned to that slab cache. This type of mitigation is easily bypassable
> +by cross-cache attacks: if the attacker can make the slab allocator return
> +the page containing the victim object to the page allocator and then make
> +it use the same page for a different slab cache, type confusion becomes
> +possible again. Addressing the first case is therefore only worthwhile if
> +cross-cache attacks are also addressed. AUTOSLAB uses a combination of
> +probabilistic mitigations for this. SLAB_VIRTUAL addresses the second case

As Kees mentioned - I don't think you're talking about
CONFIG_RANDOM_KMALLOC_CACHES here, but it should be mentioned. Comparison of
the combination of both with AUTOSLAB could be also interesting, if
clarified it's not mainline, so unaware readers are not confused.

> +deterministically by changing the way the slab allocator allocates memory.
> +
> +Preventing slab virtual address reuse
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +In theory there is an easy fix against cross-cache attacks: modify the slab
> +allocator so that it never gives memory back to the page allocator. In practice
> +this would be problematic because physical memory remains permanently assigned
> +to a slab cache even if it doesn't contain any active objects. A viable
> +cross-cache mitigation must allow the system to reclaim unused physical memory.
> +In the current design of the slab allocator there is no way
> +to keep a region of virtual memory permanently assigned to a slab cache without
> +also permanently reserving physical memory. That is because the virtual
> +addresses that the slab allocator uses come from the linear map region, where
> +there is a 1:1 correspondence between virtual and physical addresses.
> +
> +SLAB_VIRTUAL's solution is to create a dedicated virtual memory region that is
> +only used for slab memory, and to enforce that once a range of virtual addresses
> +is used for a slab cache, it is never reused for any other caches. Using a
> +dedicated region of virtual memory lets us reserve ranges of virtual addresses
> +to prevent cross-cache attacks and at the same time release physical memory back
> +to the system when it's no longer needed. This is what Chromium's PartitionAlloc
> +does in userspace
> +(https://chromium.googlesource.com/chromium/src/+/354da2514b31df2aa14291199a567e10a7671621/base/allocator/partition_allocator/PartitionAlloc.md).
> +
> +Implementation
> +~~~~~~~~~~~~~~
> +
> +SLAB_VIRTUAL reserves a region of virtual memory for the slab allocator. All
> +pointers returned by the slab allocator point to this region. The region is
> +statically partitioned in two sub-regions: the metadata region and the data
> +region. The data region is where the actual objects are allocated from. The
> +metadata region is an array of struct slab objects, one for each PAGE_SIZE bytes
> +in the data region.
> +Without SLAB_VIRTUAL, struct slab is overlaid on top of the struct page/struct
> +folio that corresponds to the physical memory page backing the slab instead of
> +using a dedicated memory region. This doesn't work for SLAB_VIRTUAL, which needs
> +to store metadata for slabs even when no physical memory is allocated to them.
> +Having an array of struct slab lets us implement virt_to_slab efficiently purely
> +with arithmetic. In order to support high-order slabs, the struct slabs
> +corresponding to tail pages contain a pointer to the head slab, which
> +corresponds to the slab's head page.

I think ideally we should be using the folio code to get from tail pages to
a folio and then a single struct slab - it would be more in line how Matthew
envisions future of struct page array AFAIK. Unless it's significantly
slower, which would be a shame :/

> +
> +TLB flushing
> +~~~~~~~~~~~~
> +
> +Before it can release a page of physical memory back to the page allocator, the
> +slab allocator must flush the TLB entries for that page on all CPUs. This is not
> +only necessary for the mitigation to work reliably but it's also required for
> +correctness. Without a TLB flush some CPUs might continue using the old mapping
> +if the virtual address range is reused for a new slab and cause memory
> +corruption even in the absence of other bugs. The slab allocator can release
> +pages in contexts where TLB flushes can't be performed (e.g. in hardware
> +interrupt handlers). Pages to free are not freed directly, and instead they are
> +put on a queue and freed from a workqueue context which also flushes the TLB.
> +
> +Performance
> +~~~~~~~~~~~
> +
> +SLAB_VIRTUAL's performance impact depends on the workload. On kernel compilation
> +(kernbench) the slowdown is about 1-2% depending on the machine type and is
> +slightly worse on machines with more cores.

2023-09-20 20:08:25

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH 00/14] Prevent cross-cache attacks in the SLUB allocator

On 9/19/23 06:42, Matteo Rizzo wrote:
> On Mon, 18 Sept 2023 at 19:39, Ingo Molnar <[email protected]> wrote:
>> What's the split of the increase in overhead due to SLAB_VIRTUAL=y, between
>> user-space execution and kernel-space execution?
>>
> Same benchmark as before (compiling a kernel on a system running the patched
> kernel):

Thanks for running those. One more situation that comes to mind is how
this will act under memory pressure. Will some memory pressure make
contention on 'slub_kworker_lock' visible or make the global TLB flushes
less bearable?

In any case, none of this looks _catastrophic_. It's surely a cost that
some folks will pay.

But I really do think it needs to be more dynamic. There are a _couple_
of reasons for this. If it's only a compile-time option, it's never
going to get turned on except for maybe ChromeOS and the datacenter
folks that are paranoid. I suspect the distros will never turn it on.

A lot of questions get easier if you can disable/enable it at runtime.
For instance, what do you do if the virtual area fills up? You _could_
just go back to handing out direct map addresses. Less secure? Yep.
But better than crashing (for some folks).

It also opens up the door to do this per-slab. That alone would be a
handy debugging option.

2023-09-30 14:06:27

by Hyeonggon Yoo

[permalink] [raw]
Subject: Re: [RFC PATCH 01/14] mm/slub: don't try to dereference invalid freepointers

On Fri, Sep 15, 2023 at 7:59 PM Matteo Rizzo <[email protected]> wrote:
>
> slab_free_freelist_hook tries to read a freelist pointer from the
> current object even when freeing a single object. This is invalid
> because single objects don't actually contain a freelist pointer when
> they're freed and the memory contains other data. This causes problems
> for checking the integrity of freelist in get_freepointer.
>
> Signed-off-by: Matteo Rizzo <[email protected]>
> ---
> mm/slub.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index f7940048138c..a7dae207c2d2 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1820,7 +1820,9 @@ static inline bool slab_free_freelist_hook(struct kmem_cache *s,
>
> do {
> object = next;
> - next = get_freepointer(s, object);
> + /* Single objects don't actually contain a freepointer */
> + if (object != old_tail)
> + next = get_freepointer(s, object);
>
> /* If object's reuse doesn't have to be delayed */
> if (!slab_free_hook(s, object, slab_want_init_on_free(s))) {
> --
> 2.42.0.459.ge4e396fd5e-goog
>

Looks good to me,
Reviewed-by: Hyeonggon Yoo <[email protected]>

2023-10-11 09:18:03

by Matteo Rizzo

[permalink] [raw]
Subject: Re: [RFC PATCH 11/14] mm/slub: allocate slabs from virtual memory

On Fri, 15 Sept 2023 at 23:57, Dave Hansen <[email protected]> wrote:
>
> I assume that the TLB flushes in the queue are going to be pretty sparse
> on average.
>
> At least on x86, flush_tlb_kernel_range() falls back pretty quickly from
> individual address invalidation to just doing a full flush. It might
> not even be worth tracking the address ranges, and just do a full flush
> every time.
>
> I'd be really curious to see how often actual ranged flushes are
> triggered from this code. I expect it would be very near zero.

I did some quick testing with kernel compilation. On x86
flush_tlb_kernel_range does a full flush when end - start is more than 33
pages and a ranged flush otherwise. I counted how many of each we are
triggering from the TLB flush worker with some code like this:

if (addr_start < addr_end) {
if ((addr_end - addr_start) <= (33 << PAGE_SHIFT))
partial_flush_count++;
else
full_flush_count++;
}

Result after one run of kernbench:

# cat /proc/slab_tlbinfo
partial 88890 full 45223

So it seems that most flushes are ranged (at least for this workload).

-- Matteo