From: Andrey Konovalov <[email protected]>
Currently, the stack depot grows indefinitely until it reaches its
capacity. Once that happens, the stack depot stops saving new stack
traces.
This creates a problem for using the stack depot for in-field testing
and in production.
For such uses, an ideal stack trace storage should:
1. Allow saving fresh stack traces on systems with a large uptime while
limiting the amount of memory used to store the traces;
2. Have a low performance impact.
Implementing #1 in the stack depot is impossible with the current
keep-forever approach. This series targets to address that. Issue #2 is
left to be addressed in a future series.
This series changes the stack depot implementation to allow evicting
unneeded stack traces from the stack depot. The users of the stack depot
can do that via new stack_depot_save_flags(STACK_DEPOT_FLAG_GET) and
stack_depot_put APIs.
Internal changes to the stack depot code include:
1. Storing stack traces in fixed-frame-sized slots (vs precisely-sized
slots in the current implementation); the slot size is controlled via
CONFIG_STACKDEPOT_MAX_FRAMES (default: 64 frames);
2. Keeping available slots in a freelist (vs keeping an offset to the next
free slot);
3. Using a read/write lock for synchronization (vs a lock-free approach
combined with a spinlock).
This series also integrates the eviction functionality into KASAN:
the tag-based modes evict stack traces when the corresponding entry
leaves the stack ring, and Generic KASAN evicts stack traces for objects
once those leave the quarantine.
With KASAN, despite wasting some space on rounding up the size of each
stack record, the total memory consumed by stack depot gets saturated due
to the eviction of irrelevant stack traces from the stack depot.
With the tag-based KASAN modes, the average total amount of memory used
for stack traces becomes ~0.5 MB (with the current default stack ring size
of 32k entries and the default CONFIG_STACKDEPOT_MAX_FRAMES of 64). With
Generic KASAN, the stack traces take up ~1 MB per 1 GB of RAM (as the
quarantine's size depends on the amount of RAM).
However, with KMSAN, the stack depot ends up using ~4x more memory per a
stack trace than before. Thus, for KMSAN, the stack depot capacity is
increased accordingly. KMSAN uses a lot of RAM for shadow memory anyway,
so the increased stack depot memory usage will not make a significant
difference.
Other users of the stack depot do not save stack traces as often as KASAN
and KMSAN. Thus, the increased memory usage is taken as an acceptable
trade-off. In the future, these other users can take advantage of the
eviction API to limit the memory waste.
There is no measurable boot time performance impact of these changes for
KASAN on x86-64. I haven't done any tests for arm64 modes (the stack
depot without performance optimizations is not suitable for intended use
of those anyway), but I expect a similar result. Obtaining and copying
stack trace frames when saving them into stack depot is what takes the
most time.
This series does not yet provide a way to configure the maximum size of
the stack depot externally (e.g. via a command-line parameter). This will
be added in a separate series, possibly together with the performance
improvement changes.
---
Changes v3->v4:
- Rebase onto 6.7-rc2.
- Fix lockdep annotation in depot_fetch_stack.
- New patch: "kasan: use stack_depot_put for Generic mode" (was sent for
review separately but now merged into this series).
- New patch: "lib/stackdepot: print disabled message only if truly
disabled" (was sent for review separately but now merged into this
series).
- New patch: "lib/stackdepot: adjust DEPOT_POOLS_CAP for KMSAN".
Changes v2->v3:
- Fix null-ptr-deref by using the proper number of entries for
initializing the stack table when alloc_large_system_hash()
auto-calculates the number (see patch #12).
- Keep STACKDEPOT/STACKDEPOT_ALWAYS_INIT Kconfig options not configurable
by users.
- Use lockdep_assert_held_read annotation in depot_fetch_stack.
- WARN_ON invalid flags in stack_depot_save_flags.
- Moved "../slab.h" include in mm/kasan/report_tags.c in the right patch.
- Various comment fixes.
Changes v1->v2:
- Rework API to stack_depot_save_flags(STACK_DEPOT_FLAG_GET) +
stack_depot_put.
- Add CONFIG_STACKDEPOT_MAX_FRAMES Kconfig option.
- Switch stack depot to using list_head's.
- Assorted minor changes, see the commit message for each path.
Andrey Konovalov (22):
lib/stackdepot: print disabled message only if truly disabled
lib/stackdepot: check disabled flag when fetching
lib/stackdepot: simplify __stack_depot_save
lib/stackdepot: drop valid bit from handles
lib/stackdepot: add depot_fetch_stack helper
lib/stackdepot: use fixed-sized slots for stack records
lib/stackdepot: fix and clean-up atomic annotations
lib/stackdepot: rework helpers for depot_alloc_stack
lib/stackdepot: rename next_pool_required to new_pool_required
lib/stackdepot: store next pool pointer in new_pool
lib/stackdepot: store free stack records in a freelist
lib/stackdepot: use read/write lock
lib/stackdepot: use list_head for stack record links
kmsan: use stack_depot_save instead of __stack_depot_save
lib/stackdepot, kasan: add flags to __stack_depot_save and rename
lib/stackdepot: add refcount for records
lib/stackdepot: allow users to evict stack traces
kasan: remove atomic accesses to stack ring entries
kasan: check object_size in kasan_complete_mode_report_info
kasan: use stack_depot_put for tag-based modes
kasan: use stack_depot_put for Generic mode
lib/stackdepot: adjust DEPOT_POOLS_CAP for KMSAN
include/linux/stackdepot.h | 59 ++++-
lib/Kconfig | 10 +
lib/stackdepot.c | 452 ++++++++++++++++++++++++-------------
mm/kasan/common.c | 8 +-
mm/kasan/generic.c | 27 ++-
mm/kasan/kasan.h | 2 +-
mm/kasan/quarantine.c | 26 ++-
mm/kasan/report_tags.c | 27 +--
mm/kasan/tags.c | 24 +-
mm/kmsan/core.c | 7 +-
10 files changed, 427 insertions(+), 215 deletions(-)
--
2.25.1
From: Andrey Konovalov <[email protected]>
The retval local variable in __stack_depot_save has the union type
handle_parts, but the function never uses anything but the union's
handle field.
Define retval simply as depot_stack_handle_t to simplify the code.
Reviewed-by: Alexander Potapenko <[email protected]>
Signed-off-by: Andrey Konovalov <[email protected]>
---
lib/stackdepot.c | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)
diff --git a/lib/stackdepot.c b/lib/stackdepot.c
index f8a8033e1dc8..3e71c8f61c7d 100644
--- a/lib/stackdepot.c
+++ b/lib/stackdepot.c
@@ -366,7 +366,7 @@ depot_stack_handle_t __stack_depot_save(unsigned long *entries,
gfp_t alloc_flags, bool can_alloc)
{
struct stack_record *found = NULL, **bucket;
- union handle_parts retval = { .handle = 0 };
+ depot_stack_handle_t handle = 0;
struct page *page = NULL;
void *prealloc = NULL;
unsigned long flags;
@@ -383,7 +383,7 @@ depot_stack_handle_t __stack_depot_save(unsigned long *entries,
nr_entries = filter_irq_stacks(entries, nr_entries);
if (unlikely(nr_entries == 0) || stack_depot_disabled)
- goto fast_exit;
+ return 0;
hash = hash_stack(entries, nr_entries);
bucket = &stack_table[hash & stack_hash_mask];
@@ -449,9 +449,8 @@ depot_stack_handle_t __stack_depot_save(unsigned long *entries,
free_pages((unsigned long)prealloc, DEPOT_POOL_ORDER);
}
if (found)
- retval.handle = found->handle.handle;
-fast_exit:
- return retval.handle;
+ handle = found->handle.handle;
+ return handle;
}
EXPORT_SYMBOL_GPL(__stack_depot_save);
--
2.25.1
From: Andrey Konovalov <[email protected]>
Stack depot doesn't use the valid bit in handles in any way, so drop it.
Reviewed-by: Alexander Potapenko <[email protected]>
Signed-off-by: Andrey Konovalov <[email protected]>
---
lib/stackdepot.c | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)
diff --git a/lib/stackdepot.c b/lib/stackdepot.c
index 3e71c8f61c7d..46a422d31c1f 100644
--- a/lib/stackdepot.c
+++ b/lib/stackdepot.c
@@ -32,13 +32,12 @@
#define DEPOT_HANDLE_BITS (sizeof(depot_stack_handle_t) * 8)
-#define DEPOT_VALID_BITS 1
#define DEPOT_POOL_ORDER 2 /* Pool size order, 4 pages */
#define DEPOT_POOL_SIZE (1LL << (PAGE_SHIFT + DEPOT_POOL_ORDER))
#define DEPOT_STACK_ALIGN 4
#define DEPOT_OFFSET_BITS (DEPOT_POOL_ORDER + PAGE_SHIFT - DEPOT_STACK_ALIGN)
-#define DEPOT_POOL_INDEX_BITS (DEPOT_HANDLE_BITS - DEPOT_VALID_BITS - \
- DEPOT_OFFSET_BITS - STACK_DEPOT_EXTRA_BITS)
+#define DEPOT_POOL_INDEX_BITS (DEPOT_HANDLE_BITS - DEPOT_OFFSET_BITS - \
+ STACK_DEPOT_EXTRA_BITS)
#define DEPOT_POOLS_CAP 8192
#define DEPOT_MAX_POOLS \
(((1LL << (DEPOT_POOL_INDEX_BITS)) < DEPOT_POOLS_CAP) ? \
@@ -50,7 +49,6 @@ union handle_parts {
struct {
u32 pool_index : DEPOT_POOL_INDEX_BITS;
u32 offset : DEPOT_OFFSET_BITS;
- u32 valid : DEPOT_VALID_BITS;
u32 extra : STACK_DEPOT_EXTRA_BITS;
};
};
@@ -309,7 +307,6 @@ depot_alloc_stack(unsigned long *entries, int size, u32 hash, void **prealloc)
stack->size = size;
stack->handle.pool_index = pool_index;
stack->handle.offset = pool_offset >> DEPOT_STACK_ALIGN;
- stack->handle.valid = 1;
stack->handle.extra = 0;
memcpy(stack->entries, entries, flex_array_size(stack, entries, size));
pool_offset += required_size;
--
2.25.1
From: Andrey Konovalov <[email protected]>
Do not try fetching a stack trace from the stack depot if the
stack_depot_disabled flag is enabled.
Reviewed-by: Alexander Potapenko <[email protected]>
Signed-off-by: Andrey Konovalov <[email protected]>
---
lib/stackdepot.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/stackdepot.c b/lib/stackdepot.c
index 0eeaef4f2523..f8a8033e1dc8 100644
--- a/lib/stackdepot.c
+++ b/lib/stackdepot.c
@@ -483,7 +483,7 @@ unsigned int stack_depot_fetch(depot_stack_handle_t handle,
*/
kmsan_unpoison_memory(entries, sizeof(*entries));
- if (!handle)
+ if (!handle || stack_depot_disabled)
return 0;
if (parts.pool_index > pool_index_cached) {
--
2.25.1
From: Andrey Konovalov <[email protected]>
Add a helper depot_fetch_stack function that fetches the pointer to
a stack record.
With this change, all static depot_* functions now operate on stack pools
and the exported stack_depot_* functions operate on the hash table.
Reviewed-by: Alexander Potapenko <[email protected]>
Signed-off-by: Andrey Konovalov <[email protected]>
---
Changes v1->v2:
- Minor comment fix as suggested by Alexander.
---
lib/stackdepot.c | 45 ++++++++++++++++++++++++++++-----------------
1 file changed, 28 insertions(+), 17 deletions(-)
diff --git a/lib/stackdepot.c b/lib/stackdepot.c
index 46a422d31c1f..e41713983cac 100644
--- a/lib/stackdepot.c
+++ b/lib/stackdepot.c
@@ -310,6 +310,7 @@ depot_alloc_stack(unsigned long *entries, int size, u32 hash, void **prealloc)
stack->handle.extra = 0;
memcpy(stack->entries, entries, flex_array_size(stack, entries, size));
pool_offset += required_size;
+
/*
* Let KMSAN know the stored stack record is initialized. This shall
* prevent false positive reports if instrumented code accesses it.
@@ -319,6 +320,32 @@ depot_alloc_stack(unsigned long *entries, int size, u32 hash, void **prealloc)
return stack;
}
+static struct stack_record *depot_fetch_stack(depot_stack_handle_t handle)
+{
+ union handle_parts parts = { .handle = handle };
+ /*
+ * READ_ONCE pairs with potential concurrent write in
+ * depot_alloc_stack().
+ */
+ int pool_index_cached = READ_ONCE(pool_index);
+ void *pool;
+ size_t offset = parts.offset << DEPOT_STACK_ALIGN;
+ struct stack_record *stack;
+
+ if (parts.pool_index > pool_index_cached) {
+ WARN(1, "pool index %d out of bounds (%d) for stack id %08x\n",
+ parts.pool_index, pool_index_cached, handle);
+ return NULL;
+ }
+
+ pool = stack_pools[parts.pool_index];
+ if (!pool)
+ return NULL;
+
+ stack = pool + offset;
+ return stack;
+}
+
/* Calculates the hash for a stack. */
static inline u32 hash_stack(unsigned long *entries, unsigned int size)
{
@@ -462,14 +489,6 @@ EXPORT_SYMBOL_GPL(stack_depot_save);
unsigned int stack_depot_fetch(depot_stack_handle_t handle,
unsigned long **entries)
{
- union handle_parts parts = { .handle = handle };
- /*
- * READ_ONCE pairs with potential concurrent write in
- * depot_alloc_stack.
- */
- int pool_index_cached = READ_ONCE(pool_index);
- void *pool;
- size_t offset = parts.offset << DEPOT_STACK_ALIGN;
struct stack_record *stack;
*entries = NULL;
@@ -482,15 +501,7 @@ unsigned int stack_depot_fetch(depot_stack_handle_t handle,
if (!handle || stack_depot_disabled)
return 0;
- if (parts.pool_index > pool_index_cached) {
- WARN(1, "pool index %d out of bounds (%d) for stack id %08x\n",
- parts.pool_index, pool_index_cached, handle);
- return 0;
- }
- pool = stack_pools[parts.pool_index];
- if (!pool)
- return 0;
- stack = pool + offset;
+ stack = depot_fetch_stack(handle);
*entries = stack->entries;
return stack->size;
--
2.25.1
From: Andrey Konovalov <[email protected]>
Drop smp_load_acquire from next_pool_required in depot_init_pool, as both
depot_init_pool and the all smp_store_release's to this variable are
executed under the stack depot lock.
Also simplify and clean up comments accompanying the use of atomic
accesses in the stack depot code.
Reviewed-by: Alexander Potapenko <[email protected]>
Signed-off-by: Andrey Konovalov <[email protected]>
---
This patch is not strictly required, as the atomic accesses are fully
removed in one of the latter patches. However, I decided to keep the
patch just in case we end up needing these atomics in the following
iterations of this series.
Changes v2->v3:
- Keep parentheses when referring to functions in comments.
- Add comment that explains why depot_init_pool reads next_pool_required
non-atomically.
Changes v1->v2:
- Minor comment fix as suggested by Marco.
- Drop READ_ONCE marking for next_pool_required.
---
lib/stackdepot.c | 29 ++++++++++++++---------------
1 file changed, 14 insertions(+), 15 deletions(-)
diff --git a/lib/stackdepot.c b/lib/stackdepot.c
index 682497dbe081..cfa3c6c7cc2e 100644
--- a/lib/stackdepot.c
+++ b/lib/stackdepot.c
@@ -231,10 +231,10 @@ static void depot_init_pool(void **prealloc)
/*
* If the next pool is already initialized or the maximum number of
* pools is reached, do not use the preallocated memory.
- * smp_load_acquire() here pairs with smp_store_release() below and
- * in depot_alloc_stack().
+ * Access next_pool_required non-atomically, as there are no concurrent
+ * write accesses to this variable.
*/
- if (!smp_load_acquire(&next_pool_required))
+ if (!next_pool_required)
return;
/* Check if the current pool is not yet allocated. */
@@ -255,8 +255,8 @@ static void depot_init_pool(void **prealloc)
* At this point, either the next pool is initialized or the
* maximum number of pools is reached. In either case, take
* note that initializing another pool is not required.
- * This smp_store_release pairs with smp_load_acquire() above
- * and in stack_depot_save().
+ * smp_store_release() pairs with smp_load_acquire() in
+ * stack_depot_save().
*/
smp_store_release(&next_pool_required, 0);
}
@@ -279,7 +279,7 @@ depot_alloc_stack(unsigned long *entries, int size, u32 hash, void **prealloc)
/*
* Move on to the next pool.
- * WRITE_ONCE pairs with potential concurrent read in
+ * WRITE_ONCE() pairs with potential concurrent read in
* stack_depot_fetch().
*/
WRITE_ONCE(pool_index, pool_index + 1);
@@ -287,8 +287,8 @@ depot_alloc_stack(unsigned long *entries, int size, u32 hash, void **prealloc)
/*
* If the maximum number of pools is not reached, take note
* that the next pool needs to initialized.
- * smp_store_release() here pairs with smp_load_acquire() in
- * stack_depot_save() and depot_init_pool().
+ * smp_store_release() pairs with smp_load_acquire() in
+ * stack_depot_save().
*/
if (pool_index + 1 < DEPOT_MAX_POOLS)
smp_store_release(&next_pool_required, 1);
@@ -329,7 +329,7 @@ static struct stack_record *depot_fetch_stack(depot_stack_handle_t handle)
{
union handle_parts parts = { .handle = handle };
/*
- * READ_ONCE pairs with potential concurrent write in
+ * READ_ONCE() pairs with potential concurrent write in
* depot_alloc_stack().
*/
int pool_index_cached = READ_ONCE(pool_index);
@@ -419,8 +419,7 @@ depot_stack_handle_t __stack_depot_save(unsigned long *entries,
/*
* Fast path: look the stack trace up without locking.
- * The smp_load_acquire() here pairs with smp_store_release() to
- * |bucket| below.
+ * smp_load_acquire() pairs with smp_store_release() to |bucket| below.
*/
found = find_stack(smp_load_acquire(bucket), entries, nr_entries, hash);
if (found)
@@ -430,8 +429,8 @@ depot_stack_handle_t __stack_depot_save(unsigned long *entries,
* Check if another stack pool needs to be initialized. If so, allocate
* the memory now - we won't be able to do that under the lock.
*
- * The smp_load_acquire() here pairs with smp_store_release() to
- * |next_pool_inited| in depot_alloc_stack() and depot_init_pool().
+ * smp_load_acquire() pairs with smp_store_release() in
+ * depot_alloc_stack() and depot_init_pool().
*/
if (unlikely(can_alloc && smp_load_acquire(&next_pool_required))) {
/*
@@ -457,8 +456,8 @@ depot_stack_handle_t __stack_depot_save(unsigned long *entries,
if (new) {
new->next = *bucket;
/*
- * This smp_store_release() pairs with
- * smp_load_acquire() from |bucket| above.
+ * smp_store_release() pairs with smp_load_acquire()
+ * from |bucket| above.
*/
smp_store_release(bucket, new);
found = new;
--
2.25.1
From: Andrey Konovalov <[email protected]>
Rename next_pool_required to new_pool_required.
This a purely code readability change: the following patch will change
stack depot to store the pointer to the new pool in a separate variable,
and "new" seems like a more logical name.
Reviewed-by: Alexander Potapenko <[email protected]>
Signed-off-by: Andrey Konovalov <[email protected]>
---
lib/stackdepot.c | 49 ++++++++++++++++++++++++------------------------
1 file changed, 24 insertions(+), 25 deletions(-)
diff --git a/lib/stackdepot.c b/lib/stackdepot.c
index b3af868627f4..a38661beab97 100644
--- a/lib/stackdepot.c
+++ b/lib/stackdepot.c
@@ -93,12 +93,11 @@ static size_t pool_offset;
static DEFINE_RAW_SPINLOCK(pool_lock);
/*
* Stack depot tries to keep an extra pool allocated even before it runs out
- * of space in the currently used pool.
- * This flag marks that this next extra pool needs to be allocated and
- * initialized. It has the value 0 when either the next pool is not yet
- * initialized or the limit on the number of pools is reached.
+ * of space in the currently used pool. This flag marks whether this extra pool
+ * needs to be allocated. It has the value 0 when either an extra pool is not
+ * yet allocated or if the limit on the number of pools is reached.
*/
-static int next_pool_required = 1;
+static int new_pool_required = 1;
static int __init disable_stack_depot(char *str)
{
@@ -225,20 +224,20 @@ int stack_depot_init(void)
}
EXPORT_SYMBOL_GPL(stack_depot_init);
-/* Keeps the preallocated memory to be used for the next stack depot pool. */
-static void depot_keep_next_pool(void **prealloc)
+/* Keeps the preallocated memory to be used for a new stack depot pool. */
+static void depot_keep_new_pool(void **prealloc)
{
/*
- * If the next pool is already saved or the maximum number of
+ * If a new pool is already saved or the maximum number of
* pools is reached, do not use the preallocated memory.
- * Access next_pool_required non-atomically, as there are no concurrent
+ * Access new_pool_required non-atomically, as there are no concurrent
* write accesses to this variable.
*/
- if (!next_pool_required)
+ if (!new_pool_required)
return;
/*
- * Use the preallocated memory for the next pool
+ * Use the preallocated memory for the new pool
* as long as we do not exceed the maximum number of pools.
*/
if (pool_index + 1 < DEPOT_MAX_POOLS) {
@@ -247,13 +246,13 @@ static void depot_keep_next_pool(void **prealloc)
}
/*
- * At this point, either the next pool is kept or the maximum
+ * At this point, either a new pool is kept or the maximum
* number of pools is reached. In either case, take note that
* keeping another pool is not required.
* smp_store_release() pairs with smp_load_acquire() in
* stack_depot_save().
*/
- smp_store_release(&next_pool_required, 0);
+ smp_store_release(&new_pool_required, 0);
}
/* Updates references to the current and the next stack depot pools. */
@@ -268,7 +267,7 @@ static bool depot_update_pools(size_t required_size, void **prealloc)
}
/*
- * Move on to the next pool.
+ * Move on to the new pool.
* WRITE_ONCE() pairs with potential concurrent read in
* stack_depot_fetch().
*/
@@ -277,12 +276,12 @@ static bool depot_update_pools(size_t required_size, void **prealloc)
/*
* If the maximum number of pools is not reached, take note
- * that the next pool needs to be initialized.
+ * that yet another new pool needs to be allocated.
* smp_store_release() pairs with smp_load_acquire() in
* stack_depot_save().
*/
if (pool_index + 1 < DEPOT_MAX_POOLS)
- smp_store_release(&next_pool_required, 1);
+ smp_store_release(&new_pool_required, 1);
}
/* Check if the current pool is not yet allocated. */
@@ -293,9 +292,9 @@ static bool depot_update_pools(size_t required_size, void **prealloc)
return true;
}
- /* Otherwise, try using the preallocated memory for the next pool. */
+ /* Otherwise, try using the preallocated memory for a new pool. */
if (*prealloc)
- depot_keep_next_pool(prealloc);
+ depot_keep_new_pool(prealloc);
return true;
}
@@ -306,7 +305,7 @@ depot_alloc_stack(unsigned long *entries, int size, u32 hash, void **prealloc)
struct stack_record *stack;
size_t required_size = DEPOT_STACK_RECORD_SIZE;
- /* Update current and next pools if required and possible. */
+ /* Update current and new pools if required and possible. */
if (!depot_update_pools(required_size, prealloc))
return NULL;
@@ -438,13 +437,13 @@ depot_stack_handle_t __stack_depot_save(unsigned long *entries,
goto exit;
/*
- * Check if another stack pool needs to be initialized. If so, allocate
- * the memory now - we won't be able to do that under the lock.
+ * Check if another stack pool needs to be allocated. If so, allocate
+ * the memory now: we won't be able to do that under the lock.
*
* smp_load_acquire() pairs with smp_store_release() in
- * depot_update_pools() and depot_keep_next_pool().
+ * depot_update_pools() and depot_keep_new_pool().
*/
- if (unlikely(can_alloc && smp_load_acquire(&next_pool_required))) {
+ if (unlikely(can_alloc && smp_load_acquire(&new_pool_required))) {
/*
* Zero out zone modifiers, as we don't have specific zone
* requirements. Keep the flags related to allocation in atomic
@@ -477,9 +476,9 @@ depot_stack_handle_t __stack_depot_save(unsigned long *entries,
} else if (prealloc) {
/*
* Stack depot already contains this stack trace, but let's
- * keep the preallocated memory for the future.
+ * keep the preallocated memory for future.
*/
- depot_keep_next_pool(&prealloc);
+ depot_keep_new_pool(&prealloc);
}
raw_spin_unlock_irqrestore(&pool_lock, flags);
--
2.25.1
From: Andrey Konovalov <[email protected]>
Change the bool can_alloc argument of __stack_depot_save to a
u32 argument that accepts a set of flags.
The following patch will add another flag to stack_depot_save_flags
besides the existing STACK_DEPOT_FLAG_CAN_ALLOC.
Also rename the function to stack_depot_save_flags, as __stack_depot_save
is a cryptic name,
Reviewed-by: Alexander Potapenko <[email protected]>
Signed-off-by: Andrey Konovalov <[email protected]>
---
Changes v2->v3:
- WARN_ON invalid flags in stack_depot_save_flags.
Changes v1->v2:
- This is a new patch.
---
include/linux/stackdepot.h | 36 +++++++++++++++++++++++++-----------
lib/stackdepot.c | 16 +++++++++++-----
mm/kasan/common.c | 7 ++++---
mm/kasan/generic.c | 9 +++++----
mm/kasan/kasan.h | 2 +-
mm/kasan/tags.c | 3 ++-
6 files changed, 48 insertions(+), 25 deletions(-)
diff --git a/include/linux/stackdepot.h b/include/linux/stackdepot.h
index e58306783d8e..0b262e14144e 100644
--- a/include/linux/stackdepot.h
+++ b/include/linux/stackdepot.h
@@ -32,6 +32,17 @@ typedef u32 depot_stack_handle_t;
*/
#define STACK_DEPOT_EXTRA_BITS 5
+typedef u32 depot_flags_t;
+
+/*
+ * Flags that can be passed to stack_depot_save_flags(); see the comment next
+ * to its declaration for more details.
+ */
+#define STACK_DEPOT_FLAG_CAN_ALLOC ((depot_flags_t)0x0001)
+
+#define STACK_DEPOT_FLAGS_NUM 1
+#define STACK_DEPOT_FLAGS_MASK ((depot_flags_t)((1 << STACK_DEPOT_FLAGS_NUM) - 1))
+
/*
* Using stack depot requires its initialization, which can be done in 3 ways:
*
@@ -69,31 +80,34 @@ static inline int stack_depot_early_init(void) { return 0; }
#endif
/**
- * __stack_depot_save - Save a stack trace to stack depot
+ * stack_depot_save_flags - Save a stack trace to stack depot
*
* @entries: Pointer to the stack trace
* @nr_entries: Number of frames in the stack
* @alloc_flags: Allocation GFP flags
- * @can_alloc: Allocate stack pools (increased chance of failure if false)
+ * @depot_flags: Stack depot flags
+ *
+ * Saves a stack trace from @entries array of size @nr_entries.
*
- * Saves a stack trace from @entries array of size @nr_entries. If @can_alloc is
- * %true, stack depot can replenish the stack pools in case no space is left
- * (allocates using GFP flags of @alloc_flags). If @can_alloc is %false, avoids
- * any allocations and fails if no space is left to store the stack trace.
+ * If STACK_DEPOT_FLAG_CAN_ALLOC is set in @depot_flags, stack depot can
+ * replenish the stack pools in case no space is left (allocates using GFP
+ * flags of @alloc_flags). Otherwise, stack depot avoids any allocations and
+ * fails if no space is left to store the stack trace.
*
* If the provided stack trace comes from the interrupt context, only the part
* up to the interrupt entry is saved.
*
- * Context: Any context, but setting @can_alloc to %false is required if
+ * Context: Any context, but setting STACK_DEPOT_FLAG_CAN_ALLOC is required if
* alloc_pages() cannot be used from the current context. Currently
* this is the case for contexts where neither %GFP_ATOMIC nor
* %GFP_NOWAIT can be used (NMI, raw_spin_lock).
*
* Return: Handle of the stack struct stored in depot, 0 on failure
*/
-depot_stack_handle_t __stack_depot_save(unsigned long *entries,
- unsigned int nr_entries,
- gfp_t gfp_flags, bool can_alloc);
+depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
+ unsigned int nr_entries,
+ gfp_t gfp_flags,
+ depot_flags_t depot_flags);
/**
* stack_depot_save - Save a stack trace to stack depot
@@ -103,7 +117,7 @@ depot_stack_handle_t __stack_depot_save(unsigned long *entries,
* @alloc_flags: Allocation GFP flags
*
* Context: Contexts where allocations via alloc_pages() are allowed.
- * See __stack_depot_save() for more details.
+ * See stack_depot_save_flags() for more details.
*
* Return: Handle of the stack trace stored in depot, 0 on failure
*/
diff --git a/lib/stackdepot.c b/lib/stackdepot.c
index 4bb0af423f82..59d61d5c09a7 100644
--- a/lib/stackdepot.c
+++ b/lib/stackdepot.c
@@ -450,19 +450,24 @@ static inline struct stack_record *find_stack(struct list_head *bucket,
return NULL;
}
-depot_stack_handle_t __stack_depot_save(unsigned long *entries,
- unsigned int nr_entries,
- gfp_t alloc_flags, bool can_alloc)
+depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
+ unsigned int nr_entries,
+ gfp_t alloc_flags,
+ depot_flags_t depot_flags)
{
struct list_head *bucket;
struct stack_record *found = NULL;
depot_stack_handle_t handle = 0;
struct page *page = NULL;
void *prealloc = NULL;
+ bool can_alloc = depot_flags & STACK_DEPOT_FLAG_CAN_ALLOC;
bool need_alloc = false;
unsigned long flags;
u32 hash;
+ if (WARN_ON(depot_flags & ~STACK_DEPOT_FLAGS_MASK))
+ return 0;
+
/*
* If this stack trace is from an interrupt, including anything before
* interrupt entry usually leads to unbounded stack depot growth.
@@ -541,13 +546,14 @@ depot_stack_handle_t __stack_depot_save(unsigned long *entries,
handle = found->handle.handle;
return handle;
}
-EXPORT_SYMBOL_GPL(__stack_depot_save);
+EXPORT_SYMBOL_GPL(stack_depot_save_flags);
depot_stack_handle_t stack_depot_save(unsigned long *entries,
unsigned int nr_entries,
gfp_t alloc_flags)
{
- return __stack_depot_save(entries, nr_entries, alloc_flags, true);
+ return stack_depot_save_flags(entries, nr_entries, alloc_flags,
+ STACK_DEPOT_FLAG_CAN_ALLOC);
}
EXPORT_SYMBOL_GPL(stack_depot_save);
diff --git a/mm/kasan/common.c b/mm/kasan/common.c
index 256930da578a..825a0240ec02 100644
--- a/mm/kasan/common.c
+++ b/mm/kasan/common.c
@@ -22,6 +22,7 @@
#include <linux/sched.h>
#include <linux/sched/task_stack.h>
#include <linux/slab.h>
+#include <linux/stackdepot.h>
#include <linux/stacktrace.h>
#include <linux/string.h>
#include <linux/types.h>
@@ -37,19 +38,19 @@ struct slab *kasan_addr_to_slab(const void *addr)
return NULL;
}
-depot_stack_handle_t kasan_save_stack(gfp_t flags, bool can_alloc)
+depot_stack_handle_t kasan_save_stack(gfp_t flags, depot_flags_t depot_flags)
{
unsigned long entries[KASAN_STACK_DEPTH];
unsigned int nr_entries;
nr_entries = stack_trace_save(entries, ARRAY_SIZE(entries), 0);
- return __stack_depot_save(entries, nr_entries, flags, can_alloc);
+ return stack_depot_save_flags(entries, nr_entries, flags, depot_flags);
}
void kasan_set_track(struct kasan_track *track, gfp_t flags)
{
track->pid = current->pid;
- track->stack = kasan_save_stack(flags, true);
+ track->stack = kasan_save_stack(flags, STACK_DEPOT_FLAG_CAN_ALLOC);
}
#if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
diff --git a/mm/kasan/generic.c b/mm/kasan/generic.c
index 4d837ab83f08..5d168c9afb32 100644
--- a/mm/kasan/generic.c
+++ b/mm/kasan/generic.c
@@ -25,6 +25,7 @@
#include <linux/sched.h>
#include <linux/sched/task_stack.h>
#include <linux/slab.h>
+#include <linux/stackdepot.h>
#include <linux/stacktrace.h>
#include <linux/string.h>
#include <linux/types.h>
@@ -472,7 +473,7 @@ size_t kasan_metadata_size(struct kmem_cache *cache, bool in_object)
sizeof(struct kasan_free_meta) : 0);
}
-static void __kasan_record_aux_stack(void *addr, bool can_alloc)
+static void __kasan_record_aux_stack(void *addr, depot_flags_t depot_flags)
{
struct slab *slab = kasan_addr_to_slab(addr);
struct kmem_cache *cache;
@@ -489,17 +490,17 @@ static void __kasan_record_aux_stack(void *addr, bool can_alloc)
return;
alloc_meta->aux_stack[1] = alloc_meta->aux_stack[0];
- alloc_meta->aux_stack[0] = kasan_save_stack(0, can_alloc);
+ alloc_meta->aux_stack[0] = kasan_save_stack(0, depot_flags);
}
void kasan_record_aux_stack(void *addr)
{
- return __kasan_record_aux_stack(addr, true);
+ return __kasan_record_aux_stack(addr, STACK_DEPOT_FLAG_CAN_ALLOC);
}
void kasan_record_aux_stack_noalloc(void *addr)
{
- return __kasan_record_aux_stack(addr, false);
+ return __kasan_record_aux_stack(addr, 0);
}
void kasan_save_alloc_info(struct kmem_cache *cache, void *object, gfp_t flags)
diff --git a/mm/kasan/kasan.h b/mm/kasan/kasan.h
index 8b06bab5c406..b29d46b83d1f 100644
--- a/mm/kasan/kasan.h
+++ b/mm/kasan/kasan.h
@@ -368,7 +368,7 @@ static inline void kasan_init_cache_meta(struct kmem_cache *cache, unsigned int
static inline void kasan_init_object_meta(struct kmem_cache *cache, const void *object) { }
#endif
-depot_stack_handle_t kasan_save_stack(gfp_t flags, bool can_alloc);
+depot_stack_handle_t kasan_save_stack(gfp_t flags, depot_flags_t depot_flags);
void kasan_set_track(struct kasan_track *track, gfp_t flags);
void kasan_save_alloc_info(struct kmem_cache *cache, void *object, gfp_t flags);
void kasan_save_free_info(struct kmem_cache *cache, void *object);
diff --git a/mm/kasan/tags.c b/mm/kasan/tags.c
index 7dcfe341d48e..4fd32121b0fd 100644
--- a/mm/kasan/tags.c
+++ b/mm/kasan/tags.c
@@ -13,6 +13,7 @@
#include <linux/memblock.h>
#include <linux/memory.h>
#include <linux/mm.h>
+#include <linux/stackdepot.h>
#include <linux/static_key.h>
#include <linux/string.h>
#include <linux/types.h>
@@ -101,7 +102,7 @@ static void save_stack_info(struct kmem_cache *cache, void *object,
struct kasan_stack_ring_entry *entry;
void *old_ptr;
- stack = kasan_save_stack(gfp_flags, true);
+ stack = kasan_save_stack(gfp_flags, STACK_DEPOT_FLAG_CAN_ALLOC);
/*
* Prevent save_stack_info() from modifying stack ring
--
2.25.1
From: Andrey Konovalov <[email protected]>
Instead of storing stack records in stack depot pools one right after
another, use fixed-sized slots.
Add a new Kconfig option STACKDEPOT_MAX_FRAMES that allows to select
the size of the slot in frames. Use 64 as the default value, which is
the maximum stack trace size both KASAN and KMSAN use right now.
Also add descriptions for other stack depot Kconfig options.
This is preparatory patch for implementing the eviction of stack records
from the stack depot.
Signed-off-by: Andrey Konovalov <[email protected]>
---
Changes v2->v3:
- Keep previously existing Kconfig options not configurable by users.
Changes v1->v2:
- Add and use STACKDEPOT_MAX_FRAMES Kconfig option.
---
lib/Kconfig | 10 ++++++++++
lib/stackdepot.c | 13 +++++++++----
2 files changed, 19 insertions(+), 4 deletions(-)
diff --git a/lib/Kconfig b/lib/Kconfig
index 3ea1c830efab..5ddda7c2ed9b 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -713,10 +713,20 @@ config ARCH_STACKWALK
config STACKDEPOT
bool
select STACKTRACE
+ help
+ Stack depot: stack trace storage that avoids duplication
config STACKDEPOT_ALWAYS_INIT
bool
select STACKDEPOT
+ help
+ Always initialize stack depot during early boot
+
+config STACKDEPOT_MAX_FRAMES
+ int "Maximum number of frames in trace saved in stack depot"
+ range 1 256
+ default 64
+ depends on STACKDEPOT
config REF_TRACKER
bool
diff --git a/lib/stackdepot.c b/lib/stackdepot.c
index e41713983cac..682497dbe081 100644
--- a/lib/stackdepot.c
+++ b/lib/stackdepot.c
@@ -58,9 +58,12 @@ struct stack_record {
u32 hash; /* Hash in the hash table */
u32 size; /* Number of stored frames */
union handle_parts handle;
- unsigned long entries[]; /* Variable-sized array of frames */
+ unsigned long entries[CONFIG_STACKDEPOT_MAX_FRAMES]; /* Frames */
};
+#define DEPOT_STACK_RECORD_SIZE \
+ ALIGN(sizeof(struct stack_record), 1 << DEPOT_STACK_ALIGN)
+
static bool stack_depot_disabled;
static bool __stack_depot_early_init_requested __initdata = IS_ENABLED(CONFIG_STACKDEPOT_ALWAYS_INIT);
static bool __stack_depot_early_init_passed __initdata;
@@ -264,9 +267,7 @@ static struct stack_record *
depot_alloc_stack(unsigned long *entries, int size, u32 hash, void **prealloc)
{
struct stack_record *stack;
- size_t required_size = struct_size(stack, entries, size);
-
- required_size = ALIGN(required_size, 1 << DEPOT_STACK_ALIGN);
+ size_t required_size = DEPOT_STACK_RECORD_SIZE;
/* Check if there is not enough space in the current pool. */
if (unlikely(pool_offset + required_size > DEPOT_POOL_SIZE)) {
@@ -301,6 +302,10 @@ depot_alloc_stack(unsigned long *entries, int size, u32 hash, void **prealloc)
if (stack_pools[pool_index] == NULL)
return NULL;
+ /* Limit number of saved frames to CONFIG_STACKDEPOT_MAX_FRAMES. */
+ if (size > CONFIG_STACKDEPOT_MAX_FRAMES)
+ size = CONFIG_STACKDEPOT_MAX_FRAMES;
+
/* Save the stack trace. */
stack = stack_pools[pool_index] + pool_offset;
stack->hash = hash;
--
2.25.1
From: Andrey Konovalov <[email protected]>
Instead of using the last pointer in stack_pools for storing the pointer
to a new pool (which does not yet store any stack records), use a new
new_pool variable.
This a purely code readability change: it seems more logical to store
the pointer to a pool with a special meaning in a dedicated variable.
Reviewed-by: Alexander Potapenko <[email protected]>
Signed-off-by: Andrey Konovalov <[email protected]>
---
lib/stackdepot.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/lib/stackdepot.c b/lib/stackdepot.c
index a38661beab97..68c1ac9aa916 100644
--- a/lib/stackdepot.c
+++ b/lib/stackdepot.c
@@ -85,6 +85,8 @@ static unsigned int stack_hash_mask;
/* Array of memory regions that store stack traces. */
static void *stack_pools[DEPOT_MAX_POOLS];
+/* Newly allocated pool that is not yet added to stack_pools. */
+static void *new_pool;
/* Currently used pool in stack_pools. */
static int pool_index;
/* Offset to the unused space in the currently used pool. */
@@ -241,7 +243,7 @@ static void depot_keep_new_pool(void **prealloc)
* as long as we do not exceed the maximum number of pools.
*/
if (pool_index + 1 < DEPOT_MAX_POOLS) {
- stack_pools[pool_index + 1] = *prealloc;
+ new_pool = *prealloc;
*prealloc = NULL;
}
@@ -272,6 +274,8 @@ static bool depot_update_pools(size_t required_size, void **prealloc)
* stack_depot_fetch().
*/
WRITE_ONCE(pool_index, pool_index + 1);
+ stack_pools[pool_index] = new_pool;
+ new_pool = NULL;
pool_offset = 0;
/*
--
2.25.1
From: Andrey Konovalov <[email protected]>
Instead of using the global pool_offset variable to find a free slot
when storing a new stack record, mainlain a freelist of free slots
within the allocated stack pools.
A global next_stack variable is used as the head of the freelist, and
the next field in the stack_record struct is reused as freelist link
(when the record is not in the freelist, this field is used as a link
in the hash table).
This is preparatory patch for implementing the eviction of stack records
from the stack depot.
Reviewed-by: Alexander Potapenko <[email protected]>
Signed-off-by: Andrey Konovalov <[email protected]>
---
Changes v2->v3:
- Add parentheses when referring to function calls in comments.
Changes v1->v2:
- Fix out-of-bounds when initializing a pool.
---
lib/stackdepot.c | 131 +++++++++++++++++++++++++++++------------------
1 file changed, 82 insertions(+), 49 deletions(-)
diff --git a/lib/stackdepot.c b/lib/stackdepot.c
index 68c1ac9aa916..a5eff165c0d5 100644
--- a/lib/stackdepot.c
+++ b/lib/stackdepot.c
@@ -54,8 +54,8 @@ union handle_parts {
};
struct stack_record {
- struct stack_record *next; /* Link in the hash table */
- u32 hash; /* Hash in the hash table */
+ struct stack_record *next; /* Link in hash table or freelist */
+ u32 hash; /* Hash in hash table */
u32 size; /* Number of stored frames */
union handle_parts handle;
unsigned long entries[CONFIG_STACKDEPOT_MAX_FRAMES]; /* Frames */
@@ -87,10 +87,10 @@ static unsigned int stack_hash_mask;
static void *stack_pools[DEPOT_MAX_POOLS];
/* Newly allocated pool that is not yet added to stack_pools. */
static void *new_pool;
-/* Currently used pool in stack_pools. */
-static int pool_index;
-/* Offset to the unused space in the currently used pool. */
-static size_t pool_offset;
+/* Number of pools in stack_pools. */
+static int pools_num;
+/* Next stack in the freelist of stack records within stack_pools. */
+static struct stack_record *next_stack;
/* Lock that protects the variables above. */
static DEFINE_RAW_SPINLOCK(pool_lock);
/*
@@ -226,6 +226,42 @@ int stack_depot_init(void)
}
EXPORT_SYMBOL_GPL(stack_depot_init);
+/* Initializes a stack depol pool. */
+static void depot_init_pool(void *pool)
+{
+ const int records_in_pool = DEPOT_POOL_SIZE / DEPOT_STACK_RECORD_SIZE;
+ int i, offset;
+
+ /* Initialize handles and link stack records to each other. */
+ for (i = 0, offset = 0;
+ offset <= DEPOT_POOL_SIZE - DEPOT_STACK_RECORD_SIZE;
+ i++, offset += DEPOT_STACK_RECORD_SIZE) {
+ struct stack_record *stack = pool + offset;
+
+ stack->handle.pool_index = pools_num;
+ stack->handle.offset = offset >> DEPOT_STACK_ALIGN;
+ stack->handle.extra = 0;
+
+ if (i < records_in_pool - 1)
+ stack->next = (void *)stack + DEPOT_STACK_RECORD_SIZE;
+ else
+ stack->next = NULL;
+ }
+
+ /* Link stack records into the freelist. */
+ WARN_ON(next_stack);
+ next_stack = pool;
+
+ /* Save reference to the pool to be used by depot_fetch_stack(). */
+ stack_pools[pools_num] = pool;
+
+ /*
+ * WRITE_ONCE() pairs with potential concurrent read in
+ * depot_fetch_stack().
+ */
+ WRITE_ONCE(pools_num, pools_num + 1);
+}
+
/* Keeps the preallocated memory to be used for a new stack depot pool. */
static void depot_keep_new_pool(void **prealloc)
{
@@ -242,7 +278,7 @@ static void depot_keep_new_pool(void **prealloc)
* Use the preallocated memory for the new pool
* as long as we do not exceed the maximum number of pools.
*/
- if (pool_index + 1 < DEPOT_MAX_POOLS) {
+ if (pools_num < DEPOT_MAX_POOLS) {
new_pool = *prealloc;
*prealloc = NULL;
}
@@ -258,45 +294,42 @@ static void depot_keep_new_pool(void **prealloc)
}
/* Updates references to the current and the next stack depot pools. */
-static bool depot_update_pools(size_t required_size, void **prealloc)
+static bool depot_update_pools(void **prealloc)
{
- /* Check if there is not enough space in the current pool. */
- if (unlikely(pool_offset + required_size > DEPOT_POOL_SIZE)) {
- /* Bail out if we reached the pool limit. */
- if (unlikely(pool_index + 1 >= DEPOT_MAX_POOLS)) {
- WARN_ONCE(1, "Stack depot reached limit capacity");
- return false;
- }
+ /* Check if we still have objects in the freelist. */
+ if (next_stack)
+ goto out_keep_prealloc;
- /*
- * Move on to the new pool.
- * WRITE_ONCE() pairs with potential concurrent read in
- * stack_depot_fetch().
- */
- WRITE_ONCE(pool_index, pool_index + 1);
- stack_pools[pool_index] = new_pool;
+ /* Check if we have a new pool saved and use it. */
+ if (new_pool) {
+ depot_init_pool(new_pool);
new_pool = NULL;
- pool_offset = 0;
- /*
- * If the maximum number of pools is not reached, take note
- * that yet another new pool needs to be allocated.
- * smp_store_release() pairs with smp_load_acquire() in
- * stack_depot_save().
- */
- if (pool_index + 1 < DEPOT_MAX_POOLS)
+ /* Take note that we might need a new new_pool. */
+ if (pools_num < DEPOT_MAX_POOLS)
smp_store_release(&new_pool_required, 1);
+
+ /* Try keeping the preallocated memory for new_pool. */
+ goto out_keep_prealloc;
+ }
+
+ /* Bail out if we reached the pool limit. */
+ if (unlikely(pools_num >= DEPOT_MAX_POOLS)) {
+ WARN_ONCE(1, "Stack depot reached limit capacity");
+ return false;
}
- /* Check if the current pool is not yet allocated. */
- if (*prealloc && stack_pools[pool_index] == NULL) {
- /* Use the preallocated memory for the current pool. */
- stack_pools[pool_index] = *prealloc;
+ /* Check if we have preallocated memory and use it. */
+ if (*prealloc) {
+ depot_init_pool(*prealloc);
*prealloc = NULL;
return true;
}
- /* Otherwise, try using the preallocated memory for a new pool. */
+ return false;
+
+out_keep_prealloc:
+ /* Keep the preallocated memory for a new pool if required. */
if (*prealloc)
depot_keep_new_pool(prealloc);
return true;
@@ -307,35 +340,35 @@ static struct stack_record *
depot_alloc_stack(unsigned long *entries, int size, u32 hash, void **prealloc)
{
struct stack_record *stack;
- size_t required_size = DEPOT_STACK_RECORD_SIZE;
/* Update current and new pools if required and possible. */
- if (!depot_update_pools(required_size, prealloc))
+ if (!depot_update_pools(prealloc))
return NULL;
- /* Check if we have a pool to save the stack trace. */
- if (stack_pools[pool_index] == NULL)
+ /* Check if we have a stack record to save the stack trace. */
+ stack = next_stack;
+ if (!stack)
return NULL;
+ /* Advance the freelist. */
+ next_stack = stack->next;
+
/* Limit number of saved frames to CONFIG_STACKDEPOT_MAX_FRAMES. */
if (size > CONFIG_STACKDEPOT_MAX_FRAMES)
size = CONFIG_STACKDEPOT_MAX_FRAMES;
/* Save the stack trace. */
- stack = stack_pools[pool_index] + pool_offset;
+ stack->next = NULL;
stack->hash = hash;
stack->size = size;
- stack->handle.pool_index = pool_index;
- stack->handle.offset = pool_offset >> DEPOT_STACK_ALIGN;
- stack->handle.extra = 0;
+ /* stack->handle is already filled in by depot_init_pool(). */
memcpy(stack->entries, entries, flex_array_size(stack, entries, size));
- pool_offset += required_size;
/*
* Let KMSAN know the stored stack record is initialized. This shall
* prevent false positive reports if instrumented code accesses it.
*/
- kmsan_unpoison_memory(stack, required_size);
+ kmsan_unpoison_memory(stack, DEPOT_STACK_RECORD_SIZE);
return stack;
}
@@ -345,16 +378,16 @@ static struct stack_record *depot_fetch_stack(depot_stack_handle_t handle)
union handle_parts parts = { .handle = handle };
/*
* READ_ONCE() pairs with potential concurrent write in
- * depot_update_pools().
+ * depot_init_pool().
*/
- int pool_index_cached = READ_ONCE(pool_index);
+ int pools_num_cached = READ_ONCE(pools_num);
void *pool;
size_t offset = parts.offset << DEPOT_STACK_ALIGN;
struct stack_record *stack;
- if (parts.pool_index > pool_index_cached) {
+ if (parts.pool_index > pools_num_cached) {
WARN(1, "pool index %d out of bounds (%d) for stack id %08x\n",
- parts.pool_index, pool_index_cached, handle);
+ parts.pool_index, pools_num_cached, handle);
return NULL;
}
--
2.25.1
From: Andrey Konovalov <[email protected]>
Add stack_depot_put, a function that decrements the reference counter
on a stack record and removes it from the stack depot once the counter
reaches 0.
Internally, when removing a stack record, the function unlinks it from
the hash table bucket and returns to the freelist.
With this change, the users of stack depot can call stack_depot_put
when keeping a stack trace in the stack depot is not needed anymore.
This allows avoiding polluting the stack depot with irrelevant stack
traces and thus have more space to store the relevant ones before the
stack depot reaches its capacity.
Signed-off-by: Andrey Konovalov <[email protected]>
---
Changes v1->v2:
- Comments fixes as suggested by Marco.
- Add lockdep_assert annotation.
- Adapt to using list_head's.
- Rename stack_depot_evict to stack_depot_put.
---
include/linux/stackdepot.h | 14 ++++++++++++++
lib/stackdepot.c | 37 ++++++++++++++++++++++++++++++++++++-
2 files changed, 50 insertions(+), 1 deletion(-)
diff --git a/include/linux/stackdepot.h b/include/linux/stackdepot.h
index 611716702d73..a6796f178913 100644
--- a/include/linux/stackdepot.h
+++ b/include/linux/stackdepot.h
@@ -97,6 +97,8 @@ static inline int stack_depot_early_init(void) { return 0; }
*
* If STACK_DEPOT_FLAG_GET is set in @depot_flags, stack depot will increment
* the refcount on the saved stack trace if it already exists in stack depot.
+ * Users of this flag must also call stack_depot_put() when keeping the stack
+ * trace is no longer required to avoid overflowing the refcount.
*
* If the provided stack trace comes from the interrupt context, only the part
* up to the interrupt entry is saved.
@@ -162,6 +164,18 @@ void stack_depot_print(depot_stack_handle_t stack);
int stack_depot_snprint(depot_stack_handle_t handle, char *buf, size_t size,
int spaces);
+/**
+ * stack_depot_put - Drop a reference to a stack trace from stack depot
+ *
+ * @handle: Stack depot handle returned from stack_depot_save()
+ *
+ * The stack trace is evicted from stack depot once all references to it have
+ * been dropped (once the number of stack_depot_evict() calls matches the
+ * number of stack_depot_save_flags() calls with STACK_DEPOT_FLAG_GET set for
+ * this stack trace).
+ */
+void stack_depot_put(depot_stack_handle_t handle);
+
/**
* stack_depot_set_extra_bits - Set extra bits in a stack depot handle
*
diff --git a/lib/stackdepot.c b/lib/stackdepot.c
index 911dee11bf39..c1b31160f4b4 100644
--- a/lib/stackdepot.c
+++ b/lib/stackdepot.c
@@ -394,7 +394,7 @@ static struct stack_record *depot_fetch_stack(depot_stack_handle_t handle)
size_t offset = parts.offset << DEPOT_STACK_ALIGN;
struct stack_record *stack;
- lockdep_assert_held_read(&pool_rwlock);
+ lockdep_assert_held(&pool_rwlock);
if (parts.pool_index > pools_num) {
WARN(1, "pool index %d out of bounds (%d) for stack id %08x\n",
@@ -410,6 +410,14 @@ static struct stack_record *depot_fetch_stack(depot_stack_handle_t handle)
return stack;
}
+/* Links stack into the freelist. */
+static void depot_free_stack(struct stack_record *stack)
+{
+ lockdep_assert_held_write(&pool_rwlock);
+
+ list_add(&stack->list, &free_stacks);
+}
+
/* Calculates the hash for a stack. */
static inline u32 hash_stack(unsigned long *entries, unsigned int size)
{
@@ -592,6 +600,33 @@ unsigned int stack_depot_fetch(depot_stack_handle_t handle,
}
EXPORT_SYMBOL_GPL(stack_depot_fetch);
+void stack_depot_put(depot_stack_handle_t handle)
+{
+ struct stack_record *stack;
+ unsigned long flags;
+
+ if (!handle || stack_depot_disabled)
+ return;
+
+ write_lock_irqsave(&pool_rwlock, flags);
+
+ stack = depot_fetch_stack(handle);
+ if (WARN_ON(!stack))
+ goto out;
+
+ if (refcount_dec_and_test(&stack->count)) {
+ /* Unlink stack from the hash table. */
+ list_del(&stack->list);
+
+ /* Free stack. */
+ depot_free_stack(stack);
+ }
+
+out:
+ write_unlock_irqrestore(&pool_rwlock, flags);
+}
+EXPORT_SYMBOL_GPL(stack_depot_put);
+
void stack_depot_print(depot_stack_handle_t stack)
{
unsigned long *entries;
--
2.25.1
From: Andrey Konovalov <[email protected]>
Split code in depot_alloc_stack and depot_init_pool into 3 functions:
1. depot_keep_next_pool that keeps preallocated memory for the next pool
if required.
2. depot_update_pools that moves on to the next pool if there's no space
left in the current pool, uses preallocated memory for the new current
pool if required, and calls depot_keep_next_pool otherwise.
3. depot_alloc_stack that calls depot_update_pools and then allocates
a stack record as before.
This makes it somewhat easier to follow the logic of depot_alloc_stack
and also serves as a preparation for implementing the eviction of stack
records from the stack depot.
Reviewed-by: Alexander Potapenko <[email protected]>
Signed-off-by: Andrey Konovalov <[email protected]>
---
Changes v2->v3:
- Add parentheses when referring to function calls in comments.
---
lib/stackdepot.c | 86 +++++++++++++++++++++++++++---------------------
1 file changed, 49 insertions(+), 37 deletions(-)
diff --git a/lib/stackdepot.c b/lib/stackdepot.c
index cfa3c6c7cc2e..b3af868627f4 100644
--- a/lib/stackdepot.c
+++ b/lib/stackdepot.c
@@ -225,11 +225,11 @@ int stack_depot_init(void)
}
EXPORT_SYMBOL_GPL(stack_depot_init);
-/* Uses preallocated memory to initialize a new stack depot pool. */
-static void depot_init_pool(void **prealloc)
+/* Keeps the preallocated memory to be used for the next stack depot pool. */
+static void depot_keep_next_pool(void **prealloc)
{
/*
- * If the next pool is already initialized or the maximum number of
+ * If the next pool is already saved or the maximum number of
* pools is reached, do not use the preallocated memory.
* Access next_pool_required non-atomically, as there are no concurrent
* write accesses to this variable.
@@ -237,44 +237,34 @@ static void depot_init_pool(void **prealloc)
if (!next_pool_required)
return;
- /* Check if the current pool is not yet allocated. */
- if (stack_pools[pool_index] == NULL) {
- /* Use the preallocated memory for the current pool. */
- stack_pools[pool_index] = *prealloc;
+ /*
+ * Use the preallocated memory for the next pool
+ * as long as we do not exceed the maximum number of pools.
+ */
+ if (pool_index + 1 < DEPOT_MAX_POOLS) {
+ stack_pools[pool_index + 1] = *prealloc;
*prealloc = NULL;
- } else {
- /*
- * Otherwise, use the preallocated memory for the next pool
- * as long as we do not exceed the maximum number of pools.
- */
- if (pool_index + 1 < DEPOT_MAX_POOLS) {
- stack_pools[pool_index + 1] = *prealloc;
- *prealloc = NULL;
- }
- /*
- * At this point, either the next pool is initialized or the
- * maximum number of pools is reached. In either case, take
- * note that initializing another pool is not required.
- * smp_store_release() pairs with smp_load_acquire() in
- * stack_depot_save().
- */
- smp_store_release(&next_pool_required, 0);
}
+
+ /*
+ * At this point, either the next pool is kept or the maximum
+ * number of pools is reached. In either case, take note that
+ * keeping another pool is not required.
+ * smp_store_release() pairs with smp_load_acquire() in
+ * stack_depot_save().
+ */
+ smp_store_release(&next_pool_required, 0);
}
-/* Allocates a new stack in a stack depot pool. */
-static struct stack_record *
-depot_alloc_stack(unsigned long *entries, int size, u32 hash, void **prealloc)
+/* Updates references to the current and the next stack depot pools. */
+static bool depot_update_pools(size_t required_size, void **prealloc)
{
- struct stack_record *stack;
- size_t required_size = DEPOT_STACK_RECORD_SIZE;
-
/* Check if there is not enough space in the current pool. */
if (unlikely(pool_offset + required_size > DEPOT_POOL_SIZE)) {
/* Bail out if we reached the pool limit. */
if (unlikely(pool_index + 1 >= DEPOT_MAX_POOLS)) {
WARN_ONCE(1, "Stack depot reached limit capacity");
- return NULL;
+ return false;
}
/*
@@ -284,9 +274,10 @@ depot_alloc_stack(unsigned long *entries, int size, u32 hash, void **prealloc)
*/
WRITE_ONCE(pool_index, pool_index + 1);
pool_offset = 0;
+
/*
* If the maximum number of pools is not reached, take note
- * that the next pool needs to initialized.
+ * that the next pool needs to be initialized.
* smp_store_release() pairs with smp_load_acquire() in
* stack_depot_save().
*/
@@ -294,9 +285,30 @@ depot_alloc_stack(unsigned long *entries, int size, u32 hash, void **prealloc)
smp_store_release(&next_pool_required, 1);
}
- /* Assign the preallocated memory to a pool if required. */
+ /* Check if the current pool is not yet allocated. */
+ if (*prealloc && stack_pools[pool_index] == NULL) {
+ /* Use the preallocated memory for the current pool. */
+ stack_pools[pool_index] = *prealloc;
+ *prealloc = NULL;
+ return true;
+ }
+
+ /* Otherwise, try using the preallocated memory for the next pool. */
if (*prealloc)
- depot_init_pool(prealloc);
+ depot_keep_next_pool(prealloc);
+ return true;
+}
+
+/* Allocates a new stack in a stack depot pool. */
+static struct stack_record *
+depot_alloc_stack(unsigned long *entries, int size, u32 hash, void **prealloc)
+{
+ struct stack_record *stack;
+ size_t required_size = DEPOT_STACK_RECORD_SIZE;
+
+ /* Update current and next pools if required and possible. */
+ if (!depot_update_pools(required_size, prealloc))
+ return NULL;
/* Check if we have a pool to save the stack trace. */
if (stack_pools[pool_index] == NULL)
@@ -330,7 +342,7 @@ static struct stack_record *depot_fetch_stack(depot_stack_handle_t handle)
union handle_parts parts = { .handle = handle };
/*
* READ_ONCE() pairs with potential concurrent write in
- * depot_alloc_stack().
+ * depot_update_pools().
*/
int pool_index_cached = READ_ONCE(pool_index);
void *pool;
@@ -430,7 +442,7 @@ depot_stack_handle_t __stack_depot_save(unsigned long *entries,
* the memory now - we won't be able to do that under the lock.
*
* smp_load_acquire() pairs with smp_store_release() in
- * depot_alloc_stack() and depot_init_pool().
+ * depot_update_pools() and depot_keep_next_pool().
*/
if (unlikely(can_alloc && smp_load_acquire(&next_pool_required))) {
/*
@@ -467,7 +479,7 @@ depot_stack_handle_t __stack_depot_save(unsigned long *entries,
* Stack depot already contains this stack trace, but let's
* keep the preallocated memory for the future.
*/
- depot_init_pool(&prealloc);
+ depot_keep_next_pool(&prealloc);
}
raw_spin_unlock_irqrestore(&pool_lock, flags);
--
2.25.1
From: Andrey Konovalov <[email protected]>
Make KMSAN use stack_depot_save instead of __stack_depot_save,
as it always passes true to __stack_depot_save as the last argument.
Reviewed-by: Alexander Potapenko <[email protected]>
Signed-off-by: Andrey Konovalov <[email protected]>
---
Changes v1->v2:
- This is a new patch.
---
mm/kmsan/core.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/mm/kmsan/core.c b/mm/kmsan/core.c
index c19f47af0424..cf2d70e9c9a5 100644
--- a/mm/kmsan/core.c
+++ b/mm/kmsan/core.c
@@ -76,7 +76,7 @@ depot_stack_handle_t kmsan_save_stack_with_flags(gfp_t flags,
/* Don't sleep. */
flags &= ~(__GFP_DIRECT_RECLAIM | __GFP_KSWAPD_RECLAIM);
- handle = __stack_depot_save(entries, nr_entries, flags, true);
+ handle = stack_depot_save(entries, nr_entries, flags);
return stack_depot_set_extra_bits(handle, extra);
}
@@ -185,11 +185,10 @@ depot_stack_handle_t kmsan_internal_chain_origin(depot_stack_handle_t id)
/*
* @entries is a local var in non-instrumented code, so KMSAN does not
* know it is initialized. Explicitly unpoison it to avoid false
- * positives when __stack_depot_save() passes it to instrumented code.
+ * positives when stack_depot_save() passes it to instrumented code.
*/
kmsan_internal_unpoison_memory(entries, sizeof(entries), false);
- handle = __stack_depot_save(entries, ARRAY_SIZE(entries), __GFP_HIGH,
- true);
+ handle = stack_depot_save(entries, ARRAY_SIZE(entries), __GFP_HIGH);
return stack_depot_set_extra_bits(handle, extra_bits);
}
--
2.25.1
From: Andrey Konovalov <[email protected]>
Currently, stack depot uses the following locking scheme:
1. Lock-free accesses when looking up a stack record, which allows to
have multiple users to look up records in parallel;
2. Spinlock for protecting the stack depot pools and the hash table
when adding a new record.
For implementing the eviction of stack traces from stack depot, the
lock-free approach is not going to work anymore, as we will need to be
able to also remove records from the hash table.
Convert the spinlock into a read/write lock, and drop the atomic accesses,
as they are no longer required.
Looking up stack traces is now protected by the read lock and adding new
records - by the write lock. One of the following patches will add a new
function for evicting stack records, which will be protected by the write
lock as well.
With this change, multiple users can still look up records in parallel.
This is preparatory patch for implementing the eviction of stack records
from the stack depot.
Reviewed-by: Alexander Potapenko <[email protected]>
Signed-off-by: Andrey Konovalov <[email protected]>
---
Changed v2->v3:
- Use lockdep_assert_held_read annotation in depot_fetch_stack.
Changes v1->v2:
- Add lockdep_assert annotations.
---
lib/stackdepot.c | 87 +++++++++++++++++++++++++-----------------------
1 file changed, 46 insertions(+), 41 deletions(-)
diff --git a/lib/stackdepot.c b/lib/stackdepot.c
index a5eff165c0d5..8378b32b5310 100644
--- a/lib/stackdepot.c
+++ b/lib/stackdepot.c
@@ -23,6 +23,7 @@
#include <linux/percpu.h>
#include <linux/printk.h>
#include <linux/slab.h>
+#include <linux/spinlock.h>
#include <linux/stacktrace.h>
#include <linux/stackdepot.h>
#include <linux/string.h>
@@ -91,15 +92,15 @@ static void *new_pool;
static int pools_num;
/* Next stack in the freelist of stack records within stack_pools. */
static struct stack_record *next_stack;
-/* Lock that protects the variables above. */
-static DEFINE_RAW_SPINLOCK(pool_lock);
/*
* Stack depot tries to keep an extra pool allocated even before it runs out
* of space in the currently used pool. This flag marks whether this extra pool
* needs to be allocated. It has the value 0 when either an extra pool is not
* yet allocated or if the limit on the number of pools is reached.
*/
-static int new_pool_required = 1;
+static bool new_pool_required = true;
+/* Lock that protects the variables above. */
+static DEFINE_RWLOCK(pool_rwlock);
static int __init disable_stack_depot(char *str)
{
@@ -232,6 +233,8 @@ static void depot_init_pool(void *pool)
const int records_in_pool = DEPOT_POOL_SIZE / DEPOT_STACK_RECORD_SIZE;
int i, offset;
+ lockdep_assert_held_write(&pool_rwlock);
+
/* Initialize handles and link stack records to each other. */
for (i = 0, offset = 0;
offset <= DEPOT_POOL_SIZE - DEPOT_STACK_RECORD_SIZE;
@@ -254,22 +257,17 @@ static void depot_init_pool(void *pool)
/* Save reference to the pool to be used by depot_fetch_stack(). */
stack_pools[pools_num] = pool;
-
- /*
- * WRITE_ONCE() pairs with potential concurrent read in
- * depot_fetch_stack().
- */
- WRITE_ONCE(pools_num, pools_num + 1);
+ pools_num++;
}
/* Keeps the preallocated memory to be used for a new stack depot pool. */
static void depot_keep_new_pool(void **prealloc)
{
+ lockdep_assert_held_write(&pool_rwlock);
+
/*
* If a new pool is already saved or the maximum number of
* pools is reached, do not use the preallocated memory.
- * Access new_pool_required non-atomically, as there are no concurrent
- * write accesses to this variable.
*/
if (!new_pool_required)
return;
@@ -287,15 +285,15 @@ static void depot_keep_new_pool(void **prealloc)
* At this point, either a new pool is kept or the maximum
* number of pools is reached. In either case, take note that
* keeping another pool is not required.
- * smp_store_release() pairs with smp_load_acquire() in
- * stack_depot_save().
*/
- smp_store_release(&new_pool_required, 0);
+ new_pool_required = false;
}
/* Updates references to the current and the next stack depot pools. */
static bool depot_update_pools(void **prealloc)
{
+ lockdep_assert_held_write(&pool_rwlock);
+
/* Check if we still have objects in the freelist. */
if (next_stack)
goto out_keep_prealloc;
@@ -307,7 +305,7 @@ static bool depot_update_pools(void **prealloc)
/* Take note that we might need a new new_pool. */
if (pools_num < DEPOT_MAX_POOLS)
- smp_store_release(&new_pool_required, 1);
+ new_pool_required = true;
/* Try keeping the preallocated memory for new_pool. */
goto out_keep_prealloc;
@@ -341,6 +339,8 @@ depot_alloc_stack(unsigned long *entries, int size, u32 hash, void **prealloc)
{
struct stack_record *stack;
+ lockdep_assert_held_write(&pool_rwlock);
+
/* Update current and new pools if required and possible. */
if (!depot_update_pools(prealloc))
return NULL;
@@ -376,18 +376,15 @@ depot_alloc_stack(unsigned long *entries, int size, u32 hash, void **prealloc)
static struct stack_record *depot_fetch_stack(depot_stack_handle_t handle)
{
union handle_parts parts = { .handle = handle };
- /*
- * READ_ONCE() pairs with potential concurrent write in
- * depot_init_pool().
- */
- int pools_num_cached = READ_ONCE(pools_num);
void *pool;
size_t offset = parts.offset << DEPOT_STACK_ALIGN;
struct stack_record *stack;
- if (parts.pool_index > pools_num_cached) {
+ lockdep_assert_held_read(&pool_rwlock);
+
+ if (parts.pool_index > pools_num) {
WARN(1, "pool index %d out of bounds (%d) for stack id %08x\n",
- parts.pool_index, pools_num_cached, handle);
+ parts.pool_index, pools_num, handle);
return NULL;
}
@@ -429,6 +426,8 @@ static inline struct stack_record *find_stack(struct stack_record *bucket,
{
struct stack_record *found;
+ lockdep_assert_held(&pool_rwlock);
+
for (found = bucket; found; found = found->next) {
if (found->hash == hash &&
found->size == size &&
@@ -446,6 +445,7 @@ depot_stack_handle_t __stack_depot_save(unsigned long *entries,
depot_stack_handle_t handle = 0;
struct page *page = NULL;
void *prealloc = NULL;
+ bool need_alloc = false;
unsigned long flags;
u32 hash;
@@ -465,22 +465,26 @@ depot_stack_handle_t __stack_depot_save(unsigned long *entries,
hash = hash_stack(entries, nr_entries);
bucket = &stack_table[hash & stack_hash_mask];
- /*
- * Fast path: look the stack trace up without locking.
- * smp_load_acquire() pairs with smp_store_release() to |bucket| below.
- */
- found = find_stack(smp_load_acquire(bucket), entries, nr_entries, hash);
- if (found)
+ read_lock_irqsave(&pool_rwlock, flags);
+
+ /* Fast path: look the stack trace up without full locking. */
+ found = find_stack(*bucket, entries, nr_entries, hash);
+ if (found) {
+ read_unlock_irqrestore(&pool_rwlock, flags);
goto exit;
+ }
+
+ /* Take note if another stack pool needs to be allocated. */
+ if (new_pool_required)
+ need_alloc = true;
+
+ read_unlock_irqrestore(&pool_rwlock, flags);
/*
- * Check if another stack pool needs to be allocated. If so, allocate
- * the memory now: we won't be able to do that under the lock.
- *
- * smp_load_acquire() pairs with smp_store_release() in
- * depot_update_pools() and depot_keep_new_pool().
+ * Allocate memory for a new pool if required now:
+ * we won't be able to do that under the lock.
*/
- if (unlikely(can_alloc && smp_load_acquire(&new_pool_required))) {
+ if (unlikely(can_alloc && need_alloc)) {
/*
* Zero out zone modifiers, as we don't have specific zone
* requirements. Keep the flags related to allocation in atomic
@@ -494,7 +498,7 @@ depot_stack_handle_t __stack_depot_save(unsigned long *entries,
prealloc = page_address(page);
}
- raw_spin_lock_irqsave(&pool_lock, flags);
+ write_lock_irqsave(&pool_rwlock, flags);
found = find_stack(*bucket, entries, nr_entries, hash);
if (!found) {
@@ -503,11 +507,7 @@ depot_stack_handle_t __stack_depot_save(unsigned long *entries,
if (new) {
new->next = *bucket;
- /*
- * smp_store_release() pairs with smp_load_acquire()
- * from |bucket| above.
- */
- smp_store_release(bucket, new);
+ *bucket = new;
found = new;
}
} else if (prealloc) {
@@ -518,7 +518,7 @@ depot_stack_handle_t __stack_depot_save(unsigned long *entries,
depot_keep_new_pool(&prealloc);
}
- raw_spin_unlock_irqrestore(&pool_lock, flags);
+ write_unlock_irqrestore(&pool_rwlock, flags);
exit:
if (prealloc) {
/* Stack depot didn't use this memory, free it. */
@@ -542,6 +542,7 @@ unsigned int stack_depot_fetch(depot_stack_handle_t handle,
unsigned long **entries)
{
struct stack_record *stack;
+ unsigned long flags;
*entries = NULL;
/*
@@ -553,8 +554,12 @@ unsigned int stack_depot_fetch(depot_stack_handle_t handle,
if (!handle || stack_depot_disabled)
return 0;
+ read_lock_irqsave(&pool_rwlock, flags);
+
stack = depot_fetch_stack(handle);
+ read_unlock_irqrestore(&pool_rwlock, flags);
+
*entries = stack->entries;
return stack->size;
}
--
2.25.1
From: Andrey Konovalov <[email protected]>
Switch stack_record to use list_head for links in the hash table
and in the freelist.
This will allow removing entries from the hash table buckets.
This is preparatory patch for implementing the eviction of stack records
from the stack depot.
Signed-off-by: Andrey Konovalov <[email protected]>
---
Changes v2->v3:
- Use the proper number of entries for initializing the stack table when
alloc_large_system_hash() auto-calculates the number.
Changes v1->v2:
- Use list_head instead of open-coding backward links.
---
lib/stackdepot.c | 87 ++++++++++++++++++++++++++++--------------------
1 file changed, 50 insertions(+), 37 deletions(-)
diff --git a/lib/stackdepot.c b/lib/stackdepot.c
index 8378b32b5310..4bb0af423f82 100644
--- a/lib/stackdepot.c
+++ b/lib/stackdepot.c
@@ -18,6 +18,7 @@
#include <linux/jhash.h>
#include <linux/kernel.h>
#include <linux/kmsan.h>
+#include <linux/list.h>
#include <linux/mm.h>
#include <linux/mutex.h>
#include <linux/percpu.h>
@@ -55,7 +56,7 @@ union handle_parts {
};
struct stack_record {
- struct stack_record *next; /* Link in hash table or freelist */
+ struct list_head list; /* Links in hash table or freelist */
u32 hash; /* Hash in hash table */
u32 size; /* Number of stored frames */
union handle_parts handle;
@@ -77,21 +78,21 @@ static bool __stack_depot_early_init_passed __initdata;
/* Initial seed for jhash2. */
#define STACK_HASH_SEED 0x9747b28c
-/* Hash table of pointers to stored stack traces. */
-static struct stack_record **stack_table;
+/* Hash table of stored stack records. */
+static struct list_head *stack_table;
/* Fixed order of the number of table buckets. Used when KASAN is enabled. */
static unsigned int stack_bucket_number_order;
/* Hash mask for indexing the table. */
static unsigned int stack_hash_mask;
-/* Array of memory regions that store stack traces. */
+/* Array of memory regions that store stack records. */
static void *stack_pools[DEPOT_MAX_POOLS];
/* Newly allocated pool that is not yet added to stack_pools. */
static void *new_pool;
/* Number of pools in stack_pools. */
static int pools_num;
-/* Next stack in the freelist of stack records within stack_pools. */
-static struct stack_record *next_stack;
+/* Freelist of stack records within stack_pools. */
+static LIST_HEAD(free_stacks);
/*
* Stack depot tries to keep an extra pool allocated even before it runs out
* of space in the currently used pool. This flag marks whether this extra pool
@@ -116,6 +117,15 @@ void __init stack_depot_request_early_init(void)
__stack_depot_early_init_requested = true;
}
+/* Initialize list_head's within the hash table. */
+static void init_stack_table(unsigned long entries)
+{
+ unsigned long i;
+
+ for (i = 0; i < entries; i++)
+ INIT_LIST_HEAD(&stack_table[i]);
+}
+
/* Allocates a hash table via memblock. Can only be used during early boot. */
int __init stack_depot_early_init(void)
{
@@ -152,16 +162,16 @@ int __init stack_depot_early_init(void)
/*
* If stack_bucket_number_order is not set, leave entries as 0 to rely
- * on the automatic calculations performed by alloc_large_system_hash.
+ * on the automatic calculations performed by alloc_large_system_hash().
*/
if (stack_bucket_number_order)
entries = 1UL << stack_bucket_number_order;
pr_info("allocating hash table via alloc_large_system_hash\n");
stack_table = alloc_large_system_hash("stackdepot",
- sizeof(struct stack_record *),
+ sizeof(struct list_head),
entries,
STACK_HASH_TABLE_SCALE,
- HASH_EARLY | HASH_ZERO,
+ HASH_EARLY,
NULL,
&stack_hash_mask,
1UL << STACK_BUCKET_NUMBER_ORDER_MIN,
@@ -171,6 +181,14 @@ int __init stack_depot_early_init(void)
stack_depot_disabled = true;
return -ENOMEM;
}
+ if (!entries) {
+ /*
+ * Obtain the number of entries that was calculated by
+ * alloc_large_system_hash().
+ */
+ entries = stack_hash_mask + 1;
+ }
+ init_stack_table(entries);
return 0;
}
@@ -211,7 +229,7 @@ int stack_depot_init(void)
entries = 1UL << STACK_BUCKET_NUMBER_ORDER_MAX;
pr_info("allocating hash table of %lu entries via kvcalloc\n", entries);
- stack_table = kvcalloc(entries, sizeof(struct stack_record *), GFP_KERNEL);
+ stack_table = kvcalloc(entries, sizeof(struct list_head), GFP_KERNEL);
if (!stack_table) {
pr_err("hash table allocation failed, disabling\n");
stack_depot_disabled = true;
@@ -219,6 +237,7 @@ int stack_depot_init(void)
goto out_unlock;
}
stack_hash_mask = entries - 1;
+ init_stack_table(entries);
out_unlock:
mutex_unlock(&stack_depot_init_mutex);
@@ -230,31 +249,24 @@ EXPORT_SYMBOL_GPL(stack_depot_init);
/* Initializes a stack depol pool. */
static void depot_init_pool(void *pool)
{
- const int records_in_pool = DEPOT_POOL_SIZE / DEPOT_STACK_RECORD_SIZE;
- int i, offset;
+ int offset;
lockdep_assert_held_write(&pool_rwlock);
- /* Initialize handles and link stack records to each other. */
- for (i = 0, offset = 0;
- offset <= DEPOT_POOL_SIZE - DEPOT_STACK_RECORD_SIZE;
- i++, offset += DEPOT_STACK_RECORD_SIZE) {
+ WARN_ON(!list_empty(&free_stacks));
+
+ /* Initialize handles and link stack records into the freelist. */
+ for (offset = 0; offset <= DEPOT_POOL_SIZE - DEPOT_STACK_RECORD_SIZE;
+ offset += DEPOT_STACK_RECORD_SIZE) {
struct stack_record *stack = pool + offset;
stack->handle.pool_index = pools_num;
stack->handle.offset = offset >> DEPOT_STACK_ALIGN;
stack->handle.extra = 0;
- if (i < records_in_pool - 1)
- stack->next = (void *)stack + DEPOT_STACK_RECORD_SIZE;
- else
- stack->next = NULL;
+ list_add(&stack->list, &free_stacks);
}
- /* Link stack records into the freelist. */
- WARN_ON(next_stack);
- next_stack = pool;
-
/* Save reference to the pool to be used by depot_fetch_stack(). */
stack_pools[pools_num] = pool;
pools_num++;
@@ -295,7 +307,7 @@ static bool depot_update_pools(void **prealloc)
lockdep_assert_held_write(&pool_rwlock);
/* Check if we still have objects in the freelist. */
- if (next_stack)
+ if (!list_empty(&free_stacks))
goto out_keep_prealloc;
/* Check if we have a new pool saved and use it. */
@@ -346,19 +358,18 @@ depot_alloc_stack(unsigned long *entries, int size, u32 hash, void **prealloc)
return NULL;
/* Check if we have a stack record to save the stack trace. */
- stack = next_stack;
- if (!stack)
+ if (list_empty(&free_stacks))
return NULL;
- /* Advance the freelist. */
- next_stack = stack->next;
+ /* Get and unlink the first entry from the freelist. */
+ stack = list_first_entry(&free_stacks, struct stack_record, list);
+ list_del(&stack->list);
/* Limit number of saved frames to CONFIG_STACKDEPOT_MAX_FRAMES. */
if (size > CONFIG_STACKDEPOT_MAX_FRAMES)
size = CONFIG_STACKDEPOT_MAX_FRAMES;
/* Save the stack trace. */
- stack->next = NULL;
stack->hash = hash;
stack->size = size;
/* stack->handle is already filled in by depot_init_pool(). */
@@ -420,15 +431,17 @@ int stackdepot_memcmp(const unsigned long *u1, const unsigned long *u2,
}
/* Finds a stack in a bucket of the hash table. */
-static inline struct stack_record *find_stack(struct stack_record *bucket,
+static inline struct stack_record *find_stack(struct list_head *bucket,
unsigned long *entries, int size,
u32 hash)
{
+ struct list_head *pos;
struct stack_record *found;
lockdep_assert_held(&pool_rwlock);
- for (found = bucket; found; found = found->next) {
+ list_for_each(pos, bucket) {
+ found = list_entry(pos, struct stack_record, list);
if (found->hash == hash &&
found->size == size &&
!stackdepot_memcmp(entries, found->entries, size))
@@ -441,7 +454,8 @@ depot_stack_handle_t __stack_depot_save(unsigned long *entries,
unsigned int nr_entries,
gfp_t alloc_flags, bool can_alloc)
{
- struct stack_record *found = NULL, **bucket;
+ struct list_head *bucket;
+ struct stack_record *found = NULL;
depot_stack_handle_t handle = 0;
struct page *page = NULL;
void *prealloc = NULL;
@@ -468,7 +482,7 @@ depot_stack_handle_t __stack_depot_save(unsigned long *entries,
read_lock_irqsave(&pool_rwlock, flags);
/* Fast path: look the stack trace up without full locking. */
- found = find_stack(*bucket, entries, nr_entries, hash);
+ found = find_stack(bucket, entries, nr_entries, hash);
if (found) {
read_unlock_irqrestore(&pool_rwlock, flags);
goto exit;
@@ -500,14 +514,13 @@ depot_stack_handle_t __stack_depot_save(unsigned long *entries,
write_lock_irqsave(&pool_rwlock, flags);
- found = find_stack(*bucket, entries, nr_entries, hash);
+ found = find_stack(bucket, entries, nr_entries, hash);
if (!found) {
struct stack_record *new =
depot_alloc_stack(entries, nr_entries, hash, &prealloc);
if (new) {
- new->next = *bucket;
- *bucket = new;
+ list_add(&new->list, bucket);
found = new;
}
} else if (prealloc) {
--
2.25.1
From: Andrey Konovalov <[email protected]>
Add a reference counter for how many times a stack records has been added
to stack depot.
Add a new STACK_DEPOT_FLAG_GET flag to stack_depot_save_flags that
instructs the stack depot to increment the refcount.
Do not yet decrement the refcount; this is implemented in one of the
following patches.
Do not yet enable any users to use the flag to avoid overflowing the
refcount.
This is preparatory patch for implementing the eviction of stack records
from the stack depot.
Reviewed-by: Alexander Potapenko <[email protected]>
Signed-off-by: Andrey Konovalov <[email protected]>
---
Changes v1->v2:
- Add forgotten refcount_inc() under write lock.
- Add STACK_DEPOT_FLAG_GET flag for stack_depot_save_flags.
---
include/linux/stackdepot.h | 13 ++++++++++---
lib/stackdepot.c | 12 ++++++++++--
2 files changed, 20 insertions(+), 5 deletions(-)
diff --git a/include/linux/stackdepot.h b/include/linux/stackdepot.h
index 0b262e14144e..611716702d73 100644
--- a/include/linux/stackdepot.h
+++ b/include/linux/stackdepot.h
@@ -39,8 +39,9 @@ typedef u32 depot_flags_t;
* to its declaration for more details.
*/
#define STACK_DEPOT_FLAG_CAN_ALLOC ((depot_flags_t)0x0001)
+#define STACK_DEPOT_FLAG_GET ((depot_flags_t)0x0002)
-#define STACK_DEPOT_FLAGS_NUM 1
+#define STACK_DEPOT_FLAGS_NUM 2
#define STACK_DEPOT_FLAGS_MASK ((depot_flags_t)((1 << STACK_DEPOT_FLAGS_NUM) - 1))
/*
@@ -94,6 +95,9 @@ static inline int stack_depot_early_init(void) { return 0; }
* flags of @alloc_flags). Otherwise, stack depot avoids any allocations and
* fails if no space is left to store the stack trace.
*
+ * If STACK_DEPOT_FLAG_GET is set in @depot_flags, stack depot will increment
+ * the refcount on the saved stack trace if it already exists in stack depot.
+ *
* If the provided stack trace comes from the interrupt context, only the part
* up to the interrupt entry is saved.
*
@@ -116,8 +120,11 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
* @nr_entries: Number of frames in the stack
* @alloc_flags: Allocation GFP flags
*
- * Context: Contexts where allocations via alloc_pages() are allowed.
- * See stack_depot_save_flags() for more details.
+ * Does not increment the refcount on the saved stack trace; see
+ * stack_depot_save_flags() for more details.
+ *
+ * Context: Contexts where allocations via alloc_pages() are allowed;
+ * see stack_depot_save_flags() for more details.
*
* Return: Handle of the stack trace stored in depot, 0 on failure
*/
diff --git a/lib/stackdepot.c b/lib/stackdepot.c
index 59d61d5c09a7..911dee11bf39 100644
--- a/lib/stackdepot.c
+++ b/lib/stackdepot.c
@@ -23,6 +23,7 @@
#include <linux/mutex.h>
#include <linux/percpu.h>
#include <linux/printk.h>
+#include <linux/refcount.h>
#include <linux/slab.h>
#include <linux/spinlock.h>
#include <linux/stacktrace.h>
@@ -60,6 +61,7 @@ struct stack_record {
u32 hash; /* Hash in hash table */
u32 size; /* Number of stored frames */
union handle_parts handle;
+ refcount_t count;
unsigned long entries[CONFIG_STACKDEPOT_MAX_FRAMES]; /* Frames */
};
@@ -373,6 +375,7 @@ depot_alloc_stack(unsigned long *entries, int size, u32 hash, void **prealloc)
stack->hash = hash;
stack->size = size;
/* stack->handle is already filled in by depot_init_pool(). */
+ refcount_set(&stack->count, 1);
memcpy(stack->entries, entries, flex_array_size(stack, entries, size));
/*
@@ -489,6 +492,8 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
/* Fast path: look the stack trace up without full locking. */
found = find_stack(bucket, entries, nr_entries, hash);
if (found) {
+ if (depot_flags & STACK_DEPOT_FLAG_GET)
+ refcount_inc(&found->count);
read_unlock_irqrestore(&pool_rwlock, flags);
goto exit;
}
@@ -528,12 +533,15 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
list_add(&new->list, bucket);
found = new;
}
- } else if (prealloc) {
+ } else {
+ if (depot_flags & STACK_DEPOT_FLAG_GET)
+ refcount_inc(&found->count);
/*
* Stack depot already contains this stack trace, but let's
* keep the preallocated memory for future.
*/
- depot_keep_new_pool(&prealloc);
+ if (prealloc)
+ depot_keep_new_pool(&prealloc);
}
write_unlock_irqrestore(&pool_rwlock, flags);
--
2.25.1
From: Andrey Konovalov <[email protected]>
Check the object size when looking up entries in the stack ring.
If the size of the object for which a report is being printed does not
match the size of the object for which a stack trace has been saved in
the stack ring, the saved stack trace is irrelevant.
Reviewed-by: Alexander Potapenko <[email protected]>
Signed-off-by: Andrey Konovalov <[email protected]>
---
Changes v2->v3:
- Added missing "../slab.h" include for accessing a kmem_cache field.
Changes v1->v2:
- This is a new patch.
---
mm/kasan/report_tags.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/mm/kasan/report_tags.c b/mm/kasan/report_tags.c
index 78abdcde5da9..55154743f915 100644
--- a/mm/kasan/report_tags.c
+++ b/mm/kasan/report_tags.c
@@ -7,6 +7,7 @@
#include <linux/atomic.h>
#include "kasan.h"
+#include "../slab.h"
extern struct kasan_stack_ring stack_ring;
@@ -58,7 +59,8 @@ void kasan_complete_mode_report_info(struct kasan_report_info *info)
entry = &stack_ring.entries[i % stack_ring.size];
if (kasan_reset_tag(entry->ptr) != info->object ||
- get_tag(entry->ptr) != get_tag(info->access_addr))
+ get_tag(entry->ptr) != get_tag(info->access_addr) ||
+ info->cache->object_size != entry->size)
continue;
if (entry->is_free) {
--
2.25.1
From: Andrey Konovalov <[email protected]>
KMSAN is frequently used in fuzzing scenarios and thus saves a lot of
stack traces. As KMSAN does not support evicting stack traces from the
stack depot, the stack depot capacity might be reached quickly with large
stack records.
Adjust the maximum number of stack depot pools for this case.
The average size of a stack trace saved into the stack depot is ~16
frames. Thus, adjust the maximum pools number accordingly to keep the
maximum number of stack traces that can be saved into the stack depot
similar to the one that was allowed before the stack trace eviction
changes.
Signed-off-by: Andrey Konovalov <[email protected]>
---
lib/stackdepot.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/lib/stackdepot.c b/lib/stackdepot.c
index c1b31160f4b4..870cce2f4cbd 100644
--- a/lib/stackdepot.c
+++ b/lib/stackdepot.c
@@ -41,7 +41,17 @@
#define DEPOT_OFFSET_BITS (DEPOT_POOL_ORDER + PAGE_SHIFT - DEPOT_STACK_ALIGN)
#define DEPOT_POOL_INDEX_BITS (DEPOT_HANDLE_BITS - DEPOT_OFFSET_BITS - \
STACK_DEPOT_EXTRA_BITS)
+#if IS_ENABLED(CONFIG_KMSAN) && CONFIG_STACKDEPOT_MAX_FRAMES >= 32
+/*
+ * KMSAN is frequently used in fuzzing scenarios and thus saves a lot of stack
+ * traces. As KMSAN does not support evicting stack traces from the stack
+ * depot, the stack depot capacity might be reached quickly with large stack
+ * records. Adjust the maximum number of stack depot pools for this case.
+ */
+#define DEPOT_POOLS_CAP (8192 * (CONFIG_STACKDEPOT_MAX_FRAMES / 16))
+#else
#define DEPOT_POOLS_CAP 8192
+#endif
#define DEPOT_MAX_POOLS \
(((1LL << (DEPOT_POOL_INDEX_BITS)) < DEPOT_POOLS_CAP) ? \
(1LL << (DEPOT_POOL_INDEX_BITS)) : DEPOT_POOLS_CAP)
--
2.25.1
From: Andrey Konovalov <[email protected]>
Evict alloc/free stack traces from the stack depot for Generic KASAN
once they are evicted from the quaratine.
For auxiliary stack traces, evict the oldest stack trace once a new one
is saved (KASAN only keeps references to the last two).
Also evict all saved stack traces on krealloc.
To avoid double-evicting and mis-evicting stack traces (in case KASAN's
metadata was corrupted), reset KASAN's per-object metadata that stores
stack depot handles when the object is initialized and when it's evicted
from the quarantine.
Note that stack_depot_put is no-op if the handle is 0.
Reviewed-by: Marco Elver <[email protected]>
Signed-off-by: Andrey Konovalov <[email protected]>
---
mm/kasan/common.c | 3 ++-
mm/kasan/generic.c | 22 ++++++++++++++++++----
mm/kasan/quarantine.c | 26 ++++++++++++++++++++------
3 files changed, 40 insertions(+), 11 deletions(-)
diff --git a/mm/kasan/common.c b/mm/kasan/common.c
index 825a0240ec02..b5d8bd26fced 100644
--- a/mm/kasan/common.c
+++ b/mm/kasan/common.c
@@ -50,7 +50,8 @@ depot_stack_handle_t kasan_save_stack(gfp_t flags, depot_flags_t depot_flags)
void kasan_set_track(struct kasan_track *track, gfp_t flags)
{
track->pid = current->pid;
- track->stack = kasan_save_stack(flags, STACK_DEPOT_FLAG_CAN_ALLOC);
+ track->stack = kasan_save_stack(flags,
+ STACK_DEPOT_FLAG_CAN_ALLOC | STACK_DEPOT_FLAG_GET);
}
#if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
diff --git a/mm/kasan/generic.c b/mm/kasan/generic.c
index 5d168c9afb32..50cc519e23f4 100644
--- a/mm/kasan/generic.c
+++ b/mm/kasan/generic.c
@@ -449,10 +449,14 @@ struct kasan_free_meta *kasan_get_free_meta(struct kmem_cache *cache,
void kasan_init_object_meta(struct kmem_cache *cache, const void *object)
{
struct kasan_alloc_meta *alloc_meta;
+ struct kasan_free_meta *free_meta;
alloc_meta = kasan_get_alloc_meta(cache, object);
if (alloc_meta)
__memset(alloc_meta, 0, sizeof(*alloc_meta));
+ free_meta = kasan_get_free_meta(cache, object);
+ if (free_meta)
+ __memset(free_meta, 0, sizeof(*free_meta));
}
size_t kasan_metadata_size(struct kmem_cache *cache, bool in_object)
@@ -489,18 +493,20 @@ static void __kasan_record_aux_stack(void *addr, depot_flags_t depot_flags)
if (!alloc_meta)
return;
+ stack_depot_put(alloc_meta->aux_stack[1]);
alloc_meta->aux_stack[1] = alloc_meta->aux_stack[0];
alloc_meta->aux_stack[0] = kasan_save_stack(0, depot_flags);
}
void kasan_record_aux_stack(void *addr)
{
- return __kasan_record_aux_stack(addr, STACK_DEPOT_FLAG_CAN_ALLOC);
+ return __kasan_record_aux_stack(addr,
+ STACK_DEPOT_FLAG_CAN_ALLOC | STACK_DEPOT_FLAG_GET);
}
void kasan_record_aux_stack_noalloc(void *addr)
{
- return __kasan_record_aux_stack(addr, 0);
+ return __kasan_record_aux_stack(addr, STACK_DEPOT_FLAG_GET);
}
void kasan_save_alloc_info(struct kmem_cache *cache, void *object, gfp_t flags)
@@ -508,8 +514,16 @@ void kasan_save_alloc_info(struct kmem_cache *cache, void *object, gfp_t flags)
struct kasan_alloc_meta *alloc_meta;
alloc_meta = kasan_get_alloc_meta(cache, object);
- if (alloc_meta)
- kasan_set_track(&alloc_meta->alloc_track, flags);
+ if (!alloc_meta)
+ return;
+
+ /* Evict previous stack traces (might exist for krealloc). */
+ stack_depot_put(alloc_meta->alloc_track.stack);
+ stack_depot_put(alloc_meta->aux_stack[0]);
+ stack_depot_put(alloc_meta->aux_stack[1]);
+ __memset(alloc_meta, 0, sizeof(*alloc_meta));
+
+ kasan_set_track(&alloc_meta->alloc_track, flags);
}
void kasan_save_free_info(struct kmem_cache *cache, void *object)
diff --git a/mm/kasan/quarantine.c b/mm/kasan/quarantine.c
index ca4529156735..265ca2bbe2dd 100644
--- a/mm/kasan/quarantine.c
+++ b/mm/kasan/quarantine.c
@@ -143,11 +143,22 @@ static void *qlink_to_object(struct qlist_node *qlink, struct kmem_cache *cache)
static void qlink_free(struct qlist_node *qlink, struct kmem_cache *cache)
{
void *object = qlink_to_object(qlink, cache);
- struct kasan_free_meta *meta = kasan_get_free_meta(cache, object);
+ struct kasan_alloc_meta *alloc_meta = kasan_get_alloc_meta(cache, object);
+ struct kasan_free_meta *free_meta = kasan_get_free_meta(cache, object);
unsigned long flags;
- if (IS_ENABLED(CONFIG_SLAB))
- local_irq_save(flags);
+ if (alloc_meta) {
+ stack_depot_put(alloc_meta->alloc_track.stack);
+ stack_depot_put(alloc_meta->aux_stack[0]);
+ stack_depot_put(alloc_meta->aux_stack[1]);
+ __memset(alloc_meta, 0, sizeof(*alloc_meta));
+ }
+
+ if (free_meta &&
+ *(u8 *)kasan_mem_to_shadow(object) == KASAN_SLAB_FREETRACK) {
+ stack_depot_put(free_meta->free_track.stack);
+ free_meta->free_track.stack = 0;
+ }
/*
* If init_on_free is enabled and KASAN's free metadata is stored in
@@ -157,14 +168,17 @@ static void qlink_free(struct qlist_node *qlink, struct kmem_cache *cache)
*/
if (slab_want_init_on_free(cache) &&
cache->kasan_info.free_meta_offset == 0)
- memzero_explicit(meta, sizeof(*meta));
+ memzero_explicit(free_meta, sizeof(*free_meta));
/*
- * As the object now gets freed from the quarantine, assume that its
- * free track is no longer valid.
+ * As the object now gets freed from the quarantine,
+ * take note that its free track is no longer exists.
*/
*(u8 *)kasan_mem_to_shadow(object) = KASAN_SLAB_FREE;
+ if (IS_ENABLED(CONFIG_SLAB))
+ local_irq_save(flags);
+
___cache_free(cache, object, _THIS_IP_);
if (IS_ENABLED(CONFIG_SLAB))
--
2.25.1
From: Andrey Konovalov <[email protected]>
Make tag-based KASAN modes evict stack traces from the stack depot once
they are evicted from the stack ring.
Internally, pass STACK_DEPOT_FLAG_GET to stack_depot_save_flags (via
kasan_save_stack) to increment the refcount when saving a new entry
to stack ring and call stack_depot_put when removing an entry from
stack ring.
Reviewed-by: Alexander Potapenko <[email protected]>
Signed-off-by: Andrey Konovalov <[email protected]>
---
Changes v1->v2:
- Adapt to the stack depot API change.
- Drop READ_ONCE when reading entry->stack.
---
mm/kasan/tags.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/mm/kasan/tags.c b/mm/kasan/tags.c
index b6c017e670d8..739ae997463d 100644
--- a/mm/kasan/tags.c
+++ b/mm/kasan/tags.c
@@ -97,12 +97,13 @@ static void save_stack_info(struct kmem_cache *cache, void *object,
gfp_t gfp_flags, bool is_free)
{
unsigned long flags;
- depot_stack_handle_t stack;
+ depot_stack_handle_t stack, old_stack;
u64 pos;
struct kasan_stack_ring_entry *entry;
void *old_ptr;
- stack = kasan_save_stack(gfp_flags, STACK_DEPOT_FLAG_CAN_ALLOC);
+ stack = kasan_save_stack(gfp_flags,
+ STACK_DEPOT_FLAG_CAN_ALLOC | STACK_DEPOT_FLAG_GET);
/*
* Prevent save_stack_info() from modifying stack ring
@@ -121,6 +122,8 @@ static void save_stack_info(struct kmem_cache *cache, void *object,
if (!try_cmpxchg(&entry->ptr, &old_ptr, STACK_RING_BUSY_PTR))
goto next; /* Busy slot. */
+ old_stack = entry->stack;
+
entry->size = cache->object_size;
entry->pid = current->pid;
entry->stack = stack;
@@ -129,6 +132,9 @@ static void save_stack_info(struct kmem_cache *cache, void *object,
entry->ptr = object;
read_unlock_irqrestore(&stack_ring.lock, flags);
+
+ if (old_stack)
+ stack_depot_put(old_stack);
}
void kasan_save_alloc_info(struct kmem_cache *cache, void *object, gfp_t flags)
--
2.25.1
From: Andrey Konovalov <[email protected]>
Remove the atomic accesses to entry fields in save_stack_info and
kasan_complete_mode_report_info for tag-based KASAN modes.
These atomics are not required, as the read/write lock prevents the
entries from being read (in kasan_complete_mode_report_info) while being
written (in save_stack_info) and the try_cmpxchg prevents the same entry
from being rewritten (in save_stack_info) in the unlikely case of wrapping
during writing.
Reviewed-by: Alexander Potapenko <[email protected]>
Signed-off-by: Andrey Konovalov <[email protected]>
---
Changes v1->v2:
- This is a new patch.
---
mm/kasan/report_tags.c | 25 +++++++------------------
mm/kasan/tags.c | 13 +++++--------
2 files changed, 12 insertions(+), 26 deletions(-)
diff --git a/mm/kasan/report_tags.c b/mm/kasan/report_tags.c
index 8b8bfdb3cfdb..78abdcde5da9 100644
--- a/mm/kasan/report_tags.c
+++ b/mm/kasan/report_tags.c
@@ -31,10 +31,6 @@ void kasan_complete_mode_report_info(struct kasan_report_info *info)
unsigned long flags;
u64 pos;
struct kasan_stack_ring_entry *entry;
- void *ptr;
- u32 pid;
- depot_stack_handle_t stack;
- bool is_free;
bool alloc_found = false, free_found = false;
if ((!info->cache || !info->object) && !info->bug_type) {
@@ -61,18 +57,11 @@ void kasan_complete_mode_report_info(struct kasan_report_info *info)
entry = &stack_ring.entries[i % stack_ring.size];
- /* Paired with smp_store_release() in save_stack_info(). */
- ptr = (void *)smp_load_acquire(&entry->ptr);
-
- if (kasan_reset_tag(ptr) != info->object ||
- get_tag(ptr) != get_tag(info->access_addr))
+ if (kasan_reset_tag(entry->ptr) != info->object ||
+ get_tag(entry->ptr) != get_tag(info->access_addr))
continue;
- pid = READ_ONCE(entry->pid);
- stack = READ_ONCE(entry->stack);
- is_free = READ_ONCE(entry->is_free);
-
- if (is_free) {
+ if (entry->is_free) {
/*
* Second free of the same object.
* Give up on trying to find the alloc entry.
@@ -80,8 +69,8 @@ void kasan_complete_mode_report_info(struct kasan_report_info *info)
if (free_found)
break;
- info->free_track.pid = pid;
- info->free_track.stack = stack;
+ info->free_track.pid = entry->pid;
+ info->free_track.stack = entry->stack;
free_found = true;
/*
@@ -95,8 +84,8 @@ void kasan_complete_mode_report_info(struct kasan_report_info *info)
if (alloc_found)
break;
- info->alloc_track.pid = pid;
- info->alloc_track.stack = stack;
+ info->alloc_track.pid = entry->pid;
+ info->alloc_track.stack = entry->stack;
alloc_found = true;
/*
diff --git a/mm/kasan/tags.c b/mm/kasan/tags.c
index 4fd32121b0fd..b6c017e670d8 100644
--- a/mm/kasan/tags.c
+++ b/mm/kasan/tags.c
@@ -121,15 +121,12 @@ static void save_stack_info(struct kmem_cache *cache, void *object,
if (!try_cmpxchg(&entry->ptr, &old_ptr, STACK_RING_BUSY_PTR))
goto next; /* Busy slot. */
- WRITE_ONCE(entry->size, cache->object_size);
- WRITE_ONCE(entry->pid, current->pid);
- WRITE_ONCE(entry->stack, stack);
- WRITE_ONCE(entry->is_free, is_free);
+ entry->size = cache->object_size;
+ entry->pid = current->pid;
+ entry->stack = stack;
+ entry->is_free = is_free;
- /*
- * Paired with smp_load_acquire() in kasan_complete_mode_report_info().
- */
- smp_store_release(&entry->ptr, (s64)object);
+ entry->ptr = object;
read_unlock_irqrestore(&stack_ring.lock, flags);
}
--
2.25.1
On Tue, Nov 21, 2023 at 1:08 PM <[email protected]> wrote:
>
> From: Andrey Konovalov <[email protected]>
>
> Evict alloc/free stack traces from the stack depot for Generic KASAN
> once they are evicted from the quaratine.
>
> For auxiliary stack traces, evict the oldest stack trace once a new one
> is saved (KASAN only keeps references to the last two).
>
> Also evict all saved stack traces on krealloc.
>
> To avoid double-evicting and mis-evicting stack traces (in case KASAN's
> metadata was corrupted), reset KASAN's per-object metadata that stores
> stack depot handles when the object is initialized and when it's evicted
> from the quarantine.
>
> Note that stack_depot_put is no-op if the handle is 0.
>
> Reviewed-by: Marco Elver <[email protected]>
> Signed-off-by: Andrey Konovalov <[email protected]>
I observed boot hangs on a few SLUB configurations.
Having other users of stackdepot might be the cause. After passing
'slub_debug=-' which disables SLUB debugging, it boots fine.
compiler version: gcc-11
config: https://download.kerneltesting.org/builds/2023-11-21-f121f2/.config
bisect log: https://download.kerneltesting.org/builds/2023-11-21-f121f2/bisect.log.txt
[dmesg]
(gdb) lx-dmesg
[ 0.000000] Linux version 6.7.0-rc1-00136-g0e8b630f3053
([email protected]) (gcc (GCC) 11.3.1 20221121 (R3[
0.000000] Command line: console=ttyS0 root=/dev/sda1 nokaslr
[ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted
6.7.0-rc1-00136-g0e8b630f3053 #22
[ 0.000000] RIP: 0010:setup_arch+0x500/0x2250
[ 0.000000] Code: c6 09 08 00 48 89 c5 48 85 c0 0f 84 58 13 00 00
48 c1 e8 03 48 83 05 be 97 66 00 01 80 3c 18 00 0f3[ 0.000000] RSP:
0000:ffffffff86007e00 EFLAGS: 00010046 ORIG_RAX: 0000000000000009
[ 0.000000] RAX: 1fffffffffe40088 RBX: dffffc0000000000 RCX: 1ffffffff11ed630
[ 0.000000] RDX: 0000000000000000 RSI: feec4698e8103000 RDI: ffffffff88f6b180
[ 0.000000] RBP: ffffffffff200444 R08: 8000000000000163 R09: 1ffffffff11ed628
[ 0.000000] R10: ffffffff88f7a150 R11: 0000000000000000 R12: 0000000000000010
[ 0.000000] R13: ffffffffff200450 R14: feec4698e8102444 R15: feec4698e8102444
[ 0.000000] FS: 0000000000000000(0000) GS:ffffffff88d5b000(0000)
knlGS:0000000000000000
[ 0.000000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.000000] CR2: ffffffffff200444 CR3: 0000000008f0e000 CR4: 00000000000000b0
[ 0.000000] Call Trace:
[ 0.000000] <TASK>
[ 0.000000] ? show_regs+0x87/0xa0
[ 0.000000] ? early_fixup_exception+0x130/0x310
[ 0.000000] ? do_early_exception+0x23/0x90
[ 0.000000] ? early_idt_handler_common+0x2f/0x40
[ 0.000000] ? setup_arch+0x500/0x2250
[ 0.000000] ? __pfx_setup_arch+0x10/0x10
[ 0.000000] ? vprintk_default+0x20/0x30
[ 0.000000] ? vprintk+0x4c/0x80
[ 0.000000] ? _printk+0xba/0xf0
[ 0.000000] ? __pfx__printk+0x10/0x10
[ 0.000000] ? init_cgroup_root+0x10f/0x2f0
--Type <RET> for more, q to quit, c to continue without paging--
[ 0.000000] ? cgroup_init_early+0x1e4/0x440
[ 0.000000] ? start_kernel+0xae/0x790
[ 0.000000] ? x86_64_start_reservations+0x28/0x50
[ 0.000000] ? x86_64_start_kernel+0x10e/0x130
[ 0.000000] ? secondary_startup_64_no_verify+0x178/0x17b
[ 0.000000] </TASK>
--
Hyeonggon
On Wed, Nov 22, 2023 at 12:17 PM Hyeonggon Yoo <[email protected]> wrote:
>
> On Tue, Nov 21, 2023 at 1:08 PM <[email protected]> wrote:
> >
> > From: Andrey Konovalov <[email protected]>
> >
> > Evict alloc/free stack traces from the stack depot for Generic KASAN
> > once they are evicted from the quaratine.
> >
> > For auxiliary stack traces, evict the oldest stack trace once a new one
> > is saved (KASAN only keeps references to the last two).
> >
> > Also evict all saved stack traces on krealloc.
> >
> > To avoid double-evicting and mis-evicting stack traces (in case KASAN's
> > metadata was corrupted), reset KASAN's per-object metadata that stores
> > stack depot handles when the object is initialized and when it's evicted
> > from the quarantine.
> >
> > Note that stack_depot_put is no-op if the handle is 0.
> >
> > Reviewed-by: Marco Elver <[email protected]>
> > Signed-off-by: Andrey Konovalov <[email protected]>
>
> I observed boot hangs on a few SLUB configurations.
>
> Having other users of stackdepot might be the cause. After passing
> 'slub_debug=-' which disables SLUB debugging, it boots fine.
Looks like I forgot to Cc regzbot.
If you need more information, please let me know.
#regzbot introduced: f0ff84b7c3a
Thanks,
Hyeonggon
> compiler version: gcc-11
> config: https://download.kerneltesting.org/builds/2023-11-21-f121f2/.config
> bisect log: https://download.kerneltesting.org/builds/2023-11-21-f121f2/bisect.log.txt
>
> [dmesg]
> (gdb) lx-dmesg
> [ 0.000000] Linux version 6.7.0-rc1-00136-g0e8b630f3053
> ([email protected]) (gcc (GCC) 11.3.1 20221121 (R3[
> 0.000000] Command line: console=ttyS0 root=/dev/sda1 nokaslr
> [ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted
> 6.7.0-rc1-00136-g0e8b630f3053 #22
> [ 0.000000] RIP: 0010:setup_arch+0x500/0x2250
> [ 0.000000] Code: c6 09 08 00 48 89 c5 48 85 c0 0f 84 58 13 00 00
> 48 c1 e8 03 48 83 05 be 97 66 00 01 80 3c 18 00 0f3[ 0.000000] RSP:
> 0000:ffffffff86007e00 EFLAGS: 00010046 ORIG_RAX: 0000000000000009
> [ 0.000000] RAX: 1fffffffffe40088 RBX: dffffc0000000000 RCX: 1ffffffff11ed630
> [ 0.000000] RDX: 0000000000000000 RSI: feec4698e8103000 RDI: ffffffff88f6b180
> [ 0.000000] RBP: ffffffffff200444 R08: 8000000000000163 R09: 1ffffffff11ed628
> [ 0.000000] R10: ffffffff88f7a150 R11: 0000000000000000 R12: 0000000000000010
> [ 0.000000] R13: ffffffffff200450 R14: feec4698e8102444 R15: feec4698e8102444
> [ 0.000000] FS: 0000000000000000(0000) GS:ffffffff88d5b000(0000)
> knlGS:0000000000000000
> [ 0.000000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 0.000000] CR2: ffffffffff200444 CR3: 0000000008f0e000 CR4: 00000000000000b0
> [ 0.000000] Call Trace:
> [ 0.000000] <TASK>
> [ 0.000000] ? show_regs+0x87/0xa0
> [ 0.000000] ? early_fixup_exception+0x130/0x310
> [ 0.000000] ? do_early_exception+0x23/0x90
> [ 0.000000] ? early_idt_handler_common+0x2f/0x40
> [ 0.000000] ? setup_arch+0x500/0x2250
> [ 0.000000] ? __pfx_setup_arch+0x10/0x10
> [ 0.000000] ? vprintk_default+0x20/0x30
> [ 0.000000] ? vprintk+0x4c/0x80
> [ 0.000000] ? _printk+0xba/0xf0
> [ 0.000000] ? __pfx__printk+0x10/0x10
> [ 0.000000] ? init_cgroup_root+0x10f/0x2f0
> --Type <RET> for more, q to quit, c to continue without paging--
> [ 0.000000] ? cgroup_init_early+0x1e4/0x440
> [ 0.000000] ? start_kernel+0xae/0x790
> [ 0.000000] ? x86_64_start_reservations+0x28/0x50
> [ 0.000000] ? x86_64_start_kernel+0x10e/0x130
> [ 0.000000] ? secondary_startup_64_no_verify+0x178/0x17b
> [ 0.000000] </TASK>
On Wed, Nov 22, 2023 at 4:17 AM Hyeonggon Yoo <[email protected]> wrote:
>
> On Tue, Nov 21, 2023 at 1:08 PM <[email protected]> wrote:
> >
> > From: Andrey Konovalov <[email protected]>
> >
> > Evict alloc/free stack traces from the stack depot for Generic KASAN
> > once they are evicted from the quaratine.
> >
> > For auxiliary stack traces, evict the oldest stack trace once a new one
> > is saved (KASAN only keeps references to the last two).
> >
> > Also evict all saved stack traces on krealloc.
> >
> > To avoid double-evicting and mis-evicting stack traces (in case KASAN's
> > metadata was corrupted), reset KASAN's per-object metadata that stores
> > stack depot handles when the object is initialized and when it's evicted
> > from the quarantine.
> >
> > Note that stack_depot_put is no-op if the handle is 0.
> >
> > Reviewed-by: Marco Elver <[email protected]>
> > Signed-off-by: Andrey Konovalov <[email protected]>
>
> I observed boot hangs on a few SLUB configurations.
>
> Having other users of stackdepot might be the cause. After passing
> 'slub_debug=-' which disables SLUB debugging, it boots fine.
Hi Hyeonggon,
Just mailed a fix.
Thank you for the report!
On Mon, Nov 20, 2023 at 06:47:00PM +0100, [email protected] wrote:
> From: Andrey Konovalov <[email protected]>
>
> Do not try fetching a stack trace from the stack depot if the
> stack_depot_disabled flag is enabled.
>
> Reviewed-by: Alexander Potapenko <[email protected]>
> Signed-off-by: Andrey Konovalov <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
--
Oscar Salvador
SUSE Labs
On Mon, Nov 20, 2023 at 06:47:01PM +0100, [email protected] wrote:
> From: Andrey Konovalov <[email protected]>
>
> The retval local variable in __stack_depot_save has the union type
> handle_parts, but the function never uses anything but the union's
> handle field.
>
> Define retval simply as depot_stack_handle_t to simplify the code.
>
> Reviewed-by: Alexander Potapenko <[email protected]>
> Signed-off-by: Andrey Konovalov <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
--
Oscar Salvador
SUSE Labs
On Mon, Nov 20, 2023 at 06:47:02PM +0100, [email protected] wrote:
> From: Andrey Konovalov <[email protected]>
>
> Stack depot doesn't use the valid bit in handles in any way, so drop it.
>
> Reviewed-by: Alexander Potapenko <[email protected]>
> Signed-off-by: Andrey Konovalov <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
--
Oscar Salvador
SUSE Labs
On Mon, Nov 20, 2023 at 06:47:03PM +0100, [email protected] wrote:
> From: Andrey Konovalov <[email protected]>
>
> Add a helper depot_fetch_stack function that fetches the pointer to
> a stack record.
>
> With this change, all static depot_* functions now operate on stack pools
> and the exported stack_depot_* functions operate on the hash table.
>
> Reviewed-by: Alexander Potapenko <[email protected]>
> Signed-off-by: Andrey Konovalov <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
--
Oscar Salvador
SUSE Labs
On Mon, Nov 20, 2023 at 06:47:04PM +0100, [email protected] wrote:
> From: Andrey Konovalov <[email protected]>
>
> Instead of storing stack records in stack depot pools one right after
> another, use fixed-sized slots.
>
> Add a new Kconfig option STACKDEPOT_MAX_FRAMES that allows to select
> the size of the slot in frames. Use 64 as the default value, which is
> the maximum stack trace size both KASAN and KMSAN use right now.
>
> Also add descriptions for other stack depot Kconfig options.
>
> This is preparatory patch for implementing the eviction of stack records
> from the stack depot.
>
> Signed-off-by: Andrey Konovalov <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
--
Oscar Salvador
SUSE Labs
On Mon, Nov 20, 2023 at 06:47:06PM +0100, [email protected] wrote:
> From: Andrey Konovalov <[email protected]>
>
> Split code in depot_alloc_stack and depot_init_pool into 3 functions:
>
> 1. depot_keep_next_pool that keeps preallocated memory for the next pool
> if required.
>
> 2. depot_update_pools that moves on to the next pool if there's no space
> left in the current pool, uses preallocated memory for the new current
> pool if required, and calls depot_keep_next_pool otherwise.
>
> 3. depot_alloc_stack that calls depot_update_pools and then allocates
> a stack record as before.
>
> This makes it somewhat easier to follow the logic of depot_alloc_stack
> and also serves as a preparation for implementing the eviction of stack
> records from the stack depot.
>
> Reviewed-by: Alexander Potapenko <[email protected]>
> Signed-off-by: Andrey Konovalov <[email protected]>
I have to say this simplifies the reading quite a lot.
Reviewed-by: Oscar Salvador <[email protected]>
--
Oscar Salvador
SUSE Labs
On Mon, Nov 20, 2023 at 06:47:07PM +0100, [email protected] wrote:
> From: Andrey Konovalov <[email protected]>
>
> Rename next_pool_required to new_pool_required.
>
> This a purely code readability change: the following patch will change
> stack depot to store the pointer to the new pool in a separate variable,
> and "new" seems like a more logical name.
>
> Reviewed-by: Alexander Potapenko <[email protected]>
> Signed-off-by: Andrey Konovalov <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
--
Oscar Salvador
SUSE Labs
On Mon, Nov 20, 2023 at 06:47:08PM +0100, [email protected] wrote:
> From: Andrey Konovalov <[email protected]>
>
> Instead of using the last pointer in stack_pools for storing the pointer
> to a new pool (which does not yet store any stack records), use a new
> new_pool variable.
>
> This a purely code readability change: it seems more logical to store
> the pointer to a pool with a special meaning in a dedicated variable.
>
> Reviewed-by: Alexander Potapenko <[email protected]>
> Signed-off-by: Andrey Konovalov <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
--
Oscar Salvador
SUSE Labs
On Mon, Nov 20, 2023 at 06:47:10PM +0100, [email protected] wrote:
> From: Andrey Konovalov <[email protected]>
>
> Currently, stack depot uses the following locking scheme:
>
> 1. Lock-free accesses when looking up a stack record, which allows to
> have multiple users to look up records in parallel;
> 2. Spinlock for protecting the stack depot pools and the hash table
> when adding a new record.
>
> For implementing the eviction of stack traces from stack depot, the
> lock-free approach is not going to work anymore, as we will need to be
> able to also remove records from the hash table.
>
> Convert the spinlock into a read/write lock, and drop the atomic accesses,
> as they are no longer required.
>
> Looking up stack traces is now protected by the read lock and adding new
> records - by the write lock. One of the following patches will add a new
> function for evicting stack records, which will be protected by the write
> lock as well.
>
> With this change, multiple users can still look up records in parallel.
>
> This is preparatory patch for implementing the eviction of stack records
> from the stack depot.
>
> Reviewed-by: Alexander Potapenko <[email protected]>
> Signed-off-by: Andrey Konovalov <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
> ---
>
> Changed v2->v3:
> - Use lockdep_assert_held_read annotation in depot_fetch_stack.
>
> Changes v1->v2:
> - Add lockdep_assert annotations.
> ---
> lib/stackdepot.c | 87 +++++++++++++++++++++++++-----------------------
> 1 file changed, 46 insertions(+), 41 deletions(-)
>
> diff --git a/lib/stackdepot.c b/lib/stackdepot.c
> index a5eff165c0d5..8378b32b5310 100644
> --- a/lib/stackdepot.c
> +++ b/lib/stackdepot.c
> @@ -23,6 +23,7 @@
> #include <linux/percpu.h>
> #include <linux/printk.h>
> #include <linux/slab.h>
> +#include <linux/spinlock.h>
> #include <linux/stacktrace.h>
> #include <linux/stackdepot.h>
> #include <linux/string.h>
> @@ -91,15 +92,15 @@ static void *new_pool;
> static int pools_num;
> /* Next stack in the freelist of stack records within stack_pools. */
> static struct stack_record *next_stack;
> -/* Lock that protects the variables above. */
> -static DEFINE_RAW_SPINLOCK(pool_lock);
> /*
> * Stack depot tries to keep an extra pool allocated even before it runs out
> * of space in the currently used pool. This flag marks whether this extra pool
> * needs to be allocated. It has the value 0 when either an extra pool is not
> * yet allocated or if the limit on the number of pools is reached.
> */
> -static int new_pool_required = 1;
> +static bool new_pool_required = true;
> +/* Lock that protects the variables above. */
> +static DEFINE_RWLOCK(pool_rwlock);
>
> static int __init disable_stack_depot(char *str)
> {
> @@ -232,6 +233,8 @@ static void depot_init_pool(void *pool)
> const int records_in_pool = DEPOT_POOL_SIZE / DEPOT_STACK_RECORD_SIZE;
> int i, offset;
>
> + lockdep_assert_held_write(&pool_rwlock);
> +
> /* Initialize handles and link stack records to each other. */
> for (i = 0, offset = 0;
> offset <= DEPOT_POOL_SIZE - DEPOT_STACK_RECORD_SIZE;
> @@ -254,22 +257,17 @@ static void depot_init_pool(void *pool)
>
> /* Save reference to the pool to be used by depot_fetch_stack(). */
> stack_pools[pools_num] = pool;
> -
> - /*
> - * WRITE_ONCE() pairs with potential concurrent read in
> - * depot_fetch_stack().
> - */
> - WRITE_ONCE(pools_num, pools_num + 1);
> + pools_num++;
> }
>
> /* Keeps the preallocated memory to be used for a new stack depot pool. */
> static void depot_keep_new_pool(void **prealloc)
> {
> + lockdep_assert_held_write(&pool_rwlock);
> +
> /*
> * If a new pool is already saved or the maximum number of
> * pools is reached, do not use the preallocated memory.
> - * Access new_pool_required non-atomically, as there are no concurrent
> - * write accesses to this variable.
> */
> if (!new_pool_required)
> return;
> @@ -287,15 +285,15 @@ static void depot_keep_new_pool(void **prealloc)
> * At this point, either a new pool is kept or the maximum
> * number of pools is reached. In either case, take note that
> * keeping another pool is not required.
> - * smp_store_release() pairs with smp_load_acquire() in
> - * stack_depot_save().
> */
> - smp_store_release(&new_pool_required, 0);
> + new_pool_required = false;
> }
>
> /* Updates references to the current and the next stack depot pools. */
> static bool depot_update_pools(void **prealloc)
> {
> + lockdep_assert_held_write(&pool_rwlock);
> +
> /* Check if we still have objects in the freelist. */
> if (next_stack)
> goto out_keep_prealloc;
> @@ -307,7 +305,7 @@ static bool depot_update_pools(void **prealloc)
>
> /* Take note that we might need a new new_pool. */
> if (pools_num < DEPOT_MAX_POOLS)
> - smp_store_release(&new_pool_required, 1);
> + new_pool_required = true;
>
> /* Try keeping the preallocated memory for new_pool. */
> goto out_keep_prealloc;
> @@ -341,6 +339,8 @@ depot_alloc_stack(unsigned long *entries, int size, u32 hash, void **prealloc)
> {
> struct stack_record *stack;
>
> + lockdep_assert_held_write(&pool_rwlock);
> +
> /* Update current and new pools if required and possible. */
> if (!depot_update_pools(prealloc))
> return NULL;
> @@ -376,18 +376,15 @@ depot_alloc_stack(unsigned long *entries, int size, u32 hash, void **prealloc)
> static struct stack_record *depot_fetch_stack(depot_stack_handle_t handle)
> {
> union handle_parts parts = { .handle = handle };
> - /*
> - * READ_ONCE() pairs with potential concurrent write in
> - * depot_init_pool().
> - */
> - int pools_num_cached = READ_ONCE(pools_num);
> void *pool;
> size_t offset = parts.offset << DEPOT_STACK_ALIGN;
> struct stack_record *stack;
>
> - if (parts.pool_index > pools_num_cached) {
> + lockdep_assert_held_read(&pool_rwlock);
> +
> + if (parts.pool_index > pools_num) {
> WARN(1, "pool index %d out of bounds (%d) for stack id %08x\n",
> - parts.pool_index, pools_num_cached, handle);
> + parts.pool_index, pools_num, handle);
> return NULL;
> }
>
> @@ -429,6 +426,8 @@ static inline struct stack_record *find_stack(struct stack_record *bucket,
> {
> struct stack_record *found;
>
> + lockdep_assert_held(&pool_rwlock);
> +
> for (found = bucket; found; found = found->next) {
> if (found->hash == hash &&
> found->size == size &&
> @@ -446,6 +445,7 @@ depot_stack_handle_t __stack_depot_save(unsigned long *entries,
> depot_stack_handle_t handle = 0;
> struct page *page = NULL;
> void *prealloc = NULL;
> + bool need_alloc = false;
> unsigned long flags;
> u32 hash;
>
> @@ -465,22 +465,26 @@ depot_stack_handle_t __stack_depot_save(unsigned long *entries,
> hash = hash_stack(entries, nr_entries);
> bucket = &stack_table[hash & stack_hash_mask];
>
> - /*
> - * Fast path: look the stack trace up without locking.
> - * smp_load_acquire() pairs with smp_store_release() to |bucket| below.
> - */
> - found = find_stack(smp_load_acquire(bucket), entries, nr_entries, hash);
> - if (found)
> + read_lock_irqsave(&pool_rwlock, flags);
> +
> + /* Fast path: look the stack trace up without full locking. */
> + found = find_stack(*bucket, entries, nr_entries, hash);
> + if (found) {
> + read_unlock_irqrestore(&pool_rwlock, flags);
> goto exit;
> + }
> +
> + /* Take note if another stack pool needs to be allocated. */
> + if (new_pool_required)
> + need_alloc = true;
> +
> + read_unlock_irqrestore(&pool_rwlock, flags);
>
> /*
> - * Check if another stack pool needs to be allocated. If so, allocate
> - * the memory now: we won't be able to do that under the lock.
> - *
> - * smp_load_acquire() pairs with smp_store_release() in
> - * depot_update_pools() and depot_keep_new_pool().
> + * Allocate memory for a new pool if required now:
> + * we won't be able to do that under the lock.
> */
> - if (unlikely(can_alloc && smp_load_acquire(&new_pool_required))) {
> + if (unlikely(can_alloc && need_alloc)) {
> /*
> * Zero out zone modifiers, as we don't have specific zone
> * requirements. Keep the flags related to allocation in atomic
> @@ -494,7 +498,7 @@ depot_stack_handle_t __stack_depot_save(unsigned long *entries,
> prealloc = page_address(page);
> }
>
> - raw_spin_lock_irqsave(&pool_lock, flags);
> + write_lock_irqsave(&pool_rwlock, flags);
>
> found = find_stack(*bucket, entries, nr_entries, hash);
> if (!found) {
> @@ -503,11 +507,7 @@ depot_stack_handle_t __stack_depot_save(unsigned long *entries,
>
> if (new) {
> new->next = *bucket;
> - /*
> - * smp_store_release() pairs with smp_load_acquire()
> - * from |bucket| above.
> - */
> - smp_store_release(bucket, new);
> + *bucket = new;
> found = new;
> }
> } else if (prealloc) {
> @@ -518,7 +518,7 @@ depot_stack_handle_t __stack_depot_save(unsigned long *entries,
> depot_keep_new_pool(&prealloc);
> }
>
> - raw_spin_unlock_irqrestore(&pool_lock, flags);
> + write_unlock_irqrestore(&pool_rwlock, flags);
> exit:
> if (prealloc) {
> /* Stack depot didn't use this memory, free it. */
> @@ -542,6 +542,7 @@ unsigned int stack_depot_fetch(depot_stack_handle_t handle,
> unsigned long **entries)
> {
> struct stack_record *stack;
> + unsigned long flags;
>
> *entries = NULL;
> /*
> @@ -553,8 +554,12 @@ unsigned int stack_depot_fetch(depot_stack_handle_t handle,
> if (!handle || stack_depot_disabled)
> return 0;
>
> + read_lock_irqsave(&pool_rwlock, flags);
> +
> stack = depot_fetch_stack(handle);
>
> + read_unlock_irqrestore(&pool_rwlock, flags);
> +
> *entries = stack->entries;
> return stack->size;
> }
> --
> 2.25.1
>
--
Oscar Salvador
SUSE Labs
On Mon, Nov 20, 2023 at 06:47:15PM +0100, [email protected] wrote:
> From: Andrey Konovalov <[email protected]>
>
> Add stack_depot_put, a function that decrements the reference counter
> on a stack record and removes it from the stack depot once the counter
> reaches 0.
>
> Internally, when removing a stack record, the function unlinks it from
> the hash table bucket and returns to the freelist.
>
> With this change, the users of stack depot can call stack_depot_put
> when keeping a stack trace in the stack depot is not needed anymore.
> This allows avoiding polluting the stack depot with irrelevant stack
> traces and thus have more space to store the relevant ones before the
> stack depot reaches its capacity.
>
> Signed-off-by: Andrey Konovalov <[email protected]>
I yet have to review the final bits of this series, but I'd like to
comment on something below
> +void stack_depot_put(depot_stack_handle_t handle)
> +{
> + struct stack_record *stack;
> + unsigned long flags;
> +
> + if (!handle || stack_depot_disabled)
> + return;
> +
> + write_lock_irqsave(&pool_rwlock, flags);
> +
> + stack = depot_fetch_stack(handle);
> + if (WARN_ON(!stack))
> + goto out;
> +
> + if (refcount_dec_and_test(&stack->count)) {
> + /* Unlink stack from the hash table. */
> + list_del(&stack->list);
> +
> + /* Free stack. */
> + depot_free_stack(stack);
It would be great if stack_depot_put would also accept a boolean,
which would determine whether we want to erase the stack or not.
For the feature I'm working on page_ower [1], I need to keep track
of how many times we allocated/freed from a certain path, which may
expose a potential leak, and I was using the refcount to do that,
but I don't want the record to be erased, because this new
functionality won't be exclusive with the existing one.
e.g: you can check /sys/kernel/debug/page_owner AND
/sys/kernel/debug/page_owner_stacks
So, while the new functionaliy won't care if a record has been erased,
the old one will, so information will be lost.
[1] https://patchwork.kernel.org/project/linux-mm/cover/[email protected]/
--
Oscar Salvador
SUSE Labs
On Thu, 4 Jan 2024 at 09:52, Oscar Salvador <[email protected]> wrote:
>
> On Mon, Nov 20, 2023 at 06:47:15PM +0100, [email protected] wrote:
> > From: Andrey Konovalov <[email protected]>
> >
> > Add stack_depot_put, a function that decrements the reference counter
> > on a stack record and removes it from the stack depot once the counter
> > reaches 0.
> >
> > Internally, when removing a stack record, the function unlinks it from
> > the hash table bucket and returns to the freelist.
> >
> > With this change, the users of stack depot can call stack_depot_put
> > when keeping a stack trace in the stack depot is not needed anymore.
> > This allows avoiding polluting the stack depot with irrelevant stack
> > traces and thus have more space to store the relevant ones before the
> > stack depot reaches its capacity.
> >
> > Signed-off-by: Andrey Konovalov <[email protected]>
>
> I yet have to review the final bits of this series, but I'd like to
> comment on something below
>
>
> > +void stack_depot_put(depot_stack_handle_t handle)
> > +{
> > + struct stack_record *stack;
> > + unsigned long flags;
> > +
> > + if (!handle || stack_depot_disabled)
> > + return;
> > +
> > + write_lock_irqsave(&pool_rwlock, flags);
> > +
> > + stack = depot_fetch_stack(handle);
> > + if (WARN_ON(!stack))
> > + goto out;
> > +
> > + if (refcount_dec_and_test(&stack->count)) {
> > + /* Unlink stack from the hash table. */
> > + list_del(&stack->list);
> > +
> > + /* Free stack. */
> > + depot_free_stack(stack);
>
> It would be great if stack_depot_put would also accept a boolean,
> which would determine whether we want to erase the stack or not.
I think a boolean makes the interface more confusing for everyone
else. At that point stack_depot_put merely decrements the refcount and
becomes a wrapper around refcount_dec, right?
I think you want to expose the stack_record struct anyway for your
series, so why not simply avoid calling stack_depot_put and decrement
the refcount with your own helper (there needs to be a new stackdepot
function to return a stack_record under the pool_rwlock held as
reader).
Also, you need to ensure noone else calls stack_depot_put on the stack
traces you want to keep. If there is a risk someone else may call
stack_depot_put on them, it obviously won't work (I think the only
option then is to introduce a way to pin stacks).
> For the feature I'm working on page_ower [1], I need to keep track
> of how many times we allocated/freed from a certain path, which may
> expose a potential leak, and I was using the refcount to do that,
> but I don't want the record to be erased, because this new
> functionality won't be exclusive with the existing one.
>
> e.g: you can check /sys/kernel/debug/page_owner AND
> /sys/kernel/debug/page_owner_stacks
>
> So, while the new functionaliy won't care if a record has been erased,
> the old one will, so information will be lost.
>
> [1] https://patchwork.kernel.org/project/linux-mm/cover/[email protected]/
>
>
>
> --
> Oscar Salvador
> SUSE Labs
On Thu, Jan 04, 2024 at 10:25:40AM +0100, Marco Elver wrote:
> I think a boolean makes the interface more confusing for everyone
> else. At that point stack_depot_put merely decrements the refcount and
> becomes a wrapper around refcount_dec, right?
Thanks Marco for the feedback.
Fair enough.
> I think you want to expose the stack_record struct anyway for your
> series, so why not simply avoid calling stack_depot_put and decrement
> the refcount with your own helper (there needs to be a new stackdepot
> function to return a stack_record under the pool_rwlock held as
> reader).
Yeah, that was something I was experimenting with my last version.
See [0], I moved the "stack_record" struct into the header so page_owner
can make sense of it. I guess that's fine right?
If so, I'd do as you mentioned, just decrementing it with my own helper
so no calls to stack_depot_put will be needed.
Regarding the locking, I yet have to check the patch that implements
the read/write lock, but given that page_owner won't be evicting
anything, do I still have to fiddle with the locks?
> Also, you need to ensure noone else calls stack_depot_put on the stack
> traces you want to keep. If there is a risk someone else may call
> stack_depot_put on them, it obviously won't work (I think the only
> option then is to introduce a way to pin stacks).
Well, since page_owner won't call stack_depot_put, I don't see
how someone else would be able to interfere there, so I think
I am safe there.
[0] https://patchwork.kernel.org/project/linux-mm/patch/[email protected]/
--
Oscar Salvador
SUSE Labs
On Thu, 4 Jan 2024 at 11:18, Oscar Salvador <[email protected]> wrote:
>
> On Thu, Jan 04, 2024 at 10:25:40AM +0100, Marco Elver wrote:
> > I think a boolean makes the interface more confusing for everyone
> > else. At that point stack_depot_put merely decrements the refcount and
> > becomes a wrapper around refcount_dec, right?
>
> Thanks Marco for the feedback.
>
> Fair enough.
>
> > I think you want to expose the stack_record struct anyway for your
> > series, so why not simply avoid calling stack_depot_put and decrement
> > the refcount with your own helper (there needs to be a new stackdepot
> > function to return a stack_record under the pool_rwlock held as
> > reader).
>
> Yeah, that was something I was experimenting with my last version.
> See [0], I moved the "stack_record" struct into the header so page_owner
> can make sense of it. I guess that's fine right?
Not exposing the internals would be better, but at this point I think
your usecase looks like it's doing something that is somewhat out of
the bounds of the stackdepot API. I also don't see that it makes sense
to add more helpers to the stackdepot API to deal with such special
cases.
As such, I'd reason that it's ok to expose the struct for this special usecase.
> If so, I'd do as you mentioned, just decrementing it with my own helper
> so no calls to stack_depot_put will be needed.
>
> Regarding the locking, I yet have to check the patch that implements
> the read/write lock, but given that page_owner won't be evicting
> anything, do I still have to fiddle with the locks?
You need to grab the lock as a reader to fetch the stack_record and
return it. All that should be hidden behind whatever function you'll
introduce to return the stack_record (stack_depot_find()?).
> > Also, you need to ensure noone else calls stack_depot_put on the stack
> > traces you want to keep. If there is a risk someone else may call
> > stack_depot_put on them, it obviously won't work (I think the only
> > option then is to introduce a way to pin stacks).
>
> Well, since page_owner won't call stack_depot_put, I don't see
> how someone else would be able to interfere there, so I think
> I am safe there.
>
> [0] https://patchwork.kernel.org/project/linux-mm/patch/[email protected]/
>
> --
> Oscar Salvador
> SUSE Labs
Oscar Salvador <[email protected]> writes:
>>
>> With this change, multiple users can still look up records in parallel.
That's a severe misunderstanding -- rwlocks always bounce a cache line,
so the parallelism is significantly reduced.
Normally rwlocks are only worth it if your critical region is quite long.
>>
>> This is preparatory patch for implementing the eviction of stack records
>> from the stack depot.
>>
>> Reviewed-by: Alexander Potapenko <[email protected]>
>> Signed-off-by: Andrey Konovalov <[email protected]>
>
> Reviewed-by: Oscar Salvador <[email protected]>
Has anyone benchmarked this on a high core count machine? It sounds
pretty bad if every lock aquisition starts bouncing a single cache line.
Consider using RCU or similar.
-Andi
On Thu, 11 Jan 2024 at 00:01, Andi Kleen <[email protected]> wrote:
>
> Oscar Salvador <[email protected]> writes:
> >>
> >> With this change, multiple users can still look up records in parallel.
>
> That's a severe misunderstanding -- rwlocks always bounce a cache line,
> so the parallelism is significantly reduced.
>
> Normally rwlocks are only worth it if your critical region is quite long.
>
> >>
> >> This is preparatory patch for implementing the eviction of stack records
> >> from the stack depot.
> >>
> >> Reviewed-by: Alexander Potapenko <[email protected]>
> >> Signed-off-by: Andrey Konovalov <[email protected]>
> >
> > Reviewed-by: Oscar Salvador <[email protected]>
>
>
> Has anyone benchmarked this on a high core count machine? It sounds
> pretty bad if every lock aquisition starts bouncing a single cache line.
>
> Consider using RCU or similar.
stackdepot is severely limited in what kernel facilities it may use
due to being used by such low level facilities as the allocator
itself.
I've been suggesting percpu-rwsem here, but looking at it in more
detail that doesn't work because percpu-rwsem wants to sleep, but
stackdepot must work in non-sleepable contexts. :-/
> stackdepot is severely limited in what kernel facilities it may use
> due to being used by such low level facilities as the allocator
> itself.
RCU can be done quite low level too (e.g. there is NMI safe RCU)
>
> I've been suggesting percpu-rwsem here, but looking at it in more
> detail that doesn't work because percpu-rwsem wants to sleep, but
> stackdepot must work in non-sleepable contexts. :-/
Yes something per CPU would work too I suppose. We used to have
big reader spinlocks for this.
-Andi
On Thu, Jan 11, 2024 at 04:36AM -0800, Andi Kleen wrote:
> > stackdepot is severely limited in what kernel facilities it may use
> > due to being used by such low level facilities as the allocator
> > itself.
>
> RCU can be done quite low level too (e.g. there is NMI safe RCU)
How about the below? This should get us back the performance of the old
lock-less version. Although it's using rculist, we don't actually need
to synchronize via RCU.
Thanks,
-- Marco
------ >8 ------
From: Marco Elver <[email protected]>
Date: Tue, 9 Jan 2024 10:21:56 +0100
Subject: [PATCH] stackdepot: make fast paths lock-less again
stack_depot_put() unconditionally takes the pool_rwlock as a writer.
This is unnecessary if the stack record is not going to be freed.
Furthermore, reader-writer locks have inherent cache contention, which
does not scale well on machines with large CPU counts.
Instead, rework the synchronization story of stack depot to again avoid
taking any locks in the fast paths. This is done by relying on RCU
primitives to give us lock-less list traversal. See code comments for
more details.
Fixes: 108be8def46e ("lib/stackdepot: allow users to evict stack traces")
Signed-off-by: Marco Elver <[email protected]>
---
lib/stackdepot.c | 222 ++++++++++++++++++++++++++++-------------------
1 file changed, 133 insertions(+), 89 deletions(-)
diff --git a/lib/stackdepot.c b/lib/stackdepot.c
index a0be5d05c7f0..9eaf46f8abc4 100644
--- a/lib/stackdepot.c
+++ b/lib/stackdepot.c
@@ -19,10 +19,13 @@
#include <linux/kernel.h>
#include <linux/kmsan.h>
#include <linux/list.h>
+#include <linux/llist.h>
#include <linux/mm.h>
#include <linux/mutex.h>
#include <linux/percpu.h>
#include <linux/printk.h>
+#include <linux/rculist.h>
+#include <linux/rcupdate.h>
#include <linux/refcount.h>
#include <linux/slab.h>
#include <linux/spinlock.h>
@@ -67,7 +70,8 @@ union handle_parts {
};
struct stack_record {
- struct list_head list; /* Links in hash table or freelist */
+ struct list_head hash_list; /* Links in the hash table */
+ struct llist_node free_list; /* Links in the freelist */
u32 hash; /* Hash in hash table */
u32 size; /* Number of stored frames */
union handle_parts handle;
@@ -104,7 +108,7 @@ static void *new_pool;
/* Number of pools in stack_pools. */
static int pools_num;
/* Freelist of stack records within stack_pools. */
-static LIST_HEAD(free_stacks);
+static LLIST_HEAD(free_stacks);
/*
* Stack depot tries to keep an extra pool allocated even before it runs out
* of space in the currently used pool. This flag marks whether this extra pool
@@ -112,8 +116,8 @@ static LIST_HEAD(free_stacks);
* yet allocated or if the limit on the number of pools is reached.
*/
static bool new_pool_required = true;
-/* Lock that protects the variables above. */
-static DEFINE_RWLOCK(pool_rwlock);
+/* The lock must be held when performing pool or free list modifications. */
+static DEFINE_RAW_SPINLOCK(pool_lock);
static int __init disable_stack_depot(char *str)
{
@@ -263,9 +267,7 @@ static void depot_init_pool(void *pool)
{
int offset;
- lockdep_assert_held_write(&pool_rwlock);
-
- WARN_ON(!list_empty(&free_stacks));
+ lockdep_assert_held(&pool_lock);
/* Initialize handles and link stack records into the freelist. */
for (offset = 0; offset <= DEPOT_POOL_SIZE - DEPOT_STACK_RECORD_SIZE;
@@ -276,18 +278,25 @@ static void depot_init_pool(void *pool)
stack->handle.offset = offset >> DEPOT_STACK_ALIGN;
stack->handle.extra = 0;
- list_add(&stack->list, &free_stacks);
+ llist_add(&stack->free_list, &free_stacks);
+ INIT_LIST_HEAD(&stack->hash_list);
}
/* Save reference to the pool to be used by depot_fetch_stack(). */
stack_pools[pools_num] = pool;
- pools_num++;
+
+ /*
+ * Release of pool pointer assignment above. Pairs with the
+ * smp_load_acquire() in depot_fetch_stack().
+ */
+ smp_store_release(&pools_num, pools_num + 1);
+ ASSERT_EXCLUSIVE_WRITER(pools_num);
}
/* Keeps the preallocated memory to be used for a new stack depot pool. */
static void depot_keep_new_pool(void **prealloc)
{
- lockdep_assert_held_write(&pool_rwlock);
+ lockdep_assert_held(&pool_lock);
/*
* If a new pool is already saved or the maximum number of
@@ -310,16 +319,16 @@ static void depot_keep_new_pool(void **prealloc)
* number of pools is reached. In either case, take note that
* keeping another pool is not required.
*/
- new_pool_required = false;
+ WRITE_ONCE(new_pool_required, false);
}
/* Updates references to the current and the next stack depot pools. */
static bool depot_update_pools(void **prealloc)
{
- lockdep_assert_held_write(&pool_rwlock);
+ lockdep_assert_held(&pool_lock);
/* Check if we still have objects in the freelist. */
- if (!list_empty(&free_stacks))
+ if (!llist_empty(&free_stacks))
goto out_keep_prealloc;
/* Check if we have a new pool saved and use it. */
@@ -329,7 +338,7 @@ static bool depot_update_pools(void **prealloc)
/* Take note that we might need a new new_pool. */
if (pools_num < DEPOT_MAX_POOLS)
- new_pool_required = true;
+ WRITE_ONCE(new_pool_required, true);
/* Try keeping the preallocated memory for new_pool. */
goto out_keep_prealloc;
@@ -362,20 +371,19 @@ static struct stack_record *
depot_alloc_stack(unsigned long *entries, int size, u32 hash, void **prealloc)
{
struct stack_record *stack;
+ struct llist_node *free;
- lockdep_assert_held_write(&pool_rwlock);
+ lockdep_assert_held(&pool_lock);
/* Update current and new pools if required and possible. */
if (!depot_update_pools(prealloc))
return NULL;
/* Check if we have a stack record to save the stack trace. */
- if (list_empty(&free_stacks))
+ free = llist_del_first(&free_stacks);
+ if (!free)
return NULL;
-
- /* Get and unlink the first entry from the freelist. */
- stack = list_first_entry(&free_stacks, struct stack_record, list);
- list_del(&stack->list);
+ stack = llist_entry(free, struct stack_record, free_list);
/* Limit number of saved frames to CONFIG_STACKDEPOT_MAX_FRAMES. */
if (size > CONFIG_STACKDEPOT_MAX_FRAMES)
@@ -385,7 +393,6 @@ depot_alloc_stack(unsigned long *entries, int size, u32 hash, void **prealloc)
stack->hash = hash;
stack->size = size;
/* stack->handle is already filled in by depot_init_pool(). */
- refcount_set(&stack->count, 1);
memcpy(stack->entries, entries, flex_array_size(stack, entries, size));
/*
@@ -394,21 +401,30 @@ depot_alloc_stack(unsigned long *entries, int size, u32 hash, void **prealloc)
*/
kmsan_unpoison_memory(stack, DEPOT_STACK_RECORD_SIZE);
+ /*
+ * Release saving of the stack trace. Pairs with smp_mb() in
+ * depot_fetch_stack().
+ */
+ smp_mb__before_atomic();
+ refcount_set(&stack->count, 1);
+
return stack;
}
static struct stack_record *depot_fetch_stack(depot_stack_handle_t handle)
{
+ /* Acquire the pool pointer written in depot_init_pool(). */
+ const int pools_num_cached = smp_load_acquire(&pools_num);
union handle_parts parts = { .handle = handle };
void *pool;
size_t offset = parts.offset << DEPOT_STACK_ALIGN;
struct stack_record *stack;
- lockdep_assert_held(&pool_rwlock);
+ lockdep_assert_not_held(&pool_lock);
- if (parts.pool_index > pools_num) {
+ if (parts.pool_index > pools_num_cached) {
WARN(1, "pool index %d out of bounds (%d) for stack id %08x\n",
- parts.pool_index, pools_num, handle);
+ parts.pool_index, pools_num_cached, handle);
return NULL;
}
@@ -417,15 +433,35 @@ static struct stack_record *depot_fetch_stack(depot_stack_handle_t handle)
return NULL;
stack = pool + offset;
+
+ /*
+ * Acquire the stack trace. Pairs with smp_mb() in depot_alloc_stack().
+ *
+ * This does not protect against a stack_depot_put() freeing the record
+ * and having it subsequently being reused. Callers are responsible to
+ * avoid using stack depot handles after passing to stack_depot_put().
+ */
+ if (!refcount_read(&stack->count))
+ return NULL;
+ smp_mb__after_atomic();
+
return stack;
}
/* Links stack into the freelist. */
static void depot_free_stack(struct stack_record *stack)
{
- lockdep_assert_held_write(&pool_rwlock);
+ unsigned long flags;
+
+ lockdep_assert_not_held(&pool_lock);
+
+ raw_spin_lock_irqsave(&pool_lock, flags);
+ printk_deferred_enter();
+ list_del_rcu(&stack->hash_list);
+ printk_deferred_exit();
+ raw_spin_unlock_irqrestore(&pool_lock, flags);
- list_add(&stack->list, &free_stacks);
+ llist_add(&stack->free_list, &free_stacks);
}
/* Calculates the hash for a stack. */
@@ -453,22 +489,55 @@ int stackdepot_memcmp(const unsigned long *u1, const unsigned long *u2,
/* Finds a stack in a bucket of the hash table. */
static inline struct stack_record *find_stack(struct list_head *bucket,
- unsigned long *entries, int size,
- u32 hash)
+ unsigned long *entries, int size,
+ u32 hash, depot_flags_t flags)
{
- struct list_head *pos;
- struct stack_record *found;
+ struct stack_record *stack, *ret = NULL;
- lockdep_assert_held(&pool_rwlock);
+ /*
+ * Due to being used from low-level code paths such as the allocators,
+ * NMI, or even RCU itself, stackdepot cannot rely on primitives that
+ * would sleep (such as synchronize_rcu()) or end up recursively call
+ * into stack depot again (such as call_rcu()).
+ *
+ * Instead, lock-less readers only rely on RCU primitives for correct
+ * memory ordering, but do not use RCU-based synchronization otherwise.
+ * Instead, we perform 3-pass validation below to ensure that the stack
+ * record we accessed is actually valid. If we fail to obtain a valid
+ * stack record here, the slow-path in stack_depot_save_flags() will
+ * retry to avoid inserting duplicates.
+ *
+ * If STACK_DEPOT_FLAG_GET is not used, it is undefined behaviour to
+ * call stack_depot_put() later - i.e. in the non-refcounted case, we do
+ * not have to worry that the entry will be recycled.
+ */
+
+ list_for_each_entry_rcu(stack, bucket, hash_list) {
+ /* 1. Check if this entry could potentially match. */
+ if (data_race(stack->hash != hash || stack->size != size))
+ continue;
+
+ /*
+ * 2. Increase refcount if not zero. If this is successful, we
+ * know that this stack record is valid and will not be freed by
+ * stack_depot_put().
+ */
+ if ((flags & STACK_DEPOT_FLAG_GET) && unlikely(!refcount_inc_not_zero(&stack->count)))
+ continue;
+
+ /* 3. Do full validation of the record. */
+ if (likely(stack->hash == hash && stack->size == size &&
+ !stackdepot_memcmp(entries, stack->entries, size))) {
+ ret = stack;
+ break;
+ }
- list_for_each(pos, bucket) {
- found = list_entry(pos, struct stack_record, list);
- if (found->hash == hash &&
- found->size == size &&
- !stackdepot_memcmp(entries, found->entries, size))
- return found;
+ /* Undo refcount - could have raced with stack_depot_put(). */
+ if ((flags & STACK_DEPOT_FLAG_GET) && unlikely(refcount_dec_and_test(&stack->count)))
+ depot_free_stack(stack);
}
- return NULL;
+
+ return ret;
}
depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
@@ -482,7 +551,6 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
struct page *page = NULL;
void *prealloc = NULL;
bool can_alloc = depot_flags & STACK_DEPOT_FLAG_CAN_ALLOC;
- bool need_alloc = false;
unsigned long flags;
u32 hash;
@@ -505,31 +573,16 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
hash = hash_stack(entries, nr_entries);
bucket = &stack_table[hash & stack_hash_mask];
- read_lock_irqsave(&pool_rwlock, flags);
- printk_deferred_enter();
-
- /* Fast path: look the stack trace up without full locking. */
- found = find_stack(bucket, entries, nr_entries, hash);
- if (found) {
- if (depot_flags & STACK_DEPOT_FLAG_GET)
- refcount_inc(&found->count);
- printk_deferred_exit();
- read_unlock_irqrestore(&pool_rwlock, flags);
+ /* Fast path: look the stack trace up without locking. */
+ found = find_stack(bucket, entries, nr_entries, hash, depot_flags);
+ if (found)
goto exit;
- }
-
- /* Take note if another stack pool needs to be allocated. */
- if (new_pool_required)
- need_alloc = true;
-
- printk_deferred_exit();
- read_unlock_irqrestore(&pool_rwlock, flags);
/*
* Allocate memory for a new pool if required now:
* we won't be able to do that under the lock.
*/
- if (unlikely(can_alloc && need_alloc)) {
+ if (unlikely(can_alloc && READ_ONCE(new_pool_required))) {
/*
* Zero out zone modifiers, as we don't have specific zone
* requirements. Keep the flags related to allocation in atomic
@@ -543,31 +596,33 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
prealloc = page_address(page);
}
- write_lock_irqsave(&pool_rwlock, flags);
+ raw_spin_lock_irqsave(&pool_lock, flags);
printk_deferred_enter();
- found = find_stack(bucket, entries, nr_entries, hash);
+ /* Try to find again, to avoid concurrently inserting duplicates. */
+ found = find_stack(bucket, entries, nr_entries, hash, depot_flags);
if (!found) {
struct stack_record *new =
depot_alloc_stack(entries, nr_entries, hash, &prealloc);
if (new) {
- list_add(&new->list, bucket);
+ /*
+ * This releases the stack record into the bucket and
+ * makes it visible to readers in find_stack().
+ */
+ list_add_rcu(&new->hash_list, bucket);
found = new;
}
- } else {
- if (depot_flags & STACK_DEPOT_FLAG_GET)
- refcount_inc(&found->count);
+ } else if (prealloc) {
/*
* Stack depot already contains this stack trace, but let's
* keep the preallocated memory for future.
*/
- if (prealloc)
- depot_keep_new_pool(&prealloc);
+ depot_keep_new_pool(&prealloc);
}
printk_deferred_exit();
- write_unlock_irqrestore(&pool_rwlock, flags);
+ raw_spin_unlock_irqrestore(&pool_lock, flags);
exit:
if (prealloc) {
/* Stack depot didn't use this memory, free it. */
@@ -592,7 +647,6 @@ unsigned int stack_depot_fetch(depot_stack_handle_t handle,
unsigned long **entries)
{
struct stack_record *stack;
- unsigned long flags;
*entries = NULL;
/*
@@ -604,13 +658,12 @@ unsigned int stack_depot_fetch(depot_stack_handle_t handle,
if (!handle || stack_depot_disabled)
return 0;
- read_lock_irqsave(&pool_rwlock, flags);
- printk_deferred_enter();
-
stack = depot_fetch_stack(handle);
-
- printk_deferred_exit();
- read_unlock_irqrestore(&pool_rwlock, flags);
+ /*
+ * Should never be NULL, otherwise this is a use-after-put.
+ */
+ if (WARN_ON(!stack))
+ return 0;
*entries = stack->entries;
return stack->size;
@@ -620,29 +673,20 @@ EXPORT_SYMBOL_GPL(stack_depot_fetch);
void stack_depot_put(depot_stack_handle_t handle)
{
struct stack_record *stack;
- unsigned long flags;
if (!handle || stack_depot_disabled)
return;
- write_lock_irqsave(&pool_rwlock, flags);
- printk_deferred_enter();
-
stack = depot_fetch_stack(handle);
+ /*
+ * Should always be able to find the stack record, otherwise this is an
+ * unbalanced put attempt.
+ */
if (WARN_ON(!stack))
- goto out;
-
- if (refcount_dec_and_test(&stack->count)) {
- /* Unlink stack from the hash table. */
- list_del(&stack->list);
+ return;
- /* Free stack. */
+ if (refcount_dec_and_test(&stack->count))
depot_free_stack(stack);
- }
-
-out:
- printk_deferred_exit();
- write_unlock_irqrestore(&pool_rwlock, flags);
}
EXPORT_SYMBOL_GPL(stack_depot_put);
--
2.43.0.275.g3460e3d667-goog
On Thu, Jan 11, 2024 at 8:08 PM Marco Elver <[email protected]> wrote:
>
> On Thu, Jan 11, 2024 at 04:36AM -0800, Andi Kleen wrote:
> > > stackdepot is severely limited in what kernel facilities it may use
> > > due to being used by such low level facilities as the allocator
> > > itself.
> >
> > RCU can be done quite low level too (e.g. there is NMI safe RCU)
>
> How about the below? This should get us back the performance of the old
> lock-less version. Although it's using rculist, we don't actually need
> to synchronize via RCU.
>
> Thanks,
> -- Marco
>
> ------ >8 ------
>
> From: Marco Elver <[email protected]>
> Date: Tue, 9 Jan 2024 10:21:56 +0100
> Subject: [PATCH] stackdepot: make fast paths lock-less again
>
> stack_depot_put() unconditionally takes the pool_rwlock as a writer.
> This is unnecessary if the stack record is not going to be freed.
> Furthermore, reader-writer locks have inherent cache contention, which
> does not scale well on machines with large CPU counts.
>
> Instead, rework the synchronization story of stack depot to again avoid
> taking any locks in the fast paths. This is done by relying on RCU
> primitives to give us lock-less list traversal. See code comments for
> more details.
>
> Fixes: 108be8def46e ("lib/stackdepot: allow users to evict stack traces")
> Signed-off-by: Marco Elver <[email protected]>
> ---
> lib/stackdepot.c | 222 ++++++++++++++++++++++++++++-------------------
> 1 file changed, 133 insertions(+), 89 deletions(-)
>
> diff --git a/lib/stackdepot.c b/lib/stackdepot.c
> index a0be5d05c7f0..9eaf46f8abc4 100644
> --- a/lib/stackdepot.c
> +++ b/lib/stackdepot.c
> @@ -19,10 +19,13 @@
> #include <linux/kernel.h>
> #include <linux/kmsan.h>
> #include <linux/list.h>
> +#include <linux/llist.h>
> #include <linux/mm.h>
> #include <linux/mutex.h>
> #include <linux/percpu.h>
> #include <linux/printk.h>
> +#include <linux/rculist.h>
> +#include <linux/rcupdate.h>
> #include <linux/refcount.h>
> #include <linux/slab.h>
> #include <linux/spinlock.h>
> @@ -67,7 +70,8 @@ union handle_parts {
> };
>
> struct stack_record {
> - struct list_head list; /* Links in hash table or freelist */
> + struct list_head hash_list; /* Links in the hash table */
> + struct llist_node free_list; /* Links in the freelist */
> u32 hash; /* Hash in hash table */
> u32 size; /* Number of stored frames */
> union handle_parts handle;
> @@ -104,7 +108,7 @@ static void *new_pool;
> /* Number of pools in stack_pools. */
> static int pools_num;
> /* Freelist of stack records within stack_pools. */
> -static LIST_HEAD(free_stacks);
> +static LLIST_HEAD(free_stacks);
> /*
> * Stack depot tries to keep an extra pool allocated even before it runs out
> * of space in the currently used pool. This flag marks whether this extra pool
> @@ -112,8 +116,8 @@ static LIST_HEAD(free_stacks);
> * yet allocated or if the limit on the number of pools is reached.
> */
> static bool new_pool_required = true;
> -/* Lock that protects the variables above. */
> -static DEFINE_RWLOCK(pool_rwlock);
> +/* The lock must be held when performing pool or free list modifications */
> +static DEFINE_RAW_SPINLOCK(pool_lock);
>
> static int __init disable_stack_depot(char *str)
> {
> @@ -263,9 +267,7 @@ static void depot_init_pool(void *pool)
> {
> int offset;
>
> - lockdep_assert_held_write(&pool_rwlock);
> -
> - WARN_ON(!list_empty(&free_stacks));
> + lockdep_assert_held(&pool_lock);
>
> /* Initialize handles and link stack records into the freelist. */
> for (offset = 0; offset <= DEPOT_POOL_SIZE - DEPOT_STACK_RECORD_SIZE;
> @@ -276,18 +278,25 @@ static void depot_init_pool(void *pool)
> stack->handle.offset = offset >> DEPOT_STACK_ALIGN;
> stack->handle.extra = 0;
>
> - list_add(&stack->list, &free_stacks);
> + llist_add(&stack->free_list, &free_stacks);
> + INIT_LIST_HEAD(&stack->hash_list);
> }
>
> /* Save reference to the pool to be used by depot_fetch_stack(). */
> stack_pools[pools_num] = pool;
> - pools_num++;
> +
> + /*
> + * Release of pool pointer assignment above. Pairs with the
> + * smp_load_acquire() in depot_fetch_stack().
> + */
> + smp_store_release(&pools_num, pools_num + 1);
> + ASSERT_EXCLUSIVE_WRITER(pools_num);
> }
>
> /* Keeps the preallocated memory to be used for a new stack depot pool. */
> static void depot_keep_new_pool(void **prealloc)
> {
> - lockdep_assert_held_write(&pool_rwlock);
> + lockdep_assert_held(&pool_lock);
>
> /*
> * If a new pool is already saved or the maximum number of
> @@ -310,16 +319,16 @@ static void depot_keep_new_pool(void **prealloc)
> * number of pools is reached. In either case, take note that
> * keeping another pool is not required.
> */
> - new_pool_required = false;
> + WRITE_ONCE(new_pool_required, false);
> }
>
> /* Updates references to the current and the next stack depot pools. */
> static bool depot_update_pools(void **prealloc)
> {
> - lockdep_assert_held_write(&pool_rwlock);
> + lockdep_assert_held(&pool_lock);
>
> /* Check if we still have objects in the freelist. */
> - if (!list_empty(&free_stacks))
> + if (!llist_empty(&free_stacks))
> goto out_keep_prealloc;
>
> /* Check if we have a new pool saved and use it. */
> @@ -329,7 +338,7 @@ static bool depot_update_pools(void **prealloc)
>
> /* Take note that we might need a new new_pool. */
> if (pools_num < DEPOT_MAX_POOLS)
> - new_pool_required = true;
> + WRITE_ONCE(new_pool_required, true);
>
> /* Try keeping the preallocated memory for new_pool. */
> goto out_keep_prealloc;
> @@ -362,20 +371,19 @@ static struct stack_record *
> depot_alloc_stack(unsigned long *entries, int size, u32 hash, void **prealloc)
> {
> struct stack_record *stack;
> + struct llist_node *free;
>
> - lockdep_assert_held_write(&pool_rwlock);
> + lockdep_assert_held(&pool_lock);
>
> /* Update current and new pools if required and possible. */
> if (!depot_update_pools(prealloc))
> return NULL;
>
> /* Check if we have a stack record to save the stack trace. */
> - if (list_empty(&free_stacks))
> + free = llist_del_first(&free_stacks);
> + if (!free)
> return NULL;
> -
> - /* Get and unlink the first entry from the freelist. */
> - stack = list_first_entry(&free_stacks, struct stack_record, list);
> - list_del(&stack->list);
> + stack = llist_entry(free, struct stack_record, free_list);
>
> /* Limit number of saved frames to CONFIG_STACKDEPOT_MAX_FRAMES. */
> if (size > CONFIG_STACKDEPOT_MAX_FRAMES)
> @@ -385,7 +393,6 @@ depot_alloc_stack(unsigned long *entries, int size, u32 hash, void **prealloc)
> stack->hash = hash;
> stack->size = size;
> /* stack->handle is already filled in by depot_init_pool(). */
> - refcount_set(&stack->count, 1);
> memcpy(stack->entries, entries, flex_array_size(stack, entries, size));
>
> /*
> @@ -394,21 +401,30 @@ depot_alloc_stack(unsigned long *entries, int size, u32 hash, void **prealloc)
> */
> kmsan_unpoison_memory(stack, DEPOT_STACK_RECORD_SIZE);
>
> + /*
> + * Release saving of the stack trace. Pairs with smp_mb() in
> + * depot_fetch_stack().
> + */
> + smp_mb__before_atomic();
> + refcount_set(&stack->count, 1);
> +
> return stack;
> }
>
> static struct stack_record *depot_fetch_stack(depot_stack_handle_t handle)
> {
> + /* Acquire the pool pointer written in depot_init_pool(). */
> + const int pools_num_cached = smp_load_acquire(&pools_num);
> union handle_parts parts = { .handle = handle };
> void *pool;
> size_t offset = parts.offset << DEPOT_STACK_ALIGN;
> struct stack_record *stack;
>
> - lockdep_assert_held(&pool_rwlock);
> + lockdep_assert_not_held(&pool_lock);
>
> - if (parts.pool_index > pools_num) {
> + if (parts.pool_index > pools_num_cached) {
> WARN(1, "pool index %d out of bounds (%d) for stack id %08x\n",
> - parts.pool_index, pools_num, handle);
> + parts.pool_index, pools_num_cached, handle);
> return NULL;
> }
>
> @@ -417,15 +433,35 @@ static struct stack_record *depot_fetch_stack(depot_stack_handle_t handle)
> return NULL;
>
> stack = pool + offset;
> +
> + /*
> + * Acquire the stack trace. Pairs with smp_mb() in depot_alloc_stack().
> + *
> + * This does not protect against a stack_depot_put() freeing the record
> + * and having it subsequently being reused. Callers are responsible to
> + * avoid using stack depot handles after passing to stack_depot_put().
> + */
> + if (!refcount_read(&stack->count))
> + return NULL;
Can this happen? It seems that depot_fetch_stack should only be called
for handles that were returned from stack_depot_save_flags before all
puts and thus the the refcount should > 0. Or is this a safeguard
against improper API usage?
> + smp_mb__after_atomic();
> +
> return stack;
> }
>
> /* Links stack into the freelist. */
> static void depot_free_stack(struct stack_record *stack)
> {
> - lockdep_assert_held_write(&pool_rwlock);
> + unsigned long flags;
> +
> + lockdep_assert_not_held(&pool_lock);
> +
> + raw_spin_lock_irqsave(&pool_lock, flags);
> + printk_deferred_enter();
> + list_del_rcu(&stack->hash_list);
> + printk_deferred_exit();
> + raw_spin_unlock_irqrestore(&pool_lock, flags);
>
> - list_add(&stack->list, &free_stacks);
> + llist_add(&stack->free_list, &free_stacks);
This llist_add is outside of the lock just because we can (i.e.
llist_add can run concurrently with the other free_stacks operations,
which are all under the lock), right? This slightly contradicts the
comment above the free_stacks definition.
If we put this under the lock and use normal list instead of llist, I
think we can then combine the hash_list with the free_list like before
to save up on some space for stack_record. Would that make sense?
> }
>
> /* Calculates the hash for a stack. */
> @@ -453,22 +489,55 @@ int stackdepot_memcmp(const unsigned long *u1, const unsigned long *u2,
>
> /* Finds a stack in a bucket of the hash table. */
> static inline struct stack_record *find_stack(struct list_head *bucket,
> - unsigned long *entries, int size,
> - u32 hash)
> + unsigned long *entries, int size,
> + u32 hash, depot_flags_t flags)
> {
> - struct list_head *pos;
> - struct stack_record *found;
> + struct stack_record *stack, *ret = NULL;
>
> - lockdep_assert_held(&pool_rwlock);
> + /*
> + * Due to being used from low-level code paths such as the allocators,
> + * NMI, or even RCU itself, stackdepot cannot rely on primitives that
> + * would sleep (such as synchronize_rcu()) or end up recursively call
> + * into stack depot again (such as call_rcu()).
> + *
> + * Instead, lock-less readers only rely on RCU primitives for correct
> + * memory ordering, but do not use RCU-based synchronization otherwise.
> + * Instead, we perform 3-pass validation below to ensure that the stack
> + * record we accessed is actually valid. If we fail to obtain a valid
> + * stack record here, the slow-path in stack_depot_save_flags() will
> + * retry to avoid inserting duplicates.
> + *
> + * If STACK_DEPOT_FLAG_GET is not used, it is undefined behaviour to
> + * call stack_depot_put() later - i.e. in the non-refcounted case, we do
> + * not have to worry that the entry will be recycled.
> + */
> +
> + list_for_each_entry_rcu(stack, bucket, hash_list) {
So we don't need rcu_read_lock here, because we don't rely on call_rcu
etc., right?
> + /* 1. Check if this entry could potentially match. */
> + if (data_race(stack->hash != hash || stack->size != size))
> + continue;
> +
> + /*
> + * 2. Increase refcount if not zero. If this is successful, we
> + * know that this stack record is valid and will not be freed by
> + * stack_depot_put().
> + */
> + if ((flags & STACK_DEPOT_FLAG_GET) && unlikely(!refcount_inc_not_zero(&stack->count)))
> + continue;
> +
> + /* 3. Do full validation of the record. */
> + if (likely(stack->hash == hash && stack->size == size &&
> + !stackdepot_memcmp(entries, stack->entries, size))) {
> + ret = stack;
> + break;
> + }
>
> - list_for_each(pos, bucket) {
> - found = list_entry(pos, struct stack_record, list);
> - if (found->hash == hash &&
> - found->size == size &&
> - !stackdepot_memcmp(entries, found->entries, size))
> - return found;
> + /* Undo refcount - could have raced with stack_depot_put(). */
> + if ((flags & STACK_DEPOT_FLAG_GET) && unlikely(refcount_dec_and_test(&stack->count)))
> + depot_free_stack(stack);
> }
> - return NULL;
> +
> + return ret;
> }
>
> depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
> @@ -482,7 +551,6 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
> struct page *page = NULL;
> void *prealloc = NULL;
> bool can_alloc = depot_flags & STACK_DEPOT_FLAG_CAN_ALLOC;
> - bool need_alloc = false;
> unsigned long flags;
> u32 hash;
>
> @@ -505,31 +573,16 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
> hash = hash_stack(entries, nr_entries);
> bucket = &stack_table[hash & stack_hash_mask];
>
> - read_lock_irqsave(&pool_rwlock, flags);
> - printk_deferred_enter();
> -
> - /* Fast path: look the stack trace up without full locking. */
> - found = find_stack(bucket, entries, nr_entries, hash);
> - if (found) {
> - if (depot_flags & STACK_DEPOT_FLAG_GET)
> - refcount_inc(&found->count);
> - printk_deferred_exit();
> - read_unlock_irqrestore(&pool_rwlock, flags);
> + /* Fast path: look the stack trace up without locking. */
> + found = find_stack(bucket, entries, nr_entries, hash, depot_flags);
> + if (found)
> goto exit;
> - }
> -
> - /* Take note if another stack pool needs to be allocated. */
> - if (new_pool_required)
> - need_alloc = true;
> -
> - printk_deferred_exit();
> - read_unlock_irqrestore(&pool_rwlock, flags);
>
> /*
> * Allocate memory for a new pool if required now:
> * we won't be able to do that under the lock.
> */
> - if (unlikely(can_alloc && need_alloc)) {
> + if (unlikely(can_alloc && READ_ONCE(new_pool_required))) {
> /*
> * Zero out zone modifiers, as we don't have specific zone
> * requirements. Keep the flags related to allocation in atomic
> @@ -543,31 +596,33 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
> prealloc = page_address(page);
> }
>
> - write_lock_irqsave(&pool_rwlock, flags);
> + raw_spin_lock_irqsave(&pool_lock, flags);
> printk_deferred_enter();
>
> - found = find_stack(bucket, entries, nr_entries, hash);
> + /* Try to find again, to avoid concurrently inserting duplicates. */
> + found = find_stack(bucket, entries, nr_entries, hash, depot_flags);
> if (!found) {
> struct stack_record *new =
> depot_alloc_stack(entries, nr_entries, hash, &prealloc);
>
> if (new) {
> - list_add(&new->list, bucket);
> + /*
> + * This releases the stack record into the bucket and
> + * makes it visible to readers in find_stack().
> + */
> + list_add_rcu(&new->hash_list, bucket);
> found = new;
> }
> - } else {
> - if (depot_flags & STACK_DEPOT_FLAG_GET)
> - refcount_inc(&found->count);
> + } else if (prealloc) {
> /*
> * Stack depot already contains this stack trace, but let's
> * keep the preallocated memory for future.
> */
> - if (prealloc)
> - depot_keep_new_pool(&prealloc);
> + depot_keep_new_pool(&prealloc);
> }
>
> printk_deferred_exit();
> - write_unlock_irqrestore(&pool_rwlock, flags);
> + raw_spin_unlock_irqrestore(&pool_lock, flags);
> exit:
> if (prealloc) {
> /* Stack depot didn't use this memory, free it. */
> @@ -592,7 +647,6 @@ unsigned int stack_depot_fetch(depot_stack_handle_t handle,
> unsigned long **entries)
> {
> struct stack_record *stack;
> - unsigned long flags;
>
> *entries = NULL;
> /*
> @@ -604,13 +658,12 @@ unsigned int stack_depot_fetch(depot_stack_handle_t handle,
> if (!handle || stack_depot_disabled)
> return 0;
>
> - read_lock_irqsave(&pool_rwlock, flags);
> - printk_deferred_enter();
> -
> stack = depot_fetch_stack(handle);
> -
> - printk_deferred_exit();
> - read_unlock_irqrestore(&pool_rwlock, flags);
> + /*
> + * Should never be NULL, otherwise this is a use-after-put.
> + */
> + if (WARN_ON(!stack))
> + return 0;
>
> *entries = stack->entries;
> return stack->size;
> @@ -620,29 +673,20 @@ EXPORT_SYMBOL_GPL(stack_depot_fetch);
> void stack_depot_put(depot_stack_handle_t handle)
> {
> struct stack_record *stack;
> - unsigned long flags;
>
> if (!handle || stack_depot_disabled)
> return;
>
> - write_lock_irqsave(&pool_rwlock, flags);
> - printk_deferred_enter();
> -
> stack = depot_fetch_stack(handle);
> + /*
> + * Should always be able to find the stack record, otherwise this is an
> + * unbalanced put attempt.
> + */
> if (WARN_ON(!stack))
> - goto out;
> -
> - if (refcount_dec_and_test(&stack->count)) {
> - /* Unlink stack from the hash table. */
> - list_del(&stack->list);
> + return;
>
> - /* Free stack. */
> + if (refcount_dec_and_test(&stack->count))
> depot_free_stack(stack);
> - }
> -
> -out:
> - printk_deferred_exit();
> - write_unlock_irqrestore(&pool_rwlock, flags);
> }
> EXPORT_SYMBOL_GPL(stack_depot_put);
>
> --
> 2.43.0.275.g3460e3d667-goog
>
Looks good to me from the functional perspective (modulo the
clarification comments I left above), but it would be great to get a
review from someone with a better understanding of the low-level
synchronization primitives.
Thank you!
On Fri, Jan 12, 2024 at 09:24AM +0100, Marco Elver wrote:
> On Fri, 12 Jan 2024 at 03:38, Andrey Konovalov <[email protected]> wrote:
[...]
> > Looks good to me from the functional perspective (modulo the
> > clarification comments I left above), but it would be great to get a
> > review from someone with a better understanding of the low-level
> > synchronization primitives.
>
> Yes - and I'll have to rework this to use get_state_synchronize_rcu()
> after all. When it's ready for proper review I'll send an RFC patch.
The below should be what we want, this time without weird hacks.
NMI-safe RCU does work for this case.
I'll let the test robot beat on it then send the patch next week
(there's going to be a patch 1/2 to add stats counters as well).
Also note that the rwlock broke RT kernels, which is also fixed by the
below patch.
Thanks,
-- Marco
From 02b11afe1dcaf42e86b7030e32146a8b34d4bd09 Mon Sep 17 00:00:00 2001
From: Marco Elver <[email protected]>
Date: Tue, 9 Jan 2024 10:21:56 +0100
Subject: [PATCH] stackdepot: make fast paths lock-less again
With the introduction of the pool_rwlock (reader-writer lock), several
fast paths end up taking the pool_rwlock as readers. Furthermore,
stack_depot_put() unconditionally takes the pool_rwlock as a writer.
Despite allowing readers to make forward-progress concurrently,
reader-writer locks have inherent cache contention issues, which does
not scale well on systems with large CPU counts. For cases with short
critical sections, as is the case with stack depot, they can cause more
harm than good, and alternative designs should be preferred.
Rework the synchronization story of stack depot to again avoid taking
any locks in the fast paths. This is done by relying on RCU-protected
list traversal, and the NMI-safe subset of RCU to delay reuse of freed
stack records. See code comments for more details.
Along with the performance issues, this also fixes incorrect nesting of rwlock
within a raw_spinlock (pool->lock is the raw spinlock), given that stack depot
should still be usable from anywhere:
| [ BUG: Invalid wait context ]
| -----------------------------
| swapper/0/1 is trying to lock:
| ffffffff89869be8 (pool_rwlock){..--}-{3:3}, at: stack_depot_save_flags+0x147/0x660
| other info that might help us debug this:
| context-{5:5}
| 2 locks held by swapper/0/1:
| #0: ffffffff89632440 (rcu_read_lock){....}-{1:3}, at: __queue_work+0x153/0xd70
| #1: ffff888100092018 (&pool->lock){-.-.}-{2:2}, at: __queue_work+0x549/0xd70
Stack depot usage stats are similar to the previous version after a KASAN
kernel boot:
$ cat /sys/kernel/debug/stackdepot/stats
pools: 838
allocations: 29865
frees: 6604
in_use: 23261
freelist_size: 1879
As we can see, the number of pools is the same as previously. The freelist size
is minimally larger, but this may also be due to variance across system boots.
This shows that even though we do not eagerly wait for the next RCU grace
period (such as with synchronize_rcu() or call_rcu()) after freeing a stack
record - requiring depot_pop_free() to "poll" if an entry may be used - new
allocations are very likely to happen only in later RCU grace periods.
Fixes: 108be8def46e ("lib/stackdepot: allow users to evict stack traces")
Signed-off-by: Marco Elver <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
---
lib/stackdepot.c | 327 +++++++++++++++++++++++++++++++----------------
1 file changed, 215 insertions(+), 112 deletions(-)
diff --git a/lib/stackdepot.c b/lib/stackdepot.c
index 80a8ca24ccc8..456396df1c0e 100644
--- a/lib/stackdepot.c
+++ b/lib/stackdepot.c
@@ -24,6 +24,8 @@
#include <linux/mutex.h>
#include <linux/percpu.h>
#include <linux/printk.h>
+#include <linux/rculist.h>
+#include <linux/rcupdate.h>
#include <linux/refcount.h>
#include <linux/slab.h>
#include <linux/spinlock.h>
@@ -68,12 +70,28 @@ union handle_parts {
};
struct stack_record {
- struct list_head list; /* Links in hash table or freelist */
+ struct list_head hash_list; /* Links in the hash table */
u32 hash; /* Hash in hash table */
u32 size; /* Number of stored frames */
- union handle_parts handle;
+ union handle_parts handle; /* Constant after initialization */
refcount_t count;
- unsigned long entries[CONFIG_STACKDEPOT_MAX_FRAMES]; /* Frames */
+ union {
+ unsigned long entries[CONFIG_STACKDEPOT_MAX_FRAMES]; /* Frames */
+ struct {
+ /*
+ * An important invariant of the implementation is to
+ * only place a stack record onto the freelist iff its
+ * refcount is zero. Because stack records with a zero
+ * refcount are never considered as valid, it is safe to
+ * union @entries and freelist management state below.
+ * Conversely, as soon as an entry is off the freelist
+ * and its refcount becomes non-zero, the below must not
+ * be accessed until being placed back on the freelist.
+ */
+ struct list_head free_list; /* Links in the freelist */
+ unsigned long rcu_state; /* RCU cookie */
+ };
+ };
};
#define DEPOT_STACK_RECORD_SIZE \
@@ -113,8 +131,8 @@ static LIST_HEAD(free_stacks);
* yet allocated or if the limit on the number of pools is reached.
*/
static bool new_pool_required = true;
-/* Lock that protects the variables above. */
-static DEFINE_RWLOCK(pool_rwlock);
+/* The lock must be held when performing pool or free list modifications. */
+static DEFINE_RAW_SPINLOCK(pool_lock);
/* Statistics counters for debugfs. */
enum depot_counter_id {
@@ -276,14 +294,15 @@ int stack_depot_init(void)
}
EXPORT_SYMBOL_GPL(stack_depot_init);
-/* Initializes a stack depol pool. */
+/*
+ * Initializes new stack depot @pool, release all its entries to the freelist,
+ * and update the list of pools.
+ */
static void depot_init_pool(void *pool)
{
int offset;
- lockdep_assert_held_write(&pool_rwlock);
-
- WARN_ON(!list_empty(&free_stacks));
+ lockdep_assert_held(&pool_lock);
/* Initialize handles and link stack records into the freelist. */
for (offset = 0; offset <= DEPOT_POOL_SIZE - DEPOT_STACK_RECORD_SIZE;
@@ -294,19 +313,31 @@ static void depot_init_pool(void *pool)
stack->handle.offset = offset >> DEPOT_STACK_ALIGN;
stack->handle.extra = 0;
- list_add(&stack->list, &free_stacks);
+ /*
+ * Stack traces of size 0 are never saved, and we can simply use
+ * the size field as an indicator if this is a new unused stack
+ * record in the freelist.
+ */
+ stack->size = 0;
+
+ INIT_LIST_HEAD(&stack->hash_list);
+ /* Add to the freelist front to prioritize never-used entries. */
+ list_add(&stack->free_list, &free_stacks);
counters[DEPOT_COUNTER_FREELIST_SIZE]++;
}
/* Save reference to the pool to be used by depot_fetch_stack(). */
stack_pools[pools_num] = pool;
- pools_num++;
+
+ /* Pairs with concurrent READ_ONCE() in depot_fetch_stack(). */
+ WRITE_ONCE(pools_num, pools_num + 1);
+ ASSERT_EXCLUSIVE_WRITER(pools_num);
}
/* Keeps the preallocated memory to be used for a new stack depot pool. */
static void depot_keep_new_pool(void **prealloc)
{
- lockdep_assert_held_write(&pool_rwlock);
+ lockdep_assert_held(&pool_lock);
/*
* If a new pool is already saved or the maximum number of
@@ -329,17 +360,16 @@ static void depot_keep_new_pool(void **prealloc)
* number of pools is reached. In either case, take note that
* keeping another pool is not required.
*/
- new_pool_required = false;
+ WRITE_ONCE(new_pool_required, false);
}
-/* Updates references to the current and the next stack depot pools. */
-static bool depot_update_pools(void **prealloc)
+/*
+ * Try to initialize a new stack depot pool from either a previous or the
+ * current pre-allocation, and release all its entries to the freelist.
+ */
+static bool depot_try_init_pool(void **prealloc)
{
- lockdep_assert_held_write(&pool_rwlock);
-
- /* Check if we still have objects in the freelist. */
- if (!list_empty(&free_stacks))
- goto out_keep_prealloc;
+ lockdep_assert_held(&pool_lock);
/* Check if we have a new pool saved and use it. */
if (new_pool) {
@@ -348,10 +378,9 @@ static bool depot_update_pools(void **prealloc)
/* Take note that we might need a new new_pool. */
if (pools_num < DEPOT_MAX_POOLS)
- new_pool_required = true;
+ WRITE_ONCE(new_pool_required, true);
- /* Try keeping the preallocated memory for new_pool. */
- goto out_keep_prealloc;
+ return true;
}
/* Bail out if we reached the pool limit. */
@@ -368,12 +397,30 @@ static bool depot_update_pools(void **prealloc)
}
return false;
+}
+
+/* Try to find next free usable entry. */
+static struct stack_record *depot_pop_free(void)
+{
+ struct stack_record *stack;
+
+ if (list_empty(&free_stacks))
+ return NULL;
+
+ /*
+ * We maintain the invariant that the elements in front are least
+ * recently used, and are therefore more likely to be associated with an
+ * RCU grace period in the past. Consequently it is sufficient to only
+ * check the first entry.
+ */
+ stack = list_first_entry(&free_stacks, struct stack_record, free_list);
+ if (stack->size && !poll_state_synchronize_rcu(stack->rcu_state))
+ return NULL;
-out_keep_prealloc:
- /* Keep the preallocated memory for a new pool if required. */
- if (*prealloc)
- depot_keep_new_pool(prealloc);
- return true;
+ list_del(&stack->free_list);
+ counters[DEPOT_COUNTER_FREELIST_SIZE]--;
+
+ return stack;
}
/* Allocates a new stack in a stack depot pool. */
@@ -382,20 +429,18 @@ depot_alloc_stack(unsigned long *entries, int size, u32 hash, void **prealloc)
{
struct stack_record *stack;
- lockdep_assert_held_write(&pool_rwlock);
-
- /* Update current and new pools if required and possible. */
- if (!depot_update_pools(prealloc))
- return NULL;
+ lockdep_assert_held(&pool_lock);
/* Check if we have a stack record to save the stack trace. */
- if (list_empty(&free_stacks))
- return NULL;
-
- /* Get and unlink the first entry from the freelist. */
- stack = list_first_entry(&free_stacks, struct stack_record, list);
- list_del(&stack->list);
- counters[DEPOT_COUNTER_FREELIST_SIZE]--;
+ stack = depot_pop_free();
+ if (!stack) {
+ /* No usable entries on the freelist - try to refill the freelist. */
+ if (!depot_try_init_pool(prealloc))
+ return NULL;
+ stack = depot_pop_free();
+ if (WARN_ON(!stack))
+ return NULL;
+ }
/* Limit number of saved frames to CONFIG_STACKDEPOT_MAX_FRAMES. */
if (size > CONFIG_STACKDEPOT_MAX_FRAMES)
@@ -421,37 +466,73 @@ depot_alloc_stack(unsigned long *entries, int size, u32 hash, void **prealloc)
static struct stack_record *depot_fetch_stack(depot_stack_handle_t handle)
{
+ const int pools_num_cached = READ_ONCE(pools_num);
union handle_parts parts = { .handle = handle };
void *pool;
size_t offset = parts.offset << DEPOT_STACK_ALIGN;
struct stack_record *stack;
- lockdep_assert_held(&pool_rwlock);
+ lockdep_assert_not_held(&pool_lock);
- if (parts.pool_index > pools_num) {
+ if (parts.pool_index > pools_num_cached) {
WARN(1, "pool index %d out of bounds (%d) for stack id %08x\n",
- parts.pool_index, pools_num, handle);
+ parts.pool_index, pools_num_cached, handle);
return NULL;
}
pool = stack_pools[parts.pool_index];
- if (!pool)
+ if (WARN_ON(!pool))
return NULL;
stack = pool + offset;
+ if (WARN_ON(!refcount_read(&stack->count)))
+ return NULL;
+
return stack;
}
/* Links stack into the freelist. */
static void depot_free_stack(struct stack_record *stack)
{
- lockdep_assert_held_write(&pool_rwlock);
+ unsigned long flags;
+
+ lockdep_assert_not_held(&pool_lock);
+
+ raw_spin_lock_irqsave(&pool_lock, flags);
+ printk_deferred_enter();
+
+ /*
+ * Remove the entry from the hash list. Concurrent list traversal may
+ * still observe the entry, but since the refcount is zero, this entry
+ * will no longer be considered as valid.
+ */
+ list_del_rcu(&stack->hash_list);
+
+ /*
+ * Due to being used from constrained contexts such as the allocators,
+ * NMI, or even RCU itself, stack depot cannot rely on primitives that
+ * would sleep (such as synchronize_rcu()) or recursively call into
+ * stack depot again (such as call_rcu()).
+ *
+ * Instead, get an RCU cookie, so that we can ensure this entry isn't
+ * moved onto another list until the next grace period, and concurrent
+ * RCU list traversal remains safe.
+ */
+ stack->rcu_state = get_state_synchronize_rcu();
- list_add(&stack->list, &free_stacks);
+ /*
+ * Add the entry to the freelist tail, so that older entries are
+ * considered first - their RCU cookie is more likely to no longer be
+ * associated with the current grace period.
+ */
+ list_add_tail(&stack->free_list, &free_stacks);
counters[DEPOT_COUNTER_FREELIST_SIZE]++;
counters[DEPOT_COUNTER_FREES]++;
counters[DEPOT_COUNTER_INUSE]--;
+
+ printk_deferred_exit();
+ raw_spin_unlock_irqrestore(&pool_lock, flags);
}
/* Calculates the hash for a stack. */
@@ -479,22 +560,65 @@ int stackdepot_memcmp(const unsigned long *u1, const unsigned long *u2,
/* Finds a stack in a bucket of the hash table. */
static inline struct stack_record *find_stack(struct list_head *bucket,
- unsigned long *entries, int size,
- u32 hash)
+ unsigned long *entries, int size,
+ u32 hash, depot_flags_t flags)
{
- struct list_head *pos;
- struct stack_record *found;
+ struct stack_record *stack, *ret = NULL;
+
+ rcu_read_lock();
- lockdep_assert_held(&pool_rwlock);
+ list_for_each_entry_rcu(stack, bucket, hash_list) {
+ if (stack->hash != hash || stack->size != size)
+ continue;
- list_for_each(pos, bucket) {
- found = list_entry(pos, struct stack_record, list);
- if (found->hash == hash &&
- found->size == size &&
- !stackdepot_memcmp(entries, found->entries, size))
- return found;
+ /*
+ * This may race with depot_free_stack() accessing the freelist
+ * management state unioned with @entries. The refcount is zero
+ * in that case and the below refcount_inc_not_zero() will fail.
+ */
+ if (data_race(stackdepot_memcmp(entries, stack->entries, size)))
+ continue;
+
+ /*
+ * Try to increment refcount. If this succeeds, the stack record
+ * is valid and has not yet been freed.
+ *
+ * If STACK_DEPOT_FLAG_GET is not used, it is undefined behavior
+ * to then call stack_depot_put() later, and we can assume that
+ * a stack record is never placed back on the freelist.
+ */
+ if (flags & STACK_DEPOT_FLAG_GET) {
+ if (!refcount_inc_not_zero(&stack->count))
+ continue;
+ smp_mb__after_atomic();
+ } else {
+ /*
+ * Pairs with the release implied by list_add_rcu() to
+ * turn the list-pointer access into an acquire; as-is
+ * it only provides dependency-ordering implied by
+ * READ_ONCE().
+ *
+ * Normally this is not needed, if we were to continue
+ * using the stack_record pointer only. But, the pointer
+ * returned here is not actually used to lookup entries.
+ * Instead, the handle is returned, from which a pointer
+ * may then be reconstructed in depot_fetch_stack().
+ *
+ * Therefore, it is required to upgrade the ordering
+ * from dependency-ordering only to at least acquire to
+ * be able to use the handle as another reference to the
+ * same stack record.
+ */
+ smp_mb();
+ }
+
+ ret = stack;
+ break;
}
- return NULL;
+
+ rcu_read_unlock();
+
+ return ret;
}
depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
@@ -508,7 +632,6 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
struct page *page = NULL;
void *prealloc = NULL;
bool can_alloc = depot_flags & STACK_DEPOT_FLAG_CAN_ALLOC;
- bool need_alloc = false;
unsigned long flags;
u32 hash;
@@ -531,31 +654,16 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
hash = hash_stack(entries, nr_entries);
bucket = &stack_table[hash & stack_hash_mask];
- read_lock_irqsave(&pool_rwlock, flags);
- printk_deferred_enter();
-
- /* Fast path: look the stack trace up without full locking. */
- found = find_stack(bucket, entries, nr_entries, hash);
- if (found) {
- if (depot_flags & STACK_DEPOT_FLAG_GET)
- refcount_inc(&found->count);
- printk_deferred_exit();
- read_unlock_irqrestore(&pool_rwlock, flags);
+ /* Fast path: look the stack trace up without locking. */
+ found = find_stack(bucket, entries, nr_entries, hash, depot_flags);
+ if (found)
goto exit;
- }
-
- /* Take note if another stack pool needs to be allocated. */
- if (new_pool_required)
- need_alloc = true;
-
- printk_deferred_exit();
- read_unlock_irqrestore(&pool_rwlock, flags);
/*
* Allocate memory for a new pool if required now:
* we won't be able to do that under the lock.
*/
- if (unlikely(can_alloc && need_alloc)) {
+ if (unlikely(can_alloc && READ_ONCE(new_pool_required))) {
/*
* Zero out zone modifiers, as we don't have specific zone
* requirements. Keep the flags related to allocation in atomic
@@ -569,31 +677,36 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
prealloc = page_address(page);
}
- write_lock_irqsave(&pool_rwlock, flags);
+ raw_spin_lock_irqsave(&pool_lock, flags);
printk_deferred_enter();
- found = find_stack(bucket, entries, nr_entries, hash);
+ /* Try to find again, to avoid concurrently inserting duplicates. */
+ found = find_stack(bucket, entries, nr_entries, hash, depot_flags);
if (!found) {
struct stack_record *new =
depot_alloc_stack(entries, nr_entries, hash, &prealloc);
if (new) {
- list_add(&new->list, bucket);
+ /*
+ * This releases the stack record into the bucket and
+ * makes it visible to readers in find_stack().
+ */
+ list_add_rcu(&new->hash_list, bucket);
found = new;
}
- } else {
- if (depot_flags & STACK_DEPOT_FLAG_GET)
- refcount_inc(&found->count);
+ }
+
+ if (prealloc) {
/*
- * Stack depot already contains this stack trace, but let's
- * keep the preallocated memory for future.
+ * Either stack depot already contains this stack trace, or
+ * depot_alloc_stack() did not consume the preallocated memory.
+ * Try to keep the preallocated memory for future.
*/
- if (prealloc)
- depot_keep_new_pool(&prealloc);
+ depot_keep_new_pool(&prealloc);
}
printk_deferred_exit();
- write_unlock_irqrestore(&pool_rwlock, flags);
+ raw_spin_unlock_irqrestore(&pool_lock, flags);
exit:
if (prealloc) {
/* Stack depot didn't use this memory, free it. */
@@ -618,7 +731,6 @@ unsigned int stack_depot_fetch(depot_stack_handle_t handle,
unsigned long **entries)
{
struct stack_record *stack;
- unsigned long flags;
*entries = NULL;
/*
@@ -630,13 +742,13 @@ unsigned int stack_depot_fetch(depot_stack_handle_t handle,
if (!handle || stack_depot_disabled)
return 0;
- read_lock_irqsave(&pool_rwlock, flags);
- printk_deferred_enter();
-
stack = depot_fetch_stack(handle);
-
- printk_deferred_exit();
- read_unlock_irqrestore(&pool_rwlock, flags);
+ /*
+ * Should never be NULL, otherwise this is a use-after-put (or just a
+ * corrupt handle).
+ */
+ if (WARN(!stack, "corrupt handle or use after stack_depot_put()"))
+ return 0;
*entries = stack->entries;
return stack->size;
@@ -646,29 +758,20 @@ EXPORT_SYMBOL_GPL(stack_depot_fetch);
void stack_depot_put(depot_stack_handle_t handle)
{
struct stack_record *stack;
- unsigned long flags;
if (!handle || stack_depot_disabled)
return;
- write_lock_irqsave(&pool_rwlock, flags);
- printk_deferred_enter();
-
stack = depot_fetch_stack(handle);
- if (WARN_ON(!stack))
- goto out;
-
- if (refcount_dec_and_test(&stack->count)) {
- /* Unlink stack from the hash table. */
- list_del(&stack->list);
+ /*
+ * Should always be able to find the stack record, otherwise this is an
+ * unbalanced put attempt (or corrupt handle).
+ */
+ if (WARN(!stack, "corrupt handle or unbalanced stack_depot_put()"))
+ return;
- /* Free stack. */
+ if (refcount_dec_and_test(&stack->count))
depot_free_stack(stack);
- }
-
-out:
- printk_deferred_exit();
- write_unlock_irqrestore(&pool_rwlock, flags);
}
EXPORT_SYMBOL_GPL(stack_depot_put);
--
2.43.0.275.g3460e3d667-goog
On Fri, Jan 12, 2024 at 11:15:05PM +0100, Marco Elver wrote:
> + /*
> + * Stack traces of size 0 are never saved, and we can simply use
> + * the size field as an indicator if this is a new unused stack
> + * record in the freelist.
> + */
> + stack->size = 0;
I would use WRITE_ONCE here too, at least for TSan.
> + return NULL;
> +
> + /*
> + * We maintain the invariant that the elements in front are least
> + * recently used, and are therefore more likely to be associated with an
> + * RCU grace period in the past. Consequently it is sufficient to only
> + * check the first entry.
> + */
> + stack = list_first_entry(&free_stacks, struct stack_record, free_list);
> + if (stack->size && !poll_state_synchronize_rcu(stack->rcu_state))
READ_ONCE (also for TSan, and might be safer long term in case the
compiler considers some fancy code transformation)
> + return NULL;
>
> + stack = depot_pop_free();
> + if (WARN_ON(!stack))
Won't you get nesting problems here if this triggers due to the print?
I assume the nmi safe printk won't consider it like an NMI.
> counters[DEPOT_COUNTER_FREELIST_SIZE]++;
> counters[DEPOT_COUNTER_FREES]++;
> counters[DEPOT_COUNTER_INUSE]--;
> +
> + printk_deferred_exit();
Ah this handles the WARN_ON? Should be ok then.
-Andi
On Sat, 13 Jan 2024 at 02:24, Andi Kleen <[email protected]> wrote:
>
> On Fri, Jan 12, 2024 at 11:15:05PM +0100, Marco Elver wrote:
> > + /*
> > + * Stack traces of size 0 are never saved, and we can simply use
> > + * the size field as an indicator if this is a new unused stack
> > + * record in the freelist.
> > + */
> > + stack->size = 0;
>
> I would use WRITE_ONCE here too, at least for TSan.
This is written with the pool_lock held.
> > + return NULL;
> > +
> > + /*
> > + * We maintain the invariant that the elements in front are least
> > + * recently used, and are therefore more likely to be associated with an
> > + * RCU grace period in the past. Consequently it is sufficient to only
> > + * check the first entry.
> > + */
> > + stack = list_first_entry(&free_stacks, struct stack_record, free_list);
> > + if (stack->size && !poll_state_synchronize_rcu(stack->rcu_state))
>
> READ_ONCE (also for TSan, and might be safer long term in case the
> compiler considers some fancy code transformation)
And this is also only read with the pool_lock held, so it's impossible
that there'd be a data race due to size. (And if there is a data race,
I'd want KCSAN to tell us because that'd be a bug then.)
depot_pop_free() can't be used w/o the lock because it's manipulating
the freelist.
To be sure, I'm adding a lockdep_assert_held() to depot_pop_free().
> > + return NULL;
> >
> > + stack = depot_pop_free();
> > + if (WARN_ON(!stack))
>
> Won't you get nesting problems here if this triggers due to the print?
> I assume the nmi safe printk won't consider it like an NMI.
>
> > counters[DEPOT_COUNTER_FREELIST_SIZE]++;
> > counters[DEPOT_COUNTER_FREES]++;
> > counters[DEPOT_COUNTER_INUSE]--;
> > +
> > + printk_deferred_exit();
>
> Ah this handles the WARN_ON? Should be ok then.
Yes, the pool_lock critical sections are surrounded by printk_deferred.
Thanks,
-- Marco
On Sat, 13 Jan 2024 at 10:19, Andi Kleen <[email protected]> wrote:
>
> On Sat, Jan 13, 2024 at 10:12:21AM +0100, Marco Elver wrote:
> > On Sat, 13 Jan 2024 at 02:24, Andi Kleen <[email protected]> wrote:
> > >
> > > On Fri, Jan 12, 2024 at 11:15:05PM +0100, Marco Elver wrote:
> > > > + /*
> > > > + * Stack traces of size 0 are never saved, and we can simply use
> > > > + * the size field as an indicator if this is a new unused stack
> > > > + * record in the freelist.
> > > > + */
> > > > + stack->size = 0;
> > >
> > > I would use WRITE_ONCE here too, at least for TSan.
> >
> > This is written with the pool_lock held.
>
> ...which doesn't help because the readers don't take it?
This function is only refilling the freelist. Readers don't see it yet
because it's in none of the hash table buckets. The freelist is only
ever accessed under the lock.
Once an entry is allocated from the freelist, its size is overwritten
with something non-zero (since it then contains a stack trace). Those
updates are released into the right hash table bucket with
list_add_rcu() (which implies a release).
Am I missing something else?
On Sat, 13 Jan 2024 at 10:23, Marco Elver <[email protected]> wrote:
>
> On Sat, 13 Jan 2024 at 10:19, Andi Kleen <[email protected]> wrote:
> >
> > On Sat, Jan 13, 2024 at 10:12:21AM +0100, Marco Elver wrote:
> > > On Sat, 13 Jan 2024 at 02:24, Andi Kleen <[email protected]> wrote:
> > > >
> > > > On Fri, Jan 12, 2024 at 11:15:05PM +0100, Marco Elver wrote:
> > > > > + /*
> > > > > + * Stack traces of size 0 are never saved, and we can simply use
> > > > > + * the size field as an indicator if this is a new unused stack
> > > > > + * record in the freelist.
> > > > > + */
> > > > > + stack->size = 0;
> > > >
> > > > I would use WRITE_ONCE here too, at least for TSan.
> > >
> > > This is written with the pool_lock held.
> >
> > ...which doesn't help because the readers don't take it?
>
> This function is only refilling the freelist. Readers don't see it yet
> because it's in none of the hash table buckets. The freelist is only
> ever accessed under the lock.
>
> Once an entry is allocated from the freelist, its size is overwritten
> with something non-zero (since it then contains a stack trace). Those
> updates are released into the right hash table bucket with
> list_add_rcu() (which implies a release).
>
> Am I missing something else?
FWIW, the current version (draft) of this can be found here:
https://git.kernel.org/pub/scm/linux/kernel/git/melver/linux.git/log/?h=kasan/dev
I'll send the 2 patches next week - they should apply cleanly on
current mainline.
Thanks,
-- Marco
> This function is only refilling the freelist. Readers don't see it yet
> because it's in none of the hash table buckets. The freelist is only
> ever accessed under the lock.
>
> Once an entry is allocated from the freelist, its size is overwritten
> with something non-zero (since it then contains a stack trace). Those
> updates are released into the right hash table bucket with
> list_add_rcu() (which implies a release).
>
> Am I missing something else?
It's probably ok semantically here, but at least I would be consistent with
using the macro for a specific field.
-Andi
Hello Andrey,
On Mon, Nov 20, 2023 at 06:47:10PM +0100, [email protected] wrote:
> From: Andrey Konovalov <[email protected]>
>
> Currently, stack depot uses the following locking scheme:
>
> 1. Lock-free accesses when looking up a stack record, which allows to
> have multiple users to look up records in parallel;
> 2. Spinlock for protecting the stack depot pools and the hash table
> when adding a new record.
>
> For implementing the eviction of stack traces from stack depot, the
> lock-free approach is not going to work anymore, as we will need to be
> able to also remove records from the hash table.
>
> Convert the spinlock into a read/write lock, and drop the atomic accesses,
> as they are no longer required.
>
> Looking up stack traces is now protected by the read lock and adding new
> records - by the write lock. One of the following patches will add a new
> function for evicting stack records, which will be protected by the write
> lock as well.
>
> With this change, multiple users can still look up records in parallel.
>
> This is preparatory patch for implementing the eviction of stack records
> from the stack depot.
I am testing quite recent "debug" kernel (with KASAN, Lockdep, etc
enabled). This kernel is based on
9f8413c4a66f2fb776d3dc3c9ed20bf435eb305e, and I found the following
lockdep DEADLOCK splat recently. Before investigating further, I would
like to share here to check if this is a known problem:
======================================================
WARNING: possible circular locking dependency detected
6.7.0-0_fbk700_debug_kbuilder_0_g9f8413c4a66f #1 Tainted: G S EL N
------------------------------------------------------
btrfs-transacti/566 is trying to acquire lock:
ffffffff84563058 ((console_sem).lock){-.-.}-{2:2}, at: down_trylock (kernel/locking/semaphore.c:135)
but task is already holding lock:
ffffffff850a2a70 (pool_lock#2){-.-.}-{2:2}, at: __stack_depot_save (lib/stackdepot.c:415)
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 (pool_lock#2){-.-.}-{2:2}:
lock_acquire (kernel/locking/lockdep.c:462 kernel/locking/lockdep.c:5753)
_raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:110 kernel/locking/spinlock.c:162)
__stack_depot_save (lib/stackdepot.c:415)
kasan_save_stack (mm/kasan/common.c:46)
__kasan_record_aux_stack (mm/kasan/generic.c:492)
task_work_add (kernel/task_work.c:44)
scheduler_tick (kernel/sched/core.c:5677)
update_process_times (kernel/time/timer.c:2070 kernel/time/timer.c:2086)
tick_nohz_highres_handler (kernel/time/tick-sched.c:? kernel/time/tick-sched.c:1512)
__hrtimer_run_queues (kernel/time/hrtimer.c:484 kernel/time/hrtimer.c:1656 kernel/time/hrtimer.c:1752)
hrtimer_interrupt (kernel/time/hrtimer.c:? kernel/time/hrtimer.c:1796)
__sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1052 arch/x86/kernel/apic/apic.c:1082)
sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1076)
asm_sysvec_apic_timer_interrupt (arch/x86/include/asm/idtentry.h:649)
_raw_spin_unlock_irqrestore (include/linux/spinlock_api_smp.h:151 kernel/locking/spinlock.c:194)
debug_object_activate (lib/debugobjects.c:? lib/debugobjects.c:732)
call_rcu (kernel/rcu/rcu.h:227 kernel/rcu/tree.c:2666 kernel/rcu/tree.c:2795)
mas_wr_modify (lib/maple_tree.c:3375 lib/maple_tree.c:3482 lib/maple_tree.c:4198 lib/maple_tree.c:4236)
mas_wr_store_entry (lib/maple_tree.c:1389 lib/maple_tree.c:4250)
mas_store_prealloc (lib/maple_tree.c:5364 lib/maple_tree.c:5458)
vma_complete (mm/mmap.c:523)
__split_vma (include/linux/mmap_lock.h:71 include/linux/mm.h:694 include/linux/mm.h:713 mm/mmap.c:2399)
do_vmi_align_munmap (include/linux/maple_tree.h:621 mm/mmap.c:2547)
do_vmi_munmap (mm/mmap.c:2729)
__vm_munmap (mm/mmap.c:3010)
elf_load (include/linux/instrumented.h:68 include/asm-generic/bitops/instrumented-non-atomic.h:141 include/linux/thread_info.h:118 fs/binfmt_elf.c:382 fs/binfmt_elf.c:408)
load_elf_binary (fs/binfmt_elf.c:?)
bprm_execve (fs/exec.c:? fs/exec.c:1854)
kernel_execve (fs/exec.c:2009)
kernel_init (init/main.c:1468)
ret_from_fork (arch/x86/kernel/process.c:153)
ret_from_fork_asm (arch/x86/entry/entry_64.S:250)
-> #2 (&rq->__lock){-.-.}-{2:2}:
lock_acquire (kernel/locking/lockdep.c:462 kernel/locking/lockdep.c:5753)
_raw_spin_lock_nested (kernel/locking/spinlock.c:378)
raw_spin_rq_lock_nested (kernel/sched/core.c:560)
task_fork_fair (kernel/sched/fair.c:12644)
sched_cgroup_fork (kernel/sched/sched.h:2024 kernel/sched/sched.h:2047 kernel/sched/core.c:4833)
copy_process (include/linux/instrumented.h:82 include/linux/atomic/atomic-instrumented.h:67 include/linux/refcount.h:136 kernel/fork.c:1868 kernel/fork.c:2499)
kernel_clone (kernel/fork.c:2898)
user_mode_thread (kernel/fork.c:2977)
rest_init (init/main.c:695)
arch_call_rest_init+0xf/0x10
start_kernel (init/main.c:1011)
x86_64_start_reservations (??:?)
x86_64_start_kernel (??:?)
secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:461)
-> #1 (&p->pi_lock){-.-.}-{2:2}:
lock_acquire (kernel/locking/lockdep.c:462 kernel/locking/lockdep.c:5753)
_raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:110 kernel/locking/spinlock.c:162)
try_to_wake_up (include/linux/spinlock.h:? kernel/sched/core.c:4252)
up (kernel/locking/semaphore.c:189)
console_unlock+0xcb/0x1e0a <- releases console semaphore
vprintk_emit (arch/x86/include/asm/irqflags.h:? arch/x86/include/asm/irqflags.h:67 arch/x86/include/asm/irqflags.h:127 kernel/printk/printk.c:1968 kernel/printk/printk.c:2302)
_printk (kernel/printk/printk.c:?)
_fat_msg (fs/fat/misc.c:?)
fat_fill_super (fs/fat/inode.c:? fs/fat/inode.c:1646)
mount_bdev (fs/super.c:?)
legacy_get_tree (fs/fs_context.c:662)
vfs_get_tree (fs/super.c:1784)
path_mount (fs/namespace.c:? fs/namespace.c:3662)
__se_sys_mount (fs/namespace.c:? fs/namespace.c:3882 fs/namespace.c:3864)
do_syscall_64 (arch/x86/entry/common.c:?)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129)
-> #0 ((console_sem).lock){-.-.}-{2:2}:
check_prevs_add (kernel/locking/lockdep.c:223 kernel/locking/lockdep.c:3098 kernel/locking/lockdep.c:3253)
__lock_acquire+0x2399/0x24f0 * Trying to get the sem_console
lock_acquire (kernel/locking/lockdep.c:462 kernel/locking/lockdep.c:5753)
_raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:110 kernel/locking/spinlock.c:162)
down_trylock (kernel/locking/semaphore.c:135)
__down_trylock_console_sem (kernel/printk/printk.c:322)
vprintk_emit (arch/x86/include/asm/atomic.h:23 include/linux/atomic/atomic-arch-fallback.h:457 include/linux/atomic/atomic-instrumented.h:33 kernel/printk/printk.c:2621 kernel/printk/printk.c:2657 kernel/printk/printk.c:1923 kernel/printk/printk.c:2302)
_printk (kernel/printk/printk.c:?)
__warn_printk (include/linux/context_tracking.h:131 include/linux/context_tracking.h:145 kernel/panic.c:718)
__stack_depot_save+0x685/0x690 * <- Got the *pool* lock
kasan_set_track (mm/kasan/common.c:52)
__kasan_slab_alloc (mm/kasan/common.c:331)
kmem_cache_alloc (include/linux/kasan.h:188 mm/slab.h:763 mm/slub.c:3478 mm/slub.c:3486 mm/slub.c:3493 mm/slub.c:3502)
btrfs_add_delayed_tree_ref (fs/btrfs/delayed-ref.c:1027)
btrfs_free_tree_block (fs/btrfs/delayed-ref.h:316 fs/btrfs/extent-tree.c:3452)
btrfs_force_cow_block (include/asm-generic/unaligned.h:52 fs/btrfs/accessors.h:681 fs/btrfs/accessors.h:721 fs/btrfs/ctree.c:574)
btrfs_cow_block (include/asm-generic/unaligned.h:37 fs/btrfs/accessors.h:678 fs/btrfs/ctree.c:678 fs/btrfs/ctree.c:727)
balance_level (fs/btrfs/ctree.c:960)
btrfs_search_slot (fs/btrfs/ctree.c:2097)
lookup_inline_extent_backref (fs/btrfs/accessors.h:371 fs/btrfs/extent-tree.c:814)
__btrfs_free_extent (fs/btrfs/extent-tree.c:3113)
__btrfs_run_delayed_refs (include/linux/instrumented.h:96 include/linux/atomic/atomic-instrumented.h:592 fs/btrfs/extent-tree.c:2045 fs/btrfs/extent-tree.c:2134)
btrfs_run_delayed_refs (include/linux/instrumented.h:68 include/asm-generic/bitops/instrumented-non-atomic.h:141 fs/btrfs/extent-tree.c:2238)
btrfs_commit_transaction (fs/btrfs/transaction.c:2217)
transaction_kthread (fs/btrfs/disk-io.c:1558)
kthread (kernel/kthread.c:373)
ret_from_fork (arch/x86/kernel/process.c:153)
ret_from_fork_asm (arch/x86/entry/entry_64.S:250)
other info that might help us debug this:
Chain exists of:
(console_sem).lock --> &rq->__lock --> pool_lock#2
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(pool_lock#2);
lock(&rq->__lock);
lock(pool_lock#2);
lock((console_sem).lock);
*** DEADLOCK ***
11 locks held by btrfs-transacti/566:
#0: ffff8883221147d8 (&fs_info->transaction_kthread_mutex){+.+.}-{3:3}, at: transaction_kthread (fs/btrfs/disk-io.c:?)
#1: ffff888322116390 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction (include/linux/spinlock.h:? fs/btrfs/transaction.c:285)
#2: ffff8883221163b8 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction (include/linux/spinlock.h:? fs/btrfs/transaction.c:285)
#3: ffff8883221163e0 (btrfs_trans_commit_prep){.+.+}-{0:0}, at: btrfs_commit_transaction (fs/btrfs/transaction.c:2205)
#4: ffff8889220f2240 (&head_ref->mutex){+.+.}-{3:3}, at: btrfs_delayed_ref_lock (include/linux/lockdep.h:288 fs/btrfs/delayed-ref.c:503)
#5: ffff8885b372a980 (btrfs-extent-02){++++}-{3:3}, at: btrfs_lock_root_node (arch/x86/include/asm/current.h:41 fs/btrfs/locking.c:211 fs/btrfs/locking.c:217 fs/btrfs/locking.c:270)
#6: ffff8888635b5538 (btrfs-extent-01){++++}-{3:3}, at: btrfs_tree_lock (arch/x86/include/asm/current.h:41 fs/btrfs/locking.c:211 fs/btrfs/locking.c:217)
#7: ffff8885df3e2d30 (btrfs-extent-01/2){+.+.}-{3:3}, at: __btrfs_tree_lock (arch/x86/include/asm/current.h:41 fs/btrfs/locking.c:211)
#8: ffff88832bb33490 (btrfs-extent-01/3){+.+.}-{3:3}, at: __btrfs_tree_lock (arch/x86/include/asm/current.h:41 fs/btrfs/locking.c:211)
#9: ffff888708195c98 (btrfs-extent-01/5){+.+.}-{3:3}, at: __btrfs_tree_lock (arch/x86/include/asm/current.h:41 fs/btrfs/locking.c:211)
#10: ffffffff850a2a70 (pool_lock#2){-.-.}-{2:2}, at: __stack_depot_save (lib/stackdepot.c:415)
On Wed, 24 Jan 2024 at 15:16, Breno Leitao <[email protected]> wrote:
>
> Hello Andrey,
>
> On Mon, Nov 20, 2023 at 06:47:10PM +0100, [email protected] wrote:
> > From: Andrey Konovalov <[email protected]>
> >
> > Currently, stack depot uses the following locking scheme:
> >
> > 1. Lock-free accesses when looking up a stack record, which allows to
> > have multiple users to look up records in parallel;
> > 2. Spinlock for protecting the stack depot pools and the hash table
> > when adding a new record.
> >
> > For implementing the eviction of stack traces from stack depot, the
> > lock-free approach is not going to work anymore, as we will need to be
> > able to also remove records from the hash table.
> >
> > Convert the spinlock into a read/write lock, and drop the atomic accesses,
> > as they are no longer required.
> >
> > Looking up stack traces is now protected by the read lock and adding new
> > records - by the write lock. One of the following patches will add a new
> > function for evicting stack records, which will be protected by the write
> > lock as well.
> >
> > With this change, multiple users can still look up records in parallel.
> >
> > This is preparatory patch for implementing the eviction of stack records
> > from the stack depot.
>
> I am testing quite recent "debug" kernel (with KASAN, Lockdep, etc
> enabled). This kernel is based on
> 9f8413c4a66f2fb776d3dc3c9ed20bf435eb305e, and I found the following
This version predates this series, as far as I can tell. Can you try linux-next?
Thanks,
-- Marco
On Wed, Jan 24, 2024 at 03:21:26PM +0100, Marco Elver wrote:
> On Wed, 24 Jan 2024 at 15:16, Breno Leitao <[email protected]> wrote:
> >
> > Hello Andrey,
> >
> > On Mon, Nov 20, 2023 at 06:47:10PM +0100, [email protected] wrote:
> > > From: Andrey Konovalov <[email protected]>
> > >
> > > Currently, stack depot uses the following locking scheme:
> > >
> > > 1. Lock-free accesses when looking up a stack record, which allows to
> > > have multiple users to look up records in parallel;
> > > 2. Spinlock for protecting the stack depot pools and the hash table
> > > when adding a new record.
> > >
> > > For implementing the eviction of stack traces from stack depot, the
> > > lock-free approach is not going to work anymore, as we will need to be
> > > able to also remove records from the hash table.
> > >
> > > Convert the spinlock into a read/write lock, and drop the atomic accesses,
> > > as they are no longer required.
> > >
> > > Looking up stack traces is now protected by the read lock and adding new
> > > records - by the write lock. One of the following patches will add a new
> > > function for evicting stack records, which will be protected by the write
> > > lock as well.
> > >
> > > With this change, multiple users can still look up records in parallel.
> > >
> > > This is preparatory patch for implementing the eviction of stack records
> > > from the stack depot.
> >
> > I am testing quite recent "debug" kernel (with KASAN, Lockdep, etc
> > enabled). This kernel is based on
> > 9f8413c4a66f2fb776d3dc3c9ed20bf435eb305e, and I found the following
>
> This version predates this series, as far as I can tell. Can you try linux-next?
That is true. I will retest and let you know if it is still
reproducible.
Thanks.