Changes v6 -> v7:
- Rebased on top of Andrey Konovalov's libstackdepot patchset
- Reformulated the changelogs
Changes v5 -> v6:
- Rebase on top of v6.7-rc1
- Move stack_record struct to the header
- Addressed feedback from Vlastimil
(some code tweaks and changelogs suggestions)
Changes v4 -> v5:
- Addressed feedback from Alexander Potapenko
Changes v3 -> v4:
- Rebase (long time has passed)
- Use boolean instead of enum for action by Alexander Potapenko
- (I left some feedback untouched because it's been long and
would like to discuss it here now instead of re-vamping
and old thread)
Changes v2 -> v3:
- Replace interface in favor of seq operations
(suggested by Vlastimil)
- Use debugfs interface to store/read valued (suggested by Ammar)
page_owner is a great debug functionality tool that lets us know
about all pages that have been allocated/freed and their specific
stacktrace.
This comes very handy when debugging memory leaks, since with
some scripting we can see the outstanding allocations, which might point
to a memory leak.
In my experience, that is one of the most useful cases, but it can get
really tedious to screen through all pages and try to reconstruct the
stack <-> allocated/freed relationship, becoming most of the time a
daunting and slow process when we have tons of allocation/free operations.
This patchset aims to ease that by adding a new functionality into
page_owner.
This functionality creates a new read-only file called "page_owner_stacks",
which prints out all the stacks followed by their outstanding number
of allocations (being that the times the stacktrace has allocated
but not freed yet).
This gives us a clear and a quick overview of stacks <-> allocated/free.
We take advantage of the new refcount_f field that stack_record struct
gained, and increment/decrement the stack refcount on every
__set_page_owner() (alloc operation) and __reset_page_owner (free operation)
call.
Unfortunately, we cannot use the new stackdepot api
STACK_DEPOT_FLAG_{GET,PUT} because it does not fulfill page_owner needs,
meaning we would have to special case things, at which point
makes more sense for page_owner to do its own {dec,inc}rementing
of the stacks.
E.g: Using STACK_DEPOT_FLAG_PUT, once the refcount reaches 0,
such stack gets evicted, so page_owner would lose information.
This patch also creates a new file called 'page_owner_threshold'.
By writing a value to it, the stacks which refcount is below such
value will be filtered out.
In order to better exercise the path in stack_depot_get_next_stack(),
I artificially filled the buckets with more than one stack, making sure
I was getting all of then when reading from it.
On a side note, stack_depot_get_next_stack() could be somehow reconstructed
to be in page_owner code, but we would have to move stack_table
into the header, so page_owner can access it.
I can do that if that's preferred, so stackdepot.c would not get "poluted".
A PoC can be found below:
# cat /sys/kernel/debug/page_owner_stacks > page_owner_full_stacks.txt
# head -40 page_owner_full_stacks.txt
prep_new_page+0xa9/0x120
get_page_from_freelist+0x801/0x2210
__alloc_pages+0x18b/0x350
alloc_pages_mpol+0x91/0x1f0
folio_alloc+0x14/0x50
filemap_alloc_folio+0xb2/0x100
page_cache_ra_unbounded+0x96/0x180
filemap_get_pages+0xfd/0x590
filemap_read+0xcc/0x330
blkdev_read_iter+0xb8/0x150
vfs_read+0x285/0x320
ksys_read+0xa5/0xe0
do_syscall_64+0x80/0x160
entry_SYSCALL_64_after_hwframe+0x6e/0x76
stack_count: 521
prep_new_page+0xa9/0x120
get_page_from_freelist+0x801/0x2210
__alloc_pages+0x18b/0x350
alloc_pages_mpol+0x91/0x1f0
folio_alloc+0x14/0x50
filemap_alloc_folio+0xb2/0x100
__filemap_get_folio+0x14a/0x490
ext4_write_begin+0xbd/0x4b0 [ext4]
generic_perform_write+0xc1/0x1e0
ext4_buffered_write_iter+0x68/0xe0 [ext4]
ext4_file_write_iter+0x70/0x740 [ext4]
vfs_write+0x33d/0x420
ksys_write+0xa5/0xe0
do_syscall_64+0x80/0x160
entry_SYSCALL_64_after_hwframe+0x6e/0x76
stack_count: 4609
..
..
# echo 5000 > /sys/kernel/debug/page_owner_threshold
# cat /sys/kernel/debug/page_owner_stacks > page_owner_full_stacks_5000.txt
# head -40 page_owner_full_stacks_5000.txt
prep_new_page+0xa9/0x120
get_page_from_freelist+0x801/0x2210
__alloc_pages+0x18b/0x350
alloc_pages_mpol+0x91/0x1f0
folio_alloc+0x14/0x50
filemap_alloc_folio+0xb2/0x100
__filemap_get_folio+0x14a/0x490
ext4_write_begin+0xbd/0x4b0 [ext4]
generic_perform_write+0xc1/0x1e0
ext4_buffered_write_iter+0x68/0xe0 [ext4]
ext4_file_write_iter+0x70/0x740 [ext4]
vfs_write+0x33d/0x420
ksys_pwrite64+0x75/0x90
do_syscall_64+0x80/0x160
entry_SYSCALL_64_after_hwframe+0x6e/0x76
stack_count: 6781
prep_new_page+0xa9/0x120
get_page_from_freelist+0x801/0x2210
__alloc_pages+0x18b/0x350
pcpu_populate_chunk+0xec/0x350
pcpu_balance_workfn+0x2d1/0x4a0
process_scheduled_works+0x84/0x380
worker_thread+0x12a/0x2a0
kthread+0xe3/0x110
ret_from_fork+0x30/0x50
ret_from_fork_asm+0x1b/0x30
stack_count: 8641
Oscar Salvador (4):
lib/stackdepot: Move stack_record struct definition into the header
mm,page_owner: Implement the tracking of the stacks count
mm,page_owner: Display all stacks and their count
mm,page_owner: Filter out stacks by a threshold
include/linux/stackdepot.h | 72 ++++++++++++++++++++
lib/stackdepot.c | 97 ++++++++++++++------------
mm/page_owner.c | 136 +++++++++++++++++++++++++++++++++++++
3 files changed, 262 insertions(+), 43 deletions(-)
--
2.43.0
In order to move the heavy lifting into page_owner code, this one
needs to have access to the stack_record structure, which right now
sits in lib/stackdepot.c.
Move it to the stackdepot.h header so page_owner can access
stack_record's struct fields.
Signed-off-by: Oscar Salvador <[email protected]>
---
include/linux/stackdepot.h | 44 ++++++++++++++++++++++++++++++++++++++
lib/stackdepot.c | 43 -------------------------------------
2 files changed, 44 insertions(+), 43 deletions(-)
diff --git a/include/linux/stackdepot.h b/include/linux/stackdepot.h
index adcbb8f23600..d0dcf4aebfb4 100644
--- a/include/linux/stackdepot.h
+++ b/include/linux/stackdepot.h
@@ -30,6 +30,50 @@ typedef u32 depot_stack_handle_t;
*/
#define STACK_DEPOT_EXTRA_BITS 5
+#define DEPOT_HANDLE_BITS (sizeof(depot_stack_handle_t) * 8)
+
+#define DEPOT_POOL_ORDER 2 /* Pool size order, 4 pages */
+#define DEPOT_POOL_SIZE (1LL << (PAGE_SHIFT + DEPOT_POOL_ORDER))
+#define DEPOT_STACK_ALIGN 4
+#define DEPOT_OFFSET_BITS (DEPOT_POOL_ORDER + PAGE_SHIFT - DEPOT_STACK_ALIGN)
+#define DEPOT_POOL_INDEX_BITS (DEPOT_HANDLE_BITS - DEPOT_OFFSET_BITS - \
+ STACK_DEPOT_EXTRA_BITS)
+
+/* Compact structure that stores a reference to a stack. */
+union handle_parts {
+ depot_stack_handle_t handle;
+ struct {
+ u32 pool_index : DEPOT_POOL_INDEX_BITS;
+ u32 offset : DEPOT_OFFSET_BITS;
+ u32 extra : STACK_DEPOT_EXTRA_BITS;
+ };
+};
+
+struct stack_record {
+ struct list_head hash_list; /* Links in the hash table */
+ u32 hash; /* Hash in hash table */
+ u32 size; /* Number of stored frames */
+ union handle_parts handle; /* Constant after initialization */
+ refcount_t count;
+ union {
+ unsigned long entries[CONFIG_STACKDEPOT_MAX_FRAMES]; /* Frames */
+ struct {
+ /*
+ * An important invariant of the implementation is to
+ * only place a stack record onto the freelist iff its
+ * refcount is zero. Because stack records with a zero
+ * refcount are never considered as valid, it is safe to
+ * union @entries and freelist management state below.
+ * Conversely, as soon as an entry is off the freelist
+ * and its refcount becomes non-zero, the below must not
+ * be accessed until being placed back on the freelist.
+ */
+ struct list_head free_list; /* Links in the freelist */
+ unsigned long rcu_state; /* RCU cookie */
+ };
+ };
+};
+
typedef u32 depot_flags_t;
/*
diff --git a/lib/stackdepot.c b/lib/stackdepot.c
index 5caa1f566553..16c8a1bf0008 100644
--- a/lib/stackdepot.c
+++ b/lib/stackdepot.c
@@ -35,14 +35,6 @@
#include <linux/memblock.h>
#include <linux/kasan-enabled.h>
-#define DEPOT_HANDLE_BITS (sizeof(depot_stack_handle_t) * 8)
-
-#define DEPOT_POOL_ORDER 2 /* Pool size order, 4 pages */
-#define DEPOT_POOL_SIZE (1LL << (PAGE_SHIFT + DEPOT_POOL_ORDER))
-#define DEPOT_STACK_ALIGN 4
-#define DEPOT_OFFSET_BITS (DEPOT_POOL_ORDER + PAGE_SHIFT - DEPOT_STACK_ALIGN)
-#define DEPOT_POOL_INDEX_BITS (DEPOT_HANDLE_BITS - DEPOT_OFFSET_BITS - \
- STACK_DEPOT_EXTRA_BITS)
#if IS_ENABLED(CONFIG_KMSAN) && CONFIG_STACKDEPOT_MAX_FRAMES >= 32
/*
* KMSAN is frequently used in fuzzing scenarios and thus saves a lot of stack
@@ -58,41 +50,6 @@
(((1LL << (DEPOT_POOL_INDEX_BITS)) < DEPOT_POOLS_CAP) ? \
(1LL << (DEPOT_POOL_INDEX_BITS)) : DEPOT_POOLS_CAP)
-/* Compact structure that stores a reference to a stack. */
-union handle_parts {
- depot_stack_handle_t handle;
- struct {
- u32 pool_index : DEPOT_POOL_INDEX_BITS;
- u32 offset : DEPOT_OFFSET_BITS;
- u32 extra : STACK_DEPOT_EXTRA_BITS;
- };
-};
-
-struct stack_record {
- struct list_head hash_list; /* Links in the hash table */
- u32 hash; /* Hash in hash table */
- u32 size; /* Number of stored frames */
- union handle_parts handle; /* Constant after initialization */
- refcount_t count;
- union {
- unsigned long entries[CONFIG_STACKDEPOT_MAX_FRAMES]; /* Frames */
- struct {
- /*
- * An important invariant of the implementation is to
- * only place a stack record onto the freelist iff its
- * refcount is zero. Because stack records with a zero
- * refcount are never considered as valid, it is safe to
- * union @entries and freelist management state below.
- * Conversely, as soon as an entry is off the freelist
- * and its refcount becomes non-zero, the below must not
- * be accessed until being placed back on the freelist.
- */
- struct list_head free_list; /* Links in the freelist */
- unsigned long rcu_state; /* RCU cookie */
- };
- };
-};
-
#define DEPOT_STACK_RECORD_SIZE \
ALIGN(sizeof(struct stack_record), 1 << DEPOT_STACK_ALIGN)
--
2.43.0
This patch adds a new file called 'page_owner_stacks', which
will show all stacks that were added by page_owner followed by
their counting, giving us a clear overview of stack <-> count
relationship.
E.g:
prep_new_page+0xa9/0x120
get_page_from_freelist+0x801/0x2210
__alloc_pages+0x18b/0x350
alloc_pages_mpol+0x91/0x1f0
folio_alloc+0x14/0x50
filemap_alloc_folio+0xb2/0x100
__filemap_get_folio+0x14a/0x490
ext4_write_begin+0xbd/0x4b0 [ext4]
generic_perform_write+0xc1/0x1e0
ext4_buffered_write_iter+0x68/0xe0 [ext4]
ext4_file_write_iter+0x70/0x740 [ext4]
vfs_write+0x33d/0x420
ksys_write+0xa5/0xe0
do_syscall_64+0x80/0x160
entry_SYSCALL_64_after_hwframe+0x6e/0x76
stack_count: 4578
In order to show all the stacks, we implement stack_depot_get_next_stack(),
which walks all buckets while retrieving the stacks stored in them.
stack_depot_get_next_stack() will return all stacks, one at a time,
by first finding a non-empty bucket, and then retrieving all the stacks
stored in that bucket.
Once we have completely gone through it, we get the next non-empty bucket
and repeat the same steps, and so on until we have completely checked all
buckets.
Signed-off-by: Oscar Salvador <[email protected]>
---
include/linux/stackdepot.h | 20 +++++++++
lib/stackdepot.c | 46 +++++++++++++++++++++
mm/page_owner.c | 85 ++++++++++++++++++++++++++++++++++++++
3 files changed, 151 insertions(+)
diff --git a/include/linux/stackdepot.h b/include/linux/stackdepot.h
index ac62de4d4999..d851ec821e6f 100644
--- a/include/linux/stackdepot.h
+++ b/include/linux/stackdepot.h
@@ -183,6 +183,26 @@ depot_stack_handle_t stack_depot_save(unsigned long *entries,
*/
struct stack_record *stack_depot_get_stack(depot_stack_handle_t handle);
+/**
+ * stack_depot_get_next_stack - Returns all stacks, one at a time
+ *
+ * @table: Current table we are checking
+ * @bucket: Current bucket we are checking
+ * @last_found: Last stack that was found
+ *
+ * This function finds first a non-empty bucket and returns the first stack
+ * stored in it. On consequent calls, it walks the bucket to see whether
+ * it contains more stacks.
+ * Once we have walked all the stacks in a bucket, we check
+ * the next one, and we repeat the same steps until we have checked all of them
+ *
+ * Return: A pointer a to stack_record struct, or NULL when we have walked all
+ * buckets.
+ */
+struct stack_record *stack_depot_get_next_stack(unsigned long *table,
+ struct list_head **bucket,
+ struct stack_record **last_found);
+
/**
* stack_depot_fetch - Fetch a stack trace from stack depot
*
diff --git a/lib/stackdepot.c b/lib/stackdepot.c
index 197c355601f9..107bd0174cd6 100644
--- a/lib/stackdepot.c
+++ b/lib/stackdepot.c
@@ -782,6 +782,52 @@ unsigned int stack_depot_get_extra_bits(depot_stack_handle_t handle)
}
EXPORT_SYMBOL(stack_depot_get_extra_bits);
+struct stack_record *stack_depot_get_next_stack(unsigned long *table,
+ struct list_head **curr_bucket,
+ struct stack_record **last_found)
+{
+ struct list_head *bucket = *curr_bucket;
+ unsigned long nr_table = *table;
+ struct stack_record *found = NULL;
+ unsigned long stack_table_entries = stack_hash_mask + 1;
+
+ rcu_read_lock_sched_notrace();
+ if (!bucket) {
+ /*
+ * Find a non-empty bucket. Once we have found it,
+ * we will use list_for_each_entry_continue_rcu() on the next
+ * call to keep walking the bucket.
+ */
+new_table:
+ bucket = &stack_table[nr_table];
+ list_for_each_entry_rcu(found, bucket, hash_list) {
+ goto out;
+ }
+ } else {
+ /* Check whether we have more stacks in this bucket */
+ found = *last_found;
+ list_for_each_entry_continue_rcu(found, bucket, hash_list) {
+ goto out;
+ }
+ }
+
+ /* No more stacks in this bucket, check the next one */
+ nr_table++;
+ if (nr_table < stack_table_entries)
+ goto new_table;
+
+ /* We are done walking all buckets */
+ found = NULL;
+
+out:
+ *table = nr_table;
+ *curr_bucket = bucket;
+ *last_found = found;
+ rcu_read_unlock_sched_notrace();
+
+ return found;
+}
+
static int stats_show(struct seq_file *seq, void *v)
{
/*
diff --git a/mm/page_owner.c b/mm/page_owner.c
index 0adf41702b9d..aea212734557 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -749,6 +749,89 @@ static const struct file_operations proc_page_owner_operations = {
.llseek = lseek_page_owner,
};
+struct stack_iterator {
+ unsigned long nr_table;
+ struct list_head *bucket;
+ struct stack_record *last_stack;
+};
+
+static void *stack_start(struct seq_file *m, loff_t *ppos)
+{
+ struct stack_iterator *iter = m->private;
+
+ if (*ppos == -1UL)
+ return NULL;
+
+ return stack_depot_get_next_stack(&iter->nr_table,
+ &iter->bucket,
+ &iter->last_stack);
+}
+
+static void *stack_next(struct seq_file *m, void *v, loff_t *ppos)
+{
+ struct stack_iterator *iter = m->private;
+ struct stack_record *stack;
+
+ stack = stack_depot_get_next_stack(&iter->nr_table,
+ &iter->bucket,
+ &iter->last_stack);
+ *ppos = stack ? *ppos + 1 : -1UL;
+
+ return stack;
+}
+
+static int stack_print(struct seq_file *m, void *v)
+{
+ char *buf;
+ int ret = 0;
+ struct stack_iterator *iter = m->private;
+ struct stack_record *stack = iter->last_stack;
+
+ if (!stack->size || stack->size < 0 || refcount_read(&stack->count) < 2)
+ return 0;
+
+ buf = kzalloc(PAGE_SIZE, GFP_KERNEL);
+
+ ret += stack_trace_snprint(buf, PAGE_SIZE, stack->entries, stack->size,
+ 0);
+ if (!ret)
+ goto out;
+
+ scnprintf(buf + ret, PAGE_SIZE - ret, "stack_count: %d\n\n",
+ refcount_read(&stack->count));
+
+ seq_printf(m, buf);
+ seq_puts(m, "\n\n");
+out:
+ kfree(buf);
+
+ return 0;
+}
+
+static void stack_stop(struct seq_file *m, void *v)
+{
+}
+
+static const struct seq_operations page_owner_stack_op = {
+ .start = stack_start,
+ .next = stack_next,
+ .stop = stack_stop,
+ .show = stack_print
+};
+
+static int page_owner_stack_open(struct inode *inode, struct file *file)
+{
+ return seq_open_private(file, &page_owner_stack_op,
+ sizeof(struct stack_iterator));
+}
+
+const struct file_operations page_owner_stack_operations = {
+ .open = page_owner_stack_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
static int __init pageowner_init(void)
{
if (!static_branch_unlikely(&page_owner_inited)) {
@@ -758,6 +841,8 @@ static int __init pageowner_init(void)
debugfs_create_file("page_owner", 0400, NULL, NULL,
&proc_page_owner_operations);
+ debugfs_create_file("page_owner_stacks", 0400, NULL, NULL,
+ &page_owner_stack_operations);
return 0;
}
--
2.43.0
We want to be able to filter out the stacks based on
a threshold we can tune.
By writing to 'page_owner_threshold' file, we can adjust
the threshold value.
Signed-off-by: Oscar Salvador <[email protected]>
---
mm/page_owner.c | 23 ++++++++++++++++++++++-
1 file changed, 22 insertions(+), 1 deletion(-)
diff --git a/mm/page_owner.c b/mm/page_owner.c
index aea212734557..d95a73cf2581 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -780,14 +780,18 @@ static void *stack_next(struct seq_file *m, void *v, loff_t *ppos)
return stack;
}
+static unsigned long page_owner_stack_threshold;
+
static int stack_print(struct seq_file *m, void *v)
{
char *buf;
int ret = 0;
struct stack_iterator *iter = m->private;
struct stack_record *stack = iter->last_stack;
+ int stack_count = refcount_read(&stack->count);
- if (!stack->size || stack->size < 0 || refcount_read(&stack->count) < 2)
+ if (!stack->size || stack->size < 0 || stack_count < 2 ||
+ stack_count < page_owner_stack_threshold)
return 0;
buf = kzalloc(PAGE_SIZE, GFP_KERNEL);
@@ -832,6 +836,21 @@ const struct file_operations page_owner_stack_operations = {
.release = seq_release,
};
+static int page_owner_threshold_get(void *data, u64 *val)
+{
+ *val = page_owner_stack_threshold;
+ return 0;
+}
+
+static int page_owner_threshold_set(void *data, u64 val)
+{
+ page_owner_stack_threshold = val;
+ return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(proc_page_owner_threshold, &page_owner_threshold_get,
+ &page_owner_threshold_set, "%llu");
+
static int __init pageowner_init(void)
{
if (!static_branch_unlikely(&page_owner_inited)) {
@@ -843,6 +862,8 @@ static int __init pageowner_init(void)
&proc_page_owner_operations);
debugfs_create_file("page_owner_stacks", 0400, NULL, NULL,
&page_owner_stack_operations);
+ debugfs_create_file("page_owner_threshold", 0600, NULL, NULL,
+ &proc_page_owner_threshold);
return 0;
}
--
2.43.0
page_owner needs to increment a stack_record refcount when a new allocation
occurs, and decrement it on a free operation.
In order to do that, we need to have a way to get a stack_record from a
handle.
Implement stack_depot_get_stack() which just does that, and make it public
so page_owner can use it.
Also implement {inc,dec}_stack_record_count() which increments
or decrements on respective allocation and free operations, via
__reset_page_owner() (free operation) and __set_page_owner() (alloc
operation).
Signed-off-by: Oscar Salvador <[email protected]>
---
include/linux/stackdepot.h | 8 ++++++++
lib/stackdepot.c | 8 ++++++++
mm/page_owner.c | 30 ++++++++++++++++++++++++++++++
3 files changed, 46 insertions(+)
diff --git a/include/linux/stackdepot.h b/include/linux/stackdepot.h
index d0dcf4aebfb4..ac62de4d4999 100644
--- a/include/linux/stackdepot.h
+++ b/include/linux/stackdepot.h
@@ -175,6 +175,14 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
depot_stack_handle_t stack_depot_save(unsigned long *entries,
unsigned int nr_entries, gfp_t gfp_flags);
+/**
+ * stack_depo_get_stack - Get a pointer to a stack struct
+ * @handle: Stack depot handle
+ *
+ * Return: Returns a pointer to a stack struct
+ */
+struct stack_record *stack_depot_get_stack(depot_stack_handle_t handle);
+
/**
* stack_depot_fetch - Fetch a stack trace from stack depot
*
diff --git a/lib/stackdepot.c b/lib/stackdepot.c
index 16c8a1bf0008..197c355601f9 100644
--- a/lib/stackdepot.c
+++ b/lib/stackdepot.c
@@ -681,6 +681,14 @@ depot_stack_handle_t stack_depot_save(unsigned long *entries,
}
EXPORT_SYMBOL_GPL(stack_depot_save);
+struct stack_record *stack_depot_get_stack(depot_stack_handle_t handle)
+{
+ if (!handle)
+ return NULL;
+
+ return depot_fetch_stack(handle);
+}
+
unsigned int stack_depot_fetch(depot_stack_handle_t handle,
unsigned long **entries)
{
diff --git a/mm/page_owner.c b/mm/page_owner.c
index 5634e5d890f8..0adf41702b9d 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -61,6 +61,22 @@ static __init bool need_page_owner(void)
return page_owner_enabled;
}
+static void inc_stack_record_count(depot_stack_handle_t handle)
+{
+ struct stack_record *stack = stack_depot_get_stack(handle);
+
+ if (stack)
+ refcount_inc(&stack->count);
+}
+
+static void dec_stack_record_count(depot_stack_handle_t handle)
+{
+ struct stack_record *stack = stack_depot_get_stack(handle);
+
+ if (stack)
+ refcount_dec(&stack->count);
+}
+
static __always_inline depot_stack_handle_t create_dummy_stack(void)
{
unsigned long entries[4];
@@ -140,6 +156,7 @@ void __reset_page_owner(struct page *page, unsigned short order)
int i;
struct page_ext *page_ext;
depot_stack_handle_t handle;
+ depot_stack_handle_t alloc_handle;
struct page_owner *page_owner;
u64 free_ts_nsec = local_clock();
@@ -147,6 +164,9 @@ void __reset_page_owner(struct page *page, unsigned short order)
if (unlikely(!page_ext))
return;
+ page_owner = get_page_owner(page_ext);
+ alloc_handle = page_owner->handle;
+
handle = save_stack(GFP_NOWAIT | __GFP_NOWARN);
for (i = 0; i < (1 << order); i++) {
__clear_bit(PAGE_EXT_OWNER_ALLOCATED, &page_ext->flags);
@@ -158,6 +178,15 @@ void __reset_page_owner(struct page *page, unsigned short order)
page_ext = page_ext_next(page_ext);
}
page_ext_put(page_ext);
+ if (alloc_handle != early_handle)
+ /*
+ * early_handle is being set as a handle for all those
+ * early allocated pages. See init_pages_in_zone().
+ * Since their refcount is not being incremented because
+ * the machinery is not ready yet, we cannot decrement
+ * their refcount either.
+ */
+ dec_stack_record_count(alloc_handle);
}
static inline void __set_page_owner_handle(struct page_ext *page_ext,
@@ -199,6 +228,7 @@ noinline void __set_page_owner(struct page *page, unsigned short order,
return;
__set_page_owner_handle(page_ext, handle, order, gfp_mask);
page_ext_put(page_ext);
+ inc_stack_record_count(handle);
}
void __set_page_owner_migrate_reason(struct page *page, int reason)
--
2.43.0
On Fri, 9 Feb 2024 00:45:35 +0100 Oscar Salvador <[email protected]> wrote:
> page_owner is a great debug functionality tool that lets us know
> about all pages that have been allocated/freed and their specific
> stacktrace.
> This comes very handy when debugging memory leaks, since with
> some scripting we can see the outstanding allocations, which might point
> to a memory leak.
>
> In my experience, that is one of the most useful cases, but it can get
> really tedious to screen through all pages and try to reconstruct the
> stack <-> allocated/freed relationship, becoming most of the time a
> daunting and slow process when we have tons of allocation/free operations.
>
> This patchset aims to ease that by adding a new functionality into
> page_owner.
> This functionality creates a new read-only file called "page_owner_stacks",
The full path would be appreciated.
> which prints out all the stacks followed by their outstanding number
> of allocations (being that the times the stacktrace has allocated
> but not freed yet).
> This gives us a clear and a quick overview of stacks <-> allocated/free.
>
> We take advantage of the new refcount_f field that stack_record struct
> gained, and increment/decrement the stack refcount on every
> __set_page_owner() (alloc operation) and __reset_page_owner (free operation)
> call.
>
> Unfortunately, we cannot use the new stackdepot api
> STACK_DEPOT_FLAG_{GET,PUT} because it does not fulfill page_owner needs,
> meaning we would have to special case things, at which point
> makes more sense for page_owner to do its own {dec,inc}rementing
> of the stacks.
> E.g: Using STACK_DEPOT_FLAG_PUT, once the refcount reaches 0,
> such stack gets evicted, so page_owner would lose information.
>
> This patch also creates a new file called 'page_owner_threshold'.
> By writing a value to it, the stacks which refcount is below such
> value will be filtered out.
>
> In order to better exercise the path in stack_depot_get_next_stack(),
> I artificially filled the buckets with more than one stack, making sure
> I was getting all of then when reading from it.
>
> On a side note, stack_depot_get_next_stack() could be somehow reconstructed
> to be in page_owner code, but we would have to move stack_table
> into the header, so page_owner can access it.
> I can do that if that's preferred, so stackdepot.c would not get "poluted".
>
> A PoC can be found below:
>
> # cat /sys/kernel/debug/page_owner_stacks > page_owner_full_stacks.txt
Oh, there it is. I wonder why we didn't use /sys/kernel/mm/
Would a new /sys/kernel/debug/page_owner_something/ be a good idea? We
might add more things later. Then it can be
/sys/kernel/debug/page_owner_something/full_stacks.
/sys/kernel/debug/page_owner/ would be nice, but it's too late for
that.
> Oscar Salvador (4):
> lib/stackdepot: Move stack_record struct definition into the header
> mm,page_owner: Implement the tracking of the stacks count
> mm,page_owner: Display all stacks and their count
> mm,page_owner: Filter out stacks by a threshold
>
> include/linux/stackdepot.h | 72 ++++++++++++++++++++
> lib/stackdepot.c | 97 ++++++++++++++------------
> mm/page_owner.c | 136 +++++++++++++++++++++++++++++++++++++
> 3 files changed, 262 insertions(+), 43 deletions(-)
Documentation/mm/page_owner.rst?
On Fri, 9 Feb 2024 at 00:45, Oscar Salvador <[email protected]> wrote:
>
> page_owner needs to increment a stack_record refcount when a new allocation
> occurs, and decrement it on a free operation.
> In order to do that, we need to have a way to get a stack_record from a
> handle.
> Implement stack_depot_get_stack() which just does that, and make it public
> so page_owner can use it.
>
> Also implement {inc,dec}_stack_record_count() which increments
> or decrements on respective allocation and free operations, via
> __reset_page_owner() (free operation) and __set_page_owner() (alloc
> operation).
>
> Signed-off-by: Oscar Salvador <[email protected]>
> ---
> include/linux/stackdepot.h | 8 ++++++++
> lib/stackdepot.c | 8 ++++++++
> mm/page_owner.c | 30 ++++++++++++++++++++++++++++++
> 3 files changed, 46 insertions(+)
>
> diff --git a/include/linux/stackdepot.h b/include/linux/stackdepot.h
> index d0dcf4aebfb4..ac62de4d4999 100644
> --- a/include/linux/stackdepot.h
> +++ b/include/linux/stackdepot.h
> @@ -175,6 +175,14 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
> depot_stack_handle_t stack_depot_save(unsigned long *entries,
> unsigned int nr_entries, gfp_t gfp_flags);
>
> +/**
> + * stack_depo_get_stack - Get a pointer to a stack struct
> + * @handle: Stack depot handle
> + *
> + * Return: Returns a pointer to a stack struct
> + */
> +struct stack_record *stack_depot_get_stack(depot_stack_handle_t handle);
> +
> /**
> * stack_depot_fetch - Fetch a stack trace from stack depot
> *
> diff --git a/lib/stackdepot.c b/lib/stackdepot.c
> index 16c8a1bf0008..197c355601f9 100644
> --- a/lib/stackdepot.c
> +++ b/lib/stackdepot.c
> @@ -681,6 +681,14 @@ depot_stack_handle_t stack_depot_save(unsigned long *entries,
> }
> EXPORT_SYMBOL_GPL(stack_depot_save);
>
> +struct stack_record *stack_depot_get_stack(depot_stack_handle_t handle)
> +{
> + if (!handle)
> + return NULL;
> +
> + return depot_fetch_stack(handle);
> +}
> +
> unsigned int stack_depot_fetch(depot_stack_handle_t handle,
> unsigned long **entries)
> {
> diff --git a/mm/page_owner.c b/mm/page_owner.c
> index 5634e5d890f8..0adf41702b9d 100644
> --- a/mm/page_owner.c
> +++ b/mm/page_owner.c
> @@ -61,6 +61,22 @@ static __init bool need_page_owner(void)
> return page_owner_enabled;
> }
>
> +static void inc_stack_record_count(depot_stack_handle_t handle)
> +{
> + struct stack_record *stack = stack_depot_get_stack(handle);
> +
> + if (stack)
> + refcount_inc(&stack->count);
> +}
> +
> +static void dec_stack_record_count(depot_stack_handle_t handle)
> +{
> + struct stack_record *stack = stack_depot_get_stack(handle);
> +
> + if (stack)
> + refcount_dec(&stack->count);
> +}
> +
> static __always_inline depot_stack_handle_t create_dummy_stack(void)
> {
> unsigned long entries[4];
> @@ -140,6 +156,7 @@ void __reset_page_owner(struct page *page, unsigned short order)
> int i;
> struct page_ext *page_ext;
> depot_stack_handle_t handle;
> + depot_stack_handle_t alloc_handle;
> struct page_owner *page_owner;
> u64 free_ts_nsec = local_clock();
>
> @@ -147,6 +164,9 @@ void __reset_page_owner(struct page *page, unsigned short order)
> if (unlikely(!page_ext))
> return;
>
> + page_owner = get_page_owner(page_ext);
> + alloc_handle = page_owner->handle;
> +
> handle = save_stack(GFP_NOWAIT | __GFP_NOWARN);
> for (i = 0; i < (1 << order); i++) {
> __clear_bit(PAGE_EXT_OWNER_ALLOCATED, &page_ext->flags);
> @@ -158,6 +178,15 @@ void __reset_page_owner(struct page *page, unsigned short order)
> page_ext = page_ext_next(page_ext);
> }
> page_ext_put(page_ext);
> + if (alloc_handle != early_handle)
> + /*
> + * early_handle is being set as a handle for all those
> + * early allocated pages. See init_pages_in_zone().
> + * Since their refcount is not being incremented because
> + * the machinery is not ready yet, we cannot decrement
> + * their refcount either.
> + */
> + dec_stack_record_count(alloc_handle);
> }
>
> static inline void __set_page_owner_handle(struct page_ext *page_ext,
> @@ -199,6 +228,7 @@ noinline void __set_page_owner(struct page *page, unsigned short order,
> return;
> __set_page_owner_handle(page_ext, handle, order, gfp_mask);
> page_ext_put(page_ext);
> + inc_stack_record_count(handle);
> }
>
> void __set_page_owner_migrate_reason(struct page *page, int reason)
> --
> 2.43.0
>
On Fri, 9 Feb 2024 at 00:45, Oscar Salvador <[email protected]> wrote:
>
> page_owner needs to increment a stack_record refcount when a new allocation
> occurs, and decrement it on a free operation.
> In order to do that, we need to have a way to get a stack_record from a
> handle.
> Implement stack_depot_get_stack() which just does that, and make it public
> so page_owner can use it.
>
> Also implement {inc,dec}_stack_record_count() which increments
> or decrements on respective allocation and free operations, via
> __reset_page_owner() (free operation) and __set_page_owner() (alloc
> operation).
>
> Signed-off-by: Oscar Salvador <[email protected]>
> ---
> include/linux/stackdepot.h | 8 ++++++++
> lib/stackdepot.c | 8 ++++++++
> mm/page_owner.c | 30 ++++++++++++++++++++++++++++++
> 3 files changed, 46 insertions(+)
>
> diff --git a/include/linux/stackdepot.h b/include/linux/stackdepot.h
> index d0dcf4aebfb4..ac62de4d4999 100644
> --- a/include/linux/stackdepot.h
> +++ b/include/linux/stackdepot.h
> @@ -175,6 +175,14 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
> depot_stack_handle_t stack_depot_save(unsigned long *entries,
> unsigned int nr_entries, gfp_t gfp_flags);
>
> +/**
> + * stack_depo_get_stack - Get a pointer to a stack struct
Typo: "depo" -> depot
I would also write "stack_record struct", because "stack struct" does not exist.
> + * @handle: Stack depot handle
> + *
> + * Return: Returns a pointer to a stack struct
> + */
> +struct stack_record *stack_depot_get_stack(depot_stack_handle_t handle);
I don't know what other usecases there are for this, but I'd want to
make make sure we give users a big hint to avoid unnecessary uses of
this function.
Perhaps we also want to mark it as somewhat internal, e.g. by
prefixing it with __. So I'd call it __stack_depot_get_stack_record().
> /**
> * stack_depot_fetch - Fetch a stack trace from stack depot
> *
> diff --git a/lib/stackdepot.c b/lib/stackdepot.c
> index 16c8a1bf0008..197c355601f9 100644
> --- a/lib/stackdepot.c
> +++ b/lib/stackdepot.c
> @@ -681,6 +681,14 @@ depot_stack_handle_t stack_depot_save(unsigned long *entries,
> }
> EXPORT_SYMBOL_GPL(stack_depot_save);
>
> +struct stack_record *stack_depot_get_stack(depot_stack_handle_t handle)
> +{
> + if (!handle)
> + return NULL;
> +
> + return depot_fetch_stack(handle);
> +}
> +
> unsigned int stack_depot_fetch(depot_stack_handle_t handle,
> unsigned long **entries)
> {
> diff --git a/mm/page_owner.c b/mm/page_owner.c
> index 5634e5d890f8..0adf41702b9d 100644
> --- a/mm/page_owner.c
> +++ b/mm/page_owner.c
> @@ -61,6 +61,22 @@ static __init bool need_page_owner(void)
> return page_owner_enabled;
> }
>
> +static void inc_stack_record_count(depot_stack_handle_t handle)
> +{
> + struct stack_record *stack = stack_depot_get_stack(handle);
> +
> + if (stack)
> + refcount_inc(&stack->count);
> +}
In the latest stackdepot version in -next, the count is initialized to
REFCOUNT_SATURATED to warn if a non-refcounted entry is suddenly used
as a refcounted one. In your case this is intentional and there is no
risk that the entry will be evicted, so that's ok. But you need to set
the refcount to 1 somewhere here on the initial stack_depot_save().
On Fri, 9 Feb 2024 at 00:45, Oscar Salvador <[email protected]> wrote:
>
> In order to move the heavy lifting into page_owner code, this one
> needs to have access to the stack_record structure, which right now
> sits in lib/stackdepot.c.
> Move it to the stackdepot.h header so page_owner can access
> stack_record's struct fields.
>
> Signed-off-by: Oscar Salvador <[email protected]>
> ---
> include/linux/stackdepot.h | 44 ++++++++++++++++++++++++++++++++++++++
> lib/stackdepot.c | 43 -------------------------------------
> 2 files changed, 44 insertions(+), 43 deletions(-)
>
> diff --git a/include/linux/stackdepot.h b/include/linux/stackdepot.h
> index adcbb8f23600..d0dcf4aebfb4 100644
> --- a/include/linux/stackdepot.h
> +++ b/include/linux/stackdepot.h
> @@ -30,6 +30,50 @@ typedef u32 depot_stack_handle_t;
> */
> #define STACK_DEPOT_EXTRA_BITS 5
>
> +#define DEPOT_HANDLE_BITS (sizeof(depot_stack_handle_t) * 8)
> +
> +#define DEPOT_POOL_ORDER 2 /* Pool size order, 4 pages */
> +#define DEPOT_POOL_SIZE (1LL << (PAGE_SHIFT + DEPOT_POOL_ORDER))
> +#define DEPOT_STACK_ALIGN 4
> +#define DEPOT_OFFSET_BITS (DEPOT_POOL_ORDER + PAGE_SHIFT - DEPOT_STACK_ALIGN)
> +#define DEPOT_POOL_INDEX_BITS (DEPOT_HANDLE_BITS - DEPOT_OFFSET_BITS - \
> + STACK_DEPOT_EXTRA_BITS)
> +
> +/* Compact structure that stores a reference to a stack. */
> +union handle_parts {
> + depot_stack_handle_t handle;
> + struct {
> + u32 pool_index : DEPOT_POOL_INDEX_BITS;
> + u32 offset : DEPOT_OFFSET_BITS;
> + u32 extra : STACK_DEPOT_EXTRA_BITS;
> + };
> +};
> +
> +struct stack_record {
> + struct list_head hash_list; /* Links in the hash table */
> + u32 hash; /* Hash in hash table */
> + u32 size; /* Number of stored frames */
> + union handle_parts handle; /* Constant after initialization */
> + refcount_t count;
> + union {
> + unsigned long entries[CONFIG_STACKDEPOT_MAX_FRAMES]; /* Frames */
> + struct {
> + /*
> + * An important invariant of the implementation is to
> + * only place a stack record onto the freelist iff its
> + * refcount is zero. Because stack records with a zero
> + * refcount are never considered as valid, it is safe to
> + * union @entries and freelist management state below.
> + * Conversely, as soon as an entry is off the freelist
> + * and its refcount becomes non-zero, the below must not
> + * be accessed until being placed back on the freelist.
> + */
> + struct list_head free_list; /* Links in the freelist */
> + unsigned long rcu_state; /* RCU cookie */
> + };
> + };
> +};
> +
> typedef u32 depot_flags_t;
>
> /*
> diff --git a/lib/stackdepot.c b/lib/stackdepot.c
> index 5caa1f566553..16c8a1bf0008 100644
> --- a/lib/stackdepot.c
> +++ b/lib/stackdepot.c
> @@ -35,14 +35,6 @@
> #include <linux/memblock.h>
> #include <linux/kasan-enabled.h>
>
> -#define DEPOT_HANDLE_BITS (sizeof(depot_stack_handle_t) * 8)
> -
> -#define DEPOT_POOL_ORDER 2 /* Pool size order, 4 pages */
> -#define DEPOT_POOL_SIZE (1LL << (PAGE_SHIFT + DEPOT_POOL_ORDER))
> -#define DEPOT_STACK_ALIGN 4
> -#define DEPOT_OFFSET_BITS (DEPOT_POOL_ORDER + PAGE_SHIFT - DEPOT_STACK_ALIGN)
> -#define DEPOT_POOL_INDEX_BITS (DEPOT_HANDLE_BITS - DEPOT_OFFSET_BITS - \
> - STACK_DEPOT_EXTRA_BITS)
> #if IS_ENABLED(CONFIG_KMSAN) && CONFIG_STACKDEPOT_MAX_FRAMES >= 32
> /*
> * KMSAN is frequently used in fuzzing scenarios and thus saves a lot of stack
^^ This hunk no longer exists, try to rebase against the version in -next.
Other than that, this looks fine.
On Fri, 9 Feb 2024 at 00:45, Oscar Salvador <[email protected]> wrote:
>
> This patch adds a new file called 'page_owner_stacks', which
> will show all stacks that were added by page_owner followed by
> their counting, giving us a clear overview of stack <-> count
> relationship.
>
> E.g:
>
> prep_new_page+0xa9/0x120
> get_page_from_freelist+0x801/0x2210
> __alloc_pages+0x18b/0x350
> alloc_pages_mpol+0x91/0x1f0
> folio_alloc+0x14/0x50
> filemap_alloc_folio+0xb2/0x100
> __filemap_get_folio+0x14a/0x490
> ext4_write_begin+0xbd/0x4b0 [ext4]
> generic_perform_write+0xc1/0x1e0
> ext4_buffered_write_iter+0x68/0xe0 [ext4]
> ext4_file_write_iter+0x70/0x740 [ext4]
> vfs_write+0x33d/0x420
> ksys_write+0xa5/0xe0
> do_syscall_64+0x80/0x160
> entry_SYSCALL_64_after_hwframe+0x6e/0x76
> stack_count: 4578
>
> In order to show all the stacks, we implement stack_depot_get_next_stack(),
> which walks all buckets while retrieving the stacks stored in them.
> stack_depot_get_next_stack() will return all stacks, one at a time,
> by first finding a non-empty bucket, and then retrieving all the stacks
> stored in that bucket.
> Once we have completely gone through it, we get the next non-empty bucket
> and repeat the same steps, and so on until we have completely checked all
> buckets.
>
> Signed-off-by: Oscar Salvador <[email protected]>
> ---
> include/linux/stackdepot.h | 20 +++++++++
> lib/stackdepot.c | 46 +++++++++++++++++++++
> mm/page_owner.c | 85 ++++++++++++++++++++++++++++++++++++++
> 3 files changed, 151 insertions(+)
>
> diff --git a/include/linux/stackdepot.h b/include/linux/stackdepot.h
> index ac62de4d4999..d851ec821e6f 100644
> --- a/include/linux/stackdepot.h
> +++ b/include/linux/stackdepot.h
> @@ -183,6 +183,26 @@ depot_stack_handle_t stack_depot_save(unsigned long *entries,
> */
> struct stack_record *stack_depot_get_stack(depot_stack_handle_t handle);
>
> +/**
> + * stack_depot_get_next_stack - Returns all stacks, one at a time
"Returns all stack_records" to be clear that this is returning the struct.
> + *
> + * @table: Current table we are checking
> + * @bucket: Current bucket we are checking
> + * @last_found: Last stack that was found
> + *
> + * This function finds first a non-empty bucket and returns the first stack
> + * stored in it. On consequent calls, it walks the bucket to see whether
> + * it contains more stacks.
> + * Once we have walked all the stacks in a bucket, we check
> + * the next one, and we repeat the same steps until we have checked all of them
I think for this function it's important to say that no entry returned
from this function can be evicted.
I.e. the easiest way to ensure this is that the caller makes sure the
entries returned are never passed to stack_depot_put() - which is
certainly the case for your usecase because you do not use
stack_depot_put().
> + * Return: A pointer a to stack_record struct, or NULL when we have walked all
> + * buckets.
> + */
> +struct stack_record *stack_depot_get_next_stack(unsigned long *table,
To keep consistent, I'd also call this
__stack_depot_get_next_stack_record(), so that we're clear this is
more of an internal function not for general usage.
> + struct list_head **bucket,
> + struct stack_record **last_found);
> +
> /**
> * stack_depot_fetch - Fetch a stack trace from stack depot
> *
> diff --git a/lib/stackdepot.c b/lib/stackdepot.c
> index 197c355601f9..107bd0174cd6 100644
> --- a/lib/stackdepot.c
> +++ b/lib/stackdepot.c
> @@ -782,6 +782,52 @@ unsigned int stack_depot_get_extra_bits(depot_stack_handle_t handle)
> }
> EXPORT_SYMBOL(stack_depot_get_extra_bits);
>
> +struct stack_record *stack_depot_get_next_stack(unsigned long *table,
> + struct list_head **curr_bucket,
> + struct stack_record **last_found)
> +{
> + struct list_head *bucket = *curr_bucket;
> + unsigned long nr_table = *table;
> + struct stack_record *found = NULL;
> + unsigned long stack_table_entries = stack_hash_mask + 1;
> +
> + rcu_read_lock_sched_notrace();
We are returning pointers to stack_records out of the RCU-read
critical section, which are then later used to continue the iteration.
list_for_each_entry_continue_rcu() says this is fine if "... you held
some sort of non-RCU reference (such as a reference count) ...".
Updating the function's documentation to say none of these entries can
be evicted via a stack_depot_put() is required.
> + if (!bucket) {
> + /*
> + * Find a non-empty bucket. Once we have found it,
> + * we will use list_for_each_entry_continue_rcu() on the next
> + * call to keep walking the bucket.
> + */
> +new_table:
> + bucket = &stack_table[nr_table];
> + list_for_each_entry_rcu(found, bucket, hash_list) {
> + goto out;
> + }
> + } else {
> + /* Check whether we have more stacks in this bucket */
> + found = *last_found;
> + list_for_each_entry_continue_rcu(found, bucket, hash_list) {
> + goto out;
> + }
> + }
> +
> + /* No more stacks in this bucket, check the next one */
> + nr_table++;
> + if (nr_table < stack_table_entries)
> + goto new_table;
> +
> + /* We are done walking all buckets */
> + found = NULL;
> +
> +out:
> + *table = nr_table;
> + *curr_bucket = bucket;
> + *last_found = found;
> + rcu_read_unlock_sched_notrace();
> +
> + return found;
> +}
> +
> static int stats_show(struct seq_file *seq, void *v)
> {
> /*
> diff --git a/mm/page_owner.c b/mm/page_owner.c
> index 0adf41702b9d..aea212734557 100644
> --- a/mm/page_owner.c
> +++ b/mm/page_owner.c
> @@ -749,6 +749,89 @@ static const struct file_operations proc_page_owner_operations = {
> .llseek = lseek_page_owner,
> };
>
> +struct stack_iterator {
> + unsigned long nr_table;
> + struct list_head *bucket;
> + struct stack_record *last_stack;
> +};
> +
> +static void *stack_start(struct seq_file *m, loff_t *ppos)
> +{
> + struct stack_iterator *iter = m->private;
> +
> + if (*ppos == -1UL)
> + return NULL;
> +
> + return stack_depot_get_next_stack(&iter->nr_table,
> + &iter->bucket,
> + &iter->last_stack);
> +}
> +
> +static void *stack_next(struct seq_file *m, void *v, loff_t *ppos)
> +{
> + struct stack_iterator *iter = m->private;
> + struct stack_record *stack;
> +
> + stack = stack_depot_get_next_stack(&iter->nr_table,
> + &iter->bucket,
> + &iter->last_stack);
> + *ppos = stack ? *ppos + 1 : -1UL;
> +
> + return stack;
> +}
> +
> +static int stack_print(struct seq_file *m, void *v)
> +{
> + char *buf;
> + int ret = 0;
> + struct stack_iterator *iter = m->private;
> + struct stack_record *stack = iter->last_stack;
> +
> + if (!stack->size || stack->size < 0 || refcount_read(&stack->count) < 2)
> + return 0;
> +
> + buf = kzalloc(PAGE_SIZE, GFP_KERNEL);
> +
> + ret += stack_trace_snprint(buf, PAGE_SIZE, stack->entries, stack->size,
> + 0);
> + if (!ret)
> + goto out;
> +
> + scnprintf(buf + ret, PAGE_SIZE - ret, "stack_count: %d\n\n",
> + refcount_read(&stack->count));
> +
> + seq_printf(m, buf);
> + seq_puts(m, "\n\n");
> +out:
> + kfree(buf);
> +
> + return 0;
> +}
> +
> +static void stack_stop(struct seq_file *m, void *v)
> +{
> +}
> +
> +static const struct seq_operations page_owner_stack_op = {
> + .start = stack_start,
> + .next = stack_next,
> + .stop = stack_stop,
> + .show = stack_print
> +};
> +
> +static int page_owner_stack_open(struct inode *inode, struct file *file)
> +{
> + return seq_open_private(file, &page_owner_stack_op,
> + sizeof(struct stack_iterator));
> +}
> +
> +const struct file_operations page_owner_stack_operations = {
> + .open = page_owner_stack_open,
> + .read = seq_read,
> + .llseek = seq_lseek,
> + .release = seq_release,
> +};
> +
> static int __init pageowner_init(void)
> {
> if (!static_branch_unlikely(&page_owner_inited)) {
> @@ -758,6 +841,8 @@ static int __init pageowner_init(void)
>
> debugfs_create_file("page_owner", 0400, NULL, NULL,
> &proc_page_owner_operations);
> + debugfs_create_file("page_owner_stacks", 0400, NULL, NULL,
> + &page_owner_stack_operations);
>
> return 0;
> }
> --
> 2.43.0
>
On Fri, 9 Feb 2024 at 00:45, Oscar Salvador <[email protected]> wrote:
>
> Changes v6 -> v7:
> - Rebased on top of Andrey Konovalov's libstackdepot patchset
> - Reformulated the changelogs
Overall looks fine, but I think it's not rebased against the latest
stackdepot version in -next. That version re-introduces variable-sized
records and there's some trickery required so you do not end up with
refcount warnings.
Hi Oscar,
kernel test robot noticed the following build errors:
[auto build test ERROR on akpm-mm/mm-nonmm-unstable]
[also build test ERROR on linus/master v6.8-rc3]
[cannot apply to akpm-mm/mm-everything next-20240209]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Oscar-Salvador/lib-stackdepot-Move-stack_record-struct-definition-into-the-header/20240209-074611
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-nonmm-unstable
patch link: https://lore.kernel.org/r/20240208234539.19113-2-osalvador%40suse.de
patch subject: [PATCH v7 1/4] lib/stackdepot: Move stack_record struct definition into the header
config: openrisc-defconfig (https://download.01.org/0day-ci/archive/20240210/[email protected]/config)
compiler: or1k-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240210/[email protected]/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
All errors (new ones prefixed by >>):
In file included from include/linux/page_ext.h:7,
from include/linux/mm.h:22,
from include/linux/pid_namespace.h:7,
from include/linux/ptrace.h:10,
from arch/openrisc/kernel/asm-offsets.c:28:
>> include/linux/stackdepot.h:59:39: error: 'CONFIG_STACKDEPOT_MAX_FRAMES' undeclared here (not in a function)
59 | unsigned long entries[CONFIG_STACKDEPOT_MAX_FRAMES]; /* Frames */
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
make[3]: *** [scripts/Makefile.build:116: arch/openrisc/kernel/asm-offsets.s] Error 1
make[3]: Target 'prepare' not remade because of errors.
make[2]: *** [Makefile:1191: prepare0] Error 2
make[2]: Target 'prepare' not remade because of errors.
make[1]: *** [Makefile:240: __sub-make] Error 2
make[1]: Target 'prepare' not remade because of errors.
make: *** [Makefile:240: __sub-make] Error 2
make: Target 'prepare' not remade because of errors.
vim +/CONFIG_STACKDEPOT_MAX_FRAMES +59 include/linux/stackdepot.h
51
52 struct stack_record {
53 struct list_head hash_list; /* Links in the hash table */
54 u32 hash; /* Hash in hash table */
55 u32 size; /* Number of stored frames */
56 union handle_parts handle; /* Constant after initialization */
57 refcount_t count;
58 union {
> 59 unsigned long entries[CONFIG_STACKDEPOT_MAX_FRAMES]; /* Frames */
60 struct {
61 /*
62 * An important invariant of the implementation is to
63 * only place a stack record onto the freelist iff its
64 * refcount is zero. Because stack records with a zero
65 * refcount are never considered as valid, it is safe to
66 * union @entries and freelist management state below.
67 * Conversely, as soon as an entry is off the freelist
68 * and its refcount becomes non-zero, the below must not
69 * be accessed until being placed back on the freelist.
70 */
71 struct list_head free_list; /* Links in the freelist */
72 unsigned long rcu_state; /* RCU cookie */
73 };
74 };
75 };
76
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
On Thu, Feb 08, 2024 at 04:28:18PM -0800, Andrew Morton wrote:
> Oh, there it is. I wonder why we didn't use /sys/kernel/mm/
I just followed the convention we use at the moment, which happens
to be /sys/kernel/debug/page_owner_xxx.
> Would a new /sys/kernel/debug/page_owner_something/ be a good idea? We
> might add more things later. Then it can be
> /sys/kernel/debug/page_owner_something/full_stacks.
> /sys/kernel/debug/page_owner/ would be nice, but it's too late for
> that.
Well, we could certainly do that, so we do not scatter the files
in /sys/kernel/debug/ but rather gather them in pagE_owner directory.
Let me see if I can come up with a proper name.
> > Oscar Salvador (4):
> > lib/stackdepot: Move stack_record struct definition into the header
> > mm,page_owner: Implement the tracking of the stacks count
> > mm,page_owner: Display all stacks and their count
> > mm,page_owner: Filter out stacks by a threshold
> >
> > include/linux/stackdepot.h | 72 ++++++++++++++++++++
> > lib/stackdepot.c | 97 ++++++++++++++------------
> > mm/page_owner.c | 136 +++++++++++++++++++++++++++++++++++++
> > 3 files changed, 262 insertions(+), 43 deletions(-)
>
> Documentation/mm/page_owner.rst?
Heh, we are definitely going to need some documentation.
Let me prepare something.
Thanks Andrew
--
Oscar Salvador
SUSE Labs
On Fri, Feb 09, 2024 at 09:03:21AM +0100, Marco Elver wrote:
> On Fri, 9 Feb 2024 at 00:45, Oscar Salvador <[email protected]> wrote:
> >
> > Changes v6 -> v7:
> > - Rebased on top of Andrey Konovalov's libstackdepot patchset
> > - Reformulated the changelogs
>
> Overall looks fine, but I think it's not rebased against the latest
> stackdepot version in -next. That version re-introduces variable-sized
> records and there's some trickery required so you do not end up with
> refcount warnings.
Thanks a lot Marco for having a look and for the feedback, that's highly
appreciated.
--
Oscar Salvador
SUSE Labs
On Fri, Feb 09, 2024 at 08:45:21AM +0100, Marco Elver wrote:
> > -#define DEPOT_HANDLE_BITS (sizeof(depot_stack_handle_t) * 8)
> > -
> > -#define DEPOT_POOL_ORDER 2 /* Pool size order, 4 pages */
> > -#define DEPOT_POOL_SIZE (1LL << (PAGE_SHIFT + DEPOT_POOL_ORDER))
> > -#define DEPOT_STACK_ALIGN 4
> > -#define DEPOT_OFFSET_BITS (DEPOT_POOL_ORDER + PAGE_SHIFT - DEPOT_STACK_ALIGN)
> > -#define DEPOT_POOL_INDEX_BITS (DEPOT_HANDLE_BITS - DEPOT_OFFSET_BITS - \
> > - STACK_DEPOT_EXTRA_BITS)
> > #if IS_ENABLED(CONFIG_KMSAN) && CONFIG_STACKDEPOT_MAX_FRAMES >= 32
> > /*
> > * KMSAN is frequently used in fuzzing scenarios and thus saves a lot of stack
>
> ^^ This hunk no longer exists, try to rebase against the version in -next.
>
> Other than that, this looks fine.
Yeah, I noticed it later.
I already synced the last -next and I have rebased it on top.
Thanks
--
Oscar Salvador
SUSE Labs
On Fri, Feb 09, 2024 at 08:45:00AM +0100, Marco Elver wrote:
> > +/**
> > + * stack_depo_get_stack - Get a pointer to a stack struct
>
> Typo: "depo" -> depot
>
> I would also write "stack_record struct", because "stack struct" does not exist.
Fixed.
> > + * @handle: Stack depot handle
> > + *
> > + * Return: Returns a pointer to a stack struct
> > + */
> > +struct stack_record *stack_depot_get_stack(depot_stack_handle_t handle);
>
> I don't know what other usecases there are for this, but I'd want to
> make make sure we give users a big hint to avoid unnecessary uses of
> this function.
>
> Perhaps we also want to mark it as somewhat internal, e.g. by
> prefixing it with __. So I'd call it __stack_depot_get_stack_record().
Yes, I went with __stack_depot_get_stack_record(), and I updated its doc
in stackdepot.h, mentioning that is only for internal purposes.
> > +static void inc_stack_record_count(depot_stack_handle_t handle)
> > +{
> > + struct stack_record *stack = stack_depot_get_stack(handle);
> > +
> > + if (stack)
> > + refcount_inc(&stack->count);
> > +}
>
> In the latest stackdepot version in -next, the count is initialized to
> REFCOUNT_SATURATED to warn if a non-refcounted entry is suddenly used
> as a refcounted one. In your case this is intentional and there is no
> risk that the entry will be evicted, so that's ok. But you need to set
> the refcount to 1 somewhere here on the initial stack_depot_save().
Well, I went with something like:
static void inc_stack_record_count(depot_stack_handle_t handle)
{
struct stack_record *stack = __stack_depot_get_stack_record(handle);
if (stack) {
/*
* New stack_records that do not use STACK_DEPOT_FLAG_GET start
* with REFCOUNT_SATURATED to catch spurious increments of their
* refcount.
* Since we do not use STACK_DEPOT_FLAG_{GET,PUT} API, let us
* set a refcount of 1 ourselves.
*/
if (refcount_read(&stack->count) == REFCOUNT_SATURATED)
refcount_set(&stack->count, 1);
refcount_inc(&stack->count);
}
}
--
Oscar Salvador
SUSE Labs
On Fri, 9 Feb 2024 at 22:42, Marco Elver <[email protected]> wrote:
>
> On Fri, 9 Feb 2024 at 22:37, Oscar Salvador <[email protected]> wrote:
> >
> > On Fri, Feb 09, 2024 at 08:45:00AM +0100, Marco Elver wrote:
> > > > +/**
> > > > + * stack_depo_get_stack - Get a pointer to a stack struct
> > >
> > > Typo: "depo" -> depot
> > >
> > > I would also write "stack_record struct", because "stack struct" does not exist.
> >
> > Fixed.
> >
> > > > + * @handle: Stack depot handle
> > > > + *
> > > > + * Return: Returns a pointer to a stack struct
> > > > + */
> > > > +struct stack_record *stack_depot_get_stack(depot_stack_handle_t handle);
> > >
> > > I don't know what other usecases there are for this, but I'd want to
> > > make make sure we give users a big hint to avoid unnecessary uses of
> > > this function.
> > >
> > > Perhaps we also want to mark it as somewhat internal, e.g. by
> > > prefixing it with __. So I'd call it __stack_depot_get_stack_record().
> >
> > Yes, I went with __stack_depot_get_stack_record(), and I updated its doc
> > in stackdepot.h, mentioning that is only for internal purposes.
> >
> > > > +static void inc_stack_record_count(depot_stack_handle_t handle)
> > > > +{
> > > > + struct stack_record *stack = stack_depot_get_stack(handle);
> > > > +
> > > > + if (stack)
> > > > + refcount_inc(&stack->count);
> > > > +}
> > >
> > > In the latest stackdepot version in -next, the count is initialized to
> > > REFCOUNT_SATURATED to warn if a non-refcounted entry is suddenly used
> > > as a refcounted one. In your case this is intentional and there is no
> > > risk that the entry will be evicted, so that's ok. But you need to set
> > > the refcount to 1 somewhere here on the initial stack_depot_save().
> >
> > Well, I went with something like:
> >
> > static void inc_stack_record_count(depot_stack_handle_t handle)
> > {
> > struct stack_record *stack = __stack_depot_get_stack_record(handle);
> >
> > if (stack) {
> > /*
> > * New stack_records that do not use STACK_DEPOT_FLAG_GET start
> > * with REFCOUNT_SATURATED to catch spurious increments of their
> > * refcount.
> > * Since we do not use STACK_DEPOT_FLAG_{GET,PUT} API, let us
>
> There is no FLAG_PUT, only stack_depot_put(). Saying you do not use
> the refcount to free any entries should hopefully make it clear that
> even if the refcount saturates and you wrap around to 1, nothing
> catastrophic will happen.
>
> > * set a refcount of 1 ourselves.
> > */
> > if (refcount_read(&stack->count) == REFCOUNT_SATURATED)
> > refcount_set(&stack->count, 1);
Do you need to inc the first allocation? Should there be an "else"
here instead of always doing refcount_inc()?
> > refcount_inc(&stack->count);
> > }
> > }
>
> That looks reasonable.
On Fri, Feb 09, 2024 at 09:00:40AM +0100, Marco Elver wrote:
> > +/**
> > + * stack_depot_get_next_stack - Returns all stacks, one at a time
>
> "Returns all stack_records" to be clear that this is returning the struct.
Fixed.
>
> > + *
> > + * @table: Current table we are checking
> > + * @bucket: Current bucket we are checking
> > + * @last_found: Last stack that was found
> > + *
> > + * This function finds first a non-empty bucket and returns the first stack
> > + * stored in it. On consequent calls, it walks the bucket to see whether
> > + * it contains more stacks.
> > + * Once we have walked all the stacks in a bucket, we check
> > + * the next one, and we repeat the same steps until we have checked all of them
>
> I think for this function it's important to say that no entry returned
> from this function can be evicted.
>
> I.e. the easiest way to ensure this is that the caller makes sure the
> entries returned are never passed to stack_depot_put() - which is
> certainly the case for your usecase because you do not use
> stack_depot_put().
>
> > + * Return: A pointer a to stack_record struct, or NULL when we have walked all
> > + * buckets.
> > + */
> > +struct stack_record *stack_depot_get_next_stack(unsigned long *table,
>
> To keep consistent, I'd also call this
> __stack_depot_get_next_stack_record(), so that we're clear this is
> more of an internal function not for general usage.
>
> > + struct list_head **bucket,
> > + struct stack_record **last_found);
> > +
> > /**
> > * stack_depot_fetch - Fetch a stack trace from stack depot
> > *
> > diff --git a/lib/stackdepot.c b/lib/stackdepot.c
> > index 197c355601f9..107bd0174cd6 100644
> > --- a/lib/stackdepot.c
> > +++ b/lib/stackdepot.c
> > @@ -782,6 +782,52 @@ unsigned int stack_depot_get_extra_bits(depot_stack_handle_t handle)
> > }
> > EXPORT_SYMBOL(stack_depot_get_extra_bits);
> >
> > +struct stack_record *stack_depot_get_next_stack(unsigned long *table,
> > + struct list_head **curr_bucket,
> > + struct stack_record **last_found)
> > +{
> > + struct list_head *bucket = *curr_bucket;
> > + unsigned long nr_table = *table;
> > + struct stack_record *found = NULL;
> > + unsigned long stack_table_entries = stack_hash_mask + 1;
> > +
> > + rcu_read_lock_sched_notrace();
>
> We are returning pointers to stack_records out of the RCU-read
> critical section, which are then later used to continue the iteration.
> list_for_each_entry_continue_rcu() says this is fine if "... you held
> some sort of non-RCU reference (such as a reference count) ...".
> Updating the function's documentation to say none of these entries can
> be evicted via a stack_depot_put() is required.
Thinking about it some more, I think I made a mistake:
I am walking all buckets, and within those buckets there are not only
page_owner stack_records, which means that I could return a stack_record
from e.g: KASAN (which I think can evict stack_records) and then
everything goes off the rails.
Which means I cannot walk the buckets like that.
Actually, I think that having something like the following
struct list_stack_records {
struct stack_record *stack;
struct list_stack_records *next;
}
in page_owner would make sense.
Then the only thing I would have to do is to add a new record on every
new stack_record, and then I could just walk the list like a linked
list.
Which means that the function stack_depot_get_next_stack() could be
killed because everything would happen in page_owner code.
e.g:
static void inc_stack_record_count(depot_stack_handle_t handle)
{
struct stack_record *stack = __stack_depot_get_stack_record(handle);
if (stack) {
/*
* New stack_record's that do not use STACK_DEPOT_FLAG_GET start
* with REFCOUNT_SATURATED to catch spurious increments of their
* refcount.
* Since we do not use STACK_DEPOT_FLAG_{GET,PUT} API, let us
* set a refcount of 1 ourselves.
*/
if (refcount_read(&stack->count) == REFCOUNT_SATURATED) {
refcount_set(&stack->count, 1);
add_new_stack_record_into_the_list(stack)
}
refcount_inc(&stack->count);
}
}
and then just walk the list_stack_records list whenever we want to
show the stacktraces and their counting.
I think that overall this approach is cleaner and safer.
--
Oscar Salvador
SUSE Labs
On Fri, 9 Feb 2024 at 22:37, Oscar Salvador <[email protected]> wrote:
>
> On Fri, Feb 09, 2024 at 08:45:00AM +0100, Marco Elver wrote:
> > > +/**
> > > + * stack_depo_get_stack - Get a pointer to a stack struct
> >
> > Typo: "depo" -> depot
> >
> > I would also write "stack_record struct", because "stack struct" does not exist.
>
> Fixed.
>
> > > + * @handle: Stack depot handle
> > > + *
> > > + * Return: Returns a pointer to a stack struct
> > > + */
> > > +struct stack_record *stack_depot_get_stack(depot_stack_handle_t handle);
> >
> > I don't know what other usecases there are for this, but I'd want to
> > make make sure we give users a big hint to avoid unnecessary uses of
> > this function.
> >
> > Perhaps we also want to mark it as somewhat internal, e.g. by
> > prefixing it with __. So I'd call it __stack_depot_get_stack_record().
>
> Yes, I went with __stack_depot_get_stack_record(), and I updated its doc
> in stackdepot.h, mentioning that is only for internal purposes.
>
> > > +static void inc_stack_record_count(depot_stack_handle_t handle)
> > > +{
> > > + struct stack_record *stack = stack_depot_get_stack(handle);
> > > +
> > > + if (stack)
> > > + refcount_inc(&stack->count);
> > > +}
> >
> > In the latest stackdepot version in -next, the count is initialized to
> > REFCOUNT_SATURATED to warn if a non-refcounted entry is suddenly used
> > as a refcounted one. In your case this is intentional and there is no
> > risk that the entry will be evicted, so that's ok. But you need to set
> > the refcount to 1 somewhere here on the initial stack_depot_save().
>
> Well, I went with something like:
>
> static void inc_stack_record_count(depot_stack_handle_t handle)
> {
> struct stack_record *stack = __stack_depot_get_stack_record(handle);
>
> if (stack) {
> /*
> * New stack_records that do not use STACK_DEPOT_FLAG_GET start
> * with REFCOUNT_SATURATED to catch spurious increments of their
> * refcount.
> * Since we do not use STACK_DEPOT_FLAG_{GET,PUT} API, let us
There is no FLAG_PUT, only stack_depot_put(). Saying you do not use
the refcount to free any entries should hopefully make it clear that
even if the refcount saturates and you wrap around to 1, nothing
catastrophic will happen.
> * set a refcount of 1 ourselves.
> */
> if (refcount_read(&stack->count) == REFCOUNT_SATURATED)
> refcount_set(&stack->count, 1);
> refcount_inc(&stack->count);
> }
> }
That looks reasonable.
Hi Oscar,
kernel test robot noticed the following build warnings:
[auto build test WARNING on akpm-mm/mm-nonmm-unstable]
[also build test WARNING on linus/master v6.8-rc3]
[cannot apply to akpm-mm/mm-everything next-20240209]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Oscar-Salvador/lib-stackdepot-Move-stack_record-struct-definition-into-the-header/20240209-074611
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-nonmm-unstable
patch link: https://lore.kernel.org/r/20240208234539.19113-4-osalvador%40suse.de
patch subject: [PATCH v7 3/4] mm,page_owner: Display all stacks and their count
config: x86_64-randconfig-121-20240209 (https://download.01.org/0day-ci/archive/20240210/[email protected]/config)
compiler: clang version 17.0.6 (https://github.com/llvm/llvm-project 6009708b4367171ccdbf4b5905cb6a803753fe18)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240210/[email protected]/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
sparse warnings: (new ones prefixed by >>)
>> mm/page_owner.c:828:30: sparse: sparse: symbol 'page_owner_stack_operations' was not declared. Should it be static?
mm/page_owner.c: note: in included file (through include/linux/mmzone.h, include/linux/gfp.h, include/linux/xarray.h, ...):
include/linux/page-flags.h:242:46: sparse: sparse: self-comparison always evaluates to false
vim +/page_owner_stack_operations +828 mm/page_owner.c
827
> 828 const struct file_operations page_owner_stack_operations = {
829 .open = page_owner_stack_open,
830 .read = seq_read,
831 .llseek = seq_lseek,
832 .release = seq_release,
833 };
834
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
On Fri, Feb 09, 2024 at 10:52:48PM +0100, Oscar Salvador wrote:
> Thinking about it some more, I think I made a mistake:
>
> I am walking all buckets, and within those buckets there are not only
> page_owner stack_records, which means that I could return a stack_record
> from e.g: KASAN (which I think can evict stack_records) and then
> everything goes off the rails.
> Which means I cannot walk the buckets like that.
>
> Actually, I think that having something like the following
>
> struct list_stack_records {
> struct stack_record *stack;
> struct list_stack_records *next;
> }
Or, I could use the extra_bits field from handle_parts to flag that
when a depot_stack_handle_t is used by page_owner.
Then __stack_depot_get_next_stack_record() would check whether
a stack_record->handle.extra_bits has the page_owner bit, and only
return those stacks that have such bit.
This would solve the problem of returning a potentially evictable stack
, only by returning page_owner's stack_records, and I would not have
to maintain my own list.
I yet have to see how that would look like, but sounds promising.
Do you think that is feasible Marco?
Thanks
--
Oscar Salvador
SUSE Labs
On Sat, 10 Feb 2024 at 00:13, Oscar Salvador <[email protected]> wrote:
>
> On Fri, Feb 09, 2024 at 10:52:48PM +0100, Oscar Salvador wrote:
> > Thinking about it some more, I think I made a mistake:
> >
> > I am walking all buckets, and within those buckets there are not only
> > page_owner stack_records, which means that I could return a stack_record
> > from e.g: KASAN (which I think can evict stack_records) and then
> > everything goes off the rails.
> > Which means I cannot walk the buckets like that.
> >
> > Actually, I think that having something like the following
> >
> > struct list_stack_records {
> > struct stack_record *stack;
> > struct list_stack_records *next;
> > }
>
> Or, I could use the extra_bits field from handle_parts to flag that
> when a depot_stack_handle_t is used by page_owner.
>
> Then __stack_depot_get_next_stack_record() would check whether
> a stack_record->handle.extra_bits has the page_owner bit, and only
> return those stacks that have such bit.
> This would solve the problem of returning a potentially evictable stack
> , only by returning page_owner's stack_records, and I would not have
> to maintain my own list.
>
> I yet have to see how that would look like, but sounds promising.
> Do you think that is feasible Marco?
The extra bits are used by KMSAN, and might conflict if enabled at the
same time. I think the safest option is to keep your own list. I think
that will also be more performant if there are other stackdepot users
because you do not have to traverse any of the other entries.
Hi Oscar,
kernel test robot noticed the following build errors:
[auto build test ERROR on akpm-mm/mm-nonmm-unstable]
[also build test ERROR on linus/master v6.8-rc3]
[cannot apply to akpm-mm/mm-everything next-20240209]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Oscar-Salvador/lib-stackdepot-Move-stack_record-struct-definition-into-the-header/20240209-074611
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-nonmm-unstable
patch link: https://lore.kernel.org/r/20240208234539.19113-2-osalvador%40suse.de
patch subject: [PATCH v7 1/4] lib/stackdepot: Move stack_record struct definition into the header
config: mips-xway_defconfig (https://download.01.org/0day-ci/archive/20240210/[email protected]/config)
compiler: clang version 19.0.0git (https://github.com/llvm/llvm-project ac0577177f053ba7e7016e1b7e44cf5932d00b03)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240210/[email protected]/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
All errors (new ones prefixed by >>):
In file included from arch/mips/kernel/asm-offsets.c:15:
In file included from include/linux/mm.h:22:
In file included from include/linux/page_ext.h:7:
>> include/linux/stackdepot.h:59:25: error: use of undeclared identifier 'CONFIG_STACKDEPOT_MAX_FRAMES'
59 | unsigned long entries[CONFIG_STACKDEPOT_MAX_FRAMES]; /* Frames */
| ^
1 error generated.
make[3]: *** [scripts/Makefile.build:116: arch/mips/kernel/asm-offsets.s] Error 1
make[3]: Target 'prepare' not remade because of errors.
make[2]: *** [Makefile:1191: prepare0] Error 2
make[2]: Target 'prepare' not remade because of errors.
make[1]: *** [Makefile:240: __sub-make] Error 2
make[1]: Target 'prepare' not remade because of errors.
make: *** [Makefile:240: __sub-make] Error 2
make: Target 'prepare' not remade because of errors.
vim +/CONFIG_STACKDEPOT_MAX_FRAMES +59 include/linux/stackdepot.h
51
52 struct stack_record {
53 struct list_head hash_list; /* Links in the hash table */
54 u32 hash; /* Hash in hash table */
55 u32 size; /* Number of stored frames */
56 union handle_parts handle; /* Constant after initialization */
57 refcount_t count;
58 union {
> 59 unsigned long entries[CONFIG_STACKDEPOT_MAX_FRAMES]; /* Frames */
60 struct {
61 /*
62 * An important invariant of the implementation is to
63 * only place a stack record onto the freelist iff its
64 * refcount is zero. Because stack records with a zero
65 * refcount are never considered as valid, it is safe to
66 * union @entries and freelist management state below.
67 * Conversely, as soon as an entry is off the freelist
68 * and its refcount becomes non-zero, the below must not
69 * be accessed until being placed back on the freelist.
70 */
71 struct list_head free_list; /* Links in the freelist */
72 unsigned long rcu_state; /* RCU cookie */
73 };
74 };
75 };
76
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
On Sat, Feb 10, 2024 at 08:52:25AM +0100, Marco Elver wrote:
> The extra bits are used by KMSAN, and might conflict if enabled at the
> same time. I think the safest option is to keep your own list. I think
> that will also be more performant if there are other stackdepot users
> because you do not have to traverse any of the other entries.
Ok, I thought we had spare bits for other users.
But thinking about it some more, yes, it makes sense for page_owner to
maintain its own list, so traversing it is faster and we do not have
to place code to traverse the buckets in stackdepot.
--
Oscar Salvador
SUSE Labs
On Fri, Feb 09, 2024 at 10:44:23PM +0100, Marco Elver wrote:
> > > * set a refcount of 1 ourselves.
> > > */
> > > if (refcount_read(&stack->count) == REFCOUNT_SATURATED)
> > > refcount_set(&stack->count, 1);
>
> Do you need to inc the first allocation? Should there be an "else"
> here instead of always doing refcount_inc()?
Yes, I need to inc in the first allocation, otherwise on the first free
op, refcount goes to 0, and when the next allocation comes around,
I will get a warning because of going from refcount 0 to 1.
--
Oscar Salvador
SUSE Labs
On 2/11/24 21:39, Oscar Salvador wrote:
> On Sat, Feb 10, 2024 at 08:52:25AM +0100, Marco Elver wrote:
>> The extra bits are used by KMSAN, and might conflict if enabled at the
>> same time. I think the safest option is to keep your own list. I think
>> that will also be more performant if there are other stackdepot users
>> because you do not have to traverse any of the other entries.
>
> Ok, I thought we had spare bits for other users.
> But thinking about it some more, yes, it makes sense for page_owner to
> maintain its own list, so traversing it is faster and we do not have
> to place code to traverse the buckets in stackdepot.
Would it make sense to introduce per-user stack depot instances? ("user"
being a subsystem i.e. kasan or page_owner). I'd expect each to have a
distinct set of stacks, so there's no benefits of using the same hash table,
only downsides of longer collision lists?
I can imagine this would be easier for users that don't need the early init
kind of stackdepot, but maybe even there it could be feasible to have a
small fixed size array of hash table roots and every user would get a
separate index?