2024-02-12 22:29:35

by Oscar Salvador

[permalink] [raw]
Subject: [PATCH v8 0/5] page_owner: print stacks and their outstanding allocations

Changes v7 -> v8
- Rebased on top of -next
- page_owner maintains its own stack_records list now
- Kill auxiliary stackdepot function to traverse buckets
- page_owner_stacks is now a directory with 'show_stacks'
and 'set_threshold'
- Update Documentation/mm/page_owner.rst
- Adressed feedback from Marco

Changes v6 -> v7:
- Rebased on top of Andrey Konovalov's libstackdepot patchset
- Reformulated the changelogs

Changes v5 -> v6:
- Rebase on top of v6.7-rc1
- Move stack_record struct to the header
- Addressed feedback from Vlastimil
(some code tweaks and changelogs suggestions)

Changes v4 -> v5:
- Addressed feedback from Alexander Potapenko

Changes v3 -> v4:
- Rebase (long time has passed)
- Use boolean instead of enum for action by Alexander Potapenko
- (I left some feedback untouched because it's been long and
would like to discuss it here now instead of re-vamping
and old thread)

Changes v2 -> v3:
- Replace interface in favor of seq operations
(suggested by Vlastimil)
- Use debugfs interface to store/read valued (suggested by Ammar)


page_owner is a great debug functionality tool that lets us know
about all pages that have been allocated/freed and their specific
stacktrace.
This comes very handy when debugging memory leaks, since with
some scripting we can see the outstanding allocations, which might point
to a memory leak.

In my experience, that is one of the most useful cases, but it can get
really tedious to screen through all pages and try to reconstruct the
stack <-> allocated/freed relationship, becoming most of the time a
daunting and slow process when we have tons of allocation/free operations.

This patchset aims to ease that by adding a new functionality into
page_owner.
This functionality creates a new directory called 'page_owner_stacks'
under 'sys/kernel//debug' with a read-only file called 'show_stacks',
which prints out all the stacks followed by their outstanding number
of allocations (being that the times the stacktrace has allocated
but not freed yet).
This gives us a clear and a quick overview of stacks <-> allocated/free.

We take advantage of the new refcount_f field that stack_record struct
gained, and increment/decrement the stack refcount on every
__set_page_owner() (alloc operation) and __reset_page_owner (free operation)
call.

Unfortunately, we cannot use the new stackdepot api
STACK_DEPOT_FLAG_{GET,PUT} because it does not fulfill page_owner needs,
meaning we would have to special case things, at which point
makes more sense for page_owner to do its own {dec,inc}rementing
of the stacks.
E.g: Using STACK_DEPOT_FLAG_PUT, once the refcount reaches 0,
such stack gets evicted, so page_owner would lose information.

This patch also creates a new file called 'set_threshold' within
'page_owner_stacks' directory, and by writing a value to it, the stacks
which refcount is below such value will be filtered out.

A PoC can be found below:

# cat /sys/kernel/debug/page_owner_stacks/show_stacks > page_owner_full_stacks.txt
# head -40 page_owner_full_stacks.txt
prep_new_page+0xa9/0x120
get_page_from_freelist+0x801/0x2210
__alloc_pages+0x18b/0x350
alloc_pages_mpol+0x91/0x1f0
folio_alloc+0x14/0x50
filemap_alloc_folio+0xb2/0x100
page_cache_ra_unbounded+0x96/0x180
filemap_get_pages+0xfd/0x590
filemap_read+0xcc/0x330
blkdev_read_iter+0xb8/0x150
vfs_read+0x285/0x320
ksys_read+0xa5/0xe0
do_syscall_64+0x80/0x160
entry_SYSCALL_64_after_hwframe+0x6e/0x76
stack_count: 521



prep_new_page+0xa9/0x120
get_page_from_freelist+0x801/0x2210
__alloc_pages+0x18b/0x350
alloc_pages_mpol+0x91/0x1f0
folio_alloc+0x14/0x50
filemap_alloc_folio+0xb2/0x100
__filemap_get_folio+0x14a/0x490
ext4_write_begin+0xbd/0x4b0 [ext4]
generic_perform_write+0xc1/0x1e0
ext4_buffered_write_iter+0x68/0xe0 [ext4]
ext4_file_write_iter+0x70/0x740 [ext4]
vfs_write+0x33d/0x420
ksys_write+0xa5/0xe0
do_syscall_64+0x80/0x160
entry_SYSCALL_64_after_hwframe+0x6e/0x76
stack_count: 4609
..
..

# echo 5000 > /sys/kernel/debug/page_owner_stacks/set_threshold
# cat /sys/kernel/debug/page_owner_stacks/show_stacks > page_owner_full_stacks_5000.txt
# head -40 page_owner_full_stacks_5000.txt
prep_new_page+0xa9/0x120
get_page_from_freelist+0x801/0x2210
__alloc_pages+0x18b/0x350
alloc_pages_mpol+0x91/0x1f0
folio_alloc+0x14/0x50
filemap_alloc_folio+0xb2/0x100
__filemap_get_folio+0x14a/0x490
ext4_write_begin+0xbd/0x4b0 [ext4]
generic_perform_write+0xc1/0x1e0
ext4_buffered_write_iter+0x68/0xe0 [ext4]
ext4_file_write_iter+0x70/0x740 [ext4]
vfs_write+0x33d/0x420
ksys_pwrite64+0x75/0x90
do_syscall_64+0x80/0x160
entry_SYSCALL_64_after_hwframe+0x6e/0x76
stack_count: 6781



prep_new_page+0xa9/0x120
get_page_from_freelist+0x801/0x2210
__alloc_pages+0x18b/0x350
pcpu_populate_chunk+0xec/0x350
pcpu_balance_workfn+0x2d1/0x4a0
process_scheduled_works+0x84/0x380
worker_thread+0x12a/0x2a0
kthread+0xe3/0x110
ret_from_fork+0x30/0x50
ret_from_fork_asm+0x1b/0x30
stack_count: 8641

Oscar Salvador (5):
lib/stackdepot: Move stack_record struct definition into the header
mm,page_owner: Implement the tracking of the stacks count
mm,page_owner: Display all stacks and their count
mm,page_owner: Filter out stacks by a threshold
mm,page_owner: Update Documentation regarding page_owner_stacks

Documentation/mm/page_owner.rst | 44 ++++++++
include/linux/stackdepot.h | 53 +++++++++
lib/stackdepot.c | 51 ++-------
mm/page_owner.c | 190 ++++++++++++++++++++++++++++++++
4 files changed, 295 insertions(+), 43 deletions(-)

--
2.43.0



2024-02-12 22:29:51

by Oscar Salvador

[permalink] [raw]
Subject: [PATCH v8 1/5] lib/stackdepot: Move stack_record struct definition into the header

In order to move the heavy lifting into page_owner code, this one
needs to have access to the stack_record structure, which right now
sits in lib/stackdepot.c.
Move it to the stackdepot.h header so page_owner can access
stack_record's struct fields.

Signed-off-by: Oscar Salvador <[email protected]>
---
include/linux/stackdepot.h | 44 ++++++++++++++++++++++++++++++++++++++
lib/stackdepot.c | 43 -------------------------------------
2 files changed, 44 insertions(+), 43 deletions(-)

diff --git a/include/linux/stackdepot.h b/include/linux/stackdepot.h
index adcbb8f23600..90274860fd8e 100644
--- a/include/linux/stackdepot.h
+++ b/include/linux/stackdepot.h
@@ -30,6 +30,50 @@ typedef u32 depot_stack_handle_t;
*/
#define STACK_DEPOT_EXTRA_BITS 5

+#define DEPOT_HANDLE_BITS (sizeof(depot_stack_handle_t) * 8)
+
+#define DEPOT_POOL_ORDER 2 /* Pool size order, 4 pages */
+#define DEPOT_POOL_SIZE (1LL << (PAGE_SHIFT + DEPOT_POOL_ORDER))
+#define DEPOT_STACK_ALIGN 4
+#define DEPOT_OFFSET_BITS (DEPOT_POOL_ORDER + PAGE_SHIFT - DEPOT_STACK_ALIGN)
+#define DEPOT_POOL_INDEX_BITS (DEPOT_HANDLE_BITS - DEPOT_OFFSET_BITS - \
+ STACK_DEPOT_EXTRA_BITS)
+
+/* Compact structure that stores a reference to a stack. */
+union handle_parts {
+ depot_stack_handle_t handle;
+ struct {
+ u32 pool_index : DEPOT_POOL_INDEX_BITS;
+ u32 offset : DEPOT_OFFSET_BITS;
+ u32 extra : STACK_DEPOT_EXTRA_BITS;
+ };
+};
+
+struct stack_record {
+ struct list_head hash_list; /* Links in the hash table */
+ u32 hash; /* Hash in hash table */
+ u32 size; /* Number of stored frames */
+ union handle_parts handle; /* Constant after initialization */
+ refcount_t count;
+ union {
+ unsigned long entries[CONFIG_STACKDEPOT_MAX_FRAMES]; /* Frames */
+ struct {
+ /*
+ * An important invariant of the implementation is to
+ * only place a stack record onto the freelist iff its
+ * refcount is zero. Because stack records with a zero
+ * refcount are never considered as valid, it is safe to
+ * union @entries and freelist management state below.
+ * Conversely, as soon as an entry is off the freelist
+ * and its refcount becomes non-zero, the below must not
+ * be accessed until being placed back on the freelist.
+ */
+ struct list_head free_list; /* Links in the freelist */
+ unsigned long rcu_state; /* RCU cookie */
+ };
+ };
+};
+
typedef u32 depot_flags_t;

/*
diff --git a/lib/stackdepot.c b/lib/stackdepot.c
index 4a7055a63d9f..6f9095374847 100644
--- a/lib/stackdepot.c
+++ b/lib/stackdepot.c
@@ -36,54 +36,11 @@
#include <linux/memblock.h>
#include <linux/kasan-enabled.h>

-#define DEPOT_HANDLE_BITS (sizeof(depot_stack_handle_t) * 8)
-
-#define DEPOT_POOL_ORDER 2 /* Pool size order, 4 pages */
-#define DEPOT_POOL_SIZE (1LL << (PAGE_SHIFT + DEPOT_POOL_ORDER))
-#define DEPOT_STACK_ALIGN 4
-#define DEPOT_OFFSET_BITS (DEPOT_POOL_ORDER + PAGE_SHIFT - DEPOT_STACK_ALIGN)
-#define DEPOT_POOL_INDEX_BITS (DEPOT_HANDLE_BITS - DEPOT_OFFSET_BITS - \
- STACK_DEPOT_EXTRA_BITS)
#define DEPOT_POOLS_CAP 8192
#define DEPOT_MAX_POOLS \
(((1LL << (DEPOT_POOL_INDEX_BITS)) < DEPOT_POOLS_CAP) ? \
(1LL << (DEPOT_POOL_INDEX_BITS)) : DEPOT_POOLS_CAP)

-/* Compact structure that stores a reference to a stack. */
-union handle_parts {
- depot_stack_handle_t handle;
- struct {
- u32 pool_index : DEPOT_POOL_INDEX_BITS;
- u32 offset : DEPOT_OFFSET_BITS;
- u32 extra : STACK_DEPOT_EXTRA_BITS;
- };
-};
-
-struct stack_record {
- struct list_head hash_list; /* Links in the hash table */
- u32 hash; /* Hash in hash table */
- u32 size; /* Number of stored frames */
- union handle_parts handle; /* Constant after initialization */
- refcount_t count;
- union {
- unsigned long entries[CONFIG_STACKDEPOT_MAX_FRAMES]; /* Frames */
- struct {
- /*
- * An important invariant of the implementation is to
- * only place a stack record onto the freelist iff its
- * refcount is zero. Because stack records with a zero
- * refcount are never considered as valid, it is safe to
- * union @entries and freelist management state below.
- * Conversely, as soon as an entry is off the freelist
- * and its refcount becomes non-zero, the below must not
- * be accessed until being placed back on the freelist.
- */
- struct list_head free_list; /* Links in the freelist */
- unsigned long rcu_state; /* RCU cookie */
- };
- };
-};
-
static bool stack_depot_disabled;
static bool __stack_depot_early_init_requested __initdata = IS_ENABLED(CONFIG_STACKDEPOT_ALWAYS_INIT);
static bool __stack_depot_early_init_passed __initdata;
--
2.43.0


2024-02-12 22:30:05

by Oscar Salvador

[permalink] [raw]
Subject: [PATCH v8 2/5] mm,page_owner: Implement the tracking of the stacks count

page_owner needs to increment a stack_record refcount when a new allocation
occurs, and decrement it on a free operation.
In order to do that, we need to have a way to get a stack_record from a
handle.
Implement __stack_depot_get_stack_record() which just does that, and make
it public so page_owner can use it.

Also implement {inc,dec}_stack_record_count() which increments
or decrements on respective allocation and free operations, via
__reset_page_owner() (free operation) and __set_page_owner() (alloc
operation).

Traversing all stackdepot buckets comes with its own complexity,
plus we would have to implement a way to mark only those stack_records
that were originated from page_owner, as those are the ones we are
interested in.
For that reason, page_owner maintains its own list of stack_records,
because traversing that list is faster than traversing all buckets
while keeping at the same time a low complexity.
inc_stack_record_count() is responsible of adding new stack_records
into the list stack_list.

Modifications on the list are protected via a spinlock with irqs
disabled, since this code can also be reached from IRQ context.

Signed-off-by: Oscar Salvador <[email protected]>
---
include/linux/stackdepot.h | 9 +++++
lib/stackdepot.c | 8 +++++
mm/page_owner.c | 73 ++++++++++++++++++++++++++++++++++++++
3 files changed, 90 insertions(+)

diff --git a/include/linux/stackdepot.h b/include/linux/stackdepot.h
index 90274860fd8e..f3c2162bf615 100644
--- a/include/linux/stackdepot.h
+++ b/include/linux/stackdepot.h
@@ -175,6 +175,15 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
depot_stack_handle_t stack_depot_save(unsigned long *entries,
unsigned int nr_entries, gfp_t gfp_flags);

+/**
+ * __stack_depot_get_stack_record - Get a pointer to a stack_record struct
+ * This function is only for internal purposes.
+ * @handle: Stack depot handle
+ *
+ * Return: Returns a pointer to a stack_record struct
+ */
+struct stack_record *__stack_depot_get_stack_record(depot_stack_handle_t handle);
+
/**
* stack_depot_fetch - Fetch a stack trace from stack depot
*
diff --git a/lib/stackdepot.c b/lib/stackdepot.c
index 6f9095374847..fdb09450a538 100644
--- a/lib/stackdepot.c
+++ b/lib/stackdepot.c
@@ -685,6 +685,14 @@ depot_stack_handle_t stack_depot_save(unsigned long *entries,
}
EXPORT_SYMBOL_GPL(stack_depot_save);

+struct stack_record *__stack_depot_get_stack_record(depot_stack_handle_t handle)
+{
+ if (!handle)
+ return NULL;
+
+ return depot_fetch_stack(handle);
+}
+
unsigned int stack_depot_fetch(depot_stack_handle_t handle,
unsigned long **entries)
{
diff --git a/mm/page_owner.c b/mm/page_owner.c
index 5634e5d890f8..7d1b3f75cef3 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -36,6 +36,14 @@ struct page_owner {
pid_t free_tgid;
};

+struct stack {
+ struct stack_record *stack_record;
+ struct stack *next;
+};
+
+static struct stack *stack_list;
+static DEFINE_SPINLOCK(stack_list_lock);
+
static bool page_owner_enabled __initdata;
DEFINE_STATIC_KEY_FALSE(page_owner_inited);

@@ -61,6 +69,57 @@ static __init bool need_page_owner(void)
return page_owner_enabled;
}

+static void add_stack_record_to_list(struct stack_record *stack_record)
+{
+ unsigned long flags;
+ struct stack *stack;
+
+ stack = kmalloc(sizeof(*stack), GFP_KERNEL);
+ if (stack) {
+ stack->stack_record = stack_record;
+ stack->next = NULL;
+
+ spin_lock_irqsave(&stack_list_lock, flags);
+ if (!stack_list) {
+ stack_list = stack;
+ } else {
+ stack->next = stack_list;
+ stack_list = stack;
+ }
+ spin_unlock_irqrestore(&stack_list_lock, flags);
+ }
+}
+
+static void inc_stack_record_count(depot_stack_handle_t handle)
+{
+ struct stack_record *stack_record = __stack_depot_get_stack_record(handle);
+
+ if (stack_record) {
+ /*
+ * New stack_record's that do not use STACK_DEPOT_FLAG_GET start
+ * with REFCOUNT_SATURATED to catch spurious increments of their
+ * refcount.
+ * Since we do not use STACK_DEPOT_FLAG_{GET,PUT} API, let us
+ * set a refcount of 1 ourselves.
+ */
+ if (refcount_read(&stack_record->count) == REFCOUNT_SATURATED) {
+ refcount_set(&stack_record->count, 1);
+
+ /* Add the new stack_record to our list */
+ add_stack_record_to_list(stack_record);
+ }
+ refcount_inc(&stack_record->count);
+ }
+}
+
+static void dec_stack_record_count(depot_stack_handle_t handle)
+{
+ struct stack_record *stack_record = __stack_depot_get_stack_record(handle);
+
+ if (stack_record)
+ refcount_dec(&stack_record->count);
+}
+
static __always_inline depot_stack_handle_t create_dummy_stack(void)
{
unsigned long entries[4];
@@ -140,6 +199,7 @@ void __reset_page_owner(struct page *page, unsigned short order)
int i;
struct page_ext *page_ext;
depot_stack_handle_t handle;
+ depot_stack_handle_t alloc_handle;
struct page_owner *page_owner;
u64 free_ts_nsec = local_clock();

@@ -147,6 +207,9 @@ void __reset_page_owner(struct page *page, unsigned short order)
if (unlikely(!page_ext))
return;

+ page_owner = get_page_owner(page_ext);
+ alloc_handle = page_owner->handle;
+
handle = save_stack(GFP_NOWAIT | __GFP_NOWARN);
for (i = 0; i < (1 << order); i++) {
__clear_bit(PAGE_EXT_OWNER_ALLOCATED, &page_ext->flags);
@@ -158,6 +221,15 @@ void __reset_page_owner(struct page *page, unsigned short order)
page_ext = page_ext_next(page_ext);
}
page_ext_put(page_ext);
+ if (alloc_handle != early_handle)
+ /*
+ * early_handle is being set as a handle for all those
+ * early allocated pages. See init_pages_in_zone().
+ * Since their refcount is not being incremented because
+ * the machinery is not ready yet, we cannot decrement
+ * their refcount either.
+ */
+ dec_stack_record_count(alloc_handle);
}

static inline void __set_page_owner_handle(struct page_ext *page_ext,
@@ -199,6 +271,7 @@ noinline void __set_page_owner(struct page *page, unsigned short order,
return;
__set_page_owner_handle(page_ext, handle, order, gfp_mask);
page_ext_put(page_ext);
+ inc_stack_record_count(handle);
}

void __set_page_owner_migrate_reason(struct page *page, int reason)
--
2.43.0


2024-02-12 22:30:38

by Oscar Salvador

[permalink] [raw]
Subject: [PATCH v8 4/5] mm,page_owner: Filter out stacks by a threshold

We want to be able to filter out the stacks based on a threshold we can
can tune.
By writing to 'set_threshold' file, we can adjust the threshold value.

Signed-off-by: Oscar Salvador <[email protected]>
---
mm/page_owner.c | 24 ++++++++++++++++++++++--
1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/mm/page_owner.c b/mm/page_owner.c
index 3e4b7cd7c8f8..c4f9e5506e93 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -832,15 +832,18 @@ static void *stack_next(struct seq_file *m, void *v, loff_t *ppos)
return stack;
}

+static unsigned long page_owner_stack_threshold;
+
static int stack_print(struct seq_file *m, void *v)
{
char *buf;
int ret = 0;
struct stack *stack = v;
struct stack_record *stack_record = stack->stack_record;
+ int stack_count = refcount_read(&stack_record->count);

if (!stack_record->size || stack_record->size < 0 ||
- refcount_read(&stack_record->count) < 2)
+ stack_count < 2 || stack_count < page_owner_stack_threshold)
return 0;

buf = kzalloc(PAGE_SIZE, GFP_KERNEL);
@@ -851,7 +854,7 @@ static int stack_print(struct seq_file *m, void *v)
goto out;

scnprintf(buf + ret, PAGE_SIZE - ret, "stack_count: %d\n\n",
- refcount_read(&stack_record->count));
+ stack_count);

seq_printf(m, buf);
seq_puts(m, "\n\n");
@@ -884,6 +887,21 @@ static const struct file_operations page_owner_stack_operations = {
.release = seq_release,
};

+static int page_owner_threshold_get(void *data, u64 *val)
+{
+ *val = page_owner_stack_threshold;
+ return 0;
+}
+
+static int page_owner_threshold_set(void *data, u64 val)
+{
+ page_owner_stack_threshold = val;
+ return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(proc_page_owner_threshold, &page_owner_threshold_get,
+ &page_owner_threshold_set, "%llu");
+
static int __init pageowner_init(void)
{
struct dentry *dir;
@@ -898,6 +916,8 @@ static int __init pageowner_init(void)
dir = debugfs_create_dir("page_owner_stacks", NULL);
debugfs_create_file("show_stacks", 0400, dir, NULL,
&page_owner_stack_operations);
+ debugfs_create_file("set_threshold", 0600, dir, NULL,
+ &proc_page_owner_threshold);

return 0;
}
--
2.43.0


2024-02-12 22:30:43

by Oscar Salvador

[permalink] [raw]
Subject: [PATCH v8 5/5] mm,page_owner: Update Documentation regarding page_owner_stacks

Update page_owner documentation including the new page_owner_stacks
feature to show how it can be used.

Signed-off-by: Oscar Salvador <[email protected]>
---
Documentation/mm/page_owner.rst | 44 +++++++++++++++++++++++++++++++++
1 file changed, 44 insertions(+)

diff --git a/Documentation/mm/page_owner.rst b/Documentation/mm/page_owner.rst
index 62e3f7ab23cc..bcde81bf0902 100644
--- a/Documentation/mm/page_owner.rst
+++ b/Documentation/mm/page_owner.rst
@@ -24,6 +24,11 @@ fragmentation statistics can be obtained through gfp flag information of
each page. It is already implemented and activated if page owner is
enabled. Other usages are more than welcome.

+It can also be used to show all the stacks and their outstanding
+allocations, which gives us a quick overview of where the memory is going
+without the need to screen through all the pages and match the allocation
+and free operation.
+
page owner is disabled by default. So, if you'd like to use it, you need
to add "page_owner=on" to your boot cmdline. If the kernel is built
with page owner and page owner is disabled in runtime due to not enabling
@@ -68,6 +73,45 @@ Usage

4) Analyze information from page owner::

+ cat /sys/kernel/debug/page_owner_stacks/show_stacks > stacks.txt
+ cat stacks.txt
+ prep_new_page+0xa9/0x120
+ get_page_from_freelist+0x7e6/0x2140
+ __alloc_pages+0x18a/0x370
+ new_slab+0xc8/0x580
+ ___slab_alloc+0x1f2/0xaf0
+ __slab_alloc.isra.86+0x22/0x40
+ kmem_cache_alloc+0x31b/0x350
+ __khugepaged_enter+0x39/0x100
+ dup_mmap+0x1c7/0x5ce
+ copy_process+0x1afe/0x1c90
+ kernel_clone+0x9a/0x3c0
+ __do_sys_clone+0x66/0x90
+ do_syscall_64+0x7f/0x160
+ entry_SYSCALL_64_after_hwframe+0x6c/0x74
+ stack_count: 234
+ ...
+ ...
+ echo 7000 > /sys/kernel/debug/page_owner_stacks/set_threshold > stacks_7000.txt
+ cat stacks_7000.txt
+ prep_new_page+0xa9/0x120
+ get_page_from_freelist+0x7e6/0x2140
+ __alloc_pages+0x18a/0x370
+ alloc_pages_mpol+0xdf/0x1e0
+ folio_alloc+0x14/0x50
+ filemap_alloc_folio+0xb0/0x100
+ page_cache_ra_unbounded+0x97/0x180
+ filemap_fault+0x4b4/0x1200
+ __do_fault+0x2d/0x110
+ do_pte_missing+0x4b0/0xa30
+ __handle_mm_fault+0x7fa/0xb70
+ handle_mm_fault+0x125/0x300
+ do_user_addr_fault+0x3c9/0x840
+ exc_page_fault+0x68/0x150
+ asm_exc_page_fault+0x22/0x30
+ stack_count: 8248
+ ...
+
cat /sys/kernel/debug/page_owner > page_owner_full.txt
./page_owner_sort page_owner_full.txt sorted_page_owner.txt

--
2.43.0


2024-02-12 22:38:44

by Oscar Salvador

[permalink] [raw]
Subject: [PATCH v8 3/5] mm,page_owner: Display all stacks and their count

This patch adds a new directory called 'page_owner_stacks' under
/sys/kernel/debug/, with a file called 'show_stacks' in it.
Reading from that file will show all stacks that were added by page_owner
followed by their counting, giving us a clear overview of stack <-> count
relationship.

E.g:

prep_new_page+0xa9/0x120
get_page_from_freelist+0x801/0x2210
__alloc_pages+0x18b/0x350
alloc_pages_mpol+0x91/0x1f0
folio_alloc+0x14/0x50
filemap_alloc_folio+0xb2/0x100
__filemap_get_folio+0x14a/0x490
ext4_write_begin+0xbd/0x4b0 [ext4]
generic_perform_write+0xc1/0x1e0
ext4_buffered_write_iter+0x68/0xe0 [ext4]
ext4_file_write_iter+0x70/0x740 [ext4]
vfs_write+0x33d/0x420
ksys_write+0xa5/0xe0
do_syscall_64+0x80/0x160
entry_SYSCALL_64_after_hwframe+0x6e/0x76
stack_count: 4578

The seq stack_{start,next} functions will iterate through the list
stack_list in order to print all stacks.

Signed-off-by: Oscar Salvador <[email protected]>
---
mm/page_owner.c | 99 ++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 98 insertions(+), 1 deletion(-)

diff --git a/mm/page_owner.c b/mm/page_owner.c
index 7d1b3f75cef3..3e4b7cd7c8f8 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -84,7 +84,12 @@ static void add_stack_record_to_list(struct stack_record *stack_record)
stack_list = stack;
} else {
stack->next = stack_list;
- stack_list = stack;
+ /* This pairs with smp_load_acquire() from function
+ * stack_start(). This guarantees that stack_start()
+ * will see an updated stack_list before starting to
+ * traverse the list.
+ */
+ smp_store_release(&stack_list, stack);
}
spin_unlock_irqrestore(&stack_list_lock, flags);
}
@@ -792,8 +797,97 @@ static const struct file_operations proc_page_owner_operations = {
.llseek = lseek_page_owner,
};

+static void *stack_start(struct seq_file *m, loff_t *ppos)
+{
+ struct stack *stack;
+
+ if (*ppos == -1UL)
+ return NULL;
+
+ if (!*ppos) {
+ /*
+ * This pairs with smp_store_release() from function
+ * add_stack_record_to_list(), so we get a consistent
+ * value of stack_list.
+ */
+ stack = smp_load_acquire(&stack_list);
+ } else {
+ stack = m->private;
+ stack = stack->next;
+ }
+
+ m->private = stack;
+
+ return stack;
+}
+
+static void *stack_next(struct seq_file *m, void *v, loff_t *ppos)
+{
+ struct stack *stack = v;
+
+ stack = stack->next;
+ *ppos = stack ? *ppos + 1 : -1UL;
+ m->private = stack;
+
+ return stack;
+}
+
+static int stack_print(struct seq_file *m, void *v)
+{
+ char *buf;
+ int ret = 0;
+ struct stack *stack = v;
+ struct stack_record *stack_record = stack->stack_record;
+
+ if (!stack_record->size || stack_record->size < 0 ||
+ refcount_read(&stack_record->count) < 2)
+ return 0;
+
+ buf = kzalloc(PAGE_SIZE, GFP_KERNEL);
+
+ ret += stack_trace_snprint(buf, PAGE_SIZE, stack_record->entries,
+ stack_record->size, 0);
+ if (!ret)
+ goto out;
+
+ scnprintf(buf + ret, PAGE_SIZE - ret, "stack_count: %d\n\n",
+ refcount_read(&stack_record->count));
+
+ seq_printf(m, buf);
+ seq_puts(m, "\n\n");
+out:
+ kfree(buf);
+
+ return 0;
+}
+
+static void stack_stop(struct seq_file *m, void *v)
+{
+}
+
+static const struct seq_operations page_owner_stack_op = {
+ .start = stack_start,
+ .next = stack_next,
+ .stop = stack_stop,
+ .show = stack_print
+};
+
+static int page_owner_stack_open(struct inode *inode, struct file *file)
+{
+ return seq_open_private(file, &page_owner_stack_op, 0);
+}
+
+static const struct file_operations page_owner_stack_operations = {
+ .open = page_owner_stack_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
static int __init pageowner_init(void)
{
+ struct dentry *dir;
+
if (!static_branch_unlikely(&page_owner_inited)) {
pr_info("page_owner is disabled\n");
return 0;
@@ -801,6 +895,9 @@ static int __init pageowner_init(void)

debugfs_create_file("page_owner", 0400, NULL, NULL,
&proc_page_owner_operations);
+ dir = debugfs_create_dir("page_owner_stacks", NULL);
+ debugfs_create_file("show_stacks", 0400, dir, NULL,
+ &page_owner_stack_operations);

return 0;
}
--
2.43.0


2024-02-13 08:30:57

by Marco Elver

[permalink] [raw]
Subject: Re: [PATCH v8 1/5] lib/stackdepot: Move stack_record struct definition into the header

On Mon, 12 Feb 2024 at 23:29, Oscar Salvador <[email protected]> wrote:
>
> In order to move the heavy lifting into page_owner code, this one
> needs to have access to the stack_record structure, which right now
> sits in lib/stackdepot.c.
> Move it to the stackdepot.h header so page_owner can access
> stack_record's struct fields.
>
> Signed-off-by: Oscar Salvador <[email protected]>

Reviewed-by: Marco Elver <[email protected]>

> ---
> include/linux/stackdepot.h | 44 ++++++++++++++++++++++++++++++++++++++
> lib/stackdepot.c | 43 -------------------------------------
> 2 files changed, 44 insertions(+), 43 deletions(-)
>
> diff --git a/include/linux/stackdepot.h b/include/linux/stackdepot.h
> index adcbb8f23600..90274860fd8e 100644
> --- a/include/linux/stackdepot.h
> +++ b/include/linux/stackdepot.h
> @@ -30,6 +30,50 @@ typedef u32 depot_stack_handle_t;
> */
> #define STACK_DEPOT_EXTRA_BITS 5
>
> +#define DEPOT_HANDLE_BITS (sizeof(depot_stack_handle_t) * 8)
> +
> +#define DEPOT_POOL_ORDER 2 /* Pool size order, 4 pages */
> +#define DEPOT_POOL_SIZE (1LL << (PAGE_SHIFT + DEPOT_POOL_ORDER))
> +#define DEPOT_STACK_ALIGN 4
> +#define DEPOT_OFFSET_BITS (DEPOT_POOL_ORDER + PAGE_SHIFT - DEPOT_STACK_ALIGN)
> +#define DEPOT_POOL_INDEX_BITS (DEPOT_HANDLE_BITS - DEPOT_OFFSET_BITS - \
> + STACK_DEPOT_EXTRA_BITS)
> +
> +/* Compact structure that stores a reference to a stack. */
> +union handle_parts {
> + depot_stack_handle_t handle;
> + struct {
> + u32 pool_index : DEPOT_POOL_INDEX_BITS;
> + u32 offset : DEPOT_OFFSET_BITS;
> + u32 extra : STACK_DEPOT_EXTRA_BITS;
> + };
> +};
> +
> +struct stack_record {
> + struct list_head hash_list; /* Links in the hash table */
> + u32 hash; /* Hash in hash table */
> + u32 size; /* Number of stored frames */
> + union handle_parts handle; /* Constant after initialization */
> + refcount_t count;
> + union {
> + unsigned long entries[CONFIG_STACKDEPOT_MAX_FRAMES]; /* Frames */
> + struct {
> + /*
> + * An important invariant of the implementation is to
> + * only place a stack record onto the freelist iff its
> + * refcount is zero. Because stack records with a zero
> + * refcount are never considered as valid, it is safe to
> + * union @entries and freelist management state below.
> + * Conversely, as soon as an entry is off the freelist
> + * and its refcount becomes non-zero, the below must not
> + * be accessed until being placed back on the freelist.
> + */
> + struct list_head free_list; /* Links in the freelist */
> + unsigned long rcu_state; /* RCU cookie */
> + };
> + };
> +};
> +
> typedef u32 depot_flags_t;
>
> /*
> diff --git a/lib/stackdepot.c b/lib/stackdepot.c
> index 4a7055a63d9f..6f9095374847 100644
> --- a/lib/stackdepot.c
> +++ b/lib/stackdepot.c
> @@ -36,54 +36,11 @@
> #include <linux/memblock.h>
> #include <linux/kasan-enabled.h>
>
> -#define DEPOT_HANDLE_BITS (sizeof(depot_stack_handle_t) * 8)
> -
> -#define DEPOT_POOL_ORDER 2 /* Pool size order, 4 pages */
> -#define DEPOT_POOL_SIZE (1LL << (PAGE_SHIFT + DEPOT_POOL_ORDER))
> -#define DEPOT_STACK_ALIGN 4
> -#define DEPOT_OFFSET_BITS (DEPOT_POOL_ORDER + PAGE_SHIFT - DEPOT_STACK_ALIGN)
> -#define DEPOT_POOL_INDEX_BITS (DEPOT_HANDLE_BITS - DEPOT_OFFSET_BITS - \
> - STACK_DEPOT_EXTRA_BITS)
> #define DEPOT_POOLS_CAP 8192
> #define DEPOT_MAX_POOLS \
> (((1LL << (DEPOT_POOL_INDEX_BITS)) < DEPOT_POOLS_CAP) ? \
> (1LL << (DEPOT_POOL_INDEX_BITS)) : DEPOT_POOLS_CAP)
>
> -/* Compact structure that stores a reference to a stack. */
> -union handle_parts {
> - depot_stack_handle_t handle;
> - struct {
> - u32 pool_index : DEPOT_POOL_INDEX_BITS;
> - u32 offset : DEPOT_OFFSET_BITS;
> - u32 extra : STACK_DEPOT_EXTRA_BITS;
> - };
> -};
> -
> -struct stack_record {
> - struct list_head hash_list; /* Links in the hash table */
> - u32 hash; /* Hash in hash table */
> - u32 size; /* Number of stored frames */
> - union handle_parts handle; /* Constant after initialization */
> - refcount_t count;
> - union {
> - unsigned long entries[CONFIG_STACKDEPOT_MAX_FRAMES]; /* Frames */
> - struct {
> - /*
> - * An important invariant of the implementation is to
> - * only place a stack record onto the freelist iff its
> - * refcount is zero. Because stack records with a zero
> - * refcount are never considered as valid, it is safe to
> - * union @entries and freelist management state below.
> - * Conversely, as soon as an entry is off the freelist
> - * and its refcount becomes non-zero, the below must not
> - * be accessed until being placed back on the freelist.
> - */
> - struct list_head free_list; /* Links in the freelist */
> - unsigned long rcu_state; /* RCU cookie */
> - };
> - };
> -};
> -
> static bool stack_depot_disabled;
> static bool __stack_depot_early_init_requested __initdata = IS_ENABLED(CONFIG_STACKDEPOT_ALWAYS_INIT);
> static bool __stack_depot_early_init_passed __initdata;
> --
> 2.43.0
>

2024-02-13 08:31:12

by Marco Elver

[permalink] [raw]
Subject: Re: [PATCH v8 2/5] mm,page_owner: Implement the tracking of the stacks count

On Mon, 12 Feb 2024 at 23:29, Oscar Salvador <[email protected]> wrote:
>
> page_owner needs to increment a stack_record refcount when a new allocation
> occurs, and decrement it on a free operation.
> In order to do that, we need to have a way to get a stack_record from a
> handle.
> Implement __stack_depot_get_stack_record() which just does that, and make
> it public so page_owner can use it.
>
> Also implement {inc,dec}_stack_record_count() which increments
> or decrements on respective allocation and free operations, via
> __reset_page_owner() (free operation) and __set_page_owner() (alloc
> operation).
>
> Traversing all stackdepot buckets comes with its own complexity,
> plus we would have to implement a way to mark only those stack_records
> that were originated from page_owner, as those are the ones we are
> interested in.
> For that reason, page_owner maintains its own list of stack_records,
> because traversing that list is faster than traversing all buckets
> while keeping at the same time a low complexity.
> inc_stack_record_count() is responsible of adding new stack_records
> into the list stack_list.
>
> Modifications on the list are protected via a spinlock with irqs
> disabled, since this code can also be reached from IRQ context.
>
> Signed-off-by: Oscar Salvador <[email protected]>

For the code:

Reviewed-by: Marco Elver <[email protected]>

But see minor comments below.

> ---
> include/linux/stackdepot.h | 9 +++++
> lib/stackdepot.c | 8 +++++
> mm/page_owner.c | 73 ++++++++++++++++++++++++++++++++++++++
> 3 files changed, 90 insertions(+)
>
> diff --git a/include/linux/stackdepot.h b/include/linux/stackdepot.h
> index 90274860fd8e..f3c2162bf615 100644
> --- a/include/linux/stackdepot.h
> +++ b/include/linux/stackdepot.h
> @@ -175,6 +175,15 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
> depot_stack_handle_t stack_depot_save(unsigned long *entries,
> unsigned int nr_entries, gfp_t gfp_flags);
>
> +/**
> + * __stack_depot_get_stack_record - Get a pointer to a stack_record struct
> + * This function is only for internal purposes.

I think the body of the kernel doc needs to go after argument declarations.

> + * @handle: Stack depot handle
> + *
> + * Return: Returns a pointer to a stack_record struct
> + */
> +struct stack_record *__stack_depot_get_stack_record(depot_stack_handle_t handle);
> +
> /**
> * stack_depot_fetch - Fetch a stack trace from stack depot
> *
> diff --git a/lib/stackdepot.c b/lib/stackdepot.c
> index 6f9095374847..fdb09450a538 100644
> --- a/lib/stackdepot.c
> +++ b/lib/stackdepot.c
> @@ -685,6 +685,14 @@ depot_stack_handle_t stack_depot_save(unsigned long *entries,
> }
> EXPORT_SYMBOL_GPL(stack_depot_save);
>
> +struct stack_record *__stack_depot_get_stack_record(depot_stack_handle_t handle)
> +{
> + if (!handle)
> + return NULL;
> +
> + return depot_fetch_stack(handle);
> +}
> +
> unsigned int stack_depot_fetch(depot_stack_handle_t handle,
> unsigned long **entries)
> {
> diff --git a/mm/page_owner.c b/mm/page_owner.c
> index 5634e5d890f8..7d1b3f75cef3 100644
> --- a/mm/page_owner.c
> +++ b/mm/page_owner.c
> @@ -36,6 +36,14 @@ struct page_owner {
> pid_t free_tgid;
> };
>
> +struct stack {
> + struct stack_record *stack_record;
> + struct stack *next;
> +};
> +
> +static struct stack *stack_list;
> +static DEFINE_SPINLOCK(stack_list_lock);
> +
> static bool page_owner_enabled __initdata;
> DEFINE_STATIC_KEY_FALSE(page_owner_inited);
>
> @@ -61,6 +69,57 @@ static __init bool need_page_owner(void)
> return page_owner_enabled;
> }
>
> +static void add_stack_record_to_list(struct stack_record *stack_record)
> +{
> + unsigned long flags;
> + struct stack *stack;
> +
> + stack = kmalloc(sizeof(*stack), GFP_KERNEL);
> + if (stack) {

It's usually more elegant to write

if (!stack)
return;

If the rest of the function is conditional.

> + stack->stack_record = stack_record;
> + stack->next = NULL;
> +
> + spin_lock_irqsave(&stack_list_lock, flags);
> + if (!stack_list) {
> + stack_list = stack;
> + } else {
> + stack->next = stack_list;
> + stack_list = stack;
> + }
> + spin_unlock_irqrestore(&stack_list_lock, flags);
> + }
> +}
> +
> +static void inc_stack_record_count(depot_stack_handle_t handle)
> +{
> + struct stack_record *stack_record = __stack_depot_get_stack_record(handle);
> +
> + if (stack_record) {
> + /*
> + * New stack_record's that do not use STACK_DEPOT_FLAG_GET start
> + * with REFCOUNT_SATURATED to catch spurious increments of their
> + * refcount.
> + * Since we do not use STACK_DEPOT_FLAG_{GET,PUT} API, let us

I think I mentioned this in the other email, there is no
STACK_DEPOT_FLAG_PUT, only stack_depot_put().

> + * set a refcount of 1 ourselves.
> + */
> + if (refcount_read(&stack_record->count) == REFCOUNT_SATURATED) {
> + refcount_set(&stack_record->count, 1);
> +
> + /* Add the new stack_record to our list */
> + add_stack_record_to_list(stack_record);
> + }
> + refcount_inc(&stack_record->count);
> + }
> +}
> +
> +static void dec_stack_record_count(depot_stack_handle_t handle)
> +{
> + struct stack_record *stack_record = __stack_depot_get_stack_record(handle);
> +
> + if (stack_record)
> + refcount_dec(&stack_record->count);
> +}
> +
> static __always_inline depot_stack_handle_t create_dummy_stack(void)
> {
> unsigned long entries[4];
> @@ -140,6 +199,7 @@ void __reset_page_owner(struct page *page, unsigned short order)
> int i;
> struct page_ext *page_ext;
> depot_stack_handle_t handle;
> + depot_stack_handle_t alloc_handle;
> struct page_owner *page_owner;
> u64 free_ts_nsec = local_clock();
>
> @@ -147,6 +207,9 @@ void __reset_page_owner(struct page *page, unsigned short order)
> if (unlikely(!page_ext))
> return;
>
> + page_owner = get_page_owner(page_ext);
> + alloc_handle = page_owner->handle;
> +
> handle = save_stack(GFP_NOWAIT | __GFP_NOWARN);
> for (i = 0; i < (1 << order); i++) {
> __clear_bit(PAGE_EXT_OWNER_ALLOCATED, &page_ext->flags);
> @@ -158,6 +221,15 @@ void __reset_page_owner(struct page *page, unsigned short order)
> page_ext = page_ext_next(page_ext);
> }
> page_ext_put(page_ext);
> + if (alloc_handle != early_handle)
> + /*
> + * early_handle is being set as a handle for all those
> + * early allocated pages. See init_pages_in_zone().
> + * Since their refcount is not being incremented because
> + * the machinery is not ready yet, we cannot decrement
> + * their refcount either.
> + */
> + dec_stack_record_count(alloc_handle);
> }
>
> static inline void __set_page_owner_handle(struct page_ext *page_ext,
> @@ -199,6 +271,7 @@ noinline void __set_page_owner(struct page *page, unsigned short order,
> return;
> __set_page_owner_handle(page_ext, handle, order, gfp_mask);
> page_ext_put(page_ext);
> + inc_stack_record_count(handle);
> }
>
> void __set_page_owner_migrate_reason(struct page *page, int reason)
> --
> 2.43.0
>

2024-02-13 08:43:22

by Marco Elver

[permalink] [raw]
Subject: Re: [PATCH v8 4/5] mm,page_owner: Filter out stacks by a threshold

On Mon, 12 Feb 2024 at 23:29, Oscar Salvador <[email protected]> wrote:
>
> We want to be able to filter out the stacks based on a threshold we can
> can tune.
> By writing to 'set_threshold' file, we can adjust the threshold value.
>
> Signed-off-by: Oscar Salvador <[email protected]>
> ---
> mm/page_owner.c | 24 ++++++++++++++++++++++--
> 1 file changed, 22 insertions(+), 2 deletions(-)
>
> diff --git a/mm/page_owner.c b/mm/page_owner.c
> index 3e4b7cd7c8f8..c4f9e5506e93 100644
> --- a/mm/page_owner.c
> +++ b/mm/page_owner.c
> @@ -832,15 +832,18 @@ static void *stack_next(struct seq_file *m, void *v, loff_t *ppos)
> return stack;
> }
>
> +static unsigned long page_owner_stack_threshold;
> +
> static int stack_print(struct seq_file *m, void *v)
> {
> char *buf;
> int ret = 0;
> struct stack *stack = v;
> struct stack_record *stack_record = stack->stack_record;
> + int stack_count = refcount_read(&stack_record->count);
>
> if (!stack_record->size || stack_record->size < 0 ||
> - refcount_read(&stack_record->count) < 2)
> + stack_count < 2 || stack_count < page_owner_stack_threshold)
> return 0;
>
> buf = kzalloc(PAGE_SIZE, GFP_KERNEL);
> @@ -851,7 +854,7 @@ static int stack_print(struct seq_file *m, void *v)
> goto out;
>
> scnprintf(buf + ret, PAGE_SIZE - ret, "stack_count: %d\n\n",
> - refcount_read(&stack_record->count));
> + stack_count);
>
> seq_printf(m, buf);
> seq_puts(m, "\n\n");
> @@ -884,6 +887,21 @@ static const struct file_operations page_owner_stack_operations = {
> .release = seq_release,
> };
>
> +static int page_owner_threshold_get(void *data, u64 *val)
> +{
> + *val = page_owner_stack_threshold;
> + return 0;
> +}
> +
> +static int page_owner_threshold_set(void *data, u64 val)
> +{
> + page_owner_stack_threshold = val;

This could be written concurrently, so to avoid data races, you can
use WRITE_ONCE(page_owner_stack_threshold, val) and use
READ_ONCE(page_owner_stack_threshold) where it's read.

> + return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(proc_page_owner_threshold, &page_owner_threshold_get,
> + &page_owner_threshold_set, "%llu");
> +
> static int __init pageowner_init(void)
> {
> struct dentry *dir;
> @@ -898,6 +916,8 @@ static int __init pageowner_init(void)
> dir = debugfs_create_dir("page_owner_stacks", NULL);
> debugfs_create_file("show_stacks", 0400, dir, NULL,
> &page_owner_stack_operations);
> + debugfs_create_file("set_threshold", 0600, dir, NULL,
> + &proc_page_owner_threshold);
>
> return 0;
> }
> --
> 2.43.0
>

2024-02-13 08:49:30

by Marco Elver

[permalink] [raw]
Subject: Re: [PATCH v8 4/5] mm,page_owner: Filter out stacks by a threshold

On Mon, 12 Feb 2024 at 23:29, Oscar Salvador <[email protected]> wrote:
>
> We want to be able to filter out the stacks based on a threshold we can
> can tune.
> By writing to 'set_threshold' file, we can adjust the threshold value.
>
> Signed-off-by: Oscar Salvador <[email protected]>
> ---
> mm/page_owner.c | 24 ++++++++++++++++++++++--
> 1 file changed, 22 insertions(+), 2 deletions(-)
>
> diff --git a/mm/page_owner.c b/mm/page_owner.c
> index 3e4b7cd7c8f8..c4f9e5506e93 100644
> --- a/mm/page_owner.c
> +++ b/mm/page_owner.c
> @@ -832,15 +832,18 @@ static void *stack_next(struct seq_file *m, void *v, loff_t *ppos)
> return stack;
> }
>
> +static unsigned long page_owner_stack_threshold;
> +
> static int stack_print(struct seq_file *m, void *v)
> {
> char *buf;
> int ret = 0;
> struct stack *stack = v;
> struct stack_record *stack_record = stack->stack_record;
> + int stack_count = refcount_read(&stack_record->count);
>
> if (!stack_record->size || stack_record->size < 0 ||
> - refcount_read(&stack_record->count) < 2)
> + stack_count < 2 || stack_count < page_owner_stack_threshold)
> return 0;
>
> buf = kzalloc(PAGE_SIZE, GFP_KERNEL);
> @@ -851,7 +854,7 @@ static int stack_print(struct seq_file *m, void *v)
> goto out;
>
> scnprintf(buf + ret, PAGE_SIZE - ret, "stack_count: %d\n\n",
> - refcount_read(&stack_record->count));
> + stack_count);
>
> seq_printf(m, buf);
> seq_puts(m, "\n\n");
> @@ -884,6 +887,21 @@ static const struct file_operations page_owner_stack_operations = {
> .release = seq_release,
> };
>
> +static int page_owner_threshold_get(void *data, u64 *val)
> +{
> + *val = page_owner_stack_threshold;
> + return 0;
> +}
> +
> +static int page_owner_threshold_set(void *data, u64 val)
> +{
> + page_owner_stack_threshold = val;
> + return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(proc_page_owner_threshold, &page_owner_threshold_get,
> + &page_owner_threshold_set, "%llu");
> +
> static int __init pageowner_init(void)
> {
> struct dentry *dir;
> @@ -898,6 +916,8 @@ static int __init pageowner_init(void)
> dir = debugfs_create_dir("page_owner_stacks", NULL);
> debugfs_create_file("show_stacks", 0400, dir, NULL,
> &page_owner_stack_operations);
> + debugfs_create_file("set_threshold", 0600, dir, NULL,
> + &proc_page_owner_threshold);

Can't you also read from "set_threshold", so the name "set_threshold"
is misleading. Why not just "threshold"?

2024-02-13 08:49:58

by Marco Elver

[permalink] [raw]
Subject: Re: [PATCH v8 5/5] mm,page_owner: Update Documentation regarding page_owner_stacks

On Mon, 12 Feb 2024 at 23:29, Oscar Salvador <[email protected]> wrote:
>
> Update page_owner documentation including the new page_owner_stacks
> feature to show how it can be used.
>
> Signed-off-by: Oscar Salvador <[email protected]>
> ---
> Documentation/mm/page_owner.rst | 44 +++++++++++++++++++++++++++++++++
> 1 file changed, 44 insertions(+)
>
> diff --git a/Documentation/mm/page_owner.rst b/Documentation/mm/page_owner.rst
> index 62e3f7ab23cc..bcde81bf0902 100644
> --- a/Documentation/mm/page_owner.rst
> +++ b/Documentation/mm/page_owner.rst
> @@ -24,6 +24,11 @@ fragmentation statistics can be obtained through gfp flag information of
> each page. It is already implemented and activated if page owner is
> enabled. Other usages are more than welcome.
>
> +It can also be used to show all the stacks and their outstanding
> +allocations, which gives us a quick overview of where the memory is going
> +without the need to screen through all the pages and match the allocation
> +and free operation.
> +
> page owner is disabled by default. So, if you'd like to use it, you need
> to add "page_owner=on" to your boot cmdline. If the kernel is built
> with page owner and page owner is disabled in runtime due to not enabling
> @@ -68,6 +73,45 @@ Usage
>
> 4) Analyze information from page owner::
>
> + cat /sys/kernel/debug/page_owner_stacks/show_stacks > stacks.txt
> + cat stacks.txt
> + prep_new_page+0xa9/0x120
> + get_page_from_freelist+0x7e6/0x2140
> + __alloc_pages+0x18a/0x370
> + new_slab+0xc8/0x580
> + ___slab_alloc+0x1f2/0xaf0
> + __slab_alloc.isra.86+0x22/0x40
> + kmem_cache_alloc+0x31b/0x350
> + __khugepaged_enter+0x39/0x100
> + dup_mmap+0x1c7/0x5ce
> + copy_process+0x1afe/0x1c90
> + kernel_clone+0x9a/0x3c0
> + __do_sys_clone+0x66/0x90
> + do_syscall_64+0x7f/0x160
> + entry_SYSCALL_64_after_hwframe+0x6c/0x74
> + stack_count: 234
> + ...
> + ...
> + echo 7000 > /sys/kernel/debug/page_owner_stacks/set_threshold > stacks_7000.txt

I think this example command is wrong.

2024-02-13 09:12:30

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH v8 5/5] mm,page_owner: Update Documentation regarding page_owner_stacks

On Tue, Feb 13, 2024 at 09:45:28AM +0100, Marco Elver wrote:
> On Mon, 12 Feb 2024 at 23:29, Oscar Salvador <[email protected]> wrote:
> > + echo 7000 > /sys/kernel/debug/page_owner_stacks/set_threshold > stacks_7000.txt
>
> I think this example command is wrong.

Yes, it is, I will correct that in the next spin.

thanks!

--
Oscar Salvador
SUSE Labs

2024-02-13 09:15:43

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH v8 2/5] mm,page_owner: Implement the tracking of the stacks count

On Tue, Feb 13, 2024 at 09:30:25AM +0100, Marco Elver wrote:
> On Mon, 12 Feb 2024 at 23:29, Oscar Salvador <[email protected]> wrote:
> > Signed-off-by: Oscar Salvador <[email protected]>
>
> For the code:
>
> Reviewed-by: Marco Elver <[email protected]>

Thanks!

> But see minor comments below.

> > +/**
> > + * __stack_depot_get_stack_record - Get a pointer to a stack_record struct
> > + * This function is only for internal purposes.
>
> I think the body of the kernel doc needs to go after argument declarations.

I see. I will amend that.

> > +static void add_stack_record_to_list(struct stack_record *stack_record)
> > +{
> > + unsigned long flags;
> > + struct stack *stack;
> > +
> > + stack = kmalloc(sizeof(*stack), GFP_KERNEL);
> > + if (stack) {
>
> It's usually more elegant to write
>
> if (!stack)
> return;
>
> If the rest of the function is conditional.

Yeah, probably better to save some identation.

> > + if (stack_record) {
> > + /*
> > + * New stack_record's that do not use STACK_DEPOT_FLAG_GET start
> > + * with REFCOUNT_SATURATED to catch spurious increments of their
> > + * refcount.
> > + * Since we do not use STACK_DEPOT_FLAG_{GET,PUT} API, let us
>
> I think I mentioned this in the other email, there is no
> STACK_DEPOT_FLAG_PUT, only stack_depot_put().

Yes, you did. This was an oversight.
I will fix that.


Thanks for the feedback Marco!

--
Oscar Salvador
SUSE Labs

2024-02-13 09:16:27

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v8 2/5] mm,page_owner: Implement the tracking of the stacks count

On 2/12/24 23:30, Oscar Salvador wrote:
> page_owner needs to increment a stack_record refcount when a new allocation
> occurs, and decrement it on a free operation.
> In order to do that, we need to have a way to get a stack_record from a
> handle.
> Implement __stack_depot_get_stack_record() which just does that, and make
> it public so page_owner can use it.
>
> Also implement {inc,dec}_stack_record_count() which increments
> or decrements on respective allocation and free operations, via
> __reset_page_owner() (free operation) and __set_page_owner() (alloc
> operation).
>
> Traversing all stackdepot buckets comes with its own complexity,
> plus we would have to implement a way to mark only those stack_records
> that were originated from page_owner, as those are the ones we are
> interested in.
> For that reason, page_owner maintains its own list of stack_records,
> because traversing that list is faster than traversing all buckets
> while keeping at the same time a low complexity.
> inc_stack_record_count() is responsible of adding new stack_records
> into the list stack_list.
>
> Modifications on the list are protected via a spinlock with irqs
> disabled, since this code can also be reached from IRQ context.
>
> Signed-off-by: Oscar Salvador <[email protected]>
> ---
> include/linux/stackdepot.h | 9 +++++
> lib/stackdepot.c | 8 +++++
> mm/page_owner.c | 73 ++++++++++++++++++++++++++++++++++++++
> 3 files changed, 90 insertions(+)

..


> --- a/mm/page_owner.c
> +++ b/mm/page_owner.c
> @@ -36,6 +36,14 @@ struct page_owner {
> pid_t free_tgid;
> };
>
> +struct stack {
> + struct stack_record *stack_record;
> + struct stack *next;
> +};
> +
> +static struct stack *stack_list;
> +static DEFINE_SPINLOCK(stack_list_lock);
> +
> static bool page_owner_enabled __initdata;
> DEFINE_STATIC_KEY_FALSE(page_owner_inited);
>
> @@ -61,6 +69,57 @@ static __init bool need_page_owner(void)
> return page_owner_enabled;
> }
>
> +static void add_stack_record_to_list(struct stack_record *stack_record)
> +{
> + unsigned long flags;
> + struct stack *stack;
> +
> + stack = kmalloc(sizeof(*stack), GFP_KERNEL);

I doubt you can use GFP_KERNEL unconditionally? Think you need to pass down
the gfp flags from __set_page_owner() here?
And what about the alloc failure case, this will just leave the stack record
unlinked forever? Can we somehow know which ones we failed to link, and try
next time? Probably easier by not recording the stack for the page at all in
that case, so the next attempt with the same stack looks like the very first
again and attemps the add to list.
Still not happy that these extra tracking objects are needed, but I guess
the per-users stack depot instances I suggested would be a major change.

> + if (stack) {
> + stack->stack_record = stack_record;
> + stack->next = NULL;
> +
> + spin_lock_irqsave(&stack_list_lock, flags);
> + if (!stack_list) {
> + stack_list = stack;
> + } else {
> + stack->next = stack_list;
> + stack_list = stack;
> + }
> + spin_unlock_irqrestore(&stack_list_lock, flags);
> + }
> +}
> +
> +static void inc_stack_record_count(depot_stack_handle_t handle)
> +{
> + struct stack_record *stack_record = __stack_depot_get_stack_record(handle);
> +
> + if (stack_record) {
> + /*
> + * New stack_record's that do not use STACK_DEPOT_FLAG_GET start
> + * with REFCOUNT_SATURATED to catch spurious increments of their
> + * refcount.
> + * Since we do not use STACK_DEPOT_FLAG_{GET,PUT} API, let us
> + * set a refcount of 1 ourselves.
> + */
> + if (refcount_read(&stack_record->count) == REFCOUNT_SATURATED) {
> + refcount_set(&stack_record->count, 1);

Isn't this racy? Shouldn't we use some atomic cmpxchg operation to change
from REFCOUNT_SATURATED to 1?

> + /* Add the new stack_record to our list */
> + add_stack_record_to_list(stack_record);
> + }
> + refcount_inc(&stack_record->count);
> + }
> +}
> +


2024-02-13 09:18:47

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH v8 3/5] mm,page_owner: Display all stacks and their count

On Tue, Feb 13, 2024 at 09:38:43AM +0100, Marco Elver wrote:
> On Mon, 12 Feb 2024 at 23:29, Oscar Salvador <[email protected]> wrote:
> > Signed-off-by: Oscar Salvador <[email protected]>
>
> Acked-by: Marco Elver <[email protected]>

Thanks!


> > + /* This pairs with smp_load_acquire() from function
>
> Comment should be
>
> /*
> *
> ...
> */

Yap, fat fingers here.

> > + if (!*ppos) {
> > + /*
> > + * This pairs with smp_store_release() from function
> > + * add_stack_record_to_list(), so we get a consistent
> > + * value of stack_list.
> > + */
> > + stack = smp_load_acquire(&stack_list);
>
> I'm not sure if it'd make your code simpler or not: there is
> <linux/llist.h> for singly-linked linked lists, although the code to
> manage the list is simple enough I'm indifferent here. Only consider
> it if it helps you make the code simpler.

I will check if it eases the code somehow.


> > +static void stack_stop(struct seq_file *m, void *v)
> > +{
> > +}
>
> Is this function even needed if it's empty? I recall there were some
> boilerplate "nop" functions that could be used.

I will check if seq already provides a dummy function for these cases.

--
Oscar Salvador
SUSE Labs

2024-02-13 09:20:51

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH v8 4/5] mm,page_owner: Filter out stacks by a threshold

On Tue, Feb 13, 2024 at 09:44:47AM +0100, Marco Elver wrote:
> On Mon, 12 Feb 2024 at 23:29, Oscar Salvador <[email protected]> wrote:
> > + debugfs_create_file("set_threshold", 0600, dir, NULL,
> > + &proc_page_owner_threshold);
>
> Can't you also read from "set_threshold", so the name "set_threshold"
> is misleading. Why not just "threshold"?

Yes, it can also be read.
I guess I was too focused on the set part.
I will rename that one, plus add the {READ,WRITE}_ONCE.

Thanks!

--
Oscar Salvador
SUSE Labs

2024-02-13 09:22:02

by Marco Elver

[permalink] [raw]
Subject: Re: [PATCH v8 2/5] mm,page_owner: Implement the tracking of the stacks count

On Tue, 13 Feb 2024 at 10:16, Vlastimil Babka <[email protected]> wrote:
>
> On 2/12/24 23:30, Oscar Salvador wrote:
> > page_owner needs to increment a stack_record refcount when a new allocation
> > occurs, and decrement it on a free operation.
> > In order to do that, we need to have a way to get a stack_record from a
> > handle.
> > Implement __stack_depot_get_stack_record() which just does that, and make
> > it public so page_owner can use it.
> >
> > Also implement {inc,dec}_stack_record_count() which increments
> > or decrements on respective allocation and free operations, via
> > __reset_page_owner() (free operation) and __set_page_owner() (alloc
> > operation).
> >
> > Traversing all stackdepot buckets comes with its own complexity,
> > plus we would have to implement a way to mark only those stack_records
> > that were originated from page_owner, as those are the ones we are
> > interested in.
> > For that reason, page_owner maintains its own list of stack_records,
> > because traversing that list is faster than traversing all buckets
> > while keeping at the same time a low complexity.
> > inc_stack_record_count() is responsible of adding new stack_records
> > into the list stack_list.
> >
> > Modifications on the list are protected via a spinlock with irqs
> > disabled, since this code can also be reached from IRQ context.
> >
> > Signed-off-by: Oscar Salvador <[email protected]>
> > ---
> > include/linux/stackdepot.h | 9 +++++
> > lib/stackdepot.c | 8 +++++
> > mm/page_owner.c | 73 ++++++++++++++++++++++++++++++++++++++
> > 3 files changed, 90 insertions(+)
>
> ...
>
>
> > --- a/mm/page_owner.c
> > +++ b/mm/page_owner.c
> > @@ -36,6 +36,14 @@ struct page_owner {
> > pid_t free_tgid;
> > };
> >
> > +struct stack {
> > + struct stack_record *stack_record;
> > + struct stack *next;
> > +};
> > +
> > +static struct stack *stack_list;
> > +static DEFINE_SPINLOCK(stack_list_lock);
> > +
> > static bool page_owner_enabled __initdata;
> > DEFINE_STATIC_KEY_FALSE(page_owner_inited);
> >
> > @@ -61,6 +69,57 @@ static __init bool need_page_owner(void)
> > return page_owner_enabled;
> > }
> >
> > +static void add_stack_record_to_list(struct stack_record *stack_record)
> > +{
> > + unsigned long flags;
> > + struct stack *stack;
> > +
> > + stack = kmalloc(sizeof(*stack), GFP_KERNEL);
>
> I doubt you can use GFP_KERNEL unconditionally? Think you need to pass down
> the gfp flags from __set_page_owner() here?
> And what about the alloc failure case, this will just leave the stack record
> unlinked forever? Can we somehow know which ones we failed to link, and try
> next time? Probably easier by not recording the stack for the page at all in
> that case, so the next attempt with the same stack looks like the very first
> again and attemps the add to list.
> Still not happy that these extra tracking objects are needed, but I guess
> the per-users stack depot instances I suggested would be a major change.
>
> > + if (stack) {
> > + stack->stack_record = stack_record;
> > + stack->next = NULL;
> > +
> > + spin_lock_irqsave(&stack_list_lock, flags);
> > + if (!stack_list) {
> > + stack_list = stack;
> > + } else {
> > + stack->next = stack_list;
> > + stack_list = stack;
> > + }
> > + spin_unlock_irqrestore(&stack_list_lock, flags);
> > + }
> > +}
> > +
> > +static void inc_stack_record_count(depot_stack_handle_t handle)
> > +{
> > + struct stack_record *stack_record = __stack_depot_get_stack_record(handle);
> > +
> > + if (stack_record) {
> > + /*
> > + * New stack_record's that do not use STACK_DEPOT_FLAG_GET start
> > + * with REFCOUNT_SATURATED to catch spurious increments of their
> > + * refcount.
> > + * Since we do not use STACK_DEPOT_FLAG_{GET,PUT} API, let us
> > + * set a refcount of 1 ourselves.
> > + */
> > + if (refcount_read(&stack_record->count) == REFCOUNT_SATURATED) {
> > + refcount_set(&stack_record->count, 1);
>
> Isn't this racy? Shouldn't we use some atomic cmpxchg operation to change
> from REFCOUNT_SATURATED to 1?

If 2 threads race here, both will want to add it to the list as well
and take the lock. So this could just be solved with double-checked
locking:

if (count == REFCOUNT_SATURATED) {
spin_lock(...);
if (count == REFCOUNT_SATURATED) {
refcount_set(.., 1);
.. add to list ...
}
spin_unlock(..);
}

2024-02-13 09:53:55

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH v8 2/5] mm,page_owner: Implement the tracking of the stacks count

On Tue, Feb 13, 2024 at 10:16:04AM +0100, Vlastimil Babka wrote:
> On 2/12/24 23:30, Oscar Salvador wrote:
> > +static void add_stack_record_to_list(struct stack_record *stack_record)
> > +{
> > + unsigned long flags;
> > + struct stack *stack;
> > +
> > + stack = kmalloc(sizeof(*stack), GFP_KERNEL);
>
> I doubt you can use GFP_KERNEL unconditionally? Think you need to pass down
> the gfp flags from __set_page_owner() here?

Yes, we should re-use the same gfp flags.

> And what about the alloc failure case, this will just leave the stack record
> unlinked forever? Can we somehow know which ones we failed to link, and try
> next time? Probably easier by not recording the stack for the page at all in
> that case, so the next attempt with the same stack looks like the very first
> again and attemps the add to list.

Well, it is not that easy.
We could, before even calling into stackdepot to save the trace, try to
allocate a "struct stack", and skip everything if that fails, so
subsequent calls will try again as if this was the first time.

The thing I do not like about that is that it gets in the way of
traditional page_owner functionality (to put it a name), meaning that
if we cannot allocate a "struct stack", we also do not log the page
and the stack as we usually do, which means we lose that information.

One could argue that if we are so screwed that we cannot allocate that
struct, we may also fail deep in stackdepot code when allocating
the stack_record struct, but who knows.
I tried to keep both features as independent as possible.

> Still not happy that these extra tracking objects are needed, but I guess
> the per-users stack depot instances I suggested would be a major change.

Well, it is definitely something I could have a look in a short-term
future, to see if it makes sense, but for now I would go this the
current state as it has a well balanced complexity vs optimization.

> > + if (stack_record) {
> > + /*
> > + * New stack_record's that do not use STACK_DEPOT_FLAG_GET start
> > + * with REFCOUNT_SATURATED to catch spurious increments of their
> > + * refcount.
> > + * Since we do not use STACK_DEPOT_FLAG_{GET,PUT} API, let us
> > + * set a refcount of 1 ourselves.
> > + */
> > + if (refcount_read(&stack_record->count) == REFCOUNT_SATURATED) {
> > + refcount_set(&stack_record->count, 1);
>
> Isn't this racy? Shouldn't we use some atomic cmpxchg operation to change
> from REFCOUNT_SATURATED to 1?

Yeah, missed that. Probably check it under the lock as Maroc suggested.

Thanks Vlastimil for the feedback!


--
Oscar Salvador
SUSE Labs

2024-02-13 11:12:19

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v8 1/5] lib/stackdepot: Move stack_record struct definition into the header

On 2/12/24 23:30, Oscar Salvador wrote:
> In order to move the heavy lifting into page_owner code, this one
> needs to have access to the stack_record structure, which right now
> sits in lib/stackdepot.c.
> Move it to the stackdepot.h header so page_owner can access
> stack_record's struct fields.
>
> Signed-off-by: Oscar Salvador <[email protected]>

Reviewed-by: Vlastimil Babka <[email protected]>


2024-02-13 11:37:49

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v8 2/5] mm,page_owner: Implement the tracking of the stacks count

On 2/13/24 10:21, Marco Elver wrote:
> On Tue, 13 Feb 2024 at 10:16, Vlastimil Babka <[email protected]> wrote:
>>
>> On 2/12/24 23:30, Oscar Salvador wrote:
>> > page_owner needs to increment a stack_record refcount when a new allocation
>> > occurs, and decrement it on a free operation.
>> > In order to do that, we need to have a way to get a stack_record from a
>> > handle.
>> > Implement __stack_depot_get_stack_record() which just does that, and make
>> > it public so page_owner can use it.
>> >
>> > Also implement {inc,dec}_stack_record_count() which increments
>> > or decrements on respective allocation and free operations, via
>> > __reset_page_owner() (free operation) and __set_page_owner() (alloc
>> > operation).
>> >
>> > Traversing all stackdepot buckets comes with its own complexity,
>> > plus we would have to implement a way to mark only those stack_records
>> > that were originated from page_owner, as those are the ones we are
>> > interested in.
>> > For that reason, page_owner maintains its own list of stack_records,
>> > because traversing that list is faster than traversing all buckets
>> > while keeping at the same time a low complexity.
>> > inc_stack_record_count() is responsible of adding new stack_records
>> > into the list stack_list.
>> >
>> > Modifications on the list are protected via a spinlock with irqs
>> > disabled, since this code can also be reached from IRQ context.
>> >
>> > Signed-off-by: Oscar Salvador <[email protected]>
>> > ---
>> > include/linux/stackdepot.h | 9 +++++
>> > lib/stackdepot.c | 8 +++++
>> > mm/page_owner.c | 73 ++++++++++++++++++++++++++++++++++++++
>> > 3 files changed, 90 insertions(+)
>>
>> ...
>>
>>
>> > --- a/mm/page_owner.c
>> > +++ b/mm/page_owner.c
>> > @@ -36,6 +36,14 @@ struct page_owner {
>> > pid_t free_tgid;
>> > };
>> >
>> > +struct stack {
>> > + struct stack_record *stack_record;
>> > + struct stack *next;
>> > +};
>> > +
>> > +static struct stack *stack_list;
>> > +static DEFINE_SPINLOCK(stack_list_lock);
>> > +
>> > static bool page_owner_enabled __initdata;
>> > DEFINE_STATIC_KEY_FALSE(page_owner_inited);
>> >
>> > @@ -61,6 +69,57 @@ static __init bool need_page_owner(void)
>> > return page_owner_enabled;
>> > }
>> >
>> > +static void add_stack_record_to_list(struct stack_record *stack_record)
>> > +{
>> > + unsigned long flags;
>> > + struct stack *stack;
>> > +
>> > + stack = kmalloc(sizeof(*stack), GFP_KERNEL);
>>
>> I doubt you can use GFP_KERNEL unconditionally? Think you need to pass down
>> the gfp flags from __set_page_owner() here?
>> And what about the alloc failure case, this will just leave the stack record
>> unlinked forever? Can we somehow know which ones we failed to link, and try
>> next time? Probably easier by not recording the stack for the page at all in
>> that case, so the next attempt with the same stack looks like the very first
>> again and attemps the add to list.
>> Still not happy that these extra tracking objects are needed, but I guess
>> the per-users stack depot instances I suggested would be a major change.
>>
>> > + if (stack) {
>> > + stack->stack_record = stack_record;
>> > + stack->next = NULL;
>> > +
>> > + spin_lock_irqsave(&stack_list_lock, flags);
>> > + if (!stack_list) {
>> > + stack_list = stack;
>> > + } else {
>> > + stack->next = stack_list;
>> > + stack_list = stack;
>> > + }
>> > + spin_unlock_irqrestore(&stack_list_lock, flags);
>> > + }
>> > +}
>> > +
>> > +static void inc_stack_record_count(depot_stack_handle_t handle)
>> > +{
>> > + struct stack_record *stack_record = __stack_depot_get_stack_record(handle);
>> > +
>> > + if (stack_record) {
>> > + /*
>> > + * New stack_record's that do not use STACK_DEPOT_FLAG_GET start
>> > + * with REFCOUNT_SATURATED to catch spurious increments of their
>> > + * refcount.
>> > + * Since we do not use STACK_DEPOT_FLAG_{GET,PUT} API, let us
>> > + * set a refcount of 1 ourselves.
>> > + */
>> > + if (refcount_read(&stack_record->count) == REFCOUNT_SATURATED) {
>> > + refcount_set(&stack_record->count, 1);
>>
>> Isn't this racy? Shouldn't we use some atomic cmpxchg operation to change
>> from REFCOUNT_SATURATED to 1?
>
> If 2 threads race here, both will want to add it to the list as well
> and take the lock. So this could just be solved with double-checked
> locking:
>
> if (count == REFCOUNT_SATURATED) {
> spin_lock(...);

Yeah probably stack_list_lock could be taken here already. But then the
kmalloc() of struct stack must happen also here, before taking the lock.

> if (count == REFCOUNT_SATURATED) {
> refcount_set(.., 1);
> .. add to list ...
> }
> spin_unlock(..);
> }


2024-02-13 12:39:35

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH v8 2/5] mm,page_owner: Implement the tracking of the stacks count

On Tue, Feb 13, 2024 at 12:34:55PM +0100, Vlastimil Babka wrote:
> On 2/13/24 10:21, Marco Elver wrote:
> > On Tue, 13 Feb 2024 at 10:16, Vlastimil Babka <[email protected]> wrote:
> >> Isn't this racy? Shouldn't we use some atomic cmpxchg operation to change
> >> from REFCOUNT_SATURATED to 1?
> >
> > If 2 threads race here, both will want to add it to the list as well
> > and take the lock. So this could just be solved with double-checked
> > locking:
> >
> > if (count == REFCOUNT_SATURATED) {
> > spin_lock(...);
>
> Yeah probably stack_list_lock could be taken here already. But then the
> kmalloc() of struct stack must happen also here, before taking the lock.

I am thinking what would be a less expensive and safer option here.
Of course, taking the spinlock is easier, but having the allocation
inside is tricky, and having it outside could mean that we might free
the struct once we checked __within__ the lock that the refcount
is no longer REFCOUNT_SATURATED. No big deal, but a bit sub-optimal.

On the other hand, IIUC, cmpxchg has some memory ordering, like
store_and_release/load_acquire do, so would it be safe to use it
instead of taking the lock?


--
Oscar Salvador
SUSE Labs

2024-02-13 12:59:31

by Marco Elver

[permalink] [raw]
Subject: Re: [PATCH v8 2/5] mm,page_owner: Implement the tracking of the stacks count

On Tue, 13 Feb 2024 at 13:39, Oscar Salvador <[email protected]> wrote:
>
> On Tue, Feb 13, 2024 at 12:34:55PM +0100, Vlastimil Babka wrote:
> > On 2/13/24 10:21, Marco Elver wrote:
> > > On Tue, 13 Feb 2024 at 10:16, Vlastimil Babka <[email protected]> wrote:
> > >> Isn't this racy? Shouldn't we use some atomic cmpxchg operation to change
> > >> from REFCOUNT_SATURATED to 1?
> > >
> > > If 2 threads race here, both will want to add it to the list as well
> > > and take the lock. So this could just be solved with double-checked
> > > locking:
> > >
> > > if (count == REFCOUNT_SATURATED) {
> > > spin_lock(...);
> >
> > Yeah probably stack_list_lock could be taken here already. But then the
> > kmalloc() of struct stack must happen also here, before taking the lock.
>
> I am thinking what would be a less expensive and safer option here.
> Of course, taking the spinlock is easier, but having the allocation
> inside is tricky, and having it outside could mean that we might free
> the struct once we checked __within__ the lock that the refcount
> is no longer REFCOUNT_SATURATED. No big deal, but a bit sub-optimal.
>
> On the other hand, IIUC, cmpxchg has some memory ordering, like
> store_and_release/load_acquire do, so would it be safe to use it
> instead of taking the lock?

Memory ordering here is secondary because the count is not used to
release and later acquire any memory (the list is used for that, you
change list head reads/writes to smp_load_acquire/smp_store_release in
the later patch). The problem is mutual exclusion. You can do mutual
exclusion with something like this as well:

> if (refcount_read(&stack->count) == REFCOUNT_SATURATED) {
> int old = REFCOUNT_SATURATED;
> if (atomic_try_cmpxchg_relaxed(&stack->count.refs, &old, 1))
> add_stack_record_to_list(...);
> }
> refcount_inc(&stack->count);

Probably simpler.

2024-02-13 13:42:38

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v8 2/5] mm,page_owner: Implement the tracking of the stacks count

On 2/12/24 23:30, Oscar Salvador wrote:
> page_owner needs to increment a stack_record refcount when a new allocation
> occurs, and decrement it on a free operation.
> In order to do that, we need to have a way to get a stack_record from a
> handle.
> Implement __stack_depot_get_stack_record() which just does that, and make
> it public so page_owner can use it.
>
> Also implement {inc,dec}_stack_record_count() which increments
> or decrements on respective allocation and free operations, via
> __reset_page_owner() (free operation) and __set_page_owner() (alloc
> operation).
>
> Traversing all stackdepot buckets comes with its own complexity,
> plus we would have to implement a way to mark only those stack_records
> that were originated from page_owner, as those are the ones we are
> interested in.
> For that reason, page_owner maintains its own list of stack_records,
> because traversing that list is faster than traversing all buckets
> while keeping at the same time a low complexity.
> inc_stack_record_count() is responsible of adding new stack_records
> into the list stack_list.
>
> Modifications on the list are protected via a spinlock with irqs
> disabled, since this code can also be reached from IRQ context.
>
> Signed-off-by: Oscar Salvador <[email protected]>
> ---
> include/linux/stackdepot.h | 9 +++++
> lib/stackdepot.c | 8 +++++
> mm/page_owner.c | 73 ++++++++++++++++++++++++++++++++++++++
> 3 files changed, 90 insertions(+)
> static inline void __set_page_owner_handle(struct page_ext *page_ext,
> @@ -199,6 +271,7 @@ noinline void __set_page_owner(struct page *page, unsigned short order,
> return;
> __set_page_owner_handle(page_ext, handle, order, gfp_mask);
> page_ext_put(page_ext);
> + inc_stack_record_count(handle);

What if this is dummy handle, which means we have recursed in page owner,
and we'll by trying to kmalloc() its struct stack and link it to the
stack_list because it was returned for the first time? Also failure_handle.
Could you pre-create static (not kmalloc) struct stack for these handles
with refcount of 0 and insert them to stack_list, all during
init_page_owner()? Bonus: no longer treating stack_list == NULL in a special
way in add_stack_record_to_list() (although you don't need to handle it
extra even now, AFAICS).

> }
>
> void __set_page_owner_migrate_reason(struct page *page, int reason)


2024-02-13 15:00:14

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v8 4/5] mm,page_owner: Filter out stacks by a threshold

On 2/13/24 10:21, Oscar Salvador wrote:
> On Tue, Feb 13, 2024 at 09:44:47AM +0100, Marco Elver wrote:
>> On Mon, 12 Feb 2024 at 23:29, Oscar Salvador <[email protected]>
>> wrote:
>>> + debugfs_create_file("set_threshold", 0600, dir, NULL, +
>>> &proc_page_owner_threshold);
>>
>> Can't you also read from "set_threshold", so the name "set_threshold"
>> is misleading. Why not just "threshold"?
>
> Yes, it can also be read. I guess I was too focused on the set part.

Maybe name it "count_threshold" so it's more clear of whate exactly it's the
threshold of?

> I will rename that one, plus add the {READ,WRITE}_ONCE.
>
> Thanks!
>


2024-02-13 15:28:40

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH v8 2/5] mm,page_owner: Implement the tracking of the stacks count

On Tue, Feb 13, 2024 at 02:42:25PM +0100, Vlastimil Babka wrote:
> On 2/12/24 23:30, Oscar Salvador wrote:
> > __set_page_owner_handle(page_ext, handle, order, gfp_mask);
> > page_ext_put(page_ext);
> > + inc_stack_record_count(handle);
>
> What if this is dummy handle, which means we have recursed in page owner,
> and we'll by trying to kmalloc() its struct stack and link it to the
> stack_list because it was returned for the first time? Also failure_handle.
> Could you pre-create static (not kmalloc) struct stack for these handles
> with refcount of 0 and insert them to stack_list, all during
> init_page_owner()? Bonus: no longer treating stack_list == NULL in a special
> way in add_stack_record_to_list() (although you don't need to handle it
> extra even now, AFAICS).

Good catch. I did not think about this scenario, but this could
definitely happen.
Yeah, maybe creating an array of 2 structs for {dummy,failure}_handle
and link them into stack_list.

I thought about giving them a refcount of 1, because we only print
stacks which refcount > 1 anyways, but setting it to 0 has comes with
the advantage of catching spurious increments, should someone call
refcount_inc on those (which should not really happen).

I will try to implement it.

Thanks

--
Oscar Salvador
SUSE Labs

2024-02-13 15:32:34

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH v8 3/5] mm,page_owner: Display all stacks and their count

On Tue, Feb 13, 2024 at 03:25:26PM +0100, Vlastimil Babka wrote:
> On 2/12/24 23:30, Oscar Salvador wrote:
> > +static int stack_print(struct seq_file *m, void *v)
> > +{
> > + char *buf;
> > + int ret = 0;
> > + struct stack *stack = v;
> > + struct stack_record *stack_record = stack->stack_record;
> > +
> > + if (!stack_record->size || stack_record->size < 0 ||
> > + refcount_read(&stack_record->count) < 2)
> > + return 0;
> > +
> > + buf = kzalloc(PAGE_SIZE, GFP_KERNEL);
> > +
> > + ret += stack_trace_snprint(buf, PAGE_SIZE, stack_record->entries,
> > + stack_record->size, 0);
> > + if (!ret)
> > + goto out;
> > +
> > + scnprintf(buf + ret, PAGE_SIZE - ret, "stack_count: %d\n\n",
> > + refcount_read(&stack_record->count));
> > +
> > + seq_printf(m, buf);
> > + seq_puts(m, "\n\n");
> > +out:
> > + kfree(buf);
>
> Seems rather wasteful to do kzalloc/kfree so you can print into that buffer
> first and then print/copy it again using seq_printf. If you give up on using
> stack_trace_snprintf() it's not much harder to print the stack directly with
> a loop of seq_printf. See e.g. slab_debugfs_show().

Well, I thought about not reinventing the wheel there, but fair enough
than performing a kmalloc/free op on every print might be suboptimal.
I will try to do ir with seq_printf alone.

Thanks


--
Oscar Salvador
SUSE Labs

2024-02-13 16:03:47

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH v8 2/5] mm,page_owner: Implement the tracking of the stacks count

On Tue, Feb 13, 2024 at 04:29:28PM +0100, Oscar Salvador wrote:
> I thought about giving them a refcount of 1, because we only print
> stacks which refcount > 1 anyways, but setting it to 0 has comes with
> the advantage of catching spurious increments, should someone call
> refcount_inc on those (which should not really happen).

On a second thought, I think we want a refcount of 1 for these stacks
at the beginning.
So should we e.g: re-enter page_owner, we would increment dummy_stack's
refcount. If refcount is 0, we will get a warning.



--
Oscar Salvador
SUSE Labs