Received-SPF: pass (google.com: domain of linux-kernel+bounces-63189-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) client-ip=147.75.80.249;
Message-ID: <fc4f498b-fc35-4ba8-abf0-7664d6f1decb@suse.cz>
Date: Tue, 13 Feb 2024 10:16:04 +0100
Precedence: bulk
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH v8 2/5] mm,page_owner: Implement the tracking of the
 stacks count
To: Oscar Salvador <osalvador@suse.de>,
 Andrew Morton <akpm@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
 Michal Hocko <mhocko@suse.com>, Marco Elver <elver@google.com>,
 Andrey Konovalov <andreyknvl@gmail.com>,
 Alexander Potapenko <glider@google.com>
References: <20240212223029.30769-1-osalvador@suse.de>
 <20240212223029.30769-3-osalvador@suse.de>
Content-Language: en-US
From: Vlastimil Babka <vbabka@suse.cz>
In-Reply-To: <20240212223029.30769-3-osalvador@suse.de>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit

On 2/12/24 23:30, Oscar Salvador wrote:
> page_owner needs to increment a stack_record refcount when a new allocation
> occurs, and decrement it on a free operation.
> In order to do that, we need to have a way to get a stack_record from a
> handle.
> Implement __stack_depot_get_stack_record() which just does that, and make
> it public so page_owner can use it.
> 
> Also implement {inc,dec}_stack_record_count() which increments
> or decrements on respective allocation and free operations, via
> __reset_page_owner() (free operation) and __set_page_owner() (alloc
> operation).
> 
> Traversing all stackdepot buckets comes with its own complexity,
> plus we would have to implement a way to mark only those stack_records
> that were originated from page_owner, as those are the ones we are
> interested in.
> For that reason, page_owner maintains its own list of stack_records,
> because traversing that list is faster than traversing all buckets
> while keeping at the same time a low complexity.
> inc_stack_record_count() is responsible of adding new stack_records
> into the list stack_list.
> 
> Modifications on the list are protected via a spinlock with irqs
> disabled, since this code can also be reached from IRQ context.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> ---
>  include/linux/stackdepot.h |  9 +++++
>  lib/stackdepot.c           |  8 +++++
>  mm/page_owner.c            | 73 ++++++++++++++++++++++++++++++++++++++
>  3 files changed, 90 insertions(+)

..


> --- a/mm/page_owner.c
> +++ b/mm/page_owner.c
> @@ -36,6 +36,14 @@ struct page_owner {
>  	pid_t free_tgid;
>  };
>  
> +struct stack {
> +	struct stack_record *stack_record;
> +	struct stack *next;
> +};
> +
> +static struct stack *stack_list;
> +static DEFINE_SPINLOCK(stack_list_lock);
> +
>  static bool page_owner_enabled __initdata;
>  DEFINE_STATIC_KEY_FALSE(page_owner_inited);
>  
> @@ -61,6 +69,57 @@ static __init bool need_page_owner(void)
>  	return page_owner_enabled;
>  }
>  
> +static void add_stack_record_to_list(struct stack_record *stack_record)
> +{
> +	unsigned long flags;
> +	struct stack *stack;
> +
> +	stack = kmalloc(sizeof(*stack), GFP_KERNEL);

I doubt you can use GFP_KERNEL unconditionally? Think you need to pass down
the gfp flags from __set_page_owner() here?
And what about the alloc failure case, this will just leave the stack record
unlinked forever? Can we somehow know which ones we failed to link, and try
next time? Probably easier by not recording the stack for the page at all in
that case, so the next attempt with the same stack looks like the very first
again and attemps the add to list.
Still not happy that these extra tracking objects are needed, but I guess
the per-users stack depot instances I suggested would be a major change.

> +	if (stack) {
> +		stack->stack_record = stack_record;
> +		stack->next = NULL;
> +
> +		spin_lock_irqsave(&stack_list_lock, flags);
> +		if (!stack_list) {
> +			stack_list = stack;
> +		} else {
> +			stack->next = stack_list;
> +			stack_list = stack;
> +		}
> +		spin_unlock_irqrestore(&stack_list_lock, flags);
> +	}
> +}
> +
> +static void inc_stack_record_count(depot_stack_handle_t handle)
> +{
> +	struct stack_record *stack_record = __stack_depot_get_stack_record(handle);
> +
> +	if (stack_record) {
> +		/*
> +		 * New stack_record's that do not use STACK_DEPOT_FLAG_GET start
> +		 * with REFCOUNT_SATURATED to catch spurious increments of their
> +		 * refcount.
> +		 * Since we do not use STACK_DEPOT_FLAG_{GET,PUT} API, let us
> +		 * set a refcount of 1 ourselves.
> +		 */
> +		if (refcount_read(&stack_record->count) == REFCOUNT_SATURATED) {
> +			refcount_set(&stack_record->count, 1);

Isn't this racy? Shouldn't we use some atomic cmpxchg operation to change
from REFCOUNT_SATURATED to 1?

> +			/* Add the new stack_record to our list */
> +			add_stack_record_to_list(stack_record);
> +		}
> +		refcount_inc(&stack_record->count);
> +	}
> +}
> +