For simplification 991e7673859e ("mm: memcontrol: account kernel stack
per node") has changed the per zone vmalloc backed stack pages
accounting to per node. By doing that we have lost a certain precision
because those pages might live in different NUMA nodes. In the end
NR_KERNEL_STACK_KB exported to the userspace might be over estimated on
some nodes while underestimated on others.
This doesn't impose any real problem to correctnes of the kernel
behavior as the counter is not used for any internal processing but it
can cause some confusion to the userspace.
Address the problem by accounting each vmalloc backing page to its own
node.
Fixes: 991e7673859e ("mm: memcontrol: account kernel stack per node")
Signed-off-by: Muchun Song <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
---
Changelog in v2:
- Rework commit log suggested by Michal.
Thanks to Michal and Shakeel for review.
kernel/fork.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c
index d66cd1014211..6e2201feb524 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -379,14 +379,19 @@ static void account_kernel_stack(struct task_struct *tsk, int account)
void *stack = task_stack_page(tsk);
struct vm_struct *vm = task_stack_vm_area(tsk);
+ if (vm) {
+ int i;
- /* All stack pages are in the same node. */
- if (vm)
- mod_lruvec_page_state(vm->pages[0], NR_KERNEL_STACK_KB,
- account * (THREAD_SIZE / 1024));
- else
+ BUG_ON(vm->nr_pages != THREAD_SIZE / PAGE_SIZE);
+
+ for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++)
+ mod_lruvec_page_state(vm->pages[i], NR_KERNEL_STACK_KB,
+ account * (PAGE_SIZE / 1024));
+ } else {
+ /* All stack pages are in the same node. */
mod_lruvec_kmem_state(stack, NR_KERNEL_STACK_KB,
account * (THREAD_SIZE / 1024));
+ }
}
static int memcg_charge_kernel_stack(struct task_struct *tsk)
--
2.11.0
On Wed 03-03-21 17:39:56, Muchun Song wrote:
> For simplification 991e7673859e ("mm: memcontrol: account kernel stack
> per node") has changed the per zone vmalloc backed stack pages
> accounting to per node. By doing that we have lost a certain precision
> because those pages might live in different NUMA nodes. In the end
> NR_KERNEL_STACK_KB exported to the userspace might be over estimated on
> some nodes while underestimated on others.
>
> This doesn't impose any real problem to correctnes of the kernel
> behavior as the counter is not used for any internal processing but it
> can cause some confusion to the userspace.
You have skipped over one part of the changelog I have proposed and that
is to provide an actual data.
> Address the problem by accounting each vmalloc backing page to its own
> node.
>
> Fixes: 991e7673859e ("mm: memcontrol: account kernel stack per node")
Fixes tag might make somebody assume this is worth backporting but I
highly doubt so.
> Signed-off-by: Muchun Song <[email protected]>
> Reviewed-by: Shakeel Butt <[email protected]>
Anyway
Acked-by: Michal Hocko <[email protected]>
as the patch is correct with one comment below
> ---
> Changelog in v2:
> - Rework commit log suggested by Michal.
>
> Thanks to Michal and Shakeel for review.
>
> kernel/fork.c | 15 ++++++++++-----
> 1 file changed, 10 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index d66cd1014211..6e2201feb524 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -379,14 +379,19 @@ static void account_kernel_stack(struct task_struct *tsk, int account)
> void *stack = task_stack_page(tsk);
> struct vm_struct *vm = task_stack_vm_area(tsk);
>
> + if (vm) {
> + int i;
>
> - /* All stack pages are in the same node. */
> - if (vm)
> - mod_lruvec_page_state(vm->pages[0], NR_KERNEL_STACK_KB,
> - account * (THREAD_SIZE / 1024));
> - else
> + BUG_ON(vm->nr_pages != THREAD_SIZE / PAGE_SIZE);
I do not think we need this BUG_ON. What kind of purpose does it serve?
> +
> + for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++)
> + mod_lruvec_page_state(vm->pages[i], NR_KERNEL_STACK_KB,
> + account * (PAGE_SIZE / 1024));
> + } else {
> + /* All stack pages are in the same node. */
> mod_lruvec_kmem_state(stack, NR_KERNEL_STACK_KB,
> account * (THREAD_SIZE / 1024));
> + }
> }
>
> static int memcg_charge_kernel_stack(struct task_struct *tsk)
> --
> 2.11.0
--
Michal Hocko
SUSE Labs
On Wed, Mar 03, 2021 at 05:39:56PM +0800, Muchun Song wrote:
> For simplification 991e7673859e ("mm: memcontrol: account kernel stack
> per node") has changed the per zone vmalloc backed stack pages
> accounting to per node. By doing that we have lost a certain precision
> because those pages might live in different NUMA nodes. In the end
> NR_KERNEL_STACK_KB exported to the userspace might be over estimated on
> some nodes while underestimated on others.
>
> This doesn't impose any real problem to correctnes of the kernel
> behavior as the counter is not used for any internal processing but it
> can cause some confusion to the userspace.
>
> Address the problem by accounting each vmalloc backing page to its own
> node.
>
> Fixes: 991e7673859e ("mm: memcontrol: account kernel stack per node")
> Signed-off-by: Muchun Song <[email protected]>
> Reviewed-by: Shakeel Butt <[email protected]>
With changes proposed by Shakeel and Michal, this looks good to me.
Acked-by: Johannes Weiner <[email protected]>
I guess the BUG_ON() was inspired by memcg_charge_kernel_stack() - not
really your fault for following that example. But yeah, please drop it
from your patch. Thanks!