The alloc_thread_stack_node() cannot guarantee that allocated stack pages
are in the same node when CONFIG_VMAP_STACK. Because we do not specify
__GFP_THISNODE to __vmalloc_node_range(). Fix it by caling
mod_lruvec_page_state() for each page one by one.
Fixes: 991e7673859e ("mm: memcontrol: account kernel stack per node")
Signed-off-by: Muchun Song <[email protected]>
---
kernel/fork.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c
index d66cd1014211..6e2201feb524 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -379,14 +379,19 @@ static void account_kernel_stack(struct task_struct *tsk, int account)
void *stack = task_stack_page(tsk);
struct vm_struct *vm = task_stack_vm_area(tsk);
+ if (vm) {
+ int i;
- /* All stack pages are in the same node. */
- if (vm)
- mod_lruvec_page_state(vm->pages[0], NR_KERNEL_STACK_KB,
- account * (THREAD_SIZE / 1024));
- else
+ BUG_ON(vm->nr_pages != THREAD_SIZE / PAGE_SIZE);
+
+ for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++)
+ mod_lruvec_page_state(vm->pages[i], NR_KERNEL_STACK_KB,
+ account * (PAGE_SIZE / 1024));
+ } else {
+ /* All stack pages are in the same node. */
mod_lruvec_kmem_state(stack, NR_KERNEL_STACK_KB,
account * (THREAD_SIZE / 1024));
+ }
}
static int memcg_charge_kernel_stack(struct task_struct *tsk)
--
2.11.0
On Tue 02-03-21 06:02:14, Shakeel Butt wrote:
> On Mon, Mar 1, 2021 at 11:52 PM Muchun Song <[email protected]> wrote:
> >
> > The alloc_thread_stack_node() cannot guarantee that allocated stack pages
> > are in the same node when CONFIG_VMAP_STACK. Because we do not specify
> > __GFP_THISNODE to __vmalloc_node_range().
>
> Instead of __GFP_THISNODE, mention that the kernel_clone() passes
> NUMA_NO_NODE which is being used for __vmalloc_node_range().
If we really want to do this then I would recommend reasoning in the
following line:
"
For simplification 991e7673859e ("mm: memcontrol: account kernel stack
per node") has changed the per zone vmalloc backed stack pages accounting
to per node. By doing that we have lost a certain precision because
those pages might live in different NUMA nodes. In the end
NR_KERNEL_STACK_KB exported to the userspace might be over estimated on
some nodes while underestimated on others.
< some examples would go here ideally >
This doesn't impose any real problem to correctnes of the kernel
behavior as the counter is not used for any internal processing but it
can cause some confusion to the userspace.
Address the problem by accounting each vmalloc backing page to its own
node.
"
--
Michal Hocko
SUSE Labs
On Mon, Mar 1, 2021 at 11:52 PM Muchun Song <[email protected]> wrote:
>
> The alloc_thread_stack_node() cannot guarantee that allocated stack pages
> are in the same node when CONFIG_VMAP_STACK. Because we do not specify
> __GFP_THISNODE to __vmalloc_node_range().
Instead of __GFP_THISNODE, mention that the kernel_clone() passes
NUMA_NO_NODE which is being used for __vmalloc_node_range().
> Fix it by caling
calling
> mod_lruvec_page_state() for each page one by one.
>
> Fixes: 991e7673859e ("mm: memcontrol: account kernel stack per node")
> Signed-off-by: Muchun Song <[email protected]>
Please follow Michal's suggestion to update the commit message.
After that:
Reviewed-by: Shakeel Butt <[email protected]>
> ---
> kernel/fork.c | 15 ++++++++++-----
> 1 file changed, 10 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index d66cd1014211..6e2201feb524 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -379,14 +379,19 @@ static void account_kernel_stack(struct task_struct *tsk, int account)
> void *stack = task_stack_page(tsk);
> struct vm_struct *vm = task_stack_vm_area(tsk);
>
> + if (vm) {
> + int i;
>
> - /* All stack pages are in the same node. */
> - if (vm)
> - mod_lruvec_page_state(vm->pages[0], NR_KERNEL_STACK_KB,
> - account * (THREAD_SIZE / 1024));
> - else
> + BUG_ON(vm->nr_pages != THREAD_SIZE / PAGE_SIZE);
> +
> + for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++)
> + mod_lruvec_page_state(vm->pages[i], NR_KERNEL_STACK_KB,
> + account * (PAGE_SIZE / 1024));
> + } else {
> + /* All stack pages are in the same node. */
> mod_lruvec_kmem_state(stack, NR_KERNEL_STACK_KB,
> account * (THREAD_SIZE / 1024));
> + }
> }
>
> static int memcg_charge_kernel_stack(struct task_struct *tsk)
> --
> 2.11.0
>
On Tue, Mar 02, 2021 at 03:37:33PM +0800, Muchun Song wrote:
> The alloc_thread_stack_node() cannot guarantee that allocated stack pages
> are in the same node when CONFIG_VMAP_STACK. Because we do not specify
> __GFP_THISNODE to __vmalloc_node_range(). Fix it by caling
> mod_lruvec_page_state() for each page one by one.
Hm, I actually wonder if it makes any sense to split the stack over multiple
nodes? Maybe we should fix this instead?
>
> Fixes: 991e7673859e ("mm: memcontrol: account kernel stack per node")
> Signed-off-by: Muchun Song <[email protected]>
> ---
> kernel/fork.c | 15 ++++++++++-----
> 1 file changed, 10 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index d66cd1014211..6e2201feb524 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -379,14 +379,19 @@ static void account_kernel_stack(struct task_struct *tsk, int account)
> void *stack = task_stack_page(tsk);
> struct vm_struct *vm = task_stack_vm_area(tsk);
>
> + if (vm) {
> + int i;
>
> - /* All stack pages are in the same node. */
> - if (vm)
> - mod_lruvec_page_state(vm->pages[0], NR_KERNEL_STACK_KB,
> - account * (THREAD_SIZE / 1024));
> - else
> + BUG_ON(vm->nr_pages != THREAD_SIZE / PAGE_SIZE);
> +
> + for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++)
> + mod_lruvec_page_state(vm->pages[i], NR_KERNEL_STACK_KB,
> + account * (PAGE_SIZE / 1024));
> + } else {
> + /* All stack pages are in the same node. */
> mod_lruvec_kmem_state(stack, NR_KERNEL_STACK_KB,
> account * (THREAD_SIZE / 1024));
> + }
> }
>
> static int memcg_charge_kernel_stack(struct task_struct *tsk)
> --
> 2.11.0
>
On Tue 02-03-21 10:50:32, Roman Gushchin wrote:
> On Tue, Mar 02, 2021 at 03:37:33PM +0800, Muchun Song wrote:
> > The alloc_thread_stack_node() cannot guarantee that allocated stack pages
> > are in the same node when CONFIG_VMAP_STACK. Because we do not specify
> > __GFP_THISNODE to __vmalloc_node_range(). Fix it by caling
> > mod_lruvec_page_state() for each page one by one.
>
> Hm, I actually wonder if it makes any sense to split the stack over multiple
> nodes? Maybe we should fix this instead?
While this is not really ideal I am not really sure it is an actual
problem worth complicating the code. I am pretty sure this would grow
into more tricky problem quite quickly (e.g. proper memory policy
handling).
--
Michal Hocko
SUSE Labs
On Tue, Mar 02, 2021 at 08:33:20PM +0100, Michal Hocko wrote:
> On Tue 02-03-21 10:50:32, Roman Gushchin wrote:
> > On Tue, Mar 02, 2021 at 03:37:33PM +0800, Muchun Song wrote:
> > > The alloc_thread_stack_node() cannot guarantee that allocated stack pages
> > > are in the same node when CONFIG_VMAP_STACK. Because we do not specify
> > > __GFP_THISNODE to __vmalloc_node_range(). Fix it by caling
> > > mod_lruvec_page_state() for each page one by one.
> >
> > Hm, I actually wonder if it makes any sense to split the stack over multiple
> > nodes? Maybe we should fix this instead?
>
> While this is not really ideal I am not really sure it is an actual
> problem worth complicating the code. I am pretty sure this would grow
> into more tricky problem quite quickly (e.g. proper memory policy
> handling).
I'd agree and IMO accounting a couple of pages to a different node
is even a smaller problem.