2024-02-19 04:19:52

by Byungchul Park

[permalink] [raw]
Subject: [PATCH v4] sched/numa, mm: do not try to migrate memory to memoryless nodes

Changes from v3:
1. Rewrite the comment in code and the commit message to make it
more clear. (feedbacked by Oscar Salvador)
2. Add "Reviewed-by: Oscar Salvador <[email protected]>"

Changes from v2:
1. Rewrite the comment in code and the commit message becasue it
turns out that this patch is not the real fix for the oops
descriped. The real fix goes in another patch below:

https://lore.kernel.org/lkml/[email protected]/

Changes from v1:
1. Trim the verbose oops in the commit message. (feedbacked by
Phil Auld)
2. Rewrite a comment in code. (feedbacked by Phil Auld)

--->8---
From 98f5d472c08e3ed5b9b6543290d392a2e50fcf3c Mon Sep 17 00:00:00 2001
From: Byungchul Park <[email protected]>
Date: Mon, 19 Feb 2024 13:10:47 +0900
Subject: [PATCH v4] sched/numa, mm: do not try to migrate memory to memoryless nodes

Memoryless nodes do not have any memory to migrate to, so stop trying
it.

Signed-off-by: Byungchul Park <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
---
kernel/sched/fair.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d7a3c63a2171..3e3b44ae72d1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1828,6 +1828,12 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
int dst_nid = cpu_to_node(dst_cpu);
int last_cpupid, this_cpupid;

+ /*
+ * Cannot migrate to memoryless nodes.
+ */
+ if (!node_state(dst_nid, N_MEMORY))
+ return false;
+
/*
* The pages in slow memory node should be migrated according
* to hot/cold instead of private/shared.
--
2.17.1



2024-02-19 08:46:48

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH v4] sched/numa, mm: do not try to migrate memory to memoryless nodes

Hi, Byungchul,

Byungchul Park <[email protected]> writes:

> Changes from v3:
> 1. Rewrite the comment in code and the commit message to make it
> more clear. (feedbacked by Oscar Salvador)
> 2. Add "Reviewed-by: Oscar Salvador <[email protected]>"
>
> Changes from v2:
> 1. Rewrite the comment in code and the commit message becasue it
> turns out that this patch is not the real fix for the oops
> descriped. The real fix goes in another patch below:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> Changes from v1:
> 1. Trim the verbose oops in the commit message. (feedbacked by
> Phil Auld)
> 2. Rewrite a comment in code. (feedbacked by Phil Auld)
>
> --->8---
>>From 98f5d472c08e3ed5b9b6543290d392a2e50fcf3c Mon Sep 17 00:00:00 2001
> From: Byungchul Park <[email protected]>
> Date: Mon, 19 Feb 2024 13:10:47 +0900
> Subject: [PATCH v4] sched/numa, mm: do not try to migrate memory to memoryless nodes
>
> Memoryless nodes do not have any memory to migrate to, so stop trying
> it.
>
> Signed-off-by: Byungchul Park <[email protected]>
> Reviewed-by: Oscar Salvador <[email protected]>
> ---
> kernel/sched/fair.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d7a3c63a2171..3e3b44ae72d1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1828,6 +1828,12 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
> int dst_nid = cpu_to_node(dst_cpu);
> int last_cpupid, this_cpupid;
>
> + /*
> + * Cannot migrate to memoryless nodes.
> + */
> + if (!node_state(dst_nid, N_MEMORY))
> + return false;
> +
> /*
> * The pages in slow memory node should be migrated according
> * to hot/cold instead of private/shared.

Good catch!

IIUC, you will use patch as fix to the issue in

https://lore.kernel.org/lkml/[email protected]/

If so, we need the Fixes: tag to make it land in -stable properly.

Feel free to add

Reviewed-by: "Huang, Ying" <[email protected]>

2024-02-19 15:06:43

by Phil Auld

[permalink] [raw]
Subject: Re: [PATCH v4] sched/numa, mm: do not try to migrate memory to memoryless nodes

On Mon, Feb 19, 2024 at 01:19:20PM +0900 Byungchul Park wrote:
> Changes from v3:
> 1. Rewrite the comment in code and the commit message to make it
> more clear. (feedbacked by Oscar Salvador)
> 2. Add "Reviewed-by: Oscar Salvador <[email protected]>"
>
> Changes from v2:
> 1. Rewrite the comment in code and the commit message becasue it
> turns out that this patch is not the real fix for the oops
> descriped. The real fix goes in another patch below:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> Changes from v1:
> 1. Trim the verbose oops in the commit message. (feedbacked by
> Phil Auld)
> 2. Rewrite a comment in code. (feedbacked by Phil Auld)
>
> --->8---
> From 98f5d472c08e3ed5b9b6543290d392a2e50fcf3c Mon Sep 17 00:00:00 2001
> From: Byungchul Park <[email protected]>
> Date: Mon, 19 Feb 2024 13:10:47 +0900
> Subject: [PATCH v4] sched/numa, mm: do not try to migrate memory to memoryless nodes
>
> Memoryless nodes do not have any memory to migrate to, so stop trying
> it.
>
> Signed-off-by: Byungchul Park <[email protected]>
> Reviewed-by: Oscar Salvador <[email protected]>
> ---
> kernel/sched/fair.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d7a3c63a2171..3e3b44ae72d1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1828,6 +1828,12 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
> int dst_nid = cpu_to_node(dst_cpu);
> int last_cpupid, this_cpupid;
>
> + /*
> + * Cannot migrate to memoryless nodes.
> + */
> + if (!node_state(dst_nid, N_MEMORY))
> + return false;
> +
> /*
> * The pages in slow memory node should be migrated according
> * to hot/cold instead of private/shared.
> --
> 2.17.1
>
>

Sorry, hadn't gotten far enough to see this version when I replied to v3...

Reviewed-by: Phil Auld <[email protected]>


Cheers,
Phil

--


2024-02-20 01:45:17

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v4] sched/numa, mm: do not try to migrate memory to memoryless nodes

On Mon, 19 Feb 2024 16:43:36 +0800 "Huang, Ying" <[email protected]> wrote:

> > + /*
> > + * Cannot migrate to memoryless nodes.
> > + */
> > + if (!node_state(dst_nid, N_MEMORY))
> > + return false;
> > +
> > /*
> > * The pages in slow memory node should be migrated according
> > * to hot/cold instead of private/shared.
>
> Good catch!
>
> IIUC, you will use patch as fix to the issue in
>
> https://lore.kernel.org/lkml/[email protected]/
>
> If so, we need the Fixes: tag to make it land in -stable properly.

Yes, this changelog is missing rather a lot of important information.

I pulled together the below, please check.


From: Byungchul Park <[email protected]>
Subject: sched/numa, mm: do not try to migrate memory to memoryless nodes
Date: Mon, 19 Feb 2024 13:10:47 +0900

With numa balancing on, when a numa system is running where a numa node
doesn't have its local memory so it has no managed zones, the following
oops has been observed. It's because wakeup_kswapd() is called with a
wrong zone index, -1. Fixed it by checking the index before calling
wakeup_kswapd().

> BUG: unable to handle page fault for address: 00000000000033f3
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x0000) - not-present page
> PGD 0 P4D 0
> Oops: 0000 [#1] PREEMPT SMP NOPTI
> CPU: 2 PID: 895 Comm: masim Not tainted 6.6.0-dirty #255
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> RIP: 0010:wakeup_kswapd (./linux/mm/vmscan.c:7812)
> Code: (omitted)
> RSP: 0000:ffffc90004257d58 EFLAGS: 00010286
> RAX: ffffffffffffffff RBX: ffff88883fff0480 RCX: 0000000000000003
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88883fff0480
> RBP: ffffffffffffffff R08: ff0003ffffffffff R09: ffffffffffffffff
> R10: ffff888106c95540 R11: 0000000055555554 R12: 0000000000000003
> R13: 0000000000000000 R14: 0000000000000000 R15: ffff88883fff0940
> FS: 00007fc4b8124740(0000) GS:ffff888827c00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00000000000033f3 CR3: 000000026cc08004 CR4: 0000000000770ee0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> PKRU: 55555554
> Call Trace:
> <TASK>
> ? __die
> ? page_fault_oops
> ? __pte_offset_map_lock
> ? exc_page_fault
> ? asm_exc_page_fault
> ? wakeup_kswapd
> migrate_misplaced_page
> __handle_mm_fault
> handle_mm_fault
> do_user_addr_fault
> exc_page_fault
> asm_exc_page_fault
> RIP: 0033:0x55b897ba0808
> Code: (omitted)
> RSP: 002b:00007ffeefa821a0 EFLAGS: 00010287
> RAX: 000055b89983acd0 RBX: 00007ffeefa823f8 RCX: 000055b89983acd0
> RDX: 00007fc2f8122010 RSI: 0000000000020000 RDI: 000055b89983acd0
> RBP: 00007ffeefa821a0 R08: 0000000000000037 R09: 0000000000000075
> R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
> R13: 00007ffeefa82410 R14: 000055b897ba5dd8 R15: 00007fc4b8340000
> </TASK>

Fix this by avoiding any attempt to migrate memory to memoryless nodes.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: c574bbe917036 ("NUMA balancing: optimize page placement for memory tiering system")
Signed-off-by: Byungchul Park <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Reviewed-by: Phil Auld <[email protected]>
Cc: Benjamin Segall <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

kernel/sched/fair.c | 6 ++++++
1 file changed, 6 insertions(+)

--- a/kernel/sched/fair.c~sched-numa-mm-do-not-try-to-migrate-memory-to-memoryless-nodes
+++ a/kernel/sched/fair.c
@@ -1831,6 +1831,12 @@ bool should_numa_migrate_memory(struct t
int last_cpupid, this_cpupid;

/*
+ * Cannot migrate to memoryless nodes.
+ */
+ if (!node_state(dst_nid, N_MEMORY))
+ return false;
+
+ /*
* The pages in slow memory node should be migrated according
* to hot/cold instead of private/shared.
*/
_


2024-02-20 01:54:03

by Byungchul Park

[permalink] [raw]
Subject: Re: [PATCH v4] sched/numa, mm: do not try to migrate memory to memoryless nodes

On Mon, Feb 19, 2024 at 05:45:08PM -0800, Andrew Morton wrote:
> On Mon, 19 Feb 2024 16:43:36 +0800 "Huang, Ying" <[email protected]> wrote:
>
> > > + /*
> > > + * Cannot migrate to memoryless nodes.
> > > + */
> > > + if (!node_state(dst_nid, N_MEMORY))
> > > + return false;
> > > +
> > > /*
> > > * The pages in slow memory node should be migrated according
> > > * to hot/cold instead of private/shared.
> >
> > Good catch!
> >
> > IIUC, you will use patch as fix to the issue in
> >
> > https://lore.kernel.org/lkml/[email protected]/
> >
> > If so, we need the Fixes: tag to make it land in -stable properly.
>
> Yes, this changelog is missing rather a lot of important information.

This is not the root cause fix any more but just optimization. That's
why I didn't add Fixes: tag and cc [email protected] in here.

Instead, I added Fixes: tag and cc'ed [email protected] in the real
fix patch. check the following link please:

https://lore.kernel.org/lkml/[email protected]/

Byungchul

> I pulled together the below, please check.
>
>
> From: Byungchul Park <[email protected]>
> Subject: sched/numa, mm: do not try to migrate memory to memoryless nodes
> Date: Mon, 19 Feb 2024 13:10:47 +0900
>
> With numa balancing on, when a numa system is running where a numa node
> doesn't have its local memory so it has no managed zones, the following
> oops has been observed. It's because wakeup_kswapd() is called with a
> wrong zone index, -1. Fixed it by checking the index before calling
> wakeup_kswapd().
>
> > BUG: unable to handle page fault for address: 00000000000033f3
> > #PF: supervisor read access in kernel mode
> > #PF: error_code(0x0000) - not-present page
> > PGD 0 P4D 0
> > Oops: 0000 [#1] PREEMPT SMP NOPTI
> > CPU: 2 PID: 895 Comm: masim Not tainted 6.6.0-dirty #255
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> > rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> > RIP: 0010:wakeup_kswapd (./linux/mm/vmscan.c:7812)
> > Code: (omitted)
> > RSP: 0000:ffffc90004257d58 EFLAGS: 00010286
> > RAX: ffffffffffffffff RBX: ffff88883fff0480 RCX: 0000000000000003
> > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88883fff0480
> > RBP: ffffffffffffffff R08: ff0003ffffffffff R09: ffffffffffffffff
> > R10: ffff888106c95540 R11: 0000000055555554 R12: 0000000000000003
> > R13: 0000000000000000 R14: 0000000000000000 R15: ffff88883fff0940
> > FS: 00007fc4b8124740(0000) GS:ffff888827c00000(0000) knlGS:0000000000000000
> > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 00000000000033f3 CR3: 000000026cc08004 CR4: 0000000000770ee0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > PKRU: 55555554
> > Call Trace:
> > <TASK>
> > ? __die
> > ? page_fault_oops
> > ? __pte_offset_map_lock
> > ? exc_page_fault
> > ? asm_exc_page_fault
> > ? wakeup_kswapd
> > migrate_misplaced_page
> > __handle_mm_fault
> > handle_mm_fault
> > do_user_addr_fault
> > exc_page_fault
> > asm_exc_page_fault
> > RIP: 0033:0x55b897ba0808
> > Code: (omitted)
> > RSP: 002b:00007ffeefa821a0 EFLAGS: 00010287
> > RAX: 000055b89983acd0 RBX: 00007ffeefa823f8 RCX: 000055b89983acd0
> > RDX: 00007fc2f8122010 RSI: 0000000000020000 RDI: 000055b89983acd0
> > RBP: 00007ffeefa821a0 R08: 0000000000000037 R09: 0000000000000075
> > R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
> > R13: 00007ffeefa82410 R14: 000055b897ba5dd8 R15: 00007fc4b8340000
> > </TASK>
>
> Fix this by avoiding any attempt to migrate memory to memoryless nodes.
>
> Link: https://lkml.kernel.org/r/[email protected]
> Link: https://lkml.kernel.org/r/[email protected]
> Fixes: c574bbe917036 ("NUMA balancing: optimize page placement for memory tiering system")
> Signed-off-by: Byungchul Park <[email protected]>
> Reviewed-by: Oscar Salvador <[email protected]>
> Reviewed-by: "Huang, Ying" <[email protected]>
> Reviewed-by: Phil Auld <[email protected]>
> Cc: Benjamin Segall <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> ---
>
> kernel/sched/fair.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> --- a/kernel/sched/fair.c~sched-numa-mm-do-not-try-to-migrate-memory-to-memoryless-nodes
> +++ a/kernel/sched/fair.c
> @@ -1831,6 +1831,12 @@ bool should_numa_migrate_memory(struct t
> int last_cpupid, this_cpupid;
>
> /*
> + * Cannot migrate to memoryless nodes.
> + */
> + if (!node_state(dst_nid, N_MEMORY))
> + return false;
> +
> + /*
> * The pages in slow memory node should be migrated according
> * to hot/cold instead of private/shared.
> */
> _

2024-02-20 02:08:16

by Byungchul Park

[permalink] [raw]
Subject: Re: [PATCH v4] sched/numa, mm: do not try to migrate memory to memoryless nodes

On Mon, Feb 19, 2024 at 05:45:08PM -0800, Andrew Morton wrote:
> On Mon, 19 Feb 2024 16:43:36 +0800 "Huang, Ying" <[email protected]> wrote:
>
> > > + /*
> > > + * Cannot migrate to memoryless nodes.
> > > + */
> > > + if (!node_state(dst_nid, N_MEMORY))
> > > + return false;
> > > +
> > > /*
> > > * The pages in slow memory node should be migrated according
> > > * to hot/cold instead of private/shared.
> >
> > Good catch!
> >
> > IIUC, you will use patch as fix to the issue in
> >
> > https://lore.kernel.org/lkml/[email protected]/
> >
> > If so, we need the Fixes: tag to make it land in -stable properly.
>
> Yes, this changelog is missing rather a lot of important information.
>
> I pulled together the below, please check.

Plus, two guys gave the tags for the optimization patch as well:

Reviewed-by: Davidlohr Bueso <[email protected]>
Acked-by: David Hildenbrand <[email protected]>

Byungchul

> From: Byungchul Park <[email protected]>
> Subject: sched/numa, mm: do not try to migrate memory to memoryless nodes
> Date: Mon, 19 Feb 2024 13:10:47 +0900
>
> With numa balancing on, when a numa system is running where a numa node
> doesn't have its local memory so it has no managed zones, the following
> oops has been observed. It's because wakeup_kswapd() is called with a
> wrong zone index, -1. Fixed it by checking the index before calling
> wakeup_kswapd().
>
> > BUG: unable to handle page fault for address: 00000000000033f3
> > #PF: supervisor read access in kernel mode
> > #PF: error_code(0x0000) - not-present page
> > PGD 0 P4D 0
> > Oops: 0000 [#1] PREEMPT SMP NOPTI
> > CPU: 2 PID: 895 Comm: masim Not tainted 6.6.0-dirty #255
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> > rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> > RIP: 0010:wakeup_kswapd (./linux/mm/vmscan.c:7812)
> > Code: (omitted)
> > RSP: 0000:ffffc90004257d58 EFLAGS: 00010286
> > RAX: ffffffffffffffff RBX: ffff88883fff0480 RCX: 0000000000000003
> > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88883fff0480
> > RBP: ffffffffffffffff R08: ff0003ffffffffff R09: ffffffffffffffff
> > R10: ffff888106c95540 R11: 0000000055555554 R12: 0000000000000003
> > R13: 0000000000000000 R14: 0000000000000000 R15: ffff88883fff0940
> > FS: 00007fc4b8124740(0000) GS:ffff888827c00000(0000) knlGS:0000000000000000
> > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 00000000000033f3 CR3: 000000026cc08004 CR4: 0000000000770ee0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > PKRU: 55555554
> > Call Trace:
> > <TASK>
> > ? __die
> > ? page_fault_oops
> > ? __pte_offset_map_lock
> > ? exc_page_fault
> > ? asm_exc_page_fault
> > ? wakeup_kswapd
> > migrate_misplaced_page
> > __handle_mm_fault
> > handle_mm_fault
> > do_user_addr_fault
> > exc_page_fault
> > asm_exc_page_fault
> > RIP: 0033:0x55b897ba0808
> > Code: (omitted)
> > RSP: 002b:00007ffeefa821a0 EFLAGS: 00010287
> > RAX: 000055b89983acd0 RBX: 00007ffeefa823f8 RCX: 000055b89983acd0
> > RDX: 00007fc2f8122010 RSI: 0000000000020000 RDI: 000055b89983acd0
> > RBP: 00007ffeefa821a0 R08: 0000000000000037 R09: 0000000000000075
> > R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
> > R13: 00007ffeefa82410 R14: 000055b897ba5dd8 R15: 00007fc4b8340000
> > </TASK>
>
> Fix this by avoiding any attempt to migrate memory to memoryless nodes.
>
> Link: https://lkml.kernel.org/r/[email protected]
> Link: https://lkml.kernel.org/r/[email protected]
> Fixes: c574bbe917036 ("NUMA balancing: optimize page placement for memory tiering system")
> Signed-off-by: Byungchul Park <[email protected]>
> Reviewed-by: Oscar Salvador <[email protected]>
> Reviewed-by: "Huang, Ying" <[email protected]>
> Reviewed-by: Phil Auld <[email protected]>
> Cc: Benjamin Segall <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> ---
>
> kernel/sched/fair.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> --- a/kernel/sched/fair.c~sched-numa-mm-do-not-try-to-migrate-memory-to-memoryless-nodes
> +++ a/kernel/sched/fair.c
> @@ -1831,6 +1831,12 @@ bool should_numa_migrate_memory(struct t
> int last_cpupid, this_cpupid;
>
> /*
> + * Cannot migrate to memoryless nodes.
> + */
> + if (!node_state(dst_nid, N_MEMORY))
> + return false;
> +
> + /*
> * The pages in slow memory node should be migrated according
> * to hot/cold instead of private/shared.
> */
> _

2024-02-20 02:33:32

by Byungchul Park

[permalink] [raw]
Subject: Re: [PATCH v4] sched/numa, mm: do not try to migrate memory to memoryless nodes

On Mon, Feb 19, 2024 at 05:45:08PM -0800, Andrew Morton wrote:
> On Mon, 19 Feb 2024 16:43:36 +0800 "Huang, Ying" <[email protected]> wrote:
>
> > > + /*
> > > + * Cannot migrate to memoryless nodes.
> > > + */
> > > + if (!node_state(dst_nid, N_MEMORY))
> > > + return false;
> > > +
> > > /*
> > > * The pages in slow memory node should be migrated according
> > > * to hot/cold instead of private/shared.
> >
> > Good catch!
> >
> > IIUC, you will use patch as fix to the issue in
> >
> > https://lore.kernel.org/lkml/[email protected]/
> >
> > If so, we need the Fixes: tag to make it land in -stable properly.
>
> Yes, this changelog is missing rather a lot of important information.
>
> I pulled together the below, please check.

To make it more clear, I need to explain it more. I posted the following
two patches while resolving the oops issue. However, two are going on
for different purposes.

1) https://lkml.kernel.org/r/[email protected]

I started this patch as the fix for the oops. However, I found the
root cause comes from using -1 as an array index. So let the root
cause fix go with another thread, 2). Nevertheless, 1) is still
necessary as a *reasonable optimization* but not the real fix any
more.

2) https://lkml.kernel.org/r/[email protected]

I found the root cause of the oops comes from using -1 as an array
index. So moved all the oops message, Fixes: tag, and cc stable to
here. Long story short, 2) is the *real fix* for the oops.

Byungchul

> From: Byungchul Park <[email protected]>
> Subject: sched/numa, mm: do not try to migrate memory to memoryless nodes
> Date: Mon, 19 Feb 2024 13:10:47 +0900
>
> With numa balancing on, when a numa system is running where a numa node
> doesn't have its local memory so it has no managed zones, the following
> oops has been observed. It's because wakeup_kswapd() is called with a
> wrong zone index, -1. Fixed it by checking the index before calling
> wakeup_kswapd().
>
> > BUG: unable to handle page fault for address: 00000000000033f3
> > #PF: supervisor read access in kernel mode
> > #PF: error_code(0x0000) - not-present page
> > PGD 0 P4D 0
> > Oops: 0000 [#1] PREEMPT SMP NOPTI
> > CPU: 2 PID: 895 Comm: masim Not tainted 6.6.0-dirty #255
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> > rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> > RIP: 0010:wakeup_kswapd (./linux/mm/vmscan.c:7812)
> > Code: (omitted)
> > RSP: 0000:ffffc90004257d58 EFLAGS: 00010286
> > RAX: ffffffffffffffff RBX: ffff88883fff0480 RCX: 0000000000000003
> > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88883fff0480
> > RBP: ffffffffffffffff R08: ff0003ffffffffff R09: ffffffffffffffff
> > R10: ffff888106c95540 R11: 0000000055555554 R12: 0000000000000003
> > R13: 0000000000000000 R14: 0000000000000000 R15: ffff88883fff0940
> > FS: 00007fc4b8124740(0000) GS:ffff888827c00000(0000) knlGS:0000000000000000
> > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 00000000000033f3 CR3: 000000026cc08004 CR4: 0000000000770ee0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > PKRU: 55555554
> > Call Trace:
> > <TASK>
> > ? __die
> > ? page_fault_oops
> > ? __pte_offset_map_lock
> > ? exc_page_fault
> > ? asm_exc_page_fault
> > ? wakeup_kswapd
> > migrate_misplaced_page
> > __handle_mm_fault
> > handle_mm_fault
> > do_user_addr_fault
> > exc_page_fault
> > asm_exc_page_fault
> > RIP: 0033:0x55b897ba0808
> > Code: (omitted)
> > RSP: 002b:00007ffeefa821a0 EFLAGS: 00010287
> > RAX: 000055b89983acd0 RBX: 00007ffeefa823f8 RCX: 000055b89983acd0
> > RDX: 00007fc2f8122010 RSI: 0000000000020000 RDI: 000055b89983acd0
> > RBP: 00007ffeefa821a0 R08: 0000000000000037 R09: 0000000000000075
> > R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
> > R13: 00007ffeefa82410 R14: 000055b897ba5dd8 R15: 00007fc4b8340000
> > </TASK>
>
> Fix this by avoiding any attempt to migrate memory to memoryless nodes.
>
> Link: https://lkml.kernel.org/r/[email protected]
> Link: https://lkml.kernel.org/r/[email protected]
> Fixes: c574bbe917036 ("NUMA balancing: optimize page placement for memory tiering system")
> Signed-off-by: Byungchul Park <[email protected]>
> Reviewed-by: Oscar Salvador <[email protected]>
> Reviewed-by: "Huang, Ying" <[email protected]>
> Reviewed-by: Phil Auld <[email protected]>
> Cc: Benjamin Segall <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Valentin Schneider <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> ---
>
> kernel/sched/fair.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> --- a/kernel/sched/fair.c~sched-numa-mm-do-not-try-to-migrate-memory-to-memoryless-nodes
> +++ a/kernel/sched/fair.c
> @@ -1831,6 +1831,12 @@ bool should_numa_migrate_memory(struct t
> int last_cpupid, this_cpupid;
>
> /*
> + * Cannot migrate to memoryless nodes.
> + */
> + if (!node_state(dst_nid, N_MEMORY))
> + return false;
> +
> + /*
> * The pages in slow memory node should be migrated according
> * to hot/cold instead of private/shared.
> */
> _

2024-02-20 03:05:27

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v4] sched/numa, mm: do not try to migrate memory to memoryless nodes

On Tue, 20 Feb 2024 10:53:43 +0900 Byungchul Park <[email protected]> wrote:

> > > IIUC, you will use patch as fix to the issue in
> > >
> > > https://lore.kernel.org/lkml/[email protected]/
> > >
> > > If so, we need the Fixes: tag to make it land in -stable properly.
> >
> > Yes, this changelog is missing rather a lot of important information.
>
> This is not the root cause fix any more but just optimization.

It would have been helpful to have told us this in the changelog :(

> That's
> why I didn't add Fixes: tag and cc [email protected] in here.
>
> Instead, I added Fixes: tag and cc'ed [email protected] in the real
> fix patch. check the following link please:
>
> https://lore.kernel.org/lkml/[email protected]/

But doesn't this patch "sched/numa, mm: do not try to migrate memory to
memoryless nodes" also fix the bug? Do we truly need both?

2024-02-20 03:20:29

by Byungchul Park

[permalink] [raw]
Subject: Re: [PATCH v4] sched/numa, mm: do not try to migrate memory to memoryless nodes

On Mon, Feb 19, 2024 at 07:05:20PM -0800, Andrew Morton wrote:
> On Tue, 20 Feb 2024 10:53:43 +0900 Byungchul Park <[email protected]> wrote:
>
> > > > IIUC, you will use patch as fix to the issue in
> > > >
> > > > https://lore.kernel.org/lkml/[email protected]/
> > > >
> > > > If so, we need the Fixes: tag to make it land in -stable properly.
> > >
> > > Yes, this changelog is missing rather a lot of important information.
> >
> > This is not the root cause fix any more but just optimization.
>
> It would have been helpful to have told us this in the changelog :(

Sorry for that and making you guys confused.

> > That's
> > why I didn't add Fixes: tag and cc [email protected] in here.
> >
> > Instead, I added Fixes: tag and cc'ed [email protected] in the real
> > fix patch. check the following link please:
> >
> > https://lore.kernel.org/lkml/[email protected]/
>
> But doesn't this patch "sched/numa, mm: do not try to migrate memory to
> memoryless nodes" also fix the bug? Do we truly need both?

Yes. The oops is gone with "sched/numa, mm: do not try to migrate memory
to memoryless nodes". However, as you know, wrongly manipulating array
index is very dangerous - what hackers are looking for. Even with security
in mind, both are necessary. Plus, no one gurantees the problematic code
is gone through with a numa node that has no managed zones in the future.

Byungchul

2024-02-20 03:28:50

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v4] sched/numa, mm: do not try to migrate memory to memoryless nodes

On Tue, 20 Feb 2024 11:33:04 +0900 Byungchul Park <[email protected]> wrote:

> > Yes, this changelog is missing rather a lot of important information.
> >
> > I pulled together the below, please check.
>
> To make it more clear, I need to explain it more. I posted the following
> two patches while resolving the oops issue. However, two are going on
> for different purposes.
>
> 1) https://lkml.kernel.org/r/[email protected]
>
> I started this patch as the fix for the oops. However, I found the
> root cause comes from using -1 as an array index. So let the root
> cause fix go with another thread, 2). Nevertheless, 1) is still
> necessary as a *reasonable optimization* but not the real fix any
> more.

Well I altered this patch's changelog to tell readers that it is an
optimization. But one does wonder why it isn't simply a bugfix.
Attempting to migrate to a memoryless node is clearly as error.
Presumably the called code handles it somehow, but in what fashion and
at what cost?

> 2) https://lkml.kernel.org/r/[email protected]
>
> I found the root cause of the oops comes from using -1 as an array
> index. So moved all the oops message, Fixes: tag, and cc stable to
> here. Long story short, 2) is the *real fix* for the oops.
>

2024-02-20 04:11:32

by Byungchul Park

[permalink] [raw]
Subject: Re: [PATCH v4] sched/numa, mm: do not try to migrate memory to memoryless nodes

On Mon, Feb 19, 2024 at 07:28:41PM -0800, Andrew Morton wrote:
> On Tue, 20 Feb 2024 11:33:04 +0900 Byungchul Park <[email protected]> wrote:
>
> > > Yes, this changelog is missing rather a lot of important information.
> > >
> > > I pulled together the below, please check.
> >
> > To make it more clear, I need to explain it more. I posted the following
> > two patches while resolving the oops issue. However, two are going on
> > for different purposes.
> >
> > 1) https://lkml.kernel.org/r/[email protected]
> >
> > I started this patch as the fix for the oops. However, I found the
> > root cause comes from using -1 as an array index. So let the root
> > cause fix go with another thread, 2). Nevertheless, 1) is still
> > necessary as a *reasonable optimization* but not the real fix any
> > more.
>
> Well I altered this patch's changelog to tell readers that it is an
> optimization. But one does wonder why it isn't simply a bugfix.
> Attempting to migrate to a memoryless node is clearly as error.

I agree with what Oscar Salvador said:

"As this is not a bug fix but an optimization, as we will fail anyways
in migrate_misplaced_folio() when migrate_balanced_pgdat() notices
that we do not have any memory on that node."

https://lore.kernel.org/lkml/[email protected]/

So assuming all the related code works correctly, the migration will
safely fail even without this optimization patch.

Byungchul

> Presumably the called code handles it somehow, but in what fashion and
> at what cost?
>
> > 2) https://lkml.kernel.org/r/[email protected]
> >
> > I found the root cause of the oops comes from using -1 as an array
> > index. So moved all the oops message, Fixes: tag, and cc stable to
> > here. Long story short, 2) is the *real fix* for the oops.
> >