2022-02-14 19:20:32

by Huang, Ying

[permalink] [raw]
Subject: [PATCH -V3 2/2] NUMA balancing: avoid to migrate task to CPU-less node

In a typical memory tiering system, there's no CPU in slow (PMEM) NUMA
nodes. But if the number of the hint page faults on a PMEM node is
the max for a task, The current NUMA balancing policy may try to place
the task on the PMEM node instead of DRAM node. This is unreasonable,
because there's no CPU in PMEM NUMA nodes. To fix this, CPU-less
nodes are ignored when searching the migration target node for a task
in this patch.

To test the patch, we run a workload that accesses more memory in PMEM
node than memory in DRAM node. Without the patch, the PMEM node will
be chosen as preferred node in task_numa_placement(). While the DRAM
node will be chosen instead with the patch.

Known issue: I don't have systems to test complex NUMA topology type,
for example, NUMA_BACKPLANE or NUMA_GLUELESS_MESH.

v3:

- Fix several missing places to use CPU-less nodes as migrating
target.

Signed-off-by: "Huang, Ying" <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Srikar Dronamraju <[email protected]>
---
kernel/sched/fair.c | 25 ++++++++++++++++++++-----
1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04968f3f9b6d..a3f0ea216ccb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1988,7 +1988,7 @@ static int task_numa_migrate(struct task_struct *p)
*/
ng = deref_curr_numa_group(p);
if (env.best_cpu == -1 || (ng && ng->active_nodes > 1)) {
- for_each_online_node(nid) {
+ for_each_node_state(nid, N_CPU) {
if (nid == env.src_nid || nid == p->numa_preferred_nid)
continue;

@@ -2086,13 +2086,13 @@ static void numa_group_count_active_nodes(struct numa_group *numa_group)
unsigned long faults, max_faults = 0;
int nid, active_nodes = 0;

- for_each_online_node(nid) {
+ for_each_node_state(nid, N_CPU) {
faults = group_faults_cpu(numa_group, nid);
if (faults > max_faults)
max_faults = faults;
}

- for_each_online_node(nid) {
+ for_each_node_state(nid, N_CPU) {
faults = group_faults_cpu(numa_group, nid);
if (faults * ACTIVE_NODE_FRACTION > max_faults)
active_nodes++;
@@ -2246,7 +2246,7 @@ static int preferred_group_nid(struct task_struct *p, int nid)

dist = sched_max_numa_distance;

- for_each_online_node(node) {
+ for_each_node_state(node, N_CPU) {
score = group_weight(p, node, dist);
if (score > max_score) {
max_score = score;
@@ -2265,7 +2265,7 @@ static int preferred_group_nid(struct task_struct *p, int nid)
* inside the highest scoring group of nodes. The nodemask tricks
* keep the complexity of the search down.
*/
- nodes = node_online_map;
+ nodes = node_states[N_CPU];
for (dist = sched_max_numa_distance; dist > LOCAL_DISTANCE; dist--) {
unsigned long max_faults = 0;
nodemask_t max_group = NODE_MASK_NONE;
@@ -2404,6 +2404,21 @@ static void task_numa_placement(struct task_struct *p)
}
}

+ /* Cannot migrate task to CPU-less node */
+ if (!node_state(max_nid, N_CPU)) {
+ int near_nid = max_nid;
+ int distance, near_distance = INT_MAX;
+
+ for_each_node_state(nid, N_CPU) {
+ distance = node_distance(max_nid, nid);
+ if (distance < near_distance) {
+ near_nid = nid;
+ near_distance = distance;
+ }
+ }
+ max_nid = near_nid;
+ }
+
if (ng) {
numa_group_count_active_nodes(ng);
spin_unlock_irq(group_lock);
--
2.30.2


2022-02-17 23:04:59

by tip-bot2 for Jacob Pan

[permalink] [raw]
Subject: [tip: sched/core] sched/numa: Avoid migrating task to CPU-less node

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 5c7b1aaf139dab5072311853bacc40fc3457d1f9
Gitweb: https://git.kernel.org/tip/5c7b1aaf139dab5072311853bacc40fc3457d1f9
Author: Huang Ying <[email protected]>
AuthorDate: Mon, 14 Feb 2022 20:15:53 +08:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Wed, 16 Feb 2022 15:57:53 +01:00

sched/numa: Avoid migrating task to CPU-less node

In a typical memory tiering system, there's no CPU in slow (PMEM) NUMA
nodes. But if the number of the hint page faults on a PMEM node is
the max for a task, The current NUMA balancing policy may try to place
the task on the PMEM node instead of DRAM node. This is unreasonable,
because there's no CPU in PMEM NUMA nodes. To fix this, CPU-less
nodes are ignored when searching the migration target node for a task
in this patch.

To test the patch, we run a workload that accesses more memory in PMEM
node than memory in DRAM node. Without the patch, the PMEM node will
be chosen as preferred node in task_numa_placement(). While the DRAM
node will be chosen instead with the patch.

Signed-off-by: "Huang, Ying" <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 25 ++++++++++++++++++++-----
1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da3230b..11a72e1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1989,7 +1989,7 @@ static int task_numa_migrate(struct task_struct *p)
*/
ng = deref_curr_numa_group(p);
if (env.best_cpu == -1 || (ng && ng->active_nodes > 1)) {
- for_each_online_node(nid) {
+ for_each_node_state(nid, N_CPU) {
if (nid == env.src_nid || nid == p->numa_preferred_nid)
continue;

@@ -2087,13 +2087,13 @@ static void numa_group_count_active_nodes(struct numa_group *numa_group)
unsigned long faults, max_faults = 0;
int nid, active_nodes = 0;

- for_each_online_node(nid) {
+ for_each_node_state(nid, N_CPU) {
faults = group_faults_cpu(numa_group, nid);
if (faults > max_faults)
max_faults = faults;
}

- for_each_online_node(nid) {
+ for_each_node_state(nid, N_CPU) {
faults = group_faults_cpu(numa_group, nid);
if (faults * ACTIVE_NODE_FRACTION > max_faults)
active_nodes++;
@@ -2247,7 +2247,7 @@ static int preferred_group_nid(struct task_struct *p, int nid)

dist = sched_max_numa_distance;

- for_each_online_node(node) {
+ for_each_node_state(node, N_CPU) {
score = group_weight(p, node, dist);
if (score > max_score) {
max_score = score;
@@ -2266,7 +2266,7 @@ static int preferred_group_nid(struct task_struct *p, int nid)
* inside the highest scoring group of nodes. The nodemask tricks
* keep the complexity of the search down.
*/
- nodes = node_online_map;
+ nodes = node_states[N_CPU];
for (dist = sched_max_numa_distance; dist > LOCAL_DISTANCE; dist--) {
unsigned long max_faults = 0;
nodemask_t max_group = NODE_MASK_NONE;
@@ -2405,6 +2405,21 @@ static void task_numa_placement(struct task_struct *p)
}
}

+ /* Cannot migrate task to CPU-less node */
+ if (!node_state(max_nid, N_CPU)) {
+ int near_nid = max_nid;
+ int distance, near_distance = INT_MAX;
+
+ for_each_node_state(nid, N_CPU) {
+ distance = node_distance(max_nid, nid);
+ if (distance < near_distance) {
+ near_nid = nid;
+ near_distance = distance;
+ }
+ }
+ max_nid = near_nid;
+ }
+
if (ng) {
numa_group_count_active_nodes(ng);
spin_unlock_irq(group_lock);

2022-03-02 11:17:31

by Huang, Ying

[permalink] [raw]
Subject: Re: [tip: sched/core] sched/numa: Avoid migrating task to CPU-less node

Qian Cai <[email protected]> writes:

> On Thu, Feb 17, 2022 at 06:56:52PM -0000, tip-bot2 for Huang Ying wrote:
>> The following commit has been merged into the sched/core branch of tip:
>>
>> Commit-ID: 5c7b1aaf139dab5072311853bacc40fc3457d1f9
>> Gitweb: https://git.kernel.org/tip/5c7b1aaf139dab5072311853bacc40fc3457d1f9
>> Author: Huang Ying <[email protected]>
>> AuthorDate: Mon, 14 Feb 2022 20:15:53 +08:00
>> Committer: Peter Zijlstra <[email protected]>
>> CommitterDate: Wed, 16 Feb 2022 15:57:53 +01:00
>>
>> sched/numa: Avoid migrating task to CPU-less node
>>
>> In a typical memory tiering system, there's no CPU in slow (PMEM) NUMA
>> nodes. But if the number of the hint page faults on a PMEM node is
>> the max for a task, The current NUMA balancing policy may try to place
>> the task on the PMEM node instead of DRAM node. This is unreasonable,
>> because there's no CPU in PMEM NUMA nodes. To fix this, CPU-less
>> nodes are ignored when searching the migration target node for a task
>> in this patch.
>>
>> To test the patch, we run a workload that accesses more memory in PMEM
>> node than memory in DRAM node. Without the patch, the PMEM node will
>> be chosen as preferred node in task_numa_placement(). While the DRAM
>> node will be chosen instead with the patch.
>>
>> Signed-off-by: "Huang, Ying" <[email protected]>
>> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
>> Link: https://lkml.kernel.org/r/[email protected]
>
> Reverting this commit on the top of today's linux-next fixed a boot crash
> on arm64 NUMA systems.
>
> Unable to handle kernel paging request at virtual address ffff7a6601694aec
> KASAN: maybe wild-memory-access in range [0xffffd3300b4a5760-0xffffd3300b4a5767]
> Mem abort info:
> ESR = 0x96000005
> EC = 0x25: DABT (current EL), IL = 32 bits
> mlx5_core 0007:02:00.0: enabling device (0100 -> 0102)
> SET = 0, FnV = 0
> EA = 0, S1PTW = 0
> FSC = 0x05: level 1 translation fault
> Data abort info:
> ISV = 0, ISS = 0x00000005
> CM = 0, WnR = 0
> swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000400b3d6c6000
> [ffff7a6601694aec] pgd=0000403fc007f003, p4d=0000403fc007f003, pud=0000000000000000
> Internal error: Oops: 96000005 [#1] PREEMPT SMP
> Modules linked in: nouveau(+) drm_ttm_helper ttm nvme(+) drm_dp_helper drm_kms_helper mlx5_core(+) mpt3sas(+) xhci_pci(+) nvme_core raid_class xhci_pci_renesas drm
> CPU: 85 PID: 1308 Comm: udevadm Not tainted 5.17.0-rc6-next-20220301 #1
> pstate: 40400009 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> pc : task_numa_placement
> lr : task_numa_placement
> sp : ffff800031047760
> x29: ffff800031047760 x28: ffff3fffab916c00 x27: 0000000000000020
> x26: 0000000000000001 x25: 0000000000000000 x24: 0000000000000000
>
> x23: ffff07ffe5289a80 x22: ffffd3300b4a5760 x21: 000000000000003f
> x20: ffffd32feb4a5768 x19: 0000000000000000 x18: ffff07ffe528ad88
> x17: ffffd32fe5693a1c x16: 0000000000000000 x15: ffff8000310478e0
>
> x14: ffff07ffe528ad90 x13: 0000000000000002 x12: dfff80000000000d
> x11: 0000000000000001 x10: 000000000000b6be x9 : 0000000000000000
> x8 : 00000000ffffffff x7 : ffffd32feb4a5780 x6 : 0000000000000000
> x5 : 0000000000000000 x4 : 0000000000000000 x3 : 1ffffa6601694aec
> x2 : 0000000000000000 x1 : dfff800000000000 x0 : 000000001ffffff8
> Call trace:
> task_numa_placement
> arch_test_bit at include/asm-generic/bitops/non-atomic.h:118
> (inlined by) node_state at include/linux/nodemask.h:416
> (inlined by) task_numa_placement at kernel/sched/fair.c:2439
> task_numa_fault
> do_numa_page
> handle_pte_fault
> __handle_mm_fault
> handle_mm_fault
> do_page_fault
> do_translation_fault
> do_mem_abort
> el0_da
> el0t_64_sync_handler
> el0t_64_sync
> Code: 8b000296 d2d00001 f2fbffe1 d343fec3 (38e16861)
> ---[ end trace 0000000000000000 ]---
> Kernel panic - not syncing: Oops: Fatal exception
> SMP: stopping secondary CPUs
> Kernel Offset: 0x532fdcf70000 from 0xffff800008000000
> PHYS_OFFSET: 0x80000000
> CPU features: 0x00,00042c0c,19801c82
> Memory Limit: none
> ---[ end Kernel panic - not syncing: Oops: Fatal exception ]---

Thanks for reporting! Can you try whether the following debug patch can fix the issue?

Best Regards,
Huang, Ying

----------------------------8<-------------------------------------------
From 176d185426730111e763eb386d0210561f021dbc Mon Sep 17 00:00:00 2001
From: Huang Ying <[email protected]>
Date: Wed, 2 Mar 2022 08:54:01 +0800
Subject: [PATCH] dbg KASAN error

---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a3f0ea216ccb..1fe7a4510cca 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2405,7 +2405,7 @@ static void task_numa_placement(struct task_struct *p)
}

/* Cannot migrate task to CPU-less node */
- if (!node_state(max_nid, N_CPU)) {
+ if (max_nid != NUMA_NO_NODE && !node_state(max_nid, N_CPU)) {
int near_nid = max_nid;
int distance, near_distance = INT_MAX;

--
2.30.2

2022-03-02 16:09:01

by Qian Cai

[permalink] [raw]
Subject: Re: [tip: sched/core] sched/numa: Avoid migrating task to CPU-less node

On Thu, Feb 17, 2022 at 06:56:52PM -0000, tip-bot2 for Huang Ying wrote:
> The following commit has been merged into the sched/core branch of tip:
>
> Commit-ID: 5c7b1aaf139dab5072311853bacc40fc3457d1f9
> Gitweb: https://git.kernel.org/tip/5c7b1aaf139dab5072311853bacc40fc3457d1f9
> Author: Huang Ying <[email protected]>
> AuthorDate: Mon, 14 Feb 2022 20:15:53 +08:00
> Committer: Peter Zijlstra <[email protected]>
> CommitterDate: Wed, 16 Feb 2022 15:57:53 +01:00
>
> sched/numa: Avoid migrating task to CPU-less node
>
> In a typical memory tiering system, there's no CPU in slow (PMEM) NUMA
> nodes. But if the number of the hint page faults on a PMEM node is
> the max for a task, The current NUMA balancing policy may try to place
> the task on the PMEM node instead of DRAM node. This is unreasonable,
> because there's no CPU in PMEM NUMA nodes. To fix this, CPU-less
> nodes are ignored when searching the migration target node for a task
> in this patch.
>
> To test the patch, we run a workload that accesses more memory in PMEM
> node than memory in DRAM node. Without the patch, the PMEM node will
> be chosen as preferred node in task_numa_placement(). While the DRAM
> node will be chosen instead with the patch.
>
> Signed-off-by: "Huang, Ying" <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Link: https://lkml.kernel.org/r/[email protected]

Reverting this commit on the top of today's linux-next fixed a boot crash
on arm64 NUMA systems.

Unable to handle kernel paging request at virtual address ffff7a6601694aec
KASAN: maybe wild-memory-access in range [0xffffd3300b4a5760-0xffffd3300b4a5767]
Mem abort info:
ESR = 0x96000005
EC = 0x25: DABT (current EL), IL = 32 bits
mlx5_core 0007:02:00.0: enabling device (0100 -> 0102)
SET = 0, FnV = 0
EA = 0, S1PTW = 0
FSC = 0x05: level 1 translation fault
Data abort info:
ISV = 0, ISS = 0x00000005
CM = 0, WnR = 0
swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000400b3d6c6000
[ffff7a6601694aec] pgd=0000403fc007f003, p4d=0000403fc007f003, pud=0000000000000000
Internal error: Oops: 96000005 [#1] PREEMPT SMP
Modules linked in: nouveau(+) drm_ttm_helper ttm nvme(+) drm_dp_helper drm_kms_helper mlx5_core(+) mpt3sas(+) xhci_pci(+) nvme_core raid_class xhci_pci_renesas drm
CPU: 85 PID: 1308 Comm: udevadm Not tainted 5.17.0-rc6-next-20220301 #1
pstate: 40400009 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : task_numa_placement
lr : task_numa_placement
sp : ffff800031047760
x29: ffff800031047760 x28: ffff3fffab916c00 x27: 0000000000000020
x26: 0000000000000001 x25: 0000000000000000 x24: 0000000000000000

x23: ffff07ffe5289a80 x22: ffffd3300b4a5760 x21: 000000000000003f
x20: ffffd32feb4a5768 x19: 0000000000000000 x18: ffff07ffe528ad88
x17: ffffd32fe5693a1c x16: 0000000000000000 x15: ffff8000310478e0

x14: ffff07ffe528ad90 x13: 0000000000000002 x12: dfff80000000000d
x11: 0000000000000001 x10: 000000000000b6be x9 : 0000000000000000
x8 : 00000000ffffffff x7 : ffffd32feb4a5780 x6 : 0000000000000000
x5 : 0000000000000000 x4 : 0000000000000000 x3 : 1ffffa6601694aec
x2 : 0000000000000000 x1 : dfff800000000000 x0 : 000000001ffffff8
Call trace:
task_numa_placement
arch_test_bit at include/asm-generic/bitops/non-atomic.h:118
(inlined by) node_state at include/linux/nodemask.h:416
(inlined by) task_numa_placement at kernel/sched/fair.c:2439
task_numa_fault
do_numa_page
handle_pte_fault
__handle_mm_fault
handle_mm_fault
do_page_fault
do_translation_fault
do_mem_abort
el0_da
el0t_64_sync_handler
el0t_64_sync
Code: 8b000296 d2d00001 f2fbffe1 d343fec3 (38e16861)
---[ end trace 0000000000000000 ]---
Kernel panic - not syncing: Oops: Fatal exception
SMP: stopping secondary CPUs
Kernel Offset: 0x532fdcf70000 from 0xffff800008000000
PHYS_OFFSET: 0x80000000
CPU features: 0x00,00042c0c,19801c82
Memory Limit: none
---[ end Kernel panic - not syncing: Oops: Fatal exception ]---

> ---
> kernel/sched/fair.c | 25 ++++++++++++++++++++-----
> 1 file changed, 20 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index da3230b..11a72e1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1989,7 +1989,7 @@ static int task_numa_migrate(struct task_struct *p)
> */
> ng = deref_curr_numa_group(p);
> if (env.best_cpu == -1 || (ng && ng->active_nodes > 1)) {
> - for_each_online_node(nid) {
> + for_each_node_state(nid, N_CPU) {
> if (nid == env.src_nid || nid == p->numa_preferred_nid)
> continue;
>
> @@ -2087,13 +2087,13 @@ static void numa_group_count_active_nodes(struct numa_group *numa_group)
> unsigned long faults, max_faults = 0;
> int nid, active_nodes = 0;
>
> - for_each_online_node(nid) {
> + for_each_node_state(nid, N_CPU) {
> faults = group_faults_cpu(numa_group, nid);
> if (faults > max_faults)
> max_faults = faults;
> }
>
> - for_each_online_node(nid) {
> + for_each_node_state(nid, N_CPU) {
> faults = group_faults_cpu(numa_group, nid);
> if (faults * ACTIVE_NODE_FRACTION > max_faults)
> active_nodes++;
> @@ -2247,7 +2247,7 @@ static int preferred_group_nid(struct task_struct *p, int nid)
>
> dist = sched_max_numa_distance;
>
> - for_each_online_node(node) {
> + for_each_node_state(node, N_CPU) {
> score = group_weight(p, node, dist);
> if (score > max_score) {
> max_score = score;
> @@ -2266,7 +2266,7 @@ static int preferred_group_nid(struct task_struct *p, int nid)
> * inside the highest scoring group of nodes. The nodemask tricks
> * keep the complexity of the search down.
> */
> - nodes = node_online_map;
> + nodes = node_states[N_CPU];
> for (dist = sched_max_numa_distance; dist > LOCAL_DISTANCE; dist--) {
> unsigned long max_faults = 0;
> nodemask_t max_group = NODE_MASK_NONE;
> @@ -2405,6 +2405,21 @@ static void task_numa_placement(struct task_struct *p)
> }
> }
>
> + /* Cannot migrate task to CPU-less node */
> + if (!node_state(max_nid, N_CPU)) {
> + int near_nid = max_nid;
> + int distance, near_distance = INT_MAX;
> +
> + for_each_node_state(nid, N_CPU) {
> + distance = node_distance(max_nid, nid);
> + if (distance < near_distance) {
> + near_nid = nid;
> + near_distance = distance;
> + }
> + }
> + max_nid = near_nid;
> + }
> +
> if (ng) {
> numa_group_count_active_nodes(ng);
> spin_unlock_irq(group_lock);

2022-03-02 22:36:38

by Qian Cai

[permalink] [raw]
Subject: Re: [tip: sched/core] sched/numa: Avoid migrating task to CPU-less node

On Wed, Mar 02, 2022 at 08:59:55AM +0800, Huang, Ying wrote:
> Thanks for reporting! Can you try whether the following debug patch can fix the issue?

Yes, it prevents the crash.

2022-03-07 07:50:23

by Huang, Ying

[permalink] [raw]
Subject: Re: [tip: sched/core] sched/numa: Avoid migrating task to CPU-less node

Hi, Qian,

"Huang, Ying" <[email protected]> writes:

> Qian Cai <[email protected]> writes:
>
>> On Thu, Feb 17, 2022 at 06:56:52PM -0000, tip-bot2 for Huang Ying wrote:
>>> The following commit has been merged into the sched/core branch of tip:
>>>
>>> Commit-ID: 5c7b1aaf139dab5072311853bacc40fc3457d1f9
>>> Gitweb: https://git.kernel.org/tip/5c7b1aaf139dab5072311853bacc40fc3457d1f9
>>> Author: Huang Ying <[email protected]>
>>> AuthorDate: Mon, 14 Feb 2022 20:15:53 +08:00
>>> Committer: Peter Zijlstra <[email protected]>
>>> CommitterDate: Wed, 16 Feb 2022 15:57:53 +01:00
>>>
>>> sched/numa: Avoid migrating task to CPU-less node
>>>
>>> In a typical memory tiering system, there's no CPU in slow (PMEM) NUMA
>>> nodes. But if the number of the hint page faults on a PMEM node is
>>> the max for a task, The current NUMA balancing policy may try to place
>>> the task on the PMEM node instead of DRAM node. This is unreasonable,
>>> because there's no CPU in PMEM NUMA nodes. To fix this, CPU-less
>>> nodes are ignored when searching the migration target node for a task
>>> in this patch.
>>>
>>> To test the patch, we run a workload that accesses more memory in PMEM
>>> node than memory in DRAM node. Without the patch, the PMEM node will
>>> be chosen as preferred node in task_numa_placement(). While the DRAM
>>> node will be chosen instead with the patch.
>>>
>>> Signed-off-by: "Huang, Ying" <[email protected]>
>>> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
>>> Link: https://lkml.kernel.org/r/[email protected]
>>
>> Reverting this commit on the top of today's linux-next fixed a boot crash
>> on arm64 NUMA systems.
>>
>> Unable to handle kernel paging request at virtual address ffff7a6601694aec
>> KASAN: maybe wild-memory-access in range [0xffffd3300b4a5760-0xffffd3300b4a5767]
>> Mem abort info:
>> ESR = 0x96000005
>> EC = 0x25: DABT (current EL), IL = 32 bits
>> mlx5_core 0007:02:00.0: enabling device (0100 -> 0102)
>> SET = 0, FnV = 0
>> EA = 0, S1PTW = 0
>> FSC = 0x05: level 1 translation fault
>> Data abort info:
>> ISV = 0, ISS = 0x00000005
>> CM = 0, WnR = 0
>> swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000400b3d6c6000
>> [ffff7a6601694aec] pgd=0000403fc007f003, p4d=0000403fc007f003, pud=0000000000000000
>> Internal error: Oops: 96000005 [#1] PREEMPT SMP
>> Modules linked in: nouveau(+) drm_ttm_helper ttm nvme(+) drm_dp_helper drm_kms_helper mlx5_core(+) mpt3sas(+) xhci_pci(+) nvme_core raid_class xhci_pci_renesas drm
>> CPU: 85 PID: 1308 Comm: udevadm Not tainted 5.17.0-rc6-next-20220301 #1
>> pstate: 40400009 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>> pc : task_numa_placement
>> lr : task_numa_placement
>> sp : ffff800031047760
>> x29: ffff800031047760 x28: ffff3fffab916c00 x27: 0000000000000020
>> x26: 0000000000000001 x25: 0000000000000000 x24: 0000000000000000
>>
>> x23: ffff07ffe5289a80 x22: ffffd3300b4a5760 x21: 000000000000003f
>> x20: ffffd32feb4a5768 x19: 0000000000000000 x18: ffff07ffe528ad88
>> x17: ffffd32fe5693a1c x16: 0000000000000000 x15: ffff8000310478e0
>>
>> x14: ffff07ffe528ad90 x13: 0000000000000002 x12: dfff80000000000d
>> x11: 0000000000000001 x10: 000000000000b6be x9 : 0000000000000000
>> x8 : 00000000ffffffff x7 : ffffd32feb4a5780 x6 : 0000000000000000
>> x5 : 0000000000000000 x4 : 0000000000000000 x3 : 1ffffa6601694aec
>> x2 : 0000000000000000 x1 : dfff800000000000 x0 : 000000001ffffff8
>> Call trace:
>> task_numa_placement
>> arch_test_bit at include/asm-generic/bitops/non-atomic.h:118
>> (inlined by) node_state at include/linux/nodemask.h:416
>> (inlined by) task_numa_placement at kernel/sched/fair.c:2439
>> task_numa_fault
>> do_numa_page
>> handle_pte_fault
>> __handle_mm_fault
>> handle_mm_fault
>> do_page_fault
>> do_translation_fault
>> do_mem_abort
>> el0_da
>> el0t_64_sync_handler
>> el0t_64_sync
>> Code: 8b000296 d2d00001 f2fbffe1 d343fec3 (38e16861)
>> ---[ end trace 0000000000000000 ]---
>> Kernel panic - not syncing: Oops: Fatal exception
>> SMP: stopping secondary CPUs
>> Kernel Offset: 0x532fdcf70000 from 0xffff800008000000
>> PHYS_OFFSET: 0x80000000
>> CPU features: 0x00,00042c0c,19801c82
>> Memory Limit: none
>> ---[ end Kernel panic - not syncing: Oops: Fatal exception ]---
>
> Thanks for reporting! Can you try whether the following debug patch can fix the issue?
>
> Best Regards,
> Huang, Ying
>
> ----------------------------8<-------------------------------------------
> From 176d185426730111e763eb386d0210561f021dbc Mon Sep 17 00:00:00 2001
> From: Huang Ying <[email protected]>
> Date: Wed, 2 Mar 2022 08:54:01 +0800
> Subject: [PATCH] dbg KASAN error
>
> ---
> kernel/sched/fair.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a3f0ea216ccb..1fe7a4510cca 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2405,7 +2405,7 @@ static void task_numa_placement(struct task_struct *p)
> }
>
> /* Cannot migrate task to CPU-less node */
> - if (!node_state(max_nid, N_CPU)) {
> + if (max_nid != NUMA_NO_NODE && !node_state(max_nid, N_CPU)) {
> int near_nid = max_nid;
> int distance, near_distance = INT_MAX;

Do you have time to give this patch a try?

Best Regards,
Huang, Ying

2022-03-08 01:09:32

by Huang, Ying

[permalink] [raw]
Subject: Re: [tip: sched/core] sched/numa: Avoid migrating task to CPU-less node

Qian Cai <[email protected]> writes:

> On Mon, Mar 07, 2022 at 01:51:55PM +0800, Huang, Ying wrote:
>> > ---
>> > kernel/sched/fair.c | 2 +-
>> > 1 file changed, 1 insertion(+), 1 deletion(-)
>> >
>> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> > index a3f0ea216ccb..1fe7a4510cca 100644
>> > --- a/kernel/sched/fair.c
>> > +++ b/kernel/sched/fair.c
>> > @@ -2405,7 +2405,7 @@ static void task_numa_placement(struct task_struct *p)
>> > }
>> >
>> > /* Cannot migrate task to CPU-less node */
>> > - if (!node_state(max_nid, N_CPU)) {
>> > + if (max_nid != NUMA_NO_NODE && !node_state(max_nid, N_CPU)) {
>> > int near_nid = max_nid;
>> > int distance, near_distance = INT_MAX;
>>
>> Do you have time to give this patch a try?
>
> Ah, I thought I has already replied it a while ago. Anyway, it works fine.

Thanks!

Best Regards,
Huang, Ying

2022-03-08 17:18:39

by Qian Cai

[permalink] [raw]
Subject: Re: [tip: sched/core] sched/numa: Avoid migrating task to CPU-less node

On Mon, Mar 07, 2022 at 01:51:55PM +0800, Huang, Ying wrote:
> > ---
> > kernel/sched/fair.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index a3f0ea216ccb..1fe7a4510cca 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -2405,7 +2405,7 @@ static void task_numa_placement(struct task_struct *p)
> > }
> >
> > /* Cannot migrate task to CPU-less node */
> > - if (!node_state(max_nid, N_CPU)) {
> > + if (max_nid != NUMA_NO_NODE && !node_state(max_nid, N_CPU)) {
> > int near_nid = max_nid;
> > int distance, near_distance = INT_MAX;
>
> Do you have time to give this patch a try?

Ah, I thought I has already replied it a while ago. Anyway, it works fine.

2022-03-09 01:27:24

by Huang, Ying

[permalink] [raw]
Subject: [PATCH -V3 2/2 UPDATE] NUMA balancing: avoid to migrate task to CPU-less node

In a typical memory tiering system, there's no CPU in slow (PMEM) NUMA
nodes. But if the number of the hint page faults on a PMEM node is
the max for a task, The current NUMA balancing policy may try to place
the task on the PMEM node instead of DRAM node. This is unreasonable,
because there's no CPU in PMEM NUMA nodes. To fix this, CPU-less
nodes are ignored when searching the migration target node for a task
in this patch.

To test the patch, we run a workload that accesses more memory in PMEM
node than memory in DRAM node. Without the patch, the PMEM node will
be chosen as preferred node in task_numa_placement(). While the DRAM
node will be chosen instead with the patch.

Known issue: I don't have systems to test complex NUMA topology type,
for example, NUMA_BACKPLANE or NUMA_GLUELESS_MESH.

v3:

- Fix a boot crash for some uncovered marginal condition. Thanks Qian
Cai for reporting and testing the bug!

- Fix several missing places to use CPU-less nodes as migrating
target.

Signed-off-by: "Huang, Ying" <[email protected]>
Reported-and-tested-by: Qian Cai <[email protected]> # boot crash
Cc: Peter Zijlstra <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Srikar Dronamraju <[email protected]>
---
kernel/sched/fair.c | 25 ++++++++++++++++++++-----
1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04968f3f9b6d..1fe7a4510cca 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1988,7 +1988,7 @@ static int task_numa_migrate(struct task_struct *p)
*/
ng = deref_curr_numa_group(p);
if (env.best_cpu == -1 || (ng && ng->active_nodes > 1)) {
- for_each_online_node(nid) {
+ for_each_node_state(nid, N_CPU) {
if (nid == env.src_nid || nid == p->numa_preferred_nid)
continue;

@@ -2086,13 +2086,13 @@ static void numa_group_count_active_nodes(struct numa_group *numa_group)
unsigned long faults, max_faults = 0;
int nid, active_nodes = 0;

- for_each_online_node(nid) {
+ for_each_node_state(nid, N_CPU) {
faults = group_faults_cpu(numa_group, nid);
if (faults > max_faults)
max_faults = faults;
}

- for_each_online_node(nid) {
+ for_each_node_state(nid, N_CPU) {
faults = group_faults_cpu(numa_group, nid);
if (faults * ACTIVE_NODE_FRACTION > max_faults)
active_nodes++;
@@ -2246,7 +2246,7 @@ static int preferred_group_nid(struct task_struct *p, int nid)

dist = sched_max_numa_distance;

- for_each_online_node(node) {
+ for_each_node_state(node, N_CPU) {
score = group_weight(p, node, dist);
if (score > max_score) {
max_score = score;
@@ -2265,7 +2265,7 @@ static int preferred_group_nid(struct task_struct *p, int nid)
* inside the highest scoring group of nodes. The nodemask tricks
* keep the complexity of the search down.
*/
- nodes = node_online_map;
+ nodes = node_states[N_CPU];
for (dist = sched_max_numa_distance; dist > LOCAL_DISTANCE; dist--) {
unsigned long max_faults = 0;
nodemask_t max_group = NODE_MASK_NONE;
@@ -2404,6 +2404,21 @@ static void task_numa_placement(struct task_struct *p)
}
}

+ /* Cannot migrate task to CPU-less node */
+ if (max_nid != NUMA_NO_NODE && !node_state(max_nid, N_CPU)) {
+ int near_nid = max_nid;
+ int distance, near_distance = INT_MAX;
+
+ for_each_node_state(nid, N_CPU) {
+ distance = node_distance(max_nid, nid);
+ if (distance < near_distance) {
+ near_nid = nid;
+ near_distance = distance;
+ }
+ }
+ max_nid = near_nid;
+ }
+
if (ng) {
numa_group_count_active_nodes(ng);
spin_unlock_irq(group_lock);
--
2.30.2