2022-10-31 18:50:37

by Yang Shi

[permalink] [raw]
Subject: [PATCH] mm: don't warn if the node is offlined

Syzbot reported the below splat:

WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node include/linux/gfp.h:221 [inline]
WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
Modules linked in:
CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted 6.1.0-rc1-syzkaller-00454-ga70385240892 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022
RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline]
RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9 96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae
RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001
RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000
FS: 00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715
hpage_collapse_scan_file+0xd6c/0x17a0 mm/khugepaged.c:2156
madvise_collapse+0x53a/0xb40 mm/khugepaged.c:2611
madvise_vma_behavior+0xd0a/0x1cc0 mm/madvise.c:1066
madvise_walk_vmas+0x1c7/0x2b0 mm/madvise.c:1240
do_madvise.part.0+0x24a/0x340 mm/madvise.c:1419
do_madvise mm/madvise.c:1432 [inline]
__do_sys_madvise mm/madvise.c:1432 [inline]
__se_sys_madvise mm/madvise.c:1430 [inline]
__x64_sys_madvise+0x113/0x150 mm/madvise.c:1430
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f6b48a4eef9
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f6b48ccf318 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
RAX: ffffffffffffffda RBX: 00007f6b48af0048 RCX: 00007f6b48a4eef9
RDX: 0000000000000019 RSI: 0000000000600003 RDI: 0000000020000000
RBP: 00007f6b48af0040 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f6b48aa53a4
R13: 00007f6b48bffcbf R14: 00007f6b48ccf400 R15: 0000000000022000
</TASK>

WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node
include/linux/gfp.h:221 [inline]
WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221
hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221
alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
Modules linked in:
CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted
6.1.0-rc1-syzkaller-00454-ga70385240892 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 10/11/2022
RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline]
RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc
ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9
96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae
RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001
RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000
FS: 00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715
hpage_collapse_scan_file+0xd6c/0x17a0 mm/khugepaged.c:2156
madvise_collapse+0x53a/0xb40 mm/khugepaged.c:2611
madvise_vma_behavior+0xd0a/0x1cc0 mm/madvise.c:1066
madvise_walk_vmas+0x1c7/0x2b0 mm/madvise.c:1240
do_madvise.part.0+0x24a/0x340 mm/madvise.c:1419
do_madvise mm/madvise.c:1432 [inline]
__do_sys_madvise mm/madvise.c:1432 [inline]
__se_sys_madvise mm/madvise.c:1430 [inline]
__x64_sys_madvise+0x113/0x150 mm/madvise.c:1430
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f6b48a4eef9
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 15 00 00 90 48 89 f8 48 89
f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01
f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f6b48ccf318 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
RAX: ffffffffffffffda RBX: 00007f6b48af0048 RCX: 00007f6b48a4eef9
RDX: 0000000000000019 RSI: 0000000000600003 RDI: 0000000020000000
RBP: 00007f6b48af0040 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f6b48aa53a4
R13: 00007f6b48bffcbf R14: 00007f6b48ccf400 R15: 0000000000022000
</TASK>

It is because khugepaged allocates pages with __GFP_THISNODE, but the
preferred node is offlined. The warning was even stronger before commit
8addc2d00fe17 ("mm: do not warn on offline nodes unless the specific node
is explicitly requested"). The commit softened the warning for
__GFP_THISNODE.

But this warning seems not quite useful because:
* There is no guarantee the node is online for __GFP_THISNODE context
for all the callsites.
* Kernel just fails the allocation regardless the warning, and it looks
all callsites handle the allocation failure gracefully.

So, removing the warning seems like the good move.

Reported-by: [email protected]
Signed-off-by: Yang Shi <[email protected]>
Cc: Zach O'Keefe <[email protected]>
Cc: Michal Hocko <[email protected]>
---
include/linux/gfp.h | 2 --
1 file changed, 2 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index ef4aea3b356e..594d6dee5646 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -218,7 +218,6 @@ static inline struct page *
__alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
{
VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
- VM_WARN_ON((gfp_mask & __GFP_THISNODE) && !node_online(nid));

return __alloc_pages(gfp_mask, order, nid, NULL);
}
@@ -227,7 +226,6 @@ static inline
struct folio *__folio_alloc_node(gfp_t gfp, unsigned int order, int nid)
{
VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
- VM_WARN_ON((gfp & __GFP_THISNODE) && !node_online(nid));

return __folio_alloc(gfp, order, nid, NULL);
}
--
2.26.3



2022-10-31 21:47:46

by Zach O'Keefe

[permalink] [raw]
Subject: Re: [PATCH] mm: don't warn if the node is offlined

On Mon, Oct 31, 2022 at 11:31 AM Yang Shi <[email protected]> wrote:
>
> Syzbot reported the below splat:
>
> WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node include/linux/gfp.h:221 [inline]
> WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
> WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
> Modules linked in:
> CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted 6.1.0-rc1-syzkaller-00454-ga70385240892 #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022
> RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline]
> RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
> RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
> Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9 96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae
> RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293
> RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001
> RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000
> FS: 00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
> <TASK>
> collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715
> hpage_collapse_scan_file+0xd6c/0x17a0 mm/khugepaged.c:2156
> madvise_collapse+0x53a/0xb40 mm/khugepaged.c:2611
> madvise_vma_behavior+0xd0a/0x1cc0 mm/madvise.c:1066
> madvise_walk_vmas+0x1c7/0x2b0 mm/madvise.c:1240
> do_madvise.part.0+0x24a/0x340 mm/madvise.c:1419
> do_madvise mm/madvise.c:1432 [inline]
> __do_sys_madvise mm/madvise.c:1432 [inline]
> __se_sys_madvise mm/madvise.c:1430 [inline]
> __x64_sys_madvise+0x113/0x150 mm/madvise.c:1430
> do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
> entry_SYSCALL_64_after_hwframe+0x63/0xcd
> RIP: 0033:0x7f6b48a4eef9
> Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
> RSP: 002b:00007f6b48ccf318 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
> RAX: ffffffffffffffda RBX: 00007f6b48af0048 RCX: 00007f6b48a4eef9
> RDX: 0000000000000019 RSI: 0000000000600003 RDI: 0000000020000000
> RBP: 00007f6b48af0040 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000246 R12: 00007f6b48aa53a4
> R13: 00007f6b48bffcbf R14: 00007f6b48ccf400 R15: 0000000000022000
> </TASK>
>
> WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node
> include/linux/gfp.h:221 [inline]
> WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221
> hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
> WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221
> alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
> Modules linked in:
> CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted
> 6.1.0-rc1-syzkaller-00454-ga70385240892 #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 10/11/2022
> RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline]
> RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
> RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
> Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc
> ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9
> 96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae
> RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293
> RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001
> RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000
> FS: 00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000)
> knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
> <TASK>
> collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715
> hpage_collapse_scan_file+0xd6c/0x17a0 mm/khugepaged.c:2156
> madvise_collapse+0x53a/0xb40 mm/khugepaged.c:2611
> madvise_vma_behavior+0xd0a/0x1cc0 mm/madvise.c:1066
> madvise_walk_vmas+0x1c7/0x2b0 mm/madvise.c:1240
> do_madvise.part.0+0x24a/0x340 mm/madvise.c:1419
> do_madvise mm/madvise.c:1432 [inline]
> __do_sys_madvise mm/madvise.c:1432 [inline]
> __se_sys_madvise mm/madvise.c:1430 [inline]
> __x64_sys_madvise+0x113/0x150 mm/madvise.c:1430
> do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
> entry_SYSCALL_64_after_hwframe+0x63/0xcd
> RIP: 0033:0x7f6b48a4eef9
> Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 15 00 00 90 48 89 f8 48 89
> f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01
> f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
> RSP: 002b:00007f6b48ccf318 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
> RAX: ffffffffffffffda RBX: 00007f6b48af0048 RCX: 00007f6b48a4eef9
> RDX: 0000000000000019 RSI: 0000000000600003 RDI: 0000000020000000
> RBP: 00007f6b48af0040 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000246 R12: 00007f6b48aa53a4
> R13: 00007f6b48bffcbf R14: 00007f6b48ccf400 R15: 0000000000022000
> </TASK>
>
> It is because khugepaged allocates pages with __GFP_THISNODE, but the
> preferred node is offlined. The warning was even stronger before commit
> 8addc2d00fe17 ("mm: do not warn on offline nodes unless the specific node
> is explicitly requested"). The commit softened the warning for
> __GFP_THISNODE.
>
> But this warning seems not quite useful because:
> * There is no guarantee the node is online for __GFP_THISNODE context
> for all the callsites.
> * Kernel just fails the allocation regardless the warning, and it looks
> all callsites handle the allocation failure gracefully.
>
> So, removing the warning seems like the good move.
>
> Reported-by: [email protected]
> Signed-off-by: Yang Shi <[email protected]>
> Cc: Zach O'Keefe <[email protected]>
> Cc: Michal Hocko <[email protected]>

Reviewed-by: Zach O'Keefe <[email protected]>

> ---
> include/linux/gfp.h | 2 --
> 1 file changed, 2 deletions(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index ef4aea3b356e..594d6dee5646 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -218,7 +218,6 @@ static inline struct page *
> __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
> {
> VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
> - VM_WARN_ON((gfp_mask & __GFP_THISNODE) && !node_online(nid));
>
> return __alloc_pages(gfp_mask, order, nid, NULL);
> }
> @@ -227,7 +226,6 @@ static inline
> struct folio *__folio_alloc_node(gfp_t gfp, unsigned int order, int nid)
> {
> VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
> - VM_WARN_ON((gfp & __GFP_THISNODE) && !node_online(nid));
>
> return __folio_alloc(gfp, order, nid, NULL);
> }
> --
> 2.26.3
>

2022-10-31 22:22:14

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: don't warn if the node is offlined

On Mon 31-10-22 11:31:22, Yang Shi wrote:
> Syzbot reported the below splat:
>
> WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node include/linux/gfp.h:221 [inline]
> WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
> WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
> Modules linked in:
> CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted 6.1.0-rc1-syzkaller-00454-ga70385240892 #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022
> RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline]
> RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
> RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
> Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9 96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae
> RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293
> RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001
> RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000
> FS: 00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
> <TASK>
> collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715

This is quite weird, isn't it? alloc_charge_hpage is selecting the most
busy node (as per collapse_control). How come this can be an offline
node? Is a parallel memory hotplug happening?

[...]

> It is because khugepaged allocates pages with __GFP_THISNODE, but the
> preferred node is offlined. The warning was even stronger before commit
> 8addc2d00fe17 ("mm: do not warn on offline nodes unless the specific node
> is explicitly requested"). The commit softened the warning for
> __GFP_THISNODE.
>
> But this warning seems not quite useful because:
> * There is no guarantee the node is online for __GFP_THISNODE context
> for all the callsites.

The original idea IIRC was to catch a buggy code which mishandled node
assignment. But this looks like a perfectly valid code. There is no
synchronization with the memory hotplug so it is possible that memory
gets offline during a longer taking scanning.

I do agree that the warning is not really helpful in this case. It is
actually even harmful for those running in panic-on-warn mode.

> * Kernel just fails the allocation regardless the warning, and it looks
> all callsites handle the allocation failure gracefully.
>
> So, removing the warning seems like the good move.
>
> Reported-by: [email protected]
> Signed-off-by: Yang Shi <[email protected]>
> Cc: Zach O'Keefe <[email protected]>
> Cc: Michal Hocko <[email protected]>

Unless I am wrong in my above statement I would appreciate extending the
changelog to describe the actual code is correct so the warning is
harmful.

Acked-by: Michal Hocko <[email protected]>

> ---
> include/linux/gfp.h | 2 --
> 1 file changed, 2 deletions(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index ef4aea3b356e..594d6dee5646 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -218,7 +218,6 @@ static inline struct page *
> __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
> {
> VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
> - VM_WARN_ON((gfp_mask & __GFP_THISNODE) && !node_online(nid));
>
> return __alloc_pages(gfp_mask, order, nid, NULL);
> }
> @@ -227,7 +226,6 @@ static inline
> struct folio *__folio_alloc_node(gfp_t gfp, unsigned int order, int nid)
> {
> VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
> - VM_WARN_ON((gfp & __GFP_THISNODE) && !node_online(nid));
>
> return __folio_alloc(gfp, order, nid, NULL);
> }
> --
> 2.26.3

--
Michal Hocko
SUSE Labs

2022-11-01 00:51:44

by Zach O'Keefe

[permalink] [raw]
Subject: Re: [PATCH] mm: don't warn if the node is offlined

On Mon, Oct 31, 2022 at 3:08 PM Michal Hocko <[email protected]> wrote:
>
> On Mon 31-10-22 11:31:22, Yang Shi wrote:
> > Syzbot reported the below splat:
> >
> > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node include/linux/gfp.h:221 [inline]
> > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
> > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
> > Modules linked in:
> > CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted 6.1.0-rc1-syzkaller-00454-ga70385240892 #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022
> > RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline]
> > RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
> > RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
> > Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9 96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae
> > RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293
> > RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> > RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001
> > RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000
> > R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> > R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000
> > FS: 00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
> > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > Call Trace:
> > <TASK>
> > collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715
>
> This is quite weird, isn't it? alloc_charge_hpage is selecting the most
> busy node (as per collapse_control). How come this can be an offline
> node? Is a parallel memory hotplug happening?

TBH -- I did not look closely at the syzbot reproducer (let alone
attempt to run it) and assumed this was the case. Taking a quick look,
at least memory hot remove is enabled:

CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE=y
CONFIG_MEMORY_HOTPLUG=y
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y
CONFIG_MEMORY_HOTREMOVE=y

But looking at the C reproducer, I don't immediately see anywhere
where we offline nodes. I'll try to run this tomorrow to make sure I'm
not missing something real here.

Thanks,
Zach


> [...]
>
> > It is because khugepaged allocates pages with __GFP_THISNODE, but the
> > preferred node is offlined. The warning was even stronger before commit
> > 8addc2d00fe17 ("mm: do not warn on offline nodes unless the specific node
> > is explicitly requested"). The commit softened the warning for
> > __GFP_THISNODE.
> >
> > But this warning seems not quite useful because:
> > * There is no guarantee the node is online for __GFP_THISNODE context
> > for all the callsites.
>
> The original idea IIRC was to catch a buggy code which mishandled node
> assignment. But this looks like a perfectly valid code. There is no
> synchronization with the memory hotplug so it is possible that memory
> gets offline during a longer taking scanning.
>
> I do agree that the warning is not really helpful in this case. It is
> actually even harmful for those running in panic-on-warn mode.
>
> > * Kernel just fails the allocation regardless the warning, and it looks
> > all callsites handle the allocation failure gracefully.
> >
> > So, removing the warning seems like the good move.
> >
> > Reported-by: [email protected]
> > Signed-off-by: Yang Shi <[email protected]>
> > Cc: Zach O'Keefe <[email protected]>
> > Cc: Michal Hocko <[email protected]>
>
> Unless I am wrong in my above statement I would appreciate extending the
> changelog to describe the actual code is correct so the warning is
> harmful.
>
> Acked-by: Michal Hocko <[email protected]>
>
> > ---
> > include/linux/gfp.h | 2 --
> > 1 file changed, 2 deletions(-)
> >
> > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > index ef4aea3b356e..594d6dee5646 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -218,7 +218,6 @@ static inline struct page *
> > __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
> > {
> > VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
> > - VM_WARN_ON((gfp_mask & __GFP_THISNODE) && !node_online(nid));
> >
> > return __alloc_pages(gfp_mask, order, nid, NULL);
> > }
> > @@ -227,7 +226,6 @@ static inline
> > struct folio *__folio_alloc_node(gfp_t gfp, unsigned int order, int nid)
> > {
> > VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
> > - VM_WARN_ON((gfp & __GFP_THISNODE) && !node_online(nid));
> >
> > return __folio_alloc(gfp, order, nid, NULL);
> > }
> > --
> > 2.26.3
>
> --
> Michal Hocko
> SUSE Labs

2022-11-01 08:01:21

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: don't warn if the node is offlined

On Mon 31-10-22 17:05:06, Zach O'Keefe wrote:
> On Mon, Oct 31, 2022 at 3:08 PM Michal Hocko <[email protected]> wrote:
> >
> > On Mon 31-10-22 11:31:22, Yang Shi wrote:
> > > Syzbot reported the below splat:
> > >
> > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node include/linux/gfp.h:221 [inline]
> > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
> > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
> > > Modules linked in:
> > > CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted 6.1.0-rc1-syzkaller-00454-ga70385240892 #0
> > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022
> > > RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline]
> > > RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
> > > RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
> > > Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9 96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae
> > > RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293
> > > RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> > > RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001
> > > RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000
> > > R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> > > R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000
> > > FS: 00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
> > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0
> > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > Call Trace:
> > > <TASK>
> > > collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715
> >
> > This is quite weird, isn't it? alloc_charge_hpage is selecting the most
> > busy node (as per collapse_control). How come this can be an offline
> > node? Is a parallel memory hotplug happening?
>
> TBH -- I did not look closely at the syzbot reproducer (let alone
> attempt to run it) and assumed this was the case. Taking a quick look,
> at least memory hot remove is enabled:
>
> CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
> CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE=y
> CONFIG_MEMORY_HOTPLUG=y
> CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y
> CONFIG_MEMORY_HOTREMOVE=y
>
> But looking at the C reproducer, I don't immediately see anywhere
> where we offline nodes. I'll try to run this tomorrow to make sure I'm
> not missing something real here.

Looking slightly closer at hpage_collapse_scan_file I think that it is
possible that xas_for_each simply doesn't find any entries in the page
cache and with khugepaged_max_ptes_none == HPAGE_PMD_NR we can fall back
to collapse_file even without any real entries.

But the mere possibility of the hotplug race should be a sufficient
ground to remove those WARN_ONs


Thanks!
--
Michal Hocko
SUSE Labs

2022-11-01 17:51:06

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH] mm: don't warn if the node is offlined

On Tue, Nov 1, 2022 at 12:54 AM Michal Hocko <[email protected]> wrote:
>
> On Mon 31-10-22 17:05:06, Zach O'Keefe wrote:
> > On Mon, Oct 31, 2022 at 3:08 PM Michal Hocko <[email protected]> wrote:
> > >
> > > On Mon 31-10-22 11:31:22, Yang Shi wrote:
> > > > Syzbot reported the below splat:
> > > >
> > > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node include/linux/gfp.h:221 [inline]
> > > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
> > > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
> > > > Modules linked in:
> > > > CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted 6.1.0-rc1-syzkaller-00454-ga70385240892 #0
> > > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022
> > > > RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline]
> > > > RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
> > > > RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
> > > > Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9 96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae
> > > > RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293
> > > > RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> > > > RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001
> > > > RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000
> > > > R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> > > > R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000
> > > > FS: 00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
> > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0
> > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > Call Trace:
> > > > <TASK>
> > > > collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715
> > >
> > > This is quite weird, isn't it? alloc_charge_hpage is selecting the most
> > > busy node (as per collapse_control). How come this can be an offline
> > > node? Is a parallel memory hotplug happening?
> >
> > TBH -- I did not look closely at the syzbot reproducer (let alone
> > attempt to run it) and assumed this was the case. Taking a quick look,
> > at least memory hot remove is enabled:
> >
> > CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
> > CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE=y
> > CONFIG_MEMORY_HOTPLUG=y
> > CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y
> > CONFIG_MEMORY_HOTREMOVE=y
> >
> > But looking at the C reproducer, I don't immediately see anywhere
> > where we offline nodes. I'll try to run this tomorrow to make sure I'm
> > not missing something real here.
>
> Looking slightly closer at hpage_collapse_scan_file I think that it is
> possible that xas_for_each simply doesn't find any entries in the page
> cache and with khugepaged_max_ptes_none == HPAGE_PMD_NR we can fall back
> to collapse_file even without any real entries.

The khugepaged_max_ptes_none can't be HPAGE_PMD_NR, it must be <=
(HPAGE_PMD_NR - 1), but MADV_COLLAPSE does ignore it.

But a closer look at the code about how to pick up the preferred node,
there seems to be a corner case for MADV_COLLAPSE.

The code tried to do some balance if several nodes have the same hit
record. Basically it does conceptually:
* If the target_node <= last_target_node, then iterate from
last_target_node + 1 to MAX_NUMNODES (1024 on default config)
* If the max_value == node_load[nid], then target_node = nid

So assuming the system has 2 nodes, the target_node is 0 and the
last_target_node is 1, if MADV_COLLAPSE path is hit, then it may
return 2 for target_node, but it is actually not existing (offline),
so the warn is triggered.

The below patch should be able to fix it:

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index ea0d186bc9d4..d24405e6736b 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -787,7 +787,8 @@ static int hpage_collapse_find_target_node(struct
collapse_control *cc)
if (target_node <= cc->last_target_node)
for (nid = cc->last_target_node + 1; nid < MAX_NUMNODES;
nid++)
- if (max_value == cc->node_load[nid]) {
+ if (node_online(nid) &&
+ max_value == cc->node_load[nid]) {
target_node = nid;
break;
}

> But the mere possibility of the hotplug race should be a sufficient
> ground to remove those WARN_ONs

The warn_on did help to catch this bug. But the reasons for removing
it still stand TBH, so we may consider to move this warn_on to the
callers which care about it?

>
>
> Thanks!
> --
> Michal Hocko
> SUSE Labs

2022-11-01 19:23:56

by Zach O'Keefe

[permalink] [raw]
Subject: Re: [PATCH] mm: don't warn if the node is offlined

On Tue, Nov 1, 2022 at 10:13 AM Yang Shi <[email protected]> wrote:
>
> On Tue, Nov 1, 2022 at 12:54 AM Michal Hocko <[email protected]> wrote:
> >
> > On Mon 31-10-22 17:05:06, Zach O'Keefe wrote:
> > > On Mon, Oct 31, 2022 at 3:08 PM Michal Hocko <[email protected]> wrote:
> > > >
> > > > On Mon 31-10-22 11:31:22, Yang Shi wrote:
> > > > > Syzbot reported the below splat:
> > > > >
> > > > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node include/linux/gfp.h:221 [inline]
> > > > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
> > > > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
> > > > > Modules linked in:
> > > > > CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted 6.1.0-rc1-syzkaller-00454-ga70385240892 #0
> > > > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022
> > > > > RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline]
> > > > > RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
> > > > > RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
> > > > > Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9 96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae
> > > > > RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293
> > > > > RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> > > > > RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001
> > > > > RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000
> > > > > R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> > > > > R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000
> > > > > FS: 00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
> > > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0
> > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > > Call Trace:
> > > > > <TASK>
> > > > > collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715
> > > >
> > > > This is quite weird, isn't it? alloc_charge_hpage is selecting the most
> > > > busy node (as per collapse_control). How come this can be an offline
> > > > node? Is a parallel memory hotplug happening?
> > >
> > > TBH -- I did not look closely at the syzbot reproducer (let alone
> > > attempt to run it) and assumed this was the case. Taking a quick look,
> > > at least memory hot remove is enabled:
> > >
> > > CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
> > > CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE=y
> > > CONFIG_MEMORY_HOTPLUG=y
> > > CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y
> > > CONFIG_MEMORY_HOTREMOVE=y
> > >
> > > But looking at the C reproducer, I don't immediately see anywhere
> > > where we offline nodes. I'll try to run this tomorrow to make sure I'm
> > > not missing something real here.
> >
> > Looking slightly closer at hpage_collapse_scan_file I think that it is
> > possible that xas_for_each simply doesn't find any entries in the page
> > cache and with khugepaged_max_ptes_none == HPAGE_PMD_NR we can fall back
> > to collapse_file even without any real entries.
>
> The khugepaged_max_ptes_none can't be HPAGE_PMD_NR, it must be <=
> (HPAGE_PMD_NR - 1), but MADV_COLLAPSE does ignore it.
>
> But a closer look at the code about how to pick up the preferred node,
> there seems to be a corner case for MADV_COLLAPSE.
>
> The code tried to do some balance if several nodes have the same hit
> record. Basically it does conceptually:
> * If the target_node <= last_target_node, then iterate from
> last_target_node + 1 to MAX_NUMNODES (1024 on default config)
> * If the max_value == node_load[nid], then target_node = nid
>
> So assuming the system has 2 nodes, the target_node is 0 and the
> last_target_node is 1, if MADV_COLLAPSE path is hit, then it may
> return 2 for target_node, but it is actually not existing (offline),
> so the warn is triggered.
>

You're one step ahead of me, Yang. I was just debugging the syzbot C
reproducer, and this seems to be exactly the case that is happening.

> The below patch should be able to fix it:
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index ea0d186bc9d4..d24405e6736b 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -787,7 +787,8 @@ static int hpage_collapse_find_target_node(struct
> collapse_control *cc)
> if (target_node <= cc->last_target_node)
> for (nid = cc->last_target_node + 1; nid < MAX_NUMNODES;
> nid++)
> - if (max_value == cc->node_load[nid]) {
> + if (node_online(nid) &&
> + max_value == cc->node_load[nid]) {
> target_node = nid;
> break;
> }
>

Thanks for the patch. I think this is the right place to do the check.

This is slightly tangential - but I don't want to send a new mail
about it -- but I wonder if we should be doing __GFP_THISNODE +
explicit node vs having hpage_collapse_find_target_node() set a
nodemask. We could then provide fallback nodes for ties, or if some
node contained > some threshold number of pages.

> > But the mere possibility of the hotplug race should be a sufficient
> > ground to remove those WARN_ONs
>

Agreed.

> The warn_on did help to catch this bug. But the reasons for removing
> it still stand TBH, so we may consider to move this warn_on to the
> callers which care about it?

I didn't come across in a cursory search -- but if there are callers
which try to synchronize with hot remove to ensure __GFP_THISNODE
succeeds, then sure, the warn makes sense to them.

> >
> >
> > Thanks!
> > --
> > Michal Hocko
> > SUSE Labs

2022-11-01 20:29:44

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH] mm: don't warn if the node is offlined

On Tue, Nov 1, 2022 at 12:14 PM Zach O'Keefe <[email protected]> wrote:
>
> On Tue, Nov 1, 2022 at 10:13 AM Yang Shi <[email protected]> wrote:
> >
> > On Tue, Nov 1, 2022 at 12:54 AM Michal Hocko <[email protected]> wrote:
> > >
> > > On Mon 31-10-22 17:05:06, Zach O'Keefe wrote:
> > > > On Mon, Oct 31, 2022 at 3:08 PM Michal Hocko <[email protected]> wrote:
> > > > >
> > > > > On Mon 31-10-22 11:31:22, Yang Shi wrote:
> > > > > > Syzbot reported the below splat:
> > > > > >
> > > > > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node include/linux/gfp.h:221 [inline]
> > > > > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
> > > > > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
> > > > > > Modules linked in:
> > > > > > CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted 6.1.0-rc1-syzkaller-00454-ga70385240892 #0
> > > > > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022
> > > > > > RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline]
> > > > > > RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
> > > > > > RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
> > > > > > Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9 96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae
> > > > > > RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293
> > > > > > RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> > > > > > RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001
> > > > > > RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000
> > > > > > R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> > > > > > R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000
> > > > > > FS: 00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
> > > > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > > CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0
> > > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > > > Call Trace:
> > > > > > <TASK>
> > > > > > collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715
> > > > >
> > > > > This is quite weird, isn't it? alloc_charge_hpage is selecting the most
> > > > > busy node (as per collapse_control). How come this can be an offline
> > > > > node? Is a parallel memory hotplug happening?
> > > >
> > > > TBH -- I did not look closely at the syzbot reproducer (let alone
> > > > attempt to run it) and assumed this was the case. Taking a quick look,
> > > > at least memory hot remove is enabled:
> > > >
> > > > CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
> > > > CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE=y
> > > > CONFIG_MEMORY_HOTPLUG=y
> > > > CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y
> > > > CONFIG_MEMORY_HOTREMOVE=y
> > > >
> > > > But looking at the C reproducer, I don't immediately see anywhere
> > > > where we offline nodes. I'll try to run this tomorrow to make sure I'm
> > > > not missing something real here.
> > >
> > > Looking slightly closer at hpage_collapse_scan_file I think that it is
> > > possible that xas_for_each simply doesn't find any entries in the page
> > > cache and with khugepaged_max_ptes_none == HPAGE_PMD_NR we can fall back
> > > to collapse_file even without any real entries.
> >
> > The khugepaged_max_ptes_none can't be HPAGE_PMD_NR, it must be <=
> > (HPAGE_PMD_NR - 1), but MADV_COLLAPSE does ignore it.
> >
> > But a closer look at the code about how to pick up the preferred node,
> > there seems to be a corner case for MADV_COLLAPSE.
> >
> > The code tried to do some balance if several nodes have the same hit
> > record. Basically it does conceptually:
> > * If the target_node <= last_target_node, then iterate from
> > last_target_node + 1 to MAX_NUMNODES (1024 on default config)
> > * If the max_value == node_load[nid], then target_node = nid
> >
> > So assuming the system has 2 nodes, the target_node is 0 and the
> > last_target_node is 1, if MADV_COLLAPSE path is hit, then it may
> > return 2 for target_node, but it is actually not existing (offline),
> > so the warn is triggered.
> >
>
> You're one step ahead of me, Yang. I was just debugging the syzbot C
> reproducer, and this seems to be exactly the case that is happening.
>
> > The below patch should be able to fix it:
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index ea0d186bc9d4..d24405e6736b 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -787,7 +787,8 @@ static int hpage_collapse_find_target_node(struct
> > collapse_control *cc)
> > if (target_node <= cc->last_target_node)
> > for (nid = cc->last_target_node + 1; nid < MAX_NUMNODES;
> > nid++)
> > - if (max_value == cc->node_load[nid]) {
> > + if (node_online(nid) &&
> > + max_value == cc->node_load[nid]) {
> > target_node = nid;
> > break;
> > }
> >
>
> Thanks for the patch. I think this is the right place to do the check.
>
> This is slightly tangential - but I don't want to send a new mail
> about it -- but I wonder if we should be doing __GFP_THISNODE +
> explicit node vs having hpage_collapse_find_target_node() set a
> nodemask. We could then provide fallback nodes for ties, or if some
> node contained > some threshold number of pages.

We definitely could, but I'm not sure whether we should make this code
more complicated or not. TBH I think it is already overengineered. If
fallback is fine, then just removing __GFP_THISNODE may be fine too.
Actually I doubt there would be any noticeable difference for real
life workload.

>
> > > But the mere possibility of the hotplug race should be a sufficient
> > > ground to remove those WARN_ONs
> >
>
> Agreed.
>
> > The warn_on did help to catch this bug. But the reasons for removing
> > it still stand TBH, so we may consider to move this warn_on to the
> > callers which care about it?
>
> I didn't come across in a cursory search -- but if there are callers
> which try to synchronize with hot remove to ensure __GFP_THISNODE
> succeeds, then sure, the warn makes sense to them.

Yeah, we are on the same page, the specific warning could be added by
the caller instead of in the common path.

>
> > >
> > >
> > > Thanks!
> > > --
> > > Michal Hocko
> > > SUSE Labs

2022-11-01 22:13:21

by Zach O'Keefe

[permalink] [raw]
Subject: Re: [PATCH] mm: don't warn if the node is offlined

On Tue, Nov 1, 2022 at 1:09 PM Yang Shi <[email protected]> wrote:
>
> On Tue, Nov 1, 2022 at 12:14 PM Zach O'Keefe <[email protected]> wrote:
> >
> > On Tue, Nov 1, 2022 at 10:13 AM Yang Shi <[email protected]> wrote:
> > >
> > > On Tue, Nov 1, 2022 at 12:54 AM Michal Hocko <[email protected]> wrote:
> > > >
> > > > On Mon 31-10-22 17:05:06, Zach O'Keefe wrote:
> > > > > On Mon, Oct 31, 2022 at 3:08 PM Michal Hocko <[email protected]> wrote:
> > > > > >
> > > > > > On Mon 31-10-22 11:31:22, Yang Shi wrote:
> > > > > > > Syzbot reported the below splat:
> > > > > > >
> > > > > > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node include/linux/gfp.h:221 [inline]
> > > > > > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
> > > > > > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
> > > > > > > Modules linked in:
> > > > > > > CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted 6.1.0-rc1-syzkaller-00454-ga70385240892 #0
> > > > > > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022
> > > > > > > RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline]
> > > > > > > RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
> > > > > > > RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
> > > > > > > Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9 96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae
> > > > > > > RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293
> > > > > > > RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> > > > > > > RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001
> > > > > > > RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000
> > > > > > > R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> > > > > > > R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000
> > > > > > > FS: 00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
> > > > > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > > > CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0
> > > > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > > > > Call Trace:
> > > > > > > <TASK>
> > > > > > > collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715
> > > > > >
> > > > > > This is quite weird, isn't it? alloc_charge_hpage is selecting the most
> > > > > > busy node (as per collapse_control). How come this can be an offline
> > > > > > node? Is a parallel memory hotplug happening?
> > > > >
> > > > > TBH -- I did not look closely at the syzbot reproducer (let alone
> > > > > attempt to run it) and assumed this was the case. Taking a quick look,
> > > > > at least memory hot remove is enabled:
> > > > >
> > > > > CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
> > > > > CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE=y
> > > > > CONFIG_MEMORY_HOTPLUG=y
> > > > > CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y
> > > > > CONFIG_MEMORY_HOTREMOVE=y
> > > > >
> > > > > But looking at the C reproducer, I don't immediately see anywhere
> > > > > where we offline nodes. I'll try to run this tomorrow to make sure I'm
> > > > > not missing something real here.
> > > >
> > > > Looking slightly closer at hpage_collapse_scan_file I think that it is
> > > > possible that xas_for_each simply doesn't find any entries in the page
> > > > cache and with khugepaged_max_ptes_none == HPAGE_PMD_NR we can fall back
> > > > to collapse_file even without any real entries.
> > >
> > > The khugepaged_max_ptes_none can't be HPAGE_PMD_NR, it must be <=
> > > (HPAGE_PMD_NR - 1), but MADV_COLLAPSE does ignore it.
> > >
> > > But a closer look at the code about how to pick up the preferred node,
> > > there seems to be a corner case for MADV_COLLAPSE.
> > >
> > > The code tried to do some balance if several nodes have the same hit
> > > record. Basically it does conceptually:
> > > * If the target_node <= last_target_node, then iterate from
> > > last_target_node + 1 to MAX_NUMNODES (1024 on default config)
> > > * If the max_value == node_load[nid], then target_node = nid
> > >
> > > So assuming the system has 2 nodes, the target_node is 0 and the
> > > last_target_node is 1, if MADV_COLLAPSE path is hit, then it may
> > > return 2 for target_node, but it is actually not existing (offline),
> > > so the warn is triggered.
> > >
> >
> > You're one step ahead of me, Yang. I was just debugging the syzbot C
> > reproducer, and this seems to be exactly the case that is happening.
> >
> > > The below patch should be able to fix it:
> > >
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index ea0d186bc9d4..d24405e6736b 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -787,7 +787,8 @@ static int hpage_collapse_find_target_node(struct
> > > collapse_control *cc)
> > > if (target_node <= cc->last_target_node)
> > > for (nid = cc->last_target_node + 1; nid < MAX_NUMNODES;
> > > nid++)
> > > - if (max_value == cc->node_load[nid]) {
> > > + if (node_online(nid) &&
> > > + max_value == cc->node_load[nid]) {
> > > target_node = nid;
> > > break;
> > > }
> > >
> >
> > Thanks for the patch. I think this is the right place to do the check.
> >
> > This is slightly tangential - but I don't want to send a new mail
> > about it -- but I wonder if we should be doing __GFP_THISNODE +
> > explicit node vs having hpage_collapse_find_target_node() set a
> > nodemask. We could then provide fallback nodes for ties, or if some
> > node contained > some threshold number of pages.
>
> We definitely could, but I'm not sure whether we should make this code
> more complicated or not. TBH I think it is already overengineered. If
> fallback is fine, then just removing __GFP_THISNODE may be fine too.
> Actually I doubt there would be any noticeable difference for real
> life workload.
>

It would definitely make it more complicated. Fallback to certain
nodes is ok -- we really would like to avoid allocating a THP if it's
mostly accessed remotely. Anyways -- as you mention, likely no
difference in real workloads, so fine leaving as-is.

> >
> > > > But the mere possibility of the hotplug race should be a sufficient
> > > > ground to remove those WARN_ONs
> > >
> >
> > Agreed.
> >
> > > The warn_on did help to catch this bug. But the reasons for removing
> > > it still stand TBH, so we may consider to move this warn_on to the
> > > callers which care about it?
> >
> > I didn't come across in a cursory search -- but if there are callers
> > which try to synchronize with hot remove to ensure __GFP_THISNODE
> > succeeds, then sure, the warn makes sense to them.
>
> Yeah, we are on the same page, the specific warning could be added by
> the caller instead of in the common path.
>
> >
> > > >
> > > >
> > > > Thanks!
> > > > --
> > > > Michal Hocko
> > > > SUSE Labs

2022-11-02 07:50:56

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: don't warn if the node is offlined

On Tue 01-11-22 10:12:49, Yang Shi wrote:
> On Tue, Nov 1, 2022 at 12:54 AM Michal Hocko <[email protected]> wrote:
> >
> > On Mon 31-10-22 17:05:06, Zach O'Keefe wrote:
> > > On Mon, Oct 31, 2022 at 3:08 PM Michal Hocko <[email protected]> wrote:
> > > >
> > > > On Mon 31-10-22 11:31:22, Yang Shi wrote:
> > > > > Syzbot reported the below splat:
> > > > >
> > > > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node include/linux/gfp.h:221 [inline]
> > > > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
> > > > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
> > > > > Modules linked in:
> > > > > CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted 6.1.0-rc1-syzkaller-00454-ga70385240892 #0
> > > > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022
> > > > > RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline]
> > > > > RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
> > > > > RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
> > > > > Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9 96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae
> > > > > RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293
> > > > > RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> > > > > RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001
> > > > > RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000
> > > > > R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> > > > > R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000
> > > > > FS: 00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
> > > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0
> > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > > Call Trace:
> > > > > <TASK>
> > > > > collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715
> > > >
> > > > This is quite weird, isn't it? alloc_charge_hpage is selecting the most
> > > > busy node (as per collapse_control). How come this can be an offline
> > > > node? Is a parallel memory hotplug happening?
> > >
> > > TBH -- I did not look closely at the syzbot reproducer (let alone
> > > attempt to run it) and assumed this was the case. Taking a quick look,
> > > at least memory hot remove is enabled:
> > >
> > > CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
> > > CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE=y
> > > CONFIG_MEMORY_HOTPLUG=y
> > > CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y
> > > CONFIG_MEMORY_HOTREMOVE=y
> > >
> > > But looking at the C reproducer, I don't immediately see anywhere
> > > where we offline nodes. I'll try to run this tomorrow to make sure I'm
> > > not missing something real here.
> >
> > Looking slightly closer at hpage_collapse_scan_file I think that it is
> > possible that xas_for_each simply doesn't find any entries in the page
> > cache and with khugepaged_max_ptes_none == HPAGE_PMD_NR we can fall back
> > to collapse_file even without any real entries.
>
> The khugepaged_max_ptes_none can't be HPAGE_PMD_NR, it must be <=
> (HPAGE_PMD_NR - 1), but MADV_COLLAPSE does ignore it.

OK, I see.

> But a closer look at the code about how to pick up the preferred node,
> there seems to be a corner case for MADV_COLLAPSE.
>
> The code tried to do some balance if several nodes have the same hit
> record. Basically it does conceptually:
> * If the target_node <= last_target_node, then iterate from
> last_target_node + 1 to MAX_NUMNODES (1024 on default config)
> * If the max_value == node_load[nid], then target_node = nid

Correct

> So assuming the system has 2 nodes, the target_node is 0 and the
> last_target_node is 1, if MADV_COLLAPSE path is hit, then it may
> return 2 for target_node, but it is actually not existing (offline),
> so the warn is triggered.

How can node_load[2] > 0 (IIUC max_value > 0 here) if the node is
offline (other than a race with hotplug)?

> The below patch should be able to fix it:
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index ea0d186bc9d4..d24405e6736b 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -787,7 +787,8 @@ static int hpage_collapse_find_target_node(struct
> collapse_control *cc)
> if (target_node <= cc->last_target_node)
> for (nid = cc->last_target_node + 1; nid < MAX_NUMNODES;
> nid++)
> - if (max_value == cc->node_load[nid]) {
> + if (node_online(nid) &&
> + max_value == cc->node_load[nid]) {
> target_node = nid;
> break;
> }

Node, this is equally racy. node_online might become false right after
the check.
--
Michal Hocko
SUSE Labs

2022-11-02 07:55:33

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: don't warn if the node is offlined

On Wed 02-11-22 08:39:25, Michal Hocko wrote:
> On Tue 01-11-22 12:13:35, Zach O'Keefe wrote:
> [...]
> > This is slightly tangential - but I don't want to send a new mail
> > about it -- but I wonder if we should be doing __GFP_THISNODE +
> > explicit node vs having hpage_collapse_find_target_node() set a
> > nodemask. We could then provide fallback nodes for ties, or if some
> > node contained > some threshold number of pages.
>
> I would simply go with something like this (not even compile tested):

Btw. while at it. It is really ugly to allocate 4kB stack space for node
mask for !NUMA configurations! If you are touching that area then this
shouldn't be hard to fix.
--
Michal Hocko
SUSE Labs

2022-11-02 08:05:37

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: don't warn if the node is offlined

On Tue 01-11-22 12:13:35, Zach O'Keefe wrote:
[...]
> This is slightly tangential - but I don't want to send a new mail
> about it -- but I wonder if we should be doing __GFP_THISNODE +
> explicit node vs having hpage_collapse_find_target_node() set a
> nodemask. We could then provide fallback nodes for ties, or if some
> node contained > some threshold number of pages.

I would simply go with something like this (not even compile tested):

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 4734315f7940..947a5158fe11 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -96,9 +96,6 @@ struct collapse_control {

/* Num pages scanned per node */
u32 node_load[MAX_NUMNODES];
-
- /* Last target selected in hpage_collapse_find_target_node() */
- int last_target_node;
};

/**
@@ -734,7 +731,6 @@ static void khugepaged_alloc_sleep(void)

struct collapse_control khugepaged_collapse_control = {
.is_khugepaged = true,
- .last_target_node = NUMA_NO_NODE,
};

static bool hpage_collapse_scan_abort(int nid, struct collapse_control *cc)
@@ -772,7 +768,7 @@ static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
}

#ifdef CONFIG_NUMA
-static int hpage_collapse_find_target_node(struct collapse_control *cc)
+static int hpage_collapse_find_target_node(struct collapse_control *cc, nodemask_t *alloc_mask)
{
int nid, target_node = 0, max_value = 0;

@@ -783,28 +779,25 @@ static int hpage_collapse_find_target_node(struct collapse_control *cc)
target_node = nid;
}

+ nodes_clear(&alloc_mask);
/* do some balance if several nodes have the same hit record */
- if (target_node <= cc->last_target_node)
- for (nid = cc->last_target_node + 1; nid < MAX_NUMNODES;
- nid++)
- if (max_value == cc->node_load[nid]) {
- target_node = nid;
- break;
- }
+ for_each_online_node(nid) {_
+ if (max_value == cc->node_load[nid])
+ node_set(nid, &alloc_mask)
+ }

- cc->last_target_node = target_node;
return target_node;
}
#else
-static int hpage_collapse_find_target_node(struct collapse_control *cc)
+static int hpage_collapse_find_target_node(struct collapse_control *cc, nodemask_t *alloc_mask)
{
return 0;
}
#endif

-static bool hpage_collapse_alloc_page(struct page **hpage, gfp_t gfp, int node)
+static bool hpage_collapse_alloc_page(struct page **hpage, gfp_t gfp, int node, nodemask_t *nmask)
{
- *hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
+ *hpage = __alloc_pages(gfp, HPAGE_PMD_ORDER, node, nmask);
if (unlikely(!*hpage)) {
count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
return false;
@@ -958,9 +951,18 @@ static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm,
/* Only allocate from the target node */
gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
GFP_TRANSHUGE) | __GFP_THISNODE;
- int node = hpage_collapse_find_target_node(cc);
+ NODEMASK_ALLOC(nodemask_t, nmask, GFP_KERNEL);
+ int node;
+ int ret;
+
+ if (!nmaks)
+ return SCAN_ALLOC_HUGE_PAGE_FAIL;
+
+ node = hpage_collapse_find_target_node(cc, nmask);
+ ret = hpage_collapse_alloc_page(hpage, gfp, node, nmask);
+ NODEMASK_FREE(nmask);

- if (!hpage_collapse_alloc_page(hpage, gfp, node))
+ if (!ret)
return SCAN_ALLOC_HUGE_PAGE_FAIL;
if (unlikely(mem_cgroup_charge(page_folio(*hpage), mm, gfp)))
return SCAN_CGROUP_CHARGE_FAIL;
@@ -2576,7 +2578,6 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
if (!cc)
return -ENOMEM;
cc->is_khugepaged = false;
- cc->last_target_node = NUMA_NO_NODE;

mmgrab(mm);
lru_add_drain_all();
--
Michal Hocko
SUSE Labs

2022-11-02 16:32:24

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH] mm: don't warn if the node is offlined

On Wed, Nov 2, 2022 at 12:14 AM Michal Hocko <[email protected]> wrote:
>
> On Tue 01-11-22 10:12:49, Yang Shi wrote:
> > On Tue, Nov 1, 2022 at 12:54 AM Michal Hocko <[email protected]> wrote:
> > >
> > > On Mon 31-10-22 17:05:06, Zach O'Keefe wrote:
> > > > On Mon, Oct 31, 2022 at 3:08 PM Michal Hocko <[email protected]> wrote:
> > > > >
> > > > > On Mon 31-10-22 11:31:22, Yang Shi wrote:
> > > > > > Syzbot reported the below splat:
> > > > > >
> > > > > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node include/linux/gfp.h:221 [inline]
> > > > > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
> > > > > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
> > > > > > Modules linked in:
> > > > > > CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted 6.1.0-rc1-syzkaller-00454-ga70385240892 #0
> > > > > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022
> > > > > > RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline]
> > > > > > RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
> > > > > > RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
> > > > > > Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9 96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae
> > > > > > RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293
> > > > > > RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> > > > > > RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001
> > > > > > RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000
> > > > > > R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> > > > > > R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000
> > > > > > FS: 00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
> > > > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > > CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0
> > > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > > > Call Trace:
> > > > > > <TASK>
> > > > > > collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715
> > > > >
> > > > > This is quite weird, isn't it? alloc_charge_hpage is selecting the most
> > > > > busy node (as per collapse_control). How come this can be an offline
> > > > > node? Is a parallel memory hotplug happening?
> > > >
> > > > TBH -- I did not look closely at the syzbot reproducer (let alone
> > > > attempt to run it) and assumed this was the case. Taking a quick look,
> > > > at least memory hot remove is enabled:
> > > >
> > > > CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
> > > > CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE=y
> > > > CONFIG_MEMORY_HOTPLUG=y
> > > > CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y
> > > > CONFIG_MEMORY_HOTREMOVE=y
> > > >
> > > > But looking at the C reproducer, I don't immediately see anywhere
> > > > where we offline nodes. I'll try to run this tomorrow to make sure I'm
> > > > not missing something real here.
> > >
> > > Looking slightly closer at hpage_collapse_scan_file I think that it is
> > > possible that xas_for_each simply doesn't find any entries in the page
> > > cache and with khugepaged_max_ptes_none == HPAGE_PMD_NR we can fall back
> > > to collapse_file even without any real entries.
> >
> > The khugepaged_max_ptes_none can't be HPAGE_PMD_NR, it must be <=
> > (HPAGE_PMD_NR - 1), but MADV_COLLAPSE does ignore it.
>
> OK, I see.
>
> > But a closer look at the code about how to pick up the preferred node,
> > there seems to be a corner case for MADV_COLLAPSE.
> >
> > The code tried to do some balance if several nodes have the same hit
> > record. Basically it does conceptually:
> > * If the target_node <= last_target_node, then iterate from
> > last_target_node + 1 to MAX_NUMNODES (1024 on default config)
> > * If the max_value == node_load[nid], then target_node = nid
>
> Correct
>
> > So assuming the system has 2 nodes, the target_node is 0 and the
> > last_target_node is 1, if MADV_COLLAPSE path is hit, then it may
> > return 2 for target_node, but it is actually not existing (offline),
> > so the warn is triggered.
>
> How can node_load[2] > 0 (IIUC max_value > 0 here) if the node is
> offline (other than a race with hotplug)?

The max_value may be 0 if there is no entry in page cache for that range IIUC.

>
> > The below patch should be able to fix it:
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index ea0d186bc9d4..d24405e6736b 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -787,7 +787,8 @@ static int hpage_collapse_find_target_node(struct
> > collapse_control *cc)
> > if (target_node <= cc->last_target_node)
> > for (nid = cc->last_target_node + 1; nid < MAX_NUMNODES;
> > nid++)
> > - if (max_value == cc->node_load[nid]) {
> > + if (node_online(nid) &&
> > + max_value == cc->node_load[nid]) {
> > target_node = nid;
> > break;
> > }
>
> Node, this is equally racy. node_online might become false right after
> the check.

Yes, my point is this could filter out the non-existing node case, for
example, the system just has 2 nodes physically, but MAX_NUMNODES > 2.
It doesn't prevent offlining, it is still racy, just like other
__GFP_THISNODE call sites.

> --
> Michal Hocko
> SUSE Labs

2022-11-02 16:34:03

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH] mm: don't warn if the node is offlined

On Wed, Nov 2, 2022 at 12:39 AM Michal Hocko <[email protected]> wrote:
>
> On Tue 01-11-22 12:13:35, Zach O'Keefe wrote:
> [...]
> > This is slightly tangential - but I don't want to send a new mail
> > about it -- but I wonder if we should be doing __GFP_THISNODE +
> > explicit node vs having hpage_collapse_find_target_node() set a
> > nodemask. We could then provide fallback nodes for ties, or if some
> > node contained > some threshold number of pages.
>
> I would simply go with something like this (not even compile tested):

Thanks, Michal. It is definitely an option. As I talked with Zach, I'm
not sure whether it is worth making the code more complicated for such
micro optimization or not. Removing __GFP_THISNODE or even removing
the node balance code should be fine too IMHO. TBH I doubt there would
be any noticeable difference.

>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 4734315f7940..947a5158fe11 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -96,9 +96,6 @@ struct collapse_control {
>
> /* Num pages scanned per node */
> u32 node_load[MAX_NUMNODES];
> -
> - /* Last target selected in hpage_collapse_find_target_node() */
> - int last_target_node;
> };
>
> /**
> @@ -734,7 +731,6 @@ static void khugepaged_alloc_sleep(void)
>
> struct collapse_control khugepaged_collapse_control = {
> .is_khugepaged = true,
> - .last_target_node = NUMA_NO_NODE,
> };
>
> static bool hpage_collapse_scan_abort(int nid, struct collapse_control *cc)
> @@ -772,7 +768,7 @@ static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
> }
>
> #ifdef CONFIG_NUMA
> -static int hpage_collapse_find_target_node(struct collapse_control *cc)
> +static int hpage_collapse_find_target_node(struct collapse_control *cc, nodemask_t *alloc_mask)
> {
> int nid, target_node = 0, max_value = 0;
>
> @@ -783,28 +779,25 @@ static int hpage_collapse_find_target_node(struct collapse_control *cc)
> target_node = nid;
> }
>
> + nodes_clear(&alloc_mask);
> /* do some balance if several nodes have the same hit record */
> - if (target_node <= cc->last_target_node)
> - for (nid = cc->last_target_node + 1; nid < MAX_NUMNODES;
> - nid++)
> - if (max_value == cc->node_load[nid]) {
> - target_node = nid;
> - break;
> - }
> + for_each_online_node(nid) {_
> + if (max_value == cc->node_load[nid])
> + node_set(nid, &alloc_mask)
> + }
>
> - cc->last_target_node = target_node;
> return target_node;
> }
> #else
> -static int hpage_collapse_find_target_node(struct collapse_control *cc)
> +static int hpage_collapse_find_target_node(struct collapse_control *cc, nodemask_t *alloc_mask)
> {
> return 0;
> }
> #endif
>
> -static bool hpage_collapse_alloc_page(struct page **hpage, gfp_t gfp, int node)
> +static bool hpage_collapse_alloc_page(struct page **hpage, gfp_t gfp, int node, nodemask_t *nmask)
> {
> - *hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
> + *hpage = __alloc_pages(gfp, HPAGE_PMD_ORDER, node, nmask);
> if (unlikely(!*hpage)) {
> count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
> return false;
> @@ -958,9 +951,18 @@ static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm,
> /* Only allocate from the target node */
> gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
> GFP_TRANSHUGE) | __GFP_THISNODE;
> - int node = hpage_collapse_find_target_node(cc);
> + NODEMASK_ALLOC(nodemask_t, nmask, GFP_KERNEL);
> + int node;
> + int ret;
> +
> + if (!nmaks)
> + return SCAN_ALLOC_HUGE_PAGE_FAIL;
> +
> + node = hpage_collapse_find_target_node(cc, nmask);
> + ret = hpage_collapse_alloc_page(hpage, gfp, node, nmask);
> + NODEMASK_FREE(nmask);
>
> - if (!hpage_collapse_alloc_page(hpage, gfp, node))
> + if (!ret)
> return SCAN_ALLOC_HUGE_PAGE_FAIL;
> if (unlikely(mem_cgroup_charge(page_folio(*hpage), mm, gfp)))
> return SCAN_CGROUP_CHARGE_FAIL;
> @@ -2576,7 +2578,6 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> if (!cc)
> return -ENOMEM;
> cc->is_khugepaged = false;
> - cc->last_target_node = NUMA_NO_NODE;
>
> mmgrab(mm);
> lru_add_drain_all();
> --
> Michal Hocko
> SUSE Labs

2022-11-02 16:34:51

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: don't warn if the node is offlined

On Wed 02-11-22 08:58:09, Yang Shi wrote:
> On Wed, Nov 2, 2022 at 12:14 AM Michal Hocko <[email protected]> wrote:
> >
> > On Tue 01-11-22 10:12:49, Yang Shi wrote:
> > > On Tue, Nov 1, 2022 at 12:54 AM Michal Hocko <[email protected]> wrote:
> > > >
> > > > On Mon 31-10-22 17:05:06, Zach O'Keefe wrote:
> > > > > On Mon, Oct 31, 2022 at 3:08 PM Michal Hocko <[email protected]> wrote:
> > > > > >
> > > > > > On Mon 31-10-22 11:31:22, Yang Shi wrote:
> > > > > > > Syzbot reported the below splat:
> > > > > > >
> > > > > > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node include/linux/gfp.h:221 [inline]
> > > > > > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
> > > > > > > WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
> > > > > > > Modules linked in:
> > > > > > > CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted 6.1.0-rc1-syzkaller-00454-ga70385240892 #0
> > > > > > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022
> > > > > > > RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline]
> > > > > > > RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
> > > > > > > RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
> > > > > > > Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9 96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae
> > > > > > > RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293
> > > > > > > RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> > > > > > > RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001
> > > > > > > RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000
> > > > > > > R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> > > > > > > R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000
> > > > > > > FS: 00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
> > > > > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > > > CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0
> > > > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > > > > Call Trace:
> > > > > > > <TASK>
> > > > > > > collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715
> > > > > >
> > > > > > This is quite weird, isn't it? alloc_charge_hpage is selecting the most
> > > > > > busy node (as per collapse_control). How come this can be an offline
> > > > > > node? Is a parallel memory hotplug happening?
> > > > >
> > > > > TBH -- I did not look closely at the syzbot reproducer (let alone
> > > > > attempt to run it) and assumed this was the case. Taking a quick look,
> > > > > at least memory hot remove is enabled:
> > > > >
> > > > > CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
> > > > > CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE=y
> > > > > CONFIG_MEMORY_HOTPLUG=y
> > > > > CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y
> > > > > CONFIG_MEMORY_HOTREMOVE=y
> > > > >
> > > > > But looking at the C reproducer, I don't immediately see anywhere
> > > > > where we offline nodes. I'll try to run this tomorrow to make sure I'm
> > > > > not missing something real here.
> > > >
> > > > Looking slightly closer at hpage_collapse_scan_file I think that it is
> > > > possible that xas_for_each simply doesn't find any entries in the page
> > > > cache and with khugepaged_max_ptes_none == HPAGE_PMD_NR we can fall back
> > > > to collapse_file even without any real entries.
> > >
> > > The khugepaged_max_ptes_none can't be HPAGE_PMD_NR, it must be <=
> > > (HPAGE_PMD_NR - 1), but MADV_COLLAPSE does ignore it.
> >
> > OK, I see.
> >
> > > But a closer look at the code about how to pick up the preferred node,
> > > there seems to be a corner case for MADV_COLLAPSE.
> > >
> > > The code tried to do some balance if several nodes have the same hit
> > > record. Basically it does conceptually:
> > > * If the target_node <= last_target_node, then iterate from
> > > last_target_node + 1 to MAX_NUMNODES (1024 on default config)
> > > * If the max_value == node_load[nid], then target_node = nid
> >
> > Correct
> >
> > > So assuming the system has 2 nodes, the target_node is 0 and the
> > > last_target_node is 1, if MADV_COLLAPSE path is hit, then it may
> > > return 2 for target_node, but it is actually not existing (offline),
> > > so the warn is triggered.
> >
> > How can node_load[2] > 0 (IIUC max_value > 0 here) if the node is
> > offline (other than a race with hotplug)?
>
> The max_value may be 0 if there is no entry in page cache for that range IIUC.

Ohh, I am blind. I believe you have already mentioned that but I must
have overlooked it. I have only now noticed cc->is_khugepaged part of the
check. Supposedly, the primary idea here being that madvise calls should
de-facto create a brand new THP in that case. A creative way to call
collapsing but I can see some point in that.

See my other reply on how to deal with that in a more reasonable (IMHO)
way.

--
Michal Hocko
SUSE Labs

2022-11-02 16:35:03

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: don't warn if the node is offlined

On Wed 02-11-22 09:03:57, Yang Shi wrote:
> On Wed, Nov 2, 2022 at 12:39 AM Michal Hocko <[email protected]> wrote:
> >
> > On Tue 01-11-22 12:13:35, Zach O'Keefe wrote:
> > [...]
> > > This is slightly tangential - but I don't want to send a new mail
> > > about it -- but I wonder if we should be doing __GFP_THISNODE +
> > > explicit node vs having hpage_collapse_find_target_node() set a
> > > nodemask. We could then provide fallback nodes for ties, or if some
> > > node contained > some threshold number of pages.
> >
> > I would simply go with something like this (not even compile tested):
>
> Thanks, Michal. It is definitely an option. As I talked with Zach, I'm
> not sure whether it is worth making the code more complicated for such
> micro optimization or not. Removing __GFP_THISNODE or even removing
> the node balance code should be fine too IMHO. TBH I doubt there would
> be any noticeable difference.

I do agree that an explicit nodes (quasi)round robin sounds over
engineered. It makes some sense to try to target the prevalent node
though because this code can be executed from khugepaged and therefore
allocating with a completely different affinity than the original fault.
--
Michal Hocko
SUSE Labs

2022-11-02 18:02:14

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: don't warn if the node is offlined

On Wed 02-11-22 10:36:07, Yang Shi wrote:
> On Wed, Nov 2, 2022 at 9:15 AM Michal Hocko <[email protected]> wrote:
> >
> > On Wed 02-11-22 09:03:57, Yang Shi wrote:
> > > On Wed, Nov 2, 2022 at 12:39 AM Michal Hocko <[email protected]> wrote:
> > > >
> > > > On Tue 01-11-22 12:13:35, Zach O'Keefe wrote:
> > > > [...]
> > > > > This is slightly tangential - but I don't want to send a new mail
> > > > > about it -- but I wonder if we should be doing __GFP_THISNODE +
> > > > > explicit node vs having hpage_collapse_find_target_node() set a
> > > > > nodemask. We could then provide fallback nodes for ties, or if some
> > > > > node contained > some threshold number of pages.
> > > >
> > > > I would simply go with something like this (not even compile tested):
> > >
> > > Thanks, Michal. It is definitely an option. As I talked with Zach, I'm
> > > not sure whether it is worth making the code more complicated for such
> > > micro optimization or not. Removing __GFP_THISNODE or even removing
> > > the node balance code should be fine too IMHO. TBH I doubt there would
> > > be any noticeable difference.
> >
> > I do agree that an explicit nodes (quasi)round robin sounds over
> > engineered. It makes some sense to try to target the prevalent node
> > though because this code can be executed from khugepaged and therefore
> > allocating with a completely different affinity than the original fault.
>
> Yeah, the corner case comes from the node balance code, it just tries
> to balance between multiple prevalent nodes, so you agree to remove it
> IIRC?

Yeah, let's just collect all good nodes into a nodemask and keep
__GFP_THISNODE in place. You can consider having the nodemask per collapse_control
so that you allocate it only once in the struct lifetime.

And as mentioned in other reply it would be really nice to hide this
under CONFIG_NUMA (in a standalong follow up of course).

--
Michal Hocko
SUSE Labs

2022-11-02 18:07:00

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH] mm: don't warn if the node is offlined

On Wed, Nov 2, 2022 at 9:15 AM Michal Hocko <[email protected]> wrote:
>
> On Wed 02-11-22 09:03:57, Yang Shi wrote:
> > On Wed, Nov 2, 2022 at 12:39 AM Michal Hocko <[email protected]> wrote:
> > >
> > > On Tue 01-11-22 12:13:35, Zach O'Keefe wrote:
> > > [...]
> > > > This is slightly tangential - but I don't want to send a new mail
> > > > about it -- but I wonder if we should be doing __GFP_THISNODE +
> > > > explicit node vs having hpage_collapse_find_target_node() set a
> > > > nodemask. We could then provide fallback nodes for ties, or if some
> > > > node contained > some threshold number of pages.
> > >
> > > I would simply go with something like this (not even compile tested):
> >
> > Thanks, Michal. It is definitely an option. As I talked with Zach, I'm
> > not sure whether it is worth making the code more complicated for such
> > micro optimization or not. Removing __GFP_THISNODE or even removing
> > the node balance code should be fine too IMHO. TBH I doubt there would
> > be any noticeable difference.
>
> I do agree that an explicit nodes (quasi)round robin sounds over
> engineered. It makes some sense to try to target the prevalent node
> though because this code can be executed from khugepaged and therefore
> allocating with a completely different affinity than the original fault.

Yeah, the corner case comes from the node balance code, it just tries
to balance between multiple prevalent nodes, so you agree to remove it
IIRC?

> --
> Michal Hocko
> SUSE Labs

2022-11-02 19:10:13

by Zach O'Keefe

[permalink] [raw]
Subject: Re: [PATCH] mm: don't warn if the node is offlined

On Wed, Nov 2, 2022 at 11:18 AM Yang Shi <[email protected]> wrote:
>
> On Wed, Nov 2, 2022 at 10:47 AM Michal Hocko <[email protected]> wrote:
> >
> > On Wed 02-11-22 10:36:07, Yang Shi wrote:
> > > On Wed, Nov 2, 2022 at 9:15 AM Michal Hocko <[email protected]> wrote:
> > > >
> > > > On Wed 02-11-22 09:03:57, Yang Shi wrote:
> > > > > On Wed, Nov 2, 2022 at 12:39 AM Michal Hocko <[email protected]> wrote:
> > > > > >
> > > > > > On Tue 01-11-22 12:13:35, Zach O'Keefe wrote:
> > > > > > [...]
> > > > > > > This is slightly tangential - but I don't want to send a new mail
> > > > > > > about it -- but I wonder if we should be doing __GFP_THISNODE +
> > > > > > > explicit node vs having hpage_collapse_find_target_node() set a
> > > > > > > nodemask. We could then provide fallback nodes for ties, or if some
> > > > > > > node contained > some threshold number of pages.
> > > > > >
> > > > > > I would simply go with something like this (not even compile tested):
> > > > >
> > > > > Thanks, Michal. It is definitely an option. As I talked with Zach, I'm
> > > > > not sure whether it is worth making the code more complicated for such
> > > > > micro optimization or not. Removing __GFP_THISNODE or even removing
> > > > > the node balance code should be fine too IMHO. TBH I doubt there would
> > > > > be any noticeable difference.
> > > >
> > > > I do agree that an explicit nodes (quasi)round robin sounds over
> > > > engineered. It makes some sense to try to target the prevalent node
> > > > though because this code can be executed from khugepaged and therefore
> > > > allocating with a completely different affinity than the original fault.
> > >
> > > Yeah, the corner case comes from the node balance code, it just tries
> > > to balance between multiple prevalent nodes, so you agree to remove it
> > > IIRC?
> >
> > Yeah, let's just collect all good nodes into a nodemask and keep
> > __GFP_THISNODE in place. You can consider having the nodemask per collapse_control
> > so that you allocate it only once in the struct lifetime.
>
> Actually my intention is more aggressive, just remove that node balance code.
>

The balancing code dates back to 2013 commit 9f1b868a13ac ("mm: thp:
khugepaged: add policy for finding target node") where it was made to
satisfy "numactl --interleave=all". I don't know why any real
workloads would want this -- but there very well could be a valid use
case. If not, I think it could be removed independent of what we do
with __GFP_THISNODE and nodemask.

Balancing aside -- I haven't fully thought through what an ideal (and
further overengineered) solution would be for numa, but one (perceived
- not measured) issue that khugepaged might have (MADV_COLLAPSE
doesn't have the choice) is on systems with many, many nodes with
source pages sprinkled across all of them. Should we collapse these
pages into a single THP from the node with the most (but could still
be a small %) pages? Probably there are better candidates. So, maybe a
khugepaged-only check for max_value > (HPAGE_PMD_NR >> 1) or something
makes sense.

> >
> > And as mentioned in other reply it would be really nice to hide this
> > under CONFIG_NUMA (in a standalong follow up of course).
>
> The hpage_collapse_find_target_node() function itself is defined under
> CONFIG_NUMA.
>
> >
> > --
> > Michal Hocko
> > SUSE Labs

2022-11-02 19:10:17

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH] mm: don't warn if the node is offlined

On Wed, Nov 2, 2022 at 10:47 AM Michal Hocko <[email protected]> wrote:
>
> On Wed 02-11-22 10:36:07, Yang Shi wrote:
> > On Wed, Nov 2, 2022 at 9:15 AM Michal Hocko <[email protected]> wrote:
> > >
> > > On Wed 02-11-22 09:03:57, Yang Shi wrote:
> > > > On Wed, Nov 2, 2022 at 12:39 AM Michal Hocko <[email protected]> wrote:
> > > > >
> > > > > On Tue 01-11-22 12:13:35, Zach O'Keefe wrote:
> > > > > [...]
> > > > > > This is slightly tangential - but I don't want to send a new mail
> > > > > > about it -- but I wonder if we should be doing __GFP_THISNODE +
> > > > > > explicit node vs having hpage_collapse_find_target_node() set a
> > > > > > nodemask. We could then provide fallback nodes for ties, or if some
> > > > > > node contained > some threshold number of pages.
> > > > >
> > > > > I would simply go with something like this (not even compile tested):
> > > >
> > > > Thanks, Michal. It is definitely an option. As I talked with Zach, I'm
> > > > not sure whether it is worth making the code more complicated for such
> > > > micro optimization or not. Removing __GFP_THISNODE or even removing
> > > > the node balance code should be fine too IMHO. TBH I doubt there would
> > > > be any noticeable difference.
> > >
> > > I do agree that an explicit nodes (quasi)round robin sounds over
> > > engineered. It makes some sense to try to target the prevalent node
> > > though because this code can be executed from khugepaged and therefore
> > > allocating with a completely different affinity than the original fault.
> >
> > Yeah, the corner case comes from the node balance code, it just tries
> > to balance between multiple prevalent nodes, so you agree to remove it
> > IIRC?
>
> Yeah, let's just collect all good nodes into a nodemask and keep
> __GFP_THISNODE in place. You can consider having the nodemask per collapse_control
> so that you allocate it only once in the struct lifetime.

Actually my intention is more aggressive, just remove that node balance code.

>
> And as mentioned in other reply it would be really nice to hide this
> under CONFIG_NUMA (in a standalong follow up of course).

The hpage_collapse_find_target_node() function itself is defined under
CONFIG_NUMA.

>
> --
> Michal Hocko
> SUSE Labs

2022-11-02 20:46:36

by Zach O'Keefe

[permalink] [raw]
Subject: Re: [PATCH] mm: don't warn if the node is offlined

On Wed, Nov 2, 2022 at 1:08 PM Yang Shi <[email protected]> wrote:
>
> On Wed, Nov 2, 2022 at 11:59 AM Zach O'Keefe <[email protected]> wrote:
> >
> > On Wed, Nov 2, 2022 at 11:18 AM Yang Shi <[email protected]> wrote:
> > >
> > > On Wed, Nov 2, 2022 at 10:47 AM Michal Hocko <[email protected]> wrote:
> > > >
> > > > On Wed 02-11-22 10:36:07, Yang Shi wrote:
> > > > > On Wed, Nov 2, 2022 at 9:15 AM Michal Hocko <[email protected]> wrote:
> > > > > >
> > > > > > On Wed 02-11-22 09:03:57, Yang Shi wrote:
> > > > > > > On Wed, Nov 2, 2022 at 12:39 AM Michal Hocko <[email protected]> wrote:
> > > > > > > >
> > > > > > > > On Tue 01-11-22 12:13:35, Zach O'Keefe wrote:
> > > > > > > > [...]
> > > > > > > > > This is slightly tangential - but I don't want to send a new mail
> > > > > > > > > about it -- but I wonder if we should be doing __GFP_THISNODE +
> > > > > > > > > explicit node vs having hpage_collapse_find_target_node() set a
> > > > > > > > > nodemask. We could then provide fallback nodes for ties, or if some
> > > > > > > > > node contained > some threshold number of pages.
> > > > > > > >
> > > > > > > > I would simply go with something like this (not even compile tested):
> > > > > > >
> > > > > > > Thanks, Michal. It is definitely an option. As I talked with Zach, I'm
> > > > > > > not sure whether it is worth making the code more complicated for such
> > > > > > > micro optimization or not. Removing __GFP_THISNODE or even removing
> > > > > > > the node balance code should be fine too IMHO. TBH I doubt there would
> > > > > > > be any noticeable difference.
> > > > > >
> > > > > > I do agree that an explicit nodes (quasi)round robin sounds over
> > > > > > engineered. It makes some sense to try to target the prevalent node
> > > > > > though because this code can be executed from khugepaged and therefore
> > > > > > allocating with a completely different affinity than the original fault.
> > > > >
> > > > > Yeah, the corner case comes from the node balance code, it just tries
> > > > > to balance between multiple prevalent nodes, so you agree to remove it
> > > > > IIRC?
> > > >
> > > > Yeah, let's just collect all good nodes into a nodemask and keep
> > > > __GFP_THISNODE in place. You can consider having the nodemask per collapse_control
> > > > so that you allocate it only once in the struct lifetime.
> > >
> > > Actually my intention is more aggressive, just remove that node balance code.
> > >
> >
> > The balancing code dates back to 2013 commit 9f1b868a13ac ("mm: thp:
> > khugepaged: add policy for finding target node") where it was made to
> > satisfy "numactl --interleave=all". I don't know why any real
> > workloads would want this -- but there very well could be a valid use
> > case. If not, I think it could be removed independent of what we do
> > with __GFP_THISNODE and nodemask.
>
> Hmm... if the code is used for interleave, I don't think nodemask
> could preserve the behavior IIUC. The nodemask also tries to allocate
> memory from the preferred node, and fallback to the allowed nodes from
> nodemask when the allocation fails on the preferred node. But the
> round robin style node balance tries to distribute the THP on the
> nodes evenly.

Ya, I don't think this has anything to do with nodemask -- I think I
inadvertently started a discussion about it and we now have 2 threads
merged into one :)

> And I just thought of __GFP_THISNODE + nodemask should not be the
> right combination IIUC, right? __GFP_THISNODE does disallow any
> fallback, so nodemask is actually useless.

Ya I was confused when I read this the first time -- thanks for
clarifying my understanding.

> So I think we narrowed down to two options:
> 1. Preserve the interleave behavior but bail out if the target node is
> not online (it is also racy, but doesn't hurt)
> 2. Remove the node balance code entirely
>

Agreed. Really comes down to if we care about that "numactl
--interleave" use case. My inclination would be to just remove it --
if we didn't have that code today, and someone raised this use case
and asked for the code to be added, I'm not sure it'd be approved.

> >
> > Balancing aside -- I haven't fully thought through what an ideal (and
> > further overengineered) solution would be for numa, but one (perceived
> > - not measured) issue that khugepaged might have (MADV_COLLAPSE
> > doesn't have the choice) is on systems with many, many nodes with
> > source pages sprinkled across all of them. Should we collapse these
> > pages into a single THP from the node with the most (but could still
> > be a small %) pages? Probably there are better candidates. So, maybe a
> > khugepaged-only check for max_value > (HPAGE_PMD_NR >> 1) or something
> > makes sense.
>
> Anyway you have to allocate a THP on one node, I don't think of a
> better idea to make the node selection fairer. But I'd prefer to wait
> for real life usecase surfaces.

So, the thought here is that we don't _have_ to allocate a THP. We can
bail-out, just as we do with max_ptes_*, when we think allocating a
THP isn't beneficial. As mentioned, MADV_COLLAPSE still has to
allocate a THP -- but khugepaged need not. I'm fine waiting on this
until needed, however.

> >
> > > >
> > > > And as mentioned in other reply it would be really nice to hide this
> > > > under CONFIG_NUMA (in a standalong follow up of course).
> > >
> > > The hpage_collapse_find_target_node() function itself is defined under
> > > CONFIG_NUMA.
> > >
> > > >
> > > > --
> > > > Michal Hocko
> > > > SUSE Labs

2022-11-02 21:14:42

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH] mm: don't warn if the node is offlined

On Wed, Nov 2, 2022 at 11:59 AM Zach O'Keefe <[email protected]> wrote:
>
> On Wed, Nov 2, 2022 at 11:18 AM Yang Shi <[email protected]> wrote:
> >
> > On Wed, Nov 2, 2022 at 10:47 AM Michal Hocko <[email protected]> wrote:
> > >
> > > On Wed 02-11-22 10:36:07, Yang Shi wrote:
> > > > On Wed, Nov 2, 2022 at 9:15 AM Michal Hocko <[email protected]> wrote:
> > > > >
> > > > > On Wed 02-11-22 09:03:57, Yang Shi wrote:
> > > > > > On Wed, Nov 2, 2022 at 12:39 AM Michal Hocko <[email protected]> wrote:
> > > > > > >
> > > > > > > On Tue 01-11-22 12:13:35, Zach O'Keefe wrote:
> > > > > > > [...]
> > > > > > > > This is slightly tangential - but I don't want to send a new mail
> > > > > > > > about it -- but I wonder if we should be doing __GFP_THISNODE +
> > > > > > > > explicit node vs having hpage_collapse_find_target_node() set a
> > > > > > > > nodemask. We could then provide fallback nodes for ties, or if some
> > > > > > > > node contained > some threshold number of pages.
> > > > > > >
> > > > > > > I would simply go with something like this (not even compile tested):
> > > > > >
> > > > > > Thanks, Michal. It is definitely an option. As I talked with Zach, I'm
> > > > > > not sure whether it is worth making the code more complicated for such
> > > > > > micro optimization or not. Removing __GFP_THISNODE or even removing
> > > > > > the node balance code should be fine too IMHO. TBH I doubt there would
> > > > > > be any noticeable difference.
> > > > >
> > > > > I do agree that an explicit nodes (quasi)round robin sounds over
> > > > > engineered. It makes some sense to try to target the prevalent node
> > > > > though because this code can be executed from khugepaged and therefore
> > > > > allocating with a completely different affinity than the original fault.
> > > >
> > > > Yeah, the corner case comes from the node balance code, it just tries
> > > > to balance between multiple prevalent nodes, so you agree to remove it
> > > > IIRC?
> > >
> > > Yeah, let's just collect all good nodes into a nodemask and keep
> > > __GFP_THISNODE in place. You can consider having the nodemask per collapse_control
> > > so that you allocate it only once in the struct lifetime.
> >
> > Actually my intention is more aggressive, just remove that node balance code.
> >
>
> The balancing code dates back to 2013 commit 9f1b868a13ac ("mm: thp:
> khugepaged: add policy for finding target node") where it was made to
> satisfy "numactl --interleave=all". I don't know why any real
> workloads would want this -- but there very well could be a valid use
> case. If not, I think it could be removed independent of what we do
> with __GFP_THISNODE and nodemask.

Hmm... if the code is used for interleave, I don't think nodemask
could preserve the behavior IIUC. The nodemask also tries to allocate
memory from the preferred node, and fallback to the allowed nodes from
nodemask when the allocation fails on the preferred node. But the
round robin style node balance tries to distribute the THP on the
nodes evenly.

And I just thought of __GFP_THISNODE + nodemask should not be the
right combination IIUC, right? __GFP_THISNODE does disallow any
fallback, so nodemask is actually useless.

So I think we narrowed down to two options:
1. Preserve the interleave behavior but bail out if the target node is
not online (it is also racy, but doesn't hurt)
2. Remove the node balance code entirely

>
> Balancing aside -- I haven't fully thought through what an ideal (and
> further overengineered) solution would be for numa, but one (perceived
> - not measured) issue that khugepaged might have (MADV_COLLAPSE
> doesn't have the choice) is on systems with many, many nodes with
> source pages sprinkled across all of them. Should we collapse these
> pages into a single THP from the node with the most (but could still
> be a small %) pages? Probably there are better candidates. So, maybe a
> khugepaged-only check for max_value > (HPAGE_PMD_NR >> 1) or something
> makes sense.

Anyway you have to allocate a THP on one node, I don't think of a
better idea to make the node selection fairer. But I'd prefer to wait
for real life usecase surfaces.

>
> > >
> > > And as mentioned in other reply it would be really nice to hide this
> > > under CONFIG_NUMA (in a standalong follow up of course).
> >
> > The hpage_collapse_find_target_node() function itself is defined under
> > CONFIG_NUMA.
> >
> > >
> > > --
> > > Michal Hocko
> > > SUSE Labs

2022-11-03 08:06:27

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: don't warn if the node is offlined

On Wed 02-11-22 11:58:26, Zach O'Keefe wrote:
> On Wed, Nov 2, 2022 at 11:18 AM Yang Shi <[email protected]> wrote:
> >
> > On Wed, Nov 2, 2022 at 10:47 AM Michal Hocko <[email protected]> wrote:
> > >
> > > On Wed 02-11-22 10:36:07, Yang Shi wrote:
> > > > On Wed, Nov 2, 2022 at 9:15 AM Michal Hocko <[email protected]> wrote:
> > > > >
> > > > > On Wed 02-11-22 09:03:57, Yang Shi wrote:
> > > > > > On Wed, Nov 2, 2022 at 12:39 AM Michal Hocko <[email protected]> wrote:
> > > > > > >
> > > > > > > On Tue 01-11-22 12:13:35, Zach O'Keefe wrote:
> > > > > > > [...]
> > > > > > > > This is slightly tangential - but I don't want to send a new mail
> > > > > > > > about it -- but I wonder if we should be doing __GFP_THISNODE +
> > > > > > > > explicit node vs having hpage_collapse_find_target_node() set a
> > > > > > > > nodemask. We could then provide fallback nodes for ties, or if some
> > > > > > > > node contained > some threshold number of pages.
> > > > > > >
> > > > > > > I would simply go with something like this (not even compile tested):
> > > > > >
> > > > > > Thanks, Michal. It is definitely an option. As I talked with Zach, I'm
> > > > > > not sure whether it is worth making the code more complicated for such
> > > > > > micro optimization or not. Removing __GFP_THISNODE or even removing
> > > > > > the node balance code should be fine too IMHO. TBH I doubt there would
> > > > > > be any noticeable difference.
> > > > >
> > > > > I do agree that an explicit nodes (quasi)round robin sounds over
> > > > > engineered. It makes some sense to try to target the prevalent node
> > > > > though because this code can be executed from khugepaged and therefore
> > > > > allocating with a completely different affinity than the original fault.
> > > >
> > > > Yeah, the corner case comes from the node balance code, it just tries
> > > > to balance between multiple prevalent nodes, so you agree to remove it
> > > > IIRC?
> > >
> > > Yeah, let's just collect all good nodes into a nodemask and keep
> > > __GFP_THISNODE in place. You can consider having the nodemask per collapse_control
> > > so that you allocate it only once in the struct lifetime.
> >
> > Actually my intention is more aggressive, just remove that node balance code.
> >
>
> The balancing code dates back to 2013 commit 9f1b868a13ac ("mm: thp:
> khugepaged: add policy for finding target node") where it was made to
> satisfy "numactl --interleave=all". I don't know why any real
> workloads would want this -- but there very well could be a valid use
> case. If not, I think it could be removed independent of what we do
> with __GFP_THISNODE and nodemask.

Thanks for the reference. The patch is really dubious. If the primary
usecase is a memory policy then one should be used. We have the vma
handy. Sure per task policy would be a bigger problem but interleaving
is a mere hint rather than something that has hard requirements.

> Balancing aside -- I haven't fully thought through what an ideal (and
> further overengineered) solution would be for numa, but one (perceived
> - not measured) issue that khugepaged might have (MADV_COLLAPSE
> doesn't have the choice) is on systems with many, many nodes with
> source pages sprinkled across all of them. Should we collapse these
> pages into a single THP from the node with the most (but could still
> be a small %) pages? Probably there are better candidates. So, maybe a
> khugepaged-only check for max_value > (HPAGE_PMD_NR >> 1) or something
> makes sense.

Honestly I do not see any problem to be solved here.

--
Michal Hocko
SUSE Labs

2022-11-03 08:39:26

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: don't warn if the node is offlined

On Wed 02-11-22 13:08:08, Yang Shi wrote:
[...]
> So I think we narrowed down to two options:
> 1. Preserve the interleave behavior but bail out if the target node is
> not online (it is also racy, but doesn't hurt)

I do not think there is merit in the interleave patch is dubious to say
the least.

> 2. Remove the node balance code entirely

Yes, removing the balancing makes sense but I would still hope that we
do not fail too easily if the range is populated on multiple nodes
equally. In practice it will likely not matter much I guess but setting
up all nodes with top score is just easy to achieve.
--
Michal Hocko
SUSE Labs

2022-11-03 17:59:28

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH] mm: don't warn if the node is offlined

On Thu, Nov 3, 2022 at 12:54 AM Michal Hocko <[email protected]> wrote:
>
> On Wed 02-11-22 13:08:08, Yang Shi wrote:
> [...]
> > So I think we narrowed down to two options:
> > 1. Preserve the interleave behavior but bail out if the target node is
> > not online (it is also racy, but doesn't hurt)
>
> I do not think there is merit in the interleave patch is dubious to say
> the least.
>
> > 2. Remove the node balance code entirely
>
> Yes, removing the balancing makes sense but I would still hope that we
> do not fail too easily if the range is populated on multiple nodes
> equally. In practice it will likely not matter much I guess but setting
> up all nodes with top score is just easy to achieve.

OK, thanks. I will come up with a patch to allow fallback between the
nodes, it should be largely based on the change suggested by you.

> --
> Michal Hocko
> SUSE Labs