2021-03-29 05:33:20

by Ilya Lipnitskiy

[permalink] [raw]
Subject: [PATCH v2] mm: fix race by making init_zero_pfn() early_initcall

There are code paths that rely on zero_pfn to be fully initialized
before core_initcall. For example, wq_sysfs_init() is a core_initcall
function that eventually results in a call to kernel_execve, which
causes a page fault with a subsequent mmput. If zero_pfn is not
initialized by then it may not get cleaned up properly and result in an
error:
BUG: Bad rss-counter state mm:(ptrval) type:MM_ANONPAGES val:1

Here is an analysis of the race as seen on a MIPS device. On this
particular MT7621 device (Ubiquiti ER-X), zero_pfn is PFN 0 until
initialized, at which point it becomes PFN 5120:
1. wq_sysfs_init calls into kobject_uevent_env at core_initcall:
[<80340dc8>] kobject_uevent_env+0x7e4/0x7ec
[<8033f8b8>] kset_register+0x68/0x88
[<803cf824>] bus_register+0xdc/0x34c
[<803cfac8>] subsys_virtual_register+0x34/0x78
[<8086afb0>] wq_sysfs_init+0x1c/0x4c
[<80001648>] do_one_initcall+0x50/0x1a8
[<8086503c>] kernel_init_freeable+0x230/0x2c8
[<8066bca0>] kernel_init+0x10/0x100
[<80003038>] ret_from_kernel_thread+0x14/0x1c

2. kobject_uevent_env() calls call_usermodehelper_exec() which executes
kernel_execve asynchronously.

3. Memory allocations in kernel_execve cause a page fault, bumping the
MM reference counter:
[<8015adb4>] add_mm_counter_fast+0xb4/0xc0
[<80160d58>] handle_mm_fault+0x6e4/0xea0
[<80158aa4>] __get_user_pages.part.78+0x190/0x37c
[<8015992c>] __get_user_pages_remote+0x128/0x360
[<801a6d9c>] get_arg_page+0x34/0xa0
[<801a7394>] copy_string_kernel+0x194/0x2a4
[<801a880c>] kernel_execve+0x11c/0x298
[<800420f4>] call_usermodehelper_exec_async+0x114/0x194

4. In case zero_pfn has not been initialized yet, zap_pte_range does
not decrement the MM_ANONPAGES RSS counter and the BUG message is
triggered shortly afterwards when __mmdrop checks the ref counters:
[<800285e8>] __mmdrop+0x98/0x1d0
[<801a6de8>] free_bprm+0x44/0x118
[<801a86a8>] kernel_execve+0x160/0x1d8
[<800420f4>] call_usermodehelper_exec_async+0x114/0x194
[<80003198>] ret_from_kernel_thread+0x14/0x1c

To avoid races such as described above, initialize init_zero_pfn at
early_initcall level. Depending on the architecture, ZERO_PAGE is either
constant or gets initialized even earlier, at paging_init, so there is
no issue with initializing zero_pfn earlier.

ML discussion: https://lore.kernel.org/lkml/CALCv0x2YqOXEAy2Q=hafjhHCtTHVodChv1qpM=niAXOpqEbt7w@mail.gmail.com/

Signed-off-by: Ilya Lipnitskiy <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: [email protected]
---
mm/memory.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 46ef306375bd..a8bbc4fc121f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -166,7 +166,7 @@ static int __init init_zero_pfn(void)
zero_pfn = page_to_pfn(ZERO_PAGE(0));
return 0;
}
-core_initcall(init_zero_pfn);
+early_initcall(init_zero_pfn);

void mm_trace_rss_stat(struct mm_struct *mm, int member, long count)
{
--
2.31.0


2021-03-30 04:46:48

by Ilya Lipnitskiy

[permalink] [raw]
Subject: [PATCH v3] mm: fix race by making init_zero_pfn() early_initcall

There are code paths that rely on zero_pfn to be fully initialized
before core_initcall. For example, wq_sysfs_init() is a core_initcall
function that eventually results in a call to kernel_execve, which
causes a page fault with a subsequent mmput. If zero_pfn is not
initialized by then it may not get cleaned up properly and result in an
error:
BUG: Bad rss-counter state mm:(ptrval) type:MM_ANONPAGES val:1

Here is an analysis of the race as seen on a MIPS device. On this
particular MT7621 device (Ubiquiti ER-X), zero_pfn is PFN 0 until
initialized, at which point it becomes PFN 5120:
1. wq_sysfs_init calls into kobject_uevent_env at core_initcall:
[<80340dc8>] kobject_uevent_env+0x7e4/0x7ec
[<8033f8b8>] kset_register+0x68/0x88
[<803cf824>] bus_register+0xdc/0x34c
[<803cfac8>] subsys_virtual_register+0x34/0x78
[<8086afb0>] wq_sysfs_init+0x1c/0x4c
[<80001648>] do_one_initcall+0x50/0x1a8
[<8086503c>] kernel_init_freeable+0x230/0x2c8
[<8066bca0>] kernel_init+0x10/0x100
[<80003038>] ret_from_kernel_thread+0x14/0x1c

2. kobject_uevent_env() calls call_usermodehelper_exec() which executes
kernel_execve asynchronously.

3. Memory allocations in kernel_execve cause a page fault, bumping the
MM reference counter:
[<8015adb4>] add_mm_counter_fast+0xb4/0xc0
[<80160d58>] handle_mm_fault+0x6e4/0xea0
[<80158aa4>] __get_user_pages.part.78+0x190/0x37c
[<8015992c>] __get_user_pages_remote+0x128/0x360
[<801a6d9c>] get_arg_page+0x34/0xa0
[<801a7394>] copy_string_kernel+0x194/0x2a4
[<801a880c>] kernel_execve+0x11c/0x298
[<800420f4>] call_usermodehelper_exec_async+0x114/0x194

4. In case zero_pfn has not been initialized yet, zap_pte_range does
not decrement the MM_ANONPAGES RSS counter and the BUG message is
triggered shortly afterwards when __mmdrop checks the ref counters:
[<800285e8>] __mmdrop+0x98/0x1d0
[<801a6de8>] free_bprm+0x44/0x118
[<801a86a8>] kernel_execve+0x160/0x1d8
[<800420f4>] call_usermodehelper_exec_async+0x114/0x194
[<80003198>] ret_from_kernel_thread+0x14/0x1c

To avoid races such as described above, initialize init_zero_pfn at
early_initcall level. Depending on the architecture, ZERO_PAGE is either
constant or gets initialized even earlier, at paging_init, so there is
no issue with initializing zero_pfn earlier.

Discussion: https://lkml.kernel.org/r/CALCv0x2YqOXEAy2Q=hafjhHCtTHVodChv1qpM=niAXOpqEbt7w@mail.gmail.com

Signed-off-by: Ilya Lipnitskiy <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: [email protected]
---
mm/memory.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 5c3b29d3af66..e66b11ac1659 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -166,7 +166,7 @@ static int __init init_zero_pfn(void)
zero_pfn = page_to_pfn(ZERO_PAGE(0));
return 0;
}
-core_initcall(init_zero_pfn);
+early_initcall(init_zero_pfn);

void mm_trace_rss_stat(struct mm_struct *mm, int member, long count)
{
--
2.31.0

2021-03-30 05:02:18

by Zhou Yanjie

[permalink] [raw]
Subject: Re: [PATCH v3] mm: fix race by making init_zero_pfn() early_initcall

Hi Ilya,

On 2021/3/30 下午12:42, Ilya Lipnitskiy wrote:
> There are code paths that rely on zero_pfn to be fully initialized
> before core_initcall. For example, wq_sysfs_init() is a core_initcall
> function that eventually results in a call to kernel_execve, which
> causes a page fault with a subsequent mmput. If zero_pfn is not
> initialized by then it may not get cleaned up properly and result in an
> error:
> BUG: Bad rss-counter state mm:(ptrval) type:MM_ANONPAGES val:1
>
> Here is an analysis of the race as seen on a MIPS device. On this
> particular MT7621 device (Ubiquiti ER-X), zero_pfn is PFN 0 until
> initialized, at which point it becomes PFN 5120:
> 1. wq_sysfs_init calls into kobject_uevent_env at core_initcall:
> [<80340dc8>] kobject_uevent_env+0x7e4/0x7ec
> [<8033f8b8>] kset_register+0x68/0x88
> [<803cf824>] bus_register+0xdc/0x34c
> [<803cfac8>] subsys_virtual_register+0x34/0x78
> [<8086afb0>] wq_sysfs_init+0x1c/0x4c
> [<80001648>] do_one_initcall+0x50/0x1a8
> [<8086503c>] kernel_init_freeable+0x230/0x2c8
> [<8066bca0>] kernel_init+0x10/0x100
> [<80003038>] ret_from_kernel_thread+0x14/0x1c
>
> 2. kobject_uevent_env() calls call_usermodehelper_exec() which executes
> kernel_execve asynchronously.
>
> 3. Memory allocations in kernel_execve cause a page fault, bumping the
> MM reference counter:
> [<8015adb4>] add_mm_counter_fast+0xb4/0xc0
> [<80160d58>] handle_mm_fault+0x6e4/0xea0
> [<80158aa4>] __get_user_pages.part.78+0x190/0x37c
> [<8015992c>] __get_user_pages_remote+0x128/0x360
> [<801a6d9c>] get_arg_page+0x34/0xa0
> [<801a7394>] copy_string_kernel+0x194/0x2a4
> [<801a880c>] kernel_execve+0x11c/0x298
> [<800420f4>] call_usermodehelper_exec_async+0x114/0x194
>
> 4. In case zero_pfn has not been initialized yet, zap_pte_range does
> not decrement the MM_ANONPAGES RSS counter and the BUG message is
> triggered shortly afterwards when __mmdrop checks the ref counters:
> [<800285e8>] __mmdrop+0x98/0x1d0
> [<801a6de8>] free_bprm+0x44/0x118
> [<801a86a8>] kernel_execve+0x160/0x1d8
> [<800420f4>] call_usermodehelper_exec_async+0x114/0x194
> [<80003198>] ret_from_kernel_thread+0x14/0x1c
>
> To avoid races such as described above, initialize init_zero_pfn at
> early_initcall level. Depending on the architecture, ZERO_PAGE is either
> constant or gets initialized even earlier, at paging_init, so there is
> no issue with initializing zero_pfn earlier.
>
> Discussion: https://lkml.kernel.org/r/CALCv0x2YqOXEAy2Q=hafjhHCtTHVodChv1qpM=niAXOpqEbt7w@mail.gmail.com
>
> Signed-off-by: Ilya Lipnitskiy <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Cc: "Eric W. Biederman" <[email protected]>
> Cc: [email protected]
> ---
> mm/memory.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)


Tested-by: 周琰杰 (Zhou Yanjie)<[email protected]> # on
CU1000-Neo/X1000E and CU1830-Neo/X1830


> diff --git a/mm/memory.c b/mm/memory.c
> index 5c3b29d3af66..e66b11ac1659 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -166,7 +166,7 @@ static int __init init_zero_pfn(void)
> zero_pfn = page_to_pfn(ZERO_PAGE(0));
> return 0;
> }
> -core_initcall(init_zero_pfn);
> +early_initcall(init_zero_pfn);
>
> void mm_trace_rss_stat(struct mm_struct *mm, int member, long count)
> {