LinuxLists.cc - [PATCH] mm/memory_hotplug.c: don't fail hot unplug quite so eagerly

2023-06-20 01:30:49

Subject: [PATCH] mm/memory_hotplug.c: don't fail hot unplug quite so eagerly

mm/memory_hotplug.c: don't fail hot unplug quite so eagerly

Some device drivers add memory to the system via memory hotplug. When
the driver is unloaded, that memory is hot-unplugged.

However, memory hot unplug can fail. And these days, it fails a little
too easily, with respect to the above case. Specifically, if a signal is
pending on the process, hot unplug fails. This leads directly to: the
user must reboot the machine in order to unload the driver, and
therefore the device is unusable until the machine is rebooted.

During teardown paths in the kernel, a higher tolerance for failures or
imperfections is often best. That is, it is often better to continue
with the teardown, than to error out too early.

So in this case, other things (unmovable pages, un-splittable huge
pages) can also cause the above problem. However, those are demonstrably
less common than simply having a pending signal. I've got bug reports
from users who can trivially reproduce this by killing their process
with a "kill -9", for example.

Fix this by soldering on with memory hot plug, even in the presence of
pending signals.

Signed-off-by: John Hubbard <[email protected]>
---
mm/memory_hotplug.c | 6 ------
1 file changed, 6 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 8e0fa209d533..57a46620a667 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1879,12 +1879,6 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
do {
pfn = start_pfn;
do {
- if (signal_pending(current)) {
- ret = -EINTR;
- reason = "signal backoff";
- goto failed_removal_isolated;
- }
-
cond_resched();

ret = scan_movable_pages(pfn, end_pfn, &pfn);
--
2.41.0

2023-06-20 01:31:52

by John Hubbard

[permalink] [raw]

Subject: [PATCH v2 05/11] selftests/mm: .gitignore: add mkdirty, va_high_addr_switch

These new build products were left out of .gitignore, so add them now.

Reviewed-by: David Hildenbrand <[email protected]>
Cc: Peter Xu <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
---
tools/testing/selftests/mm/.gitignore | 2 ++
1 file changed, 2 insertions(+)

diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore
index 8917455f4f51..ab215303d8e9 100644
--- a/tools/testing/selftests/mm/.gitignore
+++ b/tools/testing/selftests/mm/.gitignore
@@ -39,3 +39,5 @@ local_config.h
local_config.mk
ksm_functional_tests
mdwe_test
+mkdirty
+va_high_addr_switch
\ No newline at end of file
--
2.40.1

2023-06-20 01:33:12

by John Hubbard

[permalink] [raw]

Subject: [PATCH v2 04/11] selftests/mm: fix invocation of tests that are run via shell scripts

We cannot depend upon git to reliably retain the executable bit on shell
scripts, or so I was told several years ago while working on this same
run_vmtests.sh script. And sure enough, things such as test_hmm.sh are
lately failing to run, due to lacking execute permissions.

Fix this by explicitly adding "bash" to each of the shell script
invocations. Leave fixing the overall approach to another day.

Acked-by: David Hildenbrand <[email protected]>
Cc: Peter Xu <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
---
tools/testing/selftests/mm/run_vmtests.sh | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
index 4893eb60d96d..8f81432e4bac 100644
--- a/tools/testing/selftests/mm/run_vmtests.sh
+++ b/tools/testing/selftests/mm/run_vmtests.sh
@@ -242,18 +242,18 @@ if [ $VADDR64 -ne 0 ]; then
if [ "$ARCH" == "$ARCH_ARM64" ]; then
echo 6 > /proc/sys/vm/nr_hugepages
fi
- CATEGORY="hugevm" run_test ./va_high_addr_switch.sh
+ CATEGORY="hugevm" run_test bash ./va_high_addr_switch.sh
if [ "$ARCH" == "$ARCH_ARM64" ]; then
echo $prev_nr_hugepages > /proc/sys/vm/nr_hugepages
fi
fi # VADDR64

# vmalloc stability smoke test
-CATEGORY="vmalloc" run_test ./test_vmalloc.sh smoke
+CATEGORY="vmalloc" run_test bash ./test_vmalloc.sh smoke

CATEGORY="mremap" run_test ./mremap_dontunmap

-CATEGORY="hmm" run_test ./test_hmm.sh smoke
+CATEGORY="hmm" run_test bash ./test_hmm.sh smoke

# MADV_POPULATE_READ and MADV_POPULATE_WRITE tests
CATEGORY="madv_populate" run_test ./madv_populate
--
2.40.1

2023-06-20 01:34:02

by John Hubbard

[permalink] [raw]

Subject: [PATCH v2 01/11] selftests/mm: fix uffd-stress unused function warning

uffd_minor_feature() was unused. Remove it in order to fix the
associated clang build warning.

Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Peter Xu <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
---
tools/testing/selftests/mm/uffd-stress.c | 10 ----------
1 file changed, 10 deletions(-)

diff --git a/tools/testing/selftests/mm/uffd-stress.c b/tools/testing/selftests/mm/uffd-stress.c
index f1ad9eef1c3a..995ff13e74c7 100644
--- a/tools/testing/selftests/mm/uffd-stress.c
+++ b/tools/testing/selftests/mm/uffd-stress.c
@@ -88,16 +88,6 @@ static void uffd_stats_reset(struct uffd_args *args, unsigned long n_cpus)
}
}

-static inline uint64_t uffd_minor_feature(void)
-{
- if (test_type == TEST_HUGETLB && map_shared)
- return UFFD_FEATURE_MINOR_HUGETLBFS;
- else if (test_type == TEST_SHMEM)
- return UFFD_FEATURE_MINOR_SHMEM;
- else
- return 0;
-}
-
static void *locking_thread(void *arg)
{
unsigned long cpu = (unsigned long) arg;
--
2.40.1

2023-06-20 01:34:56

by John Hubbard

[permalink] [raw]

Subject: [PATCH v2 08/11] selftests/mm: fix uffd-unit-tests.c build failure due to missing MADV_COLLAPSE

MADV_PAGEOUT, MADV_POPULATE_READ, MADV_COLLAPSE are conditionally
defined as necessary. However, that was being done in .c files, and a
new build failure came up that would have been automatically avoided had
these been in a common header file.

So consolidate and move them all to vm_util.h, which fixes the build
failure.

An alternative approach from Muhammad Usama Anjum was: rely on "make
headers" being required, and include asm-generic/mman-common.h. This
works in the sense that it builds, but it still generates warnings about
duplicate MADV_* symbols, and the goal here is to get a fully clean (no
warnings) build here.

Reviewed-by: David Hildenbrand <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Muhammad Usama Anjum <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
---
tools/testing/selftests/mm/cow.c | 7 -------
tools/testing/selftests/mm/khugepaged.c | 10 ----------
tools/testing/selftests/mm/vm_util.h | 10 ++++++++++
3 files changed, 10 insertions(+), 17 deletions(-)

diff --git a/tools/testing/selftests/mm/cow.c b/tools/testing/selftests/mm/cow.c
index dc9d6fe86028..8882b05ec9c8 100644
--- a/tools/testing/selftests/mm/cow.c
+++ b/tools/testing/selftests/mm/cow.c
@@ -30,13 +30,6 @@
#include "../kselftest.h"
#include "vm_util.h"

-#ifndef MADV_PAGEOUT
-#define MADV_PAGEOUT 21
-#endif
-#ifndef MADV_COLLAPSE
-#define MADV_COLLAPSE 25
-#endif
-
static size_t pagesize;
static int pagemap_fd;
static size_t thpsize;
diff --git a/tools/testing/selftests/mm/khugepaged.c b/tools/testing/selftests/mm/khugepaged.c
index 97adc0f34f9c..e88ee039d0eb 100644
--- a/tools/testing/selftests/mm/khugepaged.c
+++ b/tools/testing/selftests/mm/khugepaged.c
@@ -22,16 +22,6 @@

#include "vm_util.h"

-#ifndef MADV_PAGEOUT
-#define MADV_PAGEOUT 21
-#endif
-#ifndef MADV_POPULATE_READ
-#define MADV_POPULATE_READ 22
-#endif
-#ifndef MADV_COLLAPSE
-#define MADV_COLLAPSE 25
-#endif
-
#define BASE_ADDR ((void *)(1UL << 30))
static unsigned long hpage_pmd_size;
static unsigned long page_size;
diff --git a/tools/testing/selftests/mm/vm_util.h b/tools/testing/selftests/mm/vm_util.h
index b950bd16083a..07f39ed2efba 100644
--- a/tools/testing/selftests/mm/vm_util.h
+++ b/tools/testing/selftests/mm/vm_util.h
@@ -63,3 +63,13 @@ int uffd_register_with_ioctls(int uffd, void *addr, uint64_t len,

#define PAGEMAP_PRESENT(ent) (((ent) & (1ull << 63)) != 0)
#define PAGEMAP_PFN(ent) ((ent) & ((1ull << 55) - 1))
+
+#ifndef MADV_PAGEOUT
+#define MADV_PAGEOUT 21
+#endif
+#ifndef MADV_POPULATE_READ
+#define MADV_POPULATE_READ 22
+#endif
+#ifndef MADV_COLLAPSE
+#define MADV_COLLAPSE 25
+#endif
--
2.40.1

2023-06-20 01:35:01

by John Hubbard

[permalink] [raw]

Subject: [PATCH v2 06/11] selftests/mm: fix two -Wformat-security warnings in uffd builds

The uffd tests generate two compile time warnings from clang's
-Wformat-security setting. These trigger at the call sites for
uffd_test_start() and uffd_test_skip().

1) Fix the uffd_test_start() issue by removing the intermediate
test_name variable (thanks to David Hildenbrand for showing how to do
this).

2) Fix the uffd_test_skip() issue by observing that there is no need for
a macro and a variable args approach, because all callers of
uffd_test_skip() pass in a simple char* string, without any format
specifiers. So just change uffd_test_skip() into a regular C function.

Cc: David Hildenbrand <[email protected]>
Cc: Peter Xu <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
---
tools/testing/selftests/mm/uffd-unit-tests.c | 16 ++++++----------
1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/tools/testing/selftests/mm/uffd-unit-tests.c b/tools/testing/selftests/mm/uffd-unit-tests.c
index 269c86768a02..04d91f144d1c 100644
--- a/tools/testing/selftests/mm/uffd-unit-tests.c
+++ b/tools/testing/selftests/mm/uffd-unit-tests.c
@@ -109,12 +109,11 @@ static void uffd_test_pass(void)
ksft_inc_fail_cnt(); \
} while (0)

-#define uffd_test_skip(...) do { \
- printf("skipped [reason: "); \
- printf(__VA_ARGS__); \
- printf("]\n"); \
- ksft_inc_xskip_cnt(); \
- } while (0)
+static void uffd_test_skip(const char *message)
+{
+ printf("skipped [reason: %s]\n", message);
+ ksft_inc_xskip_cnt();
+}

/*
* Returns 1 if specific userfaultfd supported, 0 otherwise. Note, we'll
@@ -1149,7 +1148,6 @@ int main(int argc, char *argv[])
uffd_test_case_t *test;
mem_type_t *mem_type;
uffd_test_args_t args;
- char test_name[128];
const char *errmsg;
int has_uffd, opt;
int i, j;
@@ -1192,10 +1190,8 @@ int main(int argc, char *argv[])
mem_type = &mem_types[j];
if (!(test->mem_targets & mem_type->mem_flag))
continue;
- snprintf(test_name, sizeof(test_name),
- "%s on %s", test->name, mem_type->name);

- uffd_test_start(test_name);
+ uffd_test_start("%s on %s", test->name, mem_type->name);
if (!uffd_feature_supported(test)) {
uffd_test_skip("feature missing");
continue;
--
2.40.1

2023-06-20 07:34:06

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH] mm/memory_hotplug.c: don't fail hot unplug quite so eagerly

On 20.06.23 03:17, John Hubbard wrote:
> mm/memory_hotplug.c: don't fail hot unplug quite so eagerly
>
> Some device drivers add memory to the system via memory hotplug. When
> the driver is unloaded, that memory is hot-unplugged.

Which interfaces are they using to add/remove memory?

>
> However, memory hot unplug can fail. And these days, it fails a little
> too easily, with respect to the above case. Specifically, if a signal is
> pending on the process, hot unplug fails. This leads directly to: the
> user must reboot the machine in order to unload the driver, and
> therefore the device is unusable until the machine is rebooted.

Why can't they retry in user space when offlining fails with -EINTR, or
re-trigger driver unloading?

>
> During teardown paths in the kernel, a higher tolerance for failures or
> imperfections is often best. That is, it is often better to continue
> with the teardown, than to error out too early.
>
> So in this case, other things (unmovable pages, un-splittable huge
> pages) can also cause the above problem. However, those are demonstrably
> less common than simply having a pending signal. I've got bug reports
> from users who can trivially reproduce this by killing their process
> with a "kill -9", for example.
>
> Fix this by soldering on with memory hot plug, even in the presence of
> pending signals.
>
> Signed-off-by: John Hubbard <[email protected]>
> ---
> mm/memory_hotplug.c | 6 ------
> 1 file changed, 6 deletions(-)
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 8e0fa209d533..57a46620a667 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1879,12 +1879,6 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
> do {
> pfn = start_pfn;
> do {
> - if (signal_pending(current)) {
> - ret = -EINTR;
> - reason = "signal backoff";
> - goto failed_removal_isolated;
> - }
> -
> cond_resched();
>
> ret = scan_movable_pages(pfn, end_pfn, &pfn);

No, we can't remove that. It's documented behavior that exists precisely
for that reason:

https://docs.kernel.org/admin-guide/mm/memory-hotplug.html#id21

"
When offlining is triggered from user space, the offlining context can
be terminated by sending a fatal signal. A timeout based offlining can
easily be implemented via:

% timeout $TIMEOUT offline_block | failure_handling
"

Otherwise, there is no way to stop an userspace-triggered offline
operation that loops forever in the kernel.

I guess switching to fatal_signal_pending() might help to some degree,
it should keep the timeout trick working.

But it wouldn't help in your case because where root kills arbitrary
processes. I'm not sure if that is something we should be paying
attention to.

--
Cheers,

David / dhildenb

2023-06-20 10:39:52

by Muhammad Usama Anjum

[permalink] [raw]

Subject: Re: [PATCH v2 08/11] selftests/mm: fix uffd-unit-tests.c build failure due to missing MADV_COLLAPSE

On 6/20/23 6:17 AM, John Hubbard wrote:
> MADV_PAGEOUT, MADV_POPULATE_READ, MADV_COLLAPSE are conditionally
> defined as necessary. However, that was being done in .c files, and a
> new build failure came up that would have been automatically avoided had
> these been in a common header file.
>
> So consolidate and move them all to vm_util.h, which fixes the build
> failure.
>
> An alternative approach from Muhammad Usama Anjum was: rely on "make
> headers" being required, and include asm-generic/mman-common.h. This
> works in the sense that it builds, but it still generates warnings about
> duplicate MADV_* symbols, and the goal here is to get a fully clean (no
> warnings) build here.
I've not looked in detail. But it seems like your first revision was merged
and after that my cleanup has also been merged. My cleanup patch is adding
correct header files and removing these duplicate defines: It is in
mm-stable now.
https://lore.kernel.org/all/[email protected]

>
> Reviewed-by: David Hildenbrand <[email protected]>
> Cc: Peter Xu <[email protected]>
> Cc: Muhammad Usama Anjum <[email protected]>
> Signed-off-by: John Hubbard <[email protected]>
> ---
> tools/testing/selftests/mm/cow.c | 7 -------
> tools/testing/selftests/mm/khugepaged.c | 10 ----------
> tools/testing/selftests/mm/vm_util.h | 10 ++++++++++
> 3 files changed, 10 insertions(+), 17 deletions(-)
>
> diff --git a/tools/testing/selftests/mm/cow.c b/tools/testing/selftests/mm/cow.c
> index dc9d6fe86028..8882b05ec9c8 100644
> --- a/tools/testing/selftests/mm/cow.c
> +++ b/tools/testing/selftests/mm/cow.c
> @@ -30,13 +30,6 @@
> #include "../kselftest.h"
> #include "vm_util.h"
>
> -#ifndef MADV_PAGEOUT
> -#define MADV_PAGEOUT 21
> -#endif
> -#ifndef MADV_COLLAPSE
> -#define MADV_COLLAPSE 25
> -#endif
> -
> static size_t pagesize;
> static int pagemap_fd;
> static size_t thpsize;
> diff --git a/tools/testing/selftests/mm/khugepaged.c b/tools/testing/selftests/mm/khugepaged.c
> index 97adc0f34f9c..e88ee039d0eb 100644
> --- a/tools/testing/selftests/mm/khugepaged.c
> +++ b/tools/testing/selftests/mm/khugepaged.c
> @@ -22,16 +22,6 @@
>
> #include "vm_util.h"
>
> -#ifndef MADV_PAGEOUT
> -#define MADV_PAGEOUT 21
> -#endif
> -#ifndef MADV_POPULATE_READ
> -#define MADV_POPULATE_READ 22
> -#endif
> -#ifndef MADV_COLLAPSE
> -#define MADV_COLLAPSE 25
> -#endif
> -
> #define BASE_ADDR ((void *)(1UL << 30))
> static unsigned long hpage_pmd_size;
> static unsigned long page_size;
> diff --git a/tools/testing/selftests/mm/vm_util.h b/tools/testing/selftests/mm/vm_util.h
> index b950bd16083a..07f39ed2efba 100644
> --- a/tools/testing/selftests/mm/vm_util.h
> +++ b/tools/testing/selftests/mm/vm_util.h
> @@ -63,3 +63,13 @@ int uffd_register_with_ioctls(int uffd, void *addr, uint64_t len,
>
> #define PAGEMAP_PRESENT(ent) (((ent) & (1ull << 63)) != 0)
> #define PAGEMAP_PFN(ent) ((ent) & ((1ull << 55) - 1))
> +
> +#ifndef MADV_PAGEOUT
> +#define MADV_PAGEOUT 21
> +#endif
> +#ifndef MADV_POPULATE_READ
> +#define MADV_POPULATE_READ 22
> +#endif
> +#ifndef MADV_COLLAPSE
> +#define MADV_COLLAPSE 25
> +#endif

--
BR,
Muhammad Usama Anjum

2023-06-20 10:40:51

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH v2 08/11] selftests/mm: fix uffd-unit-tests.c build failure due to missing MADV_COLLAPSE

On 20.06.23 12:17, Muhammad Usama Anjum wrote:
> On 6/20/23 6:17 AM, John Hubbard wrote:
>> MADV_PAGEOUT, MADV_POPULATE_READ, MADV_COLLAPSE are conditionally
>> defined as necessary. However, that was being done in .c files, and a
>> new build failure came up that would have been automatically avoided had
>> these been in a common header file.
>>
>> So consolidate and move them all to vm_util.h, which fixes the build
>> failure.
>>
>> An alternative approach from Muhammad Usama Anjum was: rely on "make
>> headers" being required, and include asm-generic/mman-common.h. This
>> works in the sense that it builds, but it still generates warnings about
>> duplicate MADV_* symbols, and the goal here is to get a fully clean (no
>> warnings) build here.
> I've not looked in detail. But it seems like your first revision was merged
> and after that my cleanup has also been merged. My cleanup patch is adding
> correct header files and removing these duplicate defines: It is in
> mm-stable now.
> https://lore.kernel.org/all/[email protected]

See

https://lkml.kernel.org/r/[email protected]

--
Cheers,

David / dhildenb

2023-06-20 22:11:05

by John Hubbard

[permalink] [raw]

Subject: Re: [PATCH] mm/memory_hotplug.c: don't fail hot unplug quite so eagerly

On 6/20/23 00:12, David Hildenbrand wrote:
> On 20.06.23 03:17, John Hubbard wrote:
>> mm/memory_hotplug.c: don't fail hot unplug quite so eagerly
>>
>> Some device drivers add memory to the system via memory hotplug. When
>> the driver is unloaded, that memory is hot-unplugged.
>
> Which interfaces are they using to add/remove memory?

It's coming in from the kernel driver, like this:

offline_and_remove_memory()
walk_memory_blocks()
try_offline_memory_block()
device_offline()
memory_subsys_offline()
offline_pages()

...and the above is getting invoked as part of killing a user space
process that was helping (for performance reasons) holding the device
nodes open. That triggers a final close of the file descriptors and
leads to tearing down the driver. The teardown succeeds even though
the memory was not offlined, and now everything is, to use a technical
term, "stuck". :)

More below...

>
>>
>> However, memory hot unplug can fail. And these days, it fails a little
>> too easily, with respect to the above case. Specifically, if a signal is
>> pending on the process, hot unplug fails. This leads directly to: the
>> user must reboot the machine in order to unload the driver, and
>> therefore the device is unusable until the machine is rebooted.
>
> Why can't they retry in user space when offlining fails with -EINTR, or re-trigger driver unloading?

If someone uses "kill -9" to kill that process, then we get here,
because user space cannot trap that signal.

...
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -1879,12 +1879,6 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
>>       do {
>>           pfn = start_pfn;
>>           do {
>> -            if (signal_pending(current)) {
>> -                ret = -EINTR;
>> -                reason = "signal backoff";
>> -                goto failed_removal_isolated;
>> -            }
>> -
>>               cond_resched();
>>               ret = scan_movable_pages(pfn, end_pfn, &pfn);
>
> No, we can't remove that. It's documented behavior that exists precisely for that reason:
>
> https://docs.kernel.org/admin-guide/mm/memory-hotplug.html#id21
>
> "
> When offlining is triggered from user space, the offlining context can be terminated by sending a fatal signal. A timeout based offlining can easily be implemented via:
>
> % timeout $TIMEOUT offline_block | failure_handling
> "
>
> Otherwise, there is no way to stop an userspace-triggered offline operation that loops forever in the kernel.

OK yes, I see.

>
> I guess switching to fatal_signal_pending() might help to some degree, it should keep the timeout trick working.
>
> But it wouldn't help in your case because where root kills arbitrary processes. I'm not sure if that is something we should be paying attention to.
>

Right. I think it would be more accurate perhaps, but it wouldn't help
this particular complaint.

Perhaps it is reasonable to claim that, "well, kill -9 *means* that you
end up here!" :) And the above patch clearly is not the way to go, but...

...what about discerning between "user initiated offline_pages" and
"offline pages as part of a driver shutdown/unload"?

thanks,
--
John Hubbard
NVIDIA

2023-06-21 08:26:06

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH] mm/memory_hotplug.c: don't fail hot unplug quite so eagerly

On 20.06.23 23:54, John Hubbard wrote:
> On 6/20/23 00:12, David Hildenbrand wrote:
>> On 20.06.23 03:17, John Hubbard wrote:
>>> mm/memory_hotplug.c: don't fail hot unplug quite so eagerly
>>>
>>> Some device drivers add memory to the system via memory hotplug. When
>>> the driver is unloaded, that memory is hot-unplugged.
>>
>> Which interfaces are they using to add/remove memory?
>
> It's coming in from the kernel driver, like this:
>
> offline_and_remove_memory()
> walk_memory_blocks()
> try_offline_memory_block()
> device_offline()
> memory_subsys_offline()
> offline_pages()
>
> ...and the above is getting invoked as part of killing a user space
> process that was helping (for performance reasons) holding the device
> nodes open. That triggers a final close of the file descriptors and
> leads to tearing down the driver. The teardown succeeds even though
> the memory was not offlined, and now everything is, to use a technical
> term, "stuck". :)
>

Ah, I see, thanks! I thought it would just be offlining from user space.

> More below...
>
>>
>>>
>>> However, memory hot unplug can fail. And these days, it fails a little
>>> too easily, with respect to the above case. Specifically, if a signal is
>>> pending on the process, hot unplug fails. This leads directly to: the
>>> user must reboot the machine in order to unload the driver, and
>>> therefore the device is unusable until the machine is rebooted.
>>
>> Why can't they retry in user space when offlining fails with -EINTR, or re-trigger driver unloading?
>
> If someone uses "kill -9" to kill that process, then we get here,
> because user space cannot trap that signal.

Understood, thanks!

>
>
> ...
>>> --- a/mm/memory_hotplug.c
>>> +++ b/mm/memory_hotplug.c
>>> @@ -1879,12 +1879,6 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
>>>       do {
>>>           pfn = start_pfn;
>>>           do {
>>> -            if (signal_pending(current)) {
>>> -                ret = -EINTR;
>>> -                reason = "signal backoff";
>>> -                goto failed_removal_isolated;
>>> -            }
>>> -
>>>               cond_resched();
>>>               ret = scan_movable_pages(pfn, end_pfn, &pfn);
>>
>> No, we can't remove that. It's documented behavior that exists precisely for that reason:
>>
>> https://docs.kernel.org/admin-guide/mm/memory-hotplug.html#id21
>>
>> "
>> When offlining is triggered from user space, the offlining context can be terminated by sending a fatal signal. A timeout based offlining can easily be implemented via:
>>
>> % timeout $TIMEOUT offline_block | failure_handling
>> "
>>
>> Otherwise, there is no way to stop an userspace-triggered offline operation that loops forever in the kernel.
>
> OK yes, I see.
>
>>
>> I guess switching to fatal_signal_pending() might help to some degree, it should keep the timeout trick working.
>>
>> But it wouldn't help in your case because where root kills arbitrary processes. I'm not sure if that is something we should be paying attention to.
>>
>
> Right. I think it would be more accurate perhaps, but it wouldn't help
> this particular complaint.
>
> Perhaps it is reasonable to claim that, "well, kill -9 *means* that you
> end up here!" :) And the above patch clearly is not the way to go, but...
>
> ...what about discerning between "user initiated offline_pages" and
> "offline pages as part of a driver shutdown/unload"?

Makes sense to me.

There are two ways for triggering it directly from user space:

1) drivers/base/core.c:online_store()
2) drivers/base/memory.c:state_store()

We cannot easily hook into 2) to indicate "we're offlining directly
from user space". SO we might have to do it the other way around.

Something along the following lines should do the trick (expect whitespace damage):

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 53ee7654f009..acd4b739505a 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -152,6 +152,13 @@ void put_online_mems(void)

bool movable_node_enabled = false;

+/*
+ * Protected by the device hotplug lock. Indicates whether device offlining
+ * is triggered from try_offline_memory_block() such that we don't fail memory
+ * offlining if a signal is pending.
+ */
+static bool mhp_in_try_offline_memory_block;
+
#ifndef CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
int mhp_default_online_type = MMOP_OFFLINE;
#else
@@ -1860,7 +1867,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
do {
pfn = start_pfn;
do {
- if (signal_pending(current)) {
+ if (!mhp_in_try_offline_memory_block &&
+ signal_pending(current)) {
ret = -EINTR;
reason = "signal backoff";
goto failed_removal_isolated;
@@ -2177,7 +2185,9 @@ static int try_offline_memory_block(struct memory_block *mem, void *arg)
if (page && zone_idx(page_zone(page)) == ZONE_MOVABLE)
online_type = MMOP_ONLINE_MOVABLE;

+ mhp_in_try_offline_memory_block = true;
rc = device_offline(&mem->dev);
+ mhp_in_try_offline_memory_block = false;
/*
* Default is MMOP_OFFLINE - change it only if offlining succeeded,
* so try_reonline_memory_block() can do the right thing.

There is still arch/powerpc/platforms/pseries/hotplug-memory.c that calls
device_offline() and would fail on signals (not sure if relevant, like for virtio-mem it
shouldn't be that relevant).

I guess dlpar_remove_lmb() can now simply call offline_and_remove_memory().
[I might craft a patch later]

--
Cheers,

David / dhildenb

2023-06-21 08:40:36

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH] mm/memory_hotplug.c: don't fail hot unplug quite so eagerly

[...]

>
> There is still arch/powerpc/platforms/pseries/hotplug-memory.c that calls
> device_offline() and would fail on signals (not sure if relevant, like for virtio-mem it
> shouldn't be that relevant).

Oh, and of course the ACPI-triggered device_offline().

--
Cheers,

David / dhildenb

2023-06-22 03:05:10

by John Hubbard

[permalink] [raw]

Subject: Re: [PATCH] mm/memory_hotplug.c: don't fail hot unplug quite so eagerly

On 6/21/23 01:11, David Hildenbrand wrote:
>> ...what about discerning between "user initiated offline_pages" and
>> "offline pages as part of a driver shutdown/unload"?
>
> Makes sense to me.
>
> There are two ways for triggering it directly from user space:
>
> 1) drivers/base/core.c:online_store()
> 2) drivers/base/memory.c:state_store()
>
> We cannot easily hook into 2) to indicate "we're offlining directly
> from user space". SO we might have to do it the other way around.
>
>
> Something along the following lines should do the trick (expect whitespace damage):
>
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 53ee7654f009..acd4b739505a 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -152,6 +152,13 @@ void put_online_mems(void)
>
> bool movable_node_enabled = false;
>
> +/*
> + * Protected by the device hotplug lock. Indicates whether device offlining
> + * is triggered from try_offline_memory_block() such that we don't fail memory
> + * offlining if a signal is pending.
> + */
> +static bool mhp_in_try_offline_memory_block;
> +
> #ifndef CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
> int mhp_default_online_type = MMOP_OFFLINE;
> #else
> @@ -1860,7 +1867,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
>         do {
>                 pfn = start_pfn;
>                 do {
> -                       if (signal_pending(current)) {
> +                       if (!mhp_in_try_offline_memory_block &&
> +                           signal_pending(current)) {
>                                 ret = -EINTR;
>                                 reason = "signal backoff";
>                                 goto failed_removal_isolated;
> @@ -2177,7 +2185,9 @@ static int try_offline_memory_block(struct memory_block *mem, void *arg)
>         if (page && zone_idx(page_zone(page)) == ZONE_MOVABLE)
>                 online_type = MMOP_ONLINE_MOVABLE;
>
> +       mhp_in_try_offline_memory_block = true;
>         rc = device_offline(&mem->dev);
> +       mhp_in_try_offline_memory_block = false;
>         /*
>          * Default is MMOP_OFFLINE - change it only if offlining succeeded,
>          * so try_reonline_memory_block() can do the right thing.
>
>
>
> There is still arch/powerpc/platforms/pseries/hotplug-memory.c that calls
> device_offline() and would fail on signals (not sure if relevant, like for virtio-mem it
> shouldn't be that relevant).
>
> I guess dlpar_remove_lmb() can now simply call offline_and_remove_memory().
> [I might craft a patch later]
>

This direction looks good to me, I'd love to see a patch if you
put something together.

thanks,
--
John Hubbard
NVIDIA