2023-09-22 02:55:42

by Eric DeVolder

[permalink] [raw]
Subject: [PATCH] kexec: change locking mechanism to a mutex

Scaled up testing has revealed that the kexec_trylock()
implementation leads to failures within the crash hotplug
infrastructure due to the inability to acquire the lock,
specifically the message:

crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate

When hotplug events occur, the crash hotplug infrastructure first
attempts to obtain the lock via the kexec_trylock(). However, the
implementation either acquires the lock, or fails and returns; there
is no waiting on the lock. Here is the comment/explanation from
kernel/kexec_internal.h:kexec_trylock():

* Whatever is used to serialize accesses to the kexec_crash_image needs to be
* NMI safe, as __crash_kexec() can happen during nmi_panic(), so here we use a
* "simple" atomic variable that is acquired with a cmpxchg().

While this in theory can happen for either CPU or memory hoptlug,
this problem is most prone to occur for memory hotplug.

When memory is hot plugged, the memory is converted into smaller
128MiB memblocks (typically). As each memblock is processed, a
kernel thread and a udev event thread are created. The udev thread
tries for the lock via the reading of the sysfs node
/sys/devices/system/memory/crash_hotplug node, and the kernel
worker thread tries for the lock upon entering the crash hotplug
infrastructure.

These threads then compete for the kexec lock.

For example, a 1GiB DIMM is converted into 8 memblocks, each
spawning two threads for a total of 16 threads that create a small
"swarm" all trying to acquire the lock. The larger the DIMM, the
more the memblocks and the larger the swarm.

At the root of the problem is the atomic lock behind kexec_trylock();
it works well for low lock traffic; ie loading/unloading a capture
kernel, things that happen basically once. But with the introduction
of crash hotplug, the traffic through the lock increases significantly,
and more importantly in bursts occurring at roughly the same time. Thus
there is a need to wait on the lock.

A possible workaround is to simply retry the lock, say up to N times.
There is, of course, the problem of determining a value of N that works for
all implementations, and for all the other call sites of kexec_trylock().
Not ideal.

The design decision to use the atomic lock is described in the comment
from kexec_internal.h, cited above. However, examining the code of
__crash_kexec():

if (kexec_trylock()) {
if (kexec_crash_image) {
...
}
kexec_unlock();
}

reveals that the use of kexec_trylock() here is actually a "best effort"
due to the atomic lock. This atomic lock, prior to crash hotplug,
would almost always be assured (another kexec syscall could hold the lock
and prevent this, but that is about it).

So at the point where the capture kernel would be invoked, if the lock
is not obtained, then kdump doesn't occur.

It is possible to instead use a mutex with proper waiting, and utilize
mutex_trylock() as the "best effort" in __crash_kexec(). The use of a
mutex then avoids all the lock acquisition problems that were revealed
by the crash hotplug activity.

Convert the atomic lock to a mutex.

Signed-off-by: Eric DeVolder <[email protected]>
---
kernel/crash_core.c | 10 ++--------
kernel/kexec.c | 3 +--
kernel/kexec_core.c | 13 +++++--------
kernel/kexec_file.c | 3 +--
kernel/kexec_internal.h | 12 +++---------
5 files changed, 12 insertions(+), 29 deletions(-)

diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 03a7932cde0a..9a8378fbdafa 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -749,10 +749,7 @@ int crash_check_update_elfcorehdr(void)
int rc = 0;

/* Obtain lock while reading crash information */
- if (!kexec_trylock()) {
- pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n");
- return 0;
- }
+ kexec_lock();
if (kexec_crash_image) {
if (kexec_crash_image->file_mode)
rc = 1;
@@ -784,10 +781,7 @@ static void crash_handle_hotplug_event(unsigned int hp_action, unsigned int cpu)
struct kimage *image;

/* Obtain lock while changing crash information */
- if (!kexec_trylock()) {
- pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n");
- return;
- }
+ kexec_lock();

/* Check kdump is not loaded */
if (!kexec_crash_image)
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 107f355eac10..a2f687900bb5 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -96,8 +96,7 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments,
* crash kernels we need a serialization here to prevent multiple crash
* kernels from attempting to load simultaneously.
*/
- if (!kexec_trylock())
- return -EBUSY;
+ kexec_lock();

if (flags & KEXEC_ON_CRASH) {
dest_image = &kexec_crash_image;
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 9dc728982d79..202e4590fc1c 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -47,7 +47,7 @@
#include <crypto/hash.h>
#include "kexec_internal.h"

-atomic_t __kexec_lock = ATOMIC_INIT(0);
+DEFINE_MUTEX(__kexec_lock);

/* Flag to indicate we are going to kexec a new kernel */
bool kexec_in_progress = false;
@@ -1057,7 +1057,7 @@ void __noclone __crash_kexec(struct pt_regs *regs)
* of memory the xchg(&kexec_crash_image) would be
* sufficient. But since I reuse the memory...
*/
- if (kexec_trylock()) {
+ if (mutex_trylock(&__kexec_lock)) {
if (kexec_crash_image) {
struct pt_regs fixed_regs;

@@ -1103,8 +1103,7 @@ ssize_t crash_get_memory_size(void)
{
ssize_t size = 0;

- if (!kexec_trylock())
- return -EBUSY;
+ kexec_lock();

size += crash_resource_size(&crashk_res);
size += crash_resource_size(&crashk_low_res);
@@ -1146,8 +1145,7 @@ int crash_shrink_memory(unsigned long new_size)
int ret = 0;
unsigned long old_size, low_size;

- if (!kexec_trylock())
- return -EBUSY;
+ kexec_lock();

if (kexec_crash_image) {
ret = -ENOENT;
@@ -1229,8 +1227,7 @@ int kernel_kexec(void)
{
int error = 0;

- if (!kexec_trylock())
- return -EBUSY;
+ kexec_lock();
if (!kexec_image) {
error = -EINVAL;
goto Unlock;
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index f9a419cd22d4..a4daaaab7fa7 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -341,8 +341,7 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,

image = NULL;

- if (!kexec_trylock())
- return -EBUSY;
+ kexec_lock();

if (image_type == KEXEC_TYPE_CRASH) {
dest_image = &kexec_crash_image;
diff --git a/kernel/kexec_internal.h b/kernel/kexec_internal.h
index 74da1409cd14..4fdae59767b6 100644
--- a/kernel/kexec_internal.h
+++ b/kernel/kexec_internal.h
@@ -18,15 +18,9 @@ int kimage_is_destination_range(struct kimage *image,
* NMI safe, as __crash_kexec() can happen during nmi_panic(), so here we use a
* "simple" atomic variable that is acquired with a cmpxchg().
*/
-extern atomic_t __kexec_lock;
-static inline bool kexec_trylock(void)
-{
- return atomic_cmpxchg_acquire(&__kexec_lock, 0, 1) == 0;
-}
-static inline void kexec_unlock(void)
-{
- atomic_set_release(&__kexec_lock, 0);
-}
+extern struct mutex __kexec_lock;
+#define kexec_lock() mutex_lock(&__kexec_lock)
+#define kexec_unlock() mutex_unlock(&__kexec_lock)

#ifdef CONFIG_KEXEC_FILE
#include <linux/purgatory.h>
--
2.39.3


2023-09-22 07:14:01

by Eric DeVolder

[permalink] [raw]
Subject: Re: [PATCH] kexec: change locking mechanism to a mutex



On 9/21/23 19:22, Andrew Morton wrote:
> On Thu, 21 Sep 2023 17:59:38 -0400 Eric DeVolder <[email protected]> wrote:
>
>> Scaled up testing has revealed that the kexec_trylock()
>> implementation leads to failures within the crash hotplug
>> infrastructure due to the inability to acquire the lock,
>> specifically the message:
>>
>> ...
>>
>> Convert the atomic lock to a mutex.
>>
>
> Do you think this problem is serious enough to warrant a backport into
> -stable kernels?

I do not since it will be the lock traffic created by the crash hotplug infrastructure that will
reveal the weak locking mechanism. Until this crash hotplug shows up in a stable kernel, it should
not be an issue; there isn't anything else that easily exercise it to reveal the problem.

eric

2023-09-22 11:25:42

by Dave Young

[permalink] [raw]
Subject: Re: [PATCH] kexec: change locking mechanism to a mutex

[Cced Valentin Schneider as he added the trylocks]

On Fri, 22 Sept 2023 at 06:04, Eric DeVolder <[email protected]> wrote:
>
> Scaled up testing has revealed that the kexec_trylock()
> implementation leads to failures within the crash hotplug
> infrastructure due to the inability to acquire the lock,
> specifically the message:
>
> crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate
>
> When hotplug events occur, the crash hotplug infrastructure first
> attempts to obtain the lock via the kexec_trylock(). However, the
> implementation either acquires the lock, or fails and returns; there
> is no waiting on the lock. Here is the comment/explanation from
> kernel/kexec_internal.h:kexec_trylock():
>
> * Whatever is used to serialize accesses to the kexec_crash_image needs to be
> * NMI safe, as __crash_kexec() can happen during nmi_panic(), so here we use a
> * "simple" atomic variable that is acquired with a cmpxchg().
>
> While this in theory can happen for either CPU or memory hoptlug,
> this problem is most prone to occur for memory hotplug.
>
> When memory is hot plugged, the memory is converted into smaller
> 128MiB memblocks (typically). As each memblock is processed, a
> kernel thread and a udev event thread are created. The udev thread
> tries for the lock via the reading of the sysfs node
> /sys/devices/system/memory/crash_hotplug node, and the kernel
> worker thread tries for the lock upon entering the crash hotplug
> infrastructure.
>
> These threads then compete for the kexec lock.
>
> For example, a 1GiB DIMM is converted into 8 memblocks, each
> spawning two threads for a total of 16 threads that create a small
> "swarm" all trying to acquire the lock. The larger the DIMM, the
> more the memblocks and the larger the swarm.
>
> At the root of the problem is the atomic lock behind kexec_trylock();
> it works well for low lock traffic; ie loading/unloading a capture
> kernel, things that happen basically once. But with the introduction
> of crash hotplug, the traffic through the lock increases significantly,
> and more importantly in bursts occurring at roughly the same time. Thus
> there is a need to wait on the lock.
>
> A possible workaround is to simply retry the lock, say up to N times.
> There is, of course, the problem of determining a value of N that works for
> all implementations, and for all the other call sites of kexec_trylock().
> Not ideal.
>
> The design decision to use the atomic lock is described in the comment
> from kexec_internal.h, cited above. However, examining the code of
> __crash_kexec():
>
> if (kexec_trylock()) {
> if (kexec_crash_image) {
> ...
> }
> kexec_unlock();
> }
>
> reveals that the use of kexec_trylock() here is actually a "best effort"
> due to the atomic lock. This atomic lock, prior to crash hotplug,
> would almost always be assured (another kexec syscall could hold the lock
> and prevent this, but that is about it).
>
> So at the point where the capture kernel would be invoked, if the lock
> is not obtained, then kdump doesn't occur.
>
> It is possible to instead use a mutex with proper waiting, and utilize
> mutex_trylock() as the "best effort" in __crash_kexec(). The use of a
> mutex then avoids all the lock acquisition problems that were revealed
> by the crash hotplug activity.
>
> Convert the atomic lock to a mutex.
>
> Signed-off-by: Eric DeVolder <[email protected]>
> ---
> kernel/crash_core.c | 10 ++--------
> kernel/kexec.c | 3 +--
> kernel/kexec_core.c | 13 +++++--------
> kernel/kexec_file.c | 3 +--
> kernel/kexec_internal.h | 12 +++---------
> 5 files changed, 12 insertions(+), 29 deletions(-)
>
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index 03a7932cde0a..9a8378fbdafa 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -749,10 +749,7 @@ int crash_check_update_elfcorehdr(void)
> int rc = 0;
>
> /* Obtain lock while reading crash information */
> - if (!kexec_trylock()) {
> - pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n");
> - return 0;
> - }
> + kexec_lock();
> if (kexec_crash_image) {
> if (kexec_crash_image->file_mode)
> rc = 1;
> @@ -784,10 +781,7 @@ static void crash_handle_hotplug_event(unsigned int hp_action, unsigned int cpu)
> struct kimage *image;
>
> /* Obtain lock while changing crash information */
> - if (!kexec_trylock()) {
> - pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n");
> - return;
> - }
> + kexec_lock();
>
> /* Check kdump is not loaded */
> if (!kexec_crash_image)
> diff --git a/kernel/kexec.c b/kernel/kexec.c
> index 107f355eac10..a2f687900bb5 100644
> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c
> @@ -96,8 +96,7 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments,
> * crash kernels we need a serialization here to prevent multiple crash
> * kernels from attempting to load simultaneously.
> */
> - if (!kexec_trylock())
> - return -EBUSY;
> + kexec_lock();
>
> if (flags & KEXEC_ON_CRASH) {
> dest_image = &kexec_crash_image;
> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> index 9dc728982d79..202e4590fc1c 100644
> --- a/kernel/kexec_core.c
> +++ b/kernel/kexec_core.c
> @@ -47,7 +47,7 @@
> #include <crypto/hash.h>
> #include "kexec_internal.h"
>
> -atomic_t __kexec_lock = ATOMIC_INIT(0);
> +DEFINE_MUTEX(__kexec_lock);
>
> /* Flag to indicate we are going to kexec a new kernel */
> bool kexec_in_progress = false;
> @@ -1057,7 +1057,7 @@ void __noclone __crash_kexec(struct pt_regs *regs)
> * of memory the xchg(&kexec_crash_image) would be
> * sufficient. But since I reuse the memory...
> */
> - if (kexec_trylock()) {
> + if (mutex_trylock(&__kexec_lock)) {
> if (kexec_crash_image) {
> struct pt_regs fixed_regs;
>
> @@ -1103,8 +1103,7 @@ ssize_t crash_get_memory_size(void)
> {
> ssize_t size = 0;
>
> - if (!kexec_trylock())
> - return -EBUSY;
> + kexec_lock();
>
> size += crash_resource_size(&crashk_res);
> size += crash_resource_size(&crashk_low_res);
> @@ -1146,8 +1145,7 @@ int crash_shrink_memory(unsigned long new_size)
> int ret = 0;
> unsigned long old_size, low_size;
>
> - if (!kexec_trylock())
> - return -EBUSY;
> + kexec_lock();
>
> if (kexec_crash_image) {
> ret = -ENOENT;
> @@ -1229,8 +1227,7 @@ int kernel_kexec(void)
> {
> int error = 0;
>
> - if (!kexec_trylock())
> - return -EBUSY;
> + kexec_lock();
> if (!kexec_image) {
> error = -EINVAL;
> goto Unlock;
> diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
> index f9a419cd22d4..a4daaaab7fa7 100644
> --- a/kernel/kexec_file.c
> +++ b/kernel/kexec_file.c
> @@ -341,8 +341,7 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
>
> image = NULL;
>
> - if (!kexec_trylock())
> - return -EBUSY;
> + kexec_lock();
>
> if (image_type == KEXEC_TYPE_CRASH) {
> dest_image = &kexec_crash_image;
> diff --git a/kernel/kexec_internal.h b/kernel/kexec_internal.h
> index 74da1409cd14..4fdae59767b6 100644
> --- a/kernel/kexec_internal.h
> +++ b/kernel/kexec_internal.h
> @@ -18,15 +18,9 @@ int kimage_is_destination_range(struct kimage *image,
> * NMI safe, as __crash_kexec() can happen during nmi_panic(), so here we use a
> * "simple" atomic variable that is acquired with a cmpxchg().
> */
> -extern atomic_t __kexec_lock;
> -static inline bool kexec_trylock(void)
> -{
> - return atomic_cmpxchg_acquire(&__kexec_lock, 0, 1) == 0;
> -}
> -static inline void kexec_unlock(void)
> -{
> - atomic_set_release(&__kexec_lock, 0);
> -}
> +extern struct mutex __kexec_lock;
> +#define kexec_lock() mutex_lock(&__kexec_lock)
> +#define kexec_unlock() mutex_unlock(&__kexec_lock)
>
> #ifdef CONFIG_KEXEC_FILE
> #include <linux/purgatory.h>
> --
> 2.39.3
>

2023-09-22 13:54:04

by Eric DeVolder

[permalink] [raw]
Subject: Re: [PATCH] kexec: change locking mechanism to a mutex



On 9/21/23 19:26, Andrew Morton wrote:
> On Thu, 21 Sep 2023 17:59:38 -0400 Eric DeVolder <[email protected]> wrote:
>
>> Scaled up testing has revealed that the kexec_trylock()
>> implementation leads to failures within the crash hotplug
>> infrastructure due to the inability to acquire the lock,
>> specifically the message:
>>
>> crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate
>>
>> When hotplug events occur, the crash hotplug infrastructure first
>> attempts to obtain the lock via the kexec_trylock(). However, the
>> implementation either acquires the lock, or fails and returns; there
>> is no waiting on the lock. Here is the comment/explanation from
>> kernel/kexec_internal.h:kexec_trylock():
>>
>> * Whatever is used to serialize accesses to the kexec_crash_image needs to be
>> * NMI safe, as __crash_kexec() can happen during nmi_panic(), so here we use a
>> * "simple" atomic variable that is acquired with a cmpxchg().
>>
>> While this in theory can happen for either CPU or memory hoptlug,
>> this problem is most prone to occur for memory hotplug.
>>
>> When memory is hot plugged, the memory is converted into smaller
>> 128MiB memblocks (typically). As each memblock is processed, a
>> kernel thread and a udev event thread are created. The udev thread
>> tries for the lock via the reading of the sysfs node
>> /sys/devices/system/memory/crash_hotplug node, and the kernel
>> worker thread tries for the lock upon entering the crash hotplug
>> infrastructure.
>>
>> These threads then compete for the kexec lock.
>>
>> For example, a 1GiB DIMM is converted into 8 memblocks, each
>> spawning two threads for a total of 16 threads that create a small
>> "swarm" all trying to acquire the lock. The larger the DIMM, the
>> more the memblocks and the larger the swarm.
>>
>> At the root of the problem is the atomic lock behind kexec_trylock();
>> it works well for low lock traffic; ie loading/unloading a capture
>> kernel, things that happen basically once. But with the introduction
>> of crash hotplug, the traffic through the lock increases significantly,
>> and more importantly in bursts occurring at roughly the same time. Thus
>> there is a need to wait on the lock.
>>
>> A possible workaround is to simply retry the lock, say up to N times.
>> There is, of course, the problem of determining a value of N that works for
>> all implementations, and for all the other call sites of kexec_trylock().
>> Not ideal.
>>
>> The design decision to use the atomic lock is described in the comment
>> from kexec_internal.h, cited above. However, examining the code of
>> __crash_kexec():
>>
>> if (kexec_trylock()) {
>> if (kexec_crash_image) {
>> ...
>> }
>> kexec_unlock();
>> }
>>
>> reveals that the use of kexec_trylock() here is actually a "best effort"
>> due to the atomic lock. This atomic lock, prior to crash hotplug,
>> would almost always be assured (another kexec syscall could hold the lock
>> and prevent this, but that is about it).
>>
>> So at the point where the capture kernel would be invoked, if the lock
>> is not obtained, then kdump doesn't occur.
>>
>> It is possible to instead use a mutex with proper waiting, and utilize
>> mutex_trylock() as the "best effort" in __crash_kexec(). The use of a
>> mutex then avoids all the lock acquisition problems that were revealed
>> by the crash hotplug activity.
>>
>> Convert the atomic lock to a mutex.
>>
>> ...
>>
>> --- a/kernel/kexec_core.c
>> +++ b/kernel/kexec_core.c
>> @@ -47,7 +47,7 @@
>> #include <crypto/hash.h>
>> #include "kexec_internal.h"
>>
>> -atomic_t __kexec_lock = ATOMIC_INIT(0);
>> +DEFINE_MUTEX(__kexec_lock);
>>
>> /* Flag to indicate we are going to kexec a new kernel */
>> bool kexec_in_progress = false;
>> @@ -1057,7 +1057,7 @@ void __noclone __crash_kexec(struct pt_regs *regs)
>> * of memory the xchg(&kexec_crash_image) would be
>> * sufficient. But since I reuse the memory...
>> */
>> - if (kexec_trylock()) {
>> + if (mutex_trylock(&__kexec_lock)) {
>> if (kexec_crash_image) {
>> struct pt_regs fixed_regs;
>
> What's happening here? If someone else held the lock we silently fail
> to run the kexec? Shouldn't we at least alert the user to what just
> happened?
>
>
Yes, I believe it would silently "fail" and not run the kexec kernel.
I do not have a good feel to know if logging is going to be functional,
and reliable, at this point in time (on a panic path)...
eric

2023-09-22 16:09:02

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] kexec: change locking mechanism to a mutex

On Thu, 21 Sep 2023 17:59:38 -0400 Eric DeVolder <[email protected]> wrote:

> Scaled up testing has revealed that the kexec_trylock()
> implementation leads to failures within the crash hotplug
> infrastructure due to the inability to acquire the lock,
> specifically the message:
>
> ...
>
> Convert the atomic lock to a mutex.
>

Do you think this problem is serious enough to warrant a backport into
-stable kernels?

2023-09-22 17:49:13

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] kexec: change locking mechanism to a mutex

On Thu, 21 Sep 2023 17:59:38 -0400 Eric DeVolder <[email protected]> wrote:

> Scaled up testing has revealed that the kexec_trylock()
> implementation leads to failures within the crash hotplug
> infrastructure due to the inability to acquire the lock,
> specifically the message:
>
> crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate
>
> When hotplug events occur, the crash hotplug infrastructure first
> attempts to obtain the lock via the kexec_trylock(). However, the
> implementation either acquires the lock, or fails and returns; there
> is no waiting on the lock. Here is the comment/explanation from
> kernel/kexec_internal.h:kexec_trylock():
>
> * Whatever is used to serialize accesses to the kexec_crash_image needs to be
> * NMI safe, as __crash_kexec() can happen during nmi_panic(), so here we use a
> * "simple" atomic variable that is acquired with a cmpxchg().
>
> While this in theory can happen for either CPU or memory hoptlug,
> this problem is most prone to occur for memory hotplug.
>
> When memory is hot plugged, the memory is converted into smaller
> 128MiB memblocks (typically). As each memblock is processed, a
> kernel thread and a udev event thread are created. The udev thread
> tries for the lock via the reading of the sysfs node
> /sys/devices/system/memory/crash_hotplug node, and the kernel
> worker thread tries for the lock upon entering the crash hotplug
> infrastructure.
>
> These threads then compete for the kexec lock.
>
> For example, a 1GiB DIMM is converted into 8 memblocks, each
> spawning two threads for a total of 16 threads that create a small
> "swarm" all trying to acquire the lock. The larger the DIMM, the
> more the memblocks and the larger the swarm.
>
> At the root of the problem is the atomic lock behind kexec_trylock();
> it works well for low lock traffic; ie loading/unloading a capture
> kernel, things that happen basically once. But with the introduction
> of crash hotplug, the traffic through the lock increases significantly,
> and more importantly in bursts occurring at roughly the same time. Thus
> there is a need to wait on the lock.
>
> A possible workaround is to simply retry the lock, say up to N times.
> There is, of course, the problem of determining a value of N that works for
> all implementations, and for all the other call sites of kexec_trylock().
> Not ideal.
>
> The design decision to use the atomic lock is described in the comment
> from kexec_internal.h, cited above. However, examining the code of
> __crash_kexec():
>
> if (kexec_trylock()) {
> if (kexec_crash_image) {
> ...
> }
> kexec_unlock();
> }
>
> reveals that the use of kexec_trylock() here is actually a "best effort"
> due to the atomic lock. This atomic lock, prior to crash hotplug,
> would almost always be assured (another kexec syscall could hold the lock
> and prevent this, but that is about it).
>
> So at the point where the capture kernel would be invoked, if the lock
> is not obtained, then kdump doesn't occur.
>
> It is possible to instead use a mutex with proper waiting, and utilize
> mutex_trylock() as the "best effort" in __crash_kexec(). The use of a
> mutex then avoids all the lock acquisition problems that were revealed
> by the crash hotplug activity.
>
> Convert the atomic lock to a mutex.
>
> ...
>
> --- a/kernel/kexec_core.c
> +++ b/kernel/kexec_core.c
> @@ -47,7 +47,7 @@
> #include <crypto/hash.h>
> #include "kexec_internal.h"
>
> -atomic_t __kexec_lock = ATOMIC_INIT(0);
> +DEFINE_MUTEX(__kexec_lock);
>
> /* Flag to indicate we are going to kexec a new kernel */
> bool kexec_in_progress = false;
> @@ -1057,7 +1057,7 @@ void __noclone __crash_kexec(struct pt_regs *regs)
> * of memory the xchg(&kexec_crash_image) would be
> * sufficient. But since I reuse the memory...
> */
> - if (kexec_trylock()) {
> + if (mutex_trylock(&__kexec_lock)) {
> if (kexec_crash_image) {
> struct pt_regs fixed_regs;

What's happening here? If someone else held the lock we silently fail
to run the kexec? Shouldn't we at least alert the user to what just
happened?


2023-09-22 19:37:27

by Eric DeVolder

[permalink] [raw]
Subject: Re: [PATCH] kexec: change locking mechanism to a mutex



On 9/22/23 03:06, Baoquan He wrote:
> On 09/22/23 at 11:36am, Dave Young wrote:
>> [Cced Valentin Schneider as he added the trylocks]
>>
>> On Fri, 22 Sept 2023 at 06:04, Eric DeVolder <[email protected]> wrote:
>>>
>>> Scaled up testing has revealed that the kexec_trylock()
>>> implementation leads to failures within the crash hotplug
>>> infrastructure due to the inability to acquire the lock,
>>> specifically the message:
>>>
>>> crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate
>>>
>>> When hotplug events occur, the crash hotplug infrastructure first
>>> attempts to obtain the lock via the kexec_trylock(). However, the
>>> implementation either acquires the lock, or fails and returns; there
>>> is no waiting on the lock. Here is the comment/explanation from
>>> kernel/kexec_internal.h:kexec_trylock():
>>>
>>> * Whatever is used to serialize accesses to the kexec_crash_image needs to be
>>> * NMI safe, as __crash_kexec() can happen during nmi_panic(), so here we use a
>>> * "simple" atomic variable that is acquired with a cmpxchg().
>>>
>>> While this in theory can happen for either CPU or memory hoptlug,
>>> this problem is most prone to occur for memory hotplug.
>>>
>>> When memory is hot plugged, the memory is converted into smaller
>>> 128MiB memblocks (typically). As each memblock is processed, a
>>> kernel thread and a udev event thread are created. The udev thread
>>> tries for the lock via the reading of the sysfs node
>>> /sys/devices/system/memory/crash_hotplug node, and the kernel
>>> worker thread tries for the lock upon entering the crash hotplug
>>> infrastructure.
>>>
>>> These threads then compete for the kexec lock.
>>>
>>> For example, a 1GiB DIMM is converted into 8 memblocks, each
>>> spawning two threads for a total of 16 threads that create a small
>>> "swarm" all trying to acquire the lock. The larger the DIMM, the
>>> more the memblocks and the larger the swarm.
>>>
>>> At the root of the problem is the atomic lock behind kexec_trylock();
>>> it works well for low lock traffic; ie loading/unloading a capture
>>> kernel, things that happen basically once. But with the introduction
>>> of crash hotplug, the traffic through the lock increases significantly,
>>> and more importantly in bursts occurring at roughly the same time. Thus
>>> there is a need to wait on the lock.
>
> Yeah, the atomic __kexec_lock is used to lock the door of operation on
> kimage. Among kexec/kdump kernel load/unload/shrink/jumping, once any one
> is in progress, the later attempt doesn't make sense. And these events are
> rare.
>
> Crash hotplug event is different, there will be many during one period.
> The main problem you are encountering is the cocurrent handling of hotplug
> event, right? Wondering if we can define another mutex lock to serialize
> the handling of hotplug event like below. Just a sterotype to state my
> thought.

I've tried this patch (with slight change) against my regression setup and it works as well.
Eric

>
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index 03a7932cde0a..39b9a57a4177 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -783,6 +783,7 @@ static void crash_handle_hotplug_event(unsigned int hp_action, unsigned int cpu)
> {
> struct kimage *image;
>
> + crash_hotplug_lock();
> /* Obtain lock while changing crash information */
> if (!kexec_trylock()) {
> pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n");
> @@ -852,6 +853,7 @@ static void crash_handle_hotplug_event(unsigned int hp_action, unsigned int cpu)
> out:
> /* Release lock now that update complete */
> kexec_unlock();
> + crash_hotplug_unlock();
> }
>
> static int crash_memhp_notifier(struct notifier_block *nb, unsigned long val, void *v)
> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> index 9dc728982d79..b95a73f35d9a 100644
> --- a/kernel/kexec_core.c
> +++ b/kernel/kexec_core.c
> @@ -48,6 +48,7 @@
> #include "kexec_internal.h"
>
> atomic_t __kexec_lock = ATOMIC_INIT(0);
> +DEFINE_MUTEX(__crash_hotplug_lock);
>
> /* Flag to indicate we are going to kexec a new kernel */
> bool kexec_in_progress = false;
> diff --git a/kernel/kexec_internal.h b/kernel/kexec_internal.h
> index 74da1409cd14..32cb890bb059 100644
> --- a/kernel/kexec_internal.h
> +++ b/kernel/kexec_internal.h
> @@ -28,6 +28,10 @@ static inline void kexec_unlock(void)
> atomic_set_release(&__kexec_lock, 0);
> }
>
> +extern struct mutex __crash_hotplug_lock;
> +#define crash_hotplug_lock() mutex_lock(&__crash_hotplug_lock)
> +#define crash_hotplug_unlock() mutex_unlock(&__crash_hotplug_lock)
> +
> #ifdef CONFIG_KEXEC_FILE
> #include <linux/purgatory.h>
> void kimage_file_post_load_cleanup(struct kimage *image);
>

2023-09-22 22:13:15

by Baoquan He

[permalink] [raw]
Subject: Re: [PATCH] kexec: change locking mechanism to a mutex

On 09/22/23 at 11:36am, Dave Young wrote:
> [Cced Valentin Schneider as he added the trylocks]
>
> On Fri, 22 Sept 2023 at 06:04, Eric DeVolder <[email protected]> wrote:
> >
> > Scaled up testing has revealed that the kexec_trylock()
> > implementation leads to failures within the crash hotplug
> > infrastructure due to the inability to acquire the lock,
> > specifically the message:
> >
> > crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate
> >
> > When hotplug events occur, the crash hotplug infrastructure first
> > attempts to obtain the lock via the kexec_trylock(). However, the
> > implementation either acquires the lock, or fails and returns; there
> > is no waiting on the lock. Here is the comment/explanation from
> > kernel/kexec_internal.h:kexec_trylock():
> >
> > * Whatever is used to serialize accesses to the kexec_crash_image needs to be
> > * NMI safe, as __crash_kexec() can happen during nmi_panic(), so here we use a
> > * "simple" atomic variable that is acquired with a cmpxchg().
> >
> > While this in theory can happen for either CPU or memory hoptlug,
> > this problem is most prone to occur for memory hotplug.
> >
> > When memory is hot plugged, the memory is converted into smaller
> > 128MiB memblocks (typically). As each memblock is processed, a
> > kernel thread and a udev event thread are created. The udev thread
> > tries for the lock via the reading of the sysfs node
> > /sys/devices/system/memory/crash_hotplug node, and the kernel
> > worker thread tries for the lock upon entering the crash hotplug
> > infrastructure.
> >
> > These threads then compete for the kexec lock.
> >
> > For example, a 1GiB DIMM is converted into 8 memblocks, each
> > spawning two threads for a total of 16 threads that create a small
> > "swarm" all trying to acquire the lock. The larger the DIMM, the
> > more the memblocks and the larger the swarm.
> >
> > At the root of the problem is the atomic lock behind kexec_trylock();
> > it works well for low lock traffic; ie loading/unloading a capture
> > kernel, things that happen basically once. But with the introduction
> > of crash hotplug, the traffic through the lock increases significantly,
> > and more importantly in bursts occurring at roughly the same time. Thus
> > there is a need to wait on the lock.

Yeah, the atomic __kexec_lock is used to lock the door of operation on
kimage. Among kexec/kdump kernel load/unload/shrink/jumping, once any one
is in progress, the later attempt doesn't make sense. And these events are
rare.

Crash hotplug event is different, there will be many during one period.
The main problem you are encountering is the cocurrent handling of hotplug
event, right? Wondering if we can define another mutex lock to serialize
the handling of hotplug event like below. Just a sterotype to state my
thought.

diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 03a7932cde0a..39b9a57a4177 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -783,6 +783,7 @@ static void crash_handle_hotplug_event(unsigned int hp_action, unsigned int cpu)
{
struct kimage *image;

+ crash_hotplug_lock();
/* Obtain lock while changing crash information */
if (!kexec_trylock()) {
pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n");
@@ -852,6 +853,7 @@ static void crash_handle_hotplug_event(unsigned int hp_action, unsigned int cpu)
out:
/* Release lock now that update complete */
kexec_unlock();
+ crash_hotplug_unlock();
}

static int crash_memhp_notifier(struct notifier_block *nb, unsigned long val, void *v)
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 9dc728982d79..b95a73f35d9a 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -48,6 +48,7 @@
#include "kexec_internal.h"

atomic_t __kexec_lock = ATOMIC_INIT(0);
+DEFINE_MUTEX(__crash_hotplug_lock);

/* Flag to indicate we are going to kexec a new kernel */
bool kexec_in_progress = false;
diff --git a/kernel/kexec_internal.h b/kernel/kexec_internal.h
index 74da1409cd14..32cb890bb059 100644
--- a/kernel/kexec_internal.h
+++ b/kernel/kexec_internal.h
@@ -28,6 +28,10 @@ static inline void kexec_unlock(void)
atomic_set_release(&__kexec_lock, 0);
}

+extern struct mutex __crash_hotplug_lock;
+#define crash_hotplug_lock() mutex_lock(&__crash_hotplug_lock)
+#define crash_hotplug_unlock() mutex_unlock(&__crash_hotplug_lock)
+
#ifdef CONFIG_KEXEC_FILE
#include <linux/purgatory.h>
void kimage_file_post_load_cleanup(struct kimage *image);

2023-09-22 23:26:06

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH] kexec: change locking mechanism to a mutex

On 21/09/23 17:59, Eric DeVolder wrote:
> The design decision to use the atomic lock is described in the comment
> from kexec_internal.h, cited above. However, examining the code of
> __crash_kexec():
>
> if (kexec_trylock()) {
> if (kexec_crash_image) {
> ...
> }
> kexec_unlock();
> }
>
> reveals that the use of kexec_trylock() here is actually a "best effort"
> due to the atomic lock. This atomic lock, prior to crash hotplug,
> would almost always be assured (another kexec syscall could hold the lock
> and prevent this, but that is about it).
>
> So at the point where the capture kernel would be invoked, if the lock
> is not obtained, then kdump doesn't occur.
>
> It is possible to instead use a mutex with proper waiting, and utilize
> mutex_trylock() as the "best effort" in __crash_kexec(). The use of a
> mutex then avoids all the lock acquisition problems that were revealed
> by the crash hotplug activity.
>

@Dave thanks for the Cc, I'd have missed this otherwise.


Prior to the atomic thingie, we actually had a mutex and did
mutex_trylock() in __crash_kexec(). I'm a bit confused as this looks like a
revert of
05c6257433b7 ("panic, kexec: make __crash_kexec() NMI safe")
with just the helpers kept in - this doesn't seem to address any of the
original issues regarding NMIs?

Sebastian raised some good points in [1] regarding these issues.
The main hurdle pointed out there is, if we end up in the slowpath during
the unlock, then we can can up acquiring the ->wait_lock which isn't NMI
safe.

This is even worse on PREEMPT_RT, as both trylock and the unlock can end up
acquiring the ->wait_lock.

[1]: https://lore.kernel.org/all/YqyZ%[email protected]/

2023-09-23 03:02:31

by Eric DeVolder

[permalink] [raw]
Subject: Re: [PATCH] kexec: change locking mechanism to a mutex



On 9/22/23 11:28, Valentin Schneider wrote:
> On 21/09/23 17:59, Eric DeVolder wrote:
>> The design decision to use the atomic lock is described in the comment
>> from kexec_internal.h, cited above. However, examining the code of
>> __crash_kexec():
>>
>> if (kexec_trylock()) {
>> if (kexec_crash_image) {
>> ...
>> }
>> kexec_unlock();
>> }
>>
>> reveals that the use of kexec_trylock() here is actually a "best effort"
>> due to the atomic lock. This atomic lock, prior to crash hotplug,
>> would almost always be assured (another kexec syscall could hold the lock
>> and prevent this, but that is about it).
>>
>> So at the point where the capture kernel would be invoked, if the lock
>> is not obtained, then kdump doesn't occur.
>>
>> It is possible to instead use a mutex with proper waiting, and utilize
>> mutex_trylock() as the "best effort" in __crash_kexec(). The use of a
>> mutex then avoids all the lock acquisition problems that were revealed
>> by the crash hotplug activity.
>>
>
> @Dave thanks for the Cc, I'd have missed this otherwise.
>
>
> Prior to the atomic thingie, we actually had a mutex and did
> mutex_trylock() in __crash_kexec(). I'm a bit confused as this looks like a
> revert of
> 05c6257433b7 ("panic, kexec: make __crash_kexec() NMI safe")
> with just the helpers kept in - this doesn't seem to address any of the
> original issues regarding NMIs?
>
> Sebastian raised some good points in [1] regarding these issues.
> The main hurdle pointed out there is, if we end up in the slowpath during
> the unlock, then we can can up acquiring the ->wait_lock which isn't NMI
> safe.
>
> This is even worse on PREEMPT_RT, as both trylock and the unlock can end up
> acquiring the ->wait_lock.
>
> [1]: https://lore.kernel.org/all/YqyZ%[email protected]/
>
Having reviewed the references, it would seem that Baoquan's approach of a new
lock to handle the hotplug activity is the way to go?
Eric

2023-09-23 06:51:09

by Baoquan He

[permalink] [raw]
Subject: Re: [PATCH] kexec: change locking mechanism to a mutex

On 09/22/23 at 12:35pm, Eric DeVolder wrote:
>
>
> On 9/22/23 11:28, Valentin Schneider wrote:
> > On 21/09/23 17:59, Eric DeVolder wrote:
> > > The design decision to use the atomic lock is described in the comment
> > > from kexec_internal.h, cited above. However, examining the code of
> > > __crash_kexec():
> > >
> > > if (kexec_trylock()) {
> > > if (kexec_crash_image) {
> > > ...
> > > }
> > > kexec_unlock();
> > > }
> > >
> > > reveals that the use of kexec_trylock() here is actually a "best effort"
> > > due to the atomic lock. This atomic lock, prior to crash hotplug,
> > > would almost always be assured (another kexec syscall could hold the lock
> > > and prevent this, but that is about it).
> > >
> > > So at the point where the capture kernel would be invoked, if the lock
> > > is not obtained, then kdump doesn't occur.
> > >
> > > It is possible to instead use a mutex with proper waiting, and utilize
> > > mutex_trylock() as the "best effort" in __crash_kexec(). The use of a
> > > mutex then avoids all the lock acquisition problems that were revealed
> > > by the crash hotplug activity.
> > >
> >
> > @Dave thanks for the Cc, I'd have missed this otherwise.
> >
> >
> > Prior to the atomic thingie, we actually had a mutex and did
> > mutex_trylock() in __crash_kexec(). I'm a bit confused as this looks like a
> > revert of
> > 05c6257433b7 ("panic, kexec: make __crash_kexec() NMI safe")
> > with just the helpers kept in - this doesn't seem to address any of the
> > original issues regarding NMIs?
> >
> > Sebastian raised some good points in [1] regarding these issues.
> > The main hurdle pointed out there is, if we end up in the slowpath during
> > the unlock, then we can can up acquiring the ->wait_lock which isn't NMI
> > safe.
> >
> > This is even worse on PREEMPT_RT, as both trylock and the unlock can end up
> > acquiring the ->wait_lock.
> >
> > [1]: https://lore.kernel.org/all/YqyZ%[email protected]/
> >
> Having reviewed the references, it would seem that Baoquan's approach of a new
> lock to handle the hotplug activity is the way to go?

If so, I have posted a formal one. It's simple and should work to fix
the issue.