Hi folks,
Here's ~v3~ v4 where we now completely get rid of kexec_mutex.
o Patch 1 makes sure all kexec_mutex acquisitions are trylocks. This prevents
having to add any while(atomic_cmpxchg()) loops which I'd really hate to see
here. If that can't be done then I think we're better off with the combined
mutex+atomic var approach.
o Patch 2 does the mutex -> atomic var switch.
Revisions
=========
v3 -> v4
++++++++
o Someone forgot to Cc LKML on v3...
v2 -> v3
++++++++
o Dropped kexec_mutex entirely and made the atomic variable the one true lock
for kexec (Eric)
v1 -> v2
++++++++
o Changed from Peterson-like synchronization to simpler atomic_cmpxchg
(Petr)
o Slightly reworded changelog
o Added Fixes: tag. Technically should be up to since kexec can happen
in an NMI, but that isn't such a clear target
Cheers,
Valentin
Valentin Schneider (2):
kexec: Turn all kexec_mutex acquisitions into trylocks
panic, kexec: Make __crash_kexec() NMI safe
include/linux/kexec.h | 2 +-
kernel/kexec.c | 11 ++++-------
kernel/kexec_core.c | 28 ++++++++++++++++------------
kernel/kexec_file.c | 4 ++--
kernel/kexec_internal.h | 15 ++++++++++++++-
kernel/ksysfs.c | 7 ++++++-
6 files changed, 43 insertions(+), 24 deletions(-)
--
2.31.1
Attempting to get a crash dump out of a debug PREEMPT_RT kernel via an NMI
panic() doesn't work. The cause of that lies in the PREEMPT_RT definition
of mutex_trylock():
if (IS_ENABLED(CONFIG_DEBUG_RT_MUTEXES) && WARN_ON_ONCE(!in_task()))
return 0;
This prevents an nmi_panic() from executing the main body of
__crash_kexec() which does the actual kexec into the kdump kernel.
The warning and return are explained by:
6ce47fd961fa ("rtmutex: Warn if trylock is called from hard/softirq context")
[...]
The reasons for this are:
1) There is a potential deadlock in the slowpath
2) Another cpu which blocks on the rtmutex will boost the task
which allegedly locked the rtmutex, but that cannot work
because the hard/softirq context borrows the task context.
Furthermore, grabbing the lock isn't NMI safe, so do away with kexec_mutex
and replace it with an atomic variable. This is somewhat overzealous
as *some* callsites could keep using a mutex (e.g. the sysfs-facing ones
like crash_shrink_memory()), but this has the benefit of involving a single
unified lock and preventing any future NMI-related surprises.
Tested by triggering NMI panics via:
$ echo 1 > /proc/sys/kernel/panic_on_unrecovered_nmi
$ echo 1 > /proc/sys/kernel/unknown_nmi_panic
$ echo 1 > /proc/sys/kernel/panic
$ ipmitool power diag
Fixes: 6ce47fd961fa ("rtmutex: Warn if trylock is called from hard/softirq context")
Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/kexec.c | 11 ++++-------
kernel/kexec_core.c | 20 ++++++++++----------
kernel/kexec_file.c | 4 ++--
kernel/kexec_internal.h | 15 ++++++++++++++-
4 files changed, 30 insertions(+), 20 deletions(-)
diff --git a/kernel/kexec.c b/kernel/kexec.c
index b5e40f069768..cb8e6e6f983c 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -93,13 +93,10 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments,
/*
* Because we write directly to the reserved memory region when loading
- * crash kernels we need a mutex here to prevent multiple crash kernels
- * from attempting to load simultaneously, and to prevent a crash kernel
- * from loading over the top of a in use crash kernel.
- *
- * KISS: always take the mutex.
+ * crash kernels we need a serialization here to prevent multiple crash
+ * kernels from attempting to load simultaneously.
*/
- if (!mutex_trylock(&kexec_mutex))
+ if (!kexec_trylock())
return -EBUSY;
if (flags & KEXEC_ON_CRASH) {
@@ -165,7 +162,7 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments,
kimage_free(image);
out_unlock:
- mutex_unlock(&kexec_mutex);
+ kexec_unlock();
return ret;
}
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 16370926b21a..b03859a0fbaa 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -46,7 +46,7 @@
#include <crypto/hash.h>
#include "kexec_internal.h"
-DEFINE_MUTEX(kexec_mutex);
+atomic_t __kexec_lock = ATOMIC_INIT(0);
/* Per cpu memory for storing cpu states in case of system crash. */
note_buf_t __percpu *crash_notes;
@@ -964,7 +964,7 @@ late_initcall(kexec_core_sysctl_init);
*/
void __noclone __crash_kexec(struct pt_regs *regs)
{
- /* Take the kexec_mutex here to prevent sys_kexec_load
+ /* Take the kexec_lock here to prevent sys_kexec_load
* running on one cpu from replacing the crash kernel
* we are using after a panic on a different cpu.
*
@@ -972,7 +972,7 @@ void __noclone __crash_kexec(struct pt_regs *regs)
* of memory the xchg(&kexec_crash_image) would be
* sufficient. But since I reuse the memory...
*/
- if (mutex_trylock(&kexec_mutex)) {
+ if (kexec_trylock()) {
if (kexec_crash_image) {
struct pt_regs fixed_regs;
@@ -981,7 +981,7 @@ void __noclone __crash_kexec(struct pt_regs *regs)
machine_crash_shutdown(&fixed_regs);
machine_kexec(kexec_crash_image);
}
- mutex_unlock(&kexec_mutex);
+ kexec_unlock();
}
}
STACK_FRAME_NON_STANDARD(__crash_kexec);
@@ -1013,13 +1013,13 @@ ssize_t crash_get_memory_size(void)
{
ssize_t size = 0;
- if (!mutex_trylock(&kexec_mutex))
+ if (!kexec_trylock())
return -EBUSY;
if (crashk_res.end != crashk_res.start)
size = resource_size(&crashk_res);
- mutex_unlock(&kexec_mutex);
+ kexec_unlock();
return size;
}
@@ -1039,7 +1039,7 @@ int crash_shrink_memory(unsigned long new_size)
unsigned long old_size;
struct resource *ram_res;
- if (!mutex_trylock(&kexec_mutex))
+ if (!kexec_trylock())
return -EBUSY;
if (kexec_crash_image) {
@@ -1078,7 +1078,7 @@ int crash_shrink_memory(unsigned long new_size)
insert_resource(&iomem_resource, ram_res);
unlock:
- mutex_unlock(&kexec_mutex);
+ kexec_unlock();
return ret;
}
@@ -1150,7 +1150,7 @@ int kernel_kexec(void)
{
int error = 0;
- if (!mutex_trylock(&kexec_mutex))
+ if (!kexec_trylock())
return -EBUSY;
if (!kexec_image) {
error = -EINVAL;
@@ -1226,7 +1226,7 @@ int kernel_kexec(void)
#endif
Unlock:
- mutex_unlock(&kexec_mutex);
+ kexec_unlock();
return error;
}
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index 145321a5e798..42b95bf58daf 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -334,7 +334,7 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
image = NULL;
- if (!mutex_trylock(&kexec_mutex))
+ if (!kexec_trylock())
return -EBUSY;
dest_image = &kexec_image;
@@ -406,7 +406,7 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
if ((flags & KEXEC_FILE_ON_CRASH) && kexec_crash_image)
arch_kexec_protect_crashkres();
- mutex_unlock(&kexec_mutex);
+ kexec_unlock();
kimage_free(image);
return ret;
}
diff --git a/kernel/kexec_internal.h b/kernel/kexec_internal.h
index 48aaf2ac0d0d..74da1409cd14 100644
--- a/kernel/kexec_internal.h
+++ b/kernel/kexec_internal.h
@@ -13,7 +13,20 @@ void kimage_terminate(struct kimage *image);
int kimage_is_destination_range(struct kimage *image,
unsigned long start, unsigned long end);
-extern struct mutex kexec_mutex;
+/*
+ * Whatever is used to serialize accesses to the kexec_crash_image needs to be
+ * NMI safe, as __crash_kexec() can happen during nmi_panic(), so here we use a
+ * "simple" atomic variable that is acquired with a cmpxchg().
+ */
+extern atomic_t __kexec_lock;
+static inline bool kexec_trylock(void)
+{
+ return atomic_cmpxchg_acquire(&__kexec_lock, 0, 1) == 0;
+}
+static inline void kexec_unlock(void)
+{
+ atomic_set_release(&__kexec_lock, 0);
+}
#ifdef CONFIG_KEXEC_FILE
#include <linux/purgatory.h>
--
2.31.1
Most acquistions of kexec_mutex are done via mutex_trylock() - those were a
direct "translation" from:
8c5a1cf0ad3a ("kexec: use a mutex for locking rather than xchg()")
there has however been two additions since then that use mutex_lock():
crash_get_memory_size() and crash_shrink_memory().
A later commit will replace said mutex with an atomic variable, and locking
operations will become atomic_cmpxchg(). Rather than having those
mutex_lock() become while (atomic_cmpxchg(&lock, 0, 1)), turn them into
trylocks that can return -EBUSY on acquisition failure.
This does halve the printable size of the crash kernel, but that's still
neighbouring 2G for 32bit kernels which should be ample enough.
Signed-off-by: Valentin Schneider <[email protected]>
---
include/linux/kexec.h | 2 +-
kernel/kexec_core.c | 12 ++++++++----
kernel/ksysfs.c | 7 ++++++-
3 files changed, 15 insertions(+), 6 deletions(-)
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index ce6536f1d269..54d7030d3c41 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -390,7 +390,7 @@ extern int kexec_load_disabled;
extern bool kexec_in_progress;
int crash_shrink_memory(unsigned long new_size);
-size_t crash_get_memory_size(void);
+ssize_t crash_get_memory_size(void);
void crash_free_reserved_phys_range(unsigned long begin, unsigned long end);
void arch_kexec_protect_crashkres(void);
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 4d34c78334ce..16370926b21a 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -1009,13 +1009,16 @@ void crash_kexec(struct pt_regs *regs)
}
}
-size_t crash_get_memory_size(void)
+ssize_t crash_get_memory_size(void)
{
- size_t size = 0;
+ ssize_t size = 0;
+
+ if (!mutex_trylock(&kexec_mutex))
+ return -EBUSY;
- mutex_lock(&kexec_mutex);
if (crashk_res.end != crashk_res.start)
size = resource_size(&crashk_res);
+
mutex_unlock(&kexec_mutex);
return size;
}
@@ -1036,7 +1039,8 @@ int crash_shrink_memory(unsigned long new_size)
unsigned long old_size;
struct resource *ram_res;
- mutex_lock(&kexec_mutex);
+ if (!mutex_trylock(&kexec_mutex))
+ return -EBUSY;
if (kexec_crash_image) {
ret = -ENOENT;
diff --git a/kernel/ksysfs.c b/kernel/ksysfs.c
index b1292a57c2a5..65dba9076f31 100644
--- a/kernel/ksysfs.c
+++ b/kernel/ksysfs.c
@@ -105,7 +105,12 @@ KERNEL_ATTR_RO(kexec_crash_loaded);
static ssize_t kexec_crash_size_show(struct kobject *kobj,
struct kobj_attribute *attr, char *buf)
{
- return sprintf(buf, "%zu\n", crash_get_memory_size());
+ ssize_t size = crash_get_memory_size();
+
+ if (size < 0)
+ return size;
+
+ return sprintf(buf, "%zd\n", size);
}
static ssize_t kexec_crash_size_store(struct kobject *kobj,
struct kobj_attribute *attr,
--
2.31.1
Hi,
On 06/30/22 at 11:32pm, Valentin Schneider wrote:
> Hi folks,
>
> Here's ~v3~ v4 where we now completely get rid of kexec_mutex.
>
> o Patch 1 makes sure all kexec_mutex acquisitions are trylocks. This prevents
> having to add any while(atomic_cmpxchg()) loops which I'd really hate to see
> here. If that can't be done then I think we're better off with the combined
> mutex+atomic var approach.
> o Patch 2 does the mutex -> atomic var switch.
This series looks good, has it been taken into any tree?
Thanks
Baoquan
>
> Revisions
> =========
>
> v3 -> v4
> ++++++++
>
> o Someone forgot to Cc LKML on v3...
>
> v2 -> v3
> ++++++++
>
> o Dropped kexec_mutex entirely and made the atomic variable the one true lock
> for kexec (Eric)
>
> v1 -> v2
> ++++++++
>
> o Changed from Peterson-like synchronization to simpler atomic_cmpxchg
> (Petr)
> o Slightly reworded changelog
> o Added Fixes: tag. Technically should be up to since kexec can happen
> in an NMI, but that isn't such a clear target
>
> Cheers,
> Valentin
>
> Valentin Schneider (2):
> kexec: Turn all kexec_mutex acquisitions into trylocks
> panic, kexec: Make __crash_kexec() NMI safe
>
> include/linux/kexec.h | 2 +-
> kernel/kexec.c | 11 ++++-------
> kernel/kexec_core.c | 28 ++++++++++++++++------------
> kernel/kexec_file.c | 4 ++--
> kernel/kexec_internal.h | 15 ++++++++++++++-
> kernel/ksysfs.c | 7 ++++++-
> 6 files changed, 43 insertions(+), 24 deletions(-)
>
> --
> 2.31.1
>
On 12/07/22 10:47, Baoquan He wrote:
> Hi,
>
> On 06/30/22 at 11:32pm, Valentin Schneider wrote:
>> Hi folks,
>>
>> Here's ~v3~ v4 where we now completely get rid of kexec_mutex.
>>
>> o Patch 1 makes sure all kexec_mutex acquisitions are trylocks. This prevents
>> having to add any while(atomic_cmpxchg()) loops which I'd really hate to see
>> here. If that can't be done then I think we're better off with the combined
>> mutex+atomic var approach.
>> o Patch 2 does the mutex -> atomic var switch.
>
> This series looks good, has it been taken into any tree?
>
I don't think so, briefly poked around git and haven't seen it anywhere.
> Thanks
> Baoquan
On Tue, 12 Jul 2022 12:13:03 +0100 Valentin Schneider <[email protected]> wrote:
> On 12/07/22 10:47, Baoquan He wrote:
> > Hi,
> >
> > On 06/30/22 at 11:32pm, Valentin Schneider wrote:
> >> Hi folks,
> >>
> >> Here's ~v3~ v4 where we now completely get rid of kexec_mutex.
> >>
> >> o Patch 1 makes sure all kexec_mutex acquisitions are trylocks. This prevents
> >> having to add any while(atomic_cmpxchg()) loops which I'd really hate to see
> >> here. If that can't be done then I think we're better off with the combined
> >> mutex+atomic var approach.
> >> o Patch 2 does the mutex -> atomic var switch.
> >
> > This series looks good, has it been taken into any tree?
> >
>
> I don't think so, briefly poked around git and haven't seen it anywhere.
I'll stash it away for consideration after -rc1.
On Thu 2022-06-30 23:32:56, Valentin Schneider wrote:
> Valentin Schneider (2):
> kexec: Turn all kexec_mutex acquisitions into trylocks
> panic, kexec: Make __crash_kexec() NMI safe
This version looks good to me. For both patches:
Reviewed-by: Petr Mladek <[email protected]>
Best Regards,
Petr
Hi,
On 13/07/22 10:39, Andrew Morton wrote:
> On Tue, 12 Jul 2022 12:13:03 +0100 Valentin Schneider <[email protected]> wrote:
>
>> On 12/07/22 10:47, Baoquan He wrote:
>> > Hi,
>> >
>> > This series looks good, has it been taken into any tree?
>> >
>>
>> I don't think so, briefly poked around git and haven't seen it anywhere.
>
> I'll stash it away for consideration after -rc1.
I've seen them in linux-next for a while, am I right in assuming they'll be
part of a 6.1 PR?
On Mon, 03 Oct 2022 14:20:51 +0100 Valentin Schneider <[email protected]> wrote:
> > I'll stash it away for consideration after -rc1.
>
> I've seen them in linux-next for a while, am I right in assuming they'll be
> part of a 6.1 PR?
yup.