Kexec today considers itself purely a boot loader: When we enter the new
kernel, any state the previous kernel left behind is irrelevant and the
new kernel reinitializes the system.
However, there are use cases where this mode of operation is not what we
actually want. In virtualization hosts for example, we want to use kexec
to update the host kernel while virtual machine memory stays untouched.
When we add device assignment to the mix, we also need to ensure that
IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we
need to do the same for the PCI subsystem. If we want to kexec while an
SEV-SNP enabled virtual machine is running, we need to preserve the VM
context pages and physical memory. See James' and my Linux Plumbers
Conference 2023 presentation for details:
https://lpc.events/event/17/contributions/1485/
To start us on the journey to support all the use cases above, this
patch implements basic infrastructure to allow hand over of kernel state
across kexec (Kexec HandOver, aka KHO). As example target, we use ftrace:
With this patch set applied, you can read ftrace records from the
pre-kexec environment in your post-kexec one. This creates a very powerful
debugging and performance analysis tool for kexec. It's also slightly
easier to reason about than full blown VFIO state preservation.
== Alternatives ==
There are alternative approaches to (parts of) the problems above:
* Memory Pools [1] - preallocated persistent memory region + allocator
* PRMEM [2] - resizable persistent memory regions with fixed metadata
pointer on the kernel command line + allocator
* Pkernfs [3] - preallocated file system for in-kernel data with fixed
address location on the kernel command line
* PKRAM [4] - handover of user space pages using a fixed metadata page
specified via command line
All of the approaches above fundamentally have the same problem: They
require the administrator to explicitly carve out a physical memory
location because they have no mechanism outside of the kernel command
line to pass data (including memory reservations) between kexec'ing
kernels.
KHO provides that base foundation. We will determine later whether we
still need any of the approaches above for fast bulk memory handover of for
example IOMMU page tables. But IMHO they would all be users of KHO, with
KHO providing the foundational primitive to pass metadata and bulk memory
reservations as well as provide easy versioning for data.
== Documentation ==
If people are happy with the approach in this patch set, I will write up
conclusive documentation including schemas for the metadata as part of its
next iteration. For now, here's a rudimentary overview:
We introduce a metadata file that the kernels pass between each other. How
they pass it is architecture specific. The file's format is a Flattened
Device Tree (fdt) which has a generator and parser already included in
Linux. When the root user enables KHO through /sys/kernel/kho/active, the
kernel invokes callbacks to every driver that supports KHO to serialize
its state. When the actual kexec happens, the fdt is part of the image
set that we boot into. In addition, we keep a "scratch region" available
for kexec: A physically contiguous memory region that is guaranteed to
not have any memory that KHO would preserve. The new kernel bootstraps
itself using the scratch region and sets all handed over memory as in use.
When drivers initialize that support KHO, they introspect the fdt and
recover their state from it. This includes memory reservations, where the
driver can either discard or claim reservations.
== Limitations ==
I currently only implemented file based kexec. The kernel interfaces
in the patch set are already in place to support user space kexec as well,
but I have not implemented it yet.
== How to Use ==
To use the code, please boot the kernel with the "kho_scratch=" command
line parameter set: "kho_scratch=512M". KHO requires a scratch region.
Make sure to fill ftrace with contents that you want to observe after
kexec. Then, before you invoke file based "kexec -l", activate KHO:
# echo 1 > /sys/kernel/kho/active
# kexec -l Image --initrd=initrd -s
# kexec -e
The new kernel will boot up and contain the previous kernel's trace
buffers in /sys/kernel/debug/tracing/trace.
Alex
[1] https://lore.kernel.org/all/169645773092.11424.7258549771090599226.stgit@skinsburskii./
[2] https://lore.kernel.org/all/[email protected]/
[3] https://lpc.events/event/17/contributions/1485/attachments/1296/2650/jgowans-preserving-across-kexec.pdf
[4] https://lore.kernel.org/kexec/[email protected]/
Alexander Graf (15):
mm,memblock: Add support for scratch memory
memblock: Declare scratch memory as CMA
kexec: Add Kexec HandOver (KHO) generation helpers
kexec: Add KHO parsing support
kexec: Add KHO support to kexec file loads
arm64: Add KHO support
x86: Add KHO support
tracing: Introduce names for ring buffers
tracing: Introduce names for events
tracing: Introduce kho serialization
tracing: Add kho serialization of trace buffers
tracing: Recover trace buffers from kexec handover
tracing: Add kho serialization of trace events
tracing: Recover trace events from kexec handover
tracing: Add config option for kexec handover
Documentation/ABI/testing/sysfs-firmware-kho | 9 +
Documentation/ABI/testing/sysfs-kernel-kho | 53 ++
.../admin-guide/kernel-parameters.txt | 10 +
MAINTAINERS | 2 +
arch/arm64/Kconfig | 12 +
arch/arm64/kernel/setup.c | 2 +
arch/arm64/mm/init.c | 8 +
arch/x86/Kconfig | 12 +
arch/x86/boot/compressed/kaslr.c | 55 ++
arch/x86/include/uapi/asm/bootparam.h | 15 +-
arch/x86/kernel/e820.c | 9 +
arch/x86/kernel/kexec-bzimage64.c | 39 ++
arch/x86/kernel/setup.c | 46 ++
arch/x86/mm/init_32.c | 7 +
arch/x86/mm/init_64.c | 7 +
drivers/of/fdt.c | 41 ++
drivers/of/kexec.c | 36 ++
include/linux/kexec.h | 56 ++
include/linux/memblock.h | 19 +
include/linux/ring_buffer.h | 9 +-
include/linux/trace_events.h | 1 +
include/trace/trace_events.h | 2 +
include/uapi/linux/kexec.h | 6 +
kernel/Makefile | 2 +
kernel/kexec_file.c | 41 ++
kernel/kexec_kho_in.c | 298 ++++++++++
kernel/kexec_kho_out.c | 526 ++++++++++++++++++
kernel/trace/Kconfig | 13 +
kernel/trace/blktrace.c | 1 +
kernel/trace/ring_buffer.c | 267 ++++++++-
kernel/trace/trace.c | 76 ++-
kernel/trace/trace_branch.c | 1 +
kernel/trace/trace_events.c | 3 +
kernel/trace/trace_functions_graph.c | 4 +-
kernel/trace/trace_output.c | 106 +++-
kernel/trace/trace_output.h | 1 +
kernel/trace/trace_probe.c | 3 +
kernel/trace/trace_syscalls.c | 29 +
mm/Kconfig | 4 +
mm/memblock.c | 83 ++-
40 files changed, 1901 insertions(+), 13 deletions(-)
create mode 100644 Documentation/ABI/testing/sysfs-firmware-kho
create mode 100644 Documentation/ABI/testing/sysfs-kernel-kho
create mode 100644 kernel/kexec_kho_in.c
create mode 100644 kernel/kexec_kho_out.c
--
2.40.1
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
When we finish populating our memory, we don't want to lose the scratch
region as memory we can use for useful data. Do do that, we mark it as
CMA memory. That means that any allocation within it only happens with
movable memory which we can then happily discard for the next kexec.
That way we don't lose the scratch region's memory anymore for
allocations after boot.
Signed-off-by: Alexander Graf <[email protected]>
---
mm/memblock.c | 30 ++++++++++++++++++++++++++----
1 file changed, 26 insertions(+), 4 deletions(-)
diff --git a/mm/memblock.c b/mm/memblock.c
index e89e6c8f9d75..44741424dab7 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -16,6 +16,7 @@
#include <linux/kmemleak.h>
#include <linux/seq_file.h>
#include <linux/memblock.h>
+#include <linux/page-isolation.h>
#include <asm/sections.h>
#include <linux/io.h>
@@ -1100,10 +1101,6 @@ static bool should_skip_region(struct memblock_type *type,
if ((flags & MEMBLOCK_SCRATCH) && !memblock_is_scratch(m))
return true;
- /* Leave scratch memory alone after scratch-only phase */
- if (!(flags & MEMBLOCK_SCRATCH) && memblock_is_scratch(m))
- return true;
-
return false;
}
@@ -2153,6 +2150,20 @@ static void __init __free_pages_memory(unsigned long start, unsigned long end)
}
}
+static void reserve_scratch_mem(phys_addr_t start, phys_addr_t end)
+{
+#ifdef CONFIG_MEMBLOCK_SCRATCH
+ ulong start_pfn = pageblock_start_pfn(PFN_DOWN(start));
+ ulong end_pfn = pageblock_align(PFN_UP(end));
+ ulong pfn;
+
+ for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
+ /* Mark as CMA to prevent kernel allocations in it */
+ set_pageblock_migratetype(pfn_to_page(pfn), MIGRATE_CMA);
+ }
+#endif
+}
+
static unsigned long __init __free_memory_core(phys_addr_t start,
phys_addr_t end)
{
@@ -2214,6 +2225,17 @@ static unsigned long __init free_low_memory_core_early(void)
memmap_init_reserved_pages();
+#ifdef CONFIG_MEMBLOCK_SCRATCH
+ /*
+ * Mark scratch mem as CMA before we return it. That way we ensure that
+ * no kernel allocations happen on it. That means we can reuse it as
+ * scratch memory again later.
+ */
+ __for_each_mem_range(i, &memblock.memory, NULL, NUMA_NO_NODE,
+ MEMBLOCK_SCRATCH, &start, &end, NULL)
+ reserve_scratch_mem(start, end);
+#endif
+
/*
* We need to use NUMA_NO_NODE instead of NODE_DATA(0)->node_id
* because in some case like Node0 doesn't have RAM installed
--
2.40.1
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
This patch adds the core infrastructure to generate Kexec HandOver
metadata. Kexec HandOver is a mechanism that allows Linux to preserve
state - arbitrary properties as well as memory locations - across kexec.
It does so using 3 concepts:
1) Device Tree - Every KHO kexec carries a KHO specific flattened
device tree blob that describes the state of the system. Device
drivers can register to KHO to serialize their state before kexec.
2) Mem cache - A memblocks like structure that contains full page
ranges of reservations. These can not be part of the architectural
reservations, because they differ on every kexec.
3) Scratch Region - A CMA region that we allocate in the first kernel.
CMA gives us the guarantee that no handover pages land in that
region, because handover pages must be at a static physical memory
location. We use this region as the place to load future kexec
images into which then won't collide with any handover data.
Signed-off-by: Alexander Graf <[email protected]>
---
Documentation/ABI/testing/sysfs-kernel-kho | 53 +++
.../admin-guide/kernel-parameters.txt | 10 +
MAINTAINERS | 1 +
include/linux/kexec.h | 24 ++
include/uapi/linux/kexec.h | 6 +
kernel/Makefile | 1 +
kernel/kexec_kho_out.c | 316 ++++++++++++++++++
7 files changed, 411 insertions(+)
create mode 100644 Documentation/ABI/testing/sysfs-kernel-kho
create mode 100644 kernel/kexec_kho_out.c
diff --git a/Documentation/ABI/testing/sysfs-kernel-kho b/Documentation/ABI/testing/sysfs-kernel-kho
new file mode 100644
index 000000000000..f69e7b81a337
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-kho
@@ -0,0 +1,53 @@
+What: /sys/kernel/kho/active
+Date: December 2023
+Contact: Alexander Graf <[email protected]>
+Description:
+ Kexec HandOver (KHO) allows Linux to transition the state of
+ compatible drivers into the next kexec'ed kernel. To do so,
+ device drivers will serialize their current state into a DT.
+ While the state is serialized, they are unable to perform
+ any modifications to state that was serialized, such as
+ handed over memory allocations.
+
+ When this file contains "1", the system is in the transition
+ state. When contains "0", it is not. To switch between the
+ two states, echo the respective number into this file.
+
+What: /sys/kernel/kho/dt_max
+Date: December 2023
+Contact: Alexander Graf <[email protected]>
+Description:
+ KHO needs to allocate a buffer for the DT that gets
+ generated before it knows the final size. By default, it
+ will allocate 10 MiB for it. You can write to this file
+ to modify the size of that allocation.
+
+What: /sys/kernel/kho/scratch_len
+Date: December 2023
+Contact: Alexander Graf <[email protected]>
+Description:
+ To support continuous KHO kexecs, we need to reserve a
+ physically contiguous memory region that will always stay
+ available for future kexec allocations. This file describes
+ the length of that memory region. Kexec user space tooling
+ can use this to determine where it should place its payload
+ images.
+
+What: /sys/kernel/kho/scratch_phys
+Date: December 2023
+Contact: Alexander Graf <[email protected]>
+Description:
+ To support continuous KHO kexecs, we need to reserve a
+ physically contiguous memory region that will always stay
+ available for future kexec allocations. This file describes
+ the physical location of that memory region. Kexec user space
+ tooling can use this to determine where it should place its
+ payload images.
+
+What: /sys/kernel/kho/dt
+Date: December 2023
+Contact: Alexander Graf <[email protected]>
+Description:
+ When KHO is active, the kernel exposes the generated DT that
+ carries its current KHO state in this file. Kexec user space
+ tooling can use this as input file for the KHO payload image.
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 51575cd31741..efeef075617e 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2504,6 +2504,16 @@
kgdbwait [KGDB] Stop kernel execution and enter the
kernel debugger at the earliest opportunity.
+ kho_scratch=n[KMG] [KEXEC] Sets the size of the KHO scratch
+ region. The KHO scratch region is a physically
+ memory range that can only be used for non-kernel
+ allocations. That way, even when memory is heavily
+ fragmented with handed over memory, kexec will always
+ be able to find contiguous memory to place the next
+ kernel for kexec into.
+
+ The default is 0.
+
kmac= [MIPS] Korina ethernet MAC address.
Configure the RouterBoard 532 series on-chip
Ethernet adapter MAC address.
diff --git a/MAINTAINERS b/MAINTAINERS
index 788be9ab5b73..4ebf7c5fd424 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -11769,6 +11769,7 @@ M: Eric Biederman <[email protected]>
L: [email protected]
S: Maintained
W: http://kernel.org/pub/linux/utils/kernel/kexec/
+F: Documentation/ABI/testing/sysfs-kernel-kho
F: include/linux/kexec.h
F: include/uapi/linux/kexec.h
F: kernel/kexec*
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 8227455192b7..db2597e5550d 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -21,6 +21,8 @@
#include <uapi/linux/kexec.h>
#include <linux/verification.h>
+#include <linux/libfdt.h>
+#include <linux/notifier.h>
extern note_buf_t __percpu *crash_notes;
@@ -516,6 +518,28 @@ void set_kexec_sig_enforced(void);
static inline void set_kexec_sig_enforced(void) {}
#endif
+#ifdef CONFIG_KEXEC_KHO
+/* Notifier index */
+enum kho_event {
+ KEXEC_KHO_DUMP = 0,
+ KEXEC_KHO_ABORT = 1,
+};
+
+extern phys_addr_t kho_scratch_phys;
+extern phys_addr_t kho_scratch_len;
+
+/* egest handover metadata */
+void kho_reserve(void);
+int register_kho_notifier(struct notifier_block *nb);
+int unregister_kho_notifier(struct notifier_block *nb);
+bool kho_is_active(void);
+#else
+static inline void kho_reserve(void) { }
+static inline int register_kho_notifier(struct notifier_block *nb) { return -EINVAL; }
+static inline int unregister_kho_notifier(struct notifier_block *nb) { return -EINVAL; }
+static inline bool kho_is_active(void) { return false; }
+#endif
+
#endif /* !defined(__ASSEBMLY__) */
#endif /* LINUX_KEXEC_H */
diff --git a/include/uapi/linux/kexec.h b/include/uapi/linux/kexec.h
index 01766dd839b0..d02ffd5960d6 100644
--- a/include/uapi/linux/kexec.h
+++ b/include/uapi/linux/kexec.h
@@ -49,6 +49,12 @@
/* The artificial cap on the number of segments passed to kexec_load. */
#define KEXEC_SEGMENT_MAX 16
+/* KHO passes an array of kho_mem as "mem cache" to the new kernel */
+struct kho_mem {
+ __u64 addr;
+ __u64 len;
+};
+
#ifndef __KERNEL__
/*
* This structure is used to hold the arguments that are used when
diff --git a/kernel/Makefile b/kernel/Makefile
index 3947122d618b..a6bd31e22c09 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -73,6 +73,7 @@ obj-$(CONFIG_KEXEC_CORE) += kexec_core.o
obj-$(CONFIG_KEXEC) += kexec.o
obj-$(CONFIG_KEXEC_FILE) += kexec_file.o
obj-$(CONFIG_KEXEC_ELF) += kexec_elf.o
+obj-$(CONFIG_KEXEC_KHO) += kexec_kho_out.o
obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
obj-$(CONFIG_COMPAT) += compat.o
obj-$(CONFIG_CGROUPS) += cgroup/
diff --git a/kernel/kexec_kho_out.c b/kernel/kexec_kho_out.c
new file mode 100644
index 000000000000..e6184bde5c10
--- /dev/null
+++ b/kernel/kexec_kho_out.c
@@ -0,0 +1,316 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * kexec_kho_out.c - kexec handover code to egest metadata.
+ * Copyright (C) 2023 Alexander Graf <[email protected]>
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/cma.h>
+#include <linux/kexec.h>
+#include <linux/device.h>
+#include <linux/compiler.h>
+#include <linux/kmsg_dump.h>
+
+struct kho_out {
+ struct kobject *kobj;
+ bool active;
+ struct cma *cma;
+ struct blocking_notifier_head chain_head;
+ void *dt;
+ u64 dt_len;
+ u64 dt_max;
+ struct mutex lock;
+};
+
+static struct kho_out kho = {
+ .dt_max = (1024 * 1024 * 10),
+ .chain_head = BLOCKING_NOTIFIER_INIT(kho.chain_head),
+ .lock = __MUTEX_INITIALIZER(kho.lock),
+};
+
+/*
+ * Size for scratch (non-KHO) memory. With KHO enabled, memory can become
+ * fragmented because KHO regions may be anywhere in physical address
+ * space. The scratch region gives us a safe zone that we will never see
+ * KHO allocations from. This is where we can later safely load our new kexec
+ * images into.
+ */
+static phys_addr_t kho_scratch_size __initdata;
+
+int register_kho_notifier(struct notifier_block *nb)
+{
+ return blocking_notifier_chain_register(&kho.chain_head, nb);
+}
+EXPORT_SYMBOL_GPL(register_kho_notifier);
+
+int unregister_kho_notifier(struct notifier_block *nb)
+{
+ return blocking_notifier_chain_unregister(&kho.chain_head, nb);
+}
+EXPORT_SYMBOL_GPL(unregister_kho_notifier);
+
+bool kho_is_active(void)
+{
+ return kho.active;
+}
+EXPORT_SYMBOL_GPL(kho_is_active);
+
+static ssize_t raw_read(struct file *file, struct kobject *kobj,
+ struct bin_attribute *attr, char *buf,
+ loff_t pos, size_t count)
+{
+ mutex_lock(&kho.lock);
+ memcpy(buf, attr->private + pos, count);
+ mutex_unlock(&kho.lock);
+
+ return count;
+}
+
+static BIN_ATTR(dt, 0400, raw_read, NULL, 0);
+
+static int kho_expose_dt(void *fdt)
+{
+ long fdt_len = fdt_totalsize(fdt);
+ int err;
+
+ kho.dt = fdt;
+ kho.dt_len = fdt_len;
+
+ bin_attr_dt.size = fdt_totalsize(fdt);
+ bin_attr_dt.private = fdt;
+ err = sysfs_create_bin_file(kho.kobj, &bin_attr_dt);
+
+ return err;
+}
+
+static void kho_abort(void)
+{
+ if (!kho.active)
+ return;
+
+ sysfs_remove_bin_file(kho.kobj, &bin_attr_dt);
+
+ kvfree(kho.dt);
+ kho.dt = NULL;
+ kho.dt_len = 0;
+
+ blocking_notifier_call_chain(&kho.chain_head, KEXEC_KHO_ABORT, NULL);
+
+ kho.active = false;
+}
+
+static int kho_serialize(void)
+{
+ void *fdt = NULL;
+ int err;
+
+ kho.active = true;
+ err = -ENOMEM;
+
+ fdt = kvmalloc(kho.dt_max, GFP_KERNEL);
+ if (!fdt)
+ goto out;
+
+ if (fdt_create(fdt, kho.dt_max)) {
+ err = -EINVAL;
+ goto out;
+ }
+
+ err = fdt_finish_reservemap(fdt);
+ if (err)
+ goto out;
+
+ err = fdt_begin_node(fdt, "");
+ if (err)
+ goto out;
+
+ err = fdt_property_string(fdt, "compatible", "kho-v1");
+ if (err)
+ goto out;
+
+ /* Loop through all kho dump functions */
+ err = blocking_notifier_call_chain(&kho.chain_head, KEXEC_KHO_DUMP, fdt);
+ err = notifier_to_errno(err);
+ if (err)
+ goto out;
+
+ /* Close / */
+ err = fdt_end_node(fdt);
+ if (err)
+ goto out;
+
+ err = fdt_finish(fdt);
+ if (err)
+ goto out;
+
+ if (WARN_ON(fdt_check_header(fdt))) {
+ err = -EINVAL;
+ goto out;
+ }
+
+ err = kho_expose_dt(fdt);
+
+out:
+ if (err) {
+ pr_err("kho failed to serialize state: %d", err);
+ kho_abort();
+ }
+ return err;
+}
+
+/* Handling for /sys/kernel/kho */
+
+#define KHO_ATTR_RO(_name) static struct kobj_attribute _name##_attr = __ATTR_RO_MODE(_name, 0400)
+#define KHO_ATTR_RW(_name) static struct kobj_attribute _name##_attr = __ATTR_RW_MODE(_name, 0600)
+
+static ssize_t active_store(struct kobject *dev, struct kobj_attribute *attr,
+ const char *buf, size_t size)
+{
+ ssize_t retsize = size;
+ bool val = false;
+ int ret;
+
+ if (kstrtobool(buf, &val) < 0)
+ return -EINVAL;
+
+ if (!kho_scratch_len)
+ return -ENOMEM;
+
+ mutex_lock(&kho.lock);
+ if (val != kho.active) {
+ if (val) {
+ ret = kho_serialize();
+ if (ret) {
+ retsize = -EINVAL;
+ goto out;
+ }
+ } else {
+ kho_abort();
+ }
+ }
+
+out:
+ mutex_unlock(&kho.lock);
+ return retsize;
+}
+
+static ssize_t active_show(struct kobject *dev, struct kobj_attribute *attr,
+ char *buf)
+{
+ ssize_t ret;
+
+ mutex_lock(&kho.lock);
+ ret = sysfs_emit(buf, "%d\n", kho.active);
+ mutex_unlock(&kho.lock);
+
+ return ret;
+}
+KHO_ATTR_RW(active);
+
+static ssize_t dt_max_store(struct kobject *dev, struct kobj_attribute *attr,
+ const char *buf, size_t size)
+{
+ u64 val;
+
+ if (kstrtoull(buf, 0, &val))
+ return -EINVAL;
+
+ kho.dt_max = val;
+
+ return size;
+}
+
+static ssize_t dt_max_show(struct kobject *dev, struct kobj_attribute *attr,
+ char *buf)
+{
+ return sysfs_emit(buf, "0x%llx\n", kho.dt_max);
+}
+KHO_ATTR_RW(dt_max);
+
+static ssize_t scratch_len_show(struct kobject *dev, struct kobj_attribute *attr,
+ char *buf)
+{
+ return sysfs_emit(buf, "0x%llx\n", kho_scratch_len);
+}
+KHO_ATTR_RO(scratch_len);
+
+static ssize_t scratch_phys_show(struct kobject *dev, struct kobj_attribute *attr,
+ char *buf)
+{
+ return sysfs_emit(buf, "0x%llx\n", kho_scratch_phys);
+}
+KHO_ATTR_RO(scratch_phys);
+
+static __init int kho_out_init(void)
+{
+ int ret = 0;
+
+ kho.kobj = kobject_create_and_add("kho", kernel_kobj);
+ if (!kho.kobj) {
+ ret = -ENOMEM;
+ goto err;
+ }
+
+ ret = sysfs_create_file(kho.kobj, &active_attr.attr);
+ if (ret)
+ goto err;
+
+ ret = sysfs_create_file(kho.kobj, &dt_max_attr.attr);
+ if (ret)
+ goto err;
+
+ ret = sysfs_create_file(kho.kobj, &scratch_phys_attr.attr);
+ if (ret)
+ goto err;
+
+ ret = sysfs_create_file(kho.kobj, &scratch_len_attr.attr);
+ if (ret)
+ goto err;
+
+err:
+ return ret;
+}
+late_initcall(kho_out_init);
+
+static int __init early_kho_scratch(char *p)
+{
+ kho_scratch_size = memparse(p, &p);
+ return 0;
+}
+early_param("kho_scratch", early_kho_scratch);
+
+/**
+ * kho_reserve - Reserve a contiguous chunk of memory for kexec
+ *
+ * With KHO we can preserve arbitrary pages in the system. To ensure we still
+ * have a large contiguous region of memory when we search the physical address
+ * space for target memory, let's make sure we always have a large CMA region
+ * active. This CMA region will only be used for movable pages which are not a
+ * problem for us during KHO because we can just move them somewhere else.
+ */
+__init void kho_reserve(void)
+{
+ int r;
+
+ if (kho_get_fdt()) {
+ /*
+ * We came from a previous KHO handover, so we already have
+ * a known good scratch region that we preserve. No need to
+ * allocate another.
+ */
+ return;
+ }
+
+ /* Only allocate KHO scratch memory when we're asked to */
+ if (!kho_scratch_size)
+ return;
+
+ r = cma_declare_contiguous_nid(0, kho_scratch_size, 0, PAGE_SIZE, 0,
+ false, "kho", &kho.cma, NUMA_NO_NODE);
+ if (WARN_ON(r))
+ return;
+
+ kho_scratch_phys = cma_get_base(kho.cma);
+ kho_scratch_len = cma_get_size(kho.cma);
+}
--
2.40.1
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
When we have a KHO kexec, we get a device tree, mem cache and scratch
region to populate the state of the system. Provide helper functions
that allow architecture code to easily handle memory reservations based
on them and give device drivers visibility into the KHO DT and memory
reservations so they can recover their own state.
Signed-off-by: Alexander Graf <[email protected]>
---
Documentation/ABI/testing/sysfs-firmware-kho | 9 +
MAINTAINERS | 1 +
include/linux/kexec.h | 23 ++
kernel/Makefile | 1 +
kernel/kexec_kho_in.c | 298 +++++++++++++++++++
5 files changed, 332 insertions(+)
create mode 100644 Documentation/ABI/testing/sysfs-firmware-kho
create mode 100644 kernel/kexec_kho_in.c
diff --git a/Documentation/ABI/testing/sysfs-firmware-kho b/Documentation/ABI/testing/sysfs-firmware-kho
new file mode 100644
index 000000000000..e4ed2cb7c810
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-firmware-kho
@@ -0,0 +1,9 @@
+What: /sys/firmware/kho/dt
+Date: December 2023
+Contact: Alexander Graf <[email protected]>
+Description:
+ When the kernel was booted with Kexec HandOver (KHO),
+ the device tree that carries metadata about the previous
+ kernel's state is in this file. This file may disappear
+ when all consumers of it finished to interpret their
+ metadata.
diff --git a/MAINTAINERS b/MAINTAINERS
index 4ebf7c5fd424..ec92a0dd628d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -11769,6 +11769,7 @@ M: Eric Biederman <[email protected]>
L: [email protected]
S: Maintained
W: http://kernel.org/pub/linux/utils/kernel/kexec/
+F: Documentation/ABI/testing/sysfs-firmware-kho
F: Documentation/ABI/testing/sysfs-kernel-kho
F: include/linux/kexec.h
F: include/uapi/linux/kexec.h
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index db2597e5550d..a3c4fee6f86a 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -528,12 +528,35 @@ enum kho_event {
extern phys_addr_t kho_scratch_phys;
extern phys_addr_t kho_scratch_len;
+/* ingest handover metadata */
+void kho_reserve_mem(void);
+void kho_populate(phys_addr_t dt_phys, phys_addr_t scratch_phys, u64 scratch_len,
+ phys_addr_t mem_phys, u64 mem_len);
+void kho_populate_refcount(void);
+void *kho_get_fdt(void);
+void kho_return_mem(const struct kho_mem *mem);
+void *kho_claim_mem(const struct kho_mem *mem);
+static inline bool is_kho_boot(void)
+{
+ return !!kho_scratch_phys;
+}
+
/* egest handover metadata */
void kho_reserve(void);
int register_kho_notifier(struct notifier_block *nb);
int unregister_kho_notifier(struct notifier_block *nb);
bool kho_is_active(void);
#else
+/* ingest handover metadata */
+static inline void kho_reserve_mem(void) { }
+static inline bool is_kho_boot(void) { return false; }
+static inline void kho_populate(phys_addr_t dt_phys, phys_addr_t scratch_phys,
+ u64 scratch_len, phys_addr_t mem_phys,
+ u64 mem_len) { }
+static inline void kho_populate_refcount(void) { }
+static inline void *kho_get_fdt(void) { return NULL; }
+
+/* egest handover metadata */
static inline void kho_reserve(void) { }
static inline int register_kho_notifier(struct notifier_block *nb) { return -EINVAL; }
static inline int unregister_kho_notifier(struct notifier_block *nb) { return -EINVAL; }
diff --git a/kernel/Makefile b/kernel/Makefile
index a6bd31e22c09..7c3065e40c75 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -73,6 +73,7 @@ obj-$(CONFIG_KEXEC_CORE) += kexec_core.o
obj-$(CONFIG_KEXEC) += kexec.o
obj-$(CONFIG_KEXEC_FILE) += kexec_file.o
obj-$(CONFIG_KEXEC_ELF) += kexec_elf.o
+obj-$(CONFIG_KEXEC_KHO) += kexec_kho_in.o
obj-$(CONFIG_KEXEC_KHO) += kexec_kho_out.o
obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
obj-$(CONFIG_COMPAT) += compat.o
diff --git a/kernel/kexec_kho_in.c b/kernel/kexec_kho_in.c
new file mode 100644
index 000000000000..12ec54fc537a
--- /dev/null
+++ b/kernel/kexec_kho_in.c
@@ -0,0 +1,298 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * kexec_kho_in.c - kexec handover code to ingest metadata.
+ * Copyright (C) 2023 Alexander Graf <[email protected]>
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/kexec.h>
+#include <linux/device.h>
+#include <linux/compiler.h>
+#include <linux/io.h>
+#include <linux/kmsg_dump.h>
+#include <linux/memblock.h>
+
+/* The kho dt during runtime */
+static void *fdt;
+
+/* Globals to hand over phys/len from early to runtime */
+static phys_addr_t handover_phys __initdata;
+static u32 handover_len __initdata;
+
+static phys_addr_t mem_phys __initdata;
+static u32 mem_len __initdata;
+
+phys_addr_t kho_scratch_phys;
+phys_addr_t kho_scratch_len;
+
+void *kho_get_fdt(void)
+{
+ return fdt;
+}
+EXPORT_SYMBOL_GPL(kho_get_fdt);
+
+/**
+ * kho_populate_refcount - Scan the DT for any memory ranges. Increase the
+ * affected pages' refcount by 1 for each.
+ */
+__init void kho_populate_refcount(void)
+{
+ void *fdt = kho_get_fdt();
+ void *mem_virt = __va(mem_phys);
+ int offset = 0, depth = 0, initial_depth = 0, len;
+
+ if (!fdt)
+ return;
+
+ /* Go through the mem list and add 1 for each reference */
+ for (offset = 0;
+ offset >= 0 && depth >= initial_depth;
+ offset = fdt_next_node(fdt, offset, &depth)) {
+ const struct kho_mem *mems;
+ u32 i;
+
+ mems = fdt_getprop(fdt, offset, "mem", &len);
+ if (!mems || len & (sizeof(*mems) - 1))
+ continue;
+
+ for (i = 0; i < len; i += sizeof(*mems)) {
+ const struct kho_mem *mem = ((void *)mems) + i;
+ u64 start_pfn = PFN_DOWN(mem->addr);
+ u64 end_pfn = PFN_UP(mem->addr + mem->len);
+ u64 pfn;
+
+ for (pfn = start_pfn; pfn < end_pfn; pfn++)
+ get_page(pfn_to_page(pfn));
+ }
+ }
+
+ /*
+ * Then reduce the reference count by 1 to offset the initial ref count
+ * of 1. In addition, unreserve the page. That way, we can free_page()
+ * it for every consumer and automatically free it to the global memory
+ * pool when everyone is done.
+ */
+ for (offset = 0; offset < mem_len; offset += sizeof(struct kho_mem)) {
+ struct kho_mem *mem = mem_virt + offset;
+ u64 start_pfn = PFN_DOWN(mem->addr);
+ u64 end_pfn = PFN_UP(mem->addr + mem->len);
+ u64 pfn;
+
+ for (pfn = start_pfn; pfn < end_pfn; pfn++) {
+ struct page *page = pfn_to_page(pfn);
+
+ /*
+ * This is similar to free_reserved_page(), but
+ * preserves the reference count
+ */
+ ClearPageReserved(page);
+ __free_page(page);
+ adjust_managed_page_count(page, 1);
+ }
+ }
+}
+
+static void kho_return_pfn(ulong pfn)
+{
+ struct page *page = pfn_to_page(pfn);
+
+ if (WARN_ON(!page))
+ return;
+ __free_page(page);
+}
+
+/**
+ * kho_return_mem - Notify the kernel that initially reserved memory is no
+ * longer needed. When the last consumer of a page returns their mem, kho
+ * returns the page to the buddy allocator as free page.
+ */
+void kho_return_mem(const struct kho_mem *mem)
+{
+ uint64_t start_pfn, end_pfn, pfn;
+
+ start_pfn = PFN_DOWN(mem->addr);
+ end_pfn = PFN_UP(mem->addr + mem->len);
+
+ for (pfn = start_pfn; pfn < end_pfn; pfn++)
+ kho_return_pfn(pfn);
+}
+EXPORT_SYMBOL_GPL(kho_return_mem);
+
+static void kho_claim_pfn(ulong pfn)
+{
+ struct page *page = pfn_to_page(pfn);
+
+ WARN_ON(!page);
+ if (WARN_ON(page_count(page) != 1))
+ pr_err("Claimed non kho pfn %lx", pfn);
+}
+
+/**
+ * kho_claim_mem - Notify the kernel that a handed over memory range is now in
+ * use by a kernel subsystem and considered an allocated page. This function
+ * removes the reserved state for all pages that the mem spans.
+ */
+void *kho_claim_mem(const struct kho_mem *mem)
+{
+ u64 start_pfn, end_pfn, pfn;
+ void *va = __va(mem->addr);
+
+ start_pfn = PFN_DOWN(mem->addr);
+ end_pfn = PFN_UP(mem->addr + mem->len);
+
+ for (pfn = start_pfn; pfn < end_pfn; pfn++)
+ kho_claim_pfn(pfn);
+
+ return va;
+}
+EXPORT_SYMBOL_GPL(kho_claim_mem);
+
+/**
+ * kho_reserve_mem - Adds all memory reservations into memblocks
+ * and moves us out of the scratch only phase. Must be called after page tables
+ * are initialized and memblock_allow_resize().
+ */
+void __init kho_reserve_mem(void)
+{
+ void *mem_virt = __va(mem_phys);
+ int off, err;
+
+ if (!handover_phys || !mem_phys)
+ return;
+
+ /*
+ * We reached here because we are running inside a working linear map
+ * that allows us to resize memblocks dynamically. Use the chance and
+ * populate the global fdt pointer
+ */
+ fdt = __va(handover_phys);
+
+ off = fdt_path_offset(fdt, "/");
+ if (off < 0) {
+ fdt = NULL;
+ return;
+ }
+
+ err = fdt_node_check_compatible(fdt, off, "kho-v1");
+ if (err) {
+ pr_warn("KHO has invalid compatible, disabling.");
+ return;
+ }
+
+ /* Then populate all preserved memory areas as reserved */
+ for (off = 0; off < mem_len; off += sizeof(struct kho_mem)) {
+ struct kho_mem *mem = mem_virt + off;
+
+ memblock_reserve(mem->addr, mem->len);
+ }
+
+ /* Unreserve the mem cache - we don't need it from here on */
+ memblock_phys_free(mem_phys, mem_len);
+
+ /*
+ * Now we know about all memory reservations, release the scratch only
+ * constraint and allow normal allocations from the scratch region.
+ */
+ memblock_clear_scratch_only();
+}
+
+/* Handling for /sys/firmware/kho */
+static struct kobject *kho_kobj;
+
+static ssize_t raw_read(struct file *file, struct kobject *kobj,
+ struct bin_attribute *attr, char *buf,
+ loff_t pos, size_t count)
+{
+ memcpy(buf, attr->private + pos, count);
+ return count;
+}
+
+static BIN_ATTR(dt, 0400, raw_read, NULL, 0);
+
+static __init int kho_in_init(void)
+{
+ int ret = 0;
+
+ if (!fdt)
+ return 0;
+
+ kho_kobj = kobject_create_and_add("kho", firmware_kobj);
+ if (!kho_kobj) {
+ ret = -ENOMEM;
+ goto err;
+ }
+
+ bin_attr_dt.size = fdt_totalsize(fdt);
+ bin_attr_dt.private = fdt;
+ ret = sysfs_create_bin_file(kho_kobj, &bin_attr_dt);
+ if (ret)
+ goto err;
+
+err:
+ return ret;
+}
+subsys_initcall(kho_in_init);
+
+void __init kho_populate(phys_addr_t handover_dt_phys, phys_addr_t scratch_phys,
+ u64 scratch_len, phys_addr_t mem_cache_phys,
+ u64 mem_cache_len)
+{
+ void *handover_dt;
+
+ /* Determine the real size of the DT */
+ handover_dt = early_memremap(handover_dt_phys, sizeof(struct fdt_header));
+ if (!handover_dt) {
+ pr_warn("setup: failed to memremap kexec FDT (0x%llx)\n", handover_dt_phys);
+ return;
+ }
+
+ if (fdt_check_header(handover_dt)) {
+ pr_warn("setup: kexec handover FDT is invalid (0x%llx)\n", handover_dt_phys);
+ early_memunmap(handover_dt, PAGE_SIZE);
+ return;
+ }
+
+ handover_len = fdt_totalsize(handover_dt);
+ handover_phys = handover_dt_phys;
+
+ /* Reserve the DT so we can still access it in late boot */
+ memblock_reserve(handover_phys, handover_len);
+
+ /* Reserve the mem cache so we can still access it later */
+ memblock_reserve(mem_cache_phys, mem_cache_len);
+
+ /*
+ * We pass a safe contiguous block of memory to use for early boot purporses from
+ * the previous kernel so that we can resize the memblock array as needed.
+ */
+ memblock_add(scratch_phys, scratch_len);
+
+ if (WARN_ON(memblock_mark_scratch(scratch_phys, scratch_len))) {
+ pr_err("Kexec failed to mark the scratch region. Disabling KHO.");
+ handover_len = 0;
+ handover_phys = 0;
+ return;
+ }
+ pr_debug("Marked 0x%lx+0x%lx as scratch", (long)scratch_phys, (long)scratch_len);
+
+ /*
+ * Now that we have a viable region of scratch memory, let's tell the memblocks
+ * allocator to only use that for any allocations. That way we ensure that nothing
+ * scribbles over in use data while we initialize the page tables which we will need
+ * to ingest all memory reservations from the previous kernel.
+ */
+ memblock_set_scratch_only();
+
+ early_memunmap(handover_dt, sizeof(struct fdt_header));
+
+ /* Remember the mem cache location for kho_reserve_mem() */
+ mem_len = mem_cache_len;
+ mem_phys = mem_cache_phys;
+
+ /* Remember the scratch block - we will reuse it again for the next kexec */
+ kho_scratch_phys = scratch_phys;
+ kho_scratch_len = scratch_len;
+
+ pr_info("setup: Found kexec handover data. Will skip init for some devices\n");
+}
--
2.40.1
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
Kexec has 2 modes: A user space driven mode and a kernel driven mode.
For the kernel driven mode, kernel code determines the physical
addresses of all target buffers that the payload gets copied into.
With KHO, we can only safely copy payloads into the "scratch area".
Teach the kexec file loader about it, so it only allocates for that
area. In addition, enlighten it with support to ask the KHO subsystem
for its respective payloads to copy into target memory. Also teach the
KHO subsystem how to fill the images for file loads.
Signed-off-by: Alexander Graf <[email protected]>
---
include/linux/kexec.h | 9 ++
kernel/kexec_file.c | 41 ++++++++
kernel/kexec_kho_out.c | 210 +++++++++++++++++++++++++++++++++++++++++
3 files changed, 260 insertions(+)
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index a3c4fee6f86a..c8859a2ca872 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -362,6 +362,13 @@ struct kimage {
size_t ima_buffer_size;
#endif
+#ifdef CONFIG_KEXEC_KHO
+ struct {
+ struct kexec_buf dt;
+ struct kexec_buf mem_cache;
+ } kho;
+#endif
+
/* Core ELF header buffer */
void *elf_headers;
unsigned long elf_headers_sz;
@@ -543,6 +550,7 @@ static inline bool is_kho_boot(void)
/* egest handover metadata */
void kho_reserve(void);
+int kho_fill_kimage(struct kimage *image);
int register_kho_notifier(struct notifier_block *nb);
int unregister_kho_notifier(struct notifier_block *nb);
bool kho_is_active(void);
@@ -558,6 +566,7 @@ static inline void *kho_get_fdt(void) { return NULL; }
/* egest handover metadata */
static inline void kho_reserve(void) { }
+static inline int kho_fill_kimage(struct kimage *image) { return 0; }
static inline int register_kho_notifier(struct notifier_block *nb) { return -EINVAL; }
static inline int unregister_kho_notifier(struct notifier_block *nb) { return -EINVAL; }
static inline bool kho_is_active(void) { return false; }
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index f9a419cd22d4..d895d0a49bd9 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -113,6 +113,13 @@ void kimage_file_post_load_cleanup(struct kimage *image)
image->ima_buffer = NULL;
#endif /* CONFIG_IMA_KEXEC */
+#ifdef CONFIG_KEXEC_KHO
+ kvfree(image->kho.mem_cache.buffer);
+ image->kho.mem_cache = (struct kexec_buf) {};
+ kvfree(image->kho.dt.buffer);
+ image->kho.dt = (struct kexec_buf) {};
+#endif
+
/* See if architecture has anything to cleanup post load */
arch_kimage_file_post_load_cleanup(image);
@@ -249,6 +256,11 @@ kimage_file_prepare_segments(struct kimage *image, int kernel_fd, int initrd_fd,
/* IMA needs to pass the measurement list to the next kernel. */
ima_add_kexec_buffer(image);
+ /* If KHO is active, add its images to the list */
+ ret = kho_fill_kimage(image);
+ if (ret)
+ goto out;
+
/* Call image load handler */
ldata = kexec_image_load_default(image);
@@ -518,6 +530,24 @@ static int locate_mem_hole_callback(struct resource *res, void *arg)
return locate_mem_hole_bottom_up(start, end, kbuf);
}
+#ifdef CONFIG_KEXEC_KHO
+static int kexec_walk_kho_scratch(struct kexec_buf *kbuf,
+ int (*func)(struct resource *, void *))
+{
+ int ret = 0;
+
+ struct resource res = {
+ .start = kho_scratch_phys,
+ .end = kho_scratch_phys + kho_scratch_len,
+ };
+
+ /* Try to fit the kimage into our KHO scratch region */
+ ret = func(&res, kbuf);
+
+ return ret;
+}
+#endif
+
#ifdef CONFIG_ARCH_KEEP_MEMBLOCK
static int kexec_walk_memblock(struct kexec_buf *kbuf,
int (*func)(struct resource *, void *))
@@ -612,6 +642,17 @@ int kexec_locate_mem_hole(struct kexec_buf *kbuf)
if (kbuf->mem != KEXEC_BUF_MEM_UNKNOWN)
return 0;
+#ifdef CONFIG_KEXEC_KHO
+ /*
+ * If KHO is active, only use KHO scratch memory. All other memory
+ * could potentially be handed over.
+ */
+ if (kho_is_active() && kbuf->image->type != KEXEC_TYPE_CRASH) {
+ ret = kexec_walk_kho_scratch(kbuf, locate_mem_hole_callback);
+ return ret == 1 ? 0 : -EADDRNOTAVAIL;
+ }
+#endif
+
if (!IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK))
ret = kexec_walk_resources(kbuf, locate_mem_hole_callback);
else
diff --git a/kernel/kexec_kho_out.c b/kernel/kexec_kho_out.c
index e6184bde5c10..24ced6c3013f 100644
--- a/kernel/kexec_kho_out.c
+++ b/kernel/kexec_kho_out.c
@@ -50,6 +50,216 @@ int unregister_kho_notifier(struct notifier_block *nb)
}
EXPORT_SYMBOL_GPL(unregister_kho_notifier);
+static int kho_mem_cache_add(void *fdt, struct kho_mem *mem_cache, int size,
+ struct kho_mem *new_mem)
+{
+ int entries = size / sizeof(*mem_cache);
+ u64 new_start = new_mem->addr;
+ u64 new_end = new_mem->addr + new_mem->len;
+ u64 prev_start = 0;
+ u64 prev_end = 0;
+ int i;
+
+ if (WARN_ON((new_start < (kho_scratch_phys + kho_scratch_len)) &&
+ (new_end > kho_scratch_phys))) {
+ pr_err("KHO memory runs over scratch memory");
+ return -EINVAL;
+ }
+
+ /*
+ * We walk the existing sorted mem cache and find the spot where this
+ * new entry would start, so we can insert it right there.
+ */
+ for (i = 0; i < entries; i++) {
+ struct kho_mem *mem = &mem_cache[i];
+ u64 mem_end = (mem->addr + mem->len);
+
+ if (mem_end < new_start) {
+ /* No overlap */
+ prev_start = mem->addr;
+ prev_end = mem->addr + mem->len;
+ continue;
+ } else if ((new_start >= mem->addr) && (new_end <= mem_end)) {
+ /* new_mem fits into mem, skip */
+ return size;
+ } else if ((new_end >= mem->addr) && (new_start <= mem_end)) {
+ /* new_mem and mem overlap, fold them */
+ bool remove = false;
+
+ mem->addr = min(new_start, mem->addr);
+ mem->len = max(mem_end, new_end) - mem->addr;
+ mem_end = (mem->addr + mem->len);
+
+ if (i > 0 && prev_end >= mem->addr) {
+ /* We now overlap with the previous mem, fold */
+ struct kho_mem *prev = &mem_cache[i - 1];
+
+ prev->addr = min(prev->addr, mem->addr);
+ prev->len = max(mem_end, prev_end) - prev->addr;
+ remove = true;
+ } else if (i < (entries - 1) && mem_end >= mem_cache[i + 1].addr) {
+ /* We now overlap with the next mem, fold */
+ struct kho_mem *next = &mem_cache[i + 1];
+ u64 next_end = (next->addr + next->len);
+
+ next->addr = min(next->addr, mem->addr);
+ next->len = max(mem_end, next_end) - next->addr;
+ remove = true;
+ }
+
+ if (remove) {
+ /* We folded this mem into another, remove it */
+ memmove(mem, mem + 1, (entries - i - 1) * sizeof(*mem));
+ size -= sizeof(*new_mem);
+ }
+
+ return size;
+ } else if (mem->addr > new_end) {
+ /*
+ * The mem cache is sorted. If we find the current
+ * entry start after our new_mem's end, we shot over
+ * which means we need to add it by creating a new
+ * hole right after the current entry.
+ */
+ memmove(mem + 1, mem, (entries - i) * sizeof(*mem));
+ break;
+ }
+ }
+
+ mem_cache[i] = *new_mem;
+ size += sizeof(*new_mem);
+
+ return size;
+}
+
+/**
+ * kho_alloc_mem_cache - Allocate and initialize the mem cache kexec_buf
+ */
+static int kho_alloc_mem_cache(struct kimage *image, void *fdt)
+{
+ int offset, depth, initial_depth, len;
+ void *mem_cache;
+ int size;
+
+ /* Count the elements inside all "mem" properties in the DT */
+ size = offset = depth = initial_depth = 0;
+ for (offset = 0;
+ offset >= 0 && depth >= initial_depth;
+ offset = fdt_next_node(fdt, offset, &depth)) {
+ const struct kho_mem *mems;
+
+ mems = fdt_getprop(fdt, offset, "mem", &len);
+ if (!mems || len & (sizeof(*mems) - 1))
+ continue;
+ size += len;
+ }
+
+ /* Allocate based on the max size we determined */
+ mem_cache = kvmalloc(size, GFP_KERNEL);
+ if (!mem_cache)
+ return -ENOMEM;
+
+ /* And populate the array */
+ size = offset = depth = initial_depth = 0;
+ for (offset = 0;
+ offset >= 0 && depth >= initial_depth;
+ offset = fdt_next_node(fdt, offset, &depth)) {
+ const struct kho_mem *mems;
+ int nr_mems, i;
+
+ mems = fdt_getprop(fdt, offset, "mem", &len);
+ if (!mems || len & (sizeof(*mems) - 1))
+ continue;
+
+ for (i = 0, nr_mems = len / sizeof(*mems); i < nr_mems; i++) {
+ const struct kho_mem *mem = &mems[i];
+ ulong mstart = PAGE_ALIGN_DOWN(mem->addr);
+ ulong mend = PAGE_ALIGN(mem->addr + mem->len);
+ struct kho_mem cmem = {
+ .addr = mstart,
+ .len = (mend - mstart),
+ };
+
+ size = kho_mem_cache_add(fdt, mem_cache, size, &cmem);
+ if (size < 0)
+ return size;
+ }
+ }
+
+ image->kho.mem_cache.buffer = mem_cache;
+ image->kho.mem_cache.bufsz = size;
+ image->kho.mem_cache.memsz = size;
+
+ return 0;
+}
+
+int kho_fill_kimage(struct kimage *image)
+{
+ int err = 0;
+ void *dt;
+
+ mutex_lock(&kho.lock);
+
+ if (!kho.active)
+ goto out;
+
+ /* Initialize kexec_buf for mem_cache */
+ image->kho.mem_cache = (struct kexec_buf) {
+ .image = image,
+ .buffer = NULL,
+ .bufsz = 0,
+ .mem = KEXEC_BUF_MEM_UNKNOWN,
+ .memsz = 0,
+ .buf_align = SZ_64K, /* Makes it easier to map */
+ .buf_max = ULONG_MAX,
+ .top_down = true,
+ };
+
+ /*
+ * We need to make all allocations visible here via the mem_cache so that
+ * kho_is_destination_range() can identify overlapping regions and ensure
+ * that no kimage (including the DT one) lands on handed over memory.
+ *
+ * Since we conveniently already built an array of all allocations, let's
+ * pass that on to the target kernel so that reuse it to initialize its
+ * memory blocks.
+ */
+ err = kho_alloc_mem_cache(image, kho.dt);
+ if (err)
+ goto out;
+
+ err = kexec_add_buffer(&image->kho.mem_cache);
+ if (err)
+ goto out;
+
+ /*
+ * Create a kexec copy of the DT here. We need this because lifetime may
+ * be different between kho.dt and the kimage
+ */
+ dt = kvmemdup(kho.dt, kho.dt_len, GFP_KERNEL);
+ if (!dt) {
+ err = -ENOMEM;
+ goto out;
+ }
+
+ /* Allocate target memory for kho dt */
+ image->kho.dt = (struct kexec_buf) {
+ .image = image,
+ .buffer = dt,
+ .bufsz = kho.dt_len,
+ .mem = KEXEC_BUF_MEM_UNKNOWN,
+ .memsz = kho.dt_len,
+ .buf_align = SZ_64K, /* Makes it easier to map */
+ .buf_max = ULONG_MAX,
+ .top_down = true,
+ };
+ err = kexec_add_buffer(&image->kho.dt);
+
+out:
+ mutex_unlock(&kho.lock);
+ return err;
+}
+
bool kho_is_active(void)
{
return kho.active;
--
2.40.1
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
With KHO (Kexec HandOver), we want to preserve trace buffers across
kexec. To carry over their state between kernels, the kernel needs a
common handle for them that exists on both sides. As handle we introduce
names for ring buffers. In a follow-up patch, the kernel can then use
these names to recover buffer contents for specific ring buffers.
Signed-off-by: Alexander Graf <[email protected]>
---
include/linux/ring_buffer.h | 7 ++++---
kernel/trace/ring_buffer.c | 5 ++++-
kernel/trace/trace.c | 7 ++++---
3 files changed, 12 insertions(+), 7 deletions(-)
diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 782e14f62201..f34538f97c75 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -85,17 +85,18 @@ void ring_buffer_discard_commit(struct trace_buffer *buffer,
* size is in bytes for each per CPU buffer.
*/
struct trace_buffer *
-__ring_buffer_alloc(unsigned long size, unsigned flags, struct lock_class_key *key);
+__ring_buffer_alloc(const char *name, unsigned long size, unsigned flags,
+ struct lock_class_key *key);
/*
* Because the ring buffer is generic, if other users of the ring buffer get
* traced by ftrace, it can produce lockdep warnings. We need to keep each
* ring buffer's lock class separate.
*/
-#define ring_buffer_alloc(size, flags) \
+#define ring_buffer_alloc(name, size, flags) \
({ \
static struct lock_class_key __key; \
- __ring_buffer_alloc((size), (flags), &__key); \
+ __ring_buffer_alloc((name), (size), (flags), &__key); \
})
int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full);
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 43cc47d7faaf..eaaf823ddedb 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -557,6 +557,7 @@ struct trace_buffer {
struct rb_irq_work irq_work;
bool time_stamp_abs;
+ const char *name;
};
struct ring_buffer_iter {
@@ -1801,7 +1802,8 @@ static void rb_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
* when the buffer wraps. If this flag is not set, the buffer will
* drop data when the tail hits the head.
*/
-struct trace_buffer *__ring_buffer_alloc(unsigned long size, unsigned flags,
+struct trace_buffer *__ring_buffer_alloc(const char *name,
+ unsigned long size, unsigned flags,
struct lock_class_key *key)
{
struct trace_buffer *buffer;
@@ -1823,6 +1825,7 @@ struct trace_buffer *__ring_buffer_alloc(unsigned long size, unsigned flags,
buffer->flags = flags;
buffer->clock = trace_clock_local;
buffer->reader_lock_key = key;
+ buffer->name = name;
init_irq_work(&buffer->irq_work.work, rb_wake_up_waiters);
init_waitqueue_head(&buffer->irq_work.waiters);
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 9aebf904ff97..7700ca1be2a5 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -9384,7 +9384,8 @@ allocate_trace_buffer(struct trace_array *tr, struct array_buffer *buf, int size
buf->tr = tr;
- buf->buffer = ring_buffer_alloc(size, rb_flags);
+ buf->buffer = ring_buffer_alloc(tr->name ? tr->name : "global_trace",
+ size, rb_flags);
if (!buf->buffer)
return -ENOMEM;
@@ -9421,7 +9422,7 @@ static int allocate_trace_buffers(struct trace_array *tr, int size)
return ret;
#ifdef CONFIG_TRACER_MAX_TRACE
- ret = allocate_trace_buffer(tr, &tr->max_buffer,
+ ret = allocate_trace_buffer(NULL, &tr->max_buffer,
allocate_snapshot ? size : 1);
if (MEM_FAIL(ret, "Failed to allocate trace buffer\n")) {
free_trace_buffer(&tr->array_buffer);
@@ -10473,7 +10474,7 @@ __init static int tracer_alloc_buffers(void)
goto out_free_cpumask;
/* Used for event triggers */
ret = -ENOMEM;
- temp_buffer = ring_buffer_alloc(PAGE_SIZE, RB_FL_OVERWRITE);
+ temp_buffer = ring_buffer_alloc("temp_buffer", PAGE_SIZE, RB_FL_OVERWRITE);
if (!temp_buffer)
goto out_rm_hp_state;
--
2.40.1
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
We now have all bits in place to support KHO kexecs. This patch adds
awareness of KHO in the kexec file as well as boot path for x86 and
adds the respective kconfig option to the architecture so that it can
use KHO successfully.
In addition, it enlightens it decompression code with KHO so that its
KASLR location finder only considers memory regions that are not already
occupied by KHO memory.
Signed-off-by: Alexander Graf <[email protected]>
---
arch/x86/Kconfig | 12 ++++++
arch/x86/boot/compressed/kaslr.c | 55 +++++++++++++++++++++++++++
arch/x86/include/uapi/asm/bootparam.h | 15 +++++++-
arch/x86/kernel/e820.c | 9 +++++
arch/x86/kernel/kexec-bzimage64.c | 39 +++++++++++++++++++
arch/x86/kernel/setup.c | 46 ++++++++++++++++++++++
arch/x86/mm/init_32.c | 7 ++++
arch/x86/mm/init_64.c | 7 ++++
8 files changed, 189 insertions(+), 1 deletion(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 3762f41bb092..849e6ddc5d94 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2103,6 +2103,18 @@ config ARCH_SUPPORTS_CRASH_HOTPLUG
config ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION
def_bool CRASH_CORE
+config KEXEC_KHO
+ bool "kexec handover"
+ depends on KEXEC
+ select MEMBLOCK_SCRATCH
+ select LIBFDT
+ select CMA
+ help
+ Allow kexec to hand over state across kernels by generating and
+ passing additional metadata to the target kernel. This is useful
+ to keep data or state alive across the kexec. For this to work,
+ both source and target kernels need to have this option enabled.
+
config PHYSICAL_START
hex "Physical address where the kernel is loaded" if (EXPERT || CRASH_DUMP)
default "0x1000000"
diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index dec961c6d16a..93ea292e4c18 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -29,6 +29,7 @@
#include <linux/uts.h>
#include <linux/utsname.h>
#include <linux/ctype.h>
+#include <uapi/linux/kexec.h>
#include <generated/utsversion.h>
#include <generated/utsrelease.h>
@@ -472,6 +473,60 @@ static bool mem_avoid_overlap(struct mem_vector *img,
}
}
+#ifdef CONFIG_KEXEC_KHO
+ if (ptr->type == SETUP_KEXEC_KHO) {
+ struct kho_data *kho = (struct kho_data *)ptr->data;
+ struct kho_mem *mems = (void *)kho->mem_cache_addr;
+ int nr_mems = kho->mem_cache_size / sizeof(*mems);
+ int i;
+
+ /* Avoid the mem cache */
+ avoid = (struct mem_vector) {
+ .start = kho->mem_cache_addr,
+ .size = kho->mem_cache_size,
+ };
+
+ if (mem_overlaps(img, &avoid) && (avoid.start < earliest)) {
+ *overlap = avoid;
+ earliest = overlap->start;
+ is_overlapping = true;
+ }
+
+ /* And the KHO DT */
+ avoid = (struct mem_vector) {
+ .start = kho->dt_addr,
+ .size = kho->dt_size,
+ };
+
+ if (mem_overlaps(img, &avoid) && (avoid.start < earliest)) {
+ *overlap = avoid;
+ earliest = overlap->start;
+ is_overlapping = true;
+ }
+
+ /* As well as any other KHO memory reservations */
+ for (i = 0; i < nr_mems; i++) {
+ avoid = (struct mem_vector) {
+ .start = mems[i].addr,
+ .size = mems[i].len,
+ };
+
+ /*
+ * This mem starts after our current break.
+ * The array is sorted, so we're done.
+ */
+ if (avoid.start >= earliest)
+ break;
+
+ if (mem_overlaps(img, &avoid)) {
+ *overlap = avoid;
+ earliest = overlap->start;
+ is_overlapping = true;
+ }
+ }
+ }
+#endif
+
ptr = (struct setup_data *)(unsigned long)ptr->next;
}
diff --git a/arch/x86/include/uapi/asm/bootparam.h b/arch/x86/include/uapi/asm/bootparam.h
index 01d19fc22346..013af38a9673 100644
--- a/arch/x86/include/uapi/asm/bootparam.h
+++ b/arch/x86/include/uapi/asm/bootparam.h
@@ -13,7 +13,8 @@
#define SETUP_CC_BLOB 7
#define SETUP_IMA 8
#define SETUP_RNG_SEED 9
-#define SETUP_ENUM_MAX SETUP_RNG_SEED
+#define SETUP_KEXEC_KHO 10
+#define SETUP_ENUM_MAX SETUP_KEXEC_KHO
#define SETUP_INDIRECT (1<<31)
#define SETUP_TYPE_MAX (SETUP_ENUM_MAX | SETUP_INDIRECT)
@@ -181,6 +182,18 @@ struct ima_setup_data {
__u64 size;
} __attribute__((packed));
+/*
+ * Locations of kexec handover metadata
+ */
+struct kho_data {
+ __u64 dt_addr;
+ __u64 dt_size;
+ __u64 scratch_addr;
+ __u64 scratch_size;
+ __u64 mem_cache_addr;
+ __u64 mem_cache_size;
+} __attribute__((packed));
+
/* The so-called "zeropage" */
struct boot_params {
struct screen_info screen_info; /* 0x000 */
diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index fb8cf953380d..c891b83f5b1c 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1341,6 +1341,15 @@ void __init e820__memblock_setup(void)
continue;
memblock_add(entry->addr, entry->size);
+
+ /*
+ * At this point with KHO we only allocate from scratch memory
+ * and only from memory below ISA_END_ADDRESS. Make sure that
+ * when we add memory for the eligible range, we add it as
+ * scratch memory so that we can resize the memblocks array.
+ */
+ if (is_kho_boot() && (end <= ISA_END_ADDRESS))
+ memblock_mark_scratch(entry->addr, end);
}
/* Throw away partial pages: */
diff --git a/arch/x86/kernel/kexec-bzimage64.c b/arch/x86/kernel/kexec-bzimage64.c
index a61c12c01270..0cb8d0650a02 100644
--- a/arch/x86/kernel/kexec-bzimage64.c
+++ b/arch/x86/kernel/kexec-bzimage64.c
@@ -15,6 +15,7 @@
#include <linux/slab.h>
#include <linux/kexec.h>
#include <linux/kernel.h>
+#include <linux/libfdt.h>
#include <linux/mm.h>
#include <linux/efi.h>
#include <linux/random.h>
@@ -233,6 +234,33 @@ setup_ima_state(const struct kimage *image, struct boot_params *params,
#endif /* CONFIG_IMA_KEXEC */
}
+static void setup_kho(const struct kimage *image, struct boot_params *params,
+ unsigned long params_load_addr,
+ unsigned int setup_data_offset)
+{
+#ifdef CONFIG_KEXEC_KHO
+ struct setup_data *sd = (void *)params + setup_data_offset;
+ struct kho_data *kho = (void *)sd + sizeof(*sd);
+
+ sd->type = SETUP_KEXEC_KHO;
+ sd->len = sizeof(struct kho_data);
+
+ /* Only add if we have all KHO images in place */
+ if (!image->kho.dt.buffer || !image->kho.mem_cache.buffer)
+ return;
+
+ /* Add setup data */
+ kho->dt_addr = image->kho.dt.mem;
+ kho->dt_size = image->kho.dt.bufsz;
+ kho->scratch_addr = kho_scratch_phys;
+ kho->scratch_size = kho_scratch_len;
+ kho->mem_cache_addr = image->kho.mem_cache.mem;
+ kho->mem_cache_size = image->kho.mem_cache.bufsz;
+ sd->next = params->hdr.setup_data;
+ params->hdr.setup_data = params_load_addr + setup_data_offset;
+#endif /* CONFIG_KEXEC_KHO */
+}
+
static int
setup_boot_parameters(struct kimage *image, struct boot_params *params,
unsigned long params_load_addr,
@@ -305,6 +333,13 @@ setup_boot_parameters(struct kimage *image, struct boot_params *params,
sizeof(struct ima_setup_data);
}
+ if (IS_ENABLED(CONFIG_KEXEC_KHO)) {
+ /* Setup space to store preservation metadata */
+ setup_kho(image, params, params_load_addr, setup_data_offset);
+ setup_data_offset += sizeof(struct setup_data) +
+ sizeof(struct kho_data);
+ }
+
/* Setup RNG seed */
setup_rng_seed(params, params_load_addr, setup_data_offset);
@@ -470,6 +505,10 @@ static void *bzImage64_load(struct kimage *image, char *kernel,
kbuf.bufsz += sizeof(struct setup_data) +
sizeof(struct ima_setup_data);
+ if (IS_ENABLED(CONFIG_KEXEC_KHO))
+ kbuf.bufsz += sizeof(struct setup_data) +
+ sizeof(struct kho_data);
+
params = kzalloc(kbuf.bufsz, GFP_KERNEL);
if (!params)
return ERR_PTR(-ENOMEM);
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 1526747bedf2..196414c9c9e6 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -382,6 +382,29 @@ int __init ima_get_kexec_buffer(void **addr, size_t *size)
}
#endif
+static void __init add_kho(u64 phys_addr, u32 data_len)
+{
+#ifdef CONFIG_KEXEC_KHO
+ struct kho_data *kho;
+ u64 addr = phys_addr + sizeof(struct setup_data);
+ u64 size = data_len - sizeof(struct setup_data);
+
+ kho = early_memremap(addr, size);
+ if (!kho) {
+ pr_warn("setup: failed to memremap kho data (0x%llx, 0x%llx)\n",
+ addr, size);
+ return;
+ }
+
+ kho_populate(kho->dt_addr, kho->scratch_addr, kho->scratch_size,
+ kho->mem_cache_addr, kho->mem_cache_size);
+
+ early_memunmap(kho, size);
+#else
+ pr_warn("Passed KHO data, but CONFIG_KEXEC_KHO not set. Ignoring.\n");
+#endif
+}
+
static void __init parse_setup_data(void)
{
struct setup_data *data;
@@ -410,6 +433,9 @@ static void __init parse_setup_data(void)
case SETUP_IMA:
add_early_ima_buffer(pa_data);
break;
+ case SETUP_KEXEC_KHO:
+ add_kho(pa_data, data_len);
+ break;
case SETUP_RNG_SEED:
data = early_memremap(pa_data, data_len);
add_bootloader_randomness(data->data, data->len);
@@ -989,8 +1015,26 @@ void __init setup_arch(char **cmdline_p)
cleanup_highmap();
memblock_set_current_limit(ISA_END_ADDRESS);
+
e820__memblock_setup();
+ /*
+ * We can resize memblocks at this point, let's dump all KHO
+ * reservations in and switch from scratch-only to normal allocations
+ */
+ kho_reserve_mem();
+
+ /* Allocations now skip scratch mem, return low 1M to the pool */
+ if (is_kho_boot()) {
+ u64 i;
+ phys_addr_t base, end;
+
+ __for_each_mem_range(i, &memblock.memory, NULL, NUMA_NO_NODE,
+ MEMBLOCK_SCRATCH, &base, &end, NULL)
+ if (end <= ISA_END_ADDRESS)
+ memblock_clear_scratch(base, end - base);
+ }
+
/*
* Needs to run after memblock setup because it needs the physical
* memory size.
@@ -1106,6 +1150,8 @@ void __init setup_arch(char **cmdline_p)
*/
arch_reserve_crashkernel();
+ kho_reserve();
+
memblock_find_dma_reserve();
if (!early_xdbc_setup_hardware())
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index b63403d7179d..6c3810afed04 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -20,6 +20,7 @@
#include <linux/smp.h>
#include <linux/init.h>
#include <linux/highmem.h>
+#include <linux/kexec.h>
#include <linux/pagemap.h>
#include <linux/pci.h>
#include <linux/pfn.h>
@@ -738,6 +739,12 @@ void __init mem_init(void)
after_bootmem = 1;
x86_init.hyper.init_after_bootmem();
+ /*
+ * Now that all KHO pages are marked as reserved, let's flip them back
+ * to normal pages with accurate refcount.
+ */
+ kho_populate_refcount();
+
/*
* Check boundaries twice: Some fundamental inconsistencies can
* be detected at build time already.
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index a190aae8ceaf..3ce1a4767610 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -20,6 +20,7 @@
#include <linux/smp.h>
#include <linux/init.h>
#include <linux/initrd.h>
+#include <linux/kexec.h>
#include <linux/pagemap.h>
#include <linux/memblock.h>
#include <linux/proc_fs.h>
@@ -1339,6 +1340,12 @@ void __init mem_init(void)
after_bootmem = 1;
x86_init.hyper.init_after_bootmem();
+ /*
+ * Now that all KHO pages are marked as reserved, let's flip them back
+ * to normal pages with accurate refcount.
+ */
+ kho_populate_refcount();
+
/*
* Must be done after boot memory is put on freelist, because here we
* might set fields in deferred struct pages that have not yet been
--
2.40.1
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
We want to be able to transfer ftrace state from one kernel to the next.
To start off with, let's establish all the boiler plate to get a write
hook when KHO wants to serialize and fill out basic data.
Follow-up patches will fill in serialization of ring buffers and events.
Signed-off-by: Alexander Graf <[email protected]>
---
kernel/trace/trace.c | 52 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 52 insertions(+)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 7700ca1be2a5..3e7f61cf773e 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -32,6 +32,7 @@
#include <linux/percpu.h>
#include <linux/splice.h>
#include <linux/kdebug.h>
+#include <linux/kexec.h>
#include <linux/string.h>
#include <linux/mount.h>
#include <linux/rwsem.h>
@@ -866,6 +867,10 @@ static struct tracer *trace_types __read_mostly;
*/
DEFINE_MUTEX(trace_types_lock);
+#ifdef CONFIG_FTRACE_KHO
+static bool trace_in_kho;
+#endif
+
/*
* serialize the access of the ring buffer
*
@@ -10591,12 +10596,59 @@ void __init early_trace_init(void)
init_events();
}
+#ifdef CONFIG_FTRACE_KHO
+static int trace_kho_notifier(struct notifier_block *self,
+ unsigned long cmd,
+ void *v)
+{
+ const char compatible[] = "ftrace-v1";
+ void *fdt = v;
+ int err = 0;
+
+ switch (cmd) {
+ case KEXEC_KHO_ABORT:
+ if (trace_in_kho)
+ mutex_unlock(&trace_types_lock);
+ trace_in_kho = false;
+ return NOTIFY_DONE;
+ case KEXEC_KHO_DUMP:
+ /* Handled below */
+ break;
+ default:
+ return NOTIFY_BAD;
+ }
+
+ if (unlikely(tracing_disabled))
+ return NOTIFY_DONE;
+
+ err |= fdt_begin_node(fdt, "ftrace");
+ err |= fdt_property(fdt, "compatible", compatible, sizeof(compatible));
+ err |= fdt_end_node(fdt);
+
+ if (!err) {
+ /* Hold all future allocations */
+ mutex_lock(&trace_types_lock);
+ trace_in_kho = true;
+ }
+
+ return err ? NOTIFY_BAD : NOTIFY_DONE;
+}
+
+static struct notifier_block trace_kho_nb = {
+ .notifier_call = trace_kho_notifier,
+};
+#endif
+
void __init trace_init(void)
{
trace_event_init();
if (boot_instance_index)
enable_instances();
+
+#ifdef CONFIG_FTRACE_KHO
+ register_kho_notifier(&trace_kho_nb);
+#endif
}
__init static void clear_boot_tracer(void)
--
2.40.1
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
We now have all bits in place to support KHO kexecs. This patch adds
awareness of KHO in the kexec file as well as boot path for arm64 and
adds the respective kconfig option to the architecture so that it can
use KHO successfully.
Signed-off-by: Alexander Graf <[email protected]>
---
arch/arm64/Kconfig | 12 ++++++++++++
arch/arm64/kernel/setup.c | 2 ++
arch/arm64/mm/init.c | 8 ++++++++
drivers/of/fdt.c | 41 +++++++++++++++++++++++++++++++++++++++
drivers/of/kexec.c | 36 ++++++++++++++++++++++++++++++++++
5 files changed, 99 insertions(+)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 7b071a00425d..1ba338ce7598 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1501,6 +1501,18 @@ config ARCH_SUPPORTS_CRASH_DUMP
config ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION
def_bool CRASH_CORE
+config KEXEC_KHO
+ bool "kexec handover"
+ depends on KEXEC
+ select MEMBLOCK_SCRATCH
+ select LIBFDT
+ select CMA
+ help
+ Allow kexec to hand over state across kernels by generating and
+ passing additional metadata to the target kernel. This is useful
+ to keep data or state alive across the kexec. For this to work,
+ both source and target kernels need to have this option enabled.
+
config TRANS_TABLE
def_bool y
depends on HIBERNATION || KEXEC_CORE
diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
index 417a8a86b2db..8035b673d96d 100644
--- a/arch/arm64/kernel/setup.c
+++ b/arch/arm64/kernel/setup.c
@@ -346,6 +346,8 @@ void __init __no_sanitize_address setup_arch(char **cmdline_p)
paging_init();
+ kho_reserve_mem();
+
acpi_table_upgrade();
/* Parse the ACPI tables for possible boot-time configuration */
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 74c1db8ce271..254d82f3383a 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -358,6 +358,8 @@ void __init bootmem_init(void)
*/
arch_reserve_crashkernel();
+ kho_reserve();
+
memblock_dump_all();
}
@@ -386,6 +388,12 @@ void __init mem_init(void)
/* this will put all unused low memory onto the freelists */
memblock_free_all();
+ /*
+ * Now that all KHO pages are marked as reserved, let's flip them back
+ * to normal pages with accurate refcount.
+ */
+ kho_populate_refcount();
+
/*
* Check boundaries twice: Some fundamental inconsistencies can be
* detected at build time already.
diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
index bf502ba8da95..af95139351ed 100644
--- a/drivers/of/fdt.c
+++ b/drivers/of/fdt.c
@@ -1006,6 +1006,44 @@ void __init early_init_dt_check_for_usable_mem_range(void)
memblock_add(rgn[i].base, rgn[i].size);
}
+/**
+ * early_init_dt_check_kho - Decode info required for kexec handover from DT
+ */
+void __init early_init_dt_check_kho(void)
+{
+#ifdef CONFIG_KEXEC_KHO
+ unsigned long node = chosen_node_offset;
+ u64 kho_start, scratch_start, scratch_size, mem_start, mem_size;
+ const __be32 *p;
+ int l;
+
+ if ((long)node < 0)
+ return;
+
+ p = of_get_flat_dt_prop(node, "linux,kho-dt", &l);
+ if (l != (dt_root_addr_cells + dt_root_size_cells) * sizeof(__be32))
+ return;
+
+ kho_start = dt_mem_next_cell(dt_root_addr_cells, &p);
+
+ p = of_get_flat_dt_prop(node, "linux,kho-scratch", &l);
+ if (l != (dt_root_addr_cells + dt_root_size_cells) * sizeof(__be32))
+ return;
+
+ scratch_start = dt_mem_next_cell(dt_root_addr_cells, &p);
+ scratch_size = dt_mem_next_cell(dt_root_addr_cells, &p);
+
+ p = of_get_flat_dt_prop(node, "linux,kho-mem", &l);
+ if (l != (dt_root_addr_cells + dt_root_size_cells) * sizeof(__be32))
+ return;
+
+ mem_start = dt_mem_next_cell(dt_root_addr_cells, &p);
+ mem_size = dt_mem_next_cell(dt_root_addr_cells, &p);
+
+ kho_populate(kho_start, scratch_start, scratch_size, mem_start, mem_size);
+#endif
+}
+
#ifdef CONFIG_SERIAL_EARLYCON
int __init early_init_dt_scan_chosen_stdout(void)
@@ -1304,6 +1342,9 @@ void __init early_init_dt_scan_nodes(void)
/* Handle linux,usable-memory-range property */
early_init_dt_check_for_usable_mem_range();
+
+ /* Handle kexec handover */
+ early_init_dt_check_kho();
}
bool __init early_init_dt_scan(void *params)
diff --git a/drivers/of/kexec.c b/drivers/of/kexec.c
index 68278340cecf..a612e6bb8c75 100644
--- a/drivers/of/kexec.c
+++ b/drivers/of/kexec.c
@@ -264,6 +264,37 @@ static inline int setup_ima_buffer(const struct kimage *image, void *fdt,
}
#endif /* CONFIG_IMA_KEXEC */
+static int kho_add_chosen(const struct kimage *image, void *fdt, int chosen_node)
+{
+ int ret = 0;
+
+#ifdef CONFIG_KEXEC_KHO
+ if (!image->kho.dt.buffer || !image->kho.mem_cache.buffer)
+ goto out;
+
+ pr_debug("Adding kho metadata to DT");
+
+ ret = fdt_appendprop_addrrange(fdt, 0, chosen_node, "linux,kho-dt",
+ image->kho.dt.mem, image->kho.dt.memsz);
+ if (ret)
+ goto out;
+
+ ret = fdt_appendprop_addrrange(fdt, 0, chosen_node, "linux,kho-scratch",
+ kho_scratch_phys, kho_scratch_len);
+ if (ret)
+ goto out;
+
+ ret = fdt_appendprop_addrrange(fdt, 0, chosen_node, "linux,kho-mem",
+ image->kho.mem_cache.mem,
+ image->kho.mem_cache.bufsz);
+ if (ret)
+ goto out;
+
+out:
+#endif
+ return ret;
+}
+
/*
* of_kexec_alloc_and_setup_fdt - Alloc and setup a new Flattened Device Tree
*
@@ -412,6 +443,11 @@ void *of_kexec_alloc_and_setup_fdt(const struct kimage *image,
}
}
+ /* Add kho metadata if this is a KHO image */
+ ret = kho_add_chosen(image, fdt, chosen_node);
+ if (ret)
+ goto out;
+
/* add bootargs */
if (cmdline) {
ret = fdt_setprop_string(fdt, chosen_node, "bootargs", cmdline);
--
2.40.1
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
With KHO (Kexec HandOver), we want to preserve trace buffers. To parse
them, we need to ensure that all trace events that exist in the logs are
identical to the ones we parse as. That means we need to match the
events before and after kexec.
As a first step towards that, let's give every event a unique name. That
way we can clearly identify the event before and after kexec and restore
its ID post-kexec.
Signed-off-by: Alexander Graf <[email protected]>
---
include/linux/trace_events.h | 1 +
include/trace/trace_events.h | 2 ++
kernel/trace/blktrace.c | 1 +
kernel/trace/trace_branch.c | 1 +
kernel/trace/trace_events.c | 3 +++
kernel/trace/trace_functions_graph.c | 4 +++-
kernel/trace/trace_output.c | 13 +++++++++++++
kernel/trace/trace_probe.c | 3 +++
kernel/trace/trace_syscalls.c | 29 ++++++++++++++++++++++++++++
9 files changed, 56 insertions(+), 1 deletion(-)
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index d68ff9b1247f..7670224aa92d 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -149,6 +149,7 @@ struct trace_event {
struct hlist_node node;
int type;
struct trace_event_functions *funcs;
+ const char *name;
};
extern int register_trace_event(struct trace_event *event);
diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h
index c2f9cabf154d..bb4e6a33eef9 100644
--- a/include/trace/trace_events.h
+++ b/include/trace/trace_events.h
@@ -443,6 +443,7 @@ static struct trace_event_call __used event_##call = { \
.tp = &__tracepoint_##call, \
}, \
.event.funcs = &trace_event_type_funcs_##template, \
+ .event.name = __stringify(call), \
.print_fmt = print_fmt_##template, \
.flags = TRACE_EVENT_FL_TRACEPOINT, \
}; \
@@ -460,6 +461,7 @@ static struct trace_event_call __used event_##call = { \
.tp = &__tracepoint_##call, \
}, \
.event.funcs = &trace_event_type_funcs_##call, \
+ .event.name = __stringify(template), \
.print_fmt = print_fmt_##call, \
.flags = TRACE_EVENT_FL_TRACEPOINT, \
}; \
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index d5d94510afd3..7f86fd41b38e 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -1584,6 +1584,7 @@ static struct trace_event_functions trace_blk_event_funcs = {
static struct trace_event trace_blk_event = {
.type = TRACE_BLK,
.funcs = &trace_blk_event_funcs,
+ .name = "blk",
};
static int __init init_blk_tracer(void)
diff --git a/kernel/trace/trace_branch.c b/kernel/trace/trace_branch.c
index e47fdb4c92fb..3372070f2e85 100644
--- a/kernel/trace/trace_branch.c
+++ b/kernel/trace/trace_branch.c
@@ -168,6 +168,7 @@ static struct trace_event_functions trace_branch_funcs = {
static struct trace_event trace_branch_event = {
.type = TRACE_BRANCH,
.funcs = &trace_branch_funcs,
+ .name = "branch",
};
static struct tracer branch_trace __read_mostly =
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index f29e815ca5b2..4f5d37f96a17 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -2658,6 +2658,9 @@ static int event_init(struct trace_event_call *call)
if (WARN_ON(!name))
return -EINVAL;
+ if (!call->event.name)
+ call->event.name = name;
+
if (call->class->raw_init) {
ret = call->class->raw_init(call);
if (ret < 0 && ret != -ENOSYS)
diff --git a/kernel/trace/trace_functions_graph.c b/kernel/trace/trace_functions_graph.c
index c35fbaab2a47..088dfd4a1a56 100644
--- a/kernel/trace/trace_functions_graph.c
+++ b/kernel/trace/trace_functions_graph.c
@@ -1342,11 +1342,13 @@ static struct trace_event_functions graph_functions = {
static struct trace_event graph_trace_entry_event = {
.type = TRACE_GRAPH_ENT,
.funcs = &graph_functions,
+ .name = "graph_ent",
};
static struct trace_event graph_trace_ret_event = {
.type = TRACE_GRAPH_RET,
- .funcs = &graph_functions
+ .funcs = &graph_functions,
+ .name = "graph_ret",
};
static struct tracer graph_trace __tracer_data = {
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index d8b302d01083..f3677e0da795 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1067,6 +1067,7 @@ static struct trace_event_functions trace_fn_funcs = {
static struct trace_event trace_fn_event = {
.type = TRACE_FN,
.funcs = &trace_fn_funcs,
+ .name = "fn",
};
/* TRACE_CTX an TRACE_WAKE */
@@ -1207,6 +1208,7 @@ static struct trace_event_functions trace_ctx_funcs = {
static struct trace_event trace_ctx_event = {
.type = TRACE_CTX,
.funcs = &trace_ctx_funcs,
+ .name = "ctx",
};
static struct trace_event_functions trace_wake_funcs = {
@@ -1219,6 +1221,7 @@ static struct trace_event_functions trace_wake_funcs = {
static struct trace_event trace_wake_event = {
.type = TRACE_WAKE,
.funcs = &trace_wake_funcs,
+ .name = "wake",
};
/* TRACE_STACK */
@@ -1256,6 +1259,7 @@ static struct trace_event_functions trace_stack_funcs = {
static struct trace_event trace_stack_event = {
.type = TRACE_STACK,
.funcs = &trace_stack_funcs,
+ .name = "stack",
};
/* TRACE_USER_STACK */
@@ -1309,6 +1313,7 @@ static struct trace_event_functions trace_user_stack_funcs = {
static struct trace_event trace_user_stack_event = {
.type = TRACE_USER_STACK,
.funcs = &trace_user_stack_funcs,
+ .name = "user_stack",
};
/* TRACE_HWLAT */
@@ -1373,6 +1378,7 @@ static struct trace_event_functions trace_hwlat_funcs = {
static struct trace_event trace_hwlat_event = {
.type = TRACE_HWLAT,
.funcs = &trace_hwlat_funcs,
+ .name = "hwlat",
};
/* TRACE_OSNOISE */
@@ -1443,6 +1449,7 @@ static struct trace_event_functions trace_osnoise_funcs = {
static struct trace_event trace_osnoise_event = {
.type = TRACE_OSNOISE,
.funcs = &trace_osnoise_funcs,
+ .name = "osnoise",
};
/* TRACE_TIMERLAT */
@@ -1491,6 +1498,7 @@ static struct trace_event_functions trace_timerlat_funcs = {
static struct trace_event trace_timerlat_event = {
.type = TRACE_TIMERLAT,
.funcs = &trace_timerlat_funcs,
+ .name = "timerlat",
};
/* TRACE_BPUTS */
@@ -1535,6 +1543,7 @@ static struct trace_event_functions trace_bputs_funcs = {
static struct trace_event trace_bputs_event = {
.type = TRACE_BPUTS,
.funcs = &trace_bputs_funcs,
+ .name = "bputs",
};
/* TRACE_BPRINT */
@@ -1579,6 +1588,7 @@ static struct trace_event_functions trace_bprint_funcs = {
static struct trace_event trace_bprint_event = {
.type = TRACE_BPRINT,
.funcs = &trace_bprint_funcs,
+ .name = "bprint",
};
/* TRACE_PRINT */
@@ -1616,6 +1626,7 @@ static struct trace_event_functions trace_print_funcs = {
static struct trace_event trace_print_event = {
.type = TRACE_PRINT,
.funcs = &trace_print_funcs,
+ .name = "print",
};
static enum print_line_t trace_raw_data(struct trace_iterator *iter, int flags,
@@ -1645,6 +1656,7 @@ static struct trace_event_functions trace_raw_data_funcs = {
static struct trace_event trace_raw_data_event = {
.type = TRACE_RAW_DATA,
.funcs = &trace_raw_data_funcs,
+ .name = "raw_data",
};
static enum print_line_t
@@ -1691,6 +1703,7 @@ static struct trace_event_functions trace_func_repeats_funcs = {
static struct trace_event trace_func_repeats_event = {
.type = TRACE_FUNC_REPEATS,
.funcs = &trace_func_repeats_funcs,
+ .name = "func_repeats",
};
static struct trace_event *events[] __initdata = {
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 4dc74d73fc1d..9356f3f2fdc0 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -1835,6 +1835,9 @@ int trace_probe_register_event_call(struct trace_probe *tp)
trace_probe_name(tp)))
return -EEXIST;
+ if (!call->event.name)
+ call->event.name = call->name;
+
ret = register_trace_event(&call->event);
if (!ret)
return -ENODEV;
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 9c581d6da843..3e7e10b691f5 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -441,8 +441,29 @@ static void unreg_event_syscall_exit(struct trace_event_file *file,
mutex_unlock(&syscall_trace_lock);
}
+/**
+ * trace_csum - a simple checksum generator
+ *
+ * This returns a checksum for data that should not generate
+ * a lot of collisions, but is trivial to read.
+ */
+static u32 __init trace_csum(void *data, u32 len)
+{
+ u32 r = 0, i;
+ char *p = data;
+
+ if (!data)
+ return 0;
+
+ for (i = 0; i < len; i++)
+ r = (r >> 31) + (r << 1) + p[i];
+
+ return (r << 4) + len;
+}
+
static int __init init_syscall_trace(struct trace_event_call *call)
{
+ u32 csum;
int id;
int num;
@@ -456,9 +477,17 @@ static int __init init_syscall_trace(struct trace_event_call *call)
if (set_syscall_print_fmt(call) < 0)
return -ENOMEM;
+ csum = (trace_csum(call->print_fmt, strlen(call->print_fmt)) << 4) + num;
+ call->event.name = kasprintf(GFP_KERNEL, "sc-%x", csum);
+ if (!call->event.name) {
+ free_syscall_print_fmt(call);
+ return -ENOMEM;
+ }
+
id = trace_event_raw_init(call);
if (id < 0) {
+ kfree(call->event.name);
free_syscall_print_fmt(call);
return id;
}
--
2.40.1
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
When we do a kexec handover, we want to preserve previous ftrace data
into the new kernel. At the point when we write out the handover data,
ftrace may still be running and recording new events and we want to
capture all of those too.
To allow the new kernel to revive all trace data up to reboot, we store
all locations of trace buffers as well as their linked list metadata. We
can then later reuse the linked list to reconstruct the head pointer.
This patch implements the write-out logic for trace buffers.
Signed-off-by: Alexander Graf <[email protected]>
---
include/linux/ring_buffer.h | 2 +
kernel/trace/ring_buffer.c | 89 +++++++++++++++++++++++++++++++++++++
kernel/trace/trace.c | 16 +++++++
3 files changed, 107 insertions(+)
diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index f34538f97c75..049565677ef8 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -212,4 +212,6 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct hlist_node *node);
#define trace_rb_cpu_prepare NULL
#endif
+int trace_kho_write_trace_buffer(void *fdt, struct trace_buffer *buffer);
+
#endif /* _LINUX_RING_BUFFER_H */
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index eaaf823ddedb..691d1236eeb1 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -20,6 +20,7 @@
#include <linux/percpu.h>
#include <linux/mutex.h>
#include <linux/delay.h>
+#include <linux/kexec.h>
#include <linux/slab.h>
#include <linux/init.h>
#include <linux/hash.h>
@@ -5921,6 +5922,94 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct hlist_node *node)
return 0;
}
+#ifdef CONFIG_FTRACE_KHO
+static int trace_kho_write_cpu(void *fdt, struct trace_buffer *buffer, int cpu)
+{
+ int i = 0;
+ int err = 0;
+ struct list_head *tmp;
+ const char compatible[] = "ftrace,cpu-v1";
+ char name[] = "cpuffffffff";
+ int nr_pages;
+ struct ring_buffer_per_cpu *cpu_buffer;
+ bool first_loop = true;
+ struct kho_mem *mem;
+ uint64_t mem_len;
+
+ if (!cpumask_test_cpu(cpu, buffer->cpumask))
+ return 0;
+
+ cpu_buffer = buffer->buffers[cpu];
+
+ nr_pages = cpu_buffer->nr_pages;
+ mem_len = sizeof(*mem) * nr_pages * 2;
+ mem = vmalloc(mem_len);
+
+ snprintf(name, sizeof(name), "cpu%x", cpu);
+
+ err |= fdt_begin_node(fdt, name);
+ err |= fdt_property(fdt, "compatible", compatible, sizeof(compatible));
+ err |= fdt_property(fdt, "cpu", &cpu, sizeof(cpu));
+
+ for (tmp = rb_list_head(cpu_buffer->pages);
+ tmp != rb_list_head(cpu_buffer->pages) || first_loop;
+ tmp = rb_list_head(tmp->next), first_loop = false) {
+ struct buffer_page *bpage = (struct buffer_page *)tmp;
+
+ /* Ring is larger than it should be? */
+ if (i >= (nr_pages * 2)) {
+ pr_err("ftrace ring has more pages than nr_pages (%d / %d)", i, nr_pages);
+ err = -EINVAL;
+ break;
+ }
+
+ /* First describe the bpage */
+ mem[i++] = (struct kho_mem) {
+ .addr = __pa(bpage),
+ .len = sizeof(*bpage)
+ };
+
+ /* Then the data page */
+ mem[i++] = (struct kho_mem) {
+ .addr = __pa(bpage->page),
+ .len = PAGE_SIZE
+ };
+ }
+
+ err |= fdt_property(fdt, "mem", mem, mem_len);
+ err |= fdt_end_node(fdt);
+
+ vfree(mem);
+ return err;
+}
+
+int trace_kho_write_trace_buffer(void *fdt, struct trace_buffer *buffer)
+{
+ const char compatible[] = "ftrace,buffer-v1";
+ char name[] = "buffer";
+ int err;
+ int i;
+
+ err = fdt_begin_node(fdt, name);
+ if (err)
+ return err;
+
+ fdt_property(fdt, "compatible", compatible, sizeof(compatible));
+
+ for (i = 0; i < buffer->cpus; i++) {
+ err = trace_kho_write_cpu(fdt, buffer, i);
+ if (err)
+ return err;
+ }
+
+ err = fdt_end_node(fdt);
+ if (err)
+ return err;
+
+ return 0;
+}
+#endif
+
#ifdef CONFIG_RING_BUFFER_STARTUP_TEST
/*
* This is a basic integrity check of the ring buffer.
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 3e7f61cf773e..71c249cc5b43 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -10597,6 +10597,21 @@ void __init early_trace_init(void)
}
#ifdef CONFIG_FTRACE_KHO
+static int trace_kho_write_trace_array(void *fdt, struct trace_array *tr)
+{
+ const char *name = tr->name ? tr->name : "global_trace";
+ const char compatible[] = "ftrace,array-v1";
+ int err = 0;
+
+ err |= fdt_begin_node(fdt, name);
+ err |= fdt_property(fdt, "compatible", compatible, sizeof(compatible));
+ err |= fdt_property(fdt, "trace_flags", &tr->trace_flags, sizeof(tr->trace_flags));
+ err |= trace_kho_write_trace_buffer(fdt, tr->array_buffer.buffer);
+ err |= fdt_end_node(fdt);
+
+ return err;
+}
+
static int trace_kho_notifier(struct notifier_block *self,
unsigned long cmd,
void *v)
@@ -10623,6 +10638,7 @@ static int trace_kho_notifier(struct notifier_block *self,
err |= fdt_begin_node(fdt, "ftrace");
err |= fdt_property(fdt, "compatible", compatible, sizeof(compatible));
+ err |= trace_kho_write_trace_array(fdt, &global_trace);
err |= fdt_end_node(fdt);
if (!err) {
--
2.40.1
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
Events and thus their parsing handle in ftrace have dynamic IDs that get
assigned whenever the event is added to the system. If we want to parse
trace events after kexec, we need to link event IDs back to the original
trace event that existed before we kexec'ed.
There are broadly 2 paths we could take for that:
1) Save full event description across KHO, restore after kexec,
merge identical trace events into a single identifier.
2) Recover the ID of post-kexec added events so they get the same
ID after kexec that they had before kexec
This patch implements the second option. It's simpler and thus less
intrusive. However, it means we can not fully parse affected events
when the kernel removes or modifies trace events across a kho kexec.
Signed-off-by: Alexander Graf <[email protected]>
---
kernel/trace/trace.c | 1 +
kernel/trace/trace_output.c | 28 ++++++++++++++++++++++++++++
kernel/trace/trace_output.h | 1 +
3 files changed, 30 insertions(+)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 71c249cc5b43..26edfd2a85fd 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -10639,6 +10639,7 @@ static int trace_kho_notifier(struct notifier_block *self,
err |= fdt_begin_node(fdt, "ftrace");
err |= fdt_property(fdt, "compatible", compatible, sizeof(compatible));
err |= trace_kho_write_trace_array(fdt, &global_trace);
+ err |= trace_kho_write_events(fdt);
err |= fdt_end_node(fdt);
if (!err) {
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index f3677e0da795..113de40c616f 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -12,6 +12,7 @@
#include <linux/sched/clock.h>
#include <linux/sched/mm.h>
#include <linux/idr.h>
+#include <linux/kexec.h>
#include "trace_output.h"
@@ -669,6 +670,33 @@ int trace_print_lat_context(struct trace_iterator *iter)
return !trace_seq_has_overflowed(s);
}
+int trace_kho_write_events(void *fdt)
+{
+#ifdef CONFIG_FTRACE_KHO
+ const char compatible[] = "ftrace,events-v1";
+ const char *name = "events";
+ struct trace_event *event;
+ unsigned key;
+ int err = 0;
+
+ err |= fdt_begin_node(fdt, name);
+ err |= fdt_property(fdt, "compatible", compatible, sizeof(compatible));
+
+ for (key = 0; key < EVENT_HASHSIZE; key++) {
+ hlist_for_each_entry(event, &event_hash[key], node)
+ err |= fdt_property(fdt, event->name, &event->type,
+ sizeof(event->type));
+ }
+
+ err |= fdt_end_node(fdt);
+
+ return err;
+#else
+ return 0;
+#endif
+}
+
+
/**
* ftrace_find_event - find a registered event
* @type: the type of event to look for
diff --git a/kernel/trace/trace_output.h b/kernel/trace/trace_output.h
index dca40f1f1da4..36dc7963269e 100644
--- a/kernel/trace/trace_output.h
+++ b/kernel/trace/trace_output.h
@@ -25,6 +25,7 @@ extern enum print_line_t print_event_fields(struct trace_iterator *iter,
extern void trace_event_read_lock(void);
extern void trace_event_read_unlock(void);
extern struct trace_event *ftrace_find_event(int type);
+extern int trace_kho_write_events(void *fdt);
extern enum print_line_t trace_nop_print(struct trace_iterator *iter,
int flags, struct trace_event *event);
--
2.40.1
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
This patch implements all logic necessary to match a new trace event
that we add against preserved trace events from kho. If we find a match,
we give the new trace event the old event's identifier. That way, trace
read-outs are able to make sense of buffer contents again because the
parsing code for events looks at the same identifiers.
Signed-off-by: Alexander Graf <[email protected]>
---
kernel/trace/trace_output.c | 65 ++++++++++++++++++++++++++++++++++++-
1 file changed, 64 insertions(+), 1 deletion(-)
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index 113de40c616f..d2e2a6346322 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -749,6 +749,67 @@ void trace_event_read_unlock(void)
up_read(&trace_event_sem);
}
+/**
+ * trace_kho_fill_event_type - restore event type info from KHO
+ * @event: the event type to enumerate
+ *
+ * Event types are semi-dynamically generated. To ensure that
+ * their identifiers match before and after kexec with KHO,
+ * let's match up unique name identifiers and fill in the
+ * respective ID information if we booted with KHO.
+ */
+static bool trace_kho_fill_event_type(struct trace_event *event)
+{
+#ifdef CONFIG_FTRACE_KHO
+ const char *path = "/ftrace/events";
+ void *fdt = kho_get_fdt();
+ int err, len, off, id;
+ const void *p;
+
+ if (!fdt)
+ return false;
+
+ if (WARN_ON(!event->name))
+ return false;
+
+ pr_debug("Trying to revive event '%s'", event->name);
+
+ off = fdt_path_offset(fdt, path);
+ if (off < 0) {
+ pr_debug("Could not find '%s' in DT", path);
+ return false;
+ }
+
+ err = fdt_node_check_compatible(fdt, off, "ftrace,events-v1");
+ if (err) {
+ pr_warn("Node '%s' has invalid compatible", path);
+ return false;
+ }
+
+ p = fdt_getprop(fdt, off, event->name, &len);
+ if (!p) {
+ pr_warn("Event '%s' not found", event->name);
+ return false;
+ }
+
+ if (len != sizeof(event->type)) {
+ pr_warn("Event '%s' has invalid length", event->name);
+ return false;
+ }
+
+ id = *(const u32 *)p;
+
+ /* Mark ID as in use */
+ if (ida_alloc_range(&trace_event_ida, id, id, GFP_KERNEL) != id)
+ return false;
+
+ event->type = id;
+ return true;
+#endif
+
+ return false;
+}
+
/**
* register_trace_event - register output for an event type
* @event: the event type to register
@@ -777,7 +838,9 @@ int register_trace_event(struct trace_event *event)
if (WARN_ON(!event->funcs))
goto out;
- if (!event->type) {
+ if (trace_kho_fill_event_type(event)) {
+ pr_debug("Recovered '%s' as id=%d", event->name, event->type);
+ } else if (!event->type) {
event->type = alloc_trace_event_type();
if (!event->type)
goto out;
--
2.40.1
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
When kexec handover is in place, we now know the location of all
previous buffers for ftrace rings. With this patch applied, ftrace
reassembles any new trace buffer that carries the same name as a
previous one with the same data pages that the previous buffer had.
That way, a buffer that we had in place before kexec becomes readable
after kexec again as soon as it gets initialized with the same name.
Signed-off-by: Alexander Graf <[email protected]>
---
kernel/trace/ring_buffer.c | 173 ++++++++++++++++++++++++++++++++++++-
1 file changed, 171 insertions(+), 2 deletions(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 691d1236eeb1..f3d07cb90762 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -575,6 +575,28 @@ struct ring_buffer_iter {
int missed_events;
};
+struct trace_kho_cpu {
+ const struct kho_mem *mem;
+ uint32_t nr_mems;
+};
+
+#ifdef CONFIG_FTRACE_KHO
+static int trace_kho_replace_buffers(struct ring_buffer_per_cpu *cpu_buffer,
+ struct trace_kho_cpu *kho);
+static int trace_kho_read_cpu(const char *name, int cpu, struct trace_kho_cpu *kho);
+#else
+static int trace_kho_replace_buffers(struct ring_buffer_per_cpu *cpu_buffer,
+ struct trace_kho_cpu *kho)
+{
+ return -EINVAL;
+}
+
+static int trace_kho_read_cpu(const char *name, int cpu, struct trace_kho_cpu *kho)
+{
+ return -EINVAL;
+}
+#endif
+
#ifdef RB_TIME_32
/*
@@ -1807,10 +1829,12 @@ struct trace_buffer *__ring_buffer_alloc(const char *name,
unsigned long size, unsigned flags,
struct lock_class_key *key)
{
+ int cpu = raw_smp_processor_id();
+ struct trace_kho_cpu kho = {};
struct trace_buffer *buffer;
+ bool use_kho = false;
long nr_pages;
int bsize;
- int cpu;
int ret;
/* keep it in its own cache line */
@@ -1823,6 +1847,12 @@ struct trace_buffer *__ring_buffer_alloc(const char *name,
goto fail_free_buffer;
nr_pages = DIV_ROUND_UP(size, BUF_PAGE_SIZE);
+ if (!trace_kho_read_cpu(name, cpu, &kho) && kho.nr_mems > 4) {
+ nr_pages = kho.nr_mems / 2;
+ use_kho = true;
+ pr_debug("Using kho for buffer '%s' on CPU [%03d]", name, cpu);
+ }
+
buffer->flags = flags;
buffer->clock = trace_clock_local;
buffer->reader_lock_key = key;
@@ -1843,12 +1873,14 @@ struct trace_buffer *__ring_buffer_alloc(const char *name,
if (!buffer->buffers)
goto fail_free_cpumask;
- cpu = raw_smp_processor_id();
cpumask_set_cpu(cpu, buffer->cpumask);
buffer->buffers[cpu] = rb_allocate_cpu_buffer(buffer, nr_pages, cpu);
if (!buffer->buffers[cpu])
goto fail_free_buffers;
+ if (use_kho && trace_kho_replace_buffers(buffer->buffers[cpu], &kho))
+ pr_warn("Could not revive all previous trace data");
+
ret = cpuhp_state_add_instance(CPUHP_TRACE_RB_PREPARE, &buffer->node);
if (ret < 0)
goto fail_free_buffers;
@@ -5886,7 +5918,9 @@ EXPORT_SYMBOL_GPL(ring_buffer_read_page);
*/
int trace_rb_cpu_prepare(unsigned int cpu, struct hlist_node *node)
{
+ struct trace_kho_cpu kho = {};
struct trace_buffer *buffer;
+ bool use_kho = false;
long nr_pages_same;
int cpu_i;
unsigned long nr_pages;
@@ -5910,6 +5944,12 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct hlist_node *node)
/* allocate minimum pages, user can later expand it */
if (!nr_pages_same)
nr_pages = 2;
+
+ if (!trace_kho_read_cpu(buffer->name, cpu, &kho) && kho.nr_mems > 4) {
+ nr_pages = kho.nr_mems / 2;
+ use_kho = true;
+ }
+
buffer->buffers[cpu] =
rb_allocate_cpu_buffer(buffer, nr_pages, cpu);
if (!buffer->buffers[cpu]) {
@@ -5917,12 +5957,141 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct hlist_node *node)
cpu);
return -ENOMEM;
}
+
+ if (use_kho && trace_kho_replace_buffers(buffer->buffers[cpu], &kho))
+ pr_warn("Could not revive all previous trace data");
+
smp_wmb();
cpumask_set_cpu(cpu, buffer->cpumask);
return 0;
}
#ifdef CONFIG_FTRACE_KHO
+static int trace_kho_replace_buffers(struct ring_buffer_per_cpu *cpu_buffer,
+ struct trace_kho_cpu *kho)
+{
+ bool first_loop = true;
+ struct list_head *tmp;
+ int err = 0;
+ int i = 0;
+
+ if (kho->nr_mems != cpu_buffer->nr_pages * 2)
+ return -EINVAL;
+
+ for (tmp = rb_list_head(cpu_buffer->pages);
+ tmp != rb_list_head(cpu_buffer->pages) || first_loop;
+ tmp = rb_list_head(tmp->next), first_loop = false) {
+ struct buffer_page *bpage = (struct buffer_page *)tmp;
+ const struct kho_mem *mem_bpage = &kho->mem[i++];
+ const struct kho_mem *mem_page = &kho->mem[i++];
+ const uint64_t rb_page_head = 1;
+ struct buffer_page *old_bpage;
+ void *old_page;
+
+ old_bpage = __va(mem_bpage->addr);
+ if (!bpage)
+ goto out;
+
+ if ((ulong)old_bpage->list.next & rb_page_head) {
+ struct list_head *new_lhead;
+ struct buffer_page *new_head;
+
+ new_lhead = rb_list_head(bpage->list.next);
+ new_head = (struct buffer_page *)new_lhead;
+
+ /* Assume the buffer is completely full */
+ cpu_buffer->tail_page = bpage;
+ cpu_buffer->commit_page = bpage;
+ /* Set the head pointers to what they were before */
+ cpu_buffer->head_page->list.prev->next = (struct list_head *)
+ ((ulong)cpu_buffer->head_page->list.prev->next & ~rb_page_head);
+ cpu_buffer->head_page = new_head;
+ bpage->list.next = (struct list_head *)((ulong)new_lhead | rb_page_head);
+ }
+
+ if (rb_page_entries(old_bpage) || rb_page_write(old_bpage)) {
+ /*
+ * We want to recycle the pre-kho page, it contains
+ * trace data. To do so, we unreserve it and swap the
+ * current data page with the pre-kho one
+ */
+ old_page = kho_claim_mem(mem_page);
+
+ /* Recycle the old page, it contains data */
+ free_page((ulong)bpage->page);
+ bpage->page = old_page;
+
+ bpage->write = old_bpage->write;
+ bpage->entries = old_bpage->entries;
+ bpage->real_end = old_bpage->real_end;
+
+ local_inc(&cpu_buffer->pages_touched);
+ } else {
+ kho_return_mem(mem_page);
+ }
+
+ kho_return_mem(mem_bpage);
+ }
+
+out:
+ return err;
+}
+
+static int trace_kho_read_cpu(const char *name, int cpu,
+ struct trace_kho_cpu *kho)
+{
+ void *fdt = kho_get_fdt();
+ int mem_len;
+ int err = 0;
+ char *path;
+ int off;
+
+ if (!fdt)
+ return -ENOENT;
+
+ if (!kho)
+ return -EINVAL;
+
+ path = kasprintf(GFP_KERNEL, "/ftrace/%s/buffer/cpu%x", name, cpu);
+ if (!path)
+ return -ENOMEM;
+
+ pr_debug("Trying to revive trace buffer '%s'", path);
+
+ off = fdt_path_offset(fdt, path);
+ if (off < 0) {
+ pr_debug("Could not find '%s' in DT", path);
+ err = -ENOENT;
+ goto out;
+ }
+
+ err = fdt_node_check_compatible(fdt, off, "ftrace,cpu-v1");
+ if (err) {
+ pr_warn("Node '%s' has invalid compatible", path);
+ err = -EINVAL;
+ goto out;
+ }
+
+ kho->mem = fdt_getprop(fdt, off, "mem", &mem_len);
+ if (!kho->mem) {
+ pr_warn("Node '%s' has invalid mem property", path);
+ err = -EINVAL;
+ goto out;
+ }
+
+ kho->nr_mems = mem_len / sizeof(*kho->mem);
+
+ /* Should follow "bpage 0, page 0, bpage 1, page 1, ..." pattern */
+ if ((kho->nr_mems & 1)) {
+ err = -EINVAL;
+ goto out;
+ }
+
+out:
+ kfree(path);
+ return err;
+}
+
static int trace_kho_write_cpu(void *fdt, struct trace_buffer *buffer, int cpu)
{
int i = 0;
--
2.40.1
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
Now that all bits are in place to allow ftrace to pass its trace data
into the next kernel on kexec, let's give users a kconfig option to
enable the functionality.
Signed-off-by: Alexander Graf <[email protected]>
---
kernel/trace/Kconfig | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 61c541c36596..af83ee755b9e 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -1169,6 +1169,19 @@ config HIST_TRIGGERS_DEBUG
If unsure, say N.
+config FTRACE_KHO
+ bool "Ftrace Kexec handover support"
+ depends on KEXEC_KHO
+ help
+ Enable support for ftrace to pass metadata across kexec so the new
+ kernel continues to use the previous kernel's trace buffers.
+
+ This can be useful when debugging kexec performance or correctness
+ issues: The new kernel can dump the old kernel's trace buffer which
+ contains all events until reboot.
+
+ If unsure, say N.
+
source "kernel/trace/rv/Kconfig"
endif # FTRACE
--
2.40.1
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
On Wed, 13 Dec 2023 00:04:45 +0000
Alexander Graf <[email protected]> wrote:
> With KHO (Kexec HandOver), we want to preserve trace buffers across
> kexec. To carry over their state between kernels, the kernel needs a
> common handle for them that exists on both sides. As handle we introduce
> names for ring buffers. In a follow-up patch, the kernel can then use
> these names to recover buffer contents for specific ring buffers.
>
Is there a way to use the trace_array name instead?
The trace_array is the structure that represents each tracing instance. And
it already has a name field. And if you can get the associated ring buffer
from that too.
struct trace_array *tr;
tr->array_buffer.buffer
tr->name
When you do: mkdir /sys/kernel/tracing/instance/foo
You create a new trace_array instance where tr->name = "foo" and allocates
the buffer for it as well.
-- Steve
Hi Steve,
On 13.12.23 01:15, Steven Rostedt wrote:
>
> On Wed, 13 Dec 2023 00:04:45 +0000
> Alexander Graf <[email protected]> wrote:
>
>> With KHO (Kexec HandOver), we want to preserve trace buffers across
>> kexec. To carry over their state between kernels, the kernel needs a
>> common handle for them that exists on both sides. As handle we introduce
>> names for ring buffers. In a follow-up patch, the kernel can then use
>> these names to recover buffer contents for specific ring buffers.
>>
> Is there a way to use the trace_array name instead?
>
> The trace_array is the structure that represents each tracing instance. And
> it already has a name field. And if you can get the associated ring buffer
> from that too.
>
> struct trace_array *tr;
>
> tr->array_buffer.buffer
>
> tr->name
>
> When you do: mkdir /sys/kernel/tracing/instance/foo
>
> You create a new trace_array instance where tr->name = "foo" and allocates
> the buffer for it as well.
The name in the ring buffer is pretty much just a copy of the trace
array name. I use it to reconstruct which buffer we're actually
referring to inside __ring_buffer_alloc().
I'm all ears for alternative suggestions. I suppose we could pass tr as
argument to ring_buffer_alloc() instead of the name?
Alex
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
On Wed, 13 Dec 2023 01:35:16 +0100
Alexander Graf <[email protected]> wrote:
> > The trace_array is the structure that represents each tracing instance. And
> > it already has a name field. And if you can get the associated ring buffer
> > from that too.
> >
> > struct trace_array *tr;
> >
> > tr->array_buffer.buffer
> >
> > tr->name
> >
> > When you do: mkdir /sys/kernel/tracing/instance/foo
> >
> > You create a new trace_array instance where tr->name = "foo" and allocates
> > the buffer for it as well.
>
> The name in the ring buffer is pretty much just a copy of the trace
> array name. I use it to reconstruct which buffer we're actually
> referring to inside __ring_buffer_alloc().
No, I rather not tie the ring buffer to the trace_array.
>
> I'm all ears for alternative suggestions. I suppose we could pass tr as
> argument to ring_buffer_alloc() instead of the name?
I'll have to spend some time (that I don't currently have :-( ) on looking
at this more. I really don't like the copying of the name into the ring
buffer allocation, as it may be an unneeded burden to maintain, not to
mention the duplicate field.
-- Steve
On Wed, 13 Dec 2023 00:04:46 +0000
Alexander Graf <[email protected]> wrote:
> With KHO (Kexec HandOver), we want to preserve trace buffers. To parse
> them, we need to ensure that all trace events that exist in the logs are
> identical to the ones we parse as. That means we need to match the
> events before and after kexec.
>
> As a first step towards that, let's give every event a unique name. That
> way we can clearly identify the event before and after kexec and restore
> its ID post-kexec.
>
> Signed-off-by: Alexander Graf <[email protected]>
> ---
> include/linux/trace_events.h | 1 +
> include/trace/trace_events.h | 2 ++
> kernel/trace/blktrace.c | 1 +
> kernel/trace/trace_branch.c | 1 +
> kernel/trace/trace_events.c | 3 +++
> kernel/trace/trace_functions_graph.c | 4 +++-
> kernel/trace/trace_output.c | 13 +++++++++++++
> kernel/trace/trace_probe.c | 3 +++
> kernel/trace/trace_syscalls.c | 29 ++++++++++++++++++++++++++++
> 9 files changed, 56 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
> index d68ff9b1247f..7670224aa92d 100644
> --- a/include/linux/trace_events.h
> +++ b/include/linux/trace_events.h
> @@ -149,6 +149,7 @@ struct trace_event {
> struct hlist_node node;
> int type;
> struct trace_event_functions *funcs;
> + const char *name;
> };
OK, this is a hard no. We definitely need to find a different way to do
this. I'm trying hard to lower the footprint of tracing, and this just
added 8 bytes to every event on a 64 bit machine.
On my box I have 1953 events, and they are constantly growing. This just
added 15,624 bytes of tracing overhead to that machine.
That may not sound like much, but as this is only for this feature, it just
added 15K to the overhead for the majority of users.
I'm not sure how easy it is to make this a config option that takes away
that field when not set. But I would need that at a minimum.
-- Steve
Hi Alexander,
kernel test robot noticed the following build errors:
[auto build test ERROR on tip/x86/core]
[also build test ERROR on arm64/for-next/core akpm-mm/mm-everything linus/master v6.7-rc5 next-20231213]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Alexander-Graf/mm-memblock-Add-support-for-scratch-memory/20231213-080941
base: tip/x86/core
patch link: https://lore.kernel.org/r/20231213000452.88295-9-graf%40amazon.com
patch subject: [PATCH 08/15] tracing: Introduce names for ring buffers
config: i386-buildonly-randconfig-003-20231213 (https://download.01.org/0day-ci/archive/20231213/[email protected]/config)
compiler: clang version 16.0.4 (https://github.com/llvm/llvm-project.git ae42196bc493ffe877a7e3dff8be32035dea4d07)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231213/[email protected]/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
All errors (new ones prefixed by >>):
>> kernel/trace/ring_buffer_benchmark.c:435:53: error: too few arguments provided to function-like macro invocation
buffer = ring_buffer_alloc(1000000, RB_FL_OVERWRITE);
^
include/linux/ring_buffer.h:96:9: note: macro 'ring_buffer_alloc' defined here
#define ring_buffer_alloc(name, size, flags) \
^
>> kernel/trace/ring_buffer_benchmark.c:435:11: error: use of undeclared identifier 'ring_buffer_alloc'; did you mean '__ring_buffer_alloc'?
buffer = ring_buffer_alloc(1000000, RB_FL_OVERWRITE);
^~~~~~~~~~~~~~~~~
__ring_buffer_alloc
include/linux/ring_buffer.h:88:1: note: '__ring_buffer_alloc' declared here
__ring_buffer_alloc(const char *name, unsigned long size, unsigned flags,
^
2 errors generated.
--
>> kernel/trace/ring_buffer.c:6096:65: error: too few arguments provided to function-like macro invocation
buffer = ring_buffer_alloc(RB_TEST_BUFFER_SIZE, RB_FL_OVERWRITE);
^
include/linux/ring_buffer.h:96:9: note: macro 'ring_buffer_alloc' defined here
#define ring_buffer_alloc(name, size, flags) \
^
>> kernel/trace/ring_buffer.c:6096:11: error: use of undeclared identifier 'ring_buffer_alloc'; did you mean '__ring_buffer_alloc'?
buffer = ring_buffer_alloc(RB_TEST_BUFFER_SIZE, RB_FL_OVERWRITE);
^~~~~~~~~~~~~~~~~
__ring_buffer_alloc
kernel/trace/ring_buffer.c:1873:19: note: '__ring_buffer_alloc' declared here
EXPORT_SYMBOL_GPL(__ring_buffer_alloc);
^
2 errors generated.
vim +435 kernel/trace/ring_buffer_benchmark.c
5092dbc96f3acd Steven Rostedt 2009-05-05 429
5092dbc96f3acd Steven Rostedt 2009-05-05 430 static int __init ring_buffer_benchmark_init(void)
5092dbc96f3acd Steven Rostedt 2009-05-05 431 {
5092dbc96f3acd Steven Rostedt 2009-05-05 432 int ret;
5092dbc96f3acd Steven Rostedt 2009-05-05 433
5092dbc96f3acd Steven Rostedt 2009-05-05 434 /* make a one meg buffer in overwite mode */
5092dbc96f3acd Steven Rostedt 2009-05-05 @435 buffer = ring_buffer_alloc(1000000, RB_FL_OVERWRITE);
5092dbc96f3acd Steven Rostedt 2009-05-05 436 if (!buffer)
5092dbc96f3acd Steven Rostedt 2009-05-05 437 return -ENOMEM;
5092dbc96f3acd Steven Rostedt 2009-05-05 438
5092dbc96f3acd Steven Rostedt 2009-05-05 439 if (!disable_reader) {
5092dbc96f3acd Steven Rostedt 2009-05-05 440 consumer = kthread_create(ring_buffer_consumer_thread,
5092dbc96f3acd Steven Rostedt 2009-05-05 441 NULL, "rb_consumer");
5092dbc96f3acd Steven Rostedt 2009-05-05 442 ret = PTR_ERR(consumer);
5092dbc96f3acd Steven Rostedt 2009-05-05 443 if (IS_ERR(consumer))
5092dbc96f3acd Steven Rostedt 2009-05-05 444 goto out_fail;
5092dbc96f3acd Steven Rostedt 2009-05-05 445 }
5092dbc96f3acd Steven Rostedt 2009-05-05 446
5092dbc96f3acd Steven Rostedt 2009-05-05 447 producer = kthread_run(ring_buffer_producer_thread,
5092dbc96f3acd Steven Rostedt 2009-05-05 448 NULL, "rb_producer");
5092dbc96f3acd Steven Rostedt 2009-05-05 449 ret = PTR_ERR(producer);
5092dbc96f3acd Steven Rostedt 2009-05-05 450
5092dbc96f3acd Steven Rostedt 2009-05-05 451 if (IS_ERR(producer))
5092dbc96f3acd Steven Rostedt 2009-05-05 452 goto out_kill;
5092dbc96f3acd Steven Rostedt 2009-05-05 453
98e4833ba3c314 Ingo Molnar 2009-11-23 454 /*
98e4833ba3c314 Ingo Molnar 2009-11-23 455 * Run them as low-prio background tasks by default:
98e4833ba3c314 Ingo Molnar 2009-11-23 456 */
7ac07434048001 Steven Rostedt 2009-11-25 457 if (!disable_reader) {
4fd5750af02ab7 Peter Zijlstra 2020-07-20 458 if (consumer_fifo >= 2)
4fd5750af02ab7 Peter Zijlstra 2020-07-20 459 sched_set_fifo(consumer);
4fd5750af02ab7 Peter Zijlstra 2020-07-20 460 else if (consumer_fifo == 1)
4fd5750af02ab7 Peter Zijlstra 2020-07-20 461 sched_set_fifo_low(consumer);
4fd5750af02ab7 Peter Zijlstra 2020-07-20 462 else
7ac07434048001 Steven Rostedt 2009-11-25 463 set_user_nice(consumer, consumer_nice);
7ac07434048001 Steven Rostedt 2009-11-25 464 }
7ac07434048001 Steven Rostedt 2009-11-25 465
4fd5750af02ab7 Peter Zijlstra 2020-07-20 466 if (producer_fifo >= 2)
4fd5750af02ab7 Peter Zijlstra 2020-07-20 467 sched_set_fifo(producer);
4fd5750af02ab7 Peter Zijlstra 2020-07-20 468 else if (producer_fifo == 1)
4fd5750af02ab7 Peter Zijlstra 2020-07-20 469 sched_set_fifo_low(producer);
4fd5750af02ab7 Peter Zijlstra 2020-07-20 470 else
7ac07434048001 Steven Rostedt 2009-11-25 471 set_user_nice(producer, producer_nice);
98e4833ba3c314 Ingo Molnar 2009-11-23 472
5092dbc96f3acd Steven Rostedt 2009-05-05 473 return 0;
5092dbc96f3acd Steven Rostedt 2009-05-05 474
5092dbc96f3acd Steven Rostedt 2009-05-05 475 out_kill:
5092dbc96f3acd Steven Rostedt 2009-05-05 476 if (consumer)
5092dbc96f3acd Steven Rostedt 2009-05-05 477 kthread_stop(consumer);
5092dbc96f3acd Steven Rostedt 2009-05-05 478
5092dbc96f3acd Steven Rostedt 2009-05-05 479 out_fail:
5092dbc96f3acd Steven Rostedt 2009-05-05 480 ring_buffer_free(buffer);
5092dbc96f3acd Steven Rostedt 2009-05-05 481 return ret;
5092dbc96f3acd Steven Rostedt 2009-05-05 482 }
5092dbc96f3acd Steven Rostedt 2009-05-05 483
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Hi Alexander,
kernel test robot noticed the following build warnings:
[auto build test WARNING on tip/x86/core]
[also build test WARNING on arm64/for-next/core akpm-mm/mm-everything linus/master v6.7-rc5 next-20231213]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Alexander-Graf/mm-memblock-Add-support-for-scratch-memory/20231213-080941
base: tip/x86/core
patch link: https://lore.kernel.org/r/20231213000452.88295-7-graf%40amazon.com
patch subject: [PATCH 06/15] arm64: Add KHO support
config: i386-buildonly-randconfig-001-20231213 (https://download.01.org/0day-ci/archive/20231213/[email protected]/config)
compiler: clang version 16.0.4 (https://github.com/llvm/llvm-project.git ae42196bc493ffe877a7e3dff8be32035dea4d07)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231213/[email protected]/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
All warnings (new ones prefixed by >>):
>> drivers/of/fdt.c:1012:13: warning: no previous prototype for function 'early_init_dt_check_kho' [-Wmissing-prototypes]
void __init early_init_dt_check_kho(void)
^
drivers/of/fdt.c:1012:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
void __init early_init_dt_check_kho(void)
^
static
1 warning generated.
vim +/early_init_dt_check_kho +1012 drivers/of/fdt.c
1008
1009 /**
1010 * early_init_dt_check_kho - Decode info required for kexec handover from DT
1011 */
> 1012 void __init early_init_dt_check_kho(void)
1013 {
1014 #ifdef CONFIG_KEXEC_KHO
1015 unsigned long node = chosen_node_offset;
1016 u64 kho_start, scratch_start, scratch_size, mem_start, mem_size;
1017 const __be32 *p;
1018 int l;
1019
1020 if ((long)node < 0)
1021 return;
1022
1023 p = of_get_flat_dt_prop(node, "linux,kho-dt", &l);
1024 if (l != (dt_root_addr_cells + dt_root_size_cells) * sizeof(__be32))
1025 return;
1026
1027 kho_start = dt_mem_next_cell(dt_root_addr_cells, &p);
1028
1029 p = of_get_flat_dt_prop(node, "linux,kho-scratch", &l);
1030 if (l != (dt_root_addr_cells + dt_root_size_cells) * sizeof(__be32))
1031 return;
1032
1033 scratch_start = dt_mem_next_cell(dt_root_addr_cells, &p);
1034 scratch_size = dt_mem_next_cell(dt_root_addr_cells, &p);
1035
1036 p = of_get_flat_dt_prop(node, "linux,kho-mem", &l);
1037 if (l != (dt_root_addr_cells + dt_root_size_cells) * sizeof(__be32))
1038 return;
1039
1040 mem_start = dt_mem_next_cell(dt_root_addr_cells, &p);
1041 mem_size = dt_mem_next_cell(dt_root_addr_cells, &p);
1042
1043 kho_populate(kho_start, scratch_start, scratch_size, mem_start, mem_size);
1044 #endif
1045 }
1046
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Hi Alexander,
kernel test robot noticed the following build warnings:
[auto build test WARNING on tip/x86/core]
[also build test WARNING on arm64/for-next/core akpm-mm/mm-everything linus/master v6.7-rc5 next-20231213]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Alexander-Graf/mm-memblock-Add-support-for-scratch-memory/20231213-080941
base: tip/x86/core
patch link: https://lore.kernel.org/r/20231213000452.88295-3-graf%40amazon.com
patch subject: [PATCH 02/15] memblock: Declare scratch memory as CMA
config: arc-allnoconfig (https://download.01.org/0day-ci/archive/20231213/[email protected]/config)
compiler: arc-elf-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231213/[email protected]/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
All warnings (new ones prefixed by >>):
>> mm/memblock.c:2153:13: warning: 'reserve_scratch_mem' defined but not used [-Wunused-function]
2153 | static void reserve_scratch_mem(phys_addr_t start, phys_addr_t end)
| ^~~~~~~~~~~~~~~~~~~
vim +/reserve_scratch_mem +2153 mm/memblock.c
2152
> 2153 static void reserve_scratch_mem(phys_addr_t start, phys_addr_t end)
2154 {
2155 #ifdef CONFIG_MEMBLOCK_SCRATCH
2156 ulong start_pfn = pageblock_start_pfn(PFN_DOWN(start));
2157 ulong end_pfn = pageblock_align(PFN_UP(end));
2158 ulong pfn;
2159
2160 for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
2161 /* Mark as CMA to prevent kernel allocations in it */
2162 set_pageblock_migratetype(pfn_to_page(pfn), MIGRATE_CMA);
2163 }
2164 #endif
2165 }
2166
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Hi Alexander,
kernel test robot noticed the following build warnings:
[auto build test WARNING on tip/x86/core]
[also build test WARNING on arm64/for-next/core akpm-mm/mm-everything linus/master v6.7-rc5 next-20231213]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Alexander-Graf/mm-memblock-Add-support-for-scratch-memory/20231213-080941
base: tip/x86/core
patch link: https://lore.kernel.org/r/20231213000452.88295-7-graf%40amazon.com
patch subject: [PATCH 06/15] arm64: Add KHO support
config: microblaze-randconfig-r133-20231213 (https://download.01.org/0day-ci/archive/20231213/[email protected]/config)
compiler: microblaze-linux-gcc (GCC) 13.2.0
reproduce: (https://download.01.org/0day-ci/archive/20231213/[email protected]/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
sparse warnings: (new ones prefixed by >>)
>> drivers/of/fdt.c:1012:13: sparse: sparse: symbol 'early_init_dt_check_kho' was not declared. Should it be static?
vim +/early_init_dt_check_kho +1012 drivers/of/fdt.c
1008
1009 /**
1010 * early_init_dt_check_kho - Decode info required for kexec handover from DT
1011 */
> 1012 void __init early_init_dt_check_kho(void)
1013 {
1014 #ifdef CONFIG_KEXEC_KHO
1015 unsigned long node = chosen_node_offset;
1016 u64 kho_start, scratch_start, scratch_size, mem_start, mem_size;
1017 const __be32 *p;
1018 int l;
1019
1020 if ((long)node < 0)
1021 return;
1022
1023 p = of_get_flat_dt_prop(node, "linux,kho-dt", &l);
1024 if (l != (dt_root_addr_cells + dt_root_size_cells) * sizeof(__be32))
1025 return;
1026
1027 kho_start = dt_mem_next_cell(dt_root_addr_cells, &p);
1028
1029 p = of_get_flat_dt_prop(node, "linux,kho-scratch", &l);
1030 if (l != (dt_root_addr_cells + dt_root_size_cells) * sizeof(__be32))
1031 return;
1032
1033 scratch_start = dt_mem_next_cell(dt_root_addr_cells, &p);
1034 scratch_size = dt_mem_next_cell(dt_root_addr_cells, &p);
1035
1036 p = of_get_flat_dt_prop(node, "linux,kho-mem", &l);
1037 if (l != (dt_root_addr_cells + dt_root_size_cells) * sizeof(__be32))
1038 return;
1039
1040 mem_start = dt_mem_next_cell(dt_root_addr_cells, &p);
1041 mem_size = dt_mem_next_cell(dt_root_addr_cells, &p);
1042
1043 kho_populate(kho_start, scratch_start, scratch_size, mem_start, mem_size);
1044 #endif
1045 }
1046
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
On 13.12.23 19:36, Stanislav Kinsburskii wrote:
> On Wed, Dec 13, 2023 at 12:04:40AM +0000, Alexander Graf wrote:
>> +int register_kho_notifier(struct notifier_block *nb)
>> +{
>> + return blocking_notifier_chain_register(&kho.chain_head, nb);
>> +}
>> +EXPORT_SYMBOL_GPL(register_kho_notifier);
>> +
>> +int unregister_kho_notifier(struct notifier_block *nb)
>> +{
>> + return blocking_notifier_chain_unregister(&kho.chain_head, nb);
>> +}
>> +EXPORT_SYMBOL_GPL(unregister_kho_notifier);
>> +
>> +bool kho_is_active(void)
>> +{
>> + return kho.active;
>> +}
>> +EXPORT_SYMBOL_GPL(kho_is_active);
>> +
> Why should these helpers be restricted to GPL code?
That's a simple one: Everything should be EXPORT_SYMBOL_GPL by default.
You need to have really good reasons to export anything for non-GPL
modules. I don't have a good reason for them, so it's GPL only :)
Alex
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
Alexander Graf <[email protected]> writes:
> Kexec today considers itself purely a boot loader: When we enter the new
> kernel, any state the previous kernel left behind is irrelevant and the
> new kernel reinitializes the system.
>
> However, there are use cases where this mode of operation is not what we
> actually want. In virtualization hosts for example, we want to use kexec
> to update the host kernel while virtual machine memory stays untouched.
> When we add device assignment to the mix, we also need to ensure that
> IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we
> need to do the same for the PCI subsystem. If we want to kexec while an
> SEV-SNP enabled virtual machine is running, we need to preserve the VM
> context pages and physical memory. See James' and my Linux Plumbers
> Conference 2023 presentation for details:
>
> https://lpc.events/event/17/contributions/1485/
>
> To start us on the journey to support all the use cases above, this
> patch implements basic infrastructure to allow hand over of kernel state
> across kexec (Kexec HandOver, aka KHO). As example target, we use ftrace:
> With this patch set applied, you can read ftrace records from the
> pre-kexec environment in your post-kexec one. This creates a very powerful
> debugging and performance analysis tool for kexec. It's also slightly
> easier to reason about than full blown VFIO state preservation.
>
> == Alternatives ==
>
> There are alternative approaches to (parts of) the problems above:
>
> * Memory Pools [1] - preallocated persistent memory region + allocator
> * PRMEM [2] - resizable persistent memory regions with fixed metadata
> pointer on the kernel command line + allocator
> * Pkernfs [3] - preallocated file system for in-kernel data with fixed
> address location on the kernel command line
> * PKRAM [4] - handover of user space pages using a fixed metadata page
> specified via command line
>
> All of the approaches above fundamentally have the same problem: They
> require the administrator to explicitly carve out a physical memory
> location because they have no mechanism outside of the kernel command
> line to pass data (including memory reservations) between kexec'ing
> kernels.
>
> KHO provides that base foundation. We will determine later whether we
> still need any of the approaches above for fast bulk memory handover of for
> example IOMMU page tables. But IMHO they would all be users of KHO, with
> KHO providing the foundational primitive to pass metadata and bulk memory
> reservations as well as provide easy versioning for data.
What you are describe in many ways is the same problem as
kexec-on-panic. The goal of leaving devices running absolutely requires
carving out memory for the new kernel to live in while it is coming up
so that DMA from a device that was not shutdown down does not stomp the
kernel coming up.
If I understand the virtualization case some of those virtual machines
are going to have virtual NICs that are going to want to DMA memory to
the host system. Which if I understand things correctly means that
among the devices you explicitly want to keep running there is a not
a way to avoid the chance of DMA coming in while the kernel is being
changed.
There is also a huge maintenance challenge associated with all of this.
If you go with something that is essentially kexec-on-panic and then
add a little bit to help find things in the memory of the previous
kernel while the new kernel is coming up I can see it as a possibility.
As an example I think preserving ftrace data of kexec seems bizarre.
I don't see how that is an interesting use case at all. Not in
the situation of preserving virtual machines, and not in the situation
of kexec on panic.
If you are doing an orderly shutdown and kernel switch you should be
able to manually change the memory. If you are not doing an orderly
shutdown then I really don't get it.
I don't hate the capability you are trying to build.
I have not read or looked at most of this so I am probably
missing subtle details.
As you are currently describing things I have the sense you have
completely misframed the problem and are trying to solve the wrong parts
of the problem.
Eric
Hey Eric,
On 14.12.23 15:58, Eric W. Biederman wrote:
> Alexander Graf <[email protected]> writes:
>
>> Kexec today considers itself purely a boot loader: When we enter the new
>> kernel, any state the previous kernel left behind is irrelevant and the
>> new kernel reinitializes the system.
>>
>> However, there are use cases where this mode of operation is not what we
>> actually want. In virtualization hosts for example, we want to use kexec
>> to update the host kernel while virtual machine memory stays untouched.
>> When we add device assignment to the mix, we also need to ensure that
>> IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we
>> need to do the same for the PCI subsystem. If we want to kexec while an
>> SEV-SNP enabled virtual machine is running, we need to preserve the VM
>> context pages and physical memory. See James' and my Linux Plumbers
>> Conference 2023 presentation for details:
>>
>> https://lpc.events/event/17/contributions/1485/
>>
>> To start us on the journey to support all the use cases above, this
>> patch implements basic infrastructure to allow hand over of kernel state
>> across kexec (Kexec HandOver, aka KHO). As example target, we use ftrace:
>> With this patch set applied, you can read ftrace records from the
>> pre-kexec environment in your post-kexec one. This creates a very powerful
>> debugging and performance analysis tool for kexec. It's also slightly
>> easier to reason about than full blown VFIO state preservation.
>>
>> == Alternatives ==
>>
>> There are alternative approaches to (parts of) the problems above:
>>
>> * Memory Pools [1] - preallocated persistent memory region + allocator
>> * PRMEM [2] - resizable persistent memory regions with fixed metadata
>> pointer on the kernel command line + allocator
>> * Pkernfs [3] - preallocated file system for in-kernel data with fixed
>> address location on the kernel command line
>> * PKRAM [4] - handover of user space pages using a fixed metadata page
>> specified via command line
>>
>> All of the approaches above fundamentally have the same problem: They
>> require the administrator to explicitly carve out a physical memory
>> location because they have no mechanism outside of the kernel command
>> line to pass data (including memory reservations) between kexec'ing
>> kernels.
>>
>> KHO provides that base foundation. We will determine later whether we
>> still need any of the approaches above for fast bulk memory handover of for
>> example IOMMU page tables. But IMHO they would all be users of KHO, with
>> KHO providing the foundational primitive to pass metadata and bulk memory
>> reservations as well as provide easy versioning for data.
> What you are describe in many ways is the same problem as
> kexec-on-panic. The goal of leaving devices running absolutely requires
> carving out memory for the new kernel to live in while it is coming up
> so that DMA from a device that was not shutdown down does not stomp the
> kernel coming up.
Yes, part of the problem is similar: We need a safe space to boot from
that doesn't overwrite existing data. What happens after is different:
With panics, you're trying to rescue previous state for post-mortem
analysis. You may even have intrinsic knowledge of the environment you
came from, so you can optimize that rescuing. Nobody wants to continue
running the system as if nothing happened after a panic.
With KHO, the kernels establish an ABI between each other to communicate
any state that needs to get preserved and the rest gets reinitialized.
After KHO, the new kernel continues executing workloads that were
running before.
The ABI is important because the next environment may not have a chance
to know about the previous environment's setup. Think for example of
roll-out and roll-back scenarios: If I roll back into my previous
environment because I determined something didn't work as expected after
update, I'm moving the system into an environment that was built when
the kexec source environment didn't even exist yet.
> If I understand the virtualization case some of those virtual machines
> are going to have virtual NICs that are going to want to DMA memory to
> the host system. Which if I understand things correctly means that
No, to the *guest* system. This is about device assignment: The guest is
in full control of the NICs that do DMA, so we have no chance to quiesce
them.
> among the devices you explicitly want to keep running there is a not
> a way to avoid the chance of DMA coming in while the kernel is being
> changed.
Correct, because the host doesn't own the driver :).
> There is also a huge maintenance challenge associated with all of this.
>
> If you go with something that is essentially kexec-on-panic and then
> add a little bit to help find things in the memory of the previous
> kernel while the new kernel is coming up I can see it as a possibility.
That's roughly what the patch set is doing, yes. It avoids a static
allocation ahead of time for next-kernel memory, because I only know the
size of all components when we're actually doing the kexec. But the
principle is similar.
The bit where the new kernel finds bits in the old memory is the KHO DT:
A flattened device tree structure the old kernel passes to the new
kernel. That contains all memory locations as well as additional
metadata to "help find things" in a way that doesn't immediately break
on every kernel change.
> As an example I think preserving ftrace data of kexec seems bizarre.
> I don't see how that is an interesting use case at all. Not in
> the situation of preserving virtual machines, and not in the situation
> of kexec on panic.
It's super useful as self debugging aid: I already used it to profile
the kexec path to find a few performance issues :). It's also really
helpful - even without device assignment support yet - when you use it
in combination with KVM trace points: You have a VM running backed by a
DAX pmem device, then serialize its virtual device state, kexec, restore
from the virtual device state, then the VM misbehaves.
With ftrace handover in place, you get a full trace of the flow which
simplifies debugging of issues that happen during/because of the
serialization/deserialization flow of KVM state.
But the main reason I chose ftrace to start with is that all other use
cases require another concept: fd preservation. All the typical
"objects" you want to preserve across kexec are anonymous file
descriptors. So we need to also build a way in Linux that allows user
space to request the kernel to preserve an fd using the kexec handover
framework in this patch set. But that is another big discussion I wanted
to keep separate: Ftrace is from kernel, to kernel and hence "easy".
> If you are doing an orderly shutdown and kernel switch you should be
> able to manually change the memory. If you are not doing an orderly
> shutdown then I really don't get it.
I don't follow the paragraph above?
> I don't hate the capability you are trying to build.
>
> I have not read or looked at most of this so I am probably
> missing subtle details.
>
> As you are currently describing things I have the sense you have
> completely misframed the problem and are trying to solve the wrong parts
> of the problem.
Very well possible :). I hope the above clarifies it a bit. If not,
please let me know where exactly it's unclear so I can elaborate.
If you have a few minutes, it would also be great if you could have a
look at our slides [1] or even video [2] from LPC 2023 which go into
detail of the end problem. Beware that I'm consciously *not* trying to
solve the end problem yet: I want to take baby steps towards it. Nobody
wants to review an 80 patches patch set where everything depends on
everything else.
Alex
[1]
https://lpc.events/event/17/contributions/1485/attachments/1296/2650/jgowans-preserving-across-kexec.pdf
[2] https://www.youtube.com/watch?v=cYrlV4bK1Y4
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
On Wed, Dec 13, 2023 at 12:04:43AM +0000, Alexander Graf wrote:
> We now have all bits in place to support KHO kexecs. This patch adds
> awareness of KHO in the kexec file as well as boot path for arm64 and
> adds the respective kconfig option to the architecture so that it can
> use KHO successfully.
>
> Signed-off-by: Alexander Graf <[email protected]>
> ---
> arch/arm64/Kconfig | 12 ++++++++++++
> arch/arm64/kernel/setup.c | 2 ++
> arch/arm64/mm/init.c | 8 ++++++++
> drivers/of/fdt.c | 41 +++++++++++++++++++++++++++++++++++++++
> drivers/of/kexec.c | 36 ++++++++++++++++++++++++++++++++++
> 5 files changed, 99 insertions(+)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 7b071a00425d..1ba338ce7598 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -1501,6 +1501,18 @@ config ARCH_SUPPORTS_CRASH_DUMP
> config ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION
> def_bool CRASH_CORE
>
> +config KEXEC_KHO
> + bool "kexec handover"
> + depends on KEXEC
> + select MEMBLOCK_SCRATCH
> + select LIBFDT
> + select CMA
> + help
> + Allow kexec to hand over state across kernels by generating and
> + passing additional metadata to the target kernel. This is useful
> + to keep data or state alive across the kexec. For this to work,
> + both source and target kernels need to have this option enabled.
Why do we have the same kconfig entry twice? Here and x86.
> +
> config TRANS_TABLE
> def_bool y
> depends on HIBERNATION || KEXEC_CORE
> diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
> index 417a8a86b2db..8035b673d96d 100644
> --- a/arch/arm64/kernel/setup.c
> +++ b/arch/arm64/kernel/setup.c
> @@ -346,6 +346,8 @@ void __init __no_sanitize_address setup_arch(char **cmdline_p)
>
> paging_init();
>
> + kho_reserve_mem();
> +
> acpi_table_upgrade();
>
> /* Parse the ACPI tables for possible boot-time configuration */
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 74c1db8ce271..254d82f3383a 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -358,6 +358,8 @@ void __init bootmem_init(void)
> */
> arch_reserve_crashkernel();
>
> + kho_reserve();
> +
reserve what? It is not obvious what the difference between
kho_reserve_mem() and kho_reserve() are.
> memblock_dump_all();
> }
>
> @@ -386,6 +388,12 @@ void __init mem_init(void)
> /* this will put all unused low memory onto the freelists */
> memblock_free_all();
>
> + /*
> + * Now that all KHO pages are marked as reserved, let's flip them back
> + * to normal pages with accurate refcount.
> + */
> + kho_populate_refcount();
> +
> /*
> * Check boundaries twice: Some fundamental inconsistencies can be
> * detected at build time already.
> diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
> index bf502ba8da95..af95139351ed 100644
> --- a/drivers/of/fdt.c
> +++ b/drivers/of/fdt.c
> @@ -1006,6 +1006,44 @@ void __init early_init_dt_check_for_usable_mem_range(void)
> memblock_add(rgn[i].base, rgn[i].size);
> }
>
> +/**
> + * early_init_dt_check_kho - Decode info required for kexec handover from DT
> + */
> +void __init early_init_dt_check_kho(void)
> +{
> +#ifdef CONFIG_KEXEC_KHO
if (!IS_ENABLED(CONFIG_KEXEC_KHO))
return;
You'll need a kho_populate() stub.
> + unsigned long node = chosen_node_offset;
> + u64 kho_start, scratch_start, scratch_size, mem_start, mem_size;
> + const __be32 *p;
> + int l;
> +
> + if ((long)node < 0)
> + return;
> +
> + p = of_get_flat_dt_prop(node, "linux,kho-dt", &l);
> + if (l != (dt_root_addr_cells + dt_root_size_cells) * sizeof(__be32))
> + return;
> +
> + kho_start = dt_mem_next_cell(dt_root_addr_cells, &p);
> +
> + p = of_get_flat_dt_prop(node, "linux,kho-scratch", &l);
> + if (l != (dt_root_addr_cells + dt_root_size_cells) * sizeof(__be32))
> + return;
> +
> + scratch_start = dt_mem_next_cell(dt_root_addr_cells, &p);
> + scratch_size = dt_mem_next_cell(dt_root_addr_cells, &p);
> +
> + p = of_get_flat_dt_prop(node, "linux,kho-mem", &l);
> + if (l != (dt_root_addr_cells + dt_root_size_cells) * sizeof(__be32))
> + return;
> +
> + mem_start = dt_mem_next_cell(dt_root_addr_cells, &p);
> + mem_size = dt_mem_next_cell(dt_root_addr_cells, &p);
> +
> + kho_populate(kho_start, scratch_start, scratch_size, mem_start, mem_size);
> +#endif
> +}
> +
> #ifdef CONFIG_SERIAL_EARLYCON
>
> int __init early_init_dt_scan_chosen_stdout(void)
> @@ -1304,6 +1342,9 @@ void __init early_init_dt_scan_nodes(void)
>
> /* Handle linux,usable-memory-range property */
> early_init_dt_check_for_usable_mem_range();
> +
> + /* Handle kexec handover */
> + early_init_dt_check_kho();
> }
>
> bool __init early_init_dt_scan(void *params)
> diff --git a/drivers/of/kexec.c b/drivers/of/kexec.c
> index 68278340cecf..a612e6bb8c75 100644
> --- a/drivers/of/kexec.c
> +++ b/drivers/of/kexec.c
> @@ -264,6 +264,37 @@ static inline int setup_ima_buffer(const struct kimage *image, void *fdt,
> }
> #endif /* CONFIG_IMA_KEXEC */
>
> +static int kho_add_chosen(const struct kimage *image, void *fdt, int chosen_node)
> +{
> + int ret = 0;
> +
> +#ifdef CONFIG_KEXEC_KHO
ditto
Though perhaps image->kho is not defined?
> + if (!image->kho.dt.buffer || !image->kho.mem_cache.buffer)
> + goto out;
> +
> + pr_debug("Adding kho metadata to DT");
> +
> + ret = fdt_appendprop_addrrange(fdt, 0, chosen_node, "linux,kho-dt",
> + image->kho.dt.mem, image->kho.dt.memsz);
> + if (ret)
> + goto out;
> +
> + ret = fdt_appendprop_addrrange(fdt, 0, chosen_node, "linux,kho-scratch",
> + kho_scratch_phys, kho_scratch_len);
> + if (ret)
> + goto out;
> +
> + ret = fdt_appendprop_addrrange(fdt, 0, chosen_node, "linux,kho-mem",
> + image->kho.mem_cache.mem,
> + image->kho.mem_cache.bufsz);
> + if (ret)
> + goto out;
> +
> +out:
> +#endif
> + return ret;
> +}
> +
> /*
> * of_kexec_alloc_and_setup_fdt - Alloc and setup a new Flattened Device Tree
> *
> @@ -412,6 +443,11 @@ void *of_kexec_alloc_and_setup_fdt(const struct kimage *image,
> }
> }
>
> + /* Add kho metadata if this is a KHO image */
> + ret = kho_add_chosen(image, fdt, chosen_node);
> + if (ret)
> + goto out;
> +
> /* add bootargs */
> if (cmdline) {
> ret = fdt_setprop_string(fdt, chosen_node, "bootargs", cmdline);
> --
> 2.40.1
>
>
>
>
> Amazon Development Center Germany GmbH
> Krausenstr. 38
> 10117 Berlin
> Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
> Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
> Sitz: Berlin
> Ust-ID: DE 289 237 879
>
>
>
Hey Rob!
On 14.12.23 23:36, Rob Herring wrote:
> On Wed, Dec 13, 2023 at 12:04:43AM +0000, Alexander Graf wrote:
>> We now have all bits in place to support KHO kexecs. This patch adds
>> awareness of KHO in the kexec file as well as boot path for arm64 and
>> adds the respective kconfig option to the architecture so that it can
>> use KHO successfully.
>>
>> Signed-off-by: Alexander Graf <[email protected]>
>> ---
>> arch/arm64/Kconfig | 12 ++++++++++++
>> arch/arm64/kernel/setup.c | 2 ++
>> arch/arm64/mm/init.c | 8 ++++++++
>> drivers/of/fdt.c | 41 +++++++++++++++++++++++++++++++++++++++
>> drivers/of/kexec.c | 36 ++++++++++++++++++++++++++++++++++
>> 5 files changed, 99 insertions(+)
>>
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index 7b071a00425d..1ba338ce7598 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -1501,6 +1501,18 @@ config ARCH_SUPPORTS_CRASH_DUMP
>> config ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION
>> def_bool CRASH_CORE
>>
>> +config KEXEC_KHO
>> + bool "kexec handover"
>> + depends on KEXEC
>> + select MEMBLOCK_SCRATCH
>> + select LIBFDT
>> + select CMA
>> + help
>> + Allow kexec to hand over state across kernels by generating and
>> + passing additional metadata to the target kernel. This is useful
>> + to keep data or state alive across the kexec. For this to work,
>> + both source and target kernels need to have this option enabled.
> Why do we have the same kconfig entry twice? Here and x86.
This was how the kexec config options were done when I wrote the patches
originally. Since then, looks like Eric DeVolder has cleaned up things
quite nicely. I'll adapt the new way.
>
>> +
>> config TRANS_TABLE
>> def_bool y
>> depends on HIBERNATION || KEXEC_CORE
>> diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
>> index 417a8a86b2db..8035b673d96d 100644
>> --- a/arch/arm64/kernel/setup.c
>> +++ b/arch/arm64/kernel/setup.c
>> @@ -346,6 +346,8 @@ void __init __no_sanitize_address setup_arch(char **cmdline_p)
>>
>> paging_init();
>>
>> + kho_reserve_mem();
>> +
>> acpi_table_upgrade();
>>
>> /* Parse the ACPI tables for possible boot-time configuration */
>> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
>> index 74c1db8ce271..254d82f3383a 100644
>> --- a/arch/arm64/mm/init.c
>> +++ b/arch/arm64/mm/init.c
>> @@ -358,6 +358,8 @@ void __init bootmem_init(void)
>> */
>> arch_reserve_crashkernel();
>>
>> + kho_reserve();
>> +
> reserve what? It is not obvious what the difference between
> kho_reserve_mem() and kho_reserve() are.
Yeah, I agree. I was struggling to find good names for them. What they
do is:
kho_reserve() - Reserve CMA memory for later kexec. We use this memory
region as scratch memory later.
kho_reserve_mem() - Post-KHO. Creates memory reservations inside
memblocks for pre-KHO handed over memory.
For v2, I'll change them to kho_reserve_scratch() and
kho_reserve_previous_mem() unless you have better ideas :)
>
>> memblock_dump_all();
>> }
>>
>> @@ -386,6 +388,12 @@ void __init mem_init(void)
>> /* this will put all unused low memory onto the freelists */
>> memblock_free_all();
>>
>> + /*
>> + * Now that all KHO pages are marked as reserved, let's flip them back
>> + * to normal pages with accurate refcount.
>> + */
>> + kho_populate_refcount();
>> +
>> /*
>> * Check boundaries twice: Some fundamental inconsistencies can be
>> * detected at build time already.
>> diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
>> index bf502ba8da95..af95139351ed 100644
>> --- a/drivers/of/fdt.c
>> +++ b/drivers/of/fdt.c
>> @@ -1006,6 +1006,44 @@ void __init early_init_dt_check_for_usable_mem_range(void)
>> memblock_add(rgn[i].base, rgn[i].size);
>> }
>>
>> +/**
>> + * early_init_dt_check_kho - Decode info required for kexec handover from DT
>> + */
>> +void __init early_init_dt_check_kho(void)
>> +{
>> +#ifdef CONFIG_KEXEC_KHO
> if (!IS_ENABLED(CONFIG_KEXEC_KHO))
> return;
>
> You'll need a kho_populate() stub.
Always happy to remove #ifdefs :)
>
>> + unsigned long node = chosen_node_offset;
>> + u64 kho_start, scratch_start, scratch_size, mem_start, mem_size;
>> + const __be32 *p;
>> + int l;
>> +
>> + if ((long)node < 0)
>> + return;
>> +
>> + p = of_get_flat_dt_prop(node, "linux,kho-dt", &l);
>> + if (l != (dt_root_addr_cells + dt_root_size_cells) * sizeof(__be32))
>> + return;
>> +
>> + kho_start = dt_mem_next_cell(dt_root_addr_cells, &p);
>> +
>> + p = of_get_flat_dt_prop(node, "linux,kho-scratch", &l);
>> + if (l != (dt_root_addr_cells + dt_root_size_cells) * sizeof(__be32))
>> + return;
>> +
>> + scratch_start = dt_mem_next_cell(dt_root_addr_cells, &p);
>> + scratch_size = dt_mem_next_cell(dt_root_addr_cells, &p);
>> +
>> + p = of_get_flat_dt_prop(node, "linux,kho-mem", &l);
>> + if (l != (dt_root_addr_cells + dt_root_size_cells) * sizeof(__be32))
>> + return;
>> +
>> + mem_start = dt_mem_next_cell(dt_root_addr_cells, &p);
>> + mem_size = dt_mem_next_cell(dt_root_addr_cells, &p);
>> +
>> + kho_populate(kho_start, scratch_start, scratch_size, mem_start, mem_size);
>> +#endif
>> +}
>> +
>> #ifdef CONFIG_SERIAL_EARLYCON
>>
>> int __init early_init_dt_scan_chosen_stdout(void)
>> @@ -1304,6 +1342,9 @@ void __init early_init_dt_scan_nodes(void)
>>
>> /* Handle linux,usable-memory-range property */
>> early_init_dt_check_for_usable_mem_range();
>> +
>> + /* Handle kexec handover */
>> + early_init_dt_check_kho();
>> }
>>
>> bool __init early_init_dt_scan(void *params)
>> diff --git a/drivers/of/kexec.c b/drivers/of/kexec.c
>> index 68278340cecf..a612e6bb8c75 100644
>> --- a/drivers/of/kexec.c
>> +++ b/drivers/of/kexec.c
>> @@ -264,6 +264,37 @@ static inline int setup_ima_buffer(const struct kimage *image, void *fdt,
>> }
>> #endif /* CONFIG_IMA_KEXEC */
>>
>> +static int kho_add_chosen(const struct kimage *image, void *fdt, int chosen_node)
>> +{
>> + int ret = 0;
>> +
>> +#ifdef CONFIG_KEXEC_KHO
> ditto
>
> Though perhaps image->kho is not defined?
Correct, it is not. But I'm happy to have a few local variables that I
stash the image->kho contents inside an ifdef into so we can at least
compile check all libfdt invocations.
Alex
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879