LinuxLists.cc - [PATCH v3 00/17] kexec: Allow preservation of ftrace buffers

2024-01-17 14:47:35

by Alexander Graf

[permalink] [raw]

Subject: [PATCH v3 00/17] kexec: Allow preservation of ftrace buffers

Kexec today considers itself purely a boot loader: When we enter the new
kernel, any state the previous kernel left behind is irrelevant and the
new kernel reinitializes the system.

However, there are use cases where this mode of operation is not what we
actually want. In virtualization hosts for example, we want to use kexec
to update the host kernel while virtual machine memory stays untouched.
When we add device assignment to the mix, we also need to ensure that
IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we
need to do the same for the PCI subsystem. If we want to kexec while an
SEV-SNP enabled virtual machine is running, we need to preserve the VM
context pages and physical memory. See James' and my Linux Plumbers
Conference 2023 presentation for details:

https://lpc.events/event/17/contributions/1485/

To start us on the journey to support all the use cases above, this
patch implements basic infrastructure to allow hand over of kernel state
across kexec (Kexec HandOver, aka KHO). As example target, we use ftrace:
With this patch set applied, you can read ftrace records from the
pre-kexec environment in your post-kexec one. This creates a very powerful
debugging and performance analysis tool for kexec. It's also slightly
easier to reason about than full blown VFIO state preservation.

== Alternatives ==

There are alternative approaches to (parts of) the problems above:

* Memory Pools [1] - preallocated persistent memory region + allocator
* PRMEM [2] - resizable persistent memory regions with fixed metadata
pointer on the kernel command line + allocator
* Pkernfs [3] - preallocated file system for in-kernel data with fixed
address location on the kernel command line
* PKRAM [4] - handover of user space pages using a fixed metadata page
specified via command line

All of the approaches above fundamentally have the same problem: They
require the administrator to explicitly carve out a physical memory
location because they have no mechanism outside of the kernel command
line to pass data (including memory reservations) between kexec'ing
kernels.

KHO provides that base foundation. We will determine later whether we
still need any of the approaches above for fast bulk memory handover of for
example IOMMU page tables. But IMHO they would all be users of KHO, with
KHO providing the foundational primitive to pass metadata and bulk memory
reservations as well as provide easy versioning for data.

== Overview ==

We introduce a metadata file that the kernels pass between each other. How
they pass it is architecture specific. The file's format is a Flattened
Device Tree (fdt) which has a generator and parser already included in
Linux. When the root user enables KHO through /sys/kernel/kho/active, the
kernel invokes callbacks to every driver that supports KHO to serialize
its state. When the actual kexec happens, the fdt is part of the image
set that we boot into. In addition, we keep a "scratch region" available
for kexec: A physically contiguous memory region that is guaranteed to
not have any memory that KHO would preserve. The new kernel bootstraps
itself using the scratch region and sets all handed over memory as in use.
When drivers initialize that support KHO, they introspect the fdt and
recover their state from it. This includes memory reservations, where the
driver can either discard or claim reservations.

== Limitations ==

I currently only implemented file based kexec. The kernel interfaces
in the patch set are already in place to support user space kexec as well,
but I have not implemented it yet inside kexec tools.

== How to Use ==

To use the code, please boot the kernel with the "kho_scratch=" command
line parameter set: "kho_scratch=512M". KHO requires a scratch region.

Make sure to fill ftrace with contents that you want to observe after
kexec. Then, before you invoke file based "kexec -l", activate KHO:

# echo 1 > /sys/kernel/kho/active
# kexec -l Image --initrd=initrd -s
# kexec -e

The new kernel will boot up and contain the previous kernel's trace
buffers in /sys/kernel/debug/tracing/trace.

== Changelog ==

v1 -> v2:
- Removed: tracing: Introduce names for ring buffers
- Removed: tracing: Introduce names for events
- New: kexec: Add config option for KHO
- New: kexec: Add documentation for KHO
- New: tracing: Initialize fields before registering
- New: devicetree: Add bindings for ftrace KHO
- test bot warning fixes
- Change kconfig option to ARCH_SUPPORTS_KEXEC_KHO
- s/kho_reserve_mem/kho_reserve_previous_mem/g
- s/kho_reserve/kho_reserve_scratch/g
- Remove / reduce ifdefs
- Select crc32
- Leave anything that requires a name in trace.c to keep buffers
unnamed entities
- Put events as array into a property, use fingerprint instead of
names to identify them
- Reduce footprint without CONFIG_FTRACE_KHO
- s/kho_reserve_mem/kho_reserve_previous_mem/g
- make kho_get_fdt() const
- Add stubs for return_mem and claim_mem
- make kho_get_fdt() const
- Get events as array from a property, use fingerprint instead of
names to identify events
- Change kconfig option to ARCH_SUPPORTS_KEXEC_KHO
- s/kho_reserve_mem/kho_reserve_previous_mem/g
- s/kho_reserve/kho_reserve_scratch/g
- Leave the node generation code that needs to know the name in
trace.c so that ring buffers can stay anonymous
- s/kho_reserve/kho_reserve_scratch/g
- Move kho enums out of ifdef
- Move from names to fdt offsets. That way, trace.c can find the trace
array offset and then the ring buffer code only needs to read out
its per-CPU data. That way it can stay oblivient to its name.
- Make kho_get_fdt() const

v2 -> v3:

- Fix make dt_binding_check
- Add descriptions for each object
- s/trace_flags/trace-flags/
- s/global_trace/global-trace/
- Make all additionalProperties false
- Change subject to reflect subsysten (dt-bindings)
- Fix indentation
- Remove superfluous examples
- Convert to 64bit syntax
- Move to kho directory
- s/"global_trace"/"global-trace"/
- s/"global_trace"/"global-trace"/
- s/"trace_flags"/"trace-flags"/
- Fix wording
- Add Documentation to MAINTAINERS file
- Remove kho reference on read error
- Move handover_dt unmap up
- s/reserve_scratch_mem/mark_phys_as_cma/
- Remove ifdeffery
- Remove superfluous comment

Alexander Graf (17):
mm,memblock: Add support for scratch memory
memblock: Declare scratch memory as CMA
kexec: Add Kexec HandOver (KHO) generation helpers
kexec: Add KHO parsing support
kexec: Add KHO support to kexec file loads
kexec: Add config option for KHO
kexec: Add documentation for KHO
arm64: Add KHO support
x86: Add KHO support
tracing: Initialize fields before registering
tracing: Introduce kho serialization
tracing: Add kho serialization of trace buffers
tracing: Recover trace buffers from kexec handover
tracing: Add kho serialization of trace events
tracing: Recover trace events from kexec handover
tracing: Add config option for kexec handover
Documentation: KHO: Add ftrace bindings

Documentation/ABI/testing/sysfs-firmware-kho | 9 +
Documentation/ABI/testing/sysfs-kernel-kho | 53 ++
.../admin-guide/kernel-parameters.txt | 10 +
.../kho/bindings/ftrace/ftrace-array.yaml | 38 ++
.../kho/bindings/ftrace/ftrace-cpu.yaml | 43 ++
Documentation/kho/bindings/ftrace/ftrace.yaml | 62 +++
Documentation/kho/concepts.rst | 88 +++
Documentation/kho/index.rst | 19 +
Documentation/kho/usage.rst | 57 ++
Documentation/subsystem-apis.rst | 1 +
MAINTAINERS | 3 +
arch/arm64/Kconfig | 3 +
arch/arm64/kernel/setup.c | 2 +
arch/arm64/mm/init.c | 8 +
arch/x86/Kconfig | 3 +
arch/x86/boot/compressed/kaslr.c | 55 ++
arch/x86/include/uapi/asm/bootparam.h | 15 +-
arch/x86/kernel/e820.c | 9 +
arch/x86/kernel/kexec-bzimage64.c | 39 ++
arch/x86/kernel/setup.c | 46 ++
arch/x86/mm/init_32.c | 7 +
arch/x86/mm/init_64.c | 7 +
drivers/of/fdt.c | 39 ++
drivers/of/kexec.c | 54 ++
include/linux/kexec.h | 58 ++
include/linux/memblock.h | 19 +
include/linux/ring_buffer.h | 17 +-
include/linux/trace_events.h | 1 +
include/uapi/linux/kexec.h | 6 +
kernel/Kconfig.kexec | 13 +
kernel/Makefile | 2 +
kernel/kexec_file.c | 41 ++
kernel/kexec_kho_in.c | 298 ++++++++++
kernel/kexec_kho_out.c | 526 ++++++++++++++++++
kernel/trace/Kconfig | 14 +
kernel/trace/ring_buffer.c | 243 +++++++-
kernel/trace/trace.c | 96 +++-
kernel/trace/trace_events.c | 14 +-
kernel/trace/trace_events_synth.c | 14 +-
kernel/trace/trace_events_user.c | 4 +
kernel/trace/trace_output.c | 247 +++++++-
kernel/trace/trace_output.h | 5 +
kernel/trace/trace_probe.c | 4 +
mm/Kconfig | 4 +
mm/memblock.c | 79 ++-
45 files changed, 2351 insertions(+), 24 deletions(-)
create mode 100644 Documentation/ABI/testing/sysfs-firmware-kho
create mode 100644 Documentation/ABI/testing/sysfs-kernel-kho
create mode 100644 Documentation/kho/bindings/ftrace/ftrace-array.yaml
create mode 100644 Documentation/kho/bindings/ftrace/ftrace-cpu.yaml
create mode 100644 Documentation/kho/bindings/ftrace/ftrace.yaml
create mode 100644 Documentation/kho/concepts.rst
create mode 100644 Documentation/kho/index.rst
create mode 100644 Documentation/kho/usage.rst
create mode 100644 kernel/kexec_kho_in.c
create mode 100644 kernel/kexec_kho_out.c

--
2.40.1

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

2024-01-17 14:48:09

by Alexander Graf

[permalink] [raw]

Subject: [PATCH v3 01/17] mm,memblock: Add support for scratch memory

With KHO (Kexec HandOver), we need a way to ensure that the new kernel
does not allocate memory on top of any memory regions that the previous
kernel was handing over. But to know where those are, we need to include
them in the reserved memblocks array which may not be big enough to hold
all allocations. To resize the array, we need to allocate memory. That
brings us into a catch 22 situation.

The solution to that is the scratch region: a safe region to operate in.
KHO provides a "scratch region" as part of its metadata. This scratch
region is a single, contiguous memory block that we know does not
contain any KHO allocations. We can exclusively allocate from there until
we finish kernel initialization to a point where it knows about all the
KHO memory reservations. We introduce a new memblock_set_scratch_only()
function that allows KHO to indicate that any memblock allocation must
happen from the scratch region.

Later, we may want to perform another KHO kexec. For that, we reuse the
same scratch region. To ensure that no eventually handed over data gets
allocated inside that scratch region, we flip the semantics of the scratch
region with memblock_clear_scratch_only(): After that call, no allocations
may happen from scratch memblock regions. We will lift that restriction
in the next patch.

Signed-off-by: Alexander Graf <[email protected]>
---
include/linux/memblock.h | 19 +++++++++++++
mm/Kconfig | 4 +++
mm/memblock.c | 61 +++++++++++++++++++++++++++++++++++++++-
3 files changed, 83 insertions(+), 1 deletion(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index b695f9e946da..7e9788f05dea 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -42,6 +42,10 @@ extern unsigned long long max_possible_pfn;
* kernel resource tree.
* @MEMBLOCK_RSRV_NOINIT: memory region for which struct pages are
* not initialized (only for reserved regions).
+ * @MEMBLOCK_SCRATCH: memory region that kexec can pass to the next kernel in
+ * handover mode. During early boot, we do not know about all memory reservations
+ * yet, so we get scratch memory from the previous kernel that we know is good
+ * to use. It is the only memory that allocations may happen from in this phase.
*/
enum memblock_flags {
MEMBLOCK_NONE = 0x0, /* No special request */
@@ -50,6 +54,7 @@ enum memblock_flags {
MEMBLOCK_NOMAP = 0x4, /* don't add to kernel direct mapping */
MEMBLOCK_DRIVER_MANAGED = 0x8, /* always detected via a driver */
MEMBLOCK_RSRV_NOINIT = 0x10, /* don't initialize struct pages */
+ MEMBLOCK_SCRATCH = 0x20, /* scratch memory for kexec handover */
};

/**
@@ -130,6 +135,8 @@ int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
int memblock_mark_nomap(phys_addr_t base, phys_addr_t size);
int memblock_clear_nomap(phys_addr_t base, phys_addr_t size);
int memblock_reserved_mark_noinit(phys_addr_t base, phys_addr_t size);
+int memblock_mark_scratch(phys_addr_t base, phys_addr_t size);
+int memblock_clear_scratch(phys_addr_t base, phys_addr_t size);

void memblock_free_all(void);
void memblock_free(void *ptr, size_t size);
@@ -274,6 +281,11 @@ static inline bool memblock_is_driver_managed(struct memblock_region *m)
return m->flags & MEMBLOCK_DRIVER_MANAGED;
}

+static inline bool memblock_is_scratch(struct memblock_region *m)
+{
+ return m->flags & MEMBLOCK_SCRATCH;
+}
+
int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
unsigned long *end_pfn);
void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
@@ -611,5 +623,12 @@ static inline void early_memtest(phys_addr_t start, phys_addr_t end) { }
static inline void memtest_report_meminfo(struct seq_file *m) { }
#endif

+#ifdef CONFIG_MEMBLOCK_SCRATCH
+void memblock_set_scratch_only(void);
+void memblock_clear_scratch_only(void);
+#else
+static inline void memblock_set_scratch_only(void) { }
+static inline void memblock_clear_scratch_only(void) { }
+#endif

#endif /* _LINUX_MEMBLOCK_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 1902cfe4cc4f..6cd5e16203ba 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -489,6 +489,10 @@ config ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
config HAVE_MEMBLOCK_PHYS_MAP
bool

+# Enable memblock support for scratch memory which is needed for KHO
+config MEMBLOCK_SCRATCH
+ bool
+
config HAVE_FAST_GUP
depends on MMU
bool
diff --git a/mm/memblock.c b/mm/memblock.c
index 8c194d8afeec..fbb98981a202 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -106,6 +106,13 @@ unsigned long min_low_pfn;
unsigned long max_pfn;
unsigned long long max_possible_pfn;

+#ifdef CONFIG_MEMBLOCK_SCRATCH
+/* When set to true, only allocate from MEMBLOCK_SCRATCH ranges */
+static bool scratch_only;
+#else
+#define scratch_only false
+#endif
+
static struct memblock_region memblock_memory_init_regions[INIT_MEMBLOCK_MEMORY_REGIONS] __initdata_memblock;
static struct memblock_region memblock_reserved_init_regions[INIT_MEMBLOCK_RESERVED_REGIONS] __initdata_memblock;
#ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
@@ -168,6 +175,10 @@ bool __init_memblock memblock_has_mirror(void)

static enum memblock_flags __init_memblock choose_memblock_flags(void)
{
+ /* skip non-scratch memory for kho early boot allocations */
+ if (scratch_only)
+ return MEMBLOCK_SCRATCH;
+
return system_has_some_mirror ? MEMBLOCK_MIRROR : MEMBLOCK_NONE;
}

@@ -643,7 +654,7 @@ static int __init_memblock memblock_add_range(struct memblock_type *type,
#ifdef CONFIG_NUMA
WARN_ON(nid != memblock_get_region_node(rgn));
#endif
- WARN_ON(flags != rgn->flags);
+ WARN_ON(flags != (rgn->flags & ~MEMBLOCK_SCRATCH));
nr_new++;
if (insert) {
if (start_rgn == -1)
@@ -924,6 +935,18 @@ int __init_memblock memblock_physmem_add(phys_addr_t base, phys_addr_t size)
}
#endif

+#ifdef CONFIG_MEMBLOCK_SCRATCH
+__init_memblock void memblock_set_scratch_only(void)
+{
+ scratch_only = true;
+}
+
+__init_memblock void memblock_clear_scratch_only(void)
+{
+ scratch_only = false;
+}
+#endif
+
/**
* memblock_setclr_flag - set or clear flag for a memory region
* @type: memblock type to set/clear flag for
@@ -1049,6 +1072,33 @@ int __init_memblock memblock_reserved_mark_noinit(phys_addr_t base, phys_addr_t
MEMBLOCK_RSRV_NOINIT);
}

+/**
+ * memblock_mark_scratch - Mark a memory region with flag MEMBLOCK_SCRATCH.
+ * @base: the base phys addr of the region
+ * @size: the size of the region
+ *
+ * Only memory regions marked with %MEMBLOCK_SCRATCH will be considered for
+ * allocations during early boot with kexec handover.
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+int __init_memblock memblock_mark_scratch(phys_addr_t base, phys_addr_t size)
+{
+ return memblock_setclr_flag(&memblock.memory, base, size, 1, MEMBLOCK_SCRATCH);
+}
+
+/**
+ * memblock_clear_scratch - Clear flag MEMBLOCK_SCRATCH for a specified region.
+ * @base: the base phys addr of the region
+ * @size: the size of the region
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+int __init_memblock memblock_clear_scratch(phys_addr_t base, phys_addr_t size)
+{
+ return memblock_setclr_flag(&memblock.memory, base, size, 0, MEMBLOCK_SCRATCH);
+}
+
static bool should_skip_region(struct memblock_type *type,
struct memblock_region *m,
int nid, int flags)
@@ -1080,6 +1130,14 @@ static bool should_skip_region(struct memblock_type *type,
if (!(flags & MEMBLOCK_DRIVER_MANAGED) && memblock_is_driver_managed(m))
return true;

+ /* In early alloc during kho, we can only consider scratch allocations */
+ if ((flags & MEMBLOCK_SCRATCH) && !memblock_is_scratch(m))
+ return true;
+
+ /* Leave scratch memory alone after scratch-only phase */
+ if (!(flags & MEMBLOCK_SCRATCH) && memblock_is_scratch(m))
+ return true;
+
return false;
}

@@ -2246,6 +2304,7 @@ static const char * const flagname[] = {
[ilog2(MEMBLOCK_MIRROR)] = "MIRROR",
[ilog2(MEMBLOCK_NOMAP)] = "NOMAP",
[ilog2(MEMBLOCK_DRIVER_MANAGED)] = "DRV_MNG",
+ [ilog2(MEMBLOCK_SCRATCH)] = "SCRATCH",
};

static int memblock_debug_show(struct seq_file *m, void *private)
--
2.40.1

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

2024-01-17 14:48:35

by Alexander Graf

[permalink] [raw]

Subject: [PATCH v3 04/17] kexec: Add KHO parsing support

When we have a KHO kexec, we get a device tree, mem cache and scratch
region to populate the state of the system. Provide helper functions
that allow architecture code to easily handle memory reservations based
on them and give device drivers visibility into the KHO DT and memory
reservations so they can recover their own state.

Signed-off-by: Alexander Graf <[email protected]>

---

v1 -> v2:

- s/kho_reserve_mem/kho_reserve_previous_mem/g
- make kho_get_fdt() const
- Add stubs for return_mem and claim_mem

v2 -> v3:

- Remove kho reference on read error
- Move handover_dt unmap up
---
Documentation/ABI/testing/sysfs-firmware-kho | 9 +
MAINTAINERS | 1 +
include/linux/kexec.h | 27 +-
kernel/Makefile | 1 +
kernel/kexec_kho_in.c | 298 +++++++++++++++++++
5 files changed, 335 insertions(+), 1 deletion(-)
create mode 100644 Documentation/ABI/testing/sysfs-firmware-kho
create mode 100644 kernel/kexec_kho_in.c

diff --git a/Documentation/ABI/testing/sysfs-firmware-kho b/Documentation/ABI/testing/sysfs-firmware-kho
new file mode 100644
index 000000000000..e4ed2cb7c810
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-firmware-kho
@@ -0,0 +1,9 @@
+What: /sys/firmware/kho/dt
+Date: December 2023
+Contact: Alexander Graf <[email protected]>
+Description:
+ When the kernel was booted with Kexec HandOver (KHO),
+ the device tree that carries metadata about the previous
+ kernel's state is in this file. This file may disappear
+ when all consumers of it finished to interpret their
+ metadata.
diff --git a/MAINTAINERS b/MAINTAINERS
index 6ec4be8874b9..88bf6730d801 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -11824,6 +11824,7 @@ M: Eric Biederman <[email protected]>
L: [email protected]
S: Maintained
W: http://kernel.org/pub/linux/utils/kernel/kexec/
+F: Documentation/ABI/testing/sysfs-firmware-kho
F: Documentation/ABI/testing/sysfs-kernel-kho
F: include/linux/kexec.h
F: include/uapi/linux/kexec.h
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 19ffc00b5e7b..eabf9536466a 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -535,13 +535,38 @@ enum kho_event {
extern phys_addr_t kho_scratch_phys;
extern phys_addr_t kho_scratch_len;

+/* ingest handover metadata */
+void kho_reserve_previous_mem(void);
+void kho_populate(phys_addr_t dt_phys, phys_addr_t scratch_phys, u64 scratch_len,
+ phys_addr_t mem_phys, u64 mem_len);
+void kho_populate_refcount(void);
+const void *kho_get_fdt(void);
+void kho_return_mem(const struct kho_mem *mem);
+void *kho_claim_mem(const struct kho_mem *mem);
+static inline bool is_kho_boot(void)
+{
+ return !!kho_scratch_phys;
+}
+
/* egest handover metadata */
void kho_reserve_scratch(void);
int register_kho_notifier(struct notifier_block *nb);
int unregister_kho_notifier(struct notifier_block *nb);
bool kho_is_active(void);
#else
-static inline void kho_reserve_scratch(void) {}
+/* ingest handover metadata */
+static inline void kho_reserve_previous_mem(void) { }
+static inline void kho_populate(phys_addr_t dt_phys, phys_addr_t scratch_phys,
+ u64 scratch_len, phys_addr_t mem_phys,
+ u64 mem_len) { }
+static inline void kho_populate_refcount(void) { }
+static inline void *kho_get_fdt(void) { return NULL; }
+static inline void kho_return_mem(const struct kho_mem *mem) { }
+static inline void *kho_claim_mem(const struct kho_mem *mem) { return NULL; }
+static inline bool is_kho_boot(void) { return false; }
+
+/* egest handover metadata */
+static inline void kho_reserve_scratch(void) { }
static inline int register_kho_notifier(struct notifier_block *nb) { return -EINVAL; }
static inline int unregister_kho_notifier(struct notifier_block *nb) { return -EINVAL; }
static inline bool kho_is_active(void) { return false; }
diff --git a/kernel/Makefile b/kernel/Makefile
index b182b7b4e7d1..5edf37b5b5cb 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -73,6 +73,7 @@ obj-$(CONFIG_KEXEC_CORE) += kexec_core.o
obj-$(CONFIG_KEXEC) += kexec.o
obj-$(CONFIG_KEXEC_FILE) += kexec_file.o
obj-$(CONFIG_KEXEC_ELF) += kexec_elf.o
+obj-$(CONFIG_KEXEC_KHO) += kexec_kho_in.o
obj-$(CONFIG_KEXEC_KHO) += kexec_kho_out.o
obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
obj-$(CONFIG_COMPAT) += compat.o
diff --git a/kernel/kexec_kho_in.c b/kernel/kexec_kho_in.c
new file mode 100644
index 000000000000..3f498952a8ea
--- /dev/null
+++ b/kernel/kexec_kho_in.c
@@ -0,0 +1,298 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * kexec_kho_in.c - kexec handover code to ingest metadata.
+ * Copyright (C) 2023 Alexander Graf <[email protected]>
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/kexec.h>
+#include <linux/device.h>
+#include <linux/compiler.h>
+#include <linux/io.h>
+#include <linux/kmsg_dump.h>
+#include <linux/memblock.h>
+
+/* The kho dt during runtime */
+static void *fdt;
+
+/* Globals to hand over phys/len from early to runtime */
+static phys_addr_t handover_phys __initdata;
+static u32 handover_len __initdata;
+
+static phys_addr_t mem_phys __initdata;
+static u32 mem_len __initdata;
+
+phys_addr_t kho_scratch_phys;
+phys_addr_t kho_scratch_len;
+
+const void *kho_get_fdt(void)
+{
+ return fdt;
+}
+EXPORT_SYMBOL_GPL(kho_get_fdt);
+
+/**
+ * kho_populate_refcount - Scan the DT for any memory ranges. Increase the
+ * affected pages' refcount by 1 for each.
+ */
+__init void kho_populate_refcount(void)
+{
+ const void *fdt = kho_get_fdt();
+ void *mem_virt = __va(mem_phys);
+ int offset = 0, depth = 0, initial_depth = 0, len;
+
+ if (!fdt)
+ return;
+
+ /* Go through the mem list and add 1 for each reference */
+ for (offset = 0;
+ offset >= 0 && depth >= initial_depth;
+ offset = fdt_next_node(fdt, offset, &depth)) {
+ const struct kho_mem *mems;
+ u32 i;
+
+ mems = fdt_getprop(fdt, offset, "mem", &len);
+ if (!mems || len & (sizeof(*mems) - 1))
+ continue;
+
+ for (i = 0; i < len; i += sizeof(*mems)) {
+ const struct kho_mem *mem = ((void *)mems) + i;
+ u64 start_pfn = PFN_DOWN(mem->addr);
+ u64 end_pfn = PFN_UP(mem->addr + mem->len);
+ u64 pfn;
+
+ for (pfn = start_pfn; pfn < end_pfn; pfn++)
+ get_page(pfn_to_page(pfn));
+ }
+ }
+
+ /*
+ * Then reduce the reference count by 1 to offset the initial ref count
+ * of 1. In addition, unreserve the page. That way, we can free_page()
+ * it for every consumer and automatically free it to the global memory
+ * pool when everyone is done.
+ */
+ for (offset = 0; offset < mem_len; offset += sizeof(struct kho_mem)) {
+ struct kho_mem *mem = mem_virt + offset;
+ u64 start_pfn = PFN_DOWN(mem->addr);
+ u64 end_pfn = PFN_UP(mem->addr + mem->len);
+ u64 pfn;
+
+ for (pfn = start_pfn; pfn < end_pfn; pfn++) {
+ struct page *page = pfn_to_page(pfn);
+
+ /*
+ * This is similar to free_reserved_page(), but
+ * preserves the reference count
+ */
+ ClearPageReserved(page);
+ __free_page(page);
+ adjust_managed_page_count(page, 1);
+ }
+ }
+}
+
+static void kho_return_pfn(ulong pfn)
+{
+ struct page *page = pfn_to_page(pfn);
+
+ if (WARN_ON(!page))
+ return;
+ __free_page(page);
+}
+
+/**
+ * kho_return_mem - Notify the kernel that initially reserved memory is no
+ * longer needed. When the last consumer of a page returns their mem, kho
+ * returns the page to the buddy allocator as free page.
+ */
+void kho_return_mem(const struct kho_mem *mem)
+{
+ uint64_t start_pfn, end_pfn, pfn;
+
+ start_pfn = PFN_DOWN(mem->addr);
+ end_pfn = PFN_UP(mem->addr + mem->len);
+
+ for (pfn = start_pfn; pfn < end_pfn; pfn++)
+ kho_return_pfn(pfn);
+}
+EXPORT_SYMBOL_GPL(kho_return_mem);
+
+static void kho_claim_pfn(ulong pfn)
+{
+ struct page *page = pfn_to_page(pfn);
+
+ WARN_ON(!page);
+ if (WARN_ON(page_count(page) != 1))
+ pr_err("Claimed non kho pfn %lx", pfn);
+}
+
+/**
+ * kho_claim_mem - Notify the kernel that a handed over memory range is now in
+ * use by a kernel subsystem and considered an allocated page. This function
+ * removes the reserved state for all pages that the mem spans.
+ */
+void *kho_claim_mem(const struct kho_mem *mem)
+{
+ u64 start_pfn, end_pfn, pfn;
+ void *va = __va(mem->addr);
+
+ start_pfn = PFN_DOWN(mem->addr);
+ end_pfn = PFN_UP(mem->addr + mem->len);
+
+ for (pfn = start_pfn; pfn < end_pfn; pfn++)
+ kho_claim_pfn(pfn);
+
+ return va;
+}
+EXPORT_SYMBOL_GPL(kho_claim_mem);
+
+/**
+ * kho_reserve_previous_mem - Adds all memory reservations into memblocks
+ * and moves us out of the scratch only phase. Must be called after page tables
+ * are initialized and memblock_allow_resize().
+ */
+void __init kho_reserve_previous_mem(void)
+{
+ void *mem_virt = __va(mem_phys);
+ int off, err = 0;
+
+ if (!handover_phys || !mem_phys)
+ return;
+
+ /*
+ * We reached here because we are running inside a working linear map
+ * that allows us to resize memblocks dynamically. Use the chance and
+ * populate the global fdt pointer
+ */
+ fdt = __va(handover_phys);
+
+ off = fdt_path_offset(fdt, "/");
+ if (off < 0)
+ err = -EINVAL;
+
+ if (fdt)
+ err |= fdt_node_check_compatible(fdt, off, "kho-v1");
+
+ if (err) {
+ pr_warn("KHO invalid, disabling.");
+ fdt = NULL;
+ } else {
+ /* Populate all preserved memory areas as reserved */
+ for (off = 0; off < mem_len; off += sizeof(struct kho_mem)) {
+ struct kho_mem *mem = mem_virt + off;
+
+ memblock_reserve(mem->addr, mem->len);
+ }
+ }
+
+ /* Unreserve the mem cache - we don't need it from here on */
+ memblock_phys_free(mem_phys, mem_len);
+
+ /*
+ * Now we know about all memory reservations, release the scratch only
+ * constraint and allow normal allocations from the scratch region.
+ */
+ memblock_clear_scratch_only();
+}
+
+/* Handling for /sys/firmware/kho */
+static struct kobject *kho_kobj;
+
+static ssize_t raw_read(struct file *file, struct kobject *kobj,
+ struct bin_attribute *attr, char *buf,
+ loff_t pos, size_t count)
+{
+ memcpy(buf, attr->private + pos, count);
+ return count;
+}
+
+static BIN_ATTR(dt, 0400, raw_read, NULL, 0);
+
+static __init int kho_in_init(void)
+{
+ int ret = 0;
+
+ if (!fdt)
+ return 0;
+
+ kho_kobj = kobject_create_and_add("kho", firmware_kobj);
+ if (!kho_kobj) {
+ ret = -ENOMEM;
+ goto err;
+ }
+
+ bin_attr_dt.size = fdt_totalsize(fdt);
+ bin_attr_dt.private = fdt;
+ ret = sysfs_create_bin_file(kho_kobj, &bin_attr_dt);
+ if (ret)
+ goto err;
+
+err:
+ return ret;
+}
+subsys_initcall(kho_in_init);
+
+void __init kho_populate(phys_addr_t handover_dt_phys, phys_addr_t scratch_phys,
+ u64 scratch_len, phys_addr_t mem_cache_phys,
+ u64 mem_cache_len)
+{
+ void *handover_dt;
+
+ /* Determine the real size of the DT */
+ handover_dt = early_memremap(handover_dt_phys, sizeof(struct fdt_header));
+ if (!handover_dt) {
+ pr_warn("setup: failed to memremap kexec FDT (0x%llx)\n", handover_dt_phys);
+ return;
+ }
+
+ if (fdt_check_header(handover_dt)) {
+ pr_warn("setup: kexec handover FDT is invalid (0x%llx)\n", handover_dt_phys);
+ early_memunmap(handover_dt, PAGE_SIZE);
+ return;
+ }
+
+ handover_len = fdt_totalsize(handover_dt);
+ handover_phys = handover_dt_phys;
+
+ early_memunmap(handover_dt, sizeof(struct fdt_header));
+
+ /* Reserve the DT so we can still access it in late boot */
+ memblock_reserve(handover_phys, handover_len);
+
+ /* Reserve the mem cache so we can still access it later */
+ memblock_reserve(mem_cache_phys, mem_cache_len);
+
+ /*
+ * We pass a safe contiguous block of memory to use for early boot purporses from
+ * the previous kernel so that we can resize the memblock array as needed.
+ */
+ memblock_add(scratch_phys, scratch_len);
+
+ if (WARN_ON(memblock_mark_scratch(scratch_phys, scratch_len))) {
+ pr_err("Kexec failed to mark the scratch region. Disabling KHO.");
+ handover_len = 0;
+ handover_phys = 0;
+ return;
+ }
+ pr_debug("Marked 0x%lx+0x%lx as scratch", (long)scratch_phys, (long)scratch_len);
+
+ /*
+ * Now that we have a viable region of scratch memory, let's tell the memblocks
+ * allocator to only use that for any allocations. That way we ensure that nothing
+ * scribbles over in use data while we initialize the page tables which we will need
+ * to ingest all memory reservations from the previous kernel.
+ */
+ memblock_set_scratch_only();
+
+ /* Remember the mem cache location for kho_reserve_previous_mem() */
+ mem_len = mem_cache_len;
+ mem_phys = mem_cache_phys;
+
+ /* Remember the scratch block - we will reuse it again for the next kexec */
+ kho_scratch_phys = scratch_phys;
+ kho_scratch_len = scratch_len;
+
+ pr_info("setup: Found kexec handover data. Will skip init for some devices\n");
+}
--
2.40.1

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

2024-01-17 14:48:36

by Alexander Graf

[permalink] [raw]

Subject: [PATCH v3 02/17] memblock: Declare scratch memory as CMA

When we finish populating our memory, we don't want to lose the scratch
region as memory we can use for useful data. Do do that, we mark it as
CMA memory. That means that any allocation within it only happens with
movable memory which we can then happily discard for the next kexec.

That way we don't lose the scratch region's memory anymore for
allocations after boot.

Signed-off-by: Alexander Graf <[email protected]>

---

v1 -> v2:

- test bot warning fix

v2 -> v3:

- s/reserve_scratch_mem/mark_phys_as_cma/
- Declare scratch memory as CMA: Remove ifdeffery
- Declare scratch memory as CMA: Remove superfluous comment
---
mm/memblock.c | 26 ++++++++++++++++++++++----
1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/mm/memblock.c b/mm/memblock.c
index fbb98981a202..56530d0469a8 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -16,6 +16,7 @@
#include <linux/kmemleak.h>
#include <linux/seq_file.h>
#include <linux/memblock.h>
+#include <linux/page-isolation.h>

#include <asm/sections.h>
#include <linux/io.h>
@@ -1134,10 +1135,6 @@ static bool should_skip_region(struct memblock_type *type,
if ((flags & MEMBLOCK_SCRATCH) && !memblock_is_scratch(m))
return true;

- /* Leave scratch memory alone after scratch-only phase */
- if (!(flags & MEMBLOCK_SCRATCH) && memblock_is_scratch(m))
- return true;
-
return false;
}

@@ -2188,6 +2185,16 @@ static void __init __free_pages_memory(unsigned long start, unsigned long end)
}
}

+static void mark_phys_as_cma(phys_addr_t start, phys_addr_t end)
+{
+ ulong start_pfn = pageblock_start_pfn(PFN_DOWN(start));
+ ulong end_pfn = pageblock_align(PFN_UP(end));
+ ulong pfn;
+
+ for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages)
+ set_pageblock_migratetype(pfn_to_page(pfn), MIGRATE_CMA);
+}
+
static unsigned long __init __free_memory_core(phys_addr_t start,
phys_addr_t end)
{
@@ -2249,6 +2256,17 @@ static unsigned long __init free_low_memory_core_early(void)

memmap_init_reserved_pages();

+ if (IS_ENABLED(CONFIG_MEMBLOCK_SCRATCH)) {
+ /*
+ * Mark scratch mem as CMA before we return it. That way we
+ * ensure that no kernel allocations happen on it. That means
+ * we can reuse it as scratch memory again later.
+ */
+ __for_each_mem_range(i, &memblock.memory, NULL, NUMA_NO_NODE,
+ MEMBLOCK_SCRATCH, &start, &end, NULL)
+ mark_phys_as_cma(start, end);
+ }
+
/*
* We need to use NUMA_NO_NODE instead of NODE_DATA(0)->node_id
* because in some case like Node0 doesn't have RAM installed
--
2.40.1

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

2024-01-17 14:48:57

by Alexander Graf

[permalink] [raw]

Subject: [PATCH v3 05/17] kexec: Add KHO support to kexec file loads

Kexec has 2 modes: A user space driven mode and a kernel driven mode.
For the kernel driven mode, kernel code determines the physical
addresses of all target buffers that the payload gets copied into.

With KHO, we can only safely copy payloads into the "scratch area".
Teach the kexec file loader about it, so it only allocates for that
area. In addition, enlighten it with support to ask the KHO subsystem
for its respective payloads to copy into target memory. Also teach the
KHO subsystem how to fill the images for file loads.

Signed-off-by: Alexander Graf <[email protected]>
---
include/linux/kexec.h | 9 ++
kernel/kexec_file.c | 41 ++++++++
kernel/kexec_kho_out.c | 210 +++++++++++++++++++++++++++++++++++++++++
3 files changed, 260 insertions(+)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index eabf9536466a..225ef2222eb9 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -362,6 +362,13 @@ struct kimage {
size_t ima_buffer_size;
#endif

+#ifdef CONFIG_KEXEC_KHO
+ struct {
+ struct kexec_buf dt;
+ struct kexec_buf mem_cache;
+ } kho;
+#endif
+
/* Core ELF header buffer */
void *elf_headers;
unsigned long elf_headers_sz;
@@ -550,6 +557,7 @@ static inline bool is_kho_boot(void)

/* egest handover metadata */
void kho_reserve_scratch(void);
+int kho_fill_kimage(struct kimage *image);
int register_kho_notifier(struct notifier_block *nb);
int unregister_kho_notifier(struct notifier_block *nb);
bool kho_is_active(void);
@@ -567,6 +575,7 @@ static inline bool is_kho_boot(void) { return false; }

/* egest handover metadata */
static inline void kho_reserve_scratch(void) { }
+static inline int kho_fill_kimage(struct kimage *image) { return 0; }
static inline int register_kho_notifier(struct notifier_block *nb) { return -EINVAL; }
static inline int unregister_kho_notifier(struct notifier_block *nb) { return -EINVAL; }
static inline bool kho_is_active(void) { return false; }
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index bef2f6f2571b..28fa60b51828 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -113,6 +113,13 @@ void kimage_file_post_load_cleanup(struct kimage *image)
image->ima_buffer = NULL;
#endif /* CONFIG_IMA_KEXEC */

+#ifdef CONFIG_KEXEC_KHO
+ kvfree(image->kho.mem_cache.buffer);
+ image->kho.mem_cache = (struct kexec_buf) {};
+ kvfree(image->kho.dt.buffer);
+ image->kho.dt = (struct kexec_buf) {};
+#endif
+
/* See if architecture has anything to cleanup post load */
arch_kimage_file_post_load_cleanup(image);

@@ -253,6 +260,11 @@ kimage_file_prepare_segments(struct kimage *image, int kernel_fd, int initrd_fd,
/* IMA needs to pass the measurement list to the next kernel. */
ima_add_kexec_buffer(image);

+ /* If KHO is active, add its images to the list */
+ ret = kho_fill_kimage(image);
+ if (ret)
+ goto out;
+
/* Call image load handler */
ldata = kexec_image_load_default(image);

@@ -526,6 +538,24 @@ static int locate_mem_hole_callback(struct resource *res, void *arg)
return locate_mem_hole_bottom_up(start, end, kbuf);
}

+#ifdef CONFIG_KEXEC_KHO
+static int kexec_walk_kho_scratch(struct kexec_buf *kbuf,
+ int (*func)(struct resource *, void *))
+{
+ int ret = 0;
+
+ struct resource res = {
+ .start = kho_scratch_phys,
+ .end = kho_scratch_phys + kho_scratch_len,
+ };
+
+ /* Try to fit the kimage into our KHO scratch region */
+ ret = func(&res, kbuf);
+
+ return ret;
+}
+#endif
+
#ifdef CONFIG_ARCH_KEEP_MEMBLOCK
static int kexec_walk_memblock(struct kexec_buf *kbuf,
int (*func)(struct resource *, void *))
@@ -622,6 +652,17 @@ int kexec_locate_mem_hole(struct kexec_buf *kbuf)
if (kbuf->mem != KEXEC_BUF_MEM_UNKNOWN)
return 0;

+#ifdef CONFIG_KEXEC_KHO
+ /*
+ * If KHO is active, only use KHO scratch memory. All other memory
+ * could potentially be handed over.
+ */
+ if (kho_is_active() && kbuf->image->type != KEXEC_TYPE_CRASH) {
+ ret = kexec_walk_kho_scratch(kbuf, locate_mem_hole_callback);
+ return ret == 1 ? 0 : -EADDRNOTAVAIL;
+ }
+#endif
+
if (!IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK))
ret = kexec_walk_resources(kbuf, locate_mem_hole_callback);
else
diff --git a/kernel/kexec_kho_out.c b/kernel/kexec_kho_out.c
index 765cf6ba7a46..2cf5755f5e4a 100644
--- a/kernel/kexec_kho_out.c
+++ b/kernel/kexec_kho_out.c
@@ -50,6 +50,216 @@ int unregister_kho_notifier(struct notifier_block *nb)
}
EXPORT_SYMBOL_GPL(unregister_kho_notifier);

+static int kho_mem_cache_add(void *fdt, struct kho_mem *mem_cache, int size,
+ struct kho_mem *new_mem)
+{
+ int entries = size / sizeof(*mem_cache);
+ u64 new_start = new_mem->addr;
+ u64 new_end = new_mem->addr + new_mem->len;
+ u64 prev_start = 0;
+ u64 prev_end = 0;
+ int i;
+
+ if (WARN_ON((new_start < (kho_scratch_phys + kho_scratch_len)) &&
+ (new_end > kho_scratch_phys))) {
+ pr_err("KHO memory runs over scratch memory");
+ return -EINVAL;
+ }
+
+ /*
+ * We walk the existing sorted mem cache and find the spot where this
+ * new entry would start, so we can insert it right there.
+ */
+ for (i = 0; i < entries; i++) {
+ struct kho_mem *mem = &mem_cache[i];
+ u64 mem_end = (mem->addr + mem->len);
+
+ if (mem_end < new_start) {
+ /* No overlap */
+ prev_start = mem->addr;
+ prev_end = mem->addr + mem->len;
+ continue;
+ } else if ((new_start >= mem->addr) && (new_end <= mem_end)) {
+ /* new_mem fits into mem, skip */
+ return size;
+ } else if ((new_end >= mem->addr) && (new_start <= mem_end)) {
+ /* new_mem and mem overlap, fold them */
+ bool remove = false;
+
+ mem->addr = min(new_start, mem->addr);
+ mem->len = max(mem_end, new_end) - mem->addr;
+ mem_end = (mem->addr + mem->len);
+
+ if (i > 0 && prev_end >= mem->addr) {
+ /* We now overlap with the previous mem, fold */
+ struct kho_mem *prev = &mem_cache[i - 1];
+
+ prev->addr = min(prev->addr, mem->addr);
+ prev->len = max(mem_end, prev_end) - prev->addr;
+ remove = true;
+ } else if (i < (entries - 1) && mem_end >= mem_cache[i + 1].addr) {
+ /* We now overlap with the next mem, fold */
+ struct kho_mem *next = &mem_cache[i + 1];
+ u64 next_end = (next->addr + next->len);
+
+ next->addr = min(next->addr, mem->addr);
+ next->len = max(mem_end, next_end) - next->addr;
+ remove = true;
+ }
+
+ if (remove) {
+ /* We folded this mem into another, remove it */
+ memmove(mem, mem + 1, (entries - i - 1) * sizeof(*mem));
+ size -= sizeof(*new_mem);
+ }
+
+ return size;
+ } else if (mem->addr > new_end) {
+ /*
+ * The mem cache is sorted. If we find the current
+ * entry start after our new_mem's end, we shot over
+ * which means we need to add it by creating a new
+ * hole right after the current entry.
+ */
+ memmove(mem + 1, mem, (entries - i) * sizeof(*mem));
+ break;
+ }
+ }
+
+ mem_cache[i] = *new_mem;
+ size += sizeof(*new_mem);
+
+ return size;
+}
+
+/**
+ * kho_alloc_mem_cache - Allocate and initialize the mem cache kexec_buf
+ */
+static int kho_alloc_mem_cache(struct kimage *image, void *fdt)
+{
+ int offset, depth, initial_depth, len;
+ void *mem_cache;
+ int size;
+
+ /* Count the elements inside all "mem" properties in the DT */
+ size = offset = depth = initial_depth = 0;
+ for (offset = 0;
+ offset >= 0 && depth >= initial_depth;
+ offset = fdt_next_node(fdt, offset, &depth)) {
+ const struct kho_mem *mems;
+
+ mems = fdt_getprop(fdt, offset, "mem", &len);
+ if (!mems || len & (sizeof(*mems) - 1))
+ continue;
+ size += len;
+ }
+
+ /* Allocate based on the max size we determined */
+ mem_cache = kvmalloc(size, GFP_KERNEL);
+ if (!mem_cache)
+ return -ENOMEM;
+
+ /* And populate the array */
+ size = offset = depth = initial_depth = 0;
+ for (offset = 0;
+ offset >= 0 && depth >= initial_depth;
+ offset = fdt_next_node(fdt, offset, &depth)) {
+ const struct kho_mem *mems;
+ int nr_mems, i;
+
+ mems = fdt_getprop(fdt, offset, "mem", &len);
+ if (!mems || len & (sizeof(*mems) - 1))
+ continue;
+
+ for (i = 0, nr_mems = len / sizeof(*mems); i < nr_mems; i++) {
+ const struct kho_mem *mem = &mems[i];
+ ulong mstart = PAGE_ALIGN_DOWN(mem->addr);
+ ulong mend = PAGE_ALIGN(mem->addr + mem->len);
+ struct kho_mem cmem = {
+ .addr = mstart,
+ .len = (mend - mstart),
+ };
+
+ size = kho_mem_cache_add(fdt, mem_cache, size, &cmem);
+ if (size < 0)
+ return size;
+ }
+ }
+
+ image->kho.mem_cache.buffer = mem_cache;
+ image->kho.mem_cache.bufsz = size;
+ image->kho.mem_cache.memsz = size;
+
+ return 0;
+}
+
+int kho_fill_kimage(struct kimage *image)
+{
+ int err = 0;
+ void *dt;
+
+ mutex_lock(&kho.lock);
+
+ if (!kho.active)
+ goto out;
+
+ /* Initialize kexec_buf for mem_cache */
+ image->kho.mem_cache = (struct kexec_buf) {
+ .image = image,
+ .buffer = NULL,
+ .bufsz = 0,
+ .mem = KEXEC_BUF_MEM_UNKNOWN,
+ .memsz = 0,
+ .buf_align = SZ_64K, /* Makes it easier to map */
+ .buf_max = ULONG_MAX,
+ .top_down = true,
+ };
+
+ /*
+ * We need to make all allocations visible here via the mem_cache so that
+ * kho_is_destination_range() can identify overlapping regions and ensure
+ * that no kimage (including the DT one) lands on handed over memory.
+ *
+ * Since we conveniently already built an array of all allocations, let's
+ * pass that on to the target kernel so that reuse it to initialize its
+ * memory blocks.
+ */
+ err = kho_alloc_mem_cache(image, kho.dt);
+ if (err)
+ goto out;
+
+ err = kexec_add_buffer(&image->kho.mem_cache);
+ if (err)
+ goto out;
+
+ /*
+ * Create a kexec copy of the DT here. We need this because lifetime may
+ * be different between kho.dt and the kimage
+ */
+ dt = kvmemdup(kho.dt, kho.dt_len, GFP_KERNEL);
+ if (!dt) {
+ err = -ENOMEM;
+ goto out;
+ }
+
+ /* Allocate target memory for kho dt */
+ image->kho.dt = (struct kexec_buf) {
+ .image = image,
+ .buffer = dt,
+ .bufsz = kho.dt_len,
+ .mem = KEXEC_BUF_MEM_UNKNOWN,
+ .memsz = kho.dt_len,
+ .buf_align = SZ_64K, /* Makes it easier to map */
+ .buf_max = ULONG_MAX,
+ .top_down = true,
+ };
+ err = kexec_add_buffer(&image->kho.dt);
+
+out:
+ mutex_unlock(&kho.lock);
+ return err;
+}
+
bool kho_is_active(void)
{
return kho.active;
--
2.40.1

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

2024-01-17 14:49:27

by Alexander Graf

[permalink] [raw]

Subject: [PATCH v3 03/17] kexec: Add Kexec HandOver (KHO) generation helpers

This patch adds the core infrastructure to generate Kexec HandOver
metadata. Kexec HandOver is a mechanism that allows Linux to preserve
state - arbitrary properties as well as memory locations - across kexec.

It does so using 3 concepts:

1) Device Tree - Every KHO kexec carries a KHO specific flattened
device tree blob that describes the state of the system. Device
drivers can register to KHO to serialize their state before kexec.

2) Mem cache - A memblocks like structure that contains full page
ranges of reservations. These can not be part of the architectural
reservations, because they differ on every kexec.

3) Scratch Region - A CMA region that we allocate in the first kernel.
CMA gives us the guarantee that no handover pages land in that
region, because handover pages must be at a static physical memory
location. We use this region as the place to load future kexec
images into which then won't collide with any handover data.

Signed-off-by: Alexander Graf <[email protected]>

---

v1 -> v2:

- s/kho_reserve/kho_reserve_scratch/g
- Move kho enums out of ifdef
---
Documentation/ABI/testing/sysfs-kernel-kho | 53 +++
.../admin-guide/kernel-parameters.txt | 10 +
MAINTAINERS | 1 +
include/linux/kexec.h | 24 ++
include/uapi/linux/kexec.h | 6 +
kernel/Makefile | 1 +
kernel/kexec_kho_out.c | 316 ++++++++++++++++++
7 files changed, 411 insertions(+)
create mode 100644 Documentation/ABI/testing/sysfs-kernel-kho
create mode 100644 kernel/kexec_kho_out.c

diff --git a/Documentation/ABI/testing/sysfs-kernel-kho b/Documentation/ABI/testing/sysfs-kernel-kho
new file mode 100644
index 000000000000..f69e7b81a337
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-kho
@@ -0,0 +1,53 @@
+What: /sys/kernel/kho/active
+Date: December 2023
+Contact: Alexander Graf <[email protected]>
+Description:
+ Kexec HandOver (KHO) allows Linux to transition the state of
+ compatible drivers into the next kexec'ed kernel. To do so,
+ device drivers will serialize their current state into a DT.
+ While the state is serialized, they are unable to perform
+ any modifications to state that was serialized, such as
+ handed over memory allocations.
+
+ When this file contains "1", the system is in the transition
+ state. When contains "0", it is not. To switch between the
+ two states, echo the respective number into this file.
+
+What: /sys/kernel/kho/dt_max
+Date: December 2023
+Contact: Alexander Graf <[email protected]>
+Description:
+ KHO needs to allocate a buffer for the DT that gets
+ generated before it knows the final size. By default, it
+ will allocate 10 MiB for it. You can write to this file
+ to modify the size of that allocation.
+
+What: /sys/kernel/kho/scratch_len
+Date: December 2023
+Contact: Alexander Graf <[email protected]>
+Description:
+ To support continuous KHO kexecs, we need to reserve a
+ physically contiguous memory region that will always stay
+ available for future kexec allocations. This file describes
+ the length of that memory region. Kexec user space tooling
+ can use this to determine where it should place its payload
+ images.
+
+What: /sys/kernel/kho/scratch_phys
+Date: December 2023
+Contact: Alexander Graf <[email protected]>
+Description:
+ To support continuous KHO kexecs, we need to reserve a
+ physically contiguous memory region that will always stay
+ available for future kexec allocations. This file describes
+ the physical location of that memory region. Kexec user space
+ tooling can use this to determine where it should place its
+ payload images.
+
+What: /sys/kernel/kho/dt
+Date: December 2023
+Contact: Alexander Graf <[email protected]>
+Description:
+ When KHO is active, the kernel exposes the generated DT that
+ carries its current KHO state in this file. Kexec user space
+ tooling can use this as input file for the KHO payload image.
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index cc9cc33d1121..f8785df17d67 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2515,6 +2515,16 @@
kgdbwait [KGDB] Stop kernel execution and enter the
kernel debugger at the earliest opportunity.

+ kho_scratch=n[KMG] [KEXEC] Sets the size of the KHO scratch
+ region. The KHO scratch region is a physically
+ memory range that can only be used for non-kernel
+ allocations. That way, even when memory is heavily
+ fragmented with handed over memory, kexec will always
+ be able to find contiguous memory to place the next
+ kernel for kexec into.
+
+ The default is 0.
+
kmac= [MIPS] Korina ethernet MAC address.
Configure the RouterBoard 532 series on-chip
Ethernet adapter MAC address.
diff --git a/MAINTAINERS b/MAINTAINERS
index 391bbb855cbe..6ec4be8874b9 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -11824,6 +11824,7 @@ M: Eric Biederman <[email protected]>
L: [email protected]
S: Maintained
W: http://kernel.org/pub/linux/utils/kernel/kexec/
+F: Documentation/ABI/testing/sysfs-kernel-kho
F: include/linux/kexec.h
F: include/uapi/linux/kexec.h
F: kernel/kexec*
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 400cb6c02176..19ffc00b5e7b 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -21,6 +21,8 @@

#include <uapi/linux/kexec.h>
#include <linux/verification.h>
+#include <linux/libfdt.h>
+#include <linux/notifier.h>

extern note_buf_t __percpu *crash_notes;

@@ -523,6 +525,28 @@ void set_kexec_sig_enforced(void);
static inline void set_kexec_sig_enforced(void) {}
#endif

+/* Notifier index */
+enum kho_event {
+ KEXEC_KHO_DUMP = 0,
+ KEXEC_KHO_ABORT = 1,
+};
+
+#ifdef CONFIG_KEXEC_KHO
+extern phys_addr_t kho_scratch_phys;
+extern phys_addr_t kho_scratch_len;
+
+/* egest handover metadata */
+void kho_reserve_scratch(void);
+int register_kho_notifier(struct notifier_block *nb);
+int unregister_kho_notifier(struct notifier_block *nb);
+bool kho_is_active(void);
+#else
+static inline void kho_reserve_scratch(void) {}
+static inline int register_kho_notifier(struct notifier_block *nb) { return -EINVAL; }
+static inline int unregister_kho_notifier(struct notifier_block *nb) { return -EINVAL; }
+static inline bool kho_is_active(void) { return false; }
+#endif
+
#endif /* !defined(__ASSEBMLY__) */

#endif /* LINUX_KEXEC_H */
diff --git a/include/uapi/linux/kexec.h b/include/uapi/linux/kexec.h
index c17bb096ea68..ad9e95b88b34 100644
--- a/include/uapi/linux/kexec.h
+++ b/include/uapi/linux/kexec.h
@@ -50,6 +50,12 @@
/* The artificial cap on the number of segments passed to kexec_load. */
#define KEXEC_SEGMENT_MAX 16

+/* KHO passes an array of kho_mem as "mem cache" to the new kernel */
+struct kho_mem {
+ __u64 addr;
+ __u64 len;
+};
+
#ifndef __KERNEL__
/*
* This structure is used to hold the arguments that are used when
diff --git a/kernel/Makefile b/kernel/Makefile
index ce105a5558fc..b182b7b4e7d1 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -73,6 +73,7 @@ obj-$(CONFIG_KEXEC_CORE) += kexec_core.o
obj-$(CONFIG_KEXEC) += kexec.o
obj-$(CONFIG_KEXEC_FILE) += kexec_file.o
obj-$(CONFIG_KEXEC_ELF) += kexec_elf.o
+obj-$(CONFIG_KEXEC_KHO) += kexec_kho_out.o
obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
obj-$(CONFIG_COMPAT) += compat.o
obj-$(CONFIG_CGROUPS) += cgroup/
diff --git a/kernel/kexec_kho_out.c b/kernel/kexec_kho_out.c
new file mode 100644
index 000000000000..765cf6ba7a46
--- /dev/null
+++ b/kernel/kexec_kho_out.c
@@ -0,0 +1,316 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * kexec_kho_out.c - kexec handover code to egest metadata.
+ * Copyright (C) 2023 Alexander Graf <[email protected]>
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/cma.h>
+#include <linux/kexec.h>
+#include <linux/device.h>
+#include <linux/compiler.h>
+#include <linux/kmsg_dump.h>
+
+struct kho_out {
+ struct kobject *kobj;
+ bool active;
+ struct cma *cma;
+ struct blocking_notifier_head chain_head;
+ void *dt;
+ u64 dt_len;
+ u64 dt_max;
+ struct mutex lock;
+};
+
+static struct kho_out kho = {
+ .dt_max = (1024 * 1024 * 10),
+ .chain_head = BLOCKING_NOTIFIER_INIT(kho.chain_head),
+ .lock = __MUTEX_INITIALIZER(kho.lock),
+};
+
+/*
+ * Size for scratch (non-KHO) memory. With KHO enabled, memory can become
+ * fragmented because KHO regions may be anywhere in physical address
+ * space. The scratch region gives us a safe zone that we will never see
+ * KHO allocations from. This is where we can later safely load our new kexec
+ * images into.
+ */
+static phys_addr_t kho_scratch_size __initdata;
+
+int register_kho_notifier(struct notifier_block *nb)
+{
+ return blocking_notifier_chain_register(&kho.chain_head, nb);
+}
+EXPORT_SYMBOL_GPL(register_kho_notifier);
+
+int unregister_kho_notifier(struct notifier_block *nb)
+{
+ return blocking_notifier_chain_unregister(&kho.chain_head, nb);
+}
+EXPORT_SYMBOL_GPL(unregister_kho_notifier);
+
+bool kho_is_active(void)
+{
+ return kho.active;
+}
+EXPORT_SYMBOL_GPL(kho_is_active);
+
+static ssize_t raw_read(struct file *file, struct kobject *kobj,
+ struct bin_attribute *attr, char *buf,
+ loff_t pos, size_t count)
+{
+ mutex_lock(&kho.lock);
+ memcpy(buf, attr->private + pos, count);
+ mutex_unlock(&kho.lock);
+
+ return count;
+}
+
+static BIN_ATTR(dt, 0400, raw_read, NULL, 0);
+
+static int kho_expose_dt(void *fdt)
+{
+ long fdt_len = fdt_totalsize(fdt);
+ int err;
+
+ kho.dt = fdt;
+ kho.dt_len = fdt_len;
+
+ bin_attr_dt.size = fdt_totalsize(fdt);
+ bin_attr_dt.private = fdt;
+ err = sysfs_create_bin_file(kho.kobj, &bin_attr_dt);
+
+ return err;
+}
+
+static void kho_abort(void)
+{
+ if (!kho.active)
+ return;
+
+ sysfs_remove_bin_file(kho.kobj, &bin_attr_dt);
+
+ kvfree(kho.dt);
+ kho.dt = NULL;
+ kho.dt_len = 0;
+
+ blocking_notifier_call_chain(&kho.chain_head, KEXEC_KHO_ABORT, NULL);
+
+ kho.active = false;
+}
+
+static int kho_serialize(void)
+{
+ void *fdt = NULL;
+ int err;
+
+ kho.active = true;
+ err = -ENOMEM;
+
+ fdt = kvmalloc(kho.dt_max, GFP_KERNEL);
+ if (!fdt)
+ goto out;
+
+ if (fdt_create(fdt, kho.dt_max)) {
+ err = -EINVAL;
+ goto out;
+ }
+
+ err = fdt_finish_reservemap(fdt);
+ if (err)
+ goto out;
+
+ err = fdt_begin_node(fdt, "");
+ if (err)
+ goto out;
+
+ err = fdt_property_string(fdt, "compatible", "kho-v1");
+ if (err)
+ goto out;
+
+ /* Loop through all kho dump functions */
+ err = blocking_notifier_call_chain(&kho.chain_head, KEXEC_KHO_DUMP, fdt);
+ err = notifier_to_errno(err);
+ if (err)
+ goto out;
+
+ /* Close / */
+ err = fdt_end_node(fdt);
+ if (err)
+ goto out;
+
+ err = fdt_finish(fdt);
+ if (err)
+ goto out;
+
+ if (WARN_ON(fdt_check_header(fdt))) {
+ err = -EINVAL;
+ goto out;
+ }
+
+ err = kho_expose_dt(fdt);
+
+out:
+ if (err) {
+ pr_err("kho failed to serialize state: %d", err);
+ kho_abort();
+ }
+ return err;
+}
+
+/* Handling for /sys/kernel/kho */
+
+#define KHO_ATTR_RO(_name) static struct kobj_attribute _name##_attr = __ATTR_RO_MODE(_name, 0400)
+#define KHO_ATTR_RW(_name) static struct kobj_attribute _name##_attr = __ATTR_RW_MODE(_name, 0600)
+
+static ssize_t active_store(struct kobject *dev, struct kobj_attribute *attr,
+ const char *buf, size_t size)
+{
+ ssize_t retsize = size;
+ bool val = false;
+ int ret;
+
+ if (kstrtobool(buf, &val) < 0)
+ return -EINVAL;
+
+ if (!kho_scratch_len)
+ return -ENOMEM;
+
+ mutex_lock(&kho.lock);
+ if (val != kho.active) {
+ if (val) {
+ ret = kho_serialize();
+ if (ret) {
+ retsize = -EINVAL;
+ goto out;
+ }
+ } else {
+ kho_abort();
+ }
+ }
+
+out:
+ mutex_unlock(&kho.lock);
+ return retsize;
+}
+
+static ssize_t active_show(struct kobject *dev, struct kobj_attribute *attr,
+ char *buf)
+{
+ ssize_t ret;
+
+ mutex_lock(&kho.lock);
+ ret = sysfs_emit(buf, "%d\n", kho.active);
+ mutex_unlock(&kho.lock);
+
+ return ret;
+}
+KHO_ATTR_RW(active);
+
+static ssize_t dt_max_store(struct kobject *dev, struct kobj_attribute *attr,
+ const char *buf, size_t size)
+{
+ u64 val;
+
+ if (kstrtoull(buf, 0, &val))
+ return -EINVAL;
+
+ kho.dt_max = val;
+
+ return size;
+}
+
+static ssize_t dt_max_show(struct kobject *dev, struct kobj_attribute *attr,
+ char *buf)
+{
+ return sysfs_emit(buf, "0x%llx\n", kho.dt_max);
+}
+KHO_ATTR_RW(dt_max);
+
+static ssize_t scratch_len_show(struct kobject *dev, struct kobj_attribute *attr,
+ char *buf)
+{
+ return sysfs_emit(buf, "0x%llx\n", kho_scratch_len);
+}
+KHO_ATTR_RO(scratch_len);
+
+static ssize_t scratch_phys_show(struct kobject *dev, struct kobj_attribute *attr,
+ char *buf)
+{
+ return sysfs_emit(buf, "0x%llx\n", kho_scratch_phys);
+}
+KHO_ATTR_RO(scratch_phys);
+
+static __init int kho_out_init(void)
+{
+ int ret = 0;
+
+ kho.kobj = kobject_create_and_add("kho", kernel_kobj);
+ if (!kho.kobj) {
+ ret = -ENOMEM;
+ goto err;
+ }
+
+ ret = sysfs_create_file(kho.kobj, &active_attr.attr);
+ if (ret)
+ goto err;
+
+ ret = sysfs_create_file(kho.kobj, &dt_max_attr.attr);
+ if (ret)
+ goto err;
+
+ ret = sysfs_create_file(kho.kobj, &scratch_phys_attr.attr);
+ if (ret)
+ goto err;
+
+ ret = sysfs_create_file(kho.kobj, &scratch_len_attr.attr);
+ if (ret)
+ goto err;
+
+err:
+ return ret;
+}
+late_initcall(kho_out_init);
+
+static int __init early_kho_scratch(char *p)
+{
+ kho_scratch_size = memparse(p, &p);
+ return 0;
+}
+early_param("kho_scratch", early_kho_scratch);
+
+/**
+ * kho_reserve_scratch - Reserve a contiguous chunk of memory for kexec
+ *
+ * With KHO we can preserve arbitrary pages in the system. To ensure we still
+ * have a large contiguous region of memory when we search the physical address
+ * space for target memory, let's make sure we always have a large CMA region
+ * active. This CMA region will only be used for movable pages which are not a
+ * problem for us during KHO because we can just move them somewhere else.
+ */
+__init void kho_reserve_scratch(void)
+{
+ int r;
+
+ if (kho_get_fdt()) {
+ /*
+ * We came from a previous KHO handover, so we already have
+ * a known good scratch region that we preserve. No need to
+ * allocate another.
+ */
+ return;
+ }
+
+ /* Only allocate KHO scratch memory when we're asked to */
+ if (!kho_scratch_size)
+ return;
+
+ r = cma_declare_contiguous_nid(0, kho_scratch_size, 0, PAGE_SIZE, 0,
+ false, "kho", &kho.cma, NUMA_NO_NODE);
+ if (WARN_ON(r))
+ return;
+
+ kho_scratch_phys = cma_get_base(kho.cma);
+ kho_scratch_len = cma_get_size(kho.cma);
+}
--
2.40.1

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

2024-01-17 14:49:51

by Alexander Graf

[permalink] [raw]

Subject: [PATCH v3 06/17] kexec: Add config option for KHO

We have all generic code in place now to support Kexec with KHO. This
patch adds a config option that depends on architecture support to
enable KHO support.

Signed-off-by: Alexander Graf <[email protected]>
---
kernel/Kconfig.kexec | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/kernel/Kconfig.kexec b/kernel/Kconfig.kexec
index 946dffa048b7..6fb5a6ae9697 100644
--- a/kernel/Kconfig.kexec
+++ b/kernel/Kconfig.kexec
@@ -93,6 +93,19 @@ config KEXEC_JUMP
Jump between original kernel and kexeced kernel and invoke
code in physical address mode via KEXEC

+config KEXEC_KHO
+ bool "kexec handover"
+ depends on ARCH_SUPPORTS_KEXEC_KHO
+ depends on KEXEC
+ select MEMBLOCK_SCRATCH
+ select LIBFDT
+ select CMA
+ help
+ Allow kexec to hand over state across kernels by generating and
+ passing additional metadata to the target kernel. This is useful
+ to keep data or state alive across the kexec. For this to work,
+ both source and target kernels need to have this option enabled.
+
config CRASH_DUMP
bool "kernel crash dumps"
depends on ARCH_SUPPORTS_CRASH_DUMP
--
2.40.1

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

2024-01-17 14:50:39

by Alexander Graf

[permalink] [raw]

Subject: [PATCH v3 09/17] x86: Add KHO support

We now have all bits in place to support KHO kexecs. This patch adds
awareness of KHO in the kexec file as well as boot path for x86 and
adds the respective kconfig option to the architecture so that it can
use KHO successfully.

In addition, it enlightens it decompression code with KHO so that its
KASLR location finder only considers memory regions that are not already
occupied by KHO memory.

Signed-off-by: Alexander Graf <[email protected]>

---

v1 -> v2:

- Change kconfig option to ARCH_SUPPORTS_KEXEC_KHO
- s/kho_reserve_mem/kho_reserve_previous_mem/g
- s/kho_reserve/kho_reserve_scratch/g
---
arch/x86/Kconfig | 3 ++
arch/x86/boot/compressed/kaslr.c | 55 +++++++++++++++++++++++++++
arch/x86/include/uapi/asm/bootparam.h | 15 +++++++-
arch/x86/kernel/e820.c | 9 +++++
arch/x86/kernel/kexec-bzimage64.c | 39 +++++++++++++++++++
arch/x86/kernel/setup.c | 46 ++++++++++++++++++++++
arch/x86/mm/init_32.c | 7 ++++
arch/x86/mm/init_64.c | 7 ++++
8 files changed, 180 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 53f2e7797b1d..c05d6d75f256 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2091,6 +2091,9 @@ config ARCH_SUPPORTS_KEXEC_BZIMAGE_VERIFY_SIG
config ARCH_SUPPORTS_KEXEC_JUMP
def_bool y

+config ARCH_SUPPORTS_KEXEC_KHO
+ def_bool y
+
config ARCH_SUPPORTS_CRASH_DUMP
def_bool X86_64 || (X86_32 && HIGHMEM)

diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index dec961c6d16a..93ea292e4c18 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -29,6 +29,7 @@
#include <linux/uts.h>
#include <linux/utsname.h>
#include <linux/ctype.h>
+#include <uapi/linux/kexec.h>
#include <generated/utsversion.h>
#include <generated/utsrelease.h>

@@ -472,6 +473,60 @@ static bool mem_avoid_overlap(struct mem_vector *img,
}
}

+#ifdef CONFIG_KEXEC_KHO
+ if (ptr->type == SETUP_KEXEC_KHO) {
+ struct kho_data *kho = (struct kho_data *)ptr->data;
+ struct kho_mem *mems = (void *)kho->mem_cache_addr;
+ int nr_mems = kho->mem_cache_size / sizeof(*mems);
+ int i;
+
+ /* Avoid the mem cache */
+ avoid = (struct mem_vector) {
+ .start = kho->mem_cache_addr,
+ .size = kho->mem_cache_size,
+ };
+
+ if (mem_overlaps(img, &avoid) && (avoid.start < earliest)) {
+ *overlap = avoid;
+ earliest = overlap->start;
+ is_overlapping = true;
+ }
+
+ /* And the KHO DT */
+ avoid = (struct mem_vector) {
+ .start = kho->dt_addr,
+ .size = kho->dt_size,
+ };
+
+ if (mem_overlaps(img, &avoid) && (avoid.start < earliest)) {
+ *overlap = avoid;
+ earliest = overlap->start;
+ is_overlapping = true;
+ }
+
+ /* As well as any other KHO memory reservations */
+ for (i = 0; i < nr_mems; i++) {
+ avoid = (struct mem_vector) {
+ .start = mems[i].addr,
+ .size = mems[i].len,
+ };
+
+ /*
+ * This mem starts after our current break.
+ * The array is sorted, so we're done.
+ */
+ if (avoid.start >= earliest)
+ break;
+
+ if (mem_overlaps(img, &avoid)) {
+ *overlap = avoid;
+ earliest = overlap->start;
+ is_overlapping = true;
+ }
+ }
+ }
+#endif
+
ptr = (struct setup_data *)(unsigned long)ptr->next;
}

diff --git a/arch/x86/include/uapi/asm/bootparam.h b/arch/x86/include/uapi/asm/bootparam.h
index 01d19fc22346..013af38a9673 100644
--- a/arch/x86/include/uapi/asm/bootparam.h
+++ b/arch/x86/include/uapi/asm/bootparam.h
@@ -13,7 +13,8 @@
#define SETUP_CC_BLOB 7
#define SETUP_IMA 8
#define SETUP_RNG_SEED 9
-#define SETUP_ENUM_MAX SETUP_RNG_SEED
+#define SETUP_KEXEC_KHO 10
+#define SETUP_ENUM_MAX SETUP_KEXEC_KHO

#define SETUP_INDIRECT (1<<31)
#define SETUP_TYPE_MAX (SETUP_ENUM_MAX | SETUP_INDIRECT)
@@ -181,6 +182,18 @@ struct ima_setup_data {
__u64 size;
} __attribute__((packed));

+/*
+ * Locations of kexec handover metadata
+ */
+struct kho_data {
+ __u64 dt_addr;
+ __u64 dt_size;
+ __u64 scratch_addr;
+ __u64 scratch_size;
+ __u64 mem_cache_addr;
+ __u64 mem_cache_size;
+} __attribute__((packed));
+
/* The so-called "zeropage" */
struct boot_params {
struct screen_info screen_info; /* 0x000 */
diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index fb8cf953380d..c891b83f5b1c 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1341,6 +1341,15 @@ void __init e820__memblock_setup(void)
continue;

memblock_add(entry->addr, entry->size);
+
+ /*
+ * At this point with KHO we only allocate from scratch memory
+ * and only from memory below ISA_END_ADDRESS. Make sure that
+ * when we add memory for the eligible range, we add it as
+ * scratch memory so that we can resize the memblocks array.
+ */
+ if (is_kho_boot() && (end <= ISA_END_ADDRESS))
+ memblock_mark_scratch(entry->addr, end);
}

/* Throw away partial pages: */
diff --git a/arch/x86/kernel/kexec-bzimage64.c b/arch/x86/kernel/kexec-bzimage64.c
index 2a422e00ed4b..af521bfab861 100644
--- a/arch/x86/kernel/kexec-bzimage64.c
+++ b/arch/x86/kernel/kexec-bzimage64.c
@@ -15,6 +15,7 @@
#include <linux/slab.h>
#include <linux/kexec.h>
#include <linux/kernel.h>
+#include <linux/libfdt.h>
#include <linux/mm.h>
#include <linux/efi.h>
#include <linux/random.h>
@@ -233,6 +234,33 @@ setup_ima_state(const struct kimage *image, struct boot_params *params,
#endif /* CONFIG_IMA_KEXEC */
}

+static void setup_kho(const struct kimage *image, struct boot_params *params,
+ unsigned long params_load_addr,
+ unsigned int setup_data_offset)
+{
+#ifdef CONFIG_KEXEC_KHO
+ struct setup_data *sd = (void *)params + setup_data_offset;
+ struct kho_data *kho = (void *)sd + sizeof(*sd);
+
+ sd->type = SETUP_KEXEC_KHO;
+ sd->len = sizeof(struct kho_data);
+
+ /* Only add if we have all KHO images in place */
+ if (!image->kho.dt.buffer || !image->kho.mem_cache.buffer)
+ return;
+
+ /* Add setup data */
+ kho->dt_addr = image->kho.dt.mem;
+ kho->dt_size = image->kho.dt.bufsz;
+ kho->scratch_addr = kho_scratch_phys;
+ kho->scratch_size = kho_scratch_len;
+ kho->mem_cache_addr = image->kho.mem_cache.mem;
+ kho->mem_cache_size = image->kho.mem_cache.bufsz;
+ sd->next = params->hdr.setup_data;
+ params->hdr.setup_data = params_load_addr + setup_data_offset;
+#endif /* CONFIG_KEXEC_KHO */
+}
+
static int
setup_boot_parameters(struct kimage *image, struct boot_params *params,
unsigned long params_load_addr,
@@ -310,6 +338,13 @@ setup_boot_parameters(struct kimage *image, struct boot_params *params,
sizeof(struct ima_setup_data);
}

+ if (IS_ENABLED(CONFIG_KEXEC_KHO)) {
+ /* Setup space to store preservation metadata */
+ setup_kho(image, params, params_load_addr, setup_data_offset);
+ setup_data_offset += sizeof(struct setup_data) +
+ sizeof(struct kho_data);
+ }
+
/* Setup RNG seed */
setup_rng_seed(params, params_load_addr, setup_data_offset);

@@ -475,6 +510,10 @@ static void *bzImage64_load(struct kimage *image, char *kernel,
kbuf.bufsz += sizeof(struct setup_data) +
sizeof(struct ima_setup_data);

+ if (IS_ENABLED(CONFIG_KEXEC_KHO))
+ kbuf.bufsz += sizeof(struct setup_data) +
+ sizeof(struct kho_data);
+
params = kzalloc(kbuf.bufsz, GFP_KERNEL);
if (!params)
return ERR_PTR(-ENOMEM);
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index ec2c21a1844e..1ef0f63c3400 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -380,6 +380,29 @@ int __init ima_get_kexec_buffer(void **addr, size_t *size)
}
#endif

+static void __init add_kho(u64 phys_addr, u32 data_len)
+{
+#ifdef CONFIG_KEXEC_KHO
+ struct kho_data *kho;
+ u64 addr = phys_addr + sizeof(struct setup_data);
+ u64 size = data_len - sizeof(struct setup_data);
+
+ kho = early_memremap(addr, size);
+ if (!kho) {
+ pr_warn("setup: failed to memremap kho data (0x%llx, 0x%llx)\n",
+ addr, size);
+ return;
+ }
+
+ kho_populate(kho->dt_addr, kho->scratch_addr, kho->scratch_size,
+ kho->mem_cache_addr, kho->mem_cache_size);
+
+ early_memunmap(kho, size);
+#else
+ pr_warn("Passed KHO data, but CONFIG_KEXEC_KHO not set. Ignoring.\n");
+#endif
+}
+
static void __init parse_setup_data(void)
{
struct setup_data *data;
@@ -408,6 +431,9 @@ static void __init parse_setup_data(void)
case SETUP_IMA:
add_early_ima_buffer(pa_data);
break;
+ case SETUP_KEXEC_KHO:
+ add_kho(pa_data, data_len);
+ break;
case SETUP_RNG_SEED:
data = early_memremap(pa_data, data_len);
add_bootloader_randomness(data->data, data->len);
@@ -987,8 +1013,26 @@ void __init setup_arch(char **cmdline_p)
cleanup_highmap();

memblock_set_current_limit(ISA_END_ADDRESS);
+
e820__memblock_setup();

+ /*
+ * We can resize memblocks at this point, let's dump all KHO
+ * reservations in and switch from scratch-only to normal allocations
+ */
+ kho_reserve_previous_mem();
+
+ /* Allocations now skip scratch mem, return low 1M to the pool */
+ if (is_kho_boot()) {
+ u64 i;
+ phys_addr_t base, end;
+
+ __for_each_mem_range(i, &memblock.memory, NULL, NUMA_NO_NODE,
+ MEMBLOCK_SCRATCH, &base, &end, NULL)
+ if (end <= ISA_END_ADDRESS)
+ memblock_clear_scratch(base, end - base);
+ }
+
/*
* Needs to run after memblock setup because it needs the physical
* memory size.
@@ -1104,6 +1148,8 @@ void __init setup_arch(char **cmdline_p)
*/
arch_reserve_crashkernel();

+ kho_reserve_scratch();
+
memblock_find_dma_reserve();

if (!early_xdbc_setup_hardware())
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index b63403d7179d..6c3810afed04 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -20,6 +20,7 @@
#include <linux/smp.h>
#include <linux/init.h>
#include <linux/highmem.h>
+#include <linux/kexec.h>
#include <linux/pagemap.h>
#include <linux/pci.h>
#include <linux/pfn.h>
@@ -738,6 +739,12 @@ void __init mem_init(void)
after_bootmem = 1;
x86_init.hyper.init_after_bootmem();

+ /*
+ * Now that all KHO pages are marked as reserved, let's flip them back
+ * to normal pages with accurate refcount.
+ */
+ kho_populate_refcount();
+
/*
* Check boundaries twice: Some fundamental inconsistencies can
* be detected at build time already.
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index a0dffaca6d2b..0c64790b126b 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -20,6 +20,7 @@
#include <linux/smp.h>
#include <linux/init.h>
#include <linux/initrd.h>
+#include <linux/kexec.h>
#include <linux/pagemap.h>
#include <linux/memblock.h>
#include <linux/proc_fs.h>
@@ -1339,6 +1340,12 @@ void __init mem_init(void)
after_bootmem = 1;
x86_init.hyper.init_after_bootmem();

+ /*
+ * Now that all KHO pages are marked as reserved, let's flip them back
+ * to normal pages with accurate refcount.
+ */
+ kho_populate_refcount();
+
/*
* Must be done after boot memory is put on freelist, because here we
* might set fields in deferred struct pages that have not yet been
--
2.40.1

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

2024-01-17 14:51:23

by Alexander Graf

[permalink] [raw]

Subject: [PATCH v3 11/17] tracing: Introduce kho serialization

We want to be able to transfer ftrace state from one kernel to the next.
To start off with, let's establish all the boiler plate to get a write
hook when KHO wants to serialize and fill out basic data.

Follow-up patches will fill in serialization of ring buffers and events.

Signed-off-by: Alexander Graf <[email protected]>

---

v1 -> v2:

- Remove ifdefs
---
kernel/trace/trace.c | 47 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 47 insertions(+)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index a0defe156b57..9a0d96975c9c 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -32,6 +32,7 @@
#include <linux/percpu.h>
#include <linux/splice.h>
#include <linux/kdebug.h>
+#include <linux/kexec.h>
#include <linux/string.h>
#include <linux/mount.h>
#include <linux/rwsem.h>
@@ -866,6 +867,8 @@ static struct tracer *trace_types __read_mostly;
*/
DEFINE_MUTEX(trace_types_lock);

+static bool trace_in_kho;
+
/*
* serialize the access of the ring buffer
*
@@ -10574,12 +10577,56 @@ void __init early_trace_init(void)
init_events();
}

+static int trace_kho_notifier(struct notifier_block *self,
+ unsigned long cmd,
+ void *v)
+{
+ const char compatible[] = "ftrace-v1";
+ void *fdt = v;
+ int err = 0;
+
+ switch (cmd) {
+ case KEXEC_KHO_ABORT:
+ if (trace_in_kho)
+ mutex_unlock(&trace_types_lock);
+ trace_in_kho = false;
+ return NOTIFY_DONE;
+ case KEXEC_KHO_DUMP:
+ /* Handled below */
+ break;
+ default:
+ return NOTIFY_BAD;
+ }
+
+ if (unlikely(tracing_disabled))
+ return NOTIFY_DONE;
+
+ err |= fdt_begin_node(fdt, "ftrace");
+ err |= fdt_property(fdt, "compatible", compatible, sizeof(compatible));
+ err |= fdt_end_node(fdt);
+
+ if (!err) {
+ /* Hold all future allocations */
+ mutex_lock(&trace_types_lock);
+ trace_in_kho = true;
+ }
+
+ return err ? NOTIFY_BAD : NOTIFY_DONE;
+}
+
+static struct notifier_block trace_kho_nb = {
+ .notifier_call = trace_kho_notifier,
+};
+
void __init trace_init(void)
{
trace_event_init();

if (boot_instance_index)
enable_instances();
+
+ if (IS_ENABLED(CONFIG_FTRACE_KHO))
+ register_kho_notifier(&trace_kho_nb);
}

__init static void clear_boot_tracer(void)
--
2.40.1

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

2024-01-17 14:51:33

by Alexander Graf

[permalink] [raw]

Subject: [PATCH v3 07/17] kexec: Add documentation for KHO

With KHO in place, let's add documentation that describes what it is and
how to use it.

Signed-off-by: Alexander Graf <[email protected]>

---

v2 -> v3:

- Fix wording
- Add Documentation to MAINTAINERS file
---
Documentation/kho/concepts.rst | 88 ++++++++++++++++++++++++++++++++
Documentation/kho/index.rst | 19 +++++++
Documentation/kho/usage.rst | 57 +++++++++++++++++++++
Documentation/subsystem-apis.rst | 1 +
MAINTAINERS | 1 +
5 files changed, 166 insertions(+)
create mode 100644 Documentation/kho/concepts.rst
create mode 100644 Documentation/kho/index.rst
create mode 100644 Documentation/kho/usage.rst

diff --git a/Documentation/kho/concepts.rst b/Documentation/kho/concepts.rst
new file mode 100644
index 000000000000..cb8330bcb06c
--- /dev/null
+++ b/Documentation/kho/concepts.rst
@@ -0,0 +1,88 @@
+.. SPDX-License-Identifier: GPL-2.0-or-later
+
+=======================
+Kexec Handover Concepts
+=======================
+
+Kexec HandOver (KHO) is a mechanism that allows Linux to preserve state -
+arbitrary properties as well as memory locations - across kexec.
+
+It introduces multiple concepts:
+
+KHO Device Tree
+---------------
+
+Every KHO kexec carries a KHO specific flattened device tree blob that
+describes the state of the system. Device drivers can register to KHO to
+serialize their state before kexec. After KHO, device drivers can read
+the device tree and extract previous state.
+
+KHO only uses the fdt container format and libfdt library, but does not
+adhere to the same property semantics that normal device trees do: Properties
+are passed in native endianness and standardized properties like ``regs`` and
+``ranges`` do not exist, hence there are no ``#...-cells`` properties.
+
+KHO introduces a new concept to its device tree: ``mem`` properties. A
+``mem`` property can be inside any subnode in the device tree. When present,
+it contains an array of physical memory ranges that the new kernel must mark
+as reserved on boot. It is recommended, but not required, to make these ranges
+as physically contiguous as possible to reduce the number of array elements ::
+
+ struct kho_mem {
+ __u64 addr;
+ __u64 len;
+ };
+
+After boot, drivers can call the kho subsystem to transfer ownership of memory
+that was reserved via a ``mem`` property to themselves to continue using memory
+from the previous execution.
+
+The KHO device tree follows the in-Linux schema requirements. Any element in
+the device tree is documented via device tree schema yamls that explain what
+data gets transferred.
+
+Mem cache
+---------
+
+The new kernel needs to know about all memory reservations, but is unable to
+parse the device tree yet in early bootup code because of memory limitations.
+To simplify the initial memory reservation flow, the old kernel passes a
+preprocessed array of physically contiguous reserved ranges to the new kernel.
+
+These reservations have to be separate from architectural memory maps and
+reservations because they differ on every kexec, while the architectural ones
+get passed directly between invocations.
+
+The less entries this cache contains, the faster the new kernel will boot.
+
+Scratch Region
+--------------
+
+To boot into kexec, we need to have a physically contiguous memory range that
+contains no handed over memory. Kexec then places the target kernel and initrd
+into that region. The new kernel exclusively uses this region for memory
+allocations before it ingests the mem cache.
+
+We guarantee that we always have such a region through the scratch region: On
+first boot, you can pass the ``kho_scratch`` kernel command line option. When
+it is set, Linux allocates a CMA region of the given size. CMA gives us the
+guarantee that no handover pages land in that region, because handover
+pages must be at a static physical memory location and CMA enforces that
+only movable pages can be located inside.
+
+After KHO kexec, we ignore the ``kho_scratch`` kernel command line option and
+instead reuse the exact same region that was originally allocated. This allows
+us to recursively execute any amount of KHO kexecs. Because we used this region
+for boot memory allocations and as target memory for kexec blobs, some parts
+of that memory region may be reserved. These reservations are irrenevant for
+the next KHO, because kexec can overwrite even the original kernel.
+
+KHO active phase
+----------------
+
+To enable user space based kexec file loader, the kernel needs to be able to
+provide the device tree that describes the previous kernel's state before
+performing the actual kexec. The process of generating that device tree is
+called serialization. When the device tree is generated, some properties
+of the system may become immutable because they are already written down
+in the device tree. That state is called the KHO active phase.
diff --git a/Documentation/kho/index.rst b/Documentation/kho/index.rst
new file mode 100644
index 000000000000..5e7eeeca8520
--- /dev/null
+++ b/Documentation/kho/index.rst
@@ -0,0 +1,19 @@
+.. SPDX-License-Identifier: GPL-2.0-or-later
+
+========================
+Kexec Handover Subsystem
+========================
+
+.. toctree::
+ :maxdepth: 1
+
+ concepts
+ usage
+
+.. only:: subproject and html
+
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/kho/usage.rst b/Documentation/kho/usage.rst
new file mode 100644
index 000000000000..59e82f609f75
--- /dev/null
+++ b/Documentation/kho/usage.rst
@@ -0,0 +1,57 @@
+.. SPDX-License-Identifier: GPL-2.0-or-later
+
+====================
+Kexec Handover Usage
+====================
+
+Kexec HandOver (KHO) is a mechanism that allows Linux to preserve state -
+arbitrary properties as well as memory locations - across kexec.
+
+This document expects that you are familiar with the base KHO
+:ref:`Documentation/kho/concepts.rst <concepts>`. If you have not read
+them yet, please do so now.
+
+Prerequisites
+-------------
+
+KHO is available when the ``CONFIG_KEXEC_KHO`` config option is set to y
+at compile time. Every KHO producer has its own config option that you
+need to enable if you would like to preserve their respective state across
+kexec.
+
+To use KHO, please boot the kernel with the ``kho_scratch`` command
+line parameter set to allocate a scratch region. For example
+``kho_scratch=512M`` will reserve a 512 MiB scratch region on boot.
+
+Perform a KHO kexec
+-------------------
+
+Before you can perform a KHO kexec, you need to move the system into the
+:ref:`Documentation/kho/concepts.rst <KHO active phase>` ::
+
+ $ echo 1 > /sys/kernel/kho/active
+
+After this command, the KHO device tree is available in ``/sys/kernel/kho/dt``.
+
+Next, load the target payload and kexec into it. It is important that you
+use the ``-s`` parameter to use the in-kernel kexec file loader, as user
+space kexec tooling currently has no support for KHO with the user space
+based file loader ::
+
+ # kexec -l Image --initrd=initrd -s
+ # kexec -e
+
+The new kernel will boot up and contain some of the previous kernel's state.
+
+For example, if you enabled ``CONFIG_FTRACE_KHO``, the new kernel will contain
+the old kernel's trace buffers in ``/sys/kernel/debug/tracing/trace``.
+
+Abort a KHO exec
+----------------
+
+You can move the system out of KHO active phase again by calling ::
+
+ $ echo 1 > /sys/kernel/kho/active
+
+After this command, the KHO device tree is no longer available in
+``/sys/kernel/kho/dt``.
diff --git a/Documentation/subsystem-apis.rst b/Documentation/subsystem-apis.rst
index 2d353fb8ea26..7c366337db5d 100644
--- a/Documentation/subsystem-apis.rst
+++ b/Documentation/subsystem-apis.rst
@@ -87,3 +87,4 @@ Storage interfaces
peci/index
wmi/index
tee/index
+ kho/index
diff --git a/MAINTAINERS b/MAINTAINERS
index 88bf6730d801..1c48e4ea4005 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -11826,6 +11826,7 @@ S: Maintained
W: http://kernel.org/pub/linux/utils/kernel/kexec/
F: Documentation/ABI/testing/sysfs-firmware-kho
F: Documentation/ABI/testing/sysfs-kernel-kho
+F: Documentation/kho/
F: include/linux/kexec.h
F: include/uapi/linux/kexec.h
F: kernel/kexec*
--
2.40.1

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

2024-01-17 14:52:30

by Alexander Graf

[permalink] [raw]

Subject: [PATCH v3 08/17] arm64: Add KHO support

We now have all bits in place to support KHO kexecs. This patch adds
awareness of KHO in the kexec file as well as boot path for arm64 and
adds the respective kconfig option to the architecture so that it can
use KHO successfully.

Signed-off-by: Alexander Graf <[email protected]>

---

v1 -> v2:

- test bot warning fix
- Change kconfig option to ARCH_SUPPORTS_KEXEC_KHO
- s/kho_reserve_mem/kho_reserve_previous_mem/g
- s/kho_reserve/kho_reserve_scratch/g
- Remove / reduce ifdefs for kho fdt code
---
arch/arm64/Kconfig | 3 +++
arch/arm64/kernel/setup.c | 2 ++
arch/arm64/mm/init.c | 8 ++++++
drivers/of/fdt.c | 39 ++++++++++++++++++++++++++++
drivers/of/kexec.c | 54 +++++++++++++++++++++++++++++++++++++++
5 files changed, 106 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 8f6cf1221b6a..44d8923d9db4 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1496,6 +1496,9 @@ config ARCH_SUPPORTS_KEXEC_IMAGE_VERIFY_SIG
config ARCH_DEFAULT_KEXEC_IMAGE_VERIFY_SIG
def_bool y

+config ARCH_SUPPORTS_KEXEC_KHO
+ def_bool y
+
config ARCH_SUPPORTS_CRASH_DUMP
def_bool y

diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
index 417a8a86b2db..9aa05b84d202 100644
--- a/arch/arm64/kernel/setup.c
+++ b/arch/arm64/kernel/setup.c
@@ -346,6 +346,8 @@ void __init __no_sanitize_address setup_arch(char **cmdline_p)

paging_init();

+ kho_reserve_previous_mem();
+
acpi_table_upgrade();

/* Parse the ACPI tables for possible boot-time configuration */
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 74c1db8ce271..1a8fc91509af 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -358,6 +358,8 @@ void __init bootmem_init(void)
*/
arch_reserve_crashkernel();

+ kho_reserve_scratch();
+
memblock_dump_all();
}

@@ -386,6 +388,12 @@ void __init mem_init(void)
/* this will put all unused low memory onto the freelists */
memblock_free_all();

+ /*
+ * Now that all KHO pages are marked as reserved, let's flip them back
+ * to normal pages with accurate refcount.
+ */
+ kho_populate_refcount();
+
/*
* Check boundaries twice: Some fundamental inconsistencies can be
* detected at build time already.
diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
index bf502ba8da95..f9b9a36fb722 100644
--- a/drivers/of/fdt.c
+++ b/drivers/of/fdt.c
@@ -1006,6 +1006,42 @@ void __init early_init_dt_check_for_usable_mem_range(void)
memblock_add(rgn[i].base, rgn[i].size);
}

+/**
+ * early_init_dt_check_kho - Decode info required for kexec handover from DT
+ */
+static void __init early_init_dt_check_kho(void)
+{
+ unsigned long node = chosen_node_offset;
+ u64 kho_start, scratch_start, scratch_size, mem_start, mem_size;
+ const __be32 *p;
+ int l;
+
+ if (!IS_ENABLED(CONFIG_KEXEC_KHO) || (long)node < 0)
+ return;
+
+ p = of_get_flat_dt_prop(node, "linux,kho-dt", &l);
+ if (l != (dt_root_addr_cells + dt_root_size_cells) * sizeof(__be32))
+ return;
+
+ kho_start = dt_mem_next_cell(dt_root_addr_cells, &p);
+
+ p = of_get_flat_dt_prop(node, "linux,kho-scratch", &l);
+ if (l != (dt_root_addr_cells + dt_root_size_cells) * sizeof(__be32))
+ return;
+
+ scratch_start = dt_mem_next_cell(dt_root_addr_cells, &p);
+ scratch_size = dt_mem_next_cell(dt_root_addr_cells, &p);
+
+ p = of_get_flat_dt_prop(node, "linux,kho-mem", &l);
+ if (l != (dt_root_addr_cells + dt_root_size_cells) * sizeof(__be32))
+ return;
+
+ mem_start = dt_mem_next_cell(dt_root_addr_cells, &p);
+ mem_size = dt_mem_next_cell(dt_root_addr_cells, &p);
+
+ kho_populate(kho_start, scratch_start, scratch_size, mem_start, mem_size);
+}
+
#ifdef CONFIG_SERIAL_EARLYCON

int __init early_init_dt_scan_chosen_stdout(void)
@@ -1304,6 +1340,9 @@ void __init early_init_dt_scan_nodes(void)

/* Handle linux,usable-memory-range property */
early_init_dt_check_for_usable_mem_range();
+
+ /* Handle kexec handover */
+ early_init_dt_check_kho();
}

bool __init early_init_dt_scan(void *params)
diff --git a/drivers/of/kexec.c b/drivers/of/kexec.c
index 68278340cecf..59070b09ad45 100644
--- a/drivers/of/kexec.c
+++ b/drivers/of/kexec.c
@@ -264,6 +264,55 @@ static inline int setup_ima_buffer(const struct kimage *image, void *fdt,
}
#endif /* CONFIG_IMA_KEXEC */

+static int kho_add_chosen(const struct kimage *image, void *fdt, int chosen_node)
+{
+ void *dt = NULL;
+ phys_addr_t dt_mem = 0;
+ phys_addr_t dt_len = 0;
+ phys_addr_t scratch_mem = 0;
+ phys_addr_t scratch_len = 0;
+ void *mem_cache = NULL;
+ phys_addr_t mem_cache_mem = 0;
+ phys_addr_t mem_cache_len = 0;
+ int ret = 0;
+
+#ifdef CONFIG_KEXEC_KHO
+ dt = image->kho.dt.buffer;
+ dt_mem = image->kho.dt.mem;
+ dt_len = image->kho.dt.bufsz;
+
+ scratch_mem = kho_scratch_phys;
+ scratch_len = kho_scratch_len;
+
+ mem_cache = image->kho.mem_cache.buffer;
+ mem_cache_mem = image->kho.mem_cache.mem;
+ mem_cache_len = image->kho.mem_cache.bufsz;
+#endif
+
+ if (!dt || !mem_cache)
+ goto out;
+
+ pr_debug("Adding kho metadata to DT");
+
+ ret = fdt_appendprop_addrrange(fdt, 0, chosen_node, "linux,kho-dt",
+ dt_mem, dt_len);
+ if (ret)
+ goto out;
+
+ ret = fdt_appendprop_addrrange(fdt, 0, chosen_node, "linux,kho-scratch",
+ scratch_mem, scratch_len);
+ if (ret)
+ goto out;
+
+ ret = fdt_appendprop_addrrange(fdt, 0, chosen_node, "linux,kho-mem",
+ mem_cache_mem, mem_cache_len);
+ if (ret)
+ goto out;
+
+out:
+ return ret;
+}
+
/*
* of_kexec_alloc_and_setup_fdt - Alloc and setup a new Flattened Device Tree
*
@@ -412,6 +461,11 @@ void *of_kexec_alloc_and_setup_fdt(const struct kimage *image,
}
}

+ /* Add kho metadata if this is a KHO image */
+ ret = kho_add_chosen(image, fdt, chosen_node);
+ if (ret)
+ goto out;
+
/* add bootargs */
if (cmdline) {
ret = fdt_setprop_string(fdt, chosen_node, "bootargs", cmdline);
--
2.40.1

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

2024-01-17 14:52:30

by Alexander Graf

[permalink] [raw]

Subject: [PATCH v3 12/17] tracing: Add kho serialization of trace buffers

When we do a kexec handover, we want to preserve previous ftrace data
into the new kernel. At the point when we write out the handover data,
ftrace may still be running and recording new events and we want to
capture all of those too.

To allow the new kernel to revive all trace data up to reboot, we store
all locations of trace buffers as well as their linked list metadata. We
can then later reuse the linked list to reconstruct the head pointer.

This patch implements the write-out logic for trace buffers.

Signed-off-by: Alexander Graf <[email protected]>

---

v1 -> v2:

- Leave the node generation code that needs to know the name in
trace.c so that ring buffers can stay anonymous

v2 -> v3:

- s/"global_trace"/"global-trace"/
- s/"trace_flags"/"trace-flags"/
---
include/linux/ring_buffer.h | 2 +
kernel/trace/ring_buffer.c | 76 +++++++++++++++++++++++++++++++++++++
kernel/trace/trace.c | 16 ++++++++
3 files changed, 94 insertions(+)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 782e14f62201..1c5eb33f0cb5 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -211,4 +211,6 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct hlist_node *node);
#define trace_rb_cpu_prepare NULL
#endif

+int ring_buffer_kho_write(void *fdt, struct trace_buffer *buffer);
+
#endif /* _LINUX_RING_BUFFER_H */
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 9286f88fcd32..33b41013cda9 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -20,6 +20,7 @@
#include <linux/percpu.h>
#include <linux/mutex.h>
#include <linux/delay.h>
+#include <linux/kexec.h>
#include <linux/slab.h>
#include <linux/init.h>
#include <linux/hash.h>
@@ -5859,6 +5860,81 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct hlist_node *node)
return 0;
}

+#ifdef CONFIG_FTRACE_KHO
+static int rb_kho_write_cpu(void *fdt, struct trace_buffer *buffer, int cpu)
+{
+ int i = 0;
+ int err = 0;
+ struct list_head *tmp;
+ const char compatible[] = "ftrace,cpu-v1";
+ char name[] = "cpuffffffff";
+ int nr_pages;
+ struct ring_buffer_per_cpu *cpu_buffer;
+ bool first_loop = true;
+ struct kho_mem *mem;
+ uint64_t mem_len;
+
+ if (!cpumask_test_cpu(cpu, buffer->cpumask))
+ return 0;
+
+ cpu_buffer = buffer->buffers[cpu];
+
+ nr_pages = cpu_buffer->nr_pages;
+ mem_len = sizeof(*mem) * nr_pages * 2;
+ mem = vmalloc(mem_len);
+
+ snprintf(name, sizeof(name), "cpu%x", cpu);
+
+ err |= fdt_begin_node(fdt, name);
+ err |= fdt_property(fdt, "compatible", compatible, sizeof(compatible));
+ err |= fdt_property(fdt, "cpu", &cpu, sizeof(cpu));
+
+ for (tmp = rb_list_head(cpu_buffer->pages);
+ tmp != rb_list_head(cpu_buffer->pages) || first_loop;
+ tmp = rb_list_head(tmp->next), first_loop = false) {
+ struct buffer_page *bpage = (struct buffer_page *)tmp;
+
+ /* Ring is larger than it should be? */
+ if (i >= (nr_pages * 2)) {
+ pr_err("ftrace ring has more pages than nr_pages (%d / %d)", i, nr_pages);
+ err = -EINVAL;
+ break;
+ }
+
+ /* First describe the bpage */
+ mem[i++] = (struct kho_mem) {
+ .addr = __pa(bpage),
+ .len = sizeof(*bpage)
+ };
+
+ /* Then the data page */
+ mem[i++] = (struct kho_mem) {
+ .addr = __pa(bpage->page),
+ .len = PAGE_SIZE
+ };
+ }
+
+ err |= fdt_property(fdt, "mem", mem, mem_len);
+ err |= fdt_end_node(fdt);
+
+ vfree(mem);
+ return err;
+}
+
+int ring_buffer_kho_write(void *fdt, struct trace_buffer *buffer)
+{
+ int err, i;
+
+ for (i = 0; i < buffer->cpus; i++) {
+ err = rb_kho_write_cpu(fdt, buffer, i);
+ if (err)
+ return err;
+ }
+
+ return 0;
+}
+#endif
+
#ifdef CONFIG_RING_BUFFER_STARTUP_TEST
/*
* This is a basic integrity check of the ring buffer.
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 9a0d96975c9c..9505a929a726 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -10577,6 +10577,21 @@ void __init early_trace_init(void)
init_events();
}

+static int trace_kho_write_trace_array(void *fdt, struct trace_array *tr)
+{
+ const char *name = tr->name ? tr->name : "global-trace";
+ const char compatible[] = "ftrace,array-v1";
+ int err = 0;
+
+ err |= fdt_begin_node(fdt, name);
+ err |= fdt_property(fdt, "compatible", compatible, sizeof(compatible));
+ err |= fdt_property(fdt, "trace-flags", &tr->trace_flags, sizeof(tr->trace_flags));
+ err |= ring_buffer_kho_write(fdt, tr->array_buffer.buffer);
+ err |= fdt_end_node(fdt);
+
+ return err;
+}
+
static int trace_kho_notifier(struct notifier_block *self,
unsigned long cmd,
void *v)
@@ -10603,6 +10618,7 @@ static int trace_kho_notifier(struct notifier_block *self,

err |= fdt_begin_node(fdt, "ftrace");
err |= fdt_property(fdt, "compatible", compatible, sizeof(compatible));
+ err |= trace_kho_write_trace_array(fdt, &global_trace);
err |= fdt_end_node(fdt);

if (!err) {
--
2.40.1

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

2024-01-17 14:53:24

by Alexander Graf

[permalink] [raw]

Subject: [PATCH v3 15/17] tracing: Recover trace events from kexec handover

This patch implements all logic necessary to match a new trace event
that we add against preserved trace events from kho. If we find a match,
we give the new trace event the old event's identifier. That way, trace
read-outs are able to make sense of buffer contents again because the
parsing code for events looks at the same identifiers.

Signed-off-by: Alexander Graf <[email protected]>

---

v1 -> v2:

- make kho_get_fdt() const
- Get events as array from a property, use fingerprint instead of
names to identify events
- Remove ifdefs
---
kernel/trace/trace_output.c | 158 +++++++++++++++++++++++++++++++++++-
1 file changed, 156 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index 7d8815352e20..937002a204e1 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -24,6 +24,8 @@ DECLARE_RWSEM(trace_event_sem);

static struct hlist_head event_hash[EVENT_HASHSIZE] __read_mostly;

+static bool trace_is_kho_event(int type);
+
enum print_line_t trace_print_bputs_msg_only(struct trace_iterator *iter)
{
struct trace_seq *s = &iter->seq;
@@ -784,7 +786,7 @@ static DEFINE_IDA(trace_event_ida);

static void free_trace_event_type(int type)
{
- if (type >= __TRACE_LAST_TYPE)
+ if (type >= __TRACE_LAST_TYPE && !trace_is_kho_event(type))
ida_free(&trace_event_ida, type);
}

@@ -810,6 +812,156 @@ void trace_event_read_unlock(void)
up_read(&trace_event_sem);
}

+
+/**
+ * trace_kho_get_map - Return the KHO event map
+ * @pmap: Pointer to a trace map array. Will be filled on success.
+ * @plen: Pointer to the length of the map. Will be filled on success.
+ * @unallocated: True if the event does not have an ID yet
+ *
+ * Event types are semi-dynamically generated. To ensure that
+ * their identifiers match before and after kexec with KHO,
+ * we store an event map in the KHO DT. Whenever we need the
+ * map, this function provides it.
+ *
+ * The first time we request a map, it also walks through it and
+ * reserves all identifiers so later event registration has find their
+ * identifier already reserved.
+ */
+static int trace_kho_get_map(const struct trace_event_map **pmap, int *plen,
+ bool unallocated)
+{
+ static const struct trace_event_map *event_map;
+ static int event_map_len;
+ static bool event_map_reserved;
+ const struct trace_event_map *map = NULL;
+ const void *fdt = kho_get_fdt();
+ const char *path = "/ftrace";
+ int off, err, len = 0;
+ int i;
+
+ if (!IS_ENABLED(CONFIG_FTRACE_KHO) || !fdt)
+ return -EINVAL;
+
+ if (event_map) {
+ map = event_map;
+ len = event_map_len;
+ }
+
+ if (!map) {
+ off = fdt_path_offset(fdt, path);
+
+ if (off < 0) {
+ pr_debug("Could not find '%s' in DT", path);
+ return -EINVAL;
+ }
+
+ err = fdt_node_check_compatible(fdt, off, "ftrace-v1");
+ if (err) {
+ pr_warn("Node '%s' has invalid compatible", path);
+ return -EINVAL;
+ }
+
+ map = fdt_getprop(fdt, off, "events", &len);
+ if (!map)
+ return -EINVAL;
+
+ event_map = map;
+ event_map_len = len;
+ }
+
+ if (unallocated && !event_map_reserved) {
+ /*
+ * Reserve all IDs in our IDA. We only have a working IDA later
+ * in boot, so restrict it to when we allocate a dynamic type id
+ * for an event.
+ */
+ for (i = 0; i < len; i += sizeof(*map)) {
+ const struct trace_event_map *imap = (void *)map + i;
+
+ if (imap->type < __TRACE_LAST_TYPE)
+ continue;
+ if (ida_alloc_range(&trace_event_ida, imap->type, imap->type,
+ GFP_KERNEL) != imap->type) {
+ pr_warn("Unable to reserve id %d", imap->type);
+ return -EINVAL;
+ }
+ }
+
+ event_map_reserved = true;
+ }
+
+ *pmap = map;
+ *plen = len;
+
+ return 0;
+}
+
+/**
+ * trace_is_kho_event - returns true if the event type is KHO reserved
+ * @event: the event type to enumerate
+ *
+ * With KHO, we reserve all previous kernel's trace event types in the
+ * KHO DT. Then, when we allocate a type, we just reuse the previous
+ * kernel's value. However, that means we have to keep these type identifiers
+ * reserved across the lifetime of the system, because we may get a new event
+ * that matches the old kernel's event fingerprint. This function is a small
+ * helper that allows us to check whether a type ID is in use by KHO.
+ */
+static bool trace_is_kho_event(int type)
+{
+ const struct trace_event_map *map = NULL;
+ int len, i;
+
+ if (trace_kho_get_map(&map, &len, false))
+ return false;
+
+ if (!map)
+ return false;
+
+ for (i = 0; i < len; i += sizeof(*map), map++)
+ if (map->type == type)
+ return true;
+
+ return false;
+}
+
+/**
+ * trace_kho_fill_event_type - restore event type info from KHO
+ * @event: the event to enumerate
+ *
+ * Event types are semi-dynamically generated. To ensure that
+ * their identifiers match before and after kexec with KHO,
+ * let's match up unique fingerprint - either their predetermined
+ * type or their crc32 value - and fill in the respective type
+ * information if we booted with KHO.
+ */
+static bool trace_kho_fill_event_type(struct trace_event *event)
+{
+ const struct trace_event_map *map = NULL;
+ int len = 0, i;
+ u32 crc32;
+
+ if (trace_kho_get_map(&map, &len, !event->type))
+ return false;
+
+ crc32 = event2fp(event);
+
+ for (i = 0; i < len; i += sizeof(*map), map++) {
+ if (map->crc32 == crc32) {
+ if (!map->type)
+ return false;
+
+ event->type = map->type;
+ return true;
+ }
+ }
+
+ pr_debug("Could not find event");
+
+ return false;
+}
+
/**
* register_trace_event - register output for an event type
* @event: the event type to register
@@ -838,7 +990,9 @@ int register_trace_event(struct trace_event *event)
if (WARN_ON(!event->funcs))
goto out;

- if (!event->type) {
+ if (trace_kho_fill_event_type(event)) {
+ pr_debug("Recovered id=%d", event->type);
+ } else if (!event->type) {
event->type = alloc_trace_event_type();
if (!event->type)
goto out;
--
2.40.1

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

2024-01-17 14:53:42

by Alexander Graf

[permalink] [raw]

Subject: [PATCH v3 10/17] tracing: Initialize fields before registering

With KHO, we need to know all event fields before we allocate an event
type for a trace event so that we can recover it based on a previous
execution context.

Before this patch, fields were only initialized after we allocated a
type id. After this patch, we try to allocate it early as well.

This patch leaves the old late initialization logic in place. The field
init code already validates whether there are any fields present, which
means it's legal to call it multiple times. This way we're sure we don't
miss any call sites.

Signed-off-by: Alexander Graf <[email protected]>
---
include/linux/trace_events.h | 1 +
kernel/trace/trace_events.c | 14 +++++++++-----
kernel/trace/trace_events_synth.c | 14 +++++++++-----
kernel/trace/trace_events_user.c | 4 ++++
kernel/trace/trace_probe.c | 4 ++++
5 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index d68ff9b1247f..8fe8970b48e3 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -842,6 +842,7 @@ extern int trace_define_field(struct trace_event_call *call, const char *type,
extern int trace_add_event_call(struct trace_event_call *call);
extern int trace_remove_event_call(struct trace_event_call *call);
extern int trace_event_get_offsets(struct trace_event_call *call);
+extern int trace_event_define_fields(struct trace_event_call *call);

int ftrace_set_clr_event(struct trace_array *tr, char *buf, int set);
int trace_set_clr_event(const char *system, const char *event, int set);
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index f29e815ca5b2..fbf8be1d2806 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -462,6 +462,11 @@ static void test_event_printk(struct trace_event_call *call)
int trace_event_raw_init(struct trace_event_call *call)
{
int id;
+ int ret;
+
+ ret = trace_event_define_fields(call);
+ if (ret)
+ return ret;

id = register_trace_event(&call->event);
if (!id)
@@ -2402,8 +2407,7 @@ event_subsystem_dir(struct trace_array *tr, const char *name,
return NULL;
}

-static int
-event_define_fields(struct trace_event_call *call)
+int trace_event_define_fields(struct trace_event_call *call)
{
struct list_head *head;
int ret = 0;
@@ -2592,7 +2596,7 @@ event_create_dir(struct eventfs_inode *parent, struct trace_event_file *file)

file->ei = ei;

- ret = event_define_fields(call);
+ ret = trace_event_define_fields(call);
if (ret < 0) {
pr_warn("Could not initialize trace point events/%s\n", name);
return ret;
@@ -2978,7 +2982,7 @@ __trace_add_new_event(struct trace_event_call *call, struct trace_array *tr)
if (eventdir_initialized)
return event_create_dir(tr->event_dir, file);
else
- return event_define_fields(call);
+ return trace_event_define_fields(call);
}

static void trace_early_triggers(struct trace_event_file *file, const char *name)
@@ -3015,7 +3019,7 @@ __trace_early_add_new_event(struct trace_event_call *call,
if (!file)
return -ENOMEM;

- ret = event_define_fields(call);
+ ret = trace_event_define_fields(call);
if (ret)
return ret;

diff --git a/kernel/trace/trace_events_synth.c b/kernel/trace/trace_events_synth.c
index e7af286af4f1..debfe852b0d8 100644
--- a/kernel/trace/trace_events_synth.c
+++ b/kernel/trace/trace_events_synth.c
@@ -880,17 +880,21 @@ static int register_synth_event(struct synth_event *event)
INIT_LIST_HEAD(&call->class->fields);
call->event.funcs = &synth_event_funcs;
call->class->fields_array = synth_event_fields_array;
+ call->flags = TRACE_EVENT_FL_TRACEPOINT;
+ call->class->reg = trace_event_reg;
+ call->class->probe = trace_event_raw_event_synth;
+ call->data = event;
+ call->tp = event->tp;
+
+ ret = trace_event_define_fields(call);
+ if (ret)
+ goto out;

ret = register_trace_event(&call->event);
if (!ret) {
ret = -ENODEV;
goto out;
}
- call->flags = TRACE_EVENT_FL_TRACEPOINT;
- call->class->reg = trace_event_reg;
- call->class->probe = trace_event_raw_event_synth;
- call->data = event;
- call->tp = event->tp;

ret = trace_add_event_call(call);
if (ret) {
diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c
index e76f5e1efdf2..7b7e13260932 100644
--- a/kernel/trace/trace_events_user.c
+++ b/kernel/trace/trace_events_user.c
@@ -1900,6 +1900,10 @@ static int user_event_trace_register(struct user_event *user)
{
int ret;

+ ret = trace_event_define_fields(&user->call);
+ if (ret)
+ return ret;
+
ret = register_trace_event(&user->call.event);

if (!ret)
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 4dc74d73fc1d..da73a02246d8 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -1835,6 +1835,10 @@ int trace_probe_register_event_call(struct trace_probe *tp)
trace_probe_name(tp)))
return -EEXIST;

+ ret = trace_event_define_fields(call);
+ if (ret)
+ return ret;
+
ret = register_trace_event(&call->event);
if (!ret)
return -ENODEV;
--
2.40.1

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

2024-01-17 14:55:39

by Alexander Graf

[permalink] [raw]

Subject: [PATCH v3 14/17] tracing: Add kho serialization of trace events

Events and thus their parsing handle in ftrace have dynamic IDs that get
assigned whenever the event is added to the system. If we want to parse
trace events after kexec, we need to link event IDs back to the original
trace event that existed before we kexec'ed.

There are broadly 2 paths we could take for that:

1) Save full event description across KHO, restore after kexec,
merge identical trace events into a single identifier.
2) Recover the ID of post-kexec added events so they get the same
ID after kexec that they had before kexec

This patch implements the second option. It's simpler and thus less
intrusive. However, it means we can not fully parse affected events
when the kernel removes or modifies trace events across a kho kexec.

Signed-off-by: Alexander Graf <[email protected]>

---

v1 -> v2:

- Leave anything that requires a name in trace.c to keep buffers
unnamed entities
- Put events as array into a property, use fingerprint instead of
names to identify them
- Reduce footprint without CONFIG_FTRACE_KHO

v2 -> v3:

- s/"global_trace"/"global-trace"/
---
kernel/trace/trace.c | 3 +-
kernel/trace/trace_output.c | 89 +++++++++++++++++++++++++++++++++++++
kernel/trace/trace_output.h | 5 +++
3 files changed, 96 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index a5d7f5b4c19f..b5a6a2115b75 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -9364,7 +9364,7 @@ init_tracer_tracefs(struct trace_array *tr, struct dentry *d_tracer);

static int trace_kho_off_tr(struct trace_array *tr)
{
- const char *name = tr->name ? tr->name : "global_trace";
+ const char *name = tr->name ? tr->name : "global-trace";
const void *fdt = kho_get_fdt();
char *path;
int off;
@@ -10648,6 +10648,7 @@ static int trace_kho_notifier(struct notifier_block *self,

err |= fdt_begin_node(fdt, "ftrace");
err |= fdt_property(fdt, "compatible", compatible, sizeof(compatible));
+ err |= trace_kho_write_events(fdt);
err |= trace_kho_write_trace_array(fdt, &global_trace);
err |= fdt_end_node(fdt);

diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index 3e7fa44dc2b2..7d8815352e20 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -12,6 +12,8 @@
#include <linux/sched/clock.h>
#include <linux/sched/mm.h>
#include <linux/idr.h>
+#include <linux/kexec.h>
+#include <linux/crc32.h>

#include "trace_output.h"

@@ -669,6 +671,93 @@ int trace_print_lat_context(struct trace_iterator *iter)
return !trace_seq_has_overflowed(s);
}

+/**
+ * event2fp - Return fingerprint of an event
+ * @event: The event to fingerprint
+ *
+ * For KHO, we need to match events before and after kexec to recover its type
+ * id. This function returns a hash that combines an event's name, and all of
+ * its fields' lengths.
+ */
+static u32 event2fp(struct trace_event *event)
+{
+ struct ftrace_event_field *field;
+ struct trace_event_call *call;
+ struct list_head *head;
+ const char *name;
+ u32 crc32 = ~0;
+
+ /* Low type numbers are static, nothing to checksum */
+ if (event->type && event->type < __TRACE_LAST_TYPE)
+ return event->type;
+
+ call = container_of(event, struct trace_event_call, event);
+ name = trace_event_name(call);
+ if (name)
+ crc32 = crc32_le(crc32, name, strlen(name));
+
+ head = trace_get_fields(call);
+ list_for_each_entry(field, head, link)
+ crc32 = crc32_le(crc32, (char *)&field->size, sizeof(field->size));
+
+ return crc32;
+}
+
+struct trace_event_map {
+ u32 crc32;
+ u32 type;
+};
+
+static int __maybe_unused _trace_kho_write_events(void *fdt)
+{
+ struct trace_event_call *call;
+ int count = __TRACE_LAST_TYPE - 1;
+ struct trace_event_map *map;
+ int err = 0;
+ int i;
+
+ down_read(&trace_event_sem);
+ /* Allocate an array that we can place all maps into */
+ list_for_each_entry(call, &ftrace_events, list)
+ count++;
+
+ map = vmalloc(count * sizeof(*map));
+ if (!map)
+ return -ENOMEM;
+
+ /* Then fill the array with all crc32 values */
+ count = 0;
+ for (i = 1; i < __TRACE_LAST_TYPE; i++)
+ map[count++] = (struct trace_event_map) {
+ .crc32 = count,
+ .type = count,
+ };
+
+ list_for_each_entry(call, &ftrace_events, list) {
+ struct trace_event *event = &call->event;
+
+ map[count++] = (struct trace_event_map) {
+ .crc32 = event2fp(event),
+ .type = event->type,
+ };
+ }
+ up_read(&trace_event_sem);
+
+ /* And finally write it into a DT variable */
+ err |= fdt_property(fdt, "events", map, count * sizeof(*map));
+
+ vfree(map);
+ return err;
+}
+
+#ifdef CONFIG_FTRACE_KHO
+int trace_kho_write_events(void *fdt)
+{
+ return _trace_kho_write_events(fdt);
+}
+#endif
+
+
/**
* ftrace_find_event - find a registered event
* @type: the type of event to look for
diff --git a/kernel/trace/trace_output.h b/kernel/trace/trace_output.h
index dca40f1f1da4..07481f295436 100644
--- a/kernel/trace/trace_output.h
+++ b/kernel/trace/trace_output.h
@@ -25,6 +25,11 @@ extern enum print_line_t print_event_fields(struct trace_iterator *iter,
extern void trace_event_read_lock(void);
extern void trace_event_read_unlock(void);
extern struct trace_event *ftrace_find_event(int type);
+#ifdef CONFIG_FTRACE_KHO
+extern int trace_kho_write_events(void *fdt);
+#else
+static inline int trace_kho_write_events(void *fdt) { return -EINVAL; }
+#endif

extern enum print_line_t trace_nop_print(struct trace_iterator *iter,
int flags, struct trace_event *event);
--
2.40.1

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

2024-01-17 14:56:51

by Alexander Graf

[permalink] [raw]

Subject: [PATCH v3 13/17] tracing: Recover trace buffers from kexec handover

When kexec handover is in place, we now know the location of all
previous buffers for ftrace rings. With this patch applied, ftrace
reassembles any new trace buffer that carries the same name as a
previous one with the same data pages that the previous buffer had.

That way, a buffer that we had in place before kexec becomes readable
after kexec again as soon as it gets initialized with the same name.

Signed-off-by: Alexander Graf <[email protected]>

---

v1 -> v2:

- Move from names to fdt offsets. That way, trace.c can find the trace
array offset and then the ring buffer code only needs to read out
its per-CPU data. That way it can stay oblivient to its name.
- Make kho_get_fdt() const
- Remove ifdefs
---
include/linux/ring_buffer.h | 15 ++--
kernel/trace/ring_buffer.c | 171 ++++++++++++++++++++++++++++++++++--
kernel/trace/trace.c | 32 ++++++-
3 files changed, 206 insertions(+), 12 deletions(-)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 1c5eb33f0cb5..f6d6ce441890 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -84,20 +84,23 @@ void ring_buffer_discard_commit(struct trace_buffer *buffer,
/*
* size is in bytes for each per CPU buffer.
*/
-struct trace_buffer *
-__ring_buffer_alloc(unsigned long size, unsigned flags, struct lock_class_key *key);
+struct trace_buffer *__ring_buffer_alloc(unsigned long size, unsigned flags,
+ struct lock_class_key *key,
+ int tr_off);

/*
* Because the ring buffer is generic, if other users of the ring buffer get
* traced by ftrace, it can produce lockdep warnings. We need to keep each
* ring buffer's lock class separate.
*/
-#define ring_buffer_alloc(size, flags) \
-({ \
- static struct lock_class_key __key; \
- __ring_buffer_alloc((size), (flags), &__key); \
+#define ring_buffer_alloc_kho(size, flags, tr_off) \
+({ \
+ static struct lock_class_key __key; \
+ __ring_buffer_alloc((size), (flags), &__key, tr_off); \
})

+#define ring_buffer_alloc(size, flags) ring_buffer_alloc_kho(size, flags, 0)
+
int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full);
__poll_t ring_buffer_poll_wait(struct trace_buffer *buffer, int cpu,
struct file *filp, poll_table *poll_table, int full);
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 33b41013cda9..49da2e54126b 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -558,6 +558,7 @@ struct trace_buffer {

struct rb_irq_work irq_work;
bool time_stamp_abs;
+ int tr_off;
};

struct ring_buffer_iter {
@@ -574,6 +575,15 @@ struct ring_buffer_iter {
int missed_events;
};

+struct rb_kho_cpu {
+ const struct kho_mem *mem;
+ uint32_t nr_mems;
+};
+
+static int rb_kho_replace_buffers(struct ring_buffer_per_cpu *cpu_buffer,
+ struct rb_kho_cpu *kho);
+static int rb_kho_read_cpu(int tr_off, int cpu, struct rb_kho_cpu *kho);
+
#ifdef RB_TIME_32

/*
@@ -1768,12 +1778,15 @@ static void rb_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
* drop data when the tail hits the head.
*/
struct trace_buffer *__ring_buffer_alloc(unsigned long size, unsigned flags,
- struct lock_class_key *key)
+ struct lock_class_key *key,
+ int tr_off)
{
+ int cpu = raw_smp_processor_id();
+ struct rb_kho_cpu kho = {};
struct trace_buffer *buffer;
+ bool use_kho = false;
long nr_pages;
int bsize;
- int cpu;
int ret;

/* keep it in its own cache line */
@@ -1786,9 +1799,16 @@ struct trace_buffer *__ring_buffer_alloc(unsigned long size, unsigned flags,
goto fail_free_buffer;

nr_pages = DIV_ROUND_UP(size, BUF_PAGE_SIZE);
+ if (!rb_kho_read_cpu(tr_off, cpu, &kho) && kho.nr_mems > 4) {
+ nr_pages = kho.nr_mems / 2;
+ use_kho = true;
+ pr_debug("Using kho on CPU [%03d]", cpu);
+ }
+
buffer->flags = flags;
buffer->clock = trace_clock_local;
buffer->reader_lock_key = key;
+ buffer->tr_off = tr_off;

init_irq_work(&buffer->irq_work.work, rb_wake_up_waiters);
init_waitqueue_head(&buffer->irq_work.waiters);
@@ -1805,12 +1825,14 @@ struct trace_buffer *__ring_buffer_alloc(unsigned long size, unsigned flags,
if (!buffer->buffers)
goto fail_free_cpumask;

- cpu = raw_smp_processor_id();
cpumask_set_cpu(cpu, buffer->cpumask);
buffer->buffers[cpu] = rb_allocate_cpu_buffer(buffer, nr_pages, cpu);
if (!buffer->buffers[cpu])
goto fail_free_buffers;

+ if (use_kho && rb_kho_replace_buffers(buffer->buffers[cpu], &kho))
+ pr_warn("Could not revive all previous trace data");
+
ret = cpuhp_state_add_instance(CPUHP_TRACE_RB_PREPARE, &buffer->node);
if (ret < 0)
goto fail_free_buffers;
@@ -5824,7 +5846,9 @@ EXPORT_SYMBOL_GPL(ring_buffer_read_page);
*/
int trace_rb_cpu_prepare(unsigned int cpu, struct hlist_node *node)
{
+ struct rb_kho_cpu kho = {};
struct trace_buffer *buffer;
+ bool use_kho = false;
long nr_pages_same;
int cpu_i;
unsigned long nr_pages;
@@ -5848,6 +5872,12 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct hlist_node *node)
/* allocate minimum pages, user can later expand it */
if (!nr_pages_same)
nr_pages = 2;
+
+ if (!rb_kho_read_cpu(buffer->tr_off, cpu, &kho) && kho.nr_mems > 4) {
+ nr_pages = kho.nr_mems / 2;
+ use_kho = true;
+ }
+
buffer->buffers[cpu] =
rb_allocate_cpu_buffer(buffer, nr_pages, cpu);
if (!buffer->buffers[cpu]) {
@@ -5855,13 +5885,143 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct hlist_node *node)
cpu);
return -ENOMEM;
}
+
+ if (use_kho && rb_kho_replace_buffers(buffer->buffers[cpu], &kho))
+ pr_warn("Could not revive all previous trace data");
+
smp_wmb();
cpumask_set_cpu(cpu, buffer->cpumask);
return 0;
}

-#ifdef CONFIG_FTRACE_KHO
-static int rb_kho_write_cpu(void *fdt, struct trace_buffer *buffer, int cpu)
+static int rb_kho_replace_buffers(struct ring_buffer_per_cpu *cpu_buffer,
+ struct rb_kho_cpu *kho)
+{
+ bool first_loop = true;
+ struct list_head *tmp;
+ int err = 0;
+ int i = 0;
+
+ if (!IS_ENABLED(CONFIG_FTRACE_KHO))
+ return -EINVAL;
+
+ if (kho->nr_mems != cpu_buffer->nr_pages * 2)
+ return -EINVAL;
+
+ for (tmp = rb_list_head(cpu_buffer->pages);
+ tmp != rb_list_head(cpu_buffer->pages) || first_loop;
+ tmp = rb_list_head(tmp->next), first_loop = false) {
+ struct buffer_page *bpage = (struct buffer_page *)tmp;
+ const struct kho_mem *mem_bpage = &kho->mem[i++];
+ const struct kho_mem *mem_page = &kho->mem[i++];
+ const uint64_t rb_page_head = 1;
+ struct buffer_page *old_bpage;
+ void *old_page;
+
+ old_bpage = __va(mem_bpage->addr);
+ if (!bpage)
+ goto out;
+
+ if ((ulong)old_bpage->list.next & rb_page_head) {
+ struct list_head *new_lhead;
+ struct buffer_page *new_head;
+
+ new_lhead = rb_list_head(bpage->list.next);
+ new_head = (struct buffer_page *)new_lhead;
+
+ /* Assume the buffer is completely full */
+ cpu_buffer->tail_page = bpage;
+ cpu_buffer->commit_page = bpage;
+ /* Set the head pointers to what they were before */
+ cpu_buffer->head_page->list.prev->next = (struct list_head *)
+ ((ulong)cpu_buffer->head_page->list.prev->next & ~rb_page_head);
+ cpu_buffer->head_page = new_head;
+ bpage->list.next = (struct list_head *)((ulong)new_lhead | rb_page_head);
+ }
+
+ if (rb_page_entries(old_bpage) || rb_page_write(old_bpage)) {
+ /*
+ * We want to recycle the pre-kho page, it contains
+ * trace data. To do so, we unreserve it and swap the
+ * current data page with the pre-kho one
+ */
+ old_page = kho_claim_mem(mem_page);
+
+ /* Recycle the old page, it contains data */
+ free_page((ulong)bpage->page);
+ bpage->page = old_page;
+
+ bpage->write = old_bpage->write;
+ bpage->entries = old_bpage->entries;
+ bpage->real_end = old_bpage->real_end;
+
+ local_inc(&cpu_buffer->pages_touched);
+ } else {
+ kho_return_mem(mem_page);
+ }
+
+ kho_return_mem(mem_bpage);
+ }
+
+out:
+ return err;
+}
+
+static int rb_kho_read_cpu(int tr_off, int cpu, struct rb_kho_cpu *kho)
+{
+ const void *fdt = kho_get_fdt();
+ int mem_len;
+ int err = 0;
+ char *path;
+ int off;
+
+ if (!IS_ENABLED(CONFIG_FTRACE_KHO))
+ return -EINVAL;
+
+ if (!tr_off || !fdt || !kho)
+ return -EINVAL;
+
+ path = kasprintf(GFP_KERNEL, "cpu%x", cpu);
+ if (!path)
+ return -ENOMEM;
+
+ pr_debug("Trying to revive trace cpu '%s'", path);
+
+ off = fdt_subnode_offset(fdt, tr_off, path);
+ if (off < 0) {
+ pr_debug("Could not find '%s' in DT", path);
+ err = -ENOENT;
+ goto out;
+ }
+
+ err = fdt_node_check_compatible(fdt, off, "ftrace,cpu-v1");
+ if (err) {
+ pr_warn("Node '%s' has invalid compatible", path);
+ err = -EINVAL;
+ goto out;
+ }
+
+ kho->mem = fdt_getprop(fdt, off, "mem", &mem_len);
+ if (!kho->mem) {
+ pr_warn("Node '%s' has invalid mem property", path);
+ err = -EINVAL;
+ goto out;
+ }
+
+ kho->nr_mems = mem_len / sizeof(*kho->mem);
+
+ /* Should follow "bpage 0, page 0, bpage 1, page 1, ..." pattern */
+ if ((kho->nr_mems & 1)) {
+ err = -EINVAL;
+ goto out;
+ }
+
+out:
+ kfree(path);
+ return err;
+}
+
+static int __maybe_unused rb_kho_write_cpu(void *fdt, struct trace_buffer *buffer, int cpu)
{
int i = 0;
int err = 0;
@@ -5921,6 +6081,7 @@ static int rb_kho_write_cpu(void *fdt, struct trace_buffer *buffer, int cpu)
return err;
}

+#ifdef CONFIG_FTRACE_KHO
int ring_buffer_kho_write(void *fdt, struct trace_buffer *buffer)
{
int err, i;
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 9505a929a726..a5d7f5b4c19f 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -9362,16 +9362,46 @@ static struct dentry *trace_instance_dir;
static void
init_tracer_tracefs(struct trace_array *tr, struct dentry *d_tracer);

+static int trace_kho_off_tr(struct trace_array *tr)
+{
+ const char *name = tr->name ? tr->name : "global_trace";
+ const void *fdt = kho_get_fdt();
+ char *path;
+ int off;
+
+ if (!IS_ENABLED(CONFIG_FTRACE_KHO))
+ return 0;
+
+ if (!fdt)
+ return 0;
+
+ path = kasprintf(GFP_KERNEL, "/ftrace/%s", name);
+ if (!path)
+ return -ENOMEM;
+
+ pr_debug("Trying to revive trace buffer '%s'", path);
+
+ off = fdt_path_offset(fdt, path);
+ if (off < 0) {
+ pr_debug("Could not find '%s' in DT", path);
+ off = 0;
+ }
+
+ kfree(path);
+ return off;
+}
+
static int
allocate_trace_buffer(struct trace_array *tr, struct array_buffer *buf, int size)
{
+ int tr_off = trace_kho_off_tr(tr);
enum ring_buffer_flags rb_flags;

rb_flags = tr->trace_flags & TRACE_ITER_OVERWRITE ? RB_FL_OVERWRITE : 0;

buf->tr = tr;

- buf->buffer = ring_buffer_alloc(size, rb_flags);
+ buf->buffer = ring_buffer_alloc_kho(size, rb_flags, tr_off);
if (!buf->buffer)
return -ENOMEM;

--
2.40.1

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

2024-01-17 14:58:07

by Alexander Graf

[permalink] [raw]

Subject: [PATCH v3 17/17] Documentation: KHO: Add ftrace bindings

We introduced KHO into Linux: A framework that allows Linux to pass
metadata and memory across kexec from Linux to Linux. KHO reuses fdt
as file format and shares a lot of the same properties of firmware-to-
Linux boot formats: It needs a stable, documented ABI that allows for
forward and backward compatibility as well as versioning.

As first user of KHO, we introduced ftrace which can now preserve
trace contents across kexec, so you can use the post-kexec kernel to
read traces from the pre-kexec kernel.

This patch adds ftrace schemas similar to "device" device tree ones to
a new kho bindings directory. This allows us to force contributors to
document the data that moves across KHO kexecs and catch breaking change
during review.

Signed-off-by: Alexander Graf <[email protected]>

---

v2 -> v3:

- Fix make dt_binding_check
- Add descriptions for each object
- s/trace_flags/trace-flags/
- s/global_trace/global-trace/
- Make all additionalProperties false
- Change subject to reflect subsysten (dt-bindings)
- Fix indentation
- Remove superfluous examples
- Convert to 64bit syntax
- Move to kho directory
---
.../kho/bindings/ftrace/ftrace-array.yaml | 38 ++++++++++++
.../kho/bindings/ftrace/ftrace-cpu.yaml | 43 +++++++++++++
Documentation/kho/bindings/ftrace/ftrace.yaml | 62 +++++++++++++++++++
3 files changed, 143 insertions(+)
create mode 100644 Documentation/kho/bindings/ftrace/ftrace-array.yaml
create mode 100644 Documentation/kho/bindings/ftrace/ftrace-cpu.yaml
create mode 100644 Documentation/kho/bindings/ftrace/ftrace.yaml

diff --git a/Documentation/kho/bindings/ftrace/ftrace-array.yaml b/Documentation/kho/bindings/ftrace/ftrace-array.yaml
new file mode 100644
index 000000000000..aa0007595b95
--- /dev/null
+++ b/Documentation/kho/bindings/ftrace/ftrace-array.yaml
@@ -0,0 +1,38 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/kho/ftrace/ftrace-array.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Ftrace trace array
+
+maintainers:
+ - Alexander Graf <[email protected]>
+
+description: |
+ Ftrace can create and expose multiple different trace instances, see
+ https://docs.kernel.org/trace/ftrace.html#instances. Each instance is
+ backed by a single trace array which contains all information about where
+ the corresponding trace buffers are located and how they are configured.
+
+properties:
+ compatible:
+ enum:
+ - ftrace,array-v1
+
+ trace-flags:
+ $ref: /schemas/types.yaml#/definitions/uint32
+ description: |
+ Bitmap of all the trace flags that were enabled in the trace array at the
+ point of serialization.
+
+patternProperties:
+ cpu[0-9a-f]*:
+ $ref: ftrace-cpu.yaml#
+ description: Trace buffer location for each CPU
+
+required:
+ - compatible
+ - trace-flags
+
+additionalProperties: false
diff --git a/Documentation/kho/bindings/ftrace/ftrace-cpu.yaml b/Documentation/kho/bindings/ftrace/ftrace-cpu.yaml
new file mode 100644
index 000000000000..95dec1c94fc3
--- /dev/null
+++ b/Documentation/kho/bindings/ftrace/ftrace-cpu.yaml
@@ -0,0 +1,43 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/kho/ftrace/ftrace-cpu.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Ftrace per-CPU ring buffer contents
+
+maintainers:
+ - Alexander Graf <[email protected]>
+
+description: |
+ An ftrace trace array contains a ring buffers for each CPU. This
+ object describes the buffers of such a single CPU. It describes which
+ CPU it was used in and which memory was backing the ring buffer.
+
+properties:
+ compatible:
+ enum:
+ - ftrace,cpu-v1
+
+ cpu:
+ $ref: /schemas/types.yaml#/definitions/uint32
+ description: |
+ CPU number of the CPU that this ring buffer belonged to when it was
+ serialized.
+
+ mem:
+ $ref: /schemas/types.yaml#/definitions/uint32-array
+ description: |
+ Array of { u64 phys_addr, u64 len } elements that describe a list of ring
+ buffer pages. Each page consists of two elements. The first element
+ describes the location of the struct buffer_page that contains metadata
+ for a given ring buffer page, such as the ring's head indicator. The
+ second element points to the ring buffer data page which contains the raw
+ trace data.
+
+required:
+ - compatible
+ - cpu
+ - mem
+
+additionalProperties: false
diff --git a/Documentation/kho/bindings/ftrace/ftrace.yaml b/Documentation/kho/bindings/ftrace/ftrace.yaml
new file mode 100644
index 000000000000..4a7308be8dbf
--- /dev/null
+++ b/Documentation/kho/bindings/ftrace/ftrace.yaml
@@ -0,0 +1,62 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/kho/ftrace/ftrace.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Ftrace core data
+
+maintainers:
+ - Alexander Graf <[email protected]>
+
+description: |
+ Ftrace can serialize its current trace buffers across kexec through KHO.
+ For each instance, it preserves the backing ring buffers. It also
+ preserves event ID associations. The post-KHO kernel can then consume
+ these bits to reassemble trace data (not configuration!) for each trace
+ instance and that way expose pre-KHO traces in post-KHO ftrace files.
+
+properties:
+ compatible:
+ enum:
+ - ftrace-v1
+
+ events:
+ $ref: /schemas/types.yaml#/definitions/uint32-array
+ description:
+ Array of { u32 crc, u32 type } elements. Each element contains a unique
+ identifier for an event, followed by the identifier that this event had
+ in the previous kernel's trace buffers.
+
+# Every subnode has to be a trace array
+patternProperties:
+ ^(?!compatible|events)$:
+ $ref: ftrace-array.yaml#
+ description: Trace array description for each trace instance
+
+required:
+ - compatible
+ - events
+
+additionalProperties: true
+
+examples:
+ - |
+ ftrace {
+ compatible = "ftrace-v1";
+ events = < 1 1 2 2 3 3 >;
+
+ global-trace {
+ compatible = "ftrace,array-v1";
+ trace-flags = < 0x3354601 >;
+
+ cpu0 {
+ compatible = "ftrace,cpu-v1";
+ cpu = < 0x00 >;
+ mem = /bits/ 64 < 0x101000000 0x38
+ 0x101000100 0x1000
+ 0x101000038 0x38
+ 0x101002000 0x1000 >;
+ };
+ };
+ };
--
2.40.1

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

2024-01-17 14:58:44

by Alexander Graf

[permalink] [raw]

Subject: [PATCH v3 16/17] tracing: Add config option for kexec handover

Now that all bits are in place to allow ftrace to pass its trace data
into the next kernel on kexec, let's give users a kconfig option to
enable the functionality.

Signed-off-by: Alexander Graf <[email protected]>

---

v1 -> v2:

- Select crc32
---
kernel/trace/Kconfig | 14 ++++++++++++++
1 file changed, 14 insertions(+)

diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 61c541c36596..418a5ae11aac 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -1169,6 +1169,20 @@ config HIST_TRIGGERS_DEBUG

If unsure, say N.

+config FTRACE_KHO
+ bool "Ftrace Kexec handover support"
+ depends on KEXEC_KHO
+ select CRC32
+ help
+ Enable support for ftrace to pass metadata across kexec so the new
+ kernel continues to use the previous kernel's trace buffers.
+
+ This can be useful when debugging kexec performance or correctness
+ issues: The new kernel can dump the old kernel's trace buffer which
+ contains all events until reboot.
+
+ If unsure, say N.
+
source "kernel/trace/rv/Kconfig"

endif # FTRACE
--
2.40.1

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

2024-01-18 05:23:33

by kernel test robot

[permalink] [raw]

Subject: Re: [PATCH v3 14/17] tracing: Add kho serialization of trace events

Hi Alexander,

kernel test robot noticed the following build warnings:

[auto build test WARNING on linus/master]
[cannot apply to tip/x86/core arm64/for-next/core akpm-mm/mm-everything v6.7 next-20240117]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Alexander-Graf/mm-memblock-Add-support-for-scratch-memory/20240117-225136
base: linus/master
patch link: https://lore.kernel.org/r/20240117144704.602-15-graf%40amazon.com
patch subject: [PATCH v3 14/17] tracing: Add kho serialization of trace events
config: i386-randconfig-141-20240118 (https://download.01.org/0day-ci/archive/20240118/[email protected]/config)
compiler: ClangBuiltLinux clang version 17.0.6 (https://github.com/llvm/llvm-project 6009708b4367171ccdbf4b5905cb6a803753fe18)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240118/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All warnings (new ones prefixed by >>):

>> kernel/trace/trace_output.c:731:12: warning: unsequenced modification and access to 'count' [-Wunsequenced]
731 | map[count++] = (struct trace_event_map) {
| ^
732 | .crc32 = count,
| ~~~~~
1 warning generated.

vim +/count +731 kernel/trace/trace_output.c

710
711 static int __maybe_unused _trace_kho_write_events(void *fdt)
712 {
713 struct trace_event_call *call;
714 int count = __TRACE_LAST_TYPE - 1;
715 struct trace_event_map *map;
716 int err = 0;
717 int i;
718
719 down_read(&trace_event_sem);
720 /* Allocate an array that we can place all maps into */
721 list_for_each_entry(call, &ftrace_events, list)
722 count++;
723
724 map = vmalloc(count * sizeof(*map));
725 if (!map)
726 return -ENOMEM;
727
728 /* Then fill the array with all crc32 values */
729 count = 0;
730 for (i = 1; i < __TRACE_LAST_TYPE; i++)
> 731 map[count++] = (struct trace_event_map) {
732 .crc32 = count,
733 .type = count,
734 };
735
736 list_for_each_entry(call, &ftrace_events, list) {
737 struct trace_event *event = &call->event;
738
739 map[count++] = (struct trace_event_map) {
740 .crc32 = event2fp(event),
741 .type = event->type,
742 };
743 }
744 up_read(&trace_event_sem);
745
746 /* And finally write it into a DT variable */
747 err |= fdt_property(fdt, "events", map, count * sizeof(*map));
748
749 vfree(map);
750 return err;
751 }
752

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

2024-01-18 06:47:45

by kernel test robot

[permalink] [raw]

Subject: Re: [PATCH v3 13/17] tracing: Recover trace buffers from kexec handover

Hi Alexander,

kernel test robot noticed the following build warnings:

[auto build test WARNING on linus/master]
[cannot apply to tip/x86/core arm64/for-next/core akpm-mm/mm-everything v6.7 next-20240117]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Alexander-Graf/mm-memblock-Add-support-for-scratch-memory/20240117-225136
base: linus/master
patch link: https://lore.kernel.org/r/20240117144704.602-14-graf%40amazon.com
patch subject: [PATCH v3 13/17] tracing: Recover trace buffers from kexec handover
config: arc-defconfig (https://download.01.org/0day-ci/archive/20240118/[email protected]/config)
compiler: arc-elf-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240118/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All warnings (new ones prefixed by >>):

kernel/trace/ring_buffer.c: In function 'rb_kho_replace_buffers':
>> kernel/trace/ring_buffer.c:5936:66: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
5936 | cpu_buffer->head_page->list.prev->next = (struct list_head *)
| ^
kernel/trace/ring_buffer.c:5939:44: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
5939 | bpage->list.next = (struct list_head *)((ulong)new_lhead | rb_page_head);
| ^
--
>> kernel/trace/ring_buffer.c:1783: warning: Function parameter or struct member 'tr_off' not described in '__ring_buffer_alloc'

vim +5936 kernel/trace/ring_buffer.c

5896
5897 static int rb_kho_replace_buffers(struct ring_buffer_per_cpu *cpu_buffer,
5898 struct rb_kho_cpu *kho)
5899 {
5900 bool first_loop = true;
5901 struct list_head *tmp;
5902 int err = 0;
5903 int i = 0;
5904
5905 if (!IS_ENABLED(CONFIG_FTRACE_KHO))
5906 return -EINVAL;
5907
5908 if (kho->nr_mems != cpu_buffer->nr_pages * 2)
5909 return -EINVAL;
5910
5911 for (tmp = rb_list_head(cpu_buffer->pages);
5912 tmp != rb_list_head(cpu_buffer->pages) || first_loop;
5913 tmp = rb_list_head(tmp->next), first_loop = false) {
5914 struct buffer_page *bpage = (struct buffer_page *)tmp;
5915 const struct kho_mem *mem_bpage = &kho->mem[i++];
5916 const struct kho_mem *mem_page = &kho->mem[i++];
5917 const uint64_t rb_page_head = 1;
5918 struct buffer_page *old_bpage;
5919 void *old_page;
5920
5921 old_bpage = __va(mem_bpage->addr);
5922 if (!bpage)
5923 goto out;
5924
5925 if ((ulong)old_bpage->list.next & rb_page_head) {
5926 struct list_head *new_lhead;
5927 struct buffer_page *new_head;
5928
5929 new_lhead = rb_list_head(bpage->list.next);
5930 new_head = (struct buffer_page *)new_lhead;
5931
5932 /* Assume the buffer is completely full */
5933 cpu_buffer->tail_page = bpage;
5934 cpu_buffer->commit_page = bpage;
5935 /* Set the head pointers to what they were before */
> 5936 cpu_buffer->head_page->list.prev->next = (struct list_head *)
5937 ((ulong)cpu_buffer->head_page->list.prev->next & ~rb_page_head);
5938 cpu_buffer->head_page = new_head;
5939 bpage->list.next = (struct list_head *)((ulong)new_lhead | rb_page_head);
5940 }
5941
5942 if (rb_page_entries(old_bpage) || rb_page_write(old_bpage)) {
5943 /*
5944 * We want to recycle the pre-kho page, it contains
5945 * trace data. To do so, we unreserve it and swap the
5946 * current data page with the pre-kho one
5947 */
5948 old_page = kho_claim_mem(mem_page);
5949
5950 /* Recycle the old page, it contains data */
5951 free_page((ulong)bpage->page);
5952 bpage->page = old_page;
5953
5954 bpage->write = old_bpage->write;
5955 bpage->entries = old_bpage->entries;
5956 bpage->real_end = old_bpage->real_end;
5957
5958 local_inc(&cpu_buffer->pages_touched);
5959 } else {
5960 kho_return_mem(mem_page);
5961 }
5962
5963 kho_return_mem(mem_bpage);
5964 }
5965
5966 out:
5967 return err;
5968 }
5969

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

2024-01-18 15:17:44

by kernel test robot

[permalink] [raw]

Subject: Re: [PATCH v3 13/17] tracing: Recover trace buffers from kexec handover

Hi Alexander,

kernel test robot noticed the following build warnings:

[auto build test WARNING on linus/master]
[cannot apply to tip/x86/core arm64/for-next/core akpm-mm/mm-everything v6.7 next-20240118]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Alexander-Graf/mm-memblock-Add-support-for-scratch-memory/20240117-225136
base: linus/master
patch link: https://lore.kernel.org/r/20240117144704.602-14-graf%40amazon.com
patch subject: [PATCH v3 13/17] tracing: Recover trace buffers from kexec handover
config: i386-randconfig-061-20240118 (https://download.01.org/0day-ci/archive/20240118/[email protected]/config)
compiler: ClangBuiltLinux clang version 17.0.6 (https://github.com/llvm/llvm-project 6009708b4367171ccdbf4b5905cb6a803753fe18)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240118/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

sparse warnings: (new ones prefixed by >>)
kernel/trace/ring_buffer.c:1105:32: sparse: sparse: incorrect type in return expression (different base types) @@ expected restricted __poll_t @@ got int @@
kernel/trace/ring_buffer.c:1105:32: sparse: expected restricted __poll_t
kernel/trace/ring_buffer.c:1105:32: sparse: got int
kernel/trace/ring_buffer.c:4955:9: sparse: sparse: context imbalance in 'ring_buffer_peek' - different lock contexts for basic block
kernel/trace/ring_buffer.c:5041:9: sparse: sparse: context imbalance in 'ring_buffer_consume' - different lock contexts for basic block
kernel/trace/ring_buffer.c:5421:17: sparse: sparse: context imbalance in 'ring_buffer_empty' - different lock contexts for basic block
kernel/trace/ring_buffer.c:5451:9: sparse: sparse: context imbalance in 'ring_buffer_empty_cpu' - different lock contexts for basic block
>> kernel/trace/ring_buffer.c:5937:82: sparse: sparse: non size-preserving integer to pointer cast
kernel/trace/ring_buffer.c:5939:84: sparse: sparse: non size-preserving integer to pointer cast

vim +5937 kernel/trace/ring_buffer.c

5896
5897 static int rb_kho_replace_buffers(struct ring_buffer_per_cpu *cpu_buffer,
5898 struct rb_kho_cpu *kho)
5899 {
5900 bool first_loop = true;
5901 struct list_head *tmp;
5902 int err = 0;
5903 int i = 0;
5904
5905 if (!IS_ENABLED(CONFIG_FTRACE_KHO))
5906 return -EINVAL;
5907
5908 if (kho->nr_mems != cpu_buffer->nr_pages * 2)
5909 return -EINVAL;
5910
5911 for (tmp = rb_list_head(cpu_buffer->pages);
5912 tmp != rb_list_head(cpu_buffer->pages) || first_loop;
5913 tmp = rb_list_head(tmp->next), first_loop = false) {
5914 struct buffer_page *bpage = (struct buffer_page *)tmp;
5915 const struct kho_mem *mem_bpage = &kho->mem[i++];
5916 const struct kho_mem *mem_page = &kho->mem[i++];
5917 const uint64_t rb_page_head = 1;
5918 struct buffer_page *old_bpage;
5919 void *old_page;
5920
5921 old_bpage = __va(mem_bpage->addr);
5922 if (!bpage)
5923 goto out;
5924
5925 if ((ulong)old_bpage->list.next & rb_page_head) {
5926 struct list_head *new_lhead;
5927 struct buffer_page *new_head;
5928
5929 new_lhead = rb_list_head(bpage->list.next);
5930 new_head = (struct buffer_page *)new_lhead;
5931
5932 /* Assume the buffer is completely full */
5933 cpu_buffer->tail_page = bpage;
5934 cpu_buffer->commit_page = bpage;
5935 /* Set the head pointers to what they were before */
5936 cpu_buffer->head_page->list.prev->next = (struct list_head *)
> 5937 ((ulong)cpu_buffer->head_page->list.prev->next & ~rb_page_head);
5938 cpu_buffer->head_page = new_head;
5939 bpage->list.next = (struct list_head *)((ulong)new_lhead | rb_page_head);
5940 }
5941
5942 if (rb_page_entries(old_bpage) || rb_page_write(old_bpage)) {
5943 /*
5944 * We want to recycle the pre-kho page, it contains
5945 * trace data. To do so, we unreserve it and swap the
5946 * current data page with the pre-kho one
5947 */
5948 old_page = kho_claim_mem(mem_page);
5949
5950 /* Recycle the old page, it contains data */
5951 free_page((ulong)bpage->page);
5952 bpage->page = old_page;
5953
5954 bpage->write = old_bpage->write;
5955 bpage->entries = old_bpage->entries;
5956 bpage->real_end = old_bpage->real_end;
5957
5958 local_inc(&cpu_buffer->pages_touched);
5959 } else {
5960 kho_return_mem(mem_page);
5961 }
5962
5963 kho_return_mem(mem_bpage);
5964 }
5965
5966 out:
5967 return err;
5968 }
5969

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

2024-01-29 16:40:37

by Philipp Rudo

[permalink] [raw]

Subject: Re: [PATCH v3 00/17] kexec: Allow preservation of ftrace buffers

Hi Alex,

adding linux-integrity as there are some synergies with IMA_KEXEC (in case we
get KHO to work).

Fist of all I believe that having a generic framework to pass information from
one kernel to the other across kexec would be a good thing. But I'm afraid that
you are ignoring some fundamental problems which makes it extremely hard, if
not impossible, to reliably transfer the kernel's state from one kernel to the
other.

One thing I don't understand is how reusing the scratch area is working. Sure
you pass it's location via the dt/boot_params but I don't see any code that
makes it a CMA region. So IIUC the scratch area won't be available for the 2nd
kernel. Which is probably for the better as IIUC the 2nd kernel gets loaded and
runs inside that area and I don't believe the CMA design ever considered that
the kernel image could be included in a CMA area.

Staying at reusing the scratch area. One thing that is broken for sure is that
you reuse the scratch area without ever checking the kho_scratch parameter of
the 2nd kernel's command line. Remember, with kexec you are dealing with two
different kernels with two different command lines. Meaning you can only reuse
the scratch area if the requested size in the 2nd kernel is identical to the
one of the 1st kernel. In all other cases you need to adjust the scratch area's
size or reserve a new one.

This directly leads to the next problem. In kho_reserve_previous_mem you are
reusing the different memory regions wherever the 1st kernel allocated them.
But that also means you are handing over the 1st kernel's memory
fragmentation to the 2nd kernel and you do that extremely early during boot.
Which means that users who need to allocate large continuous physical memory,
like the scratch area or the crashkernel memory, will have increasing chance to
not find a suitable area. Which IMHO is unacceptable.

Finally, and that's the big elephant in the room, is your lax handling of the
unstable kernel internal ABI. Remember, you are dealing with two different
kernels, that also means two different source levels and two different configs.
So only because both the 1st and 2nd kernel have a e.g. struct buffer_page
doesn't means that they have the same struct buffer_page. But that's what your
code implicitly assumes. For KHO ever to make it upstream you need to make sure
that both kernels are "speaking the same language".

Personally I see two possible solutions:

1) You introduce a stable intermediate format for every subsystem similar to
what IMA_KEXEC does. This should work for simple types like struct buffer_page
but for complex ones like struct vfio_device that's basically impossible.

2) You also hand over the ABI version for every given type (basically just a
hash over all fields including all the dependencies). So the 2nd kernel can
verify that the data handed over is in a format it can handle and if not bail
out with a descriptive error message rather than reading garbage. Plus side is
that once such a system is in place you can reuse it to automatically resolve
all dependencies so you no longer need to manually store the buffer_page and
its buffer_data_page separately.
Down side is that traversing the debuginfo (including the ones from modules) is
not a simple task and I expect that such a system will be way more complex than
the rest of KHO. In addition there are some cases that the versioning won't be
able to capture. For example if a type contains a "void *"-field. Then although
the definition of the type is identical in both kernels the field can be cast
to different types when used. An other problem will be function pointers which
you first need to resolve in the 1st kernel and then map to the identical
function in the 2nd kernel. This will become particularly "fun" when the
function is part of a module that isn't loaded at the time when you try to
recreate the kernel's state.

So to summarize, while it would be nice to have a generic framework like KHO to
pass data from one kernel to the other via kexec there are good reasons why it
doesn't exist, yet.

Thanks
Philipp

On Wed, 17 Jan 2024 14:46:47 +0000
Alexander Graf <[email protected]> wrote:

> Kexec today considers itself purely a boot loader: When we enter the new
> kernel, any state the previous kernel left behind is irrelevant and the
> new kernel reinitializes the system.
>
> However, there are use cases where this mode of operation is not what we
> actually want. In virtualization hosts for example, we want to use kexec
> to update the host kernel while virtual machine memory stays untouched.
> When we add device assignment to the mix, we also need to ensure that
> IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we
> need to do the same for the PCI subsystem. If we want to kexec while an
> SEV-SNP enabled virtual machine is running, we need to preserve the VM
> context pages and physical memory. See James' and my Linux Plumbers
> Conference 2023 presentation for details:
>
> https://lpc.events/event/17/contributions/1485/
>
> To start us on the journey to support all the use cases above, this
> patch implements basic infrastructure to allow hand over of kernel state
> across kexec (Kexec HandOver, aka KHO). As example target, we use ftrace:
> With this patch set applied, you can read ftrace records from the
> pre-kexec environment in your post-kexec one. This creates a very powerful
> debugging and performance analysis tool for kexec. It's also slightly
> easier to reason about than full blown VFIO state preservation.
>
> == Alternatives ==
>
> There are alternative approaches to (parts of) the problems above:
>
> * Memory Pools [1] - preallocated persistent memory region + allocator
> * PRMEM [2] - resizable persistent memory regions with fixed metadata
> pointer on the kernel command line + allocator
> * Pkernfs [3] - preallocated file system for in-kernel data with fixed
> address location on the kernel command line
> * PKRAM [4] - handover of user space pages using a fixed metadata page
> specified via command line
>
> All of the approaches above fundamentally have the same problem: They
> require the administrator to explicitly carve out a physical memory
> location because they have no mechanism outside of the kernel command
> line to pass data (including memory reservations) between kexec'ing
> kernels.
>
> KHO provides that base foundation. We will determine later whether we
> still need any of the approaches above for fast bulk memory handover of for
> example IOMMU page tables. But IMHO they would all be users of KHO, with
> KHO providing the foundational primitive to pass metadata and bulk memory
> reservations as well as provide easy versioning for data.
>
> == Overview ==
>
> We introduce a metadata file that the kernels pass between each other. How
> they pass it is architecture specific. The file's format is a Flattened
> Device Tree (fdt) which has a generator and parser already included in
> Linux. When the root user enables KHO through /sys/kernel/kho/active, the
> kernel invokes callbacks to every driver that supports KHO to serialize
> its state. When the actual kexec happens, the fdt is part of the image
> set that we boot into. In addition, we keep a "scratch region" available
> for kexec: A physically contiguous memory region that is guaranteed to
> not have any memory that KHO would preserve. The new kernel bootstraps
> itself using the scratch region and sets all handed over memory as in use.
> When drivers initialize that support KHO, they introspect the fdt and
> recover their state from it. This includes memory reservations, where the
> driver can either discard or claim reservations.
>
> == Limitations ==
>
> I currently only implemented file based kexec. The kernel interfaces
> in the patch set are already in place to support user space kexec as well,
> but I have not implemented it yet inside kexec tools.
>
> == How to Use ==
>
> To use the code, please boot the kernel with the "kho_scratch=" command
> line parameter set: "kho_scratch=512M". KHO requires a scratch region.
>
> Make sure to fill ftrace with contents that you want to observe after
> kexec. Then, before you invoke file based "kexec -l", activate KHO:
>
> # echo 1 > /sys/kernel/kho/active
> # kexec -l Image --initrd=initrd -s
> # kexec -e
>
> The new kernel will boot up and contain the previous kernel's trace
> buffers in /sys/kernel/debug/tracing/trace.
>
> == Changelog ==
>
> v1 -> v2:
> - Removed: tracing: Introduce names for ring buffers
> - Removed: tracing: Introduce names for events
> - New: kexec: Add config option for KHO
> - New: kexec: Add documentation for KHO
> - New: tracing: Initialize fields before registering
> - New: devicetree: Add bindings for ftrace KHO
> - test bot warning fixes
> - Change kconfig option to ARCH_SUPPORTS_KEXEC_KHO
> - s/kho_reserve_mem/kho_reserve_previous_mem/g
> - s/kho_reserve/kho_reserve_scratch/g
> - Remove / reduce ifdefs
> - Select crc32
> - Leave anything that requires a name in trace.c to keep buffers
> unnamed entities
> - Put events as array into a property, use fingerprint instead of
> names to identify them
> - Reduce footprint without CONFIG_FTRACE_KHO
> - s/kho_reserve_mem/kho_reserve_previous_mem/g
> - make kho_get_fdt() const
> - Add stubs for return_mem and claim_mem
> - make kho_get_fdt() const
> - Get events as array from a property, use fingerprint instead of
> names to identify events
> - Change kconfig option to ARCH_SUPPORTS_KEXEC_KHO
> - s/kho_reserve_mem/kho_reserve_previous_mem/g
> - s/kho_reserve/kho_reserve_scratch/g
> - Leave the node generation code that needs to know the name in
> trace.c so that ring buffers can stay anonymous
> - s/kho_reserve/kho_reserve_scratch/g
> - Move kho enums out of ifdef
> - Move from names to fdt offsets. That way, trace.c can find the trace
> array offset and then the ring buffer code only needs to read out
> its per-CPU data. That way it can stay oblivient to its name.
> - Make kho_get_fdt() const
>
> v2 -> v3:
>
> - Fix make dt_binding_check
> - Add descriptions for each object
> - s/trace_flags/trace-flags/
> - s/global_trace/global-trace/
> - Make all additionalProperties false
> - Change subject to reflect subsysten (dt-bindings)
> - Fix indentation
> - Remove superfluous examples
> - Convert to 64bit syntax
> - Move to kho directory
> - s/"global_trace"/"global-trace"/
> - s/"global_trace"/"global-trace"/
> - s/"trace_flags"/"trace-flags"/
> - Fix wording
> - Add Documentation to MAINTAINERS file
> - Remove kho reference on read error
> - Move handover_dt unmap up
> - s/reserve_scratch_mem/mark_phys_as_cma/
> - Remove ifdeffery
> - Remove superfluous comment
>
> Alexander Graf (17):
> mm,memblock: Add support for scratch memory
> memblock: Declare scratch memory as CMA
> kexec: Add Kexec HandOver (KHO) generation helpers
> kexec: Add KHO parsing support
> kexec: Add KHO support to kexec file loads
> kexec: Add config option for KHO
> kexec: Add documentation for KHO
> arm64: Add KHO support
> x86: Add KHO support
> tracing: Initialize fields before registering
> tracing: Introduce kho serialization
> tracing: Add kho serialization of trace buffers
> tracing: Recover trace buffers from kexec handover
> tracing: Add kho serialization of trace events
> tracing: Recover trace events from kexec handover
> tracing: Add config option for kexec handover
> Documentation: KHO: Add ftrace bindings
>
> Documentation/ABI/testing/sysfs-firmware-kho | 9 +
> Documentation/ABI/testing/sysfs-kernel-kho | 53 ++
> .../admin-guide/kernel-parameters.txt | 10 +
> .../kho/bindings/ftrace/ftrace-array.yaml | 38 ++
> .../kho/bindings/ftrace/ftrace-cpu.yaml | 43 ++
> Documentation/kho/bindings/ftrace/ftrace.yaml | 62 +++
> Documentation/kho/concepts.rst | 88 +++
> Documentation/kho/index.rst | 19 +
> Documentation/kho/usage.rst | 57 ++
> Documentation/subsystem-apis.rst | 1 +
> MAINTAINERS | 3 +
> arch/arm64/Kconfig | 3 +
> arch/arm64/kernel/setup.c | 2 +
> arch/arm64/mm/init.c | 8 +
> arch/x86/Kconfig | 3 +
> arch/x86/boot/compressed/kaslr.c | 55 ++
> arch/x86/include/uapi/asm/bootparam.h | 15 +-
> arch/x86/kernel/e820.c | 9 +
> arch/x86/kernel/kexec-bzimage64.c | 39 ++
> arch/x86/kernel/setup.c | 46 ++
> arch/x86/mm/init_32.c | 7 +
> arch/x86/mm/init_64.c | 7 +
> drivers/of/fdt.c | 39 ++
> drivers/of/kexec.c | 54 ++
> include/linux/kexec.h | 58 ++
> include/linux/memblock.h | 19 +
> include/linux/ring_buffer.h | 17 +-
> include/linux/trace_events.h | 1 +
> include/uapi/linux/kexec.h | 6 +
> kernel/Kconfig.kexec | 13 +
> kernel/Makefile | 2 +
> kernel/kexec_file.c | 41 ++
> kernel/kexec_kho_in.c | 298 ++++++++++
> kernel/kexec_kho_out.c | 526 ++++++++++++++++++
> kernel/trace/Kconfig | 14 +
> kernel/trace/ring_buffer.c | 243 +++++++-
> kernel/trace/trace.c | 96 +++-
> kernel/trace/trace_events.c | 14 +-
> kernel/trace/trace_events_synth.c | 14 +-
> kernel/trace/trace_events_user.c | 4 +
> kernel/trace/trace_output.c | 247 +++++++-
> kernel/trace/trace_output.h | 5 +
> kernel/trace/trace_probe.c | 4 +
> mm/Kconfig | 4 +
> mm/memblock.c | 79 ++-
> 45 files changed, 2351 insertions(+), 24 deletions(-)
> create mode 100644 Documentation/ABI/testing/sysfs-firmware-kho
> create mode 100644 Documentation/ABI/testing/sysfs-kernel-kho
> create mode 100644 Documentation/kho/bindings/ftrace/ftrace-array.yaml
> create mode 100644 Documentation/kho/bindings/ftrace/ftrace-cpu.yaml
> create mode 100644 Documentation/kho/bindings/ftrace/ftrace.yaml
> create mode 100644 Documentation/kho/concepts.rst
> create mode 100644 Documentation/kho/index.rst
> create mode 100644 Documentation/kho/usage.rst
> create mode 100644 kernel/kexec_kho_in.c
> create mode 100644 kernel/kexec_kho_out.c
>

2024-01-31 14:49:52

by Rob Herring (Arm)

[permalink] [raw]

Subject: Re: [PATCH v3 08/17] arm64: Add KHO support

On Wed, Jan 17, 2024 at 02:46:55PM +0000, Alexander Graf wrote:
> We now have all bits in place to support KHO kexecs. This patch adds
> awareness of KHO in the kexec file as well as boot path for arm64 and
> adds the respective kconfig option to the architecture so that it can
> use KHO successfully.
>
> Signed-off-by: Alexander Graf <[email protected]>
>
> ---
>
> v1 -> v2:
>
> - test bot warning fix
> - Change kconfig option to ARCH_SUPPORTS_KEXEC_KHO
> - s/kho_reserve_mem/kho_reserve_previous_mem/g
> - s/kho_reserve/kho_reserve_scratch/g
> - Remove / reduce ifdefs for kho fdt code
> ---
> arch/arm64/Kconfig | 3 +++
> arch/arm64/kernel/setup.c | 2 ++
> arch/arm64/mm/init.c | 8 ++++++
> drivers/of/fdt.c | 39 ++++++++++++++++++++++++++++
> drivers/of/kexec.c | 54 +++++++++++++++++++++++++++++++++++++++
> 5 files changed, 106 insertions(+)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 8f6cf1221b6a..44d8923d9db4 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -1496,6 +1496,9 @@ config ARCH_SUPPORTS_KEXEC_IMAGE_VERIFY_SIG
> config ARCH_DEFAULT_KEXEC_IMAGE_VERIFY_SIG
> def_bool y
>
> +config ARCH_SUPPORTS_KEXEC_KHO
> + def_bool y
> +
> config ARCH_SUPPORTS_CRASH_DUMP
> def_bool y
>
> diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
> index 417a8a86b2db..9aa05b84d202 100644
> --- a/arch/arm64/kernel/setup.c
> +++ b/arch/arm64/kernel/setup.c
> @@ -346,6 +346,8 @@ void __init __no_sanitize_address setup_arch(char **cmdline_p)
>
> paging_init();
>
> + kho_reserve_previous_mem();
> +
> acpi_table_upgrade();
>
> /* Parse the ACPI tables for possible boot-time configuration */
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 74c1db8ce271..1a8fc91509af 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -358,6 +358,8 @@ void __init bootmem_init(void)
> */
> arch_reserve_crashkernel();
>
> + kho_reserve_scratch();
> +
> memblock_dump_all();
> }
>
> @@ -386,6 +388,12 @@ void __init mem_init(void)
> /* this will put all unused low memory onto the freelists */
> memblock_free_all();
>
> + /*
> + * Now that all KHO pages are marked as reserved, let's flip them back
> + * to normal pages with accurate refcount.
> + */
> + kho_populate_refcount();
> +
> /*
> * Check boundaries twice: Some fundamental inconsistencies can be
> * detected at build time already.
> diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
> index bf502ba8da95..f9b9a36fb722 100644
> --- a/drivers/of/fdt.c
> +++ b/drivers/of/fdt.c
> @@ -1006,6 +1006,42 @@ void __init early_init_dt_check_for_usable_mem_range(void)
> memblock_add(rgn[i].base, rgn[i].size);
> }
>
> +/**
> + * early_init_dt_check_kho - Decode info required for kexec handover from DT
> + */
> +static void __init early_init_dt_check_kho(void)
> +{
> + unsigned long node = chosen_node_offset;
> + u64 kho_start, scratch_start, scratch_size, mem_start, mem_size;
> + const __be32 *p;
> + int l;
> +
> + if (!IS_ENABLED(CONFIG_KEXEC_KHO) || (long)node < 0)
> + return;
> +
> + p = of_get_flat_dt_prop(node, "linux,kho-dt", &l);

These need to be documented. chosen node schema lives in dtschema.

> + if (l != (dt_root_addr_cells + dt_root_size_cells) * sizeof(__be32))
> + return;

I would just make all these fixed 64-bit values rather than based on
address and size cells. That's what we've done on more recent chosen
properties describing regions.

Rob

2024-02-02 12:59:27

by Alexander Graf

[permalink] [raw]

Subject: Re: [PATCH v3 00/17] kexec: Allow preservation of ftrace buffers

Hi Philipp,

On 29.01.24 17:34, Philipp Rudo wrote:
> Hi Alex,
>
> adding linux-integrity as there are some synergies with IMA_KEXEC (in case we
> get KHO to work).
>
> Fist of all I believe that having a generic framework to pass information from
> one kernel to the other across kexec would be a good thing. But I'm afraid that

Thanks, I'm happy to hear that you agree with the basic motivation :).
There are fundamentally 2 problems with passing data:

* Passing structured data in a cross-architecture way
* Passing memory

KHO tackles both. It proposes a common FDT based format that allows us
to pass per-subsystem properties. That way, a subsystem does not need to
know whether it's running on ARM, x86, RISC-V or s390x. It just gains
awareness for KHO and can pass data.

On top of that, it proposes a standardized "mem" property (and some
magic around that) which allows subsystems to pass memory.

> you are ignoring some fundamental problems which makes it extremely hard, if
> not impossible, to reliably transfer the kernel's state from one kernel to the
> other.
>
> One thing I don't understand is how reusing the scratch area is working. Sure
> you pass it's location via the dt/boot_params but I don't see any code that
> makes it a CMA region. So IIUC the scratch area won't be available for the 2nd
> kernel. Which is probably for the better as IIUC the 2nd kernel gets loaded and
> runs inside that area and I don't believe the CMA design ever considered that
> the kernel image could be included in a CMA area.

That one took me a lot to figure out sensibly (with recursion all the
way down) while building KHO :). I hope I detailed it sensibly in the
documentation - please let me know how to improve it in case it's
unclear: https://lore.kernel.org/lkml/[email protected]/

Let me explain inline using different words as well what happens:

The first (and only the first) kernel that boots allocates a CMA region
as "scratch region". It loads the new kernel into that region. It passes
that region as "scratch region" to the next kernel. The next kernel now
takes it and marks every page block that the scratch region spans as CMA:

https://lore.kernel.org/lkml/[email protected]/

The CMA hint doesn't mean we create an actual CMA region. It mostly
means that the kernel won't use this memory for any kernel allocations.
Kernel allocations up to this point are allocations we don't need to
pass on with KHO again. Kernel allocations past that point may be
allocations that we want to pass, so we just never place them into the
"scratch region" again.

And because we now already have a scratch region from the previous
kernel, we keep reusing that forever with any new KHO kexec.

> Staying at reusing the scratch area. One thing that is broken for sure is that
> you reuse the scratch area without ever checking the kho_scratch parameter of
> the 2nd kernel's command line. Remember, with kexec you are dealing with two
> different kernels with two different command lines. Meaning you can only reuse
> the scratch area if the requested size in the 2nd kernel is identical to the
> one of the 1st kernel. In all other cases you need to adjust the scratch area's
> size or reserve a new one.

Hm. So you're saying a user may want to change the size of the scratch
area with a KHO kexec. That's insanely risky because you (as rightfully
pointed out below) may have significant fragmentation at that point. And
we will only know when we're in the new kernel so it's too late to
abort. IMHO it's better to just declare the scratch region as immutable
during KHO to avoid that pitfall.

> This directly leads to the next problem. In kho_reserve_previous_mem you are
> reusing the different memory regions wherever the 1st kernel allocated them.
> But that also means you are handing over the 1st kernel's memory
> fragmentation to the 2nd kernel and you do that extremely early during boot.
> Which means that users who need to allocate large continuous physical memory,
> like the scratch area or the crashkernel memory, will have increasing chance to
> not find a suitable area. Which IMHO is unacceptable.

Correct :). It basically means you want to pass large allocations from
the 1st kernel that you want to preserve on to the next. So if the 1st
kernel allocated a large crash area, it's safest to pass that allocation
using KHO to ensure the next kernel also has the region fully reserved.
Otherwise the next kernel may accidentally place data into the
previously reserved crash region (which would be contiguously free at
early init of the 2nd kernel) and fragment it again.

> Finally, and that's the big elephant in the room, is your lax handling of the
> unstable kernel internal ABI. Remember, you are dealing with two different
> kernels, that also means two different source levels and two different configs.
> So only because both the 1st and 2nd kernel have a e.g. struct buffer_page
> doesn't means that they have the same struct buffer_page. But that's what your
> code implicitly assumes. For KHO ever to make it upstream you need to make sure
> that both kernels are "speaking the same language".

Wow, I hope it didn't come across as that! The whole point of using FDT
and compatible strings in KHO is to solve exactly that problem. Any time
a passed over data structure changes incompatibly, you would need to
modify the compatible string of the subsystem that owns the now
incompatible data.

So in the example of struct buffer_page, it means that if anyone changes
the few bits we care about in struct buffer_page, we need to ensure that
the new kernel emits "ftrace,cpu-v2" compatible strings. We can at that
point choose whether we want to implement compat handling for
"ftrace,cpu-v1" style struct buffer_pages or only support same version
ingestion.

The one thing that we could improve on here today IMHO is to have
compile time errors if any part of struct buffer_page changes
semantically: So we'd create a few defines for the bits we want in
"ftrace,cpu-v1" as well as size of struct buffer_page and then compare
them to what the struct offsets are at compile time to ensure they stay
identical.

Please let me know how I can clarify that more in the documentation. It
really is the absolute core of KHO.

> Personally I see two possible solutions:
>
> 1) You introduce a stable intermediate format for every subsystem similar to
> what IMA_KEXEC does. This should work for simple types like struct buffer_page
> but for complex ones like struct vfio_device that's basically impossible.

I don't see why. The only reason KHO passes struct buffer_page as memory
is because we want to be able to produce traces even after KHO
serialization is done. For vfio_device, I think it's perfectly
reasonable to serialize any data we need to preserve directly into FDT
properties.

> 2) You also hand over the ABI version for every given type (basically just a
> hash over all fields including all the dependencies). So the 2nd kernel can
> verify that the data handed over is in a format it can handle and if not bail
> out with a descriptive error message rather than reading garbage. Plus side is
> that once such a system is in place you can reuse it to automatically resolve
> all dependencies so you no longer need to manually store the buffer_page and
> its buffer_data_page separately.
> Down side is that traversing the debuginfo (including the ones from modules) is
> not a simple task and I expect that such a system will be way more complex than
> the rest of KHO. In addition there are some cases that the versioning won't be
> able to capture. For example if a type contains a "void *"-field. Then although
> the definition of the type is identical in both kernels the field can be cast
> to different types when used. An other problem will be function pointers which
> you first need to resolve in the 1st kernel and then map to the identical
> function in the 2nd kernel. This will become particularly "fun" when the
> function is part of a module that isn't loaded at the time when you try to
> recreate the kernel's state.

The whole point of KHO is to leave it to the subsystem which path they
want to take. The subsystem can either pass binary data and validate as
part of FDT properties (like compatible strings). That data can be
identical to today's in-kernel data structures (usually a bad idea) or
can be a new intermediate data format. But the subsystem can also choose
to fully serialize into FDT properties and not pass any memory at all
for state that would be in structs. Or something in between.

> So to summarize, while it would be nice to have a generic framework like KHO to
> pass data from one kernel to the other via kexec there are good reasons why it
> doesn't exist, yet.

I hope my explanations above clarify things a bit. Let me know if you're
at FOSDEM, happy to talk about the internals there as well :)

Alex

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

2024-02-06 08:18:23

by Oleksij Rempel

[permalink] [raw]

Subject: Re: [PATCH v3 00/17] kexec: Allow preservation of ftrace buffers

Hi Alexander,

Nice work!

On Wed, Jan 17, 2024 at 02:46:47PM +0000, Alexander Graf wrote:
> Kexec today considers itself purely a boot loader: When we enter the new
> kernel, any state the previous kernel left behind is irrelevant and the
> new kernel reinitializes the system.
>
> However, there are use cases where this mode of operation is not what we
> actually want. In virtualization hosts for example, we want to use kexec
> to update the host kernel while virtual machine memory stays untouched.
> When we add device assignment to the mix, we also need to ensure that
> IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we
> need to do the same for the PCI subsystem. If we want to kexec while an
> SEV-SNP enabled virtual machine is running, we need to preserve the VM
> context pages and physical memory. See James' and my Linux Plumbers
> Conference 2023 presentation for details:
>
> https://lpc.events/event/17/contributions/1485/
>
> To start us on the journey to support all the use cases above, this
> patch implements basic infrastructure to allow hand over of kernel state
> across kexec (Kexec HandOver, aka KHO). As example target, we use ftrace:
> With this patch set applied, you can read ftrace records from the
> pre-kexec environment in your post-kexec one. This creates a very powerful
> debugging and performance analysis tool for kexec. It's also slightly
> easier to reason about than full blown VFIO state preservation.
>
> == Alternatives ==
>
> There are alternative approaches to (parts of) the problems above:
>
> * Memory Pools [1] - preallocated persistent memory region + allocator
> * PRMEM [2] - resizable persistent memory regions with fixed metadata
> pointer on the kernel command line + allocator
> * Pkernfs [3] - preallocated file system for in-kernel data with fixed
> address location on the kernel command line
> * PKRAM [4] - handover of user space pages using a fixed metadata page
> specified via command line
>
> All of the approaches above fundamentally have the same problem: They
> require the administrator to explicitly carve out a physical memory
> location because they have no mechanism outside of the kernel command
> line to pass data (including memory reservations) between kexec'ing
> kernels.
>
> KHO provides that base foundation. We will determine later whether we
> still need any of the approaches above for fast bulk memory handover of for
> example IOMMU page tables. But IMHO they would all be users of KHO, with
> KHO providing the foundational primitive to pass metadata and bulk memory
> reservations as well as provide easy versioning for data.
>
> == Overview ==
>
> We introduce a metadata file that the kernels pass between each other. How
> they pass it is architecture specific. The file's format is a Flattened
> Device Tree (fdt) which has a generator and parser already included in
> Linux. When the root user enables KHO through /sys/kernel/kho/active, the
> kernel invokes callbacks to every driver that supports KHO to serialize
> its state. When the actual kexec happens, the fdt is part of the image
> set that we boot into. In addition, we keep a "scratch region" available
> for kexec: A physically contiguous memory region that is guaranteed to
> not have any memory that KHO would preserve. The new kernel bootstraps
> itself using the scratch region and sets all handed over memory as in use.
> When drivers initialize that support KHO, they introspect the fdt and
> recover their state from it. This includes memory reservations, where the
> driver can either discard or claim reservations.
>
> == Limitations ==
>
> I currently only implemented file based kexec. The kernel interfaces
> in the patch set are already in place to support user space kexec as well,
> but I have not implemented it yet inside kexec tools.
>
> == How to Use ==
>
> To use the code, please boot the kernel with the "kho_scratch=" command
> line parameter set: "kho_scratch=512M". KHO requires a scratch region.
>
> Make sure to fill ftrace with contents that you want to observe after
> kexec. Then, before you invoke file based "kexec -l", activate KHO:
>
> # echo 1 > /sys/kernel/kho/active
> # kexec -l Image --initrd=initrd -s
> # kexec -e
>
> The new kernel will boot up and contain the previous kernel's trace
> buffers in /sys/kernel/debug/tracing/trace.

Assuming:
- we wont to start tracing as early as possible, before rootfs
or initrd would be able to configure it.
- traces are stored on a different device, not RAM. For example NVMEM.
- Location of NVMEM is different for different board types, but
bootloader is able to give the right configuration to the kernel.

What would be the best, acceptable for mainline, way to provide this
kind of configuration? At least part of this information do not
describes devices or device states, this would not fit in to devicetree
universe. Amount of possible information would not fit in to bootconfig
too.

Other more or less overlapping use case I have in mind is a netbootable
embedded system with a requirement to boot as fast as possible. Since
bootloader already established a link and got all needed ip
configuration, it would be able to hand over etherent controller and ip
configuration states. Wille be the KHO the way to go for this use case?

Regards,
Oleksij
--
Pengutronix e.K. | |
Steuerwalder Str. 21 | http://www.pengutronix.de/ |
31137 Hildesheim, Germany | Phone: +49-5121-206917-0 |
Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |

2024-02-06 13:45:53

by Alexander Graf

[permalink] [raw]

Subject: Re: [PATCH v3 00/17] kexec: Allow preservation of ftrace buffers

Hey Oleksij!

On 06.02.24 09:17, Oleksij Rempel wrote:
> Hi Alexander,
>
> Nice work!
>
> On Wed, Jan 17, 2024 at 02:46:47PM +0000, Alexander Graf wrote:
>> Kexec today considers itself purely a boot loader: When we enter the new
>> kernel, any state the previous kernel left behind is irrelevant and the
>> new kernel reinitializes the system.
>>
>> However, there are use cases where this mode of operation is not what we
>> actually want. In virtualization hosts for example, we want to use kexec
>> to update the host kernel while virtual machine memory stays untouched.
>> When we add device assignment to the mix, we also need to ensure that
>> IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we
>> need to do the same for the PCI subsystem. If we want to kexec while an
>> SEV-SNP enabled virtual machine is running, we need to preserve the VM
>> context pages and physical memory. See James' and my Linux Plumbers
>> Conference 2023 presentation for details:
>>
>> https://lpc.events/event/17/contributions/1485/
>>
>> To start us on the journey to support all the use cases above, this
>> patch implements basic infrastructure to allow hand over of kernel state
>> across kexec (Kexec HandOver, aka KHO). As example target, we use ftrace:
>> With this patch set applied, you can read ftrace records from the
>> pre-kexec environment in your post-kexec one. This creates a very powerful
>> debugging and performance analysis tool for kexec. It's also slightly
>> easier to reason about than full blown VFIO state preservation.
>>
>> == Alternatives ==
>>
>> There are alternative approaches to (parts of) the problems above:
>>
>> * Memory Pools [1] - preallocated persistent memory region + allocator
>> * PRMEM [2] - resizable persistent memory regions with fixed metadata
>> pointer on the kernel command line + allocator
>> * Pkernfs [3] - preallocated file system for in-kernel data with fixed
>> address location on the kernel command line
>> * PKRAM [4] - handover of user space pages using a fixed metadata page
>> specified via command line
>>
>> All of the approaches above fundamentally have the same problem: They
>> require the administrator to explicitly carve out a physical memory
>> location because they have no mechanism outside of the kernel command
>> line to pass data (including memory reservations) between kexec'ing
>> kernels.
>>
>> KHO provides that base foundation. We will determine later whether we
>> still need any of the approaches above for fast bulk memory handover of for
>> example IOMMU page tables. But IMHO they would all be users of KHO, with
>> KHO providing the foundational primitive to pass metadata and bulk memory
>> reservations as well as provide easy versioning for data.
>>
>> == Overview ==
>>
>> We introduce a metadata file that the kernels pass between each other. How
>> they pass it is architecture specific. The file's format is a Flattened
>> Device Tree (fdt) which has a generator and parser already included in
>> Linux. When the root user enables KHO through /sys/kernel/kho/active, the
>> kernel invokes callbacks to every driver that supports KHO to serialize
>> its state. When the actual kexec happens, the fdt is part of the image
>> set that we boot into. In addition, we keep a "scratch region" available
>> for kexec: A physically contiguous memory region that is guaranteed to
>> not have any memory that KHO would preserve. The new kernel bootstraps
>> itself using the scratch region and sets all handed over memory as in use.
>> When drivers initialize that support KHO, they introspect the fdt and
>> recover their state from it. This includes memory reservations, where the
>> driver can either discard or claim reservations.
>>
>> == Limitations ==
>>
>> I currently only implemented file based kexec. The kernel interfaces
>> in the patch set are already in place to support user space kexec as well,
>> but I have not implemented it yet inside kexec tools.
>>
>> == How to Use ==
>>
>> To use the code, please boot the kernel with the "kho_scratch=" command
>> line parameter set: "kho_scratch=512M". KHO requires a scratch region.
>>
>> Make sure to fill ftrace with contents that you want to observe after
>> kexec. Then, before you invoke file based "kexec -l", activate KHO:
>>
>> # echo 1 > /sys/kernel/kho/active
>> # kexec -l Image --initrd=initrd -s
>> # kexec -e
>>
>> The new kernel will boot up and contain the previous kernel's trace
>> buffers in /sys/kernel/debug/tracing/trace.
> Assuming:
> - we wont to start tracing as early as possible, before rootfs
> or initrd would be able to configure it.
> - traces are stored on a different device, not RAM. For example NVMEM.
> - Location of NVMEM is different for different board types, but
> bootloader is able to give the right configuration to the kernel.

Let me try to really understand what you're tracing here. Are we talking
about exposing boot loader traces into Linux [1]? In that case, I think
a mechanism like [2] is what you're looking for.

Or do you want to transfer genuine Linux ftrace traces? In that case,
why would you want to store them outside of RAM?

>
> What would be the best, acceptable for mainline, way to provide this
> kind of configuration? At least part of this information do not
> describes devices or device states, this would not fit in to devicetree
> universe. Amount of possible information would not fit in to bootconfig
> too.

We have precedence for configuration in device tree: You can use device
tree to describe partitions on a NAND device, you can use it to specify
MAC address overrides of devices attached to USB, etc etc. At the end of
the day when people say they don't want configuration in device tree,
what they mean is that device tree should be a hand over data structure
from firmware to kernel, not from OS integrator to kernel :). If your
firmware is the place that knows about offsets and you need to pass
those offsets, IMHO DT is a good fit.

> Other more or less overlapping use case I have in mind is a netbootable
> embedded system with a requirement to boot as fast as possible. Since
> bootloader already established a link and got all needed ip
> configuration, it would be able to hand over etherent controller and ip
> configuration states. Wille be the KHO the way to go for this use case?

That's an interesting one too. I would lean towards "try with normal
device tree first" here as well. It's again a very clear case of
"firmware wants to tell OS about things it knows, but the OS doesn't
know" to me. That means device tree should be fine to describe it.

Alex

[1] https://www.youtube.com/watch?v=RaFm5FfzFaM /
https://edk2.groups.io/g/devel/topic/91368904
[2]
https://github.com/agraf/linux/commit/b1fe0c296ec923e9b1f544862b0eb9365a8da7cb

>
> Regards,
> Oleksij
> --
> Pengutronix e.K. | |
> Steuerwalder Str. 21 | http://www.pengutronix.de/ |
> 31137 Hildesheim, Germany | Phone: +49-5121-206917-0 |
> Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

2024-02-06 15:13:43

by Oleksij Rempel

[permalink] [raw]

Subject: Re: [PATCH v3 00/17] kexec: Allow preservation of ftrace buffers

On Tue, Feb 06, 2024 at 02:43:15PM +0100, Alexander Graf wrote:
> Hey Oleksij!
>
> On 06.02.24 09:17, Oleksij Rempel wrote:
> > Hi Alexander,
> >
> > Nice work!
> >
> > On Wed, Jan 17, 2024 at 02:46:47PM +0000, Alexander Graf wrote:
> > > Make sure to fill ftrace with contents that you want to observe after
> > > kexec. Then, before you invoke file based "kexec -l", activate KHO:
> > >
> > > # echo 1 > /sys/kernel/kho/active
> > > # kexec -l Image --initrd=initrd -s
> > > # kexec -e
> > >
> > > The new kernel will boot up and contain the previous kernel's trace
> > > buffers in /sys/kernel/debug/tracing/trace.
> > Assuming:
> > - we wont to start tracing as early as possible, before rootfs
> > or initrd would be able to configure it.
> > - traces are stored on a different device, not RAM. For example NVMEM.
> > - Location of NVMEM is different for different board types, but
> > bootloader is able to give the right configuration to the kernel.
>
>
> Let me try to really understand what you're tracing here. Are we talking
> about exposing boot loader traces into Linux [1]? In that case, I think a
> mechanism like [2] is what you're looking for.
>
> Or do you want to transfer genuine Linux ftrace traces? In that case, why
> would you want to store them outside of RAM?

The high level object of what i need is to find how embedded systems in
fields do break. Since this devices should be always on, there are
different situations where system may reboot. For example, voltage
related issues, temperature, scheduled system updates, HW or SW errors.

To get better understand on what is going on, information should be
collected. But there are some limitations:
- voltage drops can be recorder only with prepared HW:
https://www.spinics.net/lists/devicetree/msg644030.html

- In case of voltage drops RAM or block devices can't be used. Instead,
some variant of NVMEM should be used. In my case, NVMEM has 8 bits of
storage :) So, only one entry of the "trace" is compressed to this storage.
https://lore.kernel.org/all/[email protected]
The reset reason information is provide by kernel and used by firmware
and kernel on next reboot

The implementation is not a big deal. The problematic part is the way
how the system should get information about existence of recorder and
where the recorder should stored things, for example NVMEM cell.

In my initial implementation I used devicetree to configure the software
based recorder and linked it with NVMEM cell. But it is against the DT
purpose to describe only HW and it makes this recorder unusable for
not DT basd systems.

Krzysztof is suggesting to configure it from initrd. This has own
limitations as well:
- record can't be used before initrd.
- we have multiple configuration point of board specific information -
firmware (bootloader) and initrd.
- initrd take place and reduce boot time for device which do not needed
it before.

Other variants like kernel command-line and/or module parameters seems
to be not acceptable depending maintainer. So, I'm still seeking
proper, acceptable, portable way to hand over not HW specific
information to the kernel.

> > What would be the best, acceptable for mainline, way to provide this
> > kind of configuration? At least part of this information do not
> > describes devices or device states, this would not fit in to devicetree
> > universe. Amount of possible information would not fit in to bootconfig
> > too.
>
>
> We have precedence for configuration in device tree: You can use device tree
> to describe partitions on a NAND device, you can use it to specify MAC
> address overrides of devices attached to USB, etc etc. At the end of the day
> when people say they don't want configuration in device tree, what they mean
> is that device tree should be a hand over data structure from firmware to
> kernel, not from OS integrator to kernel :). If your firmware is the place
> that knows about offsets and you need to pass those offsets, IMHO DT is a
> good fit.

Yes, the layout of the NVMEM can be described in the DT. How can I tell
the system that this NVMEM cell should be used by some recorder or
tracer? Before sysfs is available any how. @Krzysztof ?

> > Other more or less overlapping use case I have in mind is a netbootable
> > embedded system with a requirement to boot as fast as possible. Since
> > bootloader already established a link and got all needed ip
> > configuration, it would be able to hand over etherent controller and ip
> > configuration states. Wille be the KHO the way to go for this use case?
>
>
> That's an interesting one too. I would lean towards "try with normal device
> tree first" here as well. It's again a very clear case of "firmware wants to
> tell OS about things it knows, but the OS doesn't know" to me. That means
> device tree should be fine to describe it.

I can imagine description of PHY and MAC state. But IP configuration
state of the firmware seems to be out of DT scope?

Regards,
Oleksij
--
Pengutronix e.K. | |
Steuerwalder Str. 21 | http://www.pengutronix.de/ |
31137 Hildesheim, Germany | Phone: +49-5121-206917-0 |
Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |

2024-02-09 17:01:49

by Philipp Rudo

[permalink] [raw]

Subject: Re: [PATCH v3 00/17] kexec: Allow preservation of ftrace buffers

Hi Alex,

On Fri, 2 Feb 2024 13:58:52 +0100
Alexander Graf <[email protected]> wrote:

> Hi Philipp,
>
> On 29.01.24 17:34, Philipp Rudo wrote:
> > Hi Alex,
> >
> > adding linux-integrity as there are some synergies with IMA_KEXEC (in case we
> > get KHO to work).
> >
> > Fist of all I believe that having a generic framework to pass information from
> > one kernel to the other across kexec would be a good thing. But I'm afraid that
>
>
> Thanks, I'm happy to hear that you agree with the basic motivation :).
> There are fundamentally 2 problems with passing data:
>
> * Passing structured data in a cross-architecture way
> * Passing memory
>
> KHO tackles both. It proposes a common FDT based format that allows us
> to pass per-subsystem properties. That way, a subsystem does not need to
> know whether it's running on ARM, x86, RISC-V or s390x. It just gains
> awareness for KHO and can pass data.
>
> On top of that, it proposes a standardized "mem" property (and some
> magic around that) which allows subsystems to pass memory.
>
>
> > you are ignoring some fundamental problems which makes it extremely hard, if
> > not impossible, to reliably transfer the kernel's state from one kernel to the
> > other.
> >
> > One thing I don't understand is how reusing the scratch area is working Sure
> > you pass it's location via the dt/boot_params but I don't see any code that
> > makes it a CMA region. So IIUC the scratch area won't be available for the 2nd
> > kernel. Which is probably for the better as IIUC the 2nd kernel gets loaded and
> > runs inside that area and I don't believe the CMA design ever considered that
> > the kernel image could be included in a CMA area.
>
>
> That one took me a lot to figure out sensibly (with recursion all the
> way down) while building KHO :). I hope I detailed it sensibly in the
> documentation - please let me know how to improve it in case it's
> unclear: https://lore.kernel.org/lkml/[email protected]/
>
> Let me explain inline using different words as well what happens:
>
> The first (and only the first) kernel that boots allocates a CMA region
> as "scratch region". It loads the new kernel into that region. It passes
> that region as "scratch region" to the next kernel. The next kernel now
> takes it and marks every page block that the scratch region spans as CMA:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> The CMA hint doesn't mean we create an actual CMA region. It mostly
> means that the kernel won't use this memory for any kernel allocations.
> Kernel allocations up to this point are allocations we don't need to
> pass on with KHO again. Kernel allocations past that point may be
> allocations that we want to pass, so we just never place them into the
> "scratch region" again.
>
> And because we now already have a scratch region from the previous
> kernel, we keep reusing that forever with any new KHO kexec.

Thanks for the explanation. I've missed the memblock_mark_scratch in
kho_populate. The code makes much more sense now :-)

Having that said, for complex series like this one I like to do the
review on a branch in my local git as that to avoid problems like that
(or at least make them less likely). But your patches didn't apply. Can
you tell me what your base is or make your git branch available. That
would be very helpful to me. Thanks!

> > Staying at reusing the scratch area. One thing that is broken for sure is that
> > you reuse the scratch area without ever checking the kho_scratch parameter of
> > the 2nd kernel's command line. Remember, with kexec you are dealing with two
> > different kernels with two different command lines. Meaning you can only reuse
> > the scratch area if the requested size in the 2nd kernel is identical to the
> > one of the 1st kernel. In all other cases you need to adjust the scratch area's
> > size or reserve a new one.
>
>
> Hm. So you're saying a user may want to change the size of the scratch
> area with a KHO kexec. That's insanely risky because you (as rightfully
> pointed out below) may have significant fragmentation at that point. And
> we will only know when we're in the new kernel so it's too late to
> abort. IMHO it's better to just declare the scratch region as immutable
> during KHO to avoid that pitfall.

Yes, a user can set any command line with kexec. My expectation as a
user is that the kernel respects whatever I set on the command line and
doesn't think it knows better and simply ignores what I tell it to do.
So even when you set the scratch area immutable during boot you have to
make sure that in the end kernel respects what the user has set on the
2nd kernel's command line.

> > This directly leads to the next problem. In kho_reserve_previous_mem you are
> > reusing the different memory regions wherever the 1st kernel allocated them.
> > But that also means you are handing over the 1st kernel's memory
> > fragmentation to the 2nd kernel and you do that extremely early during boot.
> > Which means that users who need to allocate large continuous physical memory,
> > like the scratch area or the crashkernel memory, will have increasing chance to
> > not find a suitable area. Which IMHO is unacceptable.
>
>
> Correct :). It basically means you want to pass large allocations from
> the 1st kernel that you want to preserve on to the next. So if the 1st
> kernel allocated a large crash area, it's safest to pass that allocation
> using KHO to ensure the next kernel also has the region fully reserved.
> Otherwise the next kernel may accidentally place data into the
> previously reserved crash region (which would be contiguously free at
> early init of the 2nd kernel) and fragment it again.

I don't think that this is an option. For one your suggestion means that
every "large allocation" (whatever that means) needs to be tracked
manually for it to work together with KHO. In addition there is still
the problem that the 2nd kernel may need a larger allocation than the
1st one. Be it because it's a command line parameter, e.g. kho_scratch
or crashkernel, or just a new feature that requires additional memory
the 2nd kernel has. IMO it's inevitable that KHO finds a way to
remove/reduce memory fragmentation.

> > Finally, and that's the big elephant in the room, is your lax handling of the
> > unstable kernel internal ABI. Remember, you are dealing with two different
> > kernels, that also means two different source levels and two different configs.
> > So only because both the 1st and 2nd kernel have a e.g. struct buffer_page
> > doesn't means that they have the same struct buffer_page. But that's what your
> > code implicitly assumes. For KHO ever to make it upstream you need to make sure
> > that both kernels are "speaking the same language".
>
>
> Wow, I hope it didn't come across as that! The whole point of using FDT
> and compatible strings in KHO is to solve exactly that problem. Any time
> a passed over data structure changes incompatibly, you would need to
> modify the compatible string of the subsystem that owns the now
> incompatible data.
>
> So in the example of struct buffer_page, it means that if anyone changes
> the few bits we care about in struct buffer_page, we need to ensure that
> the new kernel emits "ftrace,cpu-v2" compatible strings. We can at that
> point choose whether we want to implement compat handling for
> "ftrace,cpu-v1" style struct buffer_pages or only support same version
> ingestion.

Well, it came across like that because there was absolutely no
explanation on when those versions need to be bumped up so far.

> The one thing that we could improve on here today IMHO is to have
> compile time errors if any part of struct buffer_page changes
> semantically: So we'd create a few defines for the bits we want in
> "ftrace,cpu-v1" as well as size of struct buffer_page and then compare
> them to what the struct offsets are at compile time to ensure they stay
> identical.

How do you imagine those macros to look like? How do they work with
structs that change their layout depending on the config?

Personally, I highly doubt that any system that manages these different
versions manually will work reliably. It might be possible for
something as simple as struct buffer_page but once it gets more
complicated, e.g. by depending on the kernel config or simply having
more dependencies to common data structures, it will be a constant
source of pain.
Just assume, although extremely unlikely, that struct list_head is
changed. Most likely the person who makes the change won't be from the
ftrace team and thus won't know that he/she/it needs to bump the
version. Even the compile time errors will only help if
CONFIG_FTRACE_KHO is enabled which most like won't be the case.
Ultimately this means that KHO will break silently until someone tries
to kexec in the new kernel with KHO enabled. But even then there will
only be a cryptic error message (if any) as you have basically
introduced a memory corruption to the 2nd kernel. The more complex the
structs become and the deeper the dependency list goes the more likely
it becomes that such a breaking change is made.

The way I see it there is no way around generating the version based on
the actual memory layout for this particular build.

> Please let me know how I can clarify that more in the documentation. It
> really is the absolute core of KHO.
>
>
> > Personally I see two possible solutions:
> >
> > 1) You introduce a stable intermediate format for every subsystem similar to
> > what IMA_KEXEC does. This should work for simple types like struct buffer_page
> > but for complex ones like struct vfio_device that's basically impossible.
>
>
> I don't see why. The only reason KHO passes struct buffer_page as memory
> is because we want to be able to produce traces even after KHO
> serialization is done. For vfio_device, I think it's perfectly
> reasonable to serialize any data we need to preserve directly into FDT
> properties.
>
>
>
> > 2) You also hand over the ABI version for every given type (basically just a
> > hash over all fields including all the dependencies). So the 2nd kernel can
> > verify that the data handed over is in a format it can handle and if not bail
> > out with a descriptive error message rather than reading garbage. Plus side is
> > that once such a system is in place you can reuse it to automatically resolve
> > all dependencies so you no longer need to manually store the buffer_page and
> > its buffer_data_page separately.
> > Down side is that traversing the debuginfo (including the ones from modules) is
> > not a simple task and I expect that such a system will be way more complex than
> > the rest of KHO. In addition there are some cases that the versioning won't be
> > able to capture. For example if a type contains a "void *"-field. Then although
> > the definition of the type is identical in both kernels the field can be cast
> > to different types when used. An other problem will be function pointers which
> > you first need to resolve in the 1st kernel and then map to the identical
> > function in the 2nd kernel. This will become particularly "fun" when the
> > function is part of a module that isn't loaded at the time when you try to
> > recreate the kernel's state.
>
>
> The whole point of KHO is to leave it to the subsystem which path they
> want to take. The subsystem can either pass binary data and validate as
> part of FDT properties (like compatible strings). That data can be
> identical to today's in-kernel data structures (usually a bad idea) or
> can be a new intermediate data format. But the subsystem can also choose
> to fully serialize into FDT properties and not pass any memory at all
> for state that would be in structs. Or something in between.

That's totally fine. My point is that there are simply too many ways to
fuck it up and break the 2nd kernel. That's why I don't believe that we
can rely on the subsystems to "do it right" and "remember to bump the
version". In other words, KHO needs to provide a reliable, automatic
mechanism with wich the 2nd kernel can decide if it can handle the
passed data or not.

> > So to summarize, while it would be nice to have a generic framework like KHO to
> > pass data from one kernel to the other via kexec there are good reasons why it
> > doesn't exist, yet.
>
>
> I hope my explanations above clarify things a bit. Let me know if you're
> at FOSDEM, happy to talk about the internals there as well :)

Sorry, I couldn't make it to FOSDEM but I plan to be at LPC later this
year. In fact I had your talk on my list last year. Unfortunately it was
parallel to the kernel debugger mc...

Thanks
Philipp

> Alex
>
>
>
>
>
> Amazon Development Center Germany GmbH
> Krausenstr. 38
> 10117 Berlin
> Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
> Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
> Sitz: Berlin
> Ust-ID: DE 289 237 879
>
>
> _______________________________________________
> kexec mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/kexec

2024-02-16 15:41:38

by Pratyush Yadav

[permalink] [raw]

Subject: Re: [PATCH v3 11/17] tracing: Introduce kho serialization

Hi,

On Wed, Jan 17 2024, Alexander Graf wrote:

> We want to be able to transfer ftrace state from one kernel to the next.
> To start off with, let's establish all the boiler plate to get a write
> hook when KHO wants to serialize and fill out basic data.
>
> Follow-up patches will fill in serialization of ring buffers and events.
>
> Signed-off-by: Alexander Graf <[email protected]>
>
> ---
>
> v1 -> v2:
>
> - Remove ifdefs
> ---
> kernel/trace/trace.c | 47 ++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 47 insertions(+)
>
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index a0defe156b57..9a0d96975c9c 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -32,6 +32,7 @@
> #include <linux/percpu.h>
> #include <linux/splice.h>
> #include <linux/kdebug.h>
> +#include <linux/kexec.h>
> #include <linux/string.h>
> #include <linux/mount.h>
> #include <linux/rwsem.h>
> @@ -866,6 +867,8 @@ static struct tracer *trace_types __read_mostly;
> */
> DEFINE_MUTEX(trace_types_lock);
>
> +static bool trace_in_kho;
> +
> /*
> * serialize the access of the ring buffer
> *
> @@ -10574,12 +10577,56 @@ void __init early_trace_init(void)
> init_events();
> }
>
> +static int trace_kho_notifier(struct notifier_block *self,
> + unsigned long cmd,
> + void *v)
> +{
> + const char compatible[] = "ftrace-v1";
> + void *fdt = v;
> + int err = 0;
> +
> + switch (cmd) {
> + case KEXEC_KHO_ABORT:
> + if (trace_in_kho)
> + mutex_unlock(&trace_types_lock);
> + trace_in_kho = false;
> + return NOTIFY_DONE;
> + case KEXEC_KHO_DUMP:
> + /* Handled below */
> + break;
> + default:
> + return NOTIFY_BAD;
> + }
> +
> + if (unlikely(tracing_disabled))
> + return NOTIFY_DONE;
> +
> + err |= fdt_begin_node(fdt, "ftrace");
> + err |= fdt_property(fdt, "compatible", compatible, sizeof(compatible));
> + err |= fdt_end_node(fdt);
> +
> + if (!err) {
> + /* Hold all future allocations */
> + mutex_lock(&trace_types_lock);

Say I do "echo 1 | tee /sys/kernel/kho/active". Then the lock is held by
tee, which exits. Then I later I do "echo 0 | tee
/sys/kernel/kho/active". This time another tee task unlocks the lock. So
it is not being unlocked by the same task that locked it. The comment
above mutex_lock() definition says:

The mutex must later on be released by the same task that acquired
it. Recursive locking is not allowed. The task may not exit without
first unlocking the mutex.

I tested your code and it happens to work because the unlock always
happened to take the fast path which does not sanity-check the owner.
Still, this is not the correct thing to do.

> + trace_in_kho = true;
> + }
> +
> + return err ? NOTIFY_BAD : NOTIFY_DONE;
> +}
> +
> +static struct notifier_block trace_kho_nb = {
> + .notifier_call = trace_kho_notifier,
> +};
> +
> void __init trace_init(void)
> {
> trace_event_init();
>
> if (boot_instance_index)
> enable_instances();
> +
> + if (IS_ENABLED(CONFIG_FTRACE_KHO))
> + register_kho_notifier(&trace_kho_nb);
> }
>
> __init static void clear_boot_tracer(void)

--
Regards,
Pratyush Yadav

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

2024-02-16 15:41:58

by Pratyush Yadav

[permalink] [raw]

Subject: Re: [PATCH v3 05/17] kexec: Add KHO support to kexec file loads

Hi,

On Wed, Jan 17 2024, Alexander Graf wrote:

> Kexec has 2 modes: A user space driven mode and a kernel driven mode.
> For the kernel driven mode, kernel code determines the physical
> addresses of all target buffers that the payload gets copied into.
>
> With KHO, we can only safely copy payloads into the "scratch area".
> Teach the kexec file loader about it, so it only allocates for that
> area. In addition, enlighten it with support to ask the KHO subsystem
> for its respective payloads to copy into target memory. Also teach the
> KHO subsystem how to fill the images for file loads.

This patch causes compilation failures when CONFIG_KEXEC_FILE is not
enabled. I am not listing them all here since there are a bunch. You can
try disabling it and see them for yourself.

Since Documentation/kho/usage.rst says:

It is important that you use the ``-s`` parameter to use the
in-kernel kexec file loader, as user space kexec tooling currently
has no support for KHO with the user space based file loader.

you can just make CONFIG_KEXEC_FILE a dependency for CONFIG_KEXEC_KHO to
get rid of these errors.

Or, if you foresee wanting to use the user space tooling to use KHO as
well then you should refactor your code to work with the option enabled
and disabled.

>
> Signed-off-by: Alexander Graf <[email protected]>
[...]

--
Regards,
Pratyush Yadav

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

2024-02-16 15:51:15

by Pratyush Yadav

[permalink] [raw]

Subject: Re: [PATCH v3 00/17] kexec: Allow preservation of ftrace buffers

Hi Alex,

On Wed, Jan 17 2024, Alexander Graf wrote:

> Kexec today considers itself purely a boot loader: When we enter the new
> kernel, any state the previous kernel left behind is irrelevant and the
> new kernel reinitializes the system.
>
> However, there are use cases where this mode of operation is not what we
> actually want. In virtualization hosts for example, we want to use kexec
> to update the host kernel while virtual machine memory stays untouched.
> When we add device assignment to the mix, we also need to ensure that
> IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we
> need to do the same for the PCI subsystem. If we want to kexec while an
> SEV-SNP enabled virtual machine is running, we need to preserve the VM
> context pages and physical memory. See James' and my Linux Plumbers
> Conference 2023 presentation for details:

I am working on handing userspace pages across kexec. This can be useful
for applications with large in-memory state that can be time consuming
to rebuild. If they can hand over their state over kexec, it allows for
kernel upgrades with lower downtime. As a part of this problem, I have
been looking at plugging all of this into CRIU [0] so I don't have to
modify the applications to use this feature. I can just use CRIU to do
the checkpoint and restore quickly over kexec.

I hacked together some patches for this (which are not yet polished
enough to publish) and ended up implementing something like KHO in a
much more crude way. I have since refactored my patches to use KHO and I
find it quite useful. So thanks for working on this :-)

It was easy enough to get KHO working with my patches though I had to
look into your ftrace patches to get the whole picture. The
documentation can be improved to show how it can be used from the
driver/subsystem perspective. For example, I had to read your ftrace
patches to figure out I should use kho_get_fdt(), or that I should
register a notifier via kho_register_notifier(). I would be happy to
contribute some documentation improvements.

Have you done any analysis on the performance or memory overhead? If
yes, it would be nice to look at some data. I have some concerns with
performance and memory overhead, especially for more fragmented memory
but I don't yet have numbers to present you.

[0] https://github.com/checkpoint-restore/criu

>
> https://lpc.events/event/17/contributions/1485/
>
> To start us on the journey to support all the use cases above, this
> patch implements basic infrastructure to allow hand over of kernel state
> across kexec (Kexec HandOver, aka KHO). As example target, we use ftrace:
> With this patch set applied, you can read ftrace records from the
> pre-kexec environment in your post-kexec one. This creates a very powerful
> debugging and performance analysis tool for kexec. It's also slightly
> easier to reason about than full blown VFIO state preservation.
>
> == Alternatives ==
>
> There are alternative approaches to (parts of) the problems above:
>
> * Memory Pools [1] - preallocated persistent memory region + allocator
> * PRMEM [2] - resizable persistent memory regions with fixed metadata
> pointer on the kernel command line + allocator
> * Pkernfs [3] - preallocated file system for in-kernel data with fixed
> address location on the kernel command line
> * PKRAM [4] - handover of user space pages using a fixed metadata page
> specified via command line

FYI, you forgot to paste the links in v3 but I can find them from v2.

From all these options, PKRAM seems somewhat useful for my use case but
with CRIU it would need to copy all the application pages to the PKRAM
FS and would need at least as much free memory as application memory.

Instead, I have built a simple system that gives an API to userspace to
hand over its pages and to request them back. It then keeps track of the
PID and PA -> VA mappings (essentially a page table). This lets me keep
the pages in-place and avoid needing lots of free memory or expensive
copying. KHO plays a crucial role there in handing those pages and page
tables across to the next kernel.

The FDT format works fairly well for my use case. Since page tables are
a stable data structure, I don't need to worry about their format
changing between kernel versions and can directly pass them through.
This might not be true for many other data structures so subsystems
using those either need to serialize them to FDT or invent their own
serialization formats.

I also wonder how the "mem" array will work for more fragmented
allocations. It might grow very large with lots of scattered elements. I
wonder how both KHO's parsing and memblock will behave in this case. I
have not yet tried stressing it so I can't say for myself.

>
> All of the approaches above fundamentally have the same problem: They
> require the administrator to explicitly carve out a physical memory
> location because they have no mechanism outside of the kernel command
> line to pass data (including memory reservations) between kexec'ing
> kernels.
>
> KHO provides that base foundation. We will determine later whether we
> still need any of the approaches above for fast bulk memory handover of for
> example IOMMU page tables. But IMHO they would all be users of KHO, with
> KHO providing the foundational primitive to pass metadata and bulk memory
> reservations as well as provide easy versioning for data.
>
> == Overview ==
>
> We introduce a metadata file that the kernels pass between each other. How
> they pass it is architecture specific. The file's format is a Flattened
> Device Tree (fdt) which has a generator and parser already included in
> Linux. When the root user enables KHO through /sys/kernel/kho/active, the
> kernel invokes callbacks to every driver that supports KHO to serialize
> its state. When the actual kexec happens, the fdt is part of the image
> set that we boot into. In addition, we keep a "scratch region" available
> for kexec: A physically contiguous memory region that is guaranteed to
> not have any memory that KHO would preserve. The new kernel bootstraps
> itself using the scratch region and sets all handed over memory as in use.
> When drivers initialize that support KHO, they introspect the fdt and
> recover their state from it. This includes memory reservations, where the
> driver can either discard or claim reservations.
>
> == Limitations ==
>
> I currently only implemented file based kexec. The kernel interfaces
> in the patch set are already in place to support user space kexec as well,
> but I have not implemented it yet inside kexec tools.
>
> == How to Use ==
>
> To use the code, please boot the kernel with the "kho_scratch=" command
> line parameter set: "kho_scratch=512M". KHO requires a scratch region.
>
> Make sure to fill ftrace with contents that you want to observe after
> kexec. Then, before you invoke file based "kexec -l", activate KHO:
>
> # echo 1 > /sys/kernel/kho/active
> # kexec -l Image --initrd=initrd -s
> # kexec -e
>
> The new kernel will boot up and contain the previous kernel's trace
> buffers in /sys/kernel/debug/tracing/trace.
>
[...]

Overall, I think KHO is quite useful and I would be happy to see it
evolve and eventually make it into the kernel. It would certainly make
my life a lot easier.

Since I have used it in my patches, I have done some basic testing for
it. Nothing fancy, just handed a few pages across. It works as
advertised. For that,

Tested-by: Pratyush Yadav <[email protected]>

--
Regards,
Pratyush Yadav

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

2024-02-16 15:59:19

by Pratyush Yadav

[permalink] [raw]

Subject: Re: [PATCH v3 04/17] kexec: Add KHO parsing support

Hi,

On Wed, Jan 17 2024, Alexander Graf wrote:

> When we have a KHO kexec, we get a device tree, mem cache and scratch
> region to populate the state of the system. Provide helper functions
> that allow architecture code to easily handle memory reservations based
> on them and give device drivers visibility into the KHO DT and memory
> reservations so they can recover their own state.
>
> Signed-off-by: Alexander Graf <[email protected]>
>
> ---
>
[...]
> +/**
> + * kho_return_mem - Notify the kernel that initially reserved memory is no
> + * longer needed. When the last consumer of a page returns their mem, kho
> + * returns the page to the buddy allocator as free page.
> + */
> +void kho_return_mem(const struct kho_mem *mem)
> +{
> + uint64_t start_pfn, end_pfn, pfn;
> +
> + start_pfn = PFN_DOWN(mem->addr);
> + end_pfn = PFN_UP(mem->addr + mem->len);
> +
> + for (pfn = start_pfn; pfn < end_pfn; pfn++)
> + kho_return_pfn(pfn);
> +}
> +EXPORT_SYMBOL_GPL(kho_return_mem);
> +
> +static void kho_claim_pfn(ulong pfn)
> +{
> + struct page *page = pfn_to_page(pfn);
> +
> + WARN_ON(!page);
> + if (WARN_ON(page_count(page) != 1))
> + pr_err("Claimed non kho pfn %lx", pfn);

You do sanity checks but then never actually change anything on the
page. kho_claim_mem()'s documentation says: "This function removes the
reserved state for all pages that the mem spans". So this function
should at the very least call ClearPageReserved().

Also, checking the page count is a very rough heuristic. There can be
other non-KHO pages with page count == 1. Do you think it would make
more sense to use one of the private pageflags bits to mark a page
KHO-owned? If not, shouldn't you at least also check if the page is
reserved?

> +}
> +
> +/**
> + * kho_claim_mem - Notify the kernel that a handed over memory range is now in
> + * use by a kernel subsystem and considered an allocated page. This function
> + * removes the reserved state for all pages that the mem spans.
> + */
> +void *kho_claim_mem(const struct kho_mem *mem)
> +{
> + u64 start_pfn, end_pfn, pfn;
> + void *va = __va(mem->addr);
> +
> + start_pfn = PFN_DOWN(mem->addr);
> + end_pfn = PFN_UP(mem->addr + mem->len);
> +
> + for (pfn = start_pfn; pfn < end_pfn; pfn++)
> + kho_claim_pfn(pfn);
> +
> + return va;
> +}
> +EXPORT_SYMBOL_GPL(kho_claim_mem);
> +
[...]

--
Regards,
Pratyush Yadav

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

2024-02-20 10:31:12

by Mike Rapoport

[permalink] [raw]

Subject: Re: [PATCH v3 09/17] x86: Add KHO support

Hi Alex,

On Wed, Jan 17, 2024 at 02:46:56PM +0000, Alexander Graf wrote:
> We now have all bits in place to support KHO kexecs. This patch adds
> awareness of KHO in the kexec file as well as boot path for x86 and
> adds the respective kconfig option to the architecture so that it can
> use KHO successfully.
>
> In addition, it enlightens it decompression code with KHO so that its
> KASLR location finder only considers memory regions that are not already
> occupied by KHO memory.
>
> Signed-off-by: Alexander Graf <[email protected]>
>
> ---
>
> v1 -> v2:
>
> - Change kconfig option to ARCH_SUPPORTS_KEXEC_KHO
> - s/kho_reserve_mem/kho_reserve_previous_mem/g
> - s/kho_reserve/kho_reserve_scratch/g
> ---
> arch/x86/Kconfig | 3 ++
> arch/x86/boot/compressed/kaslr.c | 55 +++++++++++++++++++++++++++
> arch/x86/include/uapi/asm/bootparam.h | 15 +++++++-
> arch/x86/kernel/e820.c | 9 +++++
> arch/x86/kernel/kexec-bzimage64.c | 39 +++++++++++++++++++
> arch/x86/kernel/setup.c | 46 ++++++++++++++++++++++
> arch/x86/mm/init_32.c | 7 ++++
> arch/x86/mm/init_64.c | 7 ++++
> 8 files changed, 180 insertions(+), 1 deletion(-)

..

> @@ -987,8 +1013,26 @@ void __init setup_arch(char **cmdline_p)
> cleanup_highmap();
>
> memblock_set_current_limit(ISA_END_ADDRESS);
> +
> e820__memblock_setup();
>
> + /*
> + * We can resize memblocks at this point, let's dump all KHO
> + * reservations in and switch from scratch-only to normal allocations
> + */
> + kho_reserve_previous_mem();
> +
> + /* Allocations now skip scratch mem, return low 1M to the pool */
> + if (is_kho_boot()) {
> + u64 i;
> + phys_addr_t base, end;
> +
> + __for_each_mem_range(i, &memblock.memory, NULL, NUMA_NO_NODE,
> + MEMBLOCK_SCRATCH, &base, &end, NULL)
> + if (end <= ISA_END_ADDRESS)
> + memblock_clear_scratch(base, end - base);
> + }

You had to mark lower 16M as MEMBLOCK_SCRATCH because at this point the
mapping of the physical memory is not ready yet and page tables only cover
lower 16M and the memory mapped in kexec::init_pgtable(). Hence the call
for memblock_set_current_limit(ISA_END_ADDRESS) slightly above, which
essentially makes scratch mem reserved by KHO unusable for allocations.

I'd suggest to move kho_reserve_previous_mem() earlier, probably even right
next to kho_populate().
kho_populate() already does memblock_add(scratch) and at that point it's
the only physical memory that memblock knows of, so if it'll have to
allocate, the allocations will end up there.

Also, there are no kernel allocations before e820__memblock_setup(), so the
only memory that might need to be allocated is for memblock_double_array()
and that will be discarded later anyway.

With this, it seems that MEMBLOCK_SCRATCH is not needed, as the scratch
memory is anyway the only usable memory up to e820__memblock_setup().

> /*
> * Needs to run after memblock setup because it needs the physical
> * memory size.
> @@ -1104,6 +1148,8 @@ void __init setup_arch(char **cmdline_p)
> */
> arch_reserve_crashkernel();
>
> + kho_reserve_scratch();
> +
> memblock_find_dma_reserve();
>
> if (!early_xdbc_setup_hardware())
> diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
> index b63403d7179d..6c3810afed04 100644
> --- a/arch/x86/mm/init_32.c
> +++ b/arch/x86/mm/init_32.c
> @@ -20,6 +20,7 @@
> #include <linux/smp.h>
> #include <linux/init.h>
> #include <linux/highmem.h>
> +#include <linux/kexec.h>
> #include <linux/pagemap.h>
> #include <linux/pci.h>
> #include <linux/pfn.h>
> @@ -738,6 +739,12 @@ void __init mem_init(void)
> after_bootmem = 1;
> x86_init.hyper.init_after_bootmem();
>
> + /*
> + * Now that all KHO pages are marked as reserved, let's flip them back
> + * to normal pages with accurate refcount.
> + */
> + kho_populate_refcount();

This should go to mm_core_init(), there's nothing architecture specific
there.

> +
> /*
> * Check boundaries twice: Some fundamental inconsistencies can
> * be detected at build time already.

--
Sincerely yours,
Mike.

2024-02-23 15:54:44

by Pratyush Yadav

[permalink] [raw]

Subject: Re: [PATCH v3 02/17] memblock: Declare scratch memory as CMA

Hi,

On Wed, Jan 17 2024, Alexander Graf wrote:

> When we finish populating our memory, we don't want to lose the scratch
> region as memory we can use for useful data. Do do that, we mark it as
> CMA memory. That means that any allocation within it only happens with
> movable memory which we can then happily discard for the next kexec.
>
> That way we don't lose the scratch region's memory anymore for
> allocations after boot.
>
> Signed-off-by: Alexander Graf <[email protected]>
>
[...]
> @@ -2188,6 +2185,16 @@ static void __init __free_pages_memory(unsigned long start, unsigned long end)
> }
> }
>
> +static void mark_phys_as_cma(phys_addr_t start, phys_addr_t end)
> +{
> + ulong start_pfn = pageblock_start_pfn(PFN_DOWN(start));
> + ulong end_pfn = pageblock_align(PFN_UP(end));
> + ulong pfn;
> +
> + for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages)
> + set_pageblock_migratetype(pfn_to_page(pfn), MIGRATE_CMA);

This fails to compile if CONFIG_CMA is disabled. I think you should add
it as a dependency for CONFIG_MEMBLOCK_SCRATCH.

> +}
> +
> static unsigned long __init __free_memory_core(phys_addr_t start,
> phys_addr_t end)
> {
> @@ -2249,6 +2256,17 @@ static unsigned long __init free_low_memory_core_early(void)
>
> memmap_init_reserved_pages();
>
> + if (IS_ENABLED(CONFIG_MEMBLOCK_SCRATCH)) {
> + /*
> + * Mark scratch mem as CMA before we return it. That way we
> + * ensure that no kernel allocations happen on it. That means
> + * we can reuse it as scratch memory again later.
> + */
> + __for_each_mem_range(i, &memblock.memory, NULL, NUMA_NO_NODE,
> + MEMBLOCK_SCRATCH, &start, &end, NULL)
> + mark_phys_as_cma(start, end);
> + }
> +
> /*
> * We need to use NUMA_NO_NODE instead of NODE_DATA(0)->node_id
> * because in some case like Node0 doesn't have RAM installed

--
Regards,
Pratyush Yadav

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879