2023-04-27 00:19:12

by Anthony Yznaga

[permalink] [raw]
Subject: [RFC v3 00/21] Preserved-over-Kexec RAM

Sending out this RFC in part to guage community interest.
This patchset implements preserved-over-kexec memory storage or PKRAM as a
method for saving memory pages of the currently executing kernel so that
they may be restored after kexec into a new kernel. The patches are adapted
from an RFC patchset sent out in 2013 by Vladimir Davydov [1]. They
introduce the PKRAM kernel API.

One use case for PKRAM is preserving guest memory and/or auxillary
supporting data (e.g. iommu data) across kexec to support reboot of the
host with minimal disruption to the guest. PKRAM provides a flexible way
for doing this without requiring that the amount of memory used by a fixed
size created a priori. Another use case is for databases to preserve their
block caches in shared memory across reboot.

Changes since RFC v2
- Rebased onto 6.3
- Updated API to save/load folios rather than file pages
- Omitted previous patches for implementing and optimizing preservation
and restoration of shmem files to reduce the number of patches and
focus on core functionality.

Changes since RFC v1
- Rebased onto 5.12-rc4
- Refined the API to reduce the number of calls
and better support multithreading.
- Allow preserving byte data of arbitrary length
(was previously limited to one page).
- Build a new memblock reserved list with the
preserved ranges and then substitute it for
the existing one. (Mike Rapoport)
- Use mem_avoid_overlap() to avoid kaslr stepping
on preserved ranges. (Kees Cook)

-- Implementation details --

* To aid in quickly finding contiguous ranges of memory containing
preserved pages a pseudo physical mapping pagetable is populated
with pages as they are preserved.

* If a page to be preserved is found to be in range of memory that was
previously reserved during early boot or in range of memory where the
kernel will be loaded to on kexec, the page will be copied to a page
outside of those ranges and the new page will be preserved. A compound
page will be copied to and preserved as individual base pages.
Note that this means that a page that cannot be moved (e.g. pinned for
DMA) currently cannot safely be preserved. This could be addressed by
adding functionality to kexec to reconfigure the destination addreses
for the sections of an already-loaded kexec kernel.

* A single page is allocated for the PKRAM super block. For the next kernel
kexec boot to find preserved memory metadata, the pfn of the PKRAM super
block, which is exported via /sys/kernel/pkram, is passed in the 'pkram'
boot option.

* In the newly booted kernel, PKRAM adds all preserved pages to the memblock
reserve list during early boot so that they will not be recycled.

* Since kexec may load the new kernel code to any memory region, it could
destroy preserved memory. When the kernel selects the memory region
(kexec_file_load syscall), kexec will avoid preserved pages. When the
user selects the kexec memory region to use (kexec_load syscall) , kexec
load will fail if there is conflict with preserved pages. Pages preserved
after a kexec kernel is loaded will be relocated if they conflict with
the selected memory region.

[1] https://lkml.org/lkml/2013/7/1/211

Anthony Yznaga (21):
mm: add PKRAM API stubs and Kconfig
mm: PKRAM: implement node load and save functions
mm: PKRAM: implement object load and save functions
mm: PKRAM: implement folio stream operations
mm: PKRAM: implement byte stream operations
mm: PKRAM: link nodes by pfn before reboot
mm: PKRAM: introduce super block
PKRAM: track preserved pages in a physical mapping pagetable
PKRAM: pass a list of preserved ranges to the next kernel
PKRAM: prepare for adding preserved ranges to memblock reserved
mm: PKRAM: reserve preserved memory at boot
PKRAM: free the preserved ranges list
PKRAM: prevent inadvertent use of a stale superblock
PKRAM: provide a way to ban pages from use by PKRAM
kexec: PKRAM: prevent kexec clobbering preserved pages in some cases
PKRAM: provide a way to check if a memory range has preserved pages
kexec: PKRAM: avoid clobbering already preserved pages
mm: PKRAM: allow preserved memory to be freed from userspace
PKRAM: disable feature when running the kdump kernel
x86/KASLR: PKRAM: support physical kaslr
x86/boot/compressed/64: use 1GB pages for mappings

arch/x86/boot/compressed/Makefile | 3 +
arch/x86/boot/compressed/ident_map_64.c | 9 +-
arch/x86/boot/compressed/kaslr.c | 10 +-
arch/x86/boot/compressed/misc.h | 10 +
arch/x86/boot/compressed/pkram.c | 110 ++
arch/x86/kernel/setup.c | 3 +
arch/x86/mm/init_64.c | 3 +
include/linux/pkram.h | 116 ++
kernel/kexec.c | 9 +
kernel/kexec_core.c | 3 +
kernel/kexec_file.c | 15 +
mm/Kconfig | 9 +
mm/Makefile | 2 +
mm/pkram.c | 1753 +++++++++++++++++++++++++++++++
mm/pkram_pagetable.c | 375 +++++++
15 files changed, 2424 insertions(+), 6 deletions(-)
create mode 100644 arch/x86/boot/compressed/pkram.c
create mode 100644 include/linux/pkram.h
create mode 100644 mm/pkram.c
create mode 100644 mm/pkram_pagetable.c

--
1.9.4


2023-04-27 00:19:15

by Anthony Yznaga

[permalink] [raw]
Subject: [RFC v3 08/21] PKRAM: track preserved pages in a physical mapping pagetable

Later patches in this series will need a way to efficiently identify
physically contiguous ranges of preserved pages independent of their
virtual addresses. To facilitate this all pages to be preserved across
kexec are added to a pseudo identity mapping pagetable.

The pagetable makes use of the existing architecture definitions for
building a memory mapping pagetable except that a bitmap is used to
represent the presence or absence of preserved pages at the PTE level.

Signed-off-by: Anthony Yznaga <[email protected]>
---
mm/Makefile | 4 +-
mm/pkram.c | 30 ++++-
mm/pkram_pagetable.c | 375 +++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 404 insertions(+), 5 deletions(-)
create mode 100644 mm/pkram_pagetable.c

diff --git a/mm/Makefile b/mm/Makefile
index 7a8d5a286d48..7a1a33b67de6 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -138,5 +138,5 @@ obj-$(CONFIG_IO_MAPPING) += io-mapping.o
obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o
obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
-obj-$(CONFIG_PKRAM) += pkram.o
->>>>>>> mm: add PKRAM API stubs and Kconfig
+obj-$(CONFIG_PKRAM) += pkram.o pkram_pagetable.o
+>>>>>>> PKRAM: track preserved pages in a physical mapping pagetable
diff --git a/mm/pkram.c b/mm/pkram.c
index c66b2ae4d520..e6c0f3c52465 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -101,6 +101,9 @@ struct pkram_super_block {
static unsigned long pkram_sb_pfn __initdata;
static struct pkram_super_block *pkram_sb;

+extern int pkram_add_identity_map(struct page *page);
+extern void pkram_remove_identity_map(struct page *page);
+
/*
* For convenience sake PKRAM nodes are kept in an auxiliary doubly-linked list
* connected through the lru field of the page struct.
@@ -119,11 +122,24 @@ static int __init parse_pkram_sb_pfn(char *arg)

static inline struct page *pkram_alloc_page(gfp_t gfp_mask)
{
- return alloc_page(gfp_mask);
+ struct page *page;
+ int err;
+
+ page = alloc_page(gfp_mask);
+ if (page) {
+ err = pkram_add_identity_map(page);
+ if (err) {
+ __free_page(page);
+ page = NULL;
+ }
+ }
+
+ return page;
}

static inline void pkram_free_page(void *addr)
{
+ pkram_remove_identity_map(virt_to_page(addr));
free_page((unsigned long)addr);
}

@@ -161,6 +177,7 @@ static void pkram_truncate_link(struct pkram_link *link)
if (!p)
continue;
page = pfn_to_page(PHYS_PFN(p));
+ pkram_remove_identity_map(page);
put_page(page);
}
}
@@ -610,10 +627,15 @@ int pkram_save_folio(struct pkram_access *pa, struct folio *folio)
{
struct pkram_node *node = pa->ps->node;
struct page *page = folio_page(folio, 0);
+ int err;

BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);

- return __pkram_save_page(pa, page, page->index);
+ err = __pkram_save_page(pa, page, page->index);
+ if (!err)
+ err = pkram_add_identity_map(page);
+
+ return err;
}

static int __pkram_bytes_save_page(struct pkram_access *pa, struct page *page)
@@ -658,6 +680,8 @@ static struct page *__pkram_prep_load_page(pkram_entry_t p)

page_ref_unfreeze(page, 1);

+ pkram_remove_identity_map(page);
+
return page;

out_error:
@@ -914,7 +938,7 @@ static int __init pkram_init_sb(void)
if (!pkram_sb) {
struct page *page;

- page = pkram_alloc_page(GFP_KERNEL | __GFP_ZERO);
+ page = alloc_page(GFP_KERNEL | __GFP_ZERO);
if (!page) {
pr_err("PKRAM: Failed to allocate super block\n");
return 0;
diff --git a/mm/pkram_pagetable.c b/mm/pkram_pagetable.c
new file mode 100644
index 000000000000..85e34301ef1e
--- /dev/null
+++ b/mm/pkram_pagetable.c
@@ -0,0 +1,375 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bitops.h>
+#include <linux/mm.h>
+
+static pgd_t *pkram_pgd;
+static DEFINE_SPINLOCK(pkram_pgd_lock);
+
+#define set_p4d(p4dp, p4d) WRITE_ONCE(*(p4dp), (p4d))
+
+#define PKRAM_PTE_BM_BYTES (PTRS_PER_PTE / BITS_PER_BYTE)
+#define PKRAM_PTE_BM_MASK (PAGE_SIZE / PKRAM_PTE_BM_BYTES - 1)
+
+static pmd_t make_bitmap_pmd(unsigned long *bitmap)
+{
+ unsigned long val;
+
+ val = __pa(ALIGN_DOWN((unsigned long)bitmap, PAGE_SIZE));
+ val |= (((unsigned long)bitmap & ~PAGE_MASK) / PKRAM_PTE_BM_BYTES);
+
+ return __pmd(val);
+}
+
+static unsigned long *get_bitmap_addr(pmd_t pmd)
+{
+ unsigned long val, off;
+
+ val = pmd_val(pmd);
+ off = (val & PKRAM_PTE_BM_MASK) * PKRAM_PTE_BM_BYTES;
+
+ val = (val & PAGE_MASK) + off;
+
+ return __va(val);
+}
+
+int pkram_add_identity_map(struct page *page)
+{
+ unsigned long paddr;
+ unsigned long *bitmap;
+ unsigned int index;
+ struct page *pg;
+ pgd_t *pgd;
+ p4d_t *p4d;
+ pud_t *pud;
+ pmd_t *pmd;
+
+ if (!pkram_pgd) {
+ spin_lock(&pkram_pgd_lock);
+ if (!pkram_pgd) {
+ pg = alloc_page(GFP_ATOMIC|__GFP_ZERO);
+ if (!pg)
+ goto nomem;
+ pkram_pgd = page_address(pg);
+ }
+ spin_unlock(&pkram_pgd_lock);
+ }
+
+ paddr = __pa(page_address(page));
+ pgd = pkram_pgd;
+ pgd += pgd_index(paddr);
+ if (pgd_none(*pgd)) {
+ spin_lock(&pkram_pgd_lock);
+ if (pgd_none(*pgd)) {
+ pg = alloc_page(GFP_ATOMIC|__GFP_ZERO);
+ if (!pg)
+ goto nomem;
+ p4d = page_address(pg);
+ set_pgd(pgd, __pgd(__pa(p4d)));
+ }
+ spin_unlock(&pkram_pgd_lock);
+ }
+ p4d = p4d_offset(pgd, paddr);
+ if (p4d_none(*p4d)) {
+ spin_lock(&pkram_pgd_lock);
+ if (p4d_none(*p4d)) {
+ pg = alloc_page(GFP_ATOMIC|__GFP_ZERO);
+ if (!pg)
+ goto nomem;
+ pud = page_address(pg);
+ set_p4d(p4d, __p4d(__pa(pud)));
+ }
+ spin_unlock(&pkram_pgd_lock);
+ }
+ pud = pud_offset(p4d, paddr);
+ if (pud_none(*pud)) {
+ spin_lock(&pkram_pgd_lock);
+ if (pud_none(*pud)) {
+ pg = alloc_page(GFP_ATOMIC|__GFP_ZERO);
+ if (!pg)
+ goto nomem;
+ pmd = page_address(pg);
+ set_pud(pud, __pud(__pa(pmd)));
+ }
+ spin_unlock(&pkram_pgd_lock);
+ }
+ pmd = pmd_offset(pud, paddr);
+ if (pmd_none(*pmd)) {
+ spin_lock(&pkram_pgd_lock);
+ if (pmd_none(*pmd)) {
+ if (PageTransHuge(page)) {
+ set_pmd(pmd, pmd_mkhuge(*pmd));
+ spin_unlock(&pkram_pgd_lock);
+ goto done;
+ }
+ bitmap = bitmap_zalloc(PTRS_PER_PTE, GFP_ATOMIC);
+ if (!bitmap)
+ goto nomem;
+ set_pmd(pmd, make_bitmap_pmd(bitmap));
+ } else {
+ BUG_ON(pmd_leaf(*pmd));
+ bitmap = get_bitmap_addr(*pmd);
+ }
+ spin_unlock(&pkram_pgd_lock);
+ } else {
+ BUG_ON(pmd_leaf(*pmd));
+ bitmap = get_bitmap_addr(*pmd);
+ }
+
+ index = pte_index(paddr);
+ BUG_ON(test_bit(index, bitmap));
+ set_bit(index, bitmap);
+ smp_mb__after_atomic();
+ if (bitmap_full(bitmap, PTRS_PER_PTE))
+ set_pmd(pmd, pmd_mkhuge(*pmd));
+done:
+ return 0;
+nomem:
+ return -ENOMEM;
+}
+
+void pkram_remove_identity_map(struct page *page)
+{
+ unsigned long *bitmap;
+ unsigned long paddr;
+ unsigned int index;
+ pgd_t *pgd;
+ p4d_t *p4d;
+ pud_t *pud;
+ pmd_t *pmd;
+
+ /*
+ * pkram_pgd will be null when freeing metadata pages after a reboot
+ */
+ if (!pkram_pgd)
+ return;
+
+ paddr = __pa(page_address(page));
+ pgd = pkram_pgd;
+ pgd += pgd_index(paddr);
+ if (pgd_none(*pgd)) {
+ WARN_ONCE(1, "PKRAM: %s: no pgd for 0x%lx\n", __func__, paddr);
+ return;
+ }
+ p4d = p4d_offset(pgd, paddr);
+ if (p4d_none(*p4d)) {
+ WARN_ONCE(1, "PKRAM: %s: no p4d for 0x%lx\n", __func__, paddr);
+ return;
+ }
+ pud = pud_offset(p4d, paddr);
+ if (pud_none(*pud)) {
+ WARN_ONCE(1, "PKRAM: %s: no pud for 0x%lx\n", __func__, paddr);
+ return;
+ }
+ pmd = pmd_offset(pud, paddr);
+ if (pmd_none(*pmd)) {
+ WARN_ONCE(1, "PKRAM: %s: no pmd for 0x%lx\n", __func__, paddr);
+ return;
+ }
+ if (PageTransHuge(page)) {
+ BUG_ON(!pmd_leaf(*pmd));
+ pmd_clear(pmd);
+ return;
+ }
+
+ if (pmd_leaf(*pmd)) {
+ spin_lock(&pkram_pgd_lock);
+ if (pmd_leaf(*pmd))
+ set_pmd(pmd, __pmd(pte_val(pte_clrhuge(*(pte_t *)pmd))));
+ spin_unlock(&pkram_pgd_lock);
+ }
+
+ bitmap = get_bitmap_addr(*pmd);
+ index = pte_index(paddr);
+ clear_bit(index, bitmap);
+ smp_mb__after_atomic();
+
+ spin_lock(&pkram_pgd_lock);
+ if (!pmd_none(*pmd) && bitmap_empty(bitmap, PTRS_PER_PTE)) {
+ pmd_clear(pmd);
+ spin_unlock(&pkram_pgd_lock);
+ bitmap_free(bitmap);
+ } else {
+ spin_unlock(&pkram_pgd_lock);
+ }
+}
+
+struct pkram_pg_state {
+ int (*range_cb)(unsigned long base, unsigned long size, void *private);
+ unsigned long start_addr;
+ unsigned long curr_addr;
+ unsigned long min_addr;
+ unsigned long max_addr;
+ void *private;
+ bool tracking;
+};
+
+#define pgd_none(a) (pgtable_l5_enabled() ? pgd_none(a) : p4d_none(__p4d(pgd_val(a))))
+
+static int note_page(struct pkram_pg_state *st, unsigned long addr, bool present)
+{
+ if (!st->tracking && present) {
+ if (addr >= st->max_addr)
+ return 1;
+ /*
+ * addr can be < min_addr if the page straddles the
+ * boundary
+ */
+ st->start_addr = max(addr, st->min_addr);
+ st->tracking = true;
+ } else if (st->tracking) {
+ unsigned long base, size;
+ int ret;
+
+ /* Continue tracking if upper bound has not been reached */
+ if (present && addr < st->max_addr)
+ return 0;
+
+ addr = min(addr, st->max_addr);
+
+ base = st->start_addr;
+ size = addr - st->start_addr;
+ st->tracking = false;
+
+ ret = st->range_cb(base, size, st->private);
+
+ if (addr == st->max_addr)
+ return 1;
+ else
+ return ret;
+ }
+
+ return 0;
+}
+
+static int walk_pte_level(struct pkram_pg_state *st, pmd_t addr, unsigned long P)
+{
+ unsigned long *bitmap;
+ int present;
+ int i, ret;
+
+ bitmap = get_bitmap_addr(addr);
+ for (i = 0; i < PTRS_PER_PTE; i++) {
+ unsigned long curr_addr = P + i * PAGE_SIZE;
+
+ if (curr_addr < st->min_addr)
+ continue;
+ present = test_bit(i, bitmap);
+ ret = note_page(st, curr_addr, present);
+ if (ret)
+ break;
+ }
+
+ return ret;
+}
+
+static int walk_pmd_level(struct pkram_pg_state *st, pud_t addr, unsigned long P)
+{
+ pmd_t *start;
+ int i, ret;
+
+ start = pud_pgtable(addr);
+ for (i = 0; i < PTRS_PER_PMD; i++, start++) {
+ unsigned long curr_addr = P + i * PMD_SIZE;
+
+ if (curr_addr + PMD_SIZE <= st->min_addr)
+ continue;
+ if (!pmd_none(*start)) {
+ if (pmd_leaf(*start))
+ ret = note_page(st, curr_addr, true);
+ else
+ ret = walk_pte_level(st, *start, curr_addr);
+ } else
+ ret = note_page(st, curr_addr, false);
+ if (ret)
+ break;
+ }
+
+ return ret;
+}
+
+static int walk_pud_level(struct pkram_pg_state *st, p4d_t addr, unsigned long P)
+{
+ pud_t *start;
+ int i, ret;
+
+ start = p4d_pgtable(addr);
+ for (i = 0; i < PTRS_PER_PUD; i++, start++) {
+ unsigned long curr_addr = P + i * PUD_SIZE;
+
+ if (curr_addr + PUD_SIZE <= st->min_addr)
+ continue;
+ if (!pud_none(*start)) {
+ if (pud_leaf(*start))
+ ret = note_page(st, curr_addr, true);
+ else
+ ret = walk_pmd_level(st, *start, curr_addr);
+ } else
+ ret = note_page(st, curr_addr, false);
+ if (ret)
+ break;
+ }
+
+ return ret;
+}
+
+static int walk_p4d_level(struct pkram_pg_state *st, pgd_t addr, unsigned long P)
+{
+ p4d_t *start;
+ int i, ret;
+
+ if (PTRS_PER_P4D == 1)
+ return walk_pud_level(st, __p4d(pgd_val(addr)), P);
+
+ start = (p4d_t *)pgd_page_vaddr(addr);
+ for (i = 0; i < PTRS_PER_P4D; i++, start++) {
+ unsigned long curr_addr = P + i * P4D_SIZE;
+
+ if (curr_addr + P4D_SIZE <= st->min_addr)
+ continue;
+ if (!p4d_none(*start)) {
+ if (p4d_leaf(*start))
+ ret = note_page(st, curr_addr, true);
+ else
+ ret = walk_pud_level(st, *start, curr_addr);
+ } else
+ ret = note_page(st, curr_addr, false);
+ if (ret)
+ break;
+ }
+
+ return ret;
+}
+
+void pkram_walk_pgt(struct pkram_pg_state *st, pgd_t *pgd)
+{
+ pgd_t *start = pgd;
+ int i, ret = 0;
+
+ for (i = 0; i < PTRS_PER_PGD; i++, start++) {
+ unsigned long curr_addr = i * PGDIR_SIZE;
+
+ if (curr_addr + PGDIR_SIZE <= st->min_addr)
+ continue;
+ if (!pgd_none(*start))
+ ret = walk_p4d_level(st, *start, curr_addr);
+ else
+ ret = note_page(st, curr_addr, false);
+ if (ret)
+ break;
+ }
+}
+
+void pkram_find_preserved(unsigned long start, unsigned long end, void *private, int (*callback)(unsigned long base, unsigned long size, void *private))
+{
+ struct pkram_pg_state st = {
+ .range_cb = callback,
+ .min_addr = start,
+ .max_addr = end,
+ .private = private,
+ };
+
+ if (!pkram_pgd)
+ return;
+
+ pkram_walk_pgt(&st, pkram_pgd);
+}
--
1.9.4

2023-04-27 00:19:29

by Anthony Yznaga

[permalink] [raw]
Subject: [RFC v3 05/21] mm: PKRAM: implement byte stream operations

This patch adds the ability to save an arbitrary byte streams to a
a PKRAM object using pkram_write() to be restored later using pkram_read().

Originally-by: Vladimir Davydov <[email protected]>
Signed-off-by: Anthony Yznaga <[email protected]>
---
include/linux/pkram.h | 11 +++++
mm/pkram.c | 123 ++++++++++++++++++++++++++++++++++++++++++++++++--
2 files changed, 130 insertions(+), 4 deletions(-)

diff --git a/include/linux/pkram.h b/include/linux/pkram.h
index 130ab5c2d94a..b614e9059bba 100644
--- a/include/linux/pkram.h
+++ b/include/linux/pkram.h
@@ -14,10 +14,12 @@
* enum pkram_data_flags - definition of data types contained in a pkram obj
* @PKRAM_DATA_none: No data types configured
* @PKRAM_DATA_folios: obj contains folio data
+ * @PKRAM_DATA_bytes: obj contains byte data
*/
enum pkram_data_flags {
PKRAM_DATA_none = 0x0, /* No data types configured */
PKRAM_DATA_folios = 0x1, /* Contains folio data */
+ PKRAM_DATA_bytes = 0x2, /* Contains byte data */
};

struct pkram_data_stream {
@@ -36,18 +38,27 @@ struct pkram_stream {

__u64 *folios_head_link_pfnp;
__u64 *folios_tail_link_pfnp;
+
+ __u64 *bytes_head_link_pfnp;
+ __u64 *bytes_tail_link_pfnp;
};

struct pkram_folios_access {
unsigned long next_index;
};

+struct pkram_bytes_access {
+ struct page *data_page; /* current page */
+ unsigned int data_offset; /* offset into current page */
+};
+
struct pkram_access {
enum pkram_data_flags dtype;
struct pkram_stream *ps;
struct pkram_data_stream pds;

struct pkram_folios_access folios;
+ struct pkram_bytes_access bytes;
};

#define PKRAM_NAME_MAX 256 /* including nul */
diff --git a/mm/pkram.c b/mm/pkram.c
index 610ff7a88c98..eac8cf6b0cdf 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -1,6 +1,7 @@
// SPDX-License-Identifier: GPL-2.0
#include <linux/err.h>
#include <linux/gfp.h>
+#include <linux/highmem.h>
#include <linux/io.h>
#include <linux/kernel.h>
#include <linux/list.h>
@@ -44,6 +45,9 @@ struct pkram_link {
struct pkram_obj {
__u64 folios_head_link_pfn; /* the first folios link of the object */
__u64 folios_tail_link_pfn; /* the last folios link of the object */
+ __u64 bytes_head_link_pfn; /* the first bytes link of the object */
+ __u64 bytes_tail_link_pfn; /* the last bytes link of the object */
+ __u64 data_len; /* byte data size */
__u64 obj_pfn; /* points to the next object in the list */
};

@@ -138,6 +142,11 @@ static void pkram_truncate_obj(struct pkram_obj *obj)
pkram_truncate_links(obj->folios_head_link_pfn);
obj->folios_head_link_pfn = 0;
obj->folios_tail_link_pfn = 0;
+
+ pkram_truncate_links(obj->bytes_head_link_pfn);
+ obj->bytes_head_link_pfn = 0;
+ obj->bytes_tail_link_pfn = 0;
+ obj->data_len = 0;
}

static void pkram_truncate_node(struct pkram_node *node)
@@ -310,7 +319,7 @@ int pkram_prepare_save_obj(struct pkram_stream *ps, enum pkram_data_flags flags)

BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);

- if (flags & ~PKRAM_DATA_folios)
+ if (flags & ~(PKRAM_DATA_folios | PKRAM_DATA_bytes))
return -EINVAL;

page = pkram_alloc_page(ps->gfp_mask | __GFP_ZERO);
@@ -326,6 +335,10 @@ int pkram_prepare_save_obj(struct pkram_stream *ps, enum pkram_data_flags flags)
ps->folios_head_link_pfnp = &obj->folios_head_link_pfn;
ps->folios_tail_link_pfnp = &obj->folios_tail_link_pfn;
}
+ if (flags & PKRAM_DATA_bytes) {
+ ps->bytes_head_link_pfnp = &obj->bytes_head_link_pfn;
+ ps->bytes_tail_link_pfnp = &obj->bytes_tail_link_pfn;
+ }
ps->obj = obj;
return 0;
}
@@ -432,7 +445,7 @@ int pkram_prepare_load_obj(struct pkram_stream *ps)
return -ENODATA;

obj = pfn_to_kaddr(node->obj_pfn);
- if (!obj->folios_head_link_pfn) {
+ if (!obj->folios_head_link_pfn && !obj->bytes_head_link_pfn) {
WARN_ON(1);
return -EINVAL;
}
@@ -443,6 +456,10 @@ int pkram_prepare_load_obj(struct pkram_stream *ps)
ps->folios_head_link_pfnp = &obj->folios_head_link_pfn;
ps->folios_tail_link_pfnp = &obj->folios_tail_link_pfn;
}
+ if (obj->bytes_head_link_pfn) {
+ ps->bytes_head_link_pfnp = &obj->bytes_head_link_pfn;
+ ps->bytes_tail_link_pfnp = &obj->bytes_tail_link_pfn;
+ }
ps->obj = obj;
return 0;
}
@@ -493,6 +510,9 @@ void pkram_finish_access(struct pkram_access *pa, bool status_ok)

if (pa->pds.link)
pkram_truncate_link(pa->pds.link);
+
+ if ((pa->dtype == PKRAM_DATA_bytes) && (pa->bytes.data_page))
+ pkram_free_page(page_address(pa->bytes.data_page));
}

/*
@@ -547,6 +567,22 @@ int pkram_save_folio(struct pkram_access *pa, struct folio *folio)
return __pkram_save_page(pa, page, page->index);
}

+static int __pkram_bytes_save_page(struct pkram_access *pa, struct page *page)
+{
+ struct pkram_data_stream *pds = &pa->pds;
+ struct pkram_link *link = pds->link;
+
+ if (!link || pds->entry_idx >= PKRAM_LINK_ENTRIES_MAX) {
+ link = pkram_new_link(pds, pa->ps->gfp_mask);
+ if (!link)
+ return -ENOMEM;
+ }
+
+ pkram_add_link_entry(pds, page);
+
+ return 0;
+}
+
static struct page *__pkram_prep_load_page(pkram_entry_t p)
{
struct page *page;
@@ -662,10 +698,53 @@ struct folio *pkram_load_folio(struct pkram_access *pa, unsigned long *index)
*
* On success, returns the number of bytes written, which is always equal to
* @count. On failure, -errno is returned.
+ *
+ * Error values:
+ * %ENOMEM: insufficient amount of memory available
*/
ssize_t pkram_write(struct pkram_access *pa, const void *buf, size_t count)
{
- return -EINVAL;
+ struct pkram_node *node = pa->ps->node;
+ struct pkram_obj *obj = pa->ps->obj;
+ size_t copy_count, write_count = 0;
+ void *addr;
+
+ BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);
+
+ while (count > 0) {
+ if (!pa->bytes.data_page) {
+ gfp_t gfp_mask = pa->ps->gfp_mask;
+ struct page *page;
+ int err;
+
+ page = pkram_alloc_page((gfp_mask & GFP_RECLAIM_MASK) |
+ __GFP_HIGHMEM | __GFP_ZERO);
+ if (!page)
+ return -ENOMEM;
+ err = __pkram_bytes_save_page(pa, page);
+ if (err) {
+ pkram_free_page(page_address(page));
+ return err;
+ }
+ pa->bytes.data_page = page;
+ pa->bytes.data_offset = 0;
+ }
+
+ copy_count = min_t(size_t, count, PAGE_SIZE - pa->bytes.data_offset);
+ addr = kmap_local_page(pa->bytes.data_page);
+ memcpy(addr + pa->bytes.data_offset, buf, copy_count);
+ kunmap_local(addr);
+
+ buf += copy_count;
+ obj->data_len += copy_count;
+ pa->bytes.data_offset += copy_count;
+ if (pa->bytes.data_offset >= PAGE_SIZE)
+ pa->bytes.data_page = NULL;
+
+ write_count += copy_count;
+ count -= copy_count;
+ }
+ return write_count;
}

/**
@@ -679,5 +758,41 @@ ssize_t pkram_write(struct pkram_access *pa, const void *buf, size_t count)
*/
size_t pkram_read(struct pkram_access *pa, void *buf, size_t count)
{
- return 0;
+ struct pkram_node *node = pa->ps->node;
+ struct pkram_obj *obj = pa->ps->obj;
+ size_t copy_count, read_count = 0;
+ char *addr;
+
+ BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_LOAD);
+
+ while (count > 0 && obj->data_len > 0) {
+ if (!pa->bytes.data_page) {
+ struct page *page;
+
+ page = __pkram_load_page(pa, NULL);
+ if (IS_ERR_OR_NULL(page))
+ break;
+ pa->bytes.data_page = page;
+ pa->bytes.data_offset = 0;
+ }
+
+ copy_count = min_t(size_t, count, PAGE_SIZE - pa->bytes.data_offset);
+ if (copy_count > obj->data_len)
+ copy_count = obj->data_len;
+ addr = kmap_local_page(pa->bytes.data_page);
+ memcpy(buf, addr + pa->bytes.data_offset, copy_count);
+ kunmap_local(addr);
+
+ buf += copy_count;
+ obj->data_len -= copy_count;
+ pa->bytes.data_offset += copy_count;
+ if (pa->bytes.data_offset >= PAGE_SIZE || !obj->data_len) {
+ put_page(pa->bytes.data_page);
+ pa->bytes.data_page = NULL;
+ }
+
+ read_count += copy_count;
+ count -= copy_count;
+ }
+ return read_count;
}
--
1.9.4

2023-04-27 00:19:31

by Anthony Yznaga

[permalink] [raw]
Subject: [RFC v3 01/21] mm: add PKRAM API stubs and Kconfig

Preserved-across-kexec memory or PKRAM is a method for saving memory
pages of the currently executing kernel and restoring them after kexec
boot into a new one. This can be utilized for preserving guest VM state,
large in-memory databases, process memory, etc. across reboot. While
DRAM-as-PMEM or actual persistent memory could be used to accomplish
these things, PKRAM provides the latency of DRAM with the flexibility
of dynamically determining the amount of memory to preserve.

The proposed API:

* Preserved memory is divided into nodes which can be saved or loaded
independently of each other. The nodes are identified by unique name
strings. A PKRAM node is created when save is initiated by calling
pkram_prepare_save(). A PKRAM node is removed when load is initiated by
calling pkram_prepare_load(). See below

* A node is further divided into objects. An object represents closely
coupled data in the form of a grouping of folios and/or a stream of
byte data. For example, the folios and attributes of a file.
After initiating an operation on a PKRAM node, PKRAM objects are
initialized for saving or loading by calling pkram_prepare_save_obj()
or pkram_prepare_load_obj().

* For saving/loading data from a PKRAM node/object instances of the
pkram_stream and pkram_access structs are used. pkram_stream tracks
the node and object being operated on while pkram_access tracks the
data type and position within an object.

The pkram_stream struct is initialized by calling pkram_prepare_save()
or pkram_prepare_load() and then pkram_prepare_save_obj() or
pkram_prepare_load_obj().

Once a pkram_stream is fully initialized, a pkram_access struct
is initialized for each data type associated with the object.
After save or load of a data type for the object is complete,
pkram_finish_access() is called.

After save or load is complete for the object, pkram_finish_save_obj()
or pkram_finish_load_obj() must be called followed by pkram_finish_save()
or pkram_finish_load() when save or load is completed for the node.
If an error occurred during save, the saved data and the PKRAM node
may be freed by calling pkram_discard_save() instead of
pkram_finish_save().

* Both folio data and byte data can separately be streamed to a PKRAM
object. pkram_save_folio() and pkram_load_folio() are used
to stream folio data while pkram_write() and pkram_read() are used to
stream byte data.

A sequence of operations for saving/loading data from PKRAM would
look like:

* For saving data to PKRAM:

/* create a PKRAM node and do initial stream setup */
pkram_prepare_save()

/* create a PKRAM object associated with the PKRAM node and complete stream initialization */
pkram_prepare_save_obj()

/* save data to the node/object */
PKRAM_ACCESS(pa_folios,...)
PKRAM_ACCESS(pa_bytes,...)
pkram_save_folio(pa_folios,...)[,...] /* for file folios */
pkram_write(pa_bytes,...)[,...] /* for a byte stream */
pkram_finish_access(pa_folios)
pkram_finish_access(pa_bytes)

pkram_finish_save_obj()

/* commit the save or discard and delete the node */
pkram_finish_save() /* on success, or
pkram_discard_save() * ... in case of error */

* For loading data from PKRAM:

/* remove a PKRAM node from the list and do initial stream setup */
pkram_prepare_load()

/* Remove a PKRAM object from the node and complete stream initializtion for loading data from it. */
pkram_prepare_load_obj()

/* load data from the node/object */
PKRAM_ACCESS(pa_folios,...)
PKRAM_ACCESS(pa_bytes,...)
pkram_load_folio(pa_folios,...)[,...] /* for file folios */
pkram_read(pa_bytes,...)[,...] /* for a byte stream */
*/
pkram_finish_access(pa_folios)
pkram_finish_access(pa_bytes)

/* free the object */
pkram_finish_load_obj()

/* free the node */
pkram_finish_load()

Originally-by: Vladimir Davydov <[email protected]>
Signed-off-by: Anthony Yznaga <[email protected]>
---
include/linux/pkram.h | 47 +++++++++++++
mm/Kconfig | 9 +++
mm/Makefile | 2 +
mm/pkram.c | 179 ++++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 237 insertions(+)
create mode 100644 include/linux/pkram.h
create mode 100644 mm/pkram.c

diff --git a/include/linux/pkram.h b/include/linux/pkram.h
new file mode 100644
index 000000000000..57b8db4229a4
--- /dev/null
+++ b/include/linux/pkram.h
@@ -0,0 +1,47 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_PKRAM_H
+#define _LINUX_PKRAM_H
+
+#include <linux/gfp.h>
+#include <linux/types.h>
+#include <linux/mm_types.h>
+
+/**
+ * enum pkram_data_flags - definition of data types contained in a pkram obj
+ * @PKRAM_DATA_none: No data types configured
+ */
+enum pkram_data_flags {
+ PKRAM_DATA_none = 0x0, /* No data types configured */
+};
+
+struct pkram_stream;
+struct pkram_access;
+
+#define PKRAM_NAME_MAX 256 /* including nul */
+
+int pkram_prepare_save(struct pkram_stream *ps, const char *name,
+ gfp_t gfp_mask);
+int pkram_prepare_save_obj(struct pkram_stream *ps, enum pkram_data_flags flags);
+
+void pkram_finish_save(struct pkram_stream *ps);
+void pkram_finish_save_obj(struct pkram_stream *ps);
+void pkram_discard_save(struct pkram_stream *ps);
+
+int pkram_prepare_load(struct pkram_stream *ps, const char *name);
+int pkram_prepare_load_obj(struct pkram_stream *ps);
+
+void pkram_finish_load(struct pkram_stream *ps);
+void pkram_finish_load_obj(struct pkram_stream *ps);
+
+#define PKRAM_ACCESS(name, stream, type) \
+ struct pkram_access name
+
+void pkram_finish_access(struct pkram_access *pa, bool status_ok);
+
+int pkram_save_folio(struct pkram_access *pa, struct folio *folio);
+struct folio *pkram_load_folio(struct pkram_access *pa, unsigned long *index);
+
+ssize_t pkram_write(struct pkram_access *pa, const void *buf, size_t count);
+size_t pkram_read(struct pkram_access *pa, void *buf, size_t count);
+
+#endif /* _LINUX_PKRAM_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 4751031f3f05..10f089f4a181 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1202,6 +1202,15 @@ config LRU_GEN_STATS
This option has a per-memcg and per-node memory overhead.
# }

+config PKRAM
+ bool "Preserved-over-kexec memory storage"
+ default n
+ help
+ This option adds the kernel API that enables saving memory pages of
+ the currently executing kernel and restoring them after a kexec in
+ the newly booted one. This can be utilized for speeding up reboot by
+ leaving process memory and/or FS caches in-place.
+
source "mm/damon/Kconfig"

endmenu
diff --git a/mm/Makefile b/mm/Makefile
index 8e105e5b3e29..7a8d5a286d48 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -138,3 +138,5 @@ obj-$(CONFIG_IO_MAPPING) += io-mapping.o
obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o
obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
+obj-$(CONFIG_PKRAM) += pkram.o
+>>>>>>> mm: add PKRAM API stubs and Kconfig
diff --git a/mm/pkram.c b/mm/pkram.c
new file mode 100644
index 000000000000..421de8211e05
--- /dev/null
+++ b/mm/pkram.c
@@ -0,0 +1,179 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/err.h>
+#include <linux/gfp.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/pkram.h>
+#include <linux/types.h>
+
+/**
+ * Create a preserved memory node with name @name and initialize stream @ps
+ * for saving data to it.
+ *
+ * @gfp_mask specifies the memory allocation mask to be used when saving data.
+ *
+ * Returns 0 on success, -errno on failure.
+ *
+ * After the save has finished, pkram_finish_save() (or pkram_discard_save() in
+ * case of failure) is to be called.
+ */
+int pkram_prepare_save(struct pkram_stream *ps, const char *name, gfp_t gfp_mask)
+{
+ return -EINVAL;
+}
+
+/**
+ * Create a preserved memory object and initialize stream @ps for saving data
+ * to it.
+ *
+ * Returns 0 on success, -errno on failure.
+ *
+ * After the save has finished, pkram_finish_save_obj() (or pkram_discard_save()
+ * in case of failure) is to be called.
+ */
+int pkram_prepare_save_obj(struct pkram_stream *ps, enum pkram_data_flags flags)
+{
+ return -EINVAL;
+}
+
+/**
+ * Commit the object started with pkram_prepare_save_obj() to preserved memory.
+ */
+void pkram_finish_save_obj(struct pkram_stream *ps)
+{
+ WARN_ON_ONCE(1);
+}
+
+/**
+ * Commit the save to preserved memory started with pkram_prepare_save().
+ * After the call, the stream may not be used any more.
+ */
+void pkram_finish_save(struct pkram_stream *ps)
+{
+ WARN_ON_ONCE(1);
+}
+
+/**
+ * Cancel the save to preserved memory started with pkram_prepare_save() and
+ * destroy the corresponding preserved memory node freeing any data already
+ * saved to it.
+ */
+void pkram_discard_save(struct pkram_stream *ps)
+{
+ WARN_ON_ONCE(1);
+}
+
+/**
+ * Remove the preserved memory node with name @name and initialize stream @ps
+ * for loading data from it.
+ *
+ * Returns 0 on success, -errno on failure.
+ *
+ * After the load has finished, pkram_finish_load() is to be called.
+ */
+int pkram_prepare_load(struct pkram_stream *ps, const char *name)
+{
+ return -EINVAL;
+}
+
+/**
+ * Remove the next preserved memory object from the stream @ps and
+ * initialize stream @ps for loading data from it.
+ *
+ * Returns 0 on success, -errno on failure.
+ *
+ * After the load has finished, pkram_finish_load_obj() is to be called.
+ */
+int pkram_prepare_load_obj(struct pkram_stream *ps)
+{
+ return -EINVAL;
+}
+
+/**
+ * Finish the load of a preserved memory object started with
+ * pkram_prepare_load_obj() freeing the object and any data that has not
+ * been loaded from it.
+ */
+void pkram_finish_load_obj(struct pkram_stream *ps)
+{
+ WARN_ON_ONCE(1);
+}
+
+/**
+ * Finish the load from preserved memory started with pkram_prepare_load()
+ * freeing the corresponding preserved memory node and any data that has
+ * not been loaded from it.
+ */
+void pkram_finish_load(struct pkram_stream *ps)
+{
+ WARN_ON_ONCE(1);
+}
+
+/**
+ * Finish the data access to or from the preserved memory node and object
+ * associated with pkram stream access @pa. The access must have been
+ * initialized with PKRAM_ACCESS().
+ */
+void pkram_finish_access(struct pkram_access *pa, bool status_ok)
+{
+ WARN_ON_ONCE(1);
+}
+
+/**
+ * Save folio @folio to the preserved memory node and object associated
+ * with pkram stream access @pa. The stream must have been initialized with
+ * pkram_prepare_save() and pkram_prepare_save_obj() and access initialized
+ * with PKRAM_ACCESS().
+ *
+ * Returns 0 on success, -errno on failure.
+ */
+int pkram_save_folio(struct pkram_access *pa, struct folio *folio)
+{
+ return -EINVAL;
+}
+
+/**
+ * Load the next folio from the preserved memory node and object associated
+ * with pkram stream access @pa. The stream must have been initialized with
+ * pkram_prepare_load() and pkram_prepare_load_obj() and access initialized
+ * with PKRAM_ACCESS().
+ *
+ * If not NULL, @index is initialized with the preserved mapping offset of the
+ * folio loaded.
+ *
+ * Returns the folio loaded or NULL if the node is empty.
+ *
+ * The folio loaded has its refcount incremented.
+ */
+struct folio *pkram_load_folio(struct pkram_access *pa, unsigned long *index)
+{
+ return NULL;
+}
+
+/**
+ * Copy @count bytes from @buf to the preserved memory node and object
+ * associated with pkram stream access @pa. The stream must have been
+ * initialized with pkram_prepare_save() and pkram_prepare_save_obj()
+ * and access initialized with PKRAM_ACCESS();
+ *
+ * On success, returns the number of bytes written, which is always equal to
+ * @count. On failure, -errno is returned.
+ */
+ssize_t pkram_write(struct pkram_access *pa, const void *buf, size_t count)
+{
+ return -EINVAL;
+}
+
+/**
+ * Copy up to @count bytes from the preserved memory node and object
+ * associated with pkram stream access @pa to @buf. The stream must have been
+ * initialized with pkram_prepare_load() and pkram_prepare_load_obj() and
+ * access initialized PKRAM_ACCESS().
+ *
+ * Returns the number of bytes read, which may be less than @count if the node
+ * has fewer bytes available.
+ */
+size_t pkram_read(struct pkram_access *pa, void *buf, size_t count)
+{
+ return 0;
+}
--
1.9.4

2023-04-27 00:19:37

by Anthony Yznaga

[permalink] [raw]
Subject: [RFC v3 18/21] mm: PKRAM: allow preserved memory to be freed from userspace

To free all space utilized for preserved memory, one can write 0 to
/sys/kernel/pkram. This will destroy all PKRAM nodes that are not
currently being read or written.

Originally-by: Vladimir Davydov <[email protected]>
Signed-off-by: Anthony Yznaga <[email protected]>
---
mm/pkram.c | 39 ++++++++++++++++++++++++++++++++++++++-
1 file changed, 38 insertions(+), 1 deletion(-)

diff --git a/mm/pkram.c b/mm/pkram.c
index 474fb6fc8355..d404e415f3cb 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -493,6 +493,32 @@ static void pkram_truncate_node(struct pkram_node *node)
node->obj_pfn = 0;
}

+/*
+ * Free all nodes that are not under operation.
+ */
+static void pkram_truncate(void)
+{
+ struct page *page, *tmp;
+ struct pkram_node *node;
+ LIST_HEAD(dispose);
+
+ mutex_lock(&pkram_mutex);
+ list_for_each_entry_safe(page, tmp, &pkram_nodes, lru) {
+ node = page_address(page);
+ if (!(node->flags & PKRAM_ACCMODE_MASK))
+ list_move(&page->lru, &dispose);
+ }
+ mutex_unlock(&pkram_mutex);
+
+ while (!list_empty(&dispose)) {
+ page = list_first_entry(&dispose, struct page, lru);
+ list_del(&page->lru);
+ node = page_address(page);
+ pkram_truncate_node(node);
+ pkram_free_page(node);
+ }
+}
+
static void pkram_add_link(struct pkram_link *link, struct pkram_data_stream *pds)
{
__u64 link_pfn = page_to_pfn(virt_to_page(link));
@@ -1252,8 +1278,19 @@ static ssize_t show_pkram_sb_pfn(struct kobject *kobj,
return sprintf(buf, "%lx\n", pfn);
}

+static ssize_t store_pkram_sb_pfn(struct kobject *kobj,
+ struct kobj_attribute *attr, const char *buf, size_t count)
+{
+ int val;
+
+ if (kstrtoint(buf, 0, &val) || val)
+ return -EINVAL;
+ pkram_truncate();
+ return count;
+}
+
static struct kobj_attribute pkram_sb_pfn_attr =
- __ATTR(pkram, 0444, show_pkram_sb_pfn, NULL);
+ __ATTR(pkram, 0644, show_pkram_sb_pfn, store_pkram_sb_pfn);

static struct attribute *pkram_attrs[] = {
&pkram_sb_pfn_attr.attr,
--
1.9.4

2023-04-27 00:19:45

by Anthony Yznaga

[permalink] [raw]
Subject: [RFC v3 09/21] PKRAM: pass a list of preserved ranges to the next kernel

In order to build a new memblock reserved list during boot that
includes ranges preserved by the previous kernel, a list of preserved
ranges is passed to the next kernel via the pkram superblock. The
ranges are stored in ascending order in a linked list of pages. A more
complete memblock list is not prepared to avoid possible conflicts with
changes in a newer kernel and to avoid having to allocate a contiguous
range larger than a page.

Signed-off-by: Anthony Yznaga <[email protected]>
---
mm/pkram.c | 184 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 177 insertions(+), 7 deletions(-)

diff --git a/mm/pkram.c b/mm/pkram.c
index e6c0f3c52465..3790e5180feb 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -84,6 +84,20 @@ struct pkram_node {
#define PKRAM_LOAD 2
#define PKRAM_ACCMODE_MASK 3

+struct pkram_region {
+ phys_addr_t base;
+ phys_addr_t size;
+};
+
+struct pkram_region_list {
+ __u64 prev_pfn;
+ __u64 next_pfn;
+
+ struct pkram_region regions[0];
+};
+
+#define PKRAM_REGIONS_LIST_MAX \
+ ((PAGE_SIZE-sizeof(struct pkram_region_list))/sizeof(struct pkram_region))
/*
* The PKRAM super block contains data needed to restore the preserved memory
* structure on boot. The pointer to it (pfn) should be passed via the 'pkram'
@@ -96,13 +110,21 @@ struct pkram_node {
*/
struct pkram_super_block {
__u64 node_pfn; /* first element of the node list */
+ __u64 region_list_pfn;
+ __u64 nr_regions;
};

+static struct pkram_region_list *pkram_regions_list;
+static int pkram_init_regions_list(void);
+static unsigned long pkram_populate_regions_list(void);
+
static unsigned long pkram_sb_pfn __initdata;
static struct pkram_super_block *pkram_sb;

extern int pkram_add_identity_map(struct page *page);
extern void pkram_remove_identity_map(struct page *page);
+extern void pkram_find_preserved(unsigned long start, unsigned long end, void *private,
+ int (*callback)(unsigned long base, unsigned long size, void *private));

/*
* For convenience sake PKRAM nodes are kept in an auxiliary doubly-linked list
@@ -878,21 +900,48 @@ static void __pkram_reboot(void)
struct page *page;
struct pkram_node *node;
unsigned long node_pfn = 0;
+ unsigned long rl_pfn = 0;
+ unsigned long nr_regions = 0;
+ int err = 0;

- list_for_each_entry_reverse(page, &pkram_nodes, lru) {
- node = page_address(page);
- if (WARN_ON(node->flags & PKRAM_ACCMODE_MASK))
- continue;
- node->node_pfn = node_pfn;
- node_pfn = page_to_pfn(page);
+ if (!list_empty(&pkram_nodes)) {
+ err = pkram_add_identity_map(virt_to_page(pkram_sb));
+ if (err) {
+ pr_err("PKRAM: failed to add super block to pagetable\n");
+ goto done;
+ }
+ list_for_each_entry_reverse(page, &pkram_nodes, lru) {
+ node = page_address(page);
+ if (WARN_ON(node->flags & PKRAM_ACCMODE_MASK))
+ continue;
+ node->node_pfn = node_pfn;
+ node_pfn = page_to_pfn(page);
+ }
+ err = pkram_init_regions_list();
+ if (err) {
+ pr_err("PKRAM: failed to init regions list\n");
+ goto done;
+ }
+ nr_regions = pkram_populate_regions_list();
+ if (IS_ERR_VALUE(nr_regions)) {
+ err = nr_regions;
+ pr_err("PKRAM: failed to populate regions list\n");
+ goto done;
+ }
+ rl_pfn = page_to_pfn(virt_to_page(pkram_regions_list));
}

+done:
/*
* Zero out pkram_sb completely since it may have been passed from
* the previous boot.
*/
memset(pkram_sb, 0, PAGE_SIZE);
- pkram_sb->node_pfn = node_pfn;
+ if (!err && node_pfn) {
+ pkram_sb->node_pfn = node_pfn;
+ pkram_sb->region_list_pfn = rl_pfn;
+ pkram_sb->nr_regions = nr_regions;
+ }
}

static int pkram_reboot(struct notifier_block *notifier,
@@ -968,3 +1017,124 @@ static int __init pkram_init(void)
return 0;
}
module_init(pkram_init);
+
+static int count_region_cb(unsigned long base, unsigned long size, void *private)
+{
+ unsigned long *nr_regions = (unsigned long *)private;
+
+ (*nr_regions)++;
+ return 0;
+}
+
+static unsigned long pkram_count_regions(void)
+{
+ unsigned long nr_regions = 0;
+
+ pkram_find_preserved(0, PHYS_ADDR_MAX, &nr_regions, count_region_cb);
+
+ return nr_regions;
+}
+
+/*
+ * To faciliate rapidly building a new memblock reserved list during boot
+ * with the addition of preserved memory ranges a regions list is built
+ * before reboot.
+ * The regions list is a linked list of pages with each page containing an
+ * array of preserved memory ranges. The ranges are stored in each page
+ * and across the list in address order. A linked list is used rather than
+ * a single contiguous range to mitigate against the possibility that a
+ * larger, contiguous allocation may fail due to fragmentation.
+ *
+ * Since the pages of the regions list must be preserved and the pkram
+ * pagetable is used to determine what ranges are preserved, the list pages
+ * must be allocated and represented in the pkram pagetable before they can
+ * be populated. Rather than recounting the number of regions after
+ * allocating pages and repeating until a precise number of pages are
+ * allocated, the number of pages needed is estimated.
+ */
+static int pkram_init_regions_list(void)
+{
+ struct pkram_region_list *rl;
+ unsigned long nr_regions;
+ unsigned long nr_lpages;
+ struct page *page;
+
+ nr_regions = pkram_count_regions();
+
+ nr_lpages = DIV_ROUND_UP(nr_regions, PKRAM_REGIONS_LIST_MAX);
+ nr_regions += nr_lpages;
+ nr_lpages = DIV_ROUND_UP(nr_regions, PKRAM_REGIONS_LIST_MAX);
+
+ for (; nr_lpages; nr_lpages--) {
+ page = pkram_alloc_page(GFP_KERNEL | __GFP_ZERO);
+ if (!page)
+ return -ENOMEM;
+ rl = page_address(page);
+ if (pkram_regions_list) {
+ rl->next_pfn = page_to_pfn(virt_to_page(pkram_regions_list));
+ pkram_regions_list->prev_pfn = page_to_pfn(page);
+ }
+ pkram_regions_list = rl;
+ }
+
+ return 0;
+}
+
+struct pkram_regions_priv {
+ struct pkram_region_list *curr;
+ struct pkram_region_list *last;
+ unsigned long nr_regions;
+ int idx;
+};
+
+static int add_region_cb(unsigned long base, unsigned long size, void *private)
+{
+ struct pkram_regions_priv *priv;
+ struct pkram_region_list *rl;
+ int i;
+
+ priv = (struct pkram_regions_priv *)private;
+ rl = priv->curr;
+ i = priv->idx;
+
+ if (!rl) {
+ WARN_ON(1);
+ return 1;
+ }
+
+ if (!i)
+ priv->last = priv->curr;
+
+ rl->regions[i].base = base;
+ rl->regions[i].size = size;
+
+ priv->nr_regions++;
+ i++;
+ if (i == PKRAM_REGIONS_LIST_MAX) {
+ u64 next_pfn = rl->next_pfn;
+
+ if (next_pfn)
+ priv->curr = pfn_to_kaddr(next_pfn);
+ else
+ priv->curr = NULL;
+
+ i = 0;
+ }
+ priv->idx = i;
+
+ return 0;
+}
+
+static unsigned long pkram_populate_regions_list(void)
+{
+ struct pkram_regions_priv priv = { .curr = pkram_regions_list };
+
+ pkram_find_preserved(0, PHYS_ADDR_MAX, &priv, add_region_cb);
+
+ /*
+ * Link the first node to the last populated one.
+ */
+ pkram_regions_list->prev_pfn = page_to_pfn(virt_to_page(priv.last));
+
+ return priv.nr_regions;
+}
--
1.9.4

2023-04-27 00:19:50

by Anthony Yznaga

[permalink] [raw]
Subject: [RFC v3 04/21] mm: PKRAM: implement folio stream operations

Implement pkram_save_folio() to populate a PKRAM object with in-memory
folios and pkram_load_folio() to load folios from a PKRAM object.
Saving a folio to PKRAM is accomplished by recording its pfn, order,
and mapping index and incrementing its refcount so that it will not
be freed after the last user puts it.

Originally-by: Vladimir Davydov <[email protected]>
Signed-off-by: Anthony Yznaga <[email protected]>
---
include/linux/pkram.h | 42 ++++++-
mm/pkram.c | 311 +++++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 346 insertions(+), 7 deletions(-)

diff --git a/include/linux/pkram.h b/include/linux/pkram.h
index 83718ad0e416..130ab5c2d94a 100644
--- a/include/linux/pkram.h
+++ b/include/linux/pkram.h
@@ -8,22 +8,47 @@

struct pkram_node;
struct pkram_obj;
+struct pkram_link;

/**
* enum pkram_data_flags - definition of data types contained in a pkram obj
* @PKRAM_DATA_none: No data types configured
+ * @PKRAM_DATA_folios: obj contains folio data
*/
enum pkram_data_flags {
- PKRAM_DATA_none = 0x0, /* No data types configured */
+ PKRAM_DATA_none = 0x0, /* No data types configured */
+ PKRAM_DATA_folios = 0x1, /* Contains folio data */
+};
+
+struct pkram_data_stream {
+ /* List of link pages to add/remove from */
+ __u64 *head_link_pfnp;
+ __u64 *tail_link_pfnp;
+
+ struct pkram_link *link; /* current link */
+ unsigned int entry_idx; /* next entry in link */
};

struct pkram_stream {
gfp_t gfp_mask;
struct pkram_node *node;
struct pkram_obj *obj;
+
+ __u64 *folios_head_link_pfnp;
+ __u64 *folios_tail_link_pfnp;
+};
+
+struct pkram_folios_access {
+ unsigned long next_index;
};

-struct pkram_access;
+struct pkram_access {
+ enum pkram_data_flags dtype;
+ struct pkram_stream *ps;
+ struct pkram_data_stream pds;
+
+ struct pkram_folios_access folios;
+};

#define PKRAM_NAME_MAX 256 /* including nul */

@@ -41,8 +66,19 @@ int pkram_prepare_save(struct pkram_stream *ps, const char *name,
void pkram_finish_load(struct pkram_stream *ps);
void pkram_finish_load_obj(struct pkram_stream *ps);

+#define PKRAM_PDS_INIT(name, stream, type) { \
+ .head_link_pfnp = (stream)->type##_head_link_pfnp, \
+ .tail_link_pfnp = (stream)->type##_tail_link_pfnp, \
+ }
+
+#define PKRAM_ACCESS_INIT(name, stream, type) { \
+ .dtype = PKRAM_DATA_##type, \
+ .ps = (stream), \
+ .pds = PKRAM_PDS_INIT(name, stream, type), \
+ }
+
#define PKRAM_ACCESS(name, stream, type) \
- struct pkram_access name
+ struct pkram_access name = PKRAM_ACCESS_INIT(name, stream, type)

void pkram_finish_access(struct pkram_access *pa, bool status_ok);

diff --git a/mm/pkram.c b/mm/pkram.c
index 6e3895cb9872..610ff7a88c98 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -1,6 +1,7 @@
// SPDX-License-Identifier: GPL-2.0
#include <linux/err.h>
#include <linux/gfp.h>
+#include <linux/io.h>
#include <linux/kernel.h>
#include <linux/list.h>
#include <linux/mm.h>
@@ -10,8 +11,40 @@
#include <linux/string.h>
#include <linux/types.h>

+#include "internal.h"
+
+
+/*
+ * Represents a reference to a data page saved to PKRAM.
+ */
+typedef __u64 pkram_entry_t;
+
+#define PKRAM_ENTRY_FLAGS_SHIFT 0x5
+#define PKRAM_ENTRY_FLAGS_MASK 0x7f
+#define PKRAM_ENTRY_ORDER_MASK 0x1f
+
+/*
+ * Keeps references to folios saved to PKRAM.
+ * The structure occupies a memory page.
+ */
+struct pkram_link {
+ __u64 link_pfn; /* points to the next link of the object */
+ __u64 index; /* mapping index of first pkram_entry_t */
+
+ /*
+ * the array occupies the rest of the link page; if the link is not
+ * full, the rest of the array must be filled with zeros
+ */
+ pkram_entry_t entry[];
+};
+
+#define PKRAM_LINK_ENTRIES_MAX \
+ ((PAGE_SIZE-sizeof(struct pkram_link))/sizeof(pkram_entry_t))
+
struct pkram_obj {
- __u64 obj_pfn; /* points to the next object in the list */
+ __u64 folios_head_link_pfn; /* the first folios link of the object */
+ __u64 folios_tail_link_pfn; /* the last folios link of the object */
+ __u64 obj_pfn; /* points to the next object in the list */
};

/*
@@ -19,6 +52,10 @@ struct pkram_obj {
* independently of each other. The nodes are identified by unique name
* strings.
*
+ * References to folios saved to a preserved memory node are kept in a
+ * singly-linked list of PKRAM link structures (see above), the node has a
+ * pointer to the head of.
+ *
* The structure occupies a memory page.
*/
struct pkram_node {
@@ -68,6 +105,41 @@ static struct pkram_node *pkram_find_node(const char *name)
return NULL;
}

+static void pkram_truncate_link(struct pkram_link *link)
+{
+ struct page *page;
+ pkram_entry_t p;
+ int i;
+
+ for (i = 0; i < PKRAM_LINK_ENTRIES_MAX; i++) {
+ p = link->entry[i];
+ if (!p)
+ continue;
+ page = pfn_to_page(PHYS_PFN(p));
+ put_page(page);
+ }
+}
+
+static void pkram_truncate_links(unsigned long link_pfn)
+{
+ struct pkram_link *link;
+
+ while (link_pfn) {
+ link = pfn_to_kaddr(link_pfn);
+ pkram_truncate_link(link);
+ link_pfn = link->link_pfn;
+ pkram_free_page(link);
+ cond_resched();
+ }
+}
+
+static void pkram_truncate_obj(struct pkram_obj *obj)
+{
+ pkram_truncate_links(obj->folios_head_link_pfn);
+ obj->folios_head_link_pfn = 0;
+ obj->folios_tail_link_pfn = 0;
+}
+
static void pkram_truncate_node(struct pkram_node *node)
{
unsigned long obj_pfn;
@@ -76,6 +148,7 @@ static void pkram_truncate_node(struct pkram_node *node)
obj_pfn = node->obj_pfn;
while (obj_pfn) {
obj = pfn_to_kaddr(obj_pfn);
+ pkram_truncate_obj(obj);
obj_pfn = obj->obj_pfn;
pkram_free_page(obj);
cond_resched();
@@ -83,6 +156,84 @@ static void pkram_truncate_node(struct pkram_node *node)
node->obj_pfn = 0;
}

+static void pkram_add_link(struct pkram_link *link, struct pkram_data_stream *pds)
+{
+ __u64 link_pfn = page_to_pfn(virt_to_page(link));
+
+ if (!*pds->head_link_pfnp) {
+ *pds->head_link_pfnp = link_pfn;
+ *pds->tail_link_pfnp = link_pfn;
+ } else {
+ struct pkram_link *tail = pfn_to_kaddr(*pds->tail_link_pfnp);
+
+ tail->link_pfn = link_pfn;
+ *pds->tail_link_pfnp = link_pfn;
+ }
+}
+
+static struct pkram_link *pkram_remove_link(struct pkram_data_stream *pds)
+{
+ struct pkram_link *link;
+
+ if (!*pds->head_link_pfnp)
+ return NULL;
+
+ link = pfn_to_kaddr(*pds->head_link_pfnp);
+ *pds->head_link_pfnp = link->link_pfn;
+ if (!*pds->head_link_pfnp)
+ *pds->tail_link_pfnp = 0;
+ else
+ link->link_pfn = 0;
+
+ return link;
+}
+
+static struct pkram_link *pkram_new_link(struct pkram_data_stream *pds, gfp_t gfp_mask)
+{
+ struct pkram_link *link;
+ struct page *link_page;
+
+ link_page = pkram_alloc_page((gfp_mask & GFP_RECLAIM_MASK) |
+ __GFP_ZERO);
+ if (!link_page)
+ return NULL;
+
+ link = page_address(link_page);
+ pkram_add_link(link, pds);
+ pds->link = link;
+ pds->entry_idx = 0;
+
+ return link;
+}
+
+static void pkram_add_link_entry(struct pkram_data_stream *pds, struct page *page)
+{
+ struct pkram_link *link = pds->link;
+ pkram_entry_t p;
+ short flags = 0;
+
+ p = page_to_phys(page);
+ p |= compound_order(page);
+ p |= ((flags & PKRAM_ENTRY_FLAGS_MASK) << PKRAM_ENTRY_FLAGS_SHIFT);
+ link->entry[pds->entry_idx] = p;
+ pds->entry_idx++;
+}
+
+static int pkram_next_link(struct pkram_data_stream *pds, struct pkram_link **linkp)
+{
+ struct pkram_link *link;
+
+ link = pkram_remove_link(pds);
+ if (!link)
+ return -ENODATA;
+
+ pds->link = link;
+ pds->entry_idx = 0;
+ *linkp = link;
+
+ return 0;
+}
+
static void pkram_stream_init(struct pkram_stream *ps,
struct pkram_node *node, gfp_t gfp_mask)
{
@@ -159,6 +310,9 @@ int pkram_prepare_save_obj(struct pkram_stream *ps, enum pkram_data_flags flags)

BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);

+ if (flags & ~PKRAM_DATA_folios)
+ return -EINVAL;
+
page = pkram_alloc_page(ps->gfp_mask | __GFP_ZERO);
if (!page)
return -ENOMEM;
@@ -168,6 +322,10 @@ int pkram_prepare_save_obj(struct pkram_stream *ps, enum pkram_data_flags flags)
obj->obj_pfn = node->obj_pfn;
node->obj_pfn = page_to_pfn(page);

+ if (flags & PKRAM_DATA_folios) {
+ ps->folios_head_link_pfnp = &obj->folios_head_link_pfn;
+ ps->folios_tail_link_pfnp = &obj->folios_tail_link_pfn;
+ }
ps->obj = obj;
return 0;
}
@@ -274,8 +432,17 @@ int pkram_prepare_load_obj(struct pkram_stream *ps)
return -ENODATA;

obj = pfn_to_kaddr(node->obj_pfn);
+ if (!obj->folios_head_link_pfn) {
+ WARN_ON(1);
+ return -EINVAL;
+ }
+
node->obj_pfn = obj->obj_pfn;

+ if (obj->folios_head_link_pfn) {
+ ps->folios_head_link_pfnp = &obj->folios_head_link_pfn;
+ ps->folios_tail_link_pfnp = &obj->folios_tail_link_pfn;
+ }
ps->obj = obj;
return 0;
}
@@ -292,6 +459,7 @@ void pkram_finish_load_obj(struct pkram_stream *ps)

BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_LOAD);

+ pkram_truncate_obj(obj);
pkram_free_page(obj);
}

@@ -317,7 +485,41 @@ void pkram_finish_load(struct pkram_stream *ps)
*/
void pkram_finish_access(struct pkram_access *pa, bool status_ok)
{
- WARN_ON_ONCE(1);
+ if (status_ok)
+ return;
+
+ if (pa->ps->node->flags == PKRAM_SAVE)
+ return;
+
+ if (pa->pds.link)
+ pkram_truncate_link(pa->pds.link);
+}
+
+/*
+ * Add a page to a PKRAM obj allocating a new PKRAM link if necessary.
+ */
+static int __pkram_save_page(struct pkram_access *pa, struct page *page,
+ unsigned long index)
+{
+ struct pkram_data_stream *pds = &pa->pds;
+ struct pkram_link *link = pds->link;
+
+ if (!link || pds->entry_idx >= PKRAM_LINK_ENTRIES_MAX ||
+ index != pa->folios.next_index) {
+ link = pkram_new_link(pds, pa->ps->gfp_mask);
+ if (!link)
+ return -ENOMEM;
+
+ pa->folios.next_index = link->index = index;
+ }
+
+ get_page(page);
+
+ pkram_add_link_entry(pds, page);
+
+ pa->folios.next_index += compound_nr(page);
+
+ return 0;
}

/**
@@ -327,10 +529,102 @@ void pkram_finish_access(struct pkram_access *pa, bool status_ok)
* with PKRAM_ACCESS().
*
* Returns 0 on success, -errno on failure.
+ *
+ * Error values:
+ * %ENOMEM: insufficient amount of memory available
+ *
+ * Saving a folio to preserved memory is simply incrementing its refcount so
+ * that it will not get freed after the last user puts it. That means it is
+ * safe to use the folio as usual after it has been saved.
*/
int pkram_save_folio(struct pkram_access *pa, struct folio *folio)
{
- return -EINVAL;
+ struct pkram_node *node = pa->ps->node;
+ struct page *page = folio_page(folio, 0);
+
+ BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);
+
+ return __pkram_save_page(pa, page, page->index);
+}
+
+static struct page *__pkram_prep_load_page(pkram_entry_t p)
+{
+ struct page *page;
+ int order;
+ short flags;
+
+ flags = (p >> PKRAM_ENTRY_FLAGS_SHIFT) & PKRAM_ENTRY_FLAGS_MASK;
+ order = p & PKRAM_ENTRY_ORDER_MASK;
+ if (order >= MAX_ORDER)
+ goto out_error;
+
+ page = pfn_to_page(PHYS_PFN(p));
+
+ if (!page_ref_freeze(pg, 1)) {
+ pr_err("PKRAM preserved page has unexpected inflated ref count\n");
+ goto out_error;
+ }
+
+ if (order) {
+ prep_compound_page(page, order);
+ if (order > 1)
+ prep_transhuge_page(page);
+ }
+
+ page_ref_unfreeze(page, 1);
+
+ return page;
+
+out_error:
+ return ERR_PTR(-EINVAL);
+}
+
+/*
+ * Extract the next page from preserved memory freeing a PKRAM link if it
+ * becomes empty.
+ */
+static struct page *__pkram_load_page(struct pkram_access *pa, unsigned long *index)
+{
+ struct pkram_data_stream *pds = &pa->pds;
+ struct pkram_link *link = pds->link;
+ struct page *page;
+ pkram_entry_t p;
+ int ret;
+
+ if (!link) {
+ ret = pkram_next_link(pds, &link);
+ if (ret)
+ return NULL;
+
+ if (index)
+ pa->folios.next_index = link->index;
+ }
+
+ BUG_ON(pds->entry_idx >= PKRAM_LINK_ENTRIES_MAX);
+
+ p = link->entry[pds->entry_idx];
+ BUG_ON(!p);
+
+ page = __pkram_prep_load_page(p);
+ if (IS_ERR(page))
+ return page;
+
+ if (index) {
+ *index = pa->folios.next_index;
+ pa->folios.next_index += compound_nr(page);
+ }
+
+ /* clear to avoid double free (see pkram_truncate_link()) */
+ link->entry[pds->entry_idx] = 0;
+
+ pds->entry_idx++;
+ if (pds->entry_idx >= PKRAM_LINK_ENTRIES_MAX ||
+ !link->entry[pds->entry_idx]) {
+ pds->link = NULL;
+ pkram_free_page(link);
+ }
+
+ return page;
}

/**
@@ -348,7 +642,16 @@ int pkram_save_folio(struct pkram_access *pa, struct folio *folio)
*/
struct folio *pkram_load_folio(struct pkram_access *pa, unsigned long *index)
{
- return NULL;
+ struct pkram_node *node = pa->ps->node;
+ struct page *page;
+
+ BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_LOAD);
+
+ page = __pkram_load_page(pa, index);
+ if (IS_ERR_OR_NULL(page))
+ return (struct folio *)page;
+ else
+ return page_folio(page);
}

/**
--
1.9.4

2023-05-26 14:32:45

by James Gowans

[permalink] [raw]
Subject: Re: [RFC v3 00/21] Preserved-over-Kexec RAM

On Wed, 2023-04-26 at 17:08 -0700, Anthony Yznaga wrote:
> Sending out this RFC in part to guage community interest.
> This patchset implements preserved-over-kexec memory storage or PKRAM as a
> method for saving memory pages of the currently executing kernel so that
> they may be restored after kexec into a new kernel. The patches are adapted
> from an RFC patchset sent out in 2013 by Vladimir Davydov [1]. They
> introduce the PKRAM kernel API.
>
> One use case for PKRAM is preserving guest memory and/or auxillary
> supporting data (e.g. iommu data) across kexec to support reboot of the
> host with minimal disruption to the guest.

Hi Anthony,

Thanks for re-posting this - I'm been wanting to re-kindle the discussion
on preserving memory across kexec for a while now.

There are a few aspects at play in this space of memory management
designed specifically for the virtualisation and live update (kexec) use-
case which I think we should consider:

1. Preserving userspace-accessible memory across kexec: this is what pkram
addresses.

2. Preserving kernel state: This would include memory required for kexec
with DMA passthrough devices, like IOMMU root page and page tables, DMA-
able buffers for drivers, etc. Also certain structures for improved kernel
boot performance after kexec, like a PCI device cache, clock LPJ and
possible others, sort of what Xen breadcrumbs [0] achieves. The pkram RFC
indicates that this should be possible, though IMO this could be more
straight forward to do with a new filesystem with first-class support for
kernel persistence via something like inode types for kernel data.

3. Ensuring huge/gigantic memory allocations: to improve the TLB perf of
2-stage translations it's beneficial to allocate guest memory in large
contiguous blocks, preferably PUD-level blocks for multi-GiB guests. If
the buddy allocator is used this may be a challenge both from an
implementation and a fragmentation perspective, and it may be desirable to
have stronger guarantees about allocation sizes.

4. Removing struct page overhead: When doing the huge/gigantic
allocations, in generally it won't be necessary to have 4 KiB struct
pages. This is something with dmemfs [1, 2] tries to achieve by using a
large chunk of reserved memory and managing that by a new filesystem.

5. More "advanced" memory management APIs/ioctls for virtualisation: Being
able to support things like DMA-driven post-copy live migration, memory
oversubscription, carving out chunks of memory from a VM to launch side-
car VMs, more fine-grain control of IOMMU or MMU permissions, etc. This
may be easier to achieve with a new filesystem, rather than coupling to
tempfs semantics and ioctls.

Overall, with the above in mind, my take is that we may have a smoother
path to implement a more comprehensive solution by going the route of a
new purpose-built file system on top of reserved memory. Sort of like
dmemfs with persistence and specifically support for kernel persistence.

Does my take here make sense?

I'm hoping to put together an RFC for something like the above (dmemfs
with persistence) soon, focusing on how the IOMMU persistence will work.
This is an important differentiating factor to cover in the RFC, IMO.

> PKRAM provides a flexible way
> for doing this without requiring that the amount of memory used by a fixed
> size created a prior.

AFAICT the main down-side of what I'm suggesting here compared to pkram,
is that as you say here: pkram doesn't require the up-front reserving of
memory - allocations from the global shared pool are dynamic. I'm on the
fence as to whether this is actually a desirable property though. Carving
out a large chunk of system memory as reserved memory for a persisted
filesystem (as I'm suggesting) has the advantages of removing struct page
overhead, providing better guarantees about huge/gigantic page
allocations, and probably makes the kexec restore path simpler and more
self contained.

I think there's an argument to be made that having a clearly-defined large
range of memory which is persisted, and the rest is normal "ephemeral"
kernel memory may be preferable.

Keen to hear your (and others) thoughts!

JG

[0] http://david.woodhou.se/live-update-handover.pdf
[1] https://lwn.net/Articles/839216/
[2] https://lkml.org/lkml/2020/12/7/342

2023-05-31 23:42:41

by Anthony Yznaga

[permalink] [raw]
Subject: Re: [RFC v3 00/21] Preserved-over-Kexec RAM


On 5/26/23 6:57 AM, Gowans, James wrote:
> On Wed, 2023-04-26 at 17:08 -0700, Anthony Yznaga wrote:
>> Sending out this RFC in part to guage community interest.
>> This patchset implements preserved-over-kexec memory storage or PKRAM as a
>> method for saving memory pages of the currently executing kernel so that
>> they may be restored after kexec into a new kernel. The patches are adapted
>> from an RFC patchset sent out in 2013 by Vladimir Davydov [1]. They
>> introduce the PKRAM kernel API.
>>
>> One use case for PKRAM is preserving guest memory and/or auxillary
>> supporting data (e.g. iommu data) across kexec to support reboot of the
>> host with minimal disruption to the guest.
> Hi Anthony,

Hi James,


Thank you for looking at this.

>
> Thanks for re-posting this - I'm been wanting to re-kindle the discussion
> on preserving memory across kexec for a while now.
>
> There are a few aspects at play in this space of memory management
> designed specifically for the virtualisation and live update (kexec) use-
> case which I think we should consider:
>
> 1. Preserving userspace-accessible memory across kexec: this is what pkram
> addresses.
>
> 2. Preserving kernel state: This would include memory required for kexec
> with DMA passthrough devices, like IOMMU root page and page tables, DMA-
> able buffers for drivers, etc. Also certain structures for improved kernel
> boot performance after kexec, like a PCI device cache, clock LPJ and
> possible others, sort of what Xen breadcrumbs [0] achieves. The pkram RFC
> indicates that this should be possible, though IMO this could be more
> straight forward to do with a new filesystem with first-class support for
> kernel persistence via something like inode types for kernel data.

PKRAM as it is now can preserve kernel data by streaming bytes to a
PKRAM object, but the data must be location independent since the data
is stored in allocated 4k pages rather than being preserved in place
This really isn't usable for things like page tables or memory expected
not to move because of DMA, etc.

One issue with preserving non-relocatable, regular memory that is not
partitioned from the kernel is the risk that a kexec kernel has already
been loaded and that its pre-computed destination where it will be copied
to on reboot will overwrite the preserved memory. Either some way of
re-processing the kexec kernel to load somewhere else would be needed,
or kexec load would need to be restricted from loading where memory

might be preserved. Plusses for a partitioning approach.


>
> 3. Ensuring huge/gigantic memory allocations: to improve the TLB perf of
> 2-stage translations it's beneficial to allocate guest memory in large
> contiguous blocks, preferably PUD-level blocks for multi-GiB guests. If
> the buddy allocator is used this may be a challenge both from an
> implementation and a fragmentation perspective, and it may be desirable to
> have stronger guarantees about allocation sizes.
Agreed that guaranteeing large blocks and fragmentation are issues for
PKRAM.  One possible avenue to address this could be to support preserving

hugetlb pages.


>
> 4. Removing struct page overhead: When doing the huge/gigantic
> allocations, in generally it won't be necessary to have 4 KiB struct
> pages. This is something with dmemfs [1, 2] tries to achieve by using a
> large chunk of reserved memory and managing that by a new filesystem.
Has using DAX been considered? Not familiar with dmemfs but it sounds

functionally similar.


>
> 5. More "advanced" memory management APIs/ioctls for virtualisation: Being
> able to support things like DMA-driven post-copy live migration, memory
> oversubscription, carving out chunks of memory from a VM to launch side-
> car VMs, more fine-grain control of IOMMU or MMU permissions, etc. This
> may be easier to achieve with a new filesystem, rather than coupling to
> tempfs semantics and ioctls.
>
> Overall, with the above in mind, my take is that we may have a smoother
> path to implement a more comprehensive solution by going the route of a
> new purpose-built file system on top of reserved memory. Sort of like
> dmemfs with persistence and specifically support for kernel persistence.
>
> Does my take here make sense?
Yes, I believe so. There are some serious issues with PKRAM to address
before it could be truly viable (fragmentation, relocation, etc), so

a memory partitioning approach might be the way to go.


>
> I'm hoping to put together an RFC for something like the above (dmemfs
> with persistence) soon, focusing on how the IOMMU persistence will work.
> This is an important differentiating factor to cover in the RFC, IMO.

Great! I'll keep an eye out for it.


Anthony


>
>> PKRAM provides a flexible way
>> for doing this without requiring that the amount of memory used by a fixed
>> size created a prior.
> AFAICT the main down-side of what I'm suggesting here compared to pkram,
> is that as you say here: pkram doesn't require the up-front reserving of
> memory - allocations from the global shared pool are dynamic. I'm on the
> fence as to whether this is actually a desirable property though. Carving
> out a large chunk of system memory as reserved memory for a persisted
> filesystem (as I'm suggesting) has the advantages of removing struct page
> overhead, providing better guarantees about huge/gigantic page
> allocations, and probably makes the kexec restore path simpler and more
> self contained.
>
> I think there's an argument to be made that having a clearly-defined large
> range of memory which is persisted, and the rest is normal "ephemeral"
> kernel memory may be preferable.
>
> Keen to hear your (and others) thoughts!
>
> JG
>
> [0] http://david.woodhou.se/live-update-handover.pdf
> [1] https://lwn.net/Articles/839216/
> [2] https://lkml.org/lkml/2020/12/7/342

2023-06-01 02:40:35

by Baoquan He

[permalink] [raw]
Subject: Re: [RFC v3 00/21] Preserved-over-Kexec RAM

On 04/26/23 at 05:08pm, Anthony Yznaga wrote:
> Sending out this RFC in part to guage community interest.
> This patchset implements preserved-over-kexec memory storage or PKRAM as a
> method for saving memory pages of the currently executing kernel so that
> they may be restored after kexec into a new kernel. The patches are adapted
> from an RFC patchset sent out in 2013 by Vladimir Davydov [1]. They
> introduce the PKRAM kernel API.
>
> One use case for PKRAM is preserving guest memory and/or auxillary
> supporting data (e.g. iommu data) across kexec to support reboot of the
> host with minimal disruption to the guest. PKRAM provides a flexible way
> for doing this without requiring that the amount of memory used by a fixed
> size created a priori. Another use case is for databases to preserve their
> block caches in shared memory across reboot.

If so, I have confusions, who can help clarify:
1) Why kexec reboot was introduced, what do we expect kexec reboot to
do?

2) If we need keep these data and those data, can we not reboot? They
are definitely located there w/o any concern.

3) What if systems of AI, edge computing, HPC etc also want to carry
kinds of data from userspace or kernel, system status, registers
etc when kexec reboot is needed, while enligntened by this patch?

Thanks
Baoquan


2023-06-02 00:17:58

by Anthony Yznaga

[permalink] [raw]
Subject: Re: [RFC v3 00/21] Preserved-over-Kexec RAM


On 5/31/23 7:15 PM, Baoquan He wrote:
> On 04/26/23 at 05:08pm, Anthony Yznaga wrote:
>> Sending out this RFC in part to guage community interest.
>> This patchset implements preserved-over-kexec memory storage or PKRAM as a
>> method for saving memory pages of the currently executing kernel so that
>> they may be restored after kexec into a new kernel. The patches are adapted
>> from an RFC patchset sent out in 2013 by Vladimir Davydov [1]. They
>> introduce the PKRAM kernel API.
>>
>> One use case for PKRAM is preserving guest memory and/or auxillary
>> supporting data (e.g. iommu data) across kexec to support reboot of the
>> host with minimal disruption to the guest. PKRAM provides a flexible way
>> for doing this without requiring that the amount of memory used by a fixed
>> size created a priori. Another use case is for databases to preserve their
>> block caches in shared memory across reboot.
> If so, I have confusions, who can help clarify:
> 1) Why kexec reboot was introduced, what do we expect kexec reboot to
> do?
>
> 2) If we need keep these data and those data, can we not reboot? They
> are definitely located there w/o any concern.
>
> 3) What if systems of AI, edge computing, HPC etc also want to carry
> kinds of data from userspace or kernel, system status, registers
> etc when kexec reboot is needed, while enligntened by this patch?

Hi Baoquan,

Avoiding a more significant disruption from having to halt or migrate

VMs, failover services, etc. when a reboot is necessary to pick up

security fixes is one motivation for exploring preserving memory

across the reboot.


Anthony

>
> Thanks
> Baoquan
>