2023-10-16 23:33:08

by Madhavan T. Venkataraman

[permalink] [raw]
Subject: [RFC PATCH v1 00/10] mm/prmem: Implement the Persistent-Across-Kexec memory feature (prmem)

From: "Madhavan T. Venkataraman" <[email protected]>

Introduction
============

This feature can be used to persist kernel and user data across kexec reboots
in RAM for various uses. E.g., persisting:

- cached data. E.g., database caches.
- state. E.g., KVM guest states.
- historical information since the last cold boot. E.g., events, logs
and journals.
- measurements for integrity checks on the next boot.
- driver data.
- IOMMU mappings.
- MMIO config information.

This is useful on systems where there is no non-volatile storage or
non-volatile storage is too small or too slow.

The following sections describe the implementation.

I have enhanced the ram disk block device driver to provide persistent ram
disks on which any filesystem can be created. This is for persisting user data.
I have also implemented DAX support for the persistent ram disks.

I am also working on making ZRAM persistent.

I have also briefly discussed the following use cases:

- Persisting IOMMU mappings
- Remembering DMA pages
- Reserving pages that encounter memory errors
- Remembering IMA measurements for integrity checks
- Remembering MMIO config info
- Implementing prmemfs (special filesystem tailored for persistence)

Allocate metadata
=================

Define a metadata structure to store all persistent memory related information.
The metadata fits into one page. On a cold boot, allocate and initialize the
metadata page.

Allocate data
=============

On a cold boot, allocate some memory for storing persistent data. Call it
persistent memory. Specify the size in a command line parameter:

prmem=size[KMG][,max_size[KMG]]

size Initial amount of memory allocated to prmem during boot
max_size Maximum amount of memory that can be allocated to prmem

When the initial memory is exhaused via allocations, expand prmem dynamically
up to max_size. Expansion is done by allocating from the buddy allocator.
Record all allocations in the metadata.

Remember the metadata
=====================

On all (kexec) reboots, remember the metadata page address. This is done via
a new kernel command line parameter:

prmem_meta=address

When a kexec image is loaded, the kexec command line is set up. Append the
above parameter to the command line automatically.

In early boot, extract the metadata page address from the command line and
reserve the metadata page. From the metadata, get the persistent memory that
has been allocated before and reserve it as well.

Manage persistent memory
========================

Manage persistent memory with the Gen Pool allocator (lib/genalloc.c). This
is so we don't have to implement a new allocator. Make the Gen Pool
persistent so allocations can be remembered across kexecs.

Provide functions for allocating and freeing persistent memory. These are
just wrappers around the Gen Pool functions:

prmem_alloc_pages() (looks like alloc_pages())
prmem_free_pages() (looks like __free_pages())
prmem_alloc() (looks like kmalloc())
prmem_free() (looks like kfree())

Create persistent instances
===========================

Consumers store information in the form of data structures. To persist a data
structure across a kexec, a consumer has to do two things:

1. Allocate persistent memory for the data structure.

2. Remember the data structure in a named persistent instance.

A persistent instance has the following attributes:

Subsystem name Name of the subsystem/module/driver that created the
instance. E.g., "ramdisk" for the ramdisk driver.
Instance name Name of the instance within the subsystem. E.g.,
"pram0" for a persistent ram disk.
Data Pointer to instance data.
Size Size of instance data.

Provide functions to create and manage persistent instances:

prmem_get() Get/Create a persistent instance.
prmem_set_data() Record the instance data pointer and size.
prmem_get_data() Retrieve the instance data pointer and size.
prmem_put() Destroy a persistent instance.
prmem_list() Enumerate the instances of a subsystem.

Complex data structures
=======================

A persistent instance may have more than one data structure to remember across
kexec.

Data structures can be connected to other data structures using pointers,
arrays, linked lists, RB trees, etc. As long as each structure is placed in
persistent memory, the whole set of data structures can be remembered
across a kexec.

It is expected that a consumer will create a top level data structure for
an instance from which all other data structures belonging to the instance
can be reached. So, only the top level data structure needs to be registered
as instance data.

Linked list nodes and RB nodes are embedded in data structures. So, persisting
linked lists and RB trees is straight forward. But the XArray needs a little
more work. The XArray itself can be embedded in a persistent data structure.
But the XA nodes are currently allocated from normal memory using the kmem
allocator. Enhance XArrays to include a persistent option so that the XA nodes
as well can be allocated from persistent memory. Then, the whole XArray becomes
persistent.

Since Radix Trees are implemented with XArrays, we get persistent Radix
Trees as well.

The ram disk uses an XArray. Some other use cases can also use an XArray.

Persistent virtual addresses
============================

Apart from consumer data structures, Prmem metadata structures must be
persisted as well. In either case, data structures point to one another
using virtual addresses.

To keep the implementation simple, the virtual addresses used within persistent
memory must not change on a kexec. The alternative is to remap everything on
each kexec. This can be complex and cumbersome.

prmem uses direct map addresses for this reason. However, if PAGE_OFFSET is
randomized by KASLR, this will not work. Until I find an alternative for this,
prmem is currently not supported if kernel memory randomization is enabled.
prmem checks for this at runtime and disables itself. So, for now, include
"nokaslr" in the command line to allow prmem.

Note that kernel text randomization does not affect prmem. So, if an
architecture does not support randomization of PAGE_OFFSET, then there is
no need to include "nokaslr" in the command line.

Validation of metadata
======================

The metadata must be validated on a kexec before it can be used. To allow this,
compute a checksum on the metadata just before the kexec reboot and store it in
the metadata.

After kexec, in early boot, use the checksum to validate the metadata. If the
validation fails, discard the metadata. Treat it as a cold boot. That is,
allocate a new metadata page and initial region and start over.

This means that all persistent data will be lost on a validation failure.

Dynamic Expansion
=================

For some use cases, it may be hard to predict how much actual memory is
needed to store persistent data. This may depend on the workload. Either
we would have to overcommit memory for persistent data. Or, we could
allow dynamic expansion of prmem memory.

Implement dynamic expansion of prmem. When there is no free persistent memory
call alloc_pages(MAX_ORDER) to allocate a max order page. Add it to prmem.

Choosing a max order page means that no fragmentation is created for
transparent huge pages or kmem slabs. But fragmentation may be created for
1GB pages. This is not a problem for 1GB pages that are reserved up front
during boot. This could be a problem for 1GB pages that are allocated at run
time dynamically.

As mentioned before, dynamic expansion is optional. If a max_size is not
specified in the command line, then dynamic expansion does not happen.

Persistent Ramdisks
===================

I have implemented one main use case in this patchset - persistent ram disks.
Any filesystem can be installed on a persistent ram disk. User data can be
persisted on the filesystem.

One problem with using a ramdisk is that the page cache will contain redundant
copies of ramdisk pages. To avoid this, I have implemented DAX support for
persistent ramdisks. This can be availed by installing a filesystem with DAX
support on the ram disks.

Normal ramdisk devices are named "ram0", "ram1", "ram2", etc. Persistent
ramdisk devices will be named "pram0", "pram1", "pram2", etc.

For normal ramdisks, ramdisk pages are allocated using alloc_pages(). For
persistent ones, ramdisk pages are allocated using prmem_alloc_pages().

Each ram disk has a device structure (struct brd_device). This is allocated
from kmem for normal ram disks and from persistent memory for persistent ram
disks. This becomes the instance data. This structure contains an XArray
of pages allocated to the ram disk. A persistent XArray will be used.

The disk size for all normal ramdisks is specified via a module parameter
"rd_size". This forces all of the ramdisks to have the same size. For
persistent ram disks, take a different approach. Define a module parameter
called "prd_sizes" which specifies a comma-separated list of sizes. The
sizes are applied in the order in which they appear to "pram0", "pram1",
etc.

Persistent Ram Disk Usage:

sudo modprobe brd prd_sizes="1G,2G"

This creates two persistent ram disks with the specified sizes.
That is, /dev/pram0 will have a size of 1G. /dev/pram1 will
have a size of 2G.

sudo mkfs.ext4 /dev/pram0
sudo mkfs.ext4 /dev/pram1

Make filesystems on the persistent ram disks.

sudo mount -t ext4 /dev/pram0 /path/to/mountpoint0
sudo mount -t ext4 -o dax /dev/pram1 /path/to/mountpoint1

Mount them somewhere. Note that the -o dax option can be used
to avail DAX.

sudo umount /path/to/mountpoint0
sudo umount /path/to/mountpoint1

Unmount the filesystems.

On subsequent kexecs, you can load the module with or without specifying the
sizes. The previous devices and sizes will be remembered. After that, simply
mount the filesystems and use them.

sudo modprobe brd
sudo mount -t ext4 /dev/pram0 /path/to/mountpoint0
sudo mount -t ext4 -o dax /dev/pram1 /path/to/mountpoint1

The persistent ramdisk devices are destroyed when the module is explicitly
unloaded (rmmod). But if a reboot happens without the module unload, the
devices are persisted.

Other use cases
===============

I believe that it is possible to implement most use cases. I have listed some
examples below. I am not an expert in these areas. These are just suggestions.
Please let me know if there are any mistakes. Comments are most welcome.

- IOMMU mappings
The IOVA to PFN mappings can be remembered using a persistent XArray.

- DMA pages
Someone mentioned this use case. IIUC, DMA operations may be in flight
when a kexec happens. Instead of waiting for the DMA operations to
complete, drivers could remember the DMA pages in a persistent XArray.
Then, in early boot, retrieve the XArray from prmem and reserve those
individual pages early. Once the DMA operations complete, the pages can
be removed from the XArray and freed into the buddy allocator.

- Pages that encounter memory errors
These could be remembered in a persistent XArray. Then, in early boot,
retrieve the XArray from prmem and reserve the pages so they are never
used.

- IMA
IMA tries to remember measurements across a kexec so that integrity
checks can be performed on a kexec reboot. Currently, IIUC, IMA
uses a kexec buffer to remember measurements. However, the buffer
has to be allocated up front when the kexec image is loaded. If the gap
between loading a kexec image and executing it is large, the
measurements that come in during that time may not fit into the
pre-allocated buffer.

The solution could be to remember measurements using prmem. I am
working on this. I will add this in a future version of this patchset.

- ZRAM
The ZRAM block device is a candidate for persistence. This is still
work in progress. I will add this in a future version of this patchset
once I get it working.

- MMIO
I am not familiar with what exactly needs to be persisted for this.
I will state my understanding of the use case. Please correct me if
I am wrong. IIUC, during PCI discovery, I/O devices are enumerated,
memory space allocation is done and the I/O devices are configured.
If the enumerated devices and their configuration can be remembered
across kexec, then the discovery phase can be skipped after kexec.
This will speed up PCI init.

I believe the MMIO config info can be persisted using prmem.

- prmemfs
It may be simpler and more efficient if we could implement a special
filesystem that is tailored for persistence. We don't have to support
anything that is not required for persistent data. E.g., FIFOs,
special files, hard links, using the page cache, etc. When files are
deleted, the memory can be freed back into prmem.

The instance data for the filesystem would be the superblock. The
following need to be allocated from pesistent memory - the superblock,
the inodes and the data pages. The data pages can be remembered in a
persistent XArray.

I am looking into this as well.

TBD
===

- Reservations.
Consumers must be able to reserve persistent memory to guarantee
sizes for their instances. E.g., for a persistent ramdisk.

- NUMA support.

- Memory Leak detection.
Something similar to kmemleak may need to be implemented to detect
memory leaks in persistent memory.

---

Madhavan T. Venkataraman (10):
mm/prmem: Allocate memory during boot for storing persistent data
mm/prmem: Reserve metadata and persistent regions in early boot after
kexec
mm/prmem: Manage persistent memory with the gen pool allocator.
mm/prmem: Implement a page allocator for persistent memory
mm/prmem: Implement a buffer allocator for persistent memory
mm/prmem: Implement persistent XArray (and Radix Tree)
mm/prmem: Implement named Persistent Instances.
mm/prmem: Implement Persistent Ramdisk instances.
mm/prmem: Implement DAX support for Persistent Ramdisks.
mm/prmem: Implement dynamic expansion of prmem.

arch/x86/kernel/kexec-bzimage64.c | 5 +-
arch/x86/kernel/setup.c | 4 +
drivers/block/Kconfig | 11 +
drivers/block/brd.c | 320 ++++++++++++++++++++++++++++--
include/linux/genalloc.h | 6 +
include/linux/memblock.h | 2 +
include/linux/prmem.h | 158 +++++++++++++++
include/linux/radix-tree.h | 4 +
include/linux/xarray.h | 15 ++
kernel/Makefile | 1 +
kernel/prmem/Makefile | 4 +
kernel/prmem/prmem_allocator.c | 222 +++++++++++++++++++++
kernel/prmem/prmem_init.c | 48 +++++
kernel/prmem/prmem_instance.c | 139 +++++++++++++
kernel/prmem/prmem_misc.c | 86 ++++++++
kernel/prmem/prmem_parse.c | 80 ++++++++
kernel/prmem/prmem_region.c | 87 ++++++++
kernel/prmem/prmem_reserve.c | 125 ++++++++++++
kernel/reboot.c | 2 +
lib/genalloc.c | 45 +++--
lib/radix-tree.c | 49 ++++-
lib/xarray.c | 11 +-
mm/memblock.c | 12 ++
mm/mm_init.c | 2 +
24 files changed, 1400 insertions(+), 38 deletions(-)
create mode 100644 include/linux/prmem.h
create mode 100644 kernel/prmem/Makefile
create mode 100644 kernel/prmem/prmem_allocator.c
create mode 100644 kernel/prmem/prmem_init.c
create mode 100644 kernel/prmem/prmem_instance.c
create mode 100644 kernel/prmem/prmem_misc.c
create mode 100644 kernel/prmem/prmem_parse.c
create mode 100644 kernel/prmem/prmem_region.c
create mode 100644 kernel/prmem/prmem_reserve.c


base-commit: 2dde18cd1d8fac735875f2e4987f11817cc0bc2c
--
2.25.1


2023-10-16 23:33:15

by Madhavan T. Venkataraman

[permalink] [raw]
Subject: [RFC PATCH v1 08/10] mm/prmem: Implement Persistent Ramdisk instances.

From: "Madhavan T. Venkataraman" <[email protected]>

Using the prmem APIs, any kernel subsystem can persist its data. For
persisting user data, we need a filesystem.

Implement persistent ramdisk block device instances so that any filesystem
can be created on it.

Normal ramdisk devices are named "ram0", "ram1", "ram2", etc. Persistent
ramdisk devices will be named "pram0", "pram1", "pram2", etc.

For normal ramdisks, ramdisk pages are allocated using alloc_pages(). For
persistent ones, ramdisk pages are allocated using prmem_alloc_pages().

Each ram disk has a device structure - struct brd_device. For persistent
ram disks, allocate this from persistent memory and record it as the
instance data of the ram disk instance. The structure contains an XArray
of pages allocated to the ram disk. Make it a persistent XArray.

The disk size for all normal ramdisks is specified via a module parameter
"rd_size". This forces all of the ramdisks to have the same size.

For persistent ram disks, take a different approach. Define a module
parameter called "prd_sizes" which specifies a comma-separated list of
sizes. The sizes are applied in the order in which they are listed to
"pram0", "pram1", etc.

Ram Disk Usage
--------------

sudo modprobe brd prd_sizes="1G,2G"

This creates two ram disks with the specified sizes. That
is, /dev/pram0 will have a size of 1G. /dev/pram1 will
have a size of 2G.

sudo mkfs.ext4 /dev/pram0
sudo mkfs.ext4 /dev/pram1

Make filesystems on the persistent ram disks.

sudo mount -t ext4 /dev/pram0 /path/to/mountpoint0
sudo mount -t ext4 /dev/pram1 /path/to/mountpoint1

Mount them somewhere.

sudo umount /path/to/mountpoint0
sudo umount /path/to/mountpoint1

Unmount the filesystems.

After kexec
-----------

sudo modprobe brd (you may omit "prd_sizes")

This remembers the previously created persistent ram disks.

sudo mount -t ext4 /dev/pram0 /path/to/mountpoint0
sudo mount -t ext4 /dev/pram1 /path/to/mountpoint1

Mount the same filesystems.

The maximum number of persistent ram disk instances is specified via
CONFIG_BLK_DEV_PRAM_MAX. By default, this is zero.

Signed-off-by: Madhavan T. Venkataraman <[email protected]>
---
drivers/block/Kconfig | 11 +++
drivers/block/brd.c | 214 +++++++++++++++++++++++++++++++++++++++---
2 files changed, 213 insertions(+), 12 deletions(-)

diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 5b9d4aaebb81..08fa40f6e2de 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -256,6 +256,17 @@ config BLK_DEV_RAM_SIZE
The default value is 4096 kilobytes. Only change this if you know
what you are doing.

+config BLK_DEV_PRAM_MAX
+ int "Maximum number of Persistent RAM disks"
+ default "0"
+ depends on BLK_DEV_RAM
+ help
+ This allows the creation of persistent RAM disks. Persistent RAM
+ disks are used to remember data across a kexec reboot. The default
+ value is 0 Persistent RAM disks. Change this if you know what you
+ are doing. The sizes of the ram disks are specified via the boot
+ arg "prd_sizes" as a comma-separated list of sizes.
+
config CDROM_PKTCDVD
tristate "Packet writing on CD/DVD media (DEPRECATED)"
depends on !UML
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 970bd6ff38c4..3a05e56ca16f 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -24,9 +24,12 @@
#include <linux/slab.h>
#include <linux/backing-dev.h>
#include <linux/debugfs.h>
+#include <linux/prmem.h>

#include <linux/uaccess.h>

+enum brd_type { BRD_NORMAL = 0, BRD_PERSISTENT, };
+
/*
* Each block ramdisk device has a xarray brd_pages of pages that stores
* the pages containing the block device's contents. A brd page's ->index is
@@ -36,6 +39,7 @@
*/
struct brd_device {
int brd_number;
+ enum brd_type brd_type;
struct gendisk *brd_disk;
struct list_head brd_list;

@@ -46,6 +50,15 @@ struct brd_device {
u64 brd_nr_pages;
};

+/* Each of these functions performs an action based on brd_type. */
+static struct brd_device *brd_alloc_device(int i, enum brd_type type);
+static void brd_free_device(struct brd_device *brd);
+static struct page *brd_alloc_page(struct brd_device *brd, gfp_t gfp);
+static void brd_free_page(struct brd_device *brd, struct page *page);
+static void brd_xa_init(struct brd_device *brd);
+static void brd_init_name(struct brd_device *brd, char *name);
+static void brd_set_capacity(struct brd_device *brd);
+
/*
* Look up and return a brd's page for a given sector.
*/
@@ -75,7 +88,7 @@ static int brd_insert_page(struct brd_device *brd, sector_t sector, gfp_t gfp)
if (page)
return 0;

- page = alloc_page(gfp | __GFP_ZERO | __GFP_HIGHMEM);
+ page = brd_alloc_page(brd, gfp | __GFP_ZERO | __GFP_HIGHMEM);
if (!page)
return -ENOMEM;

@@ -87,7 +100,7 @@ static int brd_insert_page(struct brd_device *brd, sector_t sector, gfp_t gfp)
cur = __xa_cmpxchg(&brd->brd_pages, idx, NULL, page, gfp);

if (unlikely(cur)) {
- __free_page(page);
+ brd_free_page(brd, page);
ret = xa_err(cur);
if (!ret && (cur->index != idx))
ret = -EIO;
@@ -110,7 +123,7 @@ static void brd_free_pages(struct brd_device *brd)
pgoff_t idx;

xa_for_each(&brd->brd_pages, idx, page) {
- __free_page(page);
+ brd_free_page(brd, page);
cond_resched();
}

@@ -287,6 +300,18 @@ unsigned long rd_size = CONFIG_BLK_DEV_RAM_SIZE;
module_param(rd_size, ulong, 0444);
MODULE_PARM_DESC(rd_size, "Size of each RAM disk in kbytes.");

+/* Sizes of persistent ram disks are specified in a comma-separated list. */
+static char *prd_sizes;
+module_param(prd_sizes, charp, 0444);
+MODULE_PARM_DESC(prd_sizes, "Sizes of persistent RAM disks.");
+
+/* Persistent ram disk specific data. */
+struct prd_data {
+ struct prmem_instance *instance;
+ unsigned long size;
+};
+static struct prd_data prd_data[CONFIG_BLK_DEV_PRAM_MAX];
+
static int max_part = 1;
module_param(max_part, int, 0444);
MODULE_PARM_DESC(max_part, "Num Minors to reserve between devices");
@@ -295,6 +320,32 @@ MODULE_LICENSE("GPL");
MODULE_ALIAS_BLOCKDEV_MAJOR(RAMDISK_MAJOR);
MODULE_ALIAS("rd");

+void __init brd_parse(void)
+{
+ unsigned long size;
+ char *cur, *tmp;
+ int i = 0;
+
+ if (!CONFIG_BLK_DEV_PRAM_MAX || !prd_sizes)
+ return;
+
+ /* Parse persistent ram disk sizes. */
+ cur = prd_sizes;
+ do {
+ /* Get the size of a ramdisk. Sanity check it. */
+ size = memparse(cur, &tmp);
+ if (cur == tmp || !size) {
+ pr_warn("%s: Memory value expected\n", __func__);
+ return;
+ }
+ cur = tmp;
+
+ /* Add the ramdisk size. */
+ prd_data[i++].size = size;
+
+ } while (*cur++ == ',' && i < CONFIG_BLK_DEV_PRAM_MAX);
+}
+
#ifndef MODULE
/* Legacy boot options - nonmodular */
static int __init ramdisk_size(char *str)
@@ -314,23 +365,33 @@ static struct dentry *brd_debugfs_dir;

static int brd_alloc(int i)
{
+ int brd_number;
+ enum brd_type brd_type;
struct brd_device *brd;
struct gendisk *disk;
char buf[DISK_NAME_LEN];
int err = -ENOMEM;

+ if (i < rd_nr) {
+ brd_number = i;
+ brd_type = BRD_NORMAL;
+ } else {
+ brd_number = i - rd_nr;
+ brd_type = BRD_PERSISTENT;
+ }
+
list_for_each_entry(brd, &brd_devices, brd_list)
- if (brd->brd_number == i)
+ if (brd->brd_number == i && brd->brd_type == brd_type)
return -EEXIST;
- brd = kzalloc(sizeof(*brd), GFP_KERNEL);
+ brd = brd_alloc_device(brd_number, brd_type);
if (!brd)
return -ENOMEM;
- brd->brd_number = i;
+ brd->brd_number = brd_number;
list_add_tail(&brd->brd_list, &brd_devices);

- xa_init(&brd->brd_pages);
+ brd_xa_init(brd);

- snprintf(buf, DISK_NAME_LEN, "ram%d", i);
+ brd_init_name(brd, buf);
if (!IS_ERR_OR_NULL(brd_debugfs_dir))
debugfs_create_u64(buf, 0444, brd_debugfs_dir,
&brd->brd_nr_pages);
@@ -345,7 +406,7 @@ static int brd_alloc(int i)
disk->fops = &brd_fops;
disk->private_data = brd;
strscpy(disk->disk_name, buf, DISK_NAME_LEN);
- set_capacity(disk, rd_size * 2);
+ brd_set_capacity(brd);

/*
* This is so fdisk will align partitions on 4k, because of
@@ -370,7 +431,7 @@ static int brd_alloc(int i)
put_disk(disk);
out_free_dev:
list_del(&brd->brd_list);
- kfree(brd);
+ brd_free_device(brd);
return err;
}

@@ -390,7 +451,7 @@ static void brd_cleanup(void)
put_disk(brd->brd_disk);
brd_free_pages(brd);
list_del(&brd->brd_list);
- kfree(brd);
+ brd_free_device(brd);
}
}

@@ -427,13 +488,21 @@ static int __init brd_init(void)
goto out_free;
}

+ /* Parse persistent ram disk sizes. */
+ brd_parse();
+
+ /* Create persistent ram disks. */
+ for (i = 0; i < CONFIG_BLK_DEV_PRAM_MAX; i++)
+ brd_alloc(i + rd_nr);
+
/*
* brd module now has a feature to instantiate underlying device
* structure on-demand, provided that there is an access dev node.
*
* (1) if rd_nr is specified, create that many upfront. else
* it defaults to CONFIG_BLK_DEV_RAM_COUNT
- * (2) User can further extend brd devices by create dev node themselves
+ * (2) if prd_sizes is specified, create that many upfront.
+ * (3) User can further extend brd devices by create dev node themselves
* and have kernel automatically instantiate actual device
* on-demand. Example:
* mknod /path/devnod_name b 1 X # 1 is the rd major
@@ -469,3 +538,124 @@ static void __exit brd_exit(void)
module_init(brd_init);
module_exit(brd_exit);

+/* Each of these functions performs an action based on brd_type. */
+
+static struct brd_device *brd_alloc_device(int i, enum brd_type type)
+{
+ char name[PRMEM_MAX_NAME];
+ struct brd_device *brd;
+ struct prmem_instance *instance;
+ size_t size;
+ bool create;
+
+ if (type == BRD_NORMAL)
+ return kzalloc(sizeof(struct brd_device), GFP_KERNEL);
+
+ /*
+ * Get the persistent ramdisk instance. If it does not exist, it will
+ * be created, if a size has been specified.
+ */
+ create = !!prd_data[i].size;
+ snprintf(name, PRMEM_MAX_NAME, "pram%d", i);
+ instance = prmem_get("ramdisk", name, create);
+ if (!instance)
+ return NULL;
+
+ prmem_get_data(instance, (void **) &brd, &size);
+ if (brd) {
+ /* Existing instance. Ignore the module parameter. */
+ prd_data[i].size = size;
+ prd_data[i].instance = instance;
+ return brd;
+ }
+
+ /*
+ * New instance. Allocate brd from persistent memory and set it as
+ * instance data.
+ */
+ brd = prmem_alloc(sizeof(*brd), __GFP_ZERO);
+ if (!brd) {
+ prmem_put(instance);
+ return NULL;
+ }
+ brd->brd_type = BRD_PERSISTENT;
+ prmem_set_data(instance, brd, prd_data[i].size);
+
+ prd_data[i].instance = instance;
+ return brd;
+}
+
+static void brd_free_device(struct brd_device *brd)
+{
+ struct prmem_instance *instance;
+
+ if (brd->brd_type == BRD_NORMAL) {
+ kfree(brd);
+ return;
+ }
+
+ instance = prd_data[brd->brd_number].instance;
+ prmem_set_data(instance, NULL, 0);
+ prmem_free(brd, sizeof(*brd));
+ prmem_put(instance);
+}
+
+static struct page *brd_alloc_page(struct brd_device *brd, gfp_t gfp)
+{
+ if (brd->brd_type == BRD_NORMAL)
+ return alloc_page(gfp);
+ return prmem_alloc_pages(0, gfp);
+}
+
+static void brd_free_page(struct brd_device *brd, struct page *page)
+{
+ if (brd->brd_type == BRD_NORMAL)
+ __free_page(page);
+ else
+ prmem_free_pages(page, 0);
+}
+
+static void brd_xa_init(struct brd_device *brd)
+{
+ if (brd->brd_type == BRD_NORMAL) {
+ xa_init(&brd->brd_pages);
+ return;
+ }
+
+ if (brd->brd_nr_pages) {
+ /* Existing persistent instance. */
+ struct page *page;
+ pgoff_t idx;
+
+ /*
+ * The xarray of pages is persistent. However, the page
+ * indexes are not. Set them here.
+ */
+ xa_for_each(&brd->brd_pages, idx, page) {
+ page->index = idx;
+ }
+ } else {
+ /* New persistent instance. */
+ xa_init(&brd->brd_pages);
+ xa_persistent(&brd->brd_pages);
+ }
+}
+
+static void brd_init_name(struct brd_device *brd, char *name)
+{
+ if (brd->brd_type == BRD_NORMAL)
+ snprintf(name, DISK_NAME_LEN, "ram%d", brd->brd_number);
+ else
+ snprintf(name, DISK_NAME_LEN, "pram%d", brd->brd_number);
+}
+
+static void brd_set_capacity(struct brd_device *brd)
+{
+ unsigned long disksize;
+
+ if (brd->brd_type == BRD_NORMAL)
+ disksize = rd_size;
+ else
+ disksize = prd_data[brd->brd_number].size;
+ set_capacity(brd->brd_disk, disksize * 2);
+}
--
2.25.1

2023-10-16 23:33:15

by Madhavan T. Venkataraman

[permalink] [raw]
Subject: [RFC PATCH v1 10/10] mm/prmem: Implement dynamic expansion of prmem.

From: "Madhavan T. Venkataraman" <[email protected]>

For some use cases, it is hard to predict how much actual memory is
needed to store persistent data. This will depend on the workload. Either
we would have to overcommit memory for persistent data. Or, we could
allow dynamic expansion of prmem memory.

Implement dynamic expansion of prmem. When the allocator runs out of memory
it calls alloc_pages(MAX_ORDER) to allocate a max order page. It creates a
region for that memory and adds it to the list of regions. Then, the
allocator can allocate from that region.

To allow this, extend the command line parameter:

prmem=size[KMG][,max_size[KMG]]

Size is allocated upfront as mentioned before. Between size and max_size,
prmem is expanded dynamically as mentioned above.

Choosing a max order page means that no fragmentation is created for
transparent huge pages and kmem slabs. But fragmentation may be created
for 1GB pages. This is not a problem for 1GB pages that are reserved
up front. This could be a problem for 1GB pages that are allocated at
run time dynamically.

If max_size is omitted from the command line parameter, no dynamic
expansion will happen.

Signed-off-by: Madhavan T. Venkataraman <[email protected]>
---
include/linux/prmem.h | 8 +++++++
kernel/prmem/prmem_allocator.c | 38 ++++++++++++++++++++++++++++++++++
kernel/prmem/prmem_init.c | 1 +
kernel/prmem/prmem_misc.c | 3 ++-
kernel/prmem/prmem_parse.c | 20 +++++++++++++++++-
kernel/prmem/prmem_region.c | 1 +
kernel/prmem/prmem_reserve.c | 1 +
7 files changed, 70 insertions(+), 2 deletions(-)

diff --git a/include/linux/prmem.h b/include/linux/prmem.h
index c7034690f7cb..bb552946cb5b 100644
--- a/include/linux/prmem.h
+++ b/include/linux/prmem.h
@@ -83,6 +83,9 @@ struct prmem_instance {
* metadata Physical address of the metadata page.
* size Size of initial memory allocated to prmem.
*
+ * cur_size Current amount of memory allocated to prmem.
+ * max_size Maximum amount of memory that can be allocated to prmem.
+ *
* regions List of memory regions.
*
* instances Persistent instances.
@@ -95,6 +98,10 @@ struct prmem {
unsigned long metadata;
size_t size;

+ /* Dynamic expansion. */
+ size_t cur_size;
+ size_t max_size;
+
/* Persistent Regions. */
struct list_head regions;

@@ -109,6 +116,7 @@ extern struct prmem *prmem;
extern unsigned long prmem_metadata;
extern unsigned long prmem_pa;
extern size_t prmem_size;
+extern size_t prmem_max_size;
extern bool prmem_inited;
extern spinlock_t prmem_lock;

diff --git a/kernel/prmem/prmem_allocator.c b/kernel/prmem/prmem_allocator.c
index f12975bc6777..1cb3eae8a3e7 100644
--- a/kernel/prmem/prmem_allocator.c
+++ b/kernel/prmem/prmem_allocator.c
@@ -9,17 +9,55 @@

/* Page Allocation functions. */

+static void prmem_expand(void)
+{
+ struct prmem_region *region;
+ struct page *pages;
+ unsigned int order = MAX_ORDER;
+ size_t size = (1UL << order) << PAGE_SHIFT;
+
+ if (prmem->cur_size + size > prmem->max_size)
+ return;
+
+ spin_unlock(&prmem_lock);
+ pages = alloc_pages(GFP_NOWAIT, order);
+ spin_lock(&prmem_lock);
+
+ if (!pages)
+ return;
+
+ /* cur_size may have changed. Recheck. */
+ if (prmem->cur_size + size > prmem->max_size)
+ goto free;
+
+ region = prmem_add_region(page_to_phys(pages), size);
+ if (!region)
+ goto free;
+
+ pr_warn("%s: prmem expanded by %ld\n", __func__, size);
+ return;
+free:
+ __free_pages(pages, order);
+}
+
void *prmem_alloc_pages_locked(unsigned int order)
{
struct prmem_region *region;
void *va;
size_t size = (1UL << order) << PAGE_SHIFT;
+ bool expand = true;

+retry:
list_for_each_entry(region, &prmem->regions, node) {
va = prmem_alloc_pool(region, size, size);
if (va)
return va;
}
+ if (expand) {
+ expand = false;
+ prmem_expand();
+ goto retry;
+ }
return NULL;
}

diff --git a/kernel/prmem/prmem_init.c b/kernel/prmem/prmem_init.c
index 166fca688ab3..f4814cc88508 100644
--- a/kernel/prmem/prmem_init.c
+++ b/kernel/prmem/prmem_init.c
@@ -20,6 +20,7 @@ void __init prmem_init(void)
/* Cold boot. */
prmem->metadata = prmem_metadata;
prmem->size = prmem_size;
+ prmem->max_size = prmem_max_size;
INIT_LIST_HEAD(&prmem->regions);
INIT_LIST_HEAD(&prmem->instances);

diff --git a/kernel/prmem/prmem_misc.c b/kernel/prmem/prmem_misc.c
index 49b6a7232c1a..3100662d2cbe 100644
--- a/kernel/prmem/prmem_misc.c
+++ b/kernel/prmem/prmem_misc.c
@@ -68,7 +68,8 @@ bool __init prmem_validate(void)
unsigned long checksum;

/* Sanity check the boot parameter. */
- if (prmem_metadata != prmem->metadata || prmem_size != prmem->size) {
+ if (prmem_metadata != prmem->metadata || prmem_size != prmem->size ||
+ prmem_max_size != prmem->max_size) {
pr_warn("%s: Boot parameter mismatch\n", __func__);
return false;
}
diff --git a/kernel/prmem/prmem_parse.c b/kernel/prmem/prmem_parse.c
index 6c1a23c6b84e..3a57b37fa191 100644
--- a/kernel/prmem/prmem_parse.c
+++ b/kernel/prmem/prmem_parse.c
@@ -8,9 +8,11 @@
#include <linux/prmem.h>

/*
- * Syntax: prmem=size[KMG]
+ * Syntax: prmem=size[KMG][,max_size[KMG]]
*
* Specifies the size of the initial memory to be allocated to prmem.
+ * Optionally, specifies the maximum amount of memory to be allocated to
+ * prmem. prmem will expand dynamically between size and max_size.
*/
static int __init prmem_size_parse(char *cmdline)
{
@@ -28,6 +30,22 @@ static int __init prmem_size_parse(char *cmdline)
}

prmem_size = size;
+ prmem_max_size = size;
+
+ cur = tmp;
+ if (*cur++ == ',') {
+ /* Get max size. */
+ size = memparse(cur, &tmp);
+ if (cur == tmp || !size || size & (PAGE_SIZE - 1) ||
+ size <= prmem_size) {
+ prmem_size = 0;
+ prmem_max_size = 0;
+ pr_warn("%s: Incorrect max size %lx\n", __func__, size);
+ return -EINVAL;
+ }
+ prmem_max_size = size;
+ }
+
return 0;
}
early_param("prmem", prmem_size_parse);
diff --git a/kernel/prmem/prmem_region.c b/kernel/prmem/prmem_region.c
index 6dc88c74d9c8..390329a34b74 100644
--- a/kernel/prmem/prmem_region.c
+++ b/kernel/prmem/prmem_region.c
@@ -82,5 +82,6 @@ struct prmem_region *prmem_add_region(unsigned long pa, size_t size)
return NULL;

list_add_tail(&region->node, &prmem->regions);
+ prmem->cur_size += size;
return region;
}
diff --git a/kernel/prmem/prmem_reserve.c b/kernel/prmem/prmem_reserve.c
index 8000fff05402..c5ae5d7d8f0a 100644
--- a/kernel/prmem/prmem_reserve.c
+++ b/kernel/prmem/prmem_reserve.c
@@ -11,6 +11,7 @@ struct prmem *prmem;
unsigned long prmem_metadata;
unsigned long prmem_pa;
unsigned long prmem_size;
+unsigned long prmem_max_size;

void __init prmem_reserve_early(void)
{
--
2.25.1

2023-10-16 23:33:16

by Madhavan T. Venkataraman

[permalink] [raw]
Subject: [RFC PATCH v1 03/10] mm/prmem: Manage persistent memory with the gen pool allocator.

From: "Madhavan T. Venkataraman" <[email protected]>

The memory in a prmem region must be managed by an allocator. Use
the Gen Pool allocator (lib/genalloc.c) for that purpose. This is so we
don't have to write a new allocator.

Now, the Gen Pool allocator uses a "struct gen_pool_chunk" to manage a
contiguous range of memory. The chunk is normally allocated using the kmem
allocator. However, for prmem, the chunk must be persisted across a
kexec reboot so that the allocations can be "remembered". To allow this,
allocate the chunk from the region itself and initialize it. Then, pass
the chunk to the Gen Pool allocator. In other words, persist the chunk.

Inside the Gen Pool allocator, distinguish between a chunk that is
allocated internally from kmem and a chunk that is passed by the caller
and handle it properly when the pool is destroyed.

Provide wrapper functions around the Gen Pool allocator functions so we
can change the allocator in the future if we wanted to.

prmem_create_pool()
prmem_alloc_pool()
prmem_free_pool()

Signed-off-by: Madhavan T. Venkataraman <[email protected]>
---
include/linux/genalloc.h | 6 ++++
include/linux/prmem.h | 8 +++++
kernel/prmem/prmem_init.c | 8 +++++
kernel/prmem/prmem_region.c | 67 ++++++++++++++++++++++++++++++++++++-
lib/genalloc.c | 45 ++++++++++++++++++-------
5 files changed, 121 insertions(+), 13 deletions(-)

diff --git a/include/linux/genalloc.h b/include/linux/genalloc.h
index 0bd581003cd5..186757b0aec7 100644
--- a/include/linux/genalloc.h
+++ b/include/linux/genalloc.h
@@ -73,6 +73,7 @@ struct gen_pool_chunk {
struct list_head next_chunk; /* next chunk in pool */
atomic_long_t avail;
phys_addr_t phys_addr; /* physical starting address of memory chunk */
+ bool external; /* Chunk is passed by caller. */
void *owner; /* private data to retrieve at alloc time */
unsigned long start_addr; /* start address of memory chunk */
unsigned long end_addr; /* end address of memory chunk (inclusive) */
@@ -121,6 +122,11 @@ static inline int gen_pool_add(struct gen_pool *pool, unsigned long addr,
{
return gen_pool_add_virt(pool, addr, -1, size, nid);
}
+extern unsigned long gen_pool_chunk_size(size_t size, int min_alloc_order);
+extern void gen_pool_init_chunk(struct gen_pool_chunk *chunk,
+ unsigned long addr, phys_addr_t phys,
+ size_t size, bool external, void *owner);
+void gen_pool_add_chunk(struct gen_pool *pool, struct gen_pool_chunk *chunk);
extern void gen_pool_destroy(struct gen_pool *);
unsigned long gen_pool_alloc_algo_owner(struct gen_pool *pool, size_t size,
genpool_algo_t algo, void *data, void **owner);
diff --git a/include/linux/prmem.h b/include/linux/prmem.h
index bc8054a86f49..f43f5b0d2b9c 100644
--- a/include/linux/prmem.h
+++ b/include/linux/prmem.h
@@ -24,6 +24,7 @@
* non-volatile storage is too slow.
*/
#include <linux/types.h>
+#include <linux/genalloc.h>
#include <linux/init.h>
#include <linux/kernel.h>
#include <linux/memblock.h>
@@ -38,11 +39,15 @@
* node List node.
* pa Physical address of the region.
* size Size of the region in bytes.
+ * pool Gen Pool to manage region memory.
+ * chunk Persistent Gen Pool chunk.
*/
struct prmem_region {
struct list_head node;
unsigned long pa;
size_t size;
+ struct gen_pool *pool;
+ struct gen_pool_chunk *chunk;
};

/*
@@ -80,6 +85,9 @@ int prmem_cmdline_size(void);

/* Internal functions. */
struct prmem_region *prmem_add_region(unsigned long pa, size_t size);
+bool prmem_create_pool(struct prmem_region *region, bool new_region);
+void *prmem_alloc_pool(struct prmem_region *region, size_t size, int align);
+void prmem_free_pool(struct prmem_region *region, void *va, size_t size);
unsigned long prmem_checksum(void *start, size_t size);
bool __init prmem_validate(void);
void prmem_cmdline(char *cmdline);
diff --git a/kernel/prmem/prmem_init.c b/kernel/prmem/prmem_init.c
index 9cea1cd3b6a5..56df1e6d3ebc 100644
--- a/kernel/prmem/prmem_init.c
+++ b/kernel/prmem/prmem_init.c
@@ -22,6 +22,14 @@ void __init prmem_init(void)

if (!prmem_add_region(prmem_pa, prmem_size))
return;
+ } else {
+ /* Warm boot. */
+ struct prmem_region *region;
+
+ list_for_each_entry(region, &prmem->regions, node) {
+ if (!prmem_create_pool(region, false))
+ return;
+ }
}
prmem_inited = true;
}
diff --git a/kernel/prmem/prmem_region.c b/kernel/prmem/prmem_region.c
index 8254dafcee13..6dc88c74d9c8 100644
--- a/kernel/prmem/prmem_region.c
+++ b/kernel/prmem/prmem_region.c
@@ -1,12 +1,74 @@
// SPDX-License-Identifier: GPL-2.0-only
/*
- * Persistent-Across-Kexec memory (prmem) - Regions.
+ * Persistent-Across-Kexec memory (prmem) - Regions and Region Pools.
*
* Copyright (C) 2023 Microsoft Corporation
* Author: Madhavan T. Venkataraman ([email protected])
*/
#include <linux/prmem.h>

+bool prmem_create_pool(struct prmem_region *region, bool new_region)
+{
+ size_t chunk_size, total_size;
+
+ chunk_size = gen_pool_chunk_size(region->size, PAGE_SHIFT);
+ total_size = sizeof(*region) + chunk_size;
+ total_size = ALIGN(total_size, PAGE_SIZE);
+
+ if (new_region) {
+ /*
+ * We place the region structure at the base of the region
+ * itself. Part of the region is a genpool chunk that is used
+ * to manage the region memory.
+ *
+ * Normally, the chunk is allocated from regular memory by
+ * genpool. But in the case of prmem, the chunk must be
+ * persisted across kexecs so allocations can be remembered.
+ * That is why it is allocated from the region memory itself
+ * and passed to genpool.
+ *
+ * Make sure there is enough space for the region and the chunk.
+ */
+ if (total_size >= region->size) {
+ pr_warn("%s: region size too small\n", __func__);
+ return false;
+ }
+
+ /* Initialize the persistent genpool chunk. */
+ region->chunk = (void *) (region + 1);
+ memset(region->chunk, 0, chunk_size);
+ gen_pool_init_chunk(region->chunk, (unsigned long) region,
+ region->pa, region->size, true, NULL);
+ }
+
+ region->pool = gen_pool_create(PAGE_SHIFT, NUMA_NO_NODE);
+ if (!region->pool) {
+ pr_warn("%s: Could not create genpool\n", __func__);
+ return false;
+ }
+
+ gen_pool_add_chunk(region->pool, region->chunk);
+
+ if (new_region) {
+ /* Reserve the region and chunk. */
+ gen_pool_alloc(region->pool, total_size);
+ }
+ return true;
+}
+
+void *prmem_alloc_pool(struct prmem_region *region, size_t size, int align)
+{
+ struct genpool_data_align data = { .align = align, };
+
+ return (void *) gen_pool_alloc_algo(region->pool, size,
+ gen_pool_first_fit_align, &data);
+}
+
+void prmem_free_pool(struct prmem_region *region, void *va, size_t size)
+{
+ gen_pool_free(region->pool, (unsigned long) va, size);
+}
+
struct prmem_region *prmem_add_region(unsigned long pa, size_t size)
{
struct prmem_region *region;
@@ -16,6 +78,9 @@ struct prmem_region *prmem_add_region(unsigned long pa, size_t size)
region->pa = pa;
region->size = size;

+ if (!prmem_create_pool(region, true))
+ return NULL;
+
list_add_tail(&region->node, &prmem->regions);
return region;
}
diff --git a/lib/genalloc.c b/lib/genalloc.c
index 6c644f954bc5..655db7b47ea9 100644
--- a/lib/genalloc.c
+++ b/lib/genalloc.c
@@ -165,6 +165,33 @@ struct gen_pool *gen_pool_create(int min_alloc_order, int nid)
}
EXPORT_SYMBOL(gen_pool_create);

+size_t gen_pool_chunk_size(size_t size, int min_alloc_order)
+{
+ unsigned long nbits = size >> min_alloc_order;
+ unsigned long nbytes = sizeof(struct gen_pool_chunk) +
+ BITS_TO_LONGS(nbits) * sizeof(long);
+ return nbytes;
+}
+
+void gen_pool_init_chunk(struct gen_pool_chunk *chunk, unsigned long virt,
+ phys_addr_t phys, size_t size, bool external,
+ void *owner)
+{
+ chunk->phys_addr = phys;
+ chunk->start_addr = virt;
+ chunk->end_addr = virt + size - 1;
+ chunk->external = external;
+ chunk->owner = owner;
+ atomic_long_set(&chunk->avail, size);
+}
+
+void gen_pool_add_chunk(struct gen_pool *pool, struct gen_pool_chunk *chunk)
+{
+ spin_lock(&pool->lock);
+ list_add_rcu(&chunk->next_chunk, &pool->chunks);
+ spin_unlock(&pool->lock);
+}
+
/**
* gen_pool_add_owner- add a new chunk of special memory to the pool
* @pool: pool to add new memory chunk to
@@ -183,23 +210,14 @@ int gen_pool_add_owner(struct gen_pool *pool, unsigned long virt, phys_addr_t ph
size_t size, int nid, void *owner)
{
struct gen_pool_chunk *chunk;
- unsigned long nbits = size >> pool->min_alloc_order;
- unsigned long nbytes = sizeof(struct gen_pool_chunk) +
- BITS_TO_LONGS(nbits) * sizeof(long);
+ unsigned long nbytes = gen_pool_chunk_size(size, pool->min_alloc_order);

chunk = vzalloc_node(nbytes, nid);
if (unlikely(chunk == NULL))
return -ENOMEM;

- chunk->phys_addr = phys;
- chunk->start_addr = virt;
- chunk->end_addr = virt + size - 1;
- chunk->owner = owner;
- atomic_long_set(&chunk->avail, size);
-
- spin_lock(&pool->lock);
- list_add_rcu(&chunk->next_chunk, &pool->chunks);
- spin_unlock(&pool->lock);
+ gen_pool_init_chunk(chunk, virt, phys, size, false, owner);
+ gen_pool_add_chunk(pool, chunk);

return 0;
}
@@ -248,6 +266,9 @@ void gen_pool_destroy(struct gen_pool *pool)
chunk = list_entry(_chunk, struct gen_pool_chunk, next_chunk);
list_del(&chunk->next_chunk);

+ if (chunk->external)
+ continue;
+
end_bit = chunk_size(chunk) >> order;
bit = find_first_bit(chunk->bits, end_bit);
BUG_ON(bit < end_bit);
--
2.25.1

2023-10-16 23:33:19

by Madhavan T. Venkataraman

[permalink] [raw]
Subject: [RFC PATCH v1 02/10] mm/prmem: Reserve metadata and persistent regions in early boot after kexec

From: "Madhavan T. Venkataraman" <[email protected]>

Currently, only one memory region is given to prmem to store persistent
data. In the future, regions may be added dynamically.

The prmem metadata and the regions need to be reserved during early boot
after a kexec. For this to happen, the kernel must know where the metadata
is. To allow this, introduce a kernel command line parameter:

prmem_meta=metadata_address

When a kexec image is loaded into the kernel, add this parameter to the
kexec cmdline. Upon a kexec boot, get the metadata page from the cmdline
and reserve it. Then, walk the list of regions in the metadata and reserve
the regions.

Note that the cmdline modification is done automatically within the kernel.
Userland does not have to do anything.

The metadata needs to be validated before it can be used. To allow this,
compute a checksum on the metadata and store it in the metadata at the end
of shutdown. During early boot, validate the metadata with the checksum.

If the validation fails, discard the metadata. Treat it as a cold boot.
That is, allocate a new metadata page and initial region and start over.
Similarly, if the reservation of the regions fails, treat it as a cold
boot and start over.

This means that all persistent data will be lost on any of these failures.
Note that there will be no memory leak when this happens.

Signed-off-by: Madhavan T. Venkataraman <[email protected]>
---
arch/x86/kernel/kexec-bzimage64.c | 5 +-
arch/x86/kernel/setup.c | 2 +
include/linux/memblock.h | 2 +
include/linux/prmem.h | 11 ++++
kernel/prmem/Makefile | 2 +-
kernel/prmem/prmem_init.c | 9 ++++
kernel/prmem/prmem_misc.c | 85 +++++++++++++++++++++++++++++++
kernel/prmem/prmem_parse.c | 29 +++++++++++
kernel/prmem/prmem_reserve.c | 70 ++++++++++++++++++++++++-
kernel/reboot.c | 2 +
mm/memblock.c | 12 +++++
11 files changed, 226 insertions(+), 3 deletions(-)
create mode 100644 kernel/prmem/prmem_misc.c

diff --git a/arch/x86/kernel/kexec-bzimage64.c b/arch/x86/kernel/kexec-bzimage64.c
index a61c12c01270..a19f172be410 100644
--- a/arch/x86/kernel/kexec-bzimage64.c
+++ b/arch/x86/kernel/kexec-bzimage64.c
@@ -18,6 +18,7 @@
#include <linux/mm.h>
#include <linux/efi.h>
#include <linux/random.h>
+#include <linux/prmem.h>

#include <asm/bootparam.h>
#include <asm/setup.h>
@@ -82,6 +83,8 @@ static int setup_cmdline(struct kimage *image, struct boot_params *params,

cmdline_ptr[cmdline_len - 1] = '\0';

+ prmem_cmdline(cmdline_ptr);
+
pr_debug("Final command line is: %s\n", cmdline_ptr);
cmdline_ptr_phys = bootparams_load_addr + cmdline_offset;
cmdline_low_32 = cmdline_ptr_phys & 0xffffffffUL;
@@ -458,7 +461,7 @@ static void *bzImage64_load(struct kimage *image, char *kernel,
*/
efi_map_sz = efi_get_runtime_map_size();
params_cmdline_sz = sizeof(struct boot_params) + cmdline_len +
- MAX_ELFCOREHDR_STR_LEN;
+ MAX_ELFCOREHDR_STR_LEN + prmem_cmdline_size();
params_cmdline_sz = ALIGN(params_cmdline_sz, 16);
kbuf.bufsz = params_cmdline_sz + ALIGN(efi_map_sz, 16) +
sizeof(struct setup_data) +
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index f2b13b3d3ead..22f5cd494291 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1137,6 +1137,8 @@ void __init setup_arch(char **cmdline_p)
*/
efi_reserve_boot_services();

+ prmem_reserve_early();
+
/* preallocate 4k for mptable mpc */
e820__memblock_alloc_reserved_mpc_new();

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index f71ff9f0ec81..584bbb884c8e 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -114,6 +114,8 @@ int memblock_add(phys_addr_t base, phys_addr_t size);
int memblock_remove(phys_addr_t base, phys_addr_t size);
int memblock_phys_free(phys_addr_t base, phys_addr_t size);
int memblock_reserve(phys_addr_t base, phys_addr_t size);
+void memblock_unreserve(phys_addr_t base, phys_addr_t size);
+
#ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
int memblock_physmem_add(phys_addr_t base, phys_addr_t size);
#endif
diff --git a/include/linux/prmem.h b/include/linux/prmem.h
index 7f22016c4ad2..bc8054a86f49 100644
--- a/include/linux/prmem.h
+++ b/include/linux/prmem.h
@@ -48,12 +48,16 @@ struct prmem_region {
/*
* PRMEM metadata.
*
+ * checksum Just before reboot, a checksum is computed on the metadata. On
+ * the next kexec reboot, the metadata is validated with the
+ * checksum to make sure that the metadata has not been corrupted.
* metadata Physical address of the metadata page.
* size Size of initial memory allocated to prmem.
*
* regions List of memory regions.
*/
struct prmem {
+ unsigned long checksum;
unsigned long metadata;
size_t size;

@@ -65,12 +69,19 @@ extern struct prmem *prmem;
extern unsigned long prmem_metadata;
extern unsigned long prmem_pa;
extern size_t prmem_size;
+extern bool prmem_inited;

/* Kernel API. */
+void prmem_reserve_early(void);
void prmem_reserve(void);
void prmem_init(void);
+void prmem_fini(void);
+int prmem_cmdline_size(void);

/* Internal functions. */
struct prmem_region *prmem_add_region(unsigned long pa, size_t size);
+unsigned long prmem_checksum(void *start, size_t size);
+bool __init prmem_validate(void);
+void prmem_cmdline(char *cmdline);

#endif /* _LINUX_PRMEM_H */
diff --git a/kernel/prmem/Makefile b/kernel/prmem/Makefile
index 11a53d49312a..9b0a693bfee1 100644
--- a/kernel/prmem/Makefile
+++ b/kernel/prmem/Makefile
@@ -1,3 +1,3 @@
# SPDX-License-Identifier: GPL-2.0

-obj-y += prmem_parse.o prmem_reserve.o prmem_init.o prmem_region.o
+obj-y += prmem_parse.o prmem_reserve.o prmem_init.o prmem_region.o prmem_misc.o
diff --git a/kernel/prmem/prmem_init.c b/kernel/prmem/prmem_init.c
index 97b550252028..9cea1cd3b6a5 100644
--- a/kernel/prmem/prmem_init.c
+++ b/kernel/prmem/prmem_init.c
@@ -25,3 +25,12 @@ void __init prmem_init(void)
}
prmem_inited = true;
}
+
+void prmem_fini(void)
+{
+ if (!prmem_inited)
+ return;
+
+ /* Compute checksum over the metadata. */
+ prmem->checksum = prmem_checksum(prmem, sizeof(*prmem));
+}
diff --git a/kernel/prmem/prmem_misc.c b/kernel/prmem/prmem_misc.c
new file mode 100644
index 000000000000..49b6a7232c1a
--- /dev/null
+++ b/kernel/prmem/prmem_misc.c
@@ -0,0 +1,85 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Persistent-Across-Kexec memory (prmem) - Miscellaneous functions.
+ *
+ * Copyright (C) 2023 Microsoft Corporation
+ * Author: Madhavan T. Venkataraman ([email protected])
+ */
+#include <linux/prmem.h>
+
+#define MAX_META_LENGTH 31
+
+/*
+ * On a kexec, modify the kernel command line to include the boot parameter
+ * "prmem_meta=" so that the metadata can be found on the next boot. If the
+ * parameter is already present in cmdline, overwrite it. Else, add it.
+ */
+void prmem_cmdline(char *cmdline)
+{
+ char meta[MAX_META_LENGTH], *str;
+ unsigned long metadata;
+
+ metadata = prmem_inited ? prmem->metadata : 0;
+ snprintf(meta, MAX_META_LENGTH, " prmem_meta=0x%.16lx", metadata);
+
+ str = strstr(cmdline, " prmem_meta");
+ if (str) {
+ /*
+ * Boot parameter already exists. Overwrite it. We deliberately
+ * use strncpy() and rely on the fact that it will not NULL
+ * terminate the copy.
+ */
+ strncpy(str, meta, MAX_META_LENGTH - 1);
+ return;
+ }
+ if (prmem_inited) {
+ /* Boot parameter does not exist. Add it. */
+ strcat(cmdline, meta);
+ }
+}
+
+/*
+ * Make sure that the kexec command line can accommodate the prmem_meta
+ * command line parameter.
+ */
+int prmem_cmdline_size(void)
+{
+ return MAX_META_LENGTH;
+}
+
+unsigned long prmem_checksum(void *start, size_t size)
+{
+ unsigned long checksum = 0;
+ unsigned long *ptr;
+ void *end;
+
+ end = start + size;
+ for (ptr = start; (void *) ptr < end; ptr++)
+ checksum += *ptr;
+ return checksum;
+}
+
+/*
+ * Check if the metadata is sane. It would not be sane on a cold boot or if the
+ * metadata has been corrupted. In the latter case, we treat it as a cold boot.
+ */
+bool __init prmem_validate(void)
+{
+ unsigned long checksum;
+
+ /* Sanity check the boot parameter. */
+ if (prmem_metadata != prmem->metadata || prmem_size != prmem->size) {
+ pr_warn("%s: Boot parameter mismatch\n", __func__);
+ return false;
+ }
+
+ /* Compute and check the checksum of the metadata. */
+ checksum = prmem->checksum;
+ prmem->checksum = 0;
+
+ if (checksum != prmem_checksum(prmem, sizeof(*prmem))) {
+ pr_warn("%s: Checksum mismatch\n", __func__);
+ return false;
+ }
+ return true;
+}
diff --git a/kernel/prmem/prmem_parse.c b/kernel/prmem/prmem_parse.c
index 191655b53545..6c1a23c6b84e 100644
--- a/kernel/prmem/prmem_parse.c
+++ b/kernel/prmem/prmem_parse.c
@@ -31,3 +31,32 @@ static int __init prmem_size_parse(char *cmdline)
return 0;
}
early_param("prmem", prmem_size_parse);
+
+/*
+ * Syntax: prmem_meta=metadata_address
+ *
+ * Specifies the address of a single page where the prmem metadata resides.
+ *
+ * On a kexec, the following will be appended to the kernel command line -
+ * "prmem_meta=metadata_address". This is so that the metadata can be located
+ * easily on kexec reboots.
+ */
+static int __init prmem_meta_parse(char *cmdline)
+{
+ char *tmp, *cur = cmdline;
+ unsigned long addr;
+
+ if (!cur)
+ return -EINVAL;
+
+ /* Get metadata address. */
+ addr = memparse(cur, &tmp);
+ if (cur == tmp || addr & (PAGE_SIZE - 1)) {
+ pr_warn("%s: Incorrect address %lx\n", __func__, addr);
+ return -EINVAL;
+ }
+
+ prmem_metadata = addr;
+ return 0;
+}
+early_param("prmem_meta", prmem_meta_parse);
diff --git a/kernel/prmem/prmem_reserve.c b/kernel/prmem/prmem_reserve.c
index e20e31a61d12..8000fff05402 100644
--- a/kernel/prmem/prmem_reserve.c
+++ b/kernel/prmem/prmem_reserve.c
@@ -12,11 +12,79 @@ unsigned long prmem_metadata;
unsigned long prmem_pa;
unsigned long prmem_size;

+void __init prmem_reserve_early(void)
+{
+ struct prmem_region *region;
+ unsigned long nregions;
+
+ /* Need to specify an initial size to enable prmem. */
+ if (!prmem_size)
+ return;
+
+ /* Nothing to be done if it is a cold boot. */
+ if (!prmem_metadata)
+ return;
+
+ /*
+ * prmem uses direct map addresses. If PAGE_OFFSET is randomized,
+ * these addresses will change across kexecs. Persistence cannot
+ * be supported.
+ */
+ if (kaslr_memory_enabled()) {
+ pr_warn("%s: Cannot support persistence because of KASLR.\n",
+ __func__);
+ return;
+ }
+
+ /*
+ * This is a kexec reboot. If any step fails here, treat this like a
+ * cold boot. That is, forget all persistent data and start over.
+ */
+
+ /* Reserve metadata page. */
+ if (memblock_reserve(prmem_metadata, PAGE_SIZE)) {
+ pr_warn("%s: Unable to reserve metadata at %lx\n", __func__,
+ prmem_metadata);
+ return;
+ }
+ prmem = __va(prmem_metadata);
+
+ /* Make sure that the metadata is sane. */
+ if (!prmem_validate())
+ goto unreserve_metadata;
+
+ /* Reserve regions that were added to prmem. */
+ nregions = 0;
+ list_for_each_entry(region, &prmem->regions, node) {
+ if (memblock_reserve(region->pa, region->size)) {
+ pr_warn("%s: Unable to reserve %lx, %lx\n", __func__,
+ region->pa, region->size);
+ goto unreserve_regions;
+ }
+ nregions++;
+ }
+ return;
+
+unreserve_regions:
+ /* Unreserve regions. */
+ list_for_each_entry(region, &prmem->regions, node) {
+ if (!nregions)
+ break;
+ memblock_unreserve(region->pa, region->size);
+ nregions--;
+ }
+
+unreserve_metadata:
+ /* Unreserve the metadata page. */
+ memblock_unreserve(prmem_metadata, PAGE_SIZE);
+ prmem = NULL;
+}
+
void __init prmem_reserve(void)
{
BUILD_BUG_ON(sizeof(*prmem) > PAGE_SIZE);

- if (!prmem_size)
+ if (!prmem_size || prmem)
return;

/*
diff --git a/kernel/reboot.c b/kernel/reboot.c
index 3bba88c7ffc6..b4595b7e77f3 100644
--- a/kernel/reboot.c
+++ b/kernel/reboot.c
@@ -13,6 +13,7 @@
#include <linux/kexec.h>
#include <linux/kmod.h>
#include <linux/kmsg_dump.h>
+#include <linux/prmem.h>
#include <linux/reboot.h>
#include <linux/suspend.h>
#include <linux/syscalls.h>
@@ -84,6 +85,7 @@ void kernel_restart_prepare(char *cmd)
system_state = SYSTEM_RESTART;
usermodehelper_disable();
device_shutdown();
+ prmem_fini();
}

/**
diff --git a/mm/memblock.c b/mm/memblock.c
index f9e61e565a53..1f5070f7b5bc 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -873,6 +873,18 @@ int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
return memblock_add_range(&memblock.reserved, base, size, MAX_NUMNODES, 0);
}

+void __init_memblock memblock_unreserve(phys_addr_t base, phys_addr_t size)
+{
+ phys_addr_t end = base + size - 1;
+
+ memblock_dbg("%s: [%pa-%pa] %pS\n", __func__,
+ &base, &end, (void *)_RET_IP_);
+
+ if (memblock_remove_range(&memblock.reserved, base, size))
+ return;
+ memblock_add_range(&memblock.memory, base, size, MAX_NUMNODES, 0);
+}
+
#ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
int __init_memblock memblock_physmem_add(phys_addr_t base, phys_addr_t size)
{
--
2.25.1

2023-10-16 23:33:37

by Madhavan T. Venkataraman

[permalink] [raw]
Subject: [RFC PATCH v1 07/10] mm/prmem: Implement named Persistent Instances.

From: "Madhavan T. Venkataraman" <[email protected]>

To persist any data, a consumer needs to do the following:

- create a persistent instance for it. The instance gets recorded
in the metadata.

- Name the instance.

- Record the instance data in the instance.

- Retrieve the instance by name after kexec.

- Retrieve instance data.

Implement the following API for consumers:

prmem_get(subsystem, name, create)

Get/Create a persistent instance. The consumer provides the name
of the subsystem and the name of the instance within the subsystem.
E.g., for a persistent ramdisk block device:
subsystem = "ramdisk"
instance = "pram0"

prmem_set_data()

Record a data pointer and a size for the instance. An instance may
contain many data structures connected to each other using pointers,
etc. A consumer is expected to record the top level data structure
in the instance. All other data structures must be reachable from
the top level data structure.

prmem_get_data()

Retrieve the data pointer and the size for the instance.

prmem_put()

Destroy a persistent instance. The instance data must be NULL at
this point. So, the consumer is responsible for freeing the
instance data and setting it to NULL in the instance prior to
destroying.

prmem_list()

Walk the instances of a subsystem and call a callback for each.
This allows a consumer to enumerate all of the instances associated
with a subsystem.

Signed-off-by: Madhavan T. Venkataraman <[email protected]>
---
include/linux/prmem.h | 36 +++++++++
kernel/prmem/Makefile | 2 +-
kernel/prmem/prmem_init.c | 1 +
kernel/prmem/prmem_instance.c | 139 ++++++++++++++++++++++++++++++++++
4 files changed, 177 insertions(+), 1 deletion(-)
create mode 100644 kernel/prmem/prmem_instance.c

diff --git a/include/linux/prmem.h b/include/linux/prmem.h
index 1cb4660cf35e..c7034690f7cb 100644
--- a/include/linux/prmem.h
+++ b/include/linux/prmem.h
@@ -50,6 +50,28 @@ struct prmem_region {
struct gen_pool_chunk *chunk;
};

+#define PRMEM_MAX_NAME 32
+
+/*
+ * To persist any data, a persistent instance is created for it and the data is
+ * "remembered" in the instance.
+ *
+ * node List node
+ * subsystem Subsystem/driver/module that created the instance. E.g.,
+ * "ramdisk" for the ramdisk driver.
+ * name Instance name within the subsystem/driver/module. E.g., "pram0"
+ * for a persistent ramdisk instance.
+ * data Pointer to data. E.g., the radix tree of pages in a ram disk.
+ * size Size of data.
+ */
+struct prmem_instance {
+ struct list_head node;
+ char subsystem[PRMEM_MAX_NAME];
+ char name[PRMEM_MAX_NAME];
+ void *data;
+ size_t size;
+};
+
#define PRMEM_MAX_CACHES 14

/*
@@ -63,6 +85,8 @@ struct prmem_region {
*
* regions List of memory regions.
*
+ * instances Persistent instances.
+ *
* caches Caches for different object sizes. For allocations smaller than
* PAGE_SIZE, these caches are used.
*/
@@ -74,6 +98,9 @@ struct prmem {
/* Persistent Regions. */
struct list_head regions;

+ /* Persistent Instances. */
+ struct list_head instances;
+
/* Allocation caches. */
void *caches[PRMEM_MAX_CACHES];
};
@@ -85,6 +112,8 @@ extern size_t prmem_size;
extern bool prmem_inited;
extern spinlock_t prmem_lock;

+typedef int (*prmem_list_func_t)(struct prmem_instance *instance, void *arg);
+
/* Kernel API. */
void prmem_reserve_early(void);
void prmem_reserve(void);
@@ -98,6 +127,13 @@ void prmem_free_pages(struct page *pages, unsigned int order);
void *prmem_alloc(size_t size, gfp_t gfp);
void prmem_free(void *va, size_t size);

+/* Persistent Instance API. */
+void *prmem_get(char *subsystem, char *name, bool create);
+void prmem_set_data(struct prmem_instance *instance, void *data, size_t size);
+void prmem_get_data(struct prmem_instance *instance, void **data, size_t *size);
+bool prmem_put(struct prmem_instance *instance);
+int prmem_list(char *subsystem, prmem_list_func_t func, void *arg);
+
/* Internal functions. */
struct prmem_region *prmem_add_region(unsigned long pa, size_t size);
bool prmem_create_pool(struct prmem_region *region, bool new_region);
diff --git a/kernel/prmem/Makefile b/kernel/prmem/Makefile
index 99bb19f0afd3..0ed7976580d6 100644
--- a/kernel/prmem/Makefile
+++ b/kernel/prmem/Makefile
@@ -1,4 +1,4 @@
# SPDX-License-Identifier: GPL-2.0

obj-y += prmem_parse.o prmem_reserve.o prmem_init.o prmem_region.o prmem_misc.o
-obj-y += prmem_allocator.o
+obj-y += prmem_allocator.o prmem_instance.o
diff --git a/kernel/prmem/prmem_init.c b/kernel/prmem/prmem_init.c
index d23833d296fe..166fca688ab3 100644
--- a/kernel/prmem/prmem_init.c
+++ b/kernel/prmem/prmem_init.c
@@ -21,6 +21,7 @@ void __init prmem_init(void)
prmem->metadata = prmem_metadata;
prmem->size = prmem_size;
INIT_LIST_HEAD(&prmem->regions);
+ INIT_LIST_HEAD(&prmem->instances);

if (!prmem_add_region(prmem_pa, prmem_size))
return;
diff --git a/kernel/prmem/prmem_instance.c b/kernel/prmem/prmem_instance.c
new file mode 100644
index 000000000000..ee3554d0ab8b
--- /dev/null
+++ b/kernel/prmem/prmem_instance.c
@@ -0,0 +1,139 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Persistent-Across-Kexec memory (prmem) - Persistent instances.
+ *
+ * Copyright (C) 2023 Microsoft Corporation
+ * Author: Madhavan T. Venkataraman ([email protected])
+ */
+#include <linux/prmem.h>
+
+static struct prmem_instance *prmem_find(char *subsystem, char *name)
+{
+ struct prmem_instance *instance;
+
+ list_for_each_entry(instance, &prmem->instances, node) {
+ if (!strcmp(instance->subsystem, subsystem) &&
+ !strcmp(instance->name, name)) {
+ return instance;
+ }
+ }
+ return NULL;
+}
+
+void *prmem_get(char *subsystem, char *name, bool create)
+{
+ int subsystem_len = strlen(subsystem);
+ int name_len = strlen(name);
+ struct prmem_instance *instance;
+
+ /*
+ * In early boot, you are allowed to get an existing instance. But
+ * you are not allowed to create one until prmem is fully initialized.
+ */
+ if (!prmem || (!prmem_inited && create))
+ return NULL;
+
+ if (!subsystem_len || subsystem_len >= PRMEM_MAX_NAME ||
+ !name_len || name_len >= PRMEM_MAX_NAME) {
+ return NULL;
+ }
+
+ spin_lock(&prmem_lock);
+
+ /* Check if it already exists. */
+ instance = prmem_find(subsystem, name);
+ if (instance || !create)
+ goto unlock;
+
+ instance = prmem_alloc_locked(sizeof(*instance));
+ if (!instance)
+ goto unlock;
+
+ strcpy(instance->subsystem, subsystem);
+ strcpy(instance->name, name);
+ instance->data = NULL;
+ instance->size = 0;
+
+ list_add_tail(&instance->node, &prmem->instances);
+unlock:
+ spin_unlock(&prmem_lock);
+ return instance;
+}
+EXPORT_SYMBOL_GPL(prmem_get);
+
+void prmem_set_data(struct prmem_instance *instance, void *data, size_t size)
+{
+ if (!prmem_inited)
+ return;
+
+ spin_lock(&prmem_lock);
+ instance->data = data;
+ instance->size = size;
+ spin_unlock(&prmem_lock);
+}
+EXPORT_SYMBOL_GPL(prmem_set_data);
+
+void prmem_get_data(struct prmem_instance *instance, void **data, size_t *size)
+{
+ if (!prmem)
+ return;
+
+ spin_lock(&prmem_lock);
+ *data = instance->data;
+ *size = instance->size;
+ spin_unlock(&prmem_lock);
+}
+EXPORT_SYMBOL_GPL(prmem_get_data);
+
+bool prmem_put(struct prmem_instance *instance)
+{
+ if (!prmem_inited)
+ return true;
+
+ spin_lock(&prmem_lock);
+
+ if (instance->data) {
+ /*
+ * Caller is responsible for freeing instance data and setting
+ * it to NULL.
+ */
+ spin_unlock(&prmem_lock);
+ return false;
+ }
+
+ /* Free instance. */
+ list_del(&instance->node);
+ prmem_free_locked(instance, sizeof(*instance));
+
+ spin_unlock(&prmem_lock);
+ return true;
+}
+EXPORT_SYMBOL_GPL(prmem_put);
+
+int prmem_list(char *subsystem, prmem_list_func_t func, void *arg)
+{
+ int subsystem_len = strlen(subsystem);
+ struct prmem_instance *instance;
+ int ret;
+
+ if (!prmem)
+ return 0;
+
+ if (!subsystem_len || subsystem_len >= PRMEM_MAX_NAME)
+ return -EINVAL;
+
+ spin_lock(&prmem_lock);
+
+ list_for_each_entry(instance, &prmem->instances, node) {
+ if (strcmp(instance->subsystem, subsystem))
+ continue;
+
+ ret = func(instance, arg);
+ if (ret)
+ break;
+ }
+
+ spin_unlock(&prmem_lock);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(prmem_list);
--
2.25.1

2023-10-16 23:33:39

by Madhavan T. Venkataraman

[permalink] [raw]
Subject: [RFC PATCH v1 09/10] mm/prmem: Implement DAX support for Persistent Ramdisks.

From: "Madhavan T. Venkataraman" <[email protected]>

One problem with using a ramdisk is that the page cache will contain
redundant copies of ramdisk data. To avoid this, implement DAX support
for persistent ramdisks.

To avail this, the filesystem that is installed on the ramdisk must
support DAX. Like ext4. Mount the filesystem with the dax option. E.g.,

sudo mount -t ext4 -o dax /dev/pram0 /path/to/mountpoint

Signed-off-by: Madhavan T. Venkataraman <[email protected]>
---
drivers/block/brd.c | 106 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 106 insertions(+)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 3a05e56ca16f..d4a42d3bd212 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -25,6 +25,9 @@
#include <linux/backing-dev.h>
#include <linux/debugfs.h>
#include <linux/prmem.h>
+#include <linux/pfn_t.h>
+#include <linux/dax.h>
+#include <linux/uio.h>

#include <linux/uaccess.h>

@@ -42,6 +45,7 @@ struct brd_device {
enum brd_type brd_type;
struct gendisk *brd_disk;
struct list_head brd_list;
+ struct dax_device *brd_dax;

/*
* Backing store of pages. This is the contents of the block device.
@@ -58,6 +62,8 @@ static void brd_free_page(struct brd_device *brd, struct page *page);
static void brd_xa_init(struct brd_device *brd);
static void brd_init_name(struct brd_device *brd, char *name);
static void brd_set_capacity(struct brd_device *brd);
+static int brd_dax_init(struct brd_device *brd);
+static void brd_dax_cleanup(struct brd_device *brd);

/*
* Look up and return a brd's page for a given sector.
@@ -408,6 +414,9 @@ static int brd_alloc(int i)
strscpy(disk->disk_name, buf, DISK_NAME_LEN);
brd_set_capacity(brd);

+ if (brd_dax_init(brd))
+ goto out_clean_dax;
+
/*
* This is so fdisk will align partitions on 4k, because of
* direct_access API needing 4k alignment, returning a PFN
@@ -421,6 +430,8 @@ static int brd_alloc(int i)
blk_queue_flag_set(QUEUE_FLAG_NONROT, disk->queue);
blk_queue_flag_set(QUEUE_FLAG_SYNCHRONOUS, disk->queue);
blk_queue_flag_set(QUEUE_FLAG_NOWAIT, disk->queue);
+ if (brd->brd_dax)
+ blk_queue_flag_set(QUEUE_FLAG_DAX, disk->queue);
err = add_disk(disk);
if (err)
goto out_cleanup_disk;
@@ -429,6 +440,8 @@ static int brd_alloc(int i)

out_cleanup_disk:
put_disk(disk);
+out_clean_dax:
+ brd_dax_cleanup(brd);
out_free_dev:
list_del(&brd->brd_list);
brd_free_device(brd);
@@ -447,6 +460,7 @@ static void brd_cleanup(void)
debugfs_remove_recursive(brd_debugfs_dir);

list_for_each_entry_safe(brd, next, &brd_devices, brd_list) {
+ brd_dax_cleanup(brd);
del_gendisk(brd->brd_disk);
put_disk(brd->brd_disk);
brd_free_pages(brd);
@@ -659,3 +673,95 @@ static void brd_set_capacity(struct brd_device *brd)
disksize = prd_data[brd->brd_number].size;
set_capacity(brd->brd_disk, disksize * 2);
}
+
+static bool prd_dax_enabled = IS_ENABLED(CONFIG_FS_DAX);
+
+static long brd_dax_direct_access(struct dax_device *dax_dev,
+ pgoff_t pgoff, long nr_pages,
+ enum dax_access_mode mode,
+ void **kaddr, pfn_t *pfn);
+static int brd_dax_zero_page_range(struct dax_device *dax_dev,
+ pgoff_t pgoff, size_t nr_pages);
+
+static const struct dax_operations brd_dax_ops = {
+ .direct_access = brd_dax_direct_access,
+ .zero_page_range = brd_dax_zero_page_range,
+};
+
+static int brd_dax_init(struct brd_device *brd)
+{
+ if (!prd_dax_enabled || brd->brd_type == BRD_NORMAL)
+ return 0;
+
+ brd->brd_dax = alloc_dax(brd, &brd_dax_ops);
+ if (IS_ERR(brd->brd_dax)) {
+ pr_warn("%s: DAX failed\n", __func__);
+ brd->brd_dax = NULL;
+ return -ENOMEM;
+ }
+
+ if (dax_add_host(brd->brd_dax, brd->brd_disk)) {
+ pr_warn("%s: DAX add failed\n", __func__);
+ return -ENOMEM;
+ }
+ return 0;
+}
+
+static void brd_dax_cleanup(struct brd_device *brd)
+{
+ if (!prd_dax_enabled || brd->brd_type == BRD_NORMAL)
+ return;
+
+ if (brd->brd_dax) {
+ dax_remove_host(brd->brd_disk);
+ kill_dax(brd->brd_dax);
+ put_dax(brd->brd_dax);
+ }
+}
+static int brd_dax_zero_page_range(struct dax_device *dax_dev,
+ pgoff_t pgoff, size_t nr_pages)
+{
+ long rc;
+ void *kaddr;
+
+ rc = dax_direct_access(dax_dev, pgoff, nr_pages, DAX_ACCESS,
+ &kaddr, NULL);
+ if (rc < 0)
+ return rc;
+ memset(kaddr, 0, nr_pages << PAGE_SHIFT);
+ return 0;
+}
+
+static long __brd_direct_access(struct brd_device *brd, pgoff_t pgoff,
+ long nr_pages, void **kaddr, pfn_t *pfn)
+{
+ struct page *page;
+ sector_t sector = (sector_t) pgoff << PAGE_SECTORS_SHIFT;
+ int ret;
+
+ if (!brd)
+ return -ENODEV;
+
+ ret = brd_insert_page(brd, sector, GFP_NOWAIT);
+ if (ret)
+ return ret;
+
+ page = brd_lookup_page(brd, sector);
+ if (!page)
+ return -ENOSPC;
+
+ *kaddr = page_address(page);
+ if (pfn)
+ *pfn = page_to_pfn_t(page);
+
+ return 1;
+}
+
+static long brd_dax_direct_access(struct dax_device *dax_dev,
+ pgoff_t pgoff, long nr_pages, enum dax_access_mode mode,
+ void **kaddr, pfn_t *pfn)
+{
+ struct brd_device *brd = dax_get_private(dax_dev);
+
+ return __brd_direct_access(brd, pgoff, nr_pages, kaddr, pfn);
+}
--
2.25.1

2023-10-17 08:32:13

by Alexander Graf

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/10] mm/prmem: Implement the Persistent-Across-Kexec memory feature (prmem)

Hey Madhavan!

This patch set looks super exciting - thanks a lot for putting it
together. We've been poking at a very similar direction for a while as
well and will discuss the fundamental problem of how to persist kernel
metadata across kexec at LPC:

  https://lpc.events/event/17/contributions/1485/

It would be great to have you in the room as well then.

Some more comments inline.

On 17.10.23 01:32, [email protected] wrote:
> From: "Madhavan T. Venkataraman" <[email protected]>
>
> Introduction
> ============
>
> This feature can be used to persist kernel and user data across kexec reboots
> in RAM for various uses. E.g., persisting:
>
> - cached data. E.g., database caches.
> - state. E.g., KVM guest states.
> - historical information since the last cold boot. E.g., events, logs
> and journals.
> - measurements for integrity checks on the next boot.
> - driver data.
> - IOMMU mappings.
> - MMIO config information.
>
> This is useful on systems where there is no non-volatile storage or
> non-volatile storage is too small or too slow.


This is useful in more situations. We for example need it to do a kexec
while a virtual machine is in suspended state, but has IOMMU mappings
intact (Live Update). For that, we need to ensure DMA can still reach
the VM memory and that everything gets reassembled identically and
without interruptions on the receiving end.


> The following sections describe the implementation.
>
> I have enhanced the ram disk block device driver to provide persistent ram
> disks on which any filesystem can be created. This is for persisting user data.
> I have also implemented DAX support for the persistent ram disks.


This is probably the least interesting of the enablements, right? You
can already today reserve RAM on boot as DAX block device and use it for
that purpose.


> I am also working on making ZRAM persistent.
>
> I have also briefly discussed the following use cases:
>
> - Persisting IOMMU mappings
> - Remembering DMA pages
> - Reserving pages that encounter memory errors
> - Remembering IMA measurements for integrity checks
> - Remembering MMIO config info
> - Implementing prmemfs (special filesystem tailored for persistence)
>
> Allocate metadata
> =================
>
> Define a metadata structure to store all persistent memory related information.
> The metadata fits into one page. On a cold boot, allocate and initialize the
> metadata page.
>
> Allocate data
> =============
>
> On a cold boot, allocate some memory for storing persistent data. Call it
> persistent memory. Specify the size in a command line parameter:
>
> prmem=size[KMG][,max_size[KMG]]
>
> size Initial amount of memory allocated to prmem during boot
> max_size Maximum amount of memory that can be allocated to prmem
>
> When the initial memory is exhaused via allocations, expand prmem dynamically
> up to max_size. Expansion is done by allocating from the buddy allocator.
> Record all allocations in the metadata.


I don't understand why we need a separate allocator. Why can't we just
use normal Linux allocations and serialize their location for handover?
We would obviously still need to find a large contiguous piece of memory
for the target kernel to bootstrap itself into until it can read which
pages it can and can not use, but we can do that allocation in the
source environment using CMA, no?

What I'm trying to say is: I think we're better off separating the
handover mechanism from the allocation mechanism. If we can implement
handover without a new allocator, we can use it for simple things with a
slight runtime penalty. To accelerate the handover then, we can later
add a compacting allocator that can use the handover mechanism we
already built to persist itself.



I have a WIP branch where I'm toying with such a handover mechanism that
uses device tree to serialize/deserialize state. By standardizing the
property naming, we can in the receiving kernel mark all persistent
allocations as reserved and then slowly either free them again or mark
them as in-use one by one:

https://github.com/agraf/linux/commit/fd5736a21d549a9a86c178c91acb29ed7f364f42

I used ftrace as example payload to persist: With the handover mechanism
in place, we serialize/deserialize ftrace ring buffer metadata and are
thus able to read traces of the previous system after kexec. This way,
you can for example profile the kexec exit path.

It's not even in RFC state yet, there are a few things where I would
need a couple days to think hard about data structures, layouts and
other problems :). But I believe from the patch you get the idea.

One such user of kho could be a new allocator like prmem and each
subsystem's serialization code could choose to rely on the prmem
subsystem to persist data instead of doing it themselves. That way you
get a very non-intrusive enablement path for kexec handover, easily
amendable data structures that can change compatibly over time as well
as the ability to recreate ephemeral data structure based on persistent
information - which will be necessary to persist VFIO containers.


Alex




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879


2023-10-17 18:09:29

by Madhavan T. Venkataraman

[permalink] [raw]
Subject: Re: [RFC PATCH v1 00/10] mm/prmem: Implement the Persistent-Across-Kexec memory feature (prmem)

Hey Alex,

Thanks a lot for your comments!

On 10/17/23 03:31, Alexander Graf wrote:
> Hey Madhavan!
>
> This patch set looks super exciting - thanks a lot for putting it together. We've been poking at a very similar direction for a while as well and will discuss the fundamental problem of how to persist kernel metadata across kexec at LPC:
>
>   https://lpc.events/event/17/contributions/1485/
>
> It would be great to have you in the room as well then.
>

Yes. I am planning to attend. But I am attending virtually as I am not able to travel.

> Some more comments inline.
>
> On 17.10.23 01:32, [email protected] wrote:
>> From: "Madhavan T. Venkataraman" <[email protected]>
>>
>> Introduction
>> ============
>>
>> This feature can be used to persist kernel and user data across kexec reboots
>> in RAM for various uses. E.g., persisting:
>>
>>          - cached data. E.g., database caches.
>>          - state. E.g., KVM guest states.
>>          - historical information since the last cold boot. E.g., events, logs
>>            and journals.
>>          - measurements for integrity checks on the next boot.
>>          - driver data.
>>          - IOMMU mappings.
>>          - MMIO config information.
>>
>> This is useful on systems where there is no non-volatile storage or
>> non-volatile storage is too small or too slow.
>
>
> This is useful in more situations. We for example need it to do a kexec while a virtual machine is in suspended state, but has IOMMU mappings intact (Live Update). For that, we need to ensure DMA can still reach the VM memory and that everything gets reassembled identically and without interruptions on the receiving end.
>
>

I see.

>> The following sections describe the implementation.
>>
>> I have enhanced the ram disk block device driver to provide persistent ram
>> disks on which any filesystem can be created. This is for persisting user data.
>> I have also implemented DAX support for the persistent ram disks.
>
>
> This is probably the least interesting of the enablements, right? You can already today reserve RAM on boot as DAX block device and use it for that purpose.
>

Yes. pmem provides that functionality.

There are a few differences though. However, I don't have a good feel for how important these differences are to users. May be, they are not very significant. E.g,

- pmem regions need some setup using the ndctl command.
- IIUC, one needs to specify a starting address and a size for a pmem region. Having to specify a starting address may make it somewhat less flexible from a configuration point of view.
- In the case of pmem, the entire range of memory is set aside. In the case of the prmem persistent ram disk, pages are allocated as needed. So, persistent memory is shared among multiple
consumers more flexibly.

Also Greg H. wanted to see a filesystem based use case to be presented for persistent memory so we can see how it all comes together. I am working on prmemfs (a special FS tailored for persistence). But that will take some time. So, I wanted to present this ram disk use case as a more flexible alternative to pmem.

But you are right. They are equivalent for all practical purposes.

>
>> I am also working on making ZRAM persistent.
>>
>> I have also briefly discussed the following use cases:
>>
>>          - Persisting IOMMU mappings
>>          - Remembering DMA pages
>>          - Reserving pages that encounter memory errors
>>          - Remembering IMA measurements for integrity checks
>>          - Remembering MMIO config info
>>          - Implementing prmemfs (special filesystem tailored for persistence)
>>
>> Allocate metadata
>> =================
>>
>> Define a metadata structure to store all persistent memory related information.
>> The metadata fits into one page. On a cold boot, allocate and initialize the
>> metadata page.
>>
>> Allocate data
>> =============
>>
>> On a cold boot, allocate some memory for storing persistent data. Call it
>> persistent memory. Specify the size in a command line parameter:
>>
>>          prmem=size[KMG][,max_size[KMG]]
>>
>>          size            Initial amount of memory allocated to prmem during boot
>>          max_size        Maximum amount of memory that can be allocated to prmem
>>
>> When the initial memory is exhaused via allocations, expand prmem dynamically
>> up to max_size. Expansion is done by allocating from the buddy allocator.
>> Record all allocations in the metadata.
>
>
> I don't understand why we need a separate allocator. Why can't we just use normal Linux allocations and serialize their location for handover? We would obviously still need to find a large contiguous piece of memory for the target kernel to bootstrap itself into until it can read which pages it can and can not use, but we can do that allocation in the source environment using CMA, no?
>
> What I'm trying to say is: I think we're better off separating the handover mechanism from the allocation mechanism. If we can implement handover without a new allocator, we can use it for simple things with a slight runtime penalty. To accelerate the handover then, we can later add a compacting allocator that can use the handover mechanism we already built to persist itself.
>
>
>
> I have a WIP branch where I'm toying with such a handover mechanism that uses device tree to serialize/deserialize state. By standardizing the property naming, we can in the receiving kernel mark all persistent allocations as reserved and then slowly either free them again or mark them as in-use one by one:
>
> https://github.com/agraf/linux/commit/fd5736a21d549a9a86c178c91acb29ed7f364f42
>
> I used ftrace as example payload to persist: With the handover mechanism in place, we serialize/deserialize ftrace ring buffer metadata and are thus able to read traces of the previous system after kexec. This way, you can for example profile the kexec exit path.
>
> It's not even in RFC state yet, there are a few things where I would need a couple days to think hard about data structures, layouts and other problems :). But I believe from the patch you get the idea.
>
> One such user of kho could be a new allocator like prmem and each subsystem's serialization code could choose to rely on the prmem subsystem to persist data instead of doing it themselves. That way you get a very non-intrusive enablement path for kexec handover, easily amendable data structures that can change compatibly over time as well as the ability to recreate ephemeral data structure based on persistent information - which will be necessary to persist VFIO containers.
>

OK. I will study your changes and your comments. I will send my feedback as well.

Thanks again!

Madhavan

>
> Alex
>
>
>
>
> Amazon Development Center Germany GmbH
> Krausenstr. 38
> 10117 Berlin
> Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
> Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
> Sitz: Berlin
> Ust-ID: DE 289 237 879
>
>