LinuxLists.cc - [PATCH 00/22] add Object Storage Media Pool (mpool)

2020-09-28 16:50:33

by Nabeel Meeramohideen Mohamed (nmeeramohide)

Subject: [PATCH 00/22] add Object Storage Media Pool (mpool)

From: Nabeel M Mohamed <[email protected]>

This patch series introduces the mpool object storage media pool driver.
Mpool implements a simple transactional object store on top of block
storage devices.

Mpool was developed for the Heterogeneous-Memory Storage Engine (HSE)
project, which is a high-performance key-value storage engine designed
for SSDs. HSE stores its data exclusively in mpool.

Mpool is readily applicable to other storage systems built on immutable
objects. For example, the many databases that store records in
immutable SSTables organized as an LSM-tree or similar data structure.

We developed mpool for HSE storage, versus using a file system or raw
block device, for several reasons.

A primary motivator was the need for a storage model that maps naturally
to conventional block storage devices, as well as to emerging device
interfaces we plan to support in the future, such as
* NVMe Zoned Namespaces (ZNS)
* NVMe Streams
* Persistent memory accessed via CXL or similar technologies

Another motivator was the need for a storage model that readily supports
multiple classes of storage devices or media in a single storage pool,
such as
* QLC SSDs for storing the bulk of objects, and
* 3DXP SSDs or persistent memory for storing objects requiring
low-latency access

The mpool object storage model meets these needs. It also provides
other features that benefit storage systems built on immutable objects,
including
* Facilities to memory-map a specified collection of objects into a
linear address space
* Concurrent access to object data directly and memory-mapped to greatly
reduce page cache pollution from background operations such as
LSM-tree compaction
* Proactive eviction of object data from the page cache, based on
object-level metrics, to avoid excessive memory pressure and its
associated performance impacts
* High concurrency and short code paths for efficient access to
low-latency storage devices

HSE takes advantage of all these mpool features to achieve high
throughput with low tail-latencies.

Mpool is implemented as a character device driver where
* /dev/mpoolctl is the control file (minor number 0) supporting mpool
management ioctls
* /dev/mpool/<mpool-name> are mpool files (minor numbers >0), one per
mpool, supporting object management ioctls

CLI/UAPI access to /dev/mpoolctl and /dev/mpool/<mpool-name> are
controlled by their UID, GID, and mode bits. To provide a familiar look
and feel, the mpool management model and CLI are intentionally aligned
to those of LVM to the degree practical.

An mpool is created with a block storage device specified for its
required capacity media class, and optionally a second block storage
device specified for its staging media class. We recommend virtual
block devices (such as LVM logical volumes) to aggregate the performance
and capacity of multiple physical block devices, to enable sharing of
physical block devices between mpools (or for other uses), and to
support extending the size of a block device used for an mpool media
class. The libblkid library recognizes mpool formatted block devices as
of util-linux v2.32.

Mpool implements a transactional object store with two simple object
abstractions: mblocks and mlogs.

Mblock objects are containers comprising a linear sequence of bytes that
can be written exactly once, are immutable after writing, and can be
read in whole or in part as needed until deleted. Mblocks in a media
class are currently fixed size, which is configured when an mpool is
created, though the amount of data written to mblocks will differ.

Mlog objects are containers for record logging. Records of arbitrary
size can be appended to an mlog until it is full. Once full, an mlog
must be erased before additional records can be appended. Mlog records
can be read sequentially from the beginning at any time. Mlogs in a
media class are always a multiple of the mblock size for that media
class.

Mblock and mlog writes avoid the page cache. Mblocks are written,
committed, and made immutable before they can be read either directly
(avoiding the page cache) or mmaped. Mlogs are always read and updated
directly (avoiding the page cache) and cannot be mmaped.

Mpool also provides the metadata container (MDC) APIs that clients can
use to simplify storing and maintaining metadata. These MDC APIs are
helper functions built on a pair of mlogs per MDC.

The mpool Wiki contains full details on the
* Management model in the "Configure mpools" section
* Object model in the "Develop mpool Applications" section
* Kernel module architecture in the "Explore mpool Internals" section,
which provides context for reviewing this patch series

See https://github.com/hse-project/mpool/wiki

The mpool UAPI and kernel module (not the patchset) are available on
GitHub at:

https://github.com/hse-project/mpool

https://github.com/hse-project/mpool-kmod

The HSE key-value storage engine is available on GitHub at:

https://github.com/hse-project/hse

Nabeel M Mohamed (22):
mpool: add utility routines and ioctl definitions
mpool: add in-memory struct definitions
mpool: add on-media struct definitions
mpool: add pool drive component which handles mpool IO using the block
layer API
mpool: add space map component which manages free space on mpool
devices
mpool: add on-media pack, unpack and upgrade routines
mpool: add superblock management routines
mpool: add pool metadata routines to manage object lifecycle and IO
mpool: add mblock lifecycle management and IO routines
mpool: add mlog IO utility routines
mpool: add mlog lifecycle management and IO routines
mpool: add metadata container or mlog-pair framework
mpool: add utility routines for mpool lifecycle management
mpool: add pool metadata routines to create persistent mpools
mpool: add mpool lifecycle management routines
mpool: add mpool control plane utility routines
mpool: add mpool lifecycle management ioctls
mpool: add object lifecycle management ioctls
mpool: add support to mmap arbitrary collection of mblocks
mpool: add support to proactively evict cached mblock data from the
page-cache
mpool: add documentation
mpool: add Kconfig and Makefile

drivers/Kconfig | 2 +
drivers/Makefile | 1 +
drivers/mpool/Kconfig | 28 +
drivers/mpool/Makefile | 11 +
drivers/mpool/assert.h | 25 +
drivers/mpool/init.c | 126 ++
drivers/mpool/init.h | 17 +
drivers/mpool/mblock.c | 432 +++++
drivers/mpool/mblock.h | 161 ++
drivers/mpool/mcache.c | 1036 ++++++++++++
drivers/mpool/mcache.h | 102 ++
drivers/mpool/mclass.c | 103 ++
drivers/mpool/mclass.h | 137 ++
drivers/mpool/mdc.c | 486 ++++++
drivers/mpool/mdc.h | 106 ++
drivers/mpool/mlog.c | 1667 ++++++++++++++++++
drivers/mpool/mlog.h | 212 +++
drivers/mpool/mlog_utils.c | 1352 +++++++++++++++
drivers/mpool/mlog_utils.h | 63 +
drivers/mpool/mp.c | 1086 ++++++++++++
drivers/mpool/mp.h | 231 +++
drivers/mpool/mpcore.c | 987 +++++++++++
drivers/mpool/mpcore.h | 354 ++++
drivers/mpool/mpctl.c | 2801 +++++++++++++++++++++++++++++++
drivers/mpool/mpctl.h | 59 +
drivers/mpool/mpool-locking.rst | 90 +
drivers/mpool/mpool_ioctl.h | 636 +++++++
drivers/mpool/mpool_printk.h | 44 +
drivers/mpool/omf.c | 1320 +++++++++++++++
drivers/mpool/omf.h | 593 +++++++
drivers/mpool/omf_if.h | 381 +++++
drivers/mpool/params.h | 116 ++
drivers/mpool/pd.c | 426 +++++
drivers/mpool/pd.h | 202 +++
drivers/mpool/pmd.c | 2046 ++++++++++++++++++++++
drivers/mpool/pmd.h | 379 +++++
drivers/mpool/pmd_obj.c | 1569 +++++++++++++++++
drivers/mpool/pmd_obj.h | 499 ++++++
drivers/mpool/reaper.c | 692 ++++++++
drivers/mpool/reaper.h | 71 +
drivers/mpool/sb.c | 625 +++++++
drivers/mpool/sb.h | 162 ++
drivers/mpool/smap.c | 1031 ++++++++++++
drivers/mpool/smap.h | 334 ++++
drivers/mpool/sysfs.c | 48 +
drivers/mpool/sysfs.h | 48 +
drivers/mpool/upgrade.c | 138 ++
drivers/mpool/upgrade.h | 128 ++
drivers/mpool/uuid.h | 59 +
49 files changed, 23222 insertions(+)
create mode 100644 drivers/mpool/Kconfig
create mode 100644 drivers/mpool/Makefile
create mode 100644 drivers/mpool/assert.h
create mode 100644 drivers/mpool/init.c
create mode 100644 drivers/mpool/init.h
create mode 100644 drivers/mpool/mblock.c
create mode 100644 drivers/mpool/mblock.h
create mode 100644 drivers/mpool/mcache.c
create mode 100644 drivers/mpool/mcache.h
create mode 100644 drivers/mpool/mclass.c
create mode 100644 drivers/mpool/mclass.h
create mode 100644 drivers/mpool/mdc.c
create mode 100644 drivers/mpool/mdc.h
create mode 100644 drivers/mpool/mlog.c
create mode 100644 drivers/mpool/mlog.h
create mode 100644 drivers/mpool/mlog_utils.c
create mode 100644 drivers/mpool/mlog_utils.h
create mode 100644 drivers/mpool/mp.c
create mode 100644 drivers/mpool/mp.h
create mode 100644 drivers/mpool/mpcore.c
create mode 100644 drivers/mpool/mpcore.h
create mode 100644 drivers/mpool/mpctl.c
create mode 100644 drivers/mpool/mpctl.h
create mode 100644 drivers/mpool/mpool-locking.rst
create mode 100644 drivers/mpool/mpool_ioctl.h
create mode 100644 drivers/mpool/mpool_printk.h
create mode 100644 drivers/mpool/omf.c
create mode 100644 drivers/mpool/omf.h
create mode 100644 drivers/mpool/omf_if.h
create mode 100644 drivers/mpool/params.h
create mode 100644 drivers/mpool/pd.c
create mode 100644 drivers/mpool/pd.h
create mode 100644 drivers/mpool/pmd.c
create mode 100644 drivers/mpool/pmd.h
create mode 100644 drivers/mpool/pmd_obj.c
create mode 100644 drivers/mpool/pmd_obj.h
create mode 100644 drivers/mpool/reaper.c
create mode 100644 drivers/mpool/reaper.h
create mode 100644 drivers/mpool/sb.c
create mode 100644 drivers/mpool/sb.h
create mode 100644 drivers/mpool/smap.c
create mode 100644 drivers/mpool/smap.h
create mode 100644 drivers/mpool/sysfs.c
create mode 100644 drivers/mpool/sysfs.h
create mode 100644 drivers/mpool/upgrade.c
create mode 100644 drivers/mpool/upgrade.h
create mode 100644 drivers/mpool/uuid.h

--
2.17.2

2020-09-28 16:51:08

by Nabeel Meeramohideen Mohamed (nmeeramohide)

[permalink] [raw]

Subject: [PATCH 17/22] mpool: add mpool lifecycle management ioctls

From: Nabeel M Mohamed <[email protected]>

This adds the open, release and mpool management ioctls for
the mpool driver.

The create, destroy, activate, deactivate and rename ioctls
are issued to the mpool control device (/dev/mpoolctl),
and the rest are issued to the mpool device
(/dev/mpool/<mpool_name>).

The mpool control device is owned by (root, disk) with
mode 0664. Non-default uid, gid and mode can be assigned to
an mpool device either at create time or post creation using
the params set ioctl.

Both the per-mpool and common parameters are available in
the sysfs device tree path created by the kernel for each mpool
device minor (/sys/devices/virtual/mpool). Mpool parameters
cannot be changed via the sysfs tree at this point.

The mpool management ioctl handlers invoke the mpool lifecycle
management routines to administer mpools. Activating an mpool
creates a unit object which stores some key information like
reference to the device object, reference to the mpc_mpool
instance containing the per-mpool private data, device props,
ownership and mode bits, device open count, flags etc. The
per-mpool parameters are persisted in MDC0 at activation.

Deactivating an mpool tears down the unit object and releases
all its associated resources.

An mpool can be renamed only when it's deactivated. Renaming
an mpool updates the superblock on all its constituent storage
volumes with the new mpool name.

Co-developed-by: Greg Becker <[email protected]>
Signed-off-by: Greg Becker <[email protected]>
Co-developed-by: Pierre Labat <[email protected]>
Signed-off-by: Pierre Labat <[email protected]>
Co-developed-by: John Groves <[email protected]>
Signed-off-by: John Groves <[email protected]>
Signed-off-by: Nabeel M Mohamed <[email protected]>
---
drivers/mpool/mpctl.c | 1719 +++++++++++++++++++++++++++++++++++++++--
1 file changed, 1667 insertions(+), 52 deletions(-)

diff --git a/drivers/mpool/mpctl.c b/drivers/mpool/mpctl.c
index 4f3600840ff0..002321c8689b 100644
--- a/drivers/mpool/mpctl.c
+++ b/drivers/mpool/mpctl.c
@@ -78,9 +78,14 @@ static const struct file_operations mpc_fops_default;

static struct mpc_softstate mpc_softstate;

+static struct backing_dev_info *mpc_bdi;
+
static unsigned int mpc_ctl_uid __read_mostly;
static unsigned int mpc_ctl_gid __read_mostly = 6;
static unsigned int mpc_ctl_mode __read_mostly = 0664;
+static unsigned int mpc_default_uid __read_mostly;
+static unsigned int mpc_default_gid __read_mostly = 6;
+static unsigned int mpc_default_mode __read_mostly = 0660;

static const struct mpc_uinfo mpc_uinfo_ctl = {
.ui_typename = "mpoolctl",
@@ -112,6 +117,202 @@ static inline gid_t mpc_current_gid(void)
return from_kgid(current_user_ns(), current_gid());
}

+#define MPC_MPOOL_PARAMS_CNT 7
+
+static ssize_t mpc_uid_show(struct device *dev, struct device_attribute *da, char *buf)
+{
+ return scnprintf(buf, PAGE_SIZE, "%d\n", dev_to_unit(dev)->un_uid);
+}
+
+static ssize_t mpc_gid_show(struct device *dev, struct device_attribute *da, char *buf)
+{
+ return scnprintf(buf, PAGE_SIZE, "%d\n", dev_to_unit(dev)->un_gid);
+}
+
+static ssize_t mpc_mode_show(struct device *dev, struct device_attribute *da, char *buf)
+{
+ return scnprintf(buf, PAGE_SIZE, "0%o\n", dev_to_unit(dev)->un_mode);
+}
+
+static ssize_t mpc_ra_show(struct device *dev, struct device_attribute *da, char *buf)
+{
+ return scnprintf(buf, PAGE_SIZE, "%u\n", dev_to_unit(dev)->un_ra_pages_max);
+}
+
+static ssize_t mpc_label_show(struct device *dev, struct device_attribute *da, char *buf)
+{
+ return scnprintf(buf, PAGE_SIZE, "%s\n", dev_to_unit(dev)->un_label);
+}
+
+static ssize_t mpc_type_show(struct device *dev, struct device_attribute *da, char *buf)
+{
+ struct mpool_uuid uuid;
+ char uuid_str[MPOOL_UUID_STRING_LEN + 1] = { };
+
+ memcpy(uuid.uuid, dev_to_unit(dev)->un_utype.b, MPOOL_UUID_SIZE);
+ mpool_unparse_uuid(&uuid, uuid_str);
+
+ return scnprintf(buf, PAGE_SIZE, "%s\n", uuid_str);
+}
+
+static void mpc_mpool_params_add(struct device_attribute *dattr)
+{
+ MPC_ATTR_RO(dattr++, uid);
+ MPC_ATTR_RO(dattr++, gid);
+ MPC_ATTR_RO(dattr++, mode);
+ MPC_ATTR_RO(dattr++, ra);
+ MPC_ATTR_RO(dattr++, label);
+ MPC_ATTR_RO(dattr, type);
+}
+
+static int mpc_params_register(struct mpc_unit *unit, int cnt)
+{
+ struct device_attribute *dattr;
+ struct mpc_attr *attr;
+ int rc;
+
+ attr = mpc_attr_create(unit->un_device, "parameters", cnt);
+ if (!attr)
+ return -ENOMEM;
+
+ dattr = attr->a_dattr;
+
+ /* Per-mpool parameters */
+ if (mpc_unit_ismpooldev(unit))
+ mpc_mpool_params_add(dattr);
+
+ rc = mpc_attr_group_create(attr);
+ if (rc) {
+ mpc_attr_destroy(attr);
+ return rc;
+ }
+
+ unit->un_attr = attr;
+
+ return 0;
+}
+
+static void mpc_params_unregister(struct mpc_unit *unit)
+{
+ mpc_attr_group_destroy(unit->un_attr);
+ mpc_attr_destroy(unit->un_attr);
+ unit->un_attr = NULL;
+}
+
+/**
+ * mpc_toascii() - convert string to restricted ASCII
+ *
+ * Zeroes out the remainder of str[] and returns the length.
+ */
+static size_t mpc_toascii(char *str, size_t sz)
+{
+ size_t len = 0;
+ int i;
+
+ if (!str || sz < 1)
+ return 0;
+
+ if (str[0] == '-')
+ str[0] = '_';
+
+ for (i = 0; i < (sz - 1) && str[i]; ++i) {
+ if (isalnum(str[i]) || strchr("_.-", str[i]))
+ continue;
+
+ str[i] = '_';
+ }
+
+ len = i;
+
+ while (i < sz)
+ str[i++] = '\000';
+
+ return len;
+}
+
+static void mpool_params_merge_defaults(struct mpool_params *params)
+{
+ if (params->mp_spare_cap == MPOOL_SPARES_INVALID)
+ params->mp_spare_cap = MPOOL_SPARES_DEFAULT;
+
+ if (params->mp_spare_stg == MPOOL_SPARES_INVALID)
+ params->mp_spare_stg = MPOOL_SPARES_DEFAULT;
+
+ if (params->mp_ra_pages_max == U32_MAX)
+ params->mp_ra_pages_max = MPOOL_RA_PAGES_MAX;
+ params->mp_ra_pages_max = clamp_t(u32, params->mp_ra_pages_max, 0, MPOOL_RA_PAGES_MAX);
+
+ if (params->mp_mode != -1)
+ params->mp_mode &= 0777;
+
+ params->mp_rsvd0 = 0;
+ params->mp_rsvd1 = 0;
+ params->mp_rsvd2 = 0;
+ params->mp_rsvd3 = 0;
+ params->mp_rsvd4 = 0;
+
+ if (!strcmp(params->mp_label, MPOOL_LABEL_INVALID))
+ strcpy(params->mp_label, MPOOL_LABEL_DEFAULT);
+
+ mpc_toascii(params->mp_label, sizeof(params->mp_label));
+}
+
+static void mpool_to_mpcore_params(struct mpool_params *params, struct mpcore_params *mpc_params)
+{
+ u64 mdc0cap, mdcncap;
+ u32 mdcnum;
+
+ mpcore_params_defaults(mpc_params);
+
+ mdc0cap = (u64)params->mp_mdc0cap << 20;
+ mdcncap = (u64)params->mp_mdcncap << 20;
+ mdcnum = params->mp_mdcnum;
+
+ if (mdc0cap != 0)
+ mpc_params->mp_mdc0cap = mdc0cap;
+
+ if (mdcncap != 0)
+ mpc_params->mp_mdcncap = mdcncap;
+
+ if (mdcnum != 0)
+ mpc_params->mp_mdcnum = mdcnum;
+}
+
+static bool mpool_params_merge_config(struct mpool_params *params, struct mpool_config *cfg)
+{
+ uuid_le uuidnull = { };
+ bool changed = false;
+
+ if (params->mp_uid != -1 && params->mp_uid != cfg->mc_uid) {
+ cfg->mc_uid = params->mp_uid;
+ changed = true;
+ }
+
+ if (params->mp_gid != -1 && params->mp_gid != cfg->mc_gid) {
+ cfg->mc_gid = params->mp_gid;
+ changed = true;
+ }
+
+ if (params->mp_mode != -1 && params->mp_mode != cfg->mc_mode) {
+ cfg->mc_mode = params->mp_mode;
+ changed = true;
+ }
+
+ if (memcmp(&uuidnull, &params->mp_utype, sizeof(uuidnull)) &&
+ memcmp(&params->mp_utype, &cfg->mc_utype, sizeof(params->mp_utype))) {
+ memcpy(&cfg->mc_utype, &params->mp_utype, sizeof(cfg->mc_utype));
+ changed = true;
+ }
+
+ if (strcmp(params->mp_label, MPOOL_LABEL_DEFAULT) &&
+ strncmp(params->mp_label, cfg->mc_label, sizeof(params->mp_label))) {
+ strlcpy(cfg->mc_label, params->mp_label, sizeof(cfg->mc_label));
+ changed = true;
+ }
+
+ return changed;
+}
+
/**
* mpc_mpool_release() - release kref handler for mpc_mpool object
* @refp: kref pointer
@@ -216,6 +417,9 @@ static void mpc_unit_release(struct kref *refp)
if (unit->un_mpool)
mpc_mpool_put(unit->un_mpool);

+ if (unit->un_attr)
+ mpc_params_unregister(unit);
+
if (unit->un_device)
device_destroy(ss->ss_class, unit->un_devno);

@@ -228,6 +432,89 @@ static void mpc_unit_put(struct mpc_unit *unit)
kref_put(&unit->un_ref, mpc_unit_release);
}

+/**
+ * mpc_unit_lookup() - Look up a unit by minor number.
+ * @minor: minor number
+ * @unitp: unit ptr
+ *
+ * Returns a referenced ptr to the unit (via *unitp) if found,
+ * otherwise it sets *unitp to NULL.
+ */
+static void mpc_unit_lookup(int minor, struct mpc_unit **unitp)
+{
+ struct mpc_softstate *ss = &mpc_softstate;
+ struct mpc_unit *unit;
+
+ *unitp = NULL;
+
+ mutex_lock(&ss->ss_lock);
+ unit = idr_find(&ss->ss_unitmap, minor);
+ if (unit) {
+ kref_get(&unit->un_ref);
+ *unitp = unit;
+ }
+ mutex_unlock(&ss->ss_lock);
+}
+
+/**
+ * mpc_unit_lookup_by_name_itercb() - Test to see if unit matches arg.
+ * @item: unit ptr
+ * @arg: argument vector base ptr
+ *
+ * This iterator callback is called by mpc_unit_lookup_by_name()
+ * for each unit in the units table.
+ *
+ * Return: If the unit matching the given name is found returns
+ * the referenced unit pointer in argv[2], otherwise NULL.
+ */
+static int mpc_unit_lookup_by_name_itercb(int minor, void *item, void *arg)
+{
+ struct mpc_unit *unit = item;
+ void **argv = arg;
+ struct mpc_unit *parent = argv[0];
+ const char *name = argv[1];
+
+ if (!unit)
+ return ITERCB_NEXT;
+
+ if (mpc_unit_isctldev(parent) && !mpc_unit_ismpooldev(unit))
+ return ITERCB_NEXT;
+
+ if (parent->un_mpool && unit->un_mpool != parent->un_mpool)
+ return ITERCB_NEXT;
+
+ if (strcmp(unit->un_name, name) == 0) {
+ kref_get(&unit->un_ref);
+ argv[2] = unit;
+ return ITERCB_DONE;
+ }
+
+ return ITERCB_NEXT;
+}
+
+/**
+ * mpc_unit_lookup_by_name() - Look up an mpool unit by name.
+ * @parent: parent unit
+ * @name: unit name. This is not the mpool name.
+ * @unitp: unit ptr
+ *
+ * If a unit exists in the system which has the given name and parent
+ * then it is referenced and returned via *unitp. Otherwise, *unitp
+ * is set to NULL.
+ */
+static void mpc_unit_lookup_by_name(struct mpc_unit *parent, const char *name,
+ struct mpc_unit **unitp)
+{
+ struct mpc_softstate *ss = &mpc_softstate;
+ void *argv[] = { parent, (void *)name, NULL };
+
+ mutex_lock(&ss->ss_lock);
+ idr_for_each(&ss->ss_unitmap, mpc_unit_lookup_by_name_itercb, argv);
+ mutex_unlock(&ss->ss_lock);
+
+ *unitp = argv[2];
+}
+
/**
* mpc_unit_setup() - Create a device unit object and special file
* @uinfo:
@@ -328,6 +615,36 @@ static int mpc_unit_setup(const struct mpc_uinfo *uinfo, const char *name,
return rc;
}

+
+static int mpc_cf_journal(struct mpc_unit *unit)
+{
+ struct mpool_config cfg = { };
+ struct mpc_mpool *mpool;
+ int rc;
+
+ mpool = unit->un_mpool;
+ if (!mpool)
+ return -EINVAL;
+
+ down_write(&mpool->mp_lock);
+
+ cfg.mc_uid = unit->un_uid;
+ cfg.mc_gid = unit->un_gid;
+ cfg.mc_mode = unit->un_mode;
+ cfg.mc_oid1 = unit->un_ds_oidv[0];
+ cfg.mc_oid2 = unit->un_ds_oidv[1];
+ cfg.mc_captgt = unit->un_mdc_captgt;
+ cfg.mc_ra_pages_max = unit->un_ra_pages_max;
+ memcpy(&cfg.mc_utype, &unit->un_utype, sizeof(cfg.mc_utype));
+ strlcpy(cfg.mc_label, unit->un_label, sizeof(cfg.mc_label));
+
+ rc = mpool_config_store(mpool->mp_desc, &cfg);
+
+ up_write(&mpool->mp_lock);
+
+ return rc;
+}
+
/**
* mpc_uevent() - Hook to intercept and modify uevents before they're posted to udev
* @dev: mpc driver device
@@ -348,86 +665,1384 @@ static int mpc_uevent(struct device *dev, struct kobj_uevent_env *env)
return 0;
}

-static int mpc_exit_unit(int minor, void *item, void *arg)
+/**
+ * mpc_mp_chown() - Change ownership of an mpool.
+ * @unit: mpool unit ptr
+ * @mps:
+ *
+ * Return: Returns 0 if successful, -errno otherwise...
+ */
+static int mpc_mp_chown(struct mpc_unit *unit, struct mpool_params *params)
{
- mpc_unit_put(item);
+ mode_t mode;
+ uid_t uid;
+ gid_t gid;
+ int rc = 0;

- return ITERCB_NEXT;
+ if (!mpc_unit_ismpooldev(unit))
+ return -EINVAL;
+
+ uid = params->mp_uid;
+ gid = params->mp_gid;
+ mode = params->mp_mode;
+
+ if (mode != -1)
+ mode &= 0777;
+
+ if (uid != -1 && uid != unit->un_uid && !capable(CAP_CHOWN))
+ return -EPERM;
+
+ if (gid != -1 && gid != unit->un_gid && !capable(CAP_CHOWN))
+ return -EPERM;
+
+ if (mode != -1 && mode != unit->un_mode && !capable(CAP_FOWNER))
+ return -EPERM;
+
+ if (-1 != uid)
+ unit->un_uid = uid;
+ if (-1 != gid)
+ unit->un_gid = gid;
+ if (-1 != mode)
+ unit->un_mode = mode;
+
+ if (uid != -1 || gid != -1 || mode != -1)
+ rc = kobject_uevent(&unit->un_device->kobj, KOBJ_CHANGE);
+
+ return rc;
}

/**
- * mpctl_exit() - Tear down and unload the mpool control module.
+ * mpioc_params_get() - get parameters of an activated mpool
+ * @unit: mpool unit ptr
+ * @get: mpool params
+ *
+ * MPIOC_PARAMS_GET ioctl handler to get mpool parameters
+ *
+ * Return: Returns 0 if successful, -errno otherwise...
*/
-void mpctl_exit(void)
+static int mpioc_params_get(struct mpc_unit *unit, struct mpioc_params *get)
{
struct mpc_softstate *ss = &mpc_softstate;
+ struct mpool_descriptor *desc;
+ struct mpool_params *params;
+ struct mpool_xprops xprops = { };
+ u8 mclass;

- if (ss->ss_inited) {
- idr_for_each(&ss->ss_unitmap, mpc_exit_unit, NULL);
- idr_destroy(&ss->ss_unitmap);
+ if (!mpc_unit_ismpooldev(unit))
+ return -EINVAL;

- if (ss->ss_devno != NODEV) {
- if (ss->ss_class) {
- if (ss->ss_cdev.ops)
- cdev_del(&ss->ss_cdev);
- class_destroy(ss->ss_class);
- }
- unregister_chrdev_region(ss->ss_devno, maxunits);
- }
+ desc = unit->un_mpool->mp_desc;

- ss->ss_inited = false;
- }
+ mutex_lock(&ss->ss_lock);
+
+ params = &get->mps_params;
+ memset(params, 0, sizeof(*params));
+ params->mp_uid = unit->un_uid;
+ params->mp_gid = unit->un_gid;
+ params->mp_mode = unit->un_mode;
+ params->mp_mdc_captgt = MPOOL_ROOT_LOG_CAP;
+ params->mp_oidv[0] = unit->un_ds_oidv[0];
+ params->mp_oidv[1] = unit->un_ds_oidv[1];
+ params->mp_ra_pages_max = unit->un_ra_pages_max;
+ memcpy(&params->mp_utype, &unit->un_utype, sizeof(params->mp_utype));
+ strlcpy(params->mp_label, unit->un_label, sizeof(params->mp_label));
+ strlcpy(params->mp_name, unit->un_name, sizeof(params->mp_name));
+
+ /* Get mpool properties.. */
+ mpool_get_xprops(desc, &xprops);
+
+ for (mclass = 0; mclass < MP_MED_NUMBER; mclass++)
+ params->mp_mblocksz[mclass] = xprops.ppx_params.mp_mblocksz[mclass];
+
+ params->mp_spare_cap = xprops.ppx_drive_spares[MP_MED_CAPACITY];
+ params->mp_spare_stg = xprops.ppx_drive_spares[MP_MED_STAGING];
+
+ memcpy(params->mp_poolid.b, xprops.ppx_params.mp_poolid.b, MPOOL_UUID_SIZE);
+
+ mutex_unlock(&ss->ss_lock);
+
+ return 0;
}

/**
- * mpctl_init() - Load and initialize the mpool control module.
+ * mpioc_params_set() - set parameters of an activated mpool
+ * @unit: mpool unit ptr
+ * @set: mpool params
+ *
+ * MPIOC_PARAMS_SET ioctl handler to set mpool parameters
+ *
+ * Return: Returns 0 if successful, -errno otherwise...
*/
-int mpctl_init(void)
+static int mpioc_params_set(struct mpc_unit *unit, struct mpioc_params *set)
{
struct mpc_softstate *ss = &mpc_softstate;
- struct mpool_config *cfg = NULL;
- struct mpc_unit *ctlunit;
- const char *errmsg = NULL;
- int rc;
+ struct mpool_descriptor *mp;
+ struct mpool_params *params;
+ uuid_le uuidnull = { };
+ int rerr = 0, err = 0;
+ bool journal = false;

- if (ss->ss_inited)
- return -EBUSY;
+ if (!mpc_unit_ismpooldev(unit))
+ return -EINVAL;

- ctlunit = NULL;
+ params = &set->mps_params;

- maxunits = clamp_t(uint, maxunits, 8, 8192);
+ mutex_lock(&ss->ss_lock);
+ if (params->mp_uid != -1 || params->mp_gid != -1 || params->mp_mode != -1) {
+ err = mpc_mp_chown(unit, params);
+ if (err) {
+ mutex_unlock(&ss->ss_lock);
+ return err;
+ }
+ journal = true;
+ }

- cdev_init(&ss->ss_cdev, &mpc_fops_default);
- ss->ss_cdev.owner = THIS_MODULE;
+ if (params->mp_label[0]) {
+ mpc_toascii(params->mp_label, sizeof(params->mp_label));
+ strlcpy(unit->un_label, params->mp_label, sizeof(unit->un_label));
+ journal = true;
+ }

- mutex_init(&ss->ss_lock);
- idr_init(&ss->ss_unitmap);
- ss->ss_class = NULL;
- ss->ss_devno = NODEV;
- sema_init(&ss->ss_op_sema, 1);
- ss->ss_inited = true;
+ if (memcmp(&uuidnull, &params->mp_utype, sizeof(uuidnull))) {
+ memcpy(&unit->un_utype, &params->mp_utype, sizeof(unit->un_utype));
+ journal = true;
+ }

- rc = alloc_chrdev_region(&ss->ss_devno, 0, maxunits, "mpool");
- if (rc) {
- errmsg = "cannot allocate control device major";
- ss->ss_devno = NODEV;
- goto errout;
+ if (params->mp_ra_pages_max != U32_MAX) {
+ unit->un_ra_pages_max = clamp_t(u32, params->mp_ra_pages_max,
+ 0, MPOOL_RA_PAGES_MAX);
+ journal = true;
}

- ss->ss_class = class_create(THIS_MODULE, module_name(THIS_MODULE));
- if (IS_ERR(ss->ss_class)) {
- errmsg = "class_create() failed";
- rc = PTR_ERR(ss->ss_class);
- ss->ss_class = NULL;
- goto errout;
+ if (journal)
+ err = mpc_cf_journal(unit);
+ mutex_unlock(&ss->ss_lock);
+
+ if (err) {
+ mp_pr_err("%s: params commit failed", err, unit->un_name);
+ return err;
}

- ss->ss_class->dev_uevent = mpc_uevent;
+ mp = unit->un_mpool->mp_desc;

- rc = cdev_add(&ss->ss_cdev, ss->ss_devno, maxunits);
- if (rc) {
- errmsg = "cdev_add() failed";
- ss->ss_cdev.ops = NULL;
+ if (params->mp_spare_cap != MPOOL_SPARES_INVALID) {
+ err = mpool_drive_spares(mp, MP_MED_CAPACITY, params->mp_spare_cap);
+ if (err && err != -ENOENT)
+ rerr = err;
+ }
+
+ if (params->mp_spare_stg != MPOOL_SPARES_INVALID) {
+ err = mpool_drive_spares(mp, MP_MED_STAGING, params->mp_spare_stg);
+ if (err && err != -ENOENT)
+ rerr = err;
+ }
+
+ return rerr;
+}
+
+/**
+ * mpioc_mp_mclass_get() - get information regarding an mpool's mclasses
+ * @unit: mpool unit ptr
+ * @mcl: mclass info struct
+ *
+ * MPIOC_MP_MCLASS_GET ioctl handler to get mclass information
+ *
+ * Return: Returns 0 if successful, -errno otherwise...
+ */
+static int mpioc_mp_mclass_get(struct mpc_unit *unit, struct mpioc_mclass *mcl)
+{
+ struct mpool_descriptor *desc = unit->un_mpool->mp_desc;
+ struct mpool_mclass_xprops mcxv[MP_MED_NUMBER];
+ uint32_t mcxc = ARRAY_SIZE(mcxv);
+ int rc;
+
+ if (!mcl || !desc)
+ return -EINVAL;
+
+ if (!mcl->mcl_xprops) {
+ mpool_mclass_get_cnt(desc, &mcl->mcl_cnt);
+ return 0;
+ }
+
+ memset(mcxv, 0, sizeof(mcxv));
+
+ rc = mpool_mclass_get(desc, &mcxc, mcxv);
+ if (rc)
+ return rc;
+
+ if (mcxc > mcl->mcl_cnt)
+ mcxc = mcl->mcl_cnt;
+ mcl->mcl_cnt = mcxc;
+
+ rc = copy_to_user(mcl->mcl_xprops, mcxv, sizeof(mcxv[0]) * mcxc);
+
+ return rc ? -EFAULT : 0;
+}
+
+/**
+ * mpioc_devprops_get() - Get device properties
+ * @unit: mpool unit ptr
+ *
+ * MPIOC_PROP_GET ioctl handler to retrieve properties for the specified device.
+ */
+static int mpioc_devprops_get(struct mpc_unit *unit, struct mpioc_devprops *devprops)
+{
+ int rc = 0;
+
+ if (unit->un_mpool) {
+ struct mpool_descriptor *mp = unit->un_mpool->mp_desc;
+
+ rc = mpool_get_devprops_by_name(mp, devprops->dpr_pdname, &devprops->dpr_devprops);
+ }
+
+ return rc;
+}
+
+/**
+ * mpioc_prop_get() - Get mpool properties.
+ * @unit: mpool unit ptr
+ *
+ * MPIOC_PROP_GET ioctl handler to retrieve properties for the specified device.
+ */
+static void mpioc_prop_get(struct mpc_unit *unit, struct mpioc_prop *kprop)
+{
+ struct mpool_descriptor *desc = unit->un_mpool->mp_desc;
+ struct mpool_params *params;
+ struct mpool_xprops *xprops;
+
+ memset(kprop, 0, sizeof(*kprop));
+
+ /* Get unit properties.. */
+ params = &kprop->pr_xprops.ppx_params;
+ params->mp_uid = unit->un_uid;
+ params->mp_gid = unit->un_gid;
+ params->mp_mode = unit->un_mode;
+ params->mp_mdc_captgt = unit->un_mdc_captgt;
+ params->mp_oidv[0] = unit->un_ds_oidv[0];
+ params->mp_oidv[1] = unit->un_ds_oidv[1];
+ params->mp_ra_pages_max = unit->un_ra_pages_max;
+ memcpy(&params->mp_utype, &unit->un_utype, sizeof(params->mp_utype));
+ strlcpy(params->mp_label, unit->un_label, sizeof(params->mp_label));
+ strlcpy(params->mp_name, unit->un_name, sizeof(params->mp_name));
+
+ /* Get mpool properties.. */
+ xprops = &kprop->pr_xprops;
+ mpool_get_xprops(desc, xprops);
+ mpool_get_usage(desc, MP_MED_ALL, &kprop->pr_usage);
+
+ params->mp_spare_cap = xprops->ppx_drive_spares[MP_MED_CAPACITY];
+ params->mp_spare_stg = xprops->ppx_drive_spares[MP_MED_STAGING];
+
+ kprop->pr_mcxc = ARRAY_SIZE(kprop->pr_mcxv);
+ mpool_mclass_get(desc, &kprop->pr_mcxc, kprop->pr_mcxv);
+}
+
+/**
+ * mpioc_proplist_get_itercb() - Get properties iterator callback.
+ * @item: unit ptr
+ * @arg: argument list
+ *
+ * Return: Returns properties for each unit matching the input criteria.
+ */
+static int mpioc_proplist_get_itercb(int minor, void *item, void *arg)
+{
+ struct mpc_unit *unit = item;
+ struct mpioc_prop __user *uprop;
+ struct mpioc_prop kprop;
+ struct mpc_unit *match;
+ struct mpioc_list *ls;
+ void **argv = arg;
+ int *cntp, rc;
+ int *errp;
+
+ if (!unit)
+ return ITERCB_NEXT;
+
+ match = argv[0];
+ ls = argv[1];
+
+ if (mpc_unit_isctldev(match) && !mpc_unit_ismpooldev(unit) &&
+ ls->ls_cmd != MPIOC_LIST_CMD_PROP_GET)
+ return ITERCB_NEXT;
+
+ if (mpc_unit_ismpooldev(match) && !mpc_unit_ismpooldev(unit) &&
+ ls->ls_cmd != MPIOC_LIST_CMD_PROP_GET)
+ return ITERCB_NEXT;
+
+ if (mpc_unit_ismpooldev(match) && unit->un_mpool != match->un_mpool)
+ return ITERCB_NEXT;
+
+ cntp = argv[2];
+ errp = argv[3];
+
+ mpioc_prop_get(unit, &kprop);
+
+ uprop = (struct mpioc_prop __user *)ls->ls_listv + *cntp;
+
+ rc = copy_to_user(uprop, &kprop, sizeof(*uprop));
+ if (rc) {
+ *errp = -EFAULT;
+ return ITERCB_DONE;
+ }
+
+ return (++(*cntp) >= ls->ls_listc) ? ITERCB_DONE : ITERCB_NEXT;
+}
+
+/**
+ * mpioc_proplist_get() - Get mpool properties.
+ * @unit: mpool unit ptr
+ * @ls: properties parameter block
+ *
+ * MPIOC_PROP_GET ioctl handler to retrieve properties for one
+ * or more mpools.
+ *
+ * Return: Returns 0 if successful, -errno otherwise...
+ */
+static int mpioc_proplist_get(struct mpc_unit *unit, struct mpioc_list *ls)
+{
+ struct mpc_softstate *ss = &mpc_softstate;
+ int err = 0;
+ int cnt = 0;
+ void *argv[] = { unit, ls, &cnt, &err };
+
+ if (!ls || ls->ls_listc < 1 || ls->ls_cmd == MPIOC_LIST_CMD_INVALID)
+ return -EINVAL;
+
+ mutex_lock(&ss->ss_lock);
+ idr_for_each(&ss->ss_unitmap, mpioc_proplist_get_itercb, argv);
+ mutex_unlock(&ss->ss_lock);
+
+ ls->ls_listc = cnt;
+
+ return err;
+}
+
+/**
+ * mpc_mpool_open() - Open the mpool specified by the given drive paths,
+ * and then create an mpool object to track the
+ * underlying mpool.
+ * @dpathc: drive count
+ * @dpathv: drive path name vector
+ * @mpoolp: mpool ptr. Set only if success.
+ * @pd_prop: PDs properties
+ *
+ * Return: Returns 0 if successful and sets *mpoolp.
+ * Returns -errno on error.
+ */
+static int mpc_mpool_open(uint dpathc, char **dpathv, struct mpc_mpool **mpoolp,
+ struct pd_prop *pd_prop, struct mpool_params *params, u32 flags)
+{
+ struct mpc_softstate *ss = &mpc_softstate;
+ struct mpcore_params mpc_params;
+ struct mpc_mpool *mpool;
+ size_t mpoolsz, len;
+ int rc;
+
+ if (!ss || !dpathv || !mpoolp || !params)
+ return -EINVAL;
+
+ len = mpc_toascii(params->mp_name, sizeof(params->mp_name));
+ if (len < 1 || len >= MPOOL_NAMESZ_MAX)
+ return (len < 1) ? -EINVAL : -ENAMETOOLONG;
+
+ mpoolsz = sizeof(*mpool) + len + 1;
+
+ mpool = kzalloc(mpoolsz, GFP_KERNEL);
+ if (!mpool)
+ return -ENOMEM;
+
+ if (!try_module_get(THIS_MODULE)) {
+ kfree(mpool);
+ return -EBUSY;
+ }
+
+ mpool_to_mpcore_params(params, &mpc_params);
+
+ rc = mpool_activate(dpathc, dpathv, pd_prop, MPOOL_ROOT_LOG_CAP,
+ &mpc_params, flags, &mpool->mp_desc);
+ if (rc) {
+ mp_pr_err("Activating %s failed", rc, params->mp_name);
+ module_put(THIS_MODULE);
+ kfree(mpool);
+ return rc;
+ }
+
+ kref_init(&mpool->mp_ref);
+ init_rwsem(&mpool->mp_lock);
+ mpool->mp_dpathc = dpathc;
+ mpool->mp_dpathv = dpathv;
+ strcpy(mpool->mp_name, params->mp_name);
+
+ *mpoolp = mpool;
+
+ return 0;
+}
+
+/**
+ * mpioc_mp_create() - create an mpool.
+ * @mp: mpool parameter block
+ * @pd_prop:
+ * @dpathv:
+ *
+ * MPIOC_MP_CREATE ioctl handler to create an mpool.
+ *
+ * Return: Returns 0 if the mpool is created, -errno otherwise...
+ */
+static int mpioc_mp_create(struct mpc_unit *ctl, struct mpioc_mpool *mp,
+ struct pd_prop *pd_prop, char ***dpathv)
+{
+ struct mpc_softstate *ss = &mpc_softstate;
+ struct mpcore_params mpc_params;
+ struct mpool_config cfg = { };
+ struct mpc_mpool *mpool = NULL;
+ struct mpc_unit *unit = NULL;
+ size_t len;
+ mode_t mode;
+ uid_t uid;
+ gid_t gid;
+ int rc;
+
+ if (!ctl || !mp || !pd_prop || !dpathv)
+ return -EINVAL;
+
+ len = mpc_toascii(mp->mp_params.mp_name, sizeof(mp->mp_params.mp_name));
+ if (len < 1 || len >= MPOOL_NAMESZ_MAX)
+ return (len < 1) ? -EINVAL : -ENAMETOOLONG;
+
+ mpool_params_merge_defaults(&mp->mp_params);
+
+ uid = mp->mp_params.mp_uid;
+ gid = mp->mp_params.mp_gid;
+ mode = mp->mp_params.mp_mode;
+
+ if (uid == -1)
+ uid = mpc_default_uid;
+ if (gid == -1)
+ gid = mpc_default_gid;
+ if (mode == -1)
+ mode = mpc_default_mode;
+
+ mode &= 0777;
+
+ if (uid != mpc_current_uid() && !capable(CAP_CHOWN)) {
+ rc = -EPERM;
+ mp_pr_err("chown permission denied, uid %d", rc, uid);
+ return rc;
+ }
+
+ if (gid != mpc_current_gid() && !capable(CAP_CHOWN)) {
+ rc = -EPERM;
+ mp_pr_err("chown permission denied, gid %d", rc, gid);
+ return rc;
+ }
+
+ if (!capable(CAP_SYS_ADMIN)) {
+ rc = -EPERM;
+ mp_pr_err("chmod/activate permission denied", rc);
+ return rc;
+ }
+
+ mpool_to_mpcore_params(&mp->mp_params, &mpc_params);
+
+ rc = mpool_create(mp->mp_params.mp_name, mp->mp_flags, *dpathv,
+ pd_prop, &mpc_params, MPOOL_ROOT_LOG_CAP);
+ if (rc) {
+ mp_pr_err("%s: create failed", rc, mp->mp_params.mp_name);
+ return rc;
+ }
+
+ /*
+ * Create an mpc_mpool object through which we can (re)open and manage
+ * the mpool. If successful, mpc_mpool_open() adopts dpathv.
+ */
+ mpool_params_merge_defaults(&mp->mp_params);
+
+ rc = mpc_mpool_open(mp->mp_dpathc, *dpathv, &mpool, pd_prop, &mp->mp_params, mp->mp_flags);
+ if (rc) {
+ mp_pr_err("%s: mpc_mpool_open failed", rc, mp->mp_params.mp_name);
+ mpool_destroy(mp->mp_dpathc, *dpathv, pd_prop, mp->mp_flags);
+ return rc;
+ }
+
+ *dpathv = NULL;
+
+ mlog_lookup_rootids(&cfg.mc_oid1, &cfg.mc_oid2);
+ cfg.mc_uid = uid;
+ cfg.mc_gid = gid;
+ cfg.mc_mode = mode;
+ cfg.mc_rsvd0 = mp->mp_params.mp_rsvd0;
+ cfg.mc_captgt = MPOOL_ROOT_LOG_CAP;
+ cfg.mc_ra_pages_max = mp->mp_params.mp_ra_pages_max;
+ cfg.mc_rsvd1 = mp->mp_params.mp_rsvd1;
+ cfg.mc_rsvd2 = mp->mp_params.mp_rsvd2;
+ cfg.mc_rsvd3 = mp->mp_params.mp_rsvd3;
+ cfg.mc_rsvd4 = mp->mp_params.mp_rsvd4;
+ memcpy(&cfg.mc_utype, &mp->mp_params.mp_utype, sizeof(cfg.mc_utype));
+ strlcpy(cfg.mc_label, mp->mp_params.mp_label, sizeof(cfg.mc_label));
+
+ rc = mpool_config_store(mpool->mp_desc, &cfg);
+ if (rc) {
+ mp_pr_err("%s: config store failed", rc, mp->mp_params.mp_name);
+ goto errout;
+ }
+
+ /* A unit is born with two references: A birth reference, and one for the caller. */
+ rc = mpc_unit_setup(&mpc_uinfo_mpool, mp->mp_params.mp_name,
+ &cfg, mpool, &unit);
+ if (rc) {
+ mp_pr_err("%s: unit setup failed", rc, mp->mp_params.mp_name);
+ goto errout;
+ }
+
+ /* Return resolved params to caller. */
+ mp->mp_params.mp_uid = uid;
+ mp->mp_params.mp_gid = gid;
+ mp->mp_params.mp_mode = mode;
+ mp->mp_params.mp_mdc_captgt = cfg.mc_captgt;
+ mp->mp_params.mp_oidv[0] = cfg.mc_oid1;
+ mp->mp_params.mp_oidv[1] = cfg.mc_oid2;
+
+ rc = mpc_params_register(unit, MPC_MPOOL_PARAMS_CNT);
+ if (rc) {
+ mpc_unit_put(unit); /* drop birth ref */
+ goto errout;
+ }
+
+ mutex_lock(&ss->ss_lock);
+ idr_replace(&ss->ss_unitmap, unit, MINOR(unit->un_devno));
+ mutex_unlock(&ss->ss_lock);
+
+ mpool = NULL;
+
+errout:
+ if (mpool) {
+ mpool_deactivate(mpool->mp_desc);
+ mpool->mp_desc = NULL;
+ mpool_destroy(mp->mp_dpathc, mpool->mp_dpathv, pd_prop, mp->mp_flags);
+ }
+
+ /*
+ * For failures after mpc_unit_setup() (i.e., mpool != NULL)
+ * dropping the final unit ref will release the mpool ref.
+ */
+ if (unit)
+ mpc_unit_put(unit); /* Drop caller's ref */
+ else if (mpool)
+ mpc_mpool_put(mpool);
+
+ return rc;
+}
+
+/**
+ * mpioc_mp_activate() - activate an mpool.
+ * @mp: mpool parameter block
+ * @pd_prop:
+ * @dpathv:
+ *
+ * MPIOC_MP_ACTIVATE ioctl handler to activate an mpool.
+ *
+ * Return: Returns 0 if the mpool is activated, -errno otherwise...
+ */
+static int mpioc_mp_activate(struct mpc_unit *ctl, struct mpioc_mpool *mp,
+ struct pd_prop *pd_prop, char ***dpathv)
+{
+ struct mpc_softstate *ss = &mpc_softstate;
+ struct mpool_config cfg;
+ struct mpc_mpool *mpool = NULL;
+ struct mpc_unit *unit = NULL;
+ size_t len;
+ int rc;
+
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ if (!ctl || !mp || !pd_prop || !dpathv)
+ return -EINVAL;
+
+ len = mpc_toascii(mp->mp_params.mp_name, sizeof(mp->mp_params.mp_name));
+ if (len < 1 || len >= MPOOL_NAMESZ_MAX)
+ return (len < 1) ? -EINVAL : -ENAMETOOLONG;
+
+ mpool_params_merge_defaults(&mp->mp_params);
+
+ /*
+ * Create an mpc_mpool object through which we can (re)open and manage
+ * the mpool. If successful, mpc_mpool_open() adopts dpathv.
+ */
+ rc = mpc_mpool_open(mp->mp_dpathc, *dpathv, &mpool, pd_prop, &mp->mp_params, mp->mp_flags);
+ if (rc) {
+ mp_pr_err("%s: mpc_mpool_open failed", rc, mp->mp_params.mp_name);
+ return rc;
+ }
+
+ *dpathv = NULL; /* Was adopted by successful mpc_mpool_open() */
+
+ rc = mpool_config_fetch(mpool->mp_desc, &cfg);
+ if (rc) {
+ mp_pr_err("%s config fetch failed", rc, mp->mp_params.mp_name);
+ goto errout;
+ }
+
+ if (mpool_params_merge_config(&mp->mp_params, &cfg))
+ mpool_config_store(mpool->mp_desc, &cfg);
+
+ /* A unit is born with two references: A birth reference, and one for the caller. */
+ rc = mpc_unit_setup(&mpc_uinfo_mpool, mp->mp_params.mp_name,
+ &cfg, mpool, &unit);
+ if (rc) {
+ mp_pr_err("%s unit setup failed", rc, mp->mp_params.mp_name);
+ goto errout;
+ }
+
+ /* Return resolved params to caller. */
+ mp->mp_params.mp_uid = cfg.mc_uid;
+ mp->mp_params.mp_gid = cfg.mc_gid;
+ mp->mp_params.mp_mode = cfg.mc_mode;
+ mp->mp_params.mp_mdc_captgt = cfg.mc_captgt;
+ mp->mp_params.mp_oidv[0] = cfg.mc_oid1;
+ mp->mp_params.mp_oidv[1] = cfg.mc_oid2;
+ mp->mp_params.mp_ra_pages_max = cfg.mc_ra_pages_max;
+ mp->mp_params.mp_vma_size_max = cfg.mc_vma_size_max;
+ memcpy(&mp->mp_params.mp_utype, &cfg.mc_utype, sizeof(mp->mp_params.mp_utype));
+ strlcpy(mp->mp_params.mp_label, cfg.mc_label, sizeof(mp->mp_params.mp_label));
+
+ rc = mpc_params_register(unit, MPC_MPOOL_PARAMS_CNT);
+ if (rc) {
+ mpc_unit_put(unit); /* drop birth ref */
+ goto errout;
+ }
+
+ mutex_lock(&ss->ss_lock);
+ idr_replace(&ss->ss_unitmap, unit, MINOR(unit->un_devno));
+ mutex_unlock(&ss->ss_lock);
+
+ mpool = NULL;
+
+errout:
+ /*
+ * For failures after mpc_unit_setup() (i.e., mpool != NULL)
+ * dropping the final unit ref will release the mpool ref.
+ */
+ if (unit)
+ mpc_unit_put(unit); /* drop caller's ref */
+ else if (mpool)
+ mpc_mpool_put(mpool);
+
+ return rc;
+}
+
+/**
+ * mpioc_mp_deactivate_impl() - deactivate an mpool.
+ * @unit: control device unit ptr
+ * @mp: mpool parameter block
+ *
+ * MPIOC_MP_DEACTIVATE ioctl handler to deactivate an mpool.
+ */
+static int mp_deactivate_impl(struct mpc_unit *ctl, struct mpioc_mpool *mp, bool locked)
+{
+ struct mpc_softstate *ss = &mpc_softstate;
+ struct mpc_unit *unit = NULL;
+ size_t len;
+ int rc;
+
+ if (!ctl || !mp)
+ return -EINVAL;
+
+ if (!mpc_unit_isctldev(ctl))
+ return -ENOTTY;
+
+ len = mpc_toascii(mp->mp_params.mp_name, sizeof(mp->mp_params.mp_name));
+ if (len < 1 || len >= MPOOL_NAMESZ_MAX)
+ return (len < 1) ? -EINVAL : -ENAMETOOLONG;
+
+ if (!locked) {
+ rc = down_interruptible(&ss->ss_op_sema);
+ if (rc)
+ return rc;
+ }
+
+ mpc_unit_lookup_by_name(ctl, mp->mp_params.mp_name, &unit);
+ if (!unit) {
+ rc = -ENXIO;
+ goto errout;
+ }
+
+ /*
+ * In order to be determined idle, a unit shall not be open
+ * and shall have a ref count of exactly two (the birth ref
+ * and the lookup ref from above).
+ */
+ mutex_lock(&ss->ss_lock);
+ if (unit->un_open_cnt > 0 || kref_read(&unit->un_ref) != 2) {
+ rc = -EBUSY;
+ mp_pr_err("%s: busy, cannot deactivate", rc, unit->un_name);
+ } else {
+ idr_replace(&ss->ss_unitmap, NULL, MINOR(unit->un_devno));
+ rc = 0;
+ }
+ mutex_unlock(&ss->ss_lock);
+
+ if (!rc)
+ mpc_unit_put(unit); /* drop birth ref */
+
+ mpc_unit_put(unit); /* drop lookup ref */
+
+errout:
+ if (!locked)
+ up(&ss->ss_op_sema);
+
+ return rc;
+}
+
+static int mpioc_mp_deactivate(struct mpc_unit *ctl, struct mpioc_mpool *mp)
+{
+ return mp_deactivate_impl(ctl, mp, false);
+}
+
+static int mpioc_mp_cmd(struct mpc_unit *ctl, uint cmd, struct mpioc_mpool *mp)
+{
+ struct mpc_softstate *ss = &mpc_softstate;
+ struct mpc_unit *unit = NULL;
+ struct pd_prop *pd_prop = NULL;
+ char **dpathv = NULL, *dpaths;
+ size_t dpathvsz, pd_prop_sz;
+ const char *action;
+ size_t len;
+ int rc, i;
+
+ if (!ctl || !mp)
+ return -EINVAL;
+
+ if (!mpc_unit_isctldev(ctl))
+ return -EOPNOTSUPP;
+
+ if (mp->mp_dpathc < 1 || mp->mp_dpathc > MPOOL_DRIVES_MAX)
+ return -EDOM;
+
+ len = mpc_toascii(mp->mp_params.mp_name, sizeof(mp->mp_params.mp_name));
+ if (len < 1 || len >= MPOOL_NAMESZ_MAX)
+ return (len < 1) ? -EINVAL : -ENAMETOOLONG;
+
+ switch (cmd) {
+ case MPIOC_MP_CREATE:
+ action = "create";
+ break;
+
+ case MPIOC_MP_DESTROY:
+ action = "destroy";
+ break;
+
+ case MPIOC_MP_ACTIVATE:
+ action = "activate";
+ break;
+
+ case MPIOC_MP_RENAME:
+ action = "rename";
+ break;
+
+ default:
+ return -EINVAL;
+ }
+
+ if (!mp->mp_pd_prop || !mp->mp_dpaths) {
+ rc = -EINVAL;
+ mp_pr_err("%s: %s, (%d drives), drives names %p or PD props %p invalid",
+ rc, mp->mp_params.mp_name, action, mp->mp_dpathc,
+ mp->mp_dpaths, mp->mp_pd_prop);
+
+ return rc;
+ }
+
+ if (mp->mp_dpathssz > (mp->mp_dpathc + 1) * PATH_MAX)
+ return -EINVAL;
+
+ rc = down_interruptible(&ss->ss_op_sema);
+ if (rc)
+ return rc;
+
+ /*
+ * If mpc_unit_lookup_by_name() succeeds it will have acquired
+ * a reference on unit. We release that reference at the
+ * end of this function.
+ */
+ mpc_unit_lookup_by_name(ctl, mp->mp_params.mp_name, &unit);
+
+ if (unit && cmd != MPIOC_MP_DESTROY) {
+ if (cmd == MPIOC_MP_ACTIVATE)
+ goto errout;
+ rc = -EEXIST;
+ mp_pr_err("%s: mpool already activated", rc, mp->mp_params.mp_name);
+ goto errout;
+ }
+
+ /*
+ * The device path names are in one long string separated by
+ * newlines. Here we allocate one chunk of memory to hold
+ * all the device paths and a vector of ptrs to them.
+ */
+ dpathvsz = mp->mp_dpathc * sizeof(*dpathv) + mp->mp_dpathssz;
+ if (dpathvsz > MPOOL_DRIVES_MAX * (PATH_MAX + sizeof(*dpathv))) {
+ rc = -E2BIG;
+ mp_pr_err("%s: %s, too many member drives %zu",
+ rc, mp->mp_params.mp_name, action, dpathvsz);
+ goto errout;
+ }
+
+ dpathv = kmalloc(dpathvsz, GFP_KERNEL);
+ if (!dpathv) {
+ rc = -ENOMEM;
+ goto errout;
+ }
+
+ dpaths = (char *)dpathv + mp->mp_dpathc * sizeof(*dpathv);
+
+ rc = copy_from_user(dpaths, mp->mp_dpaths, mp->mp_dpathssz);
+ if (rc) {
+ rc = -EFAULT;
+ goto errout;
+ }
+
+ for (i = 0; i < mp->mp_dpathc; ++i) {
+ dpathv[i] = strsep(&dpaths, "\n");
+ if (!dpathv[i]) {
+ rc = -EINVAL;
+ goto errout;
+ }
+ }
+
+ /* Get the PDs properties from user space buffer. */
+ pd_prop_sz = mp->mp_dpathc * sizeof(*pd_prop);
+ pd_prop = kmalloc(pd_prop_sz, GFP_KERNEL);
+ if (!pd_prop) {
+ rc = -ENOMEM;
+ mp_pr_err("%s: %s, alloc pd prop %zu failed",
+ rc, mp->mp_params.mp_name, action, pd_prop_sz);
+ goto errout;
+ }
+
+ rc = copy_from_user(pd_prop, mp->mp_pd_prop, pd_prop_sz);
+ if (rc) {
+ rc = -EFAULT;
+ mp_pr_err("%s: %s, copyin pd prop %zu failed",
+ rc, mp->mp_params.mp_name, action, pd_prop_sz);
+ goto errout;
+ }
+
+ switch (cmd) {
+ case MPIOC_MP_CREATE:
+ rc = mpioc_mp_create(ctl, mp, pd_prop, &dpathv);
+ break;
+
+ case MPIOC_MP_ACTIVATE:
+ rc = mpioc_mp_activate(ctl, mp, pd_prop, &dpathv);
+ break;
+
+ case MPIOC_MP_DESTROY:
+ if (unit) {
+ mpc_unit_put(unit);
+ unit = NULL;
+
+ rc = mp_deactivate_impl(ctl, mp, true);
+ if (rc) {
+ action = "deactivate";
+ break;
+ }
+ }
+ rc = mpool_destroy(mp->mp_dpathc, dpathv, pd_prop, mp->mp_flags);
+ break;
+
+ case MPIOC_MP_RENAME:
+ rc = mpool_rename(mp->mp_dpathc, dpathv, pd_prop, mp->mp_flags,
+ mp->mp_params.mp_name);
+ break;
+ }
+
+ if (rc)
+ mp_pr_err("%s: %s failed", rc, mp->mp_params.mp_name, action);
+
+errout:
+ mpc_unit_put(unit);
+ up(&ss->ss_op_sema);
+
+ kfree(pd_prop);
+ kfree(dpathv);
+
+ return rc;
+}
+
+/**
+ * mpioc_mp_add() - add a device to an existing mpool
+ * @unit: mpool unit ptr
+ * @drv: mpool device parameter block
+ *
+ * MPIOC_MP_ADD ioctl handler to add a drive to a activated mpool
+ *
+ * Return: Returns 0 if successful, -errno otherwise...
+ */
+static int mpioc_mp_add(struct mpc_unit *unit, struct mpioc_drive *drv)
+{
+ struct mpool_descriptor *desc = unit->un_mpool->mp_desc;
+ size_t pd_prop_sz, dpathvsz;
+ struct pd_prop *pd_prop;
+ char **dpathv, *dpaths;
+ int rc, i;
+
+ /*
+ * The device path names are in one long string separated by
+ * newlines. Here we allocate one chunk of memory to hold
+ * all the device paths and a vector of ptrs to them.
+ */
+ dpathvsz = drv->drv_dpathc * sizeof(*dpathv) + drv->drv_dpathssz;
+ if (drv->drv_dpathc > MPOOL_DRIVES_MAX ||
+ dpathvsz > MPOOL_DRIVES_MAX * (PATH_MAX + sizeof(*dpathv))) {
+ rc = -E2BIG;
+ mp_pr_err("%s: invalid pathc %u, pathsz %zu",
+ rc, unit->un_name, drv->drv_dpathc, dpathvsz);
+ return rc;
+ }
+
+ dpathv = kmalloc(dpathvsz, GFP_KERNEL);
+ if (!dpathv) {
+ rc = -ENOMEM;
+ mp_pr_err("%s: alloc dpathv %zu failed", rc, unit->un_name, dpathvsz);
+ return rc;
+ }
+
+ dpaths = (char *)dpathv + drv->drv_dpathc * sizeof(*dpathv);
+ rc = copy_from_user(dpaths, drv->drv_dpaths, drv->drv_dpathssz);
+ if (rc) {
+ rc = -EFAULT;
+ mp_pr_err("%s: copyin dpaths %u failed", rc, unit->un_name, drv->drv_dpathssz);
+ kfree(dpathv);
+ return rc;
+ }
+
+ for (i = 0; i < drv->drv_dpathc; ++i) {
+ dpathv[i] = strsep(&dpaths, "\n");
+ if (!dpathv[i] || (strlen(dpathv[i]) > PATH_MAX - 1)) {
+ rc = -EINVAL;
+ mp_pr_err("%s: ill-formed dpathv list ", rc, unit->un_name);
+ kfree(dpathv);
+ return rc;
+ }
+ }
+
+ /* Get the PDs properties from user space buffer. */
+ pd_prop_sz = drv->drv_dpathc * sizeof(*pd_prop);
+
+ pd_prop = kmalloc(pd_prop_sz, GFP_KERNEL);
+ if (!pd_prop) {
+ rc = -ENOMEM;
+ mp_pr_err("%s: alloc pd prop %zu failed", rc, unit->un_name, pd_prop_sz);
+ kfree(dpathv);
+ return rc;
+ }
+
+ rc = copy_from_user(pd_prop, drv->drv_pd_prop, pd_prop_sz);
+ if (rc) {
+ rc = -EFAULT;
+ mp_pr_err("%s: copyin pd prop %zu failed", rc, unit->un_name, pd_prop_sz);
+ kfree(pd_prop);
+ kfree(dpathv);
+ return rc;
+ }
+
+ for (i = 0; i < drv->drv_dpathc; ++i) {
+ rc = mpool_drive_add(desc, dpathv[i], &pd_prop[i]);
+ if (rc)
+ break;
+ }
+
+ kfree(pd_prop);
+ kfree(dpathv);
+
+ return rc;
+}
+
+static struct mpc_softstate *mpc_cdev2ss(struct cdev *cdev)
+{
+ if (!cdev || cdev->owner != THIS_MODULE) {
+ mp_pr_crit("module dissociated", -EINVAL);
+ return NULL;
+ }
+
+ return container_of(cdev, struct mpc_softstate, ss_cdev);
+}
+
+static int mpc_bdi_alloc(void)
+{
+ mpc_bdi = bdi_alloc(NUMA_NO_NODE);
+ if (!mpc_bdi)
+ return -ENOMEM;
+
+ return 0;
+}
+
+static void mpc_bdi_save(struct mpc_unit *unit, struct inode *ip)
+{
+ unit->un_saved_bdi = ip->i_sb->s_bdi;
+ ip->i_sb->s_bdi = bdi_get(mpc_bdi);
+}
+
+static void mpc_bdi_restore(struct mpc_unit *unit, struct inode *ip)
+{
+ ip->i_sb->s_bdi = unit->un_saved_bdi;
+ bdi_put(mpc_bdi);
+}
+
+static int mpc_bdi_setup(void)
+{
+ int rc;
+
+ rc = mpc_bdi_alloc();
+ if (rc)
+ return rc;
+
+ mpc_bdi->capabilities = BDI_CAP_NO_ACCT_AND_WRITEBACK;
+ mpc_bdi->ra_pages = MPOOL_RA_PAGES_MAX;
+
+ return 0;
+}
+
+static void mpc_bdi_teardown(void)
+{
+ bdi_put(mpc_bdi);
+}
+
+/*
+ * MPCTL file operations.
+ */
+
+/**
+ * mpc_open() - Open an mpool device.
+ * @ip: inode ptr
+ * @fp: file ptr
+ *
+ * Return: Returns 0 on success, -errno otherwise...
+ */
+static int mpc_open(struct inode *ip, struct file *fp)
+{
+ struct mpc_softstate *ss;
+ struct mpc_unit *unit;
+ bool firstopen;
+ int rc = 0;
+
+ ss = mpc_cdev2ss(ip->i_cdev);
+ if (!ss || ss != &mpc_softstate)
+ return -EBADFD;
+
+ /* Acquire a reference on the unit object. We'll release it in mpc_release(). */
+ mpc_unit_lookup(iminor(fp->f_inode), &unit);
+ if (!unit)
+ return -ENODEV;
+
+ if (down_trylock(&unit->un_open_lock)) {
+ rc = (fp->f_flags & O_NONBLOCK) ? -EWOULDBLOCK :
+ down_interruptible(&unit->un_open_lock);
+
+ if (rc)
+ goto errout;
+ }
+
+ firstopen = (unit->un_open_cnt == 0);
+
+ if (!firstopen) {
+ if (fp->f_mapping != unit->un_mapping)
+ rc = -EBUSY;
+ else if (unit->un_open_excl || (fp->f_flags & O_EXCL))
+ rc = -EBUSY;
+ goto unlock;
+ }
+
+ if (!mpc_unit_ismpooldev(unit)) {
+ unit->un_open_excl = !!(fp->f_flags & O_EXCL);
+ goto unlock; /* control device */
+ }
+
+ /* First open of an mpool unit (not the control device). */
+ if (!fp->f_mapping || fp->f_mapping != ip->i_mapping) {
+ rc = -EINVAL;
+ goto unlock;
+ }
+
+ fp->f_op = &mpc_fops_default;
+
+ mpc_bdi_save(unit, ip);
+
+ unit->un_mapping = fp->f_mapping;
+
+ inode_lock(ip);
+ i_size_write(ip, 1ul << 63);
+ inode_unlock(ip);
+
+ unit->un_open_excl = !!(fp->f_flags & O_EXCL);
+
+unlock:
+ if (!rc) {
+ fp->private_data = unit;
+ nonseekable_open(ip, fp);
+ ++unit->un_open_cnt;
+ }
+ up(&unit->un_open_lock);
+
+errout:
+ if (rc) {
+ if (rc != -EBUSY)
+ mp_pr_err("open %s failed", rc, unit->un_name);
+ mpc_unit_put(unit);
+ }
+
+ return rc;
+}
+
+/**
+ * mpc_release() - Close the specified mpool device.
+ * @ip: inode ptr
+ * @fp: file ptr
+ *
+ * Return: Returns 0 on success, -errno otherwise...
+ */
+static int mpc_release(struct inode *ip, struct file *fp)
+{
+ struct mpc_unit *unit;
+ bool lastclose;
+
+ unit = fp->private_data;
+ if (!unit)
+ return -EBADFD;
+
+ down(&unit->un_open_lock);
+ lastclose = (--unit->un_open_cnt == 0);
+ if (!lastclose)
+ goto errout;
+
+ if (mpc_unit_ismpooldev(unit)) {
+ unit->un_mapping = NULL;
+
+ mpc_bdi_restore(unit, ip);
+ }
+
+ unit->un_open_excl = false;
+
+errout:
+ up(&unit->un_open_lock);
+
+ mpc_unit_put(unit);
+
+ return 0;
+}
+
+/**
+ * mpc_ioctl() - mpc driver ioctl entry point
+ * @fp: file pointer
+ * @cmd: an mpool ioctl command (i.e., MPIOC_*)
+ * @arg: varies..
+ *
+ * Perform the specified mpool ioctl command.
+ *
+ * Return: Returns 0 on success, -errno otherwise...
+ */
+static long mpc_ioctl(struct file *fp, unsigned int cmd, unsigned long arg)
+{
+ char argbuf[256] __aligned(16);
+ struct mpc_unit *unit;
+ size_t argbufsz;
+ void *argp;
+ ulong iosz;
+ int rc;
+
+ if (_IOC_TYPE(cmd) != MPIOC_MAGIC)
+ return -ENOTTY;
+
+ if ((fp->f_flags & O_ACCMODE) == O_RDONLY) {
+ switch (cmd) {
+ case MPIOC_PROP_GET:
+ case MPIOC_DEVPROPS_GET:
+ case MPIOC_MP_MCLASS_GET:
+ break;
+
+ default:
+ return -EINVAL;
+ }
+ }
+
+ unit = fp->private_data;
+ argbufsz = sizeof(argbuf);
+ iosz = _IOC_SIZE(cmd);
+ argp = (void *)arg;
+
+ if (!unit || (iosz > sizeof(union mpioc_union)))
+ return -EINVAL;
+
+ /* Set up argp/argbuf for read/write requests. */
+ if (_IOC_DIR(cmd) & (_IOC_READ | _IOC_WRITE)) {
+ argp = argbuf;
+ if (iosz > argbufsz) {
+ argbufsz = roundup_pow_of_two(iosz);
+
+ argp = kzalloc(argbufsz, GFP_KERNEL);
+ if (!argp)
+ return -ENOMEM;
+ }
+
+ if (_IOC_DIR(cmd) & _IOC_WRITE) {
+ if (copy_from_user(argp, (const void __user *)arg, iosz)) {
+ if (argp != argbuf)
+ kfree(argp);
+ return -EFAULT;
+ }
+ }
+ }
+
+ switch (cmd) {
+ case MPIOC_MP_CREATE:
+ case MPIOC_MP_ACTIVATE:
+ case MPIOC_MP_DESTROY:
+ case MPIOC_MP_RENAME:
+ rc = mpioc_mp_cmd(unit, cmd, argp);
+ break;
+
+ case MPIOC_MP_DEACTIVATE:
+ rc = mpioc_mp_deactivate(unit, argp);
+ break;
+
+ case MPIOC_DRV_ADD:
+ rc = mpioc_mp_add(unit, argp);
+ break;
+
+ case MPIOC_PARAMS_SET:
+ rc = mpioc_params_set(unit, argp);
+ break;
+
+ case MPIOC_PARAMS_GET:
+ rc = mpioc_params_get(unit, argp);
+ break;
+
+ case MPIOC_MP_MCLASS_GET:
+ rc = mpioc_mp_mclass_get(unit, argp);
+ break;
+
+ case MPIOC_PROP_GET:
+ rc = mpioc_proplist_get(unit, argp);
+ break;
+
+ case MPIOC_DEVPROPS_GET:
+ rc = mpioc_devprops_get(unit, argp);
+ break;
+
+ default:
+ rc = -ENOTTY;
+ mp_pr_rl("invalid command %x: dir=%u type=%c nr=%u size=%u",
+ rc, cmd, _IOC_DIR(cmd), _IOC_TYPE(cmd), _IOC_NR(cmd), _IOC_SIZE(cmd));
+ break;
+ }
+
+ if (!rc && _IOC_DIR(cmd) & _IOC_READ) {
+ if (copy_to_user((void __user *)arg, argp, iosz))
+ rc = -EFAULT;
+ }
+
+ if (argp != argbuf)
+ kfree(argp);
+
+ return rc;
+}
+
+static const struct file_operations mpc_fops_default = {
+ .owner = THIS_MODULE,
+ .open = mpc_open,
+ .release = mpc_release,
+ .unlocked_ioctl = mpc_ioctl,
+};
+
+static int mpc_exit_unit(int minor, void *item, void *arg)
+{
+ mpc_unit_put(item);
+
+ return ITERCB_NEXT;
+}
+
+/**
+ * mpctl_exit() - Tear down and unload the mpool control module.
+ */
+void mpctl_exit(void)
+{
+ struct mpc_softstate *ss = &mpc_softstate;
+
+ if (ss->ss_inited) {
+ idr_for_each(&ss->ss_unitmap, mpc_exit_unit, NULL);
+ idr_destroy(&ss->ss_unitmap);
+
+ if (ss->ss_devno != NODEV) {
+ if (ss->ss_class) {
+ if (ss->ss_cdev.ops)
+ cdev_del(&ss->ss_cdev);
+ class_destroy(ss->ss_class);
+ }
+ unregister_chrdev_region(ss->ss_devno, maxunits);
+ }
+
+ ss->ss_inited = false;
+ }
+
+ mpc_bdi_teardown();
+}
+
+/**
+ * mpctl_init() - Load and initialize the mpool control module.
+ */
+int mpctl_init(void)
+{
+ struct mpc_softstate *ss = &mpc_softstate;
+ struct mpool_config *cfg = NULL;
+ struct mpc_unit *ctlunit;
+ const char *errmsg = NULL;
+ int rc;
+
+ if (ss->ss_inited)
+ return -EBUSY;
+
+ ctlunit = NULL;
+
+ maxunits = clamp_t(uint, maxunits, 8, 8192);
+
+ cdev_init(&ss->ss_cdev, &mpc_fops_default);
+ ss->ss_cdev.owner = THIS_MODULE;
+
+ mutex_init(&ss->ss_lock);
+ idr_init(&ss->ss_unitmap);
+ ss->ss_class = NULL;
+ ss->ss_devno = NODEV;
+ sema_init(&ss->ss_op_sema, 1);
+ ss->ss_inited = true;
+
+ rc = alloc_chrdev_region(&ss->ss_devno, 0, maxunits, "mpool");
+ if (rc) {
+ errmsg = "cannot allocate control device major";
+ ss->ss_devno = NODEV;
+ goto errout;
+ }
+
+ ss->ss_class = class_create(THIS_MODULE, module_name(THIS_MODULE));
+ if (IS_ERR(ss->ss_class)) {
+ errmsg = "class_create() failed";
+ rc = PTR_ERR(ss->ss_class);
+ ss->ss_class = NULL;
+ goto errout;
+ }
+
+ ss->ss_class->dev_uevent = mpc_uevent;
+
+ rc = cdev_add(&ss->ss_cdev, ss->ss_devno, maxunits);
+ if (rc) {
+ errmsg = "cdev_add() failed";
+ ss->ss_cdev.ops = NULL;
+ goto errout;
+ }
+
+ rc = mpc_bdi_setup();
+ if (rc) {
+ errmsg = "mpc bdi setup failed";
goto errout;
}

--
2.17.2

2020-09-28 16:51:31

by Nabeel Meeramohideen Mohamed (nmeeramohide)

[permalink] [raw]

Subject: [PATCH 12/22] mpool: add metadata container or mlog-pair framework

From: Nabeel M Mohamed <[email protected]>

Metadata containers are used for storing and maintaining metadata.

MDC APIs are implemented as helper functions built on a pair of
mlogs per MDC. It embodies the concept of compaction to deal with
one of the mlog pairs filling, what it means to compact is
use-case dependent.

The MDC APIs make it easy for a client to:

- Append metadata update records to the active mlog of an MDC
until it is full (or exceeds some client-specific threshold)
- Flag the start of a compaction which marks the other mlog of
the MDC as active
- Re-serialize its metadata by appending it to the (newly)
active mlog of the MDC
- Flag the end of the compaction
- Continue appending metadata update records to the MDC until
the above process repeats

The MDC API functions handle all failures, including crash
recovery, by using special markers recognized by the mlog
implementation.

Co-developed-by: Greg Becker <[email protected]>
Signed-off-by: Greg Becker <[email protected]>
Co-developed-by: Pierre Labat <[email protected]>
Signed-off-by: Pierre Labat <[email protected]>
Co-developed-by: John Groves <[email protected]>
Signed-off-by: John Groves <[email protected]>
Signed-off-by: Nabeel M Mohamed <[email protected]>
---
drivers/mpool/mdc.c | 486 ++++++++++++++++++++++++++++++++++++++++++++
drivers/mpool/mdc.h | 106 ++++++++++
2 files changed, 592 insertions(+)
create mode 100644 drivers/mpool/mdc.c
create mode 100644 drivers/mpool/mdc.h

diff --git a/drivers/mpool/mdc.c b/drivers/mpool/mdc.c
new file mode 100644
index 000000000000..1a8abd5f815e
--- /dev/null
+++ b/drivers/mpool/mdc.c
@@ -0,0 +1,486 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc. All rights reserved.
+ */
+
+#include <linux/slab.h>
+#include <linux/string.h>
+
+#include "mpool_printk.h"
+#include "mpool_ioctl.h"
+#include "mpcore.h"
+#include "mp.h"
+#include "mlog.h"
+#include "mdc.h"
+
+#define mdc_logerr(_mpname, _msg, _mlh, _objid, _gen1, _gen2, _err) \
+ mp_pr_err("mpool %s, mdc open, %s " \
+ "mlog %p objid 0x%lx gen1 %lu gen2 %lu", \
+ (_err), (_mpname), (_msg), \
+ (_mlh), (ulong)(_objid), (ulong)(_gen1), \
+ (ulong)(_gen2)) \
+
+#define OP_COMMIT 0
+#define OP_DELETE 1
+
+/**
+ * mdc_acquire() - Validate mdc handle and acquire mdc_lock
+ * @mlh: MDC handle
+ * @rw: read/append?
+ */
+static inline int mdc_acquire(struct mp_mdc *mdc, bool rw)
+{
+ if (!mdc || mdc->mdc_magic != MPC_MDC_MAGIC || !mdc->mdc_valid)
+ return -EINVAL;
+
+ if (rw && (mdc->mdc_flags & MDC_OF_SKIP_SER))
+ return 0;
+
+ mutex_lock(&mdc->mdc_lock);
+
+ /* Validate again after acquiring lock */
+ if (!mdc->mdc_valid) {
+ mutex_unlock(&mdc->mdc_lock);
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+/**
+ * mdc_release() - Release mdc_lock
+ * @mlh: MDC handle
+ * @rw: read/append?
+ */
+static inline void mdc_release(struct mp_mdc *mdc, bool rw)
+{
+ if (rw && (mdc->mdc_flags & MDC_OF_SKIP_SER))
+ return;
+
+ mutex_unlock(&mdc->mdc_lock);
+}
+
+/**
+ * mdc_invalidate() - Invalidates MDC handle by resetting the magic
+ * @mdc: MDC handle
+ */
+static inline void mdc_invalidate(struct mp_mdc *mdc)
+{
+ mdc->mdc_magic = MPC_NO_MAGIC;
+}
+
+/**
+ * mdc_get_mpname() - Get mpool name from mpool descriptor
+ * @mp: mpool descriptor
+ * @mpname: buffer to store the mpool name (output)
+ * @mplen: buffer len
+ */
+static int mdc_get_mpname(struct mpool_descriptor *mp, char *mpname, size_t mplen)
+{
+ if (!mp || !mpname)
+ return -EINVAL;
+
+ return mpool_get_mpname(mp, mpname, mplen);
+}
+
+/**
+ * mdc_find_get() - Wrapper around get for mlog pair.
+ */
+static void mdc_find_get(struct mpool_descriptor *mp, u64 *logid, bool do_put,
+ struct mlog_props *props, struct mlog_descriptor **mlh, int *ferr)
+{
+ int i;
+
+ for (i = 0; i < 2; ++i)
+ ferr[i] = mlog_find_get(mp, logid[i], 0, &props[i], &mlh[i]);
+
+ if (do_put && ((ferr[0] && !ferr[1]) || (ferr[1] && !ferr[0]))) {
+ if (ferr[0])
+ mlog_put(mlh[1]);
+ else
+ mlog_put(mlh[0]);
+ }
+}
+
+/**
+ * mdc_put() - Wrapper around put for mlog pair.
+ */
+static void mdc_put(struct mlog_descriptor *mlh1, struct mlog_descriptor *mlh2)
+{
+ mlog_put(mlh1);
+ mlog_put(mlh2);
+}
+
+int mp_mdc_open(struct mpool_descriptor *mp, u64 logid1, u64 logid2, u8 flags,
+ struct mp_mdc **mdc_out)
+{
+ struct mlog_descriptor *mlh[2];
+ struct mlog_props *props = NULL;
+ struct mp_mdc *mdc;
+
+ int err = 0, err1 = 0, err2 = 0;
+ int ferr[2] = {0};
+ u64 gen1 = 0, gen2 = 0;
+ bool empty = false;
+ u8 mlflags = 0;
+ u64 id[2];
+ char *mpname;
+
+ if (!mp || !mdc_out)
+ return -EINVAL;
+
+ mdc = kzalloc(sizeof(*mdc), GFP_KERNEL);
+ if (!mdc)
+ return -ENOMEM;
+
+ mdc->mdc_valid = 0;
+ mdc->mdc_mp = mp;
+ mdc_get_mpname(mp, mdc->mdc_mpname, sizeof(mdc->mdc_mpname));
+
+ mpname = mdc->mdc_mpname;
+
+ if (logid1 == logid2) {
+ err = -EINVAL;
+ goto exit;
+ }
+
+ props = kcalloc(2, sizeof(*props), GFP_KERNEL);
+ if (!props) {
+ err = -ENOMEM;
+ goto exit;
+ }
+
+ /*
+ * This mdc_find_get can go away once mp_mdc_open is modified to
+ * operate on handles.
+ */
+ id[0] = logid1;
+ id[1] = logid2;
+ mdc_find_get(mp, id, true, props, mlh, ferr);
+ if (ferr[0] || ferr[1]) {
+ err = ferr[0] ? : ferr[1];
+ goto exit;
+ }
+ mdc->mdc_logh1 = mlh[0];
+ mdc->mdc_logh2 = mlh[1];
+
+ if (flags & MDC_OF_SKIP_SER)
+ mlflags |= MLOG_OF_SKIP_SER;
+
+ mlflags |= MLOG_OF_COMPACT_SEM;
+
+ err1 = mlog_open(mp, mdc->mdc_logh1, mlflags, &gen1);
+ err2 = mlog_open(mp, mdc->mdc_logh2, mlflags, &gen2);
+
+ if (err1 && err1 != -EMSGSIZE && err1 != -EBUSY) {
+ err = err1;
+ } else if (err2 && err2 != -EMSGSIZE && err2 != -EBUSY) {
+ err = err2;
+ } else if ((err1 && err2) || (!err1 && !err2 && gen1 && gen1 == gen2)) {
+
+ err = -EINVAL;
+
+ /* Bad pair; both have failed erases/compactions or equal non-0 gens. */
+ mp_pr_err("mpool %s, mdc open, bad mlog handle, mlog1 %p logid1 0x%lx errno %d gen1 %lu, mlog2 %p logid2 0x%lx errno %d gen2 %lu",
+ err, mpname, mdc->mdc_logh1, (ulong)logid1, err1, (ulong)gen1,
+ mdc->mdc_logh2, (ulong)logid2, err2, (ulong)gen2);
+ } else {
+ /* Active log is valid log with smallest gen */
+ if (err1 || (!err2 && gen2 < gen1)) {
+ mdc->mdc_alogh = mdc->mdc_logh2;
+ if (!err1) {
+ err = mlog_empty(mp, mdc->mdc_logh1, &empty);
+ if (err)
+ mdc_logerr(mpname, "mlog1 empty check failed",
+ mdc->mdc_logh1, logid1, gen1, gen2, err);
+ }
+ if (!err && (err1 || !empty)) {
+ err = mlog_erase(mp, mdc->mdc_logh1, gen2 + 1);
+ if (!err) {
+ err = mlog_open(mp, mdc->mdc_logh1, mlflags, &gen1);
+ if (err)
+ mdc_logerr(mpname, "mlog1 open failed",
+ mdc->mdc_logh1, logid1, gen1, gen2, err);
+ } else {
+ mdc_logerr(mpname, "mlog1 erase failed", mdc->mdc_logh1,
+ logid1, gen1, gen2, err);
+ }
+ }
+ } else {
+ mdc->mdc_alogh = mdc->mdc_logh1;
+ if (!err2) {
+ err = mlog_empty(mp, mdc->mdc_logh2, &empty);
+ if (err)
+ mdc_logerr(mpname, "mlog2 empty check failed",
+ mdc->mdc_logh2, logid2, gen1, gen2, err);
+ }
+ if (!err && (err2 || gen2 == gen1 || !empty)) {
+ err = mlog_erase(mp, mdc->mdc_logh2, gen1 + 1);
+ if (!err) {
+ err = mlog_open(mp, mdc->mdc_logh2, mlflags, &gen2);
+ if (err)
+ mdc_logerr(mpname, "mlog2 open failed",
+ mdc->mdc_logh2, logid2, gen1, gen2, err);
+ } else {
+ mdc_logerr(mpname, "mlog2 erase failed", mdc->mdc_logh2,
+ logid2, gen1, gen2, err);
+ }
+ }
+ }
+
+ if (!err) {
+ err = mlog_empty(mp, mdc->mdc_alogh, &empty);
+ if (!err && empty) {
+ /*
+ * First use of log pair so need to add
+ * cstart/cend recs; above handles case of
+ * failure between adding cstart and cend
+ */
+ err = mlog_append_cstart(mp, mdc->mdc_alogh);
+ if (!err) {
+ err = mlog_append_cend(mp, mdc->mdc_alogh);
+ if (err)
+ mdc_logerr(mpname,
+ "adding cend to active mlog failed",
+ mdc->mdc_alogh,
+ mdc->mdc_alogh == mdc->mdc_logh1 ?
+ logid1 : logid2, gen1, gen2, err);
+ } else {
+ mdc_logerr(mpname, "adding cstart to active mlog failed",
+ mdc->mdc_alogh,
+ mdc->mdc_alogh == mdc->mdc_logh1 ?
+ logid1 : logid2, gen1, gen2, err);
+ }
+
+ } else if (err) {
+ mdc_logerr(mpname, "active mlog empty check failed",
+ mdc->mdc_alogh, mdc->mdc_alogh == mdc->mdc_logh1 ?
+ logid1 : logid2, gen1, gen2, err);
+ }
+ }
+ }
+
+ if (!err) {
+ /*
+ * Inform pre-compaction of the size of the active
+ * mlog and how much is used. This is applicable
+ * only for mpool core's internal MDCs.
+ */
+ mlog_precompact_alsz(mp, mdc->mdc_alogh);
+
+ mdc->mdc_valid = 1;
+ mdc->mdc_magic = MPC_MDC_MAGIC;
+ mdc->mdc_flags = flags;
+ mutex_init(&mdc->mdc_lock);
+
+ *mdc_out = mdc;
+ } else {
+ err1 = mlog_close(mp, mdc->mdc_logh1);
+ err2 = mlog_close(mp, mdc->mdc_logh2);
+
+ mdc_put(mdc->mdc_logh1, mdc->mdc_logh2);
+ }
+
+exit:
+ if (err)
+ kfree(mdc);
+
+ kfree(props);
+
+ return err;
+}
+
+int mp_mdc_cstart(struct mp_mdc *mdc)
+{
+ struct mlog_descriptor *tgth = NULL;
+ struct mpool_descriptor *mp;
+ bool rw = false;
+ int rc;
+
+ if (!mdc)
+ return -EINVAL;
+
+ rc = mdc_acquire(mdc, rw);
+ if (rc)
+ return rc;
+
+ mp = mdc->mdc_mp;
+
+ if (mdc->mdc_alogh == mdc->mdc_logh1)
+ tgth = mdc->mdc_logh2;
+ else
+ tgth = mdc->mdc_logh1;
+
+ rc = mlog_append_cstart(mp, tgth);
+ if (rc) {
+ mdc_release(mdc, rw);
+
+ mp_pr_err("mpool %s, mdc %p cstart failed, mlog %p",
+ rc, mdc->mdc_mpname, mdc, tgth);
+
+ (void)mp_mdc_close(mdc);
+
+ return rc;
+ }
+
+ mdc->mdc_alogh = tgth;
+ mdc_release(mdc, rw);
+
+ return 0;
+}
+
+int mp_mdc_cend(struct mp_mdc *mdc)
+{
+ struct mlog_descriptor *srch = NULL;
+ struct mlog_descriptor *tgth = NULL;
+ struct mpool_descriptor *mp;
+ u64 gentgt = 0;
+ bool rw = false;
+ int rc;
+
+ if (!mdc)
+ return -EINVAL;
+
+ rc = mdc_acquire(mdc, rw);
+ if (rc)
+ return rc;
+
+ mp = mdc->mdc_mp;
+
+ if (mdc->mdc_alogh == mdc->mdc_logh1) {
+ tgth = mdc->mdc_logh1;
+ srch = mdc->mdc_logh2;
+ } else {
+ tgth = mdc->mdc_logh2;
+ srch = mdc->mdc_logh1;
+ }
+
+ rc = mlog_append_cend(mp, tgth);
+ if (!rc) {
+ rc = mlog_gen(tgth, &gentgt);
+ if (!rc)
+ rc = mlog_erase(mp, srch, gentgt + 1);
+ }
+
+ if (rc) {
+ mdc_release(mdc, rw);
+
+ mp_pr_err("mpool %s, mdc %p cend failed, mlog %p",
+ rc, mdc->mdc_mpname, mdc, tgth);
+
+ mp_mdc_close(mdc);
+
+ return rc;
+ }
+
+ mdc_release(mdc, rw);
+
+ return rc;
+}
+
+int mp_mdc_close(struct mp_mdc *mdc)
+{
+ struct mpool_descriptor *mp;
+ int rval = 0, rc;
+ bool rw = false;
+
+ if (!mdc)
+ return -EINVAL;
+
+ rc = mdc_acquire(mdc, rw);
+ if (rc)
+ return rc;
+
+ mp = mdc->mdc_mp;
+
+ mdc->mdc_valid = 0;
+
+ rc = mlog_close(mp, mdc->mdc_logh1);
+ if (rc) {
+ mp_pr_err("mpool %s, mdc %p close failed, mlog1 %p",
+ rc, mdc->mdc_mpname, mdc, mdc->mdc_logh1);
+ rval = rc;
+ }
+
+ rc = mlog_close(mp, mdc->mdc_logh2);
+ if (rc) {
+ mp_pr_err("mpool %s, mdc %p close failed, mlog2 %p",
+ rc, mdc->mdc_mpname, mdc, mdc->mdc_logh2);
+ rval = rc;
+ }
+
+ mdc_put(mdc->mdc_logh1, mdc->mdc_logh2);
+
+ mdc_invalidate(mdc);
+ mdc_release(mdc, false);
+
+ kfree(mdc);
+
+ return rval;
+}
+
+int mp_mdc_rewind(struct mp_mdc *mdc)
+{
+ bool rw = false;
+ int rc;
+
+ if (!mdc)
+ return -EINVAL;
+
+ rc = mdc_acquire(mdc, rw);
+ if (rc)
+ return rc;
+
+ rc = mlog_read_data_init(mdc->mdc_alogh);
+ if (rc)
+ mp_pr_err("mpool %s, mdc %p rewind failed, mlog %p",
+ rc, mdc->mdc_mpname, mdc, mdc->mdc_alogh);
+
+ mdc_release(mdc, rw);
+
+ return rc;
+}
+
+int mp_mdc_read(struct mp_mdc *mdc, void *data, size_t len, size_t *rdlen)
+{
+ bool rw = true;
+ int rc;
+
+ if (!mdc || !data)
+ return -EINVAL;
+
+ rc = mdc_acquire(mdc, rw);
+ if (rc)
+ return rc;
+
+ rc = mlog_read_data_next(mdc->mdc_mp, mdc->mdc_alogh, data, (u64)len, (u64 *)rdlen);
+ if (rc && rc != -EOVERFLOW)
+ mp_pr_err("mpool %s, mdc %p read failed, mlog %p len %lu",
+ rc, mdc->mdc_mpname, mdc, mdc->mdc_alogh, len);
+
+ mdc_release(mdc, rw);
+
+ return rc;
+}
+
+int mp_mdc_append(struct mp_mdc *mdc, void *data, ssize_t len, bool sync)
+{
+ bool rw = true;
+ int rc;
+
+ if (!mdc || !data)
+ return -EINVAL;
+
+ rc = mdc_acquire(mdc, rw);
+ if (rc)
+ return rc;
+
+ rc = mlog_append_data(mdc->mdc_mp, mdc->mdc_alogh, data, (u64)len, sync);
+ if (rc)
+ mp_pr_rl("mpool %s, mdc %p append failed, mlog %p, len %lu sync %d",
+ rc, mdc->mdc_mpname, mdc, mdc->mdc_alogh, len, sync);
+
+ mdc_release(mdc, rw);
+
+ return rc;
+}
diff --git a/drivers/mpool/mdc.h b/drivers/mpool/mdc.h
new file mode 100644
index 000000000000..7ab1de261eff
--- /dev/null
+++ b/drivers/mpool/mdc.h
@@ -0,0 +1,106 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc. All rights reserved.
+ */
+
+#ifndef MPOOL_MDC_PRIV_H
+#define MPOOL_MDC_PRIV_H
+
+#include <linux/mutex.h>
+
+#define MPC_MDC_MAGIC 0xFEEDFEED
+#define MPC_NO_MAGIC 0xFADEFADE
+
+struct mpool_descriptor;
+struct mlog_descriptor;
+
+/**
+ * struct mp_mdc - MDC handle
+ * @mdc_mp: mpool handle
+ * @mdc_logh1: mlog 1 handle
+ * @mdc_logh2: mlog 2 handle
+ * @mdc_alogh: active mlog handle
+ * @mdc_lock: mdc mutex
+ * @mdc_mpname: mpool name
+ * @mdc_valid: is the handle valid?
+ * @mdc_magic: MDC handle magic
+ * @mdc_flags: MDC flags
+ */
+struct mp_mdc {
+ struct mpool_descriptor *mdc_mp;
+ struct mlog_descriptor *mdc_logh1;
+ struct mlog_descriptor *mdc_logh2;
+ struct mlog_descriptor *mdc_alogh;
+ struct mutex mdc_lock;
+ char mdc_mpname[MPOOL_NAMESZ_MAX];
+ int mdc_valid;
+ int mdc_magic;
+ u8 mdc_flags;
+};
+
+/* MDC (Metadata Container) APIs */
+
+/**
+ * mp_mdc_open() - Open MDC by OIDs
+ * @mp: mpool handle
+ * @logid1: Mlog ID 1
+ * @logid2: Mlog ID 2
+ * @flags: MDC Open flags (enum mdc_open_flags)
+ * @mdc_out: MDC handle
+ */
+int
+mp_mdc_open(struct mpool_descriptor *mp, u64 logid1, u64 logid2, u8 flags, struct mp_mdc **mdc_out);
+
+/**
+ * mp_mdc_close() - Close MDC
+ * @mdc: MDC handle
+ */
+int mp_mdc_close(struct mp_mdc *mdc);
+
+/**
+ * mp_mdc_rewind() - Rewind MDC to first record
+ * @mdc: MDC handle
+ */
+int mp_mdc_rewind(struct mp_mdc *mdc);
+
+/**
+ * mp_mdc_read() - Read next record from MDC
+ * @mdc: MDC handle
+ * @data: buffer to receive data
+ * @len: length of supplied buffer
+ * @rdlen: number of bytes read
+ *
+ * Return:
+ * If the return value is -EOVERFLOW, then the receive buffer "data"
+ * is too small and must be resized according to the value returned
+ * in "rdlen".
+ */
+int mp_mdc_read(struct mp_mdc *mdc, void *data, size_t len, size_t *rdlen);
+
+/**
+ * mp_mdc_append() - append record to MDC
+ * @mdc: MDC handle
+ * @data: data to write
+ * @len: length of data
+ * @sync: flag to defer return until IO is complete
+ */
+int mp_mdc_append(struct mp_mdc *mdc, void *data, ssize_t len, bool sync);
+
+/**
+ * mp_mdc_cstart() - Initiate MDC compaction
+ * @mdc: MDC handle
+ *
+ * Swap active (ostensibly full) and inactive (empty) mlogs
+ * Append a compaction start marker to newly active mlog
+ */
+int mp_mdc_cstart(struct mp_mdc *mdc);
+
+/**
+ * mp_mdc_cend() - End MDC compactions
+ * @mdc: MDC handle
+ *
+ * Append a compaction end marker to the active mlog
+ */
+int mp_mdc_cend(struct mp_mdc *mdc);
+
+#endif /* MPOOL_MDC_PRIV_H */
--
2.17.2

2020-09-28 16:51:32

by Nabeel Meeramohideen Mohamed (nmeeramohide)

[permalink] [raw]

Subject: [PATCH 03/22] mpool: add on-media struct definitions

From: Nabeel M Mohamed <[email protected]>

This adds headers containing the following on-media formats:
- Mpool superblock
- Object management records: create, update, delete, and erase
- Mpool configuration record
- Media class config and spare record
- OID checkpoint and version record
- Mlog page header and framing records

Co-developed-by: Greg Becker <[email protected]>
Signed-off-by: Greg Becker <[email protected]>
Co-developed-by: Pierre Labat <[email protected]>
Signed-off-by: Pierre Labat <[email protected]>
Co-developed-by: John Groves <[email protected]>
Signed-off-by: John Groves <[email protected]>
Signed-off-by: Nabeel M Mohamed <[email protected]>
---
drivers/mpool/omf.h | 593 ++++++++++++++++++++++++++++++++++++++++
drivers/mpool/omf_if.h | 381 ++++++++++++++++++++++++++
drivers/mpool/upgrade.h | 128 +++++++++
3 files changed, 1102 insertions(+)
create mode 100644 drivers/mpool/omf.h
create mode 100644 drivers/mpool/omf_if.h
create mode 100644 drivers/mpool/upgrade.h

diff --git a/drivers/mpool/omf.h b/drivers/mpool/omf.h
new file mode 100644
index 000000000000..c750573720dd
--- /dev/null
+++ b/drivers/mpool/omf.h
@@ -0,0 +1,593 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc. All rights reserved.
+ */
+/*
+ * Pool on-drive format (omf) module.
+ *
+ * Defines:
+ * + on-drive format for mpool superblocks
+ * + on-drive formats for mlogs, mblocks, and metadata containers (mdc)
+ * + utility functions for working with these on-drive formats
+ * That includes structures and enums used by the on-drive format.
+ *
+ * All mpool metadata is versioned and stored on media in little-endian format.
+ *
+ * Naming conventions:
+ * -------------------
+ * The name of the structures ends with _omf
+ * The name of the structure members start with a "p" that means "packed".
+ */
+
+#ifndef MPOOL_OMF_H
+#define MPOOL_OMF_H
+
+#include <linux/bug.h>
+#include <asm/byteorder.h>
+
+/*
+ * The following two macros exist solely to enable the OMF_SETGET macros to
+ * work on 8 bit members as well as 16, 32 and 64 bit members.
+ */
+#define le8_to_cpu(x) (x)
+#define cpu_to_le8(x) (x)
+
+
+/* Helper macro to define set/get methods for 8, 16, 32 or 64 bit scalar OMF struct members. */
+#define OMF_SETGET(type, member, bits) \
+ OMF_SETGET2(type, member, bits, member)
+
+#define OMF_SETGET2(type, member, bits, name) \
+ static __always_inline u##bits omf_##name(const type * s) \
+ { \
+ BUILD_BUG_ON(sizeof(((type *)0)->member)*8 != (bits)); \
+ return le##bits##_to_cpu(s->member); \
+ } \
+ static __always_inline void omf_set_##name(type *s, u##bits val)\
+ { \
+ s->member = cpu_to_le##bits(val); \
+ }
+
+/* Helper macro to define set/get methods for character strings embedded in OMF structures. */
+#define OMF_SETGET_CHBUF(type, member) \
+ OMF_SETGET_CHBUF2(type, member, member)
+
+#define OMF_SETGET_CHBUF2(type, member, name) \
+ static inline void omf_set_##name(type *s, const void *p, size_t plen) \
+ { \
+ size_t len = sizeof(((type *)0)->member); \
+ memcpy(s->member, p, len < plen ? len : plen); \
+ } \
+ static inline void omf_##name(const type *s, void *p, size_t plen)\
+ { \
+ size_t len = sizeof(((type *)0)->member); \
+ memcpy(p, s->member, len < plen ? len : plen); \
+ }
+
+
+/* MPOOL_NAMESZ_MAX should match OMF_MPOOL_NAME_LEN */
+#define OMF_MPOOL_NAME_LEN 32
+
+/* MPOOL_UUID_SIZE should match OMF_UUID_PACKLEN */
+#define OMF_UUID_PACKLEN 16
+
+/**
+ * enum mc_features_omf - Drive features that participate in media classes
+ * definition. These values are ored in a 64 bits field.
+ */
+enum mc_features_omf {
+ OMF_MC_FEAT_MLOG_TGT = 0x1,
+ OMF_MC_FEAT_MBLOCK_TGT = 0x2,
+ OMF_MC_FEAT_CHECKSUM = 0x4,
+};
+
+
+/**
+ * enum devtype_omf -
+ * @OMF_PD_DEV_TYPE_BLOCK_STREAM: Block device implementing streams.
+ * @OMF_PD_DEV_TYPE_BLOCK_STD: Standard (non-streams) device (SSD, HDD).
+ * @OMF_PD_DEV_TYPE_FILE: File in user space for UT.
+ * @OMF_PD_DEV_TYPE_MEM: Memory semantic device. Such as NVDIMM
+ * direct access (raw or dax mode).
+ * @OMF_PD_DEV_TYPE_ZONE: zone-like device, such as open channel SSD
+ * (OC-SSD) and SMR HDD (using ZBC/ZAC).
+ * @OMF_PD_DEV_TYPE_BLOCK_NVDIMM: Standard (non-streams) NVDIMM in sector mode.
+ */
+enum devtype_omf {
+ OMF_PD_DEV_TYPE_BLOCK_STREAM = 1,
+ OMF_PD_DEV_TYPE_BLOCK_STD = 2,
+ OMF_PD_DEV_TYPE_FILE = 3,
+ OMF_PD_DEV_TYPE_MEM = 4,
+ OMF_PD_DEV_TYPE_ZONE = 5,
+ OMF_PD_DEV_TYPE_BLOCK_NVDIMM = 6,
+};
+
+
+/**
+ * struct layout_descriptor_omf - Layout descriptor version 1.
+ * @pol_zcnt: number of zones
+ * @pol_zaddr: zone start addr
+ *
+ * Introduced with binary version 1.0.0.0.
+ * "pol_" = packed omf layout
+ */
+struct layout_descriptor_omf {
+ __le32 pol_zcnt;
+ __le64 pol_zaddr;
+} __packed;
+
+/* Define set/get methods for layout_descriptor_omf */
+OMF_SETGET(struct layout_descriptor_omf, pol_zcnt, 32)
+OMF_SETGET(struct layout_descriptor_omf, pol_zaddr, 64)
+#define OMF_LAYOUT_DESC_PACKLEN (sizeof(struct layout_descriptor_omf))
+
+
+/**
+ * struct devparm descriptor_omf - packed omf devparm descriptor
+ * @podp_devid: UUID for drive
+ * @podp_zonetot: total number of zones
+ * @podp_devsz: size of partition in bytes
+ * @podp_features: Features, ored bits of enum mc_features_omf
+ * @podp_mclassp: enum mp_media_classp
+ * @podp_devtype: PD type (enum devtype_omf)
+ * @podp_sectorsz: 2^podp_sectorsz = sector size
+ * @podp_zonepg: zone size in number of zone pages
+ *
+ * The fields mclassp, devtype, sectosz, and zonepg uniquely identify the media class of the PD.
+ * All drives in a media class must have the same values in these fields.
+ */
+struct devparm_descriptor_omf {
+ u8 podp_mclassp;
+ u8 podp_devtype;
+ u8 podp_sectorsz;
+ u8 podp_devid[OMF_UUID_PACKLEN];
+ u8 podp_pad[5];
+ __le32 podp_zonepg;
+ __le32 podp_zonetot;
+ __le64 podp_devsz;
+ __le64 podp_features;
+} __packed;
+
+/* Define set/get methods for devparm_descriptor_omf */
+OMF_SETGET(struct devparm_descriptor_omf, podp_mclassp, 8)
+OMF_SETGET(struct devparm_descriptor_omf, podp_devtype, 8)
+OMF_SETGET(struct devparm_descriptor_omf, podp_sectorsz, 8)
+OMF_SETGET_CHBUF(struct devparm_descriptor_omf, podp_devid)
+OMF_SETGET(struct devparm_descriptor_omf, podp_zonepg, 32)
+OMF_SETGET(struct devparm_descriptor_omf, podp_zonetot, 32)
+OMF_SETGET(struct devparm_descriptor_omf, podp_devsz, 64)
+OMF_SETGET(struct devparm_descriptor_omf, podp_features, 64)
+#define OMF_DEVPARM_DESC_PACKLEN (sizeof(struct devparm_descriptor_omf))
+
+
+/*
+ * mlog structure:
+ * + An mlog comprises a consecutive sequence of log blocks,
+ * where each log block is a single page within a zone
+ * + A log block comprises a header and a consecutive sequence of records
+ * + A record is a typed blob
+ *
+ * Log block headers must be versioned. Log block records do not
+ * require version numbers because they are typed and new types can
+ * always be added.
+ */
+
+/*
+ * Log block format -- version 1
+ *
+ * log block := header record+ eolb? trailer?
+ *
+ * header := struct omf_logblock_header where vers=2
+ *
+ * record := lrd byte*
+ *
+ * lrd := struct omf_logrec_descriptor with value
+ * (<record length>, <chunk length>, enum logrec_type_omf value)
+ *
+ * eolb (end of log block marker) := struct omf_logrec_descriptor with value
+ * (0, 0, enum logrec_type_omf.EOLB/0)
+ *
+ * trailer := zero bytes from end of last log block record to end of log block
+ *
+ * OMF_LOGREC_CEND must be the max. value for this enum.
+ */
+
+/**
+ * enum logrec_type_omf -
+ * @OMF_LOGREC_EOLB: end of log block marker (start of trailer)
+ * @OMF_LOGREC_DATAFULL: data record; contains all specified data
+ * @OMF_LOGREC_DATAFIRST: data record; contains first part of specified data
+ * @OMF_LOGREC_DATAMID: data record; contains interior part of data
+ * @OMF_LOGREC_DATALAST: data record; contains final part of specified data
+ * @OMF_LOGREC_CSTART: compaction start marker
+ * @OMF_LOGREC_CEND: compaction end marker
+ *
+ * A log record type of 0 signifies EOLB. This is really the start of the
+ * trailer but this simplifies parsing for partially filled log blocks.
+ * DATAFIRST, -MID, -LAST types are used for chunking logical data records.
+ */
+enum logrec_type_omf {
+ OMF_LOGREC_EOLB = 0,
+ OMF_LOGREC_DATAFULL = 1,
+ OMF_LOGREC_DATAFIRST = 2,
+ OMF_LOGREC_DATAMID = 3,
+ OMF_LOGREC_DATALAST = 4,
+ OMF_LOGREC_CSTART = 5,
+ OMF_LOGREC_CEND = 6,
+};
+
+
+/**
+ * struct logrec_descriptor_omf -packed omf logrec descriptor
+ * @polr_tlen: logical length of data record (all chunks)
+ * @polr_rlen: length of data chunk in this log record
+ * @polr_rtype: enum logrec_type_omf value
+ */
+struct logrec_descriptor_omf {
+ __le32 polr_tlen;
+ __le16 polr_rlen;
+ u8 polr_rtype;
+ u8 polr_pad;
+} __packed;
+
+/* Define set/get methods for logrec_descriptor_omf */
+OMF_SETGET(struct logrec_descriptor_omf, polr_tlen, 32)
+OMF_SETGET(struct logrec_descriptor_omf, polr_rlen, 16)
+OMF_SETGET(struct logrec_descriptor_omf, polr_rtype, 8)
+#define OMF_LOGREC_DESC_PACKLEN (sizeof(struct logrec_descriptor_omf))
+#define OMF_LOGREC_DESC_RLENMAX 65535
+
+
+#define OMF_LOGBLOCK_VERS 1
+
+/**
+ * struct logblock_header_omf - packed omf logblock header for all versions
+ * @polh_vers: log block hdr version, offset 0 in all vers
+ * @polh_magic: unique magic per mlog
+ * @polh_pfsetid: flush set ID of the previous log block
+ * @polh_cfsetid: flush set ID this log block belongs to
+ * @polh_gen: generation number
+ */
+struct logblock_header_omf {
+ __le16 polh_vers;
+ u8 polh_magic[OMF_UUID_PACKLEN];
+ u8 polh_pad[6];
+ __le32 polh_pfsetid;
+ __le32 polh_cfsetid;
+ __le64 polh_gen;
+} __packed;
+
+/* Define set/get methods for logblock_header_omf */
+OMF_SETGET(struct logblock_header_omf, polh_vers, 16)
+OMF_SETGET_CHBUF(struct logblock_header_omf, polh_magic)
+OMF_SETGET(struct logblock_header_omf, polh_pfsetid, 32)
+OMF_SETGET(struct logblock_header_omf, polh_cfsetid, 32)
+OMF_SETGET(struct logblock_header_omf, polh_gen, 64)
+/* On-media log block header length */
+#define OMF_LOGBLOCK_HDR_PACKLEN (sizeof(struct logblock_header_omf))
+
+
+/*
+ * Metadata container (mdc) mlog data record formats.
+ *
+ * NOTE: mdc records are typed and as such do not need a version number as new
+ * types can always be added as required.
+ */
+/**
+ * enum mdcrec_type_omf -
+ * @OMF_MDR_UNDEF: undefined; should never occur
+ * @OMF_MDR_OCREATE: object create
+ * @OMF_MDR_OUPDATE: object update
+ * @OMF_MDR_ODELETE: object delete
+ * @OMF_MDR_OIDCKPT: object id checkpoint
+ * @OMF_MDR_OERASE: object erase, also log mlog gen number
+ * @OMF_MDR_MCCONFIG: media class config
+ * @OMF_MDR_MCSPARE: media class spare zones set
+ * @OMF_MDR_VERSION: MDC content version.
+ * @OMF_MDR_MPCONFIG: mpool config record
+ */
+enum mdcrec_type_omf {
+ OMF_MDR_UNDEF = 0,
+ OMF_MDR_OCREATE = 1,
+ OMF_MDR_OUPDATE = 2,
+ OMF_MDR_ODELETE = 3,
+ OMF_MDR_OIDCKPT = 4,
+ OMF_MDR_OERASE = 5,
+ OMF_MDR_MCCONFIG = 6,
+ OMF_MDR_MCSPARE = 7,
+ OMF_MDR_VERSION = 8,
+ OMF_MDR_MPCONFIG = 9,
+ OMF_MDR_MAX = 10,
+};
+
+/**
+ * struct mdcver_omf - packed mdc version, version of an mpool MDC content.
+ * @pv_rtype: OMF_MDR_VERSION
+ * @pv_mdcv_major: to compare with MAJOR in binary version.
+ * @pv_mdcv_minor: to compare with MINOR in binary version.
+ * @pv_mdcv_patch: to compare with PATCH in binary version.
+ * @pv_mdcv_dev: used during development cycle when the above
+ * numbers don't change.
+ *
+ * This is not the version of the message framing used for the MDC. This is
+ * version of the binary that introduced that version of the MDC content.
+ */
+struct mdcver_omf {
+ u8 pv_rtype;
+ u8 pv_pad;
+ __le16 pv_mdcv_major;
+ __le16 pv_mdcv_minor;
+ __le16 pv_mdcv_patch;
+ __le16 pv_mdcv_dev;
+} __packed;
+
+/* Define set/get methods for mdcrec_version_omf */
+OMF_SETGET(struct mdcver_omf, pv_rtype, 8)
+OMF_SETGET(struct mdcver_omf, pv_mdcv_major, 16)
+OMF_SETGET(struct mdcver_omf, pv_mdcv_minor, 16)
+OMF_SETGET(struct mdcver_omf, pv_mdcv_patch, 16)
+OMF_SETGET(struct mdcver_omf, pv_mdcv_dev, 16)
+
+
+/**
+ * struct mdcrec_data_odelete_omf - packed data record odelete
+ * @pdro_rtype: mdrec_type_omf:OMF_MDR_ODELETE, OMF_MDR_OIDCKPT
+ * @pdro_objid: object identifier
+ */
+struct mdcrec_data_odelete_omf {
+ u8 pdro_rtype;
+ u8 pdro_pad[7];
+ __le64 pdro_objid;
+} __packed;
+
+/* Define set/get methods for mdcrec_data_odelete_omf */
+OMF_SETGET(struct mdcrec_data_odelete_omf, pdro_rtype, 8)
+OMF_SETGET(struct mdcrec_data_odelete_omf, pdro_objid, 64)
+
+
+/**
+ * struct mdcrec_data_oerase_omf - packed data record oerase
+ * @pdrt_rtype: mdrec_type_omf: OMF_MDR_OERASE
+ * @pdrt_objid: object identifier
+ * @pdrt_gen: object generation number
+ */
+struct mdcrec_data_oerase_omf {
+ u8 pdrt_rtype;
+ u8 pdrt_pad[7];
+ __le64 pdrt_objid;
+ __le64 pdrt_gen;
+} __packed;
+
+/* Define set/get methods for mdcrec_data_oerase_omf */
+OMF_SETGET(struct mdcrec_data_oerase_omf, pdrt_rtype, 8)
+OMF_SETGET(struct mdcrec_data_oerase_omf, pdrt_objid, 64)
+OMF_SETGET(struct mdcrec_data_oerase_omf, pdrt_gen, 64)
+#define OMF_MDCREC_OERASE_PACKLEN (sizeof(struct mdcrec_data_oerase_omf))
+
+
+/**
+ * struct mdcrec_data_mcconfig_omf - packed data record mclass config
+ * @pdrs_rtype: mdrec_type_omf: OMF_MDR_MCCONFIG
+ * @pdrs_parm:
+ */
+struct mdcrec_data_mcconfig_omf {
+ u8 pdrs_rtype;
+ u8 pdrs_pad[7];
+ struct devparm_descriptor_omf pdrs_parm;
+} __packed;
+
+
+OMF_SETGET(struct mdcrec_data_mcconfig_omf, pdrs_rtype, 8)
+#define OMF_MDCREC_MCCONFIG_PACKLEN (sizeof(struct mdcrec_data_mcconfig_omf))
+
+
+/**
+ * struct mdcrec_data_mcspare_omf - packed data record mcspare
+ * @pdra_rtype: mdrec_type_omf: OMF_MDR_MCSPARE
+ * @pdra_mclassp: enum mp_media_classp
+ * @pdra_spzone: percent spare zones for drives in media class
+ */
+struct mdcrec_data_mcspare_omf {
+ u8 pdra_rtype;
+ u8 pdra_mclassp;
+ u8 pdra_spzone;
+} __packed;
+
+/* Define set/get methods for mdcrec_data_mcspare_omf */
+OMF_SETGET(struct mdcrec_data_mcspare_omf, pdra_rtype, 8)
+OMF_SETGET(struct mdcrec_data_mcspare_omf, pdra_mclassp, 8)
+OMF_SETGET(struct mdcrec_data_mcspare_omf, pdra_spzone, 8)
+#define OMF_MDCREC_CLS_SPARE_PACKLEN (sizeof(struct mdcrec_data_mcspare_omf))
+
+
+/**
+ * struct mdcrec_data_ocreate_omf - packed data record ocreate
+ * @pdrc_rtype: mdrec_type_omf: OMF_MDR_OCREATE or OMF_MDR_OUPDATE
+ * @pdrc_mclass:
+ * @pdrc_uuid:
+ * @pdrc_ld:
+ * @pdrc_objid: object identifier
+ * @pdrc_gen: object generation number
+ * @pdrc_mblen: amount of data written in the mblock, for mlog this is 0
+ * @pdrc_uuid: Used only for mlogs. Must be at the end of this struct.
+ */
+struct mdcrec_data_ocreate_omf {
+ u8 pdrc_rtype;
+ u8 pdrc_mclass;
+ u8 pdrc_pad[2];
+ struct layout_descriptor_omf pdrc_ld;
+ __le64 pdrc_objid;
+ __le64 pdrc_gen;
+ __le64 pdrc_mblen;
+ u8 pdrc_uuid[];
+} __packed;
+
+/* Define set/get methods for mdcrec_data_ocreate_omf */
+OMF_SETGET(struct mdcrec_data_ocreate_omf, pdrc_rtype, 8)
+OMF_SETGET(struct mdcrec_data_ocreate_omf, pdrc_mclass, 8)
+OMF_SETGET(struct mdcrec_data_ocreate_omf, pdrc_objid, 64)
+OMF_SETGET(struct mdcrec_data_ocreate_omf, pdrc_gen, 64)
+OMF_SETGET(struct mdcrec_data_ocreate_omf, pdrc_mblen, 64)
+#define OMF_MDCREC_OBJCMN_PACKLEN (sizeof(struct mdcrec_data_ocreate_omf) + \
+ OMF_UUID_PACKLEN)
+
+
+/**
+ * struct mdcrec_data_mpconfig_omf - packed data mpool config
+ * @pdmc_rtype:
+ * @pdmc_oid1:
+ * @pdmc_oid2:
+ * @pdmc_uid:
+ * @pdmc_gid:
+ * @pdmc_mode:
+ * @pdmc_mclassp:
+ * @pdmc_captgt:
+ * @pdmc_ra_pages_max:
+ * @pdmc_vma_size_max:
+ * @pdmc_utype: user-defined type (uuid)
+ * @pdmc_label: user-defined label (ascii)
+ */
+struct mdcrec_data_mpconfig_omf {
+ u8 pdmc_rtype;
+ u8 pdmc_pad[7];
+ __le64 pdmc_oid1;
+ __le64 pdmc_oid2;
+ __le32 pdmc_uid;
+ __le32 pdmc_gid;
+ __le32 pdmc_mode;
+ __le32 pdmc_rsvd0;
+ __le64 pdmc_captgt;
+ __le32 pdmc_ra_pages_max;
+ __le32 pdmc_vma_size_max;
+ __le32 pdmc_rsvd1;
+ __le32 pdmc_rsvd2;
+ __le64 pdmc_rsvd3;
+ __le64 pdmc_rsvd4;
+ u8 pdmc_utype[16];
+ u8 pdmc_label[MPOOL_LABELSZ_MAX];
+} __packed;
+
+/* Define set/get methods for mdcrec_data_mpconfig_omf */
+OMF_SETGET(struct mdcrec_data_mpconfig_omf, pdmc_rtype, 8)
+OMF_SETGET(struct mdcrec_data_mpconfig_omf, pdmc_oid1, 64)
+OMF_SETGET(struct mdcrec_data_mpconfig_omf, pdmc_oid2, 64)
+OMF_SETGET(struct mdcrec_data_mpconfig_omf, pdmc_uid, 32)
+OMF_SETGET(struct mdcrec_data_mpconfig_omf, pdmc_gid, 32)
+OMF_SETGET(struct mdcrec_data_mpconfig_omf, pdmc_mode, 32)
+OMF_SETGET(struct mdcrec_data_mpconfig_omf, pdmc_rsvd0, 32)
+OMF_SETGET(struct mdcrec_data_mpconfig_omf, pdmc_captgt, 64)
+OMF_SETGET(struct mdcrec_data_mpconfig_omf, pdmc_ra_pages_max, 32)
+OMF_SETGET(struct mdcrec_data_mpconfig_omf, pdmc_vma_size_max, 32)
+OMF_SETGET(struct mdcrec_data_mpconfig_omf, pdmc_rsvd1, 32)
+OMF_SETGET(struct mdcrec_data_mpconfig_omf, pdmc_rsvd2, 32)
+OMF_SETGET(struct mdcrec_data_mpconfig_omf, pdmc_rsvd3, 64)
+OMF_SETGET(struct mdcrec_data_mpconfig_omf, pdmc_rsvd4, 64)
+OMF_SETGET_CHBUF(struct mdcrec_data_mpconfig_omf, pdmc_utype)
+OMF_SETGET_CHBUF(struct mdcrec_data_mpconfig_omf, pdmc_label)
+#define OMF_MDCREC_MPCONFIG_PACKLEN (sizeof(struct mdcrec_data_mpconfig_omf))
+
+
+/*
+ * Object types embedded in opaque uint64 object ids by the pmd module.
+ * This encoding is also present in the object ids stored in the
+ * data records on media.
+ *
+ * The obj_type field is 4 bits. There are two valid obj types.
+ */
+enum obj_type_omf {
+ OMF_OBJ_UNDEF = 0,
+ OMF_OBJ_MBLOCK = 1,
+ OMF_OBJ_MLOG = 2,
+};
+
+/**
+ * sb_descriptor_ver_omf - Mpool super block version
+ * @OMF_SB_DESC_UNDEF: value not on media
+ */
+enum sb_descriptor_ver_omf {
+ OMF_SB_DESC_UNDEF = 0,
+ OMF_SB_DESC_V1 = 1,
+
+};
+#define OMF_SB_DESC_VER_LAST OMF_SB_DESC_V1
+
+
+/**
+ * struct sb_descriptor_omf - packed super block, super block descriptor format version 1.
+ * @psb_magic: mpool magic value; offset 0 in all vers
+ * @psb_name: mpool name
+ * @psb_poolid: UUID of pool this drive belongs to
+ * @psb_vers: sb format version; offset 56
+ * @psb_gen: sb generation number on this drive
+ * @psb_cksum1: checksum of all fields above
+ * @psb_parm: parameters for this drive
+ * @psb_cksum2: checksum of psb_parm
+ * @psb_mdc01gen: mdc0 log1 generation number
+ * @psb_mdc01uuid:
+ * @psb_mdc01devid: mdc0 log1 device UUID
+ * @psb_mdc01strip: mdc0 log1 strip desc.
+ * @psb_mdc01desc: mdc0 log1 layout
+ * @psb_mdc02gen: mdc0 log2 generation number
+ * @psb_mdc02uuid:
+ * @psb_mdc02devid: mdc0 log2 device UUID
+ * @psb_mdc02strip: mdc0 log2 strip desc.
+ * @psb_mdc02desc: mdc0 log2 layout
+ * @psb_mdc0dev: drive param for mdc0 strip
+ *
+ * Note: these fields, up to and including psb_cksum1, are known to libblkid.
+ * cannot change them without havoc. Fields from psb_magic to psb_cksum1
+ * included are at same offset in all versions.
+ */
+struct sb_descriptor_omf {
+ __le64 psb_magic;
+ u8 psb_name[OMF_MPOOL_NAME_LEN];
+ u8 psb_poolid[OMF_UUID_PACKLEN];
+ __le16 psb_vers;
+ __le32 psb_gen;
+ u8 psb_cksum1[4];
+
+ u8 psb_pad1[6];
+ struct devparm_descriptor_omf psb_parm;
+ u8 psb_cksum2[4];
+
+ u8 psb_pad2[4];
+ __le64 psb_mdc01gen;
+ u8 psb_mdc01uuid[OMF_UUID_PACKLEN];
+ u8 psb_mdc01devid[OMF_UUID_PACKLEN];
+ struct layout_descriptor_omf psb_mdc01desc;
+
+ u8 psb_pad3[4];
+ __le64 psb_mdc02gen;
+ u8 psb_mdc02uuid[OMF_UUID_PACKLEN];
+ u8 psb_mdc02devid[OMF_UUID_PACKLEN];
+ struct layout_descriptor_omf psb_mdc02desc;
+
+ u8 psb_pad4[4];
+ struct devparm_descriptor_omf psb_mdc0dev;
+} __packed;
+
+OMF_SETGET(struct sb_descriptor_omf, psb_magic, 64)
+OMF_SETGET_CHBUF(struct sb_descriptor_omf, psb_name)
+OMF_SETGET_CHBUF(struct sb_descriptor_omf, psb_poolid)
+OMF_SETGET(struct sb_descriptor_omf, psb_vers, 16)
+OMF_SETGET(struct sb_descriptor_omf, psb_gen, 32)
+OMF_SETGET_CHBUF(struct sb_descriptor_omf, psb_cksum1)
+OMF_SETGET_CHBUF(struct sb_descriptor_omf, psb_cksum2)
+OMF_SETGET(struct sb_descriptor_omf, psb_mdc01gen, 64)
+OMF_SETGET_CHBUF(struct sb_descriptor_omf, psb_mdc01uuid)
+OMF_SETGET_CHBUF(struct sb_descriptor_omf, psb_mdc01devid)
+OMF_SETGET(struct sb_descriptor_omf, psb_mdc02gen, 64)
+OMF_SETGET_CHBUF(struct sb_descriptor_omf, psb_mdc02uuid)
+OMF_SETGET_CHBUF(struct sb_descriptor_omf, psb_mdc02devid)
+#define OMF_SB_DESC_PACKLEN (sizeof(struct sb_descriptor_omf))
+
+/*
+ * For object-related records OCREATE/OUPDATE is max so compute that here as:
+ * rtype + objid + gen + layout desc
+ */
+#define OMF_MDCREC_PACKLEN_MAX max(OMF_MDCREC_OBJCMN_PACKLEN, \
+ max(OMF_MDCREC_MCCONFIG_PACKLEN, \
+ max(OMF_MDCREC_CLS_SPARE_PACKLEN, \
+ OMF_MDCREC_MPCONFIG_PACKLEN)))
+
+#endif /* MPOOL_OMF_H */
diff --git a/drivers/mpool/omf_if.h b/drivers/mpool/omf_if.h
new file mode 100644
index 000000000000..5f11a03ef500
--- /dev/null
+++ b/drivers/mpool/omf_if.h
@@ -0,0 +1,381 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc. All rights reserved.
+ */
+
+#ifndef MPOOL_OMF_IF_H
+#define MPOOL_OMF_IF_H
+
+#include "uuid.h"
+#include "mpool_ioctl.h"
+
+#include "mp.h"
+#include "omf.h"
+
+struct mpool_descriptor;
+struct pmd_layout;
+
+/*
+ * Common defs: versioned via version number field of enclosing structs
+ */
+
+/**
+ * struct omf_layout_descriptor - version 1 layout descriptor
+ * @ol_zaddr:
+ * @ol_zcnt: number of zones
+ * @ol_pdh:
+ */
+struct omf_layout_descriptor {
+ u64 ol_zaddr;
+ u32 ol_zcnt;
+ u16 ol_pdh;
+};
+
+/**
+ * struct omf_devparm_descriptor - version 1 devparm descriptor
+ * @odp_devid: UUID for drive
+ * @odp_devsz: size, in bytes, of the volume/device
+ * @odp_zonetot: total number of zones
+ * @odp_zonepg: zone size in number of zone pages
+ * @odp_mclassp: enum mp_media_classp
+ * @odp_devtype: PD type. Enum pd_devtype
+ * @odp_sectorsz: 2^podp_sectorsz = sector size
+ * @odp_features: Features, ored bits of enum mp_mc_features
+ *
+ * The fields zonepg, mclassp, devtype, sectosz, and features uniquely identify
+ * the media class of the PD.
+ * All drives in a media class must have the same values in the below fields.
+ */
+struct omf_devparm_descriptor {
+ struct mpool_uuid odp_devid;
+ u64 odp_devsz;
+ u32 odp_zonetot;
+
+ u32 odp_zonepg;
+ u8 odp_mclassp;
+ u8 odp_devtype;
+ u8 odp_sectorsz;
+ u64 odp_features;
+};
+
+/*
+ * Superblock (sb) -- version 1
+ *
+ * Note this is 8-byte-wide reversed to get correct ascii order
+ */
+#define OMF_SB_MAGIC 0x7665446c6f6f706dULL /* ASCII mpoolDev - no null */
+
+/**
+ * struct omf_sb_descriptor - version 1 superblock descriptor
+ * @osb_magic: mpool magic value
+ * @osb_name: mpool name, contains a terminating 0 byte
+ * @osb_cktype: enum mp_cksum_type value
+ * @osb_vers: sb format version
+ * @osb_poolid: UUID of pool this drive belongs to
+ * @osb_gen: sb generation number on this drive
+ * @osb_parm: parameters for this drive
+ * @osb_mdc01gen: mdc0 log1 generation number
+ * @osb_mdc01uuid:
+ * @osb_mdc01devid:
+ * @osb_mdc01desc: mdc0 log1 layout
+ * @osb_mdc02gen: mdc0 log2 generation number
+ * @osb_mdc02uuid:
+ * @osb_mdc02devid:
+ * @osb_mdc02desc: mdc0 log2 layout
+ * @osb_mdc0dev: drive param for mdc0
+ */
+struct omf_sb_descriptor {
+ u64 osb_magic;
+ u8 osb_name[MPOOL_NAMESZ_MAX];
+ u8 osb_cktype;
+ u16 osb_vers;
+ struct mpool_uuid osb_poolid;
+ u32 osb_gen;
+ struct omf_devparm_descriptor osb_parm;
+
+ u64 osb_mdc01gen;
+ struct mpool_uuid osb_mdc01uuid;
+ struct mpool_uuid osb_mdc01devid;
+ struct omf_layout_descriptor osb_mdc01desc;
+
+ u64 osb_mdc02gen;
+ struct mpool_uuid osb_mdc02uuid;
+ struct mpool_uuid osb_mdc02devid;
+ struct omf_layout_descriptor osb_mdc02desc;
+
+ struct omf_devparm_descriptor osb_mdc0dev;
+};
+
+/**
+ * struct omf_logrec_descriptor -
+ * @olr_tlen: logical length of data record (all chunks)
+ * @olr_rlen: length of data chunk in this log record
+ * @olr_rtype: enum logrec_type_omf value
+ *
+ */
+struct omf_logrec_descriptor {
+ u32 olr_tlen;
+ u16 olr_rlen;
+ u8 olr_rtype;
+};
+
+/**
+ * struct omf_logblock_header -
+ * @olh_magic: unique ID per mlog
+ * @olh_pfsetid: flush set ID of the previous log block
+ * @olh_cfsetid: flush set ID this log block
+ * @olh_gen: generation number
+ * @olh_vers: log block format version
+ */
+struct omf_logblock_header {
+ struct mpool_uuid olh_magic;
+ u32 olh_pfsetid;
+ u32 olh_cfsetid;
+ u64 olh_gen;
+ u16 olh_vers;
+};
+
+/**
+ * struct omf_mdcver - version of an mpool MDC content.
+ * @mdcver:
+ *
+ * mdcver[0]: major version number
+ * mdcver[1]: minor version number
+ * mdcver[2]: patch version number
+ * mdcver[3]: development version number. Used during development cycle when
+ * the above numbers don't change.
+ *
+ * This is not the version of the message framing used for the MDC.
+ * This the version of the binary that introduced that version of the MDC
+ * content.
+ */
+struct omf_mdcver {
+ u16 mdcver[4];
+};
+
+#define mdcv_major mdcver[0]
+#define mdcv_minor mdcver[1]
+#define mdcv_patch mdcver[2]
+#define mdcv_dev mdcver[3]
+
+/**
+ * struct omf_mdcrec_data -
+ * @omd_version: OMF_MDR_VERSION record
+ * @omd_objid: object identifier
+ * @omd_gen: object generation number
+ * @omd_layout:
+ * @omd_mblen: Length of written data in object
+ * @omd_old:
+ * @omd_uuid:
+ * @omd_parm:
+ * @omd_mclassp: mp_media_classp
+ * @omd_spzone: percent spare zones for drives in media class
+ * @omd_cfg:
+ * @omd_rtype: enum mdcrec_type_omf value
+ *
+ * object-related rtypes:
+ * ODELETE, OIDCKPT: objid field only; others ignored
+ * OERASE: objid and gen fields only; others ignored
+ * OCREATE, OUPDATE: layout field only; others ignored
+ */
+struct omf_mdcrec_data {
+ union ustruct {
+ struct omf_mdcver omd_version;
+
+ struct object {
+ u64 omd_objid;
+ u64 omd_gen;
+ struct pmd_layout *omd_layout;
+ u64 omd_mblen;
+ struct omf_layout_descriptor omd_old;
+ struct mpool_uuid omd_uuid;
+ u8 omd_mclass;
+ } obj;
+
+ struct drive_state {
+ struct omf_devparm_descriptor omd_parm;
+ } dev;
+
+ struct media_cls_spare {
+ u8 omd_mclassp;
+ u8 omd_spzone;
+ } mcs;
+
+ struct mpool_config omd_cfg;
+ } u;
+
+ u8 omd_rtype;
+};
+
+/**
+ * objid_type() - Return the type field from an objid
+ * @objid:
+ */
+static inline int objid_type(u64 objid)
+{
+ return ((objid & 0xF00) >> 8);
+}
+
+static inline bool objtype_valid(enum obj_type_omf otype)
+{
+ return otype && (otype <= 2);
+};
+
+/*
+ * omf API functions -- exported functions for working with omf structures
+ */
+
+/**
+ * omf_sb_pack_htole() - pack superblock
+ * @sb: struct omf_sb_descriptor *
+ * @outbuf: char *
+ *
+ * Pack superblock into outbuf little-endian computing specified checksum.
+ *
+ * Return: 0 if successful, -EINVAL otherwise
+ */
+int omf_sb_pack_htole(struct omf_sb_descriptor *sb, char *outbuf);
+
+/**
+ * omf_sb_unpack_letoh() - unpack superblock
+ * @sb: struct omf_sb_descriptor *
+ * @inbuf: char *
+ * @omf_ver: on-media-format superblock version
+ *
+ * Unpack little-endian superblock from inbuf into sb verifying checksum.
+ *
+ * Return: 0 if successful, -errno otherwise
+ */
+int omf_sb_unpack_letoh(struct omf_sb_descriptor *sb, const char *inbuf, u16 *omf_ver);
+
+/**
+ * omf_sb_has_magic_le() - Determine if buffer has superblock magic value
+ * @inbuf: char *
+ *
+ * Determine if little-endian buffer inbuf has superblock magic value
+ * where expected; does NOT imply inbuf is a valid superblock.
+ *
+ * Return: 1 if true; 0 otherwise
+ */
+bool omf_sb_has_magic_le(const char *inbuf);
+
+/**
+ * omf_logblock_header_pack_htole() - pack log block header
+ * @lbh: struct omf_logblock_header *
+ * @outbuf: char *
+ *
+ * Pack header into little-endian log block buffer lbuf, ex-checksum.
+ *
+ * Return: 0 if successful, -errno otherwise
+ */
+int omf_logblock_header_pack_htole(struct omf_logblock_header *lbh, char *lbuf);
+
+/**
+ * omf_logblock_header_len_le() - Determine header length of log block
+ * @lbuf: char *
+ *
+ * Check little-endian log block in lbuf to determine header length.
+ *
+ * Return: bytes in packed header; -EINVAL if invalid header vers
+ */
+int omf_logblock_header_len_le(char *lbuf);
+
+/**
+ * omf_logblock_header_unpack_letoh() - unpack log block header
+ * @lbh: struct omf_logblock_header *
+ * @inbuf: char *
+ *
+ * Unpack little-endian log block header from lbuf into lbh; does not
+ * verify checksum.
+ *
+ * Return: 0 if successful, -EINVAL if invalid log block header vers
+ */
+int omf_logblock_header_unpack_letoh(struct omf_logblock_header *lbh, const char *inbuf);
+
+/**
+ * omf_logrec_desc_pack_htole() - pack log record descriptor
+ * @lrd: struct omf_logrec_descriptor *
+ * @outbuf: char *
+ *
+ * Pack log record descriptor into outbuf little-endian.
+ *
+ * Return: 0 if successful, -EINVAL if invalid log rec type
+ */
+int omf_logrec_desc_pack_htole(struct omf_logrec_descriptor *lrd, char *outbuf);
+
+/**
+ * omf_logrec_desc_unpack_letoh() - unpack log record descriptor
+ * @lrd: struct omf_logrec_descriptor *
+ * @inbuf: char *
+ *
+ * Unpack little-endian log record descriptor from inbuf into lrd.
+ */
+void omf_logrec_desc_unpack_letoh(struct omf_logrec_descriptor *lrd, const char *inbuf);
+
+/**
+ * omf_mdcrec_pack_htole() - pack mdc record
+ * @mp: struct mpool_descriptor *
+ * @cdr: struct omf_mdcrec_data *
+ * @outbuf: char *
+ *
+ * Pack mdc record into outbuf little-endian.
+ * NOTE: Assumes outbuf has enough space for the layout structure.
+ *
+ * Return: bytes packed if successful, -EINVAL otherwise
+ */
+int omf_mdcrec_pack_htole(struct mpool_descriptor *mp, struct omf_mdcrec_data *cdr, char *outbuf);
+
+/**
+ * omf_mdcrec_unpack_letoh() - unpack mdc record
+ * @mdcver: mdc content version of the mdc from which this data comes.
+ * NULL means latest MDC content version known by this binary.
+ * @mp: struct mpool_descriptor *
+ * @cdr: struct omf_mdcrec_data *
+ * @inbuf: char *
+ *
+ * Unpack little-endian mdc record from inbuf into cdr.
+ *
+ * Return: 0 if successful, -errno on error
+ */
+int omf_mdcrec_unpack_letoh(struct omf_mdcver *mdcver, struct mpool_descriptor *mp,
+ struct omf_mdcrec_data *cdr, const char *inbuf);
+
+/**
+ * omf_mdcrec_isobj_le() - determine if mdc recordis object-related
+ * @inbuf: char *
+ *
+ * Return true if little-endian mdc record in inbuf is object-related.
+ */
+int omf_mdcrec_isobj_le(const char *inbuf);
+
+/**
+ * omf_mdcver_unpack_letoh() - Unpack le mdc version record from inbuf.
+ * @cdr:
+ * @inbuf:
+ */
+void omf_mdcver_unpack_letoh(struct omf_mdcrec_data *cdr, const char *inbuf);
+
+/**
+ * omf_mdcrec_unpack_type_letoh() - extract the record type from a packed MDC record.
+ * @inbuf: packed MDC record.
+ */
+u8 omf_mdcrec_unpack_type_letoh(const char *inbuf);
+
+/**
+ * logrec_type_datarec() - data record or not
+ * @rtype:
+ *
+ * Return: true if the log record type is related to a data record.
+ */
+bool logrec_type_datarec(enum logrec_type_omf rtype);
+
+/**
+ * omf_sbver_to_mdcver() - Returns the matching mdc version for a given superblock version
+ * @sbver: superblock version
+ */
+struct omf_mdcver *omf_sbver_to_mdcver(enum sb_descriptor_ver_omf sbver);
+
+int omf_init(void) __cold;
+void omf_exit(void) __cold;
+
+#endif /* MPOOL_OMF_IF_H */
diff --git a/drivers/mpool/upgrade.h b/drivers/mpool/upgrade.h
new file mode 100644
index 000000000000..3b3748c47a3e
--- /dev/null
+++ b/drivers/mpool/upgrade.h
@@ -0,0 +1,128 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc. All rights reserved.
+ */
+
+/*
+ * Defines structures for upgrading MPOOL meta data
+ */
+
+#ifndef MPOOL_UPGRADE_H
+#define MPOOL_UPGRADE_H
+
+#include "omf_if.h"
+
+/*
+ * Size of version converted to string.
+ * 4 * (5 bytes for a u16) + 3 * (1 byte for the '.') + 1 byte for \0
+ */
+#define MAX_MDCVERSTR 24
+
+/*
+ * Naming conventions:
+ *
+ * omf structures:
+ * ---------------
+ * The old structure names end with _omf_v<version number>.
+ * For example: layout_descriptor_omf_v1
+ * The current/latest structure name end simply with _omf.
+ * For example: layout_descriptor_omf
+ *
+ * Conversion functions:
+ * ---------------------
+ * They are named like:
+ * omf_convert_<blabla>_<maj>_<min>_<patch>_<dev>to<maj>_<min>_<patch>_<dev>()
+ *
+ * For example: omf_convert_sb_1_0_0_0to1_0_0_1()
+ *
+ * They are not named like omf_convert_<blabla>_v1tov2() because sometimes the
+ * input and output structures are exactly the same and the conversion is
+ * related to some subtle interpretation of structure filed[s] content.
+ *
+ * Unpack functions:
+ * -----------------
+ * They are named like:
+ * omf_<blabla>_unpack_letoh_v<version number>()
+ * <version number> being the version of the structure.
+ *
+ * For example: omf_layout_unpack_letoh_v1()
+ * Note that for the latest/current version of the structure we cannot
+ * name the unpack function omf_<blabla>_unpack_letoh() because that would
+ * introduce a name conflict with the top unpack function that calls
+ * omf_unpack_letoh_and_convert()
+ *
+ * For example for layout we have:
+ * omf_layout_unpack_letoh_v1() unpacks layout_descriptor_omf_v1
+ * omf_layout_unpack_letoh_v2() unpacks layout_descriptor_omf
+ * omf_layout_unpack_letoh() calls one of the two above.
+ */
+
+/**
+ * struct upgrade_history -
+ * @uh_size: size of the current version in-memory structure
+ * @uh_unpack: unpacking function from on-media format to in-memory format
+ * @uh_conv: conversion function from previous version to current version,
+ * set to NULL for the first version
+ * @uh_sbver: corresponding superblock version since which the change has
+ * been introduced. If this structure is not used by superblock
+ * set uh_sbver = OMF_SB_DESC_UNDEF.
+ * @uh_mdcver: corresponding mdc ver since which the change has been
+ * introduced
+ *
+ * Every time we update a nested structure in superblock or MDC, we need to
+ * save the following information about this update, such that we can keep the
+ * update history of this structure
+ */
+struct upgrade_history {
+ size_t uh_size;
+ int (*uh_unpack)(void *out, const char *inbuf);
+ int (*uh_conv)(const void *pre, void *cur);
+ enum sb_descriptor_ver_omf uh_sbver;
+ struct omf_mdcver uh_mdcver;
+};
+
+/**
+ * omfu_mdcver_cur() - Return the latest mpool MDC content version understood by this binary
+ */
+struct omf_mdcver *omfu_mdcver_cur(void);
+
+/**
+ * omfu_mdcver_comment() - Return mpool MDC content version comment passed in via "mdcver".
+ * @mdcver:
+ */
+const char *omfu_mdcver_comment(struct omf_mdcver *mdcver);
+
+/**
+ * omfu_mdcver_to_str() - convert a version into a string.
+ * @mdcver: version to convert
+ * @buf: buffer in which to place the conversion.
+ * @sz: size of "buf" in bytes.
+ *
+ * Returns "buf"
+ */
+char *omfu_mdcver_to_str(struct omf_mdcver *mdcver, char *buf, size_t sz);
+
+/**
+ * omfu_mdcver_cmp() - compare two versions a and b
+ * @a: first version
+ * @op: compare operator (C syntax), can be "<", "<=", ">", ">=", "==".
+ * @b: second version
+ *
+ * Return (a op b)
+ */
+bool omfu_mdcver_cmp(struct omf_mdcver *a, char *op, struct omf_mdcver *b);
+
+/**
+ * omfu_mdcver_cmp2() - compare two versions
+ * @a: first version
+ * @op: compare operator (C syntax), can be "<", "<=", ">", ">=", "==".
+ * @major: major, minor, patch and dev which composes the second version
+ * @minor:
+ * @patch:
+ * @dev:
+ *
+ * Return true (a op b)
+ */
+bool omfu_mdcver_cmp2(struct omf_mdcver *a, char *op, u16 major, u16 minor, u16 patch, u16 dev);
+
+#endif /* MPOOL_UPGRADE_H */
--
2.17.2

2020-09-28 16:51:51

by Nabeel Meeramohideen Mohamed (nmeeramohide)

[permalink] [raw]

Subject: [PATCH 15/22] mpool: add mpool lifecycle management routines

From: Nabeel M Mohamed <[email protected]>

This adds mpool lifecycle management functions: create,
activate, deactivate, destroy, rename, add a new media
class, fetch properties etc.

An mpool is created with a mandatory capacity media class
volume. A pool drive (PD) instance is initialized for each
media class volume using the attributes pushed from the mpool
user library. The metadata manager interfaces are then
invoked to activate this mpool and allocate the initial set
of metadata containers. The media class attributes, spare
percent, mpool configuration etc. are persisted in MDC-0.

At mpool activation, the records from MDC-0 containing the
mpool properties and metadata for accessing MDC-1 through
MDC-N are first loaded into memory, initializing all the
necessary in-core structures using the metadata manager and
space map interfaces. Then the records from MDC-1 through
MDC-N containing the metadata for accessing client mblock
and mlog objects are loaded into memory, again initializing
all the necessary in-core structures using the metadata
manager and space map interfaces.

An mpool is destroyed by erasing the superblock on all its
constituent media class volumes. Renaming an mpool updates
the superblock on all the media class volumes with the new
name. Adding a new media class volume to an activated mpool
is handled like initializing a volume at mpool create.

Co-developed-by: Greg Becker <[email protected]>
Signed-off-by: Greg Becker <[email protected]>
Co-developed-by: Pierre Labat <[email protected]>
Signed-off-by: Pierre Labat <[email protected]>
Co-developed-by: John Groves <[email protected]>
Signed-off-by: John Groves <[email protected]>
Signed-off-by: Nabeel M Mohamed <[email protected]>
---
drivers/mpool/mp.c | 1086 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 1086 insertions(+)
create mode 100644 drivers/mpool/mp.c

diff --git a/drivers/mpool/mp.c b/drivers/mpool/mp.c
new file mode 100644
index 000000000000..6b8c51c23fec
--- /dev/null
+++ b/drivers/mpool/mp.c
@@ -0,0 +1,1086 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc. All rights reserved.
+ */
+
+/*
+ * Media pool (mpool) manager module.
+ *
+ * Defines functions to create and maintain mpools comprising multiple drives
+ * in multiple media classes used for storing mblocks and mlogs.
+ */
+
+#include <linux/string.h>
+#include <linux/mutex.h>
+#include <crypto/hash.h>
+
+#include "assert.h"
+#include "mpool_printk.h"
+
+#include "sb.h"
+#include "upgrade.h"
+#include "mpcore.h"
+#include "mp.h"
+
+/*
+ * Lock for serializing certain mpool ops where required/desirable; could be per
+ * mpool in some cases but no meaningful performance benefit for these rare ops;
+ * also protects mpool_pools and certain mpool_descriptor fields.
+ */
+static DEFINE_MUTEX(mpool_s_lock);
+
+int mpool_create(const char *mpname, u32 flags, char **dpaths, struct pd_prop *pd_prop,
+ struct mpcore_params *params, u64 mlog_cap)
+{
+ struct omf_sb_descriptor *sbmdc0;
+ struct mpool_descriptor *mp;
+ struct pmd_layout *mdc01, *mdc02;
+ bool active, sbvalid;
+ u16 sidx;
+ int err;
+
+ if (!mpname || !*mpname || !dpaths || !pd_prop)
+ return -EINVAL;
+
+ mdc01 = mdc02 = NULL;
+ active = sbvalid = false;
+
+ mp = mpool_desc_alloc();
+ if (!mp) {
+ err = -ENOMEM;
+ mp_pr_err("mpool %s, alloc desc failed", err, mpname);
+ return err;
+ }
+
+ sbmdc0 = &(mp->pds_sbmdc0);
+ strlcpy((char *)mp->pds_name, mpname, sizeof(mp->pds_name));
+ mpool_generate_uuid(&mp->pds_poolid);
+
+ if (params)
+ mp->pds_params = *params;
+
+ mp->pds_pdvcnt = 0;
+
+ mutex_lock(&mpool_s_lock);
+
+ /*
+ * Allocate the per-mpool workqueue.
+ * TODO: Make this per-driver
+ */
+ mp->pds_erase_wq = alloc_workqueue("mperasewq", WQ_HIGHPRI, 0);
+ if (!mp->pds_erase_wq) {
+ err = -ENOMEM;
+ mp_pr_err("mpool %s, alloc per-mpool wq failed", err, mpname);
+ goto errout;
+ }
+
+ /*
+ * Set the devices parameters from the ones placed by the discovery
+ * in pd_prop.
+ */
+ err = mpool_dev_init_all(mp->pds_pdv, 1, dpaths, pd_prop);
+ if (err) {
+ mp_pr_err("mpool %s, failed to get device parameters", err, mpname);
+ goto errout;
+ }
+
+ mp->pds_pdvcnt = 1;
+
+ mpool_mdc_cap_init(mp, &mp->pds_pdv[0]);
+
+ /* Init new pool drives uuid and mclassp */
+ mpool_generate_uuid(&mp->pds_pdv[0].pdi_devid);
+
+ /*
+ * Init mpool descriptor from new drive info.
+ * Creates the media classes and place the PDs in them.
+ * Determine the media class used for the metadata.
+ */
+ err = mpool_desc_init_newpool(mp, flags);
+ if (err) {
+ mp_pr_err("mpool %s, desc init from new drive info failed", err, mpname);
+ goto errout;
+ }
+
+ /*
+ * Alloc empty mdc0 and write superblocks to all drives; if
+ * crash drives with superblocks will not be recognized as mpool
+ * members because there are not yet any drive state records in mdc0
+ */
+ sbvalid = true;
+ err = mpool_dev_sbwrite_newpool(mp, sbmdc0);
+ if (err) {
+ mp_pr_err("mpool %s, couldn't write superblocks", err, mpname);
+ goto errout;
+ }
+
+ /* Alloc mdc0 mlog layouts and activate mpool with empty mdc0 */
+ err = mpool_mdc0_sb2obj(mp, sbmdc0, &mdc01, &mdc02);
+ if (err) {
+ mp_pr_err("mpool %s, alloc of MDC0 mlogs failed", err, mpname);
+ goto errout;
+ }
+
+ err = pmd_mpool_activate(mp, mdc01, mdc02, 1);
+ if (err) {
+ mp_pr_err("mpool %s, activation failed", err, mpname);
+ goto errout;
+ }
+
+ active = true;
+
+ /*
+ * Add the version record (always first record) in MDC0.
+ * The version record is used only from version 1.0.0.1.
+ */
+ if (omfu_mdcver_cmp2(omfu_mdcver_cur(), ">=", 1, 0, 0, 1)) {
+ err = pmd_mdc_addrec_version(mp, 0);
+ if (err) {
+ mp_pr_err("mpool %s, writing MDC version record in MDC0 failed",
+ err, mpname);
+ goto errout;
+ }
+ }
+
+ /*
+ * Add drive state records to mdc0; if crash before complete will
+ * detect if attempt to open same drive list; it may be possible to
+ * open the subset of the drive list for which state records were
+ * written without detection, in which case the other drives can be
+ * added
+ */
+ err = pmd_prop_mcconfig(mp, &mp->pds_pdv[0], false);
+ if (err) {
+ mp_pr_err("mpool %s, add drive state to MDC0 failed", err, mpname);
+ goto errout;
+ }
+
+ /*
+ * Create mdcs so user can create mlog/mblock objects;
+ * if crash before all the configured mdcs are created, or if create
+ * fails, will detect in activate and re-try.
+ *
+ * mp_cmdcn corresponds to the number of MDCNs used for client
+ * objects, i.e., [1 - mp_cmdcn]
+ */
+ for (sidx = 1; sidx <= mp->pds_params.mp_mdcnum; sidx++) {
+ err = pmd_mdc_alloc(mp, mp->pds_params.mp_mdcncap, sidx - 1);
+ if (err) {
+ mp_pr_info("mpool %s, only %u MDCs out of %lu MDCs were created",
+ mpname, sidx - 1, (ulong)mp->pds_params.mp_mdcnum);
+ /*
+ * For MDCN creation failure, mask the error and
+ * continue further with create.
+ */
+ err = 0;
+ break;
+ }
+ }
+ pmd_update_credit(mp);
+
+ /*
+ * Attempt root mlog creation only if MDC1 was successfully created.
+ * If MDC1 doesn't exist, it will be re-created during activate.
+ */
+ if (sidx > 1) {
+ err = mpool_create_rmlogs(mp, mlog_cap);
+ if (err) {
+ mp_pr_info("mpool %s, root mlog creation failed", mpname);
+ /*
+ * If root mlog creation fails, mask the error and
+ * proceed with create. root mlogs will be re-created
+ * during activate.
+ */
+ err = 0;
+ }
+ }
+
+ /* Add mp to the list of all open mpools */
+ uuid_to_mpdesc_insert(&mpool_pools, mp);
+
+errout:
+
+ if (mp->pds_erase_wq)
+ destroy_workqueue(mp->pds_erase_wq);
+
+ if (active)
+ pmd_mpool_deactivate(mp);
+
+ if (err && sbvalid) {
+ struct mpool_dev_info *pd;
+ int err1;
+
+ /* Erase super blocks on the drives */
+ pd = &mp->pds_pdv[0];
+ if (mpool_pd_status_get(pd) != PD_STAT_ONLINE) {
+ err1 = -EIO;
+ mp_pr_err("%s:%s unavailable or offline, status %d",
+ err1, mp->pds_name, pd->pdi_name, mpool_pd_status_get(pd));
+ } else {
+ err1 = sb_erase(&pd->pdi_parm);
+ if (err1)
+ mp_pr_info("%s: cleanup, sb erase failed on device %s",
+ mp->pds_name, pd->pdi_name);
+ }
+ }
+
+ mpool_desc_free(mp);
+
+ mutex_unlock(&mpool_s_lock);
+
+ return err;
+}
+
+int mpool_activate(u64 dcnt, char **dpaths, struct pd_prop *pd_prop, u64 mlog_cap,
+ struct mpcore_params *params, u32 flags, struct mpool_descriptor **mpp)
+{
+ struct omf_sb_descriptor *sbmdc0;
+ struct mpool_descriptor *mp;
+ struct pmd_layout *mdc01 = NULL;
+ struct pmd_layout *mdc02 = NULL;
+ struct media_class *mcmeta;
+ u64 mdcmax, mdcnum, mdcncap, mdc0cap;
+ bool force = ((flags & (1 << MP_FLAGS_FORCE)) != 0);
+ bool mc_resize[MP_MED_NUMBER] = { };
+ bool active;
+ int dup, doff, err, i;
+ u8 pdh;
+
+ active = false;
+ *mpp = NULL;
+
+ if (dcnt > MPOOL_DRIVES_MAX) {
+ err = -EINVAL;
+ mp_pr_err("too many drives in input %lu, first drive path %s",
+ err, (ulong)dcnt, dpaths[0]);
+ return err;
+ }
+
+ /*
+ * Verify no duplicate drive paths
+ */
+ err = check_for_dups(dpaths, dcnt, &dup, &doff);
+ if (err) {
+ mp_pr_err("duplicate drive check failed", err);
+ return err;
+ } else if (dup) {
+ err = -EINVAL;
+ mp_pr_err("duplicate drive path %s", err, (doff == -1) ? "" : dpaths[doff]);
+ return err;
+ }
+
+ /* Alloc mpool descriptor and fill in device-indepdendent values */
+ mp = mpool_desc_alloc();
+ if (!mp) {
+ err = -ENOMEM;
+ mp_pr_err("alloc mpool desc failed", err);
+ return err;
+ }
+
+ sbmdc0 = &(mp->pds_sbmdc0);
+
+ mp->pds_pdvcnt = 0;
+
+ if (params)
+ mp->pds_params = *params;
+
+ mutex_lock(&mpool_s_lock);
+
+ mp->pds_workq = alloc_workqueue("mpoolwq", WQ_UNBOUND, 0);
+ if (!mp->pds_workq) {
+ err = -ENOMEM;
+ mp_pr_err("alloc mpoolwq failed, first drive path %s", err, dpaths[0]);
+ goto errout;
+ }
+
+ mp->pds_erase_wq = alloc_workqueue("mperasewq", WQ_HIGHPRI, 0);
+ if (!mp->pds_erase_wq) {
+ err = -ENOMEM;
+ mp_pr_err("alloc mperasewq failed, first drive path %s", err, dpaths[0]);
+ goto errout;
+ }
+
+ /* Get device parm for all drive paths */
+ err = mpool_dev_init_all(mp->pds_pdv, dcnt, dpaths, pd_prop);
+ if (err) {
+ mp_pr_err("can't get drive device params, first drive path %s", err, dpaths[0]);
+ goto errout;
+ }
+
+ /* Set mp.pdvcnt so dpaths will get closed in cleanup if activate fails. */
+ mp->pds_pdvcnt = dcnt;
+
+ /* Init mpool descriptor from superblocks on drives */
+ err = mpool_desc_init_sb(mp, sbmdc0, flags, mc_resize);
+ if (err) {
+ mp_pr_err("mpool_desc_init_sb failed, first drive path %s", err, dpaths[0]);
+ goto errout;
+ }
+
+ mcmeta = &mp->pds_mc[mp->pds_mdparm.md_mclass];
+ if (mcmeta->mc_pdmc < 0) {
+ err = -ENODEV;
+ mp_pr_err("mpool %s, too many unavailable drives", err, mp->pds_name);
+ goto errout;
+ }
+
+ /* Alloc mdc0 mlog layouts from superblock and activate mpool */
+ err = mpool_mdc0_sb2obj(mp, sbmdc0, &mdc01, &mdc02);
+ if (err) {
+ mp_pr_err("mpool %s, allocation of MDC0 mlogs layouts failed", err, mp->pds_name);
+ goto errout;
+ }
+
+ err = pmd_mpool_activate(mp, mdc01, mdc02, 0);
+ if (err) {
+ mp_pr_err("mpool %s, activation failed", err, mp->pds_name);
+ goto errout;
+ }
+
+ active = true;
+
+ for (pdh = 0; pdh < mp->pds_pdvcnt; pdh++) {
+ struct mpool_dev_info *pd;
+
+ pd = &mp->pds_pdv[pdh];
+
+ if (mc_resize[pd->pdi_mclass]) {
+ err = pmd_prop_mcconfig(mp, pd, false);
+ if (err) {
+ mp_pr_err("mpool %s, updating MCCONFIG record for resize failed",
+ err, mp->pds_name);
+ goto errout;
+ }
+ }
+
+ if (pd->pdi_mclass == MP_MED_CAPACITY)
+ mpool_mdc_cap_init(mp, pd);
+ }
+
+ /* Tolerate unavailable drives only if force flag specified */
+ for (i = 0; !force && i < MP_MED_NUMBER; i++) {
+ struct media_class *mc;
+
+ mc = &mp->pds_mc[i];
+ if (mc->mc_uacnt) {
+ err = -ENODEV;
+ mp_pr_err("mpool %s, unavailable drives present", err, mp->pds_name);
+ goto errout;
+ }
+ }
+
+ /*
+ * Create mdcs if needed so user can create mlog/mblock objects;
+ * Only needed if the configured number of mdcs did not get created
+ * during mpool create due to crash or failure.
+ */
+ mdcmax = mdcncap = mdc0cap = 0;
+ mdcnum = mp->pds_params.mp_mdcnum;
+
+ pmd_mdc_cap(mp, &mdcmax, &mdcncap, &mdc0cap);
+
+ if (mdc0cap)
+ mp->pds_params.mp_mdc0cap = mdc0cap;
+
+ if (mdcncap && mdcmax) {
+ mdcncap = mdcncap / mdcmax;
+ mp->pds_params.mp_mdcncap = mdcncap;
+ mp->pds_params.mp_mdcnum = mdcmax;
+ }
+
+ if (mdcmax < mdcnum) {
+ mp_pr_info("mpool %s, detected missing MDCs %lu %lu",
+ mp->pds_name, (ulong)mdcnum, (ulong)mdcmax);
+
+ for (mdcmax++; mdcmax <= mdcnum; mdcmax++) {
+
+ err = pmd_mdc_alloc(mp, mp->pds_params.mp_mdcncap,
+ mdcmax);
+ if (!err)
+ continue;
+
+ /* MDC1 creation failure - non-functional mpool */
+ if (mdcmax < 2) {
+ mp_pr_err("mpool %s, MDC1 can't be created", err, mp->pds_name);
+ goto errout;
+ }
+
+ mp_pr_notice("mpool %s, couldn't create %lu MDCs out of %lu MDCs",
+ mp->pds_name, (ulong)(mdcnum - mdcmax + 1), (ulong)mdcnum);
+
+ /*
+ * For MDCN (N > 1) creation failure, log a warning,
+ * mask the error and continue with activate. Mpool
+ * only needs a minimum of 1 MDC to be functional.
+ */
+ err = 0;
+
+ break;
+ }
+ mp->pds_params.mp_mdcnum = mdcmax - 1;
+ }
+
+ pmd_update_credit(mp);
+
+ /*
+ * If we reach here, then MDC1 must exist. Now, make sure that the
+ * root mlogs also exist and if they don't, re-create them.
+ */
+ err = mpool_create_rmlogs(mp, mlog_cap);
+ if (err) {
+ /* Root mlogs creation failure - non-functional mpool */
+ mp_pr_err("mpool %s, root mlogs creation failed", err, mp->pds_name);
+ goto errout;
+ }
+
+ /* Add mp to the list of all activated mpools */
+ uuid_to_mpdesc_insert(&mpool_pools, mp);
+
+ /* Start the background thread doing pre-compaction of MDC1/255 */
+ pmd_precompact_start(mp);
+
+errout:
+ if (err) {
+ if (mp->pds_workq)
+ destroy_workqueue(mp->pds_workq);
+ if (mp->pds_erase_wq)
+ destroy_workqueue(mp->pds_erase_wq);
+
+ if (active)
+ pmd_mpool_deactivate(mp);
+
+ mpool_desc_free(mp);
+ mp = NULL;
+ }
+
+ mutex_unlock(&mpool_s_lock);
+
+ *mpp = mp;
+
+ if (!err) {
+ /*
+ * Start the periodic background job which logs a message
+ * when an mpool's usable space is close to its limits.
+ */
+ struct smap_usage_work *usagew;
+
+ usagew = &mp->pds_smap_usage_work;
+
+ INIT_DELAYED_WORK(&usagew->smapu_wstruct, smap_log_mpool_usage);
+ usagew->smapu_mp = mp;
+ smap_log_mpool_usage(&usagew->smapu_wstruct.work);
+ }
+
+ return err;
+}
+
+int mpool_deactivate(struct mpool_descriptor *mp)
+{
+ pmd_precompact_stop(mp);
+ smap_wait_usage_done(mp);
+
+ mutex_lock(&mpool_s_lock);
+ destroy_workqueue(mp->pds_workq);
+ destroy_workqueue(mp->pds_erase_wq);
+
+ pmd_mpool_deactivate(mp);
+
+ mpool_desc_free(mp);
+ mutex_unlock(&mpool_s_lock);
+
+ return 0;
+}
+
+int mpool_destroy(u64 dcnt, char **dpaths, struct pd_prop *pd_prop, u32 flags)
+{
+ struct omf_sb_descriptor *sbmdc0;
+ struct mpool_descriptor *mp;
+ int dup, doff;
+ int err, i;
+
+ if (dcnt > MPOOL_DRIVES_MAX) {
+ err = -EINVAL;
+ mp_pr_err("first pd %s, too many drives %lu %d",
+ err, dpaths[0], (ulong)dcnt, MPOOL_DRIVES_MAX);
+ return err;
+ } else if (dcnt == 0) {
+ return -EINVAL;
+ }
+
+ /*
+ * Verify no duplicate drive paths
+ */
+ err = check_for_dups(dpaths, dcnt, &dup, &doff);
+ if (err) {
+ mp_pr_err("check_for_dups failed, dcnt %lu", err, (ulong)dcnt);
+ return err;
+ } else if (dup) {
+ err = -ENOMEM;
+ mp_pr_err("duplicate drives found", err);
+ return err;
+ }
+
+ sbmdc0 = kzalloc(sizeof(*sbmdc0), GFP_KERNEL);
+ if (!sbmdc0) {
+ err = -ENOMEM;
+ mp_pr_err("alloc sb %zu failed", err, sizeof(*sbmdc0));
+ return err;
+ }
+
+ mp = mpool_desc_alloc();
+ if (!mp) {
+ err = -ENOMEM;
+ mp_pr_err("alloc mpool desc failed", err);
+ kfree(sbmdc0);
+ return err;
+ }
+
+ mp->pds_pdvcnt = 0;
+
+ mutex_lock(&mpool_s_lock);
+
+ /* Get device parm for all drive paths */
+ err = mpool_dev_init_all(mp->pds_pdv, dcnt, dpaths, pd_prop);
+ if (err) {
+ mp_pr_err("first pd %s, get device params failed", err, dpaths[0]);
+ goto errout;
+ }
+
+ /* Set pdvcnt so dpaths will get closed in cleanup if open fails. */
+ mp->pds_pdvcnt = dcnt;
+
+ /* Init mpool descriptor from superblocks on drives */
+ err = mpool_desc_init_sb(mp, sbmdc0, flags, NULL);
+ if (err) {
+ mp_pr_err("mpool %s, first pd %s, mpool desc init from sb failed",
+ err, (mp->pds_name == NULL) ? "" : mp->pds_name, dpaths[0]);
+ goto errout;
+ }
+
+ /* Erase super blocks on the drives */
+ for (i = 0; i < mp->pds_pdvcnt; i++) {
+ struct mpool_dev_info *pd;
+
+ pd = &mp->pds_pdv[i];
+ if (mpool_pd_status_get(pd) != PD_STAT_ONLINE) {
+ err = -EIO;
+ mp_pr_err("pd %s unavailable or offline, status %d",
+ err, pd->pdi_name, mpool_pd_status_get(pd));
+ } else {
+ err = sb_erase(&pd->pdi_parm);
+ if (err)
+ mp_pr_err("pd %s, sb erase failed", err, pd->pdi_name);
+ }
+
+ if (err)
+ break;
+ }
+
+errout:
+ mpool_desc_free(mp);
+
+ mutex_unlock(&mpool_s_lock);
+
+ kfree(sbmdc0);
+
+ return err;
+}
+
+int mpool_rename(u64 dcnt, char **dpaths, struct pd_prop *pd_prop,
+ u32 flags, const char *mp_newname)
+{
+ struct omf_sb_descriptor *sb;
+ struct mpool_descriptor *mp;
+ struct mpool_dev_info *pd = NULL;
+ u16 omf_ver = OMF_SB_DESC_UNDEF;
+ bool force = ((flags & (1 << MP_FLAGS_FORCE)) != 0);
+ u8 pdh;
+ int dup, doff;
+ int err = 0;
+
+ if (!mp_newname || dcnt == 0)
+ return -EINVAL;
+
+ if (dcnt > MPOOL_DRIVES_MAX) {
+ err = -EINVAL;
+ mp_pr_err("first pd %s, too many drives %lu %d",
+ err, dpaths[0], (ulong)dcnt, MPOOL_DRIVES_MAX);
+ return err;
+ }
+
+ /*
+ * Verify no duplicate drive paths
+ */
+ err = check_for_dups(dpaths, dcnt, &dup, &doff);
+ if (err) {
+ mp_pr_err("check_for_dups failed, dcnt %lu", err, (ulong)dcnt);
+ return err;
+ } else if (dup) {
+ err = -ENOMEM;
+ mp_pr_err("duplicate drives found", err);
+ return err;
+ }
+
+ sb = kzalloc(sizeof(*sb), GFP_KERNEL);
+ if (!sb) {
+ err = -ENOMEM;
+ mp_pr_err("alloc sb %zu failed", err, sizeof(*sb));
+ return err;
+ }
+
+ mp = mpool_desc_alloc();
+ if (!mp) {
+ err = -ENOMEM;
+ mp_pr_err("alloc mpool desc failed", err);
+ kfree(sb);
+ return err;
+ }
+
+ mp->pds_pdvcnt = 0;
+
+ mutex_lock(&mpool_s_lock);
+
+ /* Get device parm for all drive paths */
+ err = mpool_dev_init_all(mp->pds_pdv, dcnt, dpaths, pd_prop);
+ if (err) {
+ mp_pr_err("first pd %s, get device params failed", err, dpaths[0]);
+ goto errout;
+ }
+
+ /* Set pdvcnt so dpaths will get closed in cleanup if open fails.
+ */
+ mp->pds_pdvcnt = dcnt;
+
+ for (pdh = 0; pdh < mp->pds_pdvcnt; pdh++) {
+ pd = &mp->pds_pdv[pdh];
+
+ if (mpool_pd_status_get(pd) != PD_STAT_ONLINE) {
+ err = -EIO;
+ mp_pr_err("pd %s unavailable or offline, status %d",
+ err, pd->pdi_name, mpool_pd_status_get(pd));
+ goto errout;
+ }
+
+ /*
+ * Read superblock; init and validate pool drive info
+ * from device parameters stored in the super block.
+ */
+ err = sb_read(&pd->pdi_parm, sb, &omf_ver, force);
+ if (err) {
+ mp_pr_err("pd %s, sb read failed", err, pd->pdi_name);
+ goto errout;
+ }
+
+ if (omf_ver > OMF_SB_DESC_VER_LAST ||
+ omf_ver < OMF_SB_DESC_VER_LAST) {
+ err = -EOPNOTSUPP;
+ mp_pr_err("pd %s, invalid sb version %d %d",
+ err, pd->pdi_name, omf_ver, OMF_SB_DESC_VER_LAST);
+ goto errout;
+ }
+
+ if (!strcmp(mp_newname, sb->osb_name))
+ continue;
+
+ strlcpy(sb->osb_name, mp_newname, sizeof(sb->osb_name));
+
+ err = sb_write_update(&pd->pdi_parm, sb);
+ if (err) {
+ mp_pr_err("Failed to rename mpool %s on device %s",
+ err, mp->pds_name, pd->pdi_name);
+ goto errout;
+ }
+ }
+
+errout:
+ mutex_unlock(&mpool_s_lock);
+
+ mpool_desc_free(mp);
+ kfree(sb);
+
+ return err;
+}
+
+int mpool_drive_add(struct mpool_descriptor *mp, char *dpath, struct pd_prop *pd_prop)
+{
+ struct mpool_dev_info *pd;
+ struct mc_smap_parms mcsp;
+ char *dpathv[1] = { dpath };
+ bool erase = false;
+ bool smap = false;
+ int err;
+
+ /*
+ * All device list changes are serialized via mpool_s_lock so
+ * don't need to acquire mp.pdvlock until ready to update mpool
+ * descriptor
+ */
+ mutex_lock(&mpool_s_lock);
+
+ if (mp->pds_pdvcnt >= MPOOL_DRIVES_MAX) {
+ mutex_unlock(&mpool_s_lock);
+
+ mp_pr_warn("%s: pd %s, too many drives %u %d",
+ mp->pds_name, dpath, mp->pds_pdvcnt, MPOOL_DRIVES_MAX);
+ return -EINVAL;
+ }
+
+ /*
+ * get device parm for dpath; use next slot in mp.pdv which won't
+ * be visible until we update mp.pdvcnt
+ */
+ pd = &mp->pds_pdv[mp->pds_pdvcnt];
+
+ /*
+ * Some leftover may be present due to a previous try to add a PD
+ * at this position. Clear up.
+ */
+ memset(pd, 0, sizeof(*pd));
+
+ err = mpool_dev_init_all(pd, 1, dpathv, pd_prop);
+ if (err) {
+ mutex_unlock(&mpool_s_lock);
+
+ mp_pr_err("%s: pd %s, getting drive params failed", err, mp->pds_name, dpath);
+ return err;
+ }
+
+ /* Confirm drive meets all criteria for adding to this mpool */
+ err = mpool_dev_check_new(mp, pd);
+ if (err) {
+ mp_pr_err("%s: pd %s, drive doesn't pass criteria", err, mp->pds_name, dpath);
+ goto errout;
+ }
+
+ /*
+ * Check that the drive can be added in a media class.
+ */
+ down_read(&mp->pds_pdvlock);
+ err = mpool_desc_pdmc_add(mp, mp->pds_pdvcnt, NULL, true);
+ up_read(&mp->pds_pdvlock);
+ if (err) {
+ mp_pr_err("%s: pd %s, can't place in any media class", err, mp->pds_name, dpath);
+ goto errout;
+ }
+
+
+ mpool_generate_uuid(&pd->pdi_devid);
+
+ /* Write mpool superblock to drive */
+ erase = true;
+ err = mpool_dev_sbwrite(mp, pd, NULL);
+ if (err) {
+ mp_pr_err("%s: pd %s, sb write failed", err, mp->pds_name, dpath);
+ goto errout;
+ }
+
+ /* Get percent spare */
+ down_read(&mp->pds_pdvlock);
+ err = mc_smap_parms_get(&mp->pds_mc[pd->pdi_mclass], &mp->pds_params, &mcsp);
+ up_read(&mp->pds_pdvlock);
+ if (err)
+ goto errout;
+
+ /* Alloc space map for drive */
+ err = smap_drive_init(mp, &mcsp, mp->pds_pdvcnt);
+ if (err) {
+ mp_pr_err("%s: pd %s, smap init failed", err, mp->pds_name, dpath);
+ goto errout;
+ }
+ smap = true;
+
+ /*
+ * Take MDC0 compact lock to prevent race with MDC0 compaction.
+ * Take it across memory and media update.
+ */
+ PMD_MDC0_COMPACTLOCK(mp);
+
+ /*
+ * Add drive state record to mdc0; if crash any time prior to adding
+ * this record the drive will not be recognized as an mpool member
+ * on next open
+ */
+ err = pmd_prop_mcconfig(mp, pd, false);
+ if (err) {
+ PMD_MDC0_COMPACTUNLOCK(mp);
+ mp_pr_err("%s: pd %s, adding drive state to MDC0 failed", err, mp->pds_name, dpath);
+ goto errout;
+ }
+
+ /* Make new drive visible in mpool */
+ down_write(&mp->pds_pdvlock);
+ mp->pds_pdvcnt++;
+
+ /*
+ * Add the PD in its class. That should NOT fail because we already
+ * checked that the drive can be added in a media class.
+ */
+ err = mpool_desc_pdmc_add(mp, mp->pds_pdvcnt - 1, NULL, false);
+ if (err)
+ mp->pds_pdvcnt--;
+
+ up_write(&mp->pds_pdvlock);
+ PMD_MDC0_COMPACTUNLOCK(mp);
+
+errout:
+ if (err) {
+ /*
+ * No pd could have been be added at mp->pds_pdvcnt since we
+ * dropped pds_pdvlock because mpool_s_lock is held.
+ */
+ if (smap)
+ smap_drive_free(mp, mp->pds_pdvcnt);
+
+ /*
+ * Erase the pd super blocks only if the pd doesn't already
+ * belong to this mpool or another one.
+ */
+ if (erase)
+ sb_erase(&pd->pdi_parm);
+
+ pd_dev_close(&pd->pdi_parm);
+ }
+
+ mutex_unlock(&mpool_s_lock);
+
+ return err;
+}
+
+void mpool_mclass_get_cnt(struct mpool_descriptor *mp, u32 *cnt)
+{
+ int i;
+
+ *cnt = 0;
+
+ down_read(&mp->pds_pdvlock);
+ for (i = 0; i < MP_MED_NUMBER; i++) {
+ struct media_class *mc;
+
+ mc = &mp->pds_mc[i];
+ if (mc->mc_pdmc >= 0)
+ (*cnt)++;
+ }
+ up_read(&mp->pds_pdvlock);
+}
+
+int mpool_mclass_get(struct mpool_descriptor *mp, u32 *mcxc, struct mpool_mclass_xprops *mcxv)
+{
+ int i, n;
+
+ if (!mp || !mcxc || !mcxv)
+ return -EINVAL;
+
+ mutex_lock(&mpool_s_lock);
+ down_read(&mp->pds_pdvlock);
+
+ for (n = i = 0; i < MP_MED_NUMBER && n < *mcxc; i++) {
+ struct media_class *mc;
+
+ mc = &mp->pds_mc[i];
+ if (mc->mc_pdmc < 0)
+ continue;
+
+ mcxv->mc_mclass = mc->mc_parms.mcp_classp;
+ mcxv->mc_devtype = mc->mc_parms.mcp_devtype;
+ mcxv->mc_spare = mc->mc_sparms.mcsp_spzone;
+
+ mcxv->mc_zonepg = mc->mc_parms.mcp_zonepg;
+ mcxv->mc_sectorsz = mc->mc_parms.mcp_sectorsz;
+ mcxv->mc_features = mc->mc_parms.mcp_features;
+ mcxv->mc_uacnt = mc->mc_uacnt;
+ smap_mclass_usage(mp, i, &mcxv->mc_usage);
+
+ ++mcxv;
+ ++n;
+ }
+
+ up_read(&mp->pds_pdvlock);
+ mutex_unlock(&mpool_s_lock);
+
+ *mcxc = n;
+
+ return 0;
+}
+
+int mpool_drive_spares(struct mpool_descriptor *mp, enum mp_media_classp mclassp, u8 drive_spares)
+{
+ struct media_class *mc;
+ int err;
+
+ if (!mclass_isvalid(mclassp) || drive_spares > 100) {
+ err = -EINVAL;
+ mp_pr_err("mpool %s, setting percent %u spare for drives in media class %d failed",
+ err, mp->pds_name, drive_spares, mclassp);
+ return err;
+ }
+
+ /*
+ * Do not write the spare record or try updating spare if there are
+ * no PDs in the specified media class.
+ */
+ down_read(&mp->pds_pdvlock);
+ mc = &mp->pds_mc[mclassp];
+ up_read(&mp->pds_pdvlock);
+
+ if (mc->mc_pdmc < 0) {
+ err = -ENOENT;
+ goto skip_update;
+ }
+
+ mutex_lock(&mpool_s_lock);
+
+ /*
+ * Take mdc0 compact lock to prevent race with mdc0 compaction.
+ * Also make memory and media update to look atomic to compaction.
+ */
+ PMD_MDC0_COMPACTLOCK(mp);
+
+ /*
+ * update media class spare record in mdc0; no effect if crash before
+ * complete
+ */
+ err = pmd_prop_mcspare(mp, mclassp, drive_spares, false);
+ if (err) {
+ mp_pr_err("mpool %s, setting spare %u mclass %d failed, could not record in MDC0",
+ err, mp->pds_name, drive_spares, mclassp);
+ } else {
+ /* Update spare zone accounting for media class */
+ down_write(&mp->pds_pdvlock);
+
+ err = mc_set_spzone(&mp->pds_mc[mclassp], drive_spares);
+ if (err)
+ mp_pr_err("mpool %s, setting spare %u mclass %d failed",
+ err, mp->pds_name, drive_spares, mclassp);
+ else
+ /*
+ * smap accounting update always succeeds when
+ * mclassp/zone are valid
+ */
+ smap_drive_spares(mp, mclassp, drive_spares);
+
+ up_write(&mp->pds_pdvlock);
+ }
+
+ PMD_MDC0_COMPACTUNLOCK(mp);
+
+ mutex_unlock(&mpool_s_lock);
+
+skip_update:
+ return err;
+}
+
+void mpool_get_xprops(struct mpool_descriptor *mp, struct mpool_xprops *xprops)
+{
+ struct media_class *mc;
+ int mclassp, i;
+ u16 ftmax;
+
+ mutex_lock(&mpool_s_lock);
+ down_read(&mp->pds_pdvlock);
+
+ memcpy(xprops->ppx_params.mp_poolid.b, mp->pds_poolid.uuid, MPOOL_UUID_SIZE);
+ ftmax = 0;
+
+ for (mclassp = 0; mclassp < MP_MED_NUMBER; mclassp++) {
+ xprops->ppx_pd_mclassv[mclassp] = MP_MED_INVALID;
+
+ mc = &mp->pds_mc[mclassp];
+ if (mc->mc_pdmc < 0) {
+ xprops->ppx_drive_spares[mclassp] = 0;
+ xprops->ppx_uacnt[mclassp] = 0;
+
+ xprops->ppx_params.mp_mblocksz[mclassp] = 0;
+ continue;
+ }
+
+ xprops->ppx_drive_spares[mclassp] = mc->mc_sparms.mcsp_spzone;
+ xprops->ppx_uacnt[mclassp] = mc->mc_uacnt;
+ ftmax = max((u16)ftmax, (u16)(xprops->ppx_uacnt[mclassp]));
+ xprops->ppx_params.mp_mblocksz[mclassp] =
+ (mc->mc_parms.mcp_zonepg << PAGE_SHIFT) >> 20;
+ }
+
+ for (i = 0; i < mp->pds_pdvcnt; ++i) {
+ mc = &mp->pds_mc[mp->pds_pdv[i].pdi_mclass];
+ if (mc->mc_pdmc < 0)
+ continue;
+
+ xprops->ppx_pd_mclassv[i] = mc->mc_parms.mcp_classp;
+
+ strlcpy(xprops->ppx_pd_namev[i], mp->pds_pdv[i].pdi_name,
+ sizeof(xprops->ppx_pd_namev[i]));
+ }
+
+ up_read(&mp->pds_pdvlock);
+ mutex_unlock(&mpool_s_lock);
+
+ xprops->ppx_params.mp_stat = ftmax ? MPOOL_STAT_FAULTED : MPOOL_STAT_OPTIMAL;
+}
+
+int mpool_get_devprops_by_name(struct mpool_descriptor *mp, char *pdname,
+ struct mpool_devprops *dprop)
+{
+ int i;
+
+ down_read(&mp->pds_pdvlock);
+
+ for (i = 0; i < mp->pds_pdvcnt; i++) {
+ if (!strcmp(pdname, mp->pds_pdv[i].pdi_name))
+ fill_in_devprops(mp, i, dprop);
+ }
+
+ up_read(&mp->pds_pdvlock);
+
+ return 0;
+}
+
+void mpool_get_usage(struct mpool_descriptor *mp, enum mp_media_classp mclassp,
+ struct mpool_usage *usage)
+{
+ memset(usage, 0, sizeof(*usage));
+
+ down_read(&mp->pds_pdvlock);
+ if (mclassp != MP_MED_ALL) {
+ struct media_class *mc;
+
+ ASSERT(mclassp < MP_MED_NUMBER);
+
+ mc = &mp->pds_mc[mclassp];
+ if (mc->mc_pdmc < 0) {
+ /* Not an error, this media class is empty. */
+ up_read(&mp->pds_pdvlock);
+ return;
+ }
+ }
+ smap_mpool_usage(mp, mclassp, usage);
+ up_read(&mp->pds_pdvlock);
+
+ if (mclassp == MP_MED_ALL)
+ pmd_mpool_usage(mp, usage);
+}
+
+int mpool_config_store(struct mpool_descriptor *mp, const struct mpool_config *cfg)
+{
+ int err;
+
+ if (!mp || !cfg)
+ return -EINVAL;
+
+ mp->pds_cfg = *cfg;
+
+ err = pmd_prop_mpconfig(mp, cfg, false);
+ if (err)
+ mp_pr_err("mpool %s, logging config record failed", err, mp->pds_name);
+
+ return err;
+}
+
+int mpool_config_fetch(struct mpool_descriptor *mp, struct mpool_config *cfg)
+{
+ if (!mp || !cfg)
+ return -EINVAL;
+
+ *cfg = mp->pds_cfg;
+
+ return 0;
+}
--
2.17.2

2020-09-28 16:51:58

by Nabeel Meeramohideen Mohamed (nmeeramohide)

[permalink] [raw]

Subject: [PATCH 14/22] mpool: add pool metadata routines to create persistent mpools

From: Nabeel M Mohamed <[email protected]>

Mpool metadata is stored in metadata containers (MDC). An
mpool can have a maximum of 256 MDCs, MDC-0 through MDC-255.
The following metadata manager functionality is added here
for object persistence:

- Initialize and validate MDC0
- Allocate and initialize MDC 1-N. An mpool is created with
16 MDCs to provide the requisite concurrency.
- Dynamically scale up the number of MDCs when running low
on space and the garbage is below a certain threshold
across all MDCs
- Deserialize metadata records from MDC 0-N at mpool activation
and setup the corresponding in-memory structures
- Pre-compact MDC-K based on its usage and if the garbage in
MDC-K is above a certain threshold. A pre-compacting MDC is
not chosen for object allocation.

MDC0 is a distinguished container that stores both the metadata
for accessing MDC-1 through MDC-255 and all mpool properties.
MDC-1 through MDC-255 store the metadata for accessing client
allocated mblocks and mlogs. Metadata for accessing the mlogs
comprising MDC-0 is in the superblock for the capacity media
class.

In the context of MDC-1/255, compacting MDC-K is simply
serializing the in-memory metadata for accessing the still-live
client objects associated with MDC-K. In the context of MDC-0,
compacting is simply serializing the in-memory mpool properties
and in-memory metadata for accessing MDC-1/255.

An instance of struct pmd_mdc_info is created for each MDC in
an mpool. This struct hosts both the uncommitted and committed
object trees and a lock protecting each of the two trees.
Compacting an MDC requires freezing both the list of committed
objects in that MDC and the metadata for those objects,
which is facilitated by the compact lock in each MDC instance.

Co-developed-by: Greg Becker <[email protected]>
Signed-off-by: Greg Becker <[email protected]>
Co-developed-by: Pierre Labat <[email protected]>
Signed-off-by: Pierre Labat <[email protected]>
Co-developed-by: John Groves <[email protected]>
Signed-off-by: John Groves <[email protected]>
Signed-off-by: Nabeel M Mohamed <[email protected]>
---
drivers/mpool/pmd.c | 2046 +++++++++++++++++++++++++++++++++++++++
drivers/mpool/pmd_obj.c | 8 -
2 files changed, 2046 insertions(+), 8 deletions(-)
create mode 100644 drivers/mpool/pmd.c

diff --git a/drivers/mpool/pmd.c b/drivers/mpool/pmd.c
new file mode 100644
index 000000000000..07e08b5eed43
--- /dev/null
+++ b/drivers/mpool/pmd.c
@@ -0,0 +1,2046 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc. All rights reserved.
+ */
+
+/*
+ * DOC: Module info.
+ *
+ * Pool metadata (pmd) module.
+ *
+ * Defines functions for probing, reading, and writing drives in an mpool.
+ *
+ */
+
+#include <linux/workqueue.h>
+#include <linux/atomic.h>
+#include <linux/rwsem.h>
+#include <linux/mutex.h>
+#include <linux/slab.h>
+
+#include "assert.h"
+#include "mpool_printk.h"
+
+#include "mpool_ioctl.h"
+#include "mdc.h"
+#include "upgrade.h"
+#include "smap.h"
+#include "omf_if.h"
+#include "mpcore.h"
+#include "pmd.h"
+
+static DEFINE_MUTEX(pmd_s_lock);
+
+#define pmd_co_foreach(_cinfo, _node) \
+ for ((_node) = rb_first(&(_cinfo)->mmi_co_root); (_node); (_node) = rb_next((_node)))
+
+static int pmd_mdc0_validate(struct mpool_descriptor *mp, int activation);
+
+static void pmd_mda_init(struct mpool_descriptor *mp)
+{
+ int i;
+
+ spin_lock_init(&mp->pds_mda.mdi_slotvlock);
+ mp->pds_mda.mdi_slotvcnt = 0;
+
+ for (i = 0; i < MDC_SLOTS; ++i) {
+ struct pmd_mdc_info *pmi = mp->pds_mda.mdi_slotv + i;
+
+ mutex_init(&pmi->mmi_compactlock);
+ mutex_init(&pmi->mmi_uc_lock);
+ pmi->mmi_uc_root = RB_ROOT;
+ init_rwsem(&pmi->mmi_co_lock);
+ pmi->mmi_co_root = RB_ROOT;
+ mutex_init(&pmi->mmi_uqlock);
+ pmi->mmi_luniq = 0;
+ pmi->mmi_recbuf = NULL;
+ pmi->mmi_lckpt = objid_make(0, OMF_OBJ_UNDEF, i);
+ memset(&pmi->mmi_stats, 0, sizeof(pmi->mmi_stats));
+
+ /*
+ * Initial mpool metadata content version.
+ */
+ pmi->mmi_mdcver.mdcv_major = 1;
+ pmi->mmi_mdcver.mdcv_minor = 0;
+ pmi->mmi_mdcver.mdcv_patch = 0;
+ pmi->mmi_mdcver.mdcv_dev = 0;
+
+ pmi->mmi_credit.ci_slot = i;
+
+ mutex_init(&pmi->mmi_stats_lock);
+ }
+
+ mp->pds_mda.mdi_slotv[1].mmi_luniq = UROOT_OBJID_MAX;
+ mp->pds_mda.mdi_sel.mds_tbl_idx.counter = 0;
+}
+
+static void pmd_mda_free(struct mpool_descriptor *mp)
+{
+ int sidx;
+
+ /*
+ * close mdc0 last because closing other mdc logs can result in
+ * mdc0 updates
+ */
+ for (sidx = mp->pds_mda.mdi_slotvcnt - 1; sidx > -1; sidx--) {
+ struct pmd_layout *layout, *tmp;
+ struct pmd_mdc_info *cinfo;
+
+ cinfo = &mp->pds_mda.mdi_slotv[sidx];
+
+ mp_mdc_close(cinfo->mmi_mdc);
+ kfree(cinfo->mmi_recbuf);
+ cinfo->mmi_recbuf = NULL;
+
+ /* Release committed objects... */
+ rbtree_postorder_for_each_entry_safe(
+ layout, tmp, &cinfo->mmi_co_root, eld_nodemdc) {
+
+ pmd_obj_put(layout);
+ }
+
+ /* Release uncommitted objects... */
+ rbtree_postorder_for_each_entry_safe(
+ layout, tmp, &cinfo->mmi_uc_root, eld_nodemdc) {
+
+ pmd_obj_put(layout);
+ }
+ }
+}
+
+static int pmd_mdc0_init(struct mpool_descriptor *mp, struct pmd_layout *mdc01,
+ struct pmd_layout *mdc02)
+{
+ struct pmd_mdc_info *cinfo = &mp->pds_mda.mdi_slotv[0];
+ int rc;
+
+ cinfo->mmi_recbuf = kzalloc(OMF_MDCREC_PACKLEN_MAX, GFP_KERNEL);
+ if (!cinfo->mmi_recbuf) {
+ rc = -ENOMEM;
+ mp_pr_err("mpool %s, log rec buffer alloc %zu failed",
+ rc, mp->pds_name, OMF_MDCREC_PACKLEN_MAX);
+ return rc;
+ }
+
+ /*
+ * we put the mdc0 mlog layouts in mdc 0 because mdc0 mlog objids have a
+ * slot # of 0 so the rest of the code expects to find the layout there.
+ * this allows the majority of the code to treat mdc0 mlog metadata
+ * exactly the same as for mdcN (and user mlogs), even though mdc0
+ * metadata is actually stored in superblocks. however there are a few
+ * places that need to recognize mdc0 mlogs are special, including
+ * pmd_mdc_compact() and pmd_obj_erase().
+ */
+
+ mp->pds_mda.mdi_slotvcnt = 1;
+ pmd_co_insert(cinfo, mdc01);
+ pmd_co_insert(cinfo, mdc02);
+
+ rc = mp_mdc_open(mp, mdc01->eld_objid, mdc02->eld_objid, MDC_OF_SKIP_SER, &cinfo->mmi_mdc);
+ if (rc) {
+ mp_pr_err("mpool %s, MDC0 open failed", rc, mp->pds_name);
+
+ pmd_co_remove(cinfo, mdc01);
+ pmd_co_remove(cinfo, mdc02);
+
+ kfree(cinfo->mmi_recbuf);
+ cinfo->mmi_recbuf = NULL;
+
+ mp->pds_mda.mdi_slotvcnt = 0;
+ }
+
+ return rc;
+}
+
+/**
+ * pmd_mdc0_validate() -
+ * @mp:
+ * @activation:
+ *
+ * Called during mpool activation and mdc alloc because a failed
+ * mdc alloc can result in extraneous mdc mlog objects which if
+ * found we attempt to clean-up here. when called during activation
+ * we may need to adjust mp.mda. this is not so when called from
+ * mdc alloc and in fact decreasing slotvcnt post activation would
+ * violate a key invariant.
+ */
+static int pmd_mdc0_validate(struct mpool_descriptor *mp, int activation)
+{
+ struct pmd_mdc_info *cinfo;
+ struct pmd_layout *layout;
+ struct rb_node *node;
+ int err = 0, err1, err2, i;
+ u64 mdcn, mdcmax = 0;
+ u64 logid1, logid2;
+ u16 slotvcnt;
+ u8 *lcnt;
+
+ /*
+ * Activation is single-threaded and mdc alloc is serialized
+ * so the number of active mdc (slotvcnt) will not change.
+ */
+ spin_lock(&mp->pds_mda.mdi_slotvlock);
+ slotvcnt = mp->pds_mda.mdi_slotvcnt;
+ spin_unlock(&mp->pds_mda.mdi_slotvlock);
+
+ if (!slotvcnt) {
+ /* Must be at least mdc0 */
+ err = -EINVAL;
+ mp_pr_err("mpool %s, no MDC0", err, mp->pds_name);
+ return err;
+ }
+
+ cinfo = &mp->pds_mda.mdi_slotv[0];
+
+ lcnt = kcalloc(MDC_SLOTS, sizeof(*lcnt), GFP_KERNEL);
+ if (!lcnt) {
+ err = -ENOMEM;
+ mp_pr_err("mpool %s, lcnt alloc failed", err, mp->pds_name);
+ return err;
+ }
+
+ pmd_co_rlock(cinfo, 0);
+
+ pmd_co_foreach(cinfo, node) {
+ layout = rb_entry(node, typeof(*layout), eld_nodemdc);
+
+ mdcn = objid_uniq(layout->eld_objid) >> 1;
+ if (mdcn < MDC_SLOTS) {
+ lcnt[mdcn] = lcnt[mdcn] + 1;
+ mdcmax = max(mdcmax, mdcn);
+ }
+ if (mdcn >= MDC_SLOTS || lcnt[mdcn] > 2 ||
+ objid_type(layout->eld_objid) != OMF_OBJ_MLOG ||
+ objid_slot(layout->eld_objid)) {
+ err = -EINVAL;
+ mp_pr_err("mpool %s, MDC0 number of MDCs %lu %u or bad otype, objid 0x%lx",
+ err, mp->pds_name, (ulong)mdcn,
+ lcnt[mdcn], (ulong)layout->eld_objid);
+ break;
+ }
+ }
+
+ pmd_co_runlock(cinfo);
+
+ if (err)
+ goto exit;
+
+ if (!mdcmax) {
+ /*
+ * trivial case of mdc0 only; no mdc alloc failure to
+ * clean-up
+ */
+ if (lcnt[0] != 2 || slotvcnt != 1) {
+ err = -EINVAL;
+ mp_pr_err("mpool %s, inconsistent number of MDCs or slots %d %d",
+ err, mp->pds_name, lcnt[0], slotvcnt);
+ }
+
+ goto exit;
+ }
+
+ if ((mdcmax != (slotvcnt - 1)) && mdcmax != slotvcnt) {
+ err = -EINVAL;
+
+ /*
+ * mdcmax is normally slotvcnt-1; can be slotvcnt if
+ * mdc alloc failed
+ */
+ mp_pr_err("mpool %s, inconsistent max number of MDCs %lu %u",
+ err, mp->pds_name, (ulong)mdcmax, slotvcnt);
+ goto exit;
+ }
+
+ /* Both logs must always exist below mdcmax */
+ for (i = 0; i < mdcmax; i++) {
+ if (lcnt[i] != 2) {
+ err = -ENOENT;
+ mp_pr_err("mpool %s, MDC0 missing mlogs %lu %d %u",
+ err, mp->pds_name, (ulong)mdcmax, i, lcnt[i]);
+ goto exit;
+ }
+ }
+
+ /* Clean-up from failed mdc alloc if needed */
+ if (lcnt[mdcmax] != 2 || mdcmax == slotvcnt) {
+ /* Note: if activation then mdcmax == slotvcnt-1 always */
+ err1 = 0;
+ err2 = 0;
+ logid1 = logid_make(2 * mdcmax, 0);
+ logid2 = logid_make(2 * mdcmax + 1, 0);
+
+ layout = pmd_obj_find_get(mp, logid1, 1);
+ if (layout) {
+ err1 = pmd_obj_delete(mp, layout);
+ if (err1)
+ mp_pr_err("mpool %s, MDC0 %d, can't delete mlog %lu %lu %u %u",
+ err1, mp->pds_name, activation, (ulong)logid1,
+ (ulong)mdcmax, lcnt[mdcmax], slotvcnt);
+ }
+
+ layout = pmd_obj_find_get(mp, logid2, 1);
+ if (layout) {
+ err2 = pmd_obj_delete(mp, layout);
+ if (err2)
+ mp_pr_err("mpool %s, MDC0 %d, can't delete mlog %lu %lu %u %u",
+ err2, mp->pds_name, activation, (ulong)logid2,
+ (ulong)mdcmax, lcnt[mdcmax], slotvcnt);
+ }
+
+ if (activation) {
+ /*
+ * Mpool activation can ignore mdc alloc clean-up
+ * failures; single-threaded; don't need slotvlock
+ * or uqlock to adjust mda
+ */
+ cinfo->mmi_luniq = mdcmax - 1;
+ mp->pds_mda.mdi_slotvcnt = mdcmax;
+ mp_pr_warn("mpool %s, MDC0 alloc recovery: uniq %llu slotvcnt %d",
+ mp->pds_name, (unsigned long long)cinfo->mmi_luniq,
+ mp->pds_mda.mdi_slotvcnt);
+ } else {
+ /* MDC alloc cannot tolerate clean-up failures */
+ if (err1)
+ err = err1;
+ else if (err2)
+ err = err2;
+
+ if (err)
+ mp_pr_err("mpool %s, MDC0 alloc recovery, cleanup failed %lu %u %u",
+ err, mp->pds_name, (ulong)mdcmax, lcnt[mdcmax], slotvcnt);
+ else
+ mp_pr_warn("mpool %s, MDC0 alloc recovery", mp->pds_name);
+
+ }
+ }
+
+exit:
+ kfree(lcnt);
+
+ return err;
+}
+
+int pmd_mdc_alloc(struct mpool_descriptor *mp, u64 mincap, u32 iter)
+{
+ struct pmd_obj_capacity ocap;
+ enum mp_media_classp mclassp;
+ struct pmd_mdc_info *cinfo, *cinew;
+ struct pmd_layout *layout1, *layout2;
+ const char *msg = "(no detail)";
+ u64 mdcslot, logid1, logid2;
+ bool reverse = false;
+ u32 pdcnt;
+ int err;
+
+ /*
+ * serialize to prevent gap in mdc slot space in event of failure
+ */
+ mutex_lock(&pmd_s_lock);
+
+ /*
+ * recover previously failed mdc alloc if needed; cannot continue
+ * if fails
+ * note: there is an unlikely corner case where we logically delete an
+ * mlog from a previously failed mdc alloc but a background op is
+ * preventing its full removal; this will show up later in this
+ * fn as a failed alloc.
+ */
+ err = pmd_mdc0_validate(mp, 0);
+ if (err) {
+ mutex_unlock(&pmd_s_lock);
+
+ mp_pr_err("mpool %s, allocating an MDC, inconsistent MDC0", err, mp->pds_name);
+ return err;
+ }
+
+ /* MDC0 exists by definition; created as part of mpool creation */
+ cinfo = &mp->pds_mda.mdi_slotv[0];
+
+ pmd_mdc_lock(&cinfo->mmi_uqlock, 0);
+ mdcslot = cinfo->mmi_luniq;
+ pmd_mdc_unlock(&cinfo->mmi_uqlock);
+
+ if (mdcslot >= MDC_SLOTS - 1) {
+ mutex_unlock(&pmd_s_lock);
+
+ err = -ENOSPC;
+ mp_pr_err("mpool %s, allocating an MDC, too many %lu",
+ err, mp->pds_name, (ulong)mdcslot);
+ return err;
+ }
+ mdcslot = mdcslot + 1;
+
+ /*
+ * Alloc rec buf for new mdc slot; not visible so don't need to
+ * lock fields.
+ */
+ cinew = &mp->pds_mda.mdi_slotv[mdcslot];
+ cinew->mmi_recbuf = kzalloc(OMF_MDCREC_PACKLEN_MAX, GFP_KERNEL);
+ if (!cinew->mmi_recbuf) {
+ mutex_unlock(&pmd_s_lock);
+
+ mp_pr_warn("mpool %s, MDC%lu pack/unpack buf alloc failed %lu",
+ mp->pds_name, (ulong)mdcslot, (ulong)OMF_MDCREC_PACKLEN_MAX);
+ return -ENOMEM;
+ }
+ cinew->mmi_credit.ci_slot = mdcslot;
+
+ mclassp = MP_MED_CAPACITY;
+ pdcnt = 1;
+
+ /*
+ * Create new mdcs with same parameters and on same media class
+ * as mdc0.
+ */
+ ocap.moc_captgt = mincap;
+ ocap.moc_spare = false;
+
+ logid1 = logid_make(2 * mdcslot, 0);
+ logid2 = logid_make(2 * mdcslot + 1, 0);
+
+ if (!(pdcnt & 0x1) && ((iter * 2 / pdcnt) & 0x1)) {
+ /*
+ * Reverse the allocation order.
+ * The goal is to have active mlogs on all the mpool PDs.
+ * If 2 PDs, no parity, no reserve, the active mlogs
+ * will be on PDs 0,1,0,1,0,1,0,1 etc
+ * instead of 0,0,0,0,0 etc without reversing.
+ * No need to reverse if the number of PDs is odd.
+ */
+ reverse = true;
+ }
+
+ /*
+ * Each mlog must meet mincap since only one is active at a
+ * time.
+ */
+ layout1 = NULL;
+ err = pmd_obj_alloc_cmn(mp, reverse ? logid2 : logid1, OMF_OBJ_MLOG,
+ &ocap, mclassp, 0, false, &layout1);
+ if (err) {
+ if (err != -ENOENT)
+ msg = "allocation of first mlog failed";
+ goto exit;
+ }
+
+ layout2 = NULL;
+ err = pmd_obj_alloc_cmn(mp, reverse ? logid1 : logid2, OMF_OBJ_MLOG,
+ &ocap, mclassp, 0, false, &layout2);
+ if (err) {
+ pmd_obj_abort(mp, layout1);
+ if (err != -ENOENT)
+ msg = "allocation of second mlog failed";
+ goto exit;
+ }
+
+ /*
+ * Must erase before commit to guarantee new mdc logs start
+ * empty; mlogs not committed so pmd_obj_erase()
+ * not needed to make atomic.
+ */
+ pmd_obj_wrlock(layout1);
+ err = pmd_layout_erase(mp, layout1);
+ pmd_obj_wrunlock(layout1);
+
+ if (err) {
+ msg = "erase of first mlog failed";
+ } else {
+ pmd_obj_wrlock(layout2);
+ err = pmd_layout_erase(mp, layout2);
+ pmd_obj_wrunlock(layout2);
+
+ if (err)
+ msg = "erase of second mlog failed";
+ }
+ if (err) {
+ pmd_obj_abort(mp, layout1);
+ pmd_obj_abort(mp, layout2);
+ goto exit;
+ }
+
+ /*
+ * don't need to commit logid1 and logid2 atomically; mdc0
+ * validation deletes non-paired mdc logs to handle failing part
+ * way through this process
+ */
+ err = pmd_obj_commit(mp, layout1);
+ if (err) {
+ pmd_obj_abort(mp, layout1);
+ pmd_obj_abort(mp, layout2);
+ msg = "commit of first mlog failed";
+ goto exit;
+ } else {
+ err = pmd_obj_commit(mp, layout2);
+ if (err) {
+ pmd_obj_delete(mp, layout1);
+ pmd_obj_abort(mp, layout2);
+ msg = "commit of second mlog failed";
+ goto exit;
+ }
+ }
+
+ /*
+ * Finalize new mdc slot before making visible; don't need to
+ * lock fields.
+ */
+ err = mp_mdc_open(mp, logid1, logid2, MDC_OF_SKIP_SER, &cinew->mmi_mdc);
+ if (err) {
+ msg = "mdc open failed";
+
+ /* Failed open so just delete logid1/2; don't
+ * need to delete atomically since mdc0 validation
+ * will cleanup any detritus
+ */
+ pmd_obj_delete(mp, layout1);
+ pmd_obj_delete(mp, layout2);
+ goto exit;
+ }
+
+ /*
+ * Append the version record.
+ */
+ if (omfu_mdcver_cmp2(omfu_mdcver_cur(), ">=", 1, 0, 0, 1)) {
+ err = pmd_mdc_addrec_version(mp, mdcslot);
+ if (err) {
+ msg = "error adding the version record";
+ /*
+ * No version record in a MDC will trigger a MDC
+ * compaction if a activate is attempted later with this
+ * empty MDC.
+ * The compaction will add the version record in that
+ * empty MDC.
+ * Same error handling as above.
+ */
+ pmd_obj_delete(mp, layout1);
+ pmd_obj_delete(mp, layout2);
+ goto exit;
+ }
+ }
+
+ /* Make new mdc visible */
+ pmd_mdc_lock(&cinfo->mmi_uqlock, 0);
+
+ spin_lock(&mp->pds_mda.mdi_slotvlock);
+ cinfo->mmi_luniq = mdcslot;
+ mp->pds_mda.mdi_slotvcnt = mdcslot + 1;
+ spin_unlock(&mp->pds_mda.mdi_slotvlock);
+
+ pmd_mdc_unlock(&cinfo->mmi_uqlock);
+
+exit:
+ if (err) {
+ kfree(cinew->mmi_recbuf);
+ cinew->mmi_recbuf = NULL;
+ }
+
+ mutex_unlock(&pmd_s_lock);
+
+ mp_pr_debug("new mdc logid1 %llu logid2 %llu",
+ 0, (unsigned long long)logid1, (unsigned long long)logid2);
+
+ if (err) {
+ mp_pr_err("mpool %s, MDC%lu: %s", err, mp->pds_name, (ulong)mdcslot, msg);
+
+ } else {
+ mp_pr_debug("mpool %s, delta slotvcnt from %u to %llu", 0, mp->pds_name,
+ mp->pds_mda.mdi_slotvcnt, (unsigned long long)mdcslot + 1);
+
+ }
+ return err;
+}
+
+/**
+ * pmd_mdc_alloc_set() - allocates a set of MDCs
+ * @mp: mpool descriptor
+ *
+ * Creates MDCs in multiple of MPOOL_MDC_SET_SZ. If allocation had
+ * failed in prior iteration allocate MDCs to make it even multiple
+ * of MPOOL_MDC_SET_SZ.
+ *
+ * Locking: lock should not be held when calling this function.
+ */
+static void pmd_mdc_alloc_set(struct mpool_descriptor *mp)
+{
+ u8 mdc_cnt, sidx;
+ int rc;
+
+ /*
+ * MDCs are created in multiple of MPOOL_MDC_SET_SZ.
+ * However, if past allocation had failed there may not be an
+ * even multiple of MDCs in that case create any remaining
+ * MDCs to get an even multiple.
+ */
+ mdc_cnt = MPOOL_MDC_SET_SZ - ((mp->pds_mda.mdi_slotvcnt - 1) % MPOOL_MDC_SET_SZ);
+
+ mdc_cnt = min(mdc_cnt, (u8)(MDC_SLOTS - (mp->pds_mda.mdi_slotvcnt)));
+
+ for (sidx = 1; sidx <= mdc_cnt; sidx++) {
+ rc = pmd_mdc_alloc(mp, mp->pds_params.mp_mdcncap, 0);
+ if (rc) {
+ mp_pr_err("mpool %s, only %u of %u MDCs created",
+ rc, mp->pds_name, sidx-1, mdc_cnt);
+
+ /*
+ * For MDCN creation failure ignore the error.
+ * Attempt to create any remaining MDC next time
+ * next time new mdcs are required.
+ */
+ rc = 0;
+ break;
+ }
+ }
+}
+
+/**
+ * pmd_cmp_drv_mdc0() - compare the drive info read from the MDC0 drive list
+ * to what is obtained from the drive itself or from the configuration.
+ * @mp:
+ * @pdh:
+ * @omd:
+ *
+ * The drive is in list passed to mpool open or an UNAVAIL mdc0 drive.
+ */
+static int pmd_cmp_drv_mdc0(struct mpool_descriptor *mp, u8 pdh,
+ struct omf_devparm_descriptor *omd)
+{
+ const char *msg __maybe_unused;
+ struct mc_parms mcp_mdc0list, mcp_pd;
+ struct mpool_dev_info *pd;
+
+ pd = &mp->pds_pdv[pdh];
+
+ mc_pd_prop2mc_parms(&(pd->pdi_parm.dpr_prop), &mcp_pd);
+ mc_omf_devparm2mc_parms(omd, &mcp_mdc0list);
+
+ if (!memcmp(&mcp_pd, &mcp_mdc0list, sizeof(mcp_pd)))
+ return 0;
+
+ if (mpool_pd_status_get(pd) == PD_STAT_UNAVAIL)
+ msg = "UNAVAIL mdc0 drive parms don't match those in drive list record";
+ else
+ msg = "mismatch between MDC0 drive list record and drive parms";
+
+ mp_pr_warn("mpool %s, %s for %s, mclassp %d %d zonepg %u %u sectorsz %u %u devtype %u %u features %lu %lu",
+ mp->pds_name, msg, pd->pdi_name, mcp_pd.mcp_classp, mcp_mdc0list.mcp_classp,
+ mcp_pd.mcp_zonepg, mcp_mdc0list.mcp_zonepg, mcp_pd.mcp_sectorsz,
+ mcp_mdc0list.mcp_sectorsz, mcp_pd.mcp_devtype, mcp_mdc0list.mcp_devtype,
+ (ulong)mcp_pd.mcp_features, (ulong)mcp_mdc0list.mcp_features);
+
+ return -EINVAL;
+}
+
+static const char *msg_unavail1 __maybe_unused =
+ "defunct and unavailable drive still belong to the mpool";
+
+static const char *msg_unavail2 __maybe_unused =
+ "defunct and available drive still belong to the mpool";
+
+static int pmd_props_load(struct mpool_descriptor *mp)
+{
+ struct omf_devparm_descriptor netdev[MP_MED_NUMBER] = { };
+ struct omf_mdcrec_data *cdr;
+ enum mp_media_classp mclassp;
+ struct pmd_mdc_info *cinfo;
+ struct media_class *mc;
+ bool zombie[MPOOL_DRIVES_MAX];
+ int spzone[MP_MED_NUMBER], i;
+ size_t rlen = 0;
+ u64 pdh, buflen;
+ int err;
+
+ cinfo = &mp->pds_mda.mdi_slotv[0];
+ buflen = OMF_MDCREC_PACKLEN_MAX;
+
+ /* Note: single threaded here so don't need any locks */
+
+ /* Set mpool properties to defaults; overwritten by property records (if any). */
+ for (mclassp = 0; mclassp < MP_MED_NUMBER; mclassp++)
+ spzone[mclassp] = -1;
+
+ /*
+ * read mdc0 to capture net of drives, content version & other
+ * properties; ignore obj records
+ */
+ err = mp_mdc_rewind(cinfo->mmi_mdc);
+ if (err) {
+ mp_pr_err("mpool %s, MDC0 init for read properties failed", err, mp->pds_name);
+ return err;
+ }
+
+ cdr = kzalloc(sizeof(*cdr), GFP_KERNEL);
+ if (!cdr) {
+ err = -ENOMEM;
+ mp_pr_err("mpool %s, cdr alloc failed", err, mp->pds_name);
+ return err;
+ }
+
+ while (true) {
+ err = mp_mdc_read(cinfo->mmi_mdc, cinfo->mmi_recbuf, buflen, &rlen);
+ if (err) {
+ mp_pr_err("mpool %s, MDC0 read next failed %lu",
+ err, mp->pds_name, (ulong)rlen);
+ break;
+ }
+ if (rlen == 0)
+ /* Hit end of log */
+ break;
+
+ /*
+ * skip object-related mdcrec in mdc0; not ready to unpack
+ * these yet
+ */
+ if (omf_mdcrec_isobj_le(cinfo->mmi_recbuf))
+ continue;
+
+ err = omf_mdcrec_unpack_letoh(&(cinfo->mmi_mdcver), mp, cdr, cinfo->mmi_recbuf);
+ if (err) {
+ mp_pr_err("mpool %s, MDC0 property unpack failed", err, mp->pds_name);
+ break;
+ }
+
+ if (cdr->omd_rtype == OMF_MDR_MCCONFIG) {
+ struct omf_devparm_descriptor *src;
+
+ src = &cdr->u.dev.omd_parm;
+ ASSERT(src->odp_mclassp < MP_MED_NUMBER);
+
+ memcpy(&netdev[src->odp_mclassp], src, sizeof(*src));
+ continue;
+ }
+
+ if (cdr->omd_rtype == OMF_MDR_MCSPARE) {
+ mclassp = cdr->u.mcs.omd_mclassp;
+ if (mclass_isvalid(mclassp)) {
+ spzone[mclassp] = cdr->u.mcs.omd_spzone;
+ } else {
+ err = -EINVAL;
+
+ /* Should never happen */
+ mp_pr_err("mpool %s, MDC0 mclass spare record, invalid mclassp %u",
+ err, mp->pds_name, mclassp);
+ break;
+ }
+ continue;
+ }
+
+ if (cdr->omd_rtype == OMF_MDR_VERSION) {
+ cinfo->mmi_mdcver = cdr->u.omd_version;
+ if (omfu_mdcver_cmp(&cinfo->mmi_mdcver, ">", omfu_mdcver_cur())) {
+ char *buf1, *buf2 = NULL;
+
+ buf1 = kmalloc(2 * MAX_MDCVERSTR, GFP_KERNEL);
+ if (buf1) {
+ buf2 = buf1 + MAX_MDCVERSTR;
+ omfu_mdcver_to_str(&cinfo->mmi_mdcver, buf1, sizeof(buf1));
+ omfu_mdcver_to_str(omfu_mdcver_cur(), buf2, sizeof(buf2));
+ }
+
+ err = -EOPNOTSUPP;
+ mp_pr_err("mpool %s, MDC0 version %s, binary version %s",
+ err, mp->pds_name, buf1, buf2);
+ kfree(buf1);
+ break;
+ }
+ continue;
+ }
+
+ if (cdr->omd_rtype == OMF_MDR_MPCONFIG)
+ mp->pds_cfg = cdr->u.omd_cfg;
+ }
+
+ if (err) {
+ kfree(cdr);
+ return err;
+ }
+
+ /* Reconcile net drive list with those in mpool descriptor */
+ for (i = 0; i < mp->pds_pdvcnt; i++)
+ zombie[i] = true;
+
+ for (i = 0; i < MP_MED_NUMBER; i++) {
+ struct omf_devparm_descriptor *omd;
+ int j;
+
+ omd = &netdev[i];
+
+ if (mpool_uuid_is_null(&omd->odp_devid))
+ continue;
+
+ j = mp->pds_pdvcnt;
+ while (j--) {
+ if (mpool_uuid_compare(&mp->pds_pdv[j].pdi_devid, &omd->odp_devid) == 0)
+ break;
+ }
+
+ if (j >= 0) {
+ zombie[j] = false;
+ err = pmd_cmp_drv_mdc0(mp, j, omd);
+ if (err)
+ break;
+ } else {
+ err = mpool_desc_unavail_add(mp, omd);
+ if (err)
+ break;
+ zombie[mp->pds_pdvcnt - 1] = false;
+ }
+ }
+
+ /* Check for zombie drives and recompute uacnt[] */
+ if (!err) {
+ for (i = 0; i < MP_MED_NUMBER; i++) {
+ mc = &mp->pds_mc[i];
+ mc->mc_uacnt = 0;
+ }
+
+ for (pdh = 0; pdh < mp->pds_pdvcnt; pdh++) {
+ struct mpool_dev_info *pd;
+
+ mc = &mp->pds_mc[mp->pds_pdv[pdh].pdi_mclass];
+ pd = &mp->pds_pdv[pdh];
+ if (zombie[pdh]) {
+ char uuid_str[40];
+
+ mpool_unparse_uuid(&pd->pdi_devid, uuid_str);
+ err = -ENXIO;
+
+ if (mpool_pd_status_get(pd) == PD_STAT_UNAVAIL)
+ mp_pr_err("mpool %s, drive %s %s %s", err, mp->pds_name,
+ uuid_str, pd->pdi_name, msg_unavail1);
+ else
+ mp_pr_err("mpool %s, drive %s %s %s", err, mp->pds_name,
+ uuid_str, pd->pdi_name, msg_unavail2);
+ break;
+ } else if (mpool_pd_status_get(pd) == PD_STAT_UNAVAIL) {
+ mc->mc_uacnt += 1;
+ }
+ }
+ }
+
+ /*
+ * Now it is possible to update the percent spare because all
+ * the media classes of the mpool have been created because all
+ * the mpool PDs have been added in their classes.
+ */
+ if (!err) {
+ for (mclassp = 0; mclassp < MP_MED_NUMBER; mclassp++) {
+ if (spzone[mclassp] >= 0) {
+ err = mc_set_spzone(&mp->pds_mc[mclassp], spzone[mclassp]);
+ /*
+ * Should never happen, it should exist a class
+ * with perf. level mclassp with a least 1 PD.
+ */
+ if (err)
+ break;
+ }
+ }
+ if (err)
+ mp_pr_err("mpool %s, can't set spare %u because the class %u has no PD",
+ err, mp->pds_name, spzone[mclassp], mclassp);
+ }
+
+ kfree(cdr);
+
+ return err;
+}
+
+static int pmd_objs_load(struct mpool_descriptor *mp, u8 cslot)
+{
+ struct omf_mdcrec_data *cdr = NULL;
+ struct pmd_mdc_info *cinfo;
+ struct rb_node *node;
+ u64 argv[2] = { 0 };
+ const char *msg;
+ size_t recbufsz;
+ char *recbuf;
+ u64 mdcmax;
+ int err;
+
+ /* Note: single threaded here so don't need any locks */
+
+ recbufsz = OMF_MDCREC_PACKLEN_MAX;
+ msg = "(no detail)";
+ mdcmax = 0;
+
+ cinfo = &mp->pds_mda.mdi_slotv[cslot];
+
+ /* Initialize mdc if not mdc0. */
+ if (cslot) {
+ u64 logid1 = logid_make(2 * cslot, 0);
+ u64 logid2 = logid_make(2 * cslot + 1, 0);
+
+ /* Freed in pmd_mda_free() */
+ cinfo->mmi_recbuf = kmalloc(recbufsz, GFP_KERNEL);
+ if (!cinfo->mmi_recbuf) {
+ msg = "MDC recbuf alloc failed";
+ err = -ENOMEM;
+ goto errout;
+ }
+
+ err = mp_mdc_open(mp, logid1, logid2, MDC_OF_SKIP_SER, &cinfo->mmi_mdc);
+ if (err) {
+ msg = "mdc open failed";
+ goto errout;
+ }
+ }
+
+ /* Read mdc and capture net result of object data records. */
+ err = mp_mdc_rewind(cinfo->mmi_mdc);
+ if (err) {
+ msg = "mdc rewind failed";
+ goto errout;
+ }
+
+ /* Cache these pointers to simplify the ensuing code. */
+ recbuf = cinfo->mmi_recbuf;
+
+ cdr = kzalloc(sizeof(*cdr), GFP_KERNEL);
+ if (!cdr) {
+ msg = "cdr alloc failed";
+ goto errout;
+ }
+
+ while (true) {
+ struct pmd_layout *layout, *found;
+ size_t rlen = 0;
+ u64 objid;
+
+ err = mp_mdc_read(cinfo->mmi_mdc, recbuf, recbufsz, &rlen);
+ if (err) {
+ msg = "mdc read data failed";
+ break;
+ }
+ if (rlen == 0)
+ break; /* Hit end of log */
+
+ /*
+ * Version record, if present, must be first.
+ */
+ if (omf_mdcrec_unpack_type_letoh(recbuf) == OMF_MDR_VERSION) {
+ omf_mdcver_unpack_letoh(cdr, recbuf);
+ cinfo->mmi_mdcver = cdr->u.omd_version;
+
+ if (omfu_mdcver_cmp(&cinfo->mmi_mdcver, ">", omfu_mdcver_cur())) {
+ char *buf1, *buf2 = NULL;
+
+ buf1 = kmalloc(2 * MAX_MDCVERSTR, GFP_KERNEL);
+ if (buf1) {
+ buf2 = buf1 + MAX_MDCVERSTR;
+ omfu_mdcver_to_str(&cinfo->mmi_mdcver, buf1, sizeof(buf1));
+ omfu_mdcver_to_str(omfu_mdcver_cur(), buf2, sizeof(buf2));
+ }
+
+ err = -EOPNOTSUPP;
+ mp_pr_err("mpool %s, MDC%u version %s, binary version %s",
+ err, mp->pds_name, cslot, buf1, buf2);
+ kfree(buf1);
+ break;
+ }
+ continue;
+ }
+
+ /* Skip non object-related mdcrec in mdc0; i.e., property
+ * records.
+ */
+ if (!cslot && !omf_mdcrec_isobj_le(recbuf))
+ continue;
+
+ err = omf_mdcrec_unpack_letoh(&cinfo->mmi_mdcver, mp, cdr, recbuf);
+ if (err) {
+ msg = "mlog record unpack failed";
+ break;
+ }
+
+ objid = cdr->u.obj.omd_objid;
+
+ if (objid_slot(objid) != cslot) {
+ msg = "mlog record wrong slot";
+ err = -EBADSLT;
+ break;
+ }
+
+ if (cdr->omd_rtype == OMF_MDR_OCREATE) {
+ layout = cdr->u.obj.omd_layout;
+ layout->eld_state = PMD_LYT_COMMITTED;
+
+ found = pmd_co_insert(cinfo, layout);
+ if (found) {
+ msg = "OCREATE duplicate object ID";
+ pmd_obj_put(layout);
+ err = -EEXIST;
+ break;
+ }
+
+ atomic_inc(&cinfo->mmi_pco_cnt.pcc_cr);
+ atomic_inc(&cinfo->mmi_pco_cnt.pcc_cobj);
+
+ continue;
+ }
+
+ if (cdr->omd_rtype == OMF_MDR_ODELETE) {
+ found = pmd_co_find(cinfo, objid);
+ if (!found) {
+ msg = "ODELETE object not found";
+ err = -ENOENT;
+ break;
+ }
+
+ pmd_co_remove(cinfo, found);
+ pmd_obj_put(found);
+
+ atomic_inc(&cinfo->mmi_pco_cnt.pcc_del);
+ atomic_dec(&cinfo->mmi_pco_cnt.pcc_cobj);
+
+ continue;
+ }
+
+ if (cdr->omd_rtype == OMF_MDR_OIDCKPT) {
+ /*
+ * objid == mmi_lckpt == 0 is legit. Such records
+ * are appended by mpool MDC compaction due to a
+ * mpool metadata upgrade on an empty mpool.
+ */
+ if ((objid_uniq(objid) || objid_uniq(cinfo->mmi_lckpt))
+ && (objid_uniq(objid) <= objid_uniq(cinfo->mmi_lckpt))) {
+ msg = "OIDCKPT cdr ckpt %lu <= cinfo ckpt %lu";
+ argv[0] = objid_uniq(objid);
+ argv[1] = objid_uniq(cinfo->mmi_lckpt);
+ err = -EINVAL;
+ break;
+ }
+
+ cinfo->mmi_lckpt = objid;
+ continue;
+ }
+
+ if (cdr->omd_rtype == OMF_MDR_OERASE) {
+ layout = pmd_co_find(cinfo, objid);
+ if (!layout) {
+ msg = "OERASE object not found";
+ err = -ENOENT;
+ break;
+ }
+
+ /* Note: OERASE gen can equal layout gen after a compaction. */
+ if (cdr->u.obj.omd_gen < layout->eld_gen) {
+ msg = "OERASE cdr gen %lu < layout gen %lu";
+ argv[0] = cdr->u.obj.omd_gen;
+ argv[1] = layout->eld_gen;
+ err = -EINVAL;
+ break;
+ }
+
+ layout->eld_gen = cdr->u.obj.omd_gen;
+
+ atomic_inc(&cinfo->mmi_pco_cnt.pcc_er);
+ continue;
+ }
+
+ if (cdr->omd_rtype == OMF_MDR_OUPDATE) {
+ layout = cdr->u.obj.omd_layout;
+
+ found = pmd_co_find(cinfo, objid);
+ if (!found) {
+ msg = "OUPDATE object not found";
+ pmd_obj_put(layout);
+ err = -ENOENT;
+ break;
+ }
+
+ pmd_co_remove(cinfo, found);
+ pmd_obj_put(found);
+
+ layout->eld_state = PMD_LYT_COMMITTED;
+ pmd_co_insert(cinfo, layout);
+
+ atomic_inc(&cinfo->mmi_pco_cnt.pcc_up);
+
+ continue;
+ }
+ }
+
+ if (err)
+ goto errout;
+
+ /*
+ * Add all existing objects to space map.
+ * Also add/update per-mpool space usage stats
+ */
+ pmd_co_foreach(cinfo, node) {
+ struct pmd_layout *layout;
+
+ layout = rb_entry(node, typeof(*layout), eld_nodemdc);
+
+ /* Remember objid and gen in case of error... */
+ cdr->u.obj.omd_objid = layout->eld_objid;
+ cdr->u.obj.omd_gen = layout->eld_gen;
+
+ if (objid_slot(layout->eld_objid) != cslot) {
+ msg = "layout wrong slot";
+ err = -EBADSLT;
+ break;
+ }
+
+ err = pmd_smap_insert(mp, layout);
+ if (err) {
+ msg = "smap insert failed";
+ break;
+ }
+
+ pmd_update_obj_stats(mp, layout, cinfo, PMD_OBJ_LOAD);
+
+ /* For mdc0 track last logical mdc created. */
+ if (!cslot)
+ mdcmax = max(mdcmax, (objid_uniq(layout->eld_objid) >> 1));
+ }
+
+ if (err)
+ goto errout;
+
+ cdr->u.obj.omd_objid = 0;
+ cdr->u.obj.omd_gen = 0;
+
+ if (!cslot) {
+ /* MDC0: finish initializing mda */
+ cinfo->mmi_luniq = mdcmax;
+ mp->pds_mda.mdi_slotvcnt = mdcmax + 1;
+
+ /* MDC0 only: validate other mdc metadata; may make adjustments to mp.mda. */
+ err = pmd_mdc0_validate(mp, 1);
+ if (err)
+ msg = "MDC0 validation failed";
+ } else {
+ /*
+ * other mdc: set luniq to guaranteed max value
+ * previously used and ensure next objid allocation
+ * will be checkpointed; supports realloc of
+ * uncommitted objects after a crash
+ */
+ cinfo->mmi_luniq = objid_uniq(cinfo->mmi_lckpt) + OBJID_UNIQ_DELTA - 1;
+ }
+
+errout:
+ if (err) {
+ char *msgbuf;
+
+ msgbuf = kmalloc(64, GFP_KERNEL);
+ if (msgbuf)
+ snprintf(msgbuf, 64, msg, argv[0], argv[1]);
+
+ mp_pr_err("mpool %s, %s: cslot %u, ckpt %lx, %lx/%lu",
+ err, mp->pds_name, msgbuf, cslot, (ulong)cinfo->mmi_lckpt,
+ (ulong)cdr->u.obj.omd_objid, (ulong)cdr->u.obj.omd_gen);
+
+ kfree(msgbuf);
+ }
+
+ kfree(cdr);
+
+ return err;
+}
+
+/**
+ * pmd_objs_load_worker() -
+ * @ws:
+ *
+ * worker thread for loading user MDC 1~N
+ * Each worker instance will do the following (not counting errors):
+ * * grab an MDC number atomically from olw->olw_progress
+ * * If the MDC number is invalid, exit
+ * * load the objects from that MDC
+ *
+ * If an error occurs in this or any other worker, don't load any more MDCs
+ */
+static void pmd_objs_load_worker(struct work_struct *ws)
+{
+ struct pmd_obj_load_work *olw;
+ int sidx, rc;
+
+ olw = container_of(ws, struct pmd_obj_load_work, olw_work);
+
+ while (atomic_read(olw->olw_err) == 0) {
+ sidx = atomic_fetch_add(1, olw->olw_progress);
+ if (sidx >= olw->olw_mp->pds_mda.mdi_slotvcnt)
+ break; /* No more MDCs to load */
+
+ rc = pmd_objs_load(olw->olw_mp, sidx);
+ if (rc)
+ atomic_set(olw->olw_err, rc);
+ }
+}
+
+/**
+ * pmd_objs_load_parallel() - load MDC 1~N in parallel
+ * @mp:
+ *
+ * By loading user MDCs in parallel, we can reduce the mpool activate
+ * time, since the jobs of loading MDC 1~N are independent.
+ * On the other hand, we don't want to start all the jobs at once.
+ * If any one fails, we don't have to start others.
+ */
+static int pmd_objs_load_parallel(struct mpool_descriptor *mp)
+{
+ struct pmd_obj_load_work *olwv;
+ atomic_t err = ATOMIC_INIT(0);
+ atomic_t progress = ATOMIC_INIT(1);
+ uint njobs, inc, cpu, i;
+
+ if (mp->pds_mda.mdi_slotvcnt < 2)
+ return 0; /* No user MDCs allocated */
+
+ njobs = mp->pds_params.mp_objloadjobs;
+ njobs = clamp_t(uint, njobs, 1, mp->pds_mda.mdi_slotvcnt - 1);
+
+ if (mp->pds_mda.mdi_slotvcnt / njobs >= 4 && num_online_cpus() > njobs)
+ njobs *= 2;
+
+ olwv = kcalloc(njobs, sizeof(*olwv), GFP_KERNEL);
+ if (!olwv)
+ return -ENOMEM;
+
+ inc = (num_online_cpus() / njobs) & ~1u;
+ cpu = raw_smp_processor_id();
+
+ /*
+ * Each of njobs workers will atomically grab MDC numbers from &progress
+ * and load them, until all valid user MDCs have been loaded.
+ */
+ for (i = 0; i < njobs; ++i) {
+ INIT_WORK(&olwv[i].olw_work, pmd_objs_load_worker);
+ olwv[i].olw_progress = &progress;
+ olwv[i].olw_err = &err;
+ olwv[i].olw_mp = mp;
+
+ /*
+ * Try to distribute work across all NUMA nodes.
+ * queue_work_node() would be preferable, but
+ * it's not available on older kernels.
+ */
+ cpu = (cpu + inc) % nr_cpumask_bits;
+ cpu = cpumask_next_wrap(cpu, cpu_online_mask, nr_cpumask_bits, false);
+ queue_work_on(cpu, mp->pds_workq, &olwv[i].olw_work);
+ }
+
+ /* Wait for all worker threads to complete */
+ flush_workqueue(mp->pds_workq);
+
+ kfree(olwv);
+
+ return atomic_read(&err);
+}
+
+static int pmd_mdc_append(struct mpool_descriptor *mp, u8 cslot,
+ struct omf_mdcrec_data *cdr, int sync)
+{
+ struct pmd_mdc_info *cinfo = &mp->pds_mda.mdi_slotv[cslot];
+ s64 plen;
+
+ plen = omf_mdcrec_pack_htole(mp, cdr, cinfo->mmi_recbuf);
+ if (plen < 0) {
+ mp_pr_warn("mpool %s, MDC%u append failed", mp->pds_name, cslot);
+ return plen;
+ }
+
+ return mp_mdc_append(cinfo->mmi_mdc, cinfo->mmi_recbuf, plen, sync);
+}
+
+/**
+ * pmd_log_all_mdc_cobjs() - write in the new active mlog the object records.
+ * @mp:
+ * @cslot:
+ * @compacted: output
+ * @total: output
+ */
+static int pmd_log_all_mdc_cobjs(struct mpool_descriptor *mp, u8 cslot,
+ struct omf_mdcrec_data *cdr, u32 *compacted, u32 *total)
+{
+ struct pmd_mdc_info *cinfo;
+ struct pmd_layout *layout;
+ struct rb_node *node;
+ int rc;
+
+ cinfo = &mp->pds_mda.mdi_slotv[cslot];
+ rc = 0;
+
+ pmd_co_foreach(cinfo, node) {
+ layout = rb_entry(node, typeof(*layout), eld_nodemdc);
+
+ if (!objid_mdc0log(layout->eld_objid)) {
+ cdr->omd_rtype = OMF_MDR_OCREATE;
+ cdr->u.obj.omd_layout = layout;
+
+ rc = pmd_mdc_append(mp, cslot, cdr, 0);
+ if (rc) {
+ mp_pr_err("mpool %s, MDC%u log committed obj failed, objid 0x%lx",
+ rc, mp->pds_name, cslot, (ulong)layout->eld_objid);
+ break;
+ }
+
+ ++(*compacted);
+ }
+ ++(*total);
+ }
+
+ for (; node; node = rb_next(node))
+ ++(*total);
+
+ return rc;
+}
+
+/**
+ * pmd_log_mdc0_cobjs() - write in the new active mlog (of MDC0) the MDC0
+ * records that are particular to MDC0.
+ * @mp:
+ */
+static int pmd_log_mdc0_cobjs(struct mpool_descriptor *mp)
+{
+ struct mpool_dev_info *pd;
+ int rc = 0, i;
+
+ /*
+ * Log a drive record (OMF_MDR_MCCONFIG) for every drive in pds_pdv[]
+ * that is not defunct.
+ */
+ for (i = 0; i < mp->pds_pdvcnt; i++) {
+ pd = &(mp->pds_pdv[i]);
+ rc = pmd_prop_mcconfig(mp, pd, true);
+ if (rc)
+ return rc;
+ }
+
+ /*
+ * Log a media class spare record (OMF_MDR_MCSPARE) for every media
+ * class.
+ * mc count can't change now. Because the MDC0 compact lock is held
+ * and that blocks the addition of PDs in the mpool.
+ */
+ for (i = 0; i < MP_MED_NUMBER; i++) {
+ struct media_class *mc;
+
+ mc = &mp->pds_mc[i];
+ if (mc->mc_pdmc >= 0) {
+ rc = pmd_prop_mcspare(mp, mc->mc_parms.mcp_classp,
+ mc->mc_sparms.mcsp_spzone, true);
+ if (rc)
+ return rc;
+ }
+ }
+
+ return pmd_prop_mpconfig(mp, &mp->pds_cfg, true);
+}
+
+/**
+ * pmd_log_non_mdc0_cobjs() - write in the new active mlog (of MDCi i>0) the
+ * MDCi records that are particular to MDCi (not used by MDC0).
+ * @mp:
+ * @cslot:
+ */
+static int pmd_log_non_mdc0_cobjs(struct mpool_descriptor *mp, u8 cslot,
+ struct omf_mdcrec_data *cdr)
+{
+ struct pmd_mdc_info *cinfo;
+
+ cinfo = &mp->pds_mda.mdi_slotv[cslot];
+
+ /*
+ * if not mdc0 log last objid checkpoint to support realloc of
+ * uncommitted objects after a crash and to guarantee objids are
+ * never reused.
+ */
+ cdr->omd_rtype = OMF_MDR_OIDCKPT;
+ cdr->u.obj.omd_objid = cinfo->mmi_lckpt;
+
+ return pmd_mdc_append(mp, cslot, cdr, 0);
+}
+
+/**
+ * pmd_pre_compact_reset() - called on MDCi i>0
+ * @cinfo:
+ * @compacted: object create records appended in the new active mlog.
+ *
+ * Locking:
+ * MDCi compact lock is held by the caller.
+ */
+static void pmd_pre_compact_reset(struct pmd_mdc_info *cinfo, u32 compacted)
+{
+ struct pre_compact_ctrs *pco_cnt;
+
+ pco_cnt = &cinfo->mmi_pco_cnt;
+ ASSERT(pco_cnt->pcc_cobj.counter == compacted);
+
+ atomic_set(&pco_cnt->pcc_cr, compacted);
+ atomic_set(&pco_cnt->pcc_cobj, compacted);
+ atomic_set(&pco_cnt->pcc_up, 0);
+ atomic_set(&pco_cnt->pcc_del, 0);
+ atomic_set(&pco_cnt->pcc_er, 0);
+}
+
+/**
+ * pmd_mdc_compact() - compact an mpool MDCi with i >= 0.
+ * @mp:
+ * @cslot: the "i" of MDCi
+ *
+ * Locking:
+ * 1) caller must hold MDCi compact lock
+ * 2) MDC compaction freezes the state of all MDCs objects [and for MDC0
+ * also freezes all mpool properties] by simply holding MDC
+ * mmi_compactlock mutex. Hence, MDC compaction does not need to
+ * read-lock individual object layouts or mpool property data
+ * structures to read them. It is why this function and its callees don't
+ * take any lock.
+ *
+ * Note: this function or its callees must call pmd_mdc_append() with no sync
+ * instead of pmd_mdc_addrec() to avoid triggering nested compaction of
+ * a same MDCi.
+ * The sync/flush is done by append of cend, no need to sync before that.
+ */
+static int pmd_mdc_compact(struct mpool_descriptor *mp, u8 cslot)
+{
+ struct pmd_mdc_info *cinfo = &mp->pds_mda.mdi_slotv[cslot];
+ u64 logid1 = logid_make(2 * cslot, 0);
+ u64 logid2 = logid_make(2 * cslot + 1, 0);
+ struct omf_mdcrec_data *cdr;
+ int retry = 0, rc = 0;
+
+ cdr = kzalloc(sizeof(*cdr), GFP_KERNEL);
+ if (!cdr) {
+ rc = -ENOMEM;
+ mp_pr_crit("mpool %s, alloc failure during compact", rc, mp->pds_name);
+ return rc;
+ }
+
+ for (retry = 0; retry < MPOOL_MDC_COMPACT_RETRY_DEFAULT; retry++) {
+ u32 compacted = 0;
+ u32 total = 0;
+
+ if (rc) {
+ rc = mp_mdc_open(mp, logid1, logid2, MDC_OF_SKIP_SER, &cinfo->mmi_mdc);
+ if (rc)
+ continue;
+ }
+
+ mp_pr_debug("mpool %s, MDC%u start: mlog1 gen %lu mlog2 gen %lu",
+ rc, mp->pds_name, cslot,
+ (ulong)((struct pmd_layout *)cinfo->mmi_mdc->mdc_logh1)->eld_gen,
+ (ulong)((struct pmd_layout *)cinfo->mmi_mdc->mdc_logh2)->eld_gen);
+
+ rc = mp_mdc_cstart(cinfo->mmi_mdc);
+ if (rc)
+ continue;
+
+ if (omfu_mdcver_cmp2(omfu_mdcver_cur(), ">=", 1, 0, 0, 1)) {
+ rc = pmd_mdc_addrec_version(mp, cslot);
+ if (rc) {
+ mp_mdc_close(cinfo->mmi_mdc);
+ continue;
+ }
+ }
+
+ if (cslot)
+ rc = pmd_log_non_mdc0_cobjs(mp, cslot, cdr);
+ else
+ rc = pmd_log_mdc0_cobjs(mp);
+ if (rc)
+ continue;
+
+ rc = pmd_log_all_mdc_cobjs(mp, cslot, cdr, &compacted, &total);
+
+ mp_pr_debug("mpool %s, MDC%u compacted %u of %u objects: retry=%d",
+ rc, mp->pds_name, cslot, compacted, total, retry);
+
+ /*
+ * Append the compaction end record in the new active
+ * mlog, and flush/sync all the previous records
+ * appended in the new active log by the compaction
+ * above.
+ */
+ if (!rc)
+ rc = mp_mdc_cend(cinfo->mmi_mdc);
+
+ if (!rc) {
+ if (cslot) {
+ /*
+ * MDCi i>0 compacted successfully
+ * MDCi compact lock is held.
+ */
+ pmd_pre_compact_reset(cinfo, compacted);
+ }
+
+ mp_pr_debug("mpool %s, MDC%u end: mlog1 gen %lu mlog2 gen %lu",
+ rc, mp->pds_name, cslot,
+ (ulong)((struct pmd_layout *)
+ cinfo->mmi_mdc->mdc_logh1)->eld_gen,
+ (ulong)((struct pmd_layout *)
+ cinfo->mmi_mdc->mdc_logh2)->eld_gen);
+ break;
+ }
+ }
+
+ if (rc)
+ mp_pr_crit("mpool %s, MDC%u compaction failed", rc, mp->pds_name, cslot);
+
+ kfree(cdr);
+
+ return rc;
+}
+
+static int pmd_mdc_addrec(struct mpool_descriptor *mp, u8 cslot, struct omf_mdcrec_data *cdr)
+{
+ int rc;
+
+ rc = pmd_mdc_append(mp, cslot, cdr, 1);
+
+ if (rc == -EFBIG) {
+ rc = pmd_mdc_compact(mp, cslot);
+ if (!rc)
+ rc = pmd_mdc_append(mp, cslot, cdr, 1);
+ }
+
+ if (rc)
+ mp_pr_rl("mpool %s, MDC%u append failed%s", rc, mp->pds_name, cslot,
+ (rc == -EFBIG) ? " post compaction" : "");
+
+ return rc;
+}
+
+int pmd_mdc_addrec_version(struct mpool_descriptor *mp, u8 cslot)
+{
+ struct omf_mdcrec_data cdr;
+ struct omf_mdcver *ver;
+
+ cdr.omd_rtype = OMF_MDR_VERSION;
+
+ ver = omfu_mdcver_cur();
+ cdr.u.omd_version = *ver;
+
+ return pmd_mdc_addrec(mp, cslot, &cdr);
+}
+
+int pmd_prop_mcconfig(struct mpool_descriptor *mp, struct mpool_dev_info *pd, bool compacting)
+{
+ struct omf_mdcrec_data cdr;
+ struct mc_parms mc_parms;
+
+ cdr.omd_rtype = OMF_MDR_MCCONFIG;
+ mpool_uuid_copy(&cdr.u.dev.omd_parm.odp_devid, &pd->pdi_devid);
+ mc_pd_prop2mc_parms(&pd->pdi_parm.dpr_prop, &mc_parms);
+ mc_parms2omf_devparm(&mc_parms, &cdr.u.dev.omd_parm);
+ cdr.u.dev.omd_parm.odp_zonetot = pd->pdi_parm.dpr_zonetot;
+ cdr.u.dev.omd_parm.odp_devsz = pd->pdi_parm.dpr_devsz;
+
+ /* If compacting no sync needed and don't trigger another compaction. */
+ if (compacting)
+ return pmd_mdc_append(mp, 0, &cdr, 0);
+
+ return pmd_mdc_addrec(mp, 0, &cdr);
+}
+
+int pmd_prop_mcspare(struct mpool_descriptor *mp, enum mp_media_classp mclassp,
+ u8 spzone, bool compacting)
+{
+ struct omf_mdcrec_data cdr;
+ int rc;
+
+ if (!mclass_isvalid(mclassp) || spzone > 100) {
+ rc = -EINVAL;
+ mp_pr_err("persisting %s spare zone info, invalid arguments %d %u",
+ rc, mp->pds_name, mclassp, spzone);
+ return rc;
+ }
+
+ cdr.omd_rtype = OMF_MDR_MCSPARE;
+ cdr.u.mcs.omd_mclassp = mclassp;
+ cdr.u.mcs.omd_spzone = spzone;
+
+ /* If compacting no sync needed and don't trigger another compaction. */
+ if (compacting)
+ return pmd_mdc_append(mp, 0, &cdr, 0);
+
+ return pmd_mdc_addrec(mp, 0, &cdr);
+}
+
+int pmd_log_delete(struct mpool_descriptor *mp, u64 objid)
+{
+ struct omf_mdcrec_data cdr;
+
+ cdr.omd_rtype = OMF_MDR_ODELETE;
+ cdr.u.obj.omd_objid = objid;
+
+ return pmd_mdc_addrec(mp, objid_slot(objid), &cdr);
+}
+
+int pmd_log_create(struct mpool_descriptor *mp, struct pmd_layout *layout)
+{
+ struct omf_mdcrec_data cdr;
+
+ cdr.omd_rtype = OMF_MDR_OCREATE;
+ cdr.u.obj.omd_layout = layout;
+
+ return pmd_mdc_addrec(mp, objid_slot(layout->eld_objid), &cdr);
+}
+
+int pmd_log_erase(struct mpool_descriptor *mp, u64 objid, u64 gen)
+{
+ struct omf_mdcrec_data cdr;
+
+ cdr.omd_rtype = OMF_MDR_OERASE;
+ cdr.u.obj.omd_objid = objid;
+ cdr.u.obj.omd_gen = gen;
+
+ return pmd_mdc_addrec(mp, objid_slot(objid), &cdr);
+}
+
+int pmd_log_idckpt(struct mpool_descriptor *mp, u64 objid)
+{
+ struct omf_mdcrec_data cdr;
+
+ cdr.omd_rtype = OMF_MDR_OIDCKPT;
+ cdr.u.obj.omd_objid = objid;
+
+ return pmd_mdc_addrec(mp, objid_slot(objid), &cdr);
+}
+
+int pmd_prop_mpconfig(struct mpool_descriptor *mp, const struct mpool_config *cfg, bool compacting)
+{
+ struct omf_mdcrec_data cdr = { };
+
+ cdr.omd_rtype = OMF_MDR_MPCONFIG;
+ cdr.u.omd_cfg = *cfg;
+
+ if (compacting)
+ return pmd_mdc_append(mp, 0, &cdr, 0);
+
+ return pmd_mdc_addrec(mp, 0, &cdr);
+}
+
+/**
+ * pmd_need_compact() - determine if MDCi corresponding to cslot need compaction of not.
+ * @mp:
+ * @cslot:
+ *
+ * The MDCi needs compaction if the active mlog is above some threshold and
+ * if there is enough garbage (that can be eliminated by the compaction).
+ *
+ * Locking: not lock need to be held when calling this function.
+ * as a result of not holding lock the result may be off if a compaction
+ * of MDCi (with i = cslot) is taking place at the same time.
+ */
+static bool pmd_need_compact(struct mpool_descriptor *mp, u8 cslot, char *msgbuf, size_t msgsz)
+{
+ struct pre_compact_ctrs *pco_cnt;
+ struct pmd_mdc_info *cinfo;
+ u64 rec, cobj, len, cap;
+ u32 garbage, pct;
+
+ ASSERT(cslot > 0);
+
+ cinfo = &mp->pds_mda.mdi_slotv[cslot];
+ pco_cnt = &(cinfo->mmi_pco_cnt);
+
+ cap = atomic64_read(&pco_cnt->pcc_cap);
+ if (cap == 0)
+ return false; /* MDC closed for now. */
+
+ len = atomic64_read(&pco_cnt->pcc_len);
+ rec = atomic_read(&pco_cnt->pcc_cr) + atomic_read(&pco_cnt->pcc_up) +
+ atomic_read(&pco_cnt->pcc_del) + atomic_read(&pco_cnt->pcc_er);
+ cobj = atomic_read(&pco_cnt->pcc_cobj);
+
+ pct = (len * 100) / cap;
+ if (pct < mp->pds_params.mp_pcopctfull)
+ return false; /* Active mlog not filled enough */
+
+ if (rec > cobj) {
+ garbage = (rec - cobj) * 100;
+ garbage /= rec;
+ } else {
+
+ /*
+ * We may arrive here rarely if the caller doesn't
+ * hold the compact lock. In that case, the update of
+ * the counters may be seen out of order or a compaction
+ * may take place at the same time.
+ */
+ garbage = 0;
+ }
+
+ if (garbage < mp->pds_params.mp_pcopctgarbage)
+ return false;
+
+ if (msgbuf)
+ snprintf(msgbuf, msgsz,
+ "bytes used %lu, total %lu, pct %u, records %lu, objects %lu, garbage %u",
+ (ulong)len, (ulong)cap, pct, (ulong)rec, (ulong)cobj, garbage);
+
+ return true;
+}
+
+/**
+ * pmd_mdc_needed() - determines if new MDCns should be created
+ * @mp: mpool descriptor
+ *
+ * New MDC's are created if total free space across all MDC's
+ * is above a threshold value and the garbage to reclaim space
+ * is below a garbage threshold.
+ *
+ * Locking: no lock needs to be held when calling this function.
+ *
+ * NOTES:
+ * - Skip non-active MDC
+ * - Accumulate total capacity, total garbage and total in-use capacity
+ * across all active MDCs.
+ * - Return true if total used capacity across all MDCs is threshold and
+ * garbage is < a threshold that would yield significant free space upon
+ * compaction.
+ */
+static bool pmd_mdc_needed(struct mpool_descriptor *mp)
+{
+ struct pre_compact_ctrs *pco_cnt;
+ struct pmd_mdc_info *cinfo;
+ u64 cap, tcap, used, garbage, record, rec, cobj;
+ u32 pct, pctg, mdccnt;
+ u16 cslot;
+
+ ASSERT(mp->pds_mda.mdi_slotvcnt <= MDC_SLOTS);
+
+ cap = used = garbage = record = pctg = 0;
+
+ if (mp->pds_mda.mdi_slotvcnt == MDC_SLOTS)
+ return false;
+
+ for (cslot = 1, mdccnt = 0; cslot < mp->pds_mda.mdi_slotvcnt; cslot++) {
+
+ cinfo = &mp->pds_mda.mdi_slotv[cslot];
+ pco_cnt = &(cinfo->mmi_pco_cnt);
+
+ tcap = atomic64_read(&pco_cnt->pcc_cap);
+ if (tcap == 0) {
+ /*
+ * MDC closed for now and will not be considered
+ * in making a decision to create new MDC.
+ */
+ mp_pr_warn("MDC %u not open", cslot);
+ continue;
+ }
+ cap += tcap;
+
+ mdccnt++;
+
+ used += atomic64_read(&pco_cnt->pcc_len);
+ rec = atomic_read(&pco_cnt->pcc_cr) + atomic_read(&pco_cnt->pcc_up) +
+ atomic_read(&pco_cnt->pcc_del) + atomic_read(&pco_cnt->pcc_er);
+
+ cobj = atomic_read(&pco_cnt->pcc_cobj);
+
+ if (rec > cobj)
+ garbage += (rec - cobj);
+
+ record += rec;
+ }
+
+ if (mdccnt == 0) {
+ mp_pr_warn("No mpool MDCs available");
+ return false;
+ }
+
+ /* Percentage capacity used across all MDCs */
+ pct = (used * 100) / cap;
+
+ /* Percentage garbage available across all MDCs */
+ if (garbage)
+ pctg = (garbage * 100) / record;
+
+ if (pct > mp->pds_params.mp_crtmdcpctfull && pctg < mp->pds_params.mp_crtmdcpctgrbg) {
+ mp_pr_debug("MDCn %u cap %u used %u rec %u grbg %u pct used %u grbg %u Thres %u-%u",
+ 0, mdccnt, (u32)cap, (u32)used, (u32)record, (u32)garbage, pct, pctg,
+ (u32)mp->pds_params.mp_crtmdcpctfull,
+ (u32)mp->pds_params.mp_crtmdcpctgrbg);
+ return true;
+ }
+
+ return false;
+}
+
+
+/**
+ * pmd_precompact() - precompact an mpool MDC
+ * @work:
+ *
+ * The goal of this thread is to minimize the application objects commit time.
+ * This thread pre compacts the MDC1/255. As a consequence MDC1/255 compaction
+ * does not occurs in the context of an application object commit.
+ */
+static void pmd_precompact(struct work_struct *work)
+{
+ struct pre_compact_ctrl *pco;
+ struct mpool_descriptor *mp;
+ struct pmd_mdc_info *cinfo;
+ char msgbuf[128];
+ uint nmtoc, delay;
+ bool compact;
+ u8 cslot;
+
+ pco = container_of(work, typeof(*pco), pco_dwork.work);
+ mp = pco->pco_mp;
+
+ nmtoc = atomic_fetch_add(1, &pco->pco_nmtoc);
+
+ /* Only compact MDC1/255 not MDC0. */
+ cslot = (nmtoc % (mp->pds_mda.mdi_slotvcnt - 1)) + 1;
+
+ /*
+ * Check if the next mpool mdc to compact needs compaction.
+ *
+ * Note that this check is done without taking any lock.
+ * This is safe because the mpool MDCs don't go away as long as
+ * the mpool is activated. The mpool can't deactivate before
+ * this thread exit.
+ */
+ compact = pmd_need_compact(mp, cslot, NULL, 0);
+ if (compact) {
+ cinfo = &mp->pds_mda.mdi_slotv[cslot];
+
+ /*
+ * Check a second time while we hold the compact lock
+ * to avoid doing a useless compaction.
+ */
+ pmd_mdc_lock(&cinfo->mmi_compactlock, cslot);
+ compact = pmd_need_compact(mp, cslot, msgbuf, sizeof(msgbuf));
+ if (compact)
+ pmd_mdc_compact(mp, cslot);
+ pmd_mdc_unlock(&cinfo->mmi_compactlock);
+
+ if (compact)
+ mp_pr_info("mpool %s, MDC%u %s", mp->pds_name, cslot, msgbuf);
+ }
+
+ /* If running low on MDC space create new MDCs */
+ if (pmd_mdc_needed(mp))
+ pmd_mdc_alloc_set(mp);
+
+ pmd_update_credit(mp);
+
+ delay = clamp_t(uint, mp->pds_params.mp_pcoperiod, 1, 3600);
+
+ queue_delayed_work(mp->pds_workq, &pco->pco_dwork, msecs_to_jiffies(delay * 1000));
+}
+
+void pmd_precompact_start(struct mpool_descriptor *mp)
+{
+ struct pre_compact_ctrl *pco;
+
+ pco = &mp->pds_pco;
+ pco->pco_mp = mp;
+ atomic_set(&pco->pco_nmtoc, 0);
+
+ INIT_DELAYED_WORK(&pco->pco_dwork, pmd_precompact);
+ queue_delayed_work(mp->pds_workq, &pco->pco_dwork, 1);
+}
+
+void pmd_precompact_stop(struct mpool_descriptor *mp)
+{
+ cancel_delayed_work_sync(&mp->pds_pco.pco_dwork);
+}
+
+static int pmd_write_meta_to_latest_version(struct mpool_descriptor *mp, bool permitted)
+{
+ struct pmd_mdc_info *cinfo_converted = NULL, *cinfo;
+ char buf1[MAX_MDCVERSTR] __maybe_unused;
+ char buf2[MAX_MDCVERSTR] __maybe_unused;
+ u32 cslot;
+ int rc;
+
+ /*
+ * Compact MDC0 first (before MDC1-255 compaction appends in MDC0) to
+ * avoid having a potential mix of new and old records in MDC0.
+ */
+ for (cslot = 0; cslot < mp->pds_mda.mdi_slotvcnt; cslot++) {
+ cinfo = &mp->pds_mda.mdi_slotv[cslot];
+
+ /*
+ * At that point the version on media should be smaller or
+ * equal to the latest version supported by this binary.
+ * If it is not the case, the activate fails earlier.
+ */
+ if (omfu_mdcver_cmp(&cinfo->mmi_mdcver, "==", omfu_mdcver_cur()))
+ continue;
+
+ omfu_mdcver_to_str(&cinfo->mmi_mdcver, buf1, sizeof(buf1));
+ omfu_mdcver_to_str(omfu_mdcver_cur(), buf2, sizeof(buf2));
+
+ if (!permitted) {
+ rc = -EPERM;
+ mp_pr_err("mpool %s, MDC%u upgrade needed from version %s to %s",
+ rc, mp->pds_name, cslot, buf1, buf2);
+ return rc;
+ }
+
+ mp_pr_info("mpool %s, MDC%u upgraded from version %s to %s",
+ mp->pds_name, cslot, buf1, buf2);
+
+ cinfo_converted = cinfo;
+
+ pmd_mdc_lock(&cinfo->mmi_compactlock, cslot);
+ rc = pmd_mdc_compact(mp, cslot);
+ pmd_mdc_unlock(&cinfo->mmi_compactlock);
+
+ if (rc) {
+ mp_pr_err("mpool %s, failed to compact MDC %u post upgrade from %s to %s",
+ rc, mp->pds_name, cslot, buf1, buf2);
+ return rc;
+ }
+ }
+
+ if (cinfo_converted != NULL)
+ mp_pr_info("mpool %s, converted MDC from version %s to %s", mp->pds_name,
+ omfu_mdcver_to_str(&cinfo_converted->mmi_mdcver, buf1, sizeof(buf1)),
+ omfu_mdcver_to_str(omfu_mdcver_cur(), buf2, sizeof(buf2)));
+
+ return 0;
+}
+
+void pmd_mdc_cap(struct mpool_descriptor *mp, u64 *mdcmax, u64 *mdccap, u64 *mdc0cap)
+{
+ struct pmd_mdc_info *cinfo = NULL;
+ struct pmd_layout *layout = NULL;
+ struct rb_node *node = NULL;
+ u64 mlogsz;
+ u32 zonepg = 0;
+ u16 mdcn = 0;
+
+ if (!mdcmax || !mdccap || !mdc0cap)
+ return;
+
+ /* Serialize to prevent race with pmd_mdc_alloc() */
+ mutex_lock(&pmd_s_lock);
+
+ /*
+ * exclude mdc0 from stats because not used for mpool user
+ * object metadata
+ */
+ cinfo = &mp->pds_mda.mdi_slotv[0];
+
+ pmd_mdc_lock(&cinfo->mmi_uqlock, 0);
+ *mdcmax = cinfo->mmi_luniq;
+ pmd_mdc_unlock(&cinfo->mmi_uqlock);
+
+ /* Taking compactlock to freeze all object layout metadata in mdc0 */
+ pmd_mdc_lock(&cinfo->mmi_compactlock, 0);
+ pmd_co_rlock(cinfo, 0);
+
+ pmd_co_foreach(cinfo, node) {
+ layout = rb_entry(node, typeof(*layout), eld_nodemdc);
+
+ mdcn = objid_uniq(layout->eld_objid) >> 1;
+
+ if (mdcn > *mdcmax)
+ /* Ignore detritus from failed pmd_mdc_alloc() */
+ continue;
+
+ zonepg = mp->pds_pdv[layout->eld_ld.ol_pdh].pdi_parm.dpr_zonepg;
+ mlogsz = (layout->eld_ld.ol_zcnt * zonepg) << PAGE_SHIFT;
+
+ if (!mdcn)
+ *mdc0cap = *mdc0cap + mlogsz;
+ else
+ *mdccap = *mdccap + mlogsz;
+ }
+
+ pmd_co_runlock(cinfo);
+ pmd_mdc_unlock(&cinfo->mmi_compactlock);
+ mutex_unlock(&pmd_s_lock);
+
+ /* Only count capacity of one mlog in each mdc mlog pair */
+ *mdccap = *mdccap >> 1;
+ *mdc0cap = *mdc0cap >> 1;
+}
+
+int pmd_mpool_activate(struct mpool_descriptor *mp, struct pmd_layout *mdc01,
+ struct pmd_layout *mdc02, int create)
+{
+ int rc;
+
+ mp_pr_debug("mdc01: %lu mdc02: %lu", 0, (ulong)mdc01->eld_objid, (ulong)mdc02->eld_objid);
+
+ /* Activation is intense; serialize it when have multiple mpools */
+ mutex_lock(&pmd_s_lock);
+
+ /* Init metadata array for mpool */
+ pmd_mda_init(mp);
+
+ /* Initialize mdc0 for mpool */
+ rc = pmd_mdc0_init(mp, mdc01, mdc02);
+ if (rc) {
+ /*
+ * pmd_mda_free() will dealloc mdc01/2 on subsequent
+ * activation failures
+ */
+ pmd_obj_put(mdc01);
+ pmd_obj_put(mdc02);
+ goto exit;
+ }
+
+ /* Load mpool properties from mdc0 including drive list and states */
+ if (!create) {
+ rc = pmd_props_load(mp);
+ if (rc)
+ goto exit;
+ }
+
+ /*
+ * initialize smaps for all drives in mpool (now that list
+ * is finalized)
+ */
+ rc = smap_mpool_init(mp);
+ if (rc)
+ goto exit;
+
+ /* Load mdc layouts from mdc0 and finalize mda initialization */
+ rc = pmd_objs_load(mp, 0);
+ if (rc)
+ goto exit;
+
+ /* Load user object layouts from all other mdc */
+ rc = pmd_objs_load_parallel(mp);
+ if (rc) {
+ mp_pr_err("mpool %s, failed to load user MDCs", rc, mp->pds_name);
+ goto exit;
+ }
+
+ /*
+ * If the format of the mpool metadata read from media during activate
+ * is not the latest, it is time to write the metadata on media with
+ * the latest format.
+ */
+ if (!create) {
+ rc = pmd_write_meta_to_latest_version(mp, true);
+ if (rc) {
+ mp_pr_err("mpool %s, failed to compact MDCs (metadata conversion)",
+ rc, mp->pds_name);
+ goto exit;
+ }
+ }
+exit:
+ if (rc) {
+ /* Activation failed; cleanup */
+ pmd_mda_free(mp);
+ smap_mpool_free(mp);
+ }
+
+ mutex_unlock(&pmd_s_lock);
+
+ return rc;
+}
+
+void pmd_mpool_deactivate(struct mpool_descriptor *mp)
+{
+ /* Deactivation is intense; serialize it when have multiple mpools */
+ mutex_lock(&pmd_s_lock);
+
+ /* Close all open user (non-mdc) mlogs */
+ mlogutil_closeall(mp);
+
+ pmd_mda_free(mp);
+ smap_mpool_free(mp);
+
+ mutex_unlock(&pmd_s_lock);
+}
diff --git a/drivers/mpool/pmd_obj.c b/drivers/mpool/pmd_obj.c
index 8966fc0abd0e..18157fecccfb 100644
--- a/drivers/mpool/pmd_obj.c
+++ b/drivers/mpool/pmd_obj.c
@@ -507,9 +507,7 @@ int pmd_obj_commit(struct mpool_descriptor *mp, struct pmd_layout *layout)

pmd_mdc_lock(&cinfo->mmi_compactlock, cslot);

-#ifdef OBJ_PERSISTENCE_ENABLED
rc = pmd_log_create(mp, layout);
-#endif
if (!rc) {
pmd_uc_lock(cinfo, cslot);
found = pmd_uc_remove(cinfo, layout);
@@ -675,9 +673,7 @@ int pmd_obj_delete(struct mpool_descriptor *mp, struct pmd_layout *layout)
return (refcnt > 2) ? -EBUSY : -EINVAL;
}

-#ifdef OBJ_PERSISTENCE_ENABLED
rc = pmd_log_delete(mp, objid);
-#endif
if (!rc) {
pmd_co_wlock(cinfo, cslot);
found = pmd_co_remove(cinfo, layout);
@@ -763,9 +759,7 @@ int pmd_obj_erase(struct mpool_descriptor *mp, struct pmd_layout *layout, u64 ge

pmd_mdc_lock(&cinfo->mmi_compactlock, cslot);

-#ifdef OBJ_PERSISTENCE_ENABLED
rc = pmd_log_erase(mp, layout->eld_objid, gen);
-#endif
if (!rc) {
layout->eld_gen = gen;
if (cslot)
@@ -830,9 +824,7 @@ static int pmd_alloc_idgen(struct mpool_descriptor *mp, enum obj_type_omf otype,
* to prevent a race with mdc compaction.
*/
pmd_mdc_lock(&cinfo->mmi_compactlock, cslot);
-#ifdef OBJ_PERSISTENCE_ENABLED
rc = pmd_log_idckpt(mp, *objid);
-#endif
if (!rc)
cinfo->mmi_lckpt = *objid;
pmd_mdc_unlock(&cinfo->mmi_compactlock);
--
2.17.2

2020-09-28 16:52:17

by Nabeel Meeramohideen Mohamed (nmeeramohide)

[permalink] [raw]

Subject: [PATCH 11/22] mpool: add mlog lifecycle management and IO routines

From: Nabeel M Mohamed <[email protected]>

This implements the mlog lifecycle management functions:
allocate, commit, abort, destroy, append, read etc.

Mlog objects are containers for record logging. Mlogs can be
appended with arbitrary sized records and once full, an mlog
must be erased before additional records can be appended.
Mlog records can be read sequentially from the beginning at
any time. Mlogs in a media class are always a multiple of
the mblock size for that media class.

The mlog APIs implement a pattern whereby an mlog is allocated
and then committed or aborted. An mlog is not persistent or
accessible until committed, and a system failure prior to
commit results in the same logical mpool state as if the mlog
had never been allocated. An mlog allocation returns an OID
that is used to commit, append, flush, erase, or read as needed,
and delete the mlog.

At mlog open, the read buffer is fully loaded and parsed to
identify the end-of-log and the next flush set ID, to detect
media corruption, to detect bad record formatting, and to
optionally enforce compaction semantics. At mlog close, the
dirty data is flushed and all memory resources are freed.

Co-developed-by: Greg Becker <[email protected]>
Signed-off-by: Greg Becker <[email protected]>
Co-developed-by: Pierre Labat <[email protected]>
Signed-off-by: Pierre Labat <[email protected]>
Co-developed-by: John Groves <[email protected]>
Signed-off-by: John Groves <[email protected]>
Signed-off-by: Nabeel M Mohamed <[email protected]>
---
drivers/mpool/mlog.c | 1667 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 1667 insertions(+)
create mode 100644 drivers/mpool/mlog.c

diff --git a/drivers/mpool/mlog.c b/drivers/mpool/mlog.c
new file mode 100644
index 000000000000..6ccca00735c1
--- /dev/null
+++ b/drivers/mpool/mlog.c
@@ -0,0 +1,1667 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc. All rights reserved.
+ */
+
+#include <linux/mm.h>
+#include <linux/log2.h>
+#include <linux/blk_types.h>
+#include <asm/page.h>
+
+#include "assert.h"
+#include "mpool_printk.h"
+
+#include "omf_if.h"
+#include "mpcore.h"
+#include "mlog_utils.h"
+
+/**
+ * mlog_alloc_cmn() - Allocate mlog with specified parameters using new or specified objid.
+ *
+ * Returns: 0 if successful, -errno otherwise
+ */
+static int mlog_alloc_cmn(struct mpool_descriptor *mp, u64 objid,
+ struct mlog_capacity *capreq, enum mp_media_classp mclassp,
+ struct mlog_props *prop, struct mlog_descriptor **mlh)
+{
+ struct pmd_obj_capacity ocap;
+ struct pmd_layout *layout;
+ int rc;
+
+ layout = NULL;
+ *mlh = NULL;
+
+ ocap.moc_captgt = capreq->lcp_captgt;
+ ocap.moc_spare = capreq->lcp_spare;
+
+ if (!objid) {
+ rc = pmd_obj_alloc(mp, OMF_OBJ_MLOG, &ocap, mclassp, &layout);
+ if (rc || !layout) {
+ if (rc != -ENOENT)
+ mp_pr_err("mpool %s, allocating mlog failed", rc, mp->pds_name);
+ }
+ } else {
+ rc = pmd_obj_realloc(mp, objid, &ocap, mclassp, &layout);
+ if (rc || !layout) {
+ if (rc != -ENOENT)
+ mp_pr_err("mpool %s, re-allocating mlog 0x%lx failed",
+ rc, mp->pds_name, (ulong)objid);
+ }
+ }
+ if (rc)
+ return rc;
+
+ /*
+ * Mlogs rarely created and usually committed immediately so erase in-line;
+ * mlog not committed so pmd_obj_erase() not needed to make atomic
+ */
+ pmd_obj_wrlock(layout);
+ rc = pmd_layout_erase(mp, layout);
+ if (!rc)
+ mlog_getprops_cmn(mp, layout, prop);
+ pmd_obj_wrunlock(layout);
+
+ if (rc) {
+ pmd_obj_abort(mp, layout);
+ mp_pr_err("mpool %s, mlog 0x%lx alloc, erase failed",
+ rc, mp->pds_name, (ulong)layout->eld_objid);
+ return rc;
+ }
+
+ *mlh = layout2mlog(layout);
+
+ return 0;
+}
+
+/**
+ * mlog_alloc() - Allocate mlog with the capacity params specified in capreq.
+ *
+ * Allocate mlog with the capacity params specified in capreq on drives in a
+ * media class mclassp.
+ * If successful mlh is a handle for the mlog and prop contains its properties.
+ *
+ * Note: mlog is not persistent until committed; allocation can be aborted.
+ *
+ * Returns: 0 if successful, -errno otherwise
+ */
+int mlog_alloc(struct mpool_descriptor *mp, struct mlog_capacity *capreq,
+ enum mp_media_classp mclassp, struct mlog_props *prop,
+ struct mlog_descriptor **mlh)
+{
+ return mlog_alloc_cmn(mp, 0, capreq, mclassp, prop, mlh);
+}
+
+
+/**
+ * mlog_realloc() - Allocate mlog with specified objid to support crash recovery.
+ *
+ * Allocate mlog with specified objid to support crash recovery; otherwise
+ * is equivalent to mlog_alloc().
+ *
+ * Returns: 0 if successful, -errno otherwise
+ * One of the possible errno values:
+ * -EEXISTS - if objid exists
+ */
+int mlog_realloc(struct mpool_descriptor *mp, u64 objid,
+ struct mlog_capacity *capreq, enum mp_media_classp mclassp,
+ struct mlog_props *prop, struct mlog_descriptor **mlh)
+{
+ if (!mlog_objid(objid))
+ return -EINVAL;
+
+ return mlog_alloc_cmn(mp, objid, capreq, mclassp, prop, mlh);
+}
+
+/**
+ * mlog_find_get() - Get handle and properties for existing mlog with specified objid.
+ *
+ * Returns: 0 if successful, -errno otherwise
+ */
+int mlog_find_get(struct mpool_descriptor *mp, u64 objid, int which,
+ struct mlog_props *prop, struct mlog_descriptor **mlh)
+{
+ struct pmd_layout *layout;
+
+ *mlh = NULL;
+
+ if (!mlog_objid(objid))
+ return -EINVAL;
+
+ layout = pmd_obj_find_get(mp, objid, which);
+ if (!layout)
+ return -ENOENT;
+
+ if (prop) {
+ pmd_obj_rdlock(layout);
+ mlog_getprops_cmn(mp, layout, prop);
+ pmd_obj_rdunlock(layout);
+ }
+
+ *mlh = layout2mlog(layout);
+
+ return 0;
+}
+
+/**
+ * mlog_put() - Put a reference for mlog with specified objid.
+ */
+void mlog_put(struct mlog_descriptor *mlh)
+{
+ struct pmd_layout *layout;
+
+ layout = mlog2layout(mlh);
+ if (layout)
+ pmd_obj_put(layout);
+}
+
+/**
+ * mlog_lookup_rootids() - Return OIDs of mpctl root MDC.
+ * @id1: (output): OID of one of the mpctl root MDC mlogs.
+ * @id2: (output): OID of the other mpctl root MDC mlogs.
+ */
+void mlog_lookup_rootids(u64 *id1, u64 *id2)
+{
+ if (id1)
+ *id1 = UROOT_OBJID_LOG1;
+
+ if (id2)
+ *id2 = UROOT_OBJID_LOG2;
+}
+
+/**
+ * mlog_commit() - Make allocated mlog persistent.
+ *
+ * If fails mlog still exists in an uncommitted state so can retry commit or abort.
+ *
+ * Returns: 0 if successful, -errno otherwise
+ */
+int mlog_commit(struct mpool_descriptor *mp, struct mlog_descriptor *mlh)
+{
+ struct pmd_layout *layout;
+
+ layout = mlog2layout(mlh);
+ if (!layout)
+ return -EINVAL;
+
+ return pmd_obj_commit(mp, layout);
+}
+
+/**
+ * mlog_abort() - Discard uncommitted mlog; if successful mlh is invalid after call.
+ *
+ * Returns: 0 if successful, -errno otherwise
+ */
+int mlog_abort(struct mpool_descriptor *mp, struct mlog_descriptor *mlh)
+{
+ struct pmd_layout *layout;
+
+ layout = mlog2layout(mlh);
+ if (!layout)
+ return -EINVAL;
+
+ return pmd_obj_abort(mp, layout);
+}
+
+/**
+ * mlog_delete() - Delete committed mlog.
+
+ * If successful mlh is invalid after call; if fails mlog is closed.
+ *
+ * Returns: 0 if successful, -errno otherwise
+ */
+int mlog_delete(struct mpool_descriptor *mp, struct mlog_descriptor *mlh)
+{
+ struct pmd_layout *layout;
+
+ layout = mlog2layout(mlh);
+ if (!layout)
+ return -EINVAL;
+
+ /* Remove from open list and discard buffered log data */
+ pmd_obj_wrlock(layout);
+ oml_layout_lock(mp);
+ oml_layout_remove(mp, layout->eld_objid);
+ oml_layout_unlock(mp);
+
+ mlog_stat_free(layout);
+ pmd_obj_wrunlock(layout);
+
+ return pmd_obj_delete(mp, layout);
+}
+
+/**
+ * mlog_logrecs_validate() - Validate records in lstat.rbuf relative to lstat state.
+ *
+ * Validate records in lstat.rbuf relative to lstat state where midrec
+ * indicates if mid data record from previous log block; updates lstate to
+ * reflect valid markers found (if any).
+ *
+ * Returns:
+ * 0 if successful; -errno otherwise
+ *
+ * In the output param, i.e., midrec, we store:
+ * 1 if log records are valid and ended mid data record
+ * 0 if log records are valid and did NOT end mid data record
+ */
+static int mlog_logrecs_validate(struct mlog_stat *lstat, int *midrec, u16 rbidx, u16 lbidx)
+{
+ struct omf_logrec_descriptor lrd;
+ u64 recnum = 0;
+ int recoff;
+ int rc = 0;
+ char *rbuf;
+ u16 sectsz = 0;
+
+ sectsz = MLOG_SECSZ(lstat);
+ rbuf = lstat->lst_rbuf[rbidx] + lbidx * sectsz;
+
+ recoff = omf_logblock_header_len_le(rbuf);
+ if (recoff < 0)
+ return -ENODATA;
+
+ while (sectsz - recoff >= OMF_LOGREC_DESC_PACKLEN) {
+ omf_logrec_desc_unpack_letoh(&lrd, &rbuf[recoff]);
+
+ if (lrd.olr_rtype == OMF_LOGREC_CSTART) {
+ if (!lstat->lst_csem || lstat->lst_rsoff || recnum) {
+ rc = -ENODATA;
+
+ /* No compaction or not first rec in first log block */
+ mp_pr_err("no compact marker nor first rec %u %ld %u %u %lu",
+ rc, lstat->lst_csem, lstat->lst_rsoff,
+ rbidx, lbidx, (ulong)recnum);
+ return rc;
+ }
+ lstat->lst_cstart = 1;
+ *midrec = 0;
+ } else if (lrd.olr_rtype == OMF_LOGREC_CEND) {
+ if (!lstat->lst_csem || !lstat->lst_cstart || lstat->lst_cend || *midrec) {
+ rc = -ENODATA;
+
+ /*
+ * No compaction or cend before cstart or more than one cend
+ * or cend mid-record.
+ */
+ mp_pr_err("inconsistent compaction recs %u %u %u %d", rc,
+ lstat->lst_csem, lstat->lst_cstart, lstat->lst_cend,
+ *midrec);
+ return rc;
+ }
+ lstat->lst_cend = 1;
+ } else if (lrd.olr_rtype == OMF_LOGREC_EOLB) {
+ if (*midrec || !recnum) {
+ /* EOLB mid-record or first record. */
+ rc = -ENODATA;
+ mp_pr_err("end of log block marker at wrong place %d %lu",
+ rc, *midrec, (ulong)recnum);
+ return rc;
+ }
+ /* No more records in log buffer */
+ break;
+ } else if (lrd.olr_rtype == OMF_LOGREC_DATAFULL) {
+ if (*midrec && recnum) {
+ rc = -ENODATA;
+
+ /*
+ * Can occur mid data rec only if is first rec in log block
+ * indicating partial data rec at end of last log block
+ * which is a valid failure mode; otherwise is a logging
+ * error.
+ */
+ mp_pr_err("data full marker at wrong place %d %lu",
+ rc, *midrec, (ulong)recnum);
+ return rc;
+ }
+ *midrec = 0;
+ } else if (lrd.olr_rtype == OMF_LOGREC_DATAFIRST) {
+ if (*midrec && recnum) {
+ rc = -ENODATA;
+
+ /* See comment for DATAFULL */
+ mp_pr_err("data first marker at wrong place %d %lu",
+ rc, *midrec, (ulong)recnum);
+ return rc;
+ }
+ *midrec = 1;
+ } else if (lrd.olr_rtype == OMF_LOGREC_DATAMID) {
+ if (!*midrec) {
+ rc = -ENODATA;
+
+ /* Must occur mid data record. */
+ mp_pr_err("data mid marker at wrong place %d %lu",
+ rc, *midrec, (ulong)recnum);
+ return rc;
+ }
+ } else if (lrd.olr_rtype == OMF_LOGREC_DATALAST) {
+ if (!(*midrec)) {
+ rc = -ENODATA;
+
+ /* Must occur mid data record */
+ mp_pr_err("data last marker at wrong place %d %lu",
+ rc, *midrec, (ulong)recnum);
+ return rc;
+ }
+ *midrec = 0;
+ } else {
+ rc = -ENODATA;
+ mp_pr_err("unknown record type %d %lu", rc, lrd.olr_rtype, (ulong)recnum);
+ return rc;
+ }
+
+ recnum = recnum + 1;
+ recoff = recoff + OMF_LOGREC_DESC_PACKLEN + lrd.olr_rlen;
+ }
+
+ return rc;
+}
+
+static inline void max_cfsetid(struct omf_logblock_header *lbh,
+ struct pmd_layout *layout, u32 *fsetid)
+{
+ if (!mpool_uuid_compare(&lbh->olh_magic, &layout->eld_uuid) &&
+ (lbh->olh_gen == layout->eld_gen))
+ *fsetid = max_t(u32, *fsetid, lbh->olh_cfsetid);
+}
+
+/**
+ * mlog_logpage_validate() - Validate log records at log page index 'rbidx' in the read buffer.
+ * @mlh: mlog_descriptor
+ * @lstat: mlog_stat
+ * @rbidx: log page index in the read buffer to validate
+ * @nseclpg: number of sectors in the log page @rbidx
+ * @midrec: refer to mlog_logrecs_validate
+ * @leol_found: true, if LEOL found. false, if LEOL not found/log full (output)
+ * @fsetidmax: maximum flush set ID found in the log (output)
+ * @pfsetid: previous flush set ID, if LEOL found (output)
+ */
+static int mlog_logpage_validate(struct mlog_descriptor *mlh, struct mlog_stat *lstat,
+ u16 rbidx, u16 nseclpg, int *midrec,
+ bool *leol_found, u32 *fsetidmax, u32 *pfsetid)
+{
+ struct pmd_layout *layout = mlog2layout(mlh);
+ char *rbuf;
+ u16 lbidx;
+ u16 sectsz;
+
+ sectsz = MLOG_SECSZ(lstat);
+ rbuf = lstat->lst_rbuf[rbidx];
+
+ /* Loop through nseclpg sectors in the log page @rbidx. */
+ for (lbidx = 0; lbidx < nseclpg; lbidx++) {
+ struct omf_logblock_header lbh;
+ int rc;
+
+ memset(&lbh, 0, sizeof(lbh));
+
+ (void)omf_logblock_header_unpack_letoh(&lbh, rbuf);
+
+ /*
+ * If LEOL is already found, then this loop determines
+ * fsetidmax, i.e., scans through the sectors to determine
+ * any stale flush set id from a prior failed CFS flush.
+ */
+ if (*leol_found) {
+ max_cfsetid(&lbh, layout, fsetidmax);
+ rbuf += sectsz;
+ continue;
+ }
+
+ /*
+ * Check for LEOL based on prev and cur flush set ID.
+ * If LEOL is detected, then no need to validate this and
+ * the log blocks that follow.
+ *
+ * We issue DISCARD commands to erase mlogs. However the data
+ * read from a discarded block is non-determinstic. It could be
+ * all 0s, all 1s or last written data.
+ *
+ * We could read following 5 types of data from mlog:
+ * 1) Garbage
+ * 2) Stale logs with different log block gen
+ * 3) Stale logs with different flushset ID
+ * 4) Stale logs with different magic (UUID)
+ * 5) Valid logs
+ */
+ if (mpool_uuid_compare(&lbh.olh_magic, &layout->eld_uuid) ||
+ (lbh.olh_gen != layout->eld_gen) || (lbh.olh_pfsetid != *fsetidmax)) {
+ *leol_found = true;
+ *pfsetid = *fsetidmax;
+ rbuf += sectsz;
+ max_cfsetid(&lbh, layout, fsetidmax);
+ continue;
+ }
+
+ *fsetidmax = lbh.olh_cfsetid;
+
+ /* Validate the log block at lbidx. */
+ rc = mlog_logrecs_validate(lstat, midrec, rbidx, lbidx);
+ if (rc) {
+ mp_pr_err("mlog %p,, midrec %d, log pg idx %u, sector idx %u",
+ rc, mlh, *midrec, rbidx, lbidx);
+
+ return rc;
+ }
+
+ ++lstat->lst_wsoff;
+ rbuf += sectsz;
+ }
+
+ return 0;
+}
+
+/**
+ * mlog_read_and_validate() - Read and validate mlog records
+ * @mp: mpool descriptor
+ * @layout: layout descriptor
+ * @lempty: is the log empty? (output)
+ *
+ * Called by mlog_open() to read and validate log records in the mlog.
+ * In addition, determine the previous and current flush
+ * set ID to be used by the next flush.
+ *
+ * Note: this function reads the entire mlog. Doing so allows us to confirm that
+ * the mlog's contents are completely legit, and also to recognize the case
+ * where a compaction started but failed to complete (CSTART with no CEND) -
+ * for which the recovery is to use the other mlog of the mlpair.
+ * If the mlog is huge, or if there are a bazillion of them, this could be an
+ * issue to revisit in future performance or functionality optimizations.
+ *
+ * Transactional logs are expensive; this does some "extra" reading at open
+ * time, with some serious benefits.
+ *
+ * Caller must hold the write lock on the layout, which protects the mutation
+ * of the read buffer.
+ */
+static int mlog_read_and_validate(struct mpool_descriptor *mp,
+ struct pmd_layout *layout, bool *lempty)
+{
+ struct mlog_stat *lstat = &layout->eld_lstat;
+ off_t leol_off = 0, rsoff;
+ int midrec = 0, remsec;
+ bool leol_found = false;
+ bool fsetid_loop = false;
+ bool skip_ser = false;
+ u32 fsetidmax = 0;
+ u32 pfsetid = 0;
+ u16 maxsec, nsecs;
+ u16 nlpgs, nseclpg;
+ int rc = 0;
+
+ remsec = MLOG_TOTSEC(lstat);
+ maxsec = MLOG_NSECMB(lstat);
+ rsoff = lstat->lst_wsoff;
+
+ while (remsec > 0) {
+ u16 rbidx;
+
+ nseclpg = MLOG_NSECLPG(lstat);
+ nsecs = min_t(u32, maxsec, remsec);
+
+ rc = mlog_populate_rbuf(mp, layout, &nsecs, &rsoff, skip_ser);
+ if (rc) {
+ mp_pr_err("mpool %s, mlog 0x%lx validate failed, nsecs: %u, rsoff: 0x%lx",
+ rc, mp->pds_name, (ulong)layout->eld_objid, nsecs, rsoff);
+
+ goto exit;
+ }
+
+ nlpgs = (nsecs + nseclpg - 1) / nseclpg;
+ lstat->lst_rsoff = rsoff;
+
+ /* Validate the read buffer, one log page at a time. */
+ for (rbidx = 0; rbidx < nlpgs; rbidx++) {
+
+ /* No. of sectors in the last log page. */
+ if (rbidx == nlpgs - 1) {
+ nseclpg = nsecs % nseclpg;
+ nseclpg = nseclpg > 0 ? nseclpg : MLOG_NSECLPG(lstat);
+ }
+
+ /* Validate the log block(s) in the log page @rbidx. */
+ rc = mlog_logpage_validate(layout2mlog(layout), lstat, rbidx, nseclpg,
+ &midrec, &leol_found, &fsetidmax, &pfsetid);
+ if (rc) {
+ mp_pr_err("mpool %s, mlog 0x%lx rbuf validate failed, leol: %d, fsetidmax: %u, pfsetid: %u",
+ rc, mp->pds_name, (ulong)layout->eld_objid, leol_found,
+ fsetidmax, pfsetid);
+
+ mlog_free_rbuf(lstat, rbidx, nlpgs - 1);
+ goto exit;
+ }
+
+ mlog_free_rbuf(lstat, rbidx, rbidx);
+
+ /*
+ * If LEOL is found, then note down the LEOL offset
+ * and kick off the scan to identify any stale flush
+ * set id from a prior failed flush. If there's one,
+ * then the next flush set ID must be set one greater
+ * than the stale fsetid.
+ */
+ if (leol_found && !fsetid_loop) {
+ leol_off = lstat->lst_wsoff;
+ fsetid_loop = true;
+ }
+ }
+
+ remsec -= nsecs;
+ if (remsec == 0)
+ break;
+ ASSERT(remsec > 0);
+
+ if (fsetid_loop) {
+ u16 compsec;
+ off_t endoff;
+ /*
+ * To determine the new flush set ID, we need to
+ * scan only through the next min(MLOG_NSECMB, remsec)
+ * sectors. This is because of the max flush size being
+ * 1 MB and hence a failed flush wouldn't have touched
+ * any sectors beyond 1 MB from LEOL.
+ */
+ endoff = rsoff + nsecs - 1;
+ compsec = endoff - leol_off + 1;
+ remsec = min_t(u32, remsec, maxsec - compsec);
+ ASSERT(remsec >= 0);
+
+ rsoff = endoff + 1;
+ } else {
+ rsoff = lstat->lst_wsoff;
+ }
+ }
+
+ /* LEOL wouldn't have been set for a full log. */
+ if (!leol_found)
+ pfsetid = fsetidmax;
+
+ if (pfsetid != 0)
+ *lempty = false;
+
+ lstat->lst_pfsetid = pfsetid;
+ lstat->lst_cfsetid = fsetidmax + 1;
+
+exit:
+ lstat->lst_rsoff = -1;
+
+ return rc;
+}
+
+int mlog_open(struct mpool_descriptor *mp, struct mlog_descriptor *mlh, u8 flags, u64 *gen)
+{
+ struct pmd_layout *layout = mlog2layout(mlh);
+ struct mlog_stat *lstat;
+ bool lempty, csem, skip_ser;
+ int rc = 0;
+
+ lempty = csem = skip_ser = false;
+ lstat = NULL;
+ *gen = 0;
+
+ if (!layout)
+ return -EINVAL;
+
+ pmd_obj_wrlock(layout);
+
+ flags &= MLOG_OF_SKIP_SER | MLOG_OF_COMPACT_SEM;
+
+ if (flags & MLOG_OF_COMPACT_SEM)
+ csem = true;
+
+ if (flags & MLOG_OF_SKIP_SER)
+ skip_ser = true;
+
+ lstat = &layout->eld_lstat;
+
+ if (lstat->lst_abuf) {
+ /* Mlog already open */
+ if (csem && !lstat->lst_csem) {
+ pmd_obj_wrunlock(layout);
+
+ /* Re-open has inconsistent csem flag */
+ rc = -EINVAL;
+ mp_pr_err("mpool %s, re-opening of mlog 0x%lx, inconsistent csem %u %u",
+ rc, mp->pds_name, (ulong)layout->eld_objid,
+ csem, lstat->lst_csem);
+ } else if (skip_ser && !(layout->eld_flags & MLOG_OF_SKIP_SER)) {
+ pmd_obj_wrunlock(layout);
+
+ /* Re-open has inconsistent seralization flag */
+ rc = -EINVAL;
+ mp_pr_err("mpool %s, re-opening of mlog 0x%lx, inconsistent ser %u %u",
+ rc, mp->pds_name, (ulong)layout->eld_objid, skip_ser,
+ layout->eld_flags & MLOG_OF_SKIP_SER);
+ } else {
+ *gen = layout->eld_gen;
+ pmd_obj_wrunlock(layout);
+ }
+ return rc;
+ }
+
+ if (!(layout->eld_state & PMD_LYT_COMMITTED)) {
+ *gen = 0;
+ pmd_obj_wrunlock(layout);
+
+ rc = -EINVAL;
+ mp_pr_err("mpool %s, mlog 0x%lx, not committed",
+ rc, mp->pds_name, (ulong)layout->eld_objid);
+ return rc;
+ }
+
+ if (skip_ser)
+ layout->eld_flags |= MLOG_OF_SKIP_SER;
+
+ rc = mlog_stat_init(mp, mlh, csem);
+ if (rc) {
+ *gen = 0;
+ pmd_obj_wrunlock(layout);
+
+ mp_pr_err("mpool %s, mlog 0x%lx, mlog status initialization failed",
+ rc, mp->pds_name, (ulong)layout->eld_objid);
+ return rc;
+ }
+
+ lempty = true;
+
+ rc = mlog_read_and_validate(mp, layout, &lempty);
+ if (rc) {
+ mlog_stat_free(layout);
+ pmd_obj_wrunlock(layout);
+
+ mp_pr_err("mpool %s, mlog 0x%lx, mlog content validation failed",
+ rc, mp->pds_name, (ulong)layout->eld_objid);
+ return rc;
+ } else if (!lempty && csem) {
+ if (!lstat->lst_cstart) {
+ mlog_stat_free(layout);
+ pmd_obj_wrunlock(layout);
+
+ rc = -ENODATA;
+ mp_pr_err("mpool %s, mlog 0x%lx, compaction start missing",
+ rc, mp->pds_name, (ulong)layout->eld_objid);
+ return rc;
+ } else if (!lstat->lst_cend) {
+ mlog_stat_free(layout);
+ pmd_obj_wrunlock(layout);
+
+ /* Incomplete compaction */
+ rc = -EMSGSIZE;
+ mp_pr_err("mpool %s, mlog 0x%lx, incomplete compaction",
+ rc, mp->pds_name, (ulong)layout->eld_objid);
+ return rc;
+ }
+ }
+
+ *gen = layout->eld_gen;
+
+ /* TODO: Verify that the insert succeeded... */
+ oml_layout_lock(mp);
+ oml_layout_insert(mp, &layout->eld_mlpriv);
+ oml_layout_unlock(mp);
+
+ pmd_obj_wrunlock(layout);
+
+ return rc;
+}
+
+/**
+ * mlog_close() - Flush and close log and release resources; no op if log is not open.
+ *
+ * Returns: 0 on success; -errno otherwise
+ */
+int mlog_close(struct mpool_descriptor *mp, struct mlog_descriptor *mlh)
+{
+ struct pmd_layout *layout = mlog2layout(mlh);
+ struct mlog_stat *lstat;
+ bool skip_ser = false;
+ int rc = 0;
+
+ if (!layout)
+ return -EINVAL;
+
+ /*
+ * Inform pre-compaction that there is no need to try to compact
+ * an mpool MDC that would contain this mlog because it is closed.
+ */
+ pmd_precompact_alsz(mp, layout->eld_objid, 0, 0);
+
+ pmd_obj_wrlock(layout);
+
+ lstat = &layout->eld_lstat;
+ if (!lstat->lst_abuf) {
+ pmd_obj_wrunlock(layout);
+
+ return 0; /* Log already closed */
+ }
+
+ /* Flush log if potentially dirty and remove layout from open list */
+ if (lstat->lst_abdirty) {
+ rc = mlog_logblocks_flush(mp, layout, skip_ser);
+ lstat->lst_abdirty = false;
+ if (rc)
+ mp_pr_err("mpool %s, mlog 0x%lx close, log block flush failed",
+ rc, mp->pds_name, (ulong)layout->eld_objid);
+ }
+
+ oml_layout_lock(mp);
+ oml_layout_remove(mp, layout->eld_objid);
+ oml_layout_unlock(mp);
+
+ mlog_stat_free(layout);
+
+ /* Reset Mlog flags */
+ layout->eld_flags &= (~MLOG_OF_SKIP_SER);
+
+ pmd_obj_wrunlock(layout);
+
+ return rc;
+}
+
+/**
+ * mlog_gen() - Get generation number for log; log can be open or closed.
+ *
+ * Returns: 0 if successful; -errno otherwise
+ */
+int mlog_gen(struct mlog_descriptor *mlh, u64 *gen)
+{
+ struct pmd_layout *layout = mlog2layout(mlh);
+
+ *gen = 0;
+
+ if (!layout)
+ return -EINVAL;
+
+ pmd_obj_rdlock(layout);
+ *gen = layout->eld_gen;
+ pmd_obj_rdunlock(layout);
+
+ return 0;
+}
+
+/**
+ * mlog_empty() - Determine if log is empty; log must be open.
+ *
+ * Returns: 0 if successful; -errno otherwise
+ */
+int mlog_empty(struct mpool_descriptor *mp, struct mlog_descriptor *mlh, bool *empty)
+{
+ struct pmd_layout *layout = mlog2layout(mlh);
+ struct mlog_stat *lstat;
+ int rc = 0;
+
+ *empty = false;
+
+ if (!layout)
+ return -EINVAL;
+
+ pmd_obj_rdlock(layout);
+
+ lstat = &layout->eld_lstat;
+ if (lstat->lst_abuf) {
+ if ((!lstat->lst_wsoff &&
+ (lstat->lst_aoff == OMF_LOGBLOCK_HDR_PACKLEN)))
+ *empty = true;
+ } else {
+ rc = -ENOENT;
+ }
+
+ pmd_obj_rdunlock(layout);
+
+ if (rc)
+ mp_pr_err("mpool %s, mlog 0x%lx empty: no mlog status",
+ rc, mp->pds_name, (ulong)layout->eld_objid);
+
+ return rc;
+}
+
+/**
+ * mlog_len() - Returns the raw mlog bytes consumed. log must be open.
+ *
+ * Need to account for both metadata and user bytes while computing the log length.
+ */
+static int mlog_len(struct mpool_descriptor *mp, struct mlog_descriptor *mlh, u64 *len)
+{
+ struct pmd_layout *layout = mlog2layout(mlh);
+ struct mlog_stat *lstat;
+ int rc = 0;
+
+ if (!layout)
+ return -EINVAL;
+
+ pmd_obj_rdlock(layout);
+
+ lstat = &layout->eld_lstat;
+ if (lstat->lst_abuf)
+ *len = ((u64) lstat->lst_wsoff * MLOG_SECSZ(lstat)) + lstat->lst_aoff;
+ else
+ rc = -ENOENT;
+
+ pmd_obj_rdunlock(layout);
+
+ if (rc)
+ mp_pr_err("mpool %s, mlog 0x%lx bytes consumed: no mlog status",
+ rc, mp->pds_name, (ulong)layout->eld_objid);
+
+ return rc;
+}
+
+/**
+ * mlog_erase() - Erase log setting generation number to max(current gen + 1, mingen).
+ *
+ * Log can be open or closed, but must be committed; operation is idempotent
+ * and can be retried if fails.
+ *
+ * Returns: 0 on success; -errno otherwise
+ */
+int mlog_erase(struct mpool_descriptor *mp, struct mlog_descriptor *mlh, u64 mingen)
+{
+ struct pmd_layout *layout = mlog2layout(mlh);
+ struct mlog_stat *lstat = NULL;
+ u64 newgen = 0;
+ int rc = 0;
+
+ if (!layout)
+ return -EINVAL;
+
+ pmd_obj_wrlock(layout);
+
+ /* Must be committed to log erase start/end markers */
+ if (!(layout->eld_state & PMD_LYT_COMMITTED)) {
+ pmd_obj_wrunlock(layout);
+
+ rc = -EINVAL;
+ mp_pr_err("mpool %s, erasing mlog 0x%lx, mlog not committed",
+ rc, mp->pds_name, (ulong)layout->eld_objid);
+ return rc;
+ }
+
+ newgen = max(layout->eld_gen + 1, mingen);
+
+ /* If successful updates state and gen in layout */
+ rc = pmd_obj_erase(mp, layout, newgen);
+ if (rc) {
+ pmd_obj_wrunlock(layout);
+
+ mp_pr_err("mpool %s, erasing mlog 0x%lx, logging erase start failed",
+ rc, mp->pds_name, (ulong)layout->eld_objid);
+ return rc;
+ }
+
+ rc = pmd_layout_erase(mp, layout);
+ if (rc) {
+ /*
+ * Log the failure as a debugging message, but ignore the
+ * failure, since discarding blocks here is only advisory
+ */
+ mp_pr_debug("mpool %s, erasing mlog 0x%lx, erase failed ",
+ rc, mp->pds_name, (ulong)layout->eld_objid);
+ rc = 0;
+ }
+
+ /* If successful updates state in layout */
+ lstat = &layout->eld_lstat;
+ if (lstat->lst_abuf) {
+ /* Log is open so need to update lstat info */
+ mlog_free_abuf(lstat, 0, lstat->lst_abidx);
+ mlog_free_rbuf(lstat, 0, MLOG_NLPGMB(lstat) - 1);
+
+ mlog_stat_init_common(layout, lstat);
+ }
+
+ pmd_obj_wrunlock(layout);
+
+ return rc;
+}
+
+/**
+ * mlog_append_marker() - Append a marker (log rec with zero-length data field) of type mtype.
+ *
+ * Returns: 0 on success; -errno otherwise
+ * One of the possible errno values:
+ * -EFBIG - if no room in log
+ */
+static int mlog_append_marker(struct mpool_descriptor *mp, struct pmd_layout *layout,
+ enum logrec_type_omf mtype)
+{
+ struct mlog_stat *lstat = &layout->eld_lstat;
+ struct omf_logrec_descriptor lrd;
+ u16 sectsz, abidx, aoff;
+ u16 asidx, nseclpg;
+ bool skip_ser = false;
+ char *abuf;
+ off_t lpgoff;
+ int rc;
+
+ sectsz = MLOG_SECSZ(lstat);
+ nseclpg = MLOG_NSECLPG(lstat);
+
+ if (mlog_append_dmax(layout) == -1) {
+ /* Mlog is already full, flush whatever we can */
+ if (lstat->lst_abdirty) {
+ (void)mlog_logblocks_flush(mp, layout, skip_ser);
+ lstat->lst_abdirty = false;
+ }
+
+ return -EFBIG;
+ }
+
+ rc = mlog_update_append_idx(mp, layout, skip_ser);
+ if (rc)
+ return rc;
+
+ abidx = lstat->lst_abidx;
+ abuf = lstat->lst_abuf[abidx];
+ asidx = lstat->lst_wsoff - ((nseclpg * abidx) + lstat->lst_asoff);
+ lpgoff = asidx * sectsz;
+ aoff = lstat->lst_aoff;
+
+ lrd.olr_tlen = 0;
+ lrd.olr_rlen = 0;
+ lrd.olr_rtype = mtype;
+
+ ASSERT(abuf != NULL);
+
+ rc = omf_logrec_desc_pack_htole(&lrd, &abuf[lpgoff + aoff]);
+ if (!rc) {
+ lstat->lst_aoff = aoff + OMF_LOGREC_DESC_PACKLEN;
+
+ rc = mlog_logblocks_flush(mp, layout, skip_ser);
+ lstat->lst_abdirty = false;
+ if (rc)
+ mp_pr_err("mpool %s, mlog 0x%lx log block flush failed",
+ rc, mp->pds_name, (ulong)layout->eld_objid);
+ } else {
+ mp_pr_err("mpool %s, mlog 0x%lx log record descriptor packing failed",
+ rc, mp->pds_name, (ulong)layout->eld_objid);
+ }
+
+ return rc;
+}
+
+/**
+ * mlog_append_cstart() - Append compaction start marker; log must be open with csem flag true.
+ *
+ * Returns: 0 on success; -errno otherwise
+ * One of the possible errno values:
+ * -EFBIG - if no room in log
+ */
+int mlog_append_cstart(struct mpool_descriptor *mp, struct mlog_descriptor *mlh)
+{
+ struct pmd_layout *layout = mlog2layout(mlh);
+ struct mlog_stat *lstat;
+ int rc = 0;
+
+ if (!layout)
+ return -EINVAL;
+
+ pmd_obj_wrlock(layout);
+
+ lstat = &layout->eld_lstat;
+ if (!lstat->lst_abuf) {
+ pmd_obj_wrunlock(layout);
+
+ rc = -ENOENT;
+ mp_pr_err("mpool %s, in mlog 0x%lx, inconsistency: no mlog status",
+ rc, mp->pds_name, (ulong)layout->eld_objid);
+ return rc;
+ }
+
+ if (!lstat->lst_csem || lstat->lst_cstart) {
+ pmd_obj_wrunlock(layout);
+
+ rc = -EINVAL;
+ mp_pr_err("mpool %s, in mlog 0x%lx, inconsistent state %u %u",
+ rc, mp->pds_name,
+ (ulong)layout->eld_objid, lstat->lst_csem, lstat->lst_cstart);
+ return rc;
+ }
+
+ rc = mlog_append_marker(mp, layout, OMF_LOGREC_CSTART);
+ if (rc) {
+ pmd_obj_wrunlock(layout);
+
+ mp_pr_err("mpool %s, in mlog 0x%lx, marker append failed",
+ rc, mp->pds_name, (ulong)layout->eld_objid);
+ return rc;
+ }
+
+ lstat->lst_cstart = 1;
+ pmd_obj_wrunlock(layout);
+
+ return 0;
+}
+
+/**
+ * mlog_append_cend() - Append compaction start marker; log must be open with csem flag true.
+ *
+ * Returns: 0 on success; -errno otherwise
+ * One of the possible errno values:
+ * -EFBIG - if no room in log
+ */
+int mlog_append_cend(struct mpool_descriptor *mp, struct mlog_descriptor *mlh)
+{
+ struct pmd_layout *layout = mlog2layout(mlh);
+ struct mlog_stat *lstat;
+ int rc = 0;
+
+ if (!layout)
+ return -EINVAL;
+
+ pmd_obj_wrlock(layout);
+
+ lstat = &layout->eld_lstat;
+ if (!lstat->lst_abuf) {
+ pmd_obj_wrunlock(layout);
+
+ rc = -ENOENT;
+ mp_pr_err("mpool %s, mlog 0x%lx, inconsistency: no mlog status",
+ rc, mp->pds_name, (ulong)layout->eld_objid);
+ return rc;
+ }
+
+ if (!lstat->lst_csem || !lstat->lst_cstart || lstat->lst_cend) {
+ pmd_obj_wrunlock(layout);
+
+ rc = -EINVAL;
+ mp_pr_err("mpool %s, mlog 0x%lx, inconsistent state %u %u %u",
+ rc, mp->pds_name, (ulong)layout->eld_objid, lstat->lst_csem,
+ lstat->lst_cstart, lstat->lst_cend);
+ return rc;
+ }
+
+ rc = mlog_append_marker(mp, layout, OMF_LOGREC_CEND);
+ if (rc) {
+ pmd_obj_wrunlock(layout);
+
+ mp_pr_err("mpool %s, mlog 0x%lx, marker append failed",
+ rc, mp->pds_name, (ulong)layout->eld_objid);
+ return rc;
+ }
+
+ lstat->lst_cend = 1;
+ pmd_obj_wrunlock(layout);
+
+ return 0;
+}
+
+/**
+ * memcpy_from_iov() - Moves contents from an iovec to one or more destination buffers.
+ * @iov : One or more source buffers in the form of an iovec
+ * @buf : Destination buffer
+ * @buflen : The length of either source or destination whichever is minimum
+ * @nextidx: The next index in iov if the copy requires multiple invocations
+ * of memcpy_from_iov.
+ *
+ * No bounds check is done on iov. The caller is expected to give the minimum
+ * of source and destination buffers as the length (buflen) here.
+ */
+static void memcpy_from_iov(struct kvec *iov, char *buf, size_t buflen, int *nextidx)
+{
+ int i = *nextidx, cp;
+
+ if ((buflen > 0) && (iov[i].iov_len == 0))
+ i++;
+
+ while (buflen > 0) {
+ cp = (buflen < iov[i].iov_len) ? buflen : iov[i].iov_len;
+
+ if (iov[i].iov_base)
+ memcpy(buf, iov[i].iov_base, cp);
+
+ iov[i].iov_len -= cp;
+ iov[i].iov_base += cp;
+ buflen -= cp;
+ buf += cp;
+
+ if (iov[i].iov_len == 0)
+ i++;
+ }
+
+ *nextidx = i;
+}
+
+/**
+ * mlog_append_data_internal() - Append data record with buflen data bytes from buf.
+ * @mp: mpool descriptor
+ * @mlh: mlog descriptor
+ * @iov: iovec containing user data
+ * @buflen: length of the user buffer
+ * @sync: if true, then we do not return until data is on media
+ * @skip_ser: client guarantees serialization
+ *
+ * Log must be open; if log opened with csem true then a compaction
+ * start marker must be in place;
+ *
+ * Returns: 0 on success; -errno otherwise
+ * One of the possible errno values:
+ * -EFBIG - if no room in log
+ */
+static int mlog_append_data_internal(struct mpool_descriptor *mp, struct mlog_descriptor *mlh,
+ struct kvec *iov, u64 buflen, int sync, bool skip_ser)
+{
+ struct pmd_layout *layout = mlog2layout(mlh);
+ struct mlog_stat *lstat = &layout->eld_lstat;
+ struct omf_logrec_descriptor lrd;
+ int rc = 0, dfirst, cpidx;
+ u32 datasec;
+ u64 bufoff, rlenmax;
+ u16 aoff, abidx, asidx;
+ u16 nseclpg, sectsz;
+ off_t lpgoff;
+ char *abuf;
+
+ mlog_extract_fsetparms(lstat, &sectsz, &datasec, NULL, &nseclpg);
+
+ bufoff = 0;
+ dfirst = 1;
+ cpidx = 0;
+
+ lrd.olr_tlen = buflen;
+
+ while (true) {
+ if ((bufoff != buflen) && (mlog_append_dmax(layout) == -1)) {
+
+ /* Mlog is full and there's more to write;
+ * mlog_append_dmax() should prevent this, but it lied.
+ */
+ mp_pr_warn("mpool %s, mlog 0x%lx append, mlog free space incorrect",
+ mp->pds_name, (ulong)layout->eld_objid);
+
+ return -EFBIG;
+ }
+
+ rc = mlog_update_append_idx(mp, layout, skip_ser);
+ if (rc)
+ return rc;
+
+ abidx = lstat->lst_abidx;
+ abuf = lstat->lst_abuf[abidx];
+ asidx = lstat->lst_wsoff - ((nseclpg * abidx) + lstat->lst_asoff);
+ lpgoff = asidx * sectsz;
+ aoff = lstat->lst_aoff;
+
+ ASSERT(abuf != NULL);
+
+ rlenmax = min((u64)(sectsz - aoff - OMF_LOGREC_DESC_PACKLEN),
+ (u64)OMF_LOGREC_DESC_RLENMAX);
+
+ if (buflen - bufoff <= rlenmax) {
+ lrd.olr_rlen = buflen - bufoff;
+ if (dfirst)
+ lrd.olr_rtype = OMF_LOGREC_DATAFULL;
+ else
+ lrd.olr_rtype = OMF_LOGREC_DATALAST;
+ } else {
+ lrd.olr_rlen = rlenmax;
+ if (dfirst) {
+ lrd.olr_rtype = OMF_LOGREC_DATAFIRST;
+ dfirst = 0;
+ } else {
+ lrd.olr_rtype = OMF_LOGREC_DATAMID;
+ }
+ }
+
+ rc = omf_logrec_desc_pack_htole(&lrd, &abuf[lpgoff + aoff]);
+ if (rc) {
+ mp_pr_err("mpool %s, mlog 0x%lx, log record packing failed",
+ rc, mp->pds_name, (ulong)layout->eld_objid);
+ break;
+ }
+
+ lstat->lst_abdirty = true;
+
+ aoff = aoff + OMF_LOGREC_DESC_PACKLEN;
+ if (lrd.olr_rlen) {
+ memcpy_from_iov(iov, &abuf[lpgoff + aoff], lrd.olr_rlen, &cpidx);
+ aoff = aoff + lrd.olr_rlen;
+ bufoff = bufoff + lrd.olr_rlen;
+ }
+ lstat->lst_aoff = aoff;
+
+ /*
+ * Flush log block if sync and no more to write (or)
+ * if the CFS is full.
+ */
+ if ((sync && buflen == bufoff) ||
+ (abidx == MLOG_NLPGMB(lstat) - 1 && asidx == nseclpg - 1 &&
+ sectsz - aoff < OMF_LOGREC_DESC_PACKLEN)) {
+
+ rc = mlog_logblocks_flush(mp, layout, skip_ser);
+ lstat->lst_abdirty = false;
+ if (rc) {
+ mp_pr_err("mpool %s, mlog 0x%lx, log block flush failed",
+ rc, mp->pds_name, (ulong)layout->eld_objid);
+ break;
+ }
+ }
+
+ ASSERT(rc == 0);
+
+ if (bufoff == buflen)
+ break;
+ }
+
+ return rc;
+}
+
+static int mlog_append_datav(struct mpool_descriptor *mp, struct mlog_descriptor *mlh,
+ struct kvec *iov, u64 buflen, int sync)
+{
+ struct pmd_layout *layout = mlog2layout(mlh);
+ struct mlog_stat *lstat;
+ s64 dmax = 0;
+ bool skip_ser = false;
+ int rc = 0;
+
+ if (!layout)
+ return -EINVAL;
+
+ if (layout->eld_flags & MLOG_OF_SKIP_SER)
+ skip_ser = true;
+
+ if (!skip_ser)
+ pmd_obj_wrlock(layout);
+
+ lstat = &layout->eld_lstat;
+ if (!lstat->lst_abuf) {
+ rc = -ENOENT;
+ mp_pr_err("mpool %s, mlog 0x%lx, inconsistency: no mlog status",
+ rc, mp->pds_name, (ulong)layout->eld_objid);
+ } else if (lstat->lst_csem && !lstat->lst_cstart) {
+ rc = -EINVAL;
+ mp_pr_err("mpool %s, mlog 0x%lx, inconsistent state %u %u", rc, mp->pds_name,
+ (ulong)layout->eld_objid, lstat->lst_csem, lstat->lst_cstart);
+ } else {
+ dmax = mlog_append_dmax(layout);
+ if (dmax < 0 || buflen > dmax) {
+ rc = -EFBIG;
+ mp_pr_debug("mpool %s, mlog 0x%lx mlog full %ld",
+ rc, mp->pds_name, (ulong)layout->eld_objid, (long)dmax);
+
+ /* Flush whatever we can. */
+ if (lstat->lst_abdirty) {
+ (void)mlog_logblocks_flush(mp, layout, skip_ser);
+ lstat->lst_abdirty = false;
+ }
+ }
+ }
+
+ if (rc) {
+ if (!skip_ser)
+ pmd_obj_wrunlock(layout);
+ return rc;
+ }
+
+ rc = mlog_append_data_internal(mp, mlh, iov, buflen, sync, skip_ser);
+ if (rc) {
+ mp_pr_err("mpool %s, mlog 0x%lx append failed",
+ rc, mp->pds_name, (ulong)layout->eld_objid);
+
+ /* Flush whatever we can. */
+ if (lstat->lst_abdirty) {
+ (void)mlog_logblocks_flush(mp, layout, skip_ser);
+ lstat->lst_abdirty = false;
+ }
+ }
+
+ if (!skip_ser)
+ pmd_obj_wrunlock(layout);
+
+ return rc;
+}
+
+int mlog_append_data(struct mpool_descriptor *mp, struct mlog_descriptor *mlh,
+ char *buf, u64 buflen, int sync)
+{
+ struct kvec iov;
+
+ iov.iov_base = buf;
+ iov.iov_len = buflen;
+
+ return mlog_append_datav(mp, mlh, &iov, buflen, sync);
+}
+
+/**
+ * mlog_read_data_init() - Initialize iterator for reading data records from log.
+ *
+ * Log must be open; skips non-data records (markers).
+ *
+ * Returns: 0 on success; -errno otherwise
+ */
+int mlog_read_data_init(struct mlog_descriptor *mlh)
+{
+ struct pmd_layout *layout = mlog2layout(mlh);
+ struct mlog_stat *lstat;
+ struct mlog_read_iter *lri;
+ int rc = 0;
+
+ if (!layout)
+ return -EINVAL;
+
+ pmd_obj_wrlock(layout);
+
+ lstat = &layout->eld_lstat;
+ if (!lstat->lst_abuf) {
+ rc = -ENOENT;
+ } else {
+ lri = &lstat->lst_citr;
+
+ mlog_read_iter_init(layout, lstat, lri);
+ }
+
+ pmd_obj_wrunlock(layout);
+
+ return rc;
+}
+
+/**
+ * mlog_read_data_next_impl() -
+ * @mp:
+ * @mlh:
+ * @skip:
+ * @buf:
+ * @buflen:
+ * @rdlen:
+ *
+ * Return:
+ * -EOVERFLOW: the caller must retry with a larger receive buffer,
+ * the length of an adequate receive buffer is returned in "rdlen".
+ */
+static int mlog_read_data_next_impl(struct mpool_descriptor *mp, struct mlog_descriptor *mlh,
+ bool skip, char *buf, u64 buflen, u64 *rdlen)
+{
+ struct omf_logrec_descriptor lrd;
+ struct mlog_read_iter *lri = NULL;
+ struct pmd_layout *layout;
+ struct mlog_stat *lstat;
+
+ u64 bufoff = 0, midrec = 0;
+ bool recfirst = false;
+ bool skip_ser = false;
+ char *inbuf = NULL;
+ u32 sectsz = 0;
+ int rc = 0;
+
+ layout = mlog2layout(mlh);
+ if (!layout)
+ return -EINVAL;
+
+ if (!mlog_objid(layout->eld_objid))
+ return -EINVAL;
+
+ if (layout->eld_flags & MLOG_OF_SKIP_SER)
+ skip_ser = true;
+ /*
+ * Need write lock because loading log block to read updates lstat.
+ * Currently have no use case requiring support for concurrent readers.
+ */
+ if (!skip_ser)
+ pmd_obj_wrlock(layout);
+
+ lstat = &layout->eld_lstat;
+ if (lstat->lst_abuf) {
+ sectsz = MLOG_SECSZ(lstat);
+ lri = &lstat->lst_citr;
+
+ if (!lri->lri_valid) {
+ if (!skip_ser)
+ pmd_obj_wrunlock(layout);
+
+ rc = -EINVAL;
+ mp_pr_err("mpool %s, mlog 0x%lx, invalid iterator",
+ rc, mp->pds_name, (ulong)layout->eld_objid);
+ return rc;
+ }
+ }
+
+ if (!lstat || !lri) {
+ rc = -ENOENT;
+ mp_pr_err("mpool %s, mlog 0x%lx, inconsistency: no mlog status",
+ rc, mp->pds_name, (ulong)layout->eld_objid);
+ } else if (lri->lri_gen != layout->eld_gen ||
+ lri->lri_soff > lstat->lst_wsoff ||
+ (lri->lri_soff == lstat->lst_wsoff && lri->lri_roff > lstat->lst_aoff) ||
+ lri->lri_roff > sectsz) {
+
+ rc = -EINVAL;
+ mp_pr_err("mpool %s, mlog 0x%lx, invalid args gen %lu %lu offsets %ld %ld %u %u %u",
+ rc, mp->pds_name, (ulong)layout->eld_objid, (ulong)lri->lri_gen,
+ (ulong)layout->eld_gen, lri->lri_soff, lstat->lst_wsoff, lri->lri_roff,
+ lstat->lst_aoff, sectsz);
+ } else if (lri->lri_soff == lstat->lst_wsoff && lri->lri_roff == lstat->lst_aoff) {
+ /* Hit end of log - do not error count */
+ rc = -ENOMSG;
+ }
+
+ if (rc) {
+ if (!skip_ser)
+ pmd_obj_wrunlock(layout);
+ if (rc == -ENOMSG) {
+ rc = 0;
+ if (rdlen)
+ *rdlen = 0;
+ }
+
+ return rc;
+ }
+
+ bufoff = 0;
+ midrec = 0;
+
+ while (true) {
+ /* Get log block referenced by lri which can be accumulating buffer */
+ rc = mlog_logblock_load(mp, lri, &inbuf, &recfirst);
+ if (rc) {
+ if (rc == -ENOMSG) {
+ if (!skip_ser)
+ pmd_obj_wrunlock(layout);
+ rc = 0;
+ if (rdlen)
+ *rdlen = 0;
+
+ return rc;
+ }
+
+ mp_pr_err("mpool %s, mlog 0x%lx, getting log block failed",
+ rc, mp->pds_name, (ulong)layout->eld_objid);
+ break;
+ }
+
+ if ((sectsz - lri->lri_roff) < OMF_LOGREC_DESC_PACKLEN) {
+ /* No more records in current log block */
+ if (lri->lri_soff < lstat->lst_wsoff) {
+
+ /* Move to next log block */
+ lri->lri_soff = lri->lri_soff + 1;
+ lri->lri_roff = 0;
+ continue;
+ } else {
+ /*
+ * hit end of log; return EOF even in case
+ * of a partial data record which is a valid
+ * failure mode and must be ignored
+ */
+ if (bufoff)
+ rc = -ENODATA;
+
+ bufoff = 0; /* Force EOF on partials! */
+ break;
+ }
+ }
+
+ /* Parse next record in log block */
+ omf_logrec_desc_unpack_letoh(&lrd, &inbuf[lri->lri_roff]);
+
+ if (logrec_type_datarec(lrd.olr_rtype)) {
+ /* Data record */
+ if (lrd.olr_rtype == OMF_LOGREC_DATAFULL ||
+ lrd.olr_rtype == OMF_LOGREC_DATAFIRST) {
+ if (midrec && !recfirst) {
+ rc = -ENODATA;
+
+ /*
+ * Can occur mid data rec only if is first rec in log
+ * block indicating partial data rec at end of last
+ * block which is a valid failure mode,
+ * Otherwise is a logging error
+ */
+ mp_pr_err("mpool %s, mlog 0x%lx, inconsistent 1 data rec",
+ rc, mp->pds_name, (ulong)layout->eld_objid);
+ break;
+ }
+ /*
+ * Reset copy-out; set midrec which is needed for DATAFIRST
+ */
+ bufoff = 0;
+ midrec = 1;
+ } else if (lrd.olr_rtype == OMF_LOGREC_DATAMID ||
+ lrd.olr_rtype == OMF_LOGREC_DATALAST) {
+ if (!midrec) {
+ rc = -ENODATA;
+
+ /* Must occur mid data record. */
+ mp_pr_err("mpool %s, mlog 0x%lx, inconsistent 2 data rec",
+ rc, mp->pds_name, (ulong)layout->eld_objid);
+ break;
+ }
+ }
+
+ /*
+ * This is inside a loop, but it is invariant;
+ * (and it cannot be done until after the unpack)
+ *
+ * Return the necessary length to caller.
+ */
+ if (buflen < lrd.olr_tlen) {
+ if (rdlen)
+ *rdlen = lrd.olr_tlen;
+
+ rc = -EOVERFLOW;
+ break;
+ }
+
+ /* Copy-out data */
+ lri->lri_roff = lri->lri_roff + OMF_LOGREC_DESC_PACKLEN;
+
+ if (!skip)
+ memcpy(&buf[bufoff], &inbuf[lri->lri_roff], lrd.olr_rlen);
+
+ lri->lri_roff = lri->lri_roff + lrd.olr_rlen;
+ bufoff = bufoff + lrd.olr_rlen;
+
+ if (lrd.olr_rtype == OMF_LOGREC_DATAFULL ||
+ lrd.olr_rtype == OMF_LOGREC_DATALAST)
+ break;
+ } else {
+ /*
+ * Non data record; just skip unless midrec which is a logging error
+ */
+ if (midrec) {
+ rc = -ENODATA;
+ mp_pr_err("mpool %s, mlog 0x%lx, inconsistent non-data record",
+ rc, mp->pds_name, (ulong)layout->eld_objid);
+ break;
+ }
+ if (lrd.olr_rtype == OMF_LOGREC_EOLB)
+ lri->lri_roff = sectsz;
+ else
+ lri->lri_roff = lri->lri_roff + OMF_LOGREC_DESC_PACKLEN +
+ lrd.olr_rlen;
+ }
+ }
+ if (!rc && rdlen)
+ *rdlen = bufoff;
+ else if (rc != -EOVERFLOW && rc != -ENOMEM)
+ /* Handle only remains valid if buffer too small */
+ lri->lri_valid = 0;
+
+ if (!skip_ser)
+ pmd_obj_wrunlock(layout);
+
+ return rc;
+}
+
+/**
+ * mlog_read_data_next() - Read next data record into buffer buf of length buflen bytes.
+ *
+ * Log must be open; skips non-data records (markers).
+ *
+ * Iterator lri must be re-init if returns any error except ENOMEM
+ *
+ * Returns:
+ * 0 on success; The following errno values on failure:
+ * -EOVERFLOW if buflen is insufficient to hold data record; can retry
+ * errno otherwise
+ *
+ * Bytes read on success in the output param rdlen (can be 0 if appended a
+ * zero-length data record)
+ */
+int mlog_read_data_next(struct mpool_descriptor *mp, struct mlog_descriptor *mlh,
+ char *buf, u64 buflen, u64 *rdlen)
+{
+ return mlog_read_data_next_impl(mp, mlh, false, buf, buflen, rdlen);
+}
+
+/**
+ * mlog_get_props() - Return basic mlog properties in prop.
+ *
+ * Returns: 0 if successful; -errno otherwise
+ */
+static int mlog_get_props(struct mpool_descriptor *mp, struct mlog_descriptor *mlh,
+ struct mlog_props *prop)
+{
+ struct pmd_layout *layout = mlog2layout(mlh);
+
+ if (!layout)
+ return -EINVAL;
+
+ pmd_obj_rdlock(layout);
+ mlog_getprops_cmn(mp, layout, prop);
+ pmd_obj_rdunlock(layout);
+
+ return 0;
+}
+
+/**
+ * mlog_get_props_ex() - Return extended mlog properties in prop.
+ *
+ * Returns: 0 if successful; -errno otherwise
+ */
+int mlog_get_props_ex(struct mpool_descriptor *mp, struct mlog_descriptor *mlh,
+ struct mlog_props_ex *prop)
+{
+ struct pmd_layout *layout;
+ struct pd_prop *pdp;
+
+ layout = mlog2layout(mlh);
+ if (!layout)
+ return -EINVAL;
+
+ pdp = &mp->pds_pdv[layout->eld_ld.ol_pdh].pdi_prop;
+
+ pmd_obj_rdlock(layout);
+ mlog_getprops_cmn(mp, layout, &prop->lpx_props);
+ prop->lpx_zonecnt = layout->eld_ld.ol_zcnt;
+ prop->lpx_state = layout->eld_state;
+ prop->lpx_secshift = PD_SECTORSZ(pdp);
+ prop->lpx_totsec = pmd_layout_cap_get(mp, layout) >> prop->lpx_secshift;
+ pmd_obj_rdunlock(layout);
+
+ return 0;
+}
+
+void mlog_precompact_alsz(struct mpool_descriptor *mp, struct mlog_descriptor *mlh)
+{
+ struct mlog_props prop;
+ u64 len;
+ int rc;
+
+ rc = mlog_get_props(mp, mlh, &prop);
+ if (rc)
+ return;
+
+ rc = mlog_len(mp, mlh, &len);
+ if (rc)
+ return;
+
+ pmd_precompact_alsz(mp, prop.lpr_objid, len, prop.lpr_alloc_cap);
+}
--
2.17.2

2020-09-28 16:53:18

by Nabeel Meeramohideen Mohamed (nmeeramohide)

[permalink] [raw]

Subject: [PATCH 13/22] mpool: add utility routines for mpool lifecycle management

From: Nabeel M Mohamed <[email protected]>

This adds utility routines to:

- Create and initialize a media class with an mpool volume
- Initialize and validate superblocks on all media class
volumes
- Open and initialize all media class volumes
- Allocate metadata container 0 (MDC0) and update the
superblock on capacity media class volume with metadata for
accessing MDC0
- Create and initialize root MDC
- Initialize mpool descriptor and track the mapping between an
mpool UUID and its descriptor in a rbtree

When an mpool is created, a pair of mlogs are instantiated with
well-known OIDs comprising the root MDC of the mpool. The root
MDC provides a location for mpool clients to store whatever
metadata they need for start-up.

Co-developed-by: Greg Becker <[email protected]>
Signed-off-by: Greg Becker <[email protected]>
Co-developed-by: Pierre Labat <[email protected]>
Signed-off-by: Pierre Labat <[email protected]>
Co-developed-by: John Groves <[email protected]>
Signed-off-by: John Groves <[email protected]>
Signed-off-by: Nabeel M Mohamed <[email protected]>
---
drivers/mpool/mpcore.c | 987 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 987 insertions(+)
create mode 100644 drivers/mpool/mpcore.c

diff --git a/drivers/mpool/mpcore.c b/drivers/mpool/mpcore.c
new file mode 100644
index 000000000000..246baedcdcec
--- /dev/null
+++ b/drivers/mpool/mpcore.c
@@ -0,0 +1,987 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2015-2020 Micron Technology, Inc. All rights reserved.
+ */
+
+/*
+ * Media pool (mpool) manager module.
+ *
+ * Defines functions to create and maintain mpools comprising multiple drives
+ * in multiple media classes used for storing mblocks and mlogs.
+ */
+
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <linux/sort.h>
+#include <linux/slab.h>
+#include <linux/kref.h>
+#include <linux/rbtree.h>
+
+#include "mpool_ioctl.h"
+
+#include "mpool_printk.h"
+#include "assert.h"
+#include "uuid.h"
+
+#include "mp.h"
+#include "omf.h"
+#include "omf_if.h"
+#include "pd.h"
+#include "smap.h"
+#include "mclass.h"
+#include "pmd_obj.h"
+#include "mpcore.h"
+#include "sb.h"
+#include "upgrade.h"
+
+struct omf_devparm_descriptor;
+struct mpool_descriptor;
+
+/* Rbtree mapping mpool UUID to mpool descriptor node: uuid_to_mpdesc_rb */
+struct rb_root mpool_pools = { NULL };
+
+int uuid_to_mpdesc_insert(struct rb_root *root, struct mpool_descriptor *data)
+{
+ struct rb_node **new = &(root->rb_node), *parent = NULL;
+
+ /* Figure out where to put new node */
+ while (*new) {
+ struct mpool_descriptor *this = rb_entry(*new, struct mpool_descriptor, pds_node);
+
+ int result = mpool_uuid_compare(&data->pds_poolid, &this->pds_poolid);
+
+ parent = *new;
+ if (result < 0)
+ new = &((*new)->rb_left);
+ else if (result > 0)
+ new = &((*new)->rb_right);
+ else
+ return false;
+ }
+
+ /* Add new node and rebalance tree. */
+ rb_link_node(&data->pds_node, parent, new);
+ rb_insert_color(&data->pds_node, root);
+
+ return true;
+}
+
+static struct mpool_descriptor *
+uuid_to_mpdesc_search(struct rb_root *root, struct mpool_uuid *key_uuid)
+{
+ struct rb_node *node = root->rb_node;
+
+ while (node) {
+ struct mpool_descriptor *data = rb_entry(node, struct mpool_descriptor, pds_node);
+
+ int result = mpool_uuid_compare(key_uuid, &data->pds_poolid);
+
+ if (result < 0)
+ node = node->rb_left;
+ else if (result > 0)
+ node = node->rb_right;
+ else
+ return data;
+ }
+ return NULL;
+}
+
+int mpool_dev_sbwrite(struct mpool_descriptor *mp, struct mpool_dev_info *pd,
+ struct omf_sb_descriptor *sbmdc0)
+{
+ struct omf_sb_descriptor *sb = NULL;
+ struct mc_parms mc_parms;
+ int rc;
+
+ if (mpool_pd_status_get(pd) != PD_STAT_ONLINE) {
+ rc = -EIO;
+ mp_pr_err("%s:%s unavailable or offline, status %d",
+ rc, mp->pds_name, pd->pdi_name, mpool_pd_status_get(pd));
+ return rc;
+ }
+
+ sb = kzalloc(sizeof(struct omf_sb_descriptor), GFP_KERNEL);
+ if (!sb) {
+ rc = -ENOMEM;
+ mp_pr_err("mpool %s, writing superblock on drive %s, alloc of superblock descriptor failed %lu",
+ rc, mp->pds_name, pd->pdi_name, sizeof(struct omf_sb_descriptor));
+ return rc;
+ }
+
+ /*
+ * Set superblock values common to all new drives in pool
+ * (new or extant)
+ */
+ sb->osb_magic = OMF_SB_MAGIC;
+ strlcpy((char *) sb->osb_name, mp->pds_name, sizeof(sb->osb_name));
+ sb->osb_vers = OMF_SB_DESC_VER_LAST;
+ mpool_uuid_copy(&sb->osb_poolid, &mp->pds_poolid);
+ sb->osb_gen = 1;
+
+ /* Set superblock values specific to this drive */
+ mpool_uuid_copy(&sb->osb_parm.odp_devid, &pd->pdi_devid);
+ sb->osb_parm.odp_devsz = pd->pdi_parm.dpr_devsz;
+ sb->osb_parm.odp_zonetot = pd->pdi_parm.dpr_zonetot;
+ mc_pd_prop2mc_parms(&pd->pdi_parm.dpr_prop, &mc_parms);
+ mc_parms2omf_devparm(&mc_parms, &sb->osb_parm);
+
+ if (sbmdc0)
+ sbutil_mdc0_copy(sb, sbmdc0);
+ else
+ sbutil_mdc0_clear(sb);
+
+ rc = sb_write_new(&pd->pdi_parm, sb);
+ if (rc) {
+ mp_pr_err("mpool %s, writing superblock on drive %s, write failed",
+ rc, mp->pds_name, pd->pdi_name);
+ }
+
+ kfree(sb);
+ return rc;
+}
+
+/**
+ * mpool_mdc0_alloc() - Allocate space for the two MDC0 mlogs
+ * @mp:
+ * @sb:
+ *
+ * In the context of a mpool create, allocate space for the two MDC0 mlogs
+ * and update the sb structure with the position of MDC0.
+ *
+ * Note: this function assumes that the media classes have already been
+ * created.
+ */
+static int mpool_mdc0_alloc(struct mpool_descriptor *mp, struct omf_sb_descriptor *sb)
+{
+ struct mpool_dev_info *pd;
+ struct media_class *mc;
+ struct mpool_uuid uuid;
+ u64 zcnt, zonelen;
+ u32 cnt;
+ int rc;
+
+ sbutil_mdc0_clear(sb);
+
+ ASSERT(mp->pds_mdparm.md_mclass < MP_MED_NUMBER);
+
+ mc = &mp->pds_mc[mp->pds_mdparm.md_mclass];
+ if (mc->mc_pdmc < 0) {
+ rc = -ENOSPC;
+ mp_pr_err("%s: sb update memory image MDC0 information, not enough drives",
+ rc, mp->pds_name);
+ return rc;
+ }
+
+ pd = &mp->pds_pdv[mc->mc_pdmc];
+
+ zonelen = (u64)pd->pdi_parm.dpr_zonepg << PAGE_SHIFT;
+ zcnt = 1 + ((mp->pds_params.mp_mdc0cap - 1) / zonelen);
+
+ cnt = sb_zones_for_sbs(&(pd->pdi_prop));
+ if (cnt < 1) {
+ rc = -EINVAL;
+ mp_pr_err("%s: sb MDC0, getting sb range failed for drive %s %u",
+ rc, mp->pds_name, pd->pdi_name, cnt);
+ return rc;
+ }
+
+ if ((pd->pdi_zonetot - cnt) < zcnt * 2) {
+ rc = -ENOSPC;
+ mp_pr_err("%s: sb MDC0, no room for MDC0 on drive %s %lu %u %lu",
+ rc, mp->pds_name, pd->pdi_name,
+ (ulong)pd->pdi_zonetot, cnt, (ulong)zcnt);
+ return rc;
+ }
+
+ /*
+ * mdc0 log1/2 alloced on first 2 * zcnt zone's
+ */
+ rc = pd_zone_erase(&pd->pdi_parm, cnt, zcnt * 2, true);
+ if (rc) {
+ mp_pr_err("%s: sb MDC0, erase failed on %s %u %lu",
+ rc, mp->pds_name, pd->pdi_name, cnt, (ulong)zcnt);
+ return rc;
+ }
+
+ /*
+ * Fill in common mdc0 log1/2 and drive info.
+ */
+ sb->osb_mdc01gen = 1;
+ sb->osb_mdc01desc.ol_zcnt = zcnt;
+ mpool_generate_uuid(&uuid);
+ mpool_uuid_copy(&sb->osb_mdc01uuid, &uuid);
+
+ sb->osb_mdc02gen = 2;
+ sb->osb_mdc02desc.ol_zcnt = zcnt;
+ mpool_generate_uuid(&uuid);
+ mpool_uuid_copy(&sb->osb_mdc02uuid, &uuid);
+
+ mpool_uuid_copy(&sb->osb_mdc01devid, &pd->pdi_devid);
+ sb->osb_mdc01desc.ol_zaddr = cnt;
+
+ mpool_uuid_copy(&sb->osb_mdc02devid, &pd->pdi_devid);
+ sb->osb_mdc02desc.ol_zaddr = cnt + zcnt;
+
+ mpool_uuid_copy(&sb->osb_mdc0dev.odp_devid, &pd->pdi_devid);
+ sb->osb_mdc0dev.odp_devsz = pd->pdi_parm.dpr_devsz;
+ sb->osb_mdc0dev.odp_zonetot = pd->pdi_parm.dpr_zonetot;
+ mc_parms2omf_devparm(&mc->mc_parms, &sb->osb_mdc0dev);
+
+ return 0;
+}
+
+int mpool_dev_sbwrite_newpool(struct mpool_descriptor *mp, struct omf_sb_descriptor *sbmdc0)
+{
+ struct mpool_dev_info *pd = NULL;
+ u64 pdh = 0;
+ int rc;
+
+ /* Alloc mdc0 and generate mdc0 info for superblocks */
+ rc = mpool_mdc0_alloc(mp, sbmdc0);
+ if (rc) {
+ mp_pr_err("%s: MDC0 allocation failed", rc, mp->pds_name);
+ return rc;
+ }
+
+ for (pdh = 0; pdh < mp->pds_pdvcnt; pdh++) {
+ pd = &mp->pds_pdv[pdh];
+
+ if (pd->pdi_mclass == mp->pds_mdparm.md_mclass)
+ rc = mpool_dev_sbwrite(mp, pd, sbmdc0);
+ else
+ rc = mpool_dev_sbwrite(mp, pd, NULL);
+ if (rc) {
+ mp_pr_err("%s: sb write %s failed, %d %d", rc, mp->pds_name,
+ pd->pdi_name, pd->pdi_mclass, mp->pds_mdparm.md_mclass);
+ break;
+ }
+ }
+
+ return rc;
+}
+
+int mpool_mdc0_sb2obj(struct mpool_descriptor *mp, struct omf_sb_descriptor *sb,
+ struct pmd_layout **l1, struct pmd_layout **l2)
+{
+ int rc, i;
+
+ /* MDC0 mlog1 layout */
+ *l1 = pmd_layout_alloc(&sb->osb_mdc01uuid, MDC0_OBJID_LOG1, sb->osb_mdc01gen, 0,
+ sb->osb_mdc01desc.ol_zcnt);
+ if (!*l1) {
+ *l1 = *l2 = NULL;
+
+ rc = -ENOMEM;
+ mp_pr_err("mpool %s, MDC0 mlog1 allocation failed", rc, mp->pds_name);
+ return rc;
+ }
+
+ (*l1)->eld_state = PMD_LYT_COMMITTED;
+
+ for (i = 0; i < mp->pds_pdvcnt; i++) {
+ if (mpool_uuid_compare(&mp->pds_pdv[i].pdi_devid, &sb->osb_mdc01devid) == 0) {
+ (*l1)->eld_ld.ol_pdh = i;
+ (*l1)->eld_ld.ol_zaddr = sb->osb_mdc01desc.ol_zaddr;
+ break;
+ }
+ }
+
+ if (i >= mp->pds_pdvcnt) {
+ char uuid_str[40];
+
+ /* Should never happen */
+ pmd_obj_put(*l1);
+ *l1 = *l2 = NULL;
+
+ mpool_unparse_uuid(&sb->osb_mdc01devid, uuid_str);
+ rc = -ENOENT;
+ mp_pr_err("mpool %s, allocating MDC0 mlog1, can't find handle for pd uuid %s,",
+ rc, mp->pds_name, uuid_str);
+
+ return rc;
+ }
+
+ /* MDC0 mlog2 layout */
+ *l2 = pmd_layout_alloc(&sb->osb_mdc02uuid, MDC0_OBJID_LOG2, sb->osb_mdc02gen, 0,
+ sb->osb_mdc02desc.ol_zcnt);
+ if (!*l2) {
+ pmd_obj_put(*l1);
+
+ *l1 = *l2 = NULL;
+
+ rc = -ENOMEM;
+ mp_pr_err("mpool %s, MDC0 mlog2 allocation failed", rc, mp->pds_name);
+ return rc;
+ }
+
+ (*l2)->eld_state = PMD_LYT_COMMITTED;
+
+ for (i = 0; i < mp->pds_pdvcnt; i++) {
+ if (mpool_uuid_compare(&mp->pds_pdv[i].pdi_devid, &sb->osb_mdc02devid) == 0) {
+ (*l2)->eld_ld.ol_pdh = i;
+ (*l2)->eld_ld.ol_zaddr = sb->osb_mdc02desc.ol_zaddr;
+ break;
+ }
+ }
+
+ if (i >= mp->pds_pdvcnt) {
+ char uuid_str[40];
+
+ /* Should never happen */
+ pmd_obj_put(*l1);
+ pmd_obj_put(*l2);
+ *l1 = *l2 = NULL;
+
+ mpool_unparse_uuid(&sb->osb_mdc02devid, uuid_str);
+ rc = -ENOENT;
+ mp_pr_err("mpool %s, allocating MDC0 mlog2, can't find handle for pd uuid %s",
+ rc, mp->pds_name, uuid_str);
+
+ return rc;
+ }
+
+ return 0;
+}
+
+/**
+ * mpool_dev_check_new() - check if a drive is ready to be added in an mpool.
+ * @mp:
+ * @pd:
+ */
+int mpool_dev_check_new(struct mpool_descriptor *mp, struct mpool_dev_info *pd)
+{
+ int rval, rc;
+
+ if (mpool_pd_status_get(pd) != PD_STAT_ONLINE) {
+ rc = -EIO;
+ mp_pr_err("%s:%s unavailable or offline, status %d",
+ rc, mp->pds_name, pd->pdi_name, mpool_pd_status_get(pd));
+ return rc;
+ }
+
+ /* Confirm drive does not contain mpool magic value */
+ rval = sb_magic_check(&pd->pdi_parm);
+ if (rval) {
+ if (rval < 0) {
+ rc = rval;
+ mp_pr_err("%s:%s read sb magic failed", rc, mp->pds_name, pd->pdi_name);
+ return rc;
+ }
+
+ rc = -EBUSY;
+ mp_pr_err("%s:%s sb magic already exists", rc, mp->pds_name, pd->pdi_name);
+ return rc;
+ }
+
+ return 0;
+}
+
+int mpool_desc_pdmc_add(struct mpool_descriptor *mp, u16 pdh,
+ struct omf_devparm_descriptor *omf_devparm, bool check_only)
+{
+ struct mpool_dev_info *pd = NULL;
+ struct media_class *mc;
+ struct mc_parms mc_parms;
+ int rc;
+
+ pd = &mp->pds_pdv[pdh];
+ if (omf_devparm == NULL)
+ mc_pd_prop2mc_parms(&pd->pdi_parm.dpr_prop, &mc_parms);
+ else
+ mc_omf_devparm2mc_parms(omf_devparm, &mc_parms);
+
+ if (!mclass_isvalid(mc_parms.mcp_classp)) {
+ rc = -EINVAL;
+ mp_pr_err("%s: media class %u of %s is undefined", rc, mp->pds_name,
+ mc_parms.mcp_classp, pd->pdi_name);
+ return rc;
+ }
+
+ /*
+ * Devices that do not support updatable sectors can't be included
+ * in an mpool. Do not check if in the context of an unavailable PD
+ * during activate, because it is impossible to determine the PD
+ * properties.
+ */
+ if ((omf_devparm == NULL) && !(pd->pdi_cmdopt & PD_CMD_SECTOR_UPDATABLE)) {
+ rc = -EINVAL;
+ mp_pr_err("%s: device %s sectors not updatable", rc, mp->pds_name, pd->pdi_name);
+ return rc;
+ }
+
+ mc = &mp->pds_mc[mc_parms.mcp_classp];
+ if (mc->mc_pdmc < 0) {
+ struct mc_smap_parms mcsp;
+
+ /*
+ * No media class corresponding to the PD class yet, create one.
+ */
+ rc = mc_smap_parms_get(&mp->pds_mc[mc_parms.mcp_classp], &mp->pds_params, &mcsp);
+ if (rc)
+ return rc;
+
+ if (!check_only)
+ mc_init_class(mc, &mc_parms, &mcsp);
+ } else {
+ rc = -EINVAL;
+ mp_pr_err("%s: add %s, only 1 device allowed per media class",
+ rc, mp->pds_name, pd->pdi_name);
+ return rc;
+ }
+
+ if (check_only)
+ return 0;
+
+ mc->mc_pdmc = pdh;
+
+ return 0;
+}
+
+/**
+ * mpool_desc_init_newpool() - Create the media classes and add all the mpool PDs
+ * @mp:
+ * @flags: enum mp_mgmt_flags
+ *
+ * Called on mpool create.
+ * Create the media classes and add all the mpool PDs in their media class.
+ * Update the metadata media class in mp->pds_mdparm
+ *
+ * Note: the PD properties (pd->pdi_parm.dpr_prop) must be updated
+ * and correct when entering this function.
+ */
+int mpool_desc_init_newpool(struct mpool_descriptor *mp, u32 flags)
+{
+ u64 pdh = 0;
+ int rc;
+
+ if (!(flags & (1 << MP_FLAGS_FORCE))) {
+ rc = mpool_dev_check_new(mp, &mp->pds_pdv[pdh]);
+ if (rc)
+ return rc;
+ }
+
+ /*
+ * Add drive in its media class. That may create the class
+ * if first drive of the class.
+ */
+ rc = mpool_desc_pdmc_add(mp, pdh, NULL, false);
+ if (rc) {
+ struct mpool_dev_info *pd __maybe_unused;
+
+ pd = &mp->pds_pdv[pdh];
+
+ mp_pr_err("mpool %s, mpool desc init, adding drive %s in a media class failed",
+ rc, mp->pds_name, pd->pdi_name);
+ return rc;
+ }
+
+ mp->pds_mdparm.md_mclass = mp->pds_pdv[pdh].pdi_mclass;
+
+ return 0;
+}
+
+int mpool_dev_init_all(struct mpool_dev_info *pdv, u64 dcnt, char **dpaths,
+ struct pd_prop *pd_prop)
+{
+ char *pdname;
+ int idx, rc;
+
+ if (dcnt == 0)
+ return -EINVAL;
+
+ for (rc = 0, idx = 0; idx < dcnt; idx++, pd_prop++) {
+ rc = pd_dev_open(dpaths[idx], &pdv[idx].pdi_parm, pd_prop);
+ if (rc) {
+ mp_pr_err("opening device %s failed", rc, dpaths[idx]);
+ break;
+ }
+
+ pdname = strrchr(dpaths[idx], '/');
+ pdname = pdname ? pdname + 1 : dpaths[idx];
+ strlcpy(pdv[idx].pdi_name, pdname, sizeof(pdv[idx].pdi_name));
+
+ mpool_pd_status_set(&pdv[idx], PD_STAT_ONLINE);
+ }
+
+ while (rc && idx-- > 0)
+ pd_dev_close(&pdv[idx].pdi_parm);
+
+ return rc;
+}
+
+void mpool_mdc_cap_init(struct mpool_descriptor *mp, struct mpool_dev_info *pd)
+{
+ u64 zonesz, defmbsz;
+
+ zonesz = (pd->pdi_zonepg << PAGE_SHIFT) >> 20;
+ defmbsz = MPOOL_MBSIZE_MB_DEFAULT;
+
+ if (mp->pds_params.mp_mdc0cap == 0) {
+ mp->pds_params.mp_mdc0cap = max_t(u64, defmbsz, zonesz);
+ mp->pds_params.mp_mdc0cap <<= 20;
+ }
+
+ if (mp->pds_params.mp_mdcncap == 0) {
+ mp->pds_params.mp_mdcncap = max_t(u64, zonesz, (256 / zonesz));
+ mp->pds_params.mp_mdcncap <<= 20;
+ }
+}
+
+/**
+ * mpool_desc_init_sb() - Read the super blocks of the PDs.
+ * @mp:
+ * @sbmdc0: output. MDC0 information stored in the super blocks.
+ * @flags:
+ *
+ * Adjust the discovered PD properties stored in pd->pdi_parm.dpr_prop with
+ * PD parameters from the super block. Some of discovered PD properties are
+ * default (like zone size) and need to be adjusted to what the PD actually
+ * use.
+ */
+int mpool_desc_init_sb(struct mpool_descriptor *mp, struct omf_sb_descriptor *sbmdc0,
+ u32 flags, bool *mc_resize)
+{
+ struct omf_sb_descriptor *sb = NULL;
+ struct mpool_dev_info *pd = NULL;
+ u16 omf_ver = OMF_SB_DESC_UNDEF;
+ bool mdc0found = false;
+ bool force = ((flags & (1 << MP_FLAGS_FORCE)) != 0);
+ u8 pdh = 0;
+ int rc;
+
+ sb = kzalloc(sizeof(*sb), GFP_KERNEL);
+ if (!sb) {
+ rc = -ENOMEM;
+ mp_pr_err("sb desc alloc failed %lu", rc, (ulong)sizeof(*sb));
+ return rc;
+ }
+
+ for (pdh = 0; pdh < mp->pds_pdvcnt; pdh++) {
+ struct omf_devparm_descriptor *dparm;
+ bool resize = false;
+ int i;
+
+ pd = &mp->pds_pdv[pdh];
+ if (mpool_pd_status_get(pd) != PD_STAT_ONLINE) {
+ rc = -EIO;
+ mp_pr_err("pd %s unavailable or offline, status %d",
+ rc, pd->pdi_name, mpool_pd_status_get(pd));
+ kfree(sb);
+ return rc;
+ }
+
+ /*
+ * Read superblock; init and validate pool drive info
+ * from device parameters stored in the super block.
+ */
+ rc = sb_read(&pd->pdi_parm, sb, &omf_ver, force);
+ if (rc) {
+ mp_pr_err("sb read from %s failed", rc, pd->pdi_name);
+ kfree(sb);
+ return rc;
+ }
+
+ if (!pdh) {
+ size_t n __maybe_unused;
+
+ /*
+ * First drive; confirm pool not open; set pool-wide
+ * properties
+ */
+ if (uuid_to_mpdesc_search(&mpool_pools, &sb->osb_poolid)) {
+ char *uuid_str;
+
+ uuid_str = kmalloc(MPOOL_UUID_STRING_LEN + 1, GFP_KERNEL);
+ if (uuid_str)
+ mpool_unparse_uuid(&sb->osb_poolid, uuid_str);
+
+ rc = -EBUSY;
+ mp_pr_err("%s: mpool already activated, id %s, pd name %s",
+ rc, sb->osb_name, uuid_str, pd->pdi_name);
+ kfree(sb);
+ kfree(uuid_str);
+ return rc;
+ }
+ mpool_uuid_copy(&mp->pds_poolid, &sb->osb_poolid);
+
+ n = strlcpy(mp->pds_name, (char *)sb->osb_name, sizeof(mp->pds_name));
+ ASSERT(n < sizeof(mp->pds_name));
+ } else {
+ /* Second or later drive; validate pool-wide properties */
+ if (mpool_uuid_compare(&sb->osb_poolid, &mp->pds_poolid) != 0) {
+ char *uuid_str1, *uuid_str2 = NULL;
+
+ uuid_str1 = kmalloc(2 * (MPOOL_UUID_STRING_LEN + 1), GFP_KERNEL);
+ if (uuid_str1) {
+ uuid_str2 = uuid_str1 + MPOOL_UUID_STRING_LEN + 1;
+ mpool_unparse_uuid(&sb->osb_poolid, uuid_str1);
+ mpool_unparse_uuid(&mp->pds_poolid, uuid_str2);
+ }
+
+ rc = -EINVAL;
+ mp_pr_err("%s: pd %s, mpool id %s different from prior id %s",
+ rc, mp->pds_name, pd->pdi_name, uuid_str1, uuid_str2);
+ kfree(sb);
+ kfree(uuid_str1);
+ return rc;
+ }
+ }
+
+ dparm = &sb->osb_parm;
+ if (!force && pd->pdi_devsz > dparm->odp_devsz) {
+ mp_pr_info("%s: pd %s, discovered size %lu > on-media size %lu",
+ mp->pds_name, pd->pdi_name,
+ (ulong)pd->pdi_devsz, (ulong)dparm->odp_devsz);
+
+ if ((flags & (1 << MP_FLAGS_RESIZE)) == 0) {
+ pd->pdi_devsz = dparm->odp_devsz;
+ } else {
+ dparm->odp_devsz = pd->pdi_devsz;
+ dparm->odp_zonetot = pd->pdi_devsz / (pd->pdi_zonepg << PAGE_SHIFT);
+
+ pd->pdi_zonetot = dparm->odp_zonetot;
+ resize = true;
+ }
+ }
+
+ /* Validate mdc0 info in superblock if present */
+ if (!sbutil_mdc0_isclear(sb)) {
+ if (!force && !sbutil_mdc0_isvalid(sb)) {
+ rc = -EINVAL;
+ mp_pr_err("%s: pd %s, invalid sb MDC0",
+ rc, mp->pds_name, pd->pdi_name);
+ kfree(sb);
+ return rc;
+ }
+
+ dparm = &sb->osb_mdc0dev;
+ if (resize) {
+ ASSERT(pd->pdi_devsz > dparm->odp_devsz);
+
+ dparm->odp_devsz = pd->pdi_devsz;
+ dparm->odp_zonetot = pd->pdi_devsz / (pd->pdi_zonepg << PAGE_SHIFT);
+ }
+
+ sbutil_mdc0_copy(sbmdc0, sb);
+ mdc0found = true;
+ }
+
+ /* Set drive info confirming devid is unique and zone parms match */
+ for (i = 0; i < pdh; i++) {
+ if (mpool_uuid_compare(&mp->pds_pdv[i].pdi_devid,
+ &sb->osb_parm.odp_devid) == 0) {
+ char *uuid_str;
+
+ uuid_str = kmalloc(MPOOL_UUID_STRING_LEN + 1, GFP_KERNEL);
+ if (uuid_str)
+ mpool_unparse_uuid(&sb->osb_parm.odp_devid, uuid_str);
+ rc = -EINVAL;
+ mp_pr_err("%s: pd %s, duplicate devices, uuid %s",
+ rc, mp->pds_name, pd->pdi_name, uuid_str);
+ kfree(uuid_str);
+ kfree(sb);
+ return rc;
+ }
+ }
+
+ if (omf_ver > OMF_SB_DESC_VER_LAST) {
+ rc = -EOPNOTSUPP;
+ mp_pr_err("%s: unsupported sb version %d", rc, mp->pds_name, omf_ver);
+ kfree(sb);
+ return rc;
+ } else if (!force && (omf_ver < OMF_SB_DESC_VER_LAST || resize)) {
+ if ((flags & (1 << MP_FLAGS_PERMIT_META_CONV)) == 0) {
+ struct omf_mdcver *mdcver;
+ char *buf1, *buf2 = NULL;
+
+ /*
+ * We have to get the permission from users
+ * to update mpool meta data
+ */
+ mdcver = omf_sbver_to_mdcver(omf_ver);
+ ASSERT(mdcver != NULL);
+
+ buf1 = kmalloc(2 * MAX_MDCVERSTR, GFP_KERNEL);
+ if (buf1) {
+ buf2 = buf1 + MAX_MDCVERSTR;
+ omfu_mdcver_to_str(mdcver, buf1, sizeof(buf1));
+ omfu_mdcver_to_str(omfu_mdcver_cur(), buf2, sizeof(buf2));
+ }
+
+ rc = -EPERM;
+ mp_pr_err("%s: reqd sb upgrade from version %s (%s) to %s (%s)",
+ rc, mp->pds_name,
+ buf1, omfu_mdcver_comment(mdcver) ?: "",
+ buf2, omfu_mdcver_comment(omfu_mdcver_cur()));
+ kfree(buf1);
+ kfree(sb);
+ return rc;
+ }
+
+ /* We need to overwrite the old version superblock on the device */
+ rc = sb_write_update(&pd->pdi_parm, sb);
+ if (rc) {
+ mp_pr_err("%s: pd %s, failed to convert or overwrite mpool sb",
+ rc, mp->pds_name, pd->pdi_name);
+ kfree(sb);
+ return rc;
+ }
+
+ if (!resize)
+ mp_pr_info("%s: pd %s, Convert mpool sb, oldv %d newv %d",
+ mp->pds_name, pd->pdi_name, omf_ver, sb->osb_vers);
+ }
+
+ mpool_uuid_copy(&pd->pdi_devid, &sb->osb_parm.odp_devid);
+
+ /* Add drive in its media class. Create the media class if not yet created. */
+ rc = mpool_desc_pdmc_add(mp, pdh, NULL, false);
+ if (rc) {
+ mp_pr_err("%s: pd %s, adding drive in a media class failed",
+ rc, mp->pds_name, pd->pdi_name);
+
+ kfree(sb);
+ return rc;
+ }
+
+ /*
+ * Record the media class used by the MDC0 metadata.
+ */
+ if (mdc0found)
+ mp->pds_mdparm.md_mclass = pd->pdi_mclass;
+
+ if (resize && mc_resize)
+ mc_resize[pd->pdi_mclass] = resize;
+ }
+
+ if (!mdc0found) {
+ rc = -EINVAL;
+ mp_pr_err("%s: MDC0 not found", rc, mp->pds_name);
+ kfree(sb);
+ return rc;
+ }
+
+ kfree(sb);
+
+ return 0;
+}
+
+static int comp_func(const void *c1, const void *c2)
+{
+ return strcmp(*(char **)c1, *(char **)c2);
+}
+
+int check_for_dups(char **listv, int cnt, int *dup, int *offset)
+{
+ const char **sortedv;
+ const char *prev;
+ int rc, i;
+
+ *dup = 0;
+ *offset = -1;
+
+ if (0 == cnt || 1 == cnt)
+ return 0;
+
+ sortedv = kcalloc(cnt + 1, sizeof(char *), GFP_KERNEL);
+ if (!sortedv) {
+ rc = -ENOMEM;
+ mp_pr_err("kcalloc failed for %d paths, first path %s", rc, cnt, *listv);
+ return rc;
+ }
+
+ /* Make a shallow copy */
+ for (i = 0; i < cnt; i++)
+ sortedv[i] = listv[i];
+
+ sortedv[i] = NULL;
+
+ sort(sortedv, cnt, sizeof(char *), comp_func, NULL);
+
+ prev = sortedv[0];
+ for (i = 1; i < cnt; i++) {
+ if (strcmp(sortedv[i], prev) == 0) {
+ mp_pr_info("path %s is duplicated", prev);
+ *dup = 1;
+ break;
+ }
+
+ prev = sortedv[i];
+ }
+
+ /* Find offset, prev points to first dup */
+ if (*dup) {
+ for (i = 0; i < cnt; i++) {
+ if (prev == listv[i]) {
+ *offset = i;
+ break;
+ }
+ }
+ }
+
+ kfree(sortedv);
+ return 0;
+}
+
+void fill_in_devprops(struct mpool_descriptor *mp, u64 pdh, struct mpool_devprops *dprop)
+{
+ struct mpool_dev_info *pd;
+ struct media_class *mc;
+ int rc;
+
+ pd = &mp->pds_pdv[pdh];
+ memcpy(dprop->pdp_devid.b, pd->pdi_devid.uuid, MPOOL_UUID_SIZE);
+
+ mc = &mp->pds_mc[pd->pdi_mclass];
+ dprop->pdp_mclassp = mc->mc_parms.mcp_classp;
+ dprop->pdp_status = mpool_pd_status_get(pd);
+
+ rc = smap_drive_usage(mp, pdh, dprop);
+ if (rc) {
+ mp_pr_err("mpool %s, can't get drive usage, media class %d",
+ rc, mp->pds_name, dprop->pdp_mclassp);
+ }
+}
+
+int mpool_desc_unavail_add(struct mpool_descriptor *mp, struct omf_devparm_descriptor *omf_devparm)
+{
+ struct mpool_dev_info *pd = NULL;
+ char uuid_str[40];
+ int rc;
+
+ mpool_unparse_uuid(&omf_devparm->odp_devid, uuid_str);
+
+ mp_pr_warn("Activating mpool %s, adding unavailable drive %s", mp->pds_name, uuid_str);
+
+ if (mp->pds_pdvcnt >= MPOOL_DRIVES_MAX) {
+ rc = -EINVAL;
+ mp_pr_err("Activating mpool %s, adding an unavailable drive, too many drives",
+ rc, mp->pds_name);
+ return rc;
+ }
+
+ pd = &mp->pds_pdv[mp->pds_pdvcnt];
+
+ mpool_uuid_copy(&pd->pdi_devid, &omf_devparm->odp_devid);
+
+ /* Update the PD properties from the metadata record. */
+ mpool_pd_status_set(pd, PD_STAT_UNAVAIL);
+ pd_dev_set_unavail(&pd->pdi_parm, omf_devparm);
+
+ /* Add the PD in its media class. */
+ rc = mpool_desc_pdmc_add(mp, mp->pds_pdvcnt, omf_devparm, false);
+ if (rc)
+ return rc;
+
+ mp->pds_pdvcnt = mp->pds_pdvcnt + 1;
+
+ return 0;
+}
+
+int mpool_create_rmlogs(struct mpool_descriptor *mp, u64 mlog_cap)
+{
+ struct mlog_descriptor *ml_desc;
+ struct mlog_capacity mlcap = {
+ .lcp_captgt = mlog_cap,
+ };
+ struct mlog_props mlprops;
+ u64 root_mlog_id[2];
+ int rc, i;
+
+ mlog_lookup_rootids(&root_mlog_id[0], &root_mlog_id[1]);
+
+ for (i = 0; i < 2; ++i) {
+ rc = mlog_find_get(mp, root_mlog_id[i], 1, NULL, &ml_desc);
+ if (!rc) {
+ mlog_put(ml_desc);
+ continue;
+ }
+
+ if (rc != -ENOENT) {
+ mp_pr_err("mpool %s, root mlog find 0x%lx failed",
+ rc, mp->pds_name, (ulong)root_mlog_id[i]);
+ return rc;
+ }
+
+ rc = mlog_realloc(mp, root_mlog_id[i], &mlcap,
+ MP_MED_CAPACITY, &mlprops, &ml_desc);
+ if (rc) {
+ mp_pr_err("mpool %s, root mlog realloc 0x%lx failed",
+ rc, mp->pds_name, (ulong)root_mlog_id[i]);
+ return rc;
+ }
+
+ if (mlprops.lpr_objid != root_mlog_id[i]) {
+ mlog_put(ml_desc);
+ rc = -ENOENT;
+ mp_pr_err("mpool %s, root mlog mismatch 0x%lx 0x%lx", rc,
+ mp->pds_name, (ulong)root_mlog_id[i], (ulong)mlprops.lpr_objid);
+ return rc;
+ }
+
+ rc = mlog_commit(mp, ml_desc);
+ if (rc) {
+ if (mlog_abort(mp, ml_desc))
+ mlog_put(ml_desc);
+
+ mp_pr_err("mpool %s, root mlog commit 0x%lx failed",
+ rc, mp->pds_name, (ulong)root_mlog_id[i]);
+ return rc;
+ }
+
+ mlog_put(ml_desc);
+ }
+
+ return rc;
+}
+
+struct mpool_descriptor *mpool_desc_alloc(void)
+{
+ struct mpool_descriptor *mp;
+ int i;
+
+ mp = kzalloc(sizeof(*mp), GFP_KERNEL);
+ if (!mp)
+ return NULL;
+
+ init_rwsem(&mp->pds_pdvlock);
+
+ mutex_init(&mp->pds_oml_lock);
+ mp->pds_oml_root = RB_ROOT;
+
+ mp->pds_mdparm.md_mclass = MP_MED_INVALID;
+
+ mpcore_params_defaults(&mp->pds_params);
+
+ for (i = 0; i < MP_MED_NUMBER; i++)
+ mp->pds_mc[i].mc_pdmc = -1;
+
+ return mp;
+}
+
+/*
+ * remove mp from mpool_pools; close all dev; dealloc mp.
+ */
+void mpool_desc_free(struct mpool_descriptor *mp)
+{
+ struct mpool_descriptor *found_mp = NULL;
+ struct mpool_uuid uuid_zero;
+ int i;
+
+ mpool_uuid_clear(&uuid_zero);
+
+ /*
+ * Handle case where poolid and devid not in mappings
+ * which can happen when cleaning up from failed create/open.
+ */
+ found_mp = uuid_to_mpdesc_search(&mpool_pools, &mp->pds_poolid);
+ if (found_mp)
+ rb_erase(&found_mp->pds_node, &mpool_pools);
+
+ for (i = 0; i < mp->pds_pdvcnt; i++) {
+ if (mpool_pd_status_get(&mp->pds_pdv[i]) != PD_STAT_UNAVAIL)
+ pd_dev_close(&mp->pds_pdv[i].pdi_parm);
+ }
+
+ kfree(mp);
+}
--
2.17.2

2020-09-28 17:06:06

by Nabeel Meeramohideen Mohamed (nmeeramohide)

[permalink] [raw]

Subject: [PATCH 18/22] mpool: add object lifecycle management ioctls

From: Nabeel M Mohamed <[email protected]>

This adds the mblock and mlog management ioctls: alloc, commit,
abort, destroy, read, write, fetch properties etc.

The mblock and mlog management ioctl handlers are thin wrappers
around the core mblock/mlog lifecycle management and IO routines
introduced in an earlier patch.

The object read/write ioctl handlers utilizes vcache, which is a
small cache of iovec objects and page pointers. This cache is
used for large mblock/mlog IO. It acts as an emergency memory
pool for handling IO requests under memory pressure thereby
reducing tail latencies.

Co-developed-by: Greg Becker <[email protected]>
Signed-off-by: Greg Becker <[email protected]>
Co-developed-by: Pierre Labat <[email protected]>
Signed-off-by: Pierre Labat <[email protected]>
Co-developed-by: John Groves <[email protected]>
Signed-off-by: John Groves <[email protected]>
Signed-off-by: Nabeel M Mohamed <[email protected]>
---
drivers/mpool/mpctl.c | 670 +++++++++++++++++++++++++++++++++++++++++-
1 file changed, 667 insertions(+), 3 deletions(-)

diff --git a/drivers/mpool/mpctl.c b/drivers/mpool/mpctl.c
index 002321c8689b..03cc0d3c293f 100644
--- a/drivers/mpool/mpctl.c
+++ b/drivers/mpool/mpctl.c
@@ -34,6 +34,7 @@
#include "assert.h"

#include "mpool_ioctl.h"
+#include "mblock.h"
#include "mlog.h"
#include "mp.h"
#include "mpctl.h"
@@ -1302,7 +1303,6 @@ static int mpioc_mp_activate(struct mpc_unit *ctl, struct mpioc_mpool *mp,
mp->mp_params.mp_oidv[0] = cfg.mc_oid1;
mp->mp_params.mp_oidv[1] = cfg.mc_oid2;
mp->mp_params.mp_ra_pages_max = cfg.mc_ra_pages_max;
- mp->mp_params.mp_vma_size_max = cfg.mc_vma_size_max;
memcpy(&mp->mp_params.mp_utype, &cfg.mc_utype, sizeof(mp->mp_params.mp_utype));
strlcpy(mp->mp_params.mp_label, cfg.mc_label, sizeof(mp->mp_params.mp_label));

@@ -1659,6 +1659,596 @@ static int mpioc_mp_add(struct mpc_unit *unit, struct mpioc_drive *drv)
return rc;
}

+
+/**
+ * struct vcache - very-large-buffer cache...
+ */
+struct vcache {
+ spinlock_t vc_lock;
+ void *vc_head;
+ size_t vc_size;
+} ____cacheline_aligned;
+
+static struct vcache mpc_physio_vcache;
+
+static void *mpc_vcache_alloc(struct vcache *vc, size_t sz)
+{
+ void *p;
+
+ if (!vc || sz > vc->vc_size)
+ return NULL;
+
+ spin_lock(&vc->vc_lock);
+ p = vc->vc_head;
+ if (p)
+ vc->vc_head = *(void **)p;
+ spin_unlock(&vc->vc_lock);
+
+ return p;
+}
+
+static void mpc_vcache_free(struct vcache *vc, void *p)
+{
+ if (!vc || !p)
+ return;
+
+ spin_lock(&vc->vc_lock);
+ *(void **)p = vc->vc_head;
+ vc->vc_head = p;
+ spin_unlock(&vc->vc_lock);
+}
+
+static int mpc_vcache_init(struct vcache *vc, size_t sz, size_t n)
+{
+ if (!vc || sz < PAGE_SIZE || n < 1)
+ return -EINVAL;
+
+ spin_lock_init(&vc->vc_lock);
+ vc->vc_head = NULL;
+ vc->vc_size = sz;
+
+ while (n-- > 0)
+ mpc_vcache_free(vc, vmalloc(sz));
+
+ return vc->vc_head ? 0 : -ENOMEM;
+}
+
+static void mpc_vcache_fini(struct vcache *vc)
+{
+ void *p;
+
+ while ((p = mpc_vcache_alloc(vc, PAGE_SIZE)))
+ vfree(p);
+}
+
+/**
+ * mpc_physio() - Generic raw device mblock read/write routine.
+ * @mpd: mpool descriptor
+ * @desc: mblock or mlog descriptor
+ * @uiov: vector of iovecs that describe user-space segments
+ * @uioc: count of elements in uiov[]
+ * @offset: offset into the mblock at which to start reading
+ * @objtype: mblock or mlog
+ * @rw: READ or WRITE in regards to the media.
+ * @stkbuf: caller provided scratch space
+ * @stkbufsz: size of stkbuf
+ *
+ * This function creates an array of iovec objects each of which
+ * map a portion of the user request into kernel space so that
+ * mpool can directly access the user data. Note that this is
+ * a zero-copy operation.
+ *
+ * Requires that each user-space segment be page aligned and of an
+ * integral number of pages.
+ *
+ * See http://www.makelinux.net/ldd3/chp-15-sect-3 for more detail.
+ */
+static int mpc_physio(struct mpool_descriptor *mpd, void *desc, struct iovec *uiov,
+ int uioc, off_t offset, enum mp_obj_type objtype, int rw,
+ void *stkbuf, size_t stkbufsz)
+{
+ struct kvec *iov_base, *iov;
+ struct iov_iter iter;
+ struct page **pagesv;
+ size_t pagesvsz, pgbase, length;
+ int pagesc, niov, rc, i;
+ ssize_t cc;
+
+ iov = NULL;
+ niov = 0;
+ rc = 0;
+
+ length = iov_length(uiov, uioc);
+
+ if (length < PAGE_SIZE || !IS_ALIGNED(length, PAGE_SIZE))
+ return -EINVAL;
+
+ if (length > (rwsz_max_mb << 20))
+ return -EINVAL;
+
+ /*
+ * Allocate an array of page pointers for iov_iter_get_pages()
+ * and an array of iovecs for mblock_read() and mblock_write().
+ *
+ * Note: the only way we can calculate the number of required
+ * iovecs in advance is to assume that we need one per page.
+ */
+ pagesc = length / PAGE_SIZE;
+ pagesvsz = (sizeof(*pagesv) + sizeof(*iov)) * pagesc;
+
+ /*
+ * pagesvsz may be big, and it will not be used as the iovec_list
+ * for the block stack - pd will chunk it up to the underlying
+ * devices (with another iovec list per pd).
+ */
+ if (pagesvsz > stkbufsz) {
+ pagesv = NULL;
+
+ if (pagesvsz <= PAGE_SIZE * 2)
+ pagesv = kmalloc(pagesvsz, GFP_NOIO);
+
+ while (!pagesv) {
+ pagesv = mpc_vcache_alloc(&mpc_physio_vcache, pagesvsz);
+ if (!pagesv)
+ usleep_range(750, 1250);
+ }
+ } else {
+ pagesv = stkbuf;
+ }
+
+ if (!pagesv)
+ return -ENOMEM;
+
+ iov_base = (struct kvec *)((char *)pagesv + (sizeof(*pagesv) * pagesc));
+
+ iov_iter_init(&iter, rw, uiov, uioc, length);
+
+ for (i = 0, cc = 0; i < pagesc; i += (cc / PAGE_SIZE)) {
+
+ /* Get struct page vector for the user buffers. */
+ cc = iov_iter_get_pages(&iter, &pagesv[i], length - (i * PAGE_SIZE),
+ pagesc - i, &pgbase);
+ if (cc < 0) {
+ rc = cc;
+ pagesc = i;
+ goto errout;
+ }
+
+ /*
+ * pgbase is the offset into the 1st iovec - our alignment
+ * requirements force it to be 0
+ */
+ if (cc < PAGE_SIZE || pgbase != 0) {
+ rc = -EINVAL;
+ pagesc = i + 1;
+ goto errout;
+ }
+
+ iov_iter_advance(&iter, cc);
+ }
+
+ /* Build an array of iovecs for mpool so that it can directly access the user data. */
+ for (i = 0, iov = iov_base; i < pagesc; ++i, ++iov, ++niov) {
+ iov->iov_len = PAGE_SIZE;
+ iov->iov_base = kmap(pagesv[i]);
+
+ if (!iov->iov_base) {
+ rc = -EINVAL;
+ pagesc = i + 1;
+ goto errout;
+ }
+ }
+
+ switch (objtype) {
+ case MP_OBJ_MBLOCK:
+ if (rw == WRITE)
+ rc = mblock_write(mpd, desc, iov_base, niov, pagesc << PAGE_SHIFT);
+ else
+ rc = mblock_read(mpd, desc, iov_base, niov, offset, pagesc << PAGE_SHIFT);
+ break;
+
+ case MP_OBJ_MLOG:
+ rc = mlog_rw_raw(mpd, desc, iov_base, niov, offset, rw);
+ break;
+
+ default:
+ rc = -EINVAL;
+ goto errout;
+ }
+
+errout:
+ for (i = 0, iov = iov_base; i < pagesc; ++i, ++iov) {
+ if (i < niov)
+ kunmap(pagesv[i]);
+ put_page(pagesv[i]);
+ }
+
+ if (pagesvsz > stkbufsz) {
+ if (pagesvsz > PAGE_SIZE * 2)
+ mpc_vcache_free(&mpc_physio_vcache, pagesv);
+ else
+ kfree(pagesv);
+ }
+
+ return rc;
+}
+
+/**
+ * mpioc_mb_alloc() - Allocate an mblock object.
+ * @unit: mpool unit ptr
+ * @mb: mblock parameter block
+ *
+ * MPIOC_MB_ALLOC ioctl handler to allocate a single mblock.
+ *
+ * Return: Returns 0 if successful, -errno otherwise...
+ */
+static int mpioc_mb_alloc(struct mpc_unit *unit, struct mpioc_mblock *mb)
+{
+ struct mblock_descriptor *mblock;
+ struct mpool_descriptor *mpool;
+ struct mblock_props props;
+ int rc;
+
+ if (!unit || !mb || !unit->un_mpool)
+ return -EINVAL;
+
+ mpool = unit->un_mpool->mp_desc;
+
+ rc = mblock_alloc(mpool, mb->mb_mclassp, mb->mb_spare, &mblock, &props);
+ if (rc)
+ return rc;
+
+ mblock_get_props_ex(mpool, mblock, &mb->mb_props);
+ mblock_put(mblock);
+
+ mb->mb_objid = props.mpr_objid;
+ mb->mb_offset = -1;
+
+ return 0;
+}
+
+/**
+ * mpioc_mb_find() - Find an mblock object by its objid
+ * @unit: mpool unit ptr
+ * @mb: mblock parameter block
+ *
+ * Return: Returns 0 if successful, -errno otherwise...
+ */
+static int mpioc_mb_find(struct mpc_unit *unit, struct mpioc_mblock *mb)
+{
+ struct mblock_descriptor *mblock;
+ struct mpool_descriptor *mpool;
+ int rc;
+
+ if (!unit || !mb || !unit->un_mpool)
+ return -EINVAL;
+
+ if (!mblock_objid(mb->mb_objid))
+ return -EINVAL;
+
+ mpool = unit->un_mpool->mp_desc;
+
+ rc = mblock_find_get(mpool, mb->mb_objid, 0, NULL, &mblock);
+ if (rc)
+ return rc;
+
+ (void)mblock_get_props_ex(mpool, mblock, &mb->mb_props);
+
+ mblock_put(mblock);
+
+ mb->mb_offset = -1;
+
+ return 0;
+}
+
+/**
+ * mpioc_mb_abcomdel() - Abort, commit, or delete an mblock.
+ * @unit: mpool unit ptr
+ * @cmd MPIOC_MB_ABORT, MPIOC_MB_COMMIT, or MPIOC_MB_DELETE
+ * @mi: mblock parameter block
+ *
+ * MPIOC_MB_ACD ioctl handler to either abort, commit, or delete
+ * the specified mblock.
+ *
+ * Return: Returns 0 if successful, -errno otherwise...
+ */
+static int mpioc_mb_abcomdel(struct mpc_unit *unit, uint cmd, struct mpioc_mblock_id *mi)
+{
+ struct mblock_descriptor *mblock;
+ struct mpool_descriptor *mpool;
+ int which, rc;
+ bool drop;
+
+ if (!unit || !mi || !unit->un_mpool)
+ return -EINVAL;
+
+ if (!mblock_objid(mi->mi_objid))
+ return -EINVAL;
+
+ which = (cmd == MPIOC_MB_DELETE) ? 1 : -1;
+ mpool = unit->un_mpool->mp_desc;
+ drop = true;
+
+ rc = mblock_find_get(mpool, mi->mi_objid, which, NULL, &mblock);
+ if (rc)
+ return rc;
+
+ switch (cmd) {
+ case MPIOC_MB_COMMIT:
+ rc = mblock_commit(mpool, mblock);
+ break;
+
+ case MPIOC_MB_ABORT:
+ rc = mblock_abort(mpool, mblock);
+ drop = !!rc;
+ break;
+
+ case MPIOC_MB_DELETE:
+ rc = mblock_delete(mpool, mblock);
+ drop = !!rc;
+ break;
+
+ default:
+ rc = -ENOTTY;
+ break;
+ }
+
+ if (drop)
+ mblock_put(mblock);
+
+ return rc;
+}
+
+/**
+ * mpioc_mb_rw() - read/write mblock ioctl handler
+ * @unit: mpool unit ptr
+ * @cmd: MPIOC_MB_READ or MPIOC_MB_WRITE
+ * @mbiov: mblock parameter block
+ */
+static int mpioc_mb_rw(struct mpc_unit *unit, uint cmd, struct mpioc_mblock_rw *mbrw,
+ void *stkbuf, size_t stkbufsz)
+{
+ struct mblock_descriptor *mblock;
+ struct mpool_descriptor *mpool;
+ struct iovec *kiov;
+ bool xfree = false;
+ int which, rc;
+ size_t kiovsz;
+
+ if (!unit || !mbrw || !unit->un_mpool)
+ return -EINVAL;
+
+ if (!mblock_objid(mbrw->mb_objid))
+ return -EINVAL;
+
+ /*
+ * For small iovec counts we simply copyin the array of iovecs
+ * to local storage (stkbuf). Otherwise, we must kmalloc a
+ * buffer into which to perform the copyin.
+ */
+ if (mbrw->mb_iov_cnt > MPIOC_KIOV_MAX)
+ return -EINVAL;
+
+ kiovsz = mbrw->mb_iov_cnt * sizeof(*kiov);
+
+ if (kiovsz > stkbufsz) {
+ kiov = kmalloc(kiovsz, GFP_KERNEL);
+ if (!kiov)
+ return -ENOMEM;
+
+ xfree = true;
+ } else {
+ kiov = stkbuf;
+ stkbuf += kiovsz;
+ stkbufsz -= kiovsz;
+ }
+
+ which = (cmd == MPIOC_MB_READ) ? 1 : -1;
+ mpool = unit->un_mpool->mp_desc;
+
+ rc = mblock_find_get(mpool, mbrw->mb_objid, which, NULL, &mblock);
+ if (rc)
+ goto errout;
+
+ if (copy_from_user(kiov, mbrw->mb_iov, kiovsz)) {
+ rc = -EFAULT;
+ } else {
+ rc = mpc_physio(mpool, mblock, kiov, mbrw->mb_iov_cnt, mbrw->mb_offset,
+ MP_OBJ_MBLOCK, (cmd == MPIOC_MB_READ) ? READ : WRITE,
+ stkbuf, stkbufsz);
+ }
+
+ mblock_put(mblock);
+
+errout:
+ if (xfree)
+ kfree(kiov);
+
+ return rc;
+}
+
+/*
+ * Mpctl mlog ioctl handlers
+ */
+static int mpioc_mlog_alloc(struct mpc_unit *unit, struct mpioc_mlog *ml)
+{
+ struct mpool_descriptor *mpool;
+ struct mlog_descriptor *mlog;
+ struct mlog_props props;
+ int rc;
+
+ if (!unit || !unit->un_mpool || !ml)
+ return -EINVAL;
+
+ mpool = unit->un_mpool->mp_desc;
+
+ rc = mlog_alloc(mpool, &ml->ml_cap, ml->ml_mclassp, &props, &mlog);
+ if (rc)
+ return rc;
+
+ mlog_get_props_ex(mpool, mlog, &ml->ml_props);
+ mlog_put(mlog);
+
+ ml->ml_objid = props.lpr_objid;
+
+ return 0;
+}
+
+static int mpioc_mlog_find(struct mpc_unit *unit, struct mpioc_mlog *ml)
+{
+ struct mpool_descriptor *mpool;
+ struct mlog_descriptor *mlog;
+ int rc;
+
+ if (!unit || !unit->un_mpool || !ml || !mlog_objid(ml->ml_objid))
+ return -EINVAL;
+
+ mpool = unit->un_mpool->mp_desc;
+
+ rc = mlog_find_get(mpool, ml->ml_objid, 0, NULL, &mlog);
+ if (!rc) {
+ rc = mlog_get_props_ex(mpool, mlog, &ml->ml_props);
+ mlog_put(mlog);
+ }
+
+ return rc;
+}
+
+static int mpioc_mlog_abcomdel(struct mpc_unit *unit, uint cmd, struct mpioc_mlog_id *mi)
+{
+ struct mpool_descriptor *mpool;
+ struct mlog_descriptor *mlog;
+ struct mlog_props_ex props;
+ int which, rc;
+ bool drop;
+
+ if (!unit || !unit->un_mpool || !mi || !mlog_objid(mi->mi_objid))
+ return -EINVAL;
+
+ which = (cmd == MPIOC_MLOG_DELETE) ? 1 : -1;
+ mpool = unit->un_mpool->mp_desc;
+ drop = true;
+
+ rc = mlog_find_get(mpool, mi->mi_objid, which, NULL, &mlog);
+ if (rc)
+ return rc;
+
+ switch (cmd) {
+ case MPIOC_MLOG_COMMIT:
+ rc = mlog_commit(mpool, mlog);
+ if (!rc) {
+ mlog_get_props_ex(mpool, mlog, &props);
+ mi->mi_gen = props.lpx_props.lpr_gen;
+ mi->mi_state = props.lpx_state;
+ }
+ break;
+
+ case MPIOC_MLOG_ABORT:
+ rc = mlog_abort(mpool, mlog);
+ drop = !!rc;
+ break;
+
+ case MPIOC_MLOG_DELETE:
+ rc = mlog_delete(mpool, mlog);
+ drop = !!rc;
+ break;
+
+ default:
+ rc = -ENOTTY;
+ break;
+ }
+
+ if (drop)
+ mlog_put(mlog);
+
+ return rc;
+}
+
+static int mpioc_mlog_rw(struct mpc_unit *unit, struct mpioc_mlog_io *mi,
+ void *stkbuf, size_t stkbufsz)
+{
+ struct mpool_descriptor *mpool;
+ struct mlog_descriptor *mlog;
+ struct iovec *kiov;
+ bool xfree = false;
+ size_t kiovsz;
+ int rc;
+
+ if (!unit || !unit->un_mpool || !mi || !mlog_objid(mi->mi_objid))
+ return -EINVAL;
+
+ /*
+ * For small iovec counts we simply copyin the array of iovecs
+ * to the stack (kiov_buf). Otherwise, we must kmalloc a
+ * buffer into which to perform the copyin.
+ */
+ if (mi->mi_iovc > MPIOC_KIOV_MAX)
+ return -EINVAL;
+
+ kiovsz = mi->mi_iovc * sizeof(*kiov);
+
+ if (kiovsz > stkbufsz) {
+ kiov = kmalloc(kiovsz, GFP_KERNEL);
+ if (!kiov)
+ return -ENOMEM;
+
+ xfree = true;
+ } else {
+ kiov = stkbuf;
+ stkbuf += kiovsz;
+ stkbufsz -= kiovsz;
+ }
+
+ mpool = unit->un_mpool->mp_desc;
+
+ rc = mlog_find_get(mpool, mi->mi_objid, 1, NULL, &mlog);
+ if (rc)
+ goto errout;
+
+ if (copy_from_user(kiov, mi->mi_iov, kiovsz)) {
+ rc = -EFAULT;
+ } else {
+ rc = mpc_physio(mpool, mlog, kiov, mi->mi_iovc, mi->mi_off, MP_OBJ_MLOG,
+ (mi->mi_op == MPOOL_OP_READ) ? READ : WRITE, stkbuf, stkbufsz);
+ }
+
+ mlog_put(mlog);
+
+errout:
+ if (xfree)
+ kfree(kiov);
+
+ return rc;
+}
+
+static int mpioc_mlog_erase(struct mpc_unit *unit, struct mpioc_mlog_id *mi)
+{
+ struct mpool_descriptor *mpool;
+ struct mlog_descriptor *mlog;
+ struct mlog_props_ex props;
+ int rc;
+
+ if (!unit || !unit->un_mpool || !mi || !mlog_objid(mi->mi_objid))
+ return -EINVAL;
+
+ mpool = unit->un_mpool->mp_desc;
+
+ rc = mlog_find_get(mpool, mi->mi_objid, 0, NULL, &mlog);
+ if (rc)
+ return rc;
+
+ rc = mlog_erase(mpool, mlog, mi->mi_gen);
+ if (!rc) {
+ mlog_get_props_ex(mpool, mlog, &props);
+ mi->mi_gen = props.lpx_props.lpr_gen;
+ mi->mi_state = props.lpx_state;
+ }
+
+ mlog_put(mlog);
+
+ return rc;
+}
+
static struct mpc_softstate *mpc_cdev2ss(struct cdev *cdev)
{
if (!cdev || cdev->owner != THIS_MODULE) {
@@ -1846,8 +2436,8 @@ static long mpc_ioctl(struct file *fp, unsigned int cmd, unsigned long arg)
{
char argbuf[256] __aligned(16);
struct mpc_unit *unit;
- size_t argbufsz;
- void *argp;
+ size_t argbufsz, stkbufsz;
+ void *argp, *stkbuf;
ulong iosz;
int rc;

@@ -1858,7 +2448,12 @@ static long mpc_ioctl(struct file *fp, unsigned int cmd, unsigned long arg)
switch (cmd) {
case MPIOC_PROP_GET:
case MPIOC_DEVPROPS_GET:
+ case MPIOC_MB_FIND:
+ case MPIOC_MB_READ:
case MPIOC_MP_MCLASS_GET:
+ case MPIOC_MLOG_FIND:
+ case MPIOC_MLOG_READ:
+ case MPIOC_MLOG_PROPS:
break;

default:
@@ -1930,6 +2525,59 @@ static long mpc_ioctl(struct file *fp, unsigned int cmd, unsigned long arg)
rc = mpioc_devprops_get(unit, argp);
break;

+ case MPIOC_MB_ALLOC:
+ rc = mpioc_mb_alloc(unit, argp);
+ break;
+
+ case MPIOC_MB_FIND:
+ rc = mpioc_mb_find(unit, argp);
+ break;
+
+ case MPIOC_MB_COMMIT:
+ case MPIOC_MB_DELETE:
+ case MPIOC_MB_ABORT:
+ rc = mpioc_mb_abcomdel(unit, cmd, argp);
+ break;
+
+ case MPIOC_MB_READ:
+ case MPIOC_MB_WRITE:
+ ASSERT(roundup(iosz, 16) < argbufsz);
+
+ stkbufsz = argbufsz - roundup(iosz, 16);
+ stkbuf = argbuf + roundup(iosz, 16);
+
+ rc = mpioc_mb_rw(unit, cmd, argp, stkbuf, stkbufsz);
+ break;
+
+ case MPIOC_MLOG_ALLOC:
+ rc = mpioc_mlog_alloc(unit, argp);
+ break;
+
+ case MPIOC_MLOG_FIND:
+ case MPIOC_MLOG_PROPS:
+ rc = mpioc_mlog_find(unit, argp);
+ break;
+
+ case MPIOC_MLOG_ABORT:
+ case MPIOC_MLOG_COMMIT:
+ case MPIOC_MLOG_DELETE:
+ rc = mpioc_mlog_abcomdel(unit, cmd, argp);
+ break;
+
+ case MPIOC_MLOG_READ:
+ case MPIOC_MLOG_WRITE:
+ ASSERT(roundup(iosz, 16) < argbufsz);
+
+ stkbufsz = argbufsz - roundup(iosz, 16);
+ stkbuf = argbuf + roundup(iosz, 16);
+
+ rc = mpioc_mlog_rw(unit, argp, stkbuf, stkbufsz);
+ break;
+
+ case MPIOC_MLOG_ERASE:
+ rc = mpioc_mlog_erase(unit, argp);
+ break;
+
default:
rc = -ENOTTY;
mp_pr_rl("invalid command %x: dir=%u type=%c nr=%u size=%u",
@@ -1985,6 +2633,8 @@ void mpctl_exit(void)
ss->ss_inited = false;
}

+ mpc_vcache_fini(&mpc_physio_vcache);
+
mpc_bdi_teardown();
}

@@ -1997,6 +2647,7 @@ int mpctl_init(void)
struct mpool_config *cfg = NULL;
struct mpc_unit *ctlunit;
const char *errmsg = NULL;
+ size_t sz;
int rc;

if (ss->ss_inited)
@@ -2006,6 +2657,19 @@ int mpctl_init(void)

maxunits = clamp_t(uint, maxunits, 8, 8192);

+ rwsz_max_mb = clamp_t(ulong, rwsz_max_mb, 1, 128);
+ rwconc_max = clamp_t(ulong, rwconc_max, 1, 32);
+
+ /* Must be same as mpc_physio() pagesvsz calculation. */
+ sz = (rwsz_max_mb << 20) / PAGE_SIZE;
+ sz *= (sizeof(void *) + sizeof(struct iovec));
+
+ rc = mpc_vcache_init(&mpc_physio_vcache, sz, rwconc_max);
+ if (rc) {
+ errmsg = "vcache init failed";
+ goto errout;
+ }
+
cdev_init(&ss->ss_cdev, &mpc_fops_default);
ss->ss_cdev.owner = THIS_MODULE;

--
2.17.2

2020-09-30 00:15:42

by Randy Dunlap

[permalink] [raw]

Subject: Re: [PATCH 17/22] mpool: add mpool lifecycle management ioctls

On 9/28/20 9:45 AM, [email protected] wrote:
> + if (_IOC_TYPE(cmd) != MPIOC_MAGIC)
Hi,

MPIOC_MAGIC is defined in patch 01/22.
It should also be added to Documentation/userspace-api/ioctl/ioctl-number.rst.

thanks.
--
~Randy

2020-09-30 20:04:19

by Nabeel Meeramohideen Mohamed (nmeeramohide)

[permalink] [raw]

Subject: RE: [EXT] Re: [PATCH 17/22] mpool: add mpool lifecycle management ioctls

Hi Randy,

On Tuesday, September 29, 2020 6:13 PM, Randy Dunlap <[email protected]> wrote:
> On 9/28/20 9:45 AM, [email protected] wrote:
> > + if (_IOC_TYPE(cmd) != MPIOC_MAGIC)
> Hi,
>
> MPIOC_MAGIC is defined in patch 01/22.
> It should also be added to Documentation/userspace-api/ioctl/ioctl-number.rst.
>

Sure, thanks! I've made a note of this and will address it in v2.

Thanks,
Nabeel