2024-02-23 17:42:31

by John Groves

[permalink] [raw]
Subject: [RFC PATCH 00/20] Introduce the famfs shared-memory file system

This patch set introduces famfs[1] - a special-purpose fs-dax file system
for sharable disaggregated or fabric-attached memory (FAM). Famfs is not
CXL-specific in anyway way.

* Famfs creates a simple access method for storing and sharing data in
sharable memory. The memory is exposed and accessed as memory-mappable
dax files.
* Famfs supports multiple hosts mounting the same file system from the
same memory (something existing fs-dax file systems don't do).
* A famfs file system can be created on either a /dev/pmem device in fs-dax
mode, or a /dev/dax device in devdax mode (the latter depending on
patches 2-6 of this series).

The famfs kernel file system is part the famfs framework; additional
components in user space[2] handle metadata and direct the famfs kernel
module to instantiate files that map to specific memory. The famfs user
space has documentation and a reasonably thorough test suite.

The famfs kernel module never accesses the shared memory directly (either
data or metadata). Because of this, shared memory managed by the famfs
framework does not create a RAS "blast radius" problem that should be able
to crash or de-stabilize the kernel. Poison or timeouts in famfs memory
can be expected to kill apps via SIGBUS and cause mounts to be disabled
due to memory failure notifications.

Famfs does not attempt to solve concurrency or coherency problems for apps,
although it does solve these problems in regard to its own data structures.
Apps may encounter hard concurrency problems, but there are use cases that
are imminently useful and uncomplicated from a concurrency perspective:
serial sharing is one (only one host at a time has access), and read-only
concurrent sharing is another (all hosts can read-cache without worry).

Contents:

* famfs kernel documentation [patch 1]. Note that evolving famfs user
documentation is at [2]
* dev_dax_iomap patchset [patches 2-6] - This enables fs-dax to use the
iomap interface via a character /dev/dax device (e.g. /dev/dax0.0). For
historical reasons the iomap infrastructure was enabled only for
/dev/pmem devices (which are dax block devices). As famfs is the first
fs-dax file system that works on /dev/dax, this patch series fills in
the bare minimum infrastructure to enable iomap api usage with /dev/dax.
* famfs patchset [patches 7-20] - this introduces the kernel component of
famfs.

IMPORTANT NOTE: There is a developing consensus that /dev/dax requires
some fundamental re-factoring (e.g. [3]) that is related but outside the
scope of this series.

Some observations about using sharable memory

* It does not make sense to online sharable memory as system-ram.
System-ram gets zeroed when it is onlined, so sharing is basically
nonsense.
* It does not make sense to put struct page's in sharable memory, because
those can't be shared. However, separately providing non-sharable
capacity to be used for struct page's might be a sensible approach if the
size of struct page array for sharable memory is too large to put in
conventional system-ram (albeit with possible RAS implications).
* Sharable memory is pmem-like, in that a host is likely to connect in
order to gain access to data that is already in the memory. Moreover
the power domain for shared memory is separate for that of the server.
Having observed that, famfs is not intended for persistent storage. It is
intended for sharing data sets in memory during a time frame where the
memory and the compute nodes are expected to remain operational - such
as during a clustered data analytics job.

Could we do this with FUSE?

The key performance requirement for famfs is efficient handling of VMA
faults. This requires caching the complete dax extent lists for all active
files so faults can be handled without upcalls, which FUSE does not do.
It would probably be possible to put this capability FUSE, but we think
that keeping famfs separate from FUSE is the simpler approach.

This patch set is available as a branch at [5]

References

[1] https://lpc.events/event/17/contributions/1455/
[2] https://github.com/cxl-micron-reskit/famfs
[3] https://lore.kernel.org/all/166630293549.1017198.3833687373550679565.stgit@dwillia2-xfh.jf.intel.com/
[4] https://www.computeexpresslink.org/download-the-specification
[5] https://github.com/cxl-micron-reskit/famfs-linux

John Groves (20):
famfs: Documentation
dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage
dev_dax_iomap: Move dax_pgoff_to_phys from device.c to bus.c since
both need it now
dev_dax_iomap: Save the kva from memremap
dev_dax_iomap: Add dax_operations for use by fs-dax on devdax
dev_dax_iomap: Add CONFIG_DEV_DAX_IOMAP kernel build parameter
famfs: Add include/linux/famfs_ioctl.h
famfs: Add famfs_internal.h
famfs: Add super_operations
famfs: famfs_open_device() & dax_holder_operations
famfs: Add fs_context_operations
famfs: Add inode_operations and file_system_type
famfs: Add iomap_ops
famfs: Add struct file_operations
famfs: Add ioctl to file_operations
famfs: Add fault counters
famfs: Add module stuff
famfs: Support character dax via the dev_dax_iomap patch
famfs: Update MAINTAINERS file
famfs: Add Kconfig and Makefile plumbing

Documentation/filesystems/famfs.rst | 124 +++++
MAINTAINERS | 11 +
drivers/dax/Kconfig | 6 +
drivers/dax/bus.c | 131 ++++++
drivers/dax/dax-private.h | 1 +
drivers/dax/device.c | 38 +-
drivers/dax/super.c | 38 ++
fs/Kconfig | 2 +
fs/Makefile | 1 +
fs/famfs/Kconfig | 10 +
fs/famfs/Makefile | 5 +
fs/famfs/famfs_file.c | 704 ++++++++++++++++++++++++++++
fs/famfs/famfs_inode.c | 586 +++++++++++++++++++++++
fs/famfs/famfs_internal.h | 126 +++++
include/linux/dax.h | 5 +
include/uapi/linux/famfs_ioctl.h | 56 +++
16 files changed, 1821 insertions(+), 23 deletions(-)
create mode 100644 Documentation/filesystems/famfs.rst
create mode 100644 fs/famfs/Kconfig
create mode 100644 fs/famfs/Makefile
create mode 100644 fs/famfs/famfs_file.c
create mode 100644 fs/famfs/famfs_inode.c
create mode 100644 fs/famfs/famfs_internal.h
create mode 100644 include/uapi/linux/famfs_ioctl.h


base-commit: 841c35169323cd833294798e58b9bf63fa4fa1de
--
2.43.0



2024-02-23 17:42:49

by John Groves

[permalink] [raw]
Subject: [RFC PATCH 01/20] famfs: Documentation

Introduce Documentation/filesystems/famfs.rst into the Documentation
tree

Signed-off-by: John Groves <[email protected]>
---
Documentation/filesystems/famfs.rst | 124 ++++++++++++++++++++++++++++
1 file changed, 124 insertions(+)
create mode 100644 Documentation/filesystems/famfs.rst

diff --git a/Documentation/filesystems/famfs.rst b/Documentation/filesystems/famfs.rst
new file mode 100644
index 000000000000..c2cc50c10d03
--- /dev/null
+++ b/Documentation/filesystems/famfs.rst
@@ -0,0 +1,124 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. _famfs_index:
+
+==================================================================
+famfs: The kernel component of the famfs shared memory file system
+==================================================================
+
+- Copyright (C) 2024 Micron Technology, Inc.
+
+Introduction
+============
+Compute Express Link (CXL) provides a mechanism for disaggregated or
+fabric-attached memory (FAM). This creates opportunities for data sharing;
+clustered apps that would otherwise have to shard or replicate data can
+share one copy in disaggregated memory.
+
+Famfs, which is not CXL-specific in any way, provides a mechanism for
+multiple hosts to use data in shared memory, by giving it a file system
+interface. With famfs, any app that understands files (which is all of
+them, right?) can access data sets in shared memory. Although famfs
+supports read and write calls, the real point is to support mmap, which
+provides direct (dax) access to the memory - either writable or read-only.
+
+Shared memory can pose complex coherency and synchronization issues, but
+there are also simple cases. Two simple and eminently useful patterns that
+occur frequently in data analytics and AI are:
+
+* Serial Sharing - Only one host or process at a time has access to a file
+* Read-only Sharing - Multiple hosts or processes share read-only access
+ to a file
+
+The famfs kernel file system is part of the famfs framework; User space
+components [1] handle metadata allocation and distribution, and direct the
+famfs kernel module to instantiate files that map to specific memory.
+
+The famfs framework manages coherency of its own metadata and structures,
+but does not attempt to manage coherency for applications.
+
+Famfs also provides data isolation between files. That is, even though
+the host has access to an entire memory "device" (as a dax device), apps
+cannot write to memory for which the file is read-only, and mapping one
+file provides isolation from the memory of all other files. This is pretty
+basic, but some experimental shared memory usage patterns provide no such
+isolation.
+
+Principles of Operation
+=======================
+
+Without its user space components, the famfs kernel module is just a
+semi-functional clone of ramfs with latent fs-dax support. The user space
+components maintain superblocks and metadata logs, and use the famfs kernel
+component to provide a file system view of shared memory across multiple
+hosts.
+
+Each host has an independent instance of the famfs kernel module. After
+mount, files are not visible until the user space component instantiates
+them (normally by playing the famfs metadata log).
+
+Once instantiated, files on each host can point to the same shared memory,
+but in-memory metadata (inodes, etc.) is ephemeral on each host that has a
+famfs instance mounted. Like ramfs, the famfs in-kernel file system has no
+backing store for metadata modifications. If metadata is ever persisted,
+that must be done by the user space components. However, mutations to file
+data are saved to the shared memory - subject to write permission and
+processor cache behavior.
+
+
+Famfs is Not a Conventional File System
+---------------------------------------
+
+Famfs files can be accessed by conventional means, but there are
+limitations. The kernel component of famfs is not involved in the
+allocation of backing memory for files at all; the famfs user space
+creates files and passes the allocation extent lists into the kernel via
+the per-file FAMFSIOC_MAP_CREATE ioctl. A file that lacks this metadata is
+treated as invalid by the famfs kernel module. As a practical matter files
+must be created via the famfs library or cli, but they can be consumed as
+if they were conventional files.
+
+Famfs differs in some important ways from conventional file systems:
+
+* Files must be pre-allocated by the famfs framework; Allocation is never
+ performed on write.
+* Any operation that changes a file's size is considered to put the file
+ in an invalid state, disabling access to the data. It may be possible to
+ revisit this in the future.
+* (Typically the famfs user space can restore files to a valid state by
+ replaying the famfs metadata log.)
+
+Famfs exists to apply the existing file system abstractions on top of
+shared memory so applications and workflows can more easily consume it.
+
+Key Requirements
+================
+
+The primary requirements for famfs are:
+
+1. Must support a file system abstraction backed by sharable dax memory
+2. Files must efficiently handle VMA faults
+3. Must support metadata distribution in a sharable way
+4. Must handle clients with a stale copy of metadata
+
+The famfs kernel component takes care of 1-2 above.
+
+Requirements 3 and 4 are handled by the user space components, and are
+largely orthogonal to the functionality of the famfs kernel module.
+
+Requirements 3 and 4 cannot be met by conventional fs-dax file systems
+(e.g. xfs and ext4) because they use write-back metadata; it is not valid
+to mount such a file system on two hosts from the same in-memory image.
+
+
+Famfs Usage
+===========
+
+Famfs usage is documented at [1].
+
+
+References
+==========
+
+- [1] Famfs user space repository and documentation
+ https://github.com/cxl-micron-reskit/famfs
--
2.43.0


2024-02-23 17:46:02

by John Groves

[permalink] [raw]
Subject: [RFC PATCH 09/20] famfs: Add super_operations

Introduce the famfs superblock operations

Signed-off-by: John Groves <[email protected]>
---
fs/famfs/famfs_inode.c | 72 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 72 insertions(+)
create mode 100644 fs/famfs/famfs_inode.c

diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c
new file mode 100644
index 000000000000..3329aff000d1
--- /dev/null
+++ b/fs/famfs/famfs_inode.c
@@ -0,0 +1,72 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * famfs - dax file system for shared fabric-attached memory
+ *
+ * Copyright 2023-2024 Micron Technology, inc
+ *
+ * This file system, originally based on ramfs the dax support from xfs,
+ * is intended to allow multiple host systems to mount a common file system
+ * view of dax files that map to shared memory.
+ */
+
+#include <linux/fs.h>
+#include <linux/pagemap.h>
+#include <linux/highmem.h>
+#include <linux/time.h>
+#include <linux/init.h>
+#include <linux/string.h>
+#include <linux/backing-dev.h>
+#include <linux/sched.h>
+#include <linux/parser.h>
+#include <linux/magic.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/fs_context.h>
+#include <linux/fs_parser.h>
+#include <linux/seq_file.h>
+#include <linux/dax.h>
+#include <linux/hugetlb.h>
+#include <linux/uio.h>
+#include <linux/iomap.h>
+#include <linux/path.h>
+#include <linux/namei.h>
+#include <linux/pfn_t.h>
+#include <linux/blkdev.h>
+
+#include "famfs_internal.h"
+
+#define FAMFS_DEFAULT_MODE 0755
+
+static const struct super_operations famfs_ops;
+static const struct inode_operations famfs_file_inode_operations;
+static const struct inode_operations famfs_dir_inode_operations;
+
+/**********************************************************************************
+ * famfs super_operations
+ *
+ * TODO: implement a famfs_statfs() that shows size, free and available space, etc.
+ */
+
+/**
+ * famfs_show_options() - Display the mount options in /proc/mounts.
+ */
+static int famfs_show_options(
+ struct seq_file *m,
+ struct dentry *root)
+{
+ struct famfs_fs_info *fsi = root->d_sb->s_fs_info;
+
+ if (fsi->mount_opts.mode != FAMFS_DEFAULT_MODE)
+ seq_printf(m, ",mode=%o", fsi->mount_opts.mode);
+
+ return 0;
+}
+
+static const struct super_operations famfs_ops = {
+ .statfs = simple_statfs,
+ .drop_inode = generic_delete_inode,
+ .show_options = famfs_show_options,
+};
+
+
+MODULE_LICENSE("GPL");
--
2.43.0


2024-02-23 17:47:06

by John Groves

[permalink] [raw]
Subject: [RFC PATCH 13/20] famfs: Add iomap_ops

This commit introduces the famfs iomap_ops. When either
dax_iomap_fault() or dax_iomap_rw() is called, we get a callback
via our iomap_begin() handler. The question being asked is
"please resolve (file, offset) to (daxdev, offset)". The function
famfs_meta_to_dax_offset() does this.

The per-file metadata is just an extent list to the
backing dax dev. The order of this resolution is O(N) for N
extents. Note with the current user space, files usually have
only one extent.

Signed-off-by: John Groves <[email protected]>
---
fs/famfs/famfs_file.c | 245 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 245 insertions(+)
create mode 100644 fs/famfs/famfs_file.c

diff --git a/fs/famfs/famfs_file.c b/fs/famfs/famfs_file.c
new file mode 100644
index 000000000000..fc667d5f7be8
--- /dev/null
+++ b/fs/famfs/famfs_file.c
@@ -0,0 +1,245 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * famfs - dax file system for shared fabric-attached memory
+ *
+ * Copyright 2023-2024 Micron Technology, Inc.
+ *
+ * This file system, originally based on ramfs the dax support from xfs,
+ * is intended to allow multiple host systems to mount a common file system
+ * view of dax files that map to shared memory.
+ */
+
+#include <linux/module.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/sched.h>
+#include <linux/dax.h>
+#include <linux/uio.h>
+#include <linux/iomap.h>
+#include <uapi/linux/famfs_ioctl.h>
+#include "famfs_internal.h"
+
+/*********************************************************************
+ * iomap_operations
+ *
+ * This stuff uses the iomap (dax-related) helpers to resolve file offsets to
+ * offsets within a dax device.
+ */
+
+/**
+ * famfs_meta_to_dax_offset()
+ *
+ * This function is called by famfs_iomap_begin() to resolve an offset in a file to
+ * an offset in a dax device. This is upcalled from dax from calls to both
+ * dax_iomap_fault() and dax_iomap_rw(). Dax finishes the job resolving a fault to
+ * a specific physical page (the fault case) or doing a memcpy variant (the rw case)
+ *
+ * Pages can be PTE (4k), PMD (2MiB) or (theoretically) PuD (1GiB)
+ * (these sizes are for X86; may vary on other cpu architectures
+ *
+ * @inode - the file where the fault occurred
+ * @iomap - struct iomap to be filled in to indicate where to find the right memory, relative
+ * to a dax device.
+ * @offset - the offset within the file where the fault occurred (will be page boundary)
+ * @len - the length of the faulted mapping (will be a page multiple)
+ * (will be trimmed in *iomap if it's disjoint in the extent list)
+ * @flags
+ */
+static int
+famfs_meta_to_dax_offset(
+ struct inode *inode,
+ struct iomap *iomap,
+ loff_t offset,
+ loff_t len,
+ unsigned int flags)
+{
+ struct famfs_file_meta *meta = (struct famfs_file_meta *)inode->i_private;
+ int i;
+ loff_t local_offset = offset;
+ struct famfs_fs_info *fsi = inode->i_sb->s_fs_info;
+
+ iomap->offset = offset; /* file offset */
+
+ for (i = 0; i < meta->tfs_extent_ct; i++) {
+ loff_t dax_ext_offset = meta->tfs_extents[i].offset;
+ loff_t dax_ext_len = meta->tfs_extents[i].len;
+
+ if ((dax_ext_offset == 0) && (meta->file_type != FAMFS_SUPERBLOCK))
+ pr_err("%s: zero offset on non-superblock file!!\n", __func__);
+
+ /* local_offset is the offset minus the size of extents skipped so far;
+ * If local_offset < dax_ext_len, the data of interest starts in this extent
+ */
+ if (local_offset < dax_ext_len) {
+ loff_t ext_len_remainder = dax_ext_len - local_offset;
+
+ /*+
+ * OK, we found the file metadata extent where this data begins
+ * @local_offset - The offset within the current extent
+ * @ext_len_remainder - Remaining length of ext after skipping local_offset
+ *
+ * iomap->addr is the offset within the dax device where that data
+ * starts
+ */
+ iomap->addr = dax_ext_offset + local_offset; /* dax dev offset */
+ iomap->offset = offset; /* file offset */
+ iomap->length = min_t(loff_t, len, ext_len_remainder);
+ iomap->dax_dev = fsi->dax_devp;
+ iomap->type = IOMAP_MAPPED;
+ iomap->flags = flags;
+
+ return 0;
+ }
+ local_offset -= dax_ext_len; /* Get ready for the next extent */
+ }
+
+ /* Set iomap to zero length in this case, and return 0
+ * This just means that the r/w is past EOF
+ */
+ iomap->addr = offset;
+ iomap->offset = offset; /* file offset */
+ iomap->length = 0; /* this had better result in no access to dax mem */
+ iomap->dax_dev = fsi->dax_devp;
+ iomap->type = IOMAP_MAPPED;
+ iomap->flags = flags;
+
+ return 0;
+}
+
+/**
+ * famfs_iomap_begin()
+ *
+ * This function is pretty simple because files are
+ * * never partially allocated
+ * * never have holes (never sparse)
+ * * never "allocate on write"
+ */
+static int
+famfs_iomap_begin(
+ struct inode *inode,
+ loff_t offset,
+ loff_t length,
+ unsigned int flags,
+ struct iomap *iomap,
+ struct iomap *srcmap)
+{
+ struct famfs_file_meta *meta = inode->i_private;
+ size_t size;
+ int rc;
+
+ size = i_size_read(inode);
+
+ WARN_ON(size != meta->file_size);
+
+ rc = famfs_meta_to_dax_offset(inode, iomap, offset, length, flags);
+
+ return rc;
+}
+
+/* Note: We never need a special set of write_iomap_ops because famfs never
+ * performs allocation on write.
+ */
+const struct iomap_ops famfs_iomap_ops = {
+ .iomap_begin = famfs_iomap_begin,
+};
+
+/*********************************************************************
+ * vm_operations
+ */
+static vm_fault_t
+__famfs_filemap_fault(
+ struct vm_fault *vmf,
+ unsigned int pe_size,
+ bool write_fault)
+{
+ struct inode *inode = file_inode(vmf->vma->vm_file);
+ vm_fault_t ret;
+
+ if (write_fault) {
+ sb_start_pagefault(inode->i_sb);
+ file_update_time(vmf->vma->vm_file);
+ }
+
+ if (IS_DAX(inode)) {
+ pfn_t pfn;
+
+ ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL, &famfs_iomap_ops);
+ if (ret & VM_FAULT_NEEDDSYNC)
+ ret = dax_finish_sync_fault(vmf, pe_size, pfn);
+ } else {
+ /* All famfs faults will be dax... */
+ pr_err("%s: oops, non-dax fault\n", __func__);
+ ret = VM_FAULT_SIGBUS;
+ }
+
+ if (write_fault)
+ sb_end_pagefault(inode->i_sb);
+
+ return ret;
+}
+
+static inline bool
+famfs_is_write_fault(
+ struct vm_fault *vmf)
+{
+ return (vmf->flags & FAULT_FLAG_WRITE) &&
+ (vmf->vma->vm_flags & VM_SHARED);
+}
+
+static vm_fault_t
+famfs_filemap_fault(
+ struct vm_fault *vmf)
+{
+ /* DAX can shortcut the normal fault path on write faults! */
+ return __famfs_filemap_fault(vmf, 0,
+ IS_DAX(file_inode(vmf->vma->vm_file)) && famfs_is_write_fault(vmf));
+}
+
+static vm_fault_t
+famfs_filemap_huge_fault(
+ struct vm_fault *vmf,
+ unsigned int pe_size)
+{
+ if (!IS_DAX(file_inode(vmf->vma->vm_file))) {
+ pr_err("%s: file not marked IS_DAX!!\n", __func__);
+ return VM_FAULT_SIGBUS;
+ }
+
+ /* DAX can shortcut the normal fault path on write faults! */
+ return __famfs_filemap_fault(vmf, pe_size, famfs_is_write_fault(vmf));
+}
+
+static vm_fault_t
+famfs_filemap_page_mkwrite(
+ struct vm_fault *vmf)
+{
+ return __famfs_filemap_fault(vmf, 0, true);
+}
+
+static vm_fault_t
+famfs_filemap_pfn_mkwrite(
+ struct vm_fault *vmf)
+{
+ return __famfs_filemap_fault(vmf, 0, true);
+}
+
+static vm_fault_t
+famfs_filemap_map_pages(
+ struct vm_fault *vmf,
+ pgoff_t start_pgoff,
+ pgoff_t end_pgoff)
+{
+ vm_fault_t ret;
+
+ ret = filemap_map_pages(vmf, start_pgoff, end_pgoff);
+ return ret;
+}
+
+const struct vm_operations_struct famfs_file_vm_ops = {
+ .fault = famfs_filemap_fault,
+ .huge_fault = famfs_filemap_huge_fault,
+ .map_pages = famfs_filemap_map_pages,
+ .page_mkwrite = famfs_filemap_page_mkwrite,
+ .pfn_mkwrite = famfs_filemap_pfn_mkwrite,
+};
+
--
2.43.0


2024-02-23 17:47:23

by John Groves

[permalink] [raw]
Subject: [RFC PATCH 14/20] famfs: Add struct file_operations

This commit introduces the famfs file_operations. We call
thp_get_unmapped_area() to force PMD page alignment. Our read and
write handlers (famfs_dax_read_iter() and famfs_dax_write_iter())
call dax_iomap_rw() to do the work.

famfs_file_invalid() checks for various ways a famfs file can be
in an invalid state so we can fail I/O or fault resolution in those
cases. Those cases include the following:

* No famfs metadata
* file i_size does not match the originally allocated size
* file is not flagged as DAX
* errors were detected previously on the file

An invalid file can often be fixed by replaying the log, or by
umount/mount/log replay - all of which are user space operations.

Signed-off-by: John Groves <[email protected]>
---
fs/famfs/famfs_file.c | 136 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 136 insertions(+)

diff --git a/fs/famfs/famfs_file.c b/fs/famfs/famfs_file.c
index fc667d5f7be8..5228e9de1e3b 100644
--- a/fs/famfs/famfs_file.c
+++ b/fs/famfs/famfs_file.c
@@ -19,6 +19,142 @@
#include <uapi/linux/famfs_ioctl.h>
#include "famfs_internal.h"

+/*********************************************************************
+ * file_operations
+ */
+
+/* Reject I/O to files that aren't in a valid state */
+static ssize_t
+famfs_file_invalid(struct inode *inode)
+{
+ size_t i_size = i_size_read(inode);
+ struct famfs_file_meta *meta = inode->i_private;
+
+ if (!meta) {
+ pr_err("%s: un-initialized famfs file\n", __func__);
+ return -EIO;
+ }
+ if (i_size != meta->file_size) {
+ pr_err("%s: something changed the size from %ld to %ld\n",
+ __func__, meta->file_size, i_size);
+ meta->error = 1;
+ return -ENXIO;
+ }
+ if (!IS_DAX(inode)) {
+ pr_err("%s: inode %llx IS_DAX is false\n", __func__, (u64)inode);
+ meta->error = 1;
+ return -ENXIO;
+ }
+ if (meta->error) {
+ pr_err("%s: previously detected metadata errors\n", __func__);
+ meta->error = 1;
+ return -EIO;
+ }
+ return 0;
+}
+
+static ssize_t
+famfs_dax_read_iter(
+ struct kiocb *iocb,
+ struct iov_iter *to)
+{
+ struct inode *inode = iocb->ki_filp->f_mapping->host;
+ size_t i_size = i_size_read(inode);
+ size_t count = iov_iter_count(to);
+ size_t max_count;
+ ssize_t rc;
+
+ rc = famfs_file_invalid(inode);
+ if (rc)
+ return rc;
+
+ max_count = max_t(size_t, 0, i_size - iocb->ki_pos);
+
+ if (count > max_count)
+ iov_iter_truncate(to, max_count);
+
+ if (!iov_iter_count(to))
+ return 0;
+
+ rc = dax_iomap_rw(iocb, to, &famfs_iomap_ops);
+
+ file_accessed(iocb->ki_filp);
+ return rc;
+}
+
+/**
+ * famfs_write_iter()
+ *
+ * We need our own write-iter in order to prevent append
+ */
+static ssize_t
+famfs_dax_write_iter(
+ struct kiocb *iocb,
+ struct iov_iter *from)
+{
+ struct inode *inode = iocb->ki_filp->f_mapping->host;
+ size_t i_size = i_size_read(inode);
+ size_t count = iov_iter_count(from);
+ size_t max_count;
+ ssize_t rc;
+
+ rc = famfs_file_invalid(inode);
+ if (rc)
+ return rc;
+
+ /* Starting offset of write is: iocb->ki_pos
+ * length is iov_iter_count(from)
+ */
+ max_count = max_t(size_t, 0, i_size - iocb->ki_pos);
+
+ /* If write would go past EOF, truncate it to end at EOF since famfs does not
+ * alloc-on-write
+ */
+ if (count > max_count)
+ iov_iter_truncate(from, max_count);
+
+ if (!iov_iter_count(from))
+ return 0;
+
+ return dax_iomap_rw(iocb, from, &famfs_iomap_ops);
+}
+
+static int
+famfs_file_mmap(
+ struct file *file,
+ struct vm_area_struct *vma)
+{
+ struct inode *inode = file_inode(file);
+ ssize_t rc;
+
+ rc = famfs_file_invalid(inode);
+ if (rc)
+ return (int)rc;
+
+ file_accessed(file);
+ vma->vm_ops = &famfs_file_vm_ops;
+ vm_flags_set(vma, VM_HUGEPAGE);
+ return 0;
+}
+
+const struct file_operations famfs_file_operations = {
+ .owner = THIS_MODULE,
+
+ /* Custom famfs operations */
+ .write_iter = famfs_dax_write_iter,
+ .read_iter = famfs_dax_read_iter,
+ .mmap = famfs_file_mmap,
+
+ /* Force PMD alignment for mmap */
+ .get_unmapped_area = thp_get_unmapped_area,
+
+ /* Generic Operations */
+ .fsync = noop_fsync,
+ .splice_read = filemap_splice_read,
+ .splice_write = iter_file_splice_write,
+ .llseek = generic_file_llseek,
+};
+
/*********************************************************************
* iomap_operations
*
--
2.43.0


2024-02-23 17:48:07

by John Groves

[permalink] [raw]
Subject: [RFC PATCH 17/20] famfs: Add module stuff

This commit introduces the module init and exit machinery for famfs.

Signed-off-by: John Groves <[email protected]>
---
fs/famfs/famfs_inode.c | 44 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 44 insertions(+)

diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c
index ab46ec50b70d..0d659820e8ff 100644
--- a/fs/famfs/famfs_inode.c
+++ b/fs/famfs/famfs_inode.c
@@ -462,4 +462,48 @@ static struct file_system_type famfs_fs_type = {
.fs_flags = FS_USERNS_MOUNT,
};

+/*****************************************************************************************
+ * Module stuff
+ */
+static struct kobject *famfs_kobj;
+
+static int __init init_famfs_fs(void)
+{
+ int rc;
+
+#if defined(CONFIG_DEV_DAX_IOMAP)
+ pr_notice("%s: Your kernel supports famfs on /dev/dax\n", __func__);
+#else
+ pr_notice("%s: Your kernel does not support famfs on /dev/dax\n", __func__);
+#endif
+ famfs_kobj = kobject_create_and_add(MODULE_NAME, fs_kobj);
+ if (!famfs_kobj) {
+ pr_warn("Failed to create kobject\n");
+ return -ENOMEM;
+ }
+
+ rc = sysfs_create_group(famfs_kobj, &famfs_attr_group);
+ if (rc) {
+ kobject_put(famfs_kobj);
+ pr_warn("%s: Failed to create sysfs group\n", __func__);
+ return rc;
+ }
+
+ return register_filesystem(&famfs_fs_type);
+}
+
+static void
+__exit famfs_exit(void)
+{
+ sysfs_remove_group(famfs_kobj, &famfs_attr_group);
+ kobject_put(famfs_kobj);
+ unregister_filesystem(&famfs_fs_type);
+ pr_info("%s: unregistered\n", __func__);
+}
+
+
+fs_initcall(init_famfs_fs);
+module_exit(famfs_exit);
+
+MODULE_AUTHOR("John Groves, Micron Technology");
MODULE_LICENSE("GPL");
--
2.43.0


2024-02-23 17:48:31

by John Groves

[permalink] [raw]
Subject: [RFC PATCH 18/20] famfs: Support character dax via the dev_dax_iomap patch

This commit introduces the ability to open a character /dev/dax device
instead of a block /dev/pmem device. This rests on the dev_dax_iomap
patches earlier in this series.

Signed-off-by: John Groves <[email protected]>
---
fs/famfs/famfs_inode.c | 97 +++++++++++++++++++++++++++++++++++++-----
1 file changed, 87 insertions(+), 10 deletions(-)

diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c
index 0d659820e8ff..7d65ac497147 100644
--- a/fs/famfs/famfs_inode.c
+++ b/fs/famfs/famfs_inode.c
@@ -215,6 +215,93 @@ static const struct super_operations famfs_ops = {
.show_options = famfs_show_options,
};

+/*****************************************************************************/
+
+#if defined(CONFIG_DEV_DAX_IOMAP)
+
+/*
+ * famfs dax_operations (for char dax)
+ */
+static int
+famfs_dax_notify_failure(struct dax_device *dax_dev, u64 offset,
+ u64 len, int mf_flags)
+{
+ pr_err("%s: offset %lld len %llu flags %x\n", __func__,
+ offset, len, mf_flags);
+ return -EOPNOTSUPP;
+}
+
+static const struct dax_holder_operations famfs_dax_holder_ops = {
+ .notify_failure = famfs_dax_notify_failure,
+};
+
+/*****************************************************************************/
+
+/**
+ * famfs_open_char_device()
+ *
+ * Open a /dev/dax device. This only works in kernels with the dev_dax_iomap patch
+ */
+static int
+famfs_open_char_device(
+ struct super_block *sb,
+ struct fs_context *fc)
+{
+ struct famfs_fs_info *fsi = sb->s_fs_info;
+ struct dax_device *dax_devp;
+ struct inode *daxdev_inode;
+
+ int rc = 0;
+
+ pr_notice("%s: Opening character dax device %s\n", __func__, fc->source);
+
+ fsi->dax_filp = filp_open(fc->source, O_RDWR, 0);
+ if (IS_ERR(fsi->dax_filp)) {
+ pr_err("%s: failed to open dax device %s\n",
+ __func__, fc->source);
+ fsi->dax_filp = NULL;
+ return PTR_ERR(fsi->dax_filp);
+ }
+
+ daxdev_inode = file_inode(fsi->dax_filp);
+ dax_devp = inode_dax(daxdev_inode);
+ if (IS_ERR(dax_devp)) {
+ pr_err("%s: unable to get daxdev from inode for %s\n",
+ __func__, fc->source);
+ rc = -ENODEV;
+ goto char_err;
+ }
+
+ rc = fs_dax_get(dax_devp, fsi, &famfs_dax_holder_ops);
+ if (rc) {
+ pr_info("%s: err attaching famfs_dax_holder_ops\n", __func__);
+ goto char_err;
+ }
+
+ fsi->bdev_handle = NULL;
+ fsi->dax_devp = dax_devp;
+
+ return 0;
+
+char_err:
+ filp_close(fsi->dax_filp, NULL);
+ return rc;
+}
+
+#else /* CONFIG_DEV_DAX_IOMAP */
+static int
+famfs_open_char_device(
+ struct super_block *sb,
+ struct fs_context *fc)
+{
+ pr_err("%s: Root device is %s, but your kernel does not support famfs on /dev/dax\n",
+ __func__, fc->source);
+ return -ENODEV;
+}
+
+
+#endif /* CONFIG_DEV_DAX_IOMAP */
+
/***************************************************************************************
* dax_holder_operations for block dax
*/
@@ -236,16 +323,6 @@ const struct dax_holder_operations famfs_blk_dax_holder_ops = {
.notify_failure = famfs_blk_dax_notify_failure,
};

-static int
-famfs_open_char_device(
- struct super_block *sb,
- struct fs_context *fc)
-{
- pr_err("%s: Root device is %s, but your kernel does not support famfs on /dev/dax\n",
- __func__, fc->source);
- return -ENODEV;
-}
-
/**
* famfs_open_device()
*
--
2.43.0


2024-02-23 17:48:48

by John Groves

[permalink] [raw]
Subject: [RFC PATCH 19/20] famfs: Update MAINTAINERS file

This patch introduces famfs into the MAINTAINERS file

Signed-off-by: John Groves <[email protected]>
---
MAINTAINERS | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 73d898383e51..e4e8bf3602bb 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8097,6 +8097,17 @@ F: Documentation/networking/failover.rst
F: include/net/failover.h
F: net/core/failover.c

+FAMFS
+M: John Groves <[email protected]>
+M: John Groves <[email protected]>
+M: John Groves <[email protected]>
+L: [email protected]
+L: [email protected]
+S: Supported
+F: Documentation/filesystems/famfs.rst
+F: fs/famfs
+F: include/uapi/linux/famfs_ioctl.h
+
FANOTIFY
M: Jan Kara <[email protected]>
R: Amir Goldstein <[email protected]>
--
2.43.0


2024-02-23 17:52:56

by John Groves

[permalink] [raw]
Subject: [RFC PATCH 02/20] dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage

This function should be called by fs-dax file systems after opening the
devdax device. This adds holder_operations.

This function serves the same role as fs_dax_get_by_bdev(), which dax
file systems call after opening the pmem block device.

Signed-off-by: John Groves <[email protected]>
---
drivers/dax/super.c | 38 ++++++++++++++++++++++++++++++++++++++
include/linux/dax.h | 5 +++++
2 files changed, 43 insertions(+)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index f4b635526345..fc96362de237 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -121,6 +121,44 @@ void fs_put_dax(struct dax_device *dax_dev, void *holder)
EXPORT_SYMBOL_GPL(fs_put_dax);
#endif /* CONFIG_BLOCK && CONFIG_FS_DAX */

+#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
+
+/**
+ * fs_dax_get()
+ *
+ * fs-dax file systems call this function to prepare to use a devdax device for fsdax.
+ * This is like fs_dax_get_by_bdev(), but the caller already has struct dev_dax (and there
+ * is no bdev). The holder makes this exclusive.
+ *
+ * @dax_dev: dev to be prepared for fs-dax usage
+ * @holder: filesystem or mapped device inside the dax_device
+ * @hops: operations for the inner holder
+ *
+ * Returns: 0 on success, -1 on failure
+ */
+int fs_dax_get(
+ struct dax_device *dax_dev,
+ void *holder,
+ const struct dax_holder_operations *hops)
+{
+ /* dax_dev->ops should have been populated by devm_create_dev_dax() */
+ if (WARN_ON(!dax_dev->ops))
+ return -1;
+
+ if (!dax_dev || !dax_alive(dax_dev) || !igrab(&dax_dev->inode))
+ return -1;
+
+ if (cmpxchg(&dax_dev->holder_data, NULL, holder)) {
+ pr_warn("%s: holder_data already set\n", __func__);
+ return -1;
+ }
+ dax_dev->holder_ops = hops;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(fs_dax_get);
+#endif /* DEV_DAX_IOMAP */
+
enum dax_device_flags {
/* !alive + rcu grace period == no new operations / mappings */
DAXDEV_ALIVE,
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b463502b16e1..e973289bfde3 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -57,7 +57,12 @@ struct dax_holder_operations {

#if IS_ENABLED(CONFIG_DAX)
struct dax_device *alloc_dax(void *private, const struct dax_operations *ops);
+
+#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
+int fs_dax_get(struct dax_device *dax_dev, void *holder, const struct dax_holder_operations *hops);
+#endif
void *dax_holder(struct dax_device *dax_dev);
+struct dax_device *inode_dax(struct inode *inode);
void put_dax(struct dax_device *dax_dev);
void kill_dax(struct dax_device *dax_dev);
void dax_write_cache(struct dax_device *dax_dev, bool wc);
--
2.43.0


2024-02-24 03:45:24

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system

On Fri, Feb 23, 2024 at 11:41:44AM -0600, John Groves wrote:
> This patch set introduces famfs[1] - a special-purpose fs-dax file system
> for sharable disaggregated or fabric-attached memory (FAM). Famfs is not
> CXL-specific in anyway way.
>
> * Famfs creates a simple access method for storing and sharing data in
> sharable memory. The memory is exposed and accessed as memory-mappable
> dax files.
> * Famfs supports multiple hosts mounting the same file system from the
> same memory (something existing fs-dax file systems don't do).
> * A famfs file system can be created on either a /dev/pmem device in fs-dax
> mode, or a /dev/dax device in devdax mode (the latter depending on
> patches 2-6 of this series).
>
> The famfs kernel file system is part the famfs framework; additional
> components in user space[2] handle metadata and direct the famfs kernel
> module to instantiate files that map to specific memory. The famfs user
> space has documentation and a reasonably thorough test suite.
>
> The famfs kernel module never accesses the shared memory directly (either
> data or metadata). Because of this, shared memory managed by the famfs
> framework does not create a RAS "blast radius" problem that should be able
> to crash or de-stabilize the kernel. Poison or timeouts in famfs memory
> can be expected to kill apps via SIGBUS and cause mounts to be disabled
> due to memory failure notifications.
>
> Famfs does not attempt to solve concurrency or coherency problems for apps,
> although it does solve these problems in regard to its own data structures.
> Apps may encounter hard concurrency problems, but there are use cases that
> are imminently useful and uncomplicated from a concurrency perspective:
> serial sharing is one (only one host at a time has access), and read-only
> concurrent sharing is another (all hosts can read-cache without worry).

Can you do me a favor, curious if you can run a test like this:

fio -name=ten-1g-per-thread --nrfiles=10 -bs=2M -ioengine=io_uring
-direct=1
--group_reporting=1 --alloc-size=1048576 --filesize=1GiB
--readwrite=write --fallocate=none --numjobs=$(nproc) --create_on_open=1
--directory=/mnt

What do you get for throughput?

The absolute large the system an capacity the better.

Luis

2024-02-26 12:06:00

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [RFC PATCH 02/20] dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage

On Fri, 23 Feb 2024 11:41:46 -0600
John Groves <[email protected]> wrote:

> This function should be called by fs-dax file systems after opening the
> devdax device. This adds holder_operations.
>
> This function serves the same role as fs_dax_get_by_bdev(), which dax
> file systems call after opening the pmem block device.
>
> Signed-off-by: John Groves <[email protected]>

A few trivial comments form a first read to get my head around this.

Yeah, it is only an RFC, but who doesn't like tidy code? :)


> ---
> drivers/dax/super.c | 38 ++++++++++++++++++++++++++++++++++++++
> include/linux/dax.h | 5 +++++
> 2 files changed, 43 insertions(+)
>
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index f4b635526345..fc96362de237 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -121,6 +121,44 @@ void fs_put_dax(struct dax_device *dax_dev, void *holder)
> EXPORT_SYMBOL_GPL(fs_put_dax);
> #endif /* CONFIG_BLOCK && CONFIG_FS_DAX */
>
> +#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
> +
> +/**
> + * fs_dax_get()

Smells like kernel doc but fairly sure it needs a short description.
Have you sanity checked for warnings when running scripts/kerneldoc on it?

> + *
> + * fs-dax file systems call this function to prepare to use a devdax device for fsdax.
Trivial but lines too long. Keep under 80 chars unless there is a strong
readability arguement for not doing so.


> + * This is like fs_dax_get_by_bdev(), but the caller already has struct dev_dax (and there
> + * is no bdev). The holder makes this exclusive.

Not familiar with this area: what does exclusive mean here?

> + *
> + * @dax_dev: dev to be prepared for fs-dax usage
> + * @holder: filesystem or mapped device inside the dax_device
> + * @hops: operations for the inner holder
> + *
> + * Returns: 0 on success, -1 on failure

Why not return < 0 and use somewhat useful return values?

> + */
> +int fs_dax_get(
> + struct dax_device *dax_dev,
> + void *holder,
> + const struct dax_holder_operations *hops)

Match local style for indents - it's a bit inconsistent but probably...

int fs_dax_get(struct dad_device *dev_dax, void *holder,
const struct dax_holder_operations *hops)

> +{
> + /* dax_dev->ops should have been populated by devm_create_dev_dax() */
> + if (WARN_ON(!dax_dev->ops))
> + return -1;
> +
> + if (!dax_dev || !dax_alive(dax_dev) || !igrab(&dax_dev->inode))

You dereferenced dax_dev on the line above so check is too late or
unnecessary

> + return -1;
> +
> + if (cmpxchg(&dax_dev->holder_data, NULL, holder)) {
> + pr_warn("%s: holder_data already set\n", __func__);

Perhaps nicer to use a pr_fmt() deal with the func name if you need it.
or make it pr_debug and let dynamic debug control formatting if anyone
wants the function name.

> + return -1;
> + }
> + dax_dev->holder_ops = hops;
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(fs_dax_get);
> +#endif /* DEV_DAX_IOMAP */
> +
> enum dax_device_flags {
> /* !alive + rcu grace period == no new operations / mappings */
> DAXDEV_ALIVE,
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index b463502b16e1..e973289bfde3 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -57,7 +57,12 @@ struct dax_holder_operations {
>
> #if IS_ENABLED(CONFIG_DAX)
> struct dax_device *alloc_dax(void *private, const struct dax_operations *ops);
> +
> +#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
> +int fs_dax_get(struct dax_device *dax_dev, void *holder, const struct dax_holder_operations *hops);
line wrap < 80 chars

> +#endif
> void *dax_holder(struct dax_device *dax_dev);
> +struct dax_device *inode_dax(struct inode *inode);

Unrelated change?

> void put_dax(struct dax_device *dax_dev);
> void kill_dax(struct dax_device *dax_dev);
> void dax_write_cache(struct dax_device *dax_dev, bool wc);


2024-02-26 12:52:20

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [RFC PATCH 09/20] famfs: Add super_operations

On Fri, 23 Feb 2024 11:41:53 -0600
John Groves <[email protected]> wrote:

> Introduce the famfs superblock operations
>
> Signed-off-by: John Groves <[email protected]>
> ---
> fs/famfs/famfs_inode.c | 72 ++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 72 insertions(+)
> create mode 100644 fs/famfs/famfs_inode.c
>
> diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c
> new file mode 100644
> index 000000000000..3329aff000d1
> --- /dev/null
> +++ b/fs/famfs/famfs_inode.c
> @@ -0,0 +1,72 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * famfs - dax file system for shared fabric-attached memory
> + *
> + * Copyright 2023-2024 Micron Technology, inc
> + *
> + * This file system, originally based on ramfs the dax support from xfs,
> + * is intended to allow multiple host systems to mount a common file system
> + * view of dax files that map to shared memory.
> + */
> +
> +#include <linux/fs.h>
> +#include <linux/pagemap.h>
> +#include <linux/highmem.h>
> +#include <linux/time.h>
> +#include <linux/init.h>
> +#include <linux/string.h>
> +#include <linux/backing-dev.h>
> +#include <linux/sched.h>
> +#include <linux/parser.h>
> +#include <linux/magic.h>
> +#include <linux/slab.h>
> +#include <linux/uaccess.h>
> +#include <linux/fs_context.h>
> +#include <linux/fs_parser.h>
> +#include <linux/seq_file.h>
> +#include <linux/dax.h>
> +#include <linux/hugetlb.h>
> +#include <linux/uio.h>
> +#include <linux/iomap.h>
> +#include <linux/path.h>
> +#include <linux/namei.h>
> +#include <linux/pfn_t.h>
> +#include <linux/blkdev.h>

That's a lot of header for such a small patch.. I'm going to guess
they aren't all used - bring them in as you need them - I hope
you never need some of these!


> +
> +#include "famfs_internal.h"
> +
> +#define FAMFS_DEFAULT_MODE 0755
> +
> +static const struct super_operations famfs_ops;
> +static const struct inode_operations famfs_file_inode_operations;
> +static const struct inode_operations famfs_dir_inode_operations;

Why are these all up here?

> +
> +/**********************************************************************************
> + * famfs super_operations
> + *
> + * TODO: implement a famfs_statfs() that shows size, free and available space, etc.
> + */
> +
> +/**
> + * famfs_show_options() - Display the mount options in /proc/mounts.
Run kernel doc script + fix all warnings.

> + */
> +static int famfs_show_options(
> + struct seq_file *m,
> + struct dentry *root)
Not that familiar with fs code, but this unusual kernel style. I'd go with
something more common

static int famfs_show_options(struct seq_file *m, struct dentry *root)

> +{
> + struct famfs_fs_info *fsi = root->d_sb->s_fs_info;
> +
> + if (fsi->mount_opts.mode != FAMFS_DEFAULT_MODE)
> + seq_printf(m, ",mode=%o", fsi->mount_opts.mode);
> +
> + return 0;
> +}
> +
> +static const struct super_operations famfs_ops = {
> + .statfs = simple_statfs,
> + .drop_inode = generic_delete_inode,
> + .show_options = famfs_show_options,
> +};
> +
> +
One blank line probably fine.


Add the rest of the stuff a module normally has, author etc in this
patch.

> +MODULE_LICENSE("GPL");


2024-02-26 13:27:35

by John Groves

[permalink] [raw]
Subject: Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system

On 24/02/23 04:07PM, Luis Chamberlain wrote:
> On Fri, Feb 23, 2024 at 11:41:44AM -0600, John Groves wrote:
> > This patch set introduces famfs[1] - a special-purpose fs-dax file system
> > for sharable disaggregated or fabric-attached memory (FAM). Famfs is not
> > CXL-specific in anyway way.
> >
> > * Famfs creates a simple access method for storing and sharing data in
> > sharable memory. The memory is exposed and accessed as memory-mappable
> > dax files.
> > * Famfs supports multiple hosts mounting the same file system from the
> > same memory (something existing fs-dax file systems don't do).
> > * A famfs file system can be created on either a /dev/pmem device in fs-dax
> > mode, or a /dev/dax device in devdax mode (the latter depending on
> > patches 2-6 of this series).
> >
> > The famfs kernel file system is part the famfs framework; additional
> > components in user space[2] handle metadata and direct the famfs kernel
> > module to instantiate files that map to specific memory. The famfs user
> > space has documentation and a reasonably thorough test suite.
> >
> > The famfs kernel module never accesses the shared memory directly (either
> > data or metadata). Because of this, shared memory managed by the famfs
> > framework does not create a RAS "blast radius" problem that should be able
> > to crash or de-stabilize the kernel. Poison or timeouts in famfs memory
> > can be expected to kill apps via SIGBUS and cause mounts to be disabled
> > due to memory failure notifications.
> >
> > Famfs does not attempt to solve concurrency or coherency problems for apps,
> > although it does solve these problems in regard to its own data structures.
> > Apps may encounter hard concurrency problems, but there are use cases that
> > are imminently useful and uncomplicated from a concurrency perspective:
> > serial sharing is one (only one host at a time has access), and read-only
> > concurrent sharing is another (all hosts can read-cache without worry).
>
> Can you do me a favor, curious if you can run a test like this:
>
> fio -name=ten-1g-per-thread --nrfiles=10 -bs=2M -ioengine=io_uring
> -direct=1
> --group_reporting=1 --alloc-size=1048576 --filesize=1GiB
> --readwrite=write --fallocate=none --numjobs=$(nproc) --create_on_open=1
> --directory=/mnt
>
> What do you get for throughput?
>
> The absolute large the system an capacity the better.
>
> Luis

Luis,

First, thanks for paying attention. I think I need to clarify a few things
about famfs and then check how that modifies your ask; apologies if some
are obvious. You should tell me whether this is still interesting given
these clarifications and limitations, or if there is something else you'd
like to see tested instead. But read on, I have run the closest tests I
can.

Famfs files just map to dax memory; they don't have a backing store. So the
io_uring and direct=1 options don't work. The coolness is that the files &
memory can be shared, and that apps can deal with files rather than having
to learn new abstractions.

Famfs files are never allocate-on-write, so (--fallocate=none is ok, but
"actual" fallocate doesn't work - and --create_on_open desn't work). But it
seems to be happy if I preallocate the files for the test.

I don't currently have custody of a really beefy system (can get one, just
need to plan ahead). My primary dev system is a 48 HT core E5-2690 v3 @
2.60G (around 10 years old).

I have a 128GB dax device that is backed by ddr4 via efi_fake_mem. So I
can't do 48 x 10 x 1G, but I can do 48 x 10 x 256M. I ran this on
ddr4-backed famfs, and xfs backed by a sata ssd. Probably not fair, but
it's what I have on a Sunday evening.

I can get access to a beefy system with real cxl memory, though don't
assume 100% I can report performance on that - will check into that. But
think about what you're looking for in light of the fact that famfs is just
a shared-memory file system, so no O_DIRECT or io_uring. Basically just
(hopefully efficient) vma fault handling and metadata distribution.

###

Here is famfs. I had to drop the io_uring and script up alloc/creation
of the files (sudo famfs creat -s 256M /mnt/famfs/foo)

$ fio -name=ten-256m-per-thread --nrfiles=10 -bs=2M --group_reporting=1 --alloc-size=1048576 --filesize=100MiB --readwrite=write --fallocate=none --numjobs=48 --create_on_open=0 --directory=/mnt/famfs
ten-256m-per-thread: (g=0): rw=write, bs=(R) 2048KiB-2048KiB, (W) 2048KiB-2048KiB, (T) 2048KiB-2048KiB, ioengine=psync, iodepth=1
...
fio-3.33
Starting 48 processes
Jobs: 40 (f=400)
ten-256m-per-thread: (groupid=0, jobs=48): err= 0: pid=201738: Mon Feb 26 06:48:21 2024
write: IOPS=15.2k, BW=29.6GiB/s (31.8GB/s)(44.7GiB/1511msec); 0 zone resets
clat (usec): min=156, max=54645, avg=2077.40, stdev=1730.77
lat (usec): min=171, max=54686, avg=2404.87, stdev=2056.50
clat percentiles (usec):
| 1.00th=[ 196], 5.00th=[ 243], 10.00th=[ 367], 20.00th=[ 644],
| 30.00th=[ 857], 40.00th=[ 1352], 50.00th=[ 1876], 60.00th=[ 2442],
| 70.00th=[ 2868], 80.00th=[ 3228], 90.00th=[ 3884], 95.00th=[ 4555],
| 99.00th=[ 6390], 99.50th=[ 7439], 99.90th=[16450], 99.95th=[23987],
| 99.99th=[46924]
bw ( MiB/s): min=21544, max=28034, per=81.80%, avg=24789.35, stdev=130.16, samples=81
iops : min=10756, max=14000, avg=12378.00, stdev=65.06, samples=81
lat (usec) : 250=5.42%, 500=9.67%, 750=8.07%, 1000=11.77%
lat (msec) : 2=16.87%, 4=39.59%, 10=8.37%, 20=0.17%, 50=0.07%
lat (msec) : 100=0.01%
cpu : usr=13.26%, sys=81.62%, ctx=2075, majf=0, minf=18159
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,22896,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=29.6GiB/s (31.8GB/s), 29.6GiB/s-29.6GiB/s (31.8GB/s-31.8GB/s), io=44.7GiB (48.0GB), run=1511-1511msec

$ sudo famfs fsck -h /mnt/famfs
Famfs Superblock:
Filesystem UUID: 591f3f62-0a79-4543-9ab5-e02dc807c76c
System UUID: 00000000-0000-0000-0000-0cc47aaaa734
sizeof superblock: 168
num_daxdevs: 1
primary: /dev/dax1.0 137438953472

Log stats:
# of log entriesi in use: 480 of 25575
Log size in use: 157488
No allocation errors found

Capacity:
Device capacity: 128.00G
Bitmap capacity: 127.99G
Sum of file sizes: 120.00G
Allocated space: 120.00G
Free space: 7.99G
Space amplification: 1.00
Percent used: 93.8%

Famfs log:
480 of 25575 entries used
480 files
0 directories

###

Here is the same fio command, plus --ioengine=io_uring and --direct=1. It's
apples and oranges, since famfs is a memory interface and not a storage
interface. This is run on an xfs file system on a SATA ssd.

Note units are msec here, usec above.

fio -name=ten-256m-per-thread --nrfiles=10 -bs=2M --group_reporting=1 --alloc-size=1048576 --filesize=256MiB --readwrite=write --fallocate=none --numjobs=48 --create_on_open=0 --ioengine=io_uring --direct=1 --directory=/home/jmg/t1
ten-256m-per-thread: (g=0): rw=write, bs=(R) 2048KiB-2048KiB, (W) 2048KiB-2048KiB, (T) 2048KiB-2048KiB, ioengine=io_uring, iodepth=1
...
fio-3.33
Starting 48 processes
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
Jobs: 37 (f=370): [W(1),_(2),W(2),_(1),W(1),_(1),W(6),_(1),W(1),_(1),W(1),_(1),W(1),_(1),W(1),_(1),W(13),_(1),W(5),_(1),W(5)][72.1%][w=454MiB/s][w=227 IOPS][eta 01m:32sJobs: 37 (f=370): [W(1),_(2),W(2),_(1),W(1),_(1),W(6),_(1),W(1),_(1),W(1),_(1),W(1),_(1),W(1),_(1),W(13),_(1),W(5),_(1),W(5)][72.4%][w=456MiB/s][w=228 IOPS][eta 01m:31sJobs: 36 (f=360): [W(1),_(2),W(2),_(1),W(1),_(1),W(6),_(1),W(1),_(1),W(1),_(1),W(1),_(3),W(13),_(1),W(5),_(1),W(5)][72.9%][w=454MiB/s][w=227 IOPS][eta 01m:29s] Jobs: 33 (f=330): [_(3),W(2),_(1),W(1),_(1),W(1),_(1),W(4),_(1),W(1),_(1),W(1),_(1),W(1),_(3),W(13),_(1),W(5),_(1),W(2),_(1),W(2)][73.0%][w=458MiB/s][w=229 IOPS][eta 01Jobs: 30 (f=300): [_(3),W(2),_(1),W(1),_(1),W(1),_(2),W(3),_(1),W(1),_(3),W(1),_(3),W(7),_(1),W(5),_(1),W(5),_(1),W(2),_(1),W(2)][73.6%][w=462MiB/s][w=231 IOPS][eta 01mJobs: 28 (f=280): [_(3),W(2),_(1),W(1),_(1),W(1),_(2),W(3),_(5),W(1),_(3),W(7),_(1),W(5),_(1),W(5),_(1),W(2),_(2),W(1)][74.1%][w=456MiB/s][w=228 IOPS][eta 01m:25s] Jobs: 25 (f=250): [_(3),W(2),_(1),W(1),_(1),W(1),_(2),W(1),_(1),W(1),_(5),W(1),_(3),W(2),_(1),W(4),_(1),W(5),_(1),W(5),_(2),W(1),_(2),W(1)][75.1%][w=458MiB/s][w=229 IOPJobs: 24 (f=240): [_(3),W(2),_(1),W(1),_(1),W(1),_(2),W(1),_(1),W(1),_(5),W(1),_(3),W(2),_(1),W(3),_(2),W(5),_(1),W(5),_(2),W(1),_(2),W(1)][75.6%][w=456MiB/s][w=228 IOPJobs: 23 (f=230): [_(3),W(2),_(1),W(1),_(1),W(1),_(2),W(1),_(1),W(1),_(5),E(1),_(3),W(2),_(1),W(3),_(2),W(5),_(1),W(5),_(2),W(1),_(2),W(1)][76.2%][w=452MiB/s][w=226 IOPJobs: 20 (f=200): [_(3),W(2),_(1),W(1),_(1),W(1),_(2),W(1),_(11),W(2),_(1),W(3),_(2),W(5),_(1),W(3),_(1),W(1),_(2),W(1),_(3)][76.7%][w=448MiB/s][w=224 IOPS][eta 01m:15sJobs: 19 (f=190): [_(3),W(2),_(1),W(1),_(1),W(1),_(2),W(1),_(11),W(2),_(1),W(3),_(2),W(5),_(2),W(2),_(1),W(1),_(2),W(1),_(3)][77.5%][w=464MiB/s][w=232 IOPS][eta 01m:12sJobs: 18 (f=180): [_(3),W(2),_(3),W(1),_(2),W(1),_(11),W(2),_(1),W(3),_(2),W(5),_(2),W(2),_(1),W(1),_(2),W(1),_(3)][78.8%][w=478MiB/s][w=239 IOPS][eta 01m:07s] Jobs: 4 (f=40): [_(3),W(1),_(22),W(1),_(12),W(1),_(4),W(1),_(3)][92.4%][w=462MiB/s][w=231 IOPS][eta 00m:21s]
ten-256m-per-thread: (groupid=0, jobs=48): err= 0: pid=210709: Mon Feb 26 07:20:51 2024
write: IOPS=228, BW=458MiB/s (480MB/s)(114GiB/255942msec); 0 zone resets
slat (usec): min=39, max=776, avg=186.65, stdev=49.13
clat (msec): min=4, max=6718, avg=199.27, stdev=324.82
lat (msec): min=4, max=6718, avg=199.45, stdev=324.82
clat percentiles (msec):
| 1.00th=[ 30], 5.00th=[ 47], 10.00th=[ 60], 20.00th=[ 69],
| 30.00th=[ 78], 40.00th=[ 85], 50.00th=[ 95], 60.00th=[ 114],
| 70.00th=[ 142], 80.00th=[ 194], 90.00th=[ 409], 95.00th=[ 810],
| 99.00th=[ 1703], 99.50th=[ 2140], 99.90th=[ 3037], 99.95th=[ 3440],
| 99.99th=[ 4665]
bw ( KiB/s): min=195570, max=2422953, per=100.00%, avg=653513.53, stdev=8137.30, samples=17556
iops : min= 60, max= 1180, avg=314.22, stdev= 3.98, samples=17556
lat (msec) : 10=0.11%, 20=0.37%, 50=5.35%, 100=47.30%, 250=32.22%
lat (msec) : 500=6.11%, 750=2.98%, 1000=1.98%, 2000=2.97%, >=2000=0.60%
cpu : usr=0.10%, sys=0.01%, ctx=58709, majf=0, minf=669
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,58560,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=458MiB/s (480MB/s), 458MiB/s-458MiB/s (480MB/s-480MB/s), io=114GiB (123GB), run=255942-255942msec

Disk stats (read/write):
dm-2: ios=11/82263, merge=0/0, ticks=270/13403617, in_queue=13403887, util=97.10%, aggrios=11/152359, aggrmerge=0/5087, aggrticks=271/11493029, aggrin_queue=11494994, aggrutil=100.00%
sdb: ios=11/152359, merge=0/5087, ticks=271/11493029, in_queue=11494994, util=100.00%

###

Let me know what else you'd like to see tried.

Regards,
John


2024-02-26 13:31:12

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [RFC PATCH 13/20] famfs: Add iomap_ops

On Fri, 23 Feb 2024 11:41:57 -0600
John Groves <[email protected]> wrote:

> This commit introduces the famfs iomap_ops. When either
> dax_iomap_fault() or dax_iomap_rw() is called, we get a callback
> via our iomap_begin() handler. The question being asked is
> "please resolve (file, offset) to (daxdev, offset)". The function
> famfs_meta_to_dax_offset() does this.
>
> The per-file metadata is just an extent list to the
> backing dax dev. The order of this resolution is O(N) for N
> extents. Note with the current user space, files usually have
> only one extent.
>
> Signed-off-by: John Groves <[email protected]>

> ---
> fs/famfs/famfs_file.c | 245 ++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 245 insertions(+)
> create mode 100644 fs/famfs/famfs_file.c
>
> diff --git a/fs/famfs/famfs_file.c b/fs/famfs/famfs_file.c
> new file mode 100644
> index 000000000000..fc667d5f7be8
> --- /dev/null
> +++ b/fs/famfs/famfs_file.c
> @@ -0,0 +1,245 @@

> +static int
> +famfs_meta_to_dax_offset(
> + struct inode *inode,
> + struct iomap *iomap,
> + loff_t offset,
> + loff_t len,
> + unsigned int flags)
> +{
> + struct famfs_file_meta *meta = (struct famfs_file_meta *)inode->i_private;

i_private is void * so no need for explicit cast (C spec says this is always fine without)


> +
> +/**
> + * famfs_iomap_begin()
> + *
> + * This function is pretty simple because files are
> + * * never partially allocated
> + * * never have holes (never sparse)
> + * * never "allocate on write"
> + */
> +static int
> +famfs_iomap_begin(
> + struct inode *inode,
> + loff_t offset,
> + loff_t length,
> + unsigned int flags,
> + struct iomap *iomap,
> + struct iomap *srcmap)
> +{
> + struct famfs_file_meta *meta = inode->i_private;
> + size_t size;
> + int rc;
> +
> + size = i_size_read(inode);
> +
> + WARN_ON(size != meta->file_size);
> +
> + rc = famfs_meta_to_dax_offset(inode, iomap, offset, length, flags);
> +
> + return rc;
return famfs_meta_...

> +}


> +static vm_fault_t
> +famfs_filemap_map_pages(
> + struct vm_fault *vmf,
> + pgoff_t start_pgoff,
> + pgoff_t end_pgoff)
> +{
> + vm_fault_t ret;
> +
> + ret = filemap_map_pages(vmf, start_pgoff, end_pgoff);
> + return ret;
return filename_map_pages()....

> +}
> +
>


2024-02-26 13:48:40

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [RFC PATCH 17/20] famfs: Add module stuff

On Fri, 23 Feb 2024 11:42:01 -0600
John Groves <[email protected]> wrote:

> This commit introduces the module init and exit machinery for famfs.
>
> Signed-off-by: John Groves <[email protected]>
I'd prefer to see this from the start with the functionality of the module
built up as you go + build logic in place. Makes it easy to spot places
where the patches aren't appropriately self constrained.
> ---
> fs/famfs/famfs_inode.c | 44 ++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 44 insertions(+)
>
> diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c
> index ab46ec50b70d..0d659820e8ff 100644
> --- a/fs/famfs/famfs_inode.c
> +++ b/fs/famfs/famfs_inode.c
> @@ -462,4 +462,48 @@ static struct file_system_type famfs_fs_type = {
> .fs_flags = FS_USERNS_MOUNT,
> };
>
> +/*****************************************************************************************
> + * Module stuff

I'd drop these drivers structure comments. They add little beyond
a high possibility of being wrong after the code has evolved a bit.

> + */
> +static struct kobject *famfs_kobj;
> +
> +static int __init init_famfs_fs(void)
> +{
> + int rc;
> +
> +#if defined(CONFIG_DEV_DAX_IOMAP)
> + pr_notice("%s: Your kernel supports famfs on /dev/dax\n", __func__);
> +#else
> + pr_notice("%s: Your kernel does not support famfs on /dev/dax\n", __func__);
> +#endif
> + famfs_kobj = kobject_create_and_add(MODULE_NAME, fs_kobj);
> + if (!famfs_kobj) {
> + pr_warn("Failed to create kobject\n");
> + return -ENOMEM;
> + }
> +
> + rc = sysfs_create_group(famfs_kobj, &famfs_attr_group);
> + if (rc) {
> + kobject_put(famfs_kobj);
> + pr_warn("%s: Failed to create sysfs group\n", __func__);
> + return rc;
> + }
> +
> + return register_filesystem(&famfs_fs_type);

If this fails, do we not leak the kobj and sysfs groups?

> +}
> +
> +static void
> +__exit famfs_exit(void)
> +{
> + sysfs_remove_group(famfs_kobj, &famfs_attr_group);
> + kobject_put(famfs_kobj);
> + unregister_filesystem(&famfs_fs_type);
> + pr_info("%s: unregistered\n", __func__);
> +}
> +
> +
> +fs_initcall(init_famfs_fs);
> +module_exit(famfs_exit);
> +
> +MODULE_AUTHOR("John Groves, Micron Technology");
> MODULE_LICENSE("GPL");


2024-02-26 13:50:57

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [RFC PATCH 14/20] famfs: Add struct file_operations

On Fri, 23 Feb 2024 11:41:58 -0600
John Groves <[email protected]> wrote:

> This commit introduces the famfs file_operations. We call
> thp_get_unmapped_area() to force PMD page alignment. Our read and
> write handlers (famfs_dax_read_iter() and famfs_dax_write_iter())
> call dax_iomap_rw() to do the work.
>
> famfs_file_invalid() checks for various ways a famfs file can be
> in an invalid state so we can fail I/O or fault resolution in those
> cases. Those cases include the following:
>
> * No famfs metadata
> * file i_size does not match the originally allocated size
> * file is not flagged as DAX
> * errors were detected previously on the file
>
> An invalid file can often be fixed by replaying the log, or by
> umount/mount/log replay - all of which are user space operations.
>
> Signed-off-by: John Groves <[email protected]>
> ---
> fs/famfs/famfs_file.c | 136 ++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 136 insertions(+)
>
> diff --git a/fs/famfs/famfs_file.c b/fs/famfs/famfs_file.c
> index fc667d5f7be8..5228e9de1e3b 100644
> --- a/fs/famfs/famfs_file.c
> +++ b/fs/famfs/famfs_file.c
> @@ -19,6 +19,142 @@
> #include <uapi/linux/famfs_ioctl.h>
> #include "famfs_internal.h"
>
> +/*********************************************************************
> + * file_operations
> + */
> +
> +/* Reject I/O to files that aren't in a valid state */
> +static ssize_t
> +famfs_file_invalid(struct inode *inode)
> +{
> + size_t i_size = i_size_read(inode);
> + struct famfs_file_meta *meta = inode->i_private;
> +
> + if (!meta) {
> + pr_err("%s: un-initialized famfs file\n", __func__);
> + return -EIO;
> + }
> + if (i_size != meta->file_size) {
> + pr_err("%s: something changed the size from %ld to %ld\n",
> + __func__, meta->file_size, i_size);
> + meta->error = 1;
> + return -ENXIO;
> + }
> + if (!IS_DAX(inode)) {
> + pr_err("%s: inode %llx IS_DAX is false\n", __func__, (u64)inode);
> + meta->error = 1;
> + return -ENXIO;
> + }
> + if (meta->error) {
> + pr_err("%s: previously detected metadata errors\n", __func__);
> + meta->error = 1;

Already set? If treating it as only a boolean, maybe make it one?

> + return -EIO;
> + }
> + return 0;
> +}


2024-02-26 13:52:54

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [RFC PATCH 18/20] famfs: Support character dax via the dev_dax_iomap patch

On Fri, 23 Feb 2024 11:42:02 -0600
John Groves <[email protected]> wrote:

> This commit introduces the ability to open a character /dev/dax device
> instead of a block /dev/pmem device. This rests on the dev_dax_iomap
> patches earlier in this series.

Not sure the back reference is needed given it's in the series.

>
> Signed-off-by: John Groves <[email protected]>
> ---
> fs/famfs/famfs_inode.c | 97 +++++++++++++++++++++++++++++++++++++-----
> 1 file changed, 87 insertions(+), 10 deletions(-)
>
> diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c
> index 0d659820e8ff..7d65ac497147 100644
> --- a/fs/famfs/famfs_inode.c
> +++ b/fs/famfs/famfs_inode.c
> @@ -215,6 +215,93 @@ static const struct super_operations famfs_ops = {
> .show_options = famfs_show_options,
> };
>
> +/*****************************************************************************/
> +
> +#if defined(CONFIG_DEV_DAX_IOMAP)
> +
> +/*
> + * famfs dax_operations (for char dax)
> + */
> +static int
> +famfs_dax_notify_failure(struct dax_device *dax_dev, u64 offset,
> + u64 len, int mf_flags)
> +{
> + pr_err("%s: offset %lld len %llu flags %x\n", __func__,
> + offset, len, mf_flags);
> + return -EOPNOTSUPP;
> +}
> +
> +static const struct dax_holder_operations famfs_dax_holder_ops = {
> + .notify_failure = famfs_dax_notify_failure,
> +};
> +
> +/*****************************************************************************/
> +
> +/**
> + * famfs_open_char_device()
> + *
> + * Open a /dev/dax device. This only works in kernels with the dev_dax_iomap patch

That comment you definitely don't need as this won't get merged without
that patch being in place.


> + */
> +static int
> +famfs_open_char_device(
> + struct super_block *sb,
> + struct fs_context *fc)
> +{
> + struct famfs_fs_info *fsi = sb->s_fs_info;
> + struct dax_device *dax_devp;
> + struct inode *daxdev_inode;
> +
> + int rc = 0;
set in all paths where it's used.

> +
> + pr_notice("%s: Opening character dax device %s\n", __func__, fc->source);

pr_debug

> +
> + fsi->dax_filp = filp_open(fc->source, O_RDWR, 0);
> + if (IS_ERR(fsi->dax_filp)) {
> + pr_err("%s: failed to open dax device %s\n",
> + __func__, fc->source);
> + fsi->dax_filp = NULL;
Better to use a local variable

fp = filp_open(fc->source, O_RDWR, 0);
if (IS_ERR(fp)) {
pr_err.
return;
}
fsi->dax_filp = fp;
or similar.

> + return PTR_ERR(fsi->dax_filp);
> + }
> +
> + daxdev_inode = file_inode(fsi->dax_filp);
> + dax_devp = inode_dax(daxdev_inode);
> + if (IS_ERR(dax_devp)) {
> + pr_err("%s: unable to get daxdev from inode for %s\n",
> + __func__, fc->source);
> + rc = -ENODEV;
> + goto char_err;
> + }
> +
> + rc = fs_dax_get(dax_devp, fsi, &famfs_dax_holder_ops);
> + if (rc) {
> + pr_info("%s: err attaching famfs_dax_holder_ops\n", __func__);
> + goto char_err;
> + }
> +
> + fsi->bdev_handle = NULL;
> + fsi->dax_devp = dax_devp;
> +
> + return 0;
> +
> +char_err:
> + filp_close(fsi->dax_filp, NULL);

You carefully set fsi->dax_filp to null in other other error paths.
Why there and not here?

> + return rc;
> +}
> +
> +#else /* CONFIG_DEV_DAX_IOMAP */
> +static int
> +famfs_open_char_device(
> + struct super_block *sb,
> + struct fs_context *fc)
> +{
> + pr_err("%s: Root device is %s, but your kernel does not support famfs on /dev/dax\n",
> + __func__, fc->source);
> + return -ENODEV;
> +}
> +
> +
> +#endif /* CONFIG_DEV_DAX_IOMAP */
> +
> /***************************************************************************************
> * dax_holder_operations for block dax
> */
> @@ -236,16 +323,6 @@ const struct dax_holder_operations famfs_blk_dax_holder_ops = {
> .notify_failure = famfs_blk_dax_notify_failure,
> };
>

Put it in right place earlier! Makes this less noisy.

> -static int
> -famfs_open_char_device(
> - struct super_block *sb,
> - struct fs_context *fc)
> -{
> - pr_err("%s: Root device is %s, but your kernel does not support famfs on /dev/dax\n",
> - __func__, fc->source);
> - return -ENODEV;
> -}
> -
> /**
> * famfs_open_device()
> *


2024-02-26 15:13:44

by John Groves

[permalink] [raw]
Subject: Re: [RFC PATCH 02/20] dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage

On 24/02/26 12:05PM, Jonathan Cameron wrote:
> On Fri, 23 Feb 2024 11:41:46 -0600
> John Groves <[email protected]> wrote:
>
> > This function should be called by fs-dax file systems after opening the
> > devdax device. This adds holder_operations.
> >
> > This function serves the same role as fs_dax_get_by_bdev(), which dax
> > file systems call after opening the pmem block device.
> >
> > Signed-off-by: John Groves <[email protected]>
>
> A few trivial comments form a first read to get my head around this.
>
> Yeah, it is only an RFC, but who doesn't like tidy code? :)

Hope your eyes don't burn too much ;)
>
>
> > ---
> > drivers/dax/super.c | 38 ++++++++++++++++++++++++++++++++++++++
> > include/linux/dax.h | 5 +++++
> > 2 files changed, 43 insertions(+)
> >
> > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > index f4b635526345..fc96362de237 100644
> > --- a/drivers/dax/super.c
> > +++ b/drivers/dax/super.c
> > @@ -121,6 +121,44 @@ void fs_put_dax(struct dax_device *dax_dev, void *holder)
> > EXPORT_SYMBOL_GPL(fs_put_dax);
> > #endif /* CONFIG_BLOCK && CONFIG_FS_DAX */
> >
> > +#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
> > +
> > +/**
> > + * fs_dax_get()
>
> Smells like kernel doc but fairly sure it needs a short description.
> Have you sanity checked for warnings when running scripts/kerneldoc on it?

Right, and there were other cases. Randy pointed one out, and I've already
gone through and "fixed" them.

>
> > + *
> > + * fs-dax file systems call this function to prepare to use a devdax device for fsdax.
> Trivial but lines too long. Keep under 80 chars unless there is a strong
> readability arguement for not doing so.

I was under the impression the "kids these days" have a 100 column standard.
But I will go through and limit line to 80 except where it gets too awkward.

>
>
> > + * This is like fs_dax_get_by_bdev(), but the caller already has struct dev_dax (and there
> > + * is no bdev). The holder makes this exclusive.
>
> Not familiar with this area: what does exclusive mean here?

The holder_ops are set via cmpxchg, in such a way that if there are already
holder_ops, the call to fs_dax_get() will fail. (as it should)

>
> > + *
> > + * @dax_dev: dev to be prepared for fs-dax usage
> > + * @holder: filesystem or mapped device inside the dax_device
> > + * @hops: operations for the inner holder
> > + *
> > + * Returns: 0 on success, -1 on failure
>
> Why not return < 0 and use somewhat useful return values?

Good idea, will do.

>
> > + */
> > +int fs_dax_get(
> > + struct dax_device *dax_dev,
> > + void *holder,
> > + const struct dax_holder_operations *hops)
>
> Match local style for indents - it's a bit inconsistent but probably...
>
> int fs_dax_get(struct dad_device *dev_dax, void *holder,
> const struct dax_holder_operations *hops)

Done

>
> > +{
> > + /* dax_dev->ops should have been populated by devm_create_dev_dax() */
> > + if (WARN_ON(!dax_dev->ops))
> > + return -1;
> > +
> > + if (!dax_dev || !dax_alive(dax_dev) || !igrab(&dax_dev->inode))
>
> You dereferenced dax_dev on the line above so check is too late or
> unnecessary

Good catch, thank you!

>
> > + return -1;
> > +
> > + if (cmpxchg(&dax_dev->holder_data, NULL, holder)) {
> > + pr_warn("%s: holder_data already set\n", __func__);
>
> Perhaps nicer to use a pr_fmt() deal with the func name if you need it.
> or make it pr_debug and let dynamic debug control formatting if anyone
> wants the function name.

Sounds good.

>
> > + return -1;
> > + }
> > + dax_dev->holder_ops = hops;
> > +
> > + return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(fs_dax_get);
> > +#endif /* DEV_DAX_IOMAP */
> > +
> > enum dax_device_flags {
> > /* !alive + rcu grace period == no new operations / mappings */
> > DAXDEV_ALIVE,
> > diff --git a/include/linux/dax.h b/include/linux/dax.h
> > index b463502b16e1..e973289bfde3 100644
> > --- a/include/linux/dax.h
> > +++ b/include/linux/dax.h
> > @@ -57,7 +57,12 @@ struct dax_holder_operations {
> >
> > #if IS_ENABLED(CONFIG_DAX)
> > struct dax_device *alloc_dax(void *private, const struct dax_operations *ops);
> > +
> > +#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
> > +int fs_dax_get(struct dax_device *dax_dev, void *holder, const struct dax_holder_operations *hops);
> line wrap < 80 chars

Roger that

>
> > +#endif
> > void *dax_holder(struct dax_device *dax_dev);
> > +struct dax_device *inode_dax(struct inode *inode);
>
> Unrelated change?

Kinda, but I'm not sure there is a better home for this one. Patch 18,
which is a famfs patch, calls inode_dax(). It was already exported but not
prototyped in dax.h.

Mixing it in with other dev_dax_iomap content seems better than mixing it
with famfs content. Could make it a separate patch, but I was trying to
some old docs that said keep patch sets <=15 - which I deemed impossible here.

What say others?

>
> > void put_dax(struct dax_device *dax_dev);
> > void kill_dax(struct dax_device *dax_dev);
> > void dax_write_cache(struct dax_device *dax_dev, bool wc);
>

Thanks Jonathan!


2024-02-26 15:54:39

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system

On Mon, Feb 26, 2024 at 07:27:18AM -0600, John Groves wrote:
> Run status group 0 (all jobs):
> WRITE: bw=29.6GiB/s (31.8GB/s), 29.6GiB/s-29.6GiB/s (31.8GB/s-31.8GB/s), io=44.7GiB (48.0GB), run=1511-1511msec

> This is run on an xfs file system on a SATA ssd.

To compare more closer apples to apples, wouldn't it make more sense
to try this with XFS on pmem (with fio -direct=1)?

Luis

2024-02-26 21:17:07

by John Groves

[permalink] [raw]
Subject: Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system

On 24/02/26 07:53AM, Luis Chamberlain wrote:
> On Mon, Feb 26, 2024 at 07:27:18AM -0600, John Groves wrote:
> > Run status group 0 (all jobs):
> > WRITE: bw=29.6GiB/s (31.8GB/s), 29.6GiB/s-29.6GiB/s (31.8GB/s-31.8GB/s), io=44.7GiB (48.0GB), run=1511-1511msec
>
> > This is run on an xfs file system on a SATA ssd.
>
> To compare more closer apples to apples, wouldn't it make more sense
> to try this with XFS on pmem (with fio -direct=1)?
>
> Luis

Makes sense. Here is the same command line I used with xfs before, but
now it's on /dev/pmem0 (the same 128G, but converted from devdax to pmem
because xfs requires that.

fio -name=ten-256m-per-thread --nrfiles=10 -bs=2M --group_reporting=1 --alloc-size=1048576 --filesize=256MiB --readwrite=write --fallocate=none --numjobs=48 --create_on_open=0 --ioengine=io_uring --direct=1 --directory=/mnt/xfs
ten-256m-per-thread: (g=0): rw=write, bs=(R) 2048KiB-2048KiB, (W) 2048KiB-2048KiB, (T) 2048KiB-2048KiB, ioengine=io_uring, iodepth=1
..
fio-3.33
Starting 48 processes
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
Jobs: 36 (f=360): [W(3),_(1),W(3),_(1),W(1),_(1),W(6),_(1),W(1),_(1),W(1),_(1),W(7),_(1),W(3),_(1),W(2),_(2),W(4),_(1),W(5),_(1)][77.8%][w=15.1GiB/s][w=7750 IOPS][eta 00m:02s]
ten-256m-per-thread: (groupid=0, jobs=48): err= 0: pid=8798: Mon Feb 26 15:10:30 2024
write: IOPS=7582, BW=14.8GiB/s (15.9GB/s)(114GiB/7723msec); 0 zone resets
slat (usec): min=23, max=7352, avg=131.80, stdev=151.63
clat (usec): min=385, max=22638, avg=5789.74, stdev=3124.93
lat (usec): min=432, max=22724, avg=5921.54, stdev=3133.18
clat percentiles (usec):
| 1.00th=[ 799], 5.00th=[ 1467], 10.00th=[ 2073], 20.00th=[ 3097],
| 30.00th=[ 3949], 40.00th=[ 4752], 50.00th=[ 5473], 60.00th=[ 6194],
| 70.00th=[ 7046], 80.00th=[ 8029], 90.00th=[ 9634], 95.00th=[11338],
| 99.00th=[16319], 99.50th=[17957], 99.90th=[20055], 99.95th=[20579],
| 99.99th=[21365]
bw ( MiB/s): min=10852, max=26980, per=100.00%, avg=15940.43, stdev=88.61, samples=665
iops : min= 5419, max=13477, avg=7963.08, stdev=44.28, samples=665
lat (usec) : 500=0.15%, 750=0.47%, 1000=1.34%
lat (msec) : 2=7.40%, 4=21.46%, 10=60.57%, 20=8.50%, 50=0.11%
cpu : usr=2.33%, sys=0.32%, ctx=58806, majf=0, minf=36301
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,58560,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=14.8GiB/s (15.9GB/s), 14.8GiB/s-14.8GiB/s (15.9GB/s-15.9GB/s), io=114GiB (123GB), run=7723-7723msec

Disk stats (read/write):
pmem0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%


I only have some educated guesses as to why famfs is faster. Since files
are preallocated, they're always contiguous. And famfs is vastly simpler
because it isn't aimed at general purpose uses cases (and indeed can't
handle them).

Regards,
John


2024-02-26 21:48:11

by John Groves

[permalink] [raw]
Subject: Re: [RFC PATCH 09/20] famfs: Add super_operations

On 24/02/26 12:51PM, Jonathan Cameron wrote:
> On Fri, 23 Feb 2024 11:41:53 -0600
> John Groves <[email protected]> wrote:
>
> > Introduce the famfs superblock operations
> >
> > Signed-off-by: John Groves <[email protected]>
> > ---
> > fs/famfs/famfs_inode.c | 72 ++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 72 insertions(+)
> > create mode 100644 fs/famfs/famfs_inode.c
> >
> > diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c
> > new file mode 100644
> > index 000000000000..3329aff000d1
> > --- /dev/null
> > +++ b/fs/famfs/famfs_inode.c
> > @@ -0,0 +1,72 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * famfs - dax file system for shared fabric-attached memory
> > + *
> > + * Copyright 2023-2024 Micron Technology, inc
> > + *
> > + * This file system, originally based on ramfs the dax support from xfs,
> > + * is intended to allow multiple host systems to mount a common file system
> > + * view of dax files that map to shared memory.
> > + */
> > +
> > +#include <linux/fs.h>
> > +#include <linux/pagemap.h>
> > +#include <linux/highmem.h>
> > +#include <linux/time.h>
> > +#include <linux/init.h>
> > +#include <linux/string.h>
> > +#include <linux/backing-dev.h>
> > +#include <linux/sched.h>
> > +#include <linux/parser.h>
> > +#include <linux/magic.h>
> > +#include <linux/slab.h>
> > +#include <linux/uaccess.h>
> > +#include <linux/fs_context.h>
> > +#include <linux/fs_parser.h>
> > +#include <linux/seq_file.h>
> > +#include <linux/dax.h>
> > +#include <linux/hugetlb.h>
> > +#include <linux/uio.h>
> > +#include <linux/iomap.h>
> > +#include <linux/path.h>
> > +#include <linux/namei.h>
> > +#include <linux/pfn_t.h>
> > +#include <linux/blkdev.h>
>
> That's a lot of header for such a small patch.. I'm going to guess
> they aren't all used - bring them in as you need them - I hope
> you never need some of these!

I didn't phase in headers in this series. Based on these recommendations,
the next version of this series is gonna have to be 100% constructed from
scratch, but okay. My head hurts just thinking about it. I need a nap...

I've been rebasing for 3 weeks to get this series out, and it occurs to
me that maybe there are tools I'm not aware of that make it eaiser? I'm
just typing "rebase -i..." 200 times a day. Is there a less soul-crushing way?

>
>
> > +
> > +#include "famfs_internal.h"
> > +
> > +#define FAMFS_DEFAULT_MODE 0755
> > +
> > +static const struct super_operations famfs_ops;
> > +static const struct inode_operations famfs_file_inode_operations;
> > +static const struct inode_operations famfs_dir_inode_operations;
>
> Why are these all up here?

These forward declarations are needed by a later patch in the series.
They were in famfs_internal.h, but they are only used in this file, so
I moved them here.

For all answers such as this, I will hereafter reply "rebase fu", with
further clarification only if necessary.

>
> > +
> > +/**********************************************************************************
> > + * famfs super_operations
> > + *
> > + * TODO: implement a famfs_statfs() that shows size, free and available space, etc.
> > + */
> > +
> > +/**
> > + * famfs_show_options() - Display the mount options in /proc/mounts.
> Run kernel doc script + fix all warnings.

Will do; I actually think I have already fixed those...

>
> > + */
> > +static int famfs_show_options(
> > + struct seq_file *m,
> > + struct dentry *root)
> Not that familiar with fs code, but this unusual kernel style. I'd go with
> something more common
>
> static int famfs_show_options(struct seq_file *m, struct dentry *root)

Done. To all functions...

>
> > +{
> > + struct famfs_fs_info *fsi = root->d_sb->s_fs_info;
> > +
> > + if (fsi->mount_opts.mode != FAMFS_DEFAULT_MODE)
> > + seq_printf(m, ",mode=%o", fsi->mount_opts.mode);
> > +
> > + return 0;
> > +}
> > +
> > +static const struct super_operations famfs_ops = {
> > + .statfs = simple_statfs,
> > + .drop_inode = generic_delete_inode,
> > + .show_options = famfs_show_options,
> > +};
> > +
> > +
> One blank line probably fine.

Done

>
>
> Add the rest of the stuff a module normally has, author etc in this
> patch.

Because "rebase fu" I'm not sure the order will remain the same. Will
try not to make anybody tell me this again though...

John


2024-02-26 23:00:21

by John Groves

[permalink] [raw]
Subject: Re: [RFC PATCH 13/20] famfs: Add iomap_ops

On 24/02/26 01:30PM, Jonathan Cameron wrote:
> On Fri, 23 Feb 2024 11:41:57 -0600
> John Groves <[email protected]> wrote:
>
> > This commit introduces the famfs iomap_ops. When either
> > dax_iomap_fault() or dax_iomap_rw() is called, we get a callback
> > via our iomap_begin() handler. The question being asked is
> > "please resolve (file, offset) to (daxdev, offset)". The function
> > famfs_meta_to_dax_offset() does this.
> >
> > The per-file metadata is just an extent list to the
> > backing dax dev. The order of this resolution is O(N) for N
> > extents. Note with the current user space, files usually have
> > only one extent.
> >
> > Signed-off-by: John Groves <[email protected]>
>
> > ---
> > fs/famfs/famfs_file.c | 245 ++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 245 insertions(+)
> > create mode 100644 fs/famfs/famfs_file.c
> >
> > diff --git a/fs/famfs/famfs_file.c b/fs/famfs/famfs_file.c
> > new file mode 100644
> > index 000000000000..fc667d5f7be8
> > --- /dev/null
> > +++ b/fs/famfs/famfs_file.c
> > @@ -0,0 +1,245 @@
>
> > +static int
> > +famfs_meta_to_dax_offset(
> > + struct inode *inode,
> > + struct iomap *iomap,
> > + loff_t offset,
> > + loff_t len,
> > + unsigned int flags)
> > +{
> > + struct famfs_file_meta *meta = (struct famfs_file_meta *)inode->i_private;
>
> i_private is void * so no need for explicit cast (C spec says this is always fine without)

Yessir.

>
>
> > +
> > +/**
> > + * famfs_iomap_begin()
> > + *
> > + * This function is pretty simple because files are
> > + * * never partially allocated
> > + * * never have holes (never sparse)
> > + * * never "allocate on write"
> > + */
> > +static int
> > +famfs_iomap_begin(
> > + struct inode *inode,
> > + loff_t offset,
> > + loff_t length,
> > + unsigned int flags,
> > + struct iomap *iomap,
> > + struct iomap *srcmap)
> > +{
> > + struct famfs_file_meta *meta = inode->i_private;
> > + size_t size;
> > + int rc;
> > +
> > + size = i_size_read(inode);
> > +
> > + WARN_ON(size != meta->file_size);
> > +
> > + rc = famfs_meta_to_dax_offset(inode, iomap, offset, length, flags);
> > +
> > + return rc;
> return famfs_meta_...

Done

>
> > +}
>
>
> > +static vm_fault_t
> > +famfs_filemap_map_pages(
> > + struct vm_fault *vmf,
> > + pgoff_t start_pgoff,
> > + pgoff_t end_pgoff)
> > +{
> > + vm_fault_t ret;
> > +
> > + ret = filemap_map_pages(vmf, start_pgoff, end_pgoff);
> > + return ret;
> return filename_map_pages()....

Done, thanks

John


2024-02-26 23:09:55

by John Groves

[permalink] [raw]
Subject: Re: [RFC PATCH 14/20] famfs: Add struct file_operations

On 24/02/26 01:32PM, Jonathan Cameron wrote:
> On Fri, 23 Feb 2024 11:41:58 -0600
> John Groves <[email protected]> wrote:
>
> > This commit introduces the famfs file_operations. We call
> > thp_get_unmapped_area() to force PMD page alignment. Our read and
> > write handlers (famfs_dax_read_iter() and famfs_dax_write_iter())
> > call dax_iomap_rw() to do the work.
> >
> > famfs_file_invalid() checks for various ways a famfs file can be
> > in an invalid state so we can fail I/O or fault resolution in those
> > cases. Those cases include the following:
> >
> > * No famfs metadata
> > * file i_size does not match the originally allocated size
> > * file is not flagged as DAX
> > * errors were detected previously on the file
> >
> > An invalid file can often be fixed by replaying the log, or by
> > umount/mount/log replay - all of which are user space operations.
> >
> > Signed-off-by: John Groves <[email protected]>
> > ---
> > fs/famfs/famfs_file.c | 136 ++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 136 insertions(+)
> >
> > diff --git a/fs/famfs/famfs_file.c b/fs/famfs/famfs_file.c
> > index fc667d5f7be8..5228e9de1e3b 100644
> > --- a/fs/famfs/famfs_file.c
> > +++ b/fs/famfs/famfs_file.c
> > @@ -19,6 +19,142 @@
> > #include <uapi/linux/famfs_ioctl.h>
> > #include "famfs_internal.h"
> >
> > +/*********************************************************************
> > + * file_operations
> > + */
> > +
> > +/* Reject I/O to files that aren't in a valid state */
> > +static ssize_t
> > +famfs_file_invalid(struct inode *inode)
> > +{
> > + size_t i_size = i_size_read(inode);
> > + struct famfs_file_meta *meta = inode->i_private;
> > +
> > + if (!meta) {
> > + pr_err("%s: un-initialized famfs file\n", __func__);
> > + return -EIO;
> > + }
> > + if (i_size != meta->file_size) {
> > + pr_err("%s: something changed the size from %ld to %ld\n",
> > + __func__, meta->file_size, i_size);
> > + meta->error = 1;
> > + return -ENXIO;
> > + }
> > + if (!IS_DAX(inode)) {
> > + pr_err("%s: inode %llx IS_DAX is false\n", __func__, (u64)inode);
> > + meta->error = 1;
> > + return -ENXIO;
> > + }
> > + if (meta->error) {
> > + pr_err("%s: previously detected metadata errors\n", __func__);
> > + meta->error = 1;
>
> Already set? If treating it as only a boolean, maybe make it one?

Done, thanks

John


2024-02-27 00:58:38

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system

On Mon, Feb 26, 2024 at 1:16 PM John Groves <[email protected]> wrote:
>
> On 24/02/26 07:53AM, Luis Chamberlain wrote:
> > On Mon, Feb 26, 2024 at 07:27:18AM -0600, John Groves wrote:
> > > Run status group 0 (all jobs):
> > > WRITE: bw=29.6GiB/s (31.8GB/s), 29.6GiB/s-29.6GiB/s (31.8GB/s-31.8GB/s), io=44.7GiB (48.0GB), run=1511-1511msec
> >
> > > This is run on an xfs file system on a SATA ssd.
> >
> > To compare more closer apples to apples, wouldn't it make more sense
> > to try this with XFS on pmem (with fio -direct=1)?
> >
> > Luis
>
> Makes sense. Here is the same command line I used with xfs before, but
> now it's on /dev/pmem0 (the same 128G, but converted from devdax to pmem
> because xfs requires that.
>
> fio -name=ten-256m-per-thread --nrfiles=10 -bs=2M --group_reporting=1 --alloc-size=1048576 --filesize=256MiB --readwrite=write --fallocate=none --numjobs=48 --create_on_open=0 --ioengine=io_uring --direct=1 --directory=/mnt/xfs

Could you try with mkfs.xfs -d agcount=1024

Luis

2024-02-27 02:06:19

by John Groves

[permalink] [raw]
Subject: Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system

On 24/02/26 04:58PM, Luis Chamberlain wrote:
> On Mon, Feb 26, 2024 at 1:16 PM John Groves <[email protected]> wrote:
> >
> > On 24/02/26 07:53AM, Luis Chamberlain wrote:
> > > On Mon, Feb 26, 2024 at 07:27:18AM -0600, John Groves wrote:
> > > > Run status group 0 (all jobs):
> > > > WRITE: bw=29.6GiB/s (31.8GB/s), 29.6GiB/s-29.6GiB/s (31.8GB/s-31.8GB/s), io=44.7GiB (48.0GB), run=1511-1511msec
> > >
> > > > This is run on an xfs file system on a SATA ssd.
> > >
> > > To compare more closer apples to apples, wouldn't it make more sense
> > > to try this with XFS on pmem (with fio -direct=1)?
> > >
> > > Luis
> >
> > Makes sense. Here is the same command line I used with xfs before, but
> > now it's on /dev/pmem0 (the same 128G, but converted from devdax to pmem
> > because xfs requires that.
> >
> > fio -name=ten-256m-per-thread --nrfiles=10 -bs=2M --group_reporting=1 --alloc-size=1048576 --filesize=256MiB --readwrite=write --fallocate=none --numjobs=48 --create_on_open=0 --ioengine=io_uring --direct=1 --directory=/mnt/xfs
>
> Could you try with mkfs.xfs -d agcount=1024
>
> Luis

$ luis/fio-xfsdax.sh
+ sudo mkfs.xfs -d agcount=1024 -m reflink=0 -f /dev/pmem0
meta-data=/dev/pmem0 isize=512 agcount=1024, agsize=32768 blks
= sectsz=4096 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=0
= reflink=0 bigtime=1 inobtcount=1 nrext64=0
data = bsize=4096 blocks=33554432, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1
log =internal log bsize=4096 blocks=16384, version=2
= sectsz=4096 sunit=1 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
+ sudo mount -o dax /dev/pmem0 /mnt/xfs
+ sudo chown jmg:jmg /mnt/xfs
+ ls -al /mnt/xfs
total 0
drwxr-xr-x 2 jmg jmg 6 Feb 26 19:56 .
drwxr-xr-x. 4 root root 30 Feb 26 14:58 ..
++ nproc
+ fio -name=ten-256m-per-thread --nrfiles=10 -bs=2M --group_reporting=1 --alloc-size=1048576 --filesize=256MiB --readwrite=write --fallocate=none --numjobs=48 --create_on_open=0 --ioengine=io_uring --direct=1 --directory=/mnt/xfs
ten-256m-per-thread: (g=0): rw=write, bs=(R) 2048KiB-2048KiB, (W) 2048KiB-2048KiB, (T) 2048KiB-2048KiB, ioengine=io_uring, iodepth=1
..
fio-3.33
Starting 48 processes
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
Jobs: 17 (f=170): [_(2),W(1),_(8),W(2),_(7),W(3),_(2),W(2),_(3),W(2),_(2),W(1),_(2),W(1),_(1),W(3),_(4),W(2)][Jobs: 1 (f=10): [_(47),W(1)][100.0%][w=8022MiB/s][w=4011 IOPS][eta 00m:00s]
ten-256m-per-thread: (groupid=0, jobs=48): err= 0: pid=141563: Mon Feb 26 19:56:28 2024
write: IOPS=6578, BW=12.8GiB/s (13.8GB/s)(114GiB/8902msec); 0 zone resets
slat (usec): min=18, max=60593, avg=1230.85, stdev=1799.97
clat (usec): min=2, max=98969, avg=5133.25, stdev=5141.07
lat (usec): min=294, max=99725, avg=6364.09, stdev=5440.30
clat percentiles (usec):
| 1.00th=[ 11], 5.00th=[ 46], 10.00th=[ 217], 20.00th=[ 2376],
| 30.00th=[ 2999], 40.00th=[ 3556], 50.00th=[ 3785], 60.00th=[ 3982],
| 70.00th=[ 4228], 80.00th=[ 7504], 90.00th=[13173], 95.00th=[14091],
| 99.00th=[21890], 99.50th=[27919], 99.90th=[45351], 99.95th=[57934],
| 99.99th=[82314]
bw ( MiB/s): min= 5085, max=27367, per=100.00%, avg=14361.95, stdev=165.61, samples=719
iops : min= 2516, max=13670, avg=7160.17, stdev=82.88, samples=719
lat (usec) : 4=0.05%, 10=0.72%, 20=2.23%, 50=2.48%, 100=3.02%
lat (usec) : 250=1.54%, 500=2.37%, 750=1.34%, 1000=0.75%
lat (msec) : 2=3.20%, 4=43.10%, 10=23.05%, 20=14.81%, 50=1.25%
lat (msec) : 100=0.08%
cpu : usr=10.18%, sys=0.79%, ctx=67227, majf=0, minf=38511
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,58560,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=12.8GiB/s (13.8GB/s), 12.8GiB/s-12.8GiB/s (13.8GB/s-13.8GB/s), io=114GiB (123GB), run=8902-8902msec

Disk stats (read/write):
pmem0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%


I ran it several times with similar results.

Regards,
John


2024-02-27 10:34:50

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [RFC PATCH 09/20] famfs: Add super_operations

On Mon, 26 Feb 2024 15:47:53 -0600
John Groves <[email protected]> wrote:

> On 24/02/26 12:51PM, Jonathan Cameron wrote:
> > On Fri, 23 Feb 2024 11:41:53 -0600
> > John Groves <[email protected]> wrote:
> >
> > > Introduce the famfs superblock operations
> > >
> > > Signed-off-by: John Groves <[email protected]>
> > > ---
> > > fs/famfs/famfs_inode.c | 72 ++++++++++++++++++++++++++++++++++++++++++
> > > 1 file changed, 72 insertions(+)
> > > create mode 100644 fs/famfs/famfs_inode.c
> > >
> > > diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c
> > > new file mode 100644
> > > index 000000000000..3329aff000d1
> > > --- /dev/null
> > > +++ b/fs/famfs/famfs_inode.c
> > > @@ -0,0 +1,72 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +/*
> > > + * famfs - dax file system for shared fabric-attached memory
> > > + *
> > > + * Copyright 2023-2024 Micron Technology, inc
> > > + *
> > > + * This file system, originally based on ramfs the dax support from xfs,
> > > + * is intended to allow multiple host systems to mount a common file system
> > > + * view of dax files that map to shared memory.
> > > + */
> > > +
> > > +#include <linux/fs.h>
> > > +#include <linux/pagemap.h>
> > > +#include <linux/highmem.h>
> > > +#include <linux/time.h>
> > > +#include <linux/init.h>
> > > +#include <linux/string.h>
> > > +#include <linux/backing-dev.h>
> > > +#include <linux/sched.h>
> > > +#include <linux/parser.h>
> > > +#include <linux/magic.h>
> > > +#include <linux/slab.h>
> > > +#include <linux/uaccess.h>
> > > +#include <linux/fs_context.h>
> > > +#include <linux/fs_parser.h>
> > > +#include <linux/seq_file.h>
> > > +#include <linux/dax.h>
> > > +#include <linux/hugetlb.h>
> > > +#include <linux/uio.h>
> > > +#include <linux/iomap.h>
> > > +#include <linux/path.h>
> > > +#include <linux/namei.h>
> > > +#include <linux/pfn_t.h>
> > > +#include <linux/blkdev.h>
> >
> > That's a lot of header for such a small patch.. I'm going to guess
> > they aren't all used - bring them in as you need them - I hope
> > you never need some of these!
>
> I didn't phase in headers in this series. Based on these recommendations,
> the next version of this series is gonna have to be 100% constructed from
> scratch, but okay. My head hurts just thinking about it. I need a nap...
>
> I've been rebasing for 3 weeks to get this series out, and it occurs to
> me that maybe there are tools I'm not aware of that make it eaiser? I'm
> just typing "rebase -i..." 200 times a day. Is there a less soul-crushing way?

Hmm. There are things that make it easier to pick and chose parts of a
big diff for different patches. Some combination of
git reset HEAD~1
and one of the 'graphical' tools like tig that let you pick lines.

That lets you quickly break up a patch where you want to move things, then
you can reorder the patches to put them next to where you want to move
changes to and rely on git rebase -i with f or s to squash them.

Figuring out optimum path to the eventual break up you want is
a skill though. When doing this sort of mangling I tend to get it wrong
and shout at my computer a few times a day ;)
Then git rebase --abort and try again.

End result is that you end up with coherent series and it looks like
you wrote perfect code in nice steps from the start!

Jonathan



2024-02-27 17:48:47

by John Groves

[permalink] [raw]
Subject: Re: [RFC PATCH 09/20] famfs: Add super_operations

On 24/02/26 12:51PM, Jonathan Cameron wrote:
> On Fri, 23 Feb 2024 11:41:53 -0600
> John Groves <[email protected]> wrote:
> > + */
> > +static int famfs_show_options(
> > + struct seq_file *m,
> > + struct dentry *root)
> Not that familiar with fs code, but this unusual kernel style. I'd go with
> something more common
>
> static int famfs_show_options(struct seq_file *m, struct dentry *root)

Actually, xfs does function declarations and prototypes this way, not sure if
it's everywhere. But I like this format because changing one argument usually
doesn't put un-changed args into the diff.

So I may keep this style after all.

John

2024-02-27 22:17:30

by John Groves

[permalink] [raw]
Subject: Re: [RFC PATCH 17/20] famfs: Add module stuff

On 24/02/26 01:47PM, Jonathan Cameron wrote:
> On Fri, 23 Feb 2024 11:42:01 -0600
> John Groves <[email protected]> wrote:
>
> > This commit introduces the module init and exit machinery for famfs.
> >
> > Signed-off-by: John Groves <[email protected]>
> I'd prefer to see this from the start with the functionality of the module
> built up as you go + build logic in place. Makes it easy to spot places
> where the patches aren't appropriately self constrained.
> > ---
> > fs/famfs/famfs_inode.c | 44 ++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 44 insertions(+)
> >
> > diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c
> > index ab46ec50b70d..0d659820e8ff 100644
> > --- a/fs/famfs/famfs_inode.c
> > +++ b/fs/famfs/famfs_inode.c
> > @@ -462,4 +462,48 @@ static struct file_system_type famfs_fs_type = {
> > .fs_flags = FS_USERNS_MOUNT,
> > };
> >
> > +/*****************************************************************************************
> > + * Module stuff
>
> I'd drop these drivers structure comments. They add little beyond
> a high possibility of being wrong after the code has evolved a bit.

Probably will do with the module-ops-up refactor for v2

>
> > + */
> > +static struct kobject *famfs_kobj;
> > +
> > +static int __init init_famfs_fs(void)
> > +{
> > + int rc;
> > +
> > +#if defined(CONFIG_DEV_DAX_IOMAP)
> > + pr_notice("%s: Your kernel supports famfs on /dev/dax\n", __func__);
> > +#else
> > + pr_notice("%s: Your kernel does not support famfs on /dev/dax\n", __func__);
> > +#endif
> > + famfs_kobj = kobject_create_and_add(MODULE_NAME, fs_kobj);
> > + if (!famfs_kobj) {
> > + pr_warn("Failed to create kobject\n");
> > + return -ENOMEM;
> > + }
> > +
> > + rc = sysfs_create_group(famfs_kobj, &famfs_attr_group);
> > + if (rc) {
> > + kobject_put(famfs_kobj);
> > + pr_warn("%s: Failed to create sysfs group\n", __func__);
> > + return rc;
> > + }
> > +
> > + return register_filesystem(&famfs_fs_type);
>
> If this fails, do we not leak the kobj and sysfs groups?

Good catch, thanks! Fixed for now- but the kobj is also likely to go away. Will
endeavor to get it right...

Thanks,
John


2024-02-27 22:27:26

by John Groves

[permalink] [raw]
Subject: Re: [RFC PATCH 18/20] famfs: Support character dax via the dev_dax_iomap patch

On 24/02/26 01:52PM, Jonathan Cameron wrote:
> On Fri, 23 Feb 2024 11:42:02 -0600
> John Groves <[email protected]> wrote:
>
> > This commit introduces the ability to open a character /dev/dax device
> > instead of a block /dev/pmem device. This rests on the dev_dax_iomap
> > patches earlier in this series.
>
> Not sure the back reference is needed given it's in the series.

Roger that

>
> >
> > Signed-off-by: John Groves <[email protected]>
> > ---
> > fs/famfs/famfs_inode.c | 97 +++++++++++++++++++++++++++++++++++++-----
> > 1 file changed, 87 insertions(+), 10 deletions(-)
> >
> > diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c
> > index 0d659820e8ff..7d65ac497147 100644
> > --- a/fs/famfs/famfs_inode.c
> > +++ b/fs/famfs/famfs_inode.c
> > @@ -215,6 +215,93 @@ static const struct super_operations famfs_ops = {
> > .show_options = famfs_show_options,
> > };
> >
> > +/*****************************************************************************/
> > +
> > +#if defined(CONFIG_DEV_DAX_IOMAP)
> > +
> > +/*
> > + * famfs dax_operations (for char dax)
> > + */
> > +static int
> > +famfs_dax_notify_failure(struct dax_device *dax_dev, u64 offset,
> > + u64 len, int mf_flags)
> > +{
> > + pr_err("%s: offset %lld len %llu flags %x\n", __func__,
> > + offset, len, mf_flags);
> > + return -EOPNOTSUPP;
> > +}
> > +
> > +static const struct dax_holder_operations famfs_dax_holder_ops = {
> > + .notify_failure = famfs_dax_notify_failure,
> > +};
> > +
> > +/*****************************************************************************/
> > +
> > +/**
> > + * famfs_open_char_device()
> > + *
> > + * Open a /dev/dax device. This only works in kernels with the dev_dax_iomap patch
>
> That comment you definitely don't need as this won't get merged without
> that patch being in place.

This will be gone from v2. I'm 90% sure there is no reason to keep the block
device backing support (pmem), since devdax is the point AND pmem can be
converted to devdax mode. So famfs will become devdax only...etc.

This was under development for quite a few months, and actually working,
I got the dev_dax_iomap right (er, "right enough" for it to work :D). But now
that dev_dax_iomap looks basically stable, pmem/block support can come out.

>
>
> > + */
> > +static int
> > +famfs_open_char_device(
> > + struct super_block *sb,
> > + struct fs_context *fc)
> > +{
> > + struct famfs_fs_info *fsi = sb->s_fs_info;
> > + struct dax_device *dax_devp;
> > + struct inode *daxdev_inode;
> > +
> > + int rc = 0;
> set in all paths where it's used.
>
> > +
> > + pr_notice("%s: Opening character dax device %s\n", __func__, fc->source);
>
> pr_debug

Done

>
> > +
> > + fsi->dax_filp = filp_open(fc->source, O_RDWR, 0);
> > + if (IS_ERR(fsi->dax_filp)) {
> > + pr_err("%s: failed to open dax device %s\n",
> > + __func__, fc->source);
> > + fsi->dax_filp = NULL;
> Better to use a local variable
>
> fp = filp_open(fc->source, O_RDWR, 0);
> if (IS_ERR(fp)) {
> pr_err.
> return;
> }
> fsi->dax_filp = fp;
> or similar.

Done, thanks.

>
> > + return PTR_ERR(fsi->dax_filp);
> > + }
> > +
> > + daxdev_inode = file_inode(fsi->dax_filp);
> > + dax_devp = inode_dax(daxdev_inode);
> > + if (IS_ERR(dax_devp)) {
> > + pr_err("%s: unable to get daxdev from inode for %s\n",
> > + __func__, fc->source);
> > + rc = -ENODEV;
> > + goto char_err;
> > + }
> > +
> > + rc = fs_dax_get(dax_devp, fsi, &famfs_dax_holder_ops);
> > + if (rc) {
> > + pr_info("%s: err attaching famfs_dax_holder_ops\n", __func__);
> > + goto char_err;
> > + }
> > +
> > + fsi->bdev_handle = NULL;
> > + fsi->dax_devp = dax_devp;
> > +
> > + return 0;
> > +
> > +char_err:
> > + filp_close(fsi->dax_filp, NULL);
>
> You carefully set fsi->dax_filp to null in other other error paths.
> Why there and not here?

Why indeed - done now.

>
> > + return rc;
> > +}
> > +
> > +#else /* CONFIG_DEV_DAX_IOMAP */
> > +static int
> > +famfs_open_char_device(
> > + struct super_block *sb,
> > + struct fs_context *fc)
> > +{
> > + pr_err("%s: Root device is %s, but your kernel does not support famfs on /dev/dax\n",
> > + __func__, fc->source);
> > + return -ENODEV;
> > +}
> > +
> > +
> > +#endif /* CONFIG_DEV_DAX_IOMAP */
> > +
> > /***************************************************************************************
> > * dax_holder_operations for block dax
> > */
> > @@ -236,16 +323,6 @@ const struct dax_holder_operations famfs_blk_dax_holder_ops = {
> > .notify_failure = famfs_blk_dax_notify_failure,
> > };
> >
>
> Put it in right place earlier! Makes this less noisy.

This will be eliminated by the move to /dev/dax-only

Thanks,
John


2024-02-29 02:15:54

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system

On Mon, Feb 26, 2024 at 08:05:58PM -0600, John Groves wrote:
> On 24/02/26 04:58PM, Luis Chamberlain wrote:
> > On Mon, Feb 26, 2024 at 1:16 PM John Groves <[email protected]> wrote:
> > >
> > > On 24/02/26 07:53AM, Luis Chamberlain wrote:
> > > > On Mon, Feb 26, 2024 at 07:27:18AM -0600, John Groves wrote:
> > > > > Run status group 0 (all jobs):
> > > > > WRITE: bw=29.6GiB/s (31.8GB/s), 29.6GiB/s-29.6GiB/s (31.8GB/s-31.8GB/s), io=44.7GiB (48.0GB), run=1511-1511msec
> > > >
> > > > > This is run on an xfs file system on a SATA ssd.
> > > >
> > > > To compare more closer apples to apples, wouldn't it make more sense
> > > > to try this with XFS on pmem (with fio -direct=1)?
> > > >
> > > > Luis
> > >
> > > Makes sense. Here is the same command line I used with xfs before, but
> > > now it's on /dev/pmem0 (the same 128G, but converted from devdax to pmem
> > > because xfs requires that.
> > >
> > > fio -name=ten-256m-per-thread --nrfiles=10 -bs=2M --group_reporting=1 --alloc-size=1048576 --filesize=256MiB --readwrite=write --fallocate=none --numjobs=48 --create_on_open=0 --ioengine=io_uring --direct=1 --directory=/mnt/xfs
> >
> > Could you try with mkfs.xfs -d agcount=1024

Won't change anything for the better, may make things worse.

> bw ( MiB/s): min= 5085, max=27367, per=100.00%, avg=14361.95, stdev=165.61, samples=719
> iops : min= 2516, max=13670, avg=7160.17, stdev=82.88, samples=719
> lat (usec) : 4=0.05%, 10=0.72%, 20=2.23%, 50=2.48%, 100=3.02%
> lat (usec) : 250=1.54%, 500=2.37%, 750=1.34%, 1000=0.75%
> lat (msec) : 2=3.20%, 4=43.10%, 10=23.05%, 20=14.81%, 50=1.25%

Most of the IO latencies are up round the 4-20ms marks. That seems
kinda high for a 2MB IO. With a memcpy speed of 10GB/s, the 2MB
should only take a couple of hundred microseconds. For Famfs, the
latencies appear to be around 1-4ms.

So where's all that extra time coming from?


> lat (msec) : 100=0.08%
> cpu : usr=10.18%, sys=0.79%, ctx=67227, majf=0, minf=38511

And why is system time reporting at almost zero instead of almost
all the remaining cpu time (i.e. up at 80-90%)?

Can you run call-graph kernel profiles for XFS and famfs whilst
running this workload so we have some insight into what is behaving
differently here?

-Dave.
--
Dave Chinner
[email protected]

2024-02-29 06:52:58

by Amir Goldstein

[permalink] [raw]
Subject: Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system

On Fri, Feb 23, 2024 at 7:42 PM John Groves <[email protected]> wrote:
>
> This patch set introduces famfs[1] - a special-purpose fs-dax file system
> for sharable disaggregated or fabric-attached memory (FAM). Famfs is not
> CXL-specific in anyway way.
>
> * Famfs creates a simple access method for storing and sharing data in
> sharable memory. The memory is exposed and accessed as memory-mappable
> dax files.
> * Famfs supports multiple hosts mounting the same file system from the
> same memory (something existing fs-dax file systems don't do).
> * A famfs file system can be created on either a /dev/pmem device in fs-dax
> mode, or a /dev/dax device in devdax mode (the latter depending on
> patches 2-6 of this series).
>
> The famfs kernel file system is part the famfs framework; additional
> components in user space[2] handle metadata and direct the famfs kernel
> module to instantiate files that map to specific memory. The famfs user
> space has documentation and a reasonably thorough test suite.
>

So can we say that Famfs is Fuse specialized for DAX?

I am asking because you seem to have asked it first:
https://lore.kernel.org/linux-fsdevel/0100018b2439ebf3-a442db6f-f685-4bc4-b4b0-28dc333f6712-000000@email.amazonses.com/
I guess that you did not get your answers to your questions before or at LPC?

I did not see your question back in October.
Let me try to answer your questions and we can discuss later if a new dedicated
kernel driver + userspace API is really needed, or if FUSE could be used as is
extended for your needs.

You wrote:
"...My naive reading of the existence of some sort of fuse/dax support
for virtiofs
suggested that there might be a way of doing this - but I may be wrong
about that."

I'm not virtiofs expert, but I don't think that you are wrong about this.
IIUC, virtiofsd could map arbitrary memory region to any fuse file mmaped
by virtiofs client.

So what are the gaps between virtiofs and famfs that justify a new filesystem
driver and new userspace API?

Thanks,
Amir.

2024-02-29 14:53:07

by John Groves

[permalink] [raw]
Subject: Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system


Hi Dave!

On 24/02/29 01:15PM, Dave Chinner wrote:
> On Mon, Feb 26, 2024 at 08:05:58PM -0600, John Groves wrote:
> > On 24/02/26 04:58PM, Luis Chamberlain wrote:
> > > On Mon, Feb 26, 2024 at 1:16 PM John Groves <[email protected]> wrote:
> > > >
> > > > On 24/02/26 07:53AM, Luis Chamberlain wrote:
> > > > > On Mon, Feb 26, 2024 at 07:27:18AM -0600, John Groves wrote:
> > > > > > Run status group 0 (all jobs):
> > > > > > WRITE: bw=29.6GiB/s (31.8GB/s), 29.6GiB/s-29.6GiB/s (31.8GB/s-31.8GB/s), io=44.7GiB (48.0GB), run=1511-1511msec
> > > > >
> > > > > > This is run on an xfs file system on a SATA ssd.
> > > > >
> > > > > To compare more closer apples to apples, wouldn't it make more sense
> > > > > to try this with XFS on pmem (with fio -direct=1)?
> > > > >
> > > > > Luis
> > > >
> > > > Makes sense. Here is the same command line I used with xfs before, but
> > > > now it's on /dev/pmem0 (the same 128G, but converted from devdax to pmem
> > > > because xfs requires that.
> > > >
> > > > fio -name=ten-256m-per-thread --nrfiles=10 -bs=2M --group_reporting=1 --alloc-size=1048576 --filesize=256MiB --readwrite=write --fallocate=none --numjobs=48 --create_on_open=0 --ioengine=io_uring --direct=1 --directory=/mnt/xfs
> > >
> > > Could you try with mkfs.xfs -d agcount=1024
>
> Won't change anything for the better, may make things worse.

I dropped that arg, though performance looked about the same either way.

>
> > bw ( MiB/s): min= 5085, max=27367, per=100.00%, avg=14361.95, stdev=165.61, samples=719
> > iops : min= 2516, max=13670, avg=7160.17, stdev=82.88, samples=719
> > lat (usec) : 4=0.05%, 10=0.72%, 20=2.23%, 50=2.48%, 100=3.02%
> > lat (usec) : 250=1.54%, 500=2.37%, 750=1.34%, 1000=0.75%
> > lat (msec) : 2=3.20%, 4=43.10%, 10=23.05%, 20=14.81%, 50=1.25%
>
> Most of the IO latencies are up round the 4-20ms marks. That seems
> kinda high for a 2MB IO. With a memcpy speed of 10GB/s, the 2MB
> should only take a couple of hundred microseconds. For Famfs, the
> latencies appear to be around 1-4ms.
>
> So where's all that extra time coming from?

Below, you will see two runs with performance and latency distribution
about the same as famfs (the answer for that was --fallocate=native).

>
>
> > lat (msec) : 100=0.08%
> > cpu : usr=10.18%, sys=0.79%, ctx=67227, majf=0, minf=38511
>
> And why is system time reporting at almost zero instead of almost
> all the remaining cpu time (i.e. up at 80-90%)?

Something weird is going on with the cpu reporting. Sometimes sys=~0, but other times
it's about what you would expect. I suspect some sort of measurement error,
like maybe the method doesn't work with my cpu model? (I'm grasping, but with
a somewhat rational basis...)

I pasted two xfs runs below. The first has the wonky cpu sys value, and
the second looks about like what one would expect.

>
> Can you run call-graph kernel profiles for XFS and famfs whilst
> running this workload so we have some insight into what is behaving
> differently here?

Can you point me to an example of how to do that?

>
> -Dave.
> --
> Dave Chinner
> [email protected]


I'd been thinking about the ~2x gap for a few days, and the most obvious
difference is famfs files must be preallocated (like fallocate, but works
a bit differently since allocation happens in user space). I just checked
one of the xfs files, and it had maybe 80 extents (whereas the famfs
files always have 1 extent here).

FWIW I ran xfs with and without io_uring, and there was no apparent
difference (which makes sense to me because it's not block I/O).

The prior ~2x gap still seems like a lot of overhead for extent list
mapping to memory, but adding --fallocate=native to the xfs test brought
it into line with famfs:


+ fio -name=ten-256m-per-thread --nrfiles=10 -bs=2M --group_reporting=1 --alloc-size=1048576 --filesize=256MiB --readwrite=write --fallocate=native --numjobs=48 --create_on_open=0 --ioengine=io_uring --direct=1 --directory=/mnt/xfs
ten-256m-per-thread: (g=0): rw=write, bs=(R) 2048KiB-2048KiB, (W) 2048KiB-2048KiB, (T) 2048KiB-2048KiB, ioengine=io_uring, iodepth=1
..
fio-3.33
Starting 48 processes
Jobs: 38 (f=380): [W(5),_(1),W(12),_(1),W(3),_(1),W(2),_(1),W(2),_(1),W(1),_(1),W(1),_(1),W(6),_(1),W(6),_(2)][57.1%][w=28.0GiB/s][w=14.3k IOPS][eta 00m:03s]
ten-256m-per-thread: (groupid=0, jobs=48): err= 0: pid=1452590: Thu Feb 29 07:46:06 2024
write: IOPS=15.3k, BW=29.8GiB/s (32.0GB/s)(114GiB/3838msec); 0 zone resets
slat (usec): min=17, max=55364, avg=668.20, stdev=1120.41
clat (nsec): min=1368, max=99619k, avg=1982477.32, stdev=2198309.32
lat (usec): min=179, max=99813, avg=2650.68, stdev=2485.15
clat percentiles (usec):
| 1.00th=[ 4], 5.00th=[ 14], 10.00th=[ 172], 20.00th=[ 420],
| 30.00th=[ 644], 40.00th=[ 1057], 50.00th=[ 1582], 60.00th=[ 2008],
| 70.00th=[ 2343], 80.00th=[ 3097], 90.00th=[ 4555], 95.00th=[ 5473],
| 99.00th=[ 8717], 99.50th=[11863], 99.90th=[20055], 99.95th=[27657],
| 99.99th=[49546]
bw ( MiB/s): min=20095, max=59216, per=100.00%, avg=35985.47, stdev=318.61, samples=280
iops : min=10031, max=29587, avg=17970.76, stdev=159.29, samples=280
lat (usec) : 2=0.06%, 4=1.02%, 10=2.33%, 20=4.29%, 50=1.85%
lat (usec) : 100=0.20%, 250=3.26%, 500=11.23%, 750=8.87%, 1000=5.82%
lat (msec) : 2=20.95%, 4=26.74%, 10=12.60%, 20=0.66%, 50=0.09%
lat (msec) : 100=0.01%
cpu : usr=15.48%, sys=1.17%, ctx=62654, majf=0, minf=22801
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,58560,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=29.8GiB/s (32.0GB/s), 29.8GiB/s-29.8GiB/s (32.0GB/s-32.0GB/s), io=114GiB (123GB), run=3838-3838msec

Disk stats (read/write):
pmem0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%


## Here is a run where the cpu looks "normal"

+ fio -name=ten-256m-per-thread --nrfiles=10 -bs=2M --group_reporting=1 --alloc-size=1048576 --filesize=256MiB --readwrite=write --fallocate=native --numjobs=48 --create_on_open=0 --direct=1 --directory=/mnt/xfs
ten-256m-per-thread: (g=0): rw=write, bs=(R) 2048KiB-2048KiB, (W) 2048KiB-2048KiB, (T) 2048KiB-2048KiB, ioengine=psync, iodepth=1
..
fio-3.33
Starting 48 processes
Jobs: 19 (f=190): [W(2),_(1),W(2),_(8),W(1),_(3),W(1),_(1),W(2),_(2),W(1),_(1),W(3),_(2),W(1),_(1),W(1),_(2),W(2),_(7),W(3),_(1)][55.6%][w=26.7GiB/s][w=13.6k IOPS][eta 00m:04s]
ten-256m-per-thread: (groupid=0, jobs=48): err= 0: pid=1463615: Thu Feb 29 08:19:53 2024
write: IOPS=12.4k, BW=24.1GiB/s (25.9GB/s)(114GiB/4736msec); 0 zone resets
clat (usec): min=138, max=117903, avg=2581.99, stdev=2704.61
lat (usec): min=152, max=120405, avg=3019.04, stdev=2964.47
clat percentiles (usec):
| 1.00th=[ 161], 5.00th=[ 249], 10.00th=[ 627], 20.00th=[ 1270],
| 30.00th=[ 1631], 40.00th=[ 1942], 50.00th=[ 2089], 60.00th=[ 2212],
| 70.00th=[ 2343], 80.00th=[ 2704], 90.00th=[ 5866], 95.00th=[ 6849],
| 99.00th=[12387], 99.50th=[14353], 99.90th=[26084], 99.95th=[38536],
| 99.99th=[78119]
bw ( MiB/s): min=21204, max=47040, per=100.00%, avg=29005.40, stdev=237.31, samples=329
iops : min=10577, max=23497, avg=14479.74, stdev=118.65, samples=329
lat (usec) : 250=5.04%, 500=4.03%, 750=2.37%, 1000=3.13%
lat (msec) : 2=29.39%, 4=41.05%, 10=13.37%, 20=1.45%, 50=0.15%
lat (msec) : 100=0.03%, 250=0.01%
cpu : usr=14.43%, sys=78.18%, ctx=5272, majf=0, minf=15708
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,58560,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=24.1GiB/s (25.9GB/s), 24.1GiB/s-24.1GiB/s (25.9GB/s-25.9GB/s), io=114GiB (123GB), run=4736-4736msec

Disk stats (read/write):
pmem0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%


Cheers,
John


2024-02-29 22:16:50

by John Groves

[permalink] [raw]
Subject: Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system

On 24/02/29 08:52AM, Amir Goldstein wrote:
> On Fri, Feb 23, 2024 at 7:42 PM John Groves <[email protected]> wrote:
> >
> > This patch set introduces famfs[1] - a special-purpose fs-dax file system
> > for sharable disaggregated or fabric-attached memory (FAM). Famfs is not
> > CXL-specific in anyway way.
> >
> > * Famfs creates a simple access method for storing and sharing data in
> > sharable memory. The memory is exposed and accessed as memory-mappable
> > dax files.
> > * Famfs supports multiple hosts mounting the same file system from the
> > same memory (something existing fs-dax file systems don't do).
> > * A famfs file system can be created on either a /dev/pmem device in fs-dax
> > mode, or a /dev/dax device in devdax mode (the latter depending on
> > patches 2-6 of this series).
> >
> > The famfs kernel file system is part the famfs framework; additional
> > components in user space[2] handle metadata and direct the famfs kernel
> > module to instantiate files that map to specific memory. The famfs user
> > space has documentation and a reasonably thorough test suite.
> >
>
> So can we say that Famfs is Fuse specialized for DAX?
>
> I am asking because you seem to have asked it first:
> https://lore.kernel.org/linux-fsdevel/0100018b2439ebf3-a442db6f-f685-4bc4-b4b0-28dc333f6712-000000@email.amazonses.com/
> I guess that you did not get your answers to your questions before or at LPC?

Thanks for paying attention Amir. I think there is some validity to thinking
of famfs as Fuse for DAX. Administration / metadata originating in user space
is similar (but doing it this way also helps reduce RAS exposure to memory
that might have a more complex connection path).

One way it differs from fuse is that famfs is very much aimed at use
cases that require performance. *Accessing* files must run at full
memory speeds.

>
> I did not see your question back in October.
> Let me try to answer your questions and we can discuss later if a new dedicated
> kernel driver + userspace API is really needed, or if FUSE could be used as is
> extended for your needs.
>
> You wrote:
> "...My naive reading of the existence of some sort of fuse/dax support
> for virtiofs
> suggested that there might be a way of doing this - but I may be wrong
> about that."
>
> I'm not virtiofs expert, but I don't think that you are wrong about this.
> IIUC, virtiofsd could map arbitrary memory region to any fuse file mmaped
> by virtiofs client.
>
> So what are the gaps between virtiofs and famfs that justify a new filesystem
> driver and new userspace API?

I have a lot of thoughts here, and an actual conversation might be good
sooner rather than later. I hope to be at LSFMM to discuss this - if you agree,
put in a vote for my topic ;). But if you want to talk sooner than that, I'm
interested.

I think one piece of evidence that this isn't possible with Fuse today is that
I had to plumb the iomap interface for /dev/dax in this patch set. That is the
way that fs-dax file systems communicate with the dax layer for fault
resolution. If fuse/virtiofs handles dax somehow without the iomap interface,
I suspect it's doing something somehow simpler, /and/ that might need to get
reconciled with the fs-dax methodology. Or maybe I don't know what I'm talking
about (in which case, please help :D).

I think one thing that might make sense would be to bring up this functionality
as a standalone file system, and then consider merging it into fuse when &
if the time seems right.

Famfs doesn't currently have any up-calls. User space plays the log and tells
the kmod to instantiate files with extent lists to dax. Access happens with
zero user space involvement.

The important thing, the thing I'm currently paid for, is making it
practical to use disaggregated shared memory - it's ultimately not important
which mechanism is used to enable a filesystem access method for memory.

But caching metadata in the kernel for efficient fault handling is the
only way to get it to perform at "memory speeds" so that appears critical.

One final observation: famfs has significantly more code in user space than
in kernel space, and it's the user side that is likely to grow over time.
That logic is at least theoretically independent of the kernel ABI.

>
> Thanks,
> Amir.

Thanks!
John


2024-02-23 17:56:12

by John Groves

[permalink] [raw]
Subject: [RFC PATCH 12/20] famfs: Add inode_operations and file_system_type

This commit introduces the famfs inode_operations. There is nothing really
unique to famfs here in the inode_operations..

This commit also introduces the famfs_file_system_type struct and the
famfs_kill_sb() function.

Signed-off-by: John Groves <[email protected]>
---
fs/famfs/famfs_inode.c | 132 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 132 insertions(+)

diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c
index f98f82962d7b..ab46ec50b70d 100644
--- a/fs/famfs/famfs_inode.c
+++ b/fs/famfs/famfs_inode.c
@@ -85,6 +85,109 @@ static struct inode *famfs_get_inode(
return inode;
}

+/***************************************************************************
+ * famfs inode_operations: these are currently pretty much boilerplate
+ */
+
+static const struct inode_operations famfs_file_inode_operations = {
+ /* All generic */
+ .setattr = simple_setattr,
+ .getattr = simple_getattr,
+};
+
+
+/*
+ * File creation. Allocate an inode, and we're done..
+ */
+/* SMP-safe */
+static int
+famfs_mknod(
+ struct mnt_idmap *idmap,
+ struct inode *dir,
+ struct dentry *dentry,
+ umode_t mode,
+ dev_t dev)
+{
+ struct inode *inode = famfs_get_inode(dir->i_sb, dir, mode, dev);
+ int error = -ENOSPC;
+
+ if (inode) {
+ struct timespec64 tv;
+
+ d_instantiate(dentry, inode);
+ dget(dentry); /* Extra count - pin the dentry in core */
+ error = 0;
+ tv = inode_set_ctime_current(inode);
+ inode_set_mtime_to_ts(inode, tv);
+ inode_set_atime_to_ts(inode, tv);
+ }
+ return error;
+}
+
+static int famfs_mkdir(
+ struct mnt_idmap *idmap,
+ struct inode *dir,
+ struct dentry *dentry,
+ umode_t mode)
+{
+ int retval = famfs_mknod(&nop_mnt_idmap, dir, dentry, mode | S_IFDIR, 0);
+
+ if (!retval)
+ inc_nlink(dir);
+
+ return retval;
+}
+
+static int famfs_create(
+ struct mnt_idmap *idmap,
+ struct inode *dir,
+ struct dentry *dentry,
+ umode_t mode,
+ bool excl)
+{
+ return famfs_mknod(&nop_mnt_idmap, dir, dentry, mode | S_IFREG, 0);
+}
+
+static int famfs_symlink(
+ struct mnt_idmap *idmap,
+ struct inode *dir,
+ struct dentry *dentry,
+ const char *symname)
+{
+ struct inode *inode;
+ int error = -ENOSPC;
+
+ inode = famfs_get_inode(dir->i_sb, dir, S_IFLNK | 0777, 0);
+ if (inode) {
+ int l = strlen(symname)+1;
+
+ error = page_symlink(inode, symname, l);
+ if (!error) {
+ struct timespec64 tv;
+
+ d_instantiate(dentry, inode);
+ dget(dentry);
+ tv = inode_set_ctime_current(inode);
+ inode_set_mtime_to_ts(inode, tv);
+ inode_set_atime_to_ts(inode, tv);
+ } else
+ iput(inode);
+ }
+ return error;
+}
+
+static const struct inode_operations famfs_dir_inode_operations = {
+ .create = famfs_create,
+ .lookup = simple_lookup,
+ .link = simple_link,
+ .unlink = simple_unlink,
+ .symlink = famfs_symlink,
+ .mkdir = famfs_mkdir,
+ .rmdir = simple_rmdir,
+ .mknod = famfs_mknod,
+ .rename = simple_rename,
+};
+
/**********************************************************************************
* famfs super_operations
*
@@ -329,5 +432,34 @@ static int famfs_init_fs_context(struct fs_context *fc)
return 0;
}

+static void famfs_kill_sb(struct super_block *sb)
+{
+ struct famfs_fs_info *fsi = sb->s_fs_info;
+
+ mutex_lock(&famfs_context_mutex);
+ list_del(&fsi->fsi_list);
+ mutex_unlock(&famfs_context_mutex);
+
+ if (fsi->bdev_handle)
+ bdev_release(fsi->bdev_handle);
+ if (fsi->dax_devp)
+ fs_put_dax(fsi->dax_devp, fsi);
+ if (fsi->dax_filp) /* This only happens if it's char dax */
+ filp_close(fsi->dax_filp, NULL);
+
+ if (fsi && fsi->rootdev)
+ kfree(fsi->rootdev);
+ kfree(fsi);
+ kill_litter_super(sb);
+}
+
+#define MODULE_NAME "famfs"
+static struct file_system_type famfs_fs_type = {
+ .name = MODULE_NAME,
+ .init_fs_context = famfs_init_fs_context,
+ .parameters = famfs_fs_parameters,
+ .kill_sb = famfs_kill_sb,
+ .fs_flags = FS_USERNS_MOUNT,
+};

MODULE_LICENSE("GPL");
--
2.43.0


2024-02-23 17:55:15

by John Groves

[permalink] [raw]
Subject: [RFC PATCH 08/20] famfs: Add famfs_internal.h

Add the famfs_internal.h include file. This contains internal data
structures such as the per-file metadata structure (famfs_file_meta)
and extent formats.

Signed-off-by: John Groves <[email protected]>
---
fs/famfs/famfs_internal.h | 53 +++++++++++++++++++++++++++++++++++++++
1 file changed, 53 insertions(+)
create mode 100644 fs/famfs/famfs_internal.h

diff --git a/fs/famfs/famfs_internal.h b/fs/famfs/famfs_internal.h
new file mode 100644
index 000000000000..af3990d43305
--- /dev/null
+++ b/fs/famfs/famfs_internal.h
@@ -0,0 +1,53 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * famfs - dax file system for shared fabric-attached memory
+ *
+ * Copyright 2023-2024 Micron Technology, Inc.
+ *
+ * This file system, originally based on ramfs the dax support from xfs,
+ * is intended to allow multiple host systems to mount a common file system
+ * view of dax files that map to shared memory.
+ */
+#ifndef FAMFS_INTERNAL_H
+#define FAMFS_INTERNAL_H
+
+#include <linux/atomic.h>
+#include <linux/famfs_ioctl.h>
+
+#define FAMFS_MAGIC 0x87b282ff
+
+#define FAMFS_BLKDEV_MODE (FMODE_READ|FMODE_WRITE)
+
+extern const struct file_operations famfs_file_operations;
+
+/*
+ * Each famfs dax file has this hanging from its inode->i_private.
+ */
+struct famfs_file_meta {
+ int error;
+ enum famfs_file_type file_type;
+ size_t file_size;
+ enum extent_type tfs_extent_type;
+ size_t tfs_extent_ct;
+ struct famfs_extent tfs_extents[]; /* flexible array */
+};
+
+struct famfs_mount_opts {
+ umode_t mode;
+};
+
+extern const struct iomap_ops famfs_iomap_ops;
+extern const struct vm_operations_struct famfs_file_vm_ops;
+
+#define ROOTDEV_STRLEN 80
+
+struct famfs_fs_info {
+ struct famfs_mount_opts mount_opts;
+ struct file *dax_filp;
+ struct dax_device *dax_devp;
+ struct bdev_handle *bdev_handle;
+ struct list_head fsi_list;
+ char *rootdev;
+};
+
+#endif /* FAMFS_INTERNAL_H */
--
2.43.0


2024-02-23 17:57:04

by John Groves

[permalink] [raw]
Subject: [RFC PATCH 20/20] famfs: Add Kconfig and Makefile plumbing

Add famfs Kconfig and Makefile, and hook into fs/Kconfig and fs/Makefile

Signed-off-by: John Groves <[email protected]>
---
fs/Kconfig | 2 ++
fs/Makefile | 1 +
fs/famfs/Kconfig | 10 ++++++++++
fs/famfs/Makefile | 5 +++++
4 files changed, 18 insertions(+)
create mode 100644 fs/famfs/Kconfig
create mode 100644 fs/famfs/Makefile

diff --git a/fs/Kconfig b/fs/Kconfig
index 89fdbefd1075..8a11625a54a2 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -141,6 +141,8 @@ source "fs/autofs/Kconfig"
source "fs/fuse/Kconfig"
source "fs/overlayfs/Kconfig"

+source "fs/famfs/Kconfig"
+
menu "Caches"

source "fs/netfs/Kconfig"
diff --git a/fs/Makefile b/fs/Makefile
index c09016257f05..382c1ea4f4c3 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -130,3 +130,4 @@ obj-$(CONFIG_EFIVAR_FS) += efivarfs/
obj-$(CONFIG_EROFS_FS) += erofs/
obj-$(CONFIG_VBOXSF_FS) += vboxsf/
obj-$(CONFIG_ZONEFS_FS) += zonefs/
+obj-$(CONFIG_FAMFS) += famfs/
diff --git a/fs/famfs/Kconfig b/fs/famfs/Kconfig
new file mode 100644
index 000000000000..e450928d8912
--- /dev/null
+++ b/fs/famfs/Kconfig
@@ -0,0 +1,10 @@
+
+
+config FAMFS
+ tristate "famfs: shared memory file system"
+ depends on DEV_DAX && FS_DAX
+ help
+ Support for the famfs file system. Famfs is a dax file system that
+ can support scale-out shared access to fabric-attached memory
+ (e.g. CXL shared memory). Famfs is not a general purpose file system;
+ it is an enabler for data sets in shared memory.
diff --git a/fs/famfs/Makefile b/fs/famfs/Makefile
new file mode 100644
index 000000000000..8cac90c090a4
--- /dev/null
+++ b/fs/famfs/Makefile
@@ -0,0 +1,5 @@
+# SPDX-License-Identifier: GPL-2.0
+
+obj-$(CONFIG_FAMFS) += famfs.o
+
+famfs-y := famfs_inode.o famfs_file.o
--
2.43.0


2024-02-23 17:44:46

by John Groves

[permalink] [raw]
Subject: [RFC PATCH 07/20] famfs: Add include/linux/famfs_ioctl.h

Add uapi include file for famfs. The famfs user space uses ioctl on
individual files to pass in mapping information and file size. This
would be hard to do via sysfs or other means, since it's
file-specific.

Signed-off-by: John Groves <[email protected]>
---
include/uapi/linux/famfs_ioctl.h | 56 ++++++++++++++++++++++++++++++++
1 file changed, 56 insertions(+)
create mode 100644 include/uapi/linux/famfs_ioctl.h

diff --git a/include/uapi/linux/famfs_ioctl.h b/include/uapi/linux/famfs_ioctl.h
new file mode 100644
index 000000000000..6b3e6452d02f
--- /dev/null
+++ b/include/uapi/linux/famfs_ioctl.h
@@ -0,0 +1,56 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * famfs - dax file system for shared fabric-attached memory
+ *
+ * Copyright 2023-2024 Micron Technology, Inc.
+ *
+ * This file system, originally based on ramfs the dax support from xfs,
+ * is intended to allow multiple host systems to mount a common file system
+ * view of dax files that map to shared memory.
+ */
+#ifndef FAMFS_IOCTL_H
+#define FAMFS_IOCTL_H
+
+#include <linux/ioctl.h>
+#include <linux/uuid.h>
+
+#define FAMFS_MAX_EXTENTS 2
+
+enum extent_type {
+ SIMPLE_DAX_EXTENT = 13,
+ INVALID_EXTENT_TYPE,
+};
+
+struct famfs_extent {
+ __u64 offset;
+ __u64 len;
+};
+
+enum famfs_file_type {
+ FAMFS_REG,
+ FAMFS_SUPERBLOCK,
+ FAMFS_LOG,
+};
+
+/**
+ * struct famfs_ioc_map
+ *
+ * This is the metadata that indicates where the memory is for a famfs file
+ */
+struct famfs_ioc_map {
+ enum extent_type extent_type;
+ enum famfs_file_type file_type;
+ __u64 file_size;
+ __u64 ext_list_count;
+ struct famfs_extent ext_list[FAMFS_MAX_EXTENTS];
+};
+
+#define FAMFSIOC_MAGIC 'u'
+
+/* famfs file ioctl opcodes */
+#define FAMFSIOC_MAP_CREATE _IOW(FAMFSIOC_MAGIC, 1, struct famfs_ioc_map)
+#define FAMFSIOC_MAP_GET _IOR(FAMFSIOC_MAGIC, 2, struct famfs_ioc_map)
+#define FAMFSIOC_MAP_GETEXT _IOR(FAMFSIOC_MAGIC, 3, struct famfs_extent)
+#define FAMFSIOC_NOP _IO(FAMFSIOC_MAGIC, 4)
+
+#endif /* FAMFS_IOCTL_H */
--
2.43.0


2024-02-23 17:47:35

by John Groves

[permalink] [raw]
Subject: [RFC PATCH 15/20] famfs: Add ioctl to file_operations

This commit introduces the per-file ioctl function famfs_file_ioctl()
into struct file_operations, and introduces the famfs_file_init_dax()
function (which is called by famfs_file_ioct())

famfs_file_init_dax() associates a dax extent list with a file, making
it into a proper famfs file. It is called from the FAMFSIOC_MAP_CREATE
ioctl. Starting with an empty file (which is basically a ramfs file),
this turns the file into a DAX file backed by the specified extent list.

The other ioctls are:

FAMFSIOC_NOP - A convenient way for user space to verify it's a famfs file
FAMFSIOC_MAP_GET - Get the header of the metadata for a file
FAMFSIOC_MAP_GETEXT - Get the extents for a file

The latter two, together, are comparable to xfs_bmap. Our user space tools
use them primarly in testing.

Signed-off-by: John Groves <[email protected]>
---
fs/famfs/famfs_file.c | 226 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 226 insertions(+)

diff --git a/fs/famfs/famfs_file.c b/fs/famfs/famfs_file.c
index 5228e9de1e3b..fd42d5966982 100644
--- a/fs/famfs/famfs_file.c
+++ b/fs/famfs/famfs_file.c
@@ -19,6 +19,231 @@
#include <uapi/linux/famfs_ioctl.h>
#include "famfs_internal.h"

+/**
+ * famfs_map_meta_alloc() - Allocate famfs file metadata
+ * @mapp: Pointer to an mcache_map_meta pointer
+ * @ext_count: The number of extents needed
+ */
+static int
+famfs_meta_alloc(
+ struct famfs_file_meta **metap,
+ size_t ext_count)
+{
+ struct famfs_file_meta *meta;
+ size_t metasz;
+
+ *metap = NULL;
+
+ metasz = sizeof(*meta) + sizeof(*(meta->tfs_extents)) * ext_count;
+
+ meta = kzalloc(metasz, GFP_KERNEL);
+ if (!meta)
+ return -ENOMEM;
+
+ meta->tfs_extent_ct = ext_count;
+ *metap = meta;
+
+ return 0;
+}
+
+static void
+famfs_meta_free(
+ struct famfs_file_meta *map)
+{
+ kfree(map);
+}
+
+/**
+ * famfs_file_init_dax() - FAMFSIOC_MAP_CREATE ioctl handler
+ * @file:
+ * @arg: ptr to struct mcioc_map in user space
+ *
+ * Setup the dax mapping for a file. Files are created empty, and then function is called
+ * (by famfs_file_ioctl()) to setup the mapping and set the file size.
+ */
+static int
+famfs_file_init_dax(
+ struct file *file,
+ void __user *arg)
+{
+ struct famfs_extent *tfs_extents = NULL;
+ struct famfs_file_meta *meta = NULL;
+ struct inode *inode;
+ struct famfs_ioc_map imap;
+ struct famfs_fs_info *fsi;
+ struct super_block *sb;
+ int alignment_errs = 0;
+ size_t extent_total = 0;
+ size_t ext_count;
+ int rc = 0;
+ int i;
+
+ rc = copy_from_user(&imap, arg, sizeof(imap));
+ if (rc)
+ return -EFAULT;
+
+ ext_count = imap.ext_list_count;
+ if (ext_count < 1) {
+ rc = -ENOSPC;
+ goto errout;
+ }
+
+ if (ext_count > FAMFS_MAX_EXTENTS) {
+ rc = -E2BIG;
+ goto errout;
+ }
+
+ inode = file_inode(file);
+ if (!inode) {
+ rc = -EBADF;
+ goto errout;
+ }
+ sb = inode->i_sb;
+ fsi = inode->i_sb->s_fs_info;
+
+ tfs_extents = &imap.ext_list[0];
+
+ rc = famfs_meta_alloc(&meta, ext_count);
+ if (rc)
+ goto errout;
+
+ meta->file_type = imap.file_type;
+ meta->file_size = imap.file_size;
+
+ /* Fill in the internal file metadata structure */
+ for (i = 0; i < imap.ext_list_count; i++) {
+ size_t len;
+ off_t offset;
+
+ offset = imap.ext_list[i].offset;
+ len = imap.ext_list[i].len;
+
+ extent_total += len;
+
+ if (WARN_ON(offset == 0 && meta->file_type != FAMFS_SUPERBLOCK)) {
+ rc = -EINVAL;
+ goto errout;
+ }
+
+ meta->tfs_extents[i].offset = offset;
+ meta->tfs_extents[i].len = len;
+
+ /* All extent addresses/offsets must be 2MiB aligned,
+ * and all but the last length must be a 2MiB multiple.
+ */
+ if (!IS_ALIGNED(offset, PMD_SIZE)) {
+ pr_err("%s: error ext %d hpa %lx not aligned\n",
+ __func__, i, offset);
+ alignment_errs++;
+ }
+ if (i < (imap.ext_list_count - 1) && !IS_ALIGNED(len, PMD_SIZE)) {
+ pr_err("%s: error ext %d length %ld not aligned\n",
+ __func__, i, len);
+ alignment_errs++;
+ }
+ }
+
+ /*
+ * File size can be <= ext list size, since extent sizes are constrained
+ * to PMD multiples
+ */
+ if (imap.file_size > extent_total) {
+ pr_err("%s: file size %lld larger than ext list size %lld\n",
+ __func__, (u64)imap.file_size, (u64)extent_total);
+ rc = -EINVAL;
+ goto errout;
+ }
+
+ if (alignment_errs > 0) {
+ pr_err("%s: there were %d alignment errors in the extent list\n",
+ __func__, alignment_errs);
+ rc = -EINVAL;
+ goto errout;
+ }
+
+ /* Publish the famfs metadata on inode->i_private */
+ inode_lock(inode);
+ if (inode->i_private) {
+ rc = -EEXIST; /* file already has famfs metadata */
+ } else {
+ inode->i_private = meta;
+ i_size_write(inode, imap.file_size);
+ inode->i_flags |= S_DAX;
+ }
+ inode_unlock(inode);
+
+ errout:
+ if (rc)
+ famfs_meta_free(meta);
+
+ return rc;
+}
+
+/**
+ * famfs_file_ioctl() - top-level famfs file ioctl handler
+ * @file:
+ * @cmd:
+ * @arg:
+ */
+static
+long
+famfs_file_ioctl(
+ struct file *file,
+ unsigned int cmd,
+ unsigned long arg)
+{
+ long rc;
+
+ switch (cmd) {
+ case FAMFSIOC_NOP:
+ rc = 0;
+ break;
+
+ case FAMFSIOC_MAP_CREATE:
+ rc = famfs_file_init_dax(file, (void *)arg);
+ break;
+
+ case FAMFSIOC_MAP_GET: {
+ struct inode *inode = file_inode(file);
+ struct famfs_file_meta *meta = inode->i_private;
+ struct famfs_ioc_map umeta;
+
+ memset(&umeta, 0, sizeof(umeta));
+
+ if (meta) {
+ /* TODO: do more to harmonize these structures */
+ umeta.extent_type = meta->tfs_extent_type;
+ umeta.file_size = i_size_read(inode);
+ umeta.ext_list_count = meta->tfs_extent_ct;
+
+ rc = copy_to_user((void __user *)arg, &umeta, sizeof(umeta));
+ if (rc)
+ pr_err("%s: copy_to_user returned %ld\n", __func__, rc);
+
+ } else {
+ rc = -EINVAL;
+ }
+ }
+ break;
+ case FAMFSIOC_MAP_GETEXT: {
+ struct inode *inode = file_inode(file);
+ struct famfs_file_meta *meta = inode->i_private;
+
+ if (meta)
+ rc = copy_to_user((void __user *)arg, meta->tfs_extents,
+ meta->tfs_extent_ct * sizeof(struct famfs_extent));
+ else
+ rc = -EINVAL;
+ }
+ break;
+ default:
+ rc = -ENOTTY;
+ break;
+ }
+
+ return rc;
+}
+
/*********************************************************************
* file_operations
*/
@@ -143,6 +368,7 @@ const struct file_operations famfs_file_operations = {
/* Custom famfs operations */
.write_iter = famfs_dax_write_iter,
.read_iter = famfs_dax_read_iter,
+ .unlocked_ioctl = famfs_file_ioctl,
.mmap = famfs_file_mmap,

/* Force PMD alignment for mmap */
--
2.43.0


2024-02-23 17:43:47

by John Groves

[permalink] [raw]
Subject: [RFC PATCH 04/20] dev_dax_iomap: Save the kva from memremap

Save the kva from memremap because we need it for iomap rw support

Prior to famfs, there were no iomap users of /dev/dax - so the virtual
address from memremap was not needed.

Also: in some cases dev_dax_probe() is called with the first
dev_dax->range offset past pgmap[0].range. In those cases we need to
add the difference to virt_addr in order to have the physaddr's in
dev_dax->ranges match dev_dax->virt_addr.

Dragons...

Signed-off-by: John Groves <[email protected]>
---
drivers/dax/dax-private.h | 1 +
drivers/dax/device.c | 15 +++++++++++++++
2 files changed, 16 insertions(+)

diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index 446617b73aea..894eb1c66b4a 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -63,6 +63,7 @@ struct dax_mapping {
struct dev_dax {
struct dax_region *region;
struct dax_device *dax_dev;
+ u64 virt_addr;
unsigned int align;
int target_node;
bool dyn_id;
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 40ba660013cf..6cd79d00fe1b 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -372,6 +372,7 @@ static int dev_dax_probe(struct dev_dax *dev_dax)
struct dax_device *dax_dev = dev_dax->dax_dev;
struct device *dev = &dev_dax->dev;
struct dev_pagemap *pgmap;
+ u64 data_offset = 0;
struct inode *inode;
struct cdev *cdev;
void *addr;
@@ -426,6 +427,20 @@ static int dev_dax_probe(struct dev_dax *dev_dax)
if (IS_ERR(addr))
return PTR_ERR(addr);

+ /* Detect whether the data is at a non-zero offset into the memory */
+ if (pgmap->range.start != dev_dax->ranges[0].range.start) {
+ u64 phys = (u64)dev_dax->ranges[0].range.start;
+ u64 pgmap_phys = (u64)dev_dax->pgmap[0].range.start;
+ u64 vmemmap_shift = (u64)dev_dax->pgmap[0].vmemmap_shift;
+
+ if (!WARN_ON(pgmap_phys > phys))
+ data_offset = phys - pgmap_phys;
+
+ pr_notice("%s: offset detected phys=%llx pgmap_phys=%llx offset=%llx shift=%llx\n",
+ __func__, phys, pgmap_phys, data_offset, vmemmap_shift);
+ }
+ dev_dax->virt_addr = (u64)addr + data_offset;
+
inode = dax_inode(dax_dev);
cdev = inode->i_cdev;
cdev_init(cdev, &dax_fops);
--
2.43.0


2024-02-23 17:44:08

by John Groves

[permalink] [raw]
Subject: [RFC PATCH 05/20] dev_dax_iomap: Add dax_operations for use by fs-dax on devdax

Notes about this commit:

* These methods are based somewhat loosely on pmem_dax_ops from
drivers/nvdimm/pmem.c

* dev_dax_direct_access() is returns the hpa, pfn and kva. The kva was
newly stored as dev_dax->virt_addr by dev_dax_probe().

* The hpa/pfn are used for mmap (dax_iomap_fault()), and the kva is used
for read/write (dax_iomap_rw())

* dev_dax_recovery_write() and dev_dax_zero_page_range() have not been
tested yet. I'm looking for suggestions as to how to test those.

Signed-off-by: John Groves <[email protected]>
---
drivers/dax/bus.c | 107 ++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 107 insertions(+)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 664e8c1b9930..06fcda810674 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -10,6 +10,12 @@
#include "dax-private.h"
#include "bus.h"

+#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
+#include <linux/backing-dev.h>
+#include <linux/pfn_t.h>
+#include <linux/range.h>
+#endif
+
static DEFINE_MUTEX(dax_bus_lock);

#define DAX_NAME_LEN 30
@@ -1349,6 +1355,101 @@ __weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
}
EXPORT_SYMBOL_GPL(dax_pgoff_to_phys);

+#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
+
+static void write_dax(void *pmem_addr, struct page *page,
+ unsigned int off, unsigned int len)
+{
+ unsigned int chunk;
+ void *mem;
+
+ while (len) {
+ mem = kmap_local_page(page);
+ chunk = min_t(unsigned int, len, PAGE_SIZE - off);
+ memcpy_flushcache(pmem_addr, mem + off, chunk);
+ kunmap_local(mem);
+ len -= chunk;
+ off = 0;
+ page++;
+ pmem_addr += chunk;
+ }
+}
+
+static long __dev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
+ long nr_pages, enum dax_access_mode mode, void **kaddr,
+ pfn_t *pfn)
+{
+ struct dev_dax *dev_dax = dax_get_private(dax_dev);
+ size_t dax_size = dev_dax_size(dev_dax);
+ size_t size = nr_pages << PAGE_SHIFT;
+ size_t offset = pgoff << PAGE_SHIFT;
+ phys_addr_t phys;
+ u64 virt_addr = dev_dax->virt_addr + offset;
+ pfn_t local_pfn;
+ u64 flags = PFN_DEV|PFN_MAP;
+
+ WARN_ON(!dev_dax->virt_addr); /* virt_addr must be saved for direct_access */
+
+ phys = dax_pgoff_to_phys(dev_dax, pgoff, nr_pages << PAGE_SHIFT);
+
+ if (kaddr)
+ *kaddr = (void *)virt_addr;
+
+ local_pfn = phys_to_pfn_t(phys, flags); /* are flags correct? */
+ if (pfn)
+ *pfn = local_pfn;
+
+ /* This the valid size at the specified address */
+ return PHYS_PFN(min_t(size_t, size, dax_size - offset));
+}
+
+static int dev_dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
+ size_t nr_pages)
+{
+ long resid = nr_pages << PAGE_SHIFT;
+ long offset = pgoff << PAGE_SHIFT;
+
+ /* Break into one write per dax region */
+ while (resid > 0) {
+ void *kaddr;
+ pgoff_t poff = offset >> PAGE_SHIFT;
+ long len = __dev_dax_direct_access(dax_dev, poff,
+ nr_pages, DAX_ACCESS, &kaddr, NULL);
+ len = min_t(long, len, PAGE_SIZE);
+ write_dax(kaddr, ZERO_PAGE(0), offset, len);
+
+ offset += len;
+ resid -= len;
+ }
+ return 0;
+}
+
+static long dev_dax_direct_access(struct dax_device *dax_dev,
+ pgoff_t pgoff, long nr_pages, enum dax_access_mode mode,
+ void **kaddr, pfn_t *pfn)
+{
+ return __dev_dax_direct_access(dax_dev, pgoff, nr_pages, mode, kaddr, pfn);
+}
+
+static size_t dev_dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
+ void *addr, size_t bytes, struct iov_iter *i)
+{
+ size_t len, off;
+
+ off = offset_in_page(addr);
+ len = PFN_PHYS(PFN_UP(off + bytes));
+
+ return _copy_from_iter_flushcache(addr, bytes, i);
+}
+
+static const struct dax_operations dev_dax_ops = {
+ .direct_access = dev_dax_direct_access,
+ .zero_page_range = dev_dax_zero_page_range,
+ .recovery_write = dev_dax_recovery_write,
+};
+
+#endif /* IS_ENABLED(CONFIG_DEV_DAX_IOMAP) */
+
struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
{
struct dax_region *dax_region = data->dax_region;
@@ -1404,11 +1505,17 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
}
}

+#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
+ /* holder_ops currently populated separately in a slightly hacky way */
+ dax_dev = alloc_dax(dev_dax, &dev_dax_ops);
+#else
/*
* No dax_operations since there is no access to this device outside of
* mmap of the resulting character device.
*/
dax_dev = alloc_dax(dev_dax, NULL);
+#endif
+
if (IS_ERR(dax_dev)) {
rc = PTR_ERR(dax_dev);
goto err_alloc_dax;
--
2.43.0


2024-02-23 17:54:37

by John Groves

[permalink] [raw]
Subject: [RFC PATCH 06/20] dev_dax_iomap: Add CONFIG_DEV_DAX_IOMAP kernel build parameter

Add the CONFIG_DEV_DAX_IOMAP kernel config parameter to control building
of the iomap functionality to support fsdax on devdax.

Signed-off-by: John Groves <[email protected]>
---
drivers/dax/Kconfig | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index a88744244149..b1ebcc77120b 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -78,4 +78,10 @@ config DEV_DAX_KMEM

Say N if unsure.

+config DEV_DAX_IOMAP
+ depends on DEV_DAX && DAX
+ def_bool y
+ help
+ Support iomap mapping of devdax devices (for FS-DAX file
+ systems that reside on character /dev/dax devices)
endif
--
2.43.0


2024-02-23 17:43:30

by John Groves

[permalink] [raw]
Subject: [RFC PATCH 03/20] dev_dax_iomap: Move dax_pgoff_to_phys from device.c to bus.c since both need it now

bus.c can't call functions in device.c - that creates a circular linkage
dependency.

Signed-off-by: John Groves <[email protected]>
---
drivers/dax/bus.c | 24 ++++++++++++++++++++++++
drivers/dax/device.c | 23 -----------------------
2 files changed, 24 insertions(+), 23 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 1ff1ab5fa105..664e8c1b9930 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -1325,6 +1325,30 @@ static const struct device_type dev_dax_type = {
.groups = dax_attribute_groups,
};

+/* see "strong" declaration in tools/testing/nvdimm/dax-dev.c */
+__weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
+ unsigned long size)
+{
+ int i;
+
+ for (i = 0; i < dev_dax->nr_range; i++) {
+ struct dev_dax_range *dax_range = &dev_dax->ranges[i];
+ struct range *range = &dax_range->range;
+ unsigned long long pgoff_end;
+ phys_addr_t phys;
+
+ pgoff_end = dax_range->pgoff + PHYS_PFN(range_len(range)) - 1;
+ if (pgoff < dax_range->pgoff || pgoff > pgoff_end)
+ continue;
+ phys = PFN_PHYS(pgoff - dax_range->pgoff) + range->start;
+ if (phys + size - 1 <= range->end)
+ return phys;
+ break;
+ }
+ return -1;
+}
+EXPORT_SYMBOL_GPL(dax_pgoff_to_phys);
+
struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
{
struct dax_region *dax_region = data->dax_region;
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 93ebedc5ec8c..40ba660013cf 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -50,29 +50,6 @@ static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
return 0;
}

-/* see "strong" declaration in tools/testing/nvdimm/dax-dev.c */
-__weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
- unsigned long size)
-{
- int i;
-
- for (i = 0; i < dev_dax->nr_range; i++) {
- struct dev_dax_range *dax_range = &dev_dax->ranges[i];
- struct range *range = &dax_range->range;
- unsigned long long pgoff_end;
- phys_addr_t phys;
-
- pgoff_end = dax_range->pgoff + PHYS_PFN(range_len(range)) - 1;
- if (pgoff < dax_range->pgoff || pgoff > pgoff_end)
- continue;
- phys = PFN_PHYS(pgoff - dax_range->pgoff) + range->start;
- if (phys + size - 1 <= range->end)
- return phys;
- break;
- }
- return -1;
-}
-
static void dax_set_mapping(struct vm_fault *vmf, pfn_t pfn,
unsigned long fault_size)
{
--
2.43.0


2024-02-23 17:46:14

by John Groves

[permalink] [raw]
Subject: [RFC PATCH 10/20] famfs: famfs_open_device() & dax_holder_operations

Famfs works on both /dev/pmem and /dev/dax devices. This commit introduces
the function that opens a block (pmem) device and the struct
dax_holder_operations that are needed for that ABI.

In this commit, support for opening character /dev/dax is stubbed. A
later commit introduces this capability.

Signed-off-by: John Groves <[email protected]>
---
fs/famfs/famfs_inode.c | 83 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 83 insertions(+)

diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c
index 3329aff000d1..82c861998093 100644
--- a/fs/famfs/famfs_inode.c
+++ b/fs/famfs/famfs_inode.c
@@ -68,5 +68,88 @@ static const struct super_operations famfs_ops = {
.show_options = famfs_show_options,
};

+/***************************************************************************************
+ * dax_holder_operations for block dax
+ */
+
+static int
+famfs_blk_dax_notify_failure(
+ struct dax_device *dax_devp,
+ u64 offset,
+ u64 len,
+ int mf_flags)
+{
+
+ pr_err("%s: dax_devp %llx offset %llx len %lld mf_flags %x\n",
+ __func__, (u64)dax_devp, (u64)offset, (u64)len, mf_flags);
+ return -EOPNOTSUPP;
+}
+
+const struct dax_holder_operations famfs_blk_dax_holder_ops = {
+ .notify_failure = famfs_blk_dax_notify_failure,
+};
+
+static int
+famfs_open_char_device(
+ struct super_block *sb,
+ struct fs_context *fc)
+{
+ pr_err("%s: Root device is %s, but your kernel does not support famfs on /dev/dax\n",
+ __func__, fc->source);
+ return -ENODEV;
+}
+
+/**
+ * famfs_open_device()
+ *
+ * Open the memory device. If it looks like /dev/dax, call famfs_open_char_device().
+ * Otherwise try to open it as a block/pmem device.
+ */
+static int
+famfs_open_device(
+ struct super_block *sb,
+ struct fs_context *fc)
+{
+ struct famfs_fs_info *fsi = sb->s_fs_info;
+ struct dax_device *dax_devp;
+ u64 start_off = 0;
+ struct bdev_handle *handlep;
+
+ if (fsi->dax_devp) {
+ pr_err("%s: already mounted\n", __func__);
+ return -EALREADY;
+ }
+
+ if (strstr(fc->source, "/dev/dax")) /* There is probably a better way to check this */
+ return famfs_open_char_device(sb, fc);
+
+ if (!strstr(fc->source, "/dev/pmem")) { /* There is probably a better way to check this */
+ pr_err("%s: primary backing dev (%s) is not pmem\n",
+ __func__, fc->source);
+ return -EINVAL;
+ }
+
+ handlep = bdev_open_by_path(fc->source, FAMFS_BLKDEV_MODE, fsi, &fs_holder_ops);
+ if (IS_ERR(handlep->bdev)) {
+ pr_err("%s: failed blkdev_get_by_path(%s)\n", __func__, fc->source);
+ return PTR_ERR(handlep->bdev);
+ }
+
+ dax_devp = fs_dax_get_by_bdev(handlep->bdev, &start_off,
+ fsi /* holder */,
+ &famfs_blk_dax_holder_ops);
+ if (IS_ERR(dax_devp)) {
+ pr_err("%s: unable to get daxdev from handlep->bdev\n", __func__);
+ bdev_release(handlep);
+ return -ENODEV;
+ }
+ fsi->bdev_handle = handlep;
+ fsi->dax_devp = dax_devp;
+
+ pr_notice("%s: root device is block dax (%s)\n", __func__, fc->source);
+ return 0;
+}
+
+

MODULE_LICENSE("GPL");
--
2.43.0


2024-02-23 17:46:23

by John Groves

[permalink] [raw]
Subject: [RFC PATCH 11/20] famfs: Add fs_context_operations

This commit introduces the famfs fs_context_operations and
famfs_get_inode() which is used by the context operations.

Signed-off-by: John Groves <[email protected]>
---
fs/famfs/famfs_inode.c | 178 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 178 insertions(+)

diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c
index 82c861998093..f98f82962d7b 100644
--- a/fs/famfs/famfs_inode.c
+++ b/fs/famfs/famfs_inode.c
@@ -41,6 +41,50 @@ static const struct super_operations famfs_ops;
static const struct inode_operations famfs_file_inode_operations;
static const struct inode_operations famfs_dir_inode_operations;

+static struct inode *famfs_get_inode(
+ struct super_block *sb,
+ const struct inode *dir,
+ umode_t mode,
+ dev_t dev)
+{
+ struct inode *inode = new_inode(sb);
+
+ if (inode) {
+ struct timespec64 tv;
+
+ inode->i_ino = get_next_ino();
+ inode_init_owner(&nop_mnt_idmap, inode, dir, mode);
+ inode->i_mapping->a_ops = &ram_aops;
+ mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+ mapping_set_unevictable(inode->i_mapping);
+ tv = inode_set_ctime_current(inode);
+ inode_set_mtime_to_ts(inode, tv);
+ inode_set_atime_to_ts(inode, tv);
+
+ switch (mode & S_IFMT) {
+ default:
+ init_special_inode(inode, mode, dev);
+ break;
+ case S_IFREG:
+ inode->i_op = &famfs_file_inode_operations;
+ inode->i_fop = &famfs_file_operations;
+ break;
+ case S_IFDIR:
+ inode->i_op = &famfs_dir_inode_operations;
+ inode->i_fop = &simple_dir_operations;
+
+ /* Directory inodes start off with i_nlink == 2 (for "." entry) */
+ inc_nlink(inode);
+ break;
+ case S_IFLNK:
+ inode->i_op = &page_symlink_inode_operations;
+ inode_nohighmem(inode);
+ break;
+ }
+ }
+ return inode;
+}
+
/**********************************************************************************
* famfs super_operations
*
@@ -150,6 +194,140 @@ famfs_open_device(
return 0;
}

+/*****************************************************************************************
+ * fs_context_operations
+ */
+static int
+famfs_fill_super(
+ struct super_block *sb,
+ struct fs_context *fc)
+{
+ struct famfs_fs_info *fsi = sb->s_fs_info;
+ struct inode *inode;
+ int rc = 0;
+
+ sb->s_maxbytes = MAX_LFS_FILESIZE;
+ sb->s_blocksize = PAGE_SIZE;
+ sb->s_blocksize_bits = PAGE_SHIFT;
+ sb->s_magic = FAMFS_MAGIC;
+ sb->s_op = &famfs_ops;
+ sb->s_time_gran = 1;
+
+ rc = famfs_open_device(sb, fc);
+ if (rc)
+ goto out;
+
+ inode = famfs_get_inode(sb, NULL, S_IFDIR | fsi->mount_opts.mode, 0);
+ sb->s_root = d_make_root(inode);
+ if (!sb->s_root)
+ rc = -ENOMEM;
+
+out:
+ return rc;
+}
+
+enum famfs_param {
+ Opt_mode,
+ Opt_dax,
+};
+
+const struct fs_parameter_spec famfs_fs_parameters[] = {
+ fsparam_u32oct("mode", Opt_mode),
+ fsparam_string("dax", Opt_dax),
+ {}
+};
+
+static int famfs_parse_param(
+ struct fs_context *fc,
+ struct fs_parameter *param)
+{
+ struct famfs_fs_info *fsi = fc->s_fs_info;
+ struct fs_parse_result result;
+ int opt;
+
+ opt = fs_parse(fc, famfs_fs_parameters, param, &result);
+ if (opt == -ENOPARAM) {
+ opt = vfs_parse_fs_param_source(fc, param);
+ if (opt != -ENOPARAM)
+ return opt;
+
+ return 0;
+ }
+ if (opt < 0)
+ return opt;
+
+ switch (opt) {
+ case Opt_mode:
+ fsi->mount_opts.mode = result.uint_32 & S_IALLUGO;
+ break;
+ case Opt_dax:
+ if (strcmp(param->string, "always"))
+ pr_notice("%s: invalid dax mode %s\n",
+ __func__, param->string);
+ break;
+ }
+
+ return 0;
+}
+
+static DEFINE_MUTEX(famfs_context_mutex);
+static LIST_HEAD(famfs_context_list);
+
+static int famfs_get_tree(struct fs_context *fc)
+{
+ struct famfs_fs_info *fsi_entry;
+ struct famfs_fs_info *fsi = fc->s_fs_info;
+
+ fsi->rootdev = kstrdup(fc->source, GFP_KERNEL);
+ if (!fsi->rootdev)
+ return -ENOMEM;
+
+ /* Fail if famfs is already mounted from the same device */
+ mutex_lock(&famfs_context_mutex);
+ list_for_each_entry(fsi_entry, &famfs_context_list, fsi_list) {
+ if (strcmp(fsi_entry->rootdev, fc->source) == 0) {
+ mutex_unlock(&famfs_context_mutex);
+ pr_err("%s: already mounted from rootdev %s\n", __func__, fc->source);
+ return -EALREADY;
+ }
+ }
+
+ list_add(&fsi->fsi_list, &famfs_context_list);
+ mutex_unlock(&famfs_context_mutex);
+
+ return get_tree_nodev(fc, famfs_fill_super);
+
+}
+
+static void famfs_free_fc(struct fs_context *fc)
+{
+ struct famfs_fs_info *fsi = fc->s_fs_info;
+
+ if (fsi && fsi->rootdev)
+ kfree(fsi->rootdev);
+
+ kfree(fsi);
+}
+
+static const struct fs_context_operations famfs_context_ops = {
+ .free = famfs_free_fc,
+ .parse_param = famfs_parse_param,
+ .get_tree = famfs_get_tree,
+};
+
+static int famfs_init_fs_context(struct fs_context *fc)
+{
+ struct famfs_fs_info *fsi;
+
+ fsi = kzalloc(sizeof(*fsi), GFP_KERNEL);
+ if (!fsi)
+ return -ENOMEM;
+
+ fsi->mount_opts.mode = FAMFS_DEFAULT_MODE;
+ fc->s_fs_info = fsi;
+ fc->ops = &famfs_context_ops;
+ return 0;
+}


MODULE_LICENSE("GPL");
--
2.43.0


2024-02-23 17:47:57

by John Groves

[permalink] [raw]
Subject: [RFC PATCH 16/20] famfs: Add fault counters

One of the key requirements for famfs is that it service vma faults
efficiently. Our metadata helps - the search order is n for n extents,
and n is usually 1. But we can still observe gnarly lock contention
in mm if PTE faults are happening. This commit introduces fault counters
that can be enabled and read via /sys/fs/famfs/...

These counters have proved useful in troubleshooting situations where
PTE faults were happening instead of PMD. No performance impact when
disabled.

Signed-off-by: John Groves <[email protected]>
---
fs/famfs/famfs_file.c | 97 +++++++++++++++++++++++++++++++++++++++
fs/famfs/famfs_internal.h | 73 +++++++++++++++++++++++++++++
2 files changed, 170 insertions(+)

diff --git a/fs/famfs/famfs_file.c b/fs/famfs/famfs_file.c
index fd42d5966982..a626f8a89790 100644
--- a/fs/famfs/famfs_file.c
+++ b/fs/famfs/famfs_file.c
@@ -19,6 +19,100 @@
#include <uapi/linux/famfs_ioctl.h>
#include "famfs_internal.h"

+/***************************************************************************************
+ * filemap_fault counters
+ *
+ * The counters and the fault_count_enable file live at
+ * /sys/fs/famfs/
+ */
+struct famfs_fault_counters ffc;
+static int fault_count_enable;
+
+static ssize_t
+fault_count_enable_show(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ char *buf)
+{
+ return sprintf(buf, "%d\n", fault_count_enable);
+}
+
+static ssize_t
+fault_count_enable_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf,
+ size_t count)
+{
+ int value;
+ int rc;
+
+ rc = sscanf(buf, "%d", &value);
+ if (rc != 1)
+ return 0;
+
+ if (value > 0) /* clear fault counters when enabling, but not when disabling */
+ famfs_clear_fault_counters(&ffc);
+
+ fault_count_enable = value;
+ return count;
+}
+
+/* Individual fault counters are read-only */
+static ssize_t
+fault_count_pte_show(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ char *buf)
+{
+ return sprintf(buf, "%llu", famfs_pte_fault_ct(&ffc));
+}
+
+static ssize_t
+fault_count_pmd_show(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ char *buf)
+{
+ return sprintf(buf, "%llu", famfs_pmd_fault_ct(&ffc));
+}
+
+static ssize_t
+fault_count_pud_show(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ char *buf)
+{
+ return sprintf(buf, "%llu", famfs_pud_fault_ct(&ffc));
+}
+
+static struct kobj_attribute fault_count_enable_attribute = __ATTR(fault_count_enable,
+ 0660,
+ fault_count_enable_show,
+ fault_count_enable_store);
+static struct kobj_attribute fault_count_pte_attribute = __ATTR(pte_fault_ct,
+ 0440,
+ fault_count_pte_show,
+ NULL);
+static struct kobj_attribute fault_count_pmd_attribute = __ATTR(pmd_fault_ct,
+ 0440,
+ fault_count_pmd_show,
+ NULL);
+static struct kobj_attribute fault_count_pud_attribute = __ATTR(pud_fault_ct,
+ 0440,
+ fault_count_pud_show,
+ NULL);
+
+
+static struct attribute *attrs[] = {
+ &fault_count_enable_attribute.attr,
+ &fault_count_pte_attribute.attr,
+ &fault_count_pmd_attribute.attr,
+ &fault_count_pud_attribute.attr,
+ NULL,
+};
+
+struct attribute_group famfs_attr_group = {
+ .attrs = attrs,
+};
+
+/* End fault counters */
+
/**
* famfs_map_meta_alloc() - Allocate famfs file metadata
* @mapp: Pointer to an mcache_map_meta pointer
@@ -525,6 +619,9 @@ __famfs_filemap_fault(
if (IS_DAX(inode)) {
pfn_t pfn;

+ if (fault_count_enable)
+ famfs_inc_fault_counter_by_order(&ffc, pe_size);
+
ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL, &famfs_iomap_ops);
if (ret & VM_FAULT_NEEDDSYNC)
ret = dax_finish_sync_fault(vmf, pe_size, pfn);
diff --git a/fs/famfs/famfs_internal.h b/fs/famfs/famfs_internal.h
index af3990d43305..987cb172a149 100644
--- a/fs/famfs/famfs_internal.h
+++ b/fs/famfs/famfs_internal.h
@@ -50,4 +50,77 @@ struct famfs_fs_info {
char *rootdev;
};

+/*
+ * filemap_fault counters
+ */
+extern struct attribute_group famfs_attr_group;
+
+enum famfs_fault {
+ FAMFS_PTE = 0,
+ FAMFS_PMD,
+ FAMFS_PUD,
+ FAMFS_NUM_FAULT_TYPES,
+};
+
+static inline int valid_fault_type(int type)
+{
+ if (unlikely(type < 0 || type > FAMFS_PUD))
+ return 0;
+ return 1;
+}
+
+struct famfs_fault_counters {
+ atomic64_t fault_ct[FAMFS_NUM_FAULT_TYPES];
+};
+
+extern struct famfs_fault_counters ffc;
+
+static inline void famfs_clear_fault_counters(struct famfs_fault_counters *fc)
+{
+ int i;
+
+ for (i = 0; i < FAMFS_NUM_FAULT_TYPES; i++)
+ atomic64_set(&fc->fault_ct[i], 0);
+}
+
+static inline void famfs_inc_fault_counter(struct famfs_fault_counters *fc,
+ enum famfs_fault type)
+{
+ if (valid_fault_type(type))
+ atomic64_inc(&fc->fault_ct[type]);
+}
+
+static inline void famfs_inc_fault_counter_by_order(struct famfs_fault_counters *fc, int order)
+{
+ int pgf = -1;
+
+ switch (order) {
+ case 0:
+ pgf = FAMFS_PTE;
+ break;
+ case PMD_ORDER:
+ pgf = FAMFS_PMD;
+ break;
+ case PUD_ORDER:
+ pgf = FAMFS_PUD;
+ break;
+ }
+ famfs_inc_fault_counter(fc, pgf);
+}
+
+static inline u64 famfs_pte_fault_ct(struct famfs_fault_counters *fc)
+{
+ return atomic64_read(&fc->fault_ct[FAMFS_PTE]);
+}
+
+static inline u64 famfs_pmd_fault_ct(struct famfs_fault_counters *fc)
+{
+ return atomic64_read(&fc->fault_ct[FAMFS_PMD]);
+}
+
+static inline u64 famfs_pud_fault_ct(struct famfs_fault_counters *fc)
+{
+ return atomic64_read(&fc->fault_ct[FAMFS_PUD]);
+}
+
#endif /* FAMFS_INTERNAL_H */
--
2.43.0


2024-03-11 01:29:42

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system

On Thu, Feb 29, 2024 at 08:52:48AM -0600, John Groves wrote:
> On 24/02/29 01:15PM, Dave Chinner wrote:
> > On Mon, Feb 26, 2024 at 08:05:58PM -0600, John Groves wrote:
> > > bw ( MiB/s): min= 5085, max=27367, per=100.00%, avg=14361.95, stdev=165.61, samples=719
> > > iops : min= 2516, max=13670, avg=7160.17, stdev=82.88, samples=719
> > > lat (usec) : 4=0.05%, 10=0.72%, 20=2.23%, 50=2.48%, 100=3.02%
> > > lat (usec) : 250=1.54%, 500=2.37%, 750=1.34%, 1000=0.75%
> > > lat (msec) : 2=3.20%, 4=43.10%, 10=23.05%, 20=14.81%, 50=1.25%
> >
> > Most of the IO latencies are up round the 4-20ms marks. That seems
> > kinda high for a 2MB IO. With a memcpy speed of 10GB/s, the 2MB
> > should only take a couple of hundred microseconds. For Famfs, the
> > latencies appear to be around 1-4ms.
> >
> > So where's all that extra time coming from?
>
> Below, you will see two runs with performance and latency distribution
> about the same as famfs (the answer for that was --fallocate=native).

Ah, that is exactly what I suspected, and was wanting profiles
because that will show up in them clearly.

> > > lat (msec) : 100=0.08%
> > > cpu : usr=10.18%, sys=0.79%, ctx=67227, majf=0, minf=38511
> >
> > And why is system time reporting at almost zero instead of almost
> > all the remaining cpu time (i.e. up at 80-90%)?
>
> Something weird is going on with the cpu reporting. Sometimes sys=~0, but other times
> it's about what you would expect. I suspect some sort of measurement error,
> like maybe the method doesn't work with my cpu model? (I'm grasping, but with
> a somewhat rational basis...)
>
> I pasted two xfs runs below. The first has the wonky cpu sys value, and
> the second looks about like what one would expect.
>
> >
> > Can you run call-graph kernel profiles for XFS and famfs whilst
> > running this workload so we have some insight into what is behaving
> > differently here?
>
> Can you point me to an example of how to do that?

perf record --call-graph ...
pref report --call-graph ...


> I'd been thinking about the ~2x gap for a few days, and the most obvious
> difference is famfs files must be preallocated (like fallocate, but works
> a bit differently since allocation happens in user space). I just checked
> one of the xfs files, and it had maybe 80 extents (whereas the famfs
> files always have 1 extent here).

Which is about 4MB per extent. Extent size is not the problem for
zero-seek-latency storage hardware, though.

Essentially what you are seeing is interleaving extent allocation
between all the files because they are located in the same
directory. The locality algorithm is trying to place the data
extents close to the owner inode, but the indoes are also all close
together because they are located in the same AG as the parent
directory inode. Allocation concurrency is created by placing new
directories in different allocation groups, so we end up with
workloads in different directories being largely isolated from each
other.

However, that means when you are trying to write to many files in
the same directory at the same time, they are largely all competing
for the same AG lock to do block allocation during IO submission.
That creates interleaving of write() sized extents between different
files. We use speculative preallocation for buffered IO to avoid
this, and for direct IO the application needs to use extent size hints
or preallocation to avoid this contention based interleaving.

IOWs, by using fallocate() to preallocate all the space there will
be no allocation during IO submission and so the serialisation that
occurs due to competing allocations just goes away...

> FWIW I ran xfs with and without io_uring, and there was no apparent
> difference (which makes sense to me because it's not block I/O).
>
> The prior ~2x gap still seems like a lot of overhead for extent list
> mapping to memory, but adding --fallocate=native to the xfs test brought
> it into line with famfs:

As I suspected. :)

As for CPU usage accounting, the number of context switches says it
all.

"Bad":

> cpu : usr=15.48%, sys=1.17%, ctx=62654, majf=0, minf=22801

"good":

> cpu : usr=14.43%, sys=78.18%, ctx=5272, majf=0, minf=15708

I'd say that in the "bad" case most of the kernel work is being
shuffled off to kernel threads to do the work and so it doesn't get
accounted to the submission task. In comparison, in the "good" case
the work is being done in the submission thread and hence there's a
lot fewer context switches and the system time is correctly
accounted to the submission task.

Perhaps an io_uring task accounting problem?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2024-05-17 09:56:07

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system

On Thu, 29 Feb 2024 at 07:52, Amir Goldstein <[email protected]> wrote:

> I'm not virtiofs expert, but I don't think that you are wrong about this.
> IIUC, virtiofsd could map arbitrary memory region to any fuse file mmaped
> by virtiofs client.
>
> So what are the gaps between virtiofs and famfs that justify a new filesystem
> driver and new userspace API?

Let me try to fill in some gaps. I've looked at the famfs driver
(even tried to set it up in a VM, but got stuck with the EFI stuff).

- famfs has an extent list per file that indicates how each page
within the file should be mapped onto the dax device, IOW it has the
following mapping:

[famfs file, offset] -> [offset, length]

- fuse can currently map a fuse file onto a backing file:

[fuse file] -> [backing file]

The interface for the latter is

backing_id = ioctl(dev_fuse_fd, FUSE_DEV_IOC_BACKING_OPEN, backing_map);
..
fuse_open_out.flags |= FOPEN_PASSTHROUGH;
fuse_open_out.backing_id = backing_id;

This looks suitable for doing the famfs file - > dax device mapping as
well. I wouldn't extend the ioctl with extent information, since
famfs can just use FUSE_DEV_IOC_BACKING_OPEN once to register the dax
device. The flags field could be used to tell the kernel to treat
this fd as a dax device instead of a a regular file.

Letter, when the file is opened the extent list could be sent in the
open reply together with the backing id. The fuse_ext_header
mechanism seems suitable for this.

And I think that's it as far as API's are concerned.

Note: this is already more generic than the current famfs prototype,
since multiple dax devices could be used as backing for famfs files,
with the constraint that a single file can only map data from a single
dax device.

As for implementing dax passthrough, I think that needs a separate
source file, the one used by virtiofs (fs/fuse/dax.c) does not appear
to have many commonalities with this one. That could be renamed to
virtiofs_dax.c as it's pretty much virtiofs specific, AFAICT.

Comments? Am I missing something significant?

Thanks,
Miklos

2024-05-19 05:59:42

by Amir Goldstein

[permalink] [raw]
Subject: Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system

On Fri, May 17, 2024 at 12:55 PM Miklos Szeredi <[email protected]> wrote:
>
> On Thu, 29 Feb 2024 at 07:52, Amir Goldstein <[email protected]> wrote:
>
> > I'm not virtiofs expert, but I don't think that you are wrong about this.
> > IIUC, virtiofsd could map arbitrary memory region to any fuse file mmaped
> > by virtiofs client.
> >
> > So what are the gaps between virtiofs and famfs that justify a new filesystem
> > driver and new userspace API?
>
> Let me try to fill in some gaps. I've looked at the famfs driver
> (even tried to set it up in a VM, but got stuck with the EFI stuff).
>
> - famfs has an extent list per file that indicates how each page
> within the file should be mapped onto the dax device, IOW it has the
> following mapping:
>
> [famfs file, offset] -> [offset, length]
>
> - fuse can currently map a fuse file onto a backing file:
>
> [fuse file] -> [backing file]
>
> The interface for the latter is
>
> backing_id = ioctl(dev_fuse_fd, FUSE_DEV_IOC_BACKING_OPEN, backing_map);
> ...
> fuse_open_out.flags |= FOPEN_PASSTHROUGH;
> fuse_open_out.backing_id = backing_id;

FYI, library and example code was recently merged to libfuse:
https://github.com/libfuse/libfuse/pull/919

>
> This looks suitable for doing the famfs file - > dax device mapping as
> well. I wouldn't extend the ioctl with extent information, since
> famfs can just use FUSE_DEV_IOC_BACKING_OPEN once to register the dax
> device. The flags field could be used to tell the kernel to treat
> this fd as a dax device instead of a a regular file.
>
> Letter, when the file is opened the extent list could be sent in the
> open reply together with the backing id. The fuse_ext_header
> mechanism seems suitable for this.
>
> And I think that's it as far as API's are concerned.
>
> Note: this is already more generic than the current famfs prototype,
> since multiple dax devices could be used as backing for famfs files,
> with the constraint that a single file can only map data from a single
> dax device.
>
> As for implementing dax passthrough, I think that needs a separate
> source file, the one used by virtiofs (fs/fuse/dax.c) does not appear
> to have many commonalities with this one. That could be renamed to
> virtiofs_dax.c as it's pretty much virtiofs specific, AFAICT.
>
> Comments?

Would probably also need to decouple CONFIG_FUSE_DAX
from CONFIG_FUSE_VIRTIO_DAX.

What about fc->dax_mode (i.e. dax= mount option)?

What about FUSE_IS_DAX()? does it apply to both dax implementations?

Sounds like a decent plan.
John, let us know if you need help understanding the details.

> Am I missing something significant?

Would we need to set IS_DAX() on inode init time or can we set it
later on first file open?

Currently, iomodes enforces that all opens are either
mapped to same backing file or none mapped to backing file:

fuse_inode_uncached_io_start()
{
..
/* deny conflicting backing files on same fuse inode */

The iomodes rules will need to be amended to verify that:
- IS_DAX() inode open is always mapped to backing dax device
- All files of the same fuse inode are mapped to the same range
of backing file/dax device.

Thanks,
Amir.

2024-05-22 02:05:37

by John Groves

[permalink] [raw]
Subject: Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system

Initial reply to both Amir and Miklos. Sorry for the delay - I took a few
days off after LSFMM and I'm just re-engaging now.

First an observation: these messages are on the famfs v1 patch set thread.
The v2 patch set is at [1]. That is also the default branch now if you clone
the famfs kernel from [2].

Among the biggest changes at v2 is dropping /dev/pmem support and only
supporting /dev/dax (character) devices as backing devs for famfs.

On 24/05/19 08:59AM, Amir Goldstein wrote:
> On Fri, May 17, 2024 at 12:55 PM Miklos Szeredi <[email protected]> wrote:
> >
> > On Thu, 29 Feb 2024 at 07:52, Amir Goldstein <[email protected]> wrote:
> >
> > > I'm not virtiofs expert, but I don't think that you are wrong about this.
> > > IIUC, virtiofsd could map arbitrary memory region to any fuse file mmaped
> > > by virtiofs client.
> > >
> > > So what are the gaps between virtiofs and famfs that justify a new filesystem
> > > driver and new userspace API?
> >
> > Let me try to fill in some gaps. I've looked at the famfs driver
> > (even tried to set it up in a VM, but got stuck with the EFI stuff).

I'm happy to help with that if you care - ping me if so; getting a VM running
in EFI mode is not necessary if you reserve the dax memory via memmap=, or
via libvirt xml.

> >
> > - famfs has an extent list per file that indicates how each page
> > within the file should be mapped onto the dax device, IOW it has the
> > following mapping:
> >
> > [famfs file, offset] -> [offset, length]

More generally, a famfs file extent is [daxdev, offset, len]; there may
be multiple extents per file, and in the future this definitely needs to
generalize to multiple daxdev's.

Disclaimer: I'm still coming up to speed on fuse (slowly and ignorantly,
I think)...

A single backing device (daxdev) will contain extents of many famfs
files (plus metadata - currently a superblock and a log). I'm not sure
it's realistic to have a backing daxdev "open" per famfs file.

In addition there is:

- struct dax_holder_operations - to allow a notify_failure() upcall
from dax. This provides the critical capability to shut down famfs
if there are memory errors. This is filesystem- (or technically daxdev-
wide)

- The pmem or devdax iomap_ops - to allow the fsdax file system (famfs,
and [soon] famfs_fuse) to call dax_iomap_rw() and dax_iomap_fault().
I strongly suspect that famfs_fuse can't be correct unless it uses
this path rather than just the idea of a single backing file.
This interface explicitly supports files that map to disjoint ranges
of one or more dax devices.

- the dev_dax_iomap portion of the famfs patchsets adds iomap_ops to
character devdax.

- Note that dax devices, unlike files, don't support read/write - only
mmap(). I suspect (though I'm still pretty ignorant) that this means
we can't just treat the dax device as an extent-based backing file.


> >
> > - fuse can currently map a fuse file onto a backing file:
> >
> > [fuse file] -> [backing file]
> >
> > The interface for the latter is
> >
> > backing_id = ioctl(dev_fuse_fd, FUSE_DEV_IOC_BACKING_OPEN, backing_map);
> > ...
> > fuse_open_out.flags |= FOPEN_PASSTHROUGH;
> > fuse_open_out.backing_id = backing_id;
>
> FYI, library and example code was recently merged to libfuse:
> https://github.com/libfuse/libfuse/pull/919
>
> >
> > This looks suitable for doing the famfs file - > dax device mapping as
> > well. I wouldn't extend the ioctl with extent information, since
> > famfs can just use FUSE_DEV_IOC_BACKING_OPEN once to register the dax
> > device. The flags field could be used to tell the kernel to treat
> > this fd as a dax device instead of a a regular file.

A dax device to famfs is a lot more like a backing device for a "filesystem"
than a backing file for another file. And, as previously mentioned, there
is the iomap_ops interface and the holder_ops interface that deal with
multiple file tenants on a dax device (plus error notification,
respectively)

Probably doable, but important distinctions...

> >
> > Letter, when the file is opened the extent list could be sent in the
> > open reply together with the backing id. The fuse_ext_header
> > mechanism seems suitable for this.
> >
> > And I think that's it as far as API's are concerned.
> >
> > Note: this is already more generic than the current famfs prototype,
> > since multiple dax devices could be used as backing for famfs files,
> > with the constraint that a single file can only map data from a single
> > dax device.
> >
> > As for implementing dax passthrough, I think that needs a separate
> > source file, the one used by virtiofs (fs/fuse/dax.c) does not appear
> > to have many commonalities with this one. That could be renamed to
> > virtiofs_dax.c as it's pretty much virtiofs specific, AFAICT.
> >
> > Comments?
>
> Would probably also need to decouple CONFIG_FUSE_DAX
> from CONFIG_FUSE_VIRTIO_DAX.
>
> What about fc->dax_mode (i.e. dax= mount option)?
>
> What about FUSE_IS_DAX()? does it apply to both dax implementations?
>
> Sounds like a decent plan.
> John, let us know if you need help understanding the details.

I'm certain I will need some help, but I'll try to do my part.

First question: can you suggest an example fuse file pass-through
file system that I might use as a jumping-off point? Something that
gets the basic pass-through capability from which to start hacking
in famfs/dax capabilities?

When I started on famfs, I used ramfs because it got me all the basic
file system functionality minus a backing store. Then I built the dax
functionality by referring to xfs.

>
> > Am I missing something significant?
>
> Would we need to set IS_DAX() on inode init time or can we set it
> later on first file open?
>
> Currently, iomodes enforces that all opens are either
> mapped to same backing file or none mapped to backing file:
>
> fuse_inode_uncached_io_start()
> {
> ...
> /* deny conflicting backing files on same fuse inode */
>
> The iomodes rules will need to be amended to verify that:
> - IS_DAX() inode open is always mapped to backing dax device
> - All files of the same fuse inode are mapped to the same range
> of backing file/dax device.

I'm confused by the last item. I would think there would be a fuse
inode per famfs file, and that multiple of those would map to separate
extent lists of one or more backing dax devices.

Or maybe I misunderstand the meaning of "fuse inode". Feel free to
assign reading...

>
> Thanks,
> Amir.

Thanks Miklos and Amir,
John

[1] https://lore.kernel.org/linux-fsdevel/[email protected]/T/#m3b11e8d311eca80763c7d6f27d43efd1cdba628b
[2] https://github.com/cxl-micron-reskit/famfs-linux



2024-05-22 08:58:40

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system

On Wed, 22 May 2024 at 04:05, John Groves <[email protected]> wrote:
> I'm happy to help with that if you care - ping me if so; getting a VM running
> in EFI mode is not necessary if you reserve the dax memory via memmap=, or
> via libvirt xml.

Could you please give an example?

I use a raw qemu command line with a -kernel option and a root fs
image (not a disk image with a bootloader).


> More generally, a famfs file extent is [daxdev, offset, len]; there may
> be multiple extents per file, and in the future this definitely needs to
> generalize to multiple daxdev's.
>
> Disclaimer: I'm still coming up to speed on fuse (slowly and ignorantly,
> I think)...
>
> A single backing device (daxdev) will contain extents of many famfs
> files (plus metadata - currently a superblock and a log). I'm not sure
> it's realistic to have a backing daxdev "open" per famfs file.

That's exactly what I was saying.

The passthrough interface was deliberately done in a way to separate
the mapping into two steps:

1) registering the backing file (which could be a device)

2) mapping from a fuse file to a registered backing file

Step 1 can happen at any time, while step 2 currently happens at open,
but for various other purposes like metadata passthrough it makes
sense to allow the mapping to happen at lookup time and be cached for
the lifetime of the inode.

> In addition there is:
>
> - struct dax_holder_operations - to allow a notify_failure() upcall
> from dax. This provides the critical capability to shut down famfs
> if there are memory errors. This is filesystem- (or technically daxdev-
> wide)

This can be hooked into fuse_is_bad().

> - The pmem or devdax iomap_ops - to allow the fsdax file system (famfs,
> and [soon] famfs_fuse) to call dax_iomap_rw() and dax_iomap_fault().
> I strongly suspect that famfs_fuse can't be correct unless it uses
> this path rather than just the idea of a single backing file.

Agreed.

> - the dev_dax_iomap portion of the famfs patchsets adds iomap_ops to
> character devdax.

You'll need to channel those patches through the respective
maintainers, preferably before the fuse parts are merged.

> - Note that dax devices, unlike files, don't support read/write - only
> mmap(). I suspect (though I'm still pretty ignorant) that this means
> we can't just treat the dax device as an extent-based backing file.

Doesn't matter, it'll use the iomap infrastructure instead of the
passthrough infrastructure.

But the interfaces for regular passthrough and fsdax could be shared.
Conceptually they are very similar: there's a backing store indexable
with byte offsets.

What's currently missing from the API is an extent list in
fuse_open_out. The format could be:

[ {backing_id, offset, length}, ... ]

allowing each extent to map to a different backing device.

> A dax device to famfs is a lot more like a backing device for a "filesystem"
> than a backing file for another file. And, as previously mentioned, there
> is the iomap_ops interface and the holder_ops interface that deal with
> multiple file tenants on a dax device (plus error notification,
> respectively)
>
> Probably doable, but important distinctions...

Yeah, that's why I suggested to create a new source file for this
within fs/fuse. Alternatively we could try splitting up fuse into
modules (core, virtiofs, cuse, fsdax) but I think that can be left as
a cleanup step.

> First question: can you suggest an example fuse file pass-through
> file system that I might use as a jumping-off point? Something that
> gets the basic pass-through capability from which to start hacking
> in famfs/dax capabilities?

An example is in Amir's libfuse repo at

https://github.com/libfuse/libfuse

> I'm confused by the last item. I would think there would be a fuse
> inode per famfs file, and that multiple of those would map to separate
> extent lists of one or more backing dax devices.

Yeah.

> Or maybe I misunderstand the meaning of "fuse inode". Feel free to
> assign reading...

I think Amir meant that each open file could in theory have a
different mapping. This is allowed by the fuse interface, but is
disallowed in practice.

I'm in favor of caching the extent map so it only has to be given on
the first open (or lookup).

Thanks,
Miklos

2024-05-22 10:16:37

by Amir Goldstein

[permalink] [raw]
Subject: Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system

On Wed, May 22, 2024 at 11:58 AM Miklos Szeredi <[email protected]> wrote:
>
> On Wed, 22 May 2024 at 04:05, John Groves <[email protected]> wrote:
> > I'm happy to help with that if you care - ping me if so; getting a VM running
> > in EFI mode is not necessary if you reserve the dax memory via memmap=, or
> > via libvirt xml.
>
> Could you please give an example?
>
> I use a raw qemu command line with a -kernel option and a root fs
> image (not a disk image with a bootloader).
>
>
> > More generally, a famfs file extent is [daxdev, offset, len]; there may
> > be multiple extents per file, and in the future this definitely needs to
> > generalize to multiple daxdev's.
> >
> > Disclaimer: I'm still coming up to speed on fuse (slowly and ignorantly,
> > I think)...
> >
> > A single backing device (daxdev) will contain extents of many famfs
> > files (plus metadata - currently a superblock and a log). I'm not sure
> > it's realistic to have a backing daxdev "open" per famfs file.
>
> That's exactly what I was saying.
>
> The passthrough interface was deliberately done in a way to separate
> the mapping into two steps:
>
> 1) registering the backing file (which could be a device)
>
> 2) mapping from a fuse file to a registered backing file
>
> Step 1 can happen at any time, while step 2 currently happens at open,
> but for various other purposes like metadata passthrough it makes
> sense to allow the mapping to happen at lookup time and be cached for
> the lifetime of the inode.
>
> > In addition there is:
> >
> > - struct dax_holder_operations - to allow a notify_failure() upcall
> > from dax. This provides the critical capability to shut down famfs
> > if there are memory errors. This is filesystem- (or technically daxdev-
> > wide)
>
> This can be hooked into fuse_is_bad().
>
> > - The pmem or devdax iomap_ops - to allow the fsdax file system (famfs,
> > and [soon] famfs_fuse) to call dax_iomap_rw() and dax_iomap_fault().
> > I strongly suspect that famfs_fuse can't be correct unless it uses
> > this path rather than just the idea of a single backing file.
>
> Agreed.
>
> > - the dev_dax_iomap portion of the famfs patchsets adds iomap_ops to
> > character devdax.
>
> You'll need to channel those patches through the respective
> maintainers, preferably before the fuse parts are merged.
>
> > - Note that dax devices, unlike files, don't support read/write - only
> > mmap(). I suspect (though I'm still pretty ignorant) that this means
> > we can't just treat the dax device as an extent-based backing file.
>
> Doesn't matter, it'll use the iomap infrastructure instead of the
> passthrough infrastructure.
>
> But the interfaces for regular passthrough and fsdax could be shared.
> Conceptually they are very similar: there's a backing store indexable
> with byte offsets.
>
> What's currently missing from the API is an extent list in
> fuse_open_out. The format could be:
>
> [ {backing_id, offset, length}, ... ]
>
> allowing each extent to map to a different backing device.
>
> > A dax device to famfs is a lot more like a backing device for a "filesystem"
> > than a backing file for another file. And, as previously mentioned, there
> > is the iomap_ops interface and the holder_ops interface that deal with
> > multiple file tenants on a dax device (plus error notification,
> > respectively)
> >
> > Probably doable, but important distinctions...
>
> Yeah, that's why I suggested to create a new source file for this
> within fs/fuse. Alternatively we could try splitting up fuse into
> modules (core, virtiofs, cuse, fsdax) but I think that can be left as
> a cleanup step.
>
> > First question: can you suggest an example fuse file pass-through
> > file system that I might use as a jumping-off point? Something that
> > gets the basic pass-through capability from which to start hacking
> > in famfs/dax capabilities?
>
> An example is in Amir's libfuse repo at
>
> https://github.com/libfuse/libfuse
>

That's not my repo, it's the official one ;-)
but yeh, my passthrough example got merged last week:
https://github.com/libfuse/libfuse/pull/919

> > I'm confused by the last item. I would think there would be a fuse
> > inode per famfs file, and that multiple of those would map to separate
> > extent lists of one or more backing dax devices.
>
> Yeah.
>
> > Or maybe I misunderstand the meaning of "fuse inode". Feel free to
> > assign reading...
>
> I think Amir meant that each open file could in theory have a
> different mapping. This is allowed by the fuse interface, but is
> disallowed in practice.
>
> I'm in favor of caching the extent map so it only has to be given on
> the first open (or lookup).

Yeh, sorry, that was a bit confusing.
The statement is that because the simples plan as Miklos
suggested is to pass the extent list in reply to open
two different opens of the same inode are not allowed to
pass in different extent lists.

The new iomode.c code does something similar.
Currently fuse_inode has a reference to fuse_backing which
stores the backing file (that can be the dax device) and it also
has a reference to fuse_inode_dax with an rbtree of fuse_dax_mapping
Can we reuse fuse_inode_dax for the needs of famfs?

The first open would cache the extent list in fuse_inode and
second open would verify that the extent list matches.

Last file close could clean the cache extent list or not - that
is an API decision.

Thanks,
Amir.

2024-05-22 11:29:09

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system

On Wed, 22 May 2024 at 12:16, Amir Goldstein <[email protected]> wrote:

> The first open would cache the extent list in fuse_inode and
> second open would verify that the extent list matches.
>
> Last file close could clean the cache extent list or not - that
> is an API decision.

Well, current API clears the mapping, and I would treat the fi->fb as
a just a special case of the extent list. So by default I'd keep this
behavior, but perhaps it would make sense to optionally allow the
mapping to remain after the last close. For now this is probably not
relevant...

Thanks,
Miklos

2024-05-22 13:42:09

by Amir Goldstein

[permalink] [raw]
Subject: Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system

On Wed, May 22, 2024 at 2:28 PM Miklos Szeredi <[email protected]> wrote:
>
> On Wed, 22 May 2024 at 12:16, Amir Goldstein <[email protected]> wrote:
>
> > The first open would cache the extent list in fuse_inode and
> > second open would verify that the extent list matches.
> >
> > Last file close could clean the cache extent list or not - that
> > is an API decision.
>
> Well, current API clears the mapping, and I would treat the fi->fb as
> a just a special case of the extent list. So by default I'd keep this
> behavior, but perhaps it would make sense to optionally allow the
> mapping to remain after the last close. For now this is probably not
> relevant...

Already in the works ;)

Not tested - probably not working POC:
https://github.com/amir73il/linux/commits/fuse-backing-inode-wip

I am trying an API to opt into inode operation passthrough, which
has a by-product of keeping fi->fb around after last close.

This is designed to be setup on lookup, but could also be setup on
first open.

I have some ideas for how to return backing id with lookup
(and readdirplus) response, but haven't tried them yet.
But setup backing file from lookup response will surely
stick around until inode evict.

Thanks,
Amir.

2024-05-23 02:49:44

by John Groves

[permalink] [raw]
Subject: Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system

On 24/05/22 10:58AM, Miklos Szeredi wrote:
> On Wed, 22 May 2024 at 04:05, John Groves <[email protected]> wrote:
> > I'm happy to help with that if you care - ping me if so; getting a VM running
> > in EFI mode is not necessary if you reserve the dax memory via memmap=, or
> > via libvirt xml.
>
> Could you please give an example?
>
> I use a raw qemu command line with a -kernel option and a root fs
> image (not a disk image with a bootloader).

That's not the way I'm running VMs, but... I presume you know how to add
kernel command line arguments to VMs that you run this way?

- memmap=<size>!<hpa_offset> will reserve a pretend pmem device at <hpa_offset>
- memmap=<size>$<hpa_offset> will reserve a pretend dax device at <hpa_offset>

Both of the above will work regardless of whether the VM is in EFI mode.
The '$' is harder to escape through grub; and the pmem device can be converted
to devdax via 'ndctl reconfigure-device --mode=devdax...'. A dax device would
likely also need to be put in devdax mode (as the default seems to be
system-ram mode).

Incomplete documentation (that you have probably already seen) is at [1]

I can dig deeper if needed.

Otherwise the feedback in this thread makes sense to me and I'm planning to
start hacking on famfs patches Thursday. Watch this space ;)

Regards,
John

[1] https://github.com/cxl-micron-reskit/famfs/blob/master/markdown/vm-configuration.md


2024-05-23 13:59:25

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system

[trimming CC list]

On Thu, 23 May 2024 at 04:49, John Groves <[email protected]> wrote:

> - memmap=<size>!<hpa_offset> will reserve a pretend pmem device at <hpa_offset>
> - memmap=<size>$<hpa_offset> will reserve a pretend dax device at <hpa_offset>

Doesn't get me a /dev/dax or /dev/pmem

Complete qemu command line:

qemu-kvm -s -serial none -parallel none -kernel
/home/mszeredi/git/linux/arch/x86/boot/bzImage -drive
format=raw,file=/home/mszeredi/root_fs,index=0,if=virtio -drive
format=raw,file=/home/mszeredi/images/ubd1,index=1,if=virtio -chardev
stdio,id=virtiocon0,signal=off -device virtio-serial -device
virtconsole,chardev=virtiocon0 -cpu host -m 8G -net user -net
nic,model=virtio -fsdev local,security_model=none,id=fsdev0,path=/home
-device virtio-9p-pci,fsdev=fsdev0,mount_tag=hostshare -device
virtio-rng-pci -smp 4 -append 'root=/dev/vda console=hvc0
memmap=4G$4G'

root@kvm:~/famfs# scripts/chk_efi.sh
This system is neither Ubuntu nor Fedora. It is identified as debian.
/sys/firmware/efi not found; probably not efi
not found; probably nof efi
/boot/efi/EFI not found; probably not efi
/boot/efi/EFI/BOOT not found; probably not efi
/boot/efi/EFI/ not found; probably not efi
/boot/efi/EFI//grub.cfg not found; probably nof efi
Probably not efi; errs=6

Thanks,
Miklos

2024-05-24 00:47:24

by John Groves

[permalink] [raw]
Subject: Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system

On 24/05/23 03:57PM, Miklos Szeredi wrote:
> [trimming CC list]
>
> On Thu, 23 May 2024 at 04:49, John Groves <[email protected]> wrote:
>
> > - memmap=<size>!<hpa_offset> will reserve a pretend pmem device at <hpa_offset>
> > - memmap=<size>$<hpa_offset> will reserve a pretend dax device at <hpa_offset>
>
> Doesn't get me a /dev/dax or /dev/pmem
>
> Complete qemu command line:
>
> qemu-kvm -s -serial none -parallel none -kernel
> /home/mszeredi/git/linux/arch/x86/boot/bzImage -drive
> format=raw,file=/home/mszeredi/root_fs,index=0,if=virtio -drive
> format=raw,file=/home/mszeredi/images/ubd1,index=1,if=virtio -chardev
> stdio,id=virtiocon0,signal=off -device virtio-serial -device
> virtconsole,chardev=virtiocon0 -cpu host -m 8G -net user -net
> nic,model=virtio -fsdev local,security_model=none,id=fsdev0,path=/home
> -device virtio-9p-pci,fsdev=fsdev0,mount_tag=hostshare -device
> virtio-rng-pci -smp 4 -append 'root=/dev/vda console=hvc0
> memmap=4G$4G'
>
> root@kvm:~/famfs# scripts/chk_efi.sh
> This system is neither Ubuntu nor Fedora. It is identified as debian.
> /sys/firmware/efi not found; probably not efi
> not found; probably nof efi
> /boot/efi/EFI not found; probably not efi
> /boot/efi/EFI/BOOT not found; probably not efi
> /boot/efi/EFI/ not found; probably not efi
> /boot/efi/EFI//grub.cfg not found; probably nof efi
> Probably not efi; errs=6
>
> Thanks,
> Miklos


Apologies, but I'm short on time at the moment - going into a long holiday
weekend in the US with family plans. I should be focused again by middle of
next week.

But can you check /proc/cmdline to see of the memmap arg got through without
getting mangled? The '$' tends to get fubar'd. You might need \$, or I've seen
the need for \\\$. If it's un-mangled, there should be a dax device.

If that doesn't work, it's worth trying '!' instead, which I think would give
you a pmem device - if the arg gets through (but ! is less likely to get
horked). That pmem device can be converted to devdax...

Regards,
John


2024-05-24 07:56:20

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system

On Fri, 24 May 2024 at 02:47, John Groves <[email protected]> wrote:

> Apologies, but I'm short on time at the moment - going into a long holiday
> weekend in the US with family plans. I should be focused again by middle of
> next week.

NP.

Obviously I'll need to test it before anything is merged, other than
that this is not urgent at all...

> But can you check /proc/cmdline to see of the memmap arg got through without
> getting mangled? The '$' tends to get fubar'd. You might need \$, or I've seen
> the need for \\\$. If it's un-mangled, there should be a dax device.

/proc/cmdline shows the option correctly:

root@kvm:~# cat /proc/cmdline
root=/dev/vda console=hvc0 memmap=4G$4G

> If that doesn't work, it's worth trying '!' instead, which I think would give
> you a pmem device - if the arg gets through (but ! is less likely to get
> horked). That pmem device can be converted to devdax...

That doesn't work either. No device created in /dev (dax or pmem).

free(1) does show that the reserved memory is gone in both cases, so
something does happen.

Attaching my .config as well.

Thanks,
Miklos


Attachments:
.config (87.52 kB)

2024-05-25 22:55:03

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system

On Fri, May 24, 2024 at 09:55:48AM +0200, Miklos Szeredi wrote:
> On Fri, 24 May 2024 at 02:47, John Groves <[email protected]> wrote:
>
> > Apologies, but I'm short on time at the moment - going into a long holiday
> > weekend in the US with family plans. I should be focused again by middle of
> > next week.
>
> NP.
>
> Obviously I'll need to test it before anything is merged, other than
> that this is not urgent at all...
>
> > But can you check /proc/cmdline to see of the memmap arg got through without
> > getting mangled? The '$' tends to get fubar'd. You might need \$, or I've seen
> > the need for \\\$. If it's un-mangled, there should be a dax device.
>
> /proc/cmdline shows the option correctly:
>
> root@kvm:~# cat /proc/cmdline
> root=/dev/vda console=hvc0 memmap=4G$4G
>
> > If that doesn't work, it's worth trying '!' instead, which I think would give
> > you a pmem device - if the arg gets through (but ! is less likely to get
> > horked). That pmem device can be converted to devdax...
>
> That doesn't work either. No device created in /dev (dax or pmem).

I think you need to do some ndctl magic to get the memory to be
namespaced correctly for the correct devices to appear.

https://docs.pmem.io/ndctl-user-guide/managing-namespaces

IIRC, need to set the type to pmem and the mode to fsdax, devdax or
raw to get the relevant device nodes to be created for the range..

Cheers,

Dave.

--
Dave Chinner
[email protected]