tl;dr: The proposed S_IOMAP_IMMUTABLE mechanism
The daxfile proposal a few weeks back [1] sought to piggy back on the
swapfile implementation to approximate a block map immutable file. This
is an idea Dave originated last year to solve the dax "flush from
userspace" problem [2].
The discussion yielded several results. First, Christoph pointed out that
swapfiles are subtly broken [3]. Second, Darrick [4]
and Dave [5] proposed how to properly implement a block map immutable file.
Finally, Dave identified some improvements to swapfiles that can be
built on the block-map-immutable mecahanism. These patches seek to
implement the first part of the proposal and save the swapfile work to
build on top once the base mechanism is complete.
While the initial motivation for this feature is support for
byte-addressable updates of persistent memory and managing cache
maintenance from userspace, the applications of the feature are broader.
In addition to being the start of a better swapfile mechanism it can
also support a DMA-to-storage use case. This use case enables
data-acquisition hardware to DMA directly to a storage device address
while being safe in the knowledge that storage mappings will not change.
These patches are relative to Darrick's 'devel' tree. Patch 3 is likely
wrong in the way it sets the new XFS_DIFLAG2_IOMAP_IMMUTABLE flag, but
seems to work with a basic test. The test just turns the flag on and
off, checks that the file is fully allocated and immutable, and
validates that the state persists over a umount / mount cycle. A proper
xfstest is in the works, but comments on this first draft are welcome.
[1]: https://lkml.org/lkml/2017/6/16/790
[2]: https://lkml.org/lkml/2016/9/11/159
[3]: https://lkml.org/lkml/2017/6/18/31
[4]: https://lkml.org/lkml/2017/6/20/49
[5]: https://www.spinics.net/lists/linux-xfs/msg07871.html
---
Dan Williams (3):
fs, xfs: introduce S_IOMAP_IMMUTABLE
fs, xfs: introduce FALLOC_FL_SEAL_BLOCK_MAP
xfs: persist S_IOMAP_IMMUTABLE in di_flags2
fs/attr.c | 10 ++++
fs/namei.c | 3 +
fs/open.c | 28 ++++++++++++
fs/read_write.c | 3 +
fs/xfs/libxfs/xfs_format.h | 5 ++
fs/xfs/xfs_bmap_util.c | 98 ++++++++++++++++++++++++++++++++++++++++++-
fs/xfs/xfs_bmap_util.h | 4 +-
fs/xfs/xfs_file.c | 14 ++++--
fs/xfs/xfs_ioctl.c | 10 ++++
fs/xfs/xfs_iops.c | 8 ++--
include/linux/falloc.h | 3 +
include/linux/fs.h | 2 +
include/uapi/linux/falloc.h | 19 ++++++++
mm/filemap.c | 9 ++++
14 files changed, 200 insertions(+), 16 deletions(-)
An inode with this flag set indicates that the file's block map cannot
be changed, no size change, deletion, hole-punch, range collapse, or
reflink.
The implementation of toggling the flag and sealing the state of the
extent map is saved for a later patch. The functionality provided by
S_IOMAP_IMMUTABLE, once toggle support is added, will be a superset of
that provided by S_SWAPFILE, and it is targeted to replace it.
For now, only xfs and the core vfs are updated to consider the new flag.
The additional check that is added for this flag, beyond what we are
already doing for swapfiles, is to truncate or abort writes that try to
extend the file size.
Cc: Jan Kara <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Alexander Viro <[email protected]>
Suggested-by: Dave Chinner <[email protected]>
Suggested-by: "Darrick J. Wong" <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
fs/attr.c | 10 ++++++++++
fs/namei.c | 3 ++-
fs/open.c | 6 ++++++
fs/read_write.c | 3 +++
fs/xfs/xfs_ioctl.c | 6 ++++++
include/linux/fs.h | 2 ++
mm/filemap.c | 9 +++++++++
7 files changed, 38 insertions(+), 1 deletion(-)
diff --git a/fs/attr.c b/fs/attr.c
index 135304146120..8573e364bd06 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -112,6 +112,16 @@ EXPORT_SYMBOL(setattr_prepare);
*/
int inode_newsize_ok(const struct inode *inode, loff_t offset)
{
+ if (IS_IOMAP_IMMUTABLE(inode)) {
+ /*
+ * Any size change is disallowed. Size increases may
+ * dirty metadata that an application is not prepared to
+ * sync, and a size decrease may expose free blocks to
+ * in-flight DMA.
+ */
+ return -ETXTBSY;
+ }
+
if (inode->i_size < offset) {
unsigned long limit;
diff --git a/fs/namei.c b/fs/namei.c
index ddb6a7c2b3d4..588f1135c42c 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2793,7 +2793,8 @@ static int may_delete(struct inode *dir, struct dentry *victim, bool isdir)
return -EPERM;
if (check_sticky(dir, inode) || IS_APPEND(inode) ||
- IS_IMMUTABLE(inode) || IS_SWAPFILE(inode) || HAS_UNMAPPED_ID(inode))
+ IS_IMMUTABLE(inode) || IS_SWAPFILE(inode) || HAS_UNMAPPED_ID(inode)
+ || IS_IOMAP_IMMUTABLE(inode))
return -EPERM;
if (isdir) {
if (!d_is_dir(victim))
diff --git a/fs/open.c b/fs/open.c
index 35bb784763a4..7395860d7164 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -292,6 +292,12 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
return -ETXTBSY;
/*
+ * We cannot allow any allocation changes on an iomap immutable file
+ */
+ if (IS_IOMAP_IMMUTABLE(inode))
+ return -ETXTBSY;
+
+ /*
* Revalidate the write permissions, in case security policy has
* changed since the files were opened.
*/
diff --git a/fs/read_write.c b/fs/read_write.c
index 0cc7033aa413..dc673be7c7cb 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1706,6 +1706,9 @@ int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in,
if (IS_SWAPFILE(inode_in) || IS_SWAPFILE(inode_out))
return -ETXTBSY;
+ if (IS_IOMAP_IMMUTABLE(inode_in) || IS_IOMAP_IMMUTABLE(inode_out))
+ return -ETXTBSY;
+
/* Don't reflink dirs, pipes, sockets... */
if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
return -EISDIR;
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index e75c40a47b7d..2e64488bc4de 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1755,6 +1755,12 @@ xfs_ioc_swapext(
goto out_put_tmp_file;
}
+ if (IS_IOMAP_IMMUTABLE(file_inode(f.file)) ||
+ IS_IOMAP_IMMUTABLE(file_inode(tmp.file))) {
+ error = -EINVAL;
+ goto out_put_tmp_file;
+ }
+
/*
* We need to ensure that the fds passed in point to XFS inodes
* before we cast and access them as XFS structures as we have no
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6e1fd5d21248..0a254b768855 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1829,6 +1829,7 @@ struct super_operations {
#else
#define S_DAX 0 /* Make all the DAX code disappear */
#endif
+#define S_IOMAP_IMMUTABLE 16384 /* logical-to-physical extent map is fixed */
/*
* Note that nosuid etc flags are inode-specific: setting some file-system
@@ -1867,6 +1868,7 @@ struct super_operations {
#define IS_AUTOMOUNT(inode) ((inode)->i_flags & S_AUTOMOUNT)
#define IS_NOSEC(inode) ((inode)->i_flags & S_NOSEC)
#define IS_DAX(inode) ((inode)->i_flags & S_DAX)
+#define IS_IOMAP_IMMUTABLE(inode) ((inode)->i_flags & S_IOMAP_IMMUTABLE)
#define IS_WHITEOUT(inode) (S_ISCHR(inode->i_mode) && \
(inode)->i_rdev == WHITEOUT_DEV)
diff --git a/mm/filemap.c b/mm/filemap.c
index a49702445ce0..e4a6529da2bd 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2806,6 +2806,15 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
if (unlikely(pos >= inode->i_sb->s_maxbytes))
return -EFBIG;
+ /* Are we about to mutate the block map on an immutable file? */
+ if (IS_IOMAP_IMMUTABLE(inode)
+ && (pos + iov_iter_count(from) > i_size_read(inode))) {
+ if (pos < i_size_read(inode))
+ iov_iter_truncate(from, i_size_read(inode) - pos);
+ else
+ return -ETXTBSY;
+ }
+
iov_iter_truncate(from, inode->i_sb->s_maxbytes - pos);
return iov_iter_count(from);
}
>From falloc.h:
FALLOC_FL_SEAL_BLOCK_MAP is used to seal (make immutable) all of the
file logical-to-physical extent offset mappings in the file. The
purpose is to allow an application to assume that there are no holes
or shared extents in the file and that the metadata needed to find
all the physical extents of the file is stable and can never be
dirtied.
For now this patch only permits setting / clearing the in-memory state
of S_IOMAP_IMMMUTABLE, persisting the state is saved for a later patch.
The implementation is careful to not allow the immutable state to change
while any process might have any established mappings. It reuses the
existing xfs_reflink_unshare() and xfs_alloc_file_space() to unshare
extents and fill all holes in the file, or otherwise extend the file
size in the same operation that sets S_IOMAP_IMMUTABLE.
Cc: Jan Kara <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Alexander Viro <[email protected]>
Suggested-by: Dave Chinner <[email protected]>
Suggested-by: "Darrick J. Wong" <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
fs/open.c | 26 ++++++++++++-
fs/xfs/xfs_bmap_util.c | 86 +++++++++++++++++++++++++++++++++++++++++++
fs/xfs/xfs_bmap_util.h | 2 +
fs/xfs/xfs_file.c | 14 +++++--
include/linux/falloc.h | 3 +-
include/uapi/linux/falloc.h | 19 ++++++++++
6 files changed, 142 insertions(+), 8 deletions(-)
diff --git a/fs/open.c b/fs/open.c
index 7395860d7164..df075484fad5 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -241,7 +241,11 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
struct inode *inode = file_inode(file);
long ret;
- if (offset < 0 || len <= 0)
+ if (offset < 0 || len < 0)
+ return -EINVAL;
+
+ /* Allow zero len only for the unseal operation */
+ if (!(mode & FALLOC_FL_SEAL_BLOCK_MAP) && len == 0)
return -EINVAL;
/* Return error if mode is not supported */
@@ -273,6 +277,17 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
(mode & ~(FALLOC_FL_UNSHARE_RANGE | FALLOC_FL_KEEP_SIZE)))
return -EINVAL;
+ /*
+ * Seal block map should only be used exclusively, and with
+ * the IMMUTABLE capability.
+ */
+ if (mode & FALLOC_FL_SEAL_BLOCK_MAP) {
+ if (mode & ~FALLOC_FL_SEAL_BLOCK_MAP)
+ return -EINVAL;
+ if (!capable(CAP_LINUX_IMMUTABLE))
+ return -EPERM;
+ }
+
if (!(file->f_mode & FMODE_WRITE))
return -EBADF;
@@ -292,9 +307,14 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
return -ETXTBSY;
/*
- * We cannot allow any allocation changes on an iomap immutable file
+ * We cannot allow any allocation changes on an iomap immutable
+ * file, however if the operation is FALLOC_FL_SEAL_BLOCK_MAP,
+ * call down to ->fallocate() to determine if the operations is
+ * allowed. ->fallocate() may either clear the flag when @len is
+ * zero, or validate that the requested operation is already the
+ * current state of the file.
*/
- if (IS_IOMAP_IMMUTABLE(inode))
+ if (IS_IOMAP_IMMUTABLE(inode) && (!(mode & FALLOC_FL_SEAL_BLOCK_MAP)))
return -ETXTBSY;
/*
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 93e955262d07..c4fc79a0704f 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1387,6 +1387,92 @@ xfs_zero_file_space(
}
+int
+xfs_seal_file_space(
+ struct xfs_inode *ip,
+ xfs_off_t offset,
+ xfs_off_t len)
+{
+ struct inode *inode = VFS_I(ip);
+ struct address_space *mapping = inode->i_mapping;
+ int error = 0;
+
+ if (offset)
+ return -EINVAL;
+
+ i_mmap_lock_read(mapping);
+ xfs_ilock(ip, XFS_ILOCK_EXCL);
+ if (len == 0) {
+ /*
+ * Clear the immutable flag provided there are no active
+ * mappings. The active mapping check prevents an
+ * application that is assuming a static block map, for
+ * DAX or peer-to-peer DMA, from having this state
+ * silently change behind its back.
+ */
+ if (RB_EMPTY_ROOT(&mapping->i_mmap))
+ inode->i_flags &= ~S_IOMAP_IMMUTABLE;
+ else
+ error = -EBUSY;
+ } else if (IS_IOMAP_IMMUTABLE(inode)) {
+ if (len == i_size_read(inode)) {
+ /*
+ * The file is already in the correct state,
+ * bail out without error below.
+ */
+ len = 0;
+ } else {
+ /* too late to allocate more space */
+ error = -ETXTBSY;
+ }
+ } else {
+ if (len < i_size_read(inode)) {
+ /*
+ * Since S_IOMAP_IMMUTABLE is inode global it
+ * does not make sense to fallocate(immutable)
+ * on a sub-range of the file.
+ */
+ error = -EINVAL;
+ } else if (!RB_EMPTY_ROOT(&mapping->i_mmap)) {
+ /*
+ * It's not strictly required to prevent setting
+ * immutable while a file is already mapped, but
+ * we do it for simplicity and symmetry with the
+ * S_IOMAP_IMMUTABLE disable case.
+ */
+ error = -EBUSY;
+ } else
+ inode->i_flags |= S_IOMAP_IMMUTABLE;
+ }
+ xfs_iunlock(ip, XFS_ILOCK_EXCL);
+ i_mmap_unlock_read(mapping);
+
+ if (error || len == 0)
+ return error;
+
+ /*
+ * From here, the immutable flag is already set, so new
+ * operations that would change the block map are prevented by
+ * upper layer code paths. Wwe can proceed to unshare and
+ * allocate zeroed / written extents.
+ */
+ error = xfs_reflink_unshare(ip, offset, len);
+ if (error)
+ goto err;
+
+ error = xfs_alloc_file_space(ip, offset, len,
+ XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO);
+ if (error)
+ goto err;
+
+ return 0;
+err:
+ xfs_ilock(ip, XFS_ILOCK_EXCL);
+ inode->i_flags &= ~S_IOMAP_IMMUTABLE;
+ xfs_iunlock(ip, XFS_ILOCK_EXCL);
+ return error;
+}
+
/*
* @next_fsb will keep track of the extent currently undergoing shift.
* @stop_fsb will keep track of the extent at which we have to stop.
diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
index 0cede1043571..5115a32a2483 100644
--- a/fs/xfs/xfs_bmap_util.h
+++ b/fs/xfs/xfs_bmap_util.h
@@ -60,6 +60,8 @@ int xfs_collapse_file_space(struct xfs_inode *, xfs_off_t offset,
xfs_off_t len);
int xfs_insert_file_space(struct xfs_inode *, xfs_off_t offset,
xfs_off_t len);
+int xfs_seal_file_space(struct xfs_inode *, xfs_off_t offset,
+ xfs_off_t len);
/* EOF block manipulation functions */
bool xfs_can_free_eofblocks(struct xfs_inode *ip, bool force);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index c4893e226fd8..e21121530a90 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -739,7 +739,8 @@ xfs_file_write_iter(
#define XFS_FALLOC_FL_SUPPORTED \
(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE | \
FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE | \
- FALLOC_FL_INSERT_RANGE | FALLOC_FL_UNSHARE_RANGE)
+ FALLOC_FL_INSERT_RANGE | FALLOC_FL_UNSHARE_RANGE | \
+ FALLOC_FL_SEAL_BLOCK_MAP)
STATIC long
xfs_file_fallocate(
@@ -834,9 +835,14 @@ xfs_file_fallocate(
error = xfs_reflink_unshare(ip, offset, len);
if (error)
goto out_unlock;
- }
- error = xfs_alloc_file_space(ip, offset, len,
- XFS_BMAPI_PREALLOC);
+
+ error = xfs_alloc_file_space(ip, offset, len,
+ XFS_BMAPI_PREALLOC);
+ } else if (mode & FALLOC_FL_SEAL_BLOCK_MAP) {
+ error = xfs_seal_file_space(ip, offset, len);
+ } else
+ error = xfs_alloc_file_space(ip, offset, len,
+ XFS_BMAPI_PREALLOC);
}
if (error)
goto out_unlock;
diff --git a/include/linux/falloc.h b/include/linux/falloc.h
index 7494dc67c66f..48546c6fbec7 100644
--- a/include/linux/falloc.h
+++ b/include/linux/falloc.h
@@ -26,6 +26,7 @@ struct space_resv {
FALLOC_FL_COLLAPSE_RANGE | \
FALLOC_FL_ZERO_RANGE | \
FALLOC_FL_INSERT_RANGE | \
- FALLOC_FL_UNSHARE_RANGE)
+ FALLOC_FL_UNSHARE_RANGE | \
+ FALLOC_FL_SEAL_BLOCK_MAP)
#endif /* _FALLOC_H_ */
diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
index b075f601919b..629c9b20e49b 100644
--- a/include/uapi/linux/falloc.h
+++ b/include/uapi/linux/falloc.h
@@ -76,4 +76,23 @@
*/
#define FALLOC_FL_UNSHARE_RANGE 0x40
+/*
+ * FALLOC_FL_SEAL_BLOCK_MAP is used to seal (make immutable) all of the
+ * file logical-to-physical extent offset mappings in the file. The
+ * purpose is to allow an application to assume that there are no holes
+ * or shared extents in the file and that the metadata needed to find
+ * all the physical extents of the file is stable and can never be
+ * dirtied.
+ *
+ * The immutable property is in effect for the entire inode, so the
+ * range for this operation must start at offset 0 and len must be
+ * greater than or equal to the current size of the file. If greater,
+ * this operation allocates, unshares, hole fills, and seals in one
+ * atomic step. If len is zero then the immutable state is cleared for
+ * the inode.
+ *
+ * This flag implies FALLOC_FL_UNSHARE_RANGE and as such cannot be used
+ * with the punch, zero, collapse, or insert range modes.
+ */
+#define FALLOC_FL_SEAL_BLOCK_MAP 0x80
#endif /* _UAPI_FALLOC_H_ */
Record the immutable state in the on-disk inode so that on the next boot
the protections against reflink and hole punch etc are automatically
restored.
This deliberately does not add a FS_XFLAG_IOMAP_IMMUTABLE since
fallocate(2) is the path to toggle this flag.
Cc: Jan Kara <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Alexander Viro <[email protected]>
Suggested-by: Dave Chinner <[email protected]>
Suggested-by: "Darrick J. Wong" <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
fs/xfs/libxfs/xfs_format.h | 5 ++++-
fs/xfs/xfs_bmap_util.c | 14 +++++++++++---
fs/xfs/xfs_bmap_util.h | 2 +-
fs/xfs/xfs_file.c | 4 ++--
fs/xfs/xfs_ioctl.c | 4 ++--
fs/xfs/xfs_iops.c | 8 +++++---
6 files changed, 25 insertions(+), 12 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index d4d9bef20c3a..9e720e55776b 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -1063,12 +1063,15 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
#define XFS_DIFLAG2_DAX_BIT 0 /* use DAX for this inode */
#define XFS_DIFLAG2_REFLINK_BIT 1 /* file's blocks may be shared */
#define XFS_DIFLAG2_COWEXTSIZE_BIT 2 /* copy on write extent size hint */
+#define XFS_DIFLAG2_IOMAP_IMMUTABLE_BIT 3 /* set S_IOMAP_IMMUTABLE for this inode */
#define XFS_DIFLAG2_DAX (1 << XFS_DIFLAG2_DAX_BIT)
#define XFS_DIFLAG2_REFLINK (1 << XFS_DIFLAG2_REFLINK_BIT)
#define XFS_DIFLAG2_COWEXTSIZE (1 << XFS_DIFLAG2_COWEXTSIZE_BIT)
+#define XFS_DIFLAG2_IOMAP_IMMUTABLE (1 << XFS_DIFLAG2_IOMAP_IMMUTABLE_BIT)
#define XFS_DIFLAG2_ANY \
- (XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE)
+ (XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE | \
+ XFS_DIFLAG2_IOMAP_IMMUTABLE)
/*
* Inode number format:
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index c4fc79a0704f..1dcb521da456 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1021,7 +1021,8 @@ xfs_alloc_file_space(
struct xfs_inode *ip,
xfs_off_t offset,
xfs_off_t len,
- int alloc_type)
+ int alloc_type,
+ uint64_t di_flags2)
{
xfs_mount_t *mp = ip->i_mount;
xfs_off_t count;
@@ -1119,6 +1120,12 @@ xfs_alloc_file_space(
break;
}
xfs_ilock(ip, XFS_ILOCK_EXCL);
+ if (di_flags2) {
+ /* fold inode attributes for this allocation */
+ ip->i_d.di_flags2 |= di_flags2;
+ di_flags2 = 0;
+ }
+
error = xfs_trans_reserve_quota_nblks(tp, ip, qblocks,
0, quota_flag);
if (error)
@@ -1381,7 +1388,7 @@ xfs_zero_file_space(
error = xfs_alloc_file_space(ip, round_down(offset, blksize),
round_up(offset + len, blksize) -
round_down(offset, blksize),
- XFS_BMAPI_PREALLOC);
+ XFS_BMAPI_PREALLOC, 0);
out:
return error;
@@ -1461,7 +1468,8 @@ xfs_seal_file_space(
goto err;
error = xfs_alloc_file_space(ip, offset, len,
- XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO);
+ XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO,
+ XFS_DIFLAG2_IOMAP_IMMUTABLE);
if (error)
goto err;
diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
index 5115a32a2483..1698e0c58b78 100644
--- a/fs/xfs/xfs_bmap_util.h
+++ b/fs/xfs/xfs_bmap_util.h
@@ -51,7 +51,7 @@ int xfs_bmap_last_extent(struct xfs_trans *tp, struct xfs_inode *ip,
/* preallocation and hole punch interface */
int xfs_alloc_file_space(struct xfs_inode *ip, xfs_off_t offset,
- xfs_off_t len, int alloc_type);
+ xfs_off_t len, int alloc_type, uint64_t di_flags2);
int xfs_free_file_space(struct xfs_inode *ip, xfs_off_t offset,
xfs_off_t len);
int xfs_zero_file_space(struct xfs_inode *ip, xfs_off_t offset,
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index e21121530a90..2d9ffa1773bd 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -837,12 +837,12 @@ xfs_file_fallocate(
goto out_unlock;
error = xfs_alloc_file_space(ip, offset, len,
- XFS_BMAPI_PREALLOC);
+ XFS_BMAPI_PREALLOC, 0);
} else if (mode & FALLOC_FL_SEAL_BLOCK_MAP) {
error = xfs_seal_file_space(ip, offset, len);
} else
error = xfs_alloc_file_space(ip, offset, len,
- XFS_BMAPI_PREALLOC);
+ XFS_BMAPI_PREALLOC, 0);
}
if (error)
goto out_unlock;
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 2e64488bc4de..c829f85bed99 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -703,7 +703,7 @@ xfs_ioc_space(
case XFS_IOC_RESVSP64:
flags |= XFS_PREALLOC_SET;
error = xfs_alloc_file_space(ip, bf->l_start, bf->l_len,
- XFS_BMAPI_PREALLOC);
+ XFS_BMAPI_PREALLOC, 0);
break;
case XFS_IOC_UNRESVSP:
case XFS_IOC_UNRESVSP64:
@@ -716,7 +716,7 @@ xfs_ioc_space(
flags |= XFS_PREALLOC_CLEAR;
if (bf->l_start > XFS_ISIZE(ip)) {
error = xfs_alloc_file_space(ip, XFS_ISIZE(ip),
- bf->l_start - XFS_ISIZE(ip), 0);
+ bf->l_start - XFS_ISIZE(ip), 0, 0);
if (error)
goto out_unlock;
}
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 469c9fa4c178..174ef95453f5 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1186,9 +1186,10 @@ xfs_diflags_to_iflags(
struct xfs_inode *ip)
{
uint16_t flags = ip->i_d.di_flags;
+ uint64_t flags2 = ip->i_d.di_flags2;
inode->i_flags &= ~(S_IMMUTABLE | S_APPEND | S_SYNC |
- S_NOATIME | S_DAX);
+ S_NOATIME | S_DAX | S_IOMAP_IMMUTABLE);
if (flags & XFS_DIFLAG_IMMUTABLE)
inode->i_flags |= S_IMMUTABLE;
@@ -1201,9 +1202,10 @@ xfs_diflags_to_iflags(
if (S_ISREG(inode->i_mode) &&
ip->i_mount->m_sb.sb_blocksize == PAGE_SIZE &&
!xfs_is_reflink_inode(ip) &&
- (ip->i_mount->m_flags & XFS_MOUNT_DAX ||
- ip->i_d.di_flags2 & XFS_DIFLAG2_DAX))
+ (ip->i_mount->m_flags & XFS_MOUNT_DAX || flags2 & XFS_DIFLAG2_DAX))
inode->i_flags |= S_DAX;
+ if (flags2 & XFS_DIFLAG2_IOMAP_IMMUTABLE)
+ inode->i_flags |= S_IOMAP_IMMUTABLE;
}
/*
On Sat, Jul 29, 2017, at 03:43 PM, Dan Williams wrote:
> An inode with this flag set indicates that the file's block map cannot
> be changed, no size change, deletion, hole-punch, range collapse, or
> reflink.
>
> The implementation of toggling the flag and sealing the state of the
> extent map is saved for a later patch. The functionality provided by
> S_IOMAP_IMMUTABLE, once toggle support is added, will be a superset of
> that provided by S_SWAPFILE, and it is targeted to replace it.
>
> For now, only xfs and the core vfs are updated to consider the new flag.
Quite a while ago I started a request for O_OBJECT:
http://www.spinics.net/lists/linux-fsdevel/msg75085.html
A few months ago I was thinking about that more and realized
it'd likely be more palatable to land as an inode flag, like
you're doing here.
Now, S_IOMAP_IMMUTABLE would be quite close to the semantics
I want for ostree, except we also want to disallow
changes to the inode uid, gid or mode. (Extended attributes are
a whole other story; but I'd like to at least disallow changes to the
security. namespace).
The goal here is mostly about resilience to *accidental* changes;
think an admin doing `cp /path/to/binary /usr/bin/bash` which
does open(O_TRUNC), which would hence corrupt all hardlinks.
S_IOMAP_IMMUTABLE would give a lot of great protection against
those types of accidental changes - most of them are either going
to be open(O_TRUNC) or O_APPEND. Since you're touching various
write paths here, perhaps we can also add
S_CONTENTS_IMMUTABLE or something at the same time?
If this lands as is - I'm quite likely to change ostree to use it;
any objections to that? As mentioned in the thread, there are several
other cases of "content immutable" files in userspace, such as
QEMU "qcow2", git objects. And really the most classic example is
/etc/sudoers and the need for a special "visudo" program to really
ensure that editors don't do in-place overwrites.
But it'd be great if we can use this push to also land "content immutabilty"
or however we decide to call it.
On Mon, Jul 31, 2017 at 9:02 AM, Colin Walters <[email protected]> wrote:
> On Sat, Jul 29, 2017, at 03:43 PM, Dan Williams wrote:
>> An inode with this flag set indicates that the file's block map cannot
>> be changed, no size change, deletion, hole-punch, range collapse, or
>> reflink.
>>
>> The implementation of toggling the flag and sealing the state of the
>> extent map is saved for a later patch. The functionality provided by
>> S_IOMAP_IMMUTABLE, once toggle support is added, will be a superset of
>> that provided by S_SWAPFILE, and it is targeted to replace it.
>>
>> For now, only xfs and the core vfs are updated to consider the new flag.
>
> Quite a while ago I started a request for O_OBJECT:
> http://www.spinics.net/lists/linux-fsdevel/msg75085.html
> A few months ago I was thinking about that more and realized
> it'd likely be more palatable to land as an inode flag, like
> you're doing here.
>
> Now, S_IOMAP_IMMUTABLE would be quite close to the semantics
> I want for ostree, except we also want to disallow
> changes to the inode uid, gid or mode. (Extended attributes are
> a whole other story; but I'd like to at least disallow changes to the
> security. namespace).
>
> The goal here is mostly about resilience to *accidental* changes;
> think an admin doing `cp /path/to/binary /usr/bin/bash` which
> does open(O_TRUNC), which would hence corrupt all hardlinks.
>
> S_IOMAP_IMMUTABLE would give a lot of great protection against
> those types of accidental changes - most of them are either going
> to be open(O_TRUNC) or O_APPEND. Since you're touching various
> write paths here, perhaps we can also add
> S_CONTENTS_IMMUTABLE or something at the same time?
>
> If this lands as is - I'm quite likely to change ostree to use it;
> any objections to that? As mentioned in the thread, there are several
> other cases of "content immutable" files in userspace, such as
> QEMU "qcow2", git objects. And really the most classic example is
> /etc/sudoers and the need for a special "visudo" program to really
> ensure that editors don't do in-place overwrites.
>
> But it'd be great if we can use this push to also land "content immutabilty"
> or however we decide to call it.
How is S_CONTENTS_IMMUTABLE different than S_IMMUTABLE?
On Mon, Jul 31, 2017, at 12:29 PM, Dan Williams wrote:
>
> How is S_CONTENTS_IMMUTABLE different than S_IMMUTABLE?
We still want the ability to make hardlinks.
On Sat, Jul 29, 2017 at 12:43:35PM -0700, Dan Williams wrote:
> An inode with this flag set indicates that the file's block map cannot
> be changed, no size change, deletion, hole-punch, range collapse, or
> reflink.
>
> The implementation of toggling the flag and sealing the state of the
> extent map is saved for a later patch. The functionality provided by
> S_IOMAP_IMMUTABLE, once toggle support is added, will be a superset of
> that provided by S_SWAPFILE, and it is targeted to replace it.
>
> For now, only xfs and the core vfs are updated to consider the new flag.
>
> The additional check that is added for this flag, beyond what we are
> already doing for swapfiles, is to truncate or abort writes that try to
> extend the file size.
>
> Cc: Jan Kara <[email protected]>
> Cc: Jeff Moyer <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Cc: Ross Zwisler <[email protected]>
> Cc: Alexander Viro <[email protected]>
> Suggested-by: Dave Chinner <[email protected]>
> Suggested-by: "Darrick J. Wong" <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>
> ---
> fs/attr.c | 10 ++++++++++
> fs/namei.c | 3 ++-
> fs/open.c | 6 ++++++
> fs/read_write.c | 3 +++
> fs/xfs/xfs_ioctl.c | 6 ++++++
> include/linux/fs.h | 2 ++
> mm/filemap.c | 9 +++++++++
> 7 files changed, 38 insertions(+), 1 deletion(-)
>
> diff --git a/fs/attr.c b/fs/attr.c
> index 135304146120..8573e364bd06 100644
> --- a/fs/attr.c
> +++ b/fs/attr.c
> @@ -112,6 +112,16 @@ EXPORT_SYMBOL(setattr_prepare);
> */
> int inode_newsize_ok(const struct inode *inode, loff_t offset)
> {
> + if (IS_IOMAP_IMMUTABLE(inode)) {
> + /*
> + * Any size change is disallowed. Size increases may
> + * dirty metadata that an application is not prepared to
> + * sync, and a size decrease may expose free blocks to
> + * in-flight DMA.
> + */
> + return -ETXTBSY;
> + }
> +
> if (inode->i_size < offset) {
> unsigned long limit;
>
> diff --git a/fs/namei.c b/fs/namei.c
> index ddb6a7c2b3d4..588f1135c42c 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -2793,7 +2793,8 @@ static int may_delete(struct inode *dir, struct dentry *victim, bool isdir)
> return -EPERM;
>
> if (check_sticky(dir, inode) || IS_APPEND(inode) ||
> - IS_IMMUTABLE(inode) || IS_SWAPFILE(inode) || HAS_UNMAPPED_ID(inode))
> + IS_IMMUTABLE(inode) || IS_SWAPFILE(inode) || HAS_UNMAPPED_ID(inode)
> + || IS_IOMAP_IMMUTABLE(inode))
This caught my eye because why wouldn't we be able to unlink a file
whose block map is immutable? Link count has "nothing" to do with that.
:)
Hmm... I guess the reasoning here is that if we're IOMAP_IMMUTABLE'd,
we're not allowed to touch the link count since (we assume) nobody's
who's interested in *inode is going to call fsync on it to flush the
metadata to disk?
If that's the case, then shouldn't there be a corresponding may_create
check to prevent us from linking a file into a directory?
(I don't really see why link/unlink shouldn't be allowed...)
> return -EPERM;
> if (isdir) {
> if (!d_is_dir(victim))
> diff --git a/fs/open.c b/fs/open.c
> index 35bb784763a4..7395860d7164 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -292,6 +292,12 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
> return -ETXTBSY;
>
> /*
> + * We cannot allow any allocation changes on an iomap immutable file
If it's a (regular allocation) or (unshare blocks) fallocate call, we
could return zero immediately since no writes will be able to hit
ENOSPC because we know that the block mappings are there and no CoW is
required.
> + */
> + if (IS_IOMAP_IMMUTABLE(inode))
> + return -ETXTBSY;
I've been wondering in the back of my head if we could allow a
size-extending FALLOC_FL_ZERO_RANGE here that would fsync at the end?
The use case I had in mind was extending files without having to turn
off the IOMAP_IMMUTABLE flag, since (in theory, anyway) we'd bzero() the
memory and then increase i_size, so userspace should never be able to
access storage that isn't totally finalized.
OTOH that also seems like a huge weird loophole in IOMAP_IMMUTABLE, so
maybe everyone would prefer to shoot down this use case now?
> +
> + /*
> * Revalidate the write permissions, in case security policy has
> * changed since the files were opened.
> */
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 0cc7033aa413..dc673be7c7cb 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1706,6 +1706,9 @@ int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in,
> if (IS_SWAPFILE(inode_in) || IS_SWAPFILE(inode_out))
> return -ETXTBSY;
>
> + if (IS_IOMAP_IMMUTABLE(inode_in) || IS_IOMAP_IMMUTABLE(inode_out))
> + return -ETXTBSY;
> +
> /* Don't reflink dirs, pipes, sockets... */
> if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
> return -EISDIR;
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index e75c40a47b7d..2e64488bc4de 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -1755,6 +1755,12 @@ xfs_ioc_swapext(
> goto out_put_tmp_file;
> }
>
> + if (IS_IOMAP_IMMUTABLE(file_inode(f.file)) ||
> + IS_IOMAP_IMMUTABLE(file_inode(tmp.file))) {
> + error = -EINVAL;
> + goto out_put_tmp_file;
> + }
Someone's going to have to audit more of the XFS ioctls to make sure
that none of them can touch the block mapping, but as this is an RFC it
doesn't need to be done right this instant.
> +
> /*
> * We need to ensure that the fds passed in point to XFS inodes
> * before we cast and access them as XFS structures as we have no
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 6e1fd5d21248..0a254b768855 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1829,6 +1829,7 @@ struct super_operations {
> #else
> #define S_DAX 0 /* Make all the DAX code disappear */
> #endif
> +#define S_IOMAP_IMMUTABLE 16384 /* logical-to-physical extent map is fixed */
>
> /*
> * Note that nosuid etc flags are inode-specific: setting some file-system
> @@ -1867,6 +1868,7 @@ struct super_operations {
> #define IS_AUTOMOUNT(inode) ((inode)->i_flags & S_AUTOMOUNT)
> #define IS_NOSEC(inode) ((inode)->i_flags & S_NOSEC)
> #define IS_DAX(inode) ((inode)->i_flags & S_DAX)
> +#define IS_IOMAP_IMMUTABLE(inode) ((inode)->i_flags & S_IOMAP_IMMUTABLE)
>
> #define IS_WHITEOUT(inode) (S_ISCHR(inode->i_mode) && \
> (inode)->i_rdev == WHITEOUT_DEV)
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a49702445ce0..e4a6529da2bd 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2806,6 +2806,15 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> if (unlikely(pos >= inode->i_sb->s_maxbytes))
> return -EFBIG;
>
> + /* Are we about to mutate the block map on an immutable file? */
> + if (IS_IOMAP_IMMUTABLE(inode)
> + && (pos + iov_iter_count(from) > i_size_read(inode))) {
> + if (pos < i_size_read(inode))
> + iov_iter_truncate(from, i_size_read(inode) - pos);
Writes past the end get truncated to stop at EOF? I'd have thought that
would be an error since userspace is asking for metadata updates it
previously said it would never need...
--D
> + else
> + return -ETXTBSY;
> + }
> +
> iov_iter_truncate(from, inode->i_sb->s_maxbytes - pos);
> return iov_iter_count(from);
> }
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Jul 29, 2017 at 12:43:40PM -0700, Dan Williams wrote:
> >From falloc.h:
>
> FALLOC_FL_SEAL_BLOCK_MAP is used to seal (make immutable) all of the
> file logical-to-physical extent offset mappings in the file. The
> purpose is to allow an application to assume that there are no holes
> or shared extents in the file and that the metadata needed to find
> all the physical extents of the file is stable and can never be
> dirtied.
>
> For now this patch only permits setting / clearing the in-memory state
> of S_IOMAP_IMMMUTABLE, persisting the state is saved for a later patch.
>
> The implementation is careful to not allow the immutable state to change
> while any process might have any established mappings. It reuses the
> existing xfs_reflink_unshare() and xfs_alloc_file_space() to unshare
> extents and fill all holes in the file, or otherwise extend the file
> size in the same operation that sets S_IOMAP_IMMUTABLE.
>
> Cc: Jan Kara <[email protected]>
> Cc: Jeff Moyer <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Cc: Ross Zwisler <[email protected]>
> Cc: Alexander Viro <[email protected]>
> Suggested-by: Dave Chinner <[email protected]>
> Suggested-by: "Darrick J. Wong" <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>
> ---
> fs/open.c | 26 ++++++++++++-
> fs/xfs/xfs_bmap_util.c | 86 +++++++++++++++++++++++++++++++++++++++++++
> fs/xfs/xfs_bmap_util.h | 2 +
> fs/xfs/xfs_file.c | 14 +++++--
> include/linux/falloc.h | 3 +-
> include/uapi/linux/falloc.h | 19 ++++++++++
> 6 files changed, 142 insertions(+), 8 deletions(-)
>
> diff --git a/fs/open.c b/fs/open.c
> index 7395860d7164..df075484fad5 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -241,7 +241,11 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
> struct inode *inode = file_inode(file);
> long ret;
>
> - if (offset < 0 || len <= 0)
> + if (offset < 0 || len < 0)
> + return -EINVAL;
> +
> + /* Allow zero len only for the unseal operation */
> + if (!(mode & FALLOC_FL_SEAL_BLOCK_MAP) && len == 0)
> return -EINVAL;
>
> /* Return error if mode is not supported */
> @@ -273,6 +277,17 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
> (mode & ~(FALLOC_FL_UNSHARE_RANGE | FALLOC_FL_KEEP_SIZE)))
> return -EINVAL;
>
> + /*
> + * Seal block map should only be used exclusively, and with
> + * the IMMUTABLE capability.
> + */
> + if (mode & FALLOC_FL_SEAL_BLOCK_MAP) {
> + if (mode & ~FALLOC_FL_SEAL_BLOCK_MAP)
> + return -EINVAL;
> + if (!capable(CAP_LINUX_IMMUTABLE))
> + return -EPERM;
> + }
> +
> if (!(file->f_mode & FMODE_WRITE))
> return -EBADF;
>
> @@ -292,9 +307,14 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
> return -ETXTBSY;
>
> /*
> - * We cannot allow any allocation changes on an iomap immutable file
> + * We cannot allow any allocation changes on an iomap immutable
> + * file, however if the operation is FALLOC_FL_SEAL_BLOCK_MAP,
> + * call down to ->fallocate() to determine if the operations is
> + * allowed. ->fallocate() may either clear the flag when @len is
> + * zero, or validate that the requested operation is already the
> + * current state of the file.
> */
> - if (IS_IOMAP_IMMUTABLE(inode))
> + if (IS_IOMAP_IMMUTABLE(inode) && (!(mode & FALLOC_FL_SEAL_BLOCK_MAP)))
> return -ETXTBSY;
>
> /*
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 93e955262d07..c4fc79a0704f 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1387,6 +1387,92 @@ xfs_zero_file_space(
>
> }
>
> +int
> +xfs_seal_file_space(
> + struct xfs_inode *ip,
> + xfs_off_t offset,
> + xfs_off_t len)
> +{
> + struct inode *inode = VFS_I(ip);
> + struct address_space *mapping = inode->i_mapping;
> + int error = 0;
> +
> + if (offset)
> + return -EINVAL;
> +
> + i_mmap_lock_read(mapping);
(Are we allowed to take address_space->i_mmap_rwsem while holding
xfs_inode->i_mmaplock?)
> + xfs_ilock(ip, XFS_ILOCK_EXCL);
> + if (len == 0) {
> + /*
> + * Clear the immutable flag provided there are no active
> + * mappings. The active mapping check prevents an
> + * application that is assuming a static block map, for
> + * DAX or peer-to-peer DMA, from having this state
> + * silently change behind its back.
> + */
> + if (RB_EMPTY_ROOT(&mapping->i_mmap))
> + inode->i_flags &= ~S_IOMAP_IMMUTABLE;
> + else
> + error = -EBUSY;
> + } else if (IS_IOMAP_IMMUTABLE(inode)) {
> + if (len == i_size_read(inode)) {
> + /*
> + * The file is already in the correct state,
> + * bail out without error below.
> + */
> + len = 0;
> + } else {
> + /* too late to allocate more space */
> + error = -ETXTBSY;
> + }
> + } else {
> + if (len < i_size_read(inode)) {
> + /*
> + * Since S_IOMAP_IMMUTABLE is inode global it
> + * does not make sense to fallocate(immutable)
> + * on a sub-range of the file.
> + */
> + error = -EINVAL;
> + } else if (!RB_EMPTY_ROOT(&mapping->i_mmap)) {
> + /*
> + * It's not strictly required to prevent setting
> + * immutable while a file is already mapped, but
> + * we do it for simplicity and symmetry with the
> + * S_IOMAP_IMMUTABLE disable case.
> + */
> + error = -EBUSY;
> + } else
> + inode->i_flags |= S_IOMAP_IMMUTABLE;
> + }
> + xfs_iunlock(ip, XFS_ILOCK_EXCL);
> + i_mmap_unlock_read(mapping);
> +
> + if (error || len == 0)
> + return error;
> +
> + /*
> + * From here, the immutable flag is already set, so new
> + * operations that would change the block map are prevented by
> + * upper layer code paths. Wwe can proceed to unshare and
> + * allocate zeroed / written extents.
> + */
> + error = xfs_reflink_unshare(ip, offset, len);
At this point we still hold the io and mmap locks and the vfs thinks the
inode is iomap_immutable, but we haven't actually fixed the block
mappings, which means that the flag is set but there could be holes and
shared extents aplenty?
That seems strange to me -- wouldn't we want to try to unshare and
allocate, and only then take the ilock, check the mappings, and only set
the flag if nobody's messed with the extent map since the unshare &
allocated? IOWs,
if (len == 0)
return xfs_unseal_file_space();
xfs_reflink_unshare(...);
xfs_alloc_file_space(...);
xfs_ilock(...);
if (xfs_iomap_lacks_holes_and_shared_blocks(...)) {
VFS_I(ip)->i_flags |= S_IOMAP_IMMUTABLE;
ip->i_d.di_flags2 |= XFS_DIFLAG2_IOMAP_IMMUTABLE;
xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
} else {
error = -EBUSY;
}
xfs_iunlock(...);
(I guess we hold sufficient locks, but still...)
--D
> + if (error)
> + goto err;
> +
> + error = xfs_alloc_file_space(ip, offset, len,
> + XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO);
> + if (error)
> + goto err;
> +
> + return 0;
> +err:
> + xfs_ilock(ip, XFS_ILOCK_EXCL);
> + inode->i_flags &= ~S_IOMAP_IMMUTABLE;
> + xfs_iunlock(ip, XFS_ILOCK_EXCL);
> + return error;
> +}
> +
> /*
> * @next_fsb will keep track of the extent currently undergoing shift.
> * @stop_fsb will keep track of the extent at which we have to stop.
> diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
> index 0cede1043571..5115a32a2483 100644
> --- a/fs/xfs/xfs_bmap_util.h
> +++ b/fs/xfs/xfs_bmap_util.h
> @@ -60,6 +60,8 @@ int xfs_collapse_file_space(struct xfs_inode *, xfs_off_t offset,
> xfs_off_t len);
> int xfs_insert_file_space(struct xfs_inode *, xfs_off_t offset,
> xfs_off_t len);
> +int xfs_seal_file_space(struct xfs_inode *, xfs_off_t offset,
> + xfs_off_t len);
>
> /* EOF block manipulation functions */
> bool xfs_can_free_eofblocks(struct xfs_inode *ip, bool force);
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index c4893e226fd8..e21121530a90 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -739,7 +739,8 @@ xfs_file_write_iter(
> #define XFS_FALLOC_FL_SUPPORTED \
> (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE | \
> FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE | \
> - FALLOC_FL_INSERT_RANGE | FALLOC_FL_UNSHARE_RANGE)
> + FALLOC_FL_INSERT_RANGE | FALLOC_FL_UNSHARE_RANGE | \
> + FALLOC_FL_SEAL_BLOCK_MAP)
>
> STATIC long
> xfs_file_fallocate(
> @@ -834,9 +835,14 @@ xfs_file_fallocate(
> error = xfs_reflink_unshare(ip, offset, len);
> if (error)
> goto out_unlock;
> - }
> - error = xfs_alloc_file_space(ip, offset, len,
> - XFS_BMAPI_PREALLOC);
> +
> + error = xfs_alloc_file_space(ip, offset, len,
> + XFS_BMAPI_PREALLOC);
> + } else if (mode & FALLOC_FL_SEAL_BLOCK_MAP) {
> + error = xfs_seal_file_space(ip, offset, len);
> + } else
> + error = xfs_alloc_file_space(ip, offset, len,
> + XFS_BMAPI_PREALLOC);
> }
> if (error)
> goto out_unlock;
> diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> index 7494dc67c66f..48546c6fbec7 100644
> --- a/include/linux/falloc.h
> +++ b/include/linux/falloc.h
> @@ -26,6 +26,7 @@ struct space_resv {
> FALLOC_FL_COLLAPSE_RANGE | \
> FALLOC_FL_ZERO_RANGE | \
> FALLOC_FL_INSERT_RANGE | \
> - FALLOC_FL_UNSHARE_RANGE)
> + FALLOC_FL_UNSHARE_RANGE | \
> + FALLOC_FL_SEAL_BLOCK_MAP)
>
> #endif /* _FALLOC_H_ */
> diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
> index b075f601919b..629c9b20e49b 100644
> --- a/include/uapi/linux/falloc.h
> +++ b/include/uapi/linux/falloc.h
> @@ -76,4 +76,23 @@
> */
> #define FALLOC_FL_UNSHARE_RANGE 0x40
>
> +/*
> + * FALLOC_FL_SEAL_BLOCK_MAP is used to seal (make immutable) all of the
> + * file logical-to-physical extent offset mappings in the file. The
> + * purpose is to allow an application to assume that there are no holes
> + * or shared extents in the file and that the metadata needed to find
> + * all the physical extents of the file is stable and can never be
> + * dirtied.
> + *
> + * The immutable property is in effect for the entire inode, so the
> + * range for this operation must start at offset 0 and len must be
> + * greater than or equal to the current size of the file. If greater,
> + * this operation allocates, unshares, hole fills, and seals in one
> + * atomic step. If len is zero then the immutable state is cleared for
> + * the inode.
> + *
> + * This flag implies FALLOC_FL_UNSHARE_RANGE and as such cannot be used
> + * with the punch, zero, collapse, or insert range modes.
> + */
> +#define FALLOC_FL_SEAL_BLOCK_MAP 0x80
> #endif /* _UAPI_FALLOC_H_ */
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Jul 29, 2017 at 12:43:46PM -0700, Dan Williams wrote:
> Record the immutable state in the on-disk inode so that on the next boot
> the protections against reflink and hole punch etc are automatically
> restored.
>
> This deliberately does not add a FS_XFLAG_IOMAP_IMMUTABLE since
> fallocate(2) is the path to toggle this flag.
>
> Cc: Jan Kara <[email protected]>
> Cc: Jeff Moyer <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Cc: Ross Zwisler <[email protected]>
> Cc: Alexander Viro <[email protected]>
> Suggested-by: Dave Chinner <[email protected]>
> Suggested-by: "Darrick J. Wong" <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>
> ---
> fs/xfs/libxfs/xfs_format.h | 5 ++++-
> fs/xfs/xfs_bmap_util.c | 14 +++++++++++---
> fs/xfs/xfs_bmap_util.h | 2 +-
> fs/xfs/xfs_file.c | 4 ++--
> fs/xfs/xfs_ioctl.c | 4 ++--
> fs/xfs/xfs_iops.c | 8 +++++---
> 6 files changed, 25 insertions(+), 12 deletions(-)
>
> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> index d4d9bef20c3a..9e720e55776b 100644
> --- a/fs/xfs/libxfs/xfs_format.h
> +++ b/fs/xfs/libxfs/xfs_format.h
> @@ -1063,12 +1063,15 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
> #define XFS_DIFLAG2_DAX_BIT 0 /* use DAX for this inode */
> #define XFS_DIFLAG2_REFLINK_BIT 1 /* file's blocks may be shared */
> #define XFS_DIFLAG2_COWEXTSIZE_BIT 2 /* copy on write extent size hint */
> +#define XFS_DIFLAG2_IOMAP_IMMUTABLE_BIT 3 /* set S_IOMAP_IMMUTABLE for this inode */
> #define XFS_DIFLAG2_DAX (1 << XFS_DIFLAG2_DAX_BIT)
> #define XFS_DIFLAG2_REFLINK (1 << XFS_DIFLAG2_REFLINK_BIT)
> #define XFS_DIFLAG2_COWEXTSIZE (1 << XFS_DIFLAG2_COWEXTSIZE_BIT)
> +#define XFS_DIFLAG2_IOMAP_IMMUTABLE (1 << XFS_DIFLAG2_IOMAP_IMMUTABLE_BIT)
>
> #define XFS_DIFLAG2_ANY \
> - (XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE)
> + (XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE | \
> + XFS_DIFLAG2_IOMAP_IMMUTABLE)
>
> /*
> * Inode number format:
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index c4fc79a0704f..1dcb521da456 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1021,7 +1021,8 @@ xfs_alloc_file_space(
> struct xfs_inode *ip,
> xfs_off_t offset,
> xfs_off_t len,
> - int alloc_type)
> + int alloc_type,
> + uint64_t di_flags2)
> {
> xfs_mount_t *mp = ip->i_mount;
> xfs_off_t count;
> @@ -1119,6 +1120,12 @@ xfs_alloc_file_space(
> break;
> }
> xfs_ilock(ip, XFS_ILOCK_EXCL);
> + if (di_flags2) {
> + /* fold inode attributes for this allocation */
> + ip->i_d.di_flags2 |= di_flags2;
> + di_flags2 = 0;
Can we set the inode flag in xfs_seal_file_space instead of plumbing
di_flags2 all over the place into the bmap routines?
Also, what does
xfs_alloc_file_space(..., XFS_BMAPI_ATTRFORK, XFS_DIFLAG2_REFLINK);
mean? :)
--D
> + }
> +
> error = xfs_trans_reserve_quota_nblks(tp, ip, qblocks,
> 0, quota_flag);
> if (error)
> @@ -1381,7 +1388,7 @@ xfs_zero_file_space(
> error = xfs_alloc_file_space(ip, round_down(offset, blksize),
> round_up(offset + len, blksize) -
> round_down(offset, blksize),
> - XFS_BMAPI_PREALLOC);
> + XFS_BMAPI_PREALLOC, 0);
> out:
> return error;
>
> @@ -1461,7 +1468,8 @@ xfs_seal_file_space(
> goto err;
>
> error = xfs_alloc_file_space(ip, offset, len,
> - XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO);
> + XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO,
> + XFS_DIFLAG2_IOMAP_IMMUTABLE);
> if (error)
> goto err;
>
> diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
> index 5115a32a2483..1698e0c58b78 100644
> --- a/fs/xfs/xfs_bmap_util.h
> +++ b/fs/xfs/xfs_bmap_util.h
> @@ -51,7 +51,7 @@ int xfs_bmap_last_extent(struct xfs_trans *tp, struct xfs_inode *ip,
>
> /* preallocation and hole punch interface */
> int xfs_alloc_file_space(struct xfs_inode *ip, xfs_off_t offset,
> - xfs_off_t len, int alloc_type);
> + xfs_off_t len, int alloc_type, uint64_t di_flags2);
> int xfs_free_file_space(struct xfs_inode *ip, xfs_off_t offset,
> xfs_off_t len);
> int xfs_zero_file_space(struct xfs_inode *ip, xfs_off_t offset,
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index e21121530a90..2d9ffa1773bd 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -837,12 +837,12 @@ xfs_file_fallocate(
> goto out_unlock;
>
> error = xfs_alloc_file_space(ip, offset, len,
> - XFS_BMAPI_PREALLOC);
> + XFS_BMAPI_PREALLOC, 0);
> } else if (mode & FALLOC_FL_SEAL_BLOCK_MAP) {
> error = xfs_seal_file_space(ip, offset, len);
> } else
> error = xfs_alloc_file_space(ip, offset, len,
> - XFS_BMAPI_PREALLOC);
> + XFS_BMAPI_PREALLOC, 0);
> }
> if (error)
> goto out_unlock;
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index 2e64488bc4de..c829f85bed99 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -703,7 +703,7 @@ xfs_ioc_space(
> case XFS_IOC_RESVSP64:
> flags |= XFS_PREALLOC_SET;
> error = xfs_alloc_file_space(ip, bf->l_start, bf->l_len,
> - XFS_BMAPI_PREALLOC);
> + XFS_BMAPI_PREALLOC, 0);
> break;
> case XFS_IOC_UNRESVSP:
> case XFS_IOC_UNRESVSP64:
> @@ -716,7 +716,7 @@ xfs_ioc_space(
> flags |= XFS_PREALLOC_CLEAR;
> if (bf->l_start > XFS_ISIZE(ip)) {
> error = xfs_alloc_file_space(ip, XFS_ISIZE(ip),
> - bf->l_start - XFS_ISIZE(ip), 0);
> + bf->l_start - XFS_ISIZE(ip), 0, 0);
> if (error)
> goto out_unlock;
> }
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 469c9fa4c178..174ef95453f5 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -1186,9 +1186,10 @@ xfs_diflags_to_iflags(
> struct xfs_inode *ip)
> {
> uint16_t flags = ip->i_d.di_flags;
> + uint64_t flags2 = ip->i_d.di_flags2;
>
> inode->i_flags &= ~(S_IMMUTABLE | S_APPEND | S_SYNC |
> - S_NOATIME | S_DAX);
> + S_NOATIME | S_DAX | S_IOMAP_IMMUTABLE);
>
> if (flags & XFS_DIFLAG_IMMUTABLE)
> inode->i_flags |= S_IMMUTABLE;
> @@ -1201,9 +1202,10 @@ xfs_diflags_to_iflags(
> if (S_ISREG(inode->i_mode) &&
> ip->i_mount->m_sb.sb_blocksize == PAGE_SIZE &&
> !xfs_is_reflink_inode(ip) &&
> - (ip->i_mount->m_flags & XFS_MOUNT_DAX ||
> - ip->i_d.di_flags2 & XFS_DIFLAG2_DAX))
> + (ip->i_mount->m_flags & XFS_MOUNT_DAX || flags2 & XFS_DIFLAG2_DAX))
> inode->i_flags |= S_DAX;
> + if (flags2 & XFS_DIFLAG2_IOMAP_IMMUTABLE)
> + inode->i_flags |= S_IOMAP_IMMUTABLE;
> }
>
> /*
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jul 31, 2017 at 9:46 AM, Darrick J. Wong
<[email protected]> wrote:
> On Sat, Jul 29, 2017 at 12:43:35PM -0700, Dan Williams wrote:
>> An inode with this flag set indicates that the file's block map cannot
>> be changed, no size change, deletion, hole-punch, range collapse, or
>> reflink.
>>
>> The implementation of toggling the flag and sealing the state of the
>> extent map is saved for a later patch. The functionality provided by
>> S_IOMAP_IMMUTABLE, once toggle support is added, will be a superset of
>> that provided by S_SWAPFILE, and it is targeted to replace it.
>>
>> For now, only xfs and the core vfs are updated to consider the new flag.
>>
>> The additional check that is added for this flag, beyond what we are
>> already doing for swapfiles, is to truncate or abort writes that try to
>> extend the file size.
>>
>> Cc: Jan Kara <[email protected]>
>> Cc: Jeff Moyer <[email protected]>
>> Cc: Christoph Hellwig <[email protected]>
>> Cc: Ross Zwisler <[email protected]>
>> Cc: Alexander Viro <[email protected]>
>> Suggested-by: Dave Chinner <[email protected]>
>> Suggested-by: "Darrick J. Wong" <[email protected]>
>> Signed-off-by: Dan Williams <[email protected]>
>> ---
>> fs/attr.c | 10 ++++++++++
>> fs/namei.c | 3 ++-
>> fs/open.c | 6 ++++++
>> fs/read_write.c | 3 +++
>> fs/xfs/xfs_ioctl.c | 6 ++++++
>> include/linux/fs.h | 2 ++
>> mm/filemap.c | 9 +++++++++
>> 7 files changed, 38 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/attr.c b/fs/attr.c
>> index 135304146120..8573e364bd06 100644
>> --- a/fs/attr.c
>> +++ b/fs/attr.c
>> @@ -112,6 +112,16 @@ EXPORT_SYMBOL(setattr_prepare);
>> */
>> int inode_newsize_ok(const struct inode *inode, loff_t offset)
>> {
>> + if (IS_IOMAP_IMMUTABLE(inode)) {
>> + /*
>> + * Any size change is disallowed. Size increases may
>> + * dirty metadata that an application is not prepared to
>> + * sync, and a size decrease may expose free blocks to
>> + * in-flight DMA.
>> + */
>> + return -ETXTBSY;
>> + }
>> +
>> if (inode->i_size < offset) {
>> unsigned long limit;
>>
>> diff --git a/fs/namei.c b/fs/namei.c
>> index ddb6a7c2b3d4..588f1135c42c 100644
>> --- a/fs/namei.c
>> +++ b/fs/namei.c
>> @@ -2793,7 +2793,8 @@ static int may_delete(struct inode *dir, struct dentry *victim, bool isdir)
>> return -EPERM;
>>
>> if (check_sticky(dir, inode) || IS_APPEND(inode) ||
>> - IS_IMMUTABLE(inode) || IS_SWAPFILE(inode) || HAS_UNMAPPED_ID(inode))
>> + IS_IMMUTABLE(inode) || IS_SWAPFILE(inode) || HAS_UNMAPPED_ID(inode)
>> + || IS_IOMAP_IMMUTABLE(inode))
>
> This caught my eye because why wouldn't we be able to unlink a file
> whose block map is immutable? Link count has "nothing" to do with that.
>
> :)
>
> Hmm... I guess the reasoning here is that if we're IOMAP_IMMUTABLE'd,
> we're not allowed to touch the link count since (we assume) nobody's
> who's interested in *inode is going to call fsync on it to flush the
> metadata to disk?
>
> If that's the case, then shouldn't there be a corresponding may_create
> check to prevent us from linking a file into a directory?
>
> (I don't really see why link/unlink shouldn't be allowed...)
True, unlink should be allowed, the blocks will still be immutable as
long as the file is open so I think we're good.
>> return -EPERM;
>> if (isdir) {
>> if (!d_is_dir(victim))
>> diff --git a/fs/open.c b/fs/open.c
>> index 35bb784763a4..7395860d7164 100644
>> --- a/fs/open.c
>> +++ b/fs/open.c
>> @@ -292,6 +292,12 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>> return -ETXTBSY;
>>
>> /*
>> + * We cannot allow any allocation changes on an iomap immutable file
>
> If it's a (regular allocation) or (unshare blocks) fallocate call, we
> could return zero immediately since no writes will be able to hit
> ENOSPC because we know that the block mappings are there and no CoW is
> required.
Sounds good assuming the regular allocation is staying within the
range of the current file size, and fail otherwise.
>
>> + */
>> + if (IS_IOMAP_IMMUTABLE(inode))
>> + return -ETXTBSY;
>
> I've been wondering in the back of my head if we could allow a
> size-extending FALLOC_FL_ZERO_RANGE here that would fsync at the end?
> The use case I had in mind was extending files without having to turn
> off the IOMAP_IMMUTABLE flag, since (in theory, anyway) we'd bzero() the
> memory and then increase i_size, so userspace should never be able to
> access storage that isn't totally finalized.
>
> OTOH that also seems like a huge weird loophole in IOMAP_IMMUTABLE, so
> maybe everyone would prefer to shoot down this use case now?
It seems useful, because the alternative feels more error prone if
applications are forced to toggle IOMAP_IMMUTABLE.
>
>> +
>> + /*
>> * Revalidate the write permissions, in case security policy has
>> * changed since the files were opened.
>> */
>> diff --git a/fs/read_write.c b/fs/read_write.c
>> index 0cc7033aa413..dc673be7c7cb 100644
>> --- a/fs/read_write.c
>> +++ b/fs/read_write.c
>> @@ -1706,6 +1706,9 @@ int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in,
>> if (IS_SWAPFILE(inode_in) || IS_SWAPFILE(inode_out))
>> return -ETXTBSY;
>>
>> + if (IS_IOMAP_IMMUTABLE(inode_in) || IS_IOMAP_IMMUTABLE(inode_out))
>> + return -ETXTBSY;
>> +
>> /* Don't reflink dirs, pipes, sockets... */
>> if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
>> return -EISDIR;
>> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
>> index e75c40a47b7d..2e64488bc4de 100644
>> --- a/fs/xfs/xfs_ioctl.c
>> +++ b/fs/xfs/xfs_ioctl.c
>> @@ -1755,6 +1755,12 @@ xfs_ioc_swapext(
>> goto out_put_tmp_file;
>> }
>>
>> + if (IS_IOMAP_IMMUTABLE(file_inode(f.file)) ||
>> + IS_IOMAP_IMMUTABLE(file_inode(tmp.file))) {
>> + error = -EINVAL;
>> + goto out_put_tmp_file;
>> + }
>
> Someone's going to have to audit more of the XFS ioctls to make sure
> that none of them can touch the block mapping, but as this is an RFC it
> doesn't need to be done right this instant.
Speaking of things that need to be done before this mechanism can move
out of RFC, you had mentioned wanting to get your current devel
backlog cleared before moving ahead with this. I can offer some review
cycles, you'll just need to tolerate a higher
question-to-review-comment ratio given my relative XFS experience.
>
>> +
>> /*
>> * We need to ensure that the fds passed in point to XFS inodes
>> * before we cast and access them as XFS structures as we have no
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 6e1fd5d21248..0a254b768855 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -1829,6 +1829,7 @@ struct super_operations {
>> #else
>> #define S_DAX 0 /* Make all the DAX code disappear */
>> #endif
>> +#define S_IOMAP_IMMUTABLE 16384 /* logical-to-physical extent map is fixed */
>>
>> /*
>> * Note that nosuid etc flags are inode-specific: setting some file-system
>> @@ -1867,6 +1868,7 @@ struct super_operations {
>> #define IS_AUTOMOUNT(inode) ((inode)->i_flags & S_AUTOMOUNT)
>> #define IS_NOSEC(inode) ((inode)->i_flags & S_NOSEC)
>> #define IS_DAX(inode) ((inode)->i_flags & S_DAX)
>> +#define IS_IOMAP_IMMUTABLE(inode) ((inode)->i_flags & S_IOMAP_IMMUTABLE)
>>
>> #define IS_WHITEOUT(inode) (S_ISCHR(inode->i_mode) && \
>> (inode)->i_rdev == WHITEOUT_DEV)
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index a49702445ce0..e4a6529da2bd 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -2806,6 +2806,15 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
>> if (unlikely(pos >= inode->i_sb->s_maxbytes))
>> return -EFBIG;
>>
>> + /* Are we about to mutate the block map on an immutable file? */
>> + if (IS_IOMAP_IMMUTABLE(inode)
>> + && (pos + iov_iter_count(from) > i_size_read(inode))) {
>> + if (pos < i_size_read(inode))
>> + iov_iter_truncate(from, i_size_read(inode) - pos);
>
> Writes past the end get truncated to stop at EOF? I'd have thought that
> would be an error since userspace is asking for metadata updates it
> previously said it would never need...
It's an error if the current position is >= EOF, but otherwise allow a
truncated write. I'm fine either way.
...but then again if we allow FALLOC_FL_ZERO_RANGE to resize the file,
why not synchronous writes too? It turns IOMAP_IMMUTABLE into a flag
that only allows blocks to be added to a file, but not removed. That
would seem to dovetail well with synchronous mmap faults when / if XFS
grows that feature.
On Mon, Jul 31, 2017, at 12:32 PM, Colin Walters wrote:
> On Mon, Jul 31, 2017, at 12:29 PM, Dan Williams wrote:
> >
> > How is S_CONTENTS_IMMUTABLE different than S_IMMUTABLE?
>
> We still want the ability to make hardlinks.
Also of course, symmetrically, to unlink. If we used S_IMMUTABLE for /etc/sudoers,
it'd still be racy since one would have to transiently remove the flag in order
to replace it with a new version.
Related to this topic is the fact that S_IMMUTABLE is itself mutable; I
think once S_IMMUTABLE_CONTENTS is set, it would not be able to made
mutable again.
Also I just remembered that since then memfd_create() and more notably
fcntl(F_ADD_SEALS) landed - in fact it already has flags for what we want
here AFAICS. Your S_IOMAP_IMMUTABLE is fcntl(F_ADD_SEALS, F_SEAL_SHRINK | F_SEAL_GROW)
and mine just adds in F_SEAL_WRITE. I think there was some discussion
of the seals for persistent files when memfd_create() landed, but I can't
find it offhand.
On Mon, Jul 31, 2017 at 01:42:13PM -0400, Colin Walters wrote:
>
>
> On Mon, Jul 31, 2017, at 12:32 PM, Colin Walters wrote:
> > On Mon, Jul 31, 2017, at 12:29 PM, Dan Williams wrote:
> > >
> > > How is S_CONTENTS_IMMUTABLE different than S_IMMUTABLE?
> >
> > We still want the ability to make hardlinks.
>
> Also of course, symmetrically, to unlink. If we used S_IMMUTABLE for /etc/sudoers,
> it'd still be racy since one would have to transiently remove the flag in order
> to replace it with a new version.
>
> Related to this topic is the fact that S_IMMUTABLE is itself mutable; I
> think once S_IMMUTABLE_CONTENTS is set, it would not be able to made
> mutable again.
>
> Also I just remembered that since then memfd_create() and more notably
> fcntl(F_ADD_SEALS) landed - in fact it already has flags for what we want
> here AFAICS. Your S_IOMAP_IMMUTABLE is fcntl(F_ADD_SEALS, F_SEAL_SHRINK | F_SEAL_GROW)
I don't think F_SEAL_{SHRINK,GROW} prevents reflinking or CoW of file data,
which are two things that cannot happen under S_IOMAP_IMMUTABLE that
aren't size changes. From the implementation it looks like shrink and
grow are only supposed to disallow changes to i_size, not i_blocks (or
the file block map).
Then again, I suppose F_SEAL_* only work on shmem, so maybe it simply
isn't defined for any other filesystem...? e.g. it doesn't prohibit
reflink, but the only fs implementing seals doesn't support reflink.
<shrug>
Seals cannot be removed, which is too strict for the S_IOMAP_IMMUTABLE
user cases being presented.
> and mine just adds in F_SEAL_WRITE. I think there was some discussion
> of the seals for persistent files when memfd_create() landed, but I can't
> find it offhand.
--D
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jul 31, 2017 at 10:09 AM, Darrick J. Wong
<[email protected]> wrote:
> On Sat, Jul 29, 2017 at 12:43:40PM -0700, Dan Williams wrote:
>> >From falloc.h:
>>
>> FALLOC_FL_SEAL_BLOCK_MAP is used to seal (make immutable) all of the
>> file logical-to-physical extent offset mappings in the file. The
>> purpose is to allow an application to assume that there are no holes
>> or shared extents in the file and that the metadata needed to find
>> all the physical extents of the file is stable and can never be
>> dirtied.
>>
>> For now this patch only permits setting / clearing the in-memory state
>> of S_IOMAP_IMMMUTABLE, persisting the state is saved for a later patch.
>>
>> The implementation is careful to not allow the immutable state to change
>> while any process might have any established mappings. It reuses the
>> existing xfs_reflink_unshare() and xfs_alloc_file_space() to unshare
>> extents and fill all holes in the file, or otherwise extend the file
>> size in the same operation that sets S_IOMAP_IMMUTABLE.
>>
>> Cc: Jan Kara <[email protected]>
>> Cc: Jeff Moyer <[email protected]>
>> Cc: Christoph Hellwig <[email protected]>
>> Cc: Ross Zwisler <[email protected]>
>> Cc: Alexander Viro <[email protected]>
>> Suggested-by: Dave Chinner <[email protected]>
>> Suggested-by: "Darrick J. Wong" <[email protected]>
>> Signed-off-by: Dan Williams <[email protected]>
>> ---
>> fs/open.c | 26 ++++++++++++-
>> fs/xfs/xfs_bmap_util.c | 86 +++++++++++++++++++++++++++++++++++++++++++
>> fs/xfs/xfs_bmap_util.h | 2 +
>> fs/xfs/xfs_file.c | 14 +++++--
>> include/linux/falloc.h | 3 +-
>> include/uapi/linux/falloc.h | 19 ++++++++++
>> 6 files changed, 142 insertions(+), 8 deletions(-)
>>
>> diff --git a/fs/open.c b/fs/open.c
>> index 7395860d7164..df075484fad5 100644
>> --- a/fs/open.c
>> +++ b/fs/open.c
>> @@ -241,7 +241,11 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>> struct inode *inode = file_inode(file);
>> long ret;
>>
>> - if (offset < 0 || len <= 0)
>> + if (offset < 0 || len < 0)
>> + return -EINVAL;
>> +
>> + /* Allow zero len only for the unseal operation */
>> + if (!(mode & FALLOC_FL_SEAL_BLOCK_MAP) && len == 0)
>> return -EINVAL;
>>
>> /* Return error if mode is not supported */
>> @@ -273,6 +277,17 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>> (mode & ~(FALLOC_FL_UNSHARE_RANGE | FALLOC_FL_KEEP_SIZE)))
>> return -EINVAL;
>>
>> + /*
>> + * Seal block map should only be used exclusively, and with
>> + * the IMMUTABLE capability.
>> + */
>> + if (mode & FALLOC_FL_SEAL_BLOCK_MAP) {
>> + if (mode & ~FALLOC_FL_SEAL_BLOCK_MAP)
>> + return -EINVAL;
>> + if (!capable(CAP_LINUX_IMMUTABLE))
>> + return -EPERM;
>> + }
>> +
>> if (!(file->f_mode & FMODE_WRITE))
>> return -EBADF;
>>
>> @@ -292,9 +307,14 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>> return -ETXTBSY;
>>
>> /*
>> - * We cannot allow any allocation changes on an iomap immutable file
>> + * We cannot allow any allocation changes on an iomap immutable
>> + * file, however if the operation is FALLOC_FL_SEAL_BLOCK_MAP,
>> + * call down to ->fallocate() to determine if the operations is
>> + * allowed. ->fallocate() may either clear the flag when @len is
>> + * zero, or validate that the requested operation is already the
>> + * current state of the file.
>> */
>> - if (IS_IOMAP_IMMUTABLE(inode))
>> + if (IS_IOMAP_IMMUTABLE(inode) && (!(mode & FALLOC_FL_SEAL_BLOCK_MAP)))
>> return -ETXTBSY;
>>
>> /*
>> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
>> index 93e955262d07..c4fc79a0704f 100644
>> --- a/fs/xfs/xfs_bmap_util.c
>> +++ b/fs/xfs/xfs_bmap_util.c
>> @@ -1387,6 +1387,92 @@ xfs_zero_file_space(
>>
>> }
>>
>> +int
>> +xfs_seal_file_space(
>> + struct xfs_inode *ip,
>> + xfs_off_t offset,
>> + xfs_off_t len)
>> +{
>> + struct inode *inode = VFS_I(ip);
>> + struct address_space *mapping = inode->i_mapping;
>> + int error = 0;
>> +
>> + if (offset)
>> + return -EINVAL;
>> +
>> + i_mmap_lock_read(mapping);
>
> (Are we allowed to take address_space->i_mmap_rwsem while holding
> xfs_inode->i_mmaplock?)
Empirically, yes. Lockdep complains when those locks are taken in the
reverse order.
That seems to be inconsistent with the "mmap_sem -> i_mmap_lock ->
page_lock" note in the xfs_ilock comment. Am I confusing what
i_mmap_lock means in that comment, is that the i_mmap_lock_read(), or
the i_mmaplock in the xfs_inode?
>
>> + xfs_ilock(ip, XFS_ILOCK_EXCL);
>> + if (len == 0) {
>> + /*
>> + * Clear the immutable flag provided there are no active
>> + * mappings. The active mapping check prevents an
>> + * application that is assuming a static block map, for
>> + * DAX or peer-to-peer DMA, from having this state
>> + * silently change behind its back.
>> + */
>> + if (RB_EMPTY_ROOT(&mapping->i_mmap))
>> + inode->i_flags &= ~S_IOMAP_IMMUTABLE;
>> + else
>> + error = -EBUSY;
>> + } else if (IS_IOMAP_IMMUTABLE(inode)) {
>> + if (len == i_size_read(inode)) {
>> + /*
>> + * The file is already in the correct state,
>> + * bail out without error below.
>> + */
>> + len = 0;
>> + } else {
>> + /* too late to allocate more space */
>> + error = -ETXTBSY;
>> + }
>> + } else {
>> + if (len < i_size_read(inode)) {
>> + /*
>> + * Since S_IOMAP_IMMUTABLE is inode global it
>> + * does not make sense to fallocate(immutable)
>> + * on a sub-range of the file.
>> + */
>> + error = -EINVAL;
>> + } else if (!RB_EMPTY_ROOT(&mapping->i_mmap)) {
>> + /*
>> + * It's not strictly required to prevent setting
>> + * immutable while a file is already mapped, but
>> + * we do it for simplicity and symmetry with the
>> + * S_IOMAP_IMMUTABLE disable case.
>> + */
>> + error = -EBUSY;
>> + } else
>> + inode->i_flags |= S_IOMAP_IMMUTABLE;
>> + }
>> + xfs_iunlock(ip, XFS_ILOCK_EXCL);
>> + i_mmap_unlock_read(mapping);
>> +
>> + if (error || len == 0)
>> + return error;
>> +
>> + /*
>> + * From here, the immutable flag is already set, so new
>> + * operations that would change the block map are prevented by
>> + * upper layer code paths. Wwe can proceed to unshare and
>> + * allocate zeroed / written extents.
>> + */
>> + error = xfs_reflink_unshare(ip, offset, len);
>
> At this point we still hold the io and mmap locks and the vfs thinks the
> inode is iomap_immutable, but we haven't actually fixed the block
> mappings, which means that the flag is set but there could be holes and
> shared extents aplenty?
>
> That seems strange to me -- wouldn't we want to try to unshare and
> allocate, and only then take the ilock, check the mappings, and only set
> the flag if nobody's messed with the extent map since the unshare &
> allocated? IOWs,
>
> if (len == 0)
> return xfs_unseal_file_space();
>
> xfs_reflink_unshare(...);
> xfs_alloc_file_space(...);
>
> xfs_ilock(...);
> if (xfs_iomap_lacks_holes_and_shared_blocks(...)) {
> VFS_I(ip)->i_flags |= S_IOMAP_IMMUTABLE;
> ip->i_d.di_flags2 |= XFS_DIFLAG2_IOMAP_IMMUTABLE;
> xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> } else {
> error = -EBUSY;
> }
> xfs_iunlock(...);
>
> (I guess we hold sufficient locks, but still...)
Yes, that looks safer, especially if other vfs paths make assumptions
about the block map upon seeing that flag set.
On Mon, Jul 31, 2017 at 11:25:34AM -0700, Dan Williams wrote:
> On Mon, Jul 31, 2017 at 10:09 AM, Darrick J. Wong
> <[email protected]> wrote:
> >> index 93e955262d07..c4fc79a0704f 100644
> >> --- a/fs/xfs/xfs_bmap_util.c
> >> +++ b/fs/xfs/xfs_bmap_util.c
> >> @@ -1387,6 +1387,92 @@ xfs_zero_file_space(
> >>
> >> }
> >>
> >> +int
> >> +xfs_seal_file_space(
> >> + struct xfs_inode *ip,
> >> + xfs_off_t offset,
> >> + xfs_off_t len)
> >> +{
> >> + struct inode *inode = VFS_I(ip);
> >> + struct address_space *mapping = inode->i_mapping;
> >> + int error = 0;
> >> +
> >> + if (offset)
> >> + return -EINVAL;
> >> +
> >> + i_mmap_lock_read(mapping);
> >
> > (Are we allowed to take address_space->i_mmap_rwsem while holding
> > xfs_inode->i_mmaplock?)
>
> Empirically, yes. Lockdep complains when those locks are taken in the
> reverse order.
My pet hate: people who rely on lockdep to tell them that locking is
wrong rather than understanding what the correct locking order is
before writing code.
> That seems to be inconsistent with the "mmap_sem -> i_mmap_lock ->
> page_lock" note in the xfs_ilock comment. Am I confusing what
> i_mmap_lock means in that comment, is that the i_mmap_lock_read(), or
> the i_mmaplock in the xfs_inode?
The latter. The lock orders you need to pay attention to are
documented in mm/filemap.c. (Which, incidentally, needs updating to
refer to i_rwsem, not i_mutex.)
* ->i_mutex
* ->i_mmap_rwsem (truncate->unmap_mapping_range)
* ->mmap_sem
* ->i_mmap_rwsem
* ->page_table_lock or pte_lock (various, mainly in memory.c)
* ->mapping->tree_lock (arch-dependent flush_dcache_mmap_lock)
As it is, I think we shold not be taking internal mm/ state locks
deep inside filesystem code as it smells of layering violations. We
don't do this anywhere else for mapping checks - if we already hold
the XFS_MMAPLOCK_EXCL here, then we've already locked out page
faults from changing the state of the inode. In which case, we
should not need a mmap internal lock to be held here, same as all
the other filesystem operations that lock out page faults....
> >> + xfs_ilock(ip, XFS_ILOCK_EXCL);
> >> + if (len == 0) {
> >> + /*
> >> + * Clear the immutable flag provided there are no active
> >> + * mappings. The active mapping check prevents an
> >> + * application that is assuming a static block map, for
> >> + * DAX or peer-to-peer DMA, from having this state
> >> + * silently change behind its back.
> >> + */
> >> + if (RB_EMPTY_ROOT(&mapping->i_mmap))
mapping_mapped(mapping)
> >> + inode->i_flags &= ~S_IOMAP_IMMUTABLE;
> >> + else
> >> + error = -EBUSY;
> >> + } else if (IS_IOMAP_IMMUTABLE(inode)) {
> >> + if (len == i_size_read(inode)) {
> >> + /*
> >> + * The file is already in the correct state,
> >> + * bail out without error below.
> >> + */
> >> + len = 0;
> >> + } else {
> >> + /* too late to allocate more space */
> >> + error = -ETXTBSY;
> >> + }
> >> + } else {
> >> + if (len < i_size_read(inode)) {
> >> + /*
> >> + * Since S_IOMAP_IMMUTABLE is inode global it
> >> + * does not make sense to fallocate(immutable)
> >> + * on a sub-range of the file.
> >> + */
> >> + error = -EINVAL;
> >> + } else if (!RB_EMPTY_ROOT(&mapping->i_mmap)) {
> >> + /*
> >> + * It's not strictly required to prevent setting
> >> + * immutable while a file is already mapped, but
> >> + * we do it for simplicity and symmetry with the
> >> + * S_IOMAP_IMMUTABLE disable case.
> >> + */
> >> + error = -EBUSY;
> >> + } else
> >> + inode->i_flags |= S_IOMAP_IMMUTABLE;
> >> + }
> >> + xfs_iunlock(ip, XFS_ILOCK_EXCL);
> >> + i_mmap_unlock_read(mapping);
> >> +
> >> + if (error || len == 0)
> >> + return error;
I have to say, I find this checking to be fairly grotty. The "len ==
0" API to remove the immutable flag is a gross hack. IMO, it's
better to add a separate fallocate command to "unseal" the extent
map, and let that happen according to whether the file is mapped or
not. Perhaps it would be better to start with a man page
documenting the desired API?
FWIW, the if/else if/else structure could be cleaned up with a
simple "goto out_unlock" construct such as:
/* don't make immutable if inode is currently mapped */
error = -EBUSY;
if (mapping_mapped(mapping))
goto out_unlock;
/* can't do anything if inode is already immutable */
error = -ETXTBSY;
if (IS_IMMUTABLE(inode) || IS_IOMAP_IMMUTABLE(inode))
goto out_unlock;
/* XFS only supports whole file extent immutability */
error = -EINVAL;
if (len != i_size_read(inode))
goto out_unlock;
/* all good to go */
error = 0;
out_unlock:
xfs_iunlock(ip, XFS_ILOCK_EXCL);
i_mmap_unlock_read(mapping);
if (error)
return error;
/* now unshare, allocate and add immutable flag */
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Sat, Jul 29, 2017 at 12:43:46PM -0700, Dan Williams wrote:
> Record the immutable state in the on-disk inode so that on the next boot
> the protections against reflink and hole punch etc are automatically
> restored.
Keep in mind we can't do this without all the userspace side support
for the addition to the in-disk format....
> This deliberately does not add a FS_XFLAG_IOMAP_IMMUTABLE since
> fallocate(2) is the path to toggle this flag.
That's a problem. The flag needs to be added so that we can /view/
the state of the inode through xfs_io. Just because we can't set it
through the extended inode flag interface doesn't mean the flag
should not exist.
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index c4fc79a0704f..1dcb521da456 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1021,7 +1021,8 @@ xfs_alloc_file_space(
> struct xfs_inode *ip,
> xfs_off_t offset,
> xfs_off_t len,
> - int alloc_type)
> + int alloc_type,
> + uint64_t di_flags2)
> {
> xfs_mount_t *mp = ip->i_mount;
> xfs_off_t count;
> @@ -1119,6 +1120,12 @@ xfs_alloc_file_space(
> break;
> }
> xfs_ilock(ip, XFS_ILOCK_EXCL);
> + if (di_flags2) {
> + /* fold inode attributes for this allocation */
> + ip->i_d.di_flags2 |= di_flags2;
> + di_flags2 = 0;
> + }
Yikes, no! Darrick already mentioned this, but it needs pointing out
again...
Especially as it means that we are setting the immutable flag before
we've allocated all the extents to fill the file space. If we've
implemented immutable extent maps correctly, then xfs_bmapi_write()
should be rejecting any attempt to allocate or modify extents if
that flag is set on the inode, which means this code will now fail
to allocate/zero anything...
IOWs, this flag should be the last thing that is set on the inode
once it's been fully allocated and zeroed.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Mon, Jul 31, 2017, at 02:23 PM, Darrick J. Wong wrote:
> I don't think F_SEAL_{SHRINK,GROW} prevents reflinking or CoW of file data,
> which are two things that cannot happen under S_IOMAP_IMMUTABLE that
> aren't size changes. From the implementation it looks like shrink and
> grow are only supposed to disallow changes to i_size, not i_blocks (or
> the file block map).
True.
> Then again, I suppose F_SEAL_* only work on shmem, so maybe it simply
> isn't defined for any other filesystem...? e.g. it doesn't prohibit
> reflink, but the only fs implementing seals doesn't support reflink.
>
> <shrug>
>
> Seals cannot be removed, which is too strict for the S_IOMAP_IMMUTABLE
> user cases being presented.
To be clear, the set of use cases is swap files and DAX, right? Or is there anything else?
I can't imagine why anyone would want to turn a swap file back into a regular file.
I haven't fully followed DAX, but I'd take your word for it if people want to
be able to remove the flag after.
Anyways, I think your broader point is right; the use cases are different enough
that it doesn't make sense to try to add S_CONTENT_IMMUTABLE (or however
one decides to call it) at the same time.
On Mon, Jul 31, 2017 at 10:15:12PM -0400, Colin Walters wrote:
> On Mon, Jul 31, 2017, at 02:23 PM, Darrick J. Wong wrote:
>
> > I don't think F_SEAL_{SHRINK,GROW} prevents reflinking or CoW of file data,
> > which are two things that cannot happen under S_IOMAP_IMMUTABLE that
> > aren't size changes. From the implementation it looks like shrink and
> > grow are only supposed to disallow changes to i_size, not i_blocks (or
> > the file block map).
>
> True.
>
> > Then again, I suppose F_SEAL_* only work on shmem, so maybe it simply
> > isn't defined for any other filesystem...? e.g. it doesn't prohibit
> > reflink, but the only fs implementing seals doesn't support reflink.
> >
> > <shrug>
> >
> > Seals cannot be removed, which is too strict for the S_IOMAP_IMMUTABLE
> > user cases being presented.
>
> To be clear, the set of use cases is swap files and DAX, right? Or is there anything else?
I've outlined other use cases in previous discussions. To repeat
myself, every so often we get someone with, say, a new high
speed camera that want to dma the camera frames direct to the
storage because they can't push 500,000 frames/s through the CPU
to storage. Hence they want to bypass the OS and DMA the data direct
to the storage. To do this they need a mechanism to freeze and unfreeze
the block map of the file so that nothing modifies the block map
while the camera hardware is dumping data direct to the storage.
Immutable extent maps provide the functionality they need to
implement this safely.
There's also other similar use cases for RDMA targets on PMEM
(regardless of whether DAX is enabled or not), and I've come across
a couple of requests for mechanisms to allow fabric based nvme
storage to do direct data transfers between storage devices, too.
All of these use cases can be safely implemented if there is a
mechanism to mark extent maps as immutable for the duration of
the operation they need to perform.
> I can't imagine why anyone would want to turn a swap file back into a regular file.
> I haven't fully followed DAX, but I'd take your word for it if people want to
> be able to remove the flag after.
DAX isn't the driver of that functionality, it's the other use cases
that need it, and why the proposed "only remove flag if len == 0"
API is a non-starter....
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Tue, Aug 01, 2017 at 12:42:18PM +1000, Dave Chinner wrote:
> I've outlined other use cases in previous discussions. To repeat
> myself, every so often we get someone with, say, a new high
> speed camera that want to dma the camera frames direct to the
> storage because they can't push 500,000 frames/s through the CPU
> to storage. Hence they want to bypass the OS and DMA the data direct
> to the storage. To do this they need a mechanism to freeze and unfreeze
> the block map of the file so that nothing modifies the block map
> while the camera hardware is dumping data direct to the storage.
> Immutable extent maps provide the functionality they need to
> implement this safely.
And we have such a mechanism already: it's called the iolock during
I/O, and dirct I/O. I've worked on plenty such schemes and the proper way
works perfectly fine. Just because people ask for stupid ways to
archives that doesn't mean they understand what they are doing.
> There's also other similar use cases for RDMA targets on PMEM
> (regardless of whether DAX is enabled or not), and I've come across
> a couple of requests for mechanisms to allow fabric based nvme
> storage to do direct data transfers between storage devices, too.
> All of these use cases can be safely implemented if there is a
> mechanism to mark extent maps as immutable for the duration of
> the operation they need to perform.
As someone who spent most of them time on the last 2 years in this
area: we have a massive problem discoverability and addressing
(lack of struct page) for p2p devices. We have absolutely no problem
with the direct I/O model with them.
> DAX isn't the driver of that functionality, it's the other use cases
> that need it, and why the proposed "only remove flag if len == 0"
> API is a non-starter....
The other "use" cases are even more bullshit than the DAX one.