This series introduces a proposal to implementing atomic writes in the
kernel for torn-write protection.
This series takes the approach of adding a new "atomic" flag to each of
pwritev2() and iocb->ki_flags - RWF_ATOMIC and IOCB_ATOMIC, respectively.
When set, these indicate that we want the write issued "atomically".
Only direct IO is supported and for block devices and XFS.
The atomic writes feature requires dedicated HW support, like
SCSI WRITE_ATOMIC_16 command.
man pages update has been posted at:
https://lore.kernel.org/linux-api/[email protected]/T/#t
The goal here is to provide an interface that allow applications use
application-specific block sizes larger than logical block size
reported by the storage device or larger than filesystem block size as
reported by stat().
With this new interface, application blocks will never be torn or
fractured when written. For a power fail, for each individual application
block, all or none of the data to be written. A racing atomic write and
read will mean that the read sees all the old data or all the new data,
but never a mix of old and new.
Two new fields are added to struct statx - atomic_write_unit_min and
atomic_write_unit_max. For each atomic individual write, the total length
of a write must be a between atomic_write_unit_min and
atomic_write_unit_max, inclusive, and a power-of-2. The write must also be
at a natural offset in the file wrt the write length.
For XFS, we must ensure extent alignment with the userspace block size.
XFS supports an extent size hint. However, it must be ensured that the
hint is honoured. For this, a new flag is added - forcealign - to
instruct the XFS block allocator to always honour the extent size hint.
The user would typically set the extent size hint at the userspace
block size to support atomic writes.
The atomic_write_unit_{min, max} values from statx on an XFS file will
consider both the backing bdev atomic_write_unit_{min, max} values and
the extent alignment for the file.
SCSI sd.c and scsi_debug and NVMe kernel support is added.
xfsprogs update for forcealign is at:
https://lore.kernel.org/linux-xfs/[email protected]/T/#t
This series is based on v6.6-rc3.
Major changes since RFC (https://lore.kernel.org/linux-scsi/[email protected]/):
- Add XFS forcealign feature
- Only allow writing a single userspace block
Alan Adamson (1):
nvme: Support atomic writes
Darrick J. Wong (3):
fs: xfs: Introduce FORCEALIGN inode flag
fs: xfs: Make file data allocations observe the 'forcealign' flag
fs: xfs: Enable file data forcealign feature
Himanshu Madhani (2):
block: Add atomic write operations to request_queue limits
block: Add REQ_ATOMIC flag
John Garry (13):
block: Limit atomic writes according to bio and queue limits
block: Pass blk_queue_get_max_sectors() a request pointer
block: Limit atomic write IO size according to
atomic_write_max_sectors
block: Error an attempt to split an atomic write bio
block: Add checks to merging of atomic writes
block: Add fops atomic write support
fs: xfs: Don't use low-space allocator for alignment > 1
fs: xfs: Support atomic write for statx
fs: iomap: Atomic write support
fs: xfs: iomap atomic write support
scsi: sd: Support reading atomic properties from block limits VPD
scsi: sd: Add WRITE_ATOMIC_16 support
scsi: scsi_debug: Atomic write support
Prasad Singamsetty (2):
fs/bdev: Add atomic write support info to statx
fs: Add RWF_ATOMIC and IOCB_ATOMIC flags for atomic write support
Documentation/ABI/stable/sysfs-block | 42 ++
block/bdev.c | 33 +-
block/blk-merge.c | 92 ++++-
block/blk-mq.c | 2 +-
block/blk-settings.c | 76 ++++
block/blk-sysfs.c | 33 ++
block/blk.h | 9 +-
block/fops.c | 42 +-
drivers/nvme/host/core.c | 29 ++
drivers/scsi/scsi_debug.c | 587 +++++++++++++++++++++------
drivers/scsi/scsi_trace.c | 22 +
drivers/scsi/sd.c | 57 ++-
drivers/scsi/sd.h | 7 +
fs/iomap/direct-io.c | 26 +-
fs/iomap/trace.h | 3 +-
fs/stat.c | 15 +-
fs/xfs/libxfs/xfs_bmap.c | 26 +-
fs/xfs/libxfs/xfs_format.h | 9 +-
fs/xfs/libxfs/xfs_inode_buf.c | 40 ++
fs/xfs/libxfs/xfs_inode_buf.h | 3 +
fs/xfs/libxfs/xfs_sb.c | 3 +
fs/xfs/xfs_inode.c | 12 +
fs/xfs/xfs_inode.h | 5 +
fs/xfs/xfs_ioctl.c | 18 +
fs/xfs/xfs_iomap.c | 40 +-
fs/xfs/xfs_iops.c | 51 +++
fs/xfs/xfs_iops.h | 4 +
fs/xfs/xfs_mount.h | 2 +
fs/xfs/xfs_super.c | 4 +
include/linux/blk_types.h | 2 +
include/linux/blkdev.h | 37 +-
include/linux/fs.h | 1 +
include/linux/iomap.h | 1 +
include/linux/stat.h | 2 +
include/scsi/scsi_proto.h | 1 +
include/trace/events/scsi.h | 1 +
include/uapi/linux/fs.h | 7 +-
include/uapi/linux/stat.h | 7 +-
38 files changed, 1179 insertions(+), 172 deletions(-)
--
2.31.1
Support providing info on atomic write unit min and max for an inode.
For simplicity, currently we limit the min at the FS block size, but a
lower limit could be supported in future.
The atomic write unit min and max is limited by the guaranteed extent
alignment for the inode.
Signed-off-by: John Garry <[email protected]>
---
fs/xfs/xfs_iops.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++
fs/xfs/xfs_iops.h | 4 ++++
2 files changed, 55 insertions(+)
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 1c1e6171209d..5bff80748223 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -546,6 +546,46 @@ xfs_stat_blksize(
return PAGE_SIZE;
}
+void xfs_ip_atomic_write_attr(struct xfs_inode *ip,
+ xfs_filblks_t *unit_min_fsb,
+ xfs_filblks_t *unit_max_fsb)
+{
+ xfs_extlen_t extsz_hint = xfs_get_extsz_hint(ip);
+ struct xfs_buftarg *target = xfs_inode_buftarg(ip);
+ struct block_device *bdev = target->bt_bdev;
+ struct xfs_mount *mp = ip->i_mount;
+ xfs_filblks_t atomic_write_unit_min,
+ atomic_write_unit_max,
+ align;
+
+ atomic_write_unit_min = XFS_B_TO_FSB(mp,
+ queue_atomic_write_unit_min_bytes(bdev->bd_queue));
+ atomic_write_unit_max = XFS_B_TO_FSB(mp,
+ queue_atomic_write_unit_max_bytes(bdev->bd_queue));
+
+ /* for RT, unset extsize gives hint of 1 */
+ /* for !RT, unset extsize gives hint of 0 */
+ if (extsz_hint && (XFS_IS_REALTIME_INODE(ip) ||
+ (ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN)))
+ align = extsz_hint;
+ else
+ align = 1;
+
+ if (atomic_write_unit_max == 0) {
+ *unit_min_fsb = 0;
+ *unit_max_fsb = 0;
+ } else if (atomic_write_unit_min == 0) {
+ *unit_min_fsb = 1;
+ *unit_max_fsb = min_t(xfs_filblks_t, atomic_write_unit_max,
+ align);
+ } else {
+ *unit_min_fsb = min_t(xfs_filblks_t, atomic_write_unit_min,
+ align);
+ *unit_max_fsb = min_t(xfs_filblks_t, atomic_write_unit_max,
+ align);
+ }
+}
+
STATIC int
xfs_vn_getattr(
struct mnt_idmap *idmap,
@@ -614,6 +654,17 @@ xfs_vn_getattr(
stat->dio_mem_align = bdev_dma_alignment(bdev) + 1;
stat->dio_offset_align = bdev_logical_block_size(bdev);
}
+ if (request_mask & STATX_WRITE_ATOMIC) {
+ xfs_filblks_t unit_min_fsb, unit_max_fsb;
+
+ xfs_ip_atomic_write_attr(ip, &unit_min_fsb,
+ &unit_max_fsb);
+ stat->atomic_write_unit_min = XFS_FSB_TO_B(mp, unit_min_fsb);
+ stat->atomic_write_unit_max = XFS_FSB_TO_B(mp, unit_max_fsb);
+ stat->attributes |= STATX_ATTR_WRITE_ATOMIC;
+ stat->attributes_mask |= STATX_ATTR_WRITE_ATOMIC;
+ stat->result_mask |= STATX_WRITE_ATOMIC;
+ }
fallthrough;
default:
stat->blksize = xfs_stat_blksize(ip);
diff --git a/fs/xfs/xfs_iops.h b/fs/xfs/xfs_iops.h
index 7f84a0843b24..b1e683b04301 100644
--- a/fs/xfs/xfs_iops.h
+++ b/fs/xfs/xfs_iops.h
@@ -19,4 +19,8 @@ int xfs_vn_setattr_size(struct mnt_idmap *idmap,
int xfs_inode_init_security(struct inode *inode, struct inode *dir,
const struct qstr *qstr);
+void xfs_ip_atomic_write_attr(struct xfs_inode *ip,
+ xfs_filblks_t *unit_min_fsb,
+ xfs_filblks_t *unit_max_fsb);
+
#endif /* __XFS_IOPS_H__ */
--
2.31.1
Currently an IO size is limited to the request_queue limits max_sectors.
Limit the size for an atomic write to queue limit atomic_write_max_sectors
value.
Signed-off-by: John Garry <[email protected]>
---
note: atomic_write_max_sectors should prob be limited to max_sectors,
which it isn't
block/blk-merge.c | 12 +++++++++++-
block/blk.h | 3 +++
2 files changed, 14 insertions(+), 1 deletion(-)
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 0ccc251e22ff..8d4de9253fe9 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -171,7 +171,17 @@ static inline unsigned get_max_io_size(struct bio *bio,
{
unsigned pbs = lim->physical_block_size >> SECTOR_SHIFT;
unsigned lbs = lim->logical_block_size >> SECTOR_SHIFT;
- unsigned max_sectors = lim->max_sectors, start, end;
+ unsigned max_sectors, start, end;
+
+ /*
+ * We ignore lim->max_sectors for atomic writes simply because
+ * it may less than bio->write_atomic_unit, which we cannot
+ * tolerate.
+ */
+ if (bio->bi_opf & REQ_ATOMIC)
+ max_sectors = lim->atomic_write_max_sectors;
+ else
+ max_sectors = lim->max_sectors;
if (lim->chunk_sectors) {
max_sectors = min(max_sectors,
diff --git a/block/blk.h b/block/blk.h
index 94e330e9c853..6f6cd5b1830a 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -178,6 +178,9 @@ static inline unsigned int blk_queue_get_max_sectors(struct request *rq)
if (unlikely(op == REQ_OP_WRITE_ZEROES))
return q->limits.max_write_zeroes_sectors;
+ if (rq->cmd_flags & REQ_ATOMIC)
+ return q->limits.atomic_write_max_sectors;
+
return q->limits.max_sectors;
}
--
2.31.1
From: Prasad Singamsetty <[email protected]>
Userspace may add flag RWF_ATOMIC to pwritev2() to indicate that the
write is to be issued with torn write prevention, according to special
alignment and length rules.
Torn write prevention means that for a power or any other HW failure, all
or none of the data will be committed to storage, but never a mix of old
and new.
For any syscall interface utilizing struct iocb, add IOCB_ATOMIC for
iocb->ki_flags field to indicate the same.
A call to statx will give the relevant atomic write info:
- atomic_write_unit_min
- atomic_write_unit_max
Both values are a power-of-2.
Applications can avail of atomic write feature by ensuring that the total
length of a write is a power-of-2 in size and also sized between
atomic_write_unit_min and atomic_write_unit_max, inclusive. Applications
must ensure that the write is at a naturally-aligned offset in the file
wrt the total write length.
Signed-off-by: Prasad Singamsetty <[email protected]>
Signed-off-by: John Garry <[email protected]>
---
include/linux/fs.h | 1 +
include/uapi/linux/fs.h | 5 ++++-
2 files changed, 5 insertions(+), 1 deletion(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b528f063e8ff..898952dee8eb 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -328,6 +328,7 @@ enum rw_hint {
#define IOCB_SYNC (__force int) RWF_SYNC
#define IOCB_NOWAIT (__force int) RWF_NOWAIT
#define IOCB_APPEND (__force int) RWF_APPEND
+#define IOCB_ATOMIC (__force int) RWF_ATOMIC
/* non-RWF related bits - start at 16 */
#define IOCB_EVENTFD (1 << 16)
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index b7b56871029c..e3b4f5bc6860 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -301,8 +301,11 @@ typedef int __bitwise __kernel_rwf_t;
/* per-IO O_APPEND */
#define RWF_APPEND ((__force __kernel_rwf_t)0x00000010)
+/* Atomic Write */
+#define RWF_ATOMIC ((__force __kernel_rwf_t)0x00000020)
+
/* mask of flags supported by the kernel */
#define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
- RWF_APPEND)
+ RWF_APPEND | RWF_ATOMIC)
#endif /* _UAPI_LINUX_FS_H */
--
2.31.1
We rely the block layer always being able to send a bio of size
atomic_write_unit_max without being required to split it due to request
queue or other bio limits.
A bio may contain min(BIO_MAX_VECS, limits->max_segments) vectors,
and each vector is at worst case the device logical block size from
direct IO alignment requirement.
Signed-off-by: John Garry <[email protected]>
---
block/blk-settings.c | 20 ++++++++++++++++++--
1 file changed, 18 insertions(+), 2 deletions(-)
diff --git a/block/blk-settings.c b/block/blk-settings.c
index d151be394c98..57d487a00c64 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -213,6 +213,18 @@ void blk_queue_atomic_write_boundary_bytes(struct request_queue *q,
}
EXPORT_SYMBOL(blk_queue_atomic_write_boundary_bytes);
+static unsigned int blk_queue_max_guaranteed_bio_size_sectors(
+ struct request_queue *q)
+{
+ struct queue_limits *limits = &q->limits;
+ unsigned int max_segments = min_t(unsigned int, BIO_MAX_VECS,
+ limits->max_segments);
+ /* Limit according to dev sector size as we only support direct-io */
+ unsigned int limit = max_segments * queue_logical_block_size(q);
+
+ return rounddown_pow_of_two(limit >> SECTOR_SHIFT);
+}
+
/**
* blk_queue_atomic_write_unit_min_sectors - smallest unit that can be written
* atomically to the device.
@@ -223,8 +235,10 @@ void blk_queue_atomic_write_unit_min_sectors(struct request_queue *q,
unsigned int sectors)
{
struct queue_limits *limits = &q->limits;
+ unsigned int guaranteed_sectors =
+ blk_queue_max_guaranteed_bio_size_sectors(q);
- limits->atomic_write_unit_min_sectors = sectors;
+ limits->atomic_write_unit_min_sectors = min(guaranteed_sectors, sectors);
}
EXPORT_SYMBOL(blk_queue_atomic_write_unit_min_sectors);
@@ -238,8 +252,10 @@ void blk_queue_atomic_write_unit_max_sectors(struct request_queue *q,
unsigned int sectors)
{
struct queue_limits *limits = &q->limits;
+ unsigned int guaranteed_sectors =
+ blk_queue_max_guaranteed_bio_size_sectors(q);
- limits->atomic_write_unit_max_sectors = sectors;
+ limits->atomic_write_unit_max_sectors = min(guaranteed_sectors, sectors);
}
EXPORT_SYMBOL(blk_queue_atomic_write_unit_max_sectors);
--
2.31.1
For atomic writes we allow merging, but we must adhere to some additional
rules:
- Only allow merging of atomic writes with other atomic writes
- Ensure that the merged IO would not cross an atomic write boundary, if
any
We already ensure that we don't exceed the atomic writes size limit in
get_max_io_size().
Signed-off-by: John Garry <[email protected]>
---
block/blk-merge.c | 72 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 72 insertions(+)
diff --git a/block/blk-merge.c b/block/blk-merge.c
index bc21f8ff4842..5dc850924e29 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -18,6 +18,23 @@
#include "blk-rq-qos.h"
#include "blk-throttle.h"
+static bool bio_straddles_atomic_write_boundary(loff_t bi_sector,
+ unsigned int bi_size,
+ unsigned int boundary)
+{
+ loff_t start = bi_sector << SECTOR_SHIFT;
+ loff_t end = start + bi_size;
+ loff_t start_mod = start % boundary;
+ loff_t end_mod = end % boundary;
+
+ if (end - start > boundary)
+ return true;
+ if ((start_mod > end_mod) && (start_mod && end_mod))
+ return true;
+
+ return false;
+}
+
static inline void bio_get_first_bvec(struct bio *bio, struct bio_vec *bv)
{
*bv = mp_bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
@@ -664,6 +681,18 @@ int ll_back_merge_fn(struct request *req, struct bio *bio, unsigned int nr_segs)
return 0;
}
+ if (req->cmd_flags & REQ_ATOMIC) {
+ unsigned int atomic_write_boundary_bytes =
+ queue_atomic_write_boundary_bytes(req->q);
+
+ if (atomic_write_boundary_bytes &&
+ bio_straddles_atomic_write_boundary(req->__sector,
+ bio->bi_iter.bi_size + blk_rq_bytes(req),
+ atomic_write_boundary_bytes)) {
+ return 0;
+ }
+ }
+
return ll_new_hw_segment(req, bio, nr_segs);
}
@@ -683,6 +712,19 @@ static int ll_front_merge_fn(struct request *req, struct bio *bio,
return 0;
}
+ if (req->cmd_flags & REQ_ATOMIC) {
+ unsigned int atomic_write_boundary_bytes =
+ queue_atomic_write_boundary_bytes(req->q);
+
+ if (atomic_write_boundary_bytes &&
+ bio_straddles_atomic_write_boundary(
+ bio->bi_iter.bi_sector,
+ bio->bi_iter.bi_size + blk_rq_bytes(req),
+ atomic_write_boundary_bytes)) {
+ return 0;
+ }
+ }
+
return ll_new_hw_segment(req, bio, nr_segs);
}
@@ -719,6 +761,18 @@ static int ll_merge_requests_fn(struct request_queue *q, struct request *req,
blk_rq_get_max_sectors(req, blk_rq_pos(req)))
return 0;
+ if (req->cmd_flags & REQ_ATOMIC) {
+ unsigned int atomic_write_boundary_bytes =
+ queue_atomic_write_boundary_bytes(req->q);
+
+ if (atomic_write_boundary_bytes &&
+ bio_straddles_atomic_write_boundary(req->__sector,
+ blk_rq_bytes(req) + blk_rq_bytes(next),
+ atomic_write_boundary_bytes)) {
+ return 0;
+ }
+ }
+
total_phys_segments = req->nr_phys_segments + next->nr_phys_segments;
if (total_phys_segments > blk_rq_get_max_segments(req))
return 0;
@@ -814,6 +868,18 @@ static enum elv_merge blk_try_req_merge(struct request *req,
return ELEVATOR_NO_MERGE;
}
+static bool blk_atomic_write_mergeable_rq_bio(struct request *rq,
+ struct bio *bio)
+{
+ return (rq->cmd_flags & REQ_ATOMIC) == (bio->bi_opf & REQ_ATOMIC);
+}
+
+static bool blk_atomic_write_mergeable_rqs(struct request *rq,
+ struct request *next)
+{
+ return (rq->cmd_flags & REQ_ATOMIC) == (next->cmd_flags & REQ_ATOMIC);
+}
+
/*
* For non-mq, this has to be called with the request spinlock acquired.
* For mq with scheduling, the appropriate queue wide lock should be held.
@@ -833,6 +899,9 @@ static struct request *attempt_merge(struct request_queue *q,
if (req->ioprio != next->ioprio)
return NULL;
+ if (!blk_atomic_write_mergeable_rqs(req, next))
+ return NULL;
+
/*
* If we are allowed to merge, then append bio list
* from next to rq and release next. merge_requests_fn
@@ -960,6 +1029,9 @@ bool blk_rq_merge_ok(struct request *rq, struct bio *bio)
if (rq->ioprio != bio_prio(bio))
return false;
+ if (blk_atomic_write_mergeable_rq_bio(rq, bio) == false)
+ return false;
+
return true;
}
--
2.31.1
Currently blk_queue_get_max_sectors() is passed a enum req_op, which does
not work for atomic writes. This is because an atomic write has a different
max sectors values to a regular write, and we need the rq->cmd_flags
to know that we have an atomic write, so pass the request pointer, which
has all information available.
Also use rq->cmd_flags instead of rq->bio->bi_opf when possible.
Signed-off-by: John Garry <[email protected]>
---
block/blk-merge.c | 3 ++-
block/blk-mq.c | 2 +-
block/blk.h | 6 ++++--
3 files changed, 7 insertions(+), 4 deletions(-)
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 65e75efa9bd3..0ccc251e22ff 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -596,7 +596,8 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
if (blk_rq_is_passthrough(rq))
return q->limits.max_hw_sectors;
- max_sectors = blk_queue_get_max_sectors(q, req_op(rq));
+ max_sectors = blk_queue_get_max_sectors(rq);
+
if (!q->limits.chunk_sectors ||
req_op(rq) == REQ_OP_DISCARD ||
req_op(rq) == REQ_OP_SECURE_ERASE)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1fafd54dce3c..21661778bdf0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3031,7 +3031,7 @@ void blk_mq_submit_bio(struct bio *bio)
blk_status_t blk_insert_cloned_request(struct request *rq)
{
struct request_queue *q = rq->q;
- unsigned int max_sectors = blk_queue_get_max_sectors(q, req_op(rq));
+ unsigned int max_sectors = blk_queue_get_max_sectors(rq);
unsigned int max_segments = blk_rq_get_max_segments(rq);
blk_status_t ret;
diff --git a/block/blk.h b/block/blk.h
index 08a358bc0919..94e330e9c853 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -166,9 +166,11 @@ static inline unsigned int blk_rq_get_max_segments(struct request *rq)
return queue_max_segments(rq->q);
}
-static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
- enum req_op op)
+static inline unsigned int blk_queue_get_max_sectors(struct request *rq)
{
+ struct request_queue *q = rq->q;
+ enum req_op op = req_op(rq);
+
if (unlikely(op == REQ_OP_DISCARD || op == REQ_OP_SECURE_ERASE))
return min(q->limits.max_discard_sectors,
UINT_MAX >> SECTOR_SHIFT);
--
2.31.1
Ensure that when creating a mapping that we adhere to all the atomic
write rules.
We check that the mapping covers the complete range of the write to ensure
that we'll be just creating a single mapping.
Currently minimum granularity is the FS block size, but it should be
possibly to support lower in future.
Signed-off-by: John Garry <[email protected]>
---
fs/xfs/xfs_iomap.c | 36 ++++++++++++++++++++++++++++++++++++
1 file changed, 36 insertions(+)
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 70fe873951f3..3424fcfc04f5 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -783,6 +783,7 @@ xfs_direct_write_iomap_begin(
{
struct xfs_inode *ip = XFS_I(inode);
struct xfs_mount *mp = ip->i_mount;
+ struct xfs_sb *m_sb = &mp->m_sb;
struct xfs_bmbt_irec imap, cmap;
xfs_fileoff_t offset_fsb = XFS_B_TO_FSBT(mp, offset);
xfs_fileoff_t end_fsb = xfs_iomap_end_fsb(mp, offset, length);
@@ -814,6 +815,41 @@ xfs_direct_write_iomap_begin(
if (error)
goto out_unlock;
+ if (flags & IOMAP_ATOMIC_WRITE) {
+ xfs_filblks_t unit_min_fsb, unit_max_fsb;
+
+ xfs_ip_atomic_write_attr(ip, &unit_min_fsb, &unit_max_fsb);
+
+ if (!imap_spans_range(&imap, offset_fsb, end_fsb)) {
+ error = -EIO;
+ goto out_unlock;
+ }
+
+ if (offset % m_sb->sb_blocksize ||
+ length % m_sb->sb_blocksize) {
+ error = -EIO;
+ goto out_unlock;
+ }
+
+ if (imap.br_blockcount == unit_min_fsb ||
+ imap.br_blockcount == unit_max_fsb) {
+ /* min and max must be a power-of-2 */
+ } else if (imap.br_blockcount < unit_min_fsb ||
+ imap.br_blockcount > unit_max_fsb) {
+ error = -EIO;
+ goto out_unlock;
+ } else if (!is_power_of_2(imap.br_blockcount)) {
+ error = -EIO;
+ goto out_unlock;
+ }
+
+ if (imap.br_startoff &&
+ imap.br_startoff % imap.br_blockcount) {
+ error = -EIO;
+ goto out_unlock;
+ }
+ }
+
if (imap_needs_cow(ip, flags, &imap, nimaps)) {
error = -EAGAIN;
if (flags & IOMAP_NOWAIT)
--
2.31.1
Add support for atomic writes, as follows:
- Ensure that the IO follows all the atomic writes rules, like must be
naturally aligned
- Set REQ_ATOMIC
Signed-off-by: John Garry <[email protected]>
---
block/fops.c | 42 +++++++++++++++++++++++++++++++++++++++++-
1 file changed, 41 insertions(+), 1 deletion(-)
diff --git a/block/fops.c b/block/fops.c
index acff3d5d22d4..516669ad69e5 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -41,6 +41,29 @@ static bool blkdev_dio_unaligned(struct block_device *bdev, loff_t pos,
!bdev_iter_is_aligned(bdev, iter);
}
+static bool blkdev_atomic_write_valid(struct block_device *bdev, loff_t pos,
+ struct iov_iter *iter)
+{
+ unsigned int atomic_write_unit_min_bytes =
+ queue_atomic_write_unit_min_bytes(bdev_get_queue(bdev));
+ unsigned int atomic_write_unit_max_bytes =
+ queue_atomic_write_unit_max_bytes(bdev_get_queue(bdev));
+
+ if (!atomic_write_unit_min_bytes)
+ return false;
+ if (pos % atomic_write_unit_min_bytes)
+ return false;
+ if (iov_iter_count(iter) % atomic_write_unit_min_bytes)
+ return false;
+ if (!is_power_of_2(iov_iter_count(iter)))
+ return false;
+ if (iov_iter_count(iter) > atomic_write_unit_max_bytes)
+ return false;
+ if (pos % iov_iter_count(iter))
+ return false;
+ return true;
+}
+
#define DIO_INLINE_BIO_VECS 4
static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
@@ -48,6 +71,8 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
{
struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host);
struct bio_vec inline_vecs[DIO_INLINE_BIO_VECS], *vecs;
+ bool is_read = iov_iter_rw(iter) == READ;
+ bool atomic_write = (iocb->ki_flags & IOCB_ATOMIC) && !is_read;
loff_t pos = iocb->ki_pos;
bool should_dirty = false;
struct bio bio;
@@ -56,6 +81,9 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
if (blkdev_dio_unaligned(bdev, pos, iter))
return -EINVAL;
+ if (atomic_write && !blkdev_atomic_write_valid(bdev, pos, iter))
+ return -EINVAL;
+
if (nr_pages <= DIO_INLINE_BIO_VECS)
vecs = inline_vecs;
else {
@@ -65,7 +93,7 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
return -ENOMEM;
}
- if (iov_iter_rw(iter) == READ) {
+ if (is_read) {
bio_init(&bio, bdev, vecs, nr_pages, REQ_OP_READ);
if (user_backed_iter(iter))
should_dirty = true;
@@ -74,6 +102,8 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
}
bio.bi_iter.bi_sector = pos >> SECTOR_SHIFT;
bio.bi_ioprio = iocb->ki_ioprio;
+ if (atomic_write)
+ bio.bi_opf |= REQ_ATOMIC;
ret = bio_iov_iter_get_pages(&bio, iter);
if (unlikely(ret))
@@ -167,10 +197,14 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
struct blkdev_dio *dio;
struct bio *bio;
bool is_read = (iov_iter_rw(iter) == READ), is_sync;
+ bool atomic_write = (iocb->ki_flags & IOCB_ATOMIC) && !is_read;
blk_opf_t opf = is_read ? REQ_OP_READ : dio_bio_write_op(iocb);
loff_t pos = iocb->ki_pos;
int ret = 0;
+ if (atomic_write)
+ return -EINVAL;
+
if (blkdev_dio_unaligned(bdev, pos, iter))
return -EINVAL;
@@ -305,6 +339,7 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host);
bool is_read = iov_iter_rw(iter) == READ;
blk_opf_t opf = is_read ? REQ_OP_READ : dio_bio_write_op(iocb);
+ bool atomic_write = (iocb->ki_flags & IOCB_ATOMIC) && !is_read;
struct blkdev_dio *dio;
struct bio *bio;
loff_t pos = iocb->ki_pos;
@@ -313,6 +348,9 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
if (blkdev_dio_unaligned(bdev, pos, iter))
return -EINVAL;
+ if (atomic_write && !blkdev_atomic_write_valid(bdev, pos, iter))
+ return -EINVAL;
+
if (iocb->ki_flags & IOCB_ALLOC_CACHE)
opf |= REQ_ALLOC_CACHE;
bio = bio_alloc_bioset(bdev, nr_pages, opf, GFP_KERNEL,
@@ -347,6 +385,8 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
bio_set_pages_dirty(bio);
}
} else {
+ if (atomic_write)
+ bio->bi_opf |= REQ_ATOMIC;
task_io_account_write(bio->bi_iter.bi_size);
}
--
2.31.1
From: Prasad Singamsetty <[email protected]>
Extend statx system call to return additional info for atomic write support
support if the specified file is a block device.
Add initial support for a block device.
Signed-off-by: Prasad Singamsetty <[email protected]>
Signed-off-by: John Garry <[email protected]>
---
block/bdev.c | 33 +++++++++++++++++++++++----------
fs/stat.c | 15 ++++++++-------
include/linux/blkdev.h | 4 ++--
include/linux/stat.h | 2 ++
include/uapi/linux/stat.h | 7 ++++++-
5 files changed, 41 insertions(+), 20 deletions(-)
diff --git a/block/bdev.c b/block/bdev.c
index f3b13aa1b7d4..037a3d9ecbcb 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -1028,24 +1028,37 @@ void sync_bdevs(bool wait)
iput(old_inode);
}
+#define BDEV_STATX_SUPPORTED_MSK (STATX_DIOALIGN | STATX_WRITE_ATOMIC)
+
/*
- * Handle STATX_DIOALIGN for block devices.
- *
- * Note that the inode passed to this is the inode of a block device node file,
- * not the block device's internal inode. Therefore it is *not* valid to use
- * I_BDEV() here; the block device has to be looked up by i_rdev instead.
+ * Handle STATX_{DIOALIGN, WRITE_ATOMIC} for block devices.
*/
-void bdev_statx_dioalign(struct inode *inode, struct kstat *stat)
+void bdev_statx(struct dentry *dentry, struct kstat *stat, u32 request_mask)
{
struct block_device *bdev;
- bdev = blkdev_get_no_open(inode->i_rdev);
+ if (!(request_mask & BDEV_STATX_SUPPORTED_MSK))
+ return;
+
+ bdev = blkdev_get_no_open(d_backing_inode(dentry)->i_rdev);
if (!bdev)
return;
- stat->dio_mem_align = bdev_dma_alignment(bdev) + 1;
- stat->dio_offset_align = bdev_logical_block_size(bdev);
- stat->result_mask |= STATX_DIOALIGN;
+ if (request_mask & STATX_DIOALIGN) {
+ stat->dio_mem_align = bdev_dma_alignment(bdev) + 1;
+ stat->dio_offset_align = bdev_logical_block_size(bdev);
+ stat->result_mask |= STATX_DIOALIGN;
+ }
+
+ if (request_mask & STATX_WRITE_ATOMIC) {
+ stat->atomic_write_unit_min =
+ queue_atomic_write_unit_min_bytes(bdev->bd_queue);
+ stat->atomic_write_unit_max =
+ queue_atomic_write_unit_max_bytes(bdev->bd_queue);
+ stat->attributes |= STATX_ATTR_WRITE_ATOMIC;
+ stat->attributes_mask |= STATX_ATTR_WRITE_ATOMIC;
+ stat->result_mask |= STATX_WRITE_ATOMIC;
+ }
blkdev_put_no_open(bdev);
}
diff --git a/fs/stat.c b/fs/stat.c
index d43a5cc1bfa4..b840e58f41fa 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -250,13 +250,12 @@ static int vfs_statx(int dfd, struct filename *filename, int flags,
stat->attributes |= STATX_ATTR_MOUNT_ROOT;
stat->attributes_mask |= STATX_ATTR_MOUNT_ROOT;
- /* Handle STATX_DIOALIGN for block devices. */
- if (request_mask & STATX_DIOALIGN) {
- struct inode *inode = d_backing_inode(path.dentry);
-
- if (S_ISBLK(inode->i_mode))
- bdev_statx_dioalign(inode, stat);
- }
+ /* If this is a block device inode, override the filesystem
+ * attributes with the block device specific parameters
+ * that need to be obtained from the bdev backing inode
+ */
+ if (S_ISBLK(d_backing_inode(path.dentry)->i_mode))
+ bdev_statx(path.dentry, stat, request_mask);
path_put(&path);
if (retry_estale(error, lookup_flags)) {
@@ -649,6 +648,8 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer)
tmp.stx_mnt_id = stat->mnt_id;
tmp.stx_dio_mem_align = stat->dio_mem_align;
tmp.stx_dio_offset_align = stat->dio_offset_align;
+ tmp.stx_atomic_write_unit_min = stat->atomic_write_unit_min;
+ tmp.stx_atomic_write_unit_max = stat->atomic_write_unit_max;
return copy_to_user(buffer, &tmp, sizeof(tmp)) ? -EFAULT : 0;
}
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c10e47bdb34f..f70988083734 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1533,7 +1533,7 @@ int sync_blockdev(struct block_device *bdev);
int sync_blockdev_range(struct block_device *bdev, loff_t lstart, loff_t lend);
int sync_blockdev_nowait(struct block_device *bdev);
void sync_bdevs(bool wait);
-void bdev_statx_dioalign(struct inode *inode, struct kstat *stat);
+void bdev_statx(struct dentry *dentry, struct kstat *stat, u32 request_mask);
void printk_all_partitions(void);
int __init early_lookup_bdev(const char *pathname, dev_t *dev);
#else
@@ -1551,7 +1551,7 @@ static inline int sync_blockdev_nowait(struct block_device *bdev)
static inline void sync_bdevs(bool wait)
{
}
-static inline void bdev_statx_dioalign(struct inode *inode, struct kstat *stat)
+static inline void bdev_statx(struct dentry *dentry, struct kstat *stat, u32 request_mask)
{
}
static inline void printk_all_partitions(void)
diff --git a/include/linux/stat.h b/include/linux/stat.h
index 52150570d37a..dfa69ecfaacf 100644
--- a/include/linux/stat.h
+++ b/include/linux/stat.h
@@ -53,6 +53,8 @@ struct kstat {
u32 dio_mem_align;
u32 dio_offset_align;
u64 change_cookie;
+ u32 atomic_write_unit_max;
+ u32 atomic_write_unit_min;
};
/* These definitions are internal to the kernel for now. Mainly used by nfsd. */
diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
index 7cab2c65d3d7..c99d7cac2aa6 100644
--- a/include/uapi/linux/stat.h
+++ b/include/uapi/linux/stat.h
@@ -127,7 +127,10 @@ struct statx {
__u32 stx_dio_mem_align; /* Memory buffer alignment for direct I/O */
__u32 stx_dio_offset_align; /* File offset alignment for direct I/O */
/* 0xa0 */
- __u64 __spare3[12]; /* Spare space for future expansion */
+ __u32 stx_atomic_write_unit_max;
+ __u32 stx_atomic_write_unit_min;
+ /* 0xb0 */
+ __u64 __spare3[11]; /* Spare space for future expansion */
/* 0x100 */
};
@@ -154,6 +157,7 @@ struct statx {
#define STATX_BTIME 0x00000800U /* Want/got stx_btime */
#define STATX_MNT_ID 0x00001000U /* Got stx_mnt_id */
#define STATX_DIOALIGN 0x00002000U /* Want/got direct I/O alignment info */
+#define STATX_WRITE_ATOMIC 0x00004000U /* Want/got atomic_write_* fields */
#define STATX__RESERVED 0x80000000U /* Reserved for future struct statx expansion */
@@ -189,6 +193,7 @@ struct statx {
#define STATX_ATTR_MOUNT_ROOT 0x00002000 /* Root of a mount */
#define STATX_ATTR_VERITY 0x00100000 /* [I] Verity protected file */
#define STATX_ATTR_DAX 0x00200000 /* File is currently in DAX state */
+#define STATX_ATTR_WRITE_ATOMIC 0x00400000 /* File supports atomic write operations */
#endif /* _UAPI_LINUX_STAT_H */
--
2.31.1
Add function sd_setup_atomic_cmnd() to setup an WRITE_ATOMIC_16
CDB for when REQ_ATOMIC flag is set for the request.
Also add trace info.
Signed-off-by: John Garry <[email protected]>
---
drivers/scsi/scsi_trace.c | 22 ++++++++++++++++++++++
drivers/scsi/sd.c | 20 ++++++++++++++++++++
include/scsi/scsi_proto.h | 1 +
include/trace/events/scsi.h | 1 +
4 files changed, 44 insertions(+)
diff --git a/drivers/scsi/scsi_trace.c b/drivers/scsi/scsi_trace.c
index 41a950075913..3e47c4472a80 100644
--- a/drivers/scsi/scsi_trace.c
+++ b/drivers/scsi/scsi_trace.c
@@ -325,6 +325,26 @@ scsi_trace_zbc_out(struct trace_seq *p, unsigned char *cdb, int len)
return ret;
}
+static const char *
+scsi_trace_atomic_write16_out(struct trace_seq *p, unsigned char *cdb, int len)
+{
+ const char *ret = trace_seq_buffer_ptr(p);
+ unsigned int boundary_size;
+ unsigned int nr_blocks;
+ sector_t lba;
+
+ lba = get_unaligned_be64(&cdb[2]);
+ boundary_size = get_unaligned_be16(&cdb[10]);
+ nr_blocks = get_unaligned_be16(&cdb[12]);
+
+ trace_seq_printf(p, "lba=%llu txlen=%u boundary_size=%u",
+ lba, nr_blocks, boundary_size);
+
+ trace_seq_putc(p, 0);
+
+ return ret;
+}
+
static const char *
scsi_trace_varlen(struct trace_seq *p, unsigned char *cdb, int len)
{
@@ -385,6 +405,8 @@ scsi_trace_parse_cdb(struct trace_seq *p, unsigned char *cdb, int len)
return scsi_trace_zbc_in(p, cdb, len);
case ZBC_OUT:
return scsi_trace_zbc_out(p, cdb, len);
+ case WRITE_ATOMIC_16:
+ return scsi_trace_atomic_write16_out(p, cdb, len);
default:
return scsi_trace_misc(p, cdb, len);
}
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 7f6cadd1f8f3..1a41656dac2d 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1129,6 +1129,23 @@ static int sd_cdl_dld(struct scsi_disk *sdkp, struct scsi_cmnd *scmd)
return (hint - IOPRIO_HINT_DEV_DURATION_LIMIT_1) + 1;
}
+static blk_status_t sd_setup_atomic_cmnd(struct scsi_cmnd *cmd,
+ sector_t lba, unsigned int nr_blocks,
+ unsigned char flags)
+{
+ cmd->cmd_len = 16;
+ cmd->cmnd[0] = WRITE_ATOMIC_16;
+ cmd->cmnd[1] = flags;
+ put_unaligned_be64(lba, &cmd->cmnd[2]);
+ cmd->cmnd[10] = 0;
+ cmd->cmnd[11] = 0;
+ put_unaligned_be16(nr_blocks, &cmd->cmnd[12]);
+ cmd->cmnd[14] = 0;
+ cmd->cmnd[15] = 0;
+
+ return BLK_STS_OK;
+}
+
static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
{
struct request *rq = scsi_cmd_to_rq(cmd);
@@ -1139,6 +1156,7 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
unsigned int nr_blocks = sectors_to_logical(sdp, blk_rq_sectors(rq));
unsigned int mask = logical_to_sectors(sdp, 1) - 1;
bool write = rq_data_dir(rq) == WRITE;
+ bool atomic_write = !!(rq->cmd_flags & REQ_ATOMIC) && write;
unsigned char protect, fua;
unsigned int dld;
blk_status_t ret;
@@ -1200,6 +1218,8 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
if (protect && sdkp->protection_type == T10_PI_TYPE2_PROTECTION) {
ret = sd_setup_rw32_cmnd(cmd, write, lba, nr_blocks,
protect | fua, dld);
+ } else if (atomic_write) {
+ ret = sd_setup_atomic_cmnd(cmd, lba, nr_blocks, protect | fua);
} else if (sdp->use_16_for_rw || (nr_blocks > 0xffff)) {
ret = sd_setup_rw16_cmnd(cmd, write, lba, nr_blocks,
protect | fua, dld);
diff --git a/include/scsi/scsi_proto.h b/include/scsi/scsi_proto.h
index 07d65c1f59db..833de67305b5 100644
--- a/include/scsi/scsi_proto.h
+++ b/include/scsi/scsi_proto.h
@@ -119,6 +119,7 @@
#define WRITE_SAME_16 0x93
#define ZBC_OUT 0x94
#define ZBC_IN 0x95
+#define WRITE_ATOMIC_16 0x9c
#define SERVICE_ACTION_BIDIRECTIONAL 0x9d
#define SERVICE_ACTION_IN_16 0x9e
#define SERVICE_ACTION_OUT_16 0x9f
diff --git a/include/trace/events/scsi.h b/include/trace/events/scsi.h
index 8e2d9b1b0e77..05f1945ed204 100644
--- a/include/trace/events/scsi.h
+++ b/include/trace/events/scsi.h
@@ -102,6 +102,7 @@
scsi_opcode_name(WRITE_32), \
scsi_opcode_name(WRITE_SAME_32), \
scsi_opcode_name(ATA_16), \
+ scsi_opcode_name(WRITE_ATOMIC_16), \
scsi_opcode_name(ATA_12))
#define scsi_hostbyte_name(result) { result, #result }
--
2.31.1
On Fri, Sep 29, 2023 at 10:27:08AM +0000, John Garry wrote:
> diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
> index 7cab2c65d3d7..c99d7cac2aa6 100644
> --- a/include/uapi/linux/stat.h
> +++ b/include/uapi/linux/stat.h
> @@ -127,7 +127,10 @@ struct statx {
> __u32 stx_dio_mem_align; /* Memory buffer alignment for direct I/O */
> __u32 stx_dio_offset_align; /* File offset alignment for direct I/O */
> /* 0xa0 */
> - __u64 __spare3[12]; /* Spare space for future expansion */
> + __u32 stx_atomic_write_unit_max;
> + __u32 stx_atomic_write_unit_min;
Maybe min first and then max? That seems a bit more natural, and a lot of the
code you've written handle them in that order.
> +#define STATX_ATTR_WRITE_ATOMIC 0x00400000 /* File supports atomic write operations */
How would this differ from stx_atomic_write_unit_min != 0?
- Eric
From: "Darrick J. Wong" <[email protected]>
Add a new inode flag to require that all file data extent mappings must
be aligned (both the file offset range and the allocated space itself)
to the extent size hint. Having a separate COW extent size hint is no
longer allowed.
The goal here is to enable sysadmins and users to mandate that all space
mappings in a file must have a startoff/blockcount that are aligned to
(say) a 2MB alignment and that the startblock/blockcount will follow the
same alignment.
Signed-off-by: Darrick J. Wong <[email protected]>
Co-developed-by: John Garry <[email protected]>
Signed-off-by: John Garry <[email protected]>
---
fs/xfs/libxfs/xfs_format.h | 6 +++++-
fs/xfs/libxfs/xfs_inode_buf.c | 40 +++++++++++++++++++++++++++++++++++
fs/xfs/libxfs/xfs_inode_buf.h | 3 +++
fs/xfs/libxfs/xfs_sb.c | 3 +++
fs/xfs/xfs_inode.c | 12 +++++++++++
fs/xfs/xfs_inode.h | 5 +++++
fs/xfs/xfs_ioctl.c | 18 ++++++++++++++++
fs/xfs/xfs_mount.h | 2 ++
fs/xfs/xfs_super.c | 4 ++++
include/uapi/linux/fs.h | 2 ++
10 files changed, 94 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 371dc07233e0..d718b73f48ca 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -353,6 +353,7 @@ xfs_sb_has_compat_feature(
#define XFS_SB_FEAT_RO_COMPAT_RMAPBT (1 << 1) /* reverse map btree */
#define XFS_SB_FEAT_RO_COMPAT_REFLINK (1 << 2) /* reflinked files */
#define XFS_SB_FEAT_RO_COMPAT_INOBTCNT (1 << 3) /* inobt block counts */
+#define XFS_SB_FEAT_RO_COMPAT_FORCEALIGN (1 << 30) /* aligned file data extents */
#define XFS_SB_FEAT_RO_COMPAT_ALL \
(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
@@ -1069,16 +1070,19 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
#define XFS_DIFLAG2_COWEXTSIZE_BIT 2 /* copy on write extent size hint */
#define XFS_DIFLAG2_BIGTIME_BIT 3 /* big timestamps */
#define XFS_DIFLAG2_NREXT64_BIT 4 /* large extent counters */
+/* data extent mappings for regular files must be aligned to extent size hint */
+#define XFS_DIFLAG2_FORCEALIGN_BIT 5
#define XFS_DIFLAG2_DAX (1 << XFS_DIFLAG2_DAX_BIT)
#define XFS_DIFLAG2_REFLINK (1 << XFS_DIFLAG2_REFLINK_BIT)
#define XFS_DIFLAG2_COWEXTSIZE (1 << XFS_DIFLAG2_COWEXTSIZE_BIT)
#define XFS_DIFLAG2_BIGTIME (1 << XFS_DIFLAG2_BIGTIME_BIT)
#define XFS_DIFLAG2_NREXT64 (1 << XFS_DIFLAG2_NREXT64_BIT)
+#define XFS_DIFLAG2_FORCEALIGN (1 << XFS_DIFLAG2_FORCEALIGN_BIT)
#define XFS_DIFLAG2_ANY \
(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE | \
- XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64)
+ XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64 | XFS_DIFLAG2_FORCEALIGN)
static inline bool xfs_dinode_has_bigtime(const struct xfs_dinode *dip)
{
diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index a35781577cad..0c4d492c9363 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -605,6 +605,14 @@ xfs_dinode_verify(
!xfs_has_bigtime(mp))
return __this_address;
+ if (flags2 & XFS_DIFLAG2_FORCEALIGN) {
+ fa = xfs_inode_validate_forcealign(mp, mode, flags,
+ be32_to_cpu(dip->di_extsize),
+ be32_to_cpu(dip->di_cowextsize));
+ if (fa)
+ return fa;
+ }
+
return NULL;
}
@@ -772,3 +780,35 @@ xfs_inode_validate_cowextsize(
return NULL;
}
+
+/* Validate the forcealign inode flag */
+xfs_failaddr_t
+xfs_inode_validate_forcealign(
+ struct xfs_mount *mp,
+ uint16_t mode,
+ uint16_t flags,
+ uint32_t extsize,
+ uint32_t cowextsize)
+{
+ /* superblock rocompat feature flag */
+ if (!xfs_has_forcealign(mp))
+ return __this_address;
+
+ /* Only regular files and directories */
+ if (!S_ISDIR(mode) && !S_ISREG(mode))
+ return __this_address;
+
+ /* Doesn't apply to realtime files */
+ if (flags & XFS_DIFLAG_REALTIME)
+ return __this_address;
+
+ /* Requires a nonzero extent size hint */
+ if (extsize == 0)
+ return __this_address;
+
+ /* Requires no cow extent size hint */
+ if (cowextsize != 0)
+ return __this_address;
+
+ return NULL;
+}
diff --git a/fs/xfs/libxfs/xfs_inode_buf.h b/fs/xfs/libxfs/xfs_inode_buf.h
index 585ed5a110af..50db17d22b68 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.h
+++ b/fs/xfs/libxfs/xfs_inode_buf.h
@@ -33,6 +33,9 @@ xfs_failaddr_t xfs_inode_validate_extsize(struct xfs_mount *mp,
xfs_failaddr_t xfs_inode_validate_cowextsize(struct xfs_mount *mp,
uint32_t cowextsize, uint16_t mode, uint16_t flags,
uint64_t flags2);
+xfs_failaddr_t xfs_inode_validate_forcealign(struct xfs_mount *mp,
+ uint16_t mode, uint16_t flags, uint32_t extsize,
+ uint32_t cowextsize);
static inline uint64_t xfs_inode_encode_bigtime(struct timespec64 tv)
{
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index 6264daaab37b..c1be74222c70 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -162,6 +162,9 @@ xfs_sb_version_to_features(
features |= XFS_FEAT_REFLINK;
if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_INOBTCNT)
features |= XFS_FEAT_INOBTCNT;
+ if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_FORCEALIGN)
+ features |= XFS_FEAT_FORCEALIGN;
+
if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_FTYPE)
features |= XFS_FEAT_FTYPE;
if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_SPINODES)
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index f94f7b374041..3fbfb052c778 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -634,6 +634,8 @@ xfs_ip2xflags(
flags |= FS_XFLAG_DAX;
if (ip->i_diflags2 & XFS_DIFLAG2_COWEXTSIZE)
flags |= FS_XFLAG_COWEXTSIZE;
+ if (ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN)
+ flags |= FS_XFLAG_FORCEALIGN;
}
if (xfs_inode_has_attr_fork(ip))
@@ -761,6 +763,8 @@ xfs_inode_inherit_flags2(
}
if (pip->i_diflags2 & XFS_DIFLAG2_DAX)
ip->i_diflags2 |= XFS_DIFLAG2_DAX;
+ if (pip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN)
+ ip->i_diflags2 |= XFS_DIFLAG2_FORCEALIGN;
/* Don't let invalid cowextsize hints propagate. */
failaddr = xfs_inode_validate_cowextsize(ip->i_mount, ip->i_cowextsize,
@@ -769,6 +773,14 @@ xfs_inode_inherit_flags2(
ip->i_diflags2 &= ~XFS_DIFLAG2_COWEXTSIZE;
ip->i_cowextsize = 0;
}
+
+ if (ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN) {
+ failaddr = xfs_inode_validate_forcealign(ip->i_mount,
+ VFS_I(ip)->i_mode, ip->i_diflags, ip->i_extsize,
+ ip->i_cowextsize);
+ if (failaddr)
+ ip->i_diflags2 &= ~XFS_DIFLAG2_FORCEALIGN;
+ }
}
/*
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 0c5bdb91152e..f66a57085908 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -305,6 +305,11 @@ static inline bool xfs_inode_has_large_extent_counts(struct xfs_inode *ip)
return ip->i_diflags2 & XFS_DIFLAG2_NREXT64;
}
+static inline bool xfs_inode_forcealign(struct xfs_inode *ip)
+{
+ return ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN;
+}
+
/*
* Return the buftarg used for data allocations on a given inode.
*/
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 55bb01173cde..4c147def835f 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1109,6 +1109,8 @@ xfs_flags2diflags2(
di_flags2 |= XFS_DIFLAG2_DAX;
if (xflags & FS_XFLAG_COWEXTSIZE)
di_flags2 |= XFS_DIFLAG2_COWEXTSIZE;
+ if (xflags & FS_XFLAG_FORCEALIGN)
+ di_flags2 |= XFS_DIFLAG2_FORCEALIGN;
return di_flags2;
}
@@ -1143,6 +1145,22 @@ xfs_ioctl_setattr_xflags(
if (i_flags2 && !xfs_has_v3inodes(mp))
return -EINVAL;
+ /*
+ * Force-align requires a nonzero extent size hint and a zero cow
+ * extent size hint. It doesn't apply to realtime files.
+ */
+ if (fa->fsx_xflags & FS_XFLAG_FORCEALIGN) {
+ if (!xfs_has_forcealign(mp))
+ return -EINVAL;
+ if (fa->fsx_xflags & FS_XFLAG_COWEXTSIZE)
+ return -EINVAL;
+ if (!(fa->fsx_xflags & (FS_XFLAG_EXTSIZE |
+ FS_XFLAG_EXTSZINHERIT)))
+ return -EINVAL;
+ if (fa->fsx_xflags & FS_XFLAG_REALTIME)
+ return -EINVAL;
+ }
+
ip->i_diflags = xfs_flags2diflags(ip, fa->fsx_xflags);
ip->i_diflags2 = i_flags2;
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index d19cca099bc3..a7553b8bcc64 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -287,6 +287,7 @@ typedef struct xfs_mount {
#define XFS_FEAT_BIGTIME (1ULL << 24) /* large timestamps */
#define XFS_FEAT_NEEDSREPAIR (1ULL << 25) /* needs xfs_repair */
#define XFS_FEAT_NREXT64 (1ULL << 26) /* large extent counters */
+#define XFS_FEAT_FORCEALIGN (1ULL << 27) /* aligned file data extents */
/* Mount features */
#define XFS_FEAT_NOATTR2 (1ULL << 48) /* disable attr2 creation */
@@ -350,6 +351,7 @@ __XFS_HAS_FEAT(inobtcounts, INOBTCNT)
__XFS_HAS_FEAT(bigtime, BIGTIME)
__XFS_HAS_FEAT(needsrepair, NEEDSREPAIR)
__XFS_HAS_FEAT(large_extent_counts, NREXT64)
+__XFS_HAS_FEAT(forcealign, FORCEALIGN)
/*
* Mount features
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 819a3568b28f..1bbb25df23a7 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1703,6 +1703,10 @@ xfs_fs_fill_super(
mp->m_features &= ~XFS_FEAT_DISCARD;
}
+ if (xfs_has_forcealign(mp))
+ xfs_warn(mp,
+"EXPERIMENTAL forced data extent alignment feature in use. Use at your own risk!");
+
if (xfs_has_reflink(mp)) {
if (mp->m_sb.sb_rblocks) {
xfs_alert(mp,
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index e3b4f5bc6860..cbfb09bc1717 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -140,6 +140,8 @@ struct fsxattr {
#define FS_XFLAG_FILESTREAM 0x00004000 /* use filestream allocator */
#define FS_XFLAG_DAX 0x00008000 /* use DAX for IO */
#define FS_XFLAG_COWEXTSIZE 0x00010000 /* CoW extent size allocator hint */
+/* data extent mappings for regular files must be aligned to extent size hint */
+#define FS_XFLAG_FORCEALIGN 0x00020000
#define FS_XFLAG_HASATTR 0x80000000 /* no DIFLAG for this */
/* the read-only stuff doesn't really belong here, but any other place is
--
2.31.1
Hi John,
kernel test robot noticed the following build errors:
[auto build test ERROR on xfs-linux/for-next]
[also build test ERROR on axboe-block/for-next mkp-scsi/for-next jejb-scsi/for-next linus/master v6.6-rc3 next-20230929]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/John-Garry/block-Add-atomic-write-operations-to-request_queue-limits/20230929-184542
base: https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git for-next
patch link: https://lore.kernel.org/r/20230929102726.2985188-10-john.g.garry%40oracle.com
patch subject: [PATCH 09/21] block: Add checks to merging of atomic writes
config: mips-mtx1_defconfig (https://download.01.org/0day-ci/archive/20230930/[email protected]/config)
compiler: clang version 16.0.4 (https://github.com/llvm/llvm-project.git ae42196bc493ffe877a7e3dff8be32035dea4d07)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20230930/[email protected]/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
All errors (new ones prefixed by >>):
>> ld.lld: error: undefined symbol: __moddi3
>>> referenced by blk-merge.c
>>> block/blk-merge.o:(ll_back_merge_fn) in archive vmlinux.a
>>> referenced by blk-merge.c
>>> block/blk-merge.o:(ll_back_merge_fn) in archive vmlinux.a
>>> referenced by blk-merge.c
>>> block/blk-merge.o:(bio_attempt_front_merge) in archive vmlinux.a
>>> referenced 3 more times
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
On 9/29/23 15:49, Eric Biggers wrote:
> On Fri, Sep 29, 2023 at 10:27:08AM +0000, John Garry wrote:
>> diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
>> index 7cab2c65d3d7..c99d7cac2aa6 100644
>> --- a/include/uapi/linux/stat.h
>> +++ b/include/uapi/linux/stat.h
>> @@ -127,7 +127,10 @@ struct statx {
>> __u32 stx_dio_mem_align; /* Memory buffer alignment for direct I/O */
>> __u32 stx_dio_offset_align; /* File offset alignment for direct I/O */
>> /* 0xa0 */
>> - __u64 __spare3[12]; /* Spare space for future expansion */
>> + __u32 stx_atomic_write_unit_max;
>> + __u32 stx_atomic_write_unit_min;
>
> Maybe min first and then max? That seems a bit more natural, and a lot of the
> code you've written handle them in that order.
>
>> +#define STATX_ATTR_WRITE_ATOMIC 0x00400000 /* File supports atomic write operations */
>
> How would this differ from stx_atomic_write_unit_min != 0?
Is it even possible that stx_atomic_write_unit_min == 0? My understanding
is that all Linux filesystems rely on the assumption that writing a single
logical block either succeeds or does not happen, even if a power failure
occurs between writing and reading a logical block.
Thanks,
Bart.
On 01/10/2023 14:23, Bart Van Assche wrote:
> On 9/29/23 15:49, Eric Biggers wrote:
>> On Fri, Sep 29, 2023 at 10:27:08AM +0000, John Garry wrote:
>>> diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
>>> index 7cab2c65d3d7..c99d7cac2aa6 100644
>>> --- a/include/uapi/linux/stat.h
>>> +++ b/include/uapi/linux/stat.h
>>> @@ -127,7 +127,10 @@ struct statx {
>>> __u32 stx_dio_mem_align; /* Memory buffer alignment for
>>> direct I/O */
>>> __u32 stx_dio_offset_align; /* File offset alignment for
>>> direct I/O */
>>> /* 0xa0 */
>>> - __u64 __spare3[12]; /* Spare space for future expansion */
>>> + __u32 stx_atomic_write_unit_max;
>>> + __u32 stx_atomic_write_unit_min;
>>
>> Maybe min first and then max? That seems a bit more natural, and a
>> lot of the
>> code you've written handle them in that order.
ok, I think it's fine to reorder
>>
>>> +#define STATX_ATTR_WRITE_ATOMIC 0x00400000 /* File supports
>>> atomic write operations */
>>
>> How would this differ from stx_atomic_write_unit_min != 0?
Yeah, I suppose that we can just not set this for the case of
stx_atomic_write_unit_min == 0.
>
> Is it even possible that stx_atomic_write_unit_min == 0? My understanding
> is that all Linux filesystems rely on the assumption that writing a single
> logical block either succeeds or does not happen, even if a power failure
> occurs between writing and reading a logical block.
>
Maybe they do rely on this, but is it particularly interesting?
BTW, I would not like to provide assurances that every storage media
produced writes logical blocks atomically.
Thanks,
John
On 10/2/23 02:51, John Garry wrote:
> On 01/10/2023 14:23, Bart Van Assche wrote:
>> Is it even possible that stx_atomic_write_unit_min == 0? My
>> understanding is that all Linux filesystems rely on the assumption
>> that writing a single logical block either succeeds or does not
>> happen, even if a power failure occurs between writing and reading
>> a logical block.
>
> Maybe they do rely on this, but is it particularly interesting?
>
> BTW, I would not like to provide assurances that every storage media
> produced writes logical blocks atomically.
Neither the SCSI SBC standard nor the NVMe standard defines a "minimum
atomic write unit". So why to introduce something in the Linux kernel
that is not defined in common storage standards?
I propose to leave out stx_atomic_write_unit_min from
struct statx and also to leave out atomic_write_unit_min_sectors from
struct queue_limits. My opinion is that we should not support block
devices in the Linux kernel that do not write logical blocks atomically.
Block devices that do not write logical blocks atomically are not
compatible with Linux kernel journaling filesystems. Additionally, I'm
not sure it's even possible to write a journaling filesystem for such
block devices.
Thanks,
Bart.
On 10/2/23 03:10, John Garry wrote:
> On 29/09/2023 18:51, Bart Van Assche wrote:
>> On 9/29/23 03:27, John Garry wrote:
> > + if (pos % atomic_write_unit_min_bytes)
> > + return false;
>
> See later rules.
Is atomic_write_unit_min_bytes always equal to the logical block size?
If so, can the above test be left out?
> > + if (iov_iter_count(iter) % atomic_write_unit_min_bytes)
> > + return false;
>
> For SCSI, there is an atomic write granularity, which dictates
> atomic_write_unit_min_bytes. So here we need to ensure that the length
> is a multiple of this value.
Are there any SCSI devices that we care about that report an ATOMIC
TRANSFER LENGTH GRANULARITY that is larger than a single logical block?
I'm wondering whether we really have to support such devices.
> > + if (!is_power_of_2(iov_iter_count(iter)))
> > + return false;
>
> This rule comes from FS block alignment and NVMe atomic boundary.
>
> FSes (XFS) have discontiguous extents. We need to ensure that an atomic
> write does not cross discontiguous extents. To do this we ensure extent
> length and alignment and limit atomic_write_unit_max_bytes to that.
>
> For NVMe, an atomic write boundary is a boundary in LBA space which an
> atomic write should not cross. We limit atomic_write_unit_max_bytes such
> that it is evenly divisible into this atomic write boundary.
>
> To ensure that the write does not cross these alignment boundaries we
> say that it must be naturally aligned and a power-of-2 in length.
>
> We may be able to relax this rule but I am not sure it buys us anything
> - typically we want to be writing a 64KB block aligned to 64KB, for
> example.
It seems to me that the requirement is_power_of_2(iov_iter_count(iter))
is necessary for some filesystems but not for all filesystems.
Restrictions that are specific to a single filesystem (XFS) should not
occur in code that is intended to be used by all filesystems
(blkdev_atomic_write_valid()).
Thanks,
Bart.
On Sat, Sep 30, 2023 at 09:40:30PM +0800, kernel test robot wrote:
> Hi John,
>
> kernel test robot noticed the following build errors:
>
> [auto build test ERROR on xfs-linux/for-next]
> [also build test ERROR on axboe-block/for-next mkp-scsi/for-next jejb-scsi/for-next linus/master v6.6-rc3 next-20230929]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>
> url: https://github.com/intel-lab-lkp/linux/commits/John-Garry/block-Add-atomic-write-operations-to-request_queue-limits/20230929-184542
> base: https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git for-next
> patch link: https://lore.kernel.org/r/20230929102726.2985188-10-john.g.garry%40oracle.com
> patch subject: [PATCH 09/21] block: Add checks to merging of atomic writes
> config: mips-mtx1_defconfig (https://download.01.org/0day-ci/archive/20230930/[email protected]/config)
> compiler: clang version 16.0.4 (https://github.com/llvm/llvm-project.git ae42196bc493ffe877a7e3dff8be32035dea4d07)
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20230930/[email protected]/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <[email protected]>
> | Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
>
> All errors (new ones prefixed by >>):
>
> >> ld.lld: error: undefined symbol: __moddi3
> >>> referenced by blk-merge.c
> >>> block/blk-merge.o:(ll_back_merge_fn) in archive vmlinux.a
> >>> referenced by blk-merge.c
> >>> block/blk-merge.o:(ll_back_merge_fn) in archive vmlinux.a
> >>> referenced by blk-merge.c
> >>> block/blk-merge.o:(bio_attempt_front_merge) in archive vmlinux.a
> >>> referenced 3 more times
This does not appear to be clang specific, I can reproduce it with GCC
12.3.0 and the same configuration target.
Cheers,
Nathan
Bart,
> Neither the SCSI SBC standard nor the NVMe standard defines a "minimum
> atomic write unit". So why to introduce something in the Linux kernel
> that is not defined in common storage standards?
From SBC-5:
"The ATOMIC TRANSFER LENGTH GRANULARITY field indicates the minimum
transfer length for an atomic write command."
> I propose to leave out stx_atomic_write_unit_min from
> struct statx and also to leave out atomic_write_unit_min_sectors from
> struct queue_limits. My opinion is that we should not support block
> devices in the Linux kernel that do not write logical blocks atomically.
The statx values exist to describe the limits for I/Os sent using
RWF_ATOMIC and IOCB_ATOMIC. These limits may be different from other
reported values such as the filesystem block size and the logical block
size of the underlying device.
--
Martin K. Petersen Oracle Linux Engineering
Bart,
> Are there any SCSI devices that we care about that report an ATOMIC
> TRANSFER LENGTH GRANULARITY that is larger than a single logical
> block?
Yes.
Note that code path used inside a storage device to guarantee atomicity
of an entire I/O may be substantially different from the code path which
only offers an incremental guarantee at a single logical or physical
block level (to the extent that those guarantees are offered at all but
that's a different kettle of fish).
> I'm wondering whether we really have to support such devices.
Yes.
--
Martin K. Petersen Oracle Linux Engineering
On Mon, Oct 02, 2023 at 10:51:36AM +0100, John Garry wrote:
> On 01/10/2023 14:23, Bart Van Assche wrote:
> > On 9/29/23 15:49, Eric Biggers wrote:
> > > On Fri, Sep 29, 2023 at 10:27:08AM +0000, John Garry wrote:
> > > > diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
> > > > index 7cab2c65d3d7..c99d7cac2aa6 100644
> > > > --- a/include/uapi/linux/stat.h
> > > > +++ b/include/uapi/linux/stat.h
> > > > @@ -127,7 +127,10 @@ struct statx {
> > > > ????? __u32??? stx_dio_mem_align;??? /* Memory buffer alignment
> > > > for direct I/O */
> > > > ????? __u32??? stx_dio_offset_align;??? /* File offset alignment
> > > > for direct I/O */
> > > > ????? /* 0xa0 */
> > > > -??? __u64??? __spare3[12];??? /* Spare space for future expansion */
> > > > +??? __u32??? stx_atomic_write_unit_max;
> > > > +??? __u32??? stx_atomic_write_unit_min;
> > >
> > > Maybe min first and then max?? That seems a bit more natural, and a
> > > lot of the
> > > code you've written handle them in that order.
>
> ok, I think it's fine to reorder
>
> > >
> > > > +#define STATX_ATTR_WRITE_ATOMIC??????? 0x00400000 /* File
> > > > supports atomic write operations */
> > >
> > > How would this differ from stx_atomic_write_unit_min != 0?
>
> Yeah, I suppose that we can just not set this for the case of
> stx_atomic_write_unit_min == 0.
Please use the STATX_ATTR_WRITE_ATOMIC flag to indicate that the
filesystem, file and underlying device support atomic writes when
the values are non-zero. The whole point of the attribute mask is
that the caller can check the mask for supported functionality
without having to read every field in the statx structure to
determine if the functionality it wants is present.
-Dave.
--
Dave Chinner
[email protected]
On Tue, Oct 03, 2023 at 12:51:49PM +1100, Dave Chinner wrote:
> On Mon, Oct 02, 2023 at 10:51:36AM +0100, John Garry wrote:
> > On 01/10/2023 14:23, Bart Van Assche wrote:
> > > On 9/29/23 15:49, Eric Biggers wrote:
> > > > On Fri, Sep 29, 2023 at 10:27:08AM +0000, John Garry wrote:
> > > > > diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
> > > > > index 7cab2c65d3d7..c99d7cac2aa6 100644
> > > > > --- a/include/uapi/linux/stat.h
> > > > > +++ b/include/uapi/linux/stat.h
> > > > > @@ -127,7 +127,10 @@ struct statx {
> > > > > ????? __u32??? stx_dio_mem_align;??? /* Memory buffer alignment
> > > > > for direct I/O */
> > > > > ????? __u32??? stx_dio_offset_align;??? /* File offset alignment
> > > > > for direct I/O */
> > > > > ????? /* 0xa0 */
> > > > > -??? __u64??? __spare3[12];??? /* Spare space for future expansion */
> > > > > +??? __u32??? stx_atomic_write_unit_max;
> > > > > +??? __u32??? stx_atomic_write_unit_min;
> > > >
> > > > Maybe min first and then max?? That seems a bit more natural, and a
> > > > lot of the
> > > > code you've written handle them in that order.
> >
> > ok, I think it's fine to reorder
> >
> > > >
> > > > > +#define STATX_ATTR_WRITE_ATOMIC??????? 0x00400000 /* File
> > > > > supports atomic write operations */
> > > >
> > > > How would this differ from stx_atomic_write_unit_min != 0?
> >
> > Yeah, I suppose that we can just not set this for the case of
> > stx_atomic_write_unit_min == 0.
>
> Please use the STATX_ATTR_WRITE_ATOMIC flag to indicate that the
> filesystem, file and underlying device support atomic writes when
> the values are non-zero. The whole point of the attribute mask is
> that the caller can check the mask for supported functionality
> without having to read every field in the statx structure to
> determine if the functionality it wants is present.
^^ Seconding what Dave said.
--D
> -Dave.
> --
> Dave Chinner
> [email protected]
On Fri, Sep 29, 2023 at 10:27:20AM +0000, John Garry wrote:
> Support providing info on atomic write unit min and max for an inode.
>
> For simplicity, currently we limit the min at the FS block size, but a
> lower limit could be supported in future.
>
> The atomic write unit min and max is limited by the guaranteed extent
> alignment for the inode.
>
> Signed-off-by: John Garry <[email protected]>
> ---
> fs/xfs/xfs_iops.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++
> fs/xfs/xfs_iops.h | 4 ++++
> 2 files changed, 55 insertions(+)
>
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 1c1e6171209d..5bff80748223 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -546,6 +546,46 @@ xfs_stat_blksize(
> return PAGE_SIZE;
> }
>
> +void xfs_ip_atomic_write_attr(struct xfs_inode *ip,
> + xfs_filblks_t *unit_min_fsb,
> + xfs_filblks_t *unit_max_fsb)
Formatting.
Also, we don't use variable name shorthand for function names -
xfs_get_atomic_write_hint(ip) to match xfs_get_extsz_hint(ip)
would be appropriate, right?
> +{
> + xfs_extlen_t extsz_hint = xfs_get_extsz_hint(ip);
> + struct xfs_buftarg *target = xfs_inode_buftarg(ip);
> + struct block_device *bdev = target->bt_bdev;
> + struct xfs_mount *mp = ip->i_mount;
> + xfs_filblks_t atomic_write_unit_min,
> + atomic_write_unit_max,
> + align;
> +
> + atomic_write_unit_min = XFS_B_TO_FSB(mp,
> + queue_atomic_write_unit_min_bytes(bdev->bd_queue));
> + atomic_write_unit_max = XFS_B_TO_FSB(mp,
> + queue_atomic_write_unit_max_bytes(bdev->bd_queue));
These should be set in the buftarg at mount time, like we do with
sector size masks. Then we don't need to convert them to fsbs on
every single lookup.
> + /* for RT, unset extsize gives hint of 1 */
> + /* for !RT, unset extsize gives hint of 0 */
> + if (extsz_hint && (XFS_IS_REALTIME_INODE(ip) ||
> + (ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN)))
Logic is non-obvious. The compound is (rt || force), not
(extsz && rt), so it took me a while to actually realise I read this
incorrectly.
if (extsz_hint &&
(XFS_IS_REALTIME_INODE(ip) ||
(ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN))) {
> + align = extsz_hint;
> + else
> + align = 1;
And now the logic looks wrong to me. We don't want to use extsz hint
for RT inodes if force align is not set, this will always use it
regardless of the fact it has nothing to do with force alignment.
Indeed, if XFS_DIFLAG2_FORCEALIGN is not set, then shouldn't this
always return min/max = 0 because atomic alignments are not in us on
this inode?
i.e. the first thing this code should do is:
*unit_min_fsb = 0;
*unit_max_fsb = 0;
if (!(ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN))
return;
Then we can check device support:
if (!buftarg->bt_atomic_write_max)
return;
Then we can check for extent size hints. If that's not set:
align = xfs_get_extsz_hint(ip);
if (align <= 1) {
unit_min_fsb = 1;
unit_max_fsb = 1;
return;
}
And finally, if there is an extent size hint, we can return that.
> + if (atomic_write_unit_max == 0) {
> + *unit_min_fsb = 0;
> + *unit_max_fsb = 0;
> + } else if (atomic_write_unit_min == 0) {
> + *unit_min_fsb = 1;
> + *unit_max_fsb = min_t(xfs_filblks_t, atomic_write_unit_max,
> + align);
Why is it valid for a device to have a zero minimum size? If it can
set a maximum, it should -always- set a minimum size as logical
sector size is a valid lower bound, yes?
> + } else {
> + *unit_min_fsb = min_t(xfs_filblks_t, atomic_write_unit_min,
> + align);
> + *unit_max_fsb = min_t(xfs_filblks_t, atomic_write_unit_max,
> + align);
> + }
Nothing here guarantees the power-of-2 sizes that the RWF_ATOMIC
user interface requires....
It also doesn't check that the extent size hint is aligned with
atomic write units.
It also doesn't check either against stripe unit alignment....
> +}
> +
> STATIC int
> xfs_vn_getattr(
> struct mnt_idmap *idmap,
> @@ -614,6 +654,17 @@ xfs_vn_getattr(
> stat->dio_mem_align = bdev_dma_alignment(bdev) + 1;
> stat->dio_offset_align = bdev_logical_block_size(bdev);
> }
> + if (request_mask & STATX_WRITE_ATOMIC) {
> + xfs_filblks_t unit_min_fsb, unit_max_fsb;
> +
> + xfs_ip_atomic_write_attr(ip, &unit_min_fsb,
> + &unit_max_fsb);
> + stat->atomic_write_unit_min = XFS_FSB_TO_B(mp, unit_min_fsb);
> + stat->atomic_write_unit_max = XFS_FSB_TO_B(mp, unit_max_fsb);
That's just nasty. We pull byte units from the bdev, convert them to
fsb to round them, then convert them back to byte counts. We should
be doing all the work in one set of units....
> + stat->attributes |= STATX_ATTR_WRITE_ATOMIC;
> + stat->attributes_mask |= STATX_ATTR_WRITE_ATOMIC;
> + stat->result_mask |= STATX_WRITE_ATOMIC;
If the min/max are zero, then atomic writes are not supported on
this inode, right? Why would we set any of the attributes or result
mask to say it is supported on this file?
-Dave.
--
Dave Chinner
[email protected]
On 03/10/2023 03:57, Darrick J. Wong wrote:
>>>>> +#define STATX_ATTR_WRITE_ATOMIC 0x00400000 /* File
>>>>> supports atomic write operations */
>>>> How would this differ from stx_atomic_write_unit_min != 0?
>> Yeah, I suppose that we can just not set this for the case of
>> stx_atomic_write_unit_min == 0.
> Please use the STATX_ATTR_WRITE_ATOMIC flag to indicate that the
> filesystem, file and underlying device support atomic writes when
> the values are non-zero. The whole point of the attribute mask is
> that the caller can check the mask for supported functionality
> without having to read every field in the statx structure to
> determine if the functionality it wants is present.
Sure, but again that would be just checking atomic_write_unit_min_bytes
or another atomic write block setting as that is the only way to tell
from the block layer (if atomic writes are supported), so it will be
something like:
if (request_mask & STATX_WRITE_ATOMIC &&
queue_atomic_write_unit_min_bytes(bdev->bd_queue)) {
stat->atomic_write_unit_min =
queue_atomic_write_unit_min_bytes(bdev->bd_queue);
stat->atomic_write_unit_max =
queue_atomic_write_unit_max_bytes(bdev->bd_queue);
stat->attributes |= STATX_ATTR_WRITE_ATOMIC;
stat->attributes_mask |= STATX_ATTR_WRITE_ATOMIC;
stat->result_mask |= STATX_WRITE_ATOMIC;
}
Thanks,
John
On 02/10/2023 20:12, Bart Van Assche wrote:
>> > + if (!is_power_of_2(iov_iter_count(iter)))
>> > + return false;
>>
>> This rule comes from FS block alignment and NVMe atomic boundary.
>>
>> FSes (XFS) have discontiguous extents. We need to ensure that an
>> atomic write does not cross discontiguous extents. To do this we
>> ensure extent length and alignment and limit
>> atomic_write_unit_max_bytes to that.
>>
>> For NVMe, an atomic write boundary is a boundary in LBA space which an
>> atomic write should not cross. We limit atomic_write_unit_max_bytes
>> such that it is evenly divisible into this atomic write boundary.
>>
>> To ensure that the write does not cross these alignment boundaries we
>> say that it must be naturally aligned and a power-of-2 in length.
>>
>> We may be able to relax this rule but I am not sure it buys us
>> anything - typically we want to be writing a 64KB block aligned to
>> 64KB, for example.
>
> It seems to me that the requirement is_power_of_2(iov_iter_count(iter))
> is necessary for some filesystems but not for all filesystems.
> Restrictions that are specific to a single filesystem (XFS) should not
> occur in code that is intended to be used by all filesystems
> (blkdev_atomic_write_valid()).
I don't think that is_power_of_2(write length) is specific to XFS. It is
just a simple mathematical method to ensure we obey length and alignment
requirement always.
Furthermore, if ext4 wants to support atomic writes, for example, then
it will probably base that on bigalloc. And bigalloc is power-of-2 based.
As for the rules, current proposal is:
- atomic_write_unit_min and atomic_write_unit_max are power-of-2
- write needs to be at a naturally aligned file offset
- write length needs to be a power-of-2 between atomic_write_unit_min
and atomic_write_unit_max, inclusive
Those could be relaxed to:
- atomic_write_unit_min and atomic_write_unit_max are power-of-2
- write length needs to be a multiple of atomic_write_unit_min and a max
of atomic_write_unit_max
- write needs to be at an offset aligned to atomic_write_unit_min
- write cannot cross atomic_write_unit_max boundary within the file
Are the relaxed rules better? I don't think so, and I don't like "write
cannot cross atomic_write_unit_max boundary" in terms of wording.
Thanks,
John
On 03/10/2023 04:32, Dave Chinner wrote:
> On Fri, Sep 29, 2023 at 10:27:20AM +0000, John Garry wrote:
>> Support providing info on atomic write unit min and max for an inode.
>>
>> For simplicity, currently we limit the min at the FS block size, but a
>> lower limit could be supported in future.
>>
>> The atomic write unit min and max is limited by the guaranteed extent
>> alignment for the inode.
>>
>> Signed-off-by: John Garry <[email protected]>
>> ---
>> fs/xfs/xfs_iops.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++
>> fs/xfs/xfs_iops.h | 4 ++++
>> 2 files changed, 55 insertions(+)
>>
>> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
>> index 1c1e6171209d..5bff80748223 100644
>> --- a/fs/xfs/xfs_iops.c
>> +++ b/fs/xfs/xfs_iops.c
>> @@ -546,6 +546,46 @@ xfs_stat_blksize(
>> return PAGE_SIZE;
>> }
>>
>> +void xfs_ip_atomic_write_attr(struct xfs_inode *ip,
>> + xfs_filblks_t *unit_min_fsb,
>> + xfs_filblks_t *unit_max_fsb)
>
> Formatting.
Change args to 1x tab indent, right?
>
> Also, we don't use variable name shorthand for function names -
> xfs_get_atomic_write_hint(ip) to match xfs_get_extsz_hint(ip)
> would be appropriate, right?
Changing the name format would be ok. However we are not returning a
hint, but rather the inode atomic write unit min and max values in FS
blocks. Anyway, I'll look to rework the name.
>
>
>
>> +{
>> + xfs_extlen_t extsz_hint = xfs_get_extsz_hint(ip);
>> + struct xfs_buftarg *target = xfs_inode_buftarg(ip);
>> + struct block_device *bdev = target->bt_bdev;
>> + struct xfs_mount *mp = ip->i_mount;
>> + xfs_filblks_t atomic_write_unit_min,
>> + atomic_write_unit_max,
>> + align;
>> +
>> + atomic_write_unit_min = XFS_B_TO_FSB(mp,
>> + queue_atomic_write_unit_min_bytes(bdev->bd_queue));
>> + atomic_write_unit_max = XFS_B_TO_FSB(mp,
>> + queue_atomic_write_unit_max_bytes(bdev->bd_queue));
>
> These should be set in the buftarg at mount time, like we do with
> sector size masks. Then we don't need to convert them to fsbs on
> every single lookup.
ok, fine. However I do still have a doubt on whether these values should
be changeable - please see (small) comment about
atomic_write_max_sectors in patch 7/21
>
>> + /* for RT, unset extsize gives hint of 1 */
>> + /* for !RT, unset extsize gives hint of 0 */
>> + if (extsz_hint && (XFS_IS_REALTIME_INODE(ip) ||
>> + (ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN)))
>
> Logic is non-obvious. The compound is (rt || force), not
> (extsz && rt), so it took me a while to actually realise I read this
> incorrectly.
>
> if (extsz_hint &&
> (XFS_IS_REALTIME_INODE(ip) ||
> (ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN))) {
>
>> + align = extsz_hint;
>> + else
>> + align = 1;
>
> And now the logic looks wrong to me. We don't want to use extsz hint
> for RT inodes if force align is not set, this will always use it
> regardless of the fact it has nothing to do with force alignment.
extsz_hint comes from xfs_get_extsz_hint(), which gives us the SB
extsize for the RT inode and this alignment is guaranteed, no?
>
> Indeed, if XFS_DIFLAG2_FORCEALIGN is not set, then shouldn't this
> always return min/max = 0 because atomic alignments are not in us on
> this inode?
As above, for RT I thought that extsize alignment was guaranteed and we
don't need to bother with XFS_DIFLAG2_FORCEALIGN there.
>
> i.e. the first thing this code should do is:
>
> *unit_min_fsb = 0;
> *unit_max_fsb = 0;
> if (!(ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN))
> return;
>
> Then we can check device support:
>
> if (!buftarg->bt_atomic_write_max)
> return;
>
> Then we can check for extent size hints. If that's not set:
>
> align = xfs_get_extsz_hint(ip);
> if (align <= 1) {
> unit_min_fsb = 1;
> unit_max_fsb = 1;
> return;
> }
>
> And finally, if there is an extent size hint, we can return that.
>
>> + if (atomic_write_unit_max == 0) {
>> + *unit_min_fsb = 0;
>> + *unit_max_fsb = 0;
>> + } else if (atomic_write_unit_min == 0) {
>> + *unit_min_fsb = 1;
>> + *unit_max_fsb = min_t(xfs_filblks_t, atomic_write_unit_max,
>> + align);
>
> Why is it valid for a device to have a zero minimum size?
It's not valid. Local variables atomic_write_unit_max and
atomic_write_unit_min unit here is FS blocks - maybe I should change names.
The idea is that for simplicity we won't support atomic writes for XFS
of size less than 1x FS block initially. So if the bdev has - for
example - queue_atomic_write_unit_min_bytes() == 2K and
queue_atomic_write_unit_max_bytes() == 64K, then (ignoring alignment) we
say that unit_min_fsb = 1 and unit_max_fsb = 16 (for 4K FS blocks).
> If it can
> set a maximum, it should -always- set a minimum size as logical
> sector size is a valid lower bound, yes?
>
>> + } else {
>> + *unit_min_fsb = min_t(xfs_filblks_t, atomic_write_unit_min,
>> + align);
>> + *unit_max_fsb = min_t(xfs_filblks_t, atomic_write_unit_max,
>> + align);
>> + }
>
> Nothing here guarantees the power-of-2 sizes that the RWF_ATOMIC
> user interface requires....
atomic_write_unit_min and atomic_write_unit_max will be powers-of-2 (or 0).
But, you are right, we don't check align is a power-of-2 - that can be
added.
>
> It also doesn't check that the extent size hint is aligned with
> atomic write units.
If we add a check for align being a power-of-2 and atomic_write_unit_min
and atomic_write_unit_max are already powers-of-2, then this can be
relied on, right?
>
> It also doesn't check either against stripe unit alignment....
As mentioned in earlier response, this could be enforced.
>
>> +}
>> +
>> STATIC int
>> xfs_vn_getattr(
>> struct mnt_idmap *idmap,
>> @@ -614,6 +654,17 @@ xfs_vn_getattr(
>> stat->dio_mem_align = bdev_dma_alignment(bdev) + 1;
>> stat->dio_offset_align = bdev_logical_block_size(bdev);
>> }
>> + if (request_mask & STATX_WRITE_ATOMIC) {
>> + xfs_filblks_t unit_min_fsb, unit_max_fsb;
>> +
>> + xfs_ip_atomic_write_attr(ip, &unit_min_fsb,
>> + &unit_max_fsb);
>> + stat->atomic_write_unit_min = XFS_FSB_TO_B(mp, unit_min_fsb);
>> + stat->atomic_write_unit_max = XFS_FSB_TO_B(mp, unit_max_fsb);
>
> That's just nasty. We pull byte units from the bdev, convert them to
> fsb to round them, then convert them back to byte counts. We should
> be doing all the work in one set of units....
ok, agreed. bytes is probably best.
>
>> + stat->attributes |= STATX_ATTR_WRITE_ATOMIC;
>> + stat->attributes_mask |= STATX_ATTR_WRITE_ATOMIC;
>> + stat->result_mask |= STATX_WRITE_ATOMIC;
>
> If the min/max are zero, then atomic writes are not supported on
> this inode, right? Why would we set any of the attributes or result
> mask to say it is supported on this file?
ok, we won't set STATX_ATTR_WRITE_ATOMIC for min/max are zero
Thanks,
John
On Tue, Oct 03, 2023 at 08:23:26AM +0100, John Garry wrote:
> On 03/10/2023 03:57, Darrick J. Wong wrote:
> > > > > > +#define STATX_ATTR_WRITE_ATOMIC??????? 0x00400000 /* File
> > > > > > supports atomic write operations */
> > > > > How would this differ from stx_atomic_write_unit_min != 0?
> > > Yeah, I suppose that we can just not set this for the case of
> > > stx_atomic_write_unit_min == 0.
> > Please use the STATX_ATTR_WRITE_ATOMIC flag to indicate that the
> > filesystem, file and underlying device support atomic writes when
> > the values are non-zero. The whole point of the attribute mask is
> > that the caller can check the mask for supported functionality
> > without having to read every field in the statx structure to
> > determine if the functionality it wants is present.
>
> Sure, but again that would be just checking atomic_write_unit_min_bytes or
> another atomic write block setting as that is the only way to tell from the
> block layer (if atomic writes are supported), so it will be something like:
>
> if (request_mask & STATX_WRITE_ATOMIC &&
> queue_atomic_write_unit_min_bytes(bdev->bd_queue)) {
> stat->atomic_write_unit_min =
> queue_atomic_write_unit_min_bytes(bdev->bd_queue);
> stat->atomic_write_unit_max =
> queue_atomic_write_unit_max_bytes(bdev->bd_queue);
> stat->attributes |= STATX_ATTR_WRITE_ATOMIC;
> stat->attributes_mask |= STATX_ATTR_WRITE_ATOMIC;
> stat->result_mask |= STATX_WRITE_ATOMIC;
The result_mask (which becomes the statx stx_mask) needs to have
STATX_WRITE_ATOMIC set any time a filesystem responds to
STATX_WRITE_ATOMIC being set in the request_mask, even if the response
is "not supported".
The attributes_mask also needs to have STATX_ATTR_WRITE_ATOMIC set if
the filesystem+file can support the flag, even if it's not currently set
for that file. This should get turned into a generic vfs helper for the
next fs that wants to support atomic write units:
static void generic_fill_statx_atomic_writes(struct kstat *stat,
struct block_device *bdev)
{
u64 min_bytes;
/* Confirm that the fs driver knows about this statx request */
stat->result_mask |= STATX_WRITE_ATOMIC;
/* Confirm that the file attribute is known to the fs. */
stat->attributes_mask |= STATX_ATTR_WRITE_ATOMIC;
/* Fill out the rest of the atomic write fields if supported */
min_bytes = queue_atomic_write_unit_min_bytes(bdev->bd_queue);
if (min_bytes == 0)
return;
stat->atomic_write_unit_min = min_bytes;
stat->atomic_write_unit_max =
queue_atomic_write_unit_max_bytes(bdev->bd_queue);
/* Atomic writes actually supported on this file. */
stat->attributes |= STATX_ATTR_WRITE_ATOMIC;
}
and then:
if (request_mask & STATX_WRITE_ATOMIC)
generic_fill_statx_atomic_writes(stat, bdev);
> }
>
> Thanks,
> John
On Tue, Oct 03, 2023 at 11:56:52AM +0100, John Garry wrote:
> On 03/10/2023 04:32, Dave Chinner wrote:
> > On Fri, Sep 29, 2023 at 10:27:20AM +0000, John Garry wrote:
> > > Support providing info on atomic write unit min and max for an inode.
> > >
> > > For simplicity, currently we limit the min at the FS block size, but a
> > > lower limit could be supported in future.
> > >
> > > The atomic write unit min and max is limited by the guaranteed extent
> > > alignment for the inode.
> > >
> > > Signed-off-by: John Garry <[email protected]>
> > > ---
> > > fs/xfs/xfs_iops.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++
> > > fs/xfs/xfs_iops.h | 4 ++++
> > > 2 files changed, 55 insertions(+)
> > >
> > > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > > index 1c1e6171209d..5bff80748223 100644
> > > --- a/fs/xfs/xfs_iops.c
> > > +++ b/fs/xfs/xfs_iops.c
> > > @@ -546,6 +546,46 @@ xfs_stat_blksize(
> > > return PAGE_SIZE;
> > > }
> > > +void xfs_ip_atomic_write_attr(struct xfs_inode *ip,
> > > + xfs_filblks_t *unit_min_fsb,
> > > + xfs_filblks_t *unit_max_fsb)
> >
> > Formatting.
>
> Change args to 1x tab indent, right?
>
> >
> > Also, we don't use variable name shorthand for function names -
> > xfs_get_atomic_write_hint(ip) to match xfs_get_extsz_hint(ip)
> > would be appropriate, right?
>
> Changing the name format would be ok. However we are not returning a hint,
> but rather the inode atomic write unit min and max values in FS blocks.
> Anyway, I'll look to rework the name.
>
> >
> >
> >
> > > +{
> > > + xfs_extlen_t extsz_hint = xfs_get_extsz_hint(ip);
> > > + struct xfs_buftarg *target = xfs_inode_buftarg(ip);
> > > + struct block_device *bdev = target->bt_bdev;
> > > + struct xfs_mount *mp = ip->i_mount;
> > > + xfs_filblks_t atomic_write_unit_min,
> > > + atomic_write_unit_max,
> > > + align;
> > > +
> > > + atomic_write_unit_min = XFS_B_TO_FSB(mp,
> > > + queue_atomic_write_unit_min_bytes(bdev->bd_queue));
> > > + atomic_write_unit_max = XFS_B_TO_FSB(mp,
> > > + queue_atomic_write_unit_max_bytes(bdev->bd_queue));
> >
> > These should be set in the buftarg at mount time, like we do with
> > sector size masks. Then we don't need to convert them to fsbs on
> > every single lookup.
>
> ok, fine. However I do still have a doubt on whether these values should be
> changeable - please see (small) comment about atomic_write_max_sectors in
> patch 7/21
No, this /does/ have to be looked up every time, because the geometry of
the device can change underneath the fs without us knowing about it. If
someone snapshots an LV with different (or no) atomic write abilities
then we'll be doing the wrong checks.
And yes, it's true that this is a benign problem because we don't lock
anything in the bdev here and the block device driver will eventually
have to catch that anyway.
> >
> > > + /* for RT, unset extsize gives hint of 1 */
> > > + /* for !RT, unset extsize gives hint of 0 */
> > > + if (extsz_hint && (XFS_IS_REALTIME_INODE(ip) ||
> > > + (ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN)))
> >
> > Logic is non-obvious. The compound is (rt || force), not
> > (extsz && rt), so it took me a while to actually realise I read this
> > incorrectly.
> >
> > if (extsz_hint &&
> > (XFS_IS_REALTIME_INODE(ip) ||
> > (ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN))) {
> >
> > > + align = extsz_hint;
> > > + else
> > > + align = 1;
> >
> > And now the logic looks wrong to me. We don't want to use extsz hint
> > for RT inodes if force align is not set, this will always use it
> > regardless of the fact it has nothing to do with force alignment.
>
> extsz_hint comes from xfs_get_extsz_hint(), which gives us the SB extsize
> for the RT inode and this alignment is guaranteed, no?
One can also set an extent size hint on realtime files that is a
multiple of the realtime extent size. IOWs, I can decide that space on
the rt volume should be given out in 32k chunks, and then later decide
that a specific rt file should actually try for 64k chunks.
> >
> > Indeed, if XFS_DIFLAG2_FORCEALIGN is not set, then shouldn't this
> > always return min/max = 0 because atomic alignments are not in us on
> > this inode?
>
> As above, for RT I thought that extsize alignment was guaranteed and we
> don't need to bother with XFS_DIFLAG2_FORCEALIGN there.
>
> >
> > i.e. the first thing this code should do is:
> >
> > *unit_min_fsb = 0;
> > *unit_max_fsb = 0;
> > if (!(ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN))
> > return;
> >
> > Then we can check device support:
> >
> > if (!buftarg->bt_atomic_write_max)
> > return;
> >
> > Then we can check for extent size hints. If that's not set:
> >
> > align = xfs_get_extsz_hint(ip);
> > if (align <= 1) {
> > unit_min_fsb = 1;
> > unit_max_fsb = 1;
> > return;
> > }
> >
> > And finally, if there is an extent size hint, we can return that.
> >
> > > + if (atomic_write_unit_max == 0) {
> > > + *unit_min_fsb = 0;
> > > + *unit_max_fsb = 0;
> > > + } else if (atomic_write_unit_min == 0) {
> > > + *unit_min_fsb = 1;
> > > + *unit_max_fsb = min_t(xfs_filblks_t, atomic_write_unit_max,
> > > + align);
> >
> > Why is it valid for a device to have a zero minimum size?
>
> It's not valid. Local variables atomic_write_unit_max and
> atomic_write_unit_min unit here is FS blocks - maybe I should change names.
Yes, please, the variable names throughout are long enough to make for
ugly code.
/* "awu" = atomic write unit */
xfs_filblks_t awu_min_fsb, align;
u64 awu_min_bytes;
awu_min_bytes = queue_atomic_write_unit_min_bytes(bdev->bd_queue);
if (!awu_min_bytes) {
/* Not supported at all. */
*unit_min_fsb = 0;
return;
}
awu_min_fsb = XFS_B_TO_FSBT(mp, awu_min_bytes);
if (awu_min_fsb < 1) {
/* Don't allow smaller than fsb atomic writes */
*unit_min_fsb = 1;
return;
}
*unit_min_fsb = min(awu_min_fsb, align);
--D
> The idea is that for simplicity we won't support atomic writes for XFS of
> size less than 1x FS block initially. So if the bdev has - for example -
> queue_atomic_write_unit_min_bytes() == 2K and
> queue_atomic_write_unit_max_bytes() == 64K, then (ignoring alignment) we say
> that unit_min_fsb = 1 and unit_max_fsb = 16 (for 4K FS blocks).
>
> > If it can
> > set a maximum, it should -always- set a minimum size as logical
> > sector size is a valid lower bound, yes?
> >
> > > + } else {
> > > + *unit_min_fsb = min_t(xfs_filblks_t, atomic_write_unit_min,
> > > + align);
> > > + *unit_max_fsb = min_t(xfs_filblks_t, atomic_write_unit_max,
> > > + align);
> > > + }
> >
> > Nothing here guarantees the power-of-2 sizes that the RWF_ATOMIC
> > user interface requires....
>
> atomic_write_unit_min and atomic_write_unit_max will be powers-of-2 (or 0).
>
> But, you are right, we don't check align is a power-of-2 - that can be
> added.
>
> >
> > It also doesn't check that the extent size hint is aligned with
> > atomic write units.
>
> If we add a check for align being a power-of-2 and atomic_write_unit_min and
> atomic_write_unit_max are already powers-of-2, then this can be relied on,
> right?
>
> >
> > It also doesn't check either against stripe unit alignment....
>
> As mentioned in earlier response, this could be enforced.
>
> >
> > > +}
> > > +
> > > STATIC int
> > > xfs_vn_getattr(
> > > struct mnt_idmap *idmap,
> > > @@ -614,6 +654,17 @@ xfs_vn_getattr(
> > > stat->dio_mem_align = bdev_dma_alignment(bdev) + 1;
> > > stat->dio_offset_align = bdev_logical_block_size(bdev);
> > > }
> > > + if (request_mask & STATX_WRITE_ATOMIC) {
> > > + xfs_filblks_t unit_min_fsb, unit_max_fsb;
> > > +
> > > + xfs_ip_atomic_write_attr(ip, &unit_min_fsb,
> > > + &unit_max_fsb);
> > > + stat->atomic_write_unit_min = XFS_FSB_TO_B(mp, unit_min_fsb);
> > > + stat->atomic_write_unit_max = XFS_FSB_TO_B(mp, unit_max_fsb);
> >
> > That's just nasty. We pull byte units from the bdev, convert them to
> > fsb to round them, then convert them back to byte counts. We should
> > be doing all the work in one set of units....
>
> ok, agreed. bytes is probably best.
>
> >
> > > + stat->attributes |= STATX_ATTR_WRITE_ATOMIC;
> > > + stat->attributes_mask |= STATX_ATTR_WRITE_ATOMIC;
> > > + stat->result_mask |= STATX_WRITE_ATOMIC;
> >
> > If the min/max are zero, then atomic writes are not supported on
> > this inode, right? Why would we set any of the attributes or result
> > mask to say it is supported on this file?
>
> ok, we won't set STATX_ATTR_WRITE_ATOMIC for min/max are zero
>
> Thanks,
> John
On 10/3/23 01:37, John Garry wrote:
> I don't think that is_power_of_2(write length) is specific to XFS.
I think this is specific to XFS. Can you show me the F2FS code that
restricts the length of an atomic write to a power of two? I haven't
found it. The only power-of-two check that I found in F2FS is the
following (maybe I overlooked something):
$ git grep -nH is_power fs/f2fs
fs/f2fs/super.c:3914: if (!is_power_of_2(zone_sectors)) {
Thanks,
Bart.
On 10/2/23 17:48, Martin K. Petersen wrote:
>
> Bart,
>
>> Are there any SCSI devices that we care about that report an ATOMIC
>> TRANSFER LENGTH GRANULARITY that is larger than a single logical
>> block?
>
> Yes.
>
> Note that code path used inside a storage device to guarantee atomicity
> of an entire I/O may be substantially different from the code path which
> only offers an incremental guarantee at a single logical or physical
> block level (to the extent that those guarantees are offered at all but
> that's a different kettle of fish).
>
>> I'm wondering whether we really have to support such devices.
>
> Yes.
Hi Martin,
I'm still wondering whether we really should support storage devices
that report an ATOMIC TRANSFER LENGTH GRANULARITY that is larger than
the logical block size. Is my understanding correct that the NVMe
specification makes it mandatory to support single logical block atomic
writes since the smallest value that can be reported as the AWUN
parameter is one logical block because this parameter is a 0's based
value? Is my understanding correct that SCSI devices that report an
ATOMIC TRANSFER LENGTH GRANULARITY that is larger than the logical block
size are not able to support the NVMe protocol?
From the NVMe specification section about the identify controller
response: "Atomic Write Unit Normal (AWUN): This field indicates the
size of the write operation guaranteed to be written atomically to the
NVM across all namespaces with any supported namespace format during
normal operation. This field is specified in logical blocks and is a 0’s
based value."
Thanks,
Bart.
Bart,
> I'm still wondering whether we really should support storage devices
> that report an ATOMIC TRANSFER LENGTH GRANULARITY that is larger than
> the logical block size.
We should. The common case is that the device reports an ATOMIC TRANSFER
LENGTH GRANULARITY matching the reported physical block size. I.e. a
logical block size of 512 bytes and a physical block size of 4KB. In
that scenario a write of a single logical block would require
read-modify-write of a physical block.
> Is my understanding correct that the NVMe specification makes it
> mandatory to support single logical block atomic writes since the
> smallest value that can be reported as the AWUN parameter is one
> logical block because this parameter is a 0's based value? Is my
> understanding correct that SCSI devices that report an ATOMIC TRANSFER
> LENGTH GRANULARITY that is larger than the logical block size are not
> able to support the NVMe protocol?
That's correct. There are obviously things you can express in SCSI that
you can't in NVMe. And the other way around. Our intent is to support
both protocols.
--
Martin K. Petersen Oracle Linux Engineering
On 03/10/2023 17:45, Bart Van Assche wrote:
> On 10/3/23 01:37, John Garry wrote:
>> I don't think that is_power_of_2(write length) is specific to XFS.
>
> I think this is specific to XFS. Can you show me the F2FS code that
> restricts the length of an atomic write to a power of two? I haven't
> found it. The only power-of-two check that I found in F2FS is the
> following (maybe I overlooked something):
>
> $ git grep -nH is_power fs/f2fs
> fs/f2fs/super.c:3914: if (!is_power_of_2(zone_sectors)) {
Any usecases which we know of requires a power-of-2 block size.
Do you know of a requirement for other sizes? Or are you concerned that
it is unnecessarily restrictive?
We have to deal with HW features like atomic write boundary and FS
restrictions like extent and stripe alignment transparent, which are
almost always powers-of-2, so naturally we would want to work with
powers-of-2 for atomic write sizes.
The power-of-2 stuff could be dropped if that is what people want.
However we still want to provide a set of rules to the user to make
those HW and FS features mentioned transparent to the user.
Thanks,
John
On 02/10/2023 23:50, Nathan Chancellor wrote:
>>>> ld.lld: error: undefined symbol: __moddi3
>> >>> referenced by blk-merge.c
>> >>> block/blk-merge.o:(ll_back_merge_fn) in archive vmlinux.a
>> >>> referenced by blk-merge.c
>> >>> block/blk-merge.o:(ll_back_merge_fn) in archive vmlinux.a
>> >>> referenced by blk-merge.c
>> >>> block/blk-merge.o:(bio_attempt_front_merge) in archive vmlinux.a
>> >>> referenced 3 more times
> This does not appear to be clang specific, I can reproduce it with GCC
> 12.3.0 and the same configuration target.
Yeah, I just need to stop using the modulo operator for 64b values,
which I had already been advised to :|
Thanks,
John
On 03/10/2023 16:46, Darrick J. Wong wrote:
>> stat->result_mask |= STATX_WRITE_ATOMIC;
> The result_mask (which becomes the statx stx_mask) needs to have
> STATX_WRITE_ATOMIC set any time a filesystem responds to
> STATX_WRITE_ATOMIC being set in the request_mask, even if the response
> is "not supported".
>
> The attributes_mask also needs to have STATX_ATTR_WRITE_ATOMIC set if
> the filesystem+file can support the flag, even if it's not currently set
> for that file. This should get turned into a generic vfs helper for the
> next fs that wants to support atomic write units:
>
> static void generic_fill_statx_atomic_writes(struct kstat *stat,
> struct block_device *bdev)
> {
> u64 min_bytes;
>
> /* Confirm that the fs driver knows about this statx request */
> stat->result_mask |= STATX_WRITE_ATOMIC;
>
> /* Confirm that the file attribute is known to the fs. */
> stat->attributes_mask |= STATX_ATTR_WRITE_ATOMIC;
>
> /* Fill out the rest of the atomic write fields if supported */
> min_bytes = queue_atomic_write_unit_min_bytes(bdev->bd_queue);
> if (min_bytes == 0)
> return;
>
> stat->atomic_write_unit_min = min_bytes;
> stat->atomic_write_unit_max =
> queue_atomic_write_unit_max_bytes(bdev->bd_queue);
>
> /* Atomic writes actually supported on this file. */
> stat->attributes |= STATX_ATTR_WRITE_ATOMIC;
> }
>
> and then:
>
> if (request_mask & STATX_WRITE_ATOMIC)
> generic_fill_statx_atomic_writes(stat, bdev);
>
>
That looks sensible, but, if used by an FS, we would still need a method
to include extra FS restrictions, like extent alignment as in 15/21.
Thanks,
John
On 10/3/23 19:53, Martin K. Petersen wrote:
>
> Bart,
>
>> I'm still wondering whether we really should support storage
>> devices that report an ATOMIC TRANSFER LENGTH GRANULARITY that is
>> larger than the logical block size.
>
> We should. The common case is that the device reports an ATOMIC
> TRANSFER LENGTH GRANULARITY matching the reported physical block
> size. I.e. a logical block size of 512 bytes and a physical block
> size of 4KB. In that scenario a write of a single logical block would
> require read-modify-write of a physical block.
Block devices must serialize read-modify-write operations internally
that happen when there are multiple logical blocks per physical block.
Otherwise it is not guaranteed that a READ command returns the most
recently written data to the same LBA. I think we can ignore concurrent
and overlapping writes in this discussion since these can be considered
as bugs in host software.
In other words, also for the above example it is guaranteed that writes
of a single logical block (512 bytes) are atomic, no matter what value
is reported as the ATOMIC TRANSFER LENGTH GRANULARITY.
>> Is my understanding correct that the NVMe specification makes it
>> mandatory to support single logical block atomic writes since the
>> smallest value that can be reported as the AWUN parameter is one
>> logical block because this parameter is a 0's based value? Is my
>> understanding correct that SCSI devices that report an ATOMIC
>> TRANSFER LENGTH GRANULARITY that is larger than the logical block
>> size are not able to support the NVMe protocol?
>
> That's correct. There are obviously things you can express in SCSI
> that you can't in NVMe. And the other way around. Our intent is to
> support both protocols.
How about aligning the features of the two protocols as much as
possible? My understanding is that all long-term T10 contributors are
all in favor of this.
Thanks,
Bart.
On 10/4/23 02:14, John Garry wrote:
> On 03/10/2023 17:45, Bart Van Assche wrote:
>> On 10/3/23 01:37, John Garry wrote:
>>> I don't think that is_power_of_2(write length) is specific to XFS.
>>
>> I think this is specific to XFS. Can you show me the F2FS code that
>> restricts the length of an atomic write to a power of two? I haven't
>> found it. The only power-of-two check that I found in F2FS is the
>> following (maybe I overlooked something):
>>
>> $ git grep -nH is_power fs/f2fs
>> fs/f2fs/super.c:3914: if (!is_power_of_2(zone_sectors)) {
>
> Any usecases which we know of requires a power-of-2 block size.
>
> Do you know of a requirement for other sizes? Or are you concerned that
> it is unnecessarily restrictive?
>
> We have to deal with HW features like atomic write boundary and FS
> restrictions like extent and stripe alignment transparent, which are
> almost always powers-of-2, so naturally we would want to work with
> powers-of-2 for atomic write sizes.
>
> The power-of-2 stuff could be dropped if that is what people want.
> However we still want to provide a set of rules to the user to make
> those HW and FS features mentioned transparent to the user.
Hi John,
My concern is that the power-of-2 requirements are only needed for
traditional filesystems and not for log-structured filesystems (BTRFS,
F2FS, BCACHEFS).
What I'd like to see is that each filesystem declares its atomic write
requirements (in struct address_space_operations?) and that
blkdev_atomic_write_valid() checks the filesystem-specific atomic write
requirements.
Thanks,
Bart.
Hi Bart!
> In other words, also for the above example it is guaranteed that
> writes of a single logical block (512 bytes) are atomic, no matter
> what value is reported as the ATOMIC TRANSFER LENGTH GRANULARITY.
There is no formal guarantee that a disk drive sector read-modify-write
operation results in a readable sector after a power failure. We have
definitely seen blocks being mangled in the field.
Wrt. supporting SCSI atomic block operations, the device rejects the
WRITE ATOMIC(16) command if you attempt to use a transfer length smaller
than the reported granularity. If we want to support WRITE ATOMIC(16) we
have to abide by the values reported by the device. It is not optional.
Besides, the whole point of this patch set is to increase the
"observable atomic block size" beyond the physical block size to
facilitate applications that prefer to use blocks in the 8-64KB range.
IOW, using the logical block size is not particularly interesting. The
objective is to prevent tearing of much larger blocks.
> How about aligning the features of the two protocols as much as
> possible? My understanding is that all long-term T10 contributors are
> all in favor of this.
That is exactly what this patch set does. Out of the 5-6 different
"atomic" modes of operation permitted by SCSI and NVMe, our exposed
semantics are carefully chosen to permit all compliant devices to be
used. Based on only two reported queue limits (FWIW, we started with way
more than that. I believe that complexity was part of the first RFC we
posted). Whereas this series hides most of the complexity in the various
unfortunate protocol quirks behind a simple interface: Your tear-proof
writes can't be smaller than X bytes and larger than Y bytes and they
must be naturally aligned. This simplified things substantially from an
application perspective.
--
Martin K. Petersen Oracle Linux Engineering
On Wed, Oct 04, 2023 at 10:34:13AM -0700, Bart Van Assche wrote:
> On 10/4/23 02:14, John Garry wrote:
> > On 03/10/2023 17:45, Bart Van Assche wrote:
> > > On 10/3/23 01:37, John Garry wrote:
> > > > I don't think that is_power_of_2(write length) is specific to XFS.
> > >
> > > I think this is specific to XFS. Can you show me the F2FS code that
> > > restricts the length of an atomic write to a power of two? I haven't
> > > found it. The only power-of-two check that I found in F2FS is the
> > > following (maybe I overlooked something):
> > >
> > > $ git grep -nH is_power fs/f2fs
> > > fs/f2fs/super.c:3914:??? if (!is_power_of_2(zone_sectors)) {
> >
> > Any usecases which we know of requires a power-of-2 block size.
> >
> > Do you know of a requirement for other sizes? Or are you concerned that
> > it is unnecessarily restrictive?
> >
> > We have to deal with HW features like atomic write boundary and FS
> > restrictions like extent and stripe alignment transparent, which are
> > almost always powers-of-2, so naturally we would want to work with
> > powers-of-2 for atomic write sizes.
> >
> > The power-of-2 stuff could be dropped if that is what people want.
> > However we still want to provide a set of rules to the user to make
> > those HW and FS features mentioned transparent to the user.
>
> Hi John,
>
> My concern is that the power-of-2 requirements are only needed for
> traditional filesystems and not for log-structured filesystems (BTRFS,
> F2FS, BCACHEFS).
Filesystems that support copy-on-write data (needed for arbitrary
filesystem block aligned RWF_ATOMIC support) are not necessarily log
structured. For example: XFS.
All three of the filesystems you list above still use power-of-2
block sizes for most of their metadata structures and for large data
extents. Hence once you go above a certain file size they are going
to be doing full power-of-2 block size aligned IO anyway. hence the
constraint of atomic writes needing to be power-of-2 block size
aligned to avoid RMW cycles doesn't really change for these
filesystems.
In which case, they can just set their minimum atomic IO size to be
the same as their block size (e.g. 4kB) and set the maximum to
something they can guarantee gets COW'd in a single atomic
transaction. What the hardware can do with REQ_ATOMIC IO is
completely irrelevant at this point....
> What I'd like to see is that each filesystem declares its atomic write
> requirements (in struct address_space_operations?) and that
> blkdev_atomic_write_valid() checks the filesystem-specific atomic write
> requirements.
That seems unworkable to me - IO constraints propagate from the
bottom up, not from the top down.
Consider multi-device filesystems (btrfs and XFS), where different
devices might have different atomic write parameters. Which
set of bdev parameters does the filesystem report to the querying
bdev? (And doesn't that question just sound completely wrong?)
It also doesn't work for filesystems that can configure extent
allocation alignment at an individual inode level (like XFS) - what
does the filesystem report to the device when it doesn't know what
alignment constraints individual on-disk inodes might be using?
That's why statx() vectors through filesystems to all them to set
their own parameters based on the inode statx() is being called on.
If the filesystem has a native RWF_ATOMIC implementation, it can put
it's own parameters in the statx min/max atomic write size fields.
If the fs doesn't have it's own native support, but can do physical
file offset/LBA alignment, then it publishes the block device atomic
support parameters or overrides them with it's internal allocation
alignment constraints. If the bdev doesn't support REQ_ATOMIC, the
filesystem says "atomic writes are not supported".
-Dave.
--
Dave Chinner
[email protected]
On 10/4/23 11:17, Martin K. Petersen wrote:
>
> Hi Bart!
>
>> In other words, also for the above example it is guaranteed that
>> writes of a single logical block (512 bytes) are atomic, no matter
>> what value is reported as the ATOMIC TRANSFER LENGTH GRANULARITY.
>
> There is no formal guarantee that a disk drive sector
> read-modify-write operation results in a readable sector after a
> power failure. We have definitely seen blocks being mangled in the
> field.
Aren't block devices expected to use a capacitor that provides enough
power to handle power failures cleanly?
How about blacklisting block devices that mangle blocks if a power
failure occurs? I think such block devices are not compatible with
journaling filesystems nor with log-structured filesystems.
Thanks,
Bart.
On Thu, Oct 05, 2023 at 10:10:45AM -0700, Bart Van Assche wrote:
> On 10/4/23 11:17, Martin K. Petersen wrote:
> >
> > Hi Bart!
> >
> > > In other words, also for the above example it is guaranteed that
> > > writes of a single logical block (512 bytes) are atomic, no matter
> > > what value is reported as the ATOMIC TRANSFER LENGTH GRANULARITY.
> >
> > There is no formal guarantee that a disk drive sector read-modify-write
> > operation results in a readable sector after a power failure. We have
> > definitely seen blocks being mangled in the field.
>
> Aren't block devices expected to use a capacitor that provides enough
> power to handle power failures cleanly?
Nope.
Any block device that says it operates in writeback cache mode (i.e.
almost every single consumer SATA and NVMe drive ever made) has a
volatile write back cache and so does not provide any power fail
data integrity guarantees. Simple to check, my less-than-1-yr-old
workstation tells me:
$ lspci |grep -i nvme
03:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
06:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
$ cat /sys/block/nvme*n1/queue/write_cache
write back
write back
$
That they have volatile writeback caches....
> How about blacklisting block devices that mangle blocks if a power
> failure occurs? I think such block devices are not compatible with
> journaling filesystems nor with log-structured filesystems.
Statements like this from people working on storage hardware really
worry me. It demonstrates a lack of understanding of how filesystems
actually work, not to mention the fact that this architectural
problem (i.e. handling volatile device write caches correctly) was
solved in the Linux IO stack a couple of decades ago. This isn't
even 'state of the art' knowledge - this is foundational knowlege
that everyone working on storage should know.
The tl;dr summary is that filesystems will issue a cache flush
request (REQ_PREFLUSH) and/or write-through to stable storage
semantics (REQ_FUA) for any data, metadata or journal IO that has
data integrity and/or ordering requirements associated with it. The
block layer will then do the most optimal correct thing with that
request (e.g. ignore them for IO being directed at WC disabled
devices), but it guarantees the flush/fua semantics for those IOs
will be provided by all layers in the stack right down to the
persistent storage media itself. Hence all the filesystem has to do
is get it's IO and cache flush ordering correct, and everything
just works regardless of the underlying storage capabilities.
And, yes, any storage device with volatile caches that doesn't
implement cache flushes correctly is considered broken and will get
black listed....
-Dave.
--
Dave Chinner
[email protected]
On 10/5/23 15:36, Dave Chinner wrote:
> $ lspci |grep -i nvme
> 03:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
> 06:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
> $ cat /sys/block/nvme*n1/queue/write_cache
> write back
> write back
> $
>
> That they have volatile writeback caches....
It seems like what I wrote has been misunderstood completely. With
"handling a power failure cleanly" I meant that power cycling a block device
does not result in read errors nor in reading data that has never been written.
Although it is hard to find information about this topic, here is what I found
online:
* About certain SSDs with power loss protection:
https://us.transcend-info.com/embedded/technology/power-loss-protection-plp
* About another class of SSDs with power loss protection:
https://www.kingston.com/en/blog/servers-and-data-centers/ssd-power-loss-protection
* About yet another class of SSDs with power loss protection:
https://phisonblog.com/avoiding-ssd-data-loss-with-phisons-power-loss-protection-2/
So far I haven't found any information about hard disks and power failure
handling. What I found is that most current hard disks protect data with ECC.
The ECC mechanism should provide good protection against reading data that
has never been written. If a power failure occurs while a hard disk is writing
a physical block, can this result in a read error after power is restored? If
so, is this behavior allowed by storage standards?
Bart.
On Thu, Oct 05, 2023 at 03:58:38PM -0700, Bart Van Assche wrote:
> On 10/5/23 15:36, Dave Chinner wrote:
> > $ lspci |grep -i nvme
> > 03:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
> > 06:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
> > $ cat /sys/block/nvme*n1/queue/write_cache
> > write back
> > write back
> > $
> >
> > That they have volatile writeback caches....
>
> It seems like what I wrote has been misunderstood completely. With
> "handling a power failure cleanly" I meant that power cycling a block device
> does not result in read errors nor in reading data that has never been written.
Then I don't see what your concern is.
Single sector writes are guaranteed atomic and have been for as long
as I've worked in this game. OTOH, multi-sector writes are not
guaranteed to be atomic - they can get torn on sector boundaries,
but the individual sectors within that write are guaranteed to be
all-or-nothing.
Any hardware device that does not guarantee single sector write
atomicity (i.e. tears in the middle of a sector) is, by definition,
broken. And we all know that broken hardware means nothing in the
storage stack works as it should, so I just don't see what point you
are trying to make...
> Although it is hard to find information about this topic, here is what I found
> online:
> * About certain SSDs with power loss protection:
> https://us.transcend-info.com/embedded/technology/power-loss-protection-plp
> * About another class of SSDs with power loss protection:
> https://www.kingston.com/en/blog/servers-and-data-centers/ssd-power-loss-protection
> * About yet another class of SSDs with power loss protection:
> https://phisonblog.com/avoiding-ssd-data-loss-with-phisons-power-loss-protection-2/
Yup, devices that behave as if they have non-volatile write caches.
Such devices have been around for more than 30 years, they operate
the same as devices without caches at all.
> So far I haven't found any information about hard disks and power failure
> handling. What I found is that most current hard disks protect data with ECC.
> The ECC mechanism should provide good protection against reading data that
> has never been written. If a power failure occurs while a hard disk is writing
> a physical block, can this result in a read error after power is restored? If
> so, is this behavior allowed by storage standards?
If a power fail results in read errors from the storage media being
reported to the OS instead of the data that was present in the
sector before the power failure, then the device is broken. If there
is no data in the region being read because it has never been
written, then it should return zeros (no data) rather than stale
data or a media read error.
-Dave.
--
Dave Chinner
[email protected]
On 10/5/23 21:31, Dave Chinner wrote:
> Then I don't see what your concern is.
>
> Single sector writes are guaranteed atomic and have been for as long
> as I've worked in this game. OTOH, multi-sector writes are not
> guaranteed to be atomic - they can get torn on sector boundaries, but
> the individual sectors within that write are guaranteed to be
> all-or-nothing.
>
> Any hardware device that does not guarantee single sector write
> atomicity (i.e. tears in the middle of a sector) is, by definition,
> broken. And we all know that broken hardware means nothing in the
> storage stack works as it should, so I just don't see what point you
> are trying to make...
Do you agree that the above implies that it is not useful in patch 01/21
of this series to track atomic_write_unit_min_bytes in the block layer
nor to export this information to user space? The above implies that
this parameter will always be equal to the logical block size. Writes to
a single physical block happen atomically. If there are multiple logical
blocks per physical block, the block device must serialize
read/modify/write cycles internally.
Thanks,
Bart.
What is the advantage of using write flags instead of using an atomic
open flag (O_ATOMIC)? With an open flag, write, writev, pwritev would
all be supported for atomic writes. And this would potentially require
less application changes to take advantage of atomic writes.
On Fri, Sep 29, 2023 at 3:28 AM John Garry <[email protected]> wrote:
>
> From: Prasad Singamsetty <[email protected]>
>
> Userspace may add flag RWF_ATOMIC to pwritev2() to indicate that the
> write is to be issued with torn write prevention, according to special
> alignment and length rules.
>
> Torn write prevention means that for a power or any other HW failure, all
> or none of the data will be committed to storage, but never a mix of old
> and new.
>
> For any syscall interface utilizing struct iocb, add IOCB_ATOMIC for
> iocb->ki_flags field to indicate the same.
>
> A call to statx will give the relevant atomic write info:
> - atomic_write_unit_min
> - atomic_write_unit_max
>
> Both values are a power-of-2.
>
> Applications can avail of atomic write feature by ensuring that the total
> length of a write is a power-of-2 in size and also sized between
> atomic_write_unit_min and atomic_write_unit_max, inclusive. Applications
> must ensure that the write is at a naturally-aligned offset in the file
> wrt the total write length.
>
> Signed-off-by: Prasad Singamsetty <[email protected]>
> Signed-off-by: John Garry <[email protected]>
> ---
> include/linux/fs.h | 1 +
> include/uapi/linux/fs.h | 5 ++++-
> 2 files changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index b528f063e8ff..898952dee8eb 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -328,6 +328,7 @@ enum rw_hint {
> #define IOCB_SYNC (__force int) RWF_SYNC
> #define IOCB_NOWAIT (__force int) RWF_NOWAIT
> #define IOCB_APPEND (__force int) RWF_APPEND
> +#define IOCB_ATOMIC (__force int) RWF_ATOMIC
>
> /* non-RWF related bits - start at 16 */
> #define IOCB_EVENTFD (1 << 16)
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index b7b56871029c..e3b4f5bc6860 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -301,8 +301,11 @@ typedef int __bitwise __kernel_rwf_t;
> /* per-IO O_APPEND */
> #define RWF_APPEND ((__force __kernel_rwf_t)0x00000010)
>
> +/* Atomic Write */
> +#define RWF_ATOMIC ((__force __kernel_rwf_t)0x00000020)
> +
> /* mask of flags supported by the kernel */
> #define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
> - RWF_APPEND)
> + RWF_APPEND | RWF_ATOMIC)
>
> #endif /* _UAPI_LINUX_FS_H */
> --
> 2.31.1
>
Bart,
> The above implies that this parameter will always be equal to the
> logical block size.
It does not. Being able to write each individual block in an I/O without
tearing does not imply that a device can write two blocks as a single
atomic operation.
> Writes to a single physical block happen atomically. If there are
> multiple logical blocks per physical block, the block device must
> serialize read/modify/write cycles internally.
This is what SBC has to say:
"If any write command that is not an atomic write command, does not
complete successfully (e.g., the command completed with CHECK CONDITION
status, or the command was being processed at the time of a power loss
or an incorrect demount of a removable medium), then any data in the
logical blocks referenced by the LBAs specified by that command is
indeterminate."
SBC defines "atomic write command" like this:
"An atomic write command performs one or more atomic write operations.
The following write commands are atomic write commands:
a) WRITE ATOMIC (16) (see 5.48); and
b) WRITE ATOMIC (32) (see 5.49)."
You will note that none of the regular WRITE commands appear in that
list.
Now, in practice we obviously rely heavily on the fact that most devices
are implemented in a sane fashion which doesn't mess up individual
logical blocks on power fail. But the spec does not guarantee this; it
is device implementation dependent. And again, we have seen both hard
disk drives and SSDs that cause collateral damage to an entire physical
block when power is lost at the wrong time.
--
Martin K. Petersen Oracle Linux Engineering
On Fri, Oct 06, 2023 at 11:15:11AM -0700, Jeremy Bongio wrote:
> What is the advantage of using write flags instead of using an atomic
> open flag (O_ATOMIC)? With an open flag, write, writev, pwritev would
> all be supported for atomic writes. And this would potentially require
> less application changes to take advantage of atomic writes.
Atomic writes are not a property of the file or even the inode
itself, they are an attribute of the specific IO being issued by
the application.
Most applications that want atomic writes are using it as a
performance optimisation. They are likely already using DIO with
either AIO, pwritev2 or io_uring and so are already using the
interfaces that support per-IO attributes. Not every IO to every
file needs to be atomic, so a per-IO attribute makes a lot of sense
for these applications.
Add to that that implementing atomic IO semantics in the generic IO
paths (e.g. for buffered writes) is much more difficult. It's
not an unsolvable problem (especially now with high-order folio
support in the page cache), it's just way outside the scope of this
patchset.
-Dave.
--
Dave Chinner
[email protected]
Hi Christoph,
>>>
>>> Currently minimum granularity is the FS block size, but it should be
>>> possibly to support lower in future.
>> I really dislike how this forces aligned allocations. Aligned
>> allocations are a nice optimization to offload some of the work
>> to the storage hard/firmware, but we need to support it in general.
>> And I think with out of place writes into the COW fork, and atomic
>> transactions to swap it in we can do that pretty easily.
>>
>> That should also allow to get rid of the horrible forcealign mode,
>> as we can still try align if possible and just fall back to the
>> out of place writes.
>>
Can you try to explain your idea a bit more? This is blocking us.
Are you suggesting some sort of hybrid between the atomic write series
you had a few years ago and this solution?
To me that would be continuing with the following:
- per-IO RWF_ATOMIC (and not O_ATOMIC semantics of nothing is written
until some data sync)
- writes must be a power-of-two and at a naturally-aligned offset
- relying on atomic write HW support always
But for extents which are misaligned, we CoW to a new extent? I suppose
we would align that extent to alignment of the write (i.e. length of write).
BTW, we also have rtvol support which does not use forcealign as it
already can guarantee alignment, but still does rely on the same
principle of requiring alignment - would you want CoW support there also?
Thanks,
John
On Tue, Nov 28, 2023 at 08:56:37AM +0000, John Garry wrote:
> Are you suggesting some sort of hybrid between the atomic write series you
> had a few years ago and this solution?
Very roughly, yes.
> To me that would be continuing with the following:
> - per-IO RWF_ATOMIC (and not O_ATOMIC semantics of nothing is written until
> some data sync)
Yes.
> - writes must be a power-of-two and at a naturally-aligned offset
Where offset is offset in the file? It would not require it. You
probably want to do it for optimal performance, but requiring it
feeels rather limited.
> - relying on atomic write HW support always
And I think that's where we have different opinions. I think the hw
offload is a nice optimization and we should use it wherever we can.
But building the entire userspace API around it feels like a mistake.
> BTW, we also have rtvol support which does not use forcealign as it already
> can guarantee alignment, but still does rely on the same principle of
> requiring alignment - would you want CoW support there also?
Upstream doesn't have out of place write support for the RT subvolume
yet. But Darrick has a series for it and we're actively working on
upstreaming it.
On 28/11/2023 13:56, Christoph Hellwig wrote:
> On Tue, Nov 28, 2023 at 08:56:37AM +0000, John Garry wrote:
>> Are you suggesting some sort of hybrid between the atomic write series you
>> had a few years ago and this solution?
> Very roughly, yes.
>
>> To me that would be continuing with the following:
>> - per-IO RWF_ATOMIC (and not O_ATOMIC semantics of nothing is written until
>> some data sync)
> Yes.
>
>> - writes must be a power-of-two and at a naturally-aligned offset
> Where offset is offset in the file?
ok, fine, it would not be required for XFS with CoW. Some concerns still:
a. device atomic write boundary, if any
b. other FSes which do not have CoW support. ext4 is already being used
for "atomic writes" in the field - see dubious amazon torn-write prevention.
About b., we could add the pow-of-2 and file offset alignment
requirement for other FSes, but then need to add some method to
advertise that restriction.
> It would not require it. You
> probably want to do it for optimal performance, but requiring it
> feeels rather limited.
>
>> - relying on atomic write HW support always
> And I think that's where we have different opinions.
I'm just trying to understand your idea and that is not necessarily my
final opinion.
> I think the hw
> offload is a nice optimization and we should use it wherever we can.
Sure, but to me it is a concern that we have 2x paths to make robust a.
offload via hw, which may involve CoW b. no HW support, i.e. CoW always
And for no HW support, if we don't follow the O_ATOMIC model of
committing nothing until a SYNC is issued, would we allocate, write, and
later free a new extent for each write, right?
> But building the entire userspace API around it feels like a mistake.
>
ok, but FWIW it works for the usecases which we know.
>> BTW, we also have rtvol support which does not use forcealign as it already
>> can guarantee alignment, but still does rely on the same principle of
>> requiring alignment - would you want CoW support there also?
> Upstream doesn't have out of place write support for the RT subvolume
> yet. But Darrick has a series for it and we're actively working on
> upstreaming it.
Yeah, I thought that I heard this.
Thanks,
John
> b. other FSes which do not have CoW support. ext4 is already being
> used for "atomic writes" in the field
We also need raw block device access to work within the constraints
required by the hardware.
>> probably want to do it for optimal performance, but requiring it
>> feeels rather limited.
The application developers we are working with generally prefer an error
when things are not aligned properly. Predictable performance is key.
Removing the performance variability of doing double writes is the
reason for supporting atomics in the first place.
I think there is value in providing a more generic (file-centric) atomic
user API. And I think the I/O stack plumbing we provide would be useful
in supporting such an endeavor. But I am not convinced that atomic
operations in general should be limited to the couple of filesystems
that can do CoW.
--
Martin K. Petersen Oracle Linux Engineering
On Fri, Sep 29, 2023 at 10:27:15AM +0000, John Garry wrote:
> Add support for atomic writes, as follows:
> - Ensure that the IO follows all the atomic writes rules, like must be
> naturally aligned
> - Set REQ_ATOMIC
>
> Signed-off-by: John Garry <[email protected]>
> ---
> block/fops.c | 42 +++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 41 insertions(+), 1 deletion(-)
>
> diff --git a/block/fops.c b/block/fops.c
> index acff3d5d22d4..516669ad69e5 100644
> --- a/block/fops.c
> +++ b/block/fops.c
> @@ -41,6 +41,29 @@ static bool blkdev_dio_unaligned(struct block_device *bdev, loff_t pos,
> !bdev_iter_is_aligned(bdev, iter);
> }
>
> +static bool blkdev_atomic_write_valid(struct block_device *bdev, loff_t pos,
> + struct iov_iter *iter)
> +{
> + unsigned int atomic_write_unit_min_bytes =
> + queue_atomic_write_unit_min_bytes(bdev_get_queue(bdev));
> + unsigned int atomic_write_unit_max_bytes =
> + queue_atomic_write_unit_max_bytes(bdev_get_queue(bdev));
> +
> + if (!atomic_write_unit_min_bytes)
> + return false;
The above check should have be moved to limit setting code path.
> + if (pos % atomic_write_unit_min_bytes)
> + return false;
> + if (iov_iter_count(iter) % atomic_write_unit_min_bytes)
> + return false;
> + if (!is_power_of_2(iov_iter_count(iter)))
> + return false;
> + if (iov_iter_count(iter) > atomic_write_unit_max_bytes)
> + return false;
> + if (pos % iov_iter_count(iter))
> + return false;
I am a bit confused about relation between atomic_write_unit_max_bytes and
atomic_write_max_bytes.
Here the max IO length is limited to be <= atomic_write_unit_max_bytes,
so looks userspace can only submit IO with write-atomic-unit naturally
aligned IO(such as, 4k, 8k, 16k, 32k, ...), but these user IOs are
allowed to be merged to big one if naturally alignment is respected and
the merged IO size is <= atomic_write_max_bytes.
Is my understanding right? If yes, I'd suggest to document the point,
and the last two checks could be change to:
/* naturally aligned */
if (pos % iov_iter_count(iter))
return false;
if (iov_iter_count(iter) > atomic_write_max_bytes)
return false;
Thanks,
Ming
On Fri, Sep 29, 2023 at 10:27:07AM +0000, John Garry wrote:
> We rely the block layer always being able to send a bio of size
> atomic_write_unit_max without being required to split it due to request
> queue or other bio limits.
>
> A bio may contain min(BIO_MAX_VECS, limits->max_segments) vectors,
> and each vector is at worst case the device logical block size from
> direct IO alignment requirement.
Both unit_max and unit_min are applied to FS bio, which is built over
single userspace buffer, so only the 1st and last vector can include
partial page, and the other vectors should always cover whole page,
then the minimal size could be:
(max_segments - 2) * PAGE_SIZE + 2 * queue_logical_block_size(q)
Thanks,
Ming
On Mon, Dec 04, 2023 at 11:19:20AM +0800, Ming Lei wrote:
> On Fri, Sep 29, 2023 at 10:27:07AM +0000, John Garry wrote:
> > We rely the block layer always being able to send a bio of size
> > atomic_write_unit_max without being required to split it due to request
> > queue or other bio limits.
> >
> > A bio may contain min(BIO_MAX_VECS, limits->max_segments) vectors,
> > and each vector is at worst case the device logical block size from
> > direct IO alignment requirement.
>
> Both unit_max and unit_min are applied to FS bio, which is built over
> single userspace buffer, so only the 1st and last vector can include
Actually it isn't true for pwritev, and sorry for the noise.
Thanks,
Ming
On 04/12/2023 02:30, Ming Lei wrote:
Hi Ming,
>> +static bool blkdev_atomic_write_valid(struct block_device *bdev, loff_t pos,
>> + struct iov_iter *iter)
>> +{
>> + unsigned int atomic_write_unit_min_bytes =
>> + queue_atomic_write_unit_min_bytes(bdev_get_queue(bdev));
>> + unsigned int atomic_write_unit_max_bytes =
>> + queue_atomic_write_unit_max_bytes(bdev_get_queue(bdev));
>> +
>> + if (!atomic_write_unit_min_bytes)
>> + return false;
> The above check should have be moved to limit setting code path.
Sorry, I didn't fully understand your point.
I added this here (as opposed to the caller), as I was not really
worried about speeding up the failure path. Are you saying to call even
earlier in submission path?
>
>> + if (pos % atomic_write_unit_min_bytes)
>> + return false;
>> + if (iov_iter_count(iter) % atomic_write_unit_min_bytes)
>> + return false;
>> + if (!is_power_of_2(iov_iter_count(iter)))
>> + return false;
>> + if (iov_iter_count(iter) > atomic_write_unit_max_bytes)
>> + return false;
>> + if (pos % iov_iter_count(iter))
>> + return false;
> I am a bit confused about relation between atomic_write_unit_max_bytes and
> atomic_write_max_bytes.
I think that naming could be improved. Or even just drop merging (and
atomic_write_max_bytes concept) until we show it to improve performance.
So generally atomic_write_unit_max_bytes will be same as
atomic_write_max_bytes, however it could be different if:
a. request queue nr hw segments or other request queue limits needs to
restrict atomic_write_unit_max_bytes
b. atomic_write_unit_max_bytes does not need to be a power-of-2 and
atomic_write_max_bytes does. So essentially:
atomic_write_unit_max_bytes = rounddown_pow_of_2(atomic_write_max_bytes)
>
> Here the max IO length is limited to be <= atomic_write_unit_max_bytes,
> so looks userspace can only submit IO with write-atomic-unit naturally
> aligned IO(such as, 4k, 8k, 16k, 32k, ...),
correct
> but these user IOs are
> allowed to be merged to big one if naturally alignment is respected and
> the merged IO size is <= atomic_write_max_bytes.
correct, but the resultant merged IO does not have have to be naturally
aligned.
>
> Is my understanding right?
Yes, but...
> If yes, I'd suggest to document the point,
> and the last two checks could be change to:
>
> /* naturally aligned */
> if (pos % iov_iter_count(iter))
> return false;
>
> if (iov_iter_count(iter) > atomic_write_max_bytes)
> return false;
.. we would not be merging at this point as this is just IO submission
to the block layer, so atomic_write_max_bytes does not come into play
yet. If you check patch 7/21, you will see that we limit IO size to
atomic_write_max_bytes, which is relevant merging.
Thanks,
John
On 04/12/2023 03:55, Ming Lei wrote:
Hi Ming,
> On Mon, Dec 04, 2023 at 11:19:20AM +0800, Ming Lei wrote:
>> On Fri, Sep 29, 2023 at 10:27:07AM +0000, John Garry wrote:
>>> We rely the block layer always being able to send a bio of size
>>> atomic_write_unit_max without being required to split it due to request
>>> queue or other bio limits.
>>>
>>> A bio may contain min(BIO_MAX_VECS, limits->max_segments) vectors,
>>> and each vector is at worst case the device logical block size from
>>> direct IO alignment requirement.
>> Both unit_max and unit_min are applied to FS bio, which is built over
>> single userspace buffer, so only the 1st and last vector can include
> Actually it isn't true for pwritev, and sorry for the noise.
Yeah, I think that it should be:
(max_segments - 2) * PAGE_SIZE
And we need to enforce that any middle vectors are PAGE-aligned.
Thanks,
John
On Mon, Dec 04, 2023 at 09:27:00AM +0000, John Garry wrote:
> On 04/12/2023 02:30, Ming Lei wrote:
>
> Hi Ming,
>
> > > +static bool blkdev_atomic_write_valid(struct block_device *bdev, loff_t pos,
> > > + struct iov_iter *iter)
> > > +{
> > > + unsigned int atomic_write_unit_min_bytes =
> > > + queue_atomic_write_unit_min_bytes(bdev_get_queue(bdev));
> > > + unsigned int atomic_write_unit_max_bytes =
> > > + queue_atomic_write_unit_max_bytes(bdev_get_queue(bdev));
> > > +
> > > + if (!atomic_write_unit_min_bytes)
> > > + return false;
> > The above check should have be moved to limit setting code path.
>
> Sorry, I didn't fully understand your point.
>
> I added this here (as opposed to the caller), as I was not really worried
> about speeding up the failure path. Are you saying to call even earlier in
> submission path?
atomic_write_unit_min is one hardware property, and it should be checked
in blk_queue_atomic_write_unit_min_sectors() from beginning, then you
can avoid this check every other where.
>
> >
> > > + if (pos % atomic_write_unit_min_bytes)
> > > + return false;
> > > + if (iov_iter_count(iter) % atomic_write_unit_min_bytes)
> > > + return false;
> > > + if (!is_power_of_2(iov_iter_count(iter)))
> > > + return false;
> > > + if (iov_iter_count(iter) > atomic_write_unit_max_bytes)
> > > + return false;
> > > + if (pos % iov_iter_count(iter))
> > > + return false;
> > I am a bit confused about relation between atomic_write_unit_max_bytes and
> > atomic_write_max_bytes.
>
> I think that naming could be improved. Or even just drop merging (and
> atomic_write_max_bytes concept) until we show it to improve performance.
>
> So generally atomic_write_unit_max_bytes will be same as
> atomic_write_max_bytes, however it could be different if:
> a. request queue nr hw segments or other request queue limits needs to
> restrict atomic_write_unit_max_bytes
> b. atomic_write_unit_max_bytes does not need to be a power-of-2 and
> atomic_write_max_bytes does. So essentially:
> atomic_write_unit_max_bytes = rounddown_pow_of_2(atomic_write_max_bytes)
>
plug merge often improves sequential IO perf, so if the hardware supports
this way, I think 'atomic_write_max_bytes' should be supported from the
beginning, such as:
- user space submits sequential N * (4k, 8k, 16k, ...) atomic writes, all can
be merged to single IO request, which is issued to driver.
Or
- user space submits sequential 4k, 4k, 8k, 16K, 32k, 64k atomic writes, all can
be merged to single IO request, which is issued to driver.
The hardware should recognize unit size by start LBA, and check if length is
valid, so probably the interface might be relaxed to:
1) start lba is unit aligned, and this unit is in the supported unit
range(power_2 in [unit_min, unit_max])
2) length needs to be:
- N * this_unit_size
- <= atomic_write_max_bytes
> >
> > Here the max IO length is limited to be <= atomic_write_unit_max_bytes,
> > so looks userspace can only submit IO with write-atomic-unit naturally
> > aligned IO(such as, 4k, 8k, 16k, 32k, ...),
>
> correct
>
> > but these user IOs are
> > allowed to be merged to big one if naturally alignment is respected and
> > the merged IO size is <= atomic_write_max_bytes.
>
> correct, but the resultant merged IO does not have have to be naturally
> aligned.
>
> >
> > Is my understanding right?
>
> Yes, but...
>
> > If yes, I'd suggest to document the point,
> > and the last two checks could be change to:
> >
> > /* naturally aligned */
> > if (pos % iov_iter_count(iter))
> > return false;
> >
> > if (iov_iter_count(iter) > atomic_write_max_bytes)
> > return false;
>
> .. we would not be merging at this point as this is just IO submission to
> the block layer, so atomic_write_max_bytes does not come into play yet. If
> you check patch 7/21, you will see that we limit IO size to
> atomic_write_max_bytes, which is relevant merging.
I know the motivation of atomic_write_max_bytes, and now I am wondering
atomic_write_max_bytes may be exported to userspace for the sake of
atomic write performance.
Thanks,
Ming
>>
>> I added this here (as opposed to the caller), as I was not really worried
>> about speeding up the failure path. Are you saying to call even earlier in
>> submission path?
> atomic_write_unit_min is one hardware property, and it should be checked
> in blk_queue_atomic_write_unit_min_sectors() from beginning, then you
> can avoid this check every other where.
ok, but we still need to ensure in the submission path that the block
device actually supports atomic writes - this was the initial check.
>
>>>> + if (pos % atomic_write_unit_min_bytes)
>>>> + return false;
>>>> + if (iov_iter_count(iter) % atomic_write_unit_min_bytes)
>>>> + return false;
>>>> + if (!is_power_of_2(iov_iter_count(iter)))
>>>> + return false;
>>>> + if (iov_iter_count(iter) > atomic_write_unit_max_bytes)
>>>> + return false;
>>>> + if (pos % iov_iter_count(iter))
>>>> + return false;
>>> I am a bit confused about relation between atomic_write_unit_max_bytes and
>>> atomic_write_max_bytes.
>> I think that naming could be improved. Or even just drop merging (and
>> atomic_write_max_bytes concept) until we show it to improve performance.
>>
>> So generally atomic_write_unit_max_bytes will be same as
>> atomic_write_max_bytes, however it could be different if:
>> a. request queue nr hw segments or other request queue limits needs to
>> restrict atomic_write_unit_max_bytes
>> b. atomic_write_unit_max_bytes does not need to be a power-of-2 and
>> atomic_write_max_bytes does. So essentially:
>> atomic_write_unit_max_bytes = rounddown_pow_of_2(atomic_write_max_bytes)
>>
> plug merge often improves sequential IO perf, so if the hardware supports
> this way, I think 'atomic_write_max_bytes' should be supported from the
> beginning, such as:
>
> - user space submits sequential N * (4k, 8k, 16k, ...) atomic writes, all can
> be merged to single IO request, which is issued to driver.
>
> Or
>
> - user space submits sequential 4k, 4k, 8k, 16K, 32k, 64k atomic writes, all can
> be merged to single IO request, which is issued to driver.
Right, we do expect userspace to use a fixed block size, but we give
scope in the API to use variable size.
>
> The hardware should recognize unit size by start LBA, and check if length is
> valid, so probably the interface might be relaxed to:
>
> 1) start lba is unit aligned, and this unit is in the supported unit
> range(power_2 in [unit_min, unit_max])
>
> 2) length needs to be:
>
> - N * this_unit_size
> - <= atomic_write_max_bytes
Please note that we also need to consider:
- any atomic write boundary (from NVMe)
- virt boundary (from NVMe)
And, as I mentioned elsewhere, I am still not 100% comfortable that we
don't pay attention to regular max_sectors_kb...
>
>
>>> Here the max IO length is limited to be <= atomic_write_unit_max_bytes,
>>> so looks userspace can only submit IO with write-atomic-unit naturally
>>> aligned IO(such as, 4k, 8k, 16k, 32k, ...),
>> correct
>>
>>> but these user IOs are
>>> allowed to be merged to big one if naturally alignment is respected and
>>> the merged IO size is <= atomic_write_max_bytes.
>> correct, but the resultant merged IO does not have have to be naturally
>> aligned.
>>
>>> Is my understanding right?
>> Yes, but...
>>
>>> If yes, I'd suggest to document the point,
>>> and the last two checks could be change to:
>>>
>>> /* naturally aligned */
>>> if (pos % iov_iter_count(iter))
>>> return false;
>>>
>>> if (iov_iter_count(iter) > atomic_write_max_bytes)
>>> return false;
>> .. we would not be merging at this point as this is just IO submission to
>> the block layer, so atomic_write_max_bytes does not come into play yet. If
>> you check patch 7/21, you will see that we limit IO size to
>> atomic_write_max_bytes, which is relevant merging.
> I know the motivation of atomic_write_max_bytes, and now I am wondering
> atomic_write_max_bytes may be exported to userspace for the sake of
> atomic write performance.
It is available from sysfs for the request queue, but in an earlier
series Dave Chinner suggested doing more to expose to the application
programmer. So here that would mean a statx member. I'm still not
sure... it just didn't seem like a detail which the user would need to
know or be able to do much with.
Thanks,
John
On Tue, Nov 28, 2023 at 05:42:10PM +0000, John Garry wrote:
> ok, fine, it would not be required for XFS with CoW. Some concerns still:
> a. device atomic write boundary, if any
> b. other FSes which do not have CoW support. ext4 is already being used for
> "atomic writes" in the field - see dubious amazon torn-write prevention.
What is the 'dubious amazon torn-write prevention'?
> About b., we could add the pow-of-2 and file offset alignment requirement
> for other FSes, but then need to add some method to advertise that
> restriction.
We really need a better way to communicate I/O limitations anyway.
Something like XFS_IOC_DIOINFO on steroids.
> Sure, but to me it is a concern that we have 2x paths to make robust a.
> offload via hw, which may involve CoW b. no HW support, i.e. CoW always
Relying just on the hardware seems very limited, especially as there is
plenty of hardware that won't guarantee anything larger than 4k, and
plenty of NVMe hardware without has some other small limit like 32k
because it doesn't support multiple atomicy mode.
> And for no HW support, if we don't follow the O_ATOMIC model of committing
> nothing until a SYNC is issued, would we allocate, write, and later free a
> new extent for each write, right?
Yes. Then again if you do data journalling you do that anyway, and as
one little project I'm doing right now shows that data journling is
often the fastest thing we can do for very small writes.
On 04/12/2023 13:45, Christoph Hellwig wrote:
> On Tue, Nov 28, 2023 at 05:42:10PM +0000, John Garry wrote:
>> ok, fine, it would not be required for XFS with CoW. Some concerns still:
>> a. device atomic write boundary, if any
>> b. other FSes which do not have CoW support. ext4 is already being used for
>> "atomic writes" in the field - see dubious amazon torn-write prevention.
>
> What is the 'dubious amazon torn-write prevention'?
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/storage-twp.html
AFAICS, this is without any kernel changes, so no guarantee of unwanted
splitting or merging of bios.
Anyway, there will still be !CoW FSes which people want to support.
>
>> About b., we could add the pow-of-2 and file offset alignment requirement
>> for other FSes, but then need to add some method to advertise that
>> restriction.
>
> We really need a better way to communicate I/O limitations anyway.
> Something like XFS_IOC_DIOINFO on steroids.
>
>> Sure, but to me it is a concern that we have 2x paths to make robust a.
>> offload via hw, which may involve CoW b. no HW support, i.e. CoW always
>
> Relying just on the hardware seems very limited, especially as there is
> plenty of hardware that won't guarantee anything larger than 4k, and
> plenty of NVMe hardware without has some other small limit like 32k
> because it doesn't support multiple atomicy mode.
So what would you propose as the next step? Would it to be first achieve
atomic write support for XFS with HW support + CoW to ensure contiguous
extents (and without XFS forcealign)?
>
>> And for no HW support, if we don't follow the O_ATOMIC model of committing
>> nothing until a SYNC is issued, would we allocate, write, and later free a
>> new extent for each write, right?
>
> Yes. Then again if you do data journalling you do that anyway, and as
> one little project I'm doing right now shows that data journling is
> often the fastest thing we can do for very small writes.
Ignoring FSes, then how is this supposed to work for block devices? We
just always need HW support, right?
Thanks,
John
On Mon, Dec 04, 2023 at 03:19:15PM +0000, John Garry wrote:
> On 04/12/2023 13:45, Christoph Hellwig wrote:
>> On Tue, Nov 28, 2023 at 05:42:10PM +0000, John Garry wrote:
>>> ok, fine, it would not be required for XFS with CoW. Some concerns still:
>>> a. device atomic write boundary, if any
>>> b. other FSes which do not have CoW support. ext4 is already being used for
>>> "atomic writes" in the field - see dubious amazon torn-write prevention.
>>
>> What is the 'dubious amazon torn-write prevention'?
>
> https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/storage-twp.html
>
> AFAICS, this is without any kernel changes, so no guarantee of unwanted
> splitting or merging of bios.
>
> Anyway, there will still be !CoW FSes which people want to support.
Ugg, so they badly reimplement NVMe atomic write support and use it
without software stack enablement. Calling it dubious is way to
gentle..
>> Relying just on the hardware seems very limited, especially as there is
>> plenty of hardware that won't guarantee anything larger than 4k, and
>> plenty of NVMe hardware without has some other small limit like 32k
>> because it doesn't support multiple atomicy mode.
>
> So what would you propose as the next step? Would it to be first achieve
> atomic write support for XFS with HW support + CoW to ensure contiguous
> extents (and without XFS forcealign)?
I think the very first priority is just block device support without
any fs enablement. We just need to make sure the API isn't too limited
for additional use cases.
> Ignoring FSes, then how is this supposed to work for block devices? We just
> always need HW support, right?
Yes.
On 04/12/2023 15:39, Christoph Hellwig wrote:
>> So what would you propose as the next step? Would it to be first achieve
>> atomic write support for XFS with HW support + CoW to ensure contiguous
>> extents (and without XFS forcealign)?
> I think the very first priority is just block device support without
> any fs enablement. We just need to make sure the API isn't too limited
> for additional use cases.
Sounds ok
On Mon, Dec 04, 2023 at 01:13:55PM +0000, John Garry wrote:
>
> > >
> > > I added this here (as opposed to the caller), as I was not really worried
> > > about speeding up the failure path. Are you saying to call even earlier in
> > > submission path?
> > atomic_write_unit_min is one hardware property, and it should be checked
> > in blk_queue_atomic_write_unit_min_sectors() from beginning, then you
> > can avoid this check every other where.
>
> ok, but we still need to ensure in the submission path that the block device
> actually supports atomic writes - this was the initial check.
Then you may add one helper bdev_support_atomic_write().
>
> >
> > > > > + if (pos % atomic_write_unit_min_bytes)
> > > > > + return false;
> > > > > + if (iov_iter_count(iter) % atomic_write_unit_min_bytes)
> > > > > + return false;
> > > > > + if (!is_power_of_2(iov_iter_count(iter)))
> > > > > + return false;
> > > > > + if (iov_iter_count(iter) > atomic_write_unit_max_bytes)
> > > > > + return false;
> > > > > + if (pos % iov_iter_count(iter))
> > > > > + return false;
> > > > I am a bit confused about relation between atomic_write_unit_max_bytes and
> > > > atomic_write_max_bytes.
> > > I think that naming could be improved. Or even just drop merging (and
> > > atomic_write_max_bytes concept) until we show it to improve performance.
> > >
> > > So generally atomic_write_unit_max_bytes will be same as
> > > atomic_write_max_bytes, however it could be different if:
> > > a. request queue nr hw segments or other request queue limits needs to
> > > restrict atomic_write_unit_max_bytes
> > > b. atomic_write_unit_max_bytes does not need to be a power-of-2 and
> > > atomic_write_max_bytes does. So essentially:
> > > atomic_write_unit_max_bytes = rounddown_pow_of_2(atomic_write_max_bytes)
> > >
> > plug merge often improves sequential IO perf, so if the hardware supports
> > this way, I think 'atomic_write_max_bytes' should be supported from the
> > beginning, such as:
> >
> > - user space submits sequential N * (4k, 8k, 16k, ...) atomic writes, all can
> > be merged to single IO request, which is issued to driver.
> >
> > Or
> >
> > - user space submits sequential 4k, 4k, 8k, 16K, 32k, 64k atomic writes, all can
> > be merged to single IO request, which is issued to driver.
>
> Right, we do expect userspace to use a fixed block size, but we give scope
> in the API to use variable size.
Maybe it is enough to just take atomic_write_unit_min_bytes
only, and allow length to be N * atomic_write_unit_min_bytes.
But it may violate atomic write boundary?
>
> >
> > The hardware should recognize unit size by start LBA, and check if length is
> > valid, so probably the interface might be relaxed to:
> >
> > 1) start lba is unit aligned, and this unit is in the supported unit
> > range(power_2 in [unit_min, unit_max])
> >
> > 2) length needs to be:
> >
> > - N * this_unit_size
> > - <= atomic_write_max_bytes
>
> Please note that we also need to consider:
> - any atomic write boundary (from NVMe)
Can you provide actual NVMe boundary value?
Firstly natural aligned write won't cross boundary, so boundary should
be >= write_unit_max, see blow code from patch 10/21:
+static bool bio_straddles_atomic_write_boundary(loff_t bi_sector,
+ unsigned int bi_size,
+ unsigned int boundary)
+{
+ loff_t start = bi_sector << SECTOR_SHIFT;
+ loff_t end = start + bi_size;
+ loff_t start_mod = start % boundary;
+ loff_t end_mod = end % boundary;
+
+ if (end - start > boundary)
+ return true;
+ if ((start_mod > end_mod) && (start_mod && end_mod))
+ return true;
+
+ return false;
+}
+
Then if the WRITE size is <= boundary, the above function should return
false, right? Looks like it is power_of(2) & aligned atomic_write_max_bytes?
> - virt boundary (from NVMe)
virt boundary is applied on bv_offset and bv_len, and NVMe's virt
bounary is (4k - 1), it shouldn't be one issue in reality.
>
> And, as I mentioned elsewhere, I am still not 100% comfortable that we don't
> pay attention to regular max_sectors_kb...
max_sectors_kb should be bigger than atomic_write_max_bytes actually,
then what is your concern?
Thanks,
Ming
On Mon, Dec 04, 2023 at 03:19:15PM +0000, John Garry wrote:
> >
> > What is the 'dubious amazon torn-write prevention'?
>
> https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/storage-twp.html
>
> AFAICS, this is without any kernel changes, so no guarantee of unwanted
> splitting or merging of bios.
Well, more than one company has audited the kernel paths, and it turns
out that for selected Kernel versions, after doing desk-check
verification of the relevant kernel baths, as well as experimental
verification via testing to try to find torn writes in the kernel, we
can make it safe for specific kernel versions which might be used in
hosted MySQL instances where we control the kernel, the mysql server,
and the emulated block device (and we know the database is doing
Direct I/O writes --- this won't work for PostgreSQL). I gave a talk
about this at Google I/O Next '18, five years ago[1].
[1] https://www.youtube.com/watch?v=gIeuiGg-_iw
Given the performance gains (see the talk (see the comparison of the
at time 19:31 and at 29:57) --- it's quite compelling.
Of course, I wouldn't recommend this approach for a naive sysadmin,
since most database adminsitrators won't know how to audit kernel code
(see the discussion at time 35:10 of the video), and reverify the
entire software stack before every kernel upgrade. The challenge is
how to do this safely.
The fact remains that both Amazon's EBS and Google's Persistent Disk
products are implemented in such a way that writes will not be torn
below the virtual machine, and the guarantees are in fact quite a bit
stronger than what we will probably end up advertising via NVMe and/or
SCSI. It wouldn't surprise me if this is the case (or could be made
to be the case) For Oracle Cloud as well.
The question is how to make this guarantee so that the kernel knows
when various cloud-provided block devicse do provide these greater
guarantees, and then how to make it be an architected feature, as
opposed to a happy implementation detail that has to be verified at
every kernel upgrade.
Cheers,
- Ted
On 05/12/2023 01:45, Ming Lei wrote:
>> Right, we do expect userspace to use a fixed block size, but we give scope
>> in the API to use variable size.
> Maybe it is enough to just take atomic_write_unit_min_bytes
> only, and allow length to be N * atomic_write_unit_min_bytes.
>
> But it may violate atomic write boundary?
About atomic boundary, we just don't allow a merge which will result in
a write which will straddle a boundary as there are no guarantees of
atomicity then.
Having said this, atomic write boundary is just relevant to NVMe, so if
we don't have merges there, then we could just omit this code.
>
>>> The hardware should recognize unit size by start LBA, and check if length is
>>> valid, so probably the interface might be relaxed to:
>>>
>>> 1) start lba is unit aligned, and this unit is in the supported unit
>>> range(power_2 in [unit_min, unit_max])
>>>
>>> 2) length needs to be:
>>>
>>> - N * this_unit_size
>>> - <= atomic_write_max_bytes
>> Please note that we also need to consider:
>> - any atomic write boundary (from NVMe)
> Can you provide actual NVMe boundary value?
>
> Firstly natural aligned write won't cross boundary, so boundary should
> be >= write_unit_max,
Correct
> see blow code from patch 10/21:
>
> +static bool bio_straddles_atomic_write_boundary(loff_t bi_sector,
> + unsigned int bi_size,
> + unsigned int boundary)
> +{
> + loff_t start = bi_sector << SECTOR_SHIFT;
> + loff_t end = start + bi_size;
> + loff_t start_mod = start % boundary;
> + loff_t end_mod = end % boundary;
> +
> + if (end - start > boundary)
> + return true;
> + if ((start_mod > end_mod) && (start_mod && end_mod))
> + return true;
> +
> + return false;
> +}
> +
>
> Then if the WRITE size is <= boundary, the above function should return
> false, right?
Actually if WRITE > boundary then we must be crossing a boundary and
should return true, which is what the first condition checks.
However 2x naturally-aligned atomic writes could be less than
atomic_write_max_bytes but still straddle if merged.
> Looks like it is power_of(2) & aligned atomic_write_max_bytes?
>
>> - virt boundary (from NVMe)
> virt boundary is applied on bv_offset and bv_len, and NVMe's virt
> bounary is (4k - 1), it shouldn't be one issue in reality.
On a related topic, as I understand, for NVMe we just split bios
according to virt boundary for PRP, but we won't always use PRP. So is
there value in not splitting bio according to PRP if SGL will actually
be used?
>
>> And, as I mentioned elsewhere, I am still not 100% comfortable that we don't
>> pay attention to regular max_sectors_kb...
> max_sectors_kb should be bigger than atomic_write_max_bytes actually,
> then what is your concern?
My concern is that we don't enforce that, so we may issue atomic writes
which exceed max_sectors_kb.
If we did enforce it, then atomic_write_unit_min_bytes,
atomic_write_unit_max_bytes, and atomic_write_max_bytes all need to be
limited according to max_sectors_kb.
Thanks,
John
On 05/12/2023 04:55, Theodore Ts'o wrote:
>> AFAICS, this is without any kernel changes, so no guarantee of unwanted
>> splitting or merging of bios.
> Well, more than one company has audited the kernel paths, and it turns
> out that for selected Kernel versions, after doing desk-check
> verification of the relevant kernel baths, as well as experimental
> verification via testing to try to find torn writes in the kernel, we
> can make it safe for specific kernel versions which might be used in
> hosted MySQL instances where we control the kernel, the mysql server,
> and the emulated block device (and we know the database is doing
> Direct I/O writes --- this won't work for PostgreSQL). I gave a talk
> about this at Google I/O Next '18, five years ago[1].
>
> [1]https://urldefense.com/v3/__https://www.youtube.com/watch?v=gIeuiGg-_iw__;!!ACWV5N9M2RV99hQ!I4iRp4xUyzAT0UwuEcnUBBCPKLXFKfk5FNmysFbKcQYfl0marAll5xEEVyB5mMFDqeckCWLmjU1aCR2Z$
>
> Given the performance gains (see the talk (see the comparison of the
> at time 19:31 and at 29:57) --- it's quite compelling.
>
> Of course, I wouldn't recommend this approach for a naive sysadmin,
> since most database adminsitrators won't know how to audit kernel code
> (see the discussion at time 35:10 of the video), and reverify the
> entire software stack before every kernel upgrade.
Sure
> The challenge is
> how to do this safely.
Right, and that is why I would be concerned about advertising torn-write
protection support, but someone has not gone through the effort of
auditing and verification phase to ensure that this does not happen in
their software stack ever.
>
> The fact remains that both Amazon's EBS and Google's Persistent Disk
> products are implemented in such a way that writes will not be torn
> below the virtual machine, and the guarantees are in fact quite a bit
> stronger than what we will probably end up advertising via NVMe and/or
> SCSI. It wouldn't surprise me if this is the case (or could be made
> to be the case) For Oracle Cloud as well.
>
> The question is how to make this guarantee so that the kernel knows
> when various cloud-provided block devicse do provide these greater
> guarantees, and then how to make it be an architected feature, as
> opposed to a happy implementation detail that has to be verified at
> every kernel upgrade.
The kernel can only judge atomic write support from what the HW product
data tells us, so cloud-provided block devices need to provide that
information as best possible if emulating the some storage technology.
Thanks,
John
On Mon, Dec 04, 2023 at 03:19:15PM +0000, John Garry wrote:
> On 04/12/2023 13:45, Christoph Hellwig wrote:
> > On Tue, Nov 28, 2023 at 05:42:10PM +0000, John Garry wrote:
> > > ok, fine, it would not be required for XFS with CoW. Some concerns still:
> > > a. device atomic write boundary, if any
> > > b. other FSes which do not have CoW support. ext4 is already being used for
> > > "atomic writes" in the field - see dubious amazon torn-write prevention.
> >
> > What is the 'dubious amazon torn-write prevention'?
>
> https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/storage-twp.html
>
> AFAICS, this is without any kernel changes, so no guarantee of unwanted
> splitting or merging of bios.
>
> Anyway, there will still be !CoW FSes which people want to support.
>
> >
> > > About b., we could add the pow-of-2 and file offset alignment requirement
> > > for other FSes, but then need to add some method to advertise that
> > > restriction.
> >
> > We really need a better way to communicate I/O limitations anyway.
> > Something like XFS_IOC_DIOINFO on steroids.
> >
> > > Sure, but to me it is a concern that we have 2x paths to make robust a.
> > > offload via hw, which may involve CoW b. no HW support, i.e. CoW always
> >
> > Relying just on the hardware seems very limited, especially as there is
> > plenty of hardware that won't guarantee anything larger than 4k, and
> > plenty of NVMe hardware without has some other small limit like 32k
> > because it doesn't support multiple atomicy mode.
>
> So what would you propose as the next step? Would it to be first achieve
> atomic write support for XFS with HW support + CoW to ensure contiguous
> extents (and without XFS forcealign)?
>
> >
> > > And for no HW support, if we don't follow the O_ATOMIC model of committing
> > > nothing until a SYNC is issued, would we allocate, write, and later free a
> > > new extent for each write, right?
> >
> > Yes. Then again if you do data journalling you do that anyway, and as
> > one little project I'm doing right now shows that data journling is
> > often the fastest thing we can do for very small writes.
>
> Ignoring FSes, then how is this supposed to work for block devices? We just
> always need HW support, right?
Looks the HW support could be minimized, just like what Google and Amazon did,
16KB physical block size with proper queue limit setting.
Now seems it is easy to make such device with ublk-loop by:
- use one backing disk with 16KB/32KB/.. physical block size
- expose proper physical bs & chunk_sectors & max sectors queue limit
Then any 16KB aligned direct WRITE with N*16KB length(N in [1, 8] with 256
chunk_sectors) can be atomic-write.
Thanks,
Ming