2023-12-12 11:10:30

by John Garry

[permalink] [raw]
Subject: [PATCH v2 00/16] block atomic writes

This series introduces a proposal to implementing atomic writes in the
kernel for torn-write protection.

This series takes the approach of adding a new "atomic" flag to each of
pwritev2() and iocb->ki_flags - RWF_ATOMIC and IOCB_ATOMIC, respectively.
When set, these indicate that we want the write issued "atomically".

Only direct IO is supported and for block devices here. For this, atomic
write HW is required, like SCSI ATOMIC WRITE (16). For now, XFS support
from earlier versions is sidelined. That is until the interface is agreed
which does not fully rely on HW offload support.

man pages update has been posted at:
https://lore.kernel.org/linux-api/[email protected]/T/#t
(not updated since I posted v1 kernel series)

The goal here is to provide an interface that allows applications use
application-specific block sizes larger than logical block size
reported by the storage device or larger than filesystem block size as
reported by stat().

With this new interface, application blocks will never be torn or
fractured when written. For a power fail, for each individual application
block, all or none of the data to be written. A racing atomic write and
read will mean that the read sees all the old data or all the new data,
but never a mix of old and new.

Two new fields are added to struct statx - atomic_write_unit_min and
atomic_write_unit_max. For each atomic individual write, the total length
of a write must be a between atomic_write_unit_min and
atomic_write_unit_max, inclusive, and a power-of-2. The write must also be
at a natural offset in the file wrt the write length.

SCSI sd.c and scsi_debug and NVMe kernel support is added.

Some open questions:
- How to make API extensible for when we have no HW support? In that case,
we would prob not have to follow rule of power-of-2 length et al.
As a possible solution, maybe we can say that atomic writes are
supported for the file via statx, but not set unit_min and max values,
and this means that writes need to be just FS block aligned there.
- For block layer, should atomic_write_unit_max be limited by
max_sectors_kb? Currently it is not.
- How to improve requirement that iovecs are PAGE-aligned.
There are 2x issues:
a. We impose this rule to not split BIOs due to virt boundary for
NVMe, but there virt boundary is 4K (and not PAGE size, so broken for
16K/64K pages). Easy solution is to impose requirement that iovecs
are 4K-aligned.
b. We don't enforce this rule for virt boundary == 0, i.e. SCSI
- Since debugging torn-writes due to unwanted kernel BIO splitting/merging
would be horrible, should we add some kernel storage stack software
integrity checks?

This series is based on v6.7-rc5.

Changes since v1:
- Drop XFS support for now
- Tidy NVMe changes and also add checks for atomic write violating max
AW PF length and boundary (if any)
- Reject - instead of ignoring - RWF_ATOMIC for files which do not
support atomic writes
- Update block sysfs documentation
- Various tidy-ups

Alan Adamson (2):
nvme: Support atomic writes
nvme: Ensure atomic writes will be executed atomically

Himanshu Madhani (2):
block: Add atomic write operations to request_queue limits
block: Add REQ_ATOMIC flag

John Garry (10):
block: Limit atomic writes according to bio and queue limits
fs: Increase fmode_t size
block: Pass blk_queue_get_max_sectors() a request pointer
block: Limit atomic write IO size according to
atomic_write_max_sectors
block: Error an attempt to split an atomic write bio
block: Add checks to merging of atomic writes
block: Add fops atomic write support
scsi: sd: Support reading atomic write properties from block limits
VPD
scsi: sd: Add WRITE_ATOMIC_16 support
scsi: scsi_debug: Atomic write support

Prasad Singamsetty (2):
fs/bdev: Add atomic write support info to statx
fs: Add RWF_ATOMIC and IOCB_ATOMIC flags for atomic write support

Documentation/ABI/stable/sysfs-block | 47 +++
block/bdev.c | 31 +-
block/blk-merge.c | 95 ++++-
block/blk-mq.c | 2 +-
block/blk-settings.c | 84 ++++
block/blk-sysfs.c | 33 ++
block/blk.h | 9 +-
block/fops.c | 40 +-
drivers/dma-buf/dma-buf.c | 2 +-
drivers/nvme/host/core.c | 108 ++++-
drivers/nvme/host/nvme.h | 2 +
drivers/scsi/scsi_debug.c | 590 +++++++++++++++++++++------
drivers/scsi/scsi_trace.c | 22 +
drivers/scsi/sd.c | 93 ++++-
drivers/scsi/sd.h | 8 +
fs/stat.c | 44 +-
include/linux/blk_types.h | 2 +
include/linux/blkdev.h | 41 +-
include/linux/fs.h | 11 +
include/linux/stat.h | 2 +
include/linux/types.h | 2 +-
include/scsi/scsi_proto.h | 1 +
include/trace/events/scsi.h | 1 +
include/uapi/linux/fs.h | 5 +-
include/uapi/linux/stat.h | 7 +-
25 files changed, 1098 insertions(+), 184 deletions(-)

--
2.35.3


2023-12-12 11:10:45

by John Garry

[permalink] [raw]
Subject: [PATCH v2 09/16] block: Error an attempt to split an atomic write bio

As the name suggests, we should not be splitting these.

Signed-off-by: John Garry <[email protected]>
---
block/blk-merge.c | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 8d4de9253fe9..bc21f8ff4842 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -319,6 +319,11 @@ struct bio *bio_split_rw(struct bio *bio, const struct queue_limits *lim,
*segs = nsegs;
return NULL;
split:
+ if (bio->bi_opf & REQ_ATOMIC) {
+ bio->bi_status = BLK_STS_IOERR;
+ bio_endio(bio);
+ return ERR_PTR(-EIO);
+ }
/*
* We can't sanely support splitting for a REQ_NOWAIT bio. End it
* with EAGAIN if splitting is required and return an error pointer.
--
2.35.3

2023-12-12 11:11:13

by John Garry

[permalink] [raw]
Subject: [PATCH v2 15/16] nvme: Support atomic writes

From: Alan Adamson <[email protected]>

Support reading atomic write registers to fill in request_queue
properties.

Use following method to calculate limits:
atomic_write_max_bytes = flp2(NAWUPF ?: AWUPF)
atomic_write_unit_min = logical_block_size
atomic_write_unit_max = flp2(NAWUPF ?: AWUPF)
atomic_write_boundary = NABSPF

Signed-off-by: Alan Adamson <[email protected]>
#jpg: some rewrite
Signed-off-by: John Garry <[email protected]>
---
drivers/nvme/host/core.c | 85 ++++++++++++++++++++++++++++++----------
1 file changed, 64 insertions(+), 21 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 8ebdfd623e0f..fd4f09f9dbe8 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1894,12 +1894,68 @@ static void nvme_set_queue_limits(struct nvme_ctrl *ctrl,
blk_queue_write_cache(q, vwc, vwc);
}

+static void nvme_update_atomic_write_disk_info(struct gendisk *disk,
+ struct nvme_ns *ns, struct nvme_id_ns *id, u32 bs, u32 phys_bs)
+{
+ unsigned int max_bytes = 0, unit_min = 0, unit_max = 0, boundary = 0;
+ u32 atomic_bs = bs;
+
+ if (id->nabo == 0) {
+ unsigned int ns_boundary;
+
+ /*
+ * Bit 1 indicates whether NAWUPF is defined for this namespace
+ * and whether it should be used instead of AWUPF. If NAWUPF ==
+ * 0 then AWUPF must be used instead.
+ */
+ if (id->nsfeat & NVME_NS_FEAT_ATOMICS && id->nawupf)
+ atomic_bs = (1 + le16_to_cpu(id->nawupf)) * bs;
+ else
+ atomic_bs = (1 + ns->ctrl->subsys->awupf) * bs;
+
+ if (le16_to_cpu(id->nabspf))
+ ns_boundary = (le16_to_cpu(id->nabspf) + 1) * bs;
+ else
+ ns_boundary = 0;
+
+ /*
+ * The boundary size just needs to be a multiple
+ * of unit_max (and not necessarily a power-of-2), so
+ * this could be relaxed in the block layer in future.
+ */
+ if (!ns_boundary || is_power_of_2(ns_boundary)) {
+ max_bytes = atomic_bs;
+ unit_min = bs >> SECTOR_SHIFT;
+ unit_max = rounddown_pow_of_two(atomic_bs) >> SECTOR_SHIFT;
+ boundary = ns_boundary;
+ } else {
+ dev_notice(ns->ctrl->device, "Unsupported atomic write boundary (%d bytes)\n",
+ ns_boundary);
+ }
+ } else {
+ dev_info(ns->ctrl->device, "Atomic writes not supported for NABO set (%d blocks)\n",
+ id->nabo);
+ }
+
+ blk_queue_atomic_write_max_bytes(disk->queue, atomic_bs);
+ blk_queue_atomic_write_unit_min_sectors(disk->queue, bs >> SECTOR_SHIFT);
+ blk_queue_atomic_write_unit_max_sectors(disk->queue,
+ rounddown_pow_of_two(atomic_bs) >> SECTOR_SHIFT);
+ blk_queue_atomic_write_boundary_bytes(disk->queue, boundary);
+
+ /*
+ * Linux filesystems assume writing a single physical block is
+ * an atomic operation. Hence limit the physical block size to the
+ * value of the Atomic Write Unit Power Fail parameter.
+ */
+ blk_queue_physical_block_size(disk->queue, min(phys_bs, atomic_bs));
+}
+
static void nvme_update_disk_info(struct gendisk *disk,
struct nvme_ns *ns, struct nvme_id_ns *id)
{
sector_t capacity = nvme_lba_to_sect(ns, le64_to_cpu(id->nsze));
- u32 bs = 1U << ns->lba_shift;
- u32 atomic_bs, phys_bs, io_opt = 0;
+ u32 bs, phys_bs, io_opt = 0;

/*
* The block layer can't support LBA sizes larger than the page size
@@ -1909,37 +1965,24 @@ static void nvme_update_disk_info(struct gendisk *disk,
if (ns->lba_shift > PAGE_SHIFT || ns->lba_shift < SECTOR_SHIFT) {
capacity = 0;
bs = (1 << 9);
+ } else {
+ bs = 1U << ns->lba_shift;
}

blk_integrity_unregister(disk);

- atomic_bs = phys_bs = bs;
- if (id->nabo == 0) {
- /*
- * Bit 1 indicates whether NAWUPF is defined for this namespace
- * and whether it should be used instead of AWUPF. If NAWUPF ==
- * 0 then AWUPF must be used instead.
- */
- if (id->nsfeat & NVME_NS_FEAT_ATOMICS && id->nawupf)
- atomic_bs = (1 + le16_to_cpu(id->nawupf)) * bs;
- else
- atomic_bs = (1 + ns->ctrl->subsys->awupf) * bs;
- }
-
if (id->nsfeat & NVME_NS_FEAT_IO_OPT) {
/* NPWG = Namespace Preferred Write Granularity */
phys_bs = bs * (1 + le16_to_cpu(id->npwg));
/* NOWS = Namespace Optimal Write Size */
io_opt = bs * (1 + le16_to_cpu(id->nows));
+ } else {
+ phys_bs = bs;
}

+ nvme_update_atomic_write_disk_info(disk, ns, id, bs, phys_bs);
+
blk_queue_logical_block_size(disk->queue, bs);
- /*
- * Linux filesystems assume writing a single physical block is
- * an atomic operation. Hence limit the physical block size to the
- * value of the Atomic Write Unit Power Fail parameter.
- */
- blk_queue_physical_block_size(disk->queue, min(phys_bs, atomic_bs));
blk_queue_io_min(disk->queue, phys_bs);
blk_queue_io_opt(disk->queue, io_opt);

--
2.35.3

2023-12-12 11:11:26

by John Garry

[permalink] [raw]
Subject: [PATCH v2 10/16] block: Add checks to merging of atomic writes

For atomic writes we allow merging, but we must adhere to some additional
rules:
- Only allow merging of atomic writes with other atomic writes
- Ensure that the merged IO would not cross an atomic write boundary, if
any

We already ensure that we don't exceed the atomic writes size limit in
get_max_io_size().

Signed-off-by: John Garry <[email protected]>
---
block/blk-merge.c | 75 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 75 insertions(+)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index bc21f8ff4842..05eb227a5644 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -18,6 +18,42 @@
#include "blk-rq-qos.h"
#include "blk-throttle.h"

+static bool blk_rq_straddles_atomic_write_boundary(struct request *rq,
+ unsigned int front,
+ unsigned int back)
+{
+ unsigned int boundary = queue_atomic_write_boundary_bytes(rq->q);
+ unsigned int mask, imask;
+ loff_t start, end;
+
+ if (!boundary)
+ return false;
+
+ start = rq->__sector << SECTOR_SHIFT;
+ end = start + rq->__data_len;
+
+ start -= front;
+ end += back;
+
+ /* We're longer than the boundary, so must be crossing it */
+ if (end - start > boundary)
+ return true;
+
+ mask = boundary - 1;
+
+ /* start/end are boundary-aligned, so cannot be crossing */
+ if (!(start & mask) || !(end & mask))
+ return false;
+
+ imask = ~mask;
+
+ /* Top bits are different, so crossed a boundary */
+ if ((start & imask) != (end & imask))
+ return true;
+
+ return false;
+}
+
static inline void bio_get_first_bvec(struct bio *bio, struct bio_vec *bv)
{
*bv = mp_bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
@@ -664,6 +700,13 @@ int ll_back_merge_fn(struct request *req, struct bio *bio, unsigned int nr_segs)
return 0;
}

+ if (req->cmd_flags & REQ_ATOMIC) {
+ if (blk_rq_straddles_atomic_write_boundary(req,
+ 0, bio->bi_iter.bi_size)) {
+ return 0;
+ }
+ }
+
return ll_new_hw_segment(req, bio, nr_segs);
}

@@ -683,6 +726,13 @@ static int ll_front_merge_fn(struct request *req, struct bio *bio,
return 0;
}

+ if (req->cmd_flags & REQ_ATOMIC) {
+ if (blk_rq_straddles_atomic_write_boundary(req,
+ bio->bi_iter.bi_size, 0)) {
+ return 0;
+ }
+ }
+
return ll_new_hw_segment(req, bio, nr_segs);
}

@@ -719,6 +769,13 @@ static int ll_merge_requests_fn(struct request_queue *q, struct request *req,
blk_rq_get_max_sectors(req, blk_rq_pos(req)))
return 0;

+ if (req->cmd_flags & REQ_ATOMIC) {
+ if (blk_rq_straddles_atomic_write_boundary(req,
+ 0, blk_rq_bytes(next))) {
+ return 0;
+ }
+ }
+
total_phys_segments = req->nr_phys_segments + next->nr_phys_segments;
if (total_phys_segments > blk_rq_get_max_segments(req))
return 0;
@@ -814,6 +871,18 @@ static enum elv_merge blk_try_req_merge(struct request *req,
return ELEVATOR_NO_MERGE;
}

+static bool blk_atomic_write_mergeable_rq_bio(struct request *rq,
+ struct bio *bio)
+{
+ return (rq->cmd_flags & REQ_ATOMIC) == (bio->bi_opf & REQ_ATOMIC);
+}
+
+static bool blk_atomic_write_mergeable_rqs(struct request *rq,
+ struct request *next)
+{
+ return (rq->cmd_flags & REQ_ATOMIC) == (next->cmd_flags & REQ_ATOMIC);
+}
+
/*
* For non-mq, this has to be called with the request spinlock acquired.
* For mq with scheduling, the appropriate queue wide lock should be held.
@@ -833,6 +902,9 @@ static struct request *attempt_merge(struct request_queue *q,
if (req->ioprio != next->ioprio)
return NULL;

+ if (!blk_atomic_write_mergeable_rqs(req, next))
+ return NULL;
+
/*
* If we are allowed to merge, then append bio list
* from next to rq and release next. merge_requests_fn
@@ -960,6 +1032,9 @@ bool blk_rq_merge_ok(struct request *rq, struct bio *bio)
if (rq->ioprio != bio_prio(bio))
return false;

+ if (blk_atomic_write_mergeable_rq_bio(rq, bio) == false)
+ return false;
+
return true;
}

--
2.35.3

2023-12-12 11:11:38

by John Garry

[permalink] [raw]
Subject: [PATCH v2 16/16] nvme: Ensure atomic writes will be executed atomically

From: Alan Adamson <[email protected]>

There is no dedicated NVMe atomic write command (which may error for a
command which exceeds the controller atomic write limits).

As an insurance policy against the block layer sending requests which
cannot be executed atomically, add a check in the queue path.

Signed-off-by: Alan Adamson <[email protected]>
Signed-off-by: John Garry <[email protected]>
---
drivers/nvme/host/core.c | 23 +++++++++++++++++++++++
drivers/nvme/host/nvme.h | 2 ++
2 files changed, 25 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index fd4f09f9dbe8..6b89ee7e9921 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -905,6 +905,26 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
if (req->cmd_flags & REQ_RAHEAD)
dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH;

+ /*
+ * Ensure that nothing has been sent which cannot be executed
+ * atomically.
+ */
+ if (req->cmd_flags & REQ_ATOMIC) {
+ if (blk_rq_bytes(req) > ns->atomic_max)
+ return BLK_STS_IOERR;
+
+ if (ns->atomic_boundary) {
+ u32 boundary = ns->atomic_boundary >> ns->lba_shift;
+ u32 imask = ~(boundary - 1);
+ u32 lba = nvme_sect_to_lba(ns, blk_rq_pos(req));
+ u32 end = lba + (blk_rq_bytes(req) >> ns->lba_shift)
+ - 1;
+
+ if ((lba & imask) != (end & imask))
+ return BLK_STS_IOERR;
+ }
+ }
+
cmnd->rw.opcode = op;
cmnd->rw.flags = 0;
cmnd->rw.nsid = cpu_to_le32(ns->head->ns_id);
@@ -1937,6 +1957,9 @@ static void nvme_update_atomic_write_disk_info(struct gendisk *disk,
id->nabo);
}

+ ns->atomic_max = atomic_bs;
+ ns->atomic_boundary = boundary;
+
blk_queue_atomic_write_max_bytes(disk->queue, atomic_bs);
blk_queue_atomic_write_unit_min_sectors(disk->queue, bs >> SECTOR_SHIFT);
blk_queue_atomic_write_unit_max_sectors(disk->queue,
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index e7411dac00f7..5a3d76bc816f 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -520,6 +520,8 @@ struct nvme_ns {

struct nvme_fault_inject fault_inject;

+ u32 atomic_max;
+ u32 atomic_boundary;
};

/* NVMe ns supports metadata actions by the controller (generate/strip) */
--
2.35.3

2023-12-12 11:13:28

by John Garry

[permalink] [raw]
Subject: [PATCH v2 12/16] scsi: sd: Support reading atomic write properties from block limits VPD

Also update block layer request queue sysfs properties.

See sbc4r22 section 6.6.4 - Block limits VPD page.

Signed-off-by: John Garry <[email protected]>
---
drivers/scsi/sd.c | 69 ++++++++++++++++++++++++++++++++++++++++++++++-
drivers/scsi/sd.h | 8 ++++++
2 files changed, 76 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 542a4bbb21bc..fa857a08341f 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -916,6 +916,65 @@ static blk_status_t sd_setup_unmap_cmnd(struct scsi_cmnd *cmd)
return scsi_alloc_sgtables(cmd);
}

+static void sd_config_atomic(struct scsi_disk *sdkp)
+{
+ unsigned int logical_block_size = sdkp->device->sector_size,
+ physical_block_size_sectors, max_atomic, unit_min, unit_max;
+ struct request_queue *q = sdkp->disk->queue;
+
+ if ((!sdkp->max_atomic && !sdkp->max_atomic_with_boundary) ||
+ sdkp->protection_type == T10_PI_TYPE2_PROTECTION)
+ return;
+
+ physical_block_size_sectors = sdkp->physical_block_size /
+ sdkp->device->sector_size;
+
+ unit_min = rounddown_pow_of_two(sdkp->atomic_granularity ?
+ sdkp->atomic_granularity :
+ physical_block_size_sectors);
+
+ /*
+ * Only use atomic boundary when we have the odd scenario of
+ * sdkp->max_atomic == 0, which the spec does permit.
+ */
+ if (sdkp->max_atomic) {
+ max_atomic = sdkp->max_atomic;
+ unit_max = rounddown_pow_of_two(sdkp->max_atomic);
+ sdkp->use_atomic_write_boundary = 0;
+ } else {
+ max_atomic = sdkp->max_atomic_with_boundary;
+ unit_max = rounddown_pow_of_two(sdkp->max_atomic_boundary);
+ sdkp->use_atomic_write_boundary = 1;
+ }
+
+ /*
+ * Ensure compliance with granularity and alignment. For now, keep it
+ * simple and just don't support atomic writes for values mismatched
+ * with max_{boundary}atomic, physical block size, and
+ * atomic_granularity itself.
+ *
+ * We're really being distrustful by checking unit_max also...
+ */
+ if (sdkp->atomic_granularity > 1) {
+ if (unit_min > 1 && unit_min % sdkp->atomic_granularity)
+ return;
+ if (unit_max > 1 && unit_max % sdkp->atomic_granularity)
+ return;
+ }
+
+ if (sdkp->atomic_alignment > 1) {
+ if (unit_min > 1 && unit_min % sdkp->atomic_alignment)
+ return;
+ if (unit_max > 1 && unit_max % sdkp->atomic_alignment)
+ return;
+ }
+
+ blk_queue_atomic_write_max_bytes(q, max_atomic * logical_block_size);
+ blk_queue_atomic_write_unit_min_sectors(q, unit_min);
+ blk_queue_atomic_write_unit_max_sectors(q, unit_max);
+ blk_queue_atomic_write_boundary_bytes(q, 0);
+}
+
static blk_status_t sd_setup_write_same16_cmnd(struct scsi_cmnd *cmd,
bool unmap)
{
@@ -3071,7 +3130,7 @@ static void sd_read_block_limits(struct scsi_disk *sdkp)
sdkp->max_ws_blocks = (u32)get_unaligned_be64(&vpd->data[36]);

if (!sdkp->lbpme)
- goto out;
+ goto read_atomics;

lba_count = get_unaligned_be32(&vpd->data[20]);
desc_count = get_unaligned_be32(&vpd->data[24]);
@@ -3102,6 +3161,14 @@ static void sd_read_block_limits(struct scsi_disk *sdkp)
else
sd_config_discard(sdkp, SD_LBP_DISABLE);
}
+read_atomics:
+ sdkp->max_atomic = get_unaligned_be32(&vpd->data[44]);
+ sdkp->atomic_alignment = get_unaligned_be32(&vpd->data[48]);
+ sdkp->atomic_granularity = get_unaligned_be32(&vpd->data[52]);
+ sdkp->max_atomic_with_boundary = get_unaligned_be32(&vpd->data[56]);
+ sdkp->max_atomic_boundary = get_unaligned_be32(&vpd->data[60]);
+
+ sd_config_atomic(sdkp);
}

out:
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 409dda5350d1..990188a56b51 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -121,6 +121,13 @@ struct scsi_disk {
u32 max_unmap_blocks;
u32 unmap_granularity;
u32 unmap_alignment;
+
+ u32 max_atomic;
+ u32 atomic_alignment;
+ u32 atomic_granularity;
+ u32 max_atomic_with_boundary;
+ u32 max_atomic_boundary;
+
u32 index;
unsigned int physical_block_size;
unsigned int max_medium_access_timeouts;
@@ -151,6 +158,7 @@ struct scsi_disk {
unsigned urswrz : 1;
unsigned security : 1;
unsigned ignore_medium_access_errors : 1;
+ unsigned use_atomic_write_boundary : 1;
};
#define to_scsi_disk(obj) container_of(obj, struct scsi_disk, disk_dev)

--
2.35.3

2023-12-12 11:13:30

by John Garry

[permalink] [raw]
Subject: [PATCH v2 11/16] block: Add fops atomic write support

Add support for atomic writes, as follows:
- Ensure that the IO follows all the atomic writes rules, like must be
naturally aligned
- Set REQ_ATOMIC

Signed-off-by: John Garry <[email protected]>
---
block/fops.c | 40 +++++++++++++++++++++++++++++++++++++++-
1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/block/fops.c b/block/fops.c
index 0abaac705daf..ba6a2c5a74b1 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -41,6 +41,24 @@ static bool blkdev_dio_unaligned(struct block_device *bdev, loff_t pos,
!bdev_iter_is_aligned(bdev, iter);
}

+static bool blkdev_atomic_write_valid(struct block_device *bdev, loff_t pos,
+ struct iov_iter *iter)
+{
+ struct request_queue *q = bdev_get_queue(bdev);
+ unsigned int min_bytes = queue_atomic_write_unit_min_bytes(q);
+ unsigned int max_bytes = queue_atomic_write_unit_max_bytes(q);
+
+ if (iov_iter_count(iter) & (min_bytes - 1))
+ return false;
+ if (!is_power_of_2(iov_iter_count(iter)))
+ return false;
+ if (pos & (iov_iter_count(iter) - 1))
+ return false;
+ if (iov_iter_count(iter) > max_bytes)
+ return false;
+ return true;
+}
+
#define DIO_INLINE_BIO_VECS 4

static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
@@ -48,6 +66,8 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
{
struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host);
struct bio_vec inline_vecs[DIO_INLINE_BIO_VECS], *vecs;
+ bool is_read = iov_iter_rw(iter) == READ;
+ bool atomic_write = (iocb->ki_flags & IOCB_ATOMIC) && !is_read;
loff_t pos = iocb->ki_pos;
bool should_dirty = false;
struct bio bio;
@@ -56,6 +76,9 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
if (blkdev_dio_unaligned(bdev, pos, iter))
return -EINVAL;

+ if (atomic_write && !blkdev_atomic_write_valid(bdev, pos, iter))
+ return -EINVAL;
+
if (nr_pages <= DIO_INLINE_BIO_VECS)
vecs = inline_vecs;
else {
@@ -65,7 +88,7 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
return -ENOMEM;
}

- if (iov_iter_rw(iter) == READ) {
+ if (is_read) {
bio_init(&bio, bdev, vecs, nr_pages, REQ_OP_READ);
if (user_backed_iter(iter))
should_dirty = true;
@@ -74,6 +97,8 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
}
bio.bi_iter.bi_sector = pos >> SECTOR_SHIFT;
bio.bi_ioprio = iocb->ki_ioprio;
+ if (atomic_write)
+ bio.bi_opf |= REQ_ATOMIC;

ret = bio_iov_iter_get_pages(&bio, iter);
if (unlikely(ret))
@@ -167,10 +192,14 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
struct blkdev_dio *dio;
struct bio *bio;
bool is_read = (iov_iter_rw(iter) == READ), is_sync;
+ bool atomic_write = (iocb->ki_flags & IOCB_ATOMIC) && !is_read;
blk_opf_t opf = is_read ? REQ_OP_READ : dio_bio_write_op(iocb);
loff_t pos = iocb->ki_pos;
int ret = 0;

+ if (atomic_write)
+ return -EINVAL;
+
if (blkdev_dio_unaligned(bdev, pos, iter))
return -EINVAL;

@@ -305,6 +334,7 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host);
bool is_read = iov_iter_rw(iter) == READ;
blk_opf_t opf = is_read ? REQ_OP_READ : dio_bio_write_op(iocb);
+ bool atomic_write = (iocb->ki_flags & IOCB_ATOMIC) && !is_read;
struct blkdev_dio *dio;
struct bio *bio;
loff_t pos = iocb->ki_pos;
@@ -313,6 +343,9 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
if (blkdev_dio_unaligned(bdev, pos, iter))
return -EINVAL;

+ if (atomic_write && !blkdev_atomic_write_valid(bdev, pos, iter))
+ return -EINVAL;
+
if (iocb->ki_flags & IOCB_ALLOC_CACHE)
opf |= REQ_ALLOC_CACHE;
bio = bio_alloc_bioset(bdev, nr_pages, opf, GFP_KERNEL,
@@ -347,6 +380,8 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
bio_set_pages_dirty(bio);
}
} else {
+ if (atomic_write)
+ bio->bi_opf |= REQ_ATOMIC;
task_io_account_write(bio->bi_iter.bi_size);
}

@@ -605,6 +640,9 @@ static int blkdev_open(struct inode *inode, struct file *filp)
if (bdev_nowait(handle->bdev))
filp->f_mode |= FMODE_NOWAIT;

+ if (queue_atomic_write_unit_min_bytes(bdev_get_queue(handle->bdev)))
+ filp->f_mode |= FMODE_CAN_ATOMIC_WRITE;
+
filp->f_mapping = handle->bdev->bd_inode->i_mapping;
filp->f_wb_err = filemap_sample_wb_err(filp->f_mapping);
filp->private_data = handle;
--
2.35.3

2023-12-12 11:13:33

by John Garry

[permalink] [raw]
Subject: [PATCH v2 01/16] block: Add atomic write operations to request_queue limits

From: Himanshu Madhani <[email protected]>

Add the following limits:
- atomic_write_boundary_bytes
- atomic_write_max_bytes
- atomic_write_unit_max_bytes
- atomic_write_unit_min_bytes

All atomic writes limits are initialised to 0 to indicate no atomic write
support. Stacked devices are just not supported either for now.

Signed-off-by: Himanshu Madhani <[email protected]>
#jpg: Heavy rewrite
Signed-off-by: John Garry <[email protected]>
---
Documentation/ABI/stable/sysfs-block | 47 ++++++++++++++++++++++
block/blk-settings.c | 60 ++++++++++++++++++++++++++++
block/blk-sysfs.c | 33 +++++++++++++++
include/linux/blkdev.h | 37 +++++++++++++++++
4 files changed, 177 insertions(+)

diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
index 1fe9a553c37b..ba81a081522f 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -21,6 +21,53 @@ Description:
device is offset from the internal allocation unit's
natural alignment.

+What: /sys/block/<disk>/atomic_write_max_bytes
+Date: May 2023
+Contact: Himanshu Madhani <[email protected]>
+Description:
+ [RO] This parameter specifies the maximum atomic write
+ size reported by the device. This parameter is relevant
+ for merging of writes, where a merged atomic write
+ operation must not exceed this number of bytes.
+ The atomic_write_max_bytes may exceed the value in
+ atomic_write_unit_max_bytes if atomic_write_max_bytes
+ is not a power-of-two or atomic_write_unit_max_bytes is
+ limited by some queue limits, such as max_segments.
+
+
+What: /sys/block/<disk>/atomic_write_unit_min_bytes
+Date: May 2023
+Contact: Himanshu Madhani <[email protected]>
+Description:
+ [RO] This parameter specifies the smallest block which can
+ be written atomically with an atomic write operation. All
+ atomic write operations must begin at a
+ atomic_write_unit_min boundary and must be multiples of
+ atomic_write_unit_min. This value must be a power-of-two.
+
+
+What: /sys/block/<disk>/atomic_write_unit_max_bytes
+Date: January 2023
+Contact: Himanshu Madhani <[email protected]>
+Description:
+ [RO] This parameter defines the largest block which can be
+ written atomically with an atomic write operation. This
+ value must be a multiple of atomic_write_unit_min and must
+ be a power-of-two.
+
+
+What: /sys/block/<disk>/atomic_write_boundary_bytes
+Date: May 2023
+Contact: Himanshu Madhani <[email protected]>
+Description:
+ [RO] A device may need to internally split I/Os which
+ straddle a given logical block address boundary. In that
+ case a single atomic write operation will be processed as
+ one of more sub-operations which each complete atomically.
+ This parameter specifies the size in bytes of the atomic
+ boundary if one is reported by the device. This value must
+ be a power-of-two.
+

What: /sys/block/<disk>/diskseq
Date: February 2021
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 0046b447268f..d151be394c98 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -59,6 +59,10 @@ void blk_set_default_limits(struct queue_limits *lim)
lim->zoned = BLK_ZONED_NONE;
lim->zone_write_granularity = 0;
lim->dma_alignment = 511;
+ lim->atomic_write_unit_min_sectors = 0;
+ lim->atomic_write_unit_max_sectors = 0;
+ lim->atomic_write_max_sectors = 0;
+ lim->atomic_write_boundary_sectors = 0;
}

/**
@@ -183,6 +187,62 @@ void blk_queue_max_discard_sectors(struct request_queue *q,
}
EXPORT_SYMBOL(blk_queue_max_discard_sectors);

+/**
+ * blk_queue_atomic_write_max_bytes - set max bytes supported by
+ * the device for atomic write operations.
+ * @q: the request queue for the device
+ * @size: maximum bytes supported
+ */
+void blk_queue_atomic_write_max_bytes(struct request_queue *q,
+ unsigned int bytes)
+{
+ q->limits.atomic_write_max_sectors = bytes >> SECTOR_SHIFT;
+}
+EXPORT_SYMBOL(blk_queue_atomic_write_max_bytes);
+
+/**
+ * blk_queue_atomic_write_boundary_bytes - Device's logical block address space
+ * which an atomic write should not cross.
+ * @q: the request queue for the device
+ * @bytes: must be a power-of-two.
+ */
+void blk_queue_atomic_write_boundary_bytes(struct request_queue *q,
+ unsigned int bytes)
+{
+ q->limits.atomic_write_boundary_sectors = bytes >> SECTOR_SHIFT;
+}
+EXPORT_SYMBOL(blk_queue_atomic_write_boundary_bytes);
+
+/**
+ * blk_queue_atomic_write_unit_min_sectors - smallest unit that can be written
+ * atomically to the device.
+ * @q: the request queue for the device
+ * @sectors: must be a power-of-two.
+ */
+void blk_queue_atomic_write_unit_min_sectors(struct request_queue *q,
+ unsigned int sectors)
+{
+ struct queue_limits *limits = &q->limits;
+
+ limits->atomic_write_unit_min_sectors = sectors;
+}
+EXPORT_SYMBOL(blk_queue_atomic_write_unit_min_sectors);
+
+/*
+ * blk_queue_atomic_write_unit_max_sectors - largest unit that can be written
+ * atomically to the device.
+ * @q: the request queue for the device
+ * @sectors: must be a power-of-two.
+ */
+void blk_queue_atomic_write_unit_max_sectors(struct request_queue *q,
+ unsigned int sectors)
+{
+ struct queue_limits *limits = &q->limits;
+
+ limits->atomic_write_unit_max_sectors = sectors;
+}
+EXPORT_SYMBOL(blk_queue_atomic_write_unit_max_sectors);
+
/**
* blk_queue_max_secure_erase_sectors - set max sectors for a secure erase
* @q: the request queue for the device
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 0b2d04766324..4ebf148cf356 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -118,6 +118,30 @@ static ssize_t queue_max_discard_segments_show(struct request_queue *q,
return queue_var_show(queue_max_discard_segments(q), page);
}

+static ssize_t queue_atomic_write_max_bytes_show(struct request_queue *q,
+ char *page)
+{
+ return queue_var_show(queue_atomic_write_max_bytes(q), page);
+}
+
+static ssize_t queue_atomic_write_boundary_show(struct request_queue *q,
+ char *page)
+{
+ return queue_var_show(queue_atomic_write_boundary_bytes(q), page);
+}
+
+static ssize_t queue_atomic_write_unit_min_show(struct request_queue *q,
+ char *page)
+{
+ return queue_var_show(queue_atomic_write_unit_min_bytes(q), page);
+}
+
+static ssize_t queue_atomic_write_unit_max_show(struct request_queue *q,
+ char *page)
+{
+ return queue_var_show(queue_atomic_write_unit_max_bytes(q), page);
+}
+
static ssize_t queue_max_integrity_segments_show(struct request_queue *q, char *page)
{
return queue_var_show(q->limits.max_integrity_segments, page);
@@ -507,6 +531,11 @@ QUEUE_RO_ENTRY(queue_discard_max_hw, "discard_max_hw_bytes");
QUEUE_RW_ENTRY(queue_discard_max, "discard_max_bytes");
QUEUE_RO_ENTRY(queue_discard_zeroes_data, "discard_zeroes_data");

+QUEUE_RO_ENTRY(queue_atomic_write_max_bytes, "atomic_write_max_bytes");
+QUEUE_RO_ENTRY(queue_atomic_write_boundary, "atomic_write_boundary_bytes");
+QUEUE_RO_ENTRY(queue_atomic_write_unit_max, "atomic_write_unit_max_bytes");
+QUEUE_RO_ENTRY(queue_atomic_write_unit_min, "atomic_write_unit_min_bytes");
+
QUEUE_RO_ENTRY(queue_write_same_max, "write_same_max_bytes");
QUEUE_RO_ENTRY(queue_write_zeroes_max, "write_zeroes_max_bytes");
QUEUE_RO_ENTRY(queue_zone_append_max, "zone_append_max_bytes");
@@ -634,6 +663,10 @@ static struct attribute *queue_attrs[] = {
&queue_discard_max_entry.attr,
&queue_discard_max_hw_entry.attr,
&queue_discard_zeroes_data_entry.attr,
+ &queue_atomic_write_max_bytes_entry.attr,
+ &queue_atomic_write_boundary_entry.attr,
+ &queue_atomic_write_unit_min_entry.attr,
+ &queue_atomic_write_unit_max_entry.attr,
&queue_write_same_max_entry.attr,
&queue_write_zeroes_max_entry.attr,
&queue_zone_append_max_entry.attr,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 51fa7ffdee83..ab53163dd187 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -309,6 +309,11 @@ struct queue_limits {
unsigned int discard_alignment;
unsigned int zone_write_granularity;

+ unsigned int atomic_write_boundary_sectors;
+ unsigned int atomic_write_max_sectors;
+ unsigned int atomic_write_unit_min_sectors;
+ unsigned int atomic_write_unit_max_sectors;
+
unsigned short max_segments;
unsigned short max_integrity_segments;
unsigned short max_discard_segments;
@@ -908,6 +913,14 @@ void blk_queue_zone_write_granularity(struct request_queue *q,
unsigned int size);
extern void blk_queue_alignment_offset(struct request_queue *q,
unsigned int alignment);
+void blk_queue_atomic_write_max_bytes(struct request_queue *q,
+ unsigned int bytes);
+void blk_queue_atomic_write_unit_max_sectors(struct request_queue *q,
+ unsigned int sectors);
+void blk_queue_atomic_write_unit_min_sectors(struct request_queue *q,
+ unsigned int sectors);
+void blk_queue_atomic_write_boundary_bytes(struct request_queue *q,
+ unsigned int bytes);
void disk_update_readahead(struct gendisk *disk);
extern void blk_limits_io_min(struct queue_limits *limits, unsigned int min);
extern void blk_queue_io_min(struct request_queue *q, unsigned int min);
@@ -1312,6 +1325,30 @@ static inline int queue_dma_alignment(const struct request_queue *q)
return q ? q->limits.dma_alignment : 511;
}

+static inline unsigned int
+queue_atomic_write_unit_max_bytes(const struct request_queue *q)
+{
+ return q->limits.atomic_write_unit_max_sectors << SECTOR_SHIFT;
+}
+
+static inline unsigned int
+queue_atomic_write_unit_min_bytes(const struct request_queue *q)
+{
+ return q->limits.atomic_write_unit_min_sectors << SECTOR_SHIFT;
+}
+
+static inline unsigned int
+queue_atomic_write_boundary_bytes(const struct request_queue *q)
+{
+ return q->limits.atomic_write_boundary_sectors << SECTOR_SHIFT;
+}
+
+static inline unsigned int
+queue_atomic_write_max_bytes(const struct request_queue *q)
+{
+ return q->limits.atomic_write_max_sectors << SECTOR_SHIFT;
+}
+
static inline unsigned int bdev_dma_alignment(struct block_device *bdev)
{
return queue_dma_alignment(bdev_get_queue(bdev));
--
2.35.3

2023-12-12 11:13:41

by John Garry

[permalink] [raw]
Subject: [PATCH v2 13/16] scsi: sd: Add WRITE_ATOMIC_16 support

Add function sd_setup_atomic_cmnd() to setup an WRITE_ATOMIC_16
CDB for when REQ_ATOMIC flag is set for the request.

Also add trace info.

Signed-off-by: John Garry <[email protected]>
---
drivers/scsi/scsi_trace.c | 22 ++++++++++++++++++++++
drivers/scsi/sd.c | 24 ++++++++++++++++++++++++
include/scsi/scsi_proto.h | 1 +
include/trace/events/scsi.h | 1 +
4 files changed, 48 insertions(+)

diff --git a/drivers/scsi/scsi_trace.c b/drivers/scsi/scsi_trace.c
index 41a950075913..3e47c4472a80 100644
--- a/drivers/scsi/scsi_trace.c
+++ b/drivers/scsi/scsi_trace.c
@@ -325,6 +325,26 @@ scsi_trace_zbc_out(struct trace_seq *p, unsigned char *cdb, int len)
return ret;
}

+static const char *
+scsi_trace_atomic_write16_out(struct trace_seq *p, unsigned char *cdb, int len)
+{
+ const char *ret = trace_seq_buffer_ptr(p);
+ unsigned int boundary_size;
+ unsigned int nr_blocks;
+ sector_t lba;
+
+ lba = get_unaligned_be64(&cdb[2]);
+ boundary_size = get_unaligned_be16(&cdb[10]);
+ nr_blocks = get_unaligned_be16(&cdb[12]);
+
+ trace_seq_printf(p, "lba=%llu txlen=%u boundary_size=%u",
+ lba, nr_blocks, boundary_size);
+
+ trace_seq_putc(p, 0);
+
+ return ret;
+}
+
static const char *
scsi_trace_varlen(struct trace_seq *p, unsigned char *cdb, int len)
{
@@ -385,6 +405,8 @@ scsi_trace_parse_cdb(struct trace_seq *p, unsigned char *cdb, int len)
return scsi_trace_zbc_in(p, cdb, len);
case ZBC_OUT:
return scsi_trace_zbc_out(p, cdb, len);
+ case WRITE_ATOMIC_16:
+ return scsi_trace_atomic_write16_out(p, cdb, len);
default:
return scsi_trace_misc(p, cdb, len);
}
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index fa857a08341f..10942e322253 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1240,6 +1240,26 @@ static int sd_cdl_dld(struct scsi_disk *sdkp, struct scsi_cmnd *scmd)
return (hint - IOPRIO_HINT_DEV_DURATION_LIMIT_1) + 1;
}

+static blk_status_t sd_setup_atomic_cmnd(struct scsi_cmnd *cmd,
+ sector_t lba, unsigned int nr_blocks,
+ bool boundary, unsigned char flags)
+{
+ cmd->cmd_len = 16;
+ cmd->cmnd[0] = WRITE_ATOMIC_16;
+ cmd->cmnd[1] = flags;
+ put_unaligned_be64(lba, &cmd->cmnd[2]);
+ put_unaligned_be16(nr_blocks, &cmd->cmnd[12]);
+ if (boundary)
+ put_unaligned_be16(nr_blocks, &cmd->cmnd[10]);
+ else
+ put_unaligned_be16(0, &cmd->cmnd[10]);
+ put_unaligned_be16(nr_blocks, &cmd->cmnd[12]);
+ cmd->cmnd[14] = 0;
+ cmd->cmnd[15] = 0;
+
+ return BLK_STS_OK;
+}
+
static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
{
struct request *rq = scsi_cmd_to_rq(cmd);
@@ -1311,6 +1331,10 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
if (protect && sdkp->protection_type == T10_PI_TYPE2_PROTECTION) {
ret = sd_setup_rw32_cmnd(cmd, write, lba, nr_blocks,
protect | fua, dld);
+ } else if (rq->cmd_flags & REQ_ATOMIC && write) {
+ ret = sd_setup_atomic_cmnd(cmd, lba, nr_blocks,
+ sdkp->use_atomic_write_boundary,
+ protect | fua);
} else if (sdp->use_16_for_rw || (nr_blocks > 0xffff)) {
ret = sd_setup_rw16_cmnd(cmd, write, lba, nr_blocks,
protect | fua, dld);
diff --git a/include/scsi/scsi_proto.h b/include/scsi/scsi_proto.h
index 07d65c1f59db..833de67305b5 100644
--- a/include/scsi/scsi_proto.h
+++ b/include/scsi/scsi_proto.h
@@ -119,6 +119,7 @@
#define WRITE_SAME_16 0x93
#define ZBC_OUT 0x94
#define ZBC_IN 0x95
+#define WRITE_ATOMIC_16 0x9c
#define SERVICE_ACTION_BIDIRECTIONAL 0x9d
#define SERVICE_ACTION_IN_16 0x9e
#define SERVICE_ACTION_OUT_16 0x9f
diff --git a/include/trace/events/scsi.h b/include/trace/events/scsi.h
index 8e2d9b1b0e77..05f1945ed204 100644
--- a/include/trace/events/scsi.h
+++ b/include/trace/events/scsi.h
@@ -102,6 +102,7 @@
scsi_opcode_name(WRITE_32), \
scsi_opcode_name(WRITE_SAME_32), \
scsi_opcode_name(ATA_16), \
+ scsi_opcode_name(WRITE_ATOMIC_16), \
scsi_opcode_name(ATA_12))

#define scsi_hostbyte_name(result) { result, #result }
--
2.35.3

2023-12-12 11:27:05

by John Garry

[permalink] [raw]
Subject: [PATCH v2 14/16] scsi: scsi_debug: Atomic write support

Add initial support for atomic writes.

As is standard method, feed device properties via modules param, those
being:
- atomic_max_size_blks
- atomic_alignment_blks
- atomic_granularity_blks
- atomic_max_size_with_boundary_blks
- atomic_max_boundary_blks

These just match sbc4r22 section 6.6.4 - Block limits VPD page.

We just support ATOMIC_WRITE_16.

The major change in the driver is how we lock the device for RW accesses.

Currently the driver uses a per-device lock for accessing device metadata
and "media" data (calls to do_device_access()) atomically for the duration
of the whole read/write command.

This should not suit verifying atomic writes. Reason being that currently
all reads/writes are atomic, so using atomic writes does not prove
anything.

Change device access model to basis that regular writes only atomic on a
per-sector basis, while reads and atomic writes are fully atomic.

As mentioned, since accessing metadata and device media is atomic,
continue to have regular writes involving metadata - like discard or PI -
as atomic. We can improve this later.

Currently we only support model where overlapping going reads or writes
wait for current access to complete before commencing an atomic write.
This is described in 4.29.3.2 section of the SBC. However, we simplify,
things and wait for all accesses to complete (when issuing an atomic
write).

Signed-off-by: John Garry <[email protected]>
---
drivers/scsi/scsi_debug.c | 590 +++++++++++++++++++++++++++++---------
1 file changed, 457 insertions(+), 133 deletions(-)

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index 512fae20c56c..645571b0fd2c 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -68,6 +68,8 @@ static const char *sdebug_version_date = "20210520";

/* Additional Sense Code (ASC) */
#define NO_ADDITIONAL_SENSE 0x0
+#define OVERLAP_ATOMIC_COMMAND_ASC 0x0
+#define OVERLAP_ATOMIC_COMMAND_ASCQ 0x23
#define LOGICAL_UNIT_NOT_READY 0x4
#define LOGICAL_UNIT_COMMUNICATION_FAILURE 0x8
#define UNRECOVERED_READ_ERR 0x11
@@ -102,6 +104,7 @@ static const char *sdebug_version_date = "20210520";
#define READ_BOUNDARY_ASCQ 0x7
#define ATTEMPT_ACCESS_GAP 0x9
#define INSUFF_ZONE_ASCQ 0xe
+/* see drivers/scsi/sense_codes.h */

/* Additional Sense Code Qualifier (ASCQ) */
#define ACK_NAK_TO 0x3
@@ -151,6 +154,12 @@ static const char *sdebug_version_date = "20210520";
#define DEF_VIRTUAL_GB 0
#define DEF_VPD_USE_HOSTNO 1
#define DEF_WRITESAME_LENGTH 0xFFFF
+#define DEF_ATOMIC_WR 0
+#define DEF_ATOMIC_WR_MAX_LENGTH 8192
+#define DEF_ATOMIC_WR_ALIGN 2
+#define DEF_ATOMIC_WR_GRAN 2
+#define DEF_ATOMIC_WR_MAX_LENGTH_BNDRY (DEF_ATOMIC_WR_MAX_LENGTH)
+#define DEF_ATOMIC_WR_MAX_BNDRY 128
#define DEF_STRICT 0
#define DEF_STATISTICS false
#define DEF_SUBMIT_QUEUES 1
@@ -373,7 +382,9 @@ struct sdebug_host_info {

/* There is an xarray of pointers to this struct's objects, one per host */
struct sdeb_store_info {
- rwlock_t macc_lck; /* for atomic media access on this store */
+ rwlock_t macc_data_lck; /* for media data access on this store */
+ rwlock_t macc_meta_lck; /* for atomic media meta access on this store */
+ rwlock_t macc_sector_lck; /* per-sector media data access on this store */
u8 *storep; /* user data storage (ram) */
struct t10_pi_tuple *dif_storep; /* protection info */
void *map_storep; /* provisioning map */
@@ -397,12 +408,20 @@ struct sdebug_defer {
enum sdeb_defer_type defer_t;
};

+struct sdebug_device_access_info {
+ bool atomic_write;
+ u64 lba;
+ u32 num;
+ struct scsi_cmnd *self;
+};
+
struct sdebug_queued_cmd {
/* corresponding bit set in in_use_bm[] in owning struct sdebug_queue
* instance indicates this slot is in use.
*/
struct sdebug_defer sd_dp;
struct scsi_cmnd *scmd;
+ struct sdebug_device_access_info *i;
};

struct sdebug_scsi_cmd {
@@ -462,7 +481,8 @@ enum sdeb_opcode_index {
SDEB_I_PRE_FETCH = 29, /* 10, 16 */
SDEB_I_ZONE_OUT = 30, /* 0x94+SA; includes no data xfer */
SDEB_I_ZONE_IN = 31, /* 0x95+SA; all have data-in */
- SDEB_I_LAST_ELEM_P1 = 32, /* keep this last (previous + 1) */
+ SDEB_I_ATOMIC_WRITE_16 = 32,
+ SDEB_I_LAST_ELEM_P1 = 33, /* keep this last (previous + 1) */
};


@@ -496,7 +516,8 @@ static const unsigned char opcode_ind_arr[256] = {
0, 0, 0, SDEB_I_VERIFY,
SDEB_I_PRE_FETCH, SDEB_I_SYNC_CACHE, 0, SDEB_I_WRITE_SAME,
SDEB_I_ZONE_OUT, SDEB_I_ZONE_IN, 0, 0,
- 0, 0, 0, 0, 0, 0, SDEB_I_SERV_ACT_IN_16, SDEB_I_SERV_ACT_OUT_16,
+ 0, 0, 0, 0,
+ SDEB_I_ATOMIC_WRITE_16, 0, SDEB_I_SERV_ACT_IN_16, SDEB_I_SERV_ACT_OUT_16,
/* 0xa0; 0xa0->0xbf: 12 byte cdbs */
SDEB_I_REPORT_LUNS, SDEB_I_ATA_PT, 0, SDEB_I_MAINT_IN,
SDEB_I_MAINT_OUT, 0, 0, 0,
@@ -544,6 +565,7 @@ static int resp_write_buffer(struct scsi_cmnd *, struct sdebug_dev_info *);
static int resp_sync_cache(struct scsi_cmnd *, struct sdebug_dev_info *);
static int resp_pre_fetch(struct scsi_cmnd *, struct sdebug_dev_info *);
static int resp_report_zones(struct scsi_cmnd *, struct sdebug_dev_info *);
+static int resp_atomic_write(struct scsi_cmnd *, struct sdebug_dev_info *);
static int resp_open_zone(struct scsi_cmnd *, struct sdebug_dev_info *);
static int resp_close_zone(struct scsi_cmnd *, struct sdebug_dev_info *);
static int resp_finish_zone(struct scsi_cmnd *, struct sdebug_dev_info *);
@@ -782,6 +804,11 @@ static const struct opcode_info_t opcode_info_arr[SDEB_I_LAST_ELEM_P1 + 1] = {
resp_report_zones, zone_in_iarr, /* ZONE_IN(16), REPORT ZONES) */
{16, 0x0 /* SA */, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xbf, 0xc7} },
+/* 31 */
+ {0, 0x0, 0x0, F_D_OUT | FF_MEDIA_IO,
+ resp_atomic_write, NULL, /* ATOMIC WRITE 16 */
+ {16, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff} },
/* sentinel */
{0xff, 0, 0, 0, NULL, NULL, /* terminating element */
{0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} },
@@ -829,6 +856,13 @@ static unsigned int sdebug_unmap_granularity = DEF_UNMAP_GRANULARITY;
static unsigned int sdebug_unmap_max_blocks = DEF_UNMAP_MAX_BLOCKS;
static unsigned int sdebug_unmap_max_desc = DEF_UNMAP_MAX_DESC;
static unsigned int sdebug_write_same_length = DEF_WRITESAME_LENGTH;
+static unsigned int sdebug_atomic_wr = DEF_ATOMIC_WR;
+static unsigned int sdebug_atomic_wr_max_length = DEF_ATOMIC_WR_MAX_LENGTH;
+static unsigned int sdebug_atomic_wr_align = DEF_ATOMIC_WR_ALIGN;
+static unsigned int sdebug_atomic_wr_gran = DEF_ATOMIC_WR_GRAN;
+static unsigned int sdebug_atomic_wr_max_length_bndry =
+ DEF_ATOMIC_WR_MAX_LENGTH_BNDRY;
+static unsigned int sdebug_atomic_wr_max_bndry = DEF_ATOMIC_WR_MAX_BNDRY;
static int sdebug_uuid_ctl = DEF_UUID_CTL;
static bool sdebug_random = DEF_RANDOM;
static bool sdebug_per_host_store = DEF_PER_HOST_STORE;
@@ -1177,6 +1211,11 @@ static inline bool scsi_debug_lbp(void)
(sdebug_lbpu || sdebug_lbpws || sdebug_lbpws10);
}

+static inline bool scsi_debug_atomic_write(void)
+{
+ return 0 == sdebug_fake_rw && sdebug_atomic_wr;
+}
+
static void *lba2fake_store(struct sdeb_store_info *sip,
unsigned long long lba)
{
@@ -1804,6 +1843,14 @@ static int inquiry_vpd_b0(unsigned char *arr)
/* Maximum WRITE SAME Length */
put_unaligned_be64(sdebug_write_same_length, &arr[32]);

+ if (sdebug_atomic_wr) {
+ put_unaligned_be32(sdebug_atomic_wr_max_length, &arr[40]);
+ put_unaligned_be32(sdebug_atomic_wr_align, &arr[44]);
+ put_unaligned_be32(sdebug_atomic_wr_gran, &arr[48]);
+ put_unaligned_be32(sdebug_atomic_wr_max_length_bndry, &arr[52]);
+ put_unaligned_be32(sdebug_atomic_wr_max_bndry, &arr[56]);
+ }
+
return 0x3c; /* Mandatory page length for Logical Block Provisioning */
}

@@ -3305,15 +3352,240 @@ static inline struct sdeb_store_info *devip2sip(struct sdebug_dev_info *devip,
return xa_load(per_store_ap, devip->sdbg_host->si_idx);
}

+
+static inline void
+sdeb_read_lock(rwlock_t *lock)
+{
+ if (sdebug_no_rwlock)
+ __acquire(lock);
+ else
+ read_lock(lock);
+}
+
+static inline void
+sdeb_read_unlock(rwlock_t *lock)
+{
+ if (sdebug_no_rwlock)
+ __release(lock);
+ else
+ read_unlock(lock);
+}
+
+static inline void
+sdeb_write_lock(rwlock_t *lock)
+{
+ if (sdebug_no_rwlock)
+ __acquire(lock);
+ else
+ write_lock(lock);
+}
+
+static inline void
+sdeb_write_unlock(rwlock_t *lock)
+{
+ if (sdebug_no_rwlock)
+ __release(lock);
+ else
+ write_unlock(lock);
+}
+
+static inline void
+sdeb_data_read_lock(struct sdeb_store_info *sip)
+{
+ BUG_ON(!sip);
+
+ sdeb_read_lock(&sip->macc_data_lck);
+}
+
+static inline void
+sdeb_data_read_unlock(struct sdeb_store_info *sip)
+{
+ BUG_ON(!sip);
+
+ sdeb_read_unlock(&sip->macc_data_lck);
+}
+
+static inline void
+sdeb_data_write_lock(struct sdeb_store_info *sip)
+{
+ BUG_ON(!sip);
+
+ sdeb_write_lock(&sip->macc_data_lck);
+}
+
+static inline void
+sdeb_data_write_unlock(struct sdeb_store_info *sip)
+{
+ BUG_ON(!sip);
+
+ sdeb_write_unlock(&sip->macc_data_lck);
+}
+
+static inline void
+sdeb_data_sector_read_lock(struct sdeb_store_info *sip)
+{
+ BUG_ON(!sip);
+
+ sdeb_read_lock(&sip->macc_sector_lck);
+}
+
+static inline void
+sdeb_data_sector_read_unlock(struct sdeb_store_info *sip)
+{
+ BUG_ON(!sip);
+
+ sdeb_read_unlock(&sip->macc_sector_lck);
+}
+
+static inline void
+sdeb_data_sector_write_lock(struct sdeb_store_info *sip)
+{
+ BUG_ON(!sip);
+
+ sdeb_write_lock(&sip->macc_sector_lck);
+}
+
+static inline void
+sdeb_data_sector_write_unlock(struct sdeb_store_info *sip)
+{
+ BUG_ON(!sip);
+
+ sdeb_write_unlock(&sip->macc_sector_lck);
+}
+
+/*
+Atomic locking:
+We simplify the atomic model to allow only 1x atomic
+write and many non-atomic reads or writes for all
+LBAs.
+
+A RW lock has a similar bahaviour:
+Only 1x writer and many readers.
+
+So use a RW lock for per-device read and write locking:
+An atomic access grabs the lock as a writer and
+non-atomic grabs the lock as a reader.
+*/
+
+static inline void
+sdeb_data_lock(struct sdeb_store_info *sip, bool atomic_write)
+{
+ if (atomic_write)
+ sdeb_data_write_lock(sip);
+ else
+ sdeb_data_read_lock(sip);
+}
+
+static inline void
+sdeb_data_unlock(struct sdeb_store_info *sip, bool atomic_write)
+{
+ if (atomic_write)
+ sdeb_data_write_unlock(sip);
+ else
+ sdeb_data_read_unlock(sip);
+}
+
+/* Allow many reads but only 1x write per sector */
+static inline void
+sdeb_data_sector_lock(struct sdeb_store_info *sip, bool do_write)
+{
+ if (do_write)
+ sdeb_data_sector_write_lock(sip);
+ else
+ sdeb_data_sector_read_lock(sip);
+}
+
+static inline void
+sdeb_data_sector_unlock(struct sdeb_store_info *sip, bool do_write)
+{
+ if (do_write)
+ sdeb_data_sector_write_unlock(sip);
+ else
+ sdeb_data_sector_read_unlock(sip);
+}
+
+static inline void
+sdeb_meta_read_lock(struct sdeb_store_info *sip)
+{
+ if (sdebug_no_rwlock) {
+ if (sip)
+ __acquire(&sip->macc_meta_lck);
+ else
+ __acquire(&sdeb_fake_rw_lck);
+ } else {
+ if (sip)
+ read_lock(&sip->macc_meta_lck);
+ else
+ read_lock(&sdeb_fake_rw_lck);
+ }
+}
+
+static inline void
+sdeb_meta_read_unlock(struct sdeb_store_info *sip)
+{
+ if (sdebug_no_rwlock) {
+ if (sip)
+ __release(&sip->macc_meta_lck);
+ else
+ __release(&sdeb_fake_rw_lck);
+ } else {
+ if (sip)
+ read_unlock(&sip->macc_meta_lck);
+ else
+ read_unlock(&sdeb_fake_rw_lck);
+ }
+}
+
+static inline void
+sdeb_meta_write_lock(struct sdeb_store_info *sip)
+{
+ if (sdebug_no_rwlock) {
+ if (sip)
+ __acquire(&sip->macc_meta_lck);
+ else
+ __acquire(&sdeb_fake_rw_lck);
+ } else {
+ if (sip)
+ write_lock(&sip->macc_meta_lck);
+ else
+ write_lock(&sdeb_fake_rw_lck);
+ }
+}
+
+static inline void
+sdeb_meta_write_unlock(struct sdeb_store_info *sip)
+{
+ if (sdebug_no_rwlock) {
+ if (sip)
+ __release(&sip->macc_meta_lck);
+ else
+ __release(&sdeb_fake_rw_lck);
+ } else {
+ if (sip)
+ write_unlock(&sip->macc_meta_lck);
+ else
+ write_unlock(&sdeb_fake_rw_lck);
+ }
+}
+
/* Returns number of bytes copied or -1 if error. */
static int do_device_access(struct sdeb_store_info *sip, struct scsi_cmnd *scp,
- u32 sg_skip, u64 lba, u32 num, bool do_write)
+ u32 sg_skip, u64 lba, u32 num, bool do_write,
+ bool atomic_write)
{
int ret;
- u64 block, rest = 0;
+ u64 block;
enum dma_data_direction dir;
struct scsi_data_buffer *sdb = &scp->sdb;
u8 *fsp;
+ int i;
+
+ /*
+ * Even though reads are inherently atomic (in this driver), we expect
+ * the atomic flag only for writes.
+ */
+ if (!do_write && atomic_write)
+ return -1;

if (do_write) {
dir = DMA_TO_DEVICE;
@@ -3329,21 +3601,26 @@ static int do_device_access(struct sdeb_store_info *sip, struct scsi_cmnd *scp,
fsp = sip->storep;

block = do_div(lba, sdebug_store_sectors);
- if (block + num > sdebug_store_sectors)
- rest = block + num - sdebug_store_sectors;

- ret = sg_copy_buffer(sdb->table.sgl, sdb->table.nents,
+ /* Only allow 1x atomic write or multiple non-atomic writes at any given time */
+ sdeb_data_lock(sip, atomic_write);
+ for (i = 0; i < num; i++) {
+ /* We shouldn't need to lock for atomic writes, but do it anyway */
+ sdeb_data_sector_lock(sip, do_write);
+ ret = sg_copy_buffer(sdb->table.sgl, sdb->table.nents,
fsp + (block * sdebug_sector_size),
- (num - rest) * sdebug_sector_size, sg_skip, do_write);
- if (ret != (num - rest) * sdebug_sector_size)
- return ret;
-
- if (rest) {
- ret += sg_copy_buffer(sdb->table.sgl, sdb->table.nents,
- fsp, rest * sdebug_sector_size,
- sg_skip + ((num - rest) * sdebug_sector_size),
- do_write);
+ sdebug_sector_size, sg_skip, do_write);
+ sdeb_data_sector_unlock(sip, do_write);
+ if (ret != sdebug_sector_size) {
+ ret += (i * sdebug_sector_size);
+ break;
+ }
+ sg_skip += sdebug_sector_size;
+ if (++block >= sdebug_store_sectors)
+ block = 0;
}
+ ret = num * sdebug_sector_size;
+ sdeb_data_unlock(sip, atomic_write);

return ret;
}
@@ -3519,70 +3796,6 @@ static int prot_verify_read(struct scsi_cmnd *scp, sector_t start_sec,
return ret;
}

-static inline void
-sdeb_read_lock(struct sdeb_store_info *sip)
-{
- if (sdebug_no_rwlock) {
- if (sip)
- __acquire(&sip->macc_lck);
- else
- __acquire(&sdeb_fake_rw_lck);
- } else {
- if (sip)
- read_lock(&sip->macc_lck);
- else
- read_lock(&sdeb_fake_rw_lck);
- }
-}
-
-static inline void
-sdeb_read_unlock(struct sdeb_store_info *sip)
-{
- if (sdebug_no_rwlock) {
- if (sip)
- __release(&sip->macc_lck);
- else
- __release(&sdeb_fake_rw_lck);
- } else {
- if (sip)
- read_unlock(&sip->macc_lck);
- else
- read_unlock(&sdeb_fake_rw_lck);
- }
-}
-
-static inline void
-sdeb_write_lock(struct sdeb_store_info *sip)
-{
- if (sdebug_no_rwlock) {
- if (sip)
- __acquire(&sip->macc_lck);
- else
- __acquire(&sdeb_fake_rw_lck);
- } else {
- if (sip)
- write_lock(&sip->macc_lck);
- else
- write_lock(&sdeb_fake_rw_lck);
- }
-}
-
-static inline void
-sdeb_write_unlock(struct sdeb_store_info *sip)
-{
- if (sdebug_no_rwlock) {
- if (sip)
- __release(&sip->macc_lck);
- else
- __release(&sdeb_fake_rw_lck);
- } else {
- if (sip)
- write_unlock(&sip->macc_lck);
- else
- write_unlock(&sdeb_fake_rw_lck);
- }
-}
-
static int resp_read_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
{
bool check_prot;
@@ -3592,6 +3805,7 @@ static int resp_read_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
u64 lba;
struct sdeb_store_info *sip = devip2sip(devip, true);
u8 *cmd = scp->cmnd;
+ bool meta_data_locked = false;

switch (cmd[0]) {
case READ_16:
@@ -3650,6 +3864,10 @@ static int resp_read_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
atomic_set(&sdeb_inject_pending, 0);
}

+ /*
+ * When checking device access params, for reads we only check data
+ * versus what is set at init time, so no need to lock.
+ */
ret = check_device_access_params(scp, lba, num, false);
if (ret)
return ret;
@@ -3669,29 +3887,33 @@ static int resp_read_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
return check_condition_result;
}

- sdeb_read_lock(sip);
+ if (sdebug_dev_is_zoned(devip) ||
+ (sdebug_dix && scsi_prot_sg_count(scp))) {
+ sdeb_meta_read_lock(sip);
+ meta_data_locked = true;
+ }

/* DIX + T10 DIF */
if (unlikely(sdebug_dix && scsi_prot_sg_count(scp))) {
switch (prot_verify_read(scp, lba, num, ei_lba)) {
case 1: /* Guard tag error */
if (cmd[1] >> 5 != 3) { /* RDPROTECT != 3 */
- sdeb_read_unlock(sip);
+ sdeb_meta_read_unlock(sip);
mk_sense_buffer(scp, ABORTED_COMMAND, 0x10, 1);
return check_condition_result;
} else if (scp->prot_flags & SCSI_PROT_GUARD_CHECK) {
- sdeb_read_unlock(sip);
+ sdeb_meta_read_unlock(sip);
mk_sense_buffer(scp, ILLEGAL_REQUEST, 0x10, 1);
return illegal_condition_result;
}
break;
case 3: /* Reference tag error */
if (cmd[1] >> 5 != 3) { /* RDPROTECT != 3 */
- sdeb_read_unlock(sip);
+ sdeb_meta_read_unlock(sip);
mk_sense_buffer(scp, ABORTED_COMMAND, 0x10, 3);
return check_condition_result;
} else if (scp->prot_flags & SCSI_PROT_REF_CHECK) {
- sdeb_read_unlock(sip);
+ sdeb_meta_read_unlock(sip);
mk_sense_buffer(scp, ILLEGAL_REQUEST, 0x10, 3);
return illegal_condition_result;
}
@@ -3699,8 +3921,9 @@ static int resp_read_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
}
}

- ret = do_device_access(sip, scp, 0, lba, num, false);
- sdeb_read_unlock(sip);
+ ret = do_device_access(sip, scp, 0, lba, num, false, false);
+ if (meta_data_locked)
+ sdeb_meta_read_unlock(sip);
if (unlikely(ret == -1))
return DID_ERROR << 16;

@@ -3889,6 +4112,7 @@ static int resp_write_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
u64 lba;
struct sdeb_store_info *sip = devip2sip(devip, true);
u8 *cmd = scp->cmnd;
+ bool meta_data_locked = false;

switch (cmd[0]) {
case WRITE_16:
@@ -3942,10 +4166,17 @@ static int resp_write_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
"to DIF device\n");
}

- sdeb_write_lock(sip);
+ if (sdebug_dev_is_zoned(devip) ||
+ (sdebug_dix && scsi_prot_sg_count(scp)) ||
+ scsi_debug_lbp()) {
+ sdeb_meta_write_lock(sip);
+ meta_data_locked = true;
+ }
+
ret = check_device_access_params(scp, lba, num, true);
if (ret) {
- sdeb_write_unlock(sip);
+ if (meta_data_locked)
+ sdeb_meta_write_unlock(sip);
return ret;
}

@@ -3954,22 +4185,22 @@ static int resp_write_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
switch (prot_verify_write(scp, lba, num, ei_lba)) {
case 1: /* Guard tag error */
if (scp->prot_flags & SCSI_PROT_GUARD_CHECK) {
- sdeb_write_unlock(sip);
+ sdeb_meta_write_unlock(sip);
mk_sense_buffer(scp, ILLEGAL_REQUEST, 0x10, 1);
return illegal_condition_result;
} else if (scp->cmnd[1] >> 5 != 3) { /* WRPROTECT != 3 */
- sdeb_write_unlock(sip);
+ sdeb_meta_write_unlock(sip);
mk_sense_buffer(scp, ABORTED_COMMAND, 0x10, 1);
return check_condition_result;
}
break;
case 3: /* Reference tag error */
if (scp->prot_flags & SCSI_PROT_REF_CHECK) {
- sdeb_write_unlock(sip);
+ sdeb_meta_write_unlock(sip);
mk_sense_buffer(scp, ILLEGAL_REQUEST, 0x10, 3);
return illegal_condition_result;
} else if (scp->cmnd[1] >> 5 != 3) { /* WRPROTECT != 3 */
- sdeb_write_unlock(sip);
+ sdeb_meta_write_unlock(sip);
mk_sense_buffer(scp, ABORTED_COMMAND, 0x10, 3);
return check_condition_result;
}
@@ -3977,13 +4208,16 @@ static int resp_write_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
}
}

- ret = do_device_access(sip, scp, 0, lba, num, true);
+ ret = do_device_access(sip, scp, 0, lba, num, true, false);
if (unlikely(scsi_debug_lbp()))
map_region(sip, lba, num);
+
/* If ZBC zone then bump its write pointer */
if (sdebug_dev_is_zoned(devip))
zbc_inc_wp(devip, lba, num);
- sdeb_write_unlock(sip);
+ if (meta_data_locked)
+ sdeb_meta_write_unlock(sip);
+
if (unlikely(-1 == ret))
return DID_ERROR << 16;
else if (unlikely(sdebug_verbose &&
@@ -4090,7 +4324,8 @@ static int resp_write_scat(struct scsi_cmnd *scp,
goto err_out;
}

- sdeb_write_lock(sip);
+ /* Just keep it simple and always lock for now */
+ sdeb_meta_write_lock(sip);
sg_off = lbdof_blen;
/* Spec says Buffer xfer Length field in number of LBs in dout */
cum_lb = 0;
@@ -4133,7 +4368,11 @@ static int resp_write_scat(struct scsi_cmnd *scp,
}
}

- ret = do_device_access(sip, scp, sg_off, lba, num, true);
+ /*
+ * Write ranges atomically to keep as close to pre-atomic
+ * writes behaviour as possible.
+ */
+ ret = do_device_access(sip, scp, sg_off, lba, num, true, true);
/* If ZBC zone then bump its write pointer */
if (sdebug_dev_is_zoned(devip))
zbc_inc_wp(devip, lba, num);
@@ -4172,7 +4411,7 @@ static int resp_write_scat(struct scsi_cmnd *scp,
}
ret = 0;
err_out_unlock:
- sdeb_write_unlock(sip);
+ sdeb_meta_write_unlock(sip);
err_out:
kfree(lrdp);
return ret;
@@ -4191,14 +4430,16 @@ static int resp_write_same(struct scsi_cmnd *scp, u64 lba, u32 num,
scp->device->hostdata, true);
u8 *fs1p;
u8 *fsp;
+ bool meta_data_locked = false;

- sdeb_write_lock(sip);
+ if (sdebug_dev_is_zoned(devip) || scsi_debug_lbp()) {
+ sdeb_meta_write_lock(sip);
+ meta_data_locked = true;
+ }

ret = check_device_access_params(scp, lba, num, true);
- if (ret) {
- sdeb_write_unlock(sip);
- return ret;
- }
+ if (ret)
+ goto out;

if (unmap && scsi_debug_lbp()) {
unmap_region(sip, lba, num);
@@ -4209,6 +4450,7 @@ static int resp_write_same(struct scsi_cmnd *scp, u64 lba, u32 num,
/* if ndob then zero 1 logical block, else fetch 1 logical block */
fsp = sip->storep;
fs1p = fsp + (block * lb_size);
+ sdeb_data_write_lock(sip);
if (ndob) {
memset(fs1p, 0, lb_size);
ret = 0;
@@ -4216,8 +4458,8 @@ static int resp_write_same(struct scsi_cmnd *scp, u64 lba, u32 num,
ret = fetch_to_dev_buffer(scp, fs1p, lb_size);

if (-1 == ret) {
- sdeb_write_unlock(sip);
- return DID_ERROR << 16;
+ ret = DID_ERROR << 16;
+ goto out;
} else if (sdebug_verbose && !ndob && (ret < lb_size))
sdev_printk(KERN_INFO, scp->device,
"%s: %s: lb size=%u, IO sent=%d bytes\n",
@@ -4234,10 +4476,12 @@ static int resp_write_same(struct scsi_cmnd *scp, u64 lba, u32 num,
/* If ZBC zone then bump its write pointer */
if (sdebug_dev_is_zoned(devip))
zbc_inc_wp(devip, lba, num);
+ sdeb_data_write_unlock(sip);
+ ret = 0;
out:
- sdeb_write_unlock(sip);
-
- return 0;
+ if (meta_data_locked)
+ sdeb_meta_write_unlock(sip);
+ return ret;
}

static int resp_write_same_10(struct scsi_cmnd *scp,
@@ -4380,25 +4624,30 @@ static int resp_comp_write(struct scsi_cmnd *scp,
return check_condition_result;
}

- sdeb_write_lock(sip);
-
ret = do_dout_fetch(scp, dnum, arr);
if (ret == -1) {
retval = DID_ERROR << 16;
- goto cleanup;
+ goto cleanup_free;
} else if (sdebug_verbose && (ret < (dnum * lb_size)))
sdev_printk(KERN_INFO, scp->device, "%s: compare_write: cdb "
"indicated=%u, IO sent=%d bytes\n", my_name,
dnum * lb_size, ret);
+
+ sdeb_data_write_lock(sip);
+ sdeb_meta_write_lock(sip);
if (!comp_write_worker(sip, lba, num, arr, false)) {
mk_sense_buffer(scp, MISCOMPARE, MISCOMPARE_VERIFY_ASC, 0);
retval = check_condition_result;
- goto cleanup;
+ goto cleanup_unlock;
}
+
+ /* Cover sip->map_storep (which map_region()) sets with data lock */
if (scsi_debug_lbp())
map_region(sip, lba, num);
-cleanup:
- sdeb_write_unlock(sip);
+cleanup_unlock:
+ sdeb_meta_write_unlock(sip);
+ sdeb_data_write_unlock(sip);
+cleanup_free:
kfree(arr);
return retval;
}
@@ -4442,7 +4691,7 @@ static int resp_unmap(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)

desc = (void *)&buf[8];

- sdeb_write_lock(sip);
+ sdeb_meta_write_lock(sip);

for (i = 0 ; i < descriptors ; i++) {
unsigned long long lba = get_unaligned_be64(&desc[i].lba);
@@ -4458,7 +4707,7 @@ static int resp_unmap(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
ret = 0;

out:
- sdeb_write_unlock(sip);
+ sdeb_meta_write_unlock(sip);
kfree(buf);

return ret;
@@ -4571,12 +4820,13 @@ static int resp_pre_fetch(struct scsi_cmnd *scp,
rest = block + nblks - sdebug_store_sectors;

/* Try to bring the PRE-FETCH range into CPU's cache */
- sdeb_read_lock(sip);
+ sdeb_data_read_lock(sip);
prefetch_range(fsp + (sdebug_sector_size * block),
(nblks - rest) * sdebug_sector_size);
if (rest)
prefetch_range(fsp, rest * sdebug_sector_size);
- sdeb_read_unlock(sip);
+
+ sdeb_data_read_unlock(sip);
fini:
if (cmd[1] & 0x2)
res = SDEG_RES_IMMED_MASK;
@@ -4735,7 +4985,7 @@ static int resp_verify(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
return check_condition_result;
}
/* Not changing store, so only need read access */
- sdeb_read_lock(sip);
+ sdeb_data_read_lock(sip);

ret = do_dout_fetch(scp, a_num, arr);
if (ret == -1) {
@@ -4757,7 +5007,7 @@ static int resp_verify(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
goto cleanup;
}
cleanup:
- sdeb_read_unlock(sip);
+ sdeb_data_read_unlock(sip);
kfree(arr);
return ret;
}
@@ -4803,7 +5053,7 @@ static int resp_report_zones(struct scsi_cmnd *scp,
return check_condition_result;
}

- sdeb_read_lock(sip);
+ sdeb_meta_read_lock(sip);

desc = arr + 64;
for (lba = zs_lba; lba < sdebug_capacity;
@@ -4901,11 +5151,70 @@ static int resp_report_zones(struct scsi_cmnd *scp,
ret = fill_from_dev_buffer(scp, arr, min_t(u32, alloc_len, rep_len));

fini:
- sdeb_read_unlock(sip);
+ sdeb_meta_read_unlock(sip);
kfree(arr);
return ret;
}

+static int resp_atomic_write(struct scsi_cmnd *scp,
+ struct sdebug_dev_info *devip)
+{
+ struct sdeb_store_info *sip;
+ u8 *cmd = scp->cmnd;
+ u16 boundary, len;
+ u64 lba, lba_tmp;
+ int ret;
+
+ if (!scsi_debug_atomic_write()) {
+ mk_sense_invalid_opcode(scp);
+ return check_condition_result;
+ }
+
+ sip = devip2sip(devip, true);
+
+ lba = get_unaligned_be64(cmd + 2);
+ boundary = get_unaligned_be16(cmd + 10);
+ len = get_unaligned_be16(cmd + 12);
+
+ lba_tmp = lba;
+ if (sdebug_atomic_wr_align &&
+ do_div(lba_tmp, sdebug_atomic_wr_align)) {
+ /* Does not meet alignment requirement */
+ mk_sense_buffer(scp, ILLEGAL_REQUEST, INVALID_FIELD_IN_CDB, 0);
+ return check_condition_result;
+ }
+
+ if (sdebug_atomic_wr_gran && len % sdebug_atomic_wr_gran) {
+ /* Does not meet alignment requirement */
+ mk_sense_buffer(scp, ILLEGAL_REQUEST, INVALID_FIELD_IN_CDB, 0);
+ return check_condition_result;
+ }
+
+ if (boundary > 0) {
+ if (boundary > sdebug_atomic_wr_max_bndry) {
+ mk_sense_invalid_fld(scp, SDEB_IN_CDB, 12, -1);
+ return check_condition_result;
+ }
+
+ if (len > sdebug_atomic_wr_max_length_bndry) {
+ mk_sense_invalid_fld(scp, SDEB_IN_CDB, 12, -1);
+ return check_condition_result;
+ }
+ } else {
+ if (len > sdebug_atomic_wr_max_length) {
+ mk_sense_invalid_fld(scp, SDEB_IN_CDB, 12, -1);
+ return check_condition_result;
+ }
+ }
+
+ ret = do_device_access(sip, scp, 0, lba, len, true, true);
+ if (unlikely(ret == -1))
+ return DID_ERROR << 16;
+ if (unlikely(ret != len * sdebug_sector_size))
+ return DID_ERROR << 16;
+ return 0;
+}
+
/* Logic transplanted from tcmu-runner, file_zbc.c */
static void zbc_open_all(struct sdebug_dev_info *devip)
{
@@ -4932,8 +5241,7 @@ static int resp_open_zone(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
mk_sense_invalid_opcode(scp);
return check_condition_result;
}
-
- sdeb_write_lock(sip);
+ sdeb_meta_write_lock(sip);

if (all) {
/* Check if all closed zones can be open */
@@ -4982,7 +5290,7 @@ static int resp_open_zone(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)

zbc_open_zone(devip, zsp, true);
fini:
- sdeb_write_unlock(sip);
+ sdeb_meta_write_unlock(sip);
return res;
}

@@ -5009,7 +5317,7 @@ static int resp_close_zone(struct scsi_cmnd *scp,
return check_condition_result;
}

- sdeb_write_lock(sip);
+ sdeb_meta_write_lock(sip);

if (all) {
zbc_close_all(devip);
@@ -5038,7 +5346,7 @@ static int resp_close_zone(struct scsi_cmnd *scp,

zbc_close_zone(devip, zsp);
fini:
- sdeb_write_unlock(sip);
+ sdeb_meta_write_unlock(sip);
return res;
}

@@ -5081,7 +5389,7 @@ static int resp_finish_zone(struct scsi_cmnd *scp,
return check_condition_result;
}

- sdeb_write_lock(sip);
+ sdeb_meta_write_lock(sip);

if (all) {
zbc_finish_all(devip);
@@ -5110,7 +5418,7 @@ static int resp_finish_zone(struct scsi_cmnd *scp,

zbc_finish_zone(devip, zsp, true);
fini:
- sdeb_write_unlock(sip);
+ sdeb_meta_write_unlock(sip);
return res;
}

@@ -5161,7 +5469,7 @@ static int resp_rwp_zone(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
return check_condition_result;
}

- sdeb_write_lock(sip);
+ sdeb_meta_write_lock(sip);

if (all) {
zbc_rwp_all(devip);
@@ -5189,7 +5497,7 @@ static int resp_rwp_zone(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)

zbc_rwp_zone(devip, zsp);
fini:
- sdeb_write_unlock(sip);
+ sdeb_meta_write_unlock(sip);
return res;
}

@@ -5216,6 +5524,7 @@ static void sdebug_q_cmd_complete(struct sdebug_defer *sd_dp)
if (!scp) {
pr_err("scmd=NULL\n");
goto out;
+
}

sdsc = scsi_cmd_priv(scp);
@@ -6153,6 +6462,7 @@ module_param_named(lbprz, sdebug_lbprz, int, S_IRUGO);
module_param_named(lbpu, sdebug_lbpu, int, S_IRUGO);
module_param_named(lbpws, sdebug_lbpws, int, S_IRUGO);
module_param_named(lbpws10, sdebug_lbpws10, int, S_IRUGO);
+module_param_named(atomic_wr, sdebug_atomic_wr, int, S_IRUGO);
module_param_named(lowest_aligned, sdebug_lowest_aligned, int, S_IRUGO);
module_param_named(lun_format, sdebug_lun_am_i, int, S_IRUGO | S_IWUSR);
module_param_named(max_luns, sdebug_max_luns, int, S_IRUGO | S_IWUSR);
@@ -6187,6 +6497,11 @@ module_param_named(unmap_alignment, sdebug_unmap_alignment, int, S_IRUGO);
module_param_named(unmap_granularity, sdebug_unmap_granularity, int, S_IRUGO);
module_param_named(unmap_max_blocks, sdebug_unmap_max_blocks, int, S_IRUGO);
module_param_named(unmap_max_desc, sdebug_unmap_max_desc, int, S_IRUGO);
+module_param_named(atomic_wr_max_length, sdebug_atomic_wr_max_length, int, S_IRUGO);
+module_param_named(atomic_wr_align, sdebug_atomic_wr_align, int, S_IRUGO);
+module_param_named(atomic_wr_gran, sdebug_atomic_wr_gran, int, S_IRUGO);
+module_param_named(atomic_wr_max_length_bndry, sdebug_atomic_wr_max_length_bndry, int, S_IRUGO);
+module_param_named(atomic_wr_max_bndry, sdebug_atomic_wr_max_bndry, int, S_IRUGO);
module_param_named(uuid_ctl, sdebug_uuid_ctl, int, S_IRUGO);
module_param_named(virtual_gb, sdebug_virtual_gb, int, S_IRUGO | S_IWUSR);
module_param_named(vpd_use_hostno, sdebug_vpd_use_hostno, int,
@@ -6230,6 +6545,7 @@ MODULE_PARM_DESC(lbprz,
MODULE_PARM_DESC(lbpu, "enable LBP, support UNMAP command (def=0)");
MODULE_PARM_DESC(lbpws, "enable LBP, support WRITE SAME(16) with UNMAP bit (def=0)");
MODULE_PARM_DESC(lbpws10, "enable LBP, support WRITE SAME(10) with UNMAP bit (def=0)");
+MODULE_PARM_DESC(atomic_write, "enable ATOMIC WRITE support, support WRITE ATOMIC(16) (def=1)");
MODULE_PARM_DESC(lowest_aligned, "lowest aligned lba (def=0)");
MODULE_PARM_DESC(lun_format, "LUN format: 0->peripheral (def); 1 --> flat address method");
MODULE_PARM_DESC(max_luns, "number of LUNs per target to simulate(def=1)");
@@ -6261,6 +6577,11 @@ MODULE_PARM_DESC(unmap_alignment, "lowest aligned thin provisioning lba (def=0)"
MODULE_PARM_DESC(unmap_granularity, "thin provisioning granularity in blocks (def=1)");
MODULE_PARM_DESC(unmap_max_blocks, "max # of blocks can be unmapped in one cmd (def=0xffffffff)");
MODULE_PARM_DESC(unmap_max_desc, "max # of ranges that can be unmapped in one cmd (def=256)");
+MODULE_PARM_DESC(atomic_wr_max_length, "max # of blocks can be atomically written in one cmd (def=8192)");
+MODULE_PARM_DESC(atomic_wr_align, "minimum alignment of atomic write in blocks (def=2)");
+MODULE_PARM_DESC(atomic_wr_gran, "minimum granularity of atomic write in blocks (def=2)");
+MODULE_PARM_DESC(atomic_wr_max_length_bndry, "max # of blocks can be atomically written in one cmd with boundary set (def=8192)");
+MODULE_PARM_DESC(atomic_wr_max_bndry, "max # boundaries per atomic write (def=128)");
MODULE_PARM_DESC(uuid_ctl,
"1->use uuid for lu name, 0->don't, 2->all use same (def=0)");
MODULE_PARM_DESC(virtual_gb, "virtual gigabyte (GiB) size (def=0 -> use dev_size_mb)");
@@ -7407,6 +7728,7 @@ static int __init scsi_debug_init(void)
return -EINVAL;
}
}
+
xa_init_flags(per_store_ap, XA_FLAGS_ALLOC | XA_FLAGS_LOCK_IRQ);
if (want_store) {
idx = sdebug_add_store();
@@ -7614,7 +7936,9 @@ static int sdebug_add_store(void)
map_region(sip, 0, 2);
}

- rwlock_init(&sip->macc_lck);
+ rwlock_init(&sip->macc_data_lck);
+ rwlock_init(&sip->macc_meta_lck);
+ rwlock_init(&sip->macc_sector_lck);
return (int)n_idx;
err:
sdebug_erase_store((int)n_idx, sip);
--
2.35.3

2023-12-12 16:33:06

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] block atomic writes

On Tue, Dec 12, 2023 at 11:08:28AM +0000, John Garry wrote:
> Two new fields are added to struct statx - atomic_write_unit_min and
> atomic_write_unit_max. For each atomic individual write, the total length
> of a write must be a between atomic_write_unit_min and
> atomic_write_unit_max, inclusive, and a power-of-2. The write must also be
> at a natural offset in the file wrt the write length.
>
> SCSI sd.c and scsi_debug and NVMe kernel support is added.
>
> Some open questions:
> - How to make API extensible for when we have no HW support? In that case,
> we would prob not have to follow rule of power-of-2 length et al.
> As a possible solution, maybe we can say that atomic writes are
> supported for the file via statx, but not set unit_min and max values,
> and this means that writes need to be just FS block aligned there.

I don't think the power of two length is much of a problem to be
honest, and if we every want to lift it we can still do that easily
by adding a new flag or limit.

What I'm a lot more worried about is how to tell the file system that
allocations are done right for these requirement. There is no way
a user can know that allocations in an existing file are properly
aligned, so atomic writes will just fail on existing files.

I suspect we need an on-disk flag that forces allocations to be
aligned to the atomic write limit, in some ways similar how the
XFS rt flag works. You'd need to set it on an empty file, and all
allocations after that are guaranteed to be properly aligned.

> - For block layer, should atomic_write_unit_max be limited by
> max_sectors_kb? Currently it is not.

Well. It must be limited to max_hw_sectors to actually work.
max_sectors is a software limit below that, which with modern hardware
is actually pretty silly and a real performance issue with todays
workloads when people don't tweak it..

> - How to improve requirement that iovecs are PAGE-aligned.
> There are 2x issues:
> a. We impose this rule to not split BIOs due to virt boundary for
> NVMe, but there virt boundary is 4K (and not PAGE size, so broken for
> 16K/64K pages). Easy solution is to impose requirement that iovecs
> are 4K-aligned.
> b. We don't enforce this rule for virt boundary == 0, i.e. SCSI

.. we require any device that wants to support atomic writes to not
have that silly limit. For NVMe that would require SGL support
(and some driver changes I've been wanting to make for long where
we always use SGLs for transfers larger than a single PRP if supported)


> - Since debugging torn-writes due to unwanted kernel BIO splitting/merging
> would be horrible, should we add some kernel storage stack software
> integrity checks?

Yes, I think we'll need asserts in the drivers. At least for NVMe I
will insist on them. For SCSI I think the device actually checks
because the atomic writes are a different command anyway, or am I
misunderstanding how SCSI works here?

2023-12-13 01:26:15

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH v2 01/16] block: Add atomic write operations to request_queue limits

On Tue, Dec 12, 2023 at 11:08:29AM +0000, John Garry wrote:
> From: Himanshu Madhani <[email protected]>
>
> Add the following limits:
> - atomic_write_boundary_bytes
> - atomic_write_max_bytes
> - atomic_write_unit_max_bytes
> - atomic_write_unit_min_bytes
>
> All atomic writes limits are initialised to 0 to indicate no atomic write
> support. Stacked devices are just not supported either for now.
>
> Signed-off-by: Himanshu Madhani <[email protected]>
> #jpg: Heavy rewrite
> Signed-off-by: John Garry <[email protected]>
> ---
> Documentation/ABI/stable/sysfs-block | 47 ++++++++++++++++++++++
> block/blk-settings.c | 60 ++++++++++++++++++++++++++++
> block/blk-sysfs.c | 33 +++++++++++++++
> include/linux/blkdev.h | 37 +++++++++++++++++
> 4 files changed, 177 insertions(+)
>
> diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
> index 1fe9a553c37b..ba81a081522f 100644
> --- a/Documentation/ABI/stable/sysfs-block
> +++ b/Documentation/ABI/stable/sysfs-block
> @@ -21,6 +21,53 @@ Description:
> device is offset from the internal allocation unit's
> natural alignment.
>
> +What: /sys/block/<disk>/atomic_write_max_bytes
> +Date: May 2023
> +Contact: Himanshu Madhani <[email protected]>
> +Description:
> + [RO] This parameter specifies the maximum atomic write
> + size reported by the device. This parameter is relevant
> + for merging of writes, where a merged atomic write
> + operation must not exceed this number of bytes.
> + The atomic_write_max_bytes may exceed the value in
> + atomic_write_unit_max_bytes if atomic_write_max_bytes
> + is not a power-of-two or atomic_write_unit_max_bytes is
> + limited by some queue limits, such as max_segments.
> +
> +
> +What: /sys/block/<disk>/atomic_write_unit_min_bytes
> +Date: May 2023
> +Contact: Himanshu Madhani <[email protected]>
> +Description:
> + [RO] This parameter specifies the smallest block which can
> + be written atomically with an atomic write operation. All
> + atomic write operations must begin at a
> + atomic_write_unit_min boundary and must be multiples of
> + atomic_write_unit_min. This value must be a power-of-two.
> +
> +
> +What: /sys/block/<disk>/atomic_write_unit_max_bytes
> +Date: January 2023
> +Contact: Himanshu Madhani <[email protected]>
> +Description:
> + [RO] This parameter defines the largest block which can be
> + written atomically with an atomic write operation. This
> + value must be a multiple of atomic_write_unit_min and must
> + be a power-of-two.
> +
> +
> +What: /sys/block/<disk>/atomic_write_boundary_bytes
> +Date: May 2023
> +Contact: Himanshu Madhani <[email protected]>
> +Description:
> + [RO] A device may need to internally split I/Os which
> + straddle a given logical block address boundary. In that
> + case a single atomic write operation will be processed as
> + one of more sub-operations which each complete atomically.
> + This parameter specifies the size in bytes of the atomic
> + boundary if one is reported by the device. This value must
> + be a power-of-two.
> +
>
> What: /sys/block/<disk>/diskseq
> Date: February 2021
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 0046b447268f..d151be394c98 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -59,6 +59,10 @@ void blk_set_default_limits(struct queue_limits *lim)
> lim->zoned = BLK_ZONED_NONE;
> lim->zone_write_granularity = 0;
> lim->dma_alignment = 511;
> + lim->atomic_write_unit_min_sectors = 0;
> + lim->atomic_write_unit_max_sectors = 0;
> + lim->atomic_write_max_sectors = 0;
> + lim->atomic_write_boundary_sectors = 0;

Can we move the four into single structure and setup them in single
API? Then cross-validation can be done in this API.

> }
>
> /**
> @@ -183,6 +187,62 @@ void blk_queue_max_discard_sectors(struct request_queue *q,
> }
> EXPORT_SYMBOL(blk_queue_max_discard_sectors);
>
> +/**
> + * blk_queue_atomic_write_max_bytes - set max bytes supported by
> + * the device for atomic write operations.
> + * @q: the request queue for the device
> + * @size: maximum bytes supported
> + */
> +void blk_queue_atomic_write_max_bytes(struct request_queue *q,
> + unsigned int bytes)
> +{
> + q->limits.atomic_write_max_sectors = bytes >> SECTOR_SHIFT;
> +}
> +EXPORT_SYMBOL(blk_queue_atomic_write_max_bytes);

What if driver doesn't call it but driver supports atomic write?

I guess the default max sectors should be atomic_write_unit_max_sectors
if the feature is enabled.

> +
> +/**
> + * blk_queue_atomic_write_boundary_bytes - Device's logical block address space
> + * which an atomic write should not cross.
> + * @q: the request queue for the device
> + * @bytes: must be a power-of-two.
> + */
> +void blk_queue_atomic_write_boundary_bytes(struct request_queue *q,
> + unsigned int bytes)
> +{
> + q->limits.atomic_write_boundary_sectors = bytes >> SECTOR_SHIFT;
> +}
> +EXPORT_SYMBOL(blk_queue_atomic_write_boundary_bytes);

Default atomic_write_boundary_sectors should be
atomic_write_unit_max_sectors in case of atomic write?

> +
> +/**
> + * blk_queue_atomic_write_unit_min_sectors - smallest unit that can be written
> + * atomically to the device.
> + * @q: the request queue for the device
> + * @sectors: must be a power-of-two.
> + */
> +void blk_queue_atomic_write_unit_min_sectors(struct request_queue *q,
> + unsigned int sectors)
> +{
> + struct queue_limits *limits = &q->limits;
> +
> + limits->atomic_write_unit_min_sectors = sectors;
> +}
> +EXPORT_SYMBOL(blk_queue_atomic_write_unit_min_sectors);

atomic_write_unit_min_sectors should be >= (physical block size >> 9)
given the minimized atomic write unit is physical sector for all disk.

> +
> +/*
> + * blk_queue_atomic_write_unit_max_sectors - largest unit that can be written
> + * atomically to the device.
> + * @q: the request queue for the device
> + * @sectors: must be a power-of-two.
> + */
> +void blk_queue_atomic_write_unit_max_sectors(struct request_queue *q,
> + unsigned int sectors)
> +{
> + struct queue_limits *limits = &q->limits;
> +
> + limits->atomic_write_unit_max_sectors = sectors;
> +}
> +EXPORT_SYMBOL(blk_queue_atomic_write_unit_max_sectors);

atomic_write_unit_max_sectors should be >= atomic_write_unit_min_sectors.


Thanks,
Ming

2023-12-13 09:14:27

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v2 01/16] block: Add atomic write operations to request_queue limits

>> +
>>
>> What: /sys/block/<disk>/diskseq
>> Date: February 2021
>> diff --git a/block/blk-settings.c b/block/blk-settings.c
>> index 0046b447268f..d151be394c98 100644
>> --- a/block/blk-settings.c
>> +++ b/block/blk-settings.c
>> @@ -59,6 +59,10 @@ void blk_set_default_limits(struct queue_limits *lim)
>> lim->zoned = BLK_ZONED_NONE;
>> lim->zone_write_granularity = 0;
>> lim->dma_alignment = 511;
>> + lim->atomic_write_unit_min_sectors = 0;
>> + lim->atomic_write_unit_max_sectors = 0;
>> + lim->atomic_write_max_sectors = 0;
>> + lim->atomic_write_boundary_sectors = 0;
>
> Can we move the four into single structure

There is no precedent for a similar structure in struct queue_limits. So
would only passing a structure to the blk-settings.c API be ok?

> and setup them in single
> API? Then cross-validation can be done in this API.

I suppose so, if you think that it is better.

We rely on the driver to provide sound values. I suppose that we can
sanitize them also (in a single API).

>
>> }
>>
>> /**
>> @@ -183,6 +187,62 @@ void blk_queue_max_discard_sectors(struct request_queue *q,
>> }
>> EXPORT_SYMBOL(blk_queue_max_discard_sectors);
>>
>> +/**
>> + * blk_queue_atomic_write_max_bytes - set max bytes supported by
>> + * the device for atomic write operations.
>> + * @q: the request queue for the device
>> + * @size: maximum bytes supported
>> + */
>> +void blk_queue_atomic_write_max_bytes(struct request_queue *q,
>> + unsigned int bytes)
>> +{
>> + q->limits.atomic_write_max_sectors = bytes >> SECTOR_SHIFT;
>> +}
>> +EXPORT_SYMBOL(blk_queue_atomic_write_max_bytes);
>
> What if driver doesn't call it but driver supports atomic write?

We rely on the driver to do this. Any basic level of testing will show
an issue if they don't.

>
> I guess the default max sectors should be atomic_write_unit_max_sectors
> if the feature is enabled.

Sure. If we have a single API to set all values, then we don't need to
worry about this (assuming the values are filled in properly).

>
>> +
>> +/**
>> + * blk_queue_atomic_write_boundary_bytes - Device's logical block address space
>> + * which an atomic write should not cross.
>> + * @q: the request queue for the device
>> + * @bytes: must be a power-of-two.
>> + */
>> +void blk_queue_atomic_write_boundary_bytes(struct request_queue *q,
>> + unsigned int bytes)
>> +{
>> + q->limits.atomic_write_boundary_sectors = bytes >> SECTOR_SHIFT;
>> +}
>> +EXPORT_SYMBOL(blk_queue_atomic_write_boundary_bytes);
>
> Default atomic_write_boundary_sectors should be
> atomic_write_unit_max_sectors in case of atomic write?

Having atomic_write_boundary_sectors default to
atomic_write_unit_max_sectors is effectively same as a default of 0.

>
>> +
>> +/**
>> + * blk_queue_atomic_write_unit_min_sectors - smallest unit that can be written
>> + * atomically to the device.
>> + * @q: the request queue for the device
>> + * @sectors: must be a power-of-two.
>> + */
>> +void blk_queue_atomic_write_unit_min_sectors(struct request_queue *q,
>> + unsigned int sectors)
>> +{
>> + struct queue_limits *limits = &q->limits;
>> +
>> + limits->atomic_write_unit_min_sectors = sectors;
>> +}
>> +EXPORT_SYMBOL(blk_queue_atomic_write_unit_min_sectors);
>
> atomic_write_unit_min_sectors should be >= (physical block size >> 9)
> given the minimized atomic write unit is physical sector for all disk.

For SCSI, we have a granularity VPD value, and when set we pay attention
to that. If not, we use the phys block size.

For NVMe, we use the logical block size. For physical block size, that
can be greater than the logical block size for npwg set, and I don't
think it's suitable use that as minimum atomic write unit.

Anyway, I am not too keen on sanitizing this value in this way.

>
>> +
>> +/*
>> + * blk_queue_atomic_write_unit_max_sectors - largest unit that can be written
>> + * atomically to the device.
>> + * @q: the request queue for the device
>> + * @sectors: must be a power-of-two.
>> + */
>> +void blk_queue_atomic_write_unit_max_sectors(struct request_queue *q,
>> + unsigned int sectors)
>> +{
>> + struct queue_limits *limits = &q->limits;
>> +
>> + limits->atomic_write_unit_max_sectors = sectors;
>> +}
>> +EXPORT_SYMBOL(blk_queue_atomic_write_unit_max_sectors);
>
> atomic_write_unit_max_sectors should be >= atomic_write_unit_min_sectors.
>

Again, we rely on the driver to provide sound values. However, as
mentioned, we can sanitize.

Thanks,
John

2023-12-13 09:32:51

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] block atomic writes

On 12/12/2023 16:32, Christoph Hellwig wrote:
> On Tue, Dec 12, 2023 at 11:08:28AM +0000, John Garry wrote:
>> Two new fields are added to struct statx - atomic_write_unit_min and
>> atomic_write_unit_max. For each atomic individual write, the total length
>> of a write must be a between atomic_write_unit_min and
>> atomic_write_unit_max, inclusive, and a power-of-2. The write must also be
>> at a natural offset in the file wrt the write length.
>>
>> SCSI sd.c and scsi_debug and NVMe kernel support is added.
>>
>> Some open questions:
>> - How to make API extensible for when we have no HW support? In that case,
>> we would prob not have to follow rule of power-of-2 length et al.
>> As a possible solution, maybe we can say that atomic writes are
>> supported for the file via statx, but not set unit_min and max values,
>> and this means that writes need to be just FS block aligned there.
> I don't think the power of two length is much of a problem to be
> honest, and if we every want to lift it we can still do that easily
> by adding a new flag or limit.

ok, but it would be nice to have some idea on what that flag or limit
change would be.

>
> What I'm a lot more worried about is how to tell the file system that
> allocations are done right for these requirement. There is no way
> a user can know that allocations in an existing file are properly
> aligned, so atomic writes will just fail on existing files.
>
> I suspect we need an on-disk flag that forces allocations to be
> aligned to the atomic write limit, in some ways similar how the
> XFS rt flag works. You'd need to set it on an empty file, and all
> allocations after that are guaranteed to be properly aligned.

Hmmm... so how is this different to the XFS forcealign feature?

For XFS, I thought that your idea was to always CoW new extents for
misaligned extents or writes which spanned multiple extents.

>
>> - For block layer, should atomic_write_unit_max be limited by
>> max_sectors_kb? Currently it is not.
> Well. It must be limited to max_hw_sectors to actually work.

Sure, as max_hw_sectors may also be limited by DMA controller max
mapping size.

> max_sectors is a software limit below that, which with modern hardware
> is actually pretty silly and a real performance issue with todays
> workloads when people don't tweak it..

Right, so we should limit atomic write queue limits to max_hw_sectors.
But people can still tweak max_sectors, and I am inclined to say that
atomic_write_unit_max et al should be (dynamically) limited to
max_sectors also.

>
>> - How to improve requirement that iovecs are PAGE-aligned.
>> There are 2x issues:
>> a. We impose this rule to not split BIOs due to virt boundary for
>> NVMe, but there virt boundary is 4K (and not PAGE size, so broken for
>> 16K/64K pages). Easy solution is to impose requirement that iovecs
>> are 4K-aligned.
>> b. We don't enforce this rule for virt boundary == 0, i.e. SCSI
> .. we require any device that wants to support atomic writes to not
> have that silly limit. For NVMe that would require SGL support
> (and some driver changes I've been wanting to make for long where
> we always use SGLs for transfers larger than a single PRP if supported)

If we could avoid dealing with a virt boundary, then that would be nice.

Are there any patches yet for the change to always use SGLs for
transfers larger than a single PRP?

On a related topic, I am not sure about how - or if we even should -
enforce iovec PAGE-alignment or length; rather, the user could just be
advised that iovecs must be PAGE-aligned and min PAGE length to achieve
atomic_write_unit_max.

>
>
>> - Since debugging torn-writes due to unwanted kernel BIO splitting/merging
>> would be horrible, should we add some kernel storage stack software
>> integrity checks?
> Yes, I think we'll need asserts in the drivers. At least for NVMe I
> will insist on them.

Please see 16/16 in this series.

> For SCSI I think the device actually checks
> because the atomic writes are a different command anyway, or am I
> misunderstanding how SCSI works here?

Right, for WRITE ATOMIC (16) the target will reject a command when it
does not respect the device atomic write limits.

Thanks,
John

2023-12-13 12:29:03

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH v2 01/16] block: Add atomic write operations to request_queue limits

On Wed, Dec 13, 2023 at 09:13:48AM +0000, John Garry wrote:
> > > +
> > > What: /sys/block/<disk>/diskseq
> > > Date: February 2021
> > > diff --git a/block/blk-settings.c b/block/blk-settings.c
> > > index 0046b447268f..d151be394c98 100644
> > > --- a/block/blk-settings.c
> > > +++ b/block/blk-settings.c
> > > @@ -59,6 +59,10 @@ void blk_set_default_limits(struct queue_limits *lim)
> > > lim->zoned = BLK_ZONED_NONE;
> > > lim->zone_write_granularity = 0;
> > > lim->dma_alignment = 511;
> > > + lim->atomic_write_unit_min_sectors = 0;
> > > + lim->atomic_write_unit_max_sectors = 0;
> > > + lim->atomic_write_max_sectors = 0;
> > > + lim->atomic_write_boundary_sectors = 0;
> >
> > Can we move the four into single structure
>
> There is no precedent for a similar structure in struct queue_limits. So
> would only passing a structure to the blk-settings.c API be ok?

Yes, this structure is part of the new API.

>
> > and setup them in single
> > API? Then cross-validation can be done in this API.
>
> I suppose so, if you think that it is better.
>
> We rely on the driver to provide sound values. I suppose that we can
> sanitize them also (in a single API).

Please make the interface correct from beginning, and one good API is
helpful for both sides, such as isolating problems, easy to locate
bug, abstracting common logic, ...

And relying on API users is absolutely not good design.

>
> >
> > > }
> > > /**
> > > @@ -183,6 +187,62 @@ void blk_queue_max_discard_sectors(struct request_queue *q,
> > > }
> > > EXPORT_SYMBOL(blk_queue_max_discard_sectors);
> > > +/**
> > > + * blk_queue_atomic_write_max_bytes - set max bytes supported by
> > > + * the device for atomic write operations.
> > > + * @q: the request queue for the device
> > > + * @size: maximum bytes supported
> > > + */
> > > +void blk_queue_atomic_write_max_bytes(struct request_queue *q,
> > > + unsigned int bytes)
> > > +{
> > > + q->limits.atomic_write_max_sectors = bytes >> SECTOR_SHIFT;
> > > +}
> > > +EXPORT_SYMBOL(blk_queue_atomic_write_max_bytes);
> >
> > What if driver doesn't call it but driver supports atomic write?
>
> We rely on the driver to do this. Any basic level of testing will show an
> issue if they don't.

Software quality depends on good requirement analysis, design and
implementation, instead of test.

Simply you can not cover all possibilities in test.

>
> >
> > I guess the default max sectors should be atomic_write_unit_max_sectors
> > if the feature is enabled.
>
> Sure. If we have a single API to set all values, then we don't need to worry
> about this (assuming the values are filled in properly).
>
> >
> > > +
> > > +/**
> > > + * blk_queue_atomic_write_boundary_bytes - Device's logical block address space
> > > + * which an atomic write should not cross.
> > > + * @q: the request queue for the device
> > > + * @bytes: must be a power-of-two.
> > > + */
> > > +void blk_queue_atomic_write_boundary_bytes(struct request_queue *q,
> > > + unsigned int bytes)
> > > +{
> > > + q->limits.atomic_write_boundary_sectors = bytes >> SECTOR_SHIFT;
> > > +}
> > > +EXPORT_SYMBOL(blk_queue_atomic_write_boundary_bytes);
> >
> > Default atomic_write_boundary_sectors should be
> > atomic_write_unit_max_sectors in case of atomic write?
>
> Having atomic_write_boundary_sectors default to
> atomic_write_unit_max_sectors is effectively same as a default of 0.
>
> >
> > > +
> > > +/**
> > > + * blk_queue_atomic_write_unit_min_sectors - smallest unit that can be written
> > > + * atomically to the device.
> > > + * @q: the request queue for the device
> > > + * @sectors: must be a power-of-two.
> > > + */
> > > +void blk_queue_atomic_write_unit_min_sectors(struct request_queue *q,
> > > + unsigned int sectors)
> > > +{
> > > + struct queue_limits *limits = &q->limits;
> > > +
> > > + limits->atomic_write_unit_min_sectors = sectors;
> > > +}
> > > +EXPORT_SYMBOL(blk_queue_atomic_write_unit_min_sectors);
> >
> > atomic_write_unit_min_sectors should be >= (physical block size >> 9)
> > given the minimized atomic write unit is physical sector for all disk.
>
> For SCSI, we have a granularity VPD value, and when set we pay attention to
> that. If not, we use the phys block size.
>
> For NVMe, we use the logical block size. For physical block size, that can
> be greater than the logical block size for npwg set, and I don't think it's
> suitable use that as minimum atomic write unit.

I highly suspect it is wrong to use logical block size as minimum
support atomic write unit, given physical block size is supposed to
be the minimum atomic write unit.

>
> Anyway, I am not too keen on sanitizing this value in this way.
>
> >
> > > +
> > > +/*
> > > + * blk_queue_atomic_write_unit_max_sectors - largest unit that can be written
> > > + * atomically to the device.
> > > + * @q: the request queue for the device
> > > + * @sectors: must be a power-of-two.
> > > + */
> > > +void blk_queue_atomic_write_unit_max_sectors(struct request_queue *q,
> > > + unsigned int sectors)
> > > +{
> > > + struct queue_limits *limits = &q->limits;
> > > +
> > > + limits->atomic_write_unit_max_sectors = sectors;
> > > +}
> > > +EXPORT_SYMBOL(blk_queue_atomic_write_unit_max_sectors);
> >
> > atomic_write_unit_max_sectors should be >= atomic_write_unit_min_sectors.
> >
>
> Again, we rely on the driver to provide sound values. However, as mentioned,
> we can sanitize.

Relying on driver to provide sound value is absolutely bad design from API
viewpoint.

Thanks,
Ming

2023-12-13 15:45:53

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] block atomic writes

On Wed, Dec 13, 2023 at 09:32:06AM +0000, John Garry wrote:
>>> - How to make API extensible for when we have no HW support? In that case,
>>> we would prob not have to follow rule of power-of-2 length et al.
>>> As a possible solution, maybe we can say that atomic writes are
>>> supported for the file via statx, but not set unit_min and max values,
>>> and this means that writes need to be just FS block aligned there.
>> I don't think the power of two length is much of a problem to be
>> honest, and if we every want to lift it we can still do that easily
>> by adding a new flag or limit.
>
> ok, but it would be nice to have some idea on what that flag or limit
> change would be.

That would require a concrete use case. The simples thing for a file
system that can or does log I/O it would simply be a flag waving all
the alignment and size requirements.

>> I suspect we need an on-disk flag that forces allocations to be
>> aligned to the atomic write limit, in some ways similar how the
>> XFS rt flag works. You'd need to set it on an empty file, and all
>> allocations after that are guaranteed to be properly aligned.
>
> Hmmm... so how is this different to the XFS forcealign feature?

Maybe not much. But that's not what it is about - we need a common
API for this and not some XFS internal flag. So if this is something
we could support in ext4 as well that would be a good step. And for
btrfs you'd probably want to support something like it in nocow mode
if people care enough, or always support atomics and write out of
place.

> For XFS, I thought that your idea was to always CoW new extents for
> misaligned extents or writes which spanned multiple extents.

Well, that is useful for two things:

- atomic writes on hardware that does not support it
- atomic writes for bufferd I/O
- supporting other sizes / alignments than the strict power of
two above.

> Right, so we should limit atomic write queue limits to max_hw_sectors. But
> people can still tweak max_sectors, and I am inclined to say that
> atomic_write_unit_max et al should be (dynamically) limited to max_sectors
> also.

Allowing people to tweak it seems to be asking for trouble.

>> have that silly limit. For NVMe that would require SGL support
>> (and some driver changes I've been wanting to make for long where
>> we always use SGLs for transfers larger than a single PRP if supported)
>
> If we could avoid dealing with a virt boundary, then that would be nice.
>
> Are there any patches yet for the change to always use SGLs for transfers
> larger than a single PRP?

No.

> On a related topic, I am not sure about how - or if we even should -
> enforce iovec PAGE-alignment or length; rather, the user could just be
> advised that iovecs must be PAGE-aligned and min PAGE length to achieve
> atomic_write_unit_max.

Anything that just advices the user an it not clear cut and results in
an error is data loss waiting to happen. Even more so if it differs
from device to device.

2023-12-13 16:28:57

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] block atomic writes

On 13/12/2023 15:44, Christoph Hellwig wrote:
> On Wed, Dec 13, 2023 at 09:32:06AM +0000, John Garry wrote:
>>>> - How to make API extensible for when we have no HW support? In that case,
>>>> we would prob not have to follow rule of power-of-2 length et al.
>>>> As a possible solution, maybe we can say that atomic writes are
>>>> supported for the file via statx, but not set unit_min and max values,
>>>> and this means that writes need to be just FS block aligned there.
>>> I don't think the power of two length is much of a problem to be
>>> honest, and if we every want to lift it we can still do that easily
>>> by adding a new flag or limit.
>> ok, but it would be nice to have some idea on what that flag or limit
>> change would be.
> That would require a concrete use case. The simples thing for a file
> system that can or does log I/O it would simply be a flag waving all
> the alignment and size requirements.

ok...

>
>>> I suspect we need an on-disk flag that forces allocations to be
>>> aligned to the atomic write limit, in some ways similar how the
>>> XFS rt flag works. You'd need to set it on an empty file, and all
>>> allocations after that are guaranteed to be properly aligned.
>> Hmmm... so how is this different to the XFS forcealign feature?
> Maybe not much. But that's not what it is about - we need a common
> API for this and not some XFS internal flag. So if this is something
> we could support in ext4 as well that would be a good step. And for
> btrfs you'd probably want to support something like it in nocow mode
> if people care enough, or always support atomics and write out of
> place.

The forcealign feature was a bit of case of knowing specifically what to
do for XFS to enable atomic writes.

But some common method to configure any FS file for atomic writes (if
supported) would be preferable, right?

>
>> For XFS, I thought that your idea was to always CoW new extents for
>> misaligned extents or writes which spanned multiple extents.
> Well, that is useful for two things:
>
> - atomic writes on hardware that does not support it
> - atomic writes for bufferd I/O
> - supporting other sizes / alignments than the strict power of
> two above.

Sure

>
>> Right, so we should limit atomic write queue limits to max_hw_sectors. But
>> people can still tweak max_sectors, and I am inclined to say that
>> atomic_write_unit_max et al should be (dynamically) limited to max_sectors
>> also.
> Allowing people to tweak it seems to be asking for trouble.

I tend to agree....

Let's just limit at max_hw_sectors.

>
>>> have that silly limit. For NVMe that would require SGL support
>>> (and some driver changes I've been wanting to make for long where
>>> we always use SGLs for transfers larger than a single PRP if supported)
>> If we could avoid dealing with a virt boundary, then that would be nice.
>>
>> Are there any patches yet for the change to always use SGLs for transfers
>> larger than a single PRP?
> No.

As below, if we are going to enforce alignment/size of iovecs elsewhere,
then I don't need to worry about virt boundary.

>
>> On a related topic, I am not sure about how - or if we even should -
>> enforce iovec PAGE-alignment or length; rather, the user could just be
>> advised that iovecs must be PAGE-aligned and min PAGE length to achieve
>> atomic_write_unit_max.
> Anything that just advices the user an it not clear cut and results in
> an error is data loss waiting to happen. Even more so if it differs
> from device to device.

ok, fine. I suppose that we better enforce it then.

Thanks,
John

2023-12-13 19:02:40

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v2 01/16] block: Add atomic write operations to request_queue limits

On 13/12/2023 12:28, Ming Lei wrote:
>> For NVMe, we use the logical block size. For physical block size, that can
>> be greater than the logical block size for npwg set, and I don't think it's
>> suitable use that as minimum atomic write unit.
> I highly suspect it is wrong to use logical block size as minimum
> support atomic write unit, given physical block size is supposed to
> be the minimum atomic write unit.

I would tend to agree, but I am still a bit curious on how npwg is used
to calculate atomic_bs/phys_bs as it seems to be a recommended
performance-related value. It would hint to me that it is the phys_bs,
but is that same as atomic min granularity?

>
>> Anyway, I am not too keen on sanitizing this value in this way.
>>
>>>> +
>>>> +/*
>>>> + * blk_queue_atomic_write_unit_max_sectors - largest unit that can be written
>>>> + * atomically to the device.
>>>> + * @q: the request queue for the device
>>>> + * @sectors: must be a power-of-two.
>>>> + */
>>>> +void blk_queue_atomic_write_unit_max_sectors(struct request_queue *q,
>>>> + unsigned int sectors)
>>>> +{
>>>> + struct queue_limits *limits = &q->limits;
>>>> +
>>>> + limits->atomic_write_unit_max_sectors = sectors;
>>>> +}
>>>> +EXPORT_SYMBOL(blk_queue_atomic_write_unit_max_sectors);
>>> atomic_write_unit_max_sectors should be >= atomic_write_unit_min_sectors.
>>>
>> Again, we rely on the driver to provide sound values. However, as mentioned,
>> we can sanitize.
> Relying on driver to provide sound value is absolutely bad design from API
> viewpoint.

OK, fine, I'll look to revise the API to make it more robust.

Thanks,
John

2023-12-14 04:36:07

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH v2 01/16] block: Add atomic write operations to request_queue limits


Hi Ming!

>> + lim->atomic_write_unit_min_sectors = 0;
>> + lim->atomic_write_unit_max_sectors = 0;
>> + lim->atomic_write_max_sectors = 0;
>> + lim->atomic_write_boundary_sectors = 0;
>
> Can we move the four into single structure and setup them in single
> API? Then cross-validation can be done in this API.

Why would we put them in a separate struct? We don't do that for any of
the other queue_limits.

--
Martin K. Petersen Oracle Linux Engineering

2023-12-14 04:38:44

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH v2 01/16] block: Add atomic write operations to request_queue limits


Ming,

> Relying on driver to provide sound value is absolutely bad design from
> API viewpoint.

All the other queue_limits are validated by the LLDs. It's challenging
to lift that validation to the block layer since the values reported are
heavily protocol-dependent and thus information is lost if we do it
somewhere else.

--
Martin K. Petersen Oracle Linux Engineering

2023-12-14 13:47:53

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH v2 01/16] block: Add atomic write operations to request_queue limits

Hi Martin,

On Wed, Dec 13, 2023 at 11:38:10PM -0500, Martin K. Petersen wrote:
>
> Ming,

On Thu, Dec 14, 2023 at 12:35 PM Martin K. Petersen <[email protected]> wrote:
>
>
> Hi Ming!
>
> >> +    lim->atomic_write_unit_min_sectors = 0;
> >> +    lim->atomic_write_unit_max_sectors = 0;
> >> +    lim->atomic_write_max_sectors = 0;
> >> +    lim->atomic_write_boundary_sectors = 0;
> >
> > Can we move the four into single structure and setup them in single
> > API? Then cross-validation can be done in this API.
>
> Why would we put them in a separate struct? We don't do that for any of
> the other queue_limits.

All the four limits are for same purpose of supporting atomic-write, and
there can many benefits to define single API to setup atomic parameters:

1) common logic can be put into single place, such as running
cross-validation among them and setting up default value, and it is impossible
to do that by the way in this patch

2) all limits are supposed to setup once by driver in same place, so
doing them in single API actually simplifies driver and block layer, and
API itself becomes less fragile

3) it is easier for trace or troubleshoot

>
> > Relying on driver to provide sound value is absolutely bad design from
> > API viewpoint.
>
> All the other queue_limits are validated by the LLDs. It's challenging
> to lift that validation to the block layer since the values reported are
> heavily protocol-dependent and

After atomic limits are put into block layer, it becomes not driver
specific any more, scsi and nvme are going to support it first, sooner
or later, other drivers(dm & md, loop or ublk, ...) may try to support it.

Also in theory, they are not protocol-dependent, usually physical block size is
the minimized atomic-write unit, now John's patches are trying to support
bigger atomic-write unit as scsi/nvme's protocol supports it, and the concept
is actually common in disk. Similar in implementation level too, such as,
for NAND flash based storage, I guess atomic-write should be supported by atomic
update of FTL's LBA/PBA mapping.

> thus information is lost if we do it somewhere else.

Block layer is only focusing on common logic, such as the four limits
added in request queue, which are consumed by block layer and related with
other generic limits(physical block size, max IO size), and it won't be
same with driver's internal validation.


Thanks,
Ming

2023-12-14 14:37:31

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] block atomic writes

On Wed, Dec 13, 2023 at 04:27:35PM +0000, John Garry wrote:
>>> Are there any patches yet for the change to always use SGLs for transfers
>>> larger than a single PRP?
>> No.

Here is the WIP version. With that you'd need to make atomic writes
conditional on !ctrl->need_virt_boundary.

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 8ebdfd623e0f78..e04faffd6551fe 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1889,7 +1889,8 @@ static void nvme_set_queue_limits(struct nvme_ctrl *ctrl,
blk_queue_max_hw_sectors(q, ctrl->max_hw_sectors);
blk_queue_max_segments(q, min_t(u32, max_segments, USHRT_MAX));
}
- blk_queue_virt_boundary(q, NVME_CTRL_PAGE_SIZE - 1);
+ if (q == ctrl->admin_q || ctrl->need_virt_boundary)
+ blk_queue_virt_boundary(q, NVME_CTRL_PAGE_SIZE - 1);
blk_queue_dma_alignment(q, 3);
blk_queue_write_cache(q, vwc, vwc);
}
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index e7411dac00f725..aa98794a3ec53d 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -262,6 +262,7 @@ enum nvme_ctrl_flags {
struct nvme_ctrl {
bool comp_seen;
bool identified;
+ bool need_virt_boundary;
enum nvme_ctrl_state state;
spinlock_t lock;
struct mutex scan_lock;
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 61af7ff1a9d6ba..a8d273b475cb40 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -60,8 +60,7 @@ MODULE_PARM_DESC(max_host_mem_size_mb,
static unsigned int sgl_threshold = SZ_32K;
module_param(sgl_threshold, uint, 0644);
MODULE_PARM_DESC(sgl_threshold,
- "Use SGLs when average request segment size is larger or equal to "
- "this size. Use 0 to disable SGLs.");
+ "Use SGLs when > 0. Use 0 to disable SGLs.");

#define NVME_PCI_MIN_QUEUE_SIZE 2
#define NVME_PCI_MAX_QUEUE_SIZE 4095
@@ -504,23 +503,6 @@ static void nvme_commit_rqs(struct blk_mq_hw_ctx *hctx)
spin_unlock(&nvmeq->sq_lock);
}

-static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req,
- int nseg)
-{
- struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
- unsigned int avg_seg_size;
-
- avg_seg_size = DIV_ROUND_UP(blk_rq_payload_bytes(req), nseg);
-
- if (!nvme_ctrl_sgl_supported(&dev->ctrl))
- return false;
- if (!nvmeq->qid)
- return false;
- if (!sgl_threshold || avg_seg_size < sgl_threshold)
- return false;
- return true;
-}
-
static void nvme_free_prps(struct nvme_dev *dev, struct request *req)
{
const int last_prp = NVME_CTRL_PAGE_SIZE / sizeof(__le64) - 1;
@@ -769,12 +751,14 @@ static blk_status_t nvme_setup_sgl_simple(struct nvme_dev *dev,
static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
struct nvme_command *cmnd)
{
+ struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+ bool sgl_supported = nvme_ctrl_sgl_supported(&dev->ctrl) &&
+ nvmeq->qid && sgl_threshold;
blk_status_t ret = BLK_STS_RESOURCE;
int rc;

if (blk_rq_nr_phys_segments(req) == 1) {
- struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
struct bio_vec bv = req_bvec(req);

if (!is_pci_p2pdma_page(bv.bv_page)) {
@@ -782,8 +766,7 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
return nvme_setup_prp_simple(dev, req,
&cmnd->rw, &bv);

- if (nvmeq->qid && sgl_threshold &&
- nvme_ctrl_sgl_supported(&dev->ctrl))
+ if (sgl_supported)
return nvme_setup_sgl_simple(dev, req,
&cmnd->rw, &bv);
}
@@ -806,7 +789,7 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
goto out_free_sg;
}

- if (nvme_pci_use_sgls(dev, req, iod->sgt.nents))
+ if (sgl_supported)
ret = nvme_pci_setup_sgls(dev, req, &cmnd->rw);
else
ret = nvme_pci_setup_prps(dev, req, &cmnd->rw);
@@ -3036,6 +3019,8 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
result = nvme_init_ctrl_finish(&dev->ctrl, false);
if (result)
goto out_disable;
+ if (!nvme_ctrl_sgl_supported(&dev->ctrl))
+ dev->ctrl.need_virt_boundary = true;

nvme_dbbuf_dma_alloc(dev);

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 81e2621169e5d3..416a9fbcccfc74 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -838,6 +838,7 @@ static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl,
error = nvme_init_ctrl_finish(&ctrl->ctrl, false);
if (error)
goto out_quiesce_queue;
+ ctrl->ctrl.need_virt_boundary = true;

return 0;

2023-12-14 15:47:40

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] block atomic writes

On 14/12/2023 14:37, Christoph Hellwig wrote:
> On Wed, Dec 13, 2023 at 04:27:35PM +0000, John Garry wrote:
>>>> Are there any patches yet for the change to always use SGLs for transfers
>>>> larger than a single PRP?
>>> No.
> Here is the WIP version. With that you'd need to make atomic writes
> conditional on !ctrl->need_virt_boundary.

Cheers, I gave it a quick spin and on the surface looks ok.

# more /sys/block/nvme0n1/queue/atomic_write_unit_max_bytes
262144
# more /sys/block/nvme0n1/queue/virt_boundary_mask
0

2023-12-14 16:12:44

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v2 01/16] block: Add atomic write operations to request_queue limits

On Wed, Dec 13, 2023 at 09:25:38AM +0800, Ming Lei wrote:

<full quote deleted, please only quote what is relevant.


> > lim->dma_alignment = 511;
> > + lim->atomic_write_unit_min_sectors = 0;
> > + lim->atomic_write_unit_max_sectors = 0;
> > + lim->atomic_write_max_sectors = 0;
> > + lim->atomic_write_boundary_sectors = 0;
>
> Can we move the four into single structure and setup them in single
> API? Then cross-validation can be done in this API.

Please don't be arbitrarily different from all the other limits. What
we really should be doing is an API that updates all limits at the
same time, and I actually have code for that, I'll just need to finish
it. I do not want to block this series for it, though.

2023-12-18 22:50:38

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] block atomic writes

On Thu, Dec 14, 2023 at 03:37:09PM +0100, Christoph Hellwig wrote:
> On Wed, Dec 13, 2023 at 04:27:35PM +0000, John Garry wrote:
> >>> Are there any patches yet for the change to always use SGLs for transfers
> >>> larger than a single PRP?
> >> No.
>
> Here is the WIP version. With that you'd need to make atomic writes
> conditional on !ctrl->need_virt_boundary.

This looks pretty good as-is!

> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 8ebdfd623e0f78..e04faffd6551fe 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -1889,7 +1889,8 @@ static void nvme_set_queue_limits(struct nvme_ctrl *ctrl,
> blk_queue_max_hw_sectors(q, ctrl->max_hw_sectors);
> blk_queue_max_segments(q, min_t(u32, max_segments, USHRT_MAX));
> }
> - blk_queue_virt_boundary(q, NVME_CTRL_PAGE_SIZE - 1);
> + if (q == ctrl->admin_q || ctrl->need_virt_boundary)
> + blk_queue_virt_boundary(q, NVME_CTRL_PAGE_SIZE - 1);
> blk_queue_dma_alignment(q, 3);
> blk_queue_write_cache(q, vwc, vwc);
> }
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index e7411dac00f725..aa98794a3ec53d 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -262,6 +262,7 @@ enum nvme_ctrl_flags {
> struct nvme_ctrl {
> bool comp_seen;
> bool identified;
> + bool need_virt_boundary;
> enum nvme_ctrl_state state;
> spinlock_t lock;
> struct mutex scan_lock;
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 61af7ff1a9d6ba..a8d273b475cb40 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -60,8 +60,7 @@ MODULE_PARM_DESC(max_host_mem_size_mb,
> static unsigned int sgl_threshold = SZ_32K;
> module_param(sgl_threshold, uint, 0644);
> MODULE_PARM_DESC(sgl_threshold,
> - "Use SGLs when average request segment size is larger or equal to "
> - "this size. Use 0 to disable SGLs.");
> + "Use SGLs when > 0. Use 0 to disable SGLs.");
>
> #define NVME_PCI_MIN_QUEUE_SIZE 2
> #define NVME_PCI_MAX_QUEUE_SIZE 4095
> @@ -504,23 +503,6 @@ static void nvme_commit_rqs(struct blk_mq_hw_ctx *hctx)
> spin_unlock(&nvmeq->sq_lock);
> }
>
> -static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req,
> - int nseg)
> -{
> - struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
> - unsigned int avg_seg_size;
> -
> - avg_seg_size = DIV_ROUND_UP(blk_rq_payload_bytes(req), nseg);
> -
> - if (!nvme_ctrl_sgl_supported(&dev->ctrl))
> - return false;
> - if (!nvmeq->qid)
> - return false;
> - if (!sgl_threshold || avg_seg_size < sgl_threshold)
> - return false;
> - return true;
> -}
> -
> static void nvme_free_prps(struct nvme_dev *dev, struct request *req)
> {
> const int last_prp = NVME_CTRL_PAGE_SIZE / sizeof(__le64) - 1;
> @@ -769,12 +751,14 @@ static blk_status_t nvme_setup_sgl_simple(struct nvme_dev *dev,
> static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
> struct nvme_command *cmnd)
> {
> + struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
> struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> + bool sgl_supported = nvme_ctrl_sgl_supported(&dev->ctrl) &&
> + nvmeq->qid && sgl_threshold;
> blk_status_t ret = BLK_STS_RESOURCE;
> int rc;
>
> if (blk_rq_nr_phys_segments(req) == 1) {
> - struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
> struct bio_vec bv = req_bvec(req);
>
> if (!is_pci_p2pdma_page(bv.bv_page)) {
> @@ -782,8 +766,7 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
> return nvme_setup_prp_simple(dev, req,
> &cmnd->rw, &bv);
>
> - if (nvmeq->qid && sgl_threshold &&
> - nvme_ctrl_sgl_supported(&dev->ctrl))
> + if (sgl_supported)
> return nvme_setup_sgl_simple(dev, req,
> &cmnd->rw, &bv);
> }
> @@ -806,7 +789,7 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
> goto out_free_sg;
> }
>
> - if (nvme_pci_use_sgls(dev, req, iod->sgt.nents))
> + if (sgl_supported)
> ret = nvme_pci_setup_sgls(dev, req, &cmnd->rw);
> else
> ret = nvme_pci_setup_prps(dev, req, &cmnd->rw);
> @@ -3036,6 +3019,8 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> result = nvme_init_ctrl_finish(&dev->ctrl, false);
> if (result)
> goto out_disable;
> + if (!nvme_ctrl_sgl_supported(&dev->ctrl))
> + dev->ctrl.need_virt_boundary = true;
>
> nvme_dbbuf_dma_alloc(dev);
>
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 81e2621169e5d3..416a9fbcccfc74 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -838,6 +838,7 @@ static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl,
> error = nvme_init_ctrl_finish(&ctrl->ctrl, false);
> if (error)
> goto out_quiesce_queue;
> + ctrl->ctrl.need_virt_boundary = true;
>
> return 0;
>

2023-12-19 05:15:17

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] block atomic writes

On Wed, Dec 13, 2023 at 04:27:35PM +0000, John Garry wrote:
> On 13/12/2023 15:44, Christoph Hellwig wrote:
> > On Wed, Dec 13, 2023 at 09:32:06AM +0000, John Garry wrote:
> > > > > - How to make API extensible for when we have no HW support? In that case,
> > > > > we would prob not have to follow rule of power-of-2 length et al.
> > > > > As a possible solution, maybe we can say that atomic writes are
> > > > > supported for the file via statx, but not set unit_min and max values,
> > > > > and this means that writes need to be just FS block aligned there.
> > > > I don't think the power of two length is much of a problem to be
> > > > honest, and if we every want to lift it we can still do that easily
> > > > by adding a new flag or limit.
> > > ok, but it would be nice to have some idea on what that flag or limit
> > > change would be.
> > That would require a concrete use case. The simples thing for a file
> > system that can or does log I/O it would simply be a flag waving all
> > the alignment and size requirements.
>
> ok...
>
> >
> > > > I suspect we need an on-disk flag that forces allocations to be
> > > > aligned to the atomic write limit, in some ways similar how the
> > > > XFS rt flag works. You'd need to set it on an empty file, and all
> > > > allocations after that are guaranteed to be properly aligned.
> > > Hmmm... so how is this different to the XFS forcealign feature?
> > Maybe not much. But that's not what it is about - we need a common
> > API for this and not some XFS internal flag. So if this is something
> > we could support in ext4 as well that would be a good step. And for
> > btrfs you'd probably want to support something like it in nocow mode
> > if people care enough, or always support atomics and write out of
> > place.
>
> The forcealign feature was a bit of case of knowing specifically what to do
> for XFS to enable atomic writes.
>
> But some common method to configure any FS file for atomic writes (if
> supported) would be preferable, right?

/me stumbles back in with plenty of covidbrain to go around.

So ... Christoph, you're asking for a common API for
sysadmins/applications to signal to the filesystem that they want all
data allocations aligned to a given value, right?

e.g. if a nvme device advertises a capability for untorn writes between
$lbasize and 64k, then we need a way to set up each untorn-file with
some alignment between $lbasize and 64k?

or if cxl storage becomes not ung-dly expensive, then we'd want a way to
set up files with an alignment of 2M (x86) or 512M (arm64lol) to take
advantage of PMD mmap mappings?

I guess we could define a fsxattr xflag for FORCEALIGN and declare that
the fsx_extsize field becomes mandates mapping startoff/startblock
alignment if FORCEALIGN is set.

It just happens that since fsxattr comes from XFS then it'll be easy
enough for us to wire that up. Maybe.

Concretely, xfs would have to have a way to force the
ag/rtgroup/rtextent size to align with the maximum anticipated required
alignment (e.g. 64k) of the device so that groups (and hence per-group
allocations) don't split the alignment. xfs would also have need to
constrain the per-inode alignment to some factor of the ag size / rt
extent size.

Writing files could use hw acceleration directly if all the factors are
set up correctly; or fall back to COW. Assuming there's a way to
require that writeback pick up all the folios in a $forcealign chunk for
simultaneous writeout. (I think this is a lingering bug in the xfs
rtreflink patchset.) Also, IIRC John mentioned that there might be some
programs that would rather get an EIO than fall back to COW.

ext4 could maybe sort of do this by allowing forcealign alignments up to
the bigalloc size, if bigalloc is enabled. Assuming bigalloc isn't DOA.
IIRC bigalloc only supports power of 2 factors. It wouldn't have the
COW problem because it only supports overwrites.

As usual, I dunno about btrfs.

So does that sound like more or less what you're getting at, Christoph?
Or have I misread the entire thing? (PS: there's been like 1200 fsdevel
messages since I got sick and I've only had sufficient brainpower to
limp the xfs 6.8 patches across the finish line...)

--D

> >
> > > For XFS, I thought that your idea was to always CoW new extents for
> > > misaligned extents or writes which spanned multiple extents.
> > Well, that is useful for two things:
> >
> > - atomic writes on hardware that does not support it
> > - atomic writes for bufferd I/O
> > - supporting other sizes / alignments than the strict power of
> > two above.
>
> Sure
>
> >
> > > Right, so we should limit atomic write queue limits to max_hw_sectors. But
> > > people can still tweak max_sectors, and I am inclined to say that
> > > atomic_write_unit_max et al should be (dynamically) limited to max_sectors
> > > also.
> > Allowing people to tweak it seems to be asking for trouble.
>
> I tend to agree....
>
> Let's just limit at max_hw_sectors.
>
> >
> > > > have that silly limit. For NVMe that would require SGL support
> > > > (and some driver changes I've been wanting to make for long where
> > > > we always use SGLs for transfers larger than a single PRP if supported)
> > > If we could avoid dealing with a virt boundary, then that would be nice.
> > >
> > > Are there any patches yet for the change to always use SGLs for transfers
> > > larger than a single PRP?
> > No.
>
> As below, if we are going to enforce alignment/size of iovecs elsewhere,
> then I don't need to worry about virt boundary.
>
> >
> > > On a related topic, I am not sure about how - or if we even should -
> > > enforce iovec PAGE-alignment or length; rather, the user could just be
> > > advised that iovecs must be PAGE-aligned and min PAGE length to achieve
> > > atomic_write_unit_max.
> > Anything that just advices the user an it not clear cut and results in
> > an error is data loss waiting to happen. Even more so if it differs
> > from device to device.
>
> ok, fine. I suppose that we better enforce it then.
>
> Thanks,
> John
>

2023-12-19 05:21:45

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] block atomic writes

On Mon, Dec 18, 2023 at 09:14:56PM -0800, Darrick J. Wong wrote:
> /me stumbles back in with plenty of covidbrain to go around.
>
> So ... Christoph, you're asking for a common API for
> sysadmins/applications to signal to the filesystem that they want all
> data allocations aligned to a given value, right?
>
> e.g. if a nvme device advertises a capability for untorn writes between
> $lbasize and 64k, then we need a way to set up each untorn-file with
> some alignment between $lbasize and 64k?
>
> or if cxl storage becomes not ung-dly expensive, then we'd want a way to
> set up files with an alignment of 2M (x86) or 512M (arm64lol) to take
> advantage of PMD mmap mappings?

The most important point is to not mix these up.

If we want to use a file for atomic writes I should tell the fs about
it, and preferably in a way that does not require me to know about weird
internal implementation details of every file system. I really just
want to use atomic writes. Preferably I'd just start using them after
asking for the limits. But that's obviously not going to work for
file systems that use the hardware offload and don't otherwise align
to the limit (which would suck for very small files anyway :))

So as a compromise I tell the file system before writing or otherwise
adding any data [1] to file that I want to use atomic writes so that
the fs can take the right decisions.

[1] reflinking data into a such marked file will be ... interesting.


2023-12-19 12:44:36

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] block atomic writes

On 19/12/2023 05:21, Christoph Hellwig wrote:
> On Mon, Dec 18, 2023 at 09:14:56PM -0800, Darrick J. Wong wrote:
>> /me stumbles back in with plenty of covidbrain to go around.
>>
>> So ... Christoph, you're asking for a common API for
>> sysadmins/applications to signal to the filesystem that they want all
>> data allocations aligned to a given value, right?
>>
>> e.g. if a nvme device advertises a capability for untorn writes between
>> $lbasize and 64k, then we need a way to set up each untorn-file with
>> some alignment between $lbasize and 64k?
>>
>> or if cxl storage becomes not ung-dly expensive, then we'd want a way to
>> set up files with an alignment of 2M (x86) or 512M (arm64lol) to take
>> advantage of PMD mmap mappings?
>
> The most important point is to not mix these up.
>
> If we want to use a file for atomic writes I should tell the fs about
> it, and preferably in a way that does not require me to know about weird
> internal implementation details of every file system. I really just
> want to use atomic writes. Preferably I'd just start using them after
> asking for the limits. But that's obviously not going to work for
> file systems that use the hardware offload and don't otherwise align
> to the limit (which would suck for very small files anyway :))
>
> So as a compromise I tell the file system before writing or otherwise
> adding any data [1] to file that I want to use atomic writes so that
> the fs can take the right decisions.
>
> [1] reflinking data into a such marked file will be ... interesting.
>

How about something based on fcntl, like below? We will prob also
require some per-FS flag for enabling atomic writes without HW support.
That flag might be also useful for XFS for differentiating forcealign
for atomic writes with just forcealign.

---8<----

diff --git a/fs/fcntl.c b/fs/fcntl.c
index c80a6acad742..d4a50655565a 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -313,6 +313,15 @@ static long fcntl_rw_hint(struct file *file,
unsigned int cmd,
}
}

+static long fcntl_atomic_write(struct file *file, unsigned int cmd,
+ unsigned long arg)
+{
+ if (file->f_op->enable_atomic_writes)
+ return file->f_op->enable_atomic_writes(file, arg);
+
+ return -EOPNOTSUPP;
+}
+
static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
struct file *filp)
{
@@ -419,6 +428,9 @@ static long do_fcntl(int fd, unsigned int cmd,
unsigned long arg,
case F_SET_RW_HINT:
err = fcntl_rw_hint(filp, cmd, arg);
break;
+ case F_SET_ATOMIC_WRITE_SIZE:
+ err = fcntl_atomic_write(filp, cmd, arg);
+ break;
default:
break;
}
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index e33e5e13b95f..dba206cbe1ab 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1478,6 +1478,46 @@ xfs_file_mmap(
return 0;
}

+STATIC int
+xfs_enable_atomic_writes(
+ struct file *file,
+ unsigned int unit_max)
+{
+ struct xfs_inode *ip = XFS_I(file_inode(file));
+ struct mnt_idmap *idmap = file_mnt_idmap(file);
+ struct dentry *dentry = file->f_path.dentry;
+ struct xfs_buftarg *target = xfs_inode_buftarg(ip);
+ struct block_device *bdev = target->bt_bdev;
+ struct request_queue *q = bdev->bd_queue;
+ struct inode *inode = &ip->i_vnode;
+
+ if (queue_atomic_write_unit_max_bytes(q) == 0) {
+ if (unit_max != 0) {
+ /* We expect unbounded CoW size if no HW support */
+ return -ENOTBLK;
+ }
+ /* Do something for CoW support ... */
+
+ return 0;
+ }
+
+ if (!unit_max)
+ return -EINVAL;
+
+ /* bodge alert */
+ if (inode->i_op->fileattr_set) {
+ struct fileattr fa = {
+ .fsx_extsize = unit_max,
+ .fsx_xflags = FS_XFLAG_EXTSIZE | FS_XFLAG_FORCEALIGN,
+ .fsx_valid = 1,
+ };
+
+ return inode->i_op->fileattr_set(idmap, dentry, &fa);
+ }
+
+ return -EOPNOTSUPP;
+}
+
const struct file_operations xfs_file_operations = {
.llseek = xfs_file_llseek,
.read_iter = xfs_file_read_iter,
@@ -1498,6 +1538,7 @@ const struct file_operations xfs_file_operations = {
.fallocate = xfs_file_fallocate,
.fadvise = xfs_file_fadvise,
.remap_file_range = xfs_file_remap_range,
+ .enable_atomic_writes = xfs_enable_atomic_writes,
};

const struct file_operations xfs_dir_file_operations = {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 98b7a7a8c42e..a715f98057b8 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1956,6 +1956,7 @@ struct file_operations {
int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *,
unsigned int poll_flags);
+ int (*enable_atomic_writes)(struct file *, unsigned int unit_max);
} __randomize_layout;

/* Wrap a directory iterator that needs exclusive inode access */
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 6c80f96049bd..69780c49622e 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -56,6 +56,8 @@
#define F_GET_FILE_RW_HINT (F_LINUX_SPECIFIC_BASE + 13)
#define F_SET_FILE_RW_HINT (F_LINUX_SPECIFIC_BASE + 14)

+#define F_SET_ATOMIC_WRITE_SIZE (F_LINUX_SPECIFIC_BASE + 15)
+
/*
* Valid hint values for F_{GET,SET}_RW_HINT. 0 is "not set", or can be
* used to clear any hints previously set.
--


--->8----

2023-12-19 15:18:24

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] block atomic writes

On Tue, Dec 19, 2023 at 12:41:37PM +0000, John Garry wrote:
> How about something based on fcntl, like below? We will prob also require
> some per-FS flag for enabling atomic writes without HW support. That flag
> might be also useful for XFS for differentiating forcealign for atomic
> writes with just forcealign.

I would have just exposed it through a user visible flag instead of
adding yet another ioctl/fcntl opcode and yet another method.

And yes, for anything that doesn't always support atomic writes it would
need to be persisted.

2023-12-19 16:54:23

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] block atomic writes

On 19/12/2023 15:17, Christoph Hellwig wrote:
> On Tue, Dec 19, 2023 at 12:41:37PM +0000, John Garry wrote:
>> How about something based on fcntl, like below? We will prob also require
>> some per-FS flag for enabling atomic writes without HW support. That flag
>> might be also useful for XFS for differentiating forcealign for atomic
>> writes with just forcealign.
> I would have just exposed it through a user visible flag instead of
> adding yet another ioctl/fcntl opcode and yet another method.
>

Any specific type of flag?

I would suggest a file attribute which we can set via chattr, but that
is still using an ioctl and would require a new inode flag; but at least
there is standard userspace support.

> And yes, for anything that doesn't always support atomic writes it would
> need to be persisted.



2023-12-21 06:51:05

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] block atomic writes

On Tue, Dec 19, 2023 at 04:53:27PM +0000, John Garry wrote:
> On 19/12/2023 15:17, Christoph Hellwig wrote:
>> On Tue, Dec 19, 2023 at 12:41:37PM +0000, John Garry wrote:
>>> How about something based on fcntl, like below? We will prob also require
>>> some per-FS flag for enabling atomic writes without HW support. That flag
>>> might be also useful for XFS for differentiating forcealign for atomic
>>> writes with just forcealign.
>> I would have just exposed it through a user visible flag instead of
>> adding yet another ioctl/fcntl opcode and yet another method.
>>
>
> Any specific type of flag?
>
> I would suggest a file attribute which we can set via chattr, but that is
> still using an ioctl and would require a new inode flag; but at least there
> is standard userspace support.

I'd be fine with that, but we're kinda running out of flag there.
That's why I suggested the FS_XFLAG_ instead, which basically works
the same.


2023-12-21 09:50:53

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] block atomic writes

On 21/12/2023 06:50, Christoph Hellwig wrote:
> On Tue, Dec 19, 2023 at 04:53:27PM +0000, John Garry wrote:
>> On 19/12/2023 15:17, Christoph Hellwig wrote:
>>> On Tue, Dec 19, 2023 at 12:41:37PM +0000, John Garry wrote:
>>>> How about something based on fcntl, like below? We will prob also require
>>>> some per-FS flag for enabling atomic writes without HW support. That flag
>>>> might be also useful for XFS for differentiating forcealign for atomic
>>>> writes with just forcealign.
>>> I would have just exposed it through a user visible flag instead of
>>> adding yet another ioctl/fcntl opcode and yet another method.
>>>
>>
>> Any specific type of flag?
>>
>> I would suggest a file attribute which we can set via chattr, but that is
>> still using an ioctl and would require a new inode flag; but at least there
>> is standard userspace support.
>
> I'd be fine with that, but we're kinda running out of flag there.

Yeah, in looking at e2fsprogs they are all used.

> That's why I suggested the FS_XFLAG_ instead, which basically works
> the same.
>

ok, fine, we can try that out.

On another topic, maybe you can advise..

I noticed the NVMe patch to stop always setting virt boundary (thanks),
but I am struggling for the wording for iovecs rules. I'd like to reuse
iov_iter_is_aligned() to enforce any such rule.

I am thinking:
- ubuf / iovecs need to be PAGE-aligned
- each iovec needs to be length of multiple of PAGE_SIZE

But that does not work for total length < PAGE_SIZE.

So then we could have:
- ubuf / iovecs need to be PAGE-aligned
- each iovec needs to be length of multiple of atomic_write_unit_min. If
total length > PAGE_SIZE, each iovec also needs to be a multiple of
PAGE_SIZE.

I'd rather something simpler. Maybe it's ok.

Thanks,
John


2023-12-21 12:19:46

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] block atomic writes

On Thu, Dec 21, 2023 at 09:49:35AM +0000, John Garry wrote:
> I noticed the NVMe patch to stop always setting virt boundary (thanks), but
> I am struggling for the wording for iovecs rules. I'd like to reuse
> iov_iter_is_aligned() to enforce any such rule.
>
> I am thinking:
> - ubuf / iovecs need to be PAGE-aligned
> - each iovec needs to be length of multiple of PAGE_SIZE
>
> But that does not work for total length < PAGE_SIZE.
>
> So then we could have:
> - ubuf / iovecs need to be PAGE-aligned
> - each iovec needs to be length of multiple of atomic_write_unit_min. If
> total length > PAGE_SIZE, each iovec also needs to be a multiple of
> PAGE_SIZE.
>
> I'd rather something simpler. Maybe it's ok.

If we decided to not support atomic writes on anything setting a virt
boundary we don't have to care about the alignment of each vector,
and IMHO we should do that as everything else would be a life in
constant pain. If we really have a use case for atomic writes on
consumer NVMe devices we'll just have to limit it to a single iovec.

2023-12-21 12:56:29

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] block atomic writes

On 21/12/2023 12:19, Christoph Hellwig wrote:
>> So then we could have:
>> - ubuf / iovecs need to be PAGE-aligned
>> - each iovec needs to be length of multiple of atomic_write_unit_min. If
>> total length > PAGE_SIZE, each iovec also needs to be a multiple of
>> PAGE_SIZE.
>>
>> I'd rather something simpler. Maybe it's ok.
> If we decided to not support atomic writes on anything setting a virt
> boundary we don't have to care about the alignment of each vector,

ok, I think that alignment is not so important, but we still need to
consider a minimum length per iovec, such that we will always be able to
fit a write of length atomic_write_unit_max in a bio.

> and IMHO we should do that as everything else would be a life in
> constant pain. If we really have a use case for atomic writes on
> consumer NVMe devices we'll just have to limit it to a single iovec.

I'd be more inclined to say that SGL support is compulsory, but I don't
know if that is limiting atomic writes support to an unsatisfactory
market segment.



2023-12-21 13:01:28

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] block atomic writes

On Thu, Dec 21, 2023 at 12:48:24PM +0000, John Garry wrote:
>>> - ubuf / iovecs need to be PAGE-aligned
>>> - each iovec needs to be length of multiple of atomic_write_unit_min. If
>>> total length > PAGE_SIZE, each iovec also needs to be a multiple of
>>> PAGE_SIZE.
>>>
>>> I'd rather something simpler. Maybe it's ok.
>> If we decided to not support atomic writes on anything setting a virt
>> boundary we don't have to care about the alignment of each vector,
>
> ok, I think that alignment is not so important, but we still need to
> consider a minimum length per iovec, such that we will always be able to
> fit a write of length atomic_write_unit_max in a bio.

I don't think you man a minim length per iovec for that, but a maximum
number of iovecs instead. For SGL-capable devices that would be
BIO_MAX_VECS, otherwise 1.

2023-12-21 13:19:35

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] block atomic writes

On 21/12/2023 12:57, Christoph Hellwig wrote:
> On Thu, Dec 21, 2023 at 12:48:24PM +0000, John Garry wrote:
>>>> - ubuf / iovecs need to be PAGE-aligned
>>>> - each iovec needs to be length of multiple of atomic_write_unit_min. If
>>>> total length > PAGE_SIZE, each iovec also needs to be a multiple of
>>>> PAGE_SIZE.
>>>>
>>>> I'd rather something simpler. Maybe it's ok.
>>> If we decided to not support atomic writes on anything setting a virt
>>> boundary we don't have to care about the alignment of each vector,
>>
>> ok, I think that alignment is not so important, but we still need to
>> consider a minimum length per iovec, such that we will always be able to
>> fit a write of length atomic_write_unit_max in a bio.
>
> I don't think you man a minim length per iovec for that, but a maximum
> number of iovecs instead.

That would make sense, but I was thinking that if the request queue has
a limit on segments then we would need to specify a iovec min length.

> For SGL-capable devices that would be
> BIO_MAX_VECS, otherwise 1.

ok, but we would need to advertise that or whatever segment limit. A
statx field just for that seems a bit inefficient in terms of space.

2023-12-21 13:22:59

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] block atomic writes

On Thu, Dec 21, 2023 at 01:18:33PM +0000, John Garry wrote:
>> For SGL-capable devices that would be
>> BIO_MAX_VECS, otherwise 1.
>
> ok, but we would need to advertise that or whatever segment limit. A statx
> field just for that seems a bit inefficient in terms of space.

I'd rather not hard code BIO_MAX_VECS in the ABI, which suggest we
want to export is as a field. Network file systems also might have
their own limits for one reason or another.

2023-12-21 14:04:11

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] block atomic writes

On 21/12/2023 13:22, Christoph Hellwig wrote:
> On Thu, Dec 21, 2023 at 01:18:33PM +0000, John Garry wrote:
>>> For SGL-capable devices that would be
>>> BIO_MAX_VECS, otherwise 1.
>>
>> ok, but we would need to advertise that or whatever segment limit. A statx
>> field just for that seems a bit inefficient in terms of space.
>
> I'd rather not hard code BIO_MAX_VECS in the ABI,

For sure

> which suggest we
> want to export is as a field.

ok

> Network file systems also might have
> their own limits for one reason or another.

Understood

2024-01-09 10:26:19

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] block atomic writes

On 21/12/2023 06:50, Christoph Hellwig wrote:
> On Tue, Dec 19, 2023 at 04:53:27PM +0000, John Garry wrote:
>> On 19/12/2023 15:17, Christoph Hellwig wrote:
>>> On Tue, Dec 19, 2023 at 12:41:37PM +0000, John Garry wrote:
>>>> How about something based on fcntl, like below? We will prob also require
>>>> some per-FS flag for enabling atomic writes without HW support. That flag
>>>> might be also useful for XFS for differentiating forcealign for atomic
>>>> writes with just forcealign.
>>> I would have just exposed it through a user visible flag instead of
>>> adding yet another ioctl/fcntl opcode and yet another method.
>>>
>> Any specific type of flag?
>>
>> I would suggest a file attribute which we can set via chattr, but that is
>> still using an ioctl and would require a new inode flag; but at least there
>> is standard userspace support.
> I'd be fine with that, but we're kinda running out of flag there.
> That's why I suggested the FS_XFLAG_ instead, which basically works
> the same.

Hi Christoph,

Coming back to this topic... how about this FS_XFLAG_ and fsxattr update:

diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index da43810b7485..9ef15fced20c 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -118,7 +118,8 @@ struct fsxattr {
__u32 fsx_nextents; /* nextents field value (get) */
__u32 fsx_projid; /* project identifier (get/set) */
__u32 fsx_cowextsize; /* CoW extsize field value
(get/set)*/
- unsigned char fsx_pad[8];
+ __u32 fsx_atomicwrites_size; /* unit max */
+ unsigned char fsx_pad[4];
};

/*
@@ -140,6 +141,7 @@ struct fsxattr {
#define FS_XFLAG_FILESTREAM 0x00004000 /* use filestream
allocator */
#define FS_XFLAG_DAX 0x00008000 /* use DAX for IO */
#define FS_XFLAG_COWEXTSIZE 0x00010000 /* CoW extent size
allocator hint */
+#define FS_XFLAG_ATOMICWRITES 0x00020000
#define FS_XFLAG_HASATTR 0x80000000 /* no DIFLAG for this */

/* the read-only stuff doesn't really belong here, but any other place is
lines 1-22/22 (END)

Having FS_XFLAG_ATOMICWRITES set will lead to FMODE_CAN_ATOMIC_WRITE
being set.

So a user can issue:

>xfs_io -c "atomic-writes 64K" mnt/file
>xfs_io -c "atomic-writes" mnt/file
[65536] mnt/file

and then:

/xfs_io -c "lsattr" mnt/file
------------------W mnt/file

(W is new flag for atomic writes obvs)

The user will still have to issue statx to get the actual atomic write
limit for a file, as 'xfs_io -c "atomic-writes"' does not take into
account any HW/linux block layer atomic write limits.

FS_XFLAG_ATOMICWRITES will force XFS extent size and alignment to
fsx_atomicwrites_size when we have HW support, so effectively same as
forcealign. For no HW support, we still specify a size. In case of
possible XFS CoW solution for no atomic write HW support, I suppose that
there would be no size limit in reality, so the specifying the size
would only be just for userspace experience consistency.

Is this the sort of userspace API which you would like to see?

John

2024-01-09 16:02:55

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] block atomic writes

On Tue, Jan 09, 2024 at 09:55:24AM +0000, John Garry wrote:
> So a user can issue:
>
> >xfs_io -c "atomic-writes 64K" mnt/file
> >xfs_io -c "atomic-writes" mnt/file
> [65536] mnt/file

Let me try to decipher that:

- the first call sets a 64k fsx_atomicwrites_size size
- the secon call queries fsx_atomicwrites_size?

> The user will still have to issue statx to get the actual atomic write
> limit for a file, as 'xfs_io -c "atomic-writes"' does not take into account
> any HW/linux block layer atomic write limits.

So will the set side never fail?

> Is this the sort of userspace API which you would like to see?

What I had in mind (and that's doesn't mean it's right..) was that
the user just sets a binary flag, and the fs reports the best it
could. But there might be reasons to do it differently.


2024-01-09 16:54:48

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] block atomic writes

On 09/01/2024 16:02, Christoph Hellwig wrote:
> On Tue, Jan 09, 2024 at 09:55:24AM +0000, John Garry wrote:
>> So a user can issue:
>>
>>> xfs_io -c "atomic-writes 64K" mnt/file
>>> xfs_io -c "atomic-writes" mnt/file
>> [65536] mnt/file
> Let me try to decipher that:
>
> - the first call sets a 64k fsx_atomicwrites_size size

It should also set FS_XFLAG_ATOMICWRITES for the file. So this step does
everything to enable atomic writes for that file.

> - the secon call queries fsx_atomicwrites_size?

Right, I'm just demo'ing how it would look

>
>> The user will still have to issue statx to get the actual atomic write
>> limit for a file, as 'xfs_io -c "atomic-writes"' does not take into account
>> any HW/linux block layer atomic write limits.
> So will the set side never fail?

It could fail.

Examples of when it could fail could include:

a. If user gave a bad size value. So the size should be a power-of-2 and
also divisible into the AG size and compatible with any stripe alignment.

b. If the file already had flags set which are incompatible with or not
supported for atomic writes.

c. the file already has data written. I guess that's obvious.

>
>> Is this the sort of userspace API which you would like to see?
> What I had in mind (and that's doesn't mean it's right..) was that
> the user just sets a binary flag, and the fs reports the best it
> could. But there might be reasons to do it differently.

That is what I am trying to do, but I also want to specify a size for
the atomic write unit max which the user could want. I'd rather not use
the atomic write unit max from the device for that, as that could be
huge. However, re-reading a., above, makes me think that the kernel
should have more input on this, but we still need some input on the max
from the user...

Thanks,
John


2024-01-09 23:14:35

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] block atomic writes

On Tue, Jan 09, 2024 at 09:55:24AM +0000, John Garry wrote:
> On 21/12/2023 06:50, Christoph Hellwig wrote:
> > On Tue, Dec 19, 2023 at 04:53:27PM +0000, John Garry wrote:
> > > On 19/12/2023 15:17, Christoph Hellwig wrote:
> > > > On Tue, Dec 19, 2023 at 12:41:37PM +0000, John Garry wrote:
> > > > > How about something based on fcntl, like below? We will prob also require
> > > > > some per-FS flag for enabling atomic writes without HW support. That flag
> > > > > might be also useful for XFS for differentiating forcealign for atomic
> > > > > writes with just forcealign.
> > > > I would have just exposed it through a user visible flag instead of
> > > > adding yet another ioctl/fcntl opcode and yet another method.
> > > >
> > > Any specific type of flag?
> > >
> > > I would suggest a file attribute which we can set via chattr, but that is
> > > still using an ioctl and would require a new inode flag; but at least there
> > > is standard userspace support.
> > I'd be fine with that, but we're kinda running out of flag there.
> > That's why I suggested the FS_XFLAG_ instead, which basically works
> > the same.
>
> Hi Christoph,
>
> Coming back to this topic... how about this FS_XFLAG_ and fsxattr update:
>
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index da43810b7485..9ef15fced20c 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -118,7 +118,8 @@ struct fsxattr {
> __u32 fsx_nextents; /* nextents field value (get) */
> __u32 fsx_projid; /* project identifier (get/set) */
> __u32 fsx_cowextsize; /* CoW extsize field value
> (get/set)*/
> - unsigned char fsx_pad[8];
> + __u32 fsx_atomicwrites_size; /* unit max */
> + unsigned char fsx_pad[4];
> };
>
> /*
> @@ -140,6 +141,7 @@ struct fsxattr {
> #define FS_XFLAG_FILESTREAM 0x00004000 /* use filestream allocator
> */
> #define FS_XFLAG_DAX 0x00008000 /* use DAX for IO */
> #define FS_XFLAG_COWEXTSIZE 0x00010000 /* CoW extent size
> allocator hint */
> +#define FS_XFLAG_ATOMICWRITES 0x00020000
> #define FS_XFLAG_HASATTR 0x80000000 /* no DIFLAG for this */
>
> /* the read-only stuff doesn't really belong here, but any other place is
> lines 1-22/22 (END)
>
> Having FS_XFLAG_ATOMICWRITES set will lead to FMODE_CAN_ATOMIC_WRITE being
> set.
>
> So a user can issue:
>
> >xfs_io -c "atomic-writes 64K" mnt/file
> >xfs_io -c "atomic-writes" mnt/file
> [65536] mnt/file

Where are you going to store this value in the inode? It requires a
new field in the inode and so is a change of on-disk format, right?

As it is, I really don't see this as a better solution than the
original generic "force align" flag that simply makes the extent
size hint alignment a hard physical alignment requirement rather
than just a hint. This has multiple uses (DAX PMD alignment is
another), so I just don't see why something that has a single,
application specific API that implements a hard physical alignment
is desirable.

Indeed, the whole reason that extent size hints are so versatile is
that they implement a generic allocation alignment/size function
that can be used for anything your imagination extends to. If they
were implemented as a "only allow RAID stripe aligned/sized
allocation" for the original use case then that functionality would
have been far less useful than it has proven to be over the past
couple of decades.

Hence history teaches us that we should be designing the API around
the generic filesystem function required (hard alignment of physical
extent allocation), not the specific use case that requires that
functionality.

-Dave.
--
Dave Chinner
[email protected]

2024-01-10 08:56:27

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] block atomic writes

On 09/01/2024 23:04, Dave Chinner wrote:
>> --- a/include/uapi/linux/fs.h
>> +++ b/include/uapi/linux/fs.h
>> @@ -118,7 +118,8 @@ struct fsxattr {
>> __u32 fsx_nextents; /* nextents field value (get) */
>> __u32 fsx_projid; /* project identifier (get/set) */
>> __u32 fsx_cowextsize; /* CoW extsize field value
>> (get/set)*/
>> - unsigned char fsx_pad[8];
>> + __u32 fsx_atomicwrites_size; /* unit max */
>> + unsigned char fsx_pad[4];
>> };
>>
>> /*
>> @@ -140,6 +141,7 @@ struct fsxattr {
>> #define FS_XFLAG_FILESTREAM 0x00004000 /* use filestream allocator
>> */
>> #define FS_XFLAG_DAX 0x00008000 /* use DAX for IO */
>> #define FS_XFLAG_COWEXTSIZE 0x00010000 /* CoW extent size
>> allocator hint */
>> +#define FS_XFLAG_ATOMICWRITES 0x00020000
>> #define FS_XFLAG_HASATTR 0x80000000 /* no DIFLAG for this */
>>
>> /* the read-only stuff doesn't really belong here, but any other place is
>> lines 1-22/22 (END)
>>
>> Having FS_XFLAG_ATOMICWRITES set will lead to FMODE_CAN_ATOMIC_WRITE being
>> set.
>>
>> So a user can issue:
>>
>>> xfs_io -c "atomic-writes 64K" mnt/file
>>> xfs_io -c "atomic-writes" mnt/file
>> [65536] mnt/file
> Where are you going to store this value in the inode? It requires a
> new field in the inode and so is a change of on-disk format, right?

It would require an on-disk format change, unless we can find an
alternative way to store the value, like:
a. re-use pre-existing extsize or even cowextsize fields and 'xfs_io -c
"atomic-writes $SIZE"' would update those fields and
FS_XFLAG_ATOMICWRITES would be incompatible with FS_XFLAG_COWEXTSIZE or
FS_XFLAG_EXTSIZE
b. require FS_XFLAG_EXTSIZE and extsize be also set to enable atomic
writes, and extsize is used for atomic write unit max

I'm trying to think of ways to avoid requiring a value, but I don't see
good options, like:
- make atomic write unit max some compile-time option
- require mkfs stripe alignment/width be set and use that as basis for
atomic write unit max

We could just use the atomic write unit max which HW provides, but that
could be 1MB or more and that will hardly give efficient data usage for
small files. But maybe we don't care about that if we expect this
feature to only be used on DB files, which can be huge anyway. However I
still have concerns – we require that value to be fixed, but a disk
firmware update could increase that value and this could mean we have
what would be pre-existing mis-aligned extents.

>
> As it is, I really don't see this as a better solution than the
> original generic "force align" flag that simply makes the extent
> size hint alignment a hard physical alignment requirement rather
> than just a hint. This has multiple uses (DAX PMD alignment is
> another), so I just don't see why something that has a single,
> application specific API that implements a hard physical alignment
> is desirable.

I would still hope that we will support forcealign separately for those
purposes.

>
> Indeed, the whole reason that extent size hints are so versatile is
> that they implement a generic allocation alignment/size function
> that can be used for anything your imagination extends to. If they
> were implemented as a "only allow RAID stripe aligned/sized
> allocation" for the original use case then that functionality would
> have been far less useful than it has proven to be over the past
> couple of decades.
>
> Hence history teaches us that we should be designing the API around
> the generic filesystem function required (hard alignment of physical
> extent allocation), not the specific use case that requires that
> functionality.

I understand your concern. However I am not even sure that forcealign
even gives us everything we want to enable atomic writes. There is an
issue where we were required to pre-zero a file prior to issuing atomic
writes to ensure extents are suitably sized, so FS_XFLAG_ATOMICWRITES
would make the FS do what is required to avoid that pre-zeroing (but
that pre-zeroing requirement that does sound like a forcealign issue...)

Furthermore, there was some desire to support atomic writes on block
devices with no HW support by using a CoW-based solution, and forcealign
would not be relevant there.

Thanks,
John


2024-01-10 09:19:51

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] block atomic writes

On Wed, Jan 10, 2024 at 10:04:00AM +1100, Dave Chinner wrote:
> Hence history teaches us that we should be designing the API around
> the generic filesystem function required (hard alignment of physical
> extent allocation), not the specific use case that requires that
> functionality.

I disagree. The alignment requirement is an artefact of how you
implement atomic writes. As the fs user I care that I can do atomic
writes on a file and need to query how big the writes can be and
what alignment is required.

The forcealign feature is a sensible fs side implementation of that
if using hardware based atomic writes with alignment requirements,
but it is a really lousy userspace API.

So with John's API proposal for XFS with hardware alignment based atomic
writes we could still use force align.

Requesting atomic writes for an inode will set the forcealign flag
and the extent size hint, and after that it'll report atomic write
capabilities. Roughly the same implementation, but not an API
tied to an implementation detail.

2024-01-11 01:41:22

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] block atomic writes

On Wed, Jan 10, 2024 at 10:19:29AM +0100, Christoph Hellwig wrote:
> On Wed, Jan 10, 2024 at 10:04:00AM +1100, Dave Chinner wrote:
> > Hence history teaches us that we should be designing the API around
> > the generic filesystem function required (hard alignment of physical
> > extent allocation), not the specific use case that requires that
> > functionality.
>
> I disagree. The alignment requirement is an artefact of how you
> implement atomic writes. As the fs user I care that I can do atomic
> writes on a file and need to query how big the writes can be and
> what alignment is required.
>
> The forcealign feature is a sensible fs side implementation of that
> if using hardware based atomic writes with alignment requirements,
> but it is a really lousy userspace API.
>
> So with John's API proposal for XFS with hardware alignment based atomic
> writes we could still use force align.
>
> Requesting atomic writes for an inode will set the forcealign flag
> and the extent size hint, and after that it'll report atomic write
> capabilities. Roughly the same implementation, but not an API
> tied to an implementation detail.

Sounds good to me! So to summarize, this is approximately what
userspace programs would have to do something like this:

struct statx statx;
struct fsxattr fsxattr;
int fd = open('/foofile', O_RDWR | O_DIRECT);

ioctl(fd, FS_IOC_GETXATTR, &fsxattr);

fsxattr.fsx_xflags |= FS_XFLAG_FORCEALIGN | FS_XFLAG_WRITE_ATOMIC;
fsxattr.fsx_extsize = 16384; /* only for hardware no-tears writes */

ioctl(fd, FS_IOC_SETXATTR, &fsxattr);

statx(fd, "", AT_EMPTY_PATH, STATX_ALL | STATX_WRITE_ATOMIC, &statx);

if (statx.stx_atomic_write_unit_max >= 16384) {
pwrite(fd, &iov, 1, 0, RWF_SYNC | RWF_ATOMIC);
printf("HAPPY DANCE\n");
}

(Assume we bail out on errors.)

--D

2024-01-11 05:03:21

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] block atomic writes

On Wed, Jan 10, 2024 at 05:40:56PM -0800, Darrick J. Wong wrote:
> struct statx statx;
> struct fsxattr fsxattr;
> int fd = open('/foofile', O_RDWR | O_DIRECT);
>
> ioctl(fd, FS_IOC_GETXATTR, &fsxattr);
>
> fsxattr.fsx_xflags |= FS_XFLAG_FORCEALIGN | FS_XFLAG_WRITE_ATOMIC;
> fsxattr.fsx_extsize = 16384; /* only for hardware no-tears writes */
>
> ioctl(fd, FS_IOC_SETXATTR, &fsxattr);
>
> statx(fd, "", AT_EMPTY_PATH, STATX_ALL | STATX_WRITE_ATOMIC, &statx);
>
> if (statx.stx_atomic_write_unit_max >= 16384) {
> pwrite(fd, &iov, 1, 0, RWF_SYNC | RWF_ATOMIC);
> printf("HAPPY DANCE\n");
> }

I think this still needs a check if the fs needs alignment for
atomic writes at all. i.e.

struct statx statx;
struct fsxattr fsxattr;
int fd = open('/foofile', O_RDWR | O_DIRECT);

ioctl(fd, FS_IOC_GETXATTR, &fsxattr);
statx(fd, "", AT_EMPTY_PATH, STATX_ALL | STATX_WRITE_ATOMIC, &statx);
if (statx.stx_atomic_write_unit_max < 16384) {
bailout();
}

fsxattr.fsx_xflags |= FS_XFLAG_WRITE_ATOMIC;
if (statx.stx_atomic_write_alignment) {
fsxattr.fsx_xflags |= FS_XFLAG_FORCEALIGN;
fsxattr.fsx_extsize = 16384; /* only for hardware no-tears writes */
}
if (ioctl(fd, FS_IOC_SETXATTR, &fsxattr) < 1) {
bailout();
}

pwrite(fd, &iov, 1, 0, RWF_SYNC | RWF_ATOMIC);
printf("HAPPY DANCE\n");

2024-01-11 09:56:42

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] block atomic writes

On 11/01/2024 05:02, Christoph Hellwig wrote:
> On Wed, Jan 10, 2024 at 05:40:56PM -0800, Darrick J. Wong wrote:
>> struct statx statx;
>> struct fsxattr fsxattr;
>> int fd = open('/foofile', O_RDWR | O_DIRECT);

I'm assuming O_CREAT also.

>>
>> ioctl(fd, FS_IOC_GETXATTR, &fsxattr);
>>
>> fsxattr.fsx_xflags |= FS_XFLAG_FORCEALIGN | FS_XFLAG_WRITE_ATOMIC;
>> fsxattr.fsx_extsize = 16384; /* only for hardware no-tears writes */
>>
>> ioctl(fd, FS_IOC_SETXATTR, &fsxattr);
>>
>> statx(fd, "", AT_EMPTY_PATH, STATX_ALL | STATX_WRITE_ATOMIC, &statx);
>>
>> if (statx.stx_atomic_write_unit_max >= 16384) {
>> pwrite(fd, &iov, 1, 0, RWF_SYNC | RWF_ATOMIC);
>> printf("HAPPY DANCE\n");
>> }
>
> I think this still needs a check if the fs needs alignment for
> atomic writes at all. i.e.
>
> struct statx statx;
> struct fsxattr fsxattr;
> int fd = open('/foofile', O_RDWR | O_DIRECT);
>
> ioctl(fd, FS_IOC_GETXATTR, &fsxattr);
> statx(fd, "", AT_EMPTY_PATH, STATX_ALL | STATX_WRITE_ATOMIC, &statx);
> if (statx.stx_atomic_write_unit_max < 16384) {
> bailout();
> }

How could this value be >= 16384 initially? Would it be from
pre-configured FS alignment, like XFS RT extsize? Or is this from some
special CoW-based atomic write support? Or FS block size of 16384?

Incidentally, for consistency only setting FS_XFLAG_WRITE_ATOMIC will
lead to FMODE_CAN_ATOMIC_WRITE being set. So until FS_XFLAG_WRITE_ATOMIC
is set would it make sense to have statx return 0 for
STATX_WRITE_ATOMIC. Otherwise the user may be misled to think that it is
ok to issue an atomic write (when it isn’t).

Thanks,
John

>
> fsxattr.fsx_xflags |= FS_XFLAG_WRITE_ATOMIC;
> if (statx.stx_atomic_write_alignment) {
> fsxattr.fsx_xflags |= FS_XFLAG_FORCEALIGN;
> fsxattr.fsx_extsize = 16384; /* only for hardware no-tears writes */
> }
> if (ioctl(fd, FS_IOC_SETXATTR, &fsxattr) < 1) {
> bailout();
> }
>
> pwrite(fd, &iov, 1, 0, RWF_SYNC | RWF_ATOMIC);
> printf("HAPPY DANCE\n");
>



2024-01-11 14:46:08

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] block atomic writes

On Thu, Jan 11, 2024 at 09:55:36AM +0000, John Garry wrote:
> On 11/01/2024 05:02, Christoph Hellwig wrote:
>> On Wed, Jan 10, 2024 at 05:40:56PM -0800, Darrick J. Wong wrote:
>>> struct statx statx;
>>> struct fsxattr fsxattr;
>>> int fd = open('/foofile', O_RDWR | O_DIRECT);
>
> I'm assuming O_CREAT also.

Yes.

>> I think this still needs a check if the fs needs alignment for
>> atomic writes at all. i.e.
>>
>> struct statx statx;
>> struct fsxattr fsxattr;
>> int fd = open('/foofile', O_RDWR | O_DIRECT);
>>
>> ioctl(fd, FS_IOC_GETXATTR, &fsxattr);
>> statx(fd, "", AT_EMPTY_PATH, STATX_ALL | STATX_WRITE_ATOMIC, &statx);
>> if (statx.stx_atomic_write_unit_max < 16384) {
>> bailout();
>> }
>
> How could this value be >= 16384 initially? Would it be from pre-configured
> FS alignment, like XFS RT extsize? Or is this from some special CoW-based
> atomic write support? Or FS block size of 16384?

Sorry, this check should not be here at all, we should only check it
later.

> Incidentally, for consistency only setting FS_XFLAG_WRITE_ATOMIC will lead
> to FMODE_CAN_ATOMIC_WRITE being set. So until FS_XFLAG_WRITE_ATOMIC is set
> would it make sense to have statx return 0 for STATX_WRITE_ATOMIC.

True. We might need to report the limits even without that, though.


2024-01-11 16:12:46

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] block atomic writes


>
>>> I think this still needs a check if the fs needs alignment for
>>> atomic writes at all. i.e.
>>>
>>> struct statx statx;
>>> struct fsxattr fsxattr;
>>> int fd = open('/foofile', O_RDWR | O_DIRECT);
>>>
>>> ioctl(fd, FS_IOC_GETXATTR, &fsxattr);
>>> statx(fd, "", AT_EMPTY_PATH, STATX_ALL | STATX_WRITE_ATOMIC, &statx);
>>> if (statx.stx_atomic_write_unit_max < 16384) {
>>> bailout();
>>> }
>>
>> How could this value be >= 16384 initially? Would it be from pre-configured
>> FS alignment, like XFS RT extsize? Or is this from some special CoW-based
>> atomic write support? Or FS block size of 16384?
>
> Sorry, this check should not be here at all, we should only check it
> later.
>
>> Incidentally, for consistency only setting FS_XFLAG_WRITE_ATOMIC will lead
>> to FMODE_CAN_ATOMIC_WRITE being set. So until FS_XFLAG_WRITE_ATOMIC is set
>> would it make sense to have statx return 0 for STATX_WRITE_ATOMIC.
>
> True. We might need to report the limits even without that, though.

Could we just error the SETXATTR ioctl when FS_XFLAG_FORCEALIGN is not
set (and it is required)? The problem is that ioctl reports -EINVAL for
any such errors, so hard for the user to know the issue...

Thanks,
John


2024-01-11 16:15:48

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] block atomic writes

On Thu, Jan 11, 2024 at 04:11:38PM +0000, John Garry wrote:
> Could we just error the SETXATTR ioctl when FS_XFLAG_FORCEALIGN is not set
> (and it is required)? The problem is that ioctl reports -EINVAL for any
> such errors, so hard for the user to know the issue...

Sure. Pick a good unique error code, though.

2024-01-16 11:37:41

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] block atomic writes

On 21/12/2023 13:22, Christoph Hellwig wrote:
> On Thu, Dec 21, 2023 at 01:18:33PM +0000, John Garry wrote:
>>> For SGL-capable devices that would be
>>> BIO_MAX_VECS, otherwise 1.
>> ok, but we would need to advertise that or whatever segment limit. A statx
>> field just for that seems a bit inefficient in terms of space.
> I'd rather not hard code BIO_MAX_VECS in the ABI, which suggest we
> want to export is as a field. Network file systems also might have
> their own limits for one reason or another.

Hi Christoph,

I have been looking at this issue again and I am not sure if telling the
user the max number of segments allowed is the best option. I’m worried
that resultant atomic write unit max will be too small.

The background again is that we want to tell the user what the maximum
atomic write unit size is, such that we can always guarantee to fit the
write in a single bio. And there would be no iovec length or alignment
rules.

The max segments value advertised would be min(queue max segments,
BIO_MAX_VECS), so it would be 256 when the request queue is not limiting.

The worst case scenario for iovec layout (most inefficient) which the
user could provide would be like .iov_base = 0x...0E00 and .iov_length =
0x400, which would mean that we would have 2x pages and 2x DMA sg elems
required for each 1024B-length iovec. I am assuming that we will still
use the direct IO rule of LBS length and alignment.

As such, we then need to set atomic write unit max = min(queue max
segments, BIO_MAX_VECS) * LBS. That would mean atomic write unit max 256
* 512 = 128K (for 512B LBS). For a DMA controller of max segments 64,
for example, then we would have 32K. These seem too low.

Alternative I'm thinking that we should just limit to 1x iovec always,
and then atomic write unit max = (min(queue max segments, BIO_MAX_VECS)
- 1) * PAGE_SIZE [ignoring first/last iovec contents]. It also makes
support for non-enterprise NVMe drives more straightforward. If someone
wants, they can introduce support for multi-iovec later, but it would
prob require some more iovec length/alignment rules.

Please let me know your thoughts.

Thanks,
John


2024-01-17 15:02:30

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] block atomic writes

On Tue, Jan 16, 2024 at 11:35:47AM +0000, John Garry wrote:
> As such, we then need to set atomic write unit max = min(queue max
> segments, BIO_MAX_VECS) * LBS. That would mean atomic write unit max 256 *
> 512 = 128K (for 512B LBS). For a DMA controller of max segments 64, for
> example, then we would have 32K. These seem too low.

I don't see how this would work if support multiple sectors.

>
> Alternative I'm thinking that we should just limit to 1x iovec always, and
> then atomic write unit max = (min(queue max segments, BIO_MAX_VECS) - 1) *
> PAGE_SIZE [ignoring first/last iovec contents]. It also makes support for
> non-enterprise NVMe drives more straightforward. If someone wants, they can
> introduce support for multi-iovec later, but it would prob require some
> more iovec length/alignment rules.

Supporting just a single iovec initially is fine with me, as extending
that is pretty easy. Just talk to your potential users that they can
live with it.

I'd probably still advertise the limits even if it currently always is 1.


2024-01-17 16:18:14

by John Garry

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] block atomic writes

On 17/01/2024 15:02, Christoph Hellwig wrote:
> On Tue, Jan 16, 2024 at 11:35:47AM +0000, John Garry wrote:
>> As such, we then need to set atomic write unit max = min(queue max
>> segments, BIO_MAX_VECS) * LBS. That would mean atomic write unit max 256 *
>> 512 = 128K (for 512B LBS). For a DMA controller of max segments 64, for
>> example, then we would have 32K. These seem too low.
>
> I don't see how this would work if support multiple sectors.
>
>>
>> Alternative I'm thinking that we should just limit to 1x iovec always, and
>> then atomic write unit max = (min(queue max segments, BIO_MAX_VECS) - 1) *
>> PAGE_SIZE [ignoring first/last iovec contents]. It also makes support for
>> non-enterprise NVMe drives more straightforward. If someone wants, they can
>> introduce support for multi-iovec later, but it would prob require some
>> more iovec length/alignment rules.
>
> Supporting just a single iovec initially is fine with me, as extending
> that is pretty easy. Just talk to your potential users that they can
> live with it.

Yeah, any porting I know about has just been using aio with
IO_CMD_PWRITE, so would be ok

>
> I'd probably still advertise the limits even if it currently always is 1.
>

I suppose that we don't need any special rule until we support > 1, as
we cannot break anyone already using > 1 :)

Thanks,
John