2023-09-29 15:51:43

by John Garry

[permalink] [raw]
Subject: [PATCH 01/21] block: Add atomic write operations to request_queue limits

From: Himanshu Madhani <[email protected]>

Add the following limits:
- atomic_write_boundary_bytes
- atomic_write_max_bytes
- atomic_write_unit_max_bytes
- atomic_write_unit_min_bytes

All atomic writes limits are initialised to 0 to indicate no atomic write
support. Stacked devices are just not supported either for now.

Signed-off-by: Himanshu Madhani <[email protected]>
#jpg: Heavy rewrite
Signed-off-by: John Garry <[email protected]>
---
Documentation/ABI/stable/sysfs-block | 42 +++++++++++++++++++
block/blk-settings.c | 60 ++++++++++++++++++++++++++++
block/blk-sysfs.c | 33 +++++++++++++++
include/linux/blkdev.h | 33 +++++++++++++++
4 files changed, 168 insertions(+)

diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
index 1fe9a553c37b..05df7f74cbc1 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -21,6 +21,48 @@ Description:
device is offset from the internal allocation unit's
natural alignment.

+What: /sys/block/<disk>/atomic_write_max_bytes
+Date: May 2023
+Contact: Himanshu Madhani <[email protected]>
+Description:
+ [RO] This parameter specifies the maximum atomic write
+ size reported by the device. An atomic write operation
+ must not exceed this number of bytes.
+
+
+What: /sys/block/<disk>/atomic_write_unit_min_bytes
+Date: May 2023
+Contact: Himanshu Madhani <[email protected]>
+Description:
+ [RO] This parameter specifies the smallest block which can
+ be written atomically with an atomic write operation. All
+ atomic write operations must begin at a
+ atomic_write_unit_min boundary and must be multiples of
+ atomic_write_unit_min. This value must be a power-of-two.
+
+
+What: /sys/block/<disk>/atomic_write_unit_max_bytes
+Date: January 2023
+Contact: Himanshu Madhani <[email protected]>
+Description:
+ [RO] This parameter defines the largest block which can be
+ written atomically with an atomic write operation. This
+ value must be a multiple of atomic_write_unit_min and must
+ be a power-of-two.
+
+
+What: /sys/block/<disk>/atomic_write_boundary_bytes
+Date: May 2023
+Contact: Himanshu Madhani <[email protected]>
+Description:
+ [RO] A device may need to internally split I/Os which
+ straddle a given logical block address boundary. In that
+ case a single atomic write operation will be processed as
+ one of more sub-operations which each complete atomically.
+ This parameter specifies the size in bytes of the atomic
+ boundary if one is reported by the device. This value must
+ be a power-of-two.
+

What: /sys/block/<disk>/diskseq
Date: February 2021
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 0046b447268f..d151be394c98 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -59,6 +59,10 @@ void blk_set_default_limits(struct queue_limits *lim)
lim->zoned = BLK_ZONED_NONE;
lim->zone_write_granularity = 0;
lim->dma_alignment = 511;
+ lim->atomic_write_unit_min_sectors = 0;
+ lim->atomic_write_unit_max_sectors = 0;
+ lim->atomic_write_max_sectors = 0;
+ lim->atomic_write_boundary_sectors = 0;
}

/**
@@ -183,6 +187,62 @@ void blk_queue_max_discard_sectors(struct request_queue *q,
}
EXPORT_SYMBOL(blk_queue_max_discard_sectors);

+/**
+ * blk_queue_atomic_write_max_bytes - set max bytes supported by
+ * the device for atomic write operations.
+ * @q: the request queue for the device
+ * @size: maximum bytes supported
+ */
+void blk_queue_atomic_write_max_bytes(struct request_queue *q,
+ unsigned int bytes)
+{
+ q->limits.atomic_write_max_sectors = bytes >> SECTOR_SHIFT;
+}
+EXPORT_SYMBOL(blk_queue_atomic_write_max_bytes);
+
+/**
+ * blk_queue_atomic_write_boundary_bytes - Device's logical block address space
+ * which an atomic write should not cross.
+ * @q: the request queue for the device
+ * @bytes: must be a power-of-two.
+ */
+void blk_queue_atomic_write_boundary_bytes(struct request_queue *q,
+ unsigned int bytes)
+{
+ q->limits.atomic_write_boundary_sectors = bytes >> SECTOR_SHIFT;
+}
+EXPORT_SYMBOL(blk_queue_atomic_write_boundary_bytes);
+
+/**
+ * blk_queue_atomic_write_unit_min_sectors - smallest unit that can be written
+ * atomically to the device.
+ * @q: the request queue for the device
+ * @sectors: must be a power-of-two.
+ */
+void blk_queue_atomic_write_unit_min_sectors(struct request_queue *q,
+ unsigned int sectors)
+{
+ struct queue_limits *limits = &q->limits;
+
+ limits->atomic_write_unit_min_sectors = sectors;
+}
+EXPORT_SYMBOL(blk_queue_atomic_write_unit_min_sectors);
+
+/*
+ * blk_queue_atomic_write_unit_max_sectors - largest unit that can be written
+ * atomically to the device.
+ * @q: the request queue for the device
+ * @sectors: must be a power-of-two.
+ */
+void blk_queue_atomic_write_unit_max_sectors(struct request_queue *q,
+ unsigned int sectors)
+{
+ struct queue_limits *limits = &q->limits;
+
+ limits->atomic_write_unit_max_sectors = sectors;
+}
+EXPORT_SYMBOL(blk_queue_atomic_write_unit_max_sectors);
+
/**
* blk_queue_max_secure_erase_sectors - set max sectors for a secure erase
* @q: the request queue for the device
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 63e481262336..c193a04d7df7 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -118,6 +118,30 @@ static ssize_t queue_max_discard_segments_show(struct request_queue *q,
return queue_var_show(queue_max_discard_segments(q), page);
}

+static ssize_t queue_atomic_write_max_bytes_show(struct request_queue *q,
+ char *page)
+{
+ return queue_var_show(queue_atomic_write_max_bytes(q), page);
+}
+
+static ssize_t queue_atomic_write_boundary_show(struct request_queue *q,
+ char *page)
+{
+ return queue_var_show(queue_atomic_write_boundary_bytes(q), page);
+}
+
+static ssize_t queue_atomic_write_unit_min_show(struct request_queue *q,
+ char *page)
+{
+ return queue_var_show(queue_atomic_write_unit_min_bytes(q), page);
+}
+
+static ssize_t queue_atomic_write_unit_max_show(struct request_queue *q,
+ char *page)
+{
+ return queue_var_show(queue_atomic_write_unit_max_bytes(q), page);
+}
+
static ssize_t queue_max_integrity_segments_show(struct request_queue *q, char *page)
{
return queue_var_show(q->limits.max_integrity_segments, page);
@@ -507,6 +531,11 @@ QUEUE_RO_ENTRY(queue_discard_max_hw, "discard_max_hw_bytes");
QUEUE_RW_ENTRY(queue_discard_max, "discard_max_bytes");
QUEUE_RO_ENTRY(queue_discard_zeroes_data, "discard_zeroes_data");

+QUEUE_RO_ENTRY(queue_atomic_write_max_bytes, "atomic_write_max_bytes");
+QUEUE_RO_ENTRY(queue_atomic_write_boundary, "atomic_write_boundary_bytes");
+QUEUE_RO_ENTRY(queue_atomic_write_unit_max, "atomic_write_unit_max_bytes");
+QUEUE_RO_ENTRY(queue_atomic_write_unit_min, "atomic_write_unit_min_bytes");
+
QUEUE_RO_ENTRY(queue_write_same_max, "write_same_max_bytes");
QUEUE_RO_ENTRY(queue_write_zeroes_max, "write_zeroes_max_bytes");
QUEUE_RO_ENTRY(queue_zone_append_max, "zone_append_max_bytes");
@@ -633,6 +662,10 @@ static struct attribute *queue_attrs[] = {
&queue_discard_max_entry.attr,
&queue_discard_max_hw_entry.attr,
&queue_discard_zeroes_data_entry.attr,
+ &queue_atomic_write_max_bytes_entry.attr,
+ &queue_atomic_write_boundary_entry.attr,
+ &queue_atomic_write_unit_min_entry.attr,
+ &queue_atomic_write_unit_max_entry.attr,
&queue_write_same_max_entry.attr,
&queue_write_zeroes_max_entry.attr,
&queue_zone_append_max_entry.attr,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index eef450f25982..c10e47bdb34f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -309,6 +309,11 @@ struct queue_limits {
unsigned int discard_alignment;
unsigned int zone_write_granularity;

+ unsigned int atomic_write_boundary_sectors;
+ unsigned int atomic_write_max_sectors;
+ unsigned int atomic_write_unit_min_sectors;
+ unsigned int atomic_write_unit_max_sectors;
+
unsigned short max_segments;
unsigned short max_integrity_segments;
unsigned short max_discard_segments;
@@ -908,6 +913,14 @@ void blk_queue_zone_write_granularity(struct request_queue *q,
unsigned int size);
extern void blk_queue_alignment_offset(struct request_queue *q,
unsigned int alignment);
+extern void blk_queue_atomic_write_max_bytes(struct request_queue *q,
+ unsigned int bytes);
+extern void blk_queue_atomic_write_unit_max_sectors(struct request_queue *q,
+ unsigned int sectors);
+extern void blk_queue_atomic_write_unit_min_sectors(struct request_queue *q,
+ unsigned int sectors);
+extern void blk_queue_atomic_write_boundary_bytes(struct request_queue *q,
+ unsigned int bytes);
void disk_update_readahead(struct gendisk *disk);
extern void blk_limits_io_min(struct queue_limits *limits, unsigned int min);
extern void blk_queue_io_min(struct request_queue *q, unsigned int min);
@@ -1312,6 +1325,26 @@ static inline int queue_dma_alignment(const struct request_queue *q)
return q ? q->limits.dma_alignment : 511;
}

+static inline unsigned int queue_atomic_write_unit_max_bytes(const struct request_queue *q)
+{
+ return q->limits.atomic_write_unit_max_sectors << SECTOR_SHIFT;
+}
+
+static inline unsigned int queue_atomic_write_unit_min_bytes(const struct request_queue *q)
+{
+ return q->limits.atomic_write_unit_min_sectors << SECTOR_SHIFT;
+}
+
+static inline unsigned int queue_atomic_write_boundary_bytes(const struct request_queue *q)
+{
+ return q->limits.atomic_write_boundary_sectors << SECTOR_SHIFT;
+}
+
+static inline unsigned int queue_atomic_write_max_bytes(const struct request_queue *q)
+{
+ return q->limits.atomic_write_max_sectors << SECTOR_SHIFT;
+}
+
static inline unsigned int bdev_dma_alignment(struct block_device *bdev)
{
return queue_dma_alignment(bdev_get_queue(bdev));
--
2.31.1


2023-10-03 16:41:40

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH 01/21] block: Add atomic write operations to request_queue limits

On 9/29/23 03:27, John Garry wrote:
> +What: /sys/block/<disk>/atomic_write_unit_min_bytes
> +Date: May 2023
> +Contact: Himanshu Madhani <[email protected]>
> +Description:
> + [RO] This parameter specifies the smallest block which can
> + be written atomically with an atomic write operation. All
> + atomic write operations must begin at a
> + atomic_write_unit_min boundary and must be multiples of
> + atomic_write_unit_min. This value must be a power-of-two.

I have two comments about these descriptions:
- Referring to "atomic writes" only is not sufficient. It should be
explained that in this context "atomic" means "indivisible" only and
also that there are no guarantees that the data written by an atomic
write will survive a power failure. See also the difference between
the NVMe parameters AWUN and AWUPF.
- atomic_write_unit_min_bytes will always be the logical block size so I
don't think it is useful to make the block layer track this value nor
to export this value through sysfs.

Thanks,

Bart.

2023-10-04 03:01:59

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH 01/21] block: Add atomic write operations to request_queue limits


Bart,

> also that there are no guarantees that the data written by an atomic
> write will survive a power failure. See also the difference between
> the NVMe parameters AWUN and AWUPF.

We only care about *PF. The *N variants were cut from the same cloth as
TRIM and UNMAP.

--
Martin K. Petersen Oracle Linux Engineering

2023-10-04 17:28:41

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH 01/21] block: Add atomic write operations to request_queue limits

On 10/3/23 20:00, Martin K. Petersen wrote:
>
> Bart,
>
>> also that there are no guarantees that the data written by an
>> atomic write will survive a power failure. See also the difference
>> between the NVMe parameters AWUN and AWUPF.
>
> We only care about *PF. The *N variants were cut from the same cloth
> as TRIM and UNMAP.

In my opinion there is a contradiction between the above reply and patch
19/21 of this series. Data written with the SCSI WRITE ATOMIC command is
not guaranteed to survive a power failure. The following quote from
SBC-5 makes this clear:

"4.29.2 Atomic write operations that do not complete

If the device server is not able to successfully complete an atomic
write operation (e.g., the command is terminated or aborted), then the
device server shall ensure that none of the LBAs specified by the atomic
write operation have been altered by any logical block data from the
atomic write operation (i.e., the specified LBAs return logical block
data as if the atomic write operation had not occurred).

If a power loss causes loss of logical block data from an atomic write
operation in a volatile write cache that has not yet been stored on the
medium, then the device server shall ensure that none of the LBAs
specified by the atomic write operation have been altered by any logical
block data from the atomic write operation (i.e., the specified LBAs
return logical block data as if the atomic write operation had not
occurred and writes from the cache to the medium preserve the specified
atomicity)."

In other words, if a power failure occurs, SCSI devices are allowed to
discard the data written with a WRITE ATOMIC command if no SYNCHRONIZE
CACHE command has been submitted after that WRITE ATOMIC command or if
the SYNCHRONIZE CACHE command did not complete before the power failure.

Thanks,

Bart.

2023-10-04 18:27:26

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH 01/21] block: Add atomic write operations to request_queue limits


Bart,

> In my opinion there is a contradiction between the above reply and
> patch 19/21 of this series. Data written with the SCSI WRITE ATOMIC
> command is not guaranteed to survive a power failure.

That is not the intent. The intent is to ensure that for any given
application block (say 16KB), the application block on media will
contain either 100% old data or 100% new data. Always.

If a storage device offers no such guarantee across a power failure,
then it is not suitable for use by applications which do not tolerate
torn writes. That is why the writes-are-atomic-unless-there's-a-problem
variant of the values reports in NVMe are of no interest.

--
Martin K. Petersen Oracle Linux Engineering

2023-10-04 21:01:18

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH 01/21] block: Add atomic write operations to request_queue limits

On 10/3/23 20:00, Martin K. Petersen wrote:
>
> Bart,
>
>> also that there are no guarantees that the data written by an atomic
>> write will survive a power failure. See also the difference between
>> the NVMe parameters AWUN and AWUPF.
>
> We only care about *PF. The *N variants were cut from the same cloth as
> TRIM and UNMAP.

Hi Martin,

Has the following approach been considered? RWF_ATOMIC only guarantees
atomicity. Persistence is not guaranteed without fsync() / fdatasync().

I think this would be more friendly towards battery-powered devices
(smartphones). On these devices it can be safe to skip fsync() /
fdatasync() if the battery level is high enough.

Thanks,

Bart.



2023-10-05 14:49:13

by John Garry

[permalink] [raw]
Subject: Re: [PATCH 01/21] block: Add atomic write operations to request_queue limits

On 04/10/2023 22:00, Bart Van Assche wrote:
>>
>> We only care about *PF. The *N variants were cut from the same cloth as
>> TRIM and UNMAP.
>
> Hi Martin,
>
> Has the following approach been considered? RWF_ATOMIC only guarantees
> atomicity. Persistence is not guaranteed without fsync() / fdatasync().

This is the approach taken. Please consult the proposed man pages, where
we say that persistence is not guaranteed without
O_SYNC/O_DSYNC/fsync()/fdatasync()

The only thing which RWF_ATOMIC guarantees is that the write will not be
torn.

If you see 2.1.4.2.2 Non-volatile requirements in the NVMe spec, it
implies that the FUA bit or a flush command is required for persistence.

In 4.29.2 Atomic write operations that do not complete in SBC-4, we are
told that atomic writes may pend in the device volatile cache and no
atomic write data will be written if a power failure causes loss of data
from the write.

>
> I think this would be more friendly towards battery-powered devices
> (smartphones). On these devices it can be safe to skip fsync() /
> fdatasync() if the battery level is high enough.

Thanks,
John