2022-09-23 17:59:31

by Pankaj Raghav

[permalink] [raw]
Subject: [PATCH v15 00/13] support zoned block devices with non-power-of-2 zone sizes

Hi Jens,
Please consider this patch series for the 6.1 release.

- Background and Motivation:

The zone storage implementation in Linux, introduced since v4.10, first
targetted SMR drives which have a power of 2 (po2) zone size alignment
requirement. The po2 zone size was further imposed implicitly by the
block layer's blk_queue_chunk_sectors(), used to prevent IO merging
across chunks beyond the specified size, since v3.16 through commit
762380ad9322 ("block: add notion of a chunk size for request merging").
But this same general block layer po2 requirement for blk_queue_chunk_sectors()
was removed on v5.10 through commit 07d098e6bbad ("block: allow 'chunk_sectors'
to be non-power-of-2").

NAND, which is the media used in newer zoned storage devices, does not
naturally align to po2. In these devices, zone capacity(cap) is not the
same as the po2 zone size. When the zone cap != zone size, then unmapped
LBAs are introduced to cover the space between the zone cap and zone size.
po2 requirement does not make sense for these type of zone storage devices.
This patch series aims to remove these unmapped LBAs for zoned devices when
zone cap is npo2. This is done by relaxing the po2 zone size constraint
in the kernel and allowing zoned device with npo2 zone sizes if zone cap
== zone size.

Removing the po2 requirement from zone storage should be possible
now provided that no userspace regression and no performance regressions are
introduced. Stop-gap patches have been already merged into f2fs-tools to
proactively not allow npo2 zone sizes until proper support is added [1].

There were two efforts previously to add support to npo2 devices: 1) via
device level emulation [2] but that was rejected with a final conclusion
to add support for non po2 zoned device in the complete stack[3] 2)
adding support to the complete stack by removing the constraint in the
block layer and NVMe layer with support to btrfs, zonefs, etc which was
rejected with a conclusion to add a dm target for FS support [0]
to reduce the regression impact.

This series adds support to npo2 zoned devices in the block and nvme
layer and a new **dm target** is added: dm-po2zoned-target. This new
target will be initially used for filesystems such as btrfs and
f2fs until native npo2 zone support is added.

- Patchset description:
Patches 1-3 deals with removing the po2 constraint from the
block layer.

Patches 4-5 deals with removing the constraint from nvme zns.

Patch 5 removes the po2 contraint in null blk

Patch 6 adds npo2 support to zonefs

Patches 7-13 adds support for npo2 zoned devices in the DM layer and
adds a new target dm-po2zoned-target which converts a zoned device with
npo2 zone size into a zoned target with po2 zone size.

The patch series is based on linux-next tag: next-20220921

Testing:
The new target was tested with blktest and zonefs test suite in qemu and
on a real ZNS device with npo2 zone size.

Performance Measurement on a null blk:
Device:
zone size = 128M, blocksize=4k

FIO cmd:
fio --name=zbc --filename=/dev/nullb0 --direct=1 --zonemode=zbd --size=23G
--io_size=<iosize> --ioengine=io_uring --iodepth=<iod> --rw=<mode> --bs=4k
--loops=4

The following results are an average of 4 runs on AMD Ryzen 5 5600X with
32GB of RAM:

Sequential Write:
x-----------------x---------------------------------x---------------------------------x
| IOdepth | 8 | 16 |
x-----------------x---------------------------------x---------------------------------x
| | KIOPS |BW(MiB/s) | Lat(usec) | KIOPS |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patch | 578 | 2257 | 12.80 | 576 | 2248 | 25.78 |
x-----------------x---------------------------------x---------------------------------x
| With patch | 581 | 2268 | 12.74 | 576 | 2248 | 25.85 |
x-----------------x---------------------------------x---------------------------------x

Sequential read:
x-----------------x---------------------------------x---------------------------------x
| IOdepth | 8 | 16 |
x-----------------x---------------------------------x---------------------------------x
| | KIOPS |BW(MiB/s) | Lat(usec) | KIOPS |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patch | 667 | 2605 | 11.79 | 675 | 2637 | 23.49 |
x-----------------x---------------------------------x---------------------------------x
| With patch | 667 | 2605 | 11.79 | 675 | 2638 | 23.48 |
x-----------------x---------------------------------x---------------------------------x

Random read:
x-----------------x---------------------------------x---------------------------------x
| IOdepth | 8 | 16 |
x-----------------x---------------------------------x---------------------------------x
| | KIOPS |BW(MiB/s) | Lat(usec) | KIOPS |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patch | 522 | 2038 | 15.05 | 514 | 2006 | 30.87 |
x-----------------x---------------------------------x---------------------------------x
| With patch | 522 | 2039 | 15.04 | 523 | 2042 | 30.33 |
x-----------------x---------------------------------x---------------------------------x

Minor variations are noticed in Sequential write with io depth 8 and
in random read with io depth 16. But overall no noticeable differences
were noticed

[0] https://lore.kernel.org/lkml/PH0PR04MB74166C87F694B150A5AE0F009BD09@PH0PR04MB7416.namprd04.prod.outlook.com/
[1] https://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git/commit/?h=dev-test&id=6afcf6493578e77528abe65ab8b12f3e1c16749f
[2] https://lore.kernel.org/all/[email protected]/T/
[3] https://lore.kernel.org/all/20220315135245.eqf4tqngxxb7ymqa@unifi/

Changes since v1:
- Put the function declaration and its usage in the same commit (Bart)
- Remove bdev_zone_aligned function (Bart)
- Change the name from blk_queue_zone_aligned to blk_queue_is_zone_start
(Damien)
- q is never null in from bdev_get_queue (Damien)
- Add condition during bringup and check for zsze == zcap for npo2
drives (Damien)
- Rounddown operation should be made generic to work in 32 bits arch
(bart)
- Add comments where generic calculation is directly used instead having
special handling for po2 zone sizes (Hannes)
- Make the minimum zone size alignment requirement for btrfs to be 1M
instead of BTRFS_STRIPE_LEN(David)

Changes since v2:
- Minor formatting changes

Changes since v3:
- Make superblock mirror align with the existing superblock log offsets
(David)
- DM change return value and remove extra newline
- Optimize null blk zone index lookup with shift for po2 zone size

Changes since v4:
- Remove direct filesystems support for npo2 devices (Johannes, Hannes,
Damien)

Changes since v5:
- Use DIV_ROUND_UP* helper instead of round_up as it breaks 32bit arch
build in null blk(kernel-test-robot, Nathan)
- Use DIV_ROUND_UP_SECTOR_T also in blkdev_nr_zones function instead of
open coding it with div64_u64
- Added extra condition in dm-zoned and in dm to reject non power of 2
zone sizes.

Changes since v6:
- Added a new dm target for non power of 2 devices
- Added support for non power of 2 devices in the DM layer.

Changes since v7:
- Improved dm target for non power of 2 zoned devices with some bug
fixes and rearrangement
- Removed some unnecessary comments.

Changes since v8:
- Rename dm-po2z to dm-po2zone
- set max_io_len for the target to po2 zone size sector
- Simplify dm-po2zone target by removing some superfluous conditions
- Added documentation for the new dm-po2zone target
- Change pr_warn to pr_err for critical errors
- Split patch 2 and 11 with their corresponding prep patches
- Minor spelling and grammatical improvements

Changes since v9:
- Add a check for a zoned device in dm-po2zone ctr.
- Rephrased some commit messages and documentation for clarity

Changes since v10:
- Simplified dm_poz_map function (Damien)

Changes since v11:
- Rename bio_in_emulated_zone_area and some formatting adjustments
(Damien)

Changes since v12:
- Changed the name from dm-po2zone to dm-po2zoned to have a common
naming convention for zoned devices(Mike)
- Return directly from the dm_po2z_map function instead of having
returns from different functions (Mike)
- Change target type to target feature flag in commit header (Mike)
- Added dm_po2z_status function and NOWAIT flag to the target
- Added some extra information to the target's documentation.

Changes since v13:
- Use goto for cleanup in dm-po2zoned target (Mike)
- Added dtr to dm-po2zoned target
- Expose zone capacity instead of po2 zone size for
DMSTATUS_TYPE_INFO(Mike)

Changes since v14:
- Make sure to put device if ctr fails after dm_get_device(Mike)

Luis Chamberlain (1):
dm-zoned: ensure only power of 2 zone sizes are allowed

Pankaj Raghav (12):
block: make bdev_nr_zones and disk_zone_no generic for npo2 zone size
block: rearrange bdev_{is_zoned,zone_sectors,get_queue} helper in
blkdev.h
block: allow blk-zoned devices to have non-power-of-2 zone size
nvmet: Allow ZNS target to support non-power_of_2 zone sizes
nvme: zns: Allow ZNS drives that have non-power_of_2 zone size
null_blk: allow zoned devices with non power-of-2 zone sizes
zonefs: allow non power of 2 zoned devices
dm-zone: use generic helpers to calculate offset from zone start
dm-table: allow zoned devices with non power-of-2 zone sizes
dm: call dm_zone_endio after the target endio callback for zoned
devices
dm: introduce DM_EMULATED_ZONES target feature flag
dm: add power-of-2 target for zoned devices with non power-of-2 zone
sizes

.../admin-guide/device-mapper/dm-po2zoned.rst | 79 +++++
.../admin-guide/device-mapper/index.rst | 1 +
block/blk-core.c | 2 +-
block/blk-zoned.c | 37 ++-
drivers/block/null_blk/main.c | 5 +-
drivers/block/null_blk/null_blk.h | 1 +
drivers/block/null_blk/zoned.c | 18 +-
drivers/md/Kconfig | 10 +
drivers/md/Makefile | 2 +
drivers/md/dm-po2zoned-target.c | 293 ++++++++++++++++++
drivers/md/dm-table.c | 20 +-
drivers/md/dm-zone.c | 8 +-
drivers/md/dm-zoned-target.c | 8 +
drivers/md/dm.c | 8 +-
drivers/nvme/host/zns.c | 14 +-
drivers/nvme/target/zns.c | 3 +-
fs/zonefs/super.c | 6 +-
fs/zonefs/zonefs.h | 1 -
include/linux/blkdev.h | 80 +++--
include/linux/device-mapper.h | 9 +
20 files changed, 530 insertions(+), 75 deletions(-)
create mode 100644 Documentation/admin-guide/device-mapper/dm-po2zoned.rst
create mode 100644 drivers/md/dm-po2zoned-target.c

--
2.25.1


2022-09-23 18:08:46

by Pankaj Raghav

[permalink] [raw]
Subject: [PATCH v15 01/13] block: make bdev_nr_zones and disk_zone_no generic for npo2 zone size

Adapt bdev_nr_zones and disk_zone_no functions so that they can
also work for non-power-of-2 zone sizes.

As the existing deployments assume that a device zone size is a power of
2 number of sectors, power-of-2 optimized calculation is used for those
devices.

There are no direct hot paths modified and the changes just
introduce one new branch per call.

Reviewed-by: Luis Chamberlain <[email protected]>
Reviewed-by: Adam Manzanares <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Pankaj Raghav <[email protected]>
---
block/blk-zoned.c | 13 +++++++++----
include/linux/blkdev.h | 8 +++++++-
2 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index a264621d4905..dce9c95b4bcd 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -111,17 +111,22 @@ EXPORT_SYMBOL_GPL(__blk_req_zone_write_unlock);
* bdev_nr_zones - Get number of zones
* @bdev: Target device
*
- * Return the total number of zones of a zoned block device. For a block
- * device without zone capabilities, the number of zones is always 0.
+ * Return the total number of zones of a zoned block device, including the
+ * eventual small last zone if present. For a block device without zone
+ * capabilities, the number of zones is always 0.
*/
unsigned int bdev_nr_zones(struct block_device *bdev)
{
sector_t zone_sectors = bdev_zone_sectors(bdev);
+ sector_t capacity = bdev_nr_sectors(bdev);

if (!bdev_is_zoned(bdev))
return 0;
- return (bdev_nr_sectors(bdev) + zone_sectors - 1) >>
- ilog2(zone_sectors);
+
+ if (is_power_of_2(zone_sectors))
+ return (capacity + zone_sectors - 1) >> ilog2(zone_sectors);
+
+ return DIV_ROUND_UP_SECTOR_T(capacity, zone_sectors);
}
EXPORT_SYMBOL_GPL(bdev_nr_zones);

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8038c5fbde40..6c6bf4dd5709 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -674,9 +674,15 @@ static inline unsigned int disk_nr_zones(struct gendisk *disk)

static inline unsigned int disk_zone_no(struct gendisk *disk, sector_t sector)
{
+ sector_t zone_sectors = disk->queue->limits.chunk_sectors;
+
if (!blk_queue_is_zoned(disk->queue))
return 0;
- return sector >> ilog2(disk->queue->limits.chunk_sectors);
+
+ if (is_power_of_2(zone_sectors))
+ return sector >> ilog2(zone_sectors);
+
+ return div64_u64(sector, zone_sectors);
}

static inline bool disk_zone_is_seq(struct gendisk *disk, sector_t sector)
--
2.25.1

2022-09-23 18:15:14

by Pankaj Raghav

[permalink] [raw]
Subject: [PATCH v15 05/13] nvme: zns: Allow ZNS drives that have non-power_of_2 zone size

Remove the condition which disallows non-power_of_2 zone size ZNS drive
to be updated and use generic method to calculate number of zones
instead of relying on log and shift based calculation on zone size.

The power_of_2 calculation has been replaced directly with generic
calculation without special handling. Both modified functions are not
used in hot paths, they are only used during initialization &
revalidation of the ZNS device.

As rounddown macro from math.h does not work for 32 bit architectures,
round down operation is open coded.

Reviewed-by: Luis Chamberlain <[email protected]>
Reviewed by: Adam Manzanares <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Signed-off-by: Pankaj Raghav <[email protected]>
---
drivers/nvme/host/zns.c | 14 +++++---------
1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
index 12316ab51bda..fe1d715d61cc 100644
--- a/drivers/nvme/host/zns.c
+++ b/drivers/nvme/host/zns.c
@@ -101,13 +101,6 @@ int nvme_update_zone_info(struct nvme_ns *ns, unsigned lbaf)
}

ns->zsze = nvme_lba_to_sect(ns, le64_to_cpu(id->lbafe[lbaf].zsze));
- if (!is_power_of_2(ns->zsze)) {
- dev_warn(ns->ctrl->device,
- "invalid zone size:%llu for namespace:%u\n",
- ns->zsze, ns->head->ns_id);
- status = -ENODEV;
- goto free_data;
- }

disk_set_zoned(ns->disk, BLK_ZONED_HM);
blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q);
@@ -129,7 +122,7 @@ static void *nvme_zns_alloc_report_buffer(struct nvme_ns *ns,
sizeof(struct nvme_zone_descriptor);

nr_zones = min_t(unsigned int, nr_zones,
- get_capacity(ns->disk) >> ilog2(ns->zsze));
+ div64_u64(get_capacity(ns->disk), ns->zsze));

bufsize = sizeof(struct nvme_zone_report) +
nr_zones * sizeof(struct nvme_zone_descriptor);
@@ -182,6 +175,7 @@ int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
int ret, zone_idx = 0;
unsigned int nz, i;
size_t buflen;
+ u64 remainder = 0;

if (ns->head->ids.csi != NVME_CSI_ZNS)
return -EINVAL;
@@ -197,7 +191,9 @@ int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
c.zmr.zrasf = NVME_ZRASF_ZONE_REPORT_ALL;
c.zmr.pr = NVME_REPORT_ZONE_PARTIAL;

- sector &= ~(ns->zsze - 1);
+ /* Round down the sector value to the nearest zone start */
+ div64_u64_rem(sector, ns->zsze, &remainder);
+ sector -= remainder;
while (zone_idx < nr_zones && sector < get_capacity(ns->disk)) {
memset(report, 0, buflen);

--
2.25.1

2022-09-23 18:32:21

by Pankaj Raghav

[permalink] [raw]
Subject: [PATCH v15 09/13] dm-zone: use generic helpers to calculate offset from zone start

Use the bdev_offset_from_zone_start() helper function to calculate
the offset from zone start instead of using power of 2 based
calculation.

Signed-off-by: Pankaj Raghav <[email protected]>
Reviewed-by: Luis Chamberlain <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Reviewed-by: Mike Snitzer <[email protected]>
---
drivers/md/dm-zone.c | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index 3dafc0e8b7a9..ac6fc1293d41 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -390,7 +390,8 @@ static bool dm_zone_map_bio_begin(struct mapped_device *md,
case REQ_OP_WRITE_ZEROES:
case REQ_OP_WRITE:
/* Writes must be aligned to the zone write pointer */
- if ((clone->bi_iter.bi_sector & (zsectors - 1)) != zwp_offset)
+ if (bdev_offset_from_zone_start(md->disk->part0,
+ clone->bi_iter.bi_sector) != zwp_offset)
return false;
break;
case REQ_OP_ZONE_APPEND:
@@ -602,11 +603,8 @@ void dm_zone_endio(struct dm_io *io, struct bio *clone)
*/
if (clone->bi_status == BLK_STS_OK &&
bio_op(clone) == REQ_OP_ZONE_APPEND) {
- sector_t mask =
- (sector_t)bdev_zone_sectors(disk->part0) - 1;
-
orig_bio->bi_iter.bi_sector +=
- clone->bi_iter.bi_sector & mask;
+ bdev_offset_from_zone_start(disk->part0, clone->bi_iter.bi_sector);
}

return;
--
2.25.1

2022-09-23 18:38:14

by Pankaj Raghav

[permalink] [raw]
Subject: [PATCH v15 13/13] dm: add power-of-2 target for zoned devices with non power-of-2 zone sizes

Only zoned devices with power-of-2(po2) number of sectors per zone(zone
size) were supported in linux but now non power-of-2(npo2) zone sizes
support has been added to the block layer.

Filesystems such as F2FS and btrfs have support for zoned devices with
po2 zone size assumption. Before adding native support for npo2 zone
sizes, it was suggested to create a dm target for npo2 zone size device to
appear as a po2 zone size target so that file systems can initially
work without any explicit changes.

The design of this target is very simple: remap the device zone size to
the zone capacity and change the zone size to be the nearest power of 2
value.

For e.g., a device with a zone size/capacity of 3M will have an equivalent
target layout as follows:

Device layout :-
zone capacity = 3M
zone size = 3M

|--------------|-------------|
0 3M 6M

Target layout :-
zone capacity=3M
zone size = 4M

|--------------|---|--------------|---|
0 3M 4M 7M 8M

The area between target's zone capacity and zone size will be emulated
in the target.
The read IOs that fall in the emulated gap area will return 0 filled
bio and all the other IOs in that area will result in an error.
If a read IO span across the emulated area boundary, then the IOs are
split across them. All other IO operations that span across the emulated
area boundary will result in an error.

The target can be easily created as follows:
dmsetup create <label> --table '0 <size_sects> po2zoned /dev/nvme<id>'

The target does not support partial mapping of the underlying
device as there is no use-case for it.

Note:
This target is not related to dm-zoned target, which exposes a zoned block
device as a regular block device without any write constraint.

This target only exposes a different zone size than the underlying device.
The underlying device's other constraints will be directly exposed to the
target.

Signed-off-by: Pankaj Raghav <[email protected]>
Suggested-by: Johannes Thumshirn <[email protected]>
Suggested-by: Damien Le Moal <[email protected]>
Suggested-by: Hannes Reinecke <[email protected]>
---
.../admin-guide/device-mapper/dm-po2zoned.rst | 79 +++++
.../admin-guide/device-mapper/index.rst | 1 +
drivers/md/Kconfig | 10 +
drivers/md/Makefile | 2 +
drivers/md/dm-po2zoned-target.c | 293 ++++++++++++++++++
5 files changed, 385 insertions(+)
create mode 100644 Documentation/admin-guide/device-mapper/dm-po2zoned.rst
create mode 100644 drivers/md/dm-po2zoned-target.c

diff --git a/Documentation/admin-guide/device-mapper/dm-po2zoned.rst b/Documentation/admin-guide/device-mapper/dm-po2zoned.rst
new file mode 100644
index 000000000000..8a35eab0b714
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/dm-po2zoned.rst
@@ -0,0 +1,79 @@
+===========
+dm-po2zoned
+===========
+The dm-po2zoned device mapper target exposes a zoned block device with a
+non-power-of-2(npo2) number of sectors per zone as a power-of-2(po2)
+number of sectors per zone(zone size).
+The filesystems that support zoned block devices such as F2FS and BTRFS
+assume po2 zone size as the kernel has traditionally only supported
+those devices. However, as the kernel now supports zoned block devices with
+npo2 zone sizes, the filesystems can run on top of the dm-po2zoned target before
+adding native support.
+
+Partial mapping of the underlying device is not supported by this target as
+there is no use-case for it.
+
+.. note::
+ This target is **not related** to **dm-zoned target**, which exposes a
+ zoned block device as a regular block device without any write constraint.
+
+ This target only exposes a different **zone size** than the underlying device.
+ The underlying device's other **constraints** will be exposed to the target.
+
+Algorithm
+=========
+The device mapper target maps the underlying device's zone size to the
+zone capacity and changes the zone size to the nearest po2 zone size.
+The gap between the zone capacity and the zone size is emulated in the target.
+E.g., a zoned block device with a zone size (and capacity) of 3M will have an
+equivalent target layout with mapping as follows:
+
+::
+
+ 0M 3M 4M 6M 8M
+ | | | | |
+ +x------------+--+x---------+--+x------- Target
+ |x | |x | |x
+ x x x
+ x x x
+ x x x
+ x x x
+ |x |x |x
+ +x------------+x------------+x---------- Device
+ | | |
+ 0M 3M 6M
+
+A simple remap is performed for all the BIOs that do not cross the
+emulation gap area, i.e., the area between the zone capacity and size.
+
+If a BIO lies in the emulation gap area, the following operations are performed:
+
+ Read:
+ - If the BIO lies entirely in the emulation gap area, then zero out the BIO and complete it.
+ - If the BIO spans the emulation gap area, split the BIO across the zone capacity boundary
+ and remap only the BIO within the zone capacity boundary. The other part of the split BIO
+ will be zeroed out.
+
+ Other operations:
+ - Return an error
+
+Table parameters
+================
+
+::
+
+ <dev path>
+
+Mandatory parameters:
+
+ <dev path>:
+ Full pathname to the underlying block-device, or a
+ "major:minor" device-number.
+
+Examples
+========
+
+::
+
+ #!/bin/sh
+ echo "0 `blockdev --getsz $1` po2zoned $1" | dmsetup create po2z
diff --git a/Documentation/admin-guide/device-mapper/index.rst b/Documentation/admin-guide/device-mapper/index.rst
index cde52cc09645..5df93711cef5 100644
--- a/Documentation/admin-guide/device-mapper/index.rst
+++ b/Documentation/admin-guide/device-mapper/index.rst
@@ -23,6 +23,7 @@ Device Mapper
dm-service-time
dm-uevent
dm-zoned
+ dm-po2zoned
era
kcopyd
linear
diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index 998a5cfdbc4e..74fdfd02ab5f 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -518,6 +518,16 @@ config DM_FLAKEY
help
A target that intermittently fails I/O for debugging purposes.

+config DM_PO2ZONED
+ tristate "Zoned block devices target emulating a power-of-2 number of sectors per zone"
+ depends on BLK_DEV_DM
+ depends on BLK_DEV_ZONED
+ help
+ A target that converts a zoned block device with non-power-of-2(npo2)
+ number of sectors per zone to be power-of-2(po2). Use this target for
+ zoned block devices with npo2 number of sectors per zone until native
+ support is added to the filesystems and applications.
+
config DM_VERITY
tristate "Verity target support"
depends on BLK_DEV_DM
diff --git a/drivers/md/Makefile b/drivers/md/Makefile
index 84291e38dca8..ee05722bc637 100644
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@@ -26,6 +26,7 @@ dm-era-y += dm-era-target.o
dm-clone-y += dm-clone-target.o dm-clone-metadata.o
dm-verity-y += dm-verity-target.o
dm-zoned-y += dm-zoned-target.o dm-zoned-metadata.o dm-zoned-reclaim.o
+dm-po2zoned-y += dm-po2zoned-target.o

md-mod-y += md.o md-bitmap.o
raid456-y += raid5.o raid5-cache.o raid5-ppl.o
@@ -60,6 +61,7 @@ obj-$(CONFIG_DM_CRYPT) += dm-crypt.o
obj-$(CONFIG_DM_DELAY) += dm-delay.o
obj-$(CONFIG_DM_DUST) += dm-dust.o
obj-$(CONFIG_DM_FLAKEY) += dm-flakey.o
+obj-$(CONFIG_DM_PO2ZONED) += dm-po2zoned.o
obj-$(CONFIG_DM_MULTIPATH) += dm-multipath.o dm-round-robin.o
obj-$(CONFIG_DM_MULTIPATH_QL) += dm-queue-length.o
obj-$(CONFIG_DM_MULTIPATH_ST) += dm-service-time.o
diff --git a/drivers/md/dm-po2zoned-target.c b/drivers/md/dm-po2zoned-target.c
new file mode 100644
index 000000000000..0f8f768a94cf
--- /dev/null
+++ b/drivers/md/dm-po2zoned-target.c
@@ -0,0 +1,293 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2022 Samsung Electronics Co., Ltd.
+ */
+
+#include <linux/device-mapper.h>
+
+#define DM_MSG_PREFIX "po2zoned"
+
+struct dm_po2z_target {
+ struct dm_dev *dev;
+ sector_t zone_size; /* Actual zone size of the underlying dev*/
+ sector_t zone_size_po2; /* zone_size rounded to the nearest po2 value */
+ unsigned int zone_size_po2_shift;
+ sector_t zone_size_diff; /* diff between zone_size_po2 and zone_size */
+ unsigned int nr_zones;
+};
+
+static inline unsigned int npo2_zone_no(struct dm_po2z_target *dmh,
+ sector_t sect)
+{
+ return div64_u64(sect, dmh->zone_size);
+}
+
+static inline unsigned int po2_zone_no(struct dm_po2z_target *dmh,
+ sector_t sect)
+{
+ return sect >> dmh->zone_size_po2_shift;
+}
+
+static inline sector_t device_to_target_sect(struct dm_target *ti,
+ sector_t sect)
+{
+ struct dm_po2z_target *dmh = ti->private;
+
+ return sect + (npo2_zone_no(dmh, sect) * dmh->zone_size_diff) +
+ ti->begin;
+}
+
+/*
+ * This target works on the complete zoned device. Partial mapping is not
+ * supported.
+ * Construct a zoned po2 logical device: <dev-path>
+ */
+static int dm_po2z_ctr(struct dm_target *ti, unsigned int argc, char **argv)
+{
+ struct dm_po2z_target *dmh = NULL;
+ int ret;
+ sector_t zone_size;
+ sector_t dev_capacity;
+
+ if (argc != 1)
+ return -EINVAL;
+
+ dmh = kmalloc(sizeof(*dmh), GFP_KERNEL);
+ if (!dmh)
+ return -ENOMEM;
+
+ ret = dm_get_device(ti, argv[0], dm_table_get_mode(ti->table),
+ &dmh->dev);
+ if (ret) {
+ ti->error = "Device lookup failed";
+ goto free;
+ }
+
+ if (!bdev_is_zoned(dmh->dev->bdev)) {
+ DMERR("%pg is not a zoned device", dmh->dev->bdev);
+ ret = -EINVAL;
+ goto bad;
+ }
+
+ zone_size = bdev_zone_sectors(dmh->dev->bdev);
+ dev_capacity = get_capacity(dmh->dev->bdev->bd_disk);
+ if (ti->len != dev_capacity) {
+ DMERR("%pg Partial mapping of the target is not supported",
+ dmh->dev->bdev);
+ ret = -EINVAL;
+ goto bad;
+ }
+
+ if (is_power_of_2(zone_size))
+ DMWARN("%pg: underlying device has a power-of-2 number of sectors per zone",
+ dmh->dev->bdev);
+
+ dmh->zone_size = zone_size;
+ dmh->zone_size_po2 = 1 << get_count_order_long(zone_size);
+ dmh->zone_size_po2_shift = ilog2(dmh->zone_size_po2);
+ dmh->zone_size_diff = dmh->zone_size_po2 - dmh->zone_size;
+ ti->private = dmh;
+ ti->max_io_len = dmh->zone_size_po2;
+ dmh->nr_zones = npo2_zone_no(dmh, ti->len);
+ ti->len = dmh->zone_size_po2 * dmh->nr_zones;
+ return 0;
+
+bad:
+ dm_put_device(ti, dmh->dev);
+free:
+ kfree(dmh);
+ return ret;
+}
+
+static void dm_po2z_dtr(struct dm_target *ti)
+{
+ struct dm_po2z_target *dmh = ti->private;
+
+ dm_put_device(ti, dmh->dev);
+ kfree(dmh);
+}
+
+static int dm_po2z_report_zones_cb(struct blk_zone *zone, unsigned int idx,
+ void *data)
+{
+ struct dm_report_zones_args *args = data;
+ struct dm_target *ti = args->tgt;
+ struct dm_po2z_target *dmh = ti->private;
+
+ zone->start = device_to_target_sect(ti, zone->start);
+ zone->wp = device_to_target_sect(ti, zone->wp);
+ zone->len = dmh->zone_size_po2;
+ args->next_sector = zone->start + zone->len;
+
+ return args->orig_cb(zone, args->zone_idx++, args->orig_data);
+}
+
+static int dm_po2z_report_zones(struct dm_target *ti,
+ struct dm_report_zones_args *args,
+ unsigned int nr_zones)
+{
+ struct dm_po2z_target *dmh = ti->private;
+ sector_t sect =
+ po2_zone_no(dmh, dm_target_offset(ti, args->next_sector)) *
+ dmh->zone_size;
+
+ return blkdev_report_zones(dmh->dev->bdev, sect, nr_zones,
+ dm_po2z_report_zones_cb, args);
+}
+
+static int dm_po2z_end_io(struct dm_target *ti, struct bio *bio,
+ blk_status_t *error)
+{
+ if (bio->bi_status == BLK_STS_OK && bio_op(bio) == REQ_OP_ZONE_APPEND)
+ bio->bi_iter.bi_sector =
+ device_to_target_sect(ti, bio->bi_iter.bi_sector);
+
+ return DM_ENDIO_DONE;
+}
+
+static void dm_po2z_io_hints(struct dm_target *ti, struct queue_limits *limits)
+{
+ struct dm_po2z_target *dmh = ti->private;
+
+ limits->chunk_sectors = dmh->zone_size_po2;
+}
+
+static void dm_po2z_status(struct dm_target *ti, status_type_t type,
+ unsigned int status_flags, char *result,
+ unsigned int maxlen)
+{
+ struct dm_po2z_target *dmh = ti->private;
+ size_t sz = 0;
+
+ switch (type) {
+ case STATUSTYPE_INFO:
+ DMEMIT("%s %lld", dmh->dev->name,
+ (unsigned long long)dmh->zone_size);
+ break;
+
+ case STATUSTYPE_TABLE:
+ DMEMIT("%s", dmh->dev->name);
+ break;
+
+ case STATUSTYPE_IMA:
+ result[0] = '\0';
+ break;
+ }
+}
+
+/**
+ * dm_po2z_bio_in_emulated_zone_area - check if bio is in the emulated zone area
+ * @dmh: target data
+ * @bio: bio
+ * @offset: bio offset to emulated zone boundary
+ *
+ * Check if a @bio is partly or completely in the emulated zone area. If the
+ * @bio is partly in the emulated zone area, @offset can be used to split
+ * the @bio across the emulated zone boundary. @offset
+ * will be negative if the @bio completely lies in the emulated area.
+ *
+ */
+static bool dm_po2z_bio_in_emulated_zone_area(struct dm_po2z_target *dmh,
+ struct bio *bio, int *offset)
+{
+ unsigned int zone_idx = po2_zone_no(dmh, bio->bi_iter.bi_sector);
+ sector_t nr_sectors = bio->bi_iter.bi_size >> SECTOR_SHIFT;
+ sector_t sector_offset =
+ bio->bi_iter.bi_sector - (zone_idx << dmh->zone_size_po2_shift);
+
+ *offset = dmh->zone_size - sector_offset;
+
+ return sector_offset + nr_sectors > dmh->zone_size;
+}
+
+static inline void dm_po2z_read_zeroes(struct bio *bio)
+{
+ zero_fill_bio(bio);
+ bio_endio(bio);
+}
+
+static int dm_po2z_map(struct dm_target *ti, struct bio *bio)
+{
+ struct dm_po2z_target *dmh = ti->private;
+ int split_io_pos;
+
+ bio_set_dev(bio, dmh->dev->bdev);
+ bio->bi_iter.bi_sector = dm_target_offset(ti, bio->bi_iter.bi_sector);
+
+ if (op_is_zone_mgmt(bio_op(bio)) || !bio_sectors(bio))
+ goto remap_sector;
+
+ if (!dm_po2z_bio_in_emulated_zone_area(dmh, bio, &split_io_pos))
+ goto remap_sector;
+
+ /*
+ * Read operation on the emulated zone area (between zone capacity
+ * and zone size) will fill the bio with zeroes. Any other operation
+ * in the emulated area should return an error.
+ */
+ if (bio_op(bio) == REQ_OP_READ) {
+ /*
+ * If the bio is across emulated zone boundary, split the bio at
+ * the boundary.
+ */
+ if (split_io_pos > 0) {
+ dm_accept_partial_bio(bio, split_io_pos);
+ goto remap_sector;
+ }
+
+ dm_po2z_read_zeroes(bio);
+ return DM_MAPIO_SUBMITTED;
+ }
+ /* Other IOs in emulated zone area should result in an error */
+ return DM_MAPIO_KILL;
+
+remap_sector:
+ /* convert from target sector to device sector */
+ bio->bi_iter.bi_sector -= (po2_zone_no(dmh, bio->bi_iter.bi_sector) *
+ dmh->zone_size_diff);
+ return DM_MAPIO_REMAPPED;
+}
+
+static int dm_po2z_iterate_devices(struct dm_target *ti,
+ iterate_devices_callout_fn fn, void *data)
+{
+ struct dm_po2z_target *dmh = ti->private;
+ sector_t len = dmh->nr_zones * dmh->zone_size;
+
+ return fn(ti, dmh->dev, 0, len, data);
+}
+
+static struct target_type dm_po2z_target = {
+ .name = "po2zoned",
+ .version = { 1, 0, 0 },
+ .features = DM_TARGET_ZONED_HM | DM_TARGET_EMULATED_ZONES |
+ DM_TARGET_NOWAIT,
+ .map = dm_po2z_map,
+ .end_io = dm_po2z_end_io,
+ .report_zones = dm_po2z_report_zones,
+ .iterate_devices = dm_po2z_iterate_devices,
+ .module = THIS_MODULE,
+ .io_hints = dm_po2z_io_hints,
+ .status = dm_po2z_status,
+ .ctr = dm_po2z_ctr,
+ .dtr = dm_po2z_dtr,
+};
+
+static int __init dm_po2z_init(void)
+{
+ return dm_register_target(&dm_po2z_target);
+}
+
+static void __exit dm_po2z_exit(void)
+{
+ dm_unregister_target(&dm_po2z_target);
+}
+
+/* Module hooks */
+module_init(dm_po2z_init);
+module_exit(dm_po2z_exit);
+
+MODULE_DESCRIPTION(DM_NAME "power-of-2 zoned target");
+MODULE_AUTHOR("Pankaj Raghav <[email protected]>");
+MODULE_LICENSE("GPL");
+
--
2.25.1

2022-09-23 18:40:19

by Pankaj Raghav

[permalink] [raw]
Subject: [PATCH v15 03/13] block: allow blk-zoned devices to have non-power-of-2 zone size

Checking if a given sector is aligned to a zone is a common
operation that is performed for zoned devices. Add
bdev_is_zone_start helper to check for this instead of opencoding it
everywhere.

Convert the calculations on zone size to be generic instead of relying on
power-of-2(po2) based arithmetic in the block layer using the helpers
wherever possible.

The only hot path affected by this change for zoned devices with po2
zone size is in blk_check_zone_append() but bdev_is_zone_start() helper is
used to optimize the calculation for po2 zone sizes.

Finally, allow zoned devices with non po2 zone sizes provided that their
zone capacity and zone size are equal. The main motivation to allow zoned
devices with non po2 zone size is to remove the unmapped LBA between
zone capcity and zone size for devices that cannot have a po2 zone
capacity.

Reviewed-by: Luis Chamberlain <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Pankaj Raghav <[email protected]>
---
block/blk-core.c | 2 +-
block/blk-zoned.c | 24 ++++++++++++++++++------
include/linux/blkdev.h | 30 ++++++++++++++++++++++++++++++
3 files changed, 49 insertions(+), 7 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 203be672da52..ed5afdb39751 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -559,7 +559,7 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
return BLK_STS_NOTSUPP;

/* The bio sector must point to the start of a sequential zone */
- if (bio->bi_iter.bi_sector & (bdev_zone_sectors(bio->bi_bdev) - 1) ||
+ if (!bdev_is_zone_start(bio->bi_bdev, bio->bi_iter.bi_sector) ||
!bio_zone_is_seq(bio))
return BLK_STS_IOERR;

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index dce9c95b4bcd..6806c69c81dc 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -285,10 +285,10 @@ int blkdev_zone_mgmt(struct block_device *bdev, enum req_op op,
return -EINVAL;

/* Check alignment (handle eventual smaller last zone) */
- if (sector & (zone_sectors - 1))
+ if (!bdev_is_zone_start(bdev, sector))
return -EINVAL;

- if ((nr_sectors & (zone_sectors - 1)) && end_sector != capacity)
+ if (!bdev_is_zone_start(bdev, nr_sectors) && end_sector != capacity)
return -EINVAL;

/*
@@ -486,14 +486,26 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
* smaller last zone.
*/
if (zone->start == 0) {
- if (zone->len == 0 || !is_power_of_2(zone->len)) {
- pr_warn("%s: Invalid zoned device with non power of two zone size (%llu)\n",
- disk->disk_name, zone->len);
+ if (zone->len == 0) {
+ pr_warn("%s: Invalid zero zone size", disk->disk_name);
+ return -ENODEV;
+ }
+
+ /*
+ * Non power-of-2 zone size support was added to remove the
+ * gap between zone capacity and zone size. Though it is technically
+ * possible to have gaps in a non power-of-2 device, Linux requires
+ * the zone size to be equal to zone capacity for non power-of-2
+ * zoned devices.
+ */
+ if (!is_power_of_2(zone->len) && zone->capacity < zone->len) {
+ pr_err("%s: Invalid zone capacity %lld with non power-of-2 zone size %lld",
+ disk->disk_name, zone->capacity, zone->len);
return -ENODEV;
}

args->zone_sectors = zone->len;
- args->nr_zones = (capacity + zone->len - 1) >> ilog2(zone->len);
+ args->nr_zones = div64_u64(capacity + zone->len - 1, zone->len);
} else if (zone->start + args->zone_sectors < capacity) {
if (zone->len != args->zone_sectors) {
pr_warn("%s: Invalid zoned device with non constant zone size\n",
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 6cf43f9384cc..e29799076298 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -704,6 +704,30 @@ static inline unsigned int disk_zone_no(struct gendisk *disk, sector_t sector)
return div64_u64(sector, zone_sectors);
}

+static inline sector_t bdev_offset_from_zone_start(struct block_device *bdev,
+ sector_t sec)
+{
+ sector_t zone_sectors = bdev_zone_sectors(bdev);
+ u64 remainder = 0;
+
+ if (!bdev_is_zoned(bdev))
+ return 0;
+
+ if (is_power_of_2(zone_sectors))
+ return sec & (zone_sectors - 1);
+
+ div64_u64_rem(sec, zone_sectors, &remainder);
+ return remainder;
+}
+
+static inline bool bdev_is_zone_start(struct block_device *bdev, sector_t sec)
+{
+ if (!bdev_is_zoned(bdev))
+ return false;
+
+ return bdev_offset_from_zone_start(bdev, sec) == 0;
+}
+
static inline bool disk_zone_is_seq(struct gendisk *disk, sector_t sector)
{
if (!blk_queue_is_zoned(disk->queue))
@@ -748,6 +772,12 @@ static inline unsigned int disk_zone_no(struct gendisk *disk, sector_t sector)
{
return 0;
}
+
+static inline bool bdev_is_zone_start(struct block_device *bdev, sector_t sec)
+{
+ return false;
+}
+
static inline unsigned int bdev_max_open_zones(struct block_device *bdev)
{
return 0;
--
2.25.1

2022-09-23 18:40:20

by Pankaj Raghav

[permalink] [raw]
Subject: [PATCH v15 02/13] block: rearrange bdev_{is_zoned,zone_sectors,get_queue} helper in blkdev.h

Define bdev_is_zoned(), bdev_zone_sectors() and bdev_get_queue() earlier
in the blkdev.h include file. Simplify bdev_is_zoned() by removing the
superfluous NULL check for request queue while we are at it.

This commit has no functional change, and it is a prep patch for allowing
zoned devices with non-power-of-2 zone sizes in the block layer.

Reviewed-by: Damien Le Moal <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Pankaj Raghav <[email protected]>
---
include/linux/blkdev.h | 43 +++++++++++++++++++-----------------------
1 file changed, 19 insertions(+), 24 deletions(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 6c6bf4dd5709..6cf43f9384cc 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -635,6 +635,11 @@ static inline bool queue_is_mq(struct request_queue *q)
return q->mq_ops;
}

+static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
+{
+ return bdev->bd_queue; /* this is never NULL */
+}
+
#ifdef CONFIG_PM
static inline enum rpm_status queue_rpm_status(struct request_queue *q)
{
@@ -666,6 +671,20 @@ static inline bool blk_queue_is_zoned(struct request_queue *q)
}
}

+static inline bool bdev_is_zoned(struct block_device *bdev)
+{
+ return blk_queue_is_zoned(bdev_get_queue(bdev));
+}
+
+static inline sector_t bdev_zone_sectors(struct block_device *bdev)
+{
+ struct request_queue *q = bdev_get_queue(bdev);
+
+ if (!blk_queue_is_zoned(q))
+ return 0;
+ return q->limits.chunk_sectors;
+}
+
#ifdef CONFIG_BLK_DEV_ZONED
static inline unsigned int disk_nr_zones(struct gendisk *disk)
{
@@ -892,11 +911,6 @@ int bio_poll(struct bio *bio, struct io_comp_batch *iob, unsigned int flags);
int iocb_bio_iopoll(struct kiocb *kiocb, struct io_comp_batch *iob,
unsigned int flags);

-static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
-{
- return bdev->bd_queue; /* this is never NULL */
-}
-
/* Helper to convert BLK_ZONE_ZONE_XXX to its string format XXX */
const char *blk_zone_cond_str(enum blk_zone_cond zone_cond);

@@ -1296,25 +1310,6 @@ static inline enum blk_zoned_model bdev_zoned_model(struct block_device *bdev)
return BLK_ZONED_NONE;
}

-static inline bool bdev_is_zoned(struct block_device *bdev)
-{
- struct request_queue *q = bdev_get_queue(bdev);
-
- if (q)
- return blk_queue_is_zoned(q);
-
- return false;
-}
-
-static inline sector_t bdev_zone_sectors(struct block_device *bdev)
-{
- struct request_queue *q = bdev_get_queue(bdev);
-
- if (!blk_queue_is_zoned(q))
- return 0;
- return q->limits.chunk_sectors;
-}
-
static inline int queue_dma_alignment(const struct request_queue *q)
{
return q ? q->dma_alignment : 511;
--
2.25.1

2022-09-23 18:40:56

by Pankaj Raghav

[permalink] [raw]
Subject: [PATCH v15 07/13] zonefs: allow non power of 2 zoned devices

The zone size shift variable is useful only if the zone sizes are known
to be power of 2. Remove that variable and use generic helpers from
block layer to calculate zone index in zonefs.

Acked-by: Damien Le Moal <[email protected]>
Reviewed-by: Luis Chamberlain <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Signed-off-by: Pankaj Raghav <[email protected]>
---
fs/zonefs/super.c | 6 ++----
fs/zonefs/zonefs.h | 1 -
2 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index 860f0b1032c6..e549ef16738c 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -476,10 +476,9 @@ static void __zonefs_io_error(struct inode *inode, bool write)
{
struct zonefs_inode_info *zi = ZONEFS_I(inode);
struct super_block *sb = inode->i_sb;
- struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
unsigned int noio_flag;
unsigned int nr_zones =
- zi->i_zone_size >> (sbi->s_zone_sectors_shift + SECTOR_SHIFT);
+ bdev_zone_no(sb->s_bdev, zi->i_zone_size >> SECTOR_SHIFT);
struct zonefs_ioerr_data err = {
.inode = inode,
.write = write,
@@ -1401,7 +1400,7 @@ static int zonefs_init_file_inode(struct inode *inode, struct blk_zone *zone,
struct zonefs_inode_info *zi = ZONEFS_I(inode);
int ret = 0;

- inode->i_ino = zone->start >> sbi->s_zone_sectors_shift;
+ inode->i_ino = bdev_zone_no(sb->s_bdev, zone->start);
inode->i_mode = S_IFREG | sbi->s_perm;

zi->i_ztype = type;
@@ -1776,7 +1775,6 @@ static int zonefs_fill_super(struct super_block *sb, void *data, int silent)
* interface constraints.
*/
sb_set_blocksize(sb, bdev_zone_write_granularity(sb->s_bdev));
- sbi->s_zone_sectors_shift = ilog2(bdev_zone_sectors(sb->s_bdev));
sbi->s_uid = GLOBAL_ROOT_UID;
sbi->s_gid = GLOBAL_ROOT_GID;
sbi->s_perm = 0640;
diff --git a/fs/zonefs/zonefs.h b/fs/zonefs/zonefs.h
index 4b3de66c3233..39895195cda6 100644
--- a/fs/zonefs/zonefs.h
+++ b/fs/zonefs/zonefs.h
@@ -177,7 +177,6 @@ struct zonefs_sb_info {
kgid_t s_gid;
umode_t s_perm;
uuid_t s_uuid;
- unsigned int s_zone_sectors_shift;

unsigned int s_nr_files[ZONEFS_ZTYPE_MAX];

--
2.25.1

2022-09-23 18:43:07

by Pankaj Raghav

[permalink] [raw]
Subject: [PATCH v15 12/13] dm: introduce DM_EMULATED_ZONES target feature flag

Introduce a new target feature flag: DM_EMULATED_ZONES for targets with
a different number of sectors per zone (aka zone size) than the underlying
device zone size.

This target feature flag is introduced as the existing zoned targets assume
that the target and the underlying device have the same zone size.
The new target: dm-po2zoned will use this new target feature flag
as it emulates the zone boundary that is different from the underlying
zoned device.

Signed-off-by: Pankaj Raghav <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
Reviewed-by: Mike Snitzer <[email protected]>
---
drivers/md/dm-table.c | 13 ++++++++++---
include/linux/device-mapper.h | 9 +++++++++
2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 38b83c383e8f..8324bd660f09 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1619,13 +1619,20 @@ static bool dm_table_supports_zoned_model(struct dm_table *t,
return true;
}

-static int device_not_matches_zone_sectors(struct dm_target *ti, struct dm_dev *dev,
+/*
+ * Callback function to check for device zone sector across devices. If the
+ * DM_TARGET_EMULATED_ZONES target feature flag is not set, then the target
+ * should have the same zone sector as the underlying devices.
+ */
+static int check_valid_device_zone_sectors(struct dm_target *ti, struct dm_dev *dev,
sector_t start, sector_t len, void *data)
{
unsigned int *zone_sectors = data;

- if (!bdev_is_zoned(dev->bdev))
+ if (!bdev_is_zoned(dev->bdev) ||
+ dm_target_supports_emulated_zones(ti->type))
return 0;
+
return bdev_zone_sectors(dev->bdev) != *zone_sectors;
}

@@ -1650,7 +1657,7 @@ static int validate_hardware_zoned_model(struct dm_table *t,
if (!zone_sectors)
return -EINVAL;

- if (dm_table_any_dev_attr(t, device_not_matches_zone_sectors, &zone_sectors)) {
+ if (dm_table_any_dev_attr(t, check_valid_device_zone_sectors, &zone_sectors)) {
DMERR("%s: zone sectors is not consistent across all zoned devices",
dm_device_name(t->md));
return -EINVAL;
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index f8e3a96b97b0..8504da98e829 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -294,6 +294,15 @@ struct target_type {
#define dm_target_supports_mixed_zoned_model(type) (false)
#endif

+#ifdef CONFIG_BLK_DEV_ZONED
+#define DM_TARGET_EMULATED_ZONES 0x00000400
+#define dm_target_supports_emulated_zones(type) \
+ ((type)->features & DM_TARGET_EMULATED_ZONES)
+#else
+#define DM_TARGET_EMULATED_ZONES 0x00000000
+#define dm_target_supports_emulated_zones(type) (false)
+#endif
+
struct dm_target {
struct dm_table *table;
struct target_type *type;
--
2.25.1

2022-09-23 18:43:37

by Pankaj Raghav

[permalink] [raw]
Subject: [PATCH v15 08/13] dm-zoned: ensure only power of 2 zone sizes are allowed

From: Luis Chamberlain <[email protected]>

dm-zoned relies on the assumption that the zone size is a
power-of-2(po2) and the zone capacity is same as the zone size.

Ensure only po2 devices can be used as dm-zoned target until a native
support for zoned devices with non-po2 zone size is added.

Reviewed-by: Hannes Reinecke <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
Reviewed-by: Mike Snitzer <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Luis Chamberlain <[email protected]>
Signed-off-by: Pankaj Raghav <[email protected]>
---
drivers/md/dm-zoned-target.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/drivers/md/dm-zoned-target.c b/drivers/md/dm-zoned-target.c
index 95b132b52f33..9325bf5dee81 100644
--- a/drivers/md/dm-zoned-target.c
+++ b/drivers/md/dm-zoned-target.c
@@ -792,6 +792,10 @@ static int dmz_fixup_devices(struct dm_target *ti)
return -EINVAL;
}
zone_nr_sectors = bdev_zone_sectors(bdev);
+ if (!is_power_of_2(zone_nr_sectors)) {
+ ti->error = "Zone size is not a power-of-2 number of sectors";
+ return -EINVAL;
+ }
zoned_dev->zone_nr_sectors = zone_nr_sectors;
zoned_dev->nr_zones = bdev_nr_zones(bdev);
}
@@ -804,6 +808,10 @@ static int dmz_fixup_devices(struct dm_target *ti)
return -EINVAL;
}
zoned_dev->zone_nr_sectors = bdev_zone_sectors(bdev);
+ if (!is_power_of_2(zoned_dev->zone_nr_sectors)) {
+ ti->error = "Zone size is not a power-of-2 number of sectors";
+ return -EINVAL;
+ }
zoned_dev->nr_zones = bdev_nr_zones(bdev);
}

--
2.25.1

2022-09-23 18:56:04

by Pankaj Raghav

[permalink] [raw]
Subject: [PATCH v15 11/13] dm: call dm_zone_endio after the target endio callback for zoned devices

dm_zone_endio() updates the bi_sector of orig bio for zoned devices that
uses either native append or append emulation, and it is called before the
endio of the target. But target endio can still update the clone bio
after dm_zone_endio is called, thereby, the orig bio does not contain
the updated information anymore.

Currently, this is not a problem as the targets that support zoned devices
such as dm-zoned, dm-linear, and dm-crypt do not have an endio function,
and even if they do (such as dm-flakey), they don't modify the
bio->bi_iter.bi_sector of the cloned bio that is used to update the
orig_bio's bi_sector in dm_zone_endio function.

This is a prep patch for the new dm-po2zoned target as it modifies
bi_sector in the endio callback.

Call dm_zone_endio for zoned devices after calling the target's endio
function.

Reviewed-by: Mike Snitzer <[email protected]>
Signed-off-by: Pankaj Raghav <[email protected]>
---
drivers/md/dm.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 7c35dea88ed1..874e1dc9fc26 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1122,10 +1122,6 @@ static void clone_endio(struct bio *bio)
disable_write_zeroes(md);
}

- if (static_branch_unlikely(&zoned_enabled) &&
- unlikely(bdev_is_zoned(bio->bi_bdev)))
- dm_zone_endio(io, bio);
-
if (endio) {
int r = endio(ti, bio, &error);
switch (r) {
@@ -1154,6 +1150,10 @@ static void clone_endio(struct bio *bio)
}
}

+ if (static_branch_unlikely(&zoned_enabled) &&
+ unlikely(bdev_is_zoned(bio->bi_bdev)))
+ dm_zone_endio(io, bio);
+
if (static_branch_unlikely(&swap_bios_enabled) &&
unlikely(swap_bios_limit(ti, bio)))
up(&md->swap_bios_semaphore);
--
2.25.1

2022-09-27 17:23:08

by Pankaj Raghav

[permalink] [raw]
Subject: Re: [PATCH v15 13/13] dm: add power-of-2 target for zoned devices with non power-of-2 zone sizes

Gentle ping Mike.

On 2022-09-23 19:36, Pankaj Raghav wrote:
> Only zoned devices with power-of-2(po2) number of sectors per zone(zone
> size) were supported in linux but now non power-of-2(npo2) zone sizes
> support has been added to the block layer.
>
> Filesystems such as F2FS and btrfs have support for zoned devices with
> po2 zone size assumption. Before adding native support for npo2 zone
> sizes, it was suggested to create a dm target for npo2 zone size device to
> appear as a po2 zone size target so that file systems can initially
> work without any explicit changes.
>
> The design of this target is very simple: remap the device zone size to
> the zone capacity and change the zone size to be the nearest power of 2
> value.
>
> For e.g., a device with a zone size/capacity of 3M will have an equivalent
> target layout as follows:
>
> Device layout :-
> zone capacity = 3M
> zone size = 3M
>
> |--------------|-------------|
> 0 3M 6M
>
> Target layout :-
> zone capacity=3M
> zone size = 4M
>
> |--------------|---|--------------|---|
> 0 3M 4M 7M 8M
>
> The area between target's zone capacity and zone size will be emulated
> in the target.
> The read IOs that fall in the emulated gap area will return 0 filled
> bio and all the other IOs in that area will result in an error.
> If a read IO span across the emulated area boundary, then the IOs are
> split across them. All other IO operations that span across the emulated
> area boundary will result in an error.
>
> The target can be easily created as follows:
> dmsetup create <label> --table '0 <size_sects> po2zoned /dev/nvme<id>'
>
> The target does not support partial mapping of the underlying
> device as there is no use-case for it.
>
> Note:
> This target is not related to dm-zoned target, which exposes a zoned block
> device as a regular block device without any write constraint.
>
> This target only exposes a different zone size than the underlying device.
> The underlying device's other constraints will be directly exposed to the
> target.
>
> Signed-off-by: Pankaj Raghav <[email protected]>
> Suggested-by: Johannes Thumshirn <[email protected]>
> Suggested-by: Damien Le Moal <[email protected]>
> Suggested-by: Hannes Reinecke <[email protected]>
> ---
> .../admin-guide/device-mapper/dm-po2zoned.rst | 79 +++++
> .../admin-guide/device-mapper/index.rst | 1 +
> drivers/md/Kconfig | 10 +
> drivers/md/Makefile | 2 +
> drivers/md/dm-po2zoned-target.c | 293 ++++++++++++++++++
> 5 files changed, 385 insertions(+)
> create mode 100644 Documentation/admin-guide/device-mapper/dm-po2zoned.rst
> create mode 100644 drivers/md/dm-po2zoned-target.c
>
> diff --git a/Documentation/admin-guide/device-mapper/dm-po2zoned.rst b/Documentation/admin-guide/device-mapper/dm-po2zoned.rst
> new file mode 100644
> index 000000000000..8a35eab0b714
> --- /dev/null
> +++ b/Documentation/admin-guide/device-mapper/dm-po2zoned.rst
> @@ -0,0 +1,79 @@
> +===========
> +dm-po2zoned
> +===========
> +The dm-po2zoned device mapper target exposes a zoned block device with a
> +non-power-of-2(npo2) number of sectors per zone as a power-of-2(po2)
> +number of sectors per zone(zone size).
> +The filesystems that support zoned block devices such as F2FS and BTRFS
> +assume po2 zone size as the kernel has traditionally only supported
> +those devices. However, as the kernel now supports zoned block devices with
> +npo2 zone sizes, the filesystems can run on top of the dm-po2zoned target before
> +adding native support.
> +
> +Partial mapping of the underlying device is not supported by this target as
> +there is no use-case for it.
> +
> +.. note::
> + This target is **not related** to **dm-zoned target**, which exposes a
> + zoned block device as a regular block device without any write constraint.
> +
> + This target only exposes a different **zone size** than the underlying device.
> + The underlying device's other **constraints** will be exposed to the target.
> +
> +Algorithm
> +=========
> +The device mapper target maps the underlying device's zone size to the
> +zone capacity and changes the zone size to the nearest po2 zone size.
> +The gap between the zone capacity and the zone size is emulated in the target.
> +E.g., a zoned block device with a zone size (and capacity) of 3M will have an
> +equivalent target layout with mapping as follows:
> +
> +::
> +
> + 0M 3M 4M 6M 8M
> + | | | | |
> + +x------------+--+x---------+--+x------- Target
> + |x | |x | |x
> + x x x
> + x x x
> + x x x
> + x x x
> + |x |x |x
> + +x------------+x------------+x---------- Device
> + | | |
> + 0M 3M 6M
> +
> +A simple remap is performed for all the BIOs that do not cross the
> +emulation gap area, i.e., the area between the zone capacity and size.
> +
> +If a BIO lies in the emulation gap area, the following operations are performed:
> +
> + Read:
> + - If the BIO lies entirely in the emulation gap area, then zero out the BIO and complete it.
> + - If the BIO spans the emulation gap area, split the BIO across the zone capacity boundary
> + and remap only the BIO within the zone capacity boundary. The other part of the split BIO
> + will be zeroed out.
> +
> + Other operations:
> + - Return an error
> +
> +Table parameters
> +================
> +
> +::
> +
> + <dev path>
> +
> +Mandatory parameters:
> +
> + <dev path>:
> + Full pathname to the underlying block-device, or a
> + "major:minor" device-number.
> +
> +Examples
> +========
> +
> +::
> +
> + #!/bin/sh
> + echo "0 `blockdev --getsz $1` po2zoned $1" | dmsetup create po2z
> diff --git a/Documentation/admin-guide/device-mapper/index.rst b/Documentation/admin-guide/device-mapper/index.rst
> index cde52cc09645..5df93711cef5 100644
> --- a/Documentation/admin-guide/device-mapper/index.rst
> +++ b/Documentation/admin-guide/device-mapper/index.rst
> @@ -23,6 +23,7 @@ Device Mapper
> dm-service-time
> dm-uevent
> dm-zoned
> + dm-po2zoned
> era
> kcopyd
> linear
> diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
> index 998a5cfdbc4e..74fdfd02ab5f 100644
> --- a/drivers/md/Kconfig
> +++ b/drivers/md/Kconfig
> @@ -518,6 +518,16 @@ config DM_FLAKEY
> help
> A target that intermittently fails I/O for debugging purposes.
>
> +config DM_PO2ZONED
> + tristate "Zoned block devices target emulating a power-of-2 number of sectors per zone"
> + depends on BLK_DEV_DM
> + depends on BLK_DEV_ZONED
> + help
> + A target that converts a zoned block device with non-power-of-2(npo2)
> + number of sectors per zone to be power-of-2(po2). Use this target for
> + zoned block devices with npo2 number of sectors per zone until native
> + support is added to the filesystems and applications.
> +
> config DM_VERITY
> tristate "Verity target support"
> depends on BLK_DEV_DM
> diff --git a/drivers/md/Makefile b/drivers/md/Makefile
> index 84291e38dca8..ee05722bc637 100644
> --- a/drivers/md/Makefile
> +++ b/drivers/md/Makefile
> @@ -26,6 +26,7 @@ dm-era-y += dm-era-target.o
> dm-clone-y += dm-clone-target.o dm-clone-metadata.o
> dm-verity-y += dm-verity-target.o
> dm-zoned-y += dm-zoned-target.o dm-zoned-metadata.o dm-zoned-reclaim.o
> +dm-po2zoned-y += dm-po2zoned-target.o
>
> md-mod-y += md.o md-bitmap.o
> raid456-y += raid5.o raid5-cache.o raid5-ppl.o
> @@ -60,6 +61,7 @@ obj-$(CONFIG_DM_CRYPT) += dm-crypt.o
> obj-$(CONFIG_DM_DELAY) += dm-delay.o
> obj-$(CONFIG_DM_DUST) += dm-dust.o
> obj-$(CONFIG_DM_FLAKEY) += dm-flakey.o
> +obj-$(CONFIG_DM_PO2ZONED) += dm-po2zoned.o
> obj-$(CONFIG_DM_MULTIPATH) += dm-multipath.o dm-round-robin.o
> obj-$(CONFIG_DM_MULTIPATH_QL) += dm-queue-length.o
> obj-$(CONFIG_DM_MULTIPATH_ST) += dm-service-time.o
> diff --git a/drivers/md/dm-po2zoned-target.c b/drivers/md/dm-po2zoned-target.c
> new file mode 100644
> index 000000000000..0f8f768a94cf
> --- /dev/null
> +++ b/drivers/md/dm-po2zoned-target.c
> @@ -0,0 +1,293 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2022 Samsung Electronics Co., Ltd.
> + */
> +
> +#include <linux/device-mapper.h>
> +
> +#define DM_MSG_PREFIX "po2zoned"
> +
> +struct dm_po2z_target {
> + struct dm_dev *dev;
> + sector_t zone_size; /* Actual zone size of the underlying dev*/
> + sector_t zone_size_po2; /* zone_size rounded to the nearest po2 value */
> + unsigned int zone_size_po2_shift;
> + sector_t zone_size_diff; /* diff between zone_size_po2 and zone_size */
> + unsigned int nr_zones;
> +};
> +
> +static inline unsigned int npo2_zone_no(struct dm_po2z_target *dmh,
> + sector_t sect)
> +{
> + return div64_u64(sect, dmh->zone_size);
> +}
> +
> +static inline unsigned int po2_zone_no(struct dm_po2z_target *dmh,
> + sector_t sect)
> +{
> + return sect >> dmh->zone_size_po2_shift;
> +}
> +
> +static inline sector_t device_to_target_sect(struct dm_target *ti,
> + sector_t sect)
> +{
> + struct dm_po2z_target *dmh = ti->private;
> +
> + return sect + (npo2_zone_no(dmh, sect) * dmh->zone_size_diff) +
> + ti->begin;
> +}
> +
> +/*
> + * This target works on the complete zoned device. Partial mapping is not
> + * supported.
> + * Construct a zoned po2 logical device: <dev-path>
> + */
> +static int dm_po2z_ctr(struct dm_target *ti, unsigned int argc, char **argv)
> +{
> + struct dm_po2z_target *dmh = NULL;
> + int ret;
> + sector_t zone_size;
> + sector_t dev_capacity;
> +
> + if (argc != 1)
> + return -EINVAL;
> +
> + dmh = kmalloc(sizeof(*dmh), GFP_KERNEL);
> + if (!dmh)
> + return -ENOMEM;
> +
> + ret = dm_get_device(ti, argv[0], dm_table_get_mode(ti->table),
> + &dmh->dev);
> + if (ret) {
> + ti->error = "Device lookup failed";
> + goto free;
> + }
> +
> + if (!bdev_is_zoned(dmh->dev->bdev)) {
> + DMERR("%pg is not a zoned device", dmh->dev->bdev);
> + ret = -EINVAL;
> + goto bad;
> + }
> +
> + zone_size = bdev_zone_sectors(dmh->dev->bdev);
> + dev_capacity = get_capacity(dmh->dev->bdev->bd_disk);
> + if (ti->len != dev_capacity) {
> + DMERR("%pg Partial mapping of the target is not supported",
> + dmh->dev->bdev);
> + ret = -EINVAL;
> + goto bad;
> + }
> +
> + if (is_power_of_2(zone_size))
> + DMWARN("%pg: underlying device has a power-of-2 number of sectors per zone",
> + dmh->dev->bdev);
> +
> + dmh->zone_size = zone_size;
> + dmh->zone_size_po2 = 1 << get_count_order_long(zone_size);
> + dmh->zone_size_po2_shift = ilog2(dmh->zone_size_po2);
> + dmh->zone_size_diff = dmh->zone_size_po2 - dmh->zone_size;
> + ti->private = dmh;
> + ti->max_io_len = dmh->zone_size_po2;
> + dmh->nr_zones = npo2_zone_no(dmh, ti->len);
> + ti->len = dmh->zone_size_po2 * dmh->nr_zones;
> + return 0;
> +
> +bad:
> + dm_put_device(ti, dmh->dev);
> +free:
> + kfree(dmh);
> + return ret;
> +}
> +
> +static void dm_po2z_dtr(struct dm_target *ti)
> +{
> + struct dm_po2z_target *dmh = ti->private;
> +
> + dm_put_device(ti, dmh->dev);
> + kfree(dmh);
> +}
> +
> +static int dm_po2z_report_zones_cb(struct blk_zone *zone, unsigned int idx,
> + void *data)
> +{
> + struct dm_report_zones_args *args = data;
> + struct dm_target *ti = args->tgt;
> + struct dm_po2z_target *dmh = ti->private;
> +
> + zone->start = device_to_target_sect(ti, zone->start);
> + zone->wp = device_to_target_sect(ti, zone->wp);
> + zone->len = dmh->zone_size_po2;
> + args->next_sector = zone->start + zone->len;
> +
> + return args->orig_cb(zone, args->zone_idx++, args->orig_data);
> +}
> +
> +static int dm_po2z_report_zones(struct dm_target *ti,
> + struct dm_report_zones_args *args,
> + unsigned int nr_zones)
> +{
> + struct dm_po2z_target *dmh = ti->private;
> + sector_t sect =
> + po2_zone_no(dmh, dm_target_offset(ti, args->next_sector)) *
> + dmh->zone_size;
> +
> + return blkdev_report_zones(dmh->dev->bdev, sect, nr_zones,
> + dm_po2z_report_zones_cb, args);
> +}
> +
> +static int dm_po2z_end_io(struct dm_target *ti, struct bio *bio,
> + blk_status_t *error)
> +{
> + if (bio->bi_status == BLK_STS_OK && bio_op(bio) == REQ_OP_ZONE_APPEND)
> + bio->bi_iter.bi_sector =
> + device_to_target_sect(ti, bio->bi_iter.bi_sector);
> +
> + return DM_ENDIO_DONE;
> +}
> +
> +static void dm_po2z_io_hints(struct dm_target *ti, struct queue_limits *limits)
> +{
> + struct dm_po2z_target *dmh = ti->private;
> +
> + limits->chunk_sectors = dmh->zone_size_po2;
> +}
> +
> +static void dm_po2z_status(struct dm_target *ti, status_type_t type,
> + unsigned int status_flags, char *result,
> + unsigned int maxlen)
> +{
> + struct dm_po2z_target *dmh = ti->private;
> + size_t sz = 0;
> +
> + switch (type) {
> + case STATUSTYPE_INFO:
> + DMEMIT("%s %lld", dmh->dev->name,
> + (unsigned long long)dmh->zone_size);
> + break;
> +
> + case STATUSTYPE_TABLE:
> + DMEMIT("%s", dmh->dev->name);
> + break;
> +
> + case STATUSTYPE_IMA:
> + result[0] = '\0';
> + break;
> + }
> +}
> +
> +/**
> + * dm_po2z_bio_in_emulated_zone_area - check if bio is in the emulated zone area
> + * @dmh: target data
> + * @bio: bio
> + * @offset: bio offset to emulated zone boundary
> + *
> + * Check if a @bio is partly or completely in the emulated zone area. If the
> + * @bio is partly in the emulated zone area, @offset can be used to split
> + * the @bio across the emulated zone boundary. @offset
> + * will be negative if the @bio completely lies in the emulated area.
> + *
> + */
> +static bool dm_po2z_bio_in_emulated_zone_area(struct dm_po2z_target *dmh,
> + struct bio *bio, int *offset)
> +{
> + unsigned int zone_idx = po2_zone_no(dmh, bio->bi_iter.bi_sector);
> + sector_t nr_sectors = bio->bi_iter.bi_size >> SECTOR_SHIFT;
> + sector_t sector_offset =
> + bio->bi_iter.bi_sector - (zone_idx << dmh->zone_size_po2_shift);
> +
> + *offset = dmh->zone_size - sector_offset;
> +
> + return sector_offset + nr_sectors > dmh->zone_size;
> +}
> +
> +static inline void dm_po2z_read_zeroes(struct bio *bio)
> +{
> + zero_fill_bio(bio);
> + bio_endio(bio);
> +}
> +
> +static int dm_po2z_map(struct dm_target *ti, struct bio *bio)
> +{
> + struct dm_po2z_target *dmh = ti->private;
> + int split_io_pos;
> +
> + bio_set_dev(bio, dmh->dev->bdev);
> + bio->bi_iter.bi_sector = dm_target_offset(ti, bio->bi_iter.bi_sector);
> +
> + if (op_is_zone_mgmt(bio_op(bio)) || !bio_sectors(bio))
> + goto remap_sector;
> +
> + if (!dm_po2z_bio_in_emulated_zone_area(dmh, bio, &split_io_pos))
> + goto remap_sector;
> +
> + /*
> + * Read operation on the emulated zone area (between zone capacity
> + * and zone size) will fill the bio with zeroes. Any other operation
> + * in the emulated area should return an error.
> + */
> + if (bio_op(bio) == REQ_OP_READ) {
> + /*
> + * If the bio is across emulated zone boundary, split the bio at
> + * the boundary.
> + */
> + if (split_io_pos > 0) {
> + dm_accept_partial_bio(bio, split_io_pos);
> + goto remap_sector;
> + }
> +
> + dm_po2z_read_zeroes(bio);
> + return DM_MAPIO_SUBMITTED;
> + }
> + /* Other IOs in emulated zone area should result in an error */
> + return DM_MAPIO_KILL;
> +
> +remap_sector:
> + /* convert from target sector to device sector */
> + bio->bi_iter.bi_sector -= (po2_zone_no(dmh, bio->bi_iter.bi_sector) *
> + dmh->zone_size_diff);
> + return DM_MAPIO_REMAPPED;
> +}
> +
> +static int dm_po2z_iterate_devices(struct dm_target *ti,
> + iterate_devices_callout_fn fn, void *data)
> +{
> + struct dm_po2z_target *dmh = ti->private;
> + sector_t len = dmh->nr_zones * dmh->zone_size;
> +
> + return fn(ti, dmh->dev, 0, len, data);
> +}
> +
> +static struct target_type dm_po2z_target = {
> + .name = "po2zoned",
> + .version = { 1, 0, 0 },
> + .features = DM_TARGET_ZONED_HM | DM_TARGET_EMULATED_ZONES |
> + DM_TARGET_NOWAIT,
> + .map = dm_po2z_map,
> + .end_io = dm_po2z_end_io,
> + .report_zones = dm_po2z_report_zones,
> + .iterate_devices = dm_po2z_iterate_devices,
> + .module = THIS_MODULE,
> + .io_hints = dm_po2z_io_hints,
> + .status = dm_po2z_status,
> + .ctr = dm_po2z_ctr,
> + .dtr = dm_po2z_dtr,
> +};
> +
> +static int __init dm_po2z_init(void)
> +{
> + return dm_register_target(&dm_po2z_target);
> +}
> +
> +static void __exit dm_po2z_exit(void)
> +{
> + dm_unregister_target(&dm_po2z_target);
> +}
> +
> +/* Module hooks */
> +module_init(dm_po2z_init);
> +module_exit(dm_po2z_exit);
> +
> +MODULE_DESCRIPTION(DM_NAME "power-of-2 zoned target");
> +MODULE_AUTHOR("Pankaj Raghav <[email protected]>");
> +MODULE_LICENSE("GPL");
> +

2022-09-28 14:51:21

by Mike Snitzer

[permalink] [raw]
Subject: Re: [PATCH v15 13/13] dm: add power-of-2 target for zoned devices with non power-of-2 zone sizes

On Fri, Sep 23 2022 at 1:36P -0400,
Pankaj Raghav <[email protected]> wrote:

> Only zoned devices with power-of-2(po2) number of sectors per zone(zone
> size) were supported in linux but now non power-of-2(npo2) zone sizes
> support has been added to the block layer.
>
> Filesystems such as F2FS and btrfs have support for zoned devices with
> po2 zone size assumption. Before adding native support for npo2 zone
> sizes, it was suggested to create a dm target for npo2 zone size device to
> appear as a po2 zone size target so that file systems can initially
> work without any explicit changes.
>
> The design of this target is very simple: remap the device zone size to
> the zone capacity and change the zone size to be the nearest power of 2
> value.
>
> For e.g., a device with a zone size/capacity of 3M will have an equivalent
> target layout as follows:
>
> Device layout :-
> zone capacity = 3M
> zone size = 3M
>
> |--------------|-------------|
> 0 3M 6M
>
> Target layout :-
> zone capacity=3M
> zone size = 4M
>
> |--------------|---|--------------|---|
> 0 3M 4M 7M 8M
>
> The area between target's zone capacity and zone size will be emulated
> in the target.
> The read IOs that fall in the emulated gap area will return 0 filled
> bio and all the other IOs in that area will result in an error.
> If a read IO span across the emulated area boundary, then the IOs are
> split across them. All other IO operations that span across the emulated
> area boundary will result in an error.
>
> The target can be easily created as follows:
> dmsetup create <label> --table '0 <size_sects> po2zoned /dev/nvme<id>'
>
> The target does not support partial mapping of the underlying
> device as there is no use-case for it.
>
> Note:
> This target is not related to dm-zoned target, which exposes a zoned block
> device as a regular block device without any write constraint.
>
> This target only exposes a different zone size than the underlying device.
> The underlying device's other constraints will be directly exposed to the
> target.
>
> Signed-off-by: Pankaj Raghav <[email protected]>
> Suggested-by: Johannes Thumshirn <[email protected]>
> Suggested-by: Damien Le Moal <[email protected]>
> Suggested-by: Hannes Reinecke <[email protected]>

If/when this series is accepted by Jens and others the DM changes can
just be included along with the rest of the series.

Reviewed-by: Mike Snitzer <[email protected]>

2022-09-28 18:21:02

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH v15 07/13] zonefs: allow non power of 2 zoned devices

On 9/23/22 10:36, Pankaj Raghav wrote:
> The zone size shift variable is useful only if the zone sizes are known
> to be power of 2. Remove that variable and use generic helpers from
> block layer to calculate zone index in zonefs.

Reviewed-by: Bart Van Assche <[email protected]>


2022-09-28 18:21:45

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH v15 11/13] dm: call dm_zone_endio after the target endio callback for zoned devices

On 9/23/22 10:36, Pankaj Raghav wrote:
> dm_zone_endio() updates the bi_sector of orig bio for zoned devices that
> uses either native append or append emulation, and it is called before the
> endio of the target. But target endio can still update the clone bio
> after dm_zone_endio is called, thereby, the orig bio does not contain
> the updated information anymore.
>
> Currently, this is not a problem as the targets that support zoned devices
> such as dm-zoned, dm-linear, and dm-crypt do not have an endio function,
> and even if they do (such as dm-flakey), they don't modify the
> bio->bi_iter.bi_sector of the cloned bio that is used to update the
> orig_bio's bi_sector in dm_zone_endio function.
>
> This is a prep patch for the new dm-po2zoned target as it modifies
> bi_sector in the endio callback.
>
> Call dm_zone_endio for zoned devices after calling the target's endio
> function.

Reviewed-by: Bart Van Assche <[email protected]>

2022-09-29 07:27:40

by Pankaj Raghav

[permalink] [raw]
Subject: Re: [PATCH v15 00/13] support zoned block devices with non-power-of-2 zone sizes

> Hi Jens,
> Please consider this patch series for the 6.1 release.
>

Hi Jens, Christoph, and Keith,
All the patches have a Reviewed-by tag at this point. Can we queue this up
for 6.1?

--
Pankaj

2022-09-30 15:20:40

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v15 00/13] support zoned block devices with non-power-of-2 zone sizes

On 9/29/22 12:31 AM, Pankaj Raghav wrote:
>> Hi Jens,
>> Please consider this patch series for the 6.1 release.
>>
>
> Hi Jens, Christoph, and Keith,
> All the patches have a Reviewed-by tag at this point. Can we queue this up
> for 6.1?

It's getting pretty late for 6.1 and I'd really like to have both Christoph
and Martin sign off on these changes.

--
Jens Axboe


2022-09-30 20:45:41

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH v15 00/13] support zoned block devices with non-power-of-2 zone sizes

On 9/30/22 08:13, Jens Axboe wrote:
> On 9/29/22 12:31 AM, Pankaj Raghav wrote:
>>> Hi Jens,
>>> Please consider this patch series for the 6.1 release.
>>>
>>
>> Hi Jens, Christoph, and Keith,
>> All the patches have a Reviewed-by tag at this point. Can we queue this up
>> for 6.1?
>
> It's getting pretty late for 6.1 and I'd really like to have both Christoph
> and Martin sign off on these changes.

Hi Jens,

Agreed that it's getting late for 6.1.

Since this has not been mentioned in the cover letter, I want to add
that in the near future we will need these patches for Android devices.
JEDEC is working on supporting zoned storage for UFS devices, the
storage devices used in all modern Android phones. Although it would be
possible to make the offset between zone starts a power of two by
inserting gap zones between data zones, UFS vendors asked not to do this
and hence need support for zone sizes that are not a power of two. An
advantage of not having to deal with gap zones is better filesystem
performance since filesystem extents cannot span gap zones. Having to
split filesystem extents because of gap zones reduces filesystem
performance.

Thanks,

Bart.


2022-09-30 21:30:55

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v15 00/13] support zoned block devices with non-power-of-2 zone sizes

On 9/30/22 1:38 PM, Bart Van Assche wrote:
> On 9/30/22 08:13, Jens Axboe wrote:
>> On 9/29/22 12:31 AM, Pankaj Raghav wrote:
>>>> Hi Jens,
>>>> ?? Please consider this patch series for the 6.1 release.
>>>>
>>>
>>> Hi Jens, Christoph, and Keith,
>>> ? All the patches have a Reviewed-by tag at this point. Can we queue this up
>>> for 6.1?
>>
>> It's getting pretty late for 6.1 and I'd really like to have both Christoph
>> and Martin sign off on these changes.
>
> Hi Jens,
>
> Agreed that it's getting late for 6.1.
>
> Since this has not been mentioned in the cover letter, I want to add
> that in the near future we will need these patches for Android
> devices. JEDEC is working on supporting zoned storage for UFS devices,
> the storage devices used in all modern Android phones. Although it
> would be possible to make the offset between zone starts a power of
> two by inserting gap zones between data zones, UFS vendors asked not
> to do this and hence need support for zone sizes that are not a power
> of two. An advantage of not having to deal with gap zones is better
> filesystem performance since filesystem extents cannot span gap zones.
> Having to split filesystem extents because of gap zones reduces
> filesystem performance.

Noted. I'll find some time to review this as well separately, once we're
on the other side of the merge window.

--
Jens Axboe

2022-10-01 01:00:50

by Damien Le Moal

[permalink] [raw]
Subject: Re: [PATCH v15 00/13] support zoned block devices with non-power-of-2 zone sizes

On 10/1/22 04:38, Bart Van Assche wrote:
> On 9/30/22 08:13, Jens Axboe wrote:
>> On 9/29/22 12:31 AM, Pankaj Raghav wrote:
>>>> Hi Jens,
>>>> Please consider this patch series for the 6.1 release.
>>>>
>>>
>>> Hi Jens, Christoph, and Keith,
>>> All the patches have a Reviewed-by tag at this point. Can we queue this up
>>> for 6.1?
>>
>> It's getting pretty late for 6.1 and I'd really like to have both Christoph
>> and Martin sign off on these changes.
>
> Hi Jens,
>
> Agreed that it's getting late for 6.1.
>
> Since this has not been mentioned in the cover letter, I want to add
> that in the near future we will need these patches for Android devices.
> JEDEC is working on supporting zoned storage for UFS devices, the
> storage devices used in all modern Android phones. Although it would be
> possible to make the offset between zone starts a power of two by
> inserting gap zones between data zones, UFS vendors asked not to do this
> and hence need support for zone sizes that are not a power of two. An
> advantage of not having to deal with gap zones is better filesystem
> performance since filesystem extents cannot span gap zones. Having to
> split filesystem extents because of gap zones reduces filesystem
> performance.

As mentioned many times, my opinion is that a good implementation should
*not* have any extent span zone boundaries. So personally, I do not
consider such argument as a valid justification for the non-power-of-2
zone size support.

>
> Thanks,
>
> Bart.
>
>

--
Damien Le Moal
Western Digital Research

2022-10-01 02:52:59

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH v15 00/13] support zoned block devices with non-power-of-2 zone sizes

On 9/30/22 17:45, Damien Le Moal wrote:
> On 10/1/22 04:38, Bart Van Assche wrote:
>> Since this has not been mentioned in the cover letter, I want to add
>> that in the near future we will need these patches for Android devices.
>> JEDEC is working on supporting zoned storage for UFS devices, the
>> storage devices used in all modern Android phones. Although it would be
>> possible to make the offset between zone starts a power of two by
>> inserting gap zones between data zones, UFS vendors asked not to do this
>> and hence need support for zone sizes that are not a power of two. An
>> advantage of not having to deal with gap zones is better filesystem
>> performance since filesystem extents cannot span gap zones. Having to
>> split filesystem extents because of gap zones reduces filesystem
>> performance.
>
> As mentioned many times, my opinion is that a good implementation should
> *not* have any extent span zone boundaries. So personally, I do not
> consider such argument as a valid justification for the non-power-of-2
> zone size support.

Hi Damien,

Although the filesystem extent issue probably can be solved in software,
the argument that UFS vendors strongly prefer not to have gap zones and
hence need support for zone sizes that are not a power of two remains.

Thanks,

Bart.

2022-10-24 22:10:10

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH v15 00/13] support zoned block devices with non-power-of-2 zone sizes

On 9/30/22 14:24, Jens Axboe wrote:
> Noted. I'll find some time to review this as well separately, once we're
> on the other side of the merge window.

Hi Jens,

Now that we are on the other side of the merge window: do you perhaps
want Pankaj to repost this patch series? From what I have heard in
several fora (JEDEC, SNIA) all flash storage vendors except one (WDC)
are in favor of a contiguous LBA space and hence are in favor of
supporting zone sizes that are not a power of two.

As you may know in JEDEC we are working on standardizing zoned storage
for UFS devices. We (JEDEC JC-64.1 committee members) would like to know
whether or not we should require that the UFS zone size should be a
power of two.

Thank you,

Bart.