- Background and Motivation:
The zone storage implementation in Linux, introduced since v4.10, first
targetted SMR drives which have a power of 2 (po2) zone size alignment
requirement. The po2 zone size was further imposed implicitly by the
block layer's blk_queue_chunk_sectors(), used to prevent IO merging
across chunks beyond the specified size, since v3.16 through commit
762380ad9322 ("block: add notion of a chunk size for request merging").
But this same general block layer po2 requirement for blk_queue_chunk_sectors()
was removed on v5.10 through commit 07d098e6bbad ("block: allow 'chunk_sectors'
to be non-power-of-2").
NAND, which is the media used in newer zoned storage devices, does not
naturally align to po2. In these devices, zone capacity(cap) is not the
same as the po2 zone size. When the zone cap != zone size, then unmapped
LBAs are introduced to cover the space between the zone cap and zone size.
po2 requirement does not make sense for these type of zone storage devices.
This patch series aims to remove these unmapped LBAs for zoned devices when
zone cap is npo2. This is done by relaxing the po2 zone size constraint
in the kernel and allowing zoned device with npo2 zone sizes if zone cap
== zone size.
Removing the po2 requirement from zone storage should be possible
now provided that no userspace regression and no performance regressions are
introduced. Stop-gap patches have been already merged into f2fs-tools to
proactively not allow npo2 zone sizes until proper support is added [1].
There were two efforts previously to add support to npo2 devices: 1) via
device level emulation [2] but that was rejected with a final conclusion
to add support for non po2 zoned device in the complete stack[3] 2)
adding support to the complete stack by removing the constraint in the
block layer and NVMe layer with support to btrfs, zonefs, etc which was
rejected with a conclusion to add a dm target for FS support [0]
to reduce the regression impact.
This series adds support to npo2 zoned devices in the block and nvme
layer and a new **dm target** is added: dm-po2z-target. This new
target will be initially used for filesystems such as btrfs and
f2fs until native npo2 zone support is added.
- Patchset description:
Patches 1-3 deals with removing the po2 constraint from the
block layer.
Patches 4-5 deals with removing the constraint from nvme zns.
Patch 5 removes the po2 contraint in null blk
Patch 6 adds npo2 support to zonefs
Patches 7-13 adds support for npo2 zoned devices in the DM layer and
adds a new target dm-po2z-target which converts a zoned device with npo2
zone size into a zoned target with po2 zone size.
The patch series is based on linux-next tag: next-20220823
Testing:
The new target was tested with blktest and zonefs test suite in qemu and
on a real ZNS device with npo2 zone size.
Performance Measurement on a null blk:
Device:
zone size = 128M, blocksize=4k
FIO cmd:
fio --name=zbc --filename=/dev/nullb0 --direct=1 --zonemode=zbd --size=23G
--io_size=<iosize> --ioengine=io_uring --iodepth=<iod> --rw=<mode> --bs=4k
--loops=4
The following results are an average of 4 runs on AMD Ryzen 5 5600X with
32GB of RAM:
Sequential Write:
x-----------------x---------------------------------x---------------------------------x
| IOdepth | 8 | 16 |
x-----------------x---------------------------------x---------------------------------x
| | KIOPS |BW(MiB/s) | Lat(usec) | KIOPS |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patch | 578 | 2257 | 12.80 | 576 | 2248 | 25.78 |
x-----------------x---------------------------------x---------------------------------x
| With patch | 581 | 2268 | 12.74 | 576 | 2248 | 25.85 |
x-----------------x---------------------------------x---------------------------------x
Sequential read:
x-----------------x---------------------------------x---------------------------------x
| IOdepth | 8 | 16 |
x-----------------x---------------------------------x---------------------------------x
| | KIOPS |BW(MiB/s) | Lat(usec) | KIOPS |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patch | 667 | 2605 | 11.79 | 675 | 2637 | 23.49 |
x-----------------x---------------------------------x---------------------------------x
| With patch | 667 | 2605 | 11.79 | 675 | 2638 | 23.48 |
x-----------------x---------------------------------x---------------------------------x
Random read:
x-----------------x---------------------------------x---------------------------------x
| IOdepth | 8 | 16 |
x-----------------x---------------------------------x---------------------------------x
| | KIOPS |BW(MiB/s) | Lat(usec) | KIOPS |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patch | 522 | 2038 | 15.05 | 514 | 2006 | 30.87 |
x-----------------x---------------------------------x---------------------------------x
| With patch | 522 | 2039 | 15.04 | 523 | 2042 | 30.33 |
x-----------------x---------------------------------x---------------------------------x
Minor variations are noticed in Sequential write with io depth 8 and
in random read with io depth 16. But overall no noticeable differences
were noticed
[0] https://lore.kernel.org/lkml/PH0PR04MB74166C87F694B150A5AE0F009BD09@PH0PR04MB7416.namprd04.prod.outlook.com/
[1] https://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git/commit/?h=dev-test&id=6afcf6493578e77528abe65ab8b12f3e1c16749f
[2] https://lore.kernel.org/all/[email protected]/T/
[3] https://lore.kernel.org/all/20220315135245.eqf4tqngxxb7ymqa@unifi/
Changes since v1:
- Put the function declaration and its usage in the same commit (Bart)
- Remove bdev_zone_aligned function (Bart)
- Change the name from blk_queue_zone_aligned to blk_queue_is_zone_start
(Damien)
- q is never null in from bdev_get_queue (Damien)
- Add condition during bringup and check for zsze == zcap for npo2
drives (Damien)
- Rounddown operation should be made generic to work in 32 bits arch
(bart)
- Add comments where generic calculation is directly used instead having
special handling for po2 zone sizes (Hannes)
- Make the minimum zone size alignment requirement for btrfs to be 1M
instead of BTRFS_STRIPE_LEN(David)
Changes since v2:
- Minor formatting changes
Changes since v3:
- Make superblock mirror align with the existing superblock log offsets
(David)
- DM change return value and remove extra newline
- Optimize null blk zone index lookup with shift for po2 zone size
Changes since v4:
- Remove direct filesystems support for npo2 devices (Johannes, Hannes,
Damien)
Changes since v5:
- Use DIV_ROUND_UP* helper instead of round_up as it breaks 32bit arch
build in null blk(kernel-test-robot, Nathan)
- Use DIV_ROUND_UP_SECTOR_T also in blkdev_nr_zones function instead of
open coding it with div64_u64
- Added extra condition in dm-zoned and in dm to reject non power of 2
zone sizes.
Changes since v6:
- Added a new dm target for non power of 2 devices
- Added support for non power of 2 devices in the DM layer.
Changes since v7:
- Improved dm target for non power of 2 zoned devices with some bug
fixes and rearrangement
- Removed some unnecessary comments.
Changes since v8:
- Rename dm-po2z to dm-po2zone
- set max_io_len for the target to po2 zone size sector
- Simplify dm-po2zone target by removing some superfluous conditions
- Added documentation for the new dm-po2zone target
- Change pr_warn to pr_err for critical errors
- Split patch 2 and 11 with their corresponding prep patches
- Minor spelling and grammatical improvements
Changes since v9:
- Add a check for a zoned device in dm-po2zone ctr.
- Rephrased some commit messages and documentation for clarity
Changes since v10:
- Simplified dm_poz_map function (Damien)
Changes since v11:
- Rename bio_in_emulated_zone_area and some formatting adjustments
(Damien)
Luis Chamberlain (1):
dm-zoned: ensure only power of 2 zone sizes are allowed
Pankaj Raghav (12):
block: make bdev_nr_zones and disk_zone_no generic for npo2 zone size
block:rearrange bdev_{is_zoned,zone_sectors,get_queue} helpers in
blkdev.h
block: allow blk-zoned devices to have non-power-of-2 zone size
nvmet: Allow ZNS target to support non-power_of_2 zone sizes
nvme: zns: Allow ZNS drives that have non-power_of_2 zone size
null_blk: allow zoned devices with non power-of-2 zone sizes
zonefs: allow non power of 2 zoned devices
dm-zone: use generic helpers to calculate offset from zone start
dm-table: allow zoned devices with non power-of-2 zone sizes
dm: call dm_zone_endio after the target endio callback for zoned
devices
dm: introduce DM_EMULATED_ZONES target type
dm: add power-of-2 target for zoned devices with non power-of-2 zone
sizes
.../admin-guide/device-mapper/dm-po2zone.rst | 71 +++++
.../admin-guide/device-mapper/index.rst | 1 +
block/blk-core.c | 2 +-
block/blk-zoned.c | 37 ++-
drivers/block/null_blk/main.c | 5 +-
drivers/block/null_blk/null_blk.h | 1 +
drivers/block/null_blk/zoned.c | 18 +-
drivers/md/Kconfig | 10 +
drivers/md/Makefile | 2 +
drivers/md/dm-po2zone-target.c | 260 ++++++++++++++++++
drivers/md/dm-table.c | 20 +-
drivers/md/dm-zone.c | 8 +-
drivers/md/dm-zoned-target.c | 8 +
drivers/md/dm.c | 8 +-
drivers/nvme/host/zns.c | 14 +-
drivers/nvme/target/zns.c | 3 +-
fs/zonefs/super.c | 6 +-
fs/zonefs/zonefs.h | 1 -
include/linux/blkdev.h | 80 ++++--
include/linux/device-mapper.h | 9 +
20 files changed, 489 insertions(+), 75 deletions(-)
create mode 100644 Documentation/admin-guide/device-mapper/dm-po2zone.rst
create mode 100644 drivers/md/dm-po2zone-target.c
--
2.25.1
From: Luis Chamberlain <[email protected]>
dm-zoned relies on the assumption that the zone size is a
power-of-2(po2) and the zone capacity is same as the zone size.
Ensure only po2 devices can be used as dm-zoned target until a native
support for zoned devices with non-po2 zone size is added.
Reviewed-by: Hannes Reinecke <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
Signed-off-by: Luis Chamberlain <[email protected]>
Signed-off-by: Pankaj Raghav <[email protected]>
---
drivers/md/dm-zoned-target.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/drivers/md/dm-zoned-target.c b/drivers/md/dm-zoned-target.c
index 95b132b52f33..9325bf5dee81 100644
--- a/drivers/md/dm-zoned-target.c
+++ b/drivers/md/dm-zoned-target.c
@@ -792,6 +792,10 @@ static int dmz_fixup_devices(struct dm_target *ti)
return -EINVAL;
}
zone_nr_sectors = bdev_zone_sectors(bdev);
+ if (!is_power_of_2(zone_nr_sectors)) {
+ ti->error = "Zone size is not a power-of-2 number of sectors";
+ return -EINVAL;
+ }
zoned_dev->zone_nr_sectors = zone_nr_sectors;
zoned_dev->nr_zones = bdev_nr_zones(bdev);
}
@@ -804,6 +808,10 @@ static int dmz_fixup_devices(struct dm_target *ti)
return -EINVAL;
}
zoned_dev->zone_nr_sectors = bdev_zone_sectors(bdev);
+ if (!is_power_of_2(zoned_dev->zone_nr_sectors)) {
+ ti->error = "Zone size is not a power-of-2 number of sectors";
+ return -EINVAL;
+ }
zoned_dev->nr_zones = bdev_nr_zones(bdev);
}
--
2.25.1
Define bdev_is_zoned(), bdev_zone_sectors() and bdev_get_queue() earlier
in the blkdev.h include file. Simplify bdev_is_zoned() by removing the
superfluous NULL check for request queue while we are at it.
This commit has no functional change, and it is a prep patch for allowing
zoned devices with non-power-of-2 zone sizes in the block layer.
Signed-off-by: Pankaj Raghav <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
---
include/linux/blkdev.h | 43 +++++++++++++++++++-----------------------
1 file changed, 19 insertions(+), 24 deletions(-)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ab82d1ff0cce..84e7881262e3 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -635,6 +635,11 @@ static inline bool queue_is_mq(struct request_queue *q)
return q->mq_ops;
}
+static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
+{
+ return bdev->bd_queue; /* this is never NULL */
+}
+
#ifdef CONFIG_PM
static inline enum rpm_status queue_rpm_status(struct request_queue *q)
{
@@ -666,6 +671,20 @@ static inline bool blk_queue_is_zoned(struct request_queue *q)
}
}
+static inline bool bdev_is_zoned(struct block_device *bdev)
+{
+ return blk_queue_is_zoned(bdev_get_queue(bdev));
+}
+
+static inline sector_t bdev_zone_sectors(struct block_device *bdev)
+{
+ struct request_queue *q = bdev_get_queue(bdev);
+
+ if (!blk_queue_is_zoned(q))
+ return 0;
+ return q->limits.chunk_sectors;
+}
+
#ifdef CONFIG_BLK_DEV_ZONED
static inline unsigned int disk_nr_zones(struct gendisk *disk)
{
@@ -892,11 +911,6 @@ int bio_poll(struct bio *bio, struct io_comp_batch *iob, unsigned int flags);
int iocb_bio_iopoll(struct kiocb *kiocb, struct io_comp_batch *iob,
unsigned int flags);
-static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
-{
- return bdev->bd_queue; /* this is never NULL */
-}
-
/* Helper to convert BLK_ZONE_ZONE_XXX to its string format XXX */
const char *blk_zone_cond_str(enum blk_zone_cond zone_cond);
@@ -1296,25 +1310,6 @@ static inline enum blk_zoned_model bdev_zoned_model(struct block_device *bdev)
return BLK_ZONED_NONE;
}
-static inline bool bdev_is_zoned(struct block_device *bdev)
-{
- struct request_queue *q = bdev_get_queue(bdev);
-
- if (q)
- return blk_queue_is_zoned(q);
-
- return false;
-}
-
-static inline sector_t bdev_zone_sectors(struct block_device *bdev)
-{
- struct request_queue *q = bdev_get_queue(bdev);
-
- if (!blk_queue_is_zoned(q))
- return 0;
- return q->limits.chunk_sectors;
-}
-
static inline int queue_dma_alignment(const struct request_queue *q)
{
return q ? q->dma_alignment : 511;
--
2.25.1
The zone size shift variable is useful only if the zone sizes are known
to be power of 2. Remove that variable and use generic helpers from
block layer to calculate zone index in zonefs.
Acked-by: Damien Le Moal <[email protected]>
Reviewed-by: Luis Chamberlain <[email protected]>
Signed-off-by: Pankaj Raghav <[email protected]>
---
fs/zonefs/super.c | 6 ++----
fs/zonefs/zonefs.h | 1 -
2 files changed, 2 insertions(+), 5 deletions(-)
diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index 860f0b1032c6..e549ef16738c 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -476,10 +476,9 @@ static void __zonefs_io_error(struct inode *inode, bool write)
{
struct zonefs_inode_info *zi = ZONEFS_I(inode);
struct super_block *sb = inode->i_sb;
- struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
unsigned int noio_flag;
unsigned int nr_zones =
- zi->i_zone_size >> (sbi->s_zone_sectors_shift + SECTOR_SHIFT);
+ bdev_zone_no(sb->s_bdev, zi->i_zone_size >> SECTOR_SHIFT);
struct zonefs_ioerr_data err = {
.inode = inode,
.write = write,
@@ -1401,7 +1400,7 @@ static int zonefs_init_file_inode(struct inode *inode, struct blk_zone *zone,
struct zonefs_inode_info *zi = ZONEFS_I(inode);
int ret = 0;
- inode->i_ino = zone->start >> sbi->s_zone_sectors_shift;
+ inode->i_ino = bdev_zone_no(sb->s_bdev, zone->start);
inode->i_mode = S_IFREG | sbi->s_perm;
zi->i_ztype = type;
@@ -1776,7 +1775,6 @@ static int zonefs_fill_super(struct super_block *sb, void *data, int silent)
* interface constraints.
*/
sb_set_blocksize(sb, bdev_zone_write_granularity(sb->s_bdev));
- sbi->s_zone_sectors_shift = ilog2(bdev_zone_sectors(sb->s_bdev));
sbi->s_uid = GLOBAL_ROOT_UID;
sbi->s_gid = GLOBAL_ROOT_GID;
sbi->s_perm = 0640;
diff --git a/fs/zonefs/zonefs.h b/fs/zonefs/zonefs.h
index 4b3de66c3233..39895195cda6 100644
--- a/fs/zonefs/zonefs.h
+++ b/fs/zonefs/zonefs.h
@@ -177,7 +177,6 @@ struct zonefs_sb_info {
kgid_t s_gid;
umode_t s_perm;
uuid_t s_uuid;
- unsigned int s_zone_sectors_shift;
unsigned int s_nr_files[ZONEFS_ZTYPE_MAX];
--
2.25.1
Introduce a new target type DM_EMULATED_ZONES for targets with
a different number of sectors per zone (aka zone size) than the underlying
device zone size.
This target type is introduced as the existing zoned targets assume
that the target and the underlying device have the same zone size.
The new target: dm-po2zone will use this new target
type as it emulates the zone boundary that is different from the
underlying zoned device.
Signed-off-by: Pankaj Raghav <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
---
drivers/md/dm-table.c | 13 ++++++++++---
include/linux/device-mapper.h | 9 +++++++++
2 files changed, 19 insertions(+), 3 deletions(-)
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 31eb1d29d136..b37991ea3ffb 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1614,13 +1614,20 @@ static bool dm_table_supports_zoned_model(struct dm_table *t,
return true;
}
-static int device_not_matches_zone_sectors(struct dm_target *ti, struct dm_dev *dev,
+/*
+ * Callback function to check for device zone sector across devices. If the
+ * DM_TARGET_EMULATED_ZONES target feature flag is not set, then the target
+ * should have the same zone sector as the underlying devices.
+ */
+static int check_valid_device_zone_sectors(struct dm_target *ti, struct dm_dev *dev,
sector_t start, sector_t len, void *data)
{
unsigned int *zone_sectors = data;
- if (!bdev_is_zoned(dev->bdev))
+ if (!bdev_is_zoned(dev->bdev) ||
+ dm_target_supports_emulated_zones(ti->type))
return 0;
+
return bdev_zone_sectors(dev->bdev) != *zone_sectors;
}
@@ -1645,7 +1652,7 @@ static int validate_hardware_zoned_model(struct dm_table *t,
if (!zone_sectors)
return -EINVAL;
- if (dm_table_any_dev_attr(t, device_not_matches_zone_sectors, &zone_sectors)) {
+ if (dm_table_any_dev_attr(t, check_valid_device_zone_sectors, &zone_sectors)) {
DMERR("%s: zone sectors is not consistent across all zoned devices",
dm_device_name(t->md));
return -EINVAL;
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 04c6acf7faaa..83e20de264c9 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -294,6 +294,15 @@ struct target_type {
#define dm_target_supports_mixed_zoned_model(type) (false)
#endif
+#ifdef CONFIG_BLK_DEV_ZONED
+#define DM_TARGET_EMULATED_ZONES 0x00000400
+#define dm_target_supports_emulated_zones(type) \
+ ((type)->features & DM_TARGET_EMULATED_ZONES)
+#else
+#define DM_TARGET_EMULATED_ZONES 0x00000000
+#define dm_target_supports_emulated_zones(type) (false)
+#endif
+
struct dm_target {
struct dm_table *table;
struct target_type *type;
--
2.25.1
A generic bdev_zone_no() helper is added to calculate zone number for a
given sector in a block device. This helper internally uses disk_zone_no()
to find the zone number.
Use the helper bdev_zone_no() to calculate nr of zones. This let's us
make modifications to the math if needed in one place and adds now
support for zoned devices with non po2 zone size.
Reviewed by: Adam Manzanares <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Luis Chamberlain <[email protected]>
Signed-off-by: Pankaj Raghav <[email protected]>
---
drivers/nvme/target/zns.c | 3 +--
include/linux/blkdev.h | 5 +++++
2 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/drivers/nvme/target/zns.c b/drivers/nvme/target/zns.c
index c7ef69f29fe4..662f1a92f39b 100644
--- a/drivers/nvme/target/zns.c
+++ b/drivers/nvme/target/zns.c
@@ -241,8 +241,7 @@ static unsigned long nvmet_req_nr_zones_from_slba(struct nvmet_req *req)
{
unsigned int sect = nvmet_lba_to_sect(req->ns, req->cmd->zmr.slba);
- return bdev_nr_zones(req->ns->bdev) -
- (sect >> ilog2(bdev_zone_sectors(req->ns->bdev)));
+ return bdev_nr_zones(req->ns->bdev) - bdev_zone_no(req->ns->bdev, sect);
}
static unsigned long get_nr_zones_from_buf(struct nvmet_req *req, u32 bufsize)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index d0d66a0db224..b2636dcca105 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1340,6 +1340,11 @@ static inline enum blk_zoned_model bdev_zoned_model(struct block_device *bdev)
return BLK_ZONED_NONE;
}
+static inline unsigned int bdev_zone_no(struct block_device *bdev, sector_t sec)
+{
+ return disk_zone_no(bdev->bd_disk, sec);
+}
+
static inline int queue_dma_alignment(const struct request_queue *q)
{
return q ? q->dma_alignment : 511;
--
2.25.1
dm_zone_endio() updates the bi_sector of orig bio for zoned devices that
uses either native append or append emulation, and it is called before the
endio of the target. But target endio can still update the clone bio
after dm_zone_endio is called, thereby, the orig bio does not contain
the updated information anymore.
Currently, this is not a problem as the targets that support zoned devices
such as dm-zoned, dm-linear, and dm-crypt do not have an endio function,
and even if they do (such as dm-flakey), they don't modify the
bio->bi_iter.bi_sector of the cloned bio that is used to update the
orig_bio's bi_sector in dm_zone_endio function.
This is a prep patch for the new dm-po2zone target as it modifies
bi_sector in the endio callback.
Call dm_zone_endio for zoned devices after calling the target's endio
function.
Signed-off-by: Pankaj Raghav <[email protected]>
---
drivers/md/dm.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 60549b65c799..58b392c51d04 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1122,10 +1122,6 @@ static void clone_endio(struct bio *bio)
disable_write_zeroes(md);
}
- if (static_branch_unlikely(&zoned_enabled) &&
- unlikely(bdev_is_zoned(bio->bi_bdev)))
- dm_zone_endio(io, bio);
-
if (endio) {
int r = endio(ti, bio, &error);
switch (r) {
@@ -1154,6 +1150,10 @@ static void clone_endio(struct bio *bio)
}
}
+ if (static_branch_unlikely(&zoned_enabled) &&
+ unlikely(bdev_is_zoned(bio->bi_bdev)))
+ dm_zone_endio(io, bio);
+
if (static_branch_unlikely(&swap_bios_enabled) &&
unlikely(swap_bios_limit(ti, bio)))
up(&md->swap_bios_semaphore);
--
2.25.1
Checking if a given sector is aligned to a zone is a common
operation that is performed for zoned devices. Add
bdev_is_zone_start helper to check for this instead of opencoding it
everywhere.
Convert the calculations on zone size to be generic instead of relying on
power-of-2(po2) based arithmetic in the block layer using the helpers
wherever possible.
The only hot path affected by this change for zoned devices with po2
zone size is in blk_check_zone_append() but bdev_is_zone_start() helper is
used to optimize the calculation for po2 zone sizes.
Finally, allow zoned devices with non po2 zone sizes provided that their
zone capacity and zone size are equal. The main motivation to allow zoned
devices with non po2 zone size is to remove the unmapped LBA between
zone capcity and zone size for devices that cannot have a po2 zone
capacity.
Reviewed-by: Luis Chamberlain <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
Signed-off-by: Pankaj Raghav <[email protected]>
---
block/blk-core.c | 2 +-
block/blk-zoned.c | 24 ++++++++++++++++++------
include/linux/blkdev.h | 30 ++++++++++++++++++++++++++++++
3 files changed, 49 insertions(+), 7 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index a0d1104c5590..1cb519220ffb 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -563,7 +563,7 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
return BLK_STS_NOTSUPP;
/* The bio sector must point to the start of a sequential zone */
- if (bio->bi_iter.bi_sector & (bdev_zone_sectors(bio->bi_bdev) - 1) ||
+ if (!bdev_is_zone_start(bio->bi_bdev, bio->bi_iter.bi_sector) ||
!bio_zone_is_seq(bio))
return BLK_STS_IOERR;
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index dce9c95b4bcd..6806c69c81dc 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -285,10 +285,10 @@ int blkdev_zone_mgmt(struct block_device *bdev, enum req_op op,
return -EINVAL;
/* Check alignment (handle eventual smaller last zone) */
- if (sector & (zone_sectors - 1))
+ if (!bdev_is_zone_start(bdev, sector))
return -EINVAL;
- if ((nr_sectors & (zone_sectors - 1)) && end_sector != capacity)
+ if (!bdev_is_zone_start(bdev, nr_sectors) && end_sector != capacity)
return -EINVAL;
/*
@@ -486,14 +486,26 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
* smaller last zone.
*/
if (zone->start == 0) {
- if (zone->len == 0 || !is_power_of_2(zone->len)) {
- pr_warn("%s: Invalid zoned device with non power of two zone size (%llu)\n",
- disk->disk_name, zone->len);
+ if (zone->len == 0) {
+ pr_warn("%s: Invalid zero zone size", disk->disk_name);
+ return -ENODEV;
+ }
+
+ /*
+ * Non power-of-2 zone size support was added to remove the
+ * gap between zone capacity and zone size. Though it is technically
+ * possible to have gaps in a non power-of-2 device, Linux requires
+ * the zone size to be equal to zone capacity for non power-of-2
+ * zoned devices.
+ */
+ if (!is_power_of_2(zone->len) && zone->capacity < zone->len) {
+ pr_err("%s: Invalid zone capacity %lld with non power-of-2 zone size %lld",
+ disk->disk_name, zone->capacity, zone->len);
return -ENODEV;
}
args->zone_sectors = zone->len;
- args->nr_zones = (capacity + zone->len - 1) >> ilog2(zone->len);
+ args->nr_zones = div64_u64(capacity + zone->len - 1, zone->len);
} else if (zone->start + args->zone_sectors < capacity) {
if (zone->len != args->zone_sectors) {
pr_warn("%s: Invalid zoned device with non constant zone size\n",
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 84e7881262e3..d0d66a0db224 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -704,6 +704,30 @@ static inline unsigned int disk_zone_no(struct gendisk *disk, sector_t sector)
return div64_u64(sector, zone_sectors);
}
+static inline sector_t bdev_offset_from_zone_start(struct block_device *bdev,
+ sector_t sec)
+{
+ sector_t zone_sectors = bdev_zone_sectors(bdev);
+ u64 remainder = 0;
+
+ if (!bdev_is_zoned(bdev))
+ return 0;
+
+ if (is_power_of_2(zone_sectors))
+ return sec & (zone_sectors - 1);
+
+ div64_u64_rem(sec, zone_sectors, &remainder);
+ return remainder;
+}
+
+static inline bool bdev_is_zone_start(struct block_device *bdev, sector_t sec)
+{
+ if (!bdev_is_zoned(bdev))
+ return false;
+
+ return bdev_offset_from_zone_start(bdev, sec) == 0;
+}
+
static inline bool disk_zone_is_seq(struct gendisk *disk, sector_t sector)
{
if (!blk_queue_is_zoned(disk->queue))
@@ -748,6 +772,12 @@ static inline unsigned int disk_zone_no(struct gendisk *disk, sector_t sector)
{
return 0;
}
+
+static inline bool bdev_is_zone_start(struct block_device *bdev, sector_t sec)
+{
+ return false;
+}
+
static inline unsigned int bdev_max_open_zones(struct block_device *bdev)
{
return 0;
--
2.25.1
Use the bdev_offset_from_zone_start() helper function to calculate
the offset from zone start instead of using power of 2 based
calculation.
Signed-off-by: Pankaj Raghav <[email protected]>
Reviewed-by: Luis Chamberlain <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
---
drivers/md/dm-zone.c | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)
diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index 3dafc0e8b7a9..ac6fc1293d41 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -390,7 +390,8 @@ static bool dm_zone_map_bio_begin(struct mapped_device *md,
case REQ_OP_WRITE_ZEROES:
case REQ_OP_WRITE:
/* Writes must be aligned to the zone write pointer */
- if ((clone->bi_iter.bi_sector & (zsectors - 1)) != zwp_offset)
+ if (bdev_offset_from_zone_start(md->disk->part0,
+ clone->bi_iter.bi_sector) != zwp_offset)
return false;
break;
case REQ_OP_ZONE_APPEND:
@@ -602,11 +603,8 @@ void dm_zone_endio(struct dm_io *io, struct bio *clone)
*/
if (clone->bi_status == BLK_STS_OK &&
bio_op(clone) == REQ_OP_ZONE_APPEND) {
- sector_t mask =
- (sector_t)bdev_zone_sectors(disk->part0) - 1;
-
orig_bio->bi_iter.bi_sector +=
- clone->bi_iter.bi_sector & mask;
+ bdev_offset_from_zone_start(disk->part0, clone->bi_iter.bi_sector);
}
return;
--
2.25.1
Allow dm to support zoned devices with non power-of-2(po2) zone sizes as
the block layer now supports it.
Signed-off-by: Pankaj Raghav <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
---
drivers/md/dm-table.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 332f96b58252..31eb1d29d136 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -250,7 +250,7 @@ static int device_area_is_invalid(struct dm_target *ti, struct dm_dev *dev,
if (bdev_is_zoned(bdev)) {
unsigned int zone_sectors = bdev_zone_sectors(bdev);
- if (start & (zone_sectors - 1)) {
+ if (!bdev_is_zone_start(bdev, start)) {
DMWARN("%s: start=%llu not aligned to h/w zone size %u of %pg",
dm_device_name(ti->table->md),
(unsigned long long)start,
@@ -267,7 +267,7 @@ static int device_area_is_invalid(struct dm_target *ti, struct dm_dev *dev,
* devices do not end up with a smaller zone in the middle of
* the sector range.
*/
- if (len & (zone_sectors - 1)) {
+ if (!bdev_is_zone_start(bdev, len)) {
DMWARN("%s: len=%llu not aligned to h/w zone size %u of %pg",
dm_device_name(ti->table->md),
(unsigned long long)len,
@@ -1642,8 +1642,7 @@ static int validate_hardware_zoned_model(struct dm_table *t,
return -EINVAL;
}
- /* Check zone size validity and compatibility */
- if (!zone_sectors || !is_power_of_2(zone_sectors))
+ if (!zone_sectors)
return -EINVAL;
if (dm_table_any_dev_attr(t, device_not_matches_zone_sectors, &zone_sectors)) {
--
2.25.1
Only zoned devices with power-of-2(po2) number of sectors per zone(zone
size) were supported in linux but now non power-of-2(npo2) zone sizes
support has been added to the block layer.
Filesystems such as F2FS and btrfs have support for zoned devices with
po2 zone size assumption. Before adding native support for npo2 zone
sizes, it was suggested to create a dm target for npo2 zone size device to
appear as a po2 zone size target so that file systems can initially
work without any explicit changes by using this target.
The design of this target is very simple: remap the device zone size to
the zone capacity and change the zone size to be the nearest power of 2
value.
For e.g., a device with a zone size/capacity of 3M will have an equivalent
target layout as follows:
Device layout :-
zone capacity = 3M
zone size = 3M
|--------------|-------------|
0 3M 6M
Target layout :-
zone capacity=3M
zone size = 4M
|--------------|---|--------------|---|
0 3M 4M 7M 8M
The area between target's zone capacity and zone size will be emulated
in the target.
The read IOs that fall in the emulated gap area will return 0 filled
bio and all the other IOs in that area will result in an error.
If a read IO span across the emulated area boundary, then the IOs are
split across them. All other IO operations that span across the emulated
area boundary will result in an error.
The target can be easily created as follows:
dmsetup create <label> --table '0 <size_sects> po2zone /dev/nvme<id>'
Note that the target does not support partial mapping of the underlying
device.
Signed-off-by: Pankaj Raghav <[email protected]>
Suggested-by: Johannes Thumshirn <[email protected]>
Suggested-by: Damien Le Moal <[email protected]>
Suggested-by: Hannes Reinecke <[email protected]>
---
.../admin-guide/device-mapper/dm-po2zone.rst | 71 +++++
.../admin-guide/device-mapper/index.rst | 1 +
drivers/md/Kconfig | 10 +
drivers/md/Makefile | 2 +
drivers/md/dm-po2zone-target.c | 260 ++++++++++++++++++
5 files changed, 344 insertions(+)
create mode 100644 Documentation/admin-guide/device-mapper/dm-po2zone.rst
create mode 100644 drivers/md/dm-po2zone-target.c
diff --git a/Documentation/admin-guide/device-mapper/dm-po2zone.rst b/Documentation/admin-guide/device-mapper/dm-po2zone.rst
new file mode 100644
index 000000000000..19dc215fbcca
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/dm-po2zone.rst
@@ -0,0 +1,71 @@
+==========
+dm-po2zone
+==========
+The dm-po2zone device mapper target exposes a zoned block device with a
+non-power-of-2(npo2) number of sectors per zone as a power-of-2(po2)
+number of sectors per zone(zone size).
+The filesystems that support zoned block devices such as F2FS and BTRFS
+assume po2 zone size as the kernel has traditionally only supported
+those devices. However, as the kernel now supports zoned block devices with
+npo2 zone sizes, the filesystems can run on top of the dm-po2zone target before
+adding native support.
+
+Partial mapping of the underlying device is not supported by this target.
+
+Algorithm
+=========
+The device mapper target maps the underlying device's zone size to the
+zone capacity and changes the zone size to the nearest po2 zone size.
+The gap between the zone capacity and the zone size is emulated in the target.
+E.g., a zoned block device with a zone size (and capacity) of 3M will have an
+equivalent target layout with mapping as follows:
+
+::
+
+ 0M 3M 4M 6M 8M
+ | | | | |
+ +x------------+--+x---------+--+x------- Target
+ |x | |x | |x
+ x x x
+ x x x
+ x x x
+ x x x
+ |x |x |x
+ +x------------+x------------+x---------- Device
+ | | |
+ 0M 3M 6M
+
+A simple remap is performed for all the BIOs that do not cross the
+emulation gap area, i.e., the area between the zone capacity and size.
+
+If a BIO crosses the emulation gap area, the following operations are performed:
+
+ Read:
+ - If the BIO lies entirely in the emulation gap area, then zero out the BIO and complete it.
+ - If the BIO spans the emulation gap area, split the BIO across the zone capacity boundary
+ and remap only the BIO within the zone capacity boundary. The other part of the split BIO
+ will be zeroed out.
+
+ Other operations:
+ - Return an error
+
+Table parameters
+================
+
+::
+
+ <dev path>
+
+Mandatory parameters:
+
+ <dev path>:
+ Full pathname to the underlying block-device, or a
+ "major:minor" device-number.
+
+Examples
+========
+
+::
+
+ #!/bin/sh
+ echo "0 `blockdev --getsz $1` po2zone $1" | dmsetup create po2z
diff --git a/Documentation/admin-guide/device-mapper/index.rst b/Documentation/admin-guide/device-mapper/index.rst
index cde52cc09645..1fd04b5b0565 100644
--- a/Documentation/admin-guide/device-mapper/index.rst
+++ b/Documentation/admin-guide/device-mapper/index.rst
@@ -23,6 +23,7 @@ Device Mapper
dm-service-time
dm-uevent
dm-zoned
+ dm-po2zone
era
kcopyd
linear
diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index 998a5cfdbc4e..638801b2449a 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -518,6 +518,16 @@ config DM_FLAKEY
help
A target that intermittently fails I/O for debugging purposes.
+config DM_PO2ZONE
+ tristate "Zoned block devices target emulating a power-of-2 number of sectors per zone"
+ depends on BLK_DEV_DM
+ depends on BLK_DEV_ZONED
+ help
+ A target that converts a zoned block device with non-power-of-2(npo2)
+ number of sectors per zone to be power-of-2(po2). Use this target for
+ zoned block devices with npo2 number of sectors per zone until native
+ support is added to the filesystems and applications.
+
config DM_VERITY
tristate "Verity target support"
depends on BLK_DEV_DM
diff --git a/drivers/md/Makefile b/drivers/md/Makefile
index 84291e38dca8..c23f81cc8789 100644
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@@ -26,6 +26,7 @@ dm-era-y += dm-era-target.o
dm-clone-y += dm-clone-target.o dm-clone-metadata.o
dm-verity-y += dm-verity-target.o
dm-zoned-y += dm-zoned-target.o dm-zoned-metadata.o dm-zoned-reclaim.o
+dm-po2zone-y += dm-po2zone-target.o
md-mod-y += md.o md-bitmap.o
raid456-y += raid5.o raid5-cache.o raid5-ppl.o
@@ -60,6 +61,7 @@ obj-$(CONFIG_DM_CRYPT) += dm-crypt.o
obj-$(CONFIG_DM_DELAY) += dm-delay.o
obj-$(CONFIG_DM_DUST) += dm-dust.o
obj-$(CONFIG_DM_FLAKEY) += dm-flakey.o
+obj-$(CONFIG_DM_PO2ZONE) += dm-po2zone.o
obj-$(CONFIG_DM_MULTIPATH) += dm-multipath.o dm-round-robin.o
obj-$(CONFIG_DM_MULTIPATH_QL) += dm-queue-length.o
obj-$(CONFIG_DM_MULTIPATH_ST) += dm-service-time.o
diff --git a/drivers/md/dm-po2zone-target.c b/drivers/md/dm-po2zone-target.c
new file mode 100644
index 000000000000..34ccbeec9a59
--- /dev/null
+++ b/drivers/md/dm-po2zone-target.c
@@ -0,0 +1,260 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2022 Samsung Electronics Co., Ltd.
+ */
+
+#include <linux/device-mapper.h>
+
+#define DM_MSG_PREFIX "po2zone"
+
+struct dm_po2z_target {
+ struct dm_dev *dev;
+ sector_t zone_size; /* Actual zone size of the underlying dev*/
+ sector_t zone_size_po2; /* zone_size rounded to the nearest po2 value */
+ unsigned int zone_size_po2_shift;
+ sector_t zone_size_diff; /* diff between zone_size_po2 and zone_size */
+ unsigned int nr_zones;
+};
+
+static inline unsigned int npo2_zone_no(struct dm_po2z_target *dmh,
+ sector_t sect)
+{
+ return div64_u64(sect, dmh->zone_size);
+}
+
+static inline unsigned int po2_zone_no(struct dm_po2z_target *dmh,
+ sector_t sect)
+{
+ return sect >> dmh->zone_size_po2_shift;
+}
+
+static inline sector_t target_to_device_sect(struct dm_po2z_target *dmh,
+ sector_t sect)
+{
+ return sect - (po2_zone_no(dmh, sect) * dmh->zone_size_diff);
+}
+
+static inline sector_t device_to_target_sect(struct dm_po2z_target *dmh,
+ sector_t sect)
+{
+ return sect + (npo2_zone_no(dmh, sect) * dmh->zone_size_diff);
+}
+
+/*
+ * This target works on the complete zoned device. Partial mapping is not
+ * supported.
+ * Construct a zoned po2 logical device: <dev-path>
+ */
+static int dm_po2z_ctr(struct dm_target *ti, unsigned int argc, char **argv)
+{
+ struct dm_po2z_target *dmh = NULL;
+ int ret;
+ sector_t zone_size;
+ sector_t dev_capacity;
+
+ if (argc != 1)
+ return -EINVAL;
+
+ dmh = kmalloc(sizeof(*dmh), GFP_KERNEL);
+ if (!dmh)
+ return -ENOMEM;
+
+ ret = dm_get_device(ti, argv[0], dm_table_get_mode(ti->table),
+ &dmh->dev);
+ if (ret) {
+ ti->error = "Device lookup failed";
+ kfree(dmh);
+ return ret;
+ }
+
+ if (!bdev_is_zoned(dmh->dev->bdev)) {
+ DMERR("%pg is not a zoned device", dmh->dev->bdev);
+ kfree(dmh);
+ return -EINVAL;
+ }
+
+ zone_size = bdev_zone_sectors(dmh->dev->bdev);
+ dev_capacity = get_capacity(dmh->dev->bdev->bd_disk);
+ if (ti->len != dev_capacity || ti->begin) {
+ DMERR("%pg Partial mapping of the target not supported",
+ dmh->dev->bdev);
+ kfree(dmh);
+ return -EINVAL;
+ }
+
+ if (is_power_of_2(zone_size))
+ DMWARN("%pg: underlying device has a power-of-2 number of sectors per zone",
+ dmh->dev->bdev);
+
+ dmh->zone_size = zone_size;
+ dmh->zone_size_po2 = 1 << get_count_order_long(zone_size);
+ dmh->zone_size_po2_shift = ilog2(dmh->zone_size_po2);
+ dmh->zone_size_diff = dmh->zone_size_po2 - dmh->zone_size;
+ ti->private = dmh;
+ ti->max_io_len = dmh->zone_size_po2;
+ dmh->nr_zones = npo2_zone_no(dmh, ti->len);
+ ti->len = dmh->zone_size_po2 * dmh->nr_zones;
+
+ return 0;
+}
+
+static int dm_po2z_report_zones_cb(struct blk_zone *zone, unsigned int idx,
+ void *data)
+{
+ struct dm_report_zones_args *args = data;
+ struct dm_po2z_target *dmh = args->tgt->private;
+
+ zone->start = device_to_target_sect(dmh, zone->start);
+ zone->wp = device_to_target_sect(dmh, zone->wp);
+ zone->len = dmh->zone_size_po2;
+ args->next_sector = zone->start + zone->len;
+
+ return args->orig_cb(zone, args->zone_idx++, args->orig_data);
+}
+
+static int dm_po2z_report_zones(struct dm_target *ti,
+ struct dm_report_zones_args *args,
+ unsigned int nr_zones)
+{
+ struct dm_po2z_target *dmh = ti->private;
+ sector_t sect = po2_zone_no(dmh, args->next_sector) * dmh->zone_size;
+
+ return blkdev_report_zones(dmh->dev->bdev, sect, nr_zones,
+ dm_po2z_report_zones_cb, args);
+}
+
+static int dm_po2z_end_io(struct dm_target *ti, struct bio *bio,
+ blk_status_t *error)
+{
+ struct dm_po2z_target *dmh = ti->private;
+
+ if (bio->bi_status == BLK_STS_OK && bio_op(bio) == REQ_OP_ZONE_APPEND)
+ bio->bi_iter.bi_sector =
+ device_to_target_sect(dmh, bio->bi_iter.bi_sector);
+
+ return DM_ENDIO_DONE;
+}
+
+static void dm_po2z_io_hints(struct dm_target *ti, struct queue_limits *limits)
+{
+ struct dm_po2z_target *dmh = ti->private;
+
+ limits->chunk_sectors = dmh->zone_size_po2;
+}
+
+/**
+ * dm_po2z_bio_in_emulated_zone_area - check if bio is in the emulated zone area
+ * @dmh: target data
+ * @bio: bio
+ * @offset: bio offset to emulated zone boundary
+ *
+ * Check if a @bio is partly or completely in the emulated zone area. If the
+ * @bio is partly in the emulated zone area, @offset can be used to split
+ * the @bio across the emulated zone boundary. @offset
+ * will be negative if the @bio completely lies in the emulated area.
+ *
+ */
+static bool dm_po2z_bio_in_emulated_zone_area(struct dm_po2z_target *dmh,
+ struct bio *bio, int *offset)
+{
+ unsigned int zone_idx = po2_zone_no(dmh, bio->bi_iter.bi_sector);
+ sector_t nr_sectors = bio->bi_iter.bi_size >> SECTOR_SHIFT;
+ sector_t sector_offset =
+ bio->bi_iter.bi_sector - (zone_idx << dmh->zone_size_po2_shift);
+
+ *offset = dmh->zone_size - sector_offset;
+
+ return sector_offset + nr_sectors > dmh->zone_size;
+}
+
+static inline int dm_po2z_read_zeroes(struct bio *bio)
+{
+ zero_fill_bio(bio);
+ bio_endio(bio);
+ return DM_MAPIO_SUBMITTED;
+}
+
+static inline int dm_po2z_remap_sector(struct dm_po2z_target *dmh,
+ struct bio *bio)
+{
+ bio->bi_iter.bi_sector =
+ target_to_device_sect(dmh, bio->bi_iter.bi_sector);
+ return DM_MAPIO_REMAPPED;
+}
+
+static int dm_po2z_map(struct dm_target *ti, struct bio *bio)
+{
+ struct dm_po2z_target *dmh = ti->private;
+ int split_io_pos;
+
+ bio_set_dev(bio, dmh->dev->bdev);
+
+ if (op_is_zone_mgmt(bio_op(bio)))
+ return dm_po2z_remap_sector(dmh, bio);
+
+ if (!bio_sectors(bio))
+ return DM_MAPIO_REMAPPED;
+
+ /*
+ * Read operation on the emulated zone area (between zone capacity
+ * and zone size) will fill the bio with zeroes. Any other operation
+ * in the emulated area should return an error.
+ */
+ if (!dm_po2z_bio_in_emulated_zone_area(dmh, bio, &split_io_pos))
+ return dm_po2z_remap_sector(dmh, bio);
+
+ if (bio_op(bio) == REQ_OP_READ) {
+ /*
+ * If the bio is across emulated zone boundary, split the bio at
+ * the boundary.
+ */
+ if (split_io_pos > 0) {
+ dm_accept_partial_bio(bio, split_io_pos);
+ return dm_po2z_remap_sector(dmh, bio);
+ }
+ return dm_po2z_read_zeroes(bio);
+ }
+
+ return DM_MAPIO_KILL;
+}
+
+static int dm_po2z_iterate_devices(struct dm_target *ti,
+ iterate_devices_callout_fn fn, void *data)
+{
+ struct dm_po2z_target *dmh = ti->private;
+ sector_t len = dmh->nr_zones * dmh->zone_size;
+
+ return fn(ti, dmh->dev, 0, len, data);
+}
+
+static struct target_type dm_po2z_target = {
+ .name = "po2zone",
+ .version = { 1, 0, 0 },
+ .features = DM_TARGET_ZONED_HM | DM_TARGET_EMULATED_ZONES,
+ .map = dm_po2z_map,
+ .end_io = dm_po2z_end_io,
+ .report_zones = dm_po2z_report_zones,
+ .iterate_devices = dm_po2z_iterate_devices,
+ .module = THIS_MODULE,
+ .io_hints = dm_po2z_io_hints,
+ .ctr = dm_po2z_ctr,
+};
+
+static int __init dm_po2z_init(void)
+{
+ return dm_register_target(&dm_po2z_target);
+}
+
+static void __exit dm_po2z_exit(void)
+{
+ dm_unregister_target(&dm_po2z_target);
+}
+
+/* Module hooks */
+module_init(dm_po2z_init);
+module_exit(dm_po2z_exit);
+
+MODULE_DESCRIPTION(DM_NAME "power-of-2 zoned target");
+MODULE_AUTHOR("Pankaj Raghav <[email protected]>");
+MODULE_LICENSE("GPL");
+
--
2.25.1
On 8/23/22 05:18, Pankaj Raghav wrote:
> Use the bdev_offset_from_zone_start() helper function to calculate
> the offset from zone start instead of using power of 2 based
> calculation.
Reviewed-by: Bart Van Assche <[email protected]>
On 8/23/22 05:18, Pankaj Raghav wrote:
> From: Luis Chamberlain <[email protected]>
>
> dm-zoned relies on the assumption that the zone size is a
> power-of-2(po2) and the zone capacity is same as the zone size.
>
> Ensure only po2 devices can be used as dm-zoned target until a native
> support for zoned devices with non-po2 zone size is added.
Reviewed-by: Bart Van Assche <[email protected]>
On 8/23/22 05:18, Pankaj Raghav wrote:
> Define bdev_is_zoned(), bdev_zone_sectors() and bdev_get_queue() earlier
> in the blkdev.h include file. Simplify bdev_is_zoned() by removing the
> superfluous NULL check for request queue while we are at it.
>
> This commit has no functional change, and it is a prep patch for allowing
> zoned devices with non-power-of-2 zone sizes in the block layer.
It seems like a space is missing in the patch subject after the colon?
Anyway:
Reviewed-by: Bart Van Assche <[email protected]>
On 8/23/2022 6:18 AM, Pankaj Raghav wrote:
> Checking if a given sector is aligned to a zone is a common
> operation that is performed for zoned devices. Add
> bdev_is_zone_start helper to check for this instead of opencoding it
> everywhere.
>
> Convert the calculations on zone size to be generic instead of relying on
> power-of-2(po2) based arithmetic in the block layer using the helpers
> wherever possible.
>
> The only hot path affected by this change for zoned devices with po2
> zone size is in blk_check_zone_append() but bdev_is_zone_start() helper is
> used to optimize the calculation for po2 zone sizes.
>
> Finally, allow zoned devices with non po2 zone sizes provided that their
> zone capacity and zone size are equal. The main motivation to allow zoned
> devices with non po2 zone size is to remove the unmapped LBA between
> zone capcity and zone size for devices that cannot have a po2 zone
> capacity.
>
> Reviewed-by: Luis Chamberlain <[email protected]>
> Reviewed-by: Hannes Reinecke <[email protected]>
> Reviewed-by: Bart Van Assche <[email protected]>
> Signed-off-by: Pankaj Raghav <[email protected]>
> ---
> block/blk-core.c | 2 +-
> block/blk-zoned.c | 24 ++++++++++++++++++------
> include/linux/blkdev.h | 30 ++++++++++++++++++++++++++++++
> 3 files changed, 49 insertions(+), 7 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index a0d1104c5590..1cb519220ffb 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -563,7 +563,7 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
> return BLK_STS_NOTSUPP;
>
> /* The bio sector must point to the start of a sequential zone */
> - if (bio->bi_iter.bi_sector & (bdev_zone_sectors(bio->bi_bdev) - 1) ||
> + if (!bdev_is_zone_start(bio->bi_bdev, bio->bi_iter.bi_sector) ||
> !bio_zone_is_seq(bio))
> return BLK_STS_IOERR;
>
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index dce9c95b4bcd..6806c69c81dc 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -285,10 +285,10 @@ int blkdev_zone_mgmt(struct block_device *bdev, enum req_op op,
> return -EINVAL;
>
> /* Check alignment (handle eventual smaller last zone) */
> - if (sector & (zone_sectors - 1))
> + if (!bdev_is_zone_start(bdev, sector))
> return -EINVAL;
>
> - if ((nr_sectors & (zone_sectors - 1)) && end_sector != capacity)
> + if (!bdev_is_zone_start(bdev, nr_sectors) && end_sector != capacity)
> return -EINVAL;
>
> /*
> @@ -486,14 +486,26 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
> * smaller last zone.
> */
> if (zone->start == 0) {
> - if (zone->len == 0 || !is_power_of_2(zone->len)) {
> - pr_warn("%s: Invalid zoned device with non power of two zone size (%llu)\n",
> - disk->disk_name, zone->len);
> + if (zone->len == 0) {
> + pr_warn("%s: Invalid zero zone size", disk->disk_name);
> + return -ENODEV;
> + }
> +
> + /*
> + * Non power-of-2 zone size support was added to remove the
> + * gap between zone capacity and zone size. Though it is technically
> + * possible to have gaps in a non power-of-2 device, Linux requires
> + * the zone size to be equal to zone capacity for non power-of-2
> + * zoned devices.
> + */
> + if (!is_power_of_2(zone->len) && zone->capacity < zone->len) {
> + pr_err("%s: Invalid zone capacity %lld with non power-of-2 zone size %lld",
> + disk->disk_name, zone->capacity, zone->len);
> return -ENODEV;
> }
>
> args->zone_sectors = zone->len;
> - args->nr_zones = (capacity + zone->len - 1) >> ilog2(zone->len);
> + args->nr_zones = div64_u64(capacity + zone->len - 1, zone->len);
> } else if (zone->start + args->zone_sectors < capacity) {
> if (zone->len != args->zone_sectors) {
> pr_warn("%s: Invalid zoned device with non constant zone size\n",
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 84e7881262e3..d0d66a0db224 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -704,6 +704,30 @@ static inline unsigned int disk_zone_no(struct gendisk *disk, sector_t sector)
> return div64_u64(sector, zone_sectors);
> }
>
> +static inline sector_t bdev_offset_from_zone_start(struct block_device *bdev,
> + sector_t sec)
> +{
> + sector_t zone_sectors = bdev_zone_sectors(bdev);
> + u64 remainder = 0;
> +
> + if (!bdev_is_zoned(bdev))
> + return 0;
See below
> +
> + if (is_power_of_2(zone_sectors))
> + return sec & (zone_sectors - 1);
> +
> + div64_u64_rem(sec, zone_sectors, &remainder);
> + return remainder;
> +}
> +
> +static inline bool bdev_is_zone_start(struct block_device *bdev, sector_t sec)
> +{
> + if (!bdev_is_zoned(bdev))
> + return false;
Duplicating the same check above, and the check above is less clear in
the case of !zoned since it returns 0 and not some warning that makes
sense in the case of zoned check on !zoned bdev.
Can you simply exclude above check?
> +
> + return bdev_offset_from_zone_start(bdev, sec) == 0;
> +}
> +
> static inline bool disk_zone_is_seq(struct gendisk *disk, sector_t sector)
> {
> if (!blk_queue_is_zoned(disk->queue))
> @@ -748,6 +772,12 @@ static inline unsigned int disk_zone_no(struct gendisk *disk, sector_t sector)
> {
> return 0;
> }
> +
> +static inline bool bdev_is_zone_start(struct block_device *bdev, sector_t sec)
> +{
> + return false;
> +}
> +
> static inline unsigned int bdev_max_open_zones(struct block_device *bdev)
> {
> return 0;
On 8/26/2022 2:06 PM, Jonathan Derrick wrote:
>
>
> On 8/23/2022 6:18 AM, Pankaj Raghav wrote:
>> Checking if a given sector is aligned to a zone is a common
>> operation that is performed for zoned devices. Add
>> bdev_is_zone_start helper to check for this instead of opencoding it
>> everywhere.
>>
>> Convert the calculations on zone size to be generic instead of relying on
>> power-of-2(po2) based arithmetic in the block layer using the helpers
>> wherever possible.
>>
>> The only hot path affected by this change for zoned devices with po2
>> zone size is in blk_check_zone_append() but bdev_is_zone_start()
>> helper is
>> used to optimize the calculation for po2 zone sizes.
>>
>> Finally, allow zoned devices with non po2 zone sizes provided that their
>> zone capacity and zone size are equal. The main motivation to allow zoned
>> devices with non po2 zone size is to remove the unmapped LBA between
>> zone capcity and zone size for devices that cannot have a po2 zone
>> capacity.
>>
>> Reviewed-by: Luis Chamberlain <[email protected]>
>> Reviewed-by: Hannes Reinecke <[email protected]>
>> Reviewed-by: Bart Van Assche <[email protected]>
>> Signed-off-by: Pankaj Raghav <[email protected]>
>> ---
>> block/blk-core.c | 2 +-
>> block/blk-zoned.c | 24 ++++++++++++++++++------
>> include/linux/blkdev.h | 30 ++++++++++++++++++++++++++++++
>> 3 files changed, 49 insertions(+), 7 deletions(-)
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index a0d1104c5590..1cb519220ffb 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -563,7 +563,7 @@ static inline blk_status_t
>> blk_check_zone_append(struct request_queue *q,
>> return BLK_STS_NOTSUPP;
>> /* The bio sector must point to the start of a sequential zone */
>> - if (bio->bi_iter.bi_sector & (bdev_zone_sectors(bio->bi_bdev) -
>> 1) ||
>> + if (!bdev_is_zone_start(bio->bi_bdev, bio->bi_iter.bi_sector) ||
>> !bio_zone_is_seq(bio))
>> return BLK_STS_IOERR;
>> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
>> index dce9c95b4bcd..6806c69c81dc 100644
>> --- a/block/blk-zoned.c
>> +++ b/block/blk-zoned.c
>> @@ -285,10 +285,10 @@ int blkdev_zone_mgmt(struct block_device *bdev,
>> enum req_op op,
>> return -EINVAL;
>> /* Check alignment (handle eventual smaller last zone) */
>> - if (sector & (zone_sectors - 1))
>> + if (!bdev_is_zone_start(bdev, sector))
>> return -EINVAL;
>> - if ((nr_sectors & (zone_sectors - 1)) && end_sector != capacity)
>> + if (!bdev_is_zone_start(bdev, nr_sectors) && end_sector != capacity)
>> return -EINVAL;
>> /*
>> @@ -486,14 +486,26 @@ static int blk_revalidate_zone_cb(struct
>> blk_zone *zone, unsigned int idx,
>> * smaller last zone.
>> */
>> if (zone->start == 0) {
>> - if (zone->len == 0 || !is_power_of_2(zone->len)) {
>> - pr_warn("%s: Invalid zoned device with non power of two
>> zone size (%llu)\n",
>> - disk->disk_name, zone->len);
>> + if (zone->len == 0) {
>> + pr_warn("%s: Invalid zero zone size", disk->disk_name);
>> + return -ENODEV;
>> + }
>> +
>> + /*
>> + * Non power-of-2 zone size support was added to remove the
>> + * gap between zone capacity and zone size. Though it is
>> technically
>> + * possible to have gaps in a non power-of-2 device, Linux
>> requires
>> + * the zone size to be equal to zone capacity for non power-of-2
>> + * zoned devices.
>> + */
>> + if (!is_power_of_2(zone->len) && zone->capacity < zone->len) {
>> + pr_err("%s: Invalid zone capacity %lld with non
>> power-of-2 zone size %lld",
>> + disk->disk_name, zone->capacity, zone->len);
>> return -ENODEV;
>> }
>> args->zone_sectors = zone->len;
>> - args->nr_zones = (capacity + zone->len - 1) >> ilog2(zone->len);
>> + args->nr_zones = div64_u64(capacity + zone->len - 1, zone->len);
>> } else if (zone->start + args->zone_sectors < capacity) {
>> if (zone->len != args->zone_sectors) {
>> pr_warn("%s: Invalid zoned device with non constant zone
>> size\n",
>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> index 84e7881262e3..d0d66a0db224 100644
>> --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -704,6 +704,30 @@ static inline unsigned int disk_zone_no(struct
>> gendisk *disk, sector_t sector)
>> return div64_u64(sector, zone_sectors);
>> }
>> +static inline sector_t bdev_offset_from_zone_start(struct
>> block_device *bdev,
>> + sector_t sec)
>> +{
>> + sector_t zone_sectors = bdev_zone_sectors(bdev);
>> + u64 remainder = 0;
>> +
>> + if (!bdev_is_zoned(bdev))
>> + return 0;
> See below
>
>> +
>> + if (is_power_of_2(zone_sectors))
>> + return sec & (zone_sectors - 1);
>> +
>> + div64_u64_rem(sec, zone_sectors, &remainder);
>> + return remainder;
>> +}
>> +
>> +static inline bool bdev_is_zone_start(struct block_device *bdev,
>> sector_t sec)
>> +{
>> + if (!bdev_is_zoned(bdev))
>> + return false;
> Duplicating the same check above, and the check above is less clear in
> the case of !zoned since it returns 0 and not some warning that makes
> sense in the case of zoned check on !zoned bdev.
> Can you simply exclude above check?
Nevermind, just noticed bdev_offset_from_zone_start is used in later
patches.
>
>
>> +
>> + return bdev_offset_from_zone_start(bdev, sec) == 0;
>> +}
>> +
>> static inline bool disk_zone_is_seq(struct gendisk *disk, sector_t
>> sector)
>> {
>> if (!blk_queue_is_zoned(disk->queue))
>> @@ -748,6 +772,12 @@ static inline unsigned int disk_zone_no(struct
>> gendisk *disk, sector_t sector)
>> {
>> return 0;
>> }
>> +
>> +static inline bool bdev_is_zone_start(struct block_device *bdev,
>> sector_t sec)
>> +{
>> + return false;
>> +}
>> +
>> static inline unsigned int bdev_max_open_zones(struct block_device
>> *bdev)
>> {
>> return 0;
On 2022-08-30 04:52, Shinichiro Kawasaki wrote:
> On Aug 23, 2022 / 14:18, Pankaj Raghav wrote:
>> Only zoned devices with power-of-2(po2) number of sectors per zone(zone
>> size) were supported in linux but now non power-of-2(npo2) zone sizes
>> support has been added to the block layer.
>>
>> Filesystems such as F2FS and btrfs have support for zoned devices with
>> po2 zone size assumption. Before adding native support for npo2 zone
>> sizes, it was suggested to create a dm target for npo2 zone size device to
>> appear as a po2 zone size target so that file systems can initially
>> work without any explicit changes by using this target.
>
> FYI, with this patch series, I created the new dm target and ran blktests zbd
> group for it. And I observed zbd/007 test case failure (other test cases
> passed). The test checks sector mapping of zoned dm-linear, dm-flakey and dm-
> crypt. Some changes in the test case look required to handle the new target.
>
Thanks for testing it. I am aware of this test case, and I skipped it while
I was testing my target.
The test needs to be adapted as the container's start, and the logical
device's start will be different for this target.
I initially thought this test case might not apply to the dm-po2zone
target, but at a closer look, it is helpful once the zone offset is adapted
while doing a reset and writing data as the test only verifies the relative
WP position.
I also noticed that this test relies on getting the underlying device id
using `dmsetup table` command. The target currently lacks the `.status`
callback which appends the device id details. I will add them as a part of
the next revision for this target. Thanks.
On Tue, Aug 23 2022 at 8:18P -0400,
Pankaj Raghav <[email protected]> wrote:
> From: Luis Chamberlain <[email protected]>
>
> dm-zoned relies on the assumption that the zone size is a
> power-of-2(po2) and the zone capacity is same as the zone size.
>
> Ensure only po2 devices can be used as dm-zoned target until a native
> support for zoned devices with non-po2 zone size is added.
>
> Reviewed-by: Hannes Reinecke <[email protected]>
> Reviewed-by: Damien Le Moal <[email protected]>
> Signed-off-by: Luis Chamberlain <[email protected]>
> Signed-off-by: Pankaj Raghav <[email protected]>
Reviewed-by: Mike Snitzer <[email protected]>
On Tue, Aug 23 2022 at 8:18P -0400,
Pankaj Raghav <[email protected]> wrote:
> Allow dm to support zoned devices with non power-of-2(po2) zone sizes as
> the block layer now supports it.
>
> Signed-off-by: Pankaj Raghav <[email protected]>
> Reviewed-by: Bart Van Assche <[email protected]>
> Reviewed-by: Damien Le Moal <[email protected]>
Reviewed-by: Mike Snitzer <[email protected]>
On Tue, Aug 23 2022 at 8:18P -0400,
Pankaj Raghav <[email protected]> wrote:
> dm_zone_endio() updates the bi_sector of orig bio for zoned devices that
> uses either native append or append emulation, and it is called before the
> endio of the target. But target endio can still update the clone bio
> after dm_zone_endio is called, thereby, the orig bio does not contain
> the updated information anymore.
>
> Currently, this is not a problem as the targets that support zoned devices
> such as dm-zoned, dm-linear, and dm-crypt do not have an endio function,
> and even if they do (such as dm-flakey), they don't modify the
> bio->bi_iter.bi_sector of the cloned bio that is used to update the
> orig_bio's bi_sector in dm_zone_endio function.
>
> This is a prep patch for the new dm-po2zone target as it modifies
> bi_sector in the endio callback.
>
> Call dm_zone_endio for zoned devices after calling the target's endio
> function.
>
> Signed-off-by: Pankaj Raghav <[email protected]>
Great patch header, explains the situation nicely.
Reviewed-by: Mike Snitzer <[email protected]>
On Tue, Aug 23 2022 at 8:18P -0400,
Pankaj Raghav <[email protected]> wrote:
> Use the bdev_offset_from_zone_start() helper function to calculate
> the offset from zone start instead of using power of 2 based
> calculation.
>
> Signed-off-by: Pankaj Raghav <[email protected]>
> Reviewed-by: Luis Chamberlain <[email protected]>
> Reviewed-by: Damien Le Moal <[email protected]>
Reviewed-by: Mike Snitzer <[email protected]>
On Tue, Aug 23 2022 at 8:18P -0400,
Pankaj Raghav <[email protected]> wrote:
> Introduce a new target type DM_EMULATED_ZONES for targets with
> a different number of sectors per zone (aka zone size) than the underlying
> device zone size.
>
> This target type is introduced as the existing zoned targets assume
> that the target and the underlying device have the same zone size.
> The new target: dm-po2zone will use this new target
> type as it emulates the zone boundary that is different from the
> underlying zoned device.
>
> Signed-off-by: Pankaj Raghav <[email protected]>
> Reviewed-by: Damien Le Moal <[email protected]>
This patch's use of "target type" jargon isn't valid.
Please say "target feature flag" and rename DM_EMULATED_ZONES to
DM_TARGET_EMULATED_ZONES in the subject and header.
But, with those fixed:
Signed-off-by: Mike Snitzer <[email protected]>
(fyi, I'll review patch 13, the dm-po2zone target, tomorrow)
On 2022-09-02 02:28, Mike Snitzer wrote:
> On Tue, Aug 23 2022 at 8:18P -0400,
> Pankaj Raghav <[email protected]> wrote:
>
>> Introduce a new target type DM_EMULATED_ZONES for targets with
>> a different number of sectors per zone (aka zone size) than the underlying
>> device zone size.
>>
>> This target type is introduced as the existing zoned targets assume
>> that the target and the underlying device have the same zone size.
>> The new target: dm-po2zone will use this new target
>> type as it emulates the zone boundary that is different from the
>> underlying zoned device.
>>
>> Signed-off-by: Pankaj Raghav <[email protected]>
>> Reviewed-by: Damien Le Moal <[email protected]>
>
> This patch's use of "target type" jargon isn't valid.
>
> Please say "target feature flag" and rename DM_EMULATED_ZONES to
> DM_TARGET_EMULATED_ZONES in the subject and header.
> Good catch. I will fix it up for the next version.
> But, with those fixed:
>
> Signed-off-by: Mike Snitzer <[email protected]>
>
You mean <Reviewed-By> ? :)
> (fyi, I'll review patch 13, the dm-po2zone target, tomorrow)
>
Thanks.
On 2022-08-23 14:18, Pankaj Raghav wrote:
> +static int dm_po2z_iterate_devices(struct dm_target *ti,
> + iterate_devices_callout_fn fn, void *data)
> +{
> + struct dm_po2z_target *dmh = ti->private;
> + sector_t len = dmh->nr_zones * dmh->zone_size;
> +
> + return fn(ti, dmh->dev, 0, len, data);
> +}
> +
> +static struct target_type dm_po2z_target = {
> + .name = "po2zone",
> + .version = { 1, 0, 0 },
> + .features = DM_TARGET_ZONED_HM | DM_TARGET_EMULATED_ZONES,
This target also supports DM_TARGET_NOWAIT feature flag. I will add it in
the next version.
> + .map = dm_po2z_map,
> + .end_io = dm_po2z_end_io,
> + .report_zones = dm_po2z_report_zones,
> + .iterate_devices = dm_po2z_iterate_devices,
> + .module = THIS_MODULE,
> + .io_hints = dm_po2z_io_hints,
> + .ctr = dm_po2z_ctr,
> +};
> +
> +static int __init dm_po2z_init(void)
> +{
> + return dm_register_target(&dm_po2z_target);
> +}
> +
> +static void __exit dm_po2z_exit(void)
> +{
> + dm_unregister_target(&dm_po2z_target);
> +}
> +
> +/* Module hooks */
> +module_init(dm_po2z_init);
> +module_exit(dm_po2z_exit);
> +
> +MODULE_DESCRIPTION(DM_NAME "power-of-2 zoned target");
> +MODULE_AUTHOR("Pankaj Raghav <[email protected]>");
> +MODULE_LICENSE("GPL");
> +
On Fri, Sep 02 2022 at 8:02P -0400,
Pankaj Raghav <[email protected]> wrote:
> On 2022-09-02 02:28, Mike Snitzer wrote:
> > On Tue, Aug 23 2022 at 8:18P -0400,
> > Pankaj Raghav <[email protected]> wrote:
> >
> >> Introduce a new target type DM_EMULATED_ZONES for targets with
> >> a different number of sectors per zone (aka zone size) than the underlying
> >> device zone size.
> >>
> >> This target type is introduced as the existing zoned targets assume
> >> that the target and the underlying device have the same zone size.
> >> The new target: dm-po2zone will use this new target
> >> type as it emulates the zone boundary that is different from the
> >> underlying zoned device.
> >>
> >> Signed-off-by: Pankaj Raghav <[email protected]>
> >> Reviewed-by: Damien Le Moal <[email protected]>
> >
> > This patch's use of "target type" jargon isn't valid.
> >
> > Please say "target feature flag" and rename DM_EMULATED_ZONES to
> > DM_TARGET_EMULATED_ZONES in the subject and header.
> > Good catch. I will fix it up for the next version.
> > But, with those fixed:
> >
> > Signed-off-by: Mike Snitzer <[email protected]>
> >
> You mean <Reviewed-By> ? :)
Ah, yes Reviewed-By, force of habit ;)
On Tue, Aug 23 2022 at 8:18P -0400,
Pankaj Raghav <[email protected]> wrote:
> Only zoned devices with power-of-2(po2) number of sectors per zone(zone
> size) were supported in linux but now non power-of-2(npo2) zone sizes
> support has been added to the block layer.
>
> Filesystems such as F2FS and btrfs have support for zoned devices with
> po2 zone size assumption. Before adding native support for npo2 zone
> sizes, it was suggested to create a dm target for npo2 zone size device to
> appear as a po2 zone size target so that file systems can initially
> work without any explicit changes by using this target.
>
> The design of this target is very simple: remap the device zone size to
> the zone capacity and change the zone size to be the nearest power of 2
> value.
>
> For e.g., a device with a zone size/capacity of 3M will have an equivalent
> target layout as follows:
>
> Device layout :-
> zone capacity = 3M
> zone size = 3M
>
> |--------------|-------------|
> 0 3M 6M
>
> Target layout :-
> zone capacity=3M
> zone size = 4M
>
> |--------------|---|--------------|---|
> 0 3M 4M 7M 8M
>
> The area between target's zone capacity and zone size will be emulated
> in the target.
> The read IOs that fall in the emulated gap area will return 0 filled
> bio and all the other IOs in that area will result in an error.
> If a read IO span across the emulated area boundary, then the IOs are
> split across them. All other IO operations that span across the emulated
> area boundary will result in an error.
>
> The target can be easily created as follows:
> dmsetup create <label> --table '0 <size_sects> po2zone /dev/nvme<id>'
>
> Note that the target does not support partial mapping of the underlying
> device.
>
> Signed-off-by: Pankaj Raghav <[email protected]>
> Suggested-by: Johannes Thumshirn <[email protected]>
> Suggested-by: Damien Le Moal <[email protected]>
> Suggested-by: Hannes Reinecke <[email protected]>
This target needs more review from those who Suggested-by it.
And the header and docs needs to address:
1) why is a partial mapping of the underlying device disallowed?
2) why is it assumed all IO is read-only? (talk to me and others like
we don't know the inherent limitations of this class of zoned hw)
On a code level:
1) are you certain you're properly failing all writes?
- are writes allowed to the "zone capacity area" but _not_
allowed to the "emulated zone area"? (if yes, _please document_).
2) yes, you absolutely need to implement the .status target_type hook
(for both STATUS and TABLE).
3) really not loving the nested return (of DM_MAPIO_SUBMITTED or
DM_MAPIO_REMAPPED) from methods called from dm_po2z_map(). Would
prefer to not have to do a depth-first search to see where and when
dm_po2z_map() returns a DM_MAPIO_XXX unless there is a solid
justification for it. To me it just obfuscates the DM interface a
bit too much.
Otherwise, pretty clean code and nothing weird going on.
I look forward to seeing your next (final?) revision of this patchset.
Thanks,
Mike
On Fri, Sep 02 2022 at 4:55P -0400,
Mike Snitzer <[email protected]> wrote:
> On Tue, Aug 23 2022 at 8:18P -0400,
> Pankaj Raghav <[email protected]> wrote:
>
> > Only zoned devices with power-of-2(po2) number of sectors per zone(zone
> > size) were supported in linux but now non power-of-2(npo2) zone sizes
> > support has been added to the block layer.
> >
> > Filesystems such as F2FS and btrfs have support for zoned devices with
> > po2 zone size assumption. Before adding native support for npo2 zone
> > sizes, it was suggested to create a dm target for npo2 zone size device to
> > appear as a po2 zone size target so that file systems can initially
> > work without any explicit changes by using this target.
> >
> > The design of this target is very simple: remap the device zone size to
> > the zone capacity and change the zone size to be the nearest power of 2
> > value.
> >
> > For e.g., a device with a zone size/capacity of 3M will have an equivalent
> > target layout as follows:
> >
> > Device layout :-
> > zone capacity = 3M
> > zone size = 3M
> >
> > |--------------|-------------|
> > 0 3M 6M
> >
> > Target layout :-
> > zone capacity=3M
> > zone size = 4M
> >
> > |--------------|---|--------------|---|
> > 0 3M 4M 7M 8M
> >
> > The area between target's zone capacity and zone size will be emulated
> > in the target.
> > The read IOs that fall in the emulated gap area will return 0 filled
> > bio and all the other IOs in that area will result in an error.
> > If a read IO span across the emulated area boundary, then the IOs are
> > split across them. All other IO operations that span across the emulated
> > area boundary will result in an error.
> >
> > The target can be easily created as follows:
> > dmsetup create <label> --table '0 <size_sects> po2zone /dev/nvme<id>'
> >
> > Note that the target does not support partial mapping of the underlying
> > device.
> >
> > Signed-off-by: Pankaj Raghav <[email protected]>
> > Suggested-by: Johannes Thumshirn <[email protected]>
> > Suggested-by: Damien Le Moal <[email protected]>
> > Suggested-by: Hannes Reinecke <[email protected]>
>
>
> This target needs more review from those who Suggested-by it.
>
> And the header and docs needs to address:
>
> 1) why is a partial mapping of the underlying device disallowed?
> 2) why is it assumed all IO is read-only? (talk to me and others like
> we don't know the inherent limitations of this class of zoned hw)
>
> On a code level:
> 1) are you certain you're properly failing all writes?
> - are writes allowed to the "zone capacity area" but _not_
> allowed to the "emulated zone area"? (if yes, _please document_).
> 2) yes, you absolutely need to implement the .status target_type hook
> (for both STATUS and TABLE).
> 3) really not loving the nested return (of DM_MAPIO_SUBMITTED or
> DM_MAPIO_REMAPPED) from methods called from dm_po2z_map(). Would
> prefer to not have to do a depth-first search to see where and when
> dm_po2z_map() returns a DM_MAPIO_XXX unless there is a solid
> justification for it. To me it just obfuscates the DM interface a
> bit too much.
>
> Otherwise, pretty clean code and nothing weird going on.
>
> I look forward to seeing your next (final?) revision of this patchset.
Thinking further.. I'm left confused about just what the heck this
target is assuming.
E.g.: feels like its exposing a readonly end of the zone is very
bi-polar... yet no hint to upper layer it shouldn't write to that
read-only end (the "emulated zone").. but there has to be some zoned
magic assumed? And I'm just naive?
Mike
Hi Mike,
>> Note that the target does not support partial mapping of the underlying
>> device.
>>
>> Signed-off-by: Pankaj Raghav <[email protected]>
>> Suggested-by: Johannes Thumshirn <[email protected]>
>> Suggested-by: Damien Le Moal <[email protected]>
>> Suggested-by: Hannes Reinecke <[email protected]>
>
>
> This target needs more review from those who Suggested-by it.
>
> And the header and docs needs to address:
>
> 1) why is a partial mapping of the underlying device disallowed?
While it is technically possible, I don't see any use-case to do so for
this target. I can mention it in the documentation as well.
> 2) why is it assumed all IO is read-only? (talk to me and others like
> we don't know the inherent limitations of this class of zoned hw)
>
TL;DR: no, we don't assume all IO to be read-only. All operations all
allowed until the zone capacity, and only reads are permitted in the
emulated gap area.
A bit of context about Zoned HW(especially ZNS SSD):
Zone: A contiguous range of logical block addresses managed as a single unit.
Zoned Block Device: A block device that consists of multiple zones.
Zone size: Size of a zone
Zone capacity: Usable logical blocks in a zone
According to ZNS spec, the LBAs from zone capacity to zone size behave like
deallocated blocks when read and are not allowed to be written. Until now,
zone capacity can be any value, but zone size needed to be a power-of-2 to
work in Linux (More information about this is also in my cover letter).
This patch series aims to allow non-po2 zone size devices with zone
capacity == zone size to work in Linux.
A non-po2 zone size device might not work correctly in filesystems that
support zoned devices such as btrfs and f2fs as they assume po2 zone sizes.
Therefore, this target is created to enable these filesystems to work with
non-po2 zone sizes until native support is added.
This target's zone capacity will be the same as the underlying device, but
the target's zone size will be the nearest po2 value of its zone capacity.
Furthermore, the area between the zone capacity and zone size of the target
(emulated gap area) will resemble the spec behavior: behave like the
deallocated blocks when read (we fill zeroes in the bio) and are not
allowed to write.
Does that clarify your question?
> On a code level:
> 1) are you certain you're properly failing all writes?
> - are writes allowed to the "zone capacity area" but _not_
> allowed to the "emulated zone area"? (if yes, _please document_).
I have already documented in Documentation:
A simple remap is performed for all the BIOs that do not cross the
emulation gap area, i.e., the area between the zone capacity and size.
If a BIO lies in the emulation gap area, the following operations are
performed:
Read:
- If the BIO lies entirely in the emulation gap area, then zero out the
BIO and complete it.
- If the BIO spans the emulation gap area, split the BIO across the
zone capacity boundary and remap only the BIO within the zone capacity
boundary. The other part of the split BIO will be zeroed out.
Other operations:
- Return an error
Maybe it is not clear enough?? Let me know.
> 2) yes, you absolutely need to implement the .status target_type hook
> (for both STATUS and TABLE).
I already queued this change locally. I will send it as a part of the next rev.
> 3) really not loving the nested return (of DM_MAPIO_SUBMITTED or
> DM_MAPIO_REMAPPED) from methods called from dm_po2z_map(). Would
> prefer to not have to do a depth-first search to see where and when
> dm_po2z_map() returns a DM_MAPIO_XXX unless there is a solid
> justification for it. To me it just obfuscates the DM interface a
> bit too much.
>
Got it. Do you prefer having the return statements in the dm_po2z_map
itself instead of returning a helper function, which in return returns the
status code?
What about something like this:
static inline void dm_po2z_read_zeroes(struct bio *bio)
{
zero_fill_bio(bio);
bio_endio(bio);
}
static int dm_po2z_map(struct dm_target *ti, struct bio *bio)
{
struct dm_po2z_target *dmh = ti->private;
int split_io_pos;
bio_set_dev(bio, dmh->dev->bdev);
if (op_is_zone_mgmt(bio_op(bio)))
goto remap_sector;
if (!bio_sectors(bio))
return DM_MAPIO_REMAPPED;
if (!dm_po2z_bio_in_emulated_zone_area(dmh, bio, &split_io_pos))
goto remap_sector;
/*
* Read operation on the emulated zone area (between zone capacity
* and zone size) will fill the bio with zeroes.Any other operation
* in the emulated area should return an error.
*/
if (bio_op(bio) == REQ_OP_READ) {
/*
* If the bio is across emulated zone boundary, split
* the bio at
* the boundary.
*/
if (split_io_pos > 0) {
dm_accept_partial_bio(bio, split_io_pos);
goto remap_sector;
}
dm_po2z_read_zeroes(bio);
return DM_MAPIO_SUBMITTED;
}
return DM_MAPIO_KILL;
remap_sector:
bio->bi_iter.bi_sector =
target_to_device_sect(dmh, bio->bi_iter.bi_sector);
return DM_MAPIO_REMAPPED;
}
> Otherwise, pretty clean code and nothing weird going on.
>
> I look forward to seeing your next (final?) revision of this patchset.
>
> Thanks,
> Mike
>
>>
>> 1) why is a partial mapping of the underlying device disallowed?
>> 2) why is it assumed all IO is read-only? (talk to me and others like
>> we don't know the inherent limitations of this class of zoned hw)
>>
>> On a code level:
>> 1) are you certain you're properly failing all writes?
>> - are writes allowed to the "zone capacity area" but _not_
>> allowed to the "emulated zone area"? (if yes, _please document_).
>> 2) yes, you absolutely need to implement the .status target_type hook
>> (for both STATUS and TABLE).
>> 3) really not loving the nested return (of DM_MAPIO_SUBMITTED or
>> DM_MAPIO_REMAPPED) from methods called from dm_po2z_map(). Would
>> prefer to not have to do a depth-first search to see where and when
>> dm_po2z_map() returns a DM_MAPIO_XXX unless there is a solid
>> justification for it. To me it just obfuscates the DM interface a
>> bit too much.
>>
>> Otherwise, pretty clean code and nothing weird going on.
>>
>> I look forward to seeing your next (final?) revision of this patchset.
>
> Thinking further.. I'm left confused about just what the heck this
> target is assuming.
>
> E.g.: feels like its exposing a readonly end of the zone is very
> bi-polar... yet no hint to upper layer it shouldn't write to that
> read-only end (the "emulated zone").. but there has to be some zoned
> magic assumed? And I'm just naive?
>
You are absolutely right about "zoned magic". Applications that use a zoned
block device are aware of the zone capacity and zone size. BLKREPORTZONE
ioctl is typically used to get the zone information from a zoned block device.
This target adjusts the zone report so that zone size and zone capacity are
modified correctly (see dm_po2z_report_zones() and
dm_po2z_report_zones_cb() functions).