2022-09-20 09:50:54

by Pankaj Raghav

[permalink] [raw]
Subject: [PATCH v14 00/13] support zoned block devices with non-power-of-2 zone sizes

- Background and Motivation:

The zone storage implementation in Linux, introduced since v4.10, first
targetted SMR drives which have a power of 2 (po2) zone size alignment
requirement. The po2 zone size was further imposed implicitly by the
block layer's blk_queue_chunk_sectors(), used to prevent IO merging
across chunks beyond the specified size, since v3.16 through commit
762380ad9322 ("block: add notion of a chunk size for request merging").
But this same general block layer po2 requirement for blk_queue_chunk_sectors()
was removed on v5.10 through commit 07d098e6bbad ("block: allow 'chunk_sectors'
to be non-power-of-2").

NAND, which is the media used in newer zoned storage devices, does not
naturally align to po2. In these devices, zone capacity(cap) is not the
same as the po2 zone size. When the zone cap != zone size, then unmapped
LBAs are introduced to cover the space between the zone cap and zone size.
po2 requirement does not make sense for these type of zone storage devices.
This patch series aims to remove these unmapped LBAs for zoned devices when
zone cap is npo2. This is done by relaxing the po2 zone size constraint
in the kernel and allowing zoned device with npo2 zone sizes if zone cap
== zone size.

Removing the po2 requirement from zone storage should be possible
now provided that no userspace regression and no performance regressions are
introduced. Stop-gap patches have been already merged into f2fs-tools to
proactively not allow npo2 zone sizes until proper support is added [1].

There were two efforts previously to add support to npo2 devices: 1) via
device level emulation [2] but that was rejected with a final conclusion
to add support for non po2 zoned device in the complete stack[3] 2)
adding support to the complete stack by removing the constraint in the
block layer and NVMe layer with support to btrfs, zonefs, etc which was
rejected with a conclusion to add a dm target for FS support [0]
to reduce the regression impact.

This series adds support to npo2 zoned devices in the block and nvme
layer and a new **dm target** is added: dm-po2zoned-target. This new
target will be initially used for filesystems such as btrfs and
f2fs until native npo2 zone support is added.

- Patchset description:
Patches 1-3 deals with removing the po2 constraint from the
block layer.

Patches 4-5 deals with removing the constraint from nvme zns.

Patch 5 removes the po2 contraint in null blk

Patch 6 adds npo2 support to zonefs

Patches 7-13 adds support for npo2 zoned devices in the DM layer and
adds a new target dm-po2zoned-target which converts a zoned device with
npo2 zone size into a zoned target with po2 zone size.

The patch series is based on linux-next tag: next-20220919

Testing:
The new target was tested with blktest and zonefs test suite in qemu and
on a real ZNS device with npo2 zone size.

Performance Measurement on a null blk:
Device:
zone size = 128M, blocksize=4k

FIO cmd:
fio --name=zbc --filename=/dev/nullb0 --direct=1 --zonemode=zbd --size=23G
--io_size=<iosize> --ioengine=io_uring --iodepth=<iod> --rw=<mode> --bs=4k
--loops=4

The following results are an average of 4 runs on AMD Ryzen 5 5600X with
32GB of RAM:

Sequential Write:
x-----------------x---------------------------------x---------------------------------x
| IOdepth | 8 | 16 |
x-----------------x---------------------------------x---------------------------------x
| | KIOPS |BW(MiB/s) | Lat(usec) | KIOPS |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patch | 578 | 2257 | 12.80 | 576 | 2248 | 25.78 |
x-----------------x---------------------------------x---------------------------------x
| With patch | 581 | 2268 | 12.74 | 576 | 2248 | 25.85 |
x-----------------x---------------------------------x---------------------------------x

Sequential read:
x-----------------x---------------------------------x---------------------------------x
| IOdepth | 8 | 16 |
x-----------------x---------------------------------x---------------------------------x
| | KIOPS |BW(MiB/s) | Lat(usec) | KIOPS |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patch | 667 | 2605 | 11.79 | 675 | 2637 | 23.49 |
x-----------------x---------------------------------x---------------------------------x
| With patch | 667 | 2605 | 11.79 | 675 | 2638 | 23.48 |
x-----------------x---------------------------------x---------------------------------x

Random read:
x-----------------x---------------------------------x---------------------------------x
| IOdepth | 8 | 16 |
x-----------------x---------------------------------x---------------------------------x
| | KIOPS |BW(MiB/s) | Lat(usec) | KIOPS |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patch | 522 | 2038 | 15.05 | 514 | 2006 | 30.87 |
x-----------------x---------------------------------x---------------------------------x
| With patch | 522 | 2039 | 15.04 | 523 | 2042 | 30.33 |
x-----------------x---------------------------------x---------------------------------x

Minor variations are noticed in Sequential write with io depth 8 and
in random read with io depth 16. But overall no noticeable differences
were noticed

[0] https://lore.kernel.org/lkml/PH0PR04MB74166C87F694B150A5AE0F009BD09@PH0PR04MB7416.namprd04.prod.outlook.com/
[1] https://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git/commit/?h=dev-test&id=6afcf6493578e77528abe65ab8b12f3e1c16749f
[2] https://lore.kernel.org/all/[email protected]/T/
[3] https://lore.kernel.org/all/20220315135245.eqf4tqngxxb7ymqa@unifi/

Changes since v1:
- Put the function declaration and its usage in the same commit (Bart)
- Remove bdev_zone_aligned function (Bart)
- Change the name from blk_queue_zone_aligned to blk_queue_is_zone_start
(Damien)
- q is never null in from bdev_get_queue (Damien)
- Add condition during bringup and check for zsze == zcap for npo2
drives (Damien)
- Rounddown operation should be made generic to work in 32 bits arch
(bart)
- Add comments where generic calculation is directly used instead having
special handling for po2 zone sizes (Hannes)
- Make the minimum zone size alignment requirement for btrfs to be 1M
instead of BTRFS_STRIPE_LEN(David)

Changes since v2:
- Minor formatting changes

Changes since v3:
- Make superblock mirror align with the existing superblock log offsets
(David)
- DM change return value and remove extra newline
- Optimize null blk zone index lookup with shift for po2 zone size

Changes since v4:
- Remove direct filesystems support for npo2 devices (Johannes, Hannes,
Damien)

Changes since v5:
- Use DIV_ROUND_UP* helper instead of round_up as it breaks 32bit arch
build in null blk(kernel-test-robot, Nathan)
- Use DIV_ROUND_UP_SECTOR_T also in blkdev_nr_zones function instead of
open coding it with div64_u64
- Added extra condition in dm-zoned and in dm to reject non power of 2
zone sizes.

Changes since v6:
- Added a new dm target for non power of 2 devices
- Added support for non power of 2 devices in the DM layer.

Changes since v7:
- Improved dm target for non power of 2 zoned devices with some bug
fixes and rearrangement
- Removed some unnecessary comments.

Changes since v8:
- Rename dm-po2z to dm-po2zone
- set max_io_len for the target to po2 zone size sector
- Simplify dm-po2zone target by removing some superfluous conditions
- Added documentation for the new dm-po2zone target
- Change pr_warn to pr_err for critical errors
- Split patch 2 and 11 with their corresponding prep patches
- Minor spelling and grammatical improvements

Changes since v9:
- Add a check for a zoned device in dm-po2zone ctr.
- Rephrased some commit messages and documentation for clarity

Changes since v10:
- Simplified dm_poz_map function (Damien)

Changes since v11:
- Rename bio_in_emulated_zone_area and some formatting adjustments
(Damien)

Changes since v12:
- Changed the name from dm-po2zone to dm-po2zoned to have a common
naming convention for zoned devices(Mike)
- Return directly from the dm_po2z_map function instead of having
returns from different functions (Mike)
- Change target type to target feature flag in commit header (Mike)
- Added dm_po2z_status function and NOWAIT flag to the target
- Added some extra information to the target's documentation.

Changes since v13:
- Use goto for cleanup in dm-po2zoned target (Mike)
- Added dtr to dm-po2zoned target
- Expose zone capacity instead of po2 zone size for
DMSTATUS_TYPE_INFO(Mike)

Luis Chamberlain (1):
dm-zoned: ensure only power of 2 zone sizes are allowed

Pankaj Raghav (12):
block: make bdev_nr_zones and disk_zone_no generic for npo2 zone size
block: rearrange bdev_{is_zoned,zone_sectors,get_queue} helper in
blkdev.h
block: allow blk-zoned devices to have non-power-of-2 zone size
nvmet: Allow ZNS target to support non-power_of_2 zone sizes
nvme: zns: Allow ZNS drives that have non-power_of_2 zone size
null_blk: allow zoned devices with non power-of-2 zone sizes
zonefs: allow non power of 2 zoned devices
dm-zone: use generic helpers to calculate offset from zone start
dm-table: allow zoned devices with non power-of-2 zone sizes
dm: call dm_zone_endio after the target endio callback for zoned
devices
dm: introduce DM_EMULATED_ZONES target feature flag
dm: add power-of-2 target for zoned devices with non power-of-2 zone
sizes

.../admin-guide/device-mapper/dm-po2zoned.rst | 79 +++++
.../admin-guide/device-mapper/index.rst | 1 +
block/blk-core.c | 2 +-
block/blk-zoned.c | 37 ++-
drivers/block/null_blk/main.c | 5 +-
drivers/block/null_blk/null_blk.h | 1 +
drivers/block/null_blk/zoned.c | 18 +-
drivers/md/Kconfig | 10 +
drivers/md/Makefile | 2 +
drivers/md/dm-po2zoned-target.c | 291 ++++++++++++++++++
drivers/md/dm-table.c | 20 +-
drivers/md/dm-zone.c | 8 +-
drivers/md/dm-zoned-target.c | 8 +
drivers/md/dm.c | 8 +-
drivers/nvme/host/zns.c | 14 +-
drivers/nvme/target/zns.c | 3 +-
fs/zonefs/super.c | 6 +-
fs/zonefs/zonefs.h | 1 -
include/linux/blkdev.h | 80 +++--
include/linux/device-mapper.h | 9 +
20 files changed, 528 insertions(+), 75 deletions(-)
create mode 100644 Documentation/admin-guide/device-mapper/dm-po2zoned.rst
create mode 100644 drivers/md/dm-po2zoned-target.c

--
2.25.1


2022-09-20 09:51:02

by Pankaj Raghav

[permalink] [raw]
Subject: [PATCH v14 03/13] block: allow blk-zoned devices to have non-power-of-2 zone size

Checking if a given sector is aligned to a zone is a common
operation that is performed for zoned devices. Add
bdev_is_zone_start helper to check for this instead of opencoding it
everywhere.

Convert the calculations on zone size to be generic instead of relying on
power-of-2(po2) based arithmetic in the block layer using the helpers
wherever possible.

The only hot path affected by this change for zoned devices with po2
zone size is in blk_check_zone_append() but bdev_is_zone_start() helper is
used to optimize the calculation for po2 zone sizes.

Finally, allow zoned devices with non po2 zone sizes provided that their
zone capacity and zone size are equal. The main motivation to allow zoned
devices with non po2 zone size is to remove the unmapped LBA between
zone capcity and zone size for devices that cannot have a po2 zone
capacity.

Reviewed-by: Luis Chamberlain <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Pankaj Raghav <[email protected]>
---
block/blk-core.c | 2 +-
block/blk-zoned.c | 24 ++++++++++++++++++------
include/linux/blkdev.h | 30 ++++++++++++++++++++++++++++++
3 files changed, 49 insertions(+), 7 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 4d0dd0e9e46d..735f63b6159a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -559,7 +559,7 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
return BLK_STS_NOTSUPP;

/* The bio sector must point to the start of a sequential zone */
- if (bio->bi_iter.bi_sector & (bdev_zone_sectors(bio->bi_bdev) - 1) ||
+ if (!bdev_is_zone_start(bio->bi_bdev, bio->bi_iter.bi_sector) ||
!bio_zone_is_seq(bio))
return BLK_STS_IOERR;

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index dce9c95b4bcd..6806c69c81dc 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -285,10 +285,10 @@ int blkdev_zone_mgmt(struct block_device *bdev, enum req_op op,
return -EINVAL;

/* Check alignment (handle eventual smaller last zone) */
- if (sector & (zone_sectors - 1))
+ if (!bdev_is_zone_start(bdev, sector))
return -EINVAL;

- if ((nr_sectors & (zone_sectors - 1)) && end_sector != capacity)
+ if (!bdev_is_zone_start(bdev, nr_sectors) && end_sector != capacity)
return -EINVAL;

/*
@@ -486,14 +486,26 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
* smaller last zone.
*/
if (zone->start == 0) {
- if (zone->len == 0 || !is_power_of_2(zone->len)) {
- pr_warn("%s: Invalid zoned device with non power of two zone size (%llu)\n",
- disk->disk_name, zone->len);
+ if (zone->len == 0) {
+ pr_warn("%s: Invalid zero zone size", disk->disk_name);
+ return -ENODEV;
+ }
+
+ /*
+ * Non power-of-2 zone size support was added to remove the
+ * gap between zone capacity and zone size. Though it is technically
+ * possible to have gaps in a non power-of-2 device, Linux requires
+ * the zone size to be equal to zone capacity for non power-of-2
+ * zoned devices.
+ */
+ if (!is_power_of_2(zone->len) && zone->capacity < zone->len) {
+ pr_err("%s: Invalid zone capacity %lld with non power-of-2 zone size %lld",
+ disk->disk_name, zone->capacity, zone->len);
return -ENODEV;
}

args->zone_sectors = zone->len;
- args->nr_zones = (capacity + zone->len - 1) >> ilog2(zone->len);
+ args->nr_zones = div64_u64(capacity + zone->len - 1, zone->len);
} else if (zone->start + args->zone_sectors < capacity) {
if (zone->len != args->zone_sectors) {
pr_warn("%s: Invalid zoned device with non constant zone size\n",
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 6cf43f9384cc..e29799076298 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -704,6 +704,30 @@ static inline unsigned int disk_zone_no(struct gendisk *disk, sector_t sector)
return div64_u64(sector, zone_sectors);
}

+static inline sector_t bdev_offset_from_zone_start(struct block_device *bdev,
+ sector_t sec)
+{
+ sector_t zone_sectors = bdev_zone_sectors(bdev);
+ u64 remainder = 0;
+
+ if (!bdev_is_zoned(bdev))
+ return 0;
+
+ if (is_power_of_2(zone_sectors))
+ return sec & (zone_sectors - 1);
+
+ div64_u64_rem(sec, zone_sectors, &remainder);
+ return remainder;
+}
+
+static inline bool bdev_is_zone_start(struct block_device *bdev, sector_t sec)
+{
+ if (!bdev_is_zoned(bdev))
+ return false;
+
+ return bdev_offset_from_zone_start(bdev, sec) == 0;
+}
+
static inline bool disk_zone_is_seq(struct gendisk *disk, sector_t sector)
{
if (!blk_queue_is_zoned(disk->queue))
@@ -748,6 +772,12 @@ static inline unsigned int disk_zone_no(struct gendisk *disk, sector_t sector)
{
return 0;
}
+
+static inline bool bdev_is_zone_start(struct block_device *bdev, sector_t sec)
+{
+ return false;
+}
+
static inline unsigned int bdev_max_open_zones(struct block_device *bdev)
{
return 0;
--
2.25.1

2022-09-20 10:07:56

by Pankaj Raghav

[permalink] [raw]
Subject: [PATCH v14 07/13] zonefs: allow non power of 2 zoned devices

The zone size shift variable is useful only if the zone sizes are known
to be power of 2. Remove that variable and use generic helpers from
block layer to calculate zone index in zonefs.

Acked-by: Damien Le Moal <[email protected]>
Reviewed-by: Luis Chamberlain <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Signed-off-by: Pankaj Raghav <[email protected]>
---
fs/zonefs/super.c | 6 ++----
fs/zonefs/zonefs.h | 1 -
2 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index 860f0b1032c6..e549ef16738c 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -476,10 +476,9 @@ static void __zonefs_io_error(struct inode *inode, bool write)
{
struct zonefs_inode_info *zi = ZONEFS_I(inode);
struct super_block *sb = inode->i_sb;
- struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
unsigned int noio_flag;
unsigned int nr_zones =
- zi->i_zone_size >> (sbi->s_zone_sectors_shift + SECTOR_SHIFT);
+ bdev_zone_no(sb->s_bdev, zi->i_zone_size >> SECTOR_SHIFT);
struct zonefs_ioerr_data err = {
.inode = inode,
.write = write,
@@ -1401,7 +1400,7 @@ static int zonefs_init_file_inode(struct inode *inode, struct blk_zone *zone,
struct zonefs_inode_info *zi = ZONEFS_I(inode);
int ret = 0;

- inode->i_ino = zone->start >> sbi->s_zone_sectors_shift;
+ inode->i_ino = bdev_zone_no(sb->s_bdev, zone->start);
inode->i_mode = S_IFREG | sbi->s_perm;

zi->i_ztype = type;
@@ -1776,7 +1775,6 @@ static int zonefs_fill_super(struct super_block *sb, void *data, int silent)
* interface constraints.
*/
sb_set_blocksize(sb, bdev_zone_write_granularity(sb->s_bdev));
- sbi->s_zone_sectors_shift = ilog2(bdev_zone_sectors(sb->s_bdev));
sbi->s_uid = GLOBAL_ROOT_UID;
sbi->s_gid = GLOBAL_ROOT_GID;
sbi->s_perm = 0640;
diff --git a/fs/zonefs/zonefs.h b/fs/zonefs/zonefs.h
index 4b3de66c3233..39895195cda6 100644
--- a/fs/zonefs/zonefs.h
+++ b/fs/zonefs/zonefs.h
@@ -177,7 +177,6 @@ struct zonefs_sb_info {
kgid_t s_gid;
umode_t s_perm;
uuid_t s_uuid;
- unsigned int s_zone_sectors_shift;

unsigned int s_nr_files[ZONEFS_ZTYPE_MAX];

--
2.25.1

2022-09-20 10:12:08

by Pankaj Raghav

[permalink] [raw]
Subject: [PATCH v14 11/13] dm: call dm_zone_endio after the target endio callback for zoned devices

dm_zone_endio() updates the bi_sector of orig bio for zoned devices that
uses either native append or append emulation, and it is called before the
endio of the target. But target endio can still update the clone bio
after dm_zone_endio is called, thereby, the orig bio does not contain
the updated information anymore.

Currently, this is not a problem as the targets that support zoned devices
such as dm-zoned, dm-linear, and dm-crypt do not have an endio function,
and even if they do (such as dm-flakey), they don't modify the
bio->bi_iter.bi_sector of the cloned bio that is used to update the
orig_bio's bi_sector in dm_zone_endio function.

This is a prep patch for the new dm-po2zoned target as it modifies
bi_sector in the endio callback.

Call dm_zone_endio for zoned devices after calling the target's endio
function.

Reviewed-by: Mike Snitzer <[email protected]>
Signed-off-by: Pankaj Raghav <[email protected]>
---
drivers/md/dm.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 7c35dea88ed1..874e1dc9fc26 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1122,10 +1122,6 @@ static void clone_endio(struct bio *bio)
disable_write_zeroes(md);
}

- if (static_branch_unlikely(&zoned_enabled) &&
- unlikely(bdev_is_zoned(bio->bi_bdev)))
- dm_zone_endio(io, bio);
-
if (endio) {
int r = endio(ti, bio, &error);
switch (r) {
@@ -1154,6 +1150,10 @@ static void clone_endio(struct bio *bio)
}
}

+ if (static_branch_unlikely(&zoned_enabled) &&
+ unlikely(bdev_is_zoned(bio->bi_bdev)))
+ dm_zone_endio(io, bio);
+
if (static_branch_unlikely(&swap_bios_enabled) &&
unlikely(swap_bios_limit(ti, bio)))
up(&md->swap_bios_semaphore);
--
2.25.1

2022-09-21 17:37:37

by Mike Snitzer

[permalink] [raw]
Subject: Please further explain Linux's "zoned storage" roadmap [was: Re: [PATCH v14 00/13] support zoned block devices with non-power-of-2 zone sizes]

On Tue, Sep 20 2022 at 5:11P -0400,
Pankaj Raghav <[email protected]> wrote:

> - Background and Motivation:
>
> The zone storage implementation in Linux, introduced since v4.10, first
> targetted SMR drives which have a power of 2 (po2) zone size alignment
> requirement. The po2 zone size was further imposed implicitly by the
> block layer's blk_queue_chunk_sectors(), used to prevent IO merging
> across chunks beyond the specified size, since v3.16 through commit
> 762380ad9322 ("block: add notion of a chunk size for request merging").
> But this same general block layer po2 requirement for blk_queue_chunk_sectors()
> was removed on v5.10 through commit 07d098e6bbad ("block: allow 'chunk_sectors'
> to be non-power-of-2").
>
> NAND, which is the media used in newer zoned storage devices, does not
> naturally align to po2. In these devices, zone capacity(cap) is not the
> same as the po2 zone size. When the zone cap != zone size, then unmapped
> LBAs are introduced to cover the space between the zone cap and zone size.
> po2 requirement does not make sense for these type of zone storage devices.
> This patch series aims to remove these unmapped LBAs for zoned devices when
> zone cap is npo2. This is done by relaxing the po2 zone size constraint
> in the kernel and allowing zoned device with npo2 zone sizes if zone cap
> == zone size.
>
> Removing the po2 requirement from zone storage should be possible
> now provided that no userspace regression and no performance regressions are
> introduced. Stop-gap patches have been already merged into f2fs-tools to
> proactively not allow npo2 zone sizes until proper support is added [1].
>
> There were two efforts previously to add support to npo2 devices: 1) via
> device level emulation [2] but that was rejected with a final conclusion
> to add support for non po2 zoned device in the complete stack[3] 2)
> adding support to the complete stack by removing the constraint in the
> block layer and NVMe layer with support to btrfs, zonefs, etc which was
> rejected with a conclusion to add a dm target for FS support [0]
> to reduce the regression impact.
>
> This series adds support to npo2 zoned devices in the block and nvme
> layer and a new **dm target** is added: dm-po2zoned-target. This new
> target will be initially used for filesystems such as btrfs and
> f2fs until native npo2 zone support is added.

As this patchset nears the point of being "ready for merge" and DM's
"zoned" oriented targets are multiplying, I need to understand: where
are we collectively going? How long are we expecting to support the
"stop-gap zoned storage" layers we've constructed?

I know https://zonedstorage.io/docs/introduction exists... but it
_seems_ stale given the emergence of ZNS and new permutations of zoned
hardware. Maybe that isn't quite fair (it does cover A LOT!) but I'm
still left wanting (e.g. "bring it all home for me!")...

Damien, as the most "zoned storage" oriented engineer I know, can you
please kick things off by shedding light on where Linux is now, and
where it's going, for "zoned storage"?

To give some additional context to help me when you answer: I'm left
wondering what, if any, role dm-zoned has to play moving forward given
ZNS is "the future" (and yeah "the future" is now but...)? E.g.: Does
it make sense to stack dm-zoned ontop of dm-po2zoned!?

Yet more context: When I'm asked to add full-blown support for
dm-zoned to RHEL my gut is "please no, why!?". And if we really
should add dm-zoned is dm-po2zoned now also a requirement (to support
non-power-of-2 ZNS devices in our never-ending engineering of "zoned
storage" compatibility stop-gaps)?

In addition, it was my understanding that WDC had yet another zoned DM
target called "dm-zap" that is for ZNS based devices... It's all a bit
messy in my head (that's on me for not keeping up, but I think we need
a recap!)

So please help me, and others, become more informed as quickly as
possible! ;)

Thanks,
Mike

ps. I'm asking all this in the open on various Linux mailing lists
because it doesn't seem right to request a concall to inform only
me... I think others may need similar "zoned storage" help.

2022-09-22 00:08:38

by Damien Le Moal

[permalink] [raw]
Subject: Re: Please further explain Linux's "zoned storage" roadmap [was: Re: [PATCH v14 00/13] support zoned block devices with non-power-of-2 zone sizes]

On 9/22/22 02:27, Mike Snitzer wrote:
> On Tue, Sep 20 2022 at 5:11P -0400,
> Pankaj Raghav <[email protected]> wrote:
>
>> - Background and Motivation:
>>
>> The zone storage implementation in Linux, introduced since v4.10, first
>> targetted SMR drives which have a power of 2 (po2) zone size alignment
>> requirement. The po2 zone size was further imposed implicitly by the
>> block layer's blk_queue_chunk_sectors(), used to prevent IO merging
>> across chunks beyond the specified size, since v3.16 through commit
>> 762380ad9322 ("block: add notion of a chunk size for request merging").
>> But this same general block layer po2 requirement for blk_queue_chunk_sectors()
>> was removed on v5.10 through commit 07d098e6bbad ("block: allow 'chunk_sectors'
>> to be non-power-of-2").
>>
>> NAND, which is the media used in newer zoned storage devices, does not
>> naturally align to po2. In these devices, zone capacity(cap) is not the
>> same as the po2 zone size. When the zone cap != zone size, then unmapped
>> LBAs are introduced to cover the space between the zone cap and zone size.
>> po2 requirement does not make sense for these type of zone storage devices.
>> This patch series aims to remove these unmapped LBAs for zoned devices when
>> zone cap is npo2. This is done by relaxing the po2 zone size constraint
>> in the kernel and allowing zoned device with npo2 zone sizes if zone cap
>> == zone size.
>>
>> Removing the po2 requirement from zone storage should be possible
>> now provided that no userspace regression and no performance regressions are
>> introduced. Stop-gap patches have been already merged into f2fs-tools to
>> proactively not allow npo2 zone sizes until proper support is added [1].
>>
>> There were two efforts previously to add support to npo2 devices: 1) via
>> device level emulation [2] but that was rejected with a final conclusion
>> to add support for non po2 zoned device in the complete stack[3] 2)
>> adding support to the complete stack by removing the constraint in the
>> block layer and NVMe layer with support to btrfs, zonefs, etc which was
>> rejected with a conclusion to add a dm target for FS support [0]
>> to reduce the regression impact.
>>
>> This series adds support to npo2 zoned devices in the block and nvme
>> layer and a new **dm target** is added: dm-po2zoned-target. This new
>> target will be initially used for filesystems such as btrfs and
>> f2fs until native npo2 zone support is added.
>
> As this patchset nears the point of being "ready for merge" and DM's
> "zoned" oriented targets are multiplying, I need to understand: where
> are we collectively going? How long are we expecting to support the
> "stop-gap zoned storage" layers we've constructed?
>
> I know https://zonedstorage.io/docs/introduction exists... but it
> _seems_ stale given the emergence of ZNS and new permutations of zoned
> hardware. Maybe that isn't quite fair (it does cover A LOT!) but I'm
> still left wanting (e.g. "bring it all home for me!")...
>
> Damien, as the most "zoned storage" oriented engineer I know, can you
> please kick things off by shedding light on where Linux is now, and
> where it's going, for "zoned storage"?

Let me first start with what we have seen so far with deployments in the
field.

The largest user base for zoned storage is for now hyperscalers (cloud
services) deploying SMR disks. E.g. Dropbox has many times publicized its
use of SMR HDDs. ZNS is fairly new, and while it is being actively
evaluated by many, there are not yet any large scale deployments that I am
aware of.

Most of the large scale SMR users today mainly use the zoned storage
drives directly, without a file system, similarly to their use of regular
block devices. Some erasure coded object store sits on top of the zoned
drives and manage them. The interface used for that has now switched to
using the kernel API, from libzbc pass-through in the early days of SMR
support. With the inclusion of zonefs in kernel 5.6, many are now
switching to using that instead of directly accessing the block device
file. zonefs makes the application development somewhat easier (there is
no need for issuing zone management ioctls) and can also result in
applications that can actually run almost as-is on top of regular block
devices with a file system. That is a very interesting property,
especially in development phase for the user.

Beside these large scale SMR deployments, there are also many smaller
users. For these cases, dm-zoned seemed to be used a lot. In particular,
the Chia cryptocurrency boom (now fading ?) did generate a fair amount of
new SMR users relying on dm-zoned. With btrfs zoned storage support
maturing, dm-zoned is not as needed as it used to though. SMR drives can
be used directly under btrfs and I certainly am always recommending this
approach over dm-zoned+ext4 or dm-zoned+xfs as performance is much better
for write intensive workloads.

For Linux kernel overall, zoned storage is in a very good shape for raw
block device use and zonefs use. Production deployments we are seeing are
a proof of that. Currently, my team effort is mostly focused on btrfs and
zonefs and increasing zoned storage use cases.

1) For btrfs, Johannes and Naohiro are working on stabilizing support for
ZNS (we still have some issues with the management of active zones) and
implementing de-clustered parity RAID support so that zoned drives can be
used in RAID 0, 1, 10, 5, 6 and erasure coded volumes. This will address
use cases such as home NAS boxes, backup servers, small file servers,
video applications (e.g. video surveillance) etc. Essentially, any
application with large storage capacity needs that is not a distributed
setup. There are many.

2) For zonefs, I have some to-do items lined up to improve performance
(better read IO tail latency) and further improve ease of use (e.g. remove
the O_DIRECT write constraint).

3) At the block device level, we are also working on adding zoned block
device specifications to virtio and implementing that support in qemu and
the kernel. Patches are floating around now but not yet merged. This
addresses the use of zoned storage in VM environments through virtio
interface instead of directly attaching devices to guests.

> To give some additional context to help me when you answer: I'm left
> wondering what, if any, role dm-zoned has to play moving forward given
> ZNS is "the future" (and yeah "the future" is now but...)? E.g.: Does
> it make sense to stack dm-zoned ontop of dm-po2zoned!?

That is a lot to unfold in a small paragraph :)

First of all, I would rephrase "ZNS is the future" into "ZNS is a very
interesting alternative to generic NVMe SSDs". The reason being that HDD
are not dead, far from it. They still are way cheaper than SSDs in $/TB :)
So ZNS is not really in competition with SMR HDDs jere. The 2 are
complementary, exactly like regular SSDs are complementary to regular HDDs.

dm-zoned serves some use cases for SMR HDDs (see above) but does not
address ZNS (more on this below). And given that all SMR HDD on the market
today have a zone size that is a power of 2 number of LBAs (256MB zone
size is by far the most common), dm-po2zoned is not required at all for SMR.

Pankaj patch series is all about supporting ZNS devices that have a zone
size that is not a power of 2 number of LBAs as some vendors want to
produce such drives. There is no such move happening in the SMR world as
all users are happy with the current zone sizes which match the kernel
support (which currently requires power-of-2 number of LBAs for the zone
size).

I do not think we have yet reached a consensus on if we really want to
accept any zone size for zoned storage. I personally am not a big fan of
removing the existing constraint as that makes the code somewhat heavier
(multiplication & divisions instead of bit shifts) without introducing any
benefit to the user that I can see (or agree with). And there is also a
risk of forcing onto the users to redesign/change their code to support
different devices in the same system. That is never nice to fragment
support like this for the same device class. This is why several people,
including me, requested something like dm-po2zoned, to avoid breaking user
applications if non-power-of-2 zone size drives support is merged. Better
than nothing for sure, but not ideal either. That is only my opinion.
There are different opinions out there.

> Yet more context: When I'm asked to add full-blown support for
> dm-zoned to RHEL my gut is "please no, why!?". And if we really
> should add dm-zoned is dm-po2zoned now also a requirement (to support
> non-power-of-2 ZNS devices in our never-ending engineering of "zoned
> storage" compatibility stop-gaps)?

Support for dm-zoned in RHEL really depends on if your customers need it.
Having SMR and ZNS block device (CONFIG_BLK_DEV_ZONED) and zonefs support
enabled would already cover a lot of use cases on their own, at least the
ones we see in the field today.

Going forward, we expect more use cases to rely on btrfs rather than
dm-zoned or any equivalent DM target for ZNS. And that can also include
non power of 2 zone size drives as btrfs should normally be able to handle
such devices, if the support for them is merged. But we are not there yet
with btrfs support, hence dm-po2zoned.

But again, that all depends on if Pankaj patch series is accepted, that
is, on everybody accepting that we lift the power-of-2 zone size constraint.
> In addition, it was my understanding that WDC had yet another zoned DM
> target called "dm-zap" that is for ZNS based devices... It's all a bit
> messy in my head (that's on me for not keeping up, but I think we need
> a recap!)

Since the ZNS specification does not define conventional zones, dm-zoned
cannot be used as a standalone DM target (read: single block device) with
NVMe zoned block devices. Furthermore, due to its block mapping scheme,
dm-zoned does not support devices with zones that have a capacity lower
than the zone size. So ZNS is really a big *no* for dm-zoned. dm-zap is a
prototype and in a nutshell is the equivalent of dm-zoned for ZNS. dm-zap
can deal with the smaller zone capacity and does not require conventional
zones. We are not trying to push for dm-zap to be merged for now as we are
still evaluating its potential use cases. We also have a different but
functionally equivalent approach implemented as a block device driver that
we are evaluating internally.

Given the above mentioned usage pattern we have seen so far for zoned
storage, it is not yet clear if something like dm-zap for ZNS is needed
beside some niche use cases.

> So please help me, and others, become more informed as quickly as
> possible! ;)

I hope the above helps. If you want me to develop further any of the
points above, feel free to let me know.

> ps. I'm asking all this in the open on various Linux mailing lists
> because it doesn't seem right to request a concall to inform only
> me... I think others may need similar "zoned storage" help.

All good with me :)

--
Damien Le Moal
Western Digital Research

2022-09-22 12:36:04

by Pankaj Raghav

[permalink] [raw]
Subject: Re: Please further explain Linux's "zoned storage" roadmap [was: Re: [PATCH v14 00/13] support zoned block devices with non-power-of-2 zone sizes]

Thanks a lot Damien for the summary. Your feedback has made this series
much better.

> Pankaj patch series is all about supporting ZNS devices that have a zone
> size that is not a power of 2 number of LBAs as some vendors want to
> produce such drives. There is no such move happening in the SMR world as
> all users are happy with the current zone sizes which match the kernel
> support (which currently requires power-of-2 number of LBAs for the zone
> size).
>
> I do not think we have yet reached a consensus on if we really want to
> accept any zone size for zoned storage. I personally am not a big fan of
> removing the existing constraint as that makes the code somewhat heavier
> (multiplication & divisions instead of bit shifts) without introducing any
> benefit to the user that I can see (or agree with). And there is also a
> risk of forcing onto the users to redesign/change their code to support
> different devices in the same system. That is never nice to fragment
> support like this for the same device class. This is why several people,
> including me, requested something like dm-po2zoned, to avoid breaking user
> applications if non-power-of-2 zone size drives support is merged. Better
> than nothing for sure, but not ideal either. That is only my opinion.
> There are different opinions out there.

I appreciate that you have explained the different perspectives. We have
covered this written and orally, and it seems to me that we have a good
coverage of the arguments in the list.

At this point, I would like to ask the opinion of Jens, Christoph and
Keith. Do you think we are missing anything in the series? Can this be
queued up for 6.1 (after I send the next version with a minor fix suggested
by Mike)?

--
Regards,
Pankaj

2022-09-22 20:43:32

by Mike Snitzer

[permalink] [raw]
Subject: Re: Please further explain Linux's "zoned storage" roadmap [was: Re: [PATCH v14 00/13] support zoned block devices with non-power-of-2 zone sizes]

On Wed, Sep 21 2022 at 7:55P -0400,
Damien Le Moal <[email protected]> wrote:

> On 9/22/22 02:27, Mike Snitzer wrote:
> > On Tue, Sep 20 2022 at 5:11P -0400,
> > Pankaj Raghav <[email protected]> wrote:
> >
> >> - Background and Motivation:
> >>
> >> The zone storage implementation in Linux, introduced since v4.10, first
> >> targetted SMR drives which have a power of 2 (po2) zone size alignment
> >> requirement. The po2 zone size was further imposed implicitly by the
> >> block layer's blk_queue_chunk_sectors(), used to prevent IO merging
> >> across chunks beyond the specified size, since v3.16 through commit
> >> 762380ad9322 ("block: add notion of a chunk size for request merging").
> >> But this same general block layer po2 requirement for blk_queue_chunk_sectors()
> >> was removed on v5.10 through commit 07d098e6bbad ("block: allow 'chunk_sectors'
> >> to be non-power-of-2").
> >>
> >> NAND, which is the media used in newer zoned storage devices, does not
> >> naturally align to po2. In these devices, zone capacity(cap) is not the
> >> same as the po2 zone size. When the zone cap != zone size, then unmapped
> >> LBAs are introduced to cover the space between the zone cap and zone size.
> >> po2 requirement does not make sense for these type of zone storage devices.
> >> This patch series aims to remove these unmapped LBAs for zoned devices when
> >> zone cap is npo2. This is done by relaxing the po2 zone size constraint
> >> in the kernel and allowing zoned device with npo2 zone sizes if zone cap
> >> == zone size.
> >>
> >> Removing the po2 requirement from zone storage should be possible
> >> now provided that no userspace regression and no performance regressions are
> >> introduced. Stop-gap patches have been already merged into f2fs-tools to
> >> proactively not allow npo2 zone sizes until proper support is added [1].
> >>
> >> There were two efforts previously to add support to npo2 devices: 1) via
> >> device level emulation [2] but that was rejected with a final conclusion
> >> to add support for non po2 zoned device in the complete stack[3] 2)
> >> adding support to the complete stack by removing the constraint in the
> >> block layer and NVMe layer with support to btrfs, zonefs, etc which was
> >> rejected with a conclusion to add a dm target for FS support [0]
> >> to reduce the regression impact.
> >>
> >> This series adds support to npo2 zoned devices in the block and nvme
> >> layer and a new **dm target** is added: dm-po2zoned-target. This new
> >> target will be initially used for filesystems such as btrfs and
> >> f2fs until native npo2 zone support is added.
> >
> > As this patchset nears the point of being "ready for merge" and DM's
> > "zoned" oriented targets are multiplying, I need to understand: where
> > are we collectively going? How long are we expecting to support the
> > "stop-gap zoned storage" layers we've constructed?
> >
> > I know https://zonedstorage.io/docs/introduction exists... but it
> > _seems_ stale given the emergence of ZNS and new permutations of zoned
> > hardware. Maybe that isn't quite fair (it does cover A LOT!) but I'm
> > still left wanting (e.g. "bring it all home for me!")...
> >
> > Damien, as the most "zoned storage" oriented engineer I know, can you
> > please kick things off by shedding light on where Linux is now, and
> > where it's going, for "zoned storage"?
>
> Let me first start with what we have seen so far with deployments in the
> field.

<snip>

Thanks for all your insights on zoned storage, very appreciated!

> > In addition, it was my understanding that WDC had yet another zoned DM
> > target called "dm-zap" that is for ZNS based devices... It's all a bit
> > messy in my head (that's on me for not keeping up, but I think we need
> > a recap!)
>
> Since the ZNS specification does not define conventional zones, dm-zoned
> cannot be used as a standalone DM target (read: single block device) with
> NVMe zoned block devices. Furthermore, due to its block mapping scheme,
> dm-zoned does not support devices with zones that have a capacity lower
> than the zone size. So ZNS is really a big *no* for dm-zoned. dm-zap is a
> prototype and in a nutshell is the equivalent of dm-zoned for ZNS. dm-zap
> can deal with the smaller zone capacity and does not require conventional
> zones. We are not trying to push for dm-zap to be merged for now as we are
> still evaluating its potential use cases. We also have a different but
> functionally equivalent approach implemented as a block device driver that
> we are evaluating internally.
>
> Given the above mentioned usage pattern we have seen so far for zoned
> storage, it is not yet clear if something like dm-zap for ZNS is needed
> beside some niche use cases.

OK, good to know. I do think dm-zoned should be trained to _not_
allow use with ZNS NVMe devices (maybe that is in place and I just
missed it?). Because there is some confusion with at least one
customer that is asserting dm-zoned is somehow enabling them to use
ZNS NVMe devices!

Maybe they somehow don't _need_ conventional zones (writes are handled
by some other layer? and dm-zoned access is confined to read only)!?
And might they also be using ZNS NVMe devices to do _not_ have a
zone capacity lower than the zone size?

Or maybe they are mistaken and we should ask more specific questions
of them?

> > So please help me, and others, become more informed as quickly as
> > possible! ;)
>
> I hope the above helps. If you want me to develop further any of the
> points above, feel free to let me know.

You've been extremely helpful, thanks!

2022-09-22 22:00:01

by Damien Le Moal

[permalink] [raw]
Subject: Re: Please further explain Linux's "zoned storage" roadmap [was: Re: [PATCH v14 00/13] support zoned block devices with non-power-of-2 zone sizes]

On 9/23/22 04:37, Mike Snitzer wrote:
> On Wed, Sep 21 2022 at 7:55P -0400,
> Damien Le Moal <[email protected]> wrote:
>
>> On 9/22/22 02:27, Mike Snitzer wrote:
>>> On Tue, Sep 20 2022 at 5:11P -0400,
>>> Pankaj Raghav <[email protected]> wrote:
>>>
>>>> - Background and Motivation:
>>>>
>>>> The zone storage implementation in Linux, introduced since v4.10, first
>>>> targetted SMR drives which have a power of 2 (po2) zone size alignment
>>>> requirement. The po2 zone size was further imposed implicitly by the
>>>> block layer's blk_queue_chunk_sectors(), used to prevent IO merging
>>>> across chunks beyond the specified size, since v3.16 through commit
>>>> 762380ad9322 ("block: add notion of a chunk size for request merging").
>>>> But this same general block layer po2 requirement for blk_queue_chunk_sectors()
>>>> was removed on v5.10 through commit 07d098e6bbad ("block: allow 'chunk_sectors'
>>>> to be non-power-of-2").
>>>>
>>>> NAND, which is the media used in newer zoned storage devices, does not
>>>> naturally align to po2. In these devices, zone capacity(cap) is not the
>>>> same as the po2 zone size. When the zone cap != zone size, then unmapped
>>>> LBAs are introduced to cover the space between the zone cap and zone size.
>>>> po2 requirement does not make sense for these type of zone storage devices.
>>>> This patch series aims to remove these unmapped LBAs for zoned devices when
>>>> zone cap is npo2. This is done by relaxing the po2 zone size constraint
>>>> in the kernel and allowing zoned device with npo2 zone sizes if zone cap
>>>> == zone size.
>>>>
>>>> Removing the po2 requirement from zone storage should be possible
>>>> now provided that no userspace regression and no performance regressions are
>>>> introduced. Stop-gap patches have been already merged into f2fs-tools to
>>>> proactively not allow npo2 zone sizes until proper support is added [1].
>>>>
>>>> There were two efforts previously to add support to npo2 devices: 1) via
>>>> device level emulation [2] but that was rejected with a final conclusion
>>>> to add support for non po2 zoned device in the complete stack[3] 2)
>>>> adding support to the complete stack by removing the constraint in the
>>>> block layer and NVMe layer with support to btrfs, zonefs, etc which was
>>>> rejected with a conclusion to add a dm target for FS support [0]
>>>> to reduce the regression impact.
>>>>
>>>> This series adds support to npo2 zoned devices in the block and nvme
>>>> layer and a new **dm target** is added: dm-po2zoned-target. This new
>>>> target will be initially used for filesystems such as btrfs and
>>>> f2fs until native npo2 zone support is added.
>>>
>>> As this patchset nears the point of being "ready for merge" and DM's
>>> "zoned" oriented targets are multiplying, I need to understand: where
>>> are we collectively going? How long are we expecting to support the
>>> "stop-gap zoned storage" layers we've constructed?
>>>
>>> I know https://zonedstorage.io/docs/introduction exists... but it
>>> _seems_ stale given the emergence of ZNS and new permutations of zoned
>>> hardware. Maybe that isn't quite fair (it does cover A LOT!) but I'm
>>> still left wanting (e.g. "bring it all home for me!")...
>>>
>>> Damien, as the most "zoned storage" oriented engineer I know, can you
>>> please kick things off by shedding light on where Linux is now, and
>>> where it's going, for "zoned storage"?
>>
>> Let me first start with what we have seen so far with deployments in the
>> field.
>
> <snip>
>
> Thanks for all your insights on zoned storage, very appreciated!
>
>>> In addition, it was my understanding that WDC had yet another zoned DM
>>> target called "dm-zap" that is for ZNS based devices... It's all a bit
>>> messy in my head (that's on me for not keeping up, but I think we need
>>> a recap!)
>>
>> Since the ZNS specification does not define conventional zones, dm-zoned
>> cannot be used as a standalone DM target (read: single block device) with
>> NVMe zoned block devices. Furthermore, due to its block mapping scheme,
>> dm-zoned does not support devices with zones that have a capacity lower
>> than the zone size. So ZNS is really a big *no* for dm-zoned. dm-zap is a
>> prototype and in a nutshell is the equivalent of dm-zoned for ZNS. dm-zap
>> can deal with the smaller zone capacity and does not require conventional
>> zones. We are not trying to push for dm-zap to be merged for now as we are
>> still evaluating its potential use cases. We also have a different but
>> functionally equivalent approach implemented as a block device driver that
>> we are evaluating internally.
>>
>> Given the above mentioned usage pattern we have seen so far for zoned
>> storage, it is not yet clear if something like dm-zap for ZNS is needed
>> beside some niche use cases.
>
> OK, good to know. I do think dm-zoned should be trained to _not_
> allow use with ZNS NVMe devices (maybe that is in place and I just
> missed it?). Because there is some confusion with at least one
> customer that is asserting dm-zoned is somehow enabling them to use
> ZNS NVMe devices!

dm-zoned checks for conventional zones and also that all zones have a zone
capacity that is equal to the zone size. The first point puts ZNS out but
a second regular drive can be used to emulate conventional zones. However,
the second point (zone cap < zone size) is pretty much a given with ZNS
and so rules it out.

If anything, we should also add a check on the max number of active zones,
which is also a limitation that ZNS drives have, unlike SMR drives. Since
dm-zoned does not handle active zones at all, any drive with a limit
should be excluded. I will send patches for that.
>
> Maybe they somehow don't _need_ conventional zones (writes are handled
> by some other layer? and dm-zoned access is confined to read only)!?
> And might they also be using ZNS NVMe devices to do _not_ have a
> zone capacity lower than the zone size?

It is a possibility. Indeed, if the ZNS drive has:
1) zone capacity equal to zone size
2) a second regular drive is used to emulate conventional zones
3) no limit on the max number of active zones

Then dm-zoned will work just fine. But again, I seriously doubt that point
(3) holds. And we should check that upfront in dm-zoned ctr.

> Or maybe they are mistaken and we should ask more specific questions
> of them?

Getting the exact drive characteristics (zone size, capacity and zone
resource limits) will tell you if dm-zoned can work or not.

--
Damien Le Moal
Western Digital Research

2022-09-23 00:02:53

by Bart Van Assche

[permalink] [raw]
Subject: Re: Please further explain Linux's "zoned storage" roadmap [was: Re: [PATCH v14 00/13] support zoned block devices with non-power-of-2 zone sizes]

On 9/21/22 16:55, Damien Le Moal wrote:
> But again, that all depends on if Pankaj patch series is accepted, that
> is, on everybody accepting that we lift the power-of-2 zone size constraint.

The companies that are busy with implementing zoned storage for UFS
devices are asking for kernel support for non-power-of-2 zone sizes.

Thanks,

Bart.